UPIQAL — Full-Reference Objective Image Quality Assessment and Automated Artifact Detection

State-of-the-Art Review

The objective evaluation of image quality and the automated, spatially precise detection of visual artifacts remain among the most complex challenges in computational vision. Historically, the assessment of image fidelity relied on human visual inspection, formalized through the Mean Opinion Score (MOS) framework. However, the subjective nature, high cost, and inherent unscalability of MOS have driven the necessity for robust, automated, and data-driven computational models.

The urgency for such models has accelerated exponentially with the advent of advanced generative architectures, including text-to-image diffusion models and highly parameterized Generative Adversarial Networks (GANs). In these contemporary paradigms, Image Quality Assessment (IQA) models no longer serve merely as post-generation evaluation metrics; they are actively deployed as critical perceptual reward signals within frameworks such as Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization (DPO) to align synthesized outputs with human visual preferences.

Shift 1 · Pixel → Structure

Pixel-wise MSE and PSNR treat all localized errors with equal weight, entirely disregarding structural context, spatial frequency dependencies, luminance masking, and contrast masking. SSIM operates via a sliding Gaussian window to compute localized similarities in luminance, contrast, and structure — the first step away from mathematical error and toward perceptual similarity.

To rectify the perceptual inadequacy of pixel-wise error, the Structural Similarity Index Measure (SSIM) and its multi-scale extension (MS-SSIM) were introduced. Subsequent hand-crafted derivatives — FSIM (phase congruency and gradient magnitude), GMSD (global gradient deviation), and VIF (Gaussian scale mixtures) — sought to further approximate the Human Visual System (HVS). Despite higher correlations with MOS than PSNR, all traditional metrics apply fixed, deterministic statistical models uniformly across all image regions and are heavily sensitive to non-structural alterations such as global color shifts.

Shift 2 · Structure → Deep Features

LPIPS shifts the comparison from the raw pixel space to the latent feature space of VGG / AlexNet / SqueezeNet. Hierarchical convolutional layers capture both low-level edges and high-level semantic abstractions, yielding strong correlation with human assessments — at the cost of interpretability and robustness to adversarial perturbations.

DISTS reconciles SSIM's structural rigor with LPIPS's semantic depth: it extracts hierarchical VGG16 feature maps and computes their global spatial statistics (means, variances, cross-covariances), explicitly separating structural fidelity from textural similarity. A-DISTS further introduces a spatial dispersion index to locally adapt the pooling strategy between structure and texture — without supervised retraining on subjective scores.

Shift 3 · Determinism → Probability

SUSS frames perceptual evaluation as explicit probabilistic density estimation, modeling an image through a set of perceptual components represented by structured multivariate Normal distributions. The final score becomes a weighted sum of component log-probabilities — effectively a Mahalanobis distance in a learned perceptual space.

For color degradation, metrics such as EDOKS (Earth Mover's Distance and Oklab Similarity) convert images into the Oklab perceptual color space and use the Earth Mover's Distance to measure the minimum mathematical cost required to transform one color distribution into another. For structural artifacts, explicit mathematical heuristics remain vastly superior to CNN feature extraction: JPEG blocking is best quantified using cross-difference filtering with a contrario statistical validation; Gibbs ringing is isolated by dilated-edge variance ratios; Gaussian noise is estimated via wavelet decomposition.

Comparison Table

Metric families compared by mechanism, feature space, advantages, and known failure modes.

Metric / Framework	Mechanism	Feature Space	Advantages	Limitations
SSIM / MS-SSIM	Local mean, variance, covariance pooling	Raw pixel (spatial)	Interpretable; basic luminance / contrast masking	Fails on texture resampling, color shifts, generative outputs
LPIPS	Euclidean / cosine distance of normalized activations	Deep latent (VGG / AlexNet)	Aligns with human semantics; captures high-level features	Heavy; opaque spatial maps; sensitive to adversarial noise
DISTS	Global statistical pooling of deep features	Deep latent (VGG16)	Robust to texture variance and mild spatial misalignment	Global pooling ignores local semantics; over-forgives blur
A-DISTS	Dispersion-index-based adaptive pooling	Deep latent (VGG16)	Content-aware; separates texture and structure probabilistically	Dispersion entropy cost; weaker on very small patches
SUSS	Multivariate Normal · Mahalanobis distance	Probabilistic pixel / feature	Uncertainty-aware; localized interpretable anomaly maps	Heavy generative training; dense covariance inversion
EDOKS	Optimal transport (EMD) in perceptual uniform space	Oklab color space	Accurate color perception; immune to RGB bias	EMD optimization is $O(n^3 \log n)$ without Sinkhorn

Algorithm Formulation

No single existing metric simultaneously satisfies the requirements for deep semantic understanding, probabilistic spatial interpretability, precise color modeling, and deterministic artifact isolation. Therefore, the Unified Probabilistic Image Quality and Artifact Locator (UPIQAL) synthesizes the dispersion-based texture tracking of A-DISTS, the multivariate uncertainty modeling of SUSS, the optimal transport color science of EDOKS, and targeted frequency-domain heuristics.

UPIQAL ingests a reference image $I_r$ and a distorted target image $I_t$. It operates through five cascading modules, terminating in a unified aggregation head that outputs a scalar score and a multi-channel spatial diagnostic tensor.

01 Universal Preprocessing & Normalization

Standard algorithms frequently fail when deployed across disparate domains — natural photographs, MRI, low-light endoscopic imagery. UPIQAL subjects $I_r$ and $I_t$ to a rigorous normalization protocol. Min-max scaling bounds intensities to $[0, 1]$; for modalities lacking standard reference ranges, a piece-wise linear histogram matching algorithm aligns the target histogram modes and percentiles (e.g. $p_{1}$ and $p_{99}$) to the reference, neutralizing irrelevant systemic intensity shifts before perceptual comparison.

02 Chromatic Transport Evaluator

Deep convolutional feature extractors are notoriously poor at pure color evaluation — their kernels are biased toward spatial frequencies and edge geometries. UPIQAL isolates chromatic evaluation into a parallel module based on EDOKS. Images are transformed from the non-linear sRGB space into the Oklab perceptual color space, which decouples lightness $L$ from the opponent channels $a$ (green–red) and $b$ (blue–yellow) while maintaining strict perceptual uniformity.

Within Oklab, dense overlapping 3D color histograms are extracted and compared via the Earth Mover's Distance (EMD) — framing color degradation as linear optimal transport. This produces a dense Color_Degradation_Map that accurately penalizes hue and saturation shifts without being confounded by structural alterations.

03 Hierarchical Deep Statistical Extractor

To assess the interplay of structure and texture, UPIQAL integrates an advanced A-DISTS variant. Reference and target propagate through a fixed, pre-trained VGG16 backbone; feature maps are intercepted at five hierarchical stages — from relu1_2 (fine gradients) up to relu5_3 (deep semantic objects). Rather than executing a direct Euclidean subtraction — which triggers the LPIPS vulnerability of over-penalizing valid generative textures — the module applies localized pooling with a 2D Hanning window, outputting spatial means, variances, and cross-covariances for every channel at every location.

A spatial dispersion index — the ratio of local variance to local mean — is passed through a sigmoid to generate a texture probability map $P_{\mathrm{tex}}$. Where $P_{\mathrm{tex}} \to 1$ (grass, water) the system prioritizes variance similarity, ignoring exact pixel alignment; where $P_{\mathrm{tex}} \to 0$ (rigid geometry) it strictly penalizes structural warping or blurring via cross-covariance.

04 Probabilistic Uncertainty Mapper

Inspired by the SUSS framework, UPIQAL models residual feature differences between $I_r$ and $I_t$ as samples drawn from a structured multivariate Normal distribution. During an offline, self-supervised generative phase, the model learns the covariance matrices that define the acceptable limits of human-imperceptible augmentations for varying image contents.

At inference, the Mahalanobis distance between the target's deep residuals and the learned distribution of the reference yields an exact mathematical representation of perceptual deviation. Locations with high $D_M^2$ fall into the low-probability tails of the natural distribution and are classified as perceptual anomalies — this produces the continuous Global_Anomaly_Map.

05 Spatial Artifact Heuristics Engine

The probabilistic and deep modules identify where a perceptual failure occurs; classifying what specific artifact is present requires targeted spatial and frequency heuristics.

JPEG Blocking — horizontal and vertical cross-difference filters detect grid discontinuities; an a contrario validation model computes the Number of False Alarms (NFA) for periodic gradient spikes at modulo-8 intervals.
Gibbs Ringing — a binary edge skeleton is dilated to form a proximity mask; the ratio of localized edge-vicinity variance to background variance isolates unnatural oscillation energy.
Blur — edge-spread analysis + attenuation of high-frequency components in the earliest convolutional layers.
Gaussian Noise — multi-level wavelet decomposition; noise standard deviation estimated via median of the highest-frequency diagonal subband.

Discrete masks are spatially intersected with the Global_Anomaly_Map. A fully connected regression head aggregates weighted sums of local log-probabilities, outputting a single, comprehensive FR-IQA scalar calibrated to substitute human MOS.

Mathematical Foundation

1 · Statistical Pooling & Adaptive Feature Separation

Feature representations at stage $k$ of VGG16 are tensors $F_r^{(k)}$ and $F_t^{(k)}$. A continuous 2D Hanning window $w(u,v)$ over kernel size $K$ provides the localized spatial kernel:

$$w(u,v) \;=\; \frac{1}{Z}\left(1 - \cos\frac{2\pi u}{K - 1}\right)\left(1 - \cos\frac{2\pi v}{K - 1}\right)$$

(1)

where $Z$ is a normalization constant ensuring $\sum_{u,v} w(u,v) = 1$. The localized expected value for channel $c$ at coordinate $(x,y)$ is the window-weighted convolution:

$$\mu^{(k)}_{r,c}(x,y) \;=\; \bigl(F^{(k)}_{r,c} \ast w\bigr)(x,y) \;+\; \epsilon$$

(2)

Localized variances $\sigma^{2}_{r}$, $\sigma^{2}_{t}$ and cross-covariance $\sigma_{rt}$ follow the standard definitions centered around these expectations. From SSIM, the luminance and structural similarity terms are:

$$l(x) \;=\; \frac{2\mu_r\mu_t + C_1}{\mu_r^{2} + \mu_t^{2} + C_1}, \qquad s(x) \;=\; \frac{2\sigma_{rt} + C_2}{\sigma_r^{2} + \sigma_t^{2} + C_2}$$

(3)

The dispersion index is $\delta(x,y) = \sigma^{2}(x,y) / \mu(x,y)$. A logistic sigmoid maps it to a texture probability:

$$P_{\mathrm{tex}}(x,y) \;=\; \frac{1}{1 + \exp\!\bigl(-\alpha\,(\delta(x,y) - \beta)\bigr)}$$

(4)

Aggregation linearly interpolates weights so that high-variance regions prioritize $s(x)$ (forgiving structural misalignment) and low-variance regions enforce $l(x)$ (strict geometry).

2 · Probabilistic Uncertainty Modeling

Let the residual vector at location $(x,y)$ across concatenated channels be $\Delta f = f_t - f_r \in \mathbb{R}^{D}$. Acceptable human-imperceptible perturbations are modeled as a multivariate Normal $\mathcal{N}(\mathbf{0},\,\Sigma_{x,y})$:

$$p(\Delta f \mid \Sigma) \;=\; \frac{1}{(2\pi)^{D/2}\,\lvert\Sigma\rvert^{1/2}}\;\exp\!\left(-\frac{1}{2}\,\Delta f^{\top}\,\Sigma^{-1}\,\Delta f\right)$$

(5)

The exponent is the squared Mahalanobis distance $D_{M}^{2} = \Delta f^{\top}\,\Sigma^{-1}\,\Delta f$. Because $D$ is massive, the precision matrix is parameterized via a Cholesky decomposition:

$$\Sigma^{-1} \;=\; L\,L^{\top}, \qquad D_{M}^{2} \;=\; \bigl\lVert L^{\top}\,\Delta f \bigr\rVert_{2}^{2}$$

(6)

where $L$ is a sparse, structured lower-triangular matrix learned during training. The spatial tensor of $D_M^2$ is the Global_Anomaly_Map; the final score is the negative sum of log-probabilities, normalized to $[0,1]$.

3 · Optimal Transport for Chromatic Degradation

A reference patch is represented by a color signature $P = \{(m_i, w_i)\}$ where $m_i \in \mathbb{R}^3$ are Oklab coordinates and $w_i$ is the histogram mass. The target patch $Q = \{(n_j, v_j)\}$ is similarly defined. The ground distance $d_{ij} = \lVert m_i - n_j \rVert$ is the perceptual color difference. The Earth Mover's Distance is the linear program:

$$\mathrm{EMD}(P, Q) \;=\; \min_{f_{ij} \,\ge\, 0}\; \sum_{i}\sum_{j} d_{ij}\, f_{ij}$$

(7)

subject to the mass-preservation constraints:

$$\sum_{j} f_{ij} \,\le\, w_{i}, \qquad \sum_{i} f_{ij} \,\le\, v_{j}, \qquad \sum_{i,j} f_{ij} \;=\; \min\!\left(\sum_{i} w_{i},\; \sum_{j} v_{j}\right)$$

(8)

Exact LP is $\mathcal{O}(n^{3}\log n)$. The pipeline instead uses the Sinkhorn–Knopp algorithm, adding an entropy regularizer $\lambda\,H(f)$ that yields a matrix-scaling iteration suitable for GPU hardware.

4 · Mathematical Heuristics for Artifact Localization

Blocking. Let $Y$ be the luminance channel. The horizontal cross-difference filter is $D_{h}(x,y) = Y(x + 1,\,y) - Y(x,\,y)$. To detect $8 \times 8$ DCT boundaries, an accumulation vector $A_{h}[k]$ is computed for offsets $k \in \{0, \ldots, 7\}$:

$$A_{h}[k] \;=\; \sum_{x \,\equiv\, k \,(\mathrm{mod}\ 8)} \; \sum_{y} \bigl\lvert D_{h}(x,y) \bigr\rvert$$

(9)

An a contrario validation step computes the Number of False Alarms (NFA). If $\mathrm{NFA}(k) < \tau$, the entire grid phase is flagged as blocking distortion.

Ringing. A Sobel operator $S$ extracts the edge map $E$. Morphological dilation $E_{d} = E \oplus B$ defines the proximity zone. Local variances inside and outside are compared:

$$R(x,y) \;=\; \frac{\mathrm{Var}_{E_{d}}\!\bigl(Y\bigr)}{\mathrm{Var}_{\overline{E_{d}}}\!\bigl(Y\bigr)}$$

(10)

If $R > \gamma$, high-frequency parasitic oscillations are confirmed and the region is classified into the Ringing_Mask.

Implementation Pipeline

The three parallel execution branches converge into an aggregation head that produces the final score and diagnostic tensor.

The pipeline outputs a comprehensive dictionary:

scoreScalar $[0,1]$ — the FR-IQA score, calibrated to substitute human MOS.

anomalyContinuous spatial map of $D_M^2$ — absolute perceptual deviation.

colorDense Oklab EMD map — chromatic degradation magnitude.

structureLocal structural similarity from SSIM $\times$ deep features.

blockingBinary mask of detected DCT block boundaries (NFA-validated).

ringingDilated-edge variance-ratio mask for Gibbs oscillations.

dominant_artifactArgmax over HF-energy-gated severity contributions.

By synthesizing deterministic frequency-domain heuristics with the semantic depth of convolutional features, and enveloping both within a rigorous multivariate probabilistic framework, UPIQAL offers an exhaustive, fully automated, and highly interpretable solution to the complexities of modern Image Quality Assessment.

References

Zoom-IQA: Image Quality Assessment with Reliable Region-Aware Reasoning. arXiv:2601.02918.
Structural similarity index measure. Wikipedia.
Structured Uncertainty Similarity Score (SUSS). arXiv:2512.03701.
Analysis of PSNR, SSIM, LPIPS in the context of human perception. ResearchGate, 2026.
R-LPIPS: An Adversarially Robust Perceptual Similarity Metric. arXiv:2307.15157.
New Texture Similarity Metrics. University of Michigan.
SSIM vs. LPIPS — Patsnap Eureka.
What Is LPIPS and How It Measures Perceptual Similarity — Patsnap Eureka.
Learned Perceptual Image Patch Similarity (LPIPS) — PyTorch-Metrics docs.
Evaluation of Objective Image Quality Metrics for High-Fidelity Image Compression. arXiv:2509.13150.
Optimization results by DISTS, LPIPS, and DeepWSD — ResearchGate.
Comparison of Full-Reference Image Quality Models for Optimization. PMC.
dingkeyan93/DISTS — GitHub.
Full-Reference Image Quality Assessment with Transformer and DISTS. MDPI.
DISTS PyTorch implementation. GitHub.
Image Quality Assessment: Unifying Structure and Texture Similarity. NYU CNS.
Locally Adaptive Structure and Texture Similarity for Image Quality Assessment. arXiv:2110.08521.
A-DISTS publication. CityU Scholars.
A-DISTS review. Liner.
Adaptive Structure and Texture Similarity Metric. ResearchGate.
SUSS preprint — ResearchGate.
Localizing Perceptual Artifacts in Synthetic Images. MDPI Electronics.
A New Image Similarity Metric — Perceptual and Transparent Geometric and Chromatic Assessment. arXiv:2601.19680.
Similarity and Quality Metrics for MR Image-to-Image Translation. arXiv:2405.08431.
The Earth Mover's Distance (EMD). Stanford InfoLab.
Helmlab vs Oklab vs CIEDE2000 — Medium, 2026.
MUG: A Parameterless No-Reference JPEG Quality Evaluator. arXiv:1609.03461.
Local JPEG Grid Detector via Blocking Artifacts. IPOL Journal.
Local JPEG Grid Detector PDF. IPOL.
Image Deconvolution Ringing Artifact Detection and Removal via PSF Frequency Analysis. ResearchGate.
Image Quality Assessment for Gibbs Ringing Reduction. MDPI Algorithms.
Edge Map Guided Adaptive Post-Filter for Blocking and Ringing Artifacts. MERL TR2004-003.
Removal of blocking and ringing artifacts in JPEG-coded images. ResearchGate.
Measurement of ringing artifacts in JPEG images. ResearchGate.
Similarity and quality metrics for MR image-to-image translation. ResearchGate.
Detection of Image Artifacts Using Improved Cascade Region-Based CNN. MDPI.
Reference-Free Image Quality Metric for Degradation and Reconstruction Artifacts. arXiv:2405.02208.
The Earth Mover's Distance as a Metric for Image Retrieval. CMU SCS.
Multivariate normal distribution (Reddit discussion).
No-Reference Quality Assessment — UT Austin LIVE.
Quantification of ring artifact visibility in CT. IACL SPIE.
Quantification of ring artifact visibility in CT (PDF). ResearchGate.
No-reference IQA based on noise, blurring and blocking. ResearchGate.
Blur and Ringing Artifact Measurement in Image Compression using Wavelet Transform. Academia.edu.
sunwei925/UIQA — GitHub.
What is Adaptive Average Pooling — Stack Overflow.