3 PERCEPTUAL METRIC
From the nonlinear image representation it is
possible to define a perceptual image distortion
metric similar to that proposed by Teo and Heeger
(1994). For that, we simply add an error pooling
stage. This computes a Minkowski sum with
exponent 2 of the differences
i
rΔ
(multiplied by
constants k
i
that adjust the overall gain) between the
nonlinear outputs from the reference image and
those from the distorted image (Valerio et al., 2004):
∑
Δ⋅=Δ
i
ii
rkr
2
2
(2)
This perceptual metric has two main differences
with respect to that by Teo and Heeger (1994). First,
the divisive normalization considers not only
neighbouring responses in orientation but also in
position, and scale. Second, the parameters of the
divisive normalization are adapted to natural image
statistics instead of being fixed exclusively to fit
psychophysical data.
4 CODING RESULTS
In order to compare the coding efficiency of the 9/7
transform alone and our nonlinear transform (the 9/7
transform plus the divisive normalization), we have
conducted a series of compression experiments with
a simplified JPEG 2000 codec. Basically, the coding
is as follows. First, the input image is preprocessed
(the nominal dynamic range of the samples is
adjusted by subtracting a bias of 2
P-1
, where P is the
number of bits per sample, from each of the samples
values). Then, the intracomponent transform takes
place. This can be the 9/7 transform or our nonlinear
transform. In both cases, we use the implementation
of the 9/7 transform in the JasPer software (Adams
and Kossentini, 2000). After quantization is
performed in the encoder (we fix the quantizer step
size at one, that is, there is no quantization), tier-1
coding takes place.
In the tier-1 coder, each subband is partitioned
into code blocks (the code block size is 64x64), and
each of the code blocks is independently coded. The
coding is performed using a bit-plane coder. There is
only one coding pass per bit plane and the samples
are scanned in a fixed order as follows. The code
block is partitioned into horizontal stripes, each
having a nominal height of four samples. The stripes
are scanned from top to bottom. Within a stripe,
columns are scanned from left to right. Within a
column, samples are scanned from top to bottom.
The sign of each sample is coded with a single
binary symbol right before its most significant bit.
The bit-plane encoding process generates a sequence
of symbols that are entropy coded. For the purposes
of entropy coding, a simple adaptive binary
arithmetic coder is used. All of the coding passes of
a code block form a single codeword (per-segment
termination).
Tier-1 coding is followed by tier-2 coding, in
which the coding pass information is packaged.
Each packet consists of two parts: header and body.
The header indicates which coding passes are
included in the packet, while the body contains the
actual coding pass data. The coding passes included
in the packet are always the most significant ones
and we use a fixed-point representation with 13 bits
after the decimal point, so that we only need to store
the maximum number of bit planes of each code
block.
In tier-2 coding, rate control is achieved through
the selection of the subset of coding passes to
include in the code stream. The encoder knows the
contribution that each coding pass makes to the rate,
and can also calculate the distortion reduction
associated with each coding pass. Using this
information, the encoder can then include the coding
passes in order of decreasing distortion reduction per
unit rate until the bit budget has been exhausted.
This approach is very flexible and permits the use of
different distortion metrics.
Figs. 1 and 2 show some compression results
with the codec described above. The input image is
in both figures a 128x128 patch (this is for
simplicity, since if we use this image size there is
only one code block per subband) of the 8 bpp
“Baboon” image, and we consider only the lowest
scale vertical subband. The results are very different
depending on the distortion metric used. As we can
see in Fig. 1, if we use the classical mean squared
error (MSE) as distortion metric (note that the MSE
is not very well matched to perceived visual quality)
the 9/7 transform yields better results than the
nonlinear transform. However, the nonlinear
transform yields better perceptual quality than the
9/7 transform (see Fig. 2).
In Fig. 3 we can see that the MSE, or
equivalently the peak signal-to-noise ratio (PSNR),
is not very well matched to perceived visual quality.
So, despite their very different MSE (the PSNR
corresponding to the 9/7 transform is more than 10
dB greater than that of the nonlinear transform), the
two decoded images showed in the figure are almost
visually indistinguishable.
NONLINEAR PRIMARY CORTICAL IMAGE REPRESENTATION FOR JPEG 2000 - Applying natural image statistics
and visual perception to image compression
521