This is safe because:
* prediction only computes averages, or explicitly clamps for
TM_PRED / B_TM_PRED. Since the inputs are in [0, 255], so will the
outputs.
* Addition of IDCT and prediction buffer is immediately clamped back
to [0, 255]
No behavior change, and matches what both libwebp and the reference
implementation in rfc6386 do.
https://datatracker.ietf.org/doc/html/rfc6386#section-14.5 says:
"""
The summing procedure is fairly straightforward, having only a couple
of details. The prediction and residue buffers are both arrays of
16-bit signed integers. Each individual (Y, U, and V pixel) result
is calculated first as a 32-bit sum of the prediction and residue,
and is then saturated to 8-bit unsigned range (using, say, the
clamp255 function defined above) before being stored as an 8-bit
unsigned pixel value.
"""
It's IMHO not 100% clear if the clamping is supposed to happen
immediately (so that it affects prediction inputs for the next
macroblock) or later.
But vp8_dixie_idct_add() on page 173 in
https://datatracker.ietf.org/doc/html/rfc6386#section-20.8 does:
recon[0] = CLAMP_255(predict[0] + ((a1 + d1 + 4) >> 3));
So it does look like it should happen immediately.
(I'm a bit confused why the spec then says "The prediction and residue
buffers are both arrays of 16-bit signed integers", since the
prediction buffer can just be an u8 buffer now, without changing
behavior.
The spec has that clamp at the end of
https://datatracker.ietf.org/doc/html/rfc6386#section-12.2:
The exact algorithm is as follows:
[...]
b[r][c] = clamp255(L[r]+ A[c] - P);
For the test images I'm looking at, it doesn't seem to make a
dramatic difference, but omitting it in `B_TM_PRED` did make
a dramatic difference, so add it. (Also, the spec demands it.)
Each secondary partition has an independent BooleanDecoder.
Their bitstreams interleave per macroblock row, that is the first
macroblock row is read from the first decoder, the second from the
second, ..., until it wraps around again.
All partitions share a single prediction state though: The second
macroblock row (which reads coefficients off the second decoder) is
predicted using the result of decoding the frist macroblock row (which
reads coefficients off the first decoder).
So if I understand things right, in theory the coefficient reading could
be parallelized, but prediction can't be. (IDCT can also be
parallelized, but that's true with just a single partition too.)
I created the test image by running
examples/cwebp -low_memory -partitions 3 -o foo.webp \
~/src/serenity/Tests/LibGfx/test-inputs/4.webp
using a cwebp hacked up as described in #19149. Since creating
multi-partition lossy webps requires hacking up `cwebp`, they're likely
very rare in practice. (But maybe other programs using the libwebp API
create them.)
Fixes#19149.
With this, webp lossy support is complete (*) :^)
And with that, webp support is complete: Lossless, lossy, lossy with
alpha, animated lossless, animated lossy, animated lossy with alpha all
work.
(*: Loop filtering isn't implemented yet, which has a minor visual
effect on the output. But it's only visible when carefully comparing
a webp decoded without loop filtering to the same decoded with it.
But it's technically a part of the spec that's still missing.
The upsampling of UV in the YUV->RGB code is also low-quality. This
produces somewhat visible banding in practice in some images (e.g.
in the fire breather's face in 5.webp), so we should probably improve
that at some point. Our JPG decoder has the same issue.)
This basically adds the line
u8 token = TRY(
tree_decode(decoder, COEFFICIENT_TREE,
header.coefficient_probabilities[plane][band][tricky],
last_decoded_value == DCT_0 ? 2 : 0));
and calls it once for the 16 coefficients of a discrete cosine transform
that covers a 4x4 pixel subblock.
And then it does this 24 or 25 times per macroblock, for the 4x4 Y
subblocks and the 2x2 each U and V subblocks (plus an optional Y2 block
for some macroblocks).
It then adds a whole bunch of machinery to be able to compute `plane`,
`band`, and in particular `tricky` (which depends on if the
corresponding left or above subblock has non-zero coefficients).
(It also does dequantization, and does an inverse Walsh-Hadamard
transform when needed, to end up with complete DCT coefficients
in all cases.)
Pure code move (except of removing `static` on the two public functions
in the new header), not behavior change.
There isn't a lot of lossy decoder yet, but it'll make implementing it
more convenient.
No behavior change.