The adobe specification doesn't even consider JPEG images with a single
component. So let's not consider the content of the App14 segment for
grayscale images.
This is clearly something I missed during the first implementation. The
specification is crystal clear about it: "The quantization elements
shall be specified in zig-zag scan order."
This patch fixes the weird behavior we had when using the quantization
table.
This adds a decoder for the TinyVG vector format (https://tinyvg.tech/).
TinyVG is a very simple binary vector format, but it is good enough to
represent a lot of SVGs, without needing the full web engine.
The main use case for Serenity is for scalable icons (which .tvg easily
handles).
The ideal size is the size the user will display the image. Raster
formats should ignore this parameter, but vector formats can use
it to generate a bitmap of the ideal size.
Decoding progressive JPEGs involves a much more complicated logic than
sequential JPEGs. Thanks to template specialization, this patch allow us
to skip the additional cost of progressive images when it's not needed.
It gives a nice 10% improvements on sequential JPEGs :^)
This solution is a middle ground between re-computing `cos` every time
and a much more mathematically complicated approach (as we have in the
decoder).
While still being far from optimal it already gives us a 10x
improvement, not that bad :^)
Co-authored-by: Tim Flynn <trflynn89@pm.me>
This encoder is very naive as it only output SOF0 images and uses
pre-defined Huffman tables.
There is also a small bug with quantization which make using it
over-degrade the quality.
There was a TODO questioning whether breaking on >4bpp images
was the correct behaviour when RLE4 was detected. There is no
indication in the spec that RLE4 can be used with anything >4bpp,
so I believe this doesn't require additional follow-up.
MS Spec:
https://learn.microsoft.com/en-us/windows/win32/gdi/bitmap-compression
simple-vp8l-alpha-used-false.webp is a copy of simple-vp8l.webp,
with the byte at offset 0x18 changed from 0x10 to 0x00 -- that
is, the bit in the VP8L header that stores `is_alpha_used` is cleared.
We would already allocated a BGRx8888 instead of a BGRA8888 bitmap,
but keep actual alpha data in the `x` channel.
That lead to at least `image` still writing a PNG with an alpha channel.
So explicitly set the alpha channel to 0xff when is_alpha_used is false,
to make sure all consumers of decoded lossless webp data have behavior
consistent with other webp readers.
In practice, webp encoders usually don't write files that have
`is_alpha_used` set to false and then write actual alpha data to their
output. So this is rarely observable. However, for example for
lossy+ALPH webp files, the lossless webp used to store the ALPH channel
has `is_alpha_used` set to false and all channels but green are 0
(since the lossless green channel stores the alpha channel of a
lossy+ALPH webp). So if we dump such a bitmap to a standalone webp
file (e.g. with the temporary debugging code in fc3249a1ca),
then without this commit here, `image` would convert that webp to
a fully transparent webp, while other webp software would correctly
display the green image with opaque alpha.
Some of these functions can be called millions of times even for images
of moderate size. Inlining these functions really helps the compiler and
gives performance improvements up to 10%.
Inside each scan, raw data is read with the following rules:
- Each `0x00` that is preceded by `0xFF` should be discarded
- If multiple `0xFF` are following, only consider one.
That, plus the fact that we don't know the size of the scan beforehand
made us put a prepared stream in a vector for an easy, later on, usage.
This patch remove this duplication by realizing these operations in a
stream-friendly way.
This class is similar to some extent to an implementation of `Stream`,
but specialized for JPEG reading.
Technically, it is composed of a `Vector` to bufferize the input and
provides a simple interface to read bytes.
This patch aims for a future better and specialized control over the
input data. The encapsulation already allows us to get rid of the last
`seek` in the decoder, thus the ability to use the decoder with a raw
`Stream`.
As it provides faster `read` routines, this patch already reduces time
spend on reading.
Errors are now deferred until `finish_decode()` is finished, meaning
branches to return errors only need to occur at the end of a ranged
decode. If VPX_DEBUG is enabled, a debug message will be printed
immediately when an overread occurs.
Average decoding times for `Tests/LibGfx/test-inputs/4.webp` improve
by about 4.7% with this change, absolute decode times changing from
27.4ms±1.1ms down to 26.1ms±1.0ms.
This does a few things:
- The decoder uses a 32- or 64-bit integer as a reservoir of the data
being decoded, rather than one single byte as it was previously.
- `read_bool()` only refills the reservoir (value) when the size drops
below one byte. Previously, it would read out a bit-sized range from
the data to completely refill the 8-bit value, doing much more work
than necessary for each individual read.
- VP9-specific code for reading the marker bit was moved to its own
function in Context.h.
- A debug flag `VPX_DEBUG` was added to optionally enable checking of
the final bits in a VPX ranged arithmetic decode and ensure that it
contains all zeroes. These zeroes are a bitstream requirement for
VP9, and are also present for all our lossy WebP test inputs
currently. This can be useful to test whether all the data present in
the range has been consumed.
A lot of the size of this diff comes from the removal of error handling
from all the range decoder reads in LibVideo/VP9 and LibGfx/WebP (VP8),
since it is now checked only at the end of the range.
In a benchmark decoding `Tests/LibGfx/test-inputs/4.webp`, decode times
are improved by about 22.8%, reducing average runtime from 35.5ms±1.1ms
down to 27.4±1.1ms.
This should cause no behavioral changes.
The spec says "Decoders are not required to use this information in any
specified way" about this field, but that's probably a typo and belongs
in the previous section. At least, images in the wild look wrong
without this, for example
https://fjord.dropboxstatic.com/warp/conversion/dropbox/warp/en-us/dropbox/Integrations_4@2x.png?id=ce8269af-ef1a-460a-8199-65af3dd978a3&output_type=webp
Implementation-wise, this now copies both uncompressed and compressed
data to yet another buffer for processing the alpha, then does
filtering on that buffer, and then copies the filtered alpha data
into the final image. (The RGB data comes from a lossy webp.)
This is a bit wasteful and we could probably manage without that
local copy, but that'd make the code more convoluted, so this is
good enough for now, at least until we've added tests for this case.
No test, because I haven't yet found out how to create images in this
format.
Else, WebP files with a broken header just return "Decoding failed"
without any more details. This way, there's some debug logging with
more details.
Maybe we'll want to remove this again since it might lead to duplicate
error messages for files that have their error not in the header.
We'll see how this feels. (But most files don't have any errors, so
it's probably fine.)
The spec doesn't talk about this happening in the text, but
`dequant_init()` in 20.4 processes segment adjustment and quantization
index adjustment in the same variable `q` before clamping.
Since we had to adjust the latter step in the previous commit, do
it for the former step too.
I haven't seen this happen in the wild yet (and now, I hopefully
never will notice it if it happens).
It's up to callers of the ImageDecoderPlugin to honor loop_count().
The ImageDecoderPlugin doesn't have to look at it when decoding frames.
No behavior change.
The spec says that the AC dequantization factor for Y2 data should
be at least 8, so do that.
This only has a very small effect (only the first two AC table
entries are < 8 after multiplying with 155 / 100, so this would
have only a small effect on brightness), and this case is hit
exactly 0 times in all my test images. But it's still good to match
the spec.
For a 1024x1024 image, saves about a quarter MB of memory use while
decoding (compared to the decompressed image data itself needing
4 MiB). Not a huge win, but also very easy to do, so might as well.
No behavior change, no measurable performance impact.
This is safe because:
* prediction only computes averages, or explicitly clamps for
TM_PRED / B_TM_PRED. Since the inputs are in [0, 255], so will the
outputs.
* Addition of IDCT and prediction buffer is immediately clamped back
to [0, 255]
No behavior change, and matches what both libwebp and the reference
implementation in rfc6386 do.
https://datatracker.ietf.org/doc/html/rfc6386#section-14.5 says:
"""
The summing procedure is fairly straightforward, having only a couple
of details. The prediction and residue buffers are both arrays of
16-bit signed integers. Each individual (Y, U, and V pixel) result
is calculated first as a 32-bit sum of the prediction and residue,
and is then saturated to 8-bit unsigned range (using, say, the
clamp255 function defined above) before being stored as an 8-bit
unsigned pixel value.
"""
It's IMHO not 100% clear if the clamping is supposed to happen
immediately (so that it affects prediction inputs for the next
macroblock) or later.
But vp8_dixie_idct_add() on page 173 in
https://datatracker.ietf.org/doc/html/rfc6386#section-20.8 does:
recon[0] = CLAMP_255(predict[0] + ((a1 + d1 + 4) >> 3));
So it does look like it should happen immediately.
(I'm a bit confused why the spec then says "The prediction and residue
buffers are both arrays of 16-bit signed integers", since the
prediction buffer can just be an u8 buffer now, without changing
behavior.
The spec has that clamp at the end of
https://datatracker.ietf.org/doc/html/rfc6386#section-12.2:
The exact algorithm is as follows:
[...]
b[r][c] = clamp255(L[r]+ A[c] - P);
For the test images I'm looking at, it doesn't seem to make a
dramatic difference, but omitting it in `B_TM_PRED` did make
a dramatic difference, so add it. (Also, the spec demands it.)
Each secondary partition has an independent BooleanDecoder.
Their bitstreams interleave per macroblock row, that is the first
macroblock row is read from the first decoder, the second from the
second, ..., until it wraps around again.
All partitions share a single prediction state though: The second
macroblock row (which reads coefficients off the second decoder) is
predicted using the result of decoding the frist macroblock row (which
reads coefficients off the first decoder).
So if I understand things right, in theory the coefficient reading could
be parallelized, but prediction can't be. (IDCT can also be
parallelized, but that's true with just a single partition too.)
I created the test image by running
examples/cwebp -low_memory -partitions 3 -o foo.webp \
~/src/serenity/Tests/LibGfx/test-inputs/4.webp
using a cwebp hacked up as described in #19149. Since creating
multi-partition lossy webps requires hacking up `cwebp`, they're likely
very rare in practice. (But maybe other programs using the libwebp API
create them.)
Fixes#19149.
With this, webp lossy support is complete (*) :^)
And with that, webp support is complete: Lossless, lossy, lossy with
alpha, animated lossless, animated lossy, animated lossy with alpha all
work.
(*: Loop filtering isn't implemented yet, which has a minor visual
effect on the output. But it's only visible when carefully comparing
a webp decoded without loop filtering to the same decoded with it.
But it's technically a part of the spec that's still missing.
The upsampling of UV in the YUV->RGB code is also low-quality. This
produces somewhat visible banding in practice in some images (e.g.
in the fire breather's face in 5.webp), so we should probably improve
that at some point. Our JPG decoder has the same issue.)