This now searches the memory in blocks, which should be slightly more
efficient. However, it doesn't make much difference (e.g. ~1% in LZMA
compression) in most real-world applications, as the non-hint function
is more expensive by orders of magnitude.
The "operation modes" of this function have very different focuses, and
trying to combine both in a way where we share the most amount of code
probably results in the worst performance.
Instead, split up the function into "existing distances" and "no
existing distances" so that we can optimize either case separately.
We will be adding extra logic to the CircularBuffer to optimize
searching, but this would negatively impact the performance of
CircularBuffer users that don't need that functionality.
This should keep the `read_some` function a bit flatter and shorter, and
make it easier to match the match type decoding process with the
specification.
Webp lossless can have up to 2328 symbols. This code assumed the deflate
max of 288, leading to crashes for webp lossless files using more than
288 symbols (such as Tests/LibGfx/test-inputs/simple-vp8l.webp).
Nothing writes webp files at this point, so the m_bit_codes and
m_bit_code_lengths arrays aren't ever used in practice with more than
288 entries.
Missing:
* Transform support (used by virtually all lossless webp files)
* Meta prefix / entropy image support
Working:
* Decoding of regular image streams
* Color cache
This happens to be enough to be able to decode
Tests/LibGfx/test-inputs/extended-lossless.webp
The canonical prefix code is very similar to deflate's, enough so that
this can use Compress::CanonicalCode (and take advantage of all the
recent performance improvements there).
deflate_special_code_length_copy has value 16, so it should be
before the two zero-filling branches for codes 17 and 18.
Also, the initial if also refers to deflate_special_code_length_copy
as well, so if it's repeated right in the next else, one has to keep
it on the mental stack for shorter when reading this code.
No behavior change.
Alternatively, we could remove the else after the continue, but
all branches here should be equally prominent, so this seems a bit
nicer.
No behavior change.
This is very similar to the LittleEndianInputBitStream bit buffer change
from 8e834d4bb2.
We currently buffer one byte of data for the underlying stream. And when
we put bits onto that buffer, we do so 1 bit at a time.
This replaces the u8 buffer with a u64. And instead of looping at all,
we perform bitwise operations to write the desired number of bits.
Using the "enwik8" file as a test (100MB uncompressed, commonly used in
benchmarks: https://www.mattmahoney.net/dc/enwik8.zip), compression time
decreases from:
13.62s to 10.9s on Serenity (cold)
13.62s to 9.22s on Serenity (warm)
2.93s to 2.32s on Linux
One caveat is that this requires explicitly flushing any leftover bits
when the caller is done with the stream. The byte buffer implementation
implicitly flushed its data every time the buffer was byte-aligned, as
doing so would always fill the byte. This is no longer the case. But for
now, this should be fine as the one user of this class, DEFLATE, already
has a "flush everything now that we're done" finalizer.
This is copy-pasted from the gzip utility, along with its existing TODO.
This is currently only needed by that utility, but this gives us API
symmetry with GzipDecompressor, and helps ensure we won't end up in a
situation where only one utility receives optimizations that should be
received by all interested parties.
Instead of reading bytes from the output stream into a buffer, just to
immediately write them back out, we can skip the middle-man and copy the
bytes directly into the output buffer.
The constructor is now only concerned with creating the required
streams, which means that it no longer fails for XZ streams with
invalid headers. Instead, everything is parsed and validated during the
first read, preparing us for files with multiple streams.
We currently decode back-references one byte at a time, while writing
that byte back out to the output buffer. This is only necessary when the
back-reference refers to itself, i.e. when the back-reference distance
is less than its length. In other cases, we can read the entire back-
reference block in one shot.
Using the "enwik8" file as a test (100MB uncompressed, commonly used in
benchmarks: https://www.mattmahoney.net/dc/enwik8.zip), decompression
time decreases from:
5.8s to 4.89s on Serenity (cold)
2.3s to 1.72s on Serenity (warm)
1.6s to 1.06s on Linux