1
Fork 0
mirror of https://github.com/RGBCube/serenity synced 2025-07-27 19:17:44 +00:00

LibCompress: Use prefix tables to decode Huffman codes up to 8 bits long

Huffman codes have a useful property in that they are prefix codes. That
is, a set of bits representing a Huffman-coded symbol is never a prefix
of another symbol. This allows us to create a table, where each index in
the table are integers whose prefix is the entry's corresponding Huffman
code.

With Deflate, we can have codes up to 16 bits in length, thus creating a
prefix table with 2^16 entries. So instead of creating a table fit all
possible codes, we use a cutoff of 8-bit codes. Codes larger than 8 bits
fall back to the binary search method.

Using the "enwik8" file as a test (100MB uncompressed, commonly used in
benchmarks: https://www.mattmahoney.net/dc/enwik8.zip), decompression
time decreases from 3.527s to 2.585s on Linux.
This commit is contained in:
Timothy Flynn 2023-03-28 14:45:20 -04:00 committed by Andreas Kling
parent 8e834d4bb2
commit 5aaefe4e62
2 changed files with 63 additions and 11 deletions

View file

@ -30,10 +30,20 @@ public:
static Optional<CanonicalCode> from_bytes(ReadonlyBytes);
private:
static constexpr size_t max_allowed_prefixed_code_length = 8;
struct PrefixTableEntry {
u16 symbol_value { 0 };
u16 code_length { 0 };
};
// Decompression - indexed by code
Vector<u16> m_symbol_codes;
Vector<u16> m_symbol_values;
Array<PrefixTableEntry, 1 << max_allowed_prefixed_code_length> m_prefix_table {};
size_t m_max_prefixed_code_length { 0 };
// Compression - indexed by symbol
Array<u16, 288> m_bit_codes {}; // deflate uses a maximum of 288 symbols (maximum of 32 for distances)
Array<u16, 288> m_bit_code_lengths {};