LibCompress: Use prefix tables to decode Huffman codes up to 8 bits long

Huffman codes have a useful property in that they are prefix codes. That is, a set of bits representing a Huffman-coded symbol is never a prefix of another symbol. This allows us to create a table, where each index in the table are integers whose prefix is the entry's corresponding Huffman code. With Deflate, we can have codes up to 16 bits in length, thus creating a prefix table with 2^16 entries. So instead of creating a table fit all possible codes, we use a cutoff of 8-bit codes. Codes larger than 8 bits fall back to the binary search method. Using the "enwik8" file as a test (100MB uncompressed, commonly used in benchmarks: https://www.mattmahoney.net/dc/enwik8.zip), decompression time decreases from 3.527s to 2.585s on Linux.
2025-07-27 19:17:44 +00:00 · 2023-03-28 14:45:20 -04:00 · 2023-03-28 14:45:20 -04:00 · 5aaefe4e62
commit 5aaefe4e62
parent 8e834d4bb2
2 changed files with 63 additions and 11 deletions
--- a/Userland/Libraries/LibCompress/Deflate.h
+++ b/Userland/Libraries/LibCompress/Deflate.h
@ -30,10 +30,20 @@ public:
    static Optional<CanonicalCode> from_bytes(ReadonlyBytes);

 private:
+    static constexpr size_t max_allowed_prefixed_code_length = 8;
+
+    struct PrefixTableEntry {
+        u16 symbol_value { 0 };
+        u16 code_length { 0 };
+    };
+
    // Decompression - indexed by code
    Vector<u16> m_symbol_codes;
    Vector<u16> m_symbol_values;

+    Array<PrefixTableEntry, 1 << max_allowed_prefixed_code_length> m_prefix_table {};
+    size_t m_max_prefixed_code_length { 0 };
+
    // Compression - indexed by symbol
    Array<u16, 288> m_bit_codes {}; // deflate uses a maximum of 288 symbols (maximum of 32 for distances)
    Array<u16, 288> m_bit_code_lengths {};