wc: Add benchmarking documentation

2025-09-14 11:07:59 +00:00 · 2021-08-25 20:40:30 +02:00 · 2021-08-25 20:40:30 +02:00 · 94e33c97f3
commit 94e33c97f3
parent 1358aeecdd
1 changed files with 124 additions and 0 deletions
--- a/src/uu/wc/BENCHMARKING.md
+++ b/src/uu/wc/BENCHMARKING.md
@ -0,0 +1,124 @@
 # Benchmarking wc
 <!-- spell-checker:ignore (words) uuwc uucat largefile somefile Mshortlines moby lwcm cmds tablefmt -->
 Much of what makes wc fast is avoiding unnecessary work. It has multiple strategies, depending on which data is requested.
 ## Strategies
 ### Counting bytes
 In the case of `wc -c` the content of the input doesn't have to be inspected at all, only the size has to be known. That enables a few optimizations.
 #### File size
 If it can, wc reads the file size directly. This is not interesting to benchmark, except to see if it still works. Try `wc -c largefile`.
 #### `splice()`
 On Linux `splice()` is used to get the input's length while discarding it directly.
 The best way I've found to generate a fast input to test `splice()` is to pipe the output of uutils `cat` into it. Note that GNU `cat` is slower and therefore less suitable, and that if a file is given as its input directly (as in `wc -c < largefile`) the first strategy kicks in. Try `uucat somefile | wc -c`.
 ### Counting lines
 In the case of `wc -l` or `wc -cl` the input doesn't have to be decoded. It's read in chunks and the `bytecount` crate is used to count the newlines.
 It's useful to vary the line length in the input. GNU wc seems particularly bad at short lines.
 ### Processing unicode
 This is the most general strategy, and it's necessary for counting words, characters, and line lengths. Individual steps are still switched on and off depending on what must be reported.
 Try varying which of the `-w`, `-m`, `-l` and `-L` flags are used. (The `-c` flag is unlikely to make a difference.)
 Passing no flags is equivalent to passing `-wcl`. That case should perhaps be given special attention as it's the default.
 ## Generating files
 To generate a file with many very short lines, run `yes | head -c50000000 > 25Mshortlines`.
 To get a file with less artificial contents, download a book from Project Gutenberg and concatenate it a lot of times:
 ```
 wget https://www.gutenberg.org/files/2701/2701-0.txt -O moby.txt
 cat moby.txt moby.txt moby.txt moby.txt > moby4.txt
 cat moby4.txt moby4.txt moby4.txt moby4.txt > moby16.txt
 cat moby16.txt moby16.txt moby16.txt moby16.txt > moby64.txt
 ```
 And get one with lots of unicode too:
 ```
 wget https://www.gutenberg.org/files/30613/30613-0.txt -O odyssey.txt
 cat odyssey.txt odyssey.txt odyssey.txt odyssey.txt > odyssey4.txt
 cat odyssey4.txt odyssey4.txt odyssey4.txt odyssey4.txt > odyssey16.txt
 cat odyssey16.txt odyssey16.txt odyssey16.txt odyssey16.txt > odyssey64.txt
 cat odyssey64.txt odyssey64.txt odyssey64.txt odyssey64.txt > odyssey256.txt
 ```
 Finally, it's interesting to try a binary file. Look for one with `du -sh /usr/bin/* | sort -h`. On my system `/usr/bin/docker` is a good candidate as it's fairly large.
 ## Running benchmarks
 Use [`hyperfine`](https://github.com/sharkdp/hyperfine) to compare the performance. For example, `hyperfine 'wc somefile' 'uuwc somefile'`.
 If you want to get fancy and exhaustive, generate a table:
 |                        |   moby64.txt |   odyssey256.txt |   25Mshortlines |   /usr/bin/docker |
 |------------------------|--------------|------------------|-----------------|-------------------|
 | `wc <FILE>`            |       1.3965 |           1.6182 |          5.2967 |            2.2294 |
 | `wc -c <FILE>`         |       0.8134 |           1.2774 |          0.7732 |            0.9106 |
 | `uucat <FILE> | wc -c` |       2.7760 |           2.5565 |          2.3769 |            2.3982 |
 | `wc -l <FILE>`         |       1.1441 |           1.2854 |          2.9681 |            1.1493 |
 | `wc -L <FILE>`         |       2.1087 |           1.2551 |          5.4577 |            2.1490 |
 | `wc -m <FILE>`         |       2.7272 |           2.1704 |          7.3371 |            3.4347 |
 | `wc -w <FILE>`         |       1.9007 |           1.5206 |          4.7851 |            2.8529 |
 | `wc -lwcmL <FILE>`     |       1.1687 |           0.9169 |          4.4092 |            2.0663 |
 Beware that:
 - Results are fuzzy and change from run to run
 - You'll often want to check versions of uutils wc against each other instead of against GNU
 - This takes a lot of time to generate
 - This only shows the relative speedup, not the absolute time, which may be misleading if the time is very short
 Created by the following Python script:
 ```python
 import json
 import subprocess
 from tabulate import tabulate
 bins = ["wc", "uuwc"]
 files = ["moby64.txt", "odyssey256.txt", "25Mshortlines", "/usr/bin/docker"]
 cmds = [
    "{cmd} {file}",
    "{cmd} -c {file}",
    "uucat {file} | {cmd} -c",
    "{cmd} -l {file}",
    "{cmd} -L {file}",
    "{cmd} -m {file}",
    "{cmd} -w {file}",
    "{cmd} -lwcmL {file}",
 ]
 table = []
 for cmd in cmds:
    row = ["`" + cmd.format(cmd="wc", file="<FILE>") + "`"]
    for file in files:
        subprocess.run(
            [
                "hyperfine",
                cmd.format(cmd=bins[0], file=file),
                cmd.format(cmd=bins[1], file=file),
                "--export-json=out.json",
            ],
            check=True,
        )
        with open("out.json") as f:
            res = json.load(f)["results"]
        row.append(round(res[0]["mean"] / res[1]["mean"], 4))
    table.append(row)
 print(tabulate(table, [""] + files, tablefmt="github"))
 ```
 (You may have to adjust the `bins` and `files` variables depending on your setup, and please do add other interesting cases to `cmds`.)