From 94e33c97f363097242e98f537f3aa1acbb61baf3 Mon Sep 17 00:00:00 2001 From: Jan Verbeek Date: Wed, 25 Aug 2021 20:40:30 +0200 Subject: [PATCH] wc: Add benchmarking documentation --- src/uu/wc/BENCHMARKING.md | 124 ++++++++++++++++++++++++++++++++++++++ 1 file changed, 124 insertions(+) create mode 100644 src/uu/wc/BENCHMARKING.md diff --git a/src/uu/wc/BENCHMARKING.md b/src/uu/wc/BENCHMARKING.md new file mode 100644 index 000000000..b9d8cc22d --- /dev/null +++ b/src/uu/wc/BENCHMARKING.md @@ -0,0 +1,124 @@ +# Benchmarking wc + + + +Much of what makes wc fast is avoiding unnecessary work. It has multiple strategies, depending on which data is requested. + +## Strategies + +### Counting bytes + +In the case of `wc -c` the content of the input doesn't have to be inspected at all, only the size has to be known. That enables a few optimizations. + +#### File size + +If it can, wc reads the file size directly. This is not interesting to benchmark, except to see if it still works. Try `wc -c largefile`. + +#### `splice()` + +On Linux `splice()` is used to get the input's length while discarding it directly. + +The best way I've found to generate a fast input to test `splice()` is to pipe the output of uutils `cat` into it. Note that GNU `cat` is slower and therefore less suitable, and that if a file is given as its input directly (as in `wc -c < largefile`) the first strategy kicks in. Try `uucat somefile | wc -c`. + +### Counting lines + +In the case of `wc -l` or `wc -cl` the input doesn't have to be decoded. It's read in chunks and the `bytecount` crate is used to count the newlines. + +It's useful to vary the line length in the input. GNU wc seems particularly bad at short lines. + +### Processing unicode + +This is the most general strategy, and it's necessary for counting words, characters, and line lengths. Individual steps are still switched on and off depending on what must be reported. + +Try varying which of the `-w`, `-m`, `-l` and `-L` flags are used. (The `-c` flag is unlikely to make a difference.) + +Passing no flags is equivalent to passing `-wcl`. That case should perhaps be given special attention as it's the default. + +## Generating files + +To generate a file with many very short lines, run `yes | head -c50000000 > 25Mshortlines`. + +To get a file with less artificial contents, download a book from Project Gutenberg and concatenate it a lot of times: + +``` +wget https://www.gutenberg.org/files/2701/2701-0.txt -O moby.txt +cat moby.txt moby.txt moby.txt moby.txt > moby4.txt +cat moby4.txt moby4.txt moby4.txt moby4.txt > moby16.txt +cat moby16.txt moby16.txt moby16.txt moby16.txt > moby64.txt +``` + +And get one with lots of unicode too: + +``` +wget https://www.gutenberg.org/files/30613/30613-0.txt -O odyssey.txt +cat odyssey.txt odyssey.txt odyssey.txt odyssey.txt > odyssey4.txt +cat odyssey4.txt odyssey4.txt odyssey4.txt odyssey4.txt > odyssey16.txt +cat odyssey16.txt odyssey16.txt odyssey16.txt odyssey16.txt > odyssey64.txt +cat odyssey64.txt odyssey64.txt odyssey64.txt odyssey64.txt > odyssey256.txt +``` + +Finally, it's interesting to try a binary file. Look for one with `du -sh /usr/bin/* | sort -h`. On my system `/usr/bin/docker` is a good candidate as it's fairly large. + +## Running benchmarks + +Use [`hyperfine`](https://github.com/sharkdp/hyperfine) to compare the performance. For example, `hyperfine 'wc somefile' 'uuwc somefile'`. + +If you want to get fancy and exhaustive, generate a table: + +| | moby64.txt | odyssey256.txt | 25Mshortlines | /usr/bin/docker | +|------------------------|--------------|------------------|-----------------|-------------------| +| `wc ` | 1.3965 | 1.6182 | 5.2967 | 2.2294 | +| `wc -c ` | 0.8134 | 1.2774 | 0.7732 | 0.9106 | +| `uucat | wc -c` | 2.7760 | 2.5565 | 2.3769 | 2.3982 | +| `wc -l ` | 1.1441 | 1.2854 | 2.9681 | 1.1493 | +| `wc -L ` | 2.1087 | 1.2551 | 5.4577 | 2.1490 | +| `wc -m ` | 2.7272 | 2.1704 | 7.3371 | 3.4347 | +| `wc -w ` | 1.9007 | 1.5206 | 4.7851 | 2.8529 | +| `wc -lwcmL ` | 1.1687 | 0.9169 | 4.4092 | 2.0663 | + +Beware that: +- Results are fuzzy and change from run to run +- You'll often want to check versions of uutils wc against each other instead of against GNU +- This takes a lot of time to generate +- This only shows the relative speedup, not the absolute time, which may be misleading if the time is very short + +Created by the following Python script: +```python +import json +import subprocess + +from tabulate import tabulate + +bins = ["wc", "uuwc"] +files = ["moby64.txt", "odyssey256.txt", "25Mshortlines", "/usr/bin/docker"] +cmds = [ + "{cmd} {file}", + "{cmd} -c {file}", + "uucat {file} | {cmd} -c", + "{cmd} -l {file}", + "{cmd} -L {file}", + "{cmd} -m {file}", + "{cmd} -w {file}", + "{cmd} -lwcmL {file}", +] + +table = [] +for cmd in cmds: + row = ["`" + cmd.format(cmd="wc", file="") + "`"] + for file in files: + subprocess.run( + [ + "hyperfine", + cmd.format(cmd=bins[0], file=file), + cmd.format(cmd=bins[1], file=file), + "--export-json=out.json", + ], + check=True, + ) + with open("out.json") as f: + res = json.load(f)["results"] + row.append(round(res[0]["mean"] / res[1]["mean"], 4)) + table.append(row) +print(tabulate(table, [""] + files, tablefmt="github")) +``` +(You may have to adjust the `bins` and `files` variables depending on your setup, and please do add other interesting cases to `cmds`.)