mirror of
https://github.com/RGBCube/uutils-coreutils
synced 2025-07-27 19:17:43 +00:00
wc: Add benchmarking documentation
This commit is contained in:
parent
1358aeecdd
commit
94e33c97f3
1 changed files with 124 additions and 0 deletions
124
src/uu/wc/BENCHMARKING.md
Normal file
124
src/uu/wc/BENCHMARKING.md
Normal file
|
@ -0,0 +1,124 @@
|
||||||
|
# Benchmarking wc
|
||||||
|
|
||||||
|
<!-- spell-checker:ignore (words) uuwc uucat largefile somefile Mshortlines moby lwcm cmds tablefmt -->
|
||||||
|
|
||||||
|
Much of what makes wc fast is avoiding unnecessary work. It has multiple strategies, depending on which data is requested.
|
||||||
|
|
||||||
|
## Strategies
|
||||||
|
|
||||||
|
### Counting bytes
|
||||||
|
|
||||||
|
In the case of `wc -c` the content of the input doesn't have to be inspected at all, only the size has to be known. That enables a few optimizations.
|
||||||
|
|
||||||
|
#### File size
|
||||||
|
|
||||||
|
If it can, wc reads the file size directly. This is not interesting to benchmark, except to see if it still works. Try `wc -c largefile`.
|
||||||
|
|
||||||
|
#### `splice()`
|
||||||
|
|
||||||
|
On Linux `splice()` is used to get the input's length while discarding it directly.
|
||||||
|
|
||||||
|
The best way I've found to generate a fast input to test `splice()` is to pipe the output of uutils `cat` into it. Note that GNU `cat` is slower and therefore less suitable, and that if a file is given as its input directly (as in `wc -c < largefile`) the first strategy kicks in. Try `uucat somefile | wc -c`.
|
||||||
|
|
||||||
|
### Counting lines
|
||||||
|
|
||||||
|
In the case of `wc -l` or `wc -cl` the input doesn't have to be decoded. It's read in chunks and the `bytecount` crate is used to count the newlines.
|
||||||
|
|
||||||
|
It's useful to vary the line length in the input. GNU wc seems particularly bad at short lines.
|
||||||
|
|
||||||
|
### Processing unicode
|
||||||
|
|
||||||
|
This is the most general strategy, and it's necessary for counting words, characters, and line lengths. Individual steps are still switched on and off depending on what must be reported.
|
||||||
|
|
||||||
|
Try varying which of the `-w`, `-m`, `-l` and `-L` flags are used. (The `-c` flag is unlikely to make a difference.)
|
||||||
|
|
||||||
|
Passing no flags is equivalent to passing `-wcl`. That case should perhaps be given special attention as it's the default.
|
||||||
|
|
||||||
|
## Generating files
|
||||||
|
|
||||||
|
To generate a file with many very short lines, run `yes | head -c50000000 > 25Mshortlines`.
|
||||||
|
|
||||||
|
To get a file with less artificial contents, download a book from Project Gutenberg and concatenate it a lot of times:
|
||||||
|
|
||||||
|
```
|
||||||
|
wget https://www.gutenberg.org/files/2701/2701-0.txt -O moby.txt
|
||||||
|
cat moby.txt moby.txt moby.txt moby.txt > moby4.txt
|
||||||
|
cat moby4.txt moby4.txt moby4.txt moby4.txt > moby16.txt
|
||||||
|
cat moby16.txt moby16.txt moby16.txt moby16.txt > moby64.txt
|
||||||
|
```
|
||||||
|
|
||||||
|
And get one with lots of unicode too:
|
||||||
|
|
||||||
|
```
|
||||||
|
wget https://www.gutenberg.org/files/30613/30613-0.txt -O odyssey.txt
|
||||||
|
cat odyssey.txt odyssey.txt odyssey.txt odyssey.txt > odyssey4.txt
|
||||||
|
cat odyssey4.txt odyssey4.txt odyssey4.txt odyssey4.txt > odyssey16.txt
|
||||||
|
cat odyssey16.txt odyssey16.txt odyssey16.txt odyssey16.txt > odyssey64.txt
|
||||||
|
cat odyssey64.txt odyssey64.txt odyssey64.txt odyssey64.txt > odyssey256.txt
|
||||||
|
```
|
||||||
|
|
||||||
|
Finally, it's interesting to try a binary file. Look for one with `du -sh /usr/bin/* | sort -h`. On my system `/usr/bin/docker` is a good candidate as it's fairly large.
|
||||||
|
|
||||||
|
## Running benchmarks
|
||||||
|
|
||||||
|
Use [`hyperfine`](https://github.com/sharkdp/hyperfine) to compare the performance. For example, `hyperfine 'wc somefile' 'uuwc somefile'`.
|
||||||
|
|
||||||
|
If you want to get fancy and exhaustive, generate a table:
|
||||||
|
|
||||||
|
| | moby64.txt | odyssey256.txt | 25Mshortlines | /usr/bin/docker |
|
||||||
|
|------------------------|--------------|------------------|-----------------|-------------------|
|
||||||
|
| `wc <FILE>` | 1.3965 | 1.6182 | 5.2967 | 2.2294 |
|
||||||
|
| `wc -c <FILE>` | 0.8134 | 1.2774 | 0.7732 | 0.9106 |
|
||||||
|
| `uucat <FILE> | wc -c` | 2.7760 | 2.5565 | 2.3769 | 2.3982 |
|
||||||
|
| `wc -l <FILE>` | 1.1441 | 1.2854 | 2.9681 | 1.1493 |
|
||||||
|
| `wc -L <FILE>` | 2.1087 | 1.2551 | 5.4577 | 2.1490 |
|
||||||
|
| `wc -m <FILE>` | 2.7272 | 2.1704 | 7.3371 | 3.4347 |
|
||||||
|
| `wc -w <FILE>` | 1.9007 | 1.5206 | 4.7851 | 2.8529 |
|
||||||
|
| `wc -lwcmL <FILE>` | 1.1687 | 0.9169 | 4.4092 | 2.0663 |
|
||||||
|
|
||||||
|
Beware that:
|
||||||
|
- Results are fuzzy and change from run to run
|
||||||
|
- You'll often want to check versions of uutils wc against each other instead of against GNU
|
||||||
|
- This takes a lot of time to generate
|
||||||
|
- This only shows the relative speedup, not the absolute time, which may be misleading if the time is very short
|
||||||
|
|
||||||
|
Created by the following Python script:
|
||||||
|
```python
|
||||||
|
import json
|
||||||
|
import subprocess
|
||||||
|
|
||||||
|
from tabulate import tabulate
|
||||||
|
|
||||||
|
bins = ["wc", "uuwc"]
|
||||||
|
files = ["moby64.txt", "odyssey256.txt", "25Mshortlines", "/usr/bin/docker"]
|
||||||
|
cmds = [
|
||||||
|
"{cmd} {file}",
|
||||||
|
"{cmd} -c {file}",
|
||||||
|
"uucat {file} | {cmd} -c",
|
||||||
|
"{cmd} -l {file}",
|
||||||
|
"{cmd} -L {file}",
|
||||||
|
"{cmd} -m {file}",
|
||||||
|
"{cmd} -w {file}",
|
||||||
|
"{cmd} -lwcmL {file}",
|
||||||
|
]
|
||||||
|
|
||||||
|
table = []
|
||||||
|
for cmd in cmds:
|
||||||
|
row = ["`" + cmd.format(cmd="wc", file="<FILE>") + "`"]
|
||||||
|
for file in files:
|
||||||
|
subprocess.run(
|
||||||
|
[
|
||||||
|
"hyperfine",
|
||||||
|
cmd.format(cmd=bins[0], file=file),
|
||||||
|
cmd.format(cmd=bins[1], file=file),
|
||||||
|
"--export-json=out.json",
|
||||||
|
],
|
||||||
|
check=True,
|
||||||
|
)
|
||||||
|
with open("out.json") as f:
|
||||||
|
res = json.load(f)["results"]
|
||||||
|
row.append(round(res[0]["mean"] / res[1]["mean"], 4))
|
||||||
|
table.append(row)
|
||||||
|
print(tabulate(table, [""] + files, tablefmt="github"))
|
||||||
|
```
|
||||||
|
(You may have to adjust the `bins` and `files` variables depending on your setup, and please do add other interesting cases to `cmds`.)
|
Loading…
Add table
Add a link
Reference in a new issue