split: add BENCHMARKING.md documentation file

2025-09-14 11:07:59 +00:00 · 2021-12-31 13:16:38 -05:00 · 2021-12-31 13:16:38 -05:00 · b37718de10
commit b37718de10
parent 70ca1f45ea
1 changed files with 47 additions and 0 deletions
--- a/src/uu/split/BENCHMARKING.md
+++ b/src/uu/split/BENCHMARKING.md
@ -0,0 +1,47 @@
+<!-- spell-checker:ignore testfile -->
+
+# Benchmarking to measure performance
+
+To compare the performance of the `uutils` version of `split` with the
+GNU version of `split`, you can use a benchmarking tool like
+[hyperfine][0]. On Ubuntu 18.04 or later, you can install `hyperfine` by
+running
+
+    sudo apt-get install hyperfine
+
+Next, build the `split` binary under the release profile:
+
+    cargo build --release -p uu_split
+
+Now, get a text file to test `split` on. The `split` program has three
+main modes of operation: chunk by lines, chunk by bytes, and chunk by
+lines with a byte limit. You may want to test the performance of `split`
+with various shapes and sizes of input files and under various modes of
+operation. For example, to test chunking by bytes on a large input file,
+you can create a file named `testfile.txt` containing one million null
+bytes like this:
+
+    printf "%0.s\0" {1..1000000} > testfile.txt
+
+For another example, to test chunking by bytes on a large real-world
+input file, you could download a [database dump of Wikidata][1] or some
+related files that the Wikimedia project provides. For example, [this
+file][2] contains about 130 million lines.
+
+Finally, you can compare the performance of the two versions of `split`
+by running, for example,
+
+    cd /tmp && hyperfine \
+        --prepare 'rm x* || true' \
+	"split -b 1000 testfile.txt" \
+	"target/release/split -b 1000 testfile.txt"
+
+Since `split` creates a lot of files on the filesystem, I recommend
+changing to the `/tmp` directory before running the benchmark. The
+`--prepare` argument to `hyperfine` runs a specified command before each
+timing run. We specify `rm x* || true` so that the output files from the
+previous run of `split` are removed before each run begins.
+
+[0]: https://github.com/sharkdp/hyperfine
+[1]: https://www.wikidata.org/wiki/Wikidata:Database_download
+[2]: https://dumps.wikimedia.org/wikidatawiki/20211001/wikidatawiki-20211001-pages-logging.xml.gz