From b37718de1027d3d9d77fbeaf117245754cbc8d0a Mon Sep 17 00:00:00 2001 From: Jeffrey Finkelstein Date: Fri, 31 Dec 2021 13:16:38 -0500 Subject: [PATCH] split: add BENCHMARKING.md documentation file --- src/uu/split/BENCHMARKING.md | 47 ++++++++++++++++++++++++++++++++++++ 1 file changed, 47 insertions(+) create mode 100644 src/uu/split/BENCHMARKING.md diff --git a/src/uu/split/BENCHMARKING.md b/src/uu/split/BENCHMARKING.md new file mode 100644 index 000000000..9c8d5d17c --- /dev/null +++ b/src/uu/split/BENCHMARKING.md @@ -0,0 +1,47 @@ + + +# Benchmarking to measure performance + +To compare the performance of the `uutils` version of `split` with the +GNU version of `split`, you can use a benchmarking tool like +[hyperfine][0]. On Ubuntu 18.04 or later, you can install `hyperfine` by +running + + sudo apt-get install hyperfine + +Next, build the `split` binary under the release profile: + + cargo build --release -p uu_split + +Now, get a text file to test `split` on. The `split` program has three +main modes of operation: chunk by lines, chunk by bytes, and chunk by +lines with a byte limit. You may want to test the performance of `split` +with various shapes and sizes of input files and under various modes of +operation. For example, to test chunking by bytes on a large input file, +you can create a file named `testfile.txt` containing one million null +bytes like this: + + printf "%0.s\0" {1..1000000} > testfile.txt + +For another example, to test chunking by bytes on a large real-world +input file, you could download a [database dump of Wikidata][1] or some +related files that the Wikimedia project provides. For example, [this +file][2] contains about 130 million lines. + +Finally, you can compare the performance of the two versions of `split` +by running, for example, + + cd /tmp && hyperfine \ + --prepare 'rm x* || true' \ + "split -b 1000 testfile.txt" \ + "target/release/split -b 1000 testfile.txt" + +Since `split` creates a lot of files on the filesystem, I recommend +changing to the `/tmp` directory before running the benchmark. The +`--prepare` argument to `hyperfine` runs a specified command before each +timing run. We specify `rm x* || true` so that the output files from the +previous run of `split` are removed before each run begins. + +[0]: https://github.com/sharkdp/hyperfine +[1]: https://www.wikidata.org/wiki/Wikidata:Database_download +[2]: https://dumps.wikimedia.org/wikidatawiki/20211001/wikidatawiki-20211001-pages-logging.xml.gz