1
Fork 0
mirror of https://github.com/RGBCube/uutils-coreutils synced 2025-07-28 03:27:44 +00:00

wc: Do a chunked read with proper UTF-8 handling

This brings the results mostly in line with GNU wc and solves nasty
behavior with long lines.
This commit is contained in:
Jan Verbeek 2021-08-25 13:26:44 +02:00 committed by Michael Debertol
parent 48437fc49d
commit 6f7d740592
8 changed files with 105 additions and 138 deletions

Binary file not shown.

25
tests/fixtures/wc/UTF_8_weirdchars.txt vendored Normal file
View file

@ -0,0 +1,25 @@
zero-width space inbetween these: xx
and inbetween two spaces: [ ]
and at the end of the line:
non-breaking space: x x [   ]  
simple unicode: xµx [ µ ] µ
wide: xx [ ]
simple emoji: x👩x [ 👩 ] 👩
complex emoji: x👩🔬x [ 👩‍🔬 ] 👩‍🔬
, !
line feed: x x [ ]
vertical tab: x x [ ]
horizontal tab: x x [ ]
this should be the longest line:
1234567 12345678 123456781234567812345678
Control character: xx [  ]