serenity

37460 commits 1 branch 0 tags 230 MiB

Author	SHA1	Message	Date
Timothy Flynn	9e5abec6f1	AK: Invalidate UTF-8 encoded code points larger than U+10ffff On oss-fuzz, the LibJS REPL is provided a file encoded with Windows-1252 with the following contents: /ô¡°½/ The REPL assumes the input file is UTF-8. So in Windows-1252, the above is represented as [0x2f 0xf4 0xa1 0xb0 0xbd 0x2f]. The inner 4 bytes are actually a valid UTF-8 encoding if we only look at the most significant bits to parse leading/continuation bytes. However, it decodes to the code point U+121c3d, which is not a valid code point. This commit adds additional validation to ensure the decoded code point itself is also valid.	2022-04-05 00:14:29 +01:00
Andreas Kling	1be4cbd639	AK: Make Utf8View constructors inline and remove C string constructor Using StringView instead of C strings is basically always preferable. The only reason to use a C string is because you are calling a C API.	2021-09-18 19:54:24 +02:00
Timothy Flynn	87848cdf7d	AK: Track byte length, rather than code point length, in Utf8View::trim Utf8View::trim uses Utf8View::substring_view to return its result, which requires the input to be a byte offset/length rather than code point length.	2021-07-17 16:59:59 +01:00
DexesTTP	e01f1c949f	AK: Do not VERIFY on invalid code point bytes in UTF8View The previous behavior was to always VERIFY that the UTF-8 bytes were valid when iterating over the code points of an UTF8View. This change makes it so we instead output the 0xFFFD 'REPLACEMENT CHARACTER' code point when encountering invalid bytes, and keep iterating the view after skipping one byte. Leaving the decision to the consumer would break symmetry with the UTF32View API, which would in turn require heavy refactoring and/or code duplication in generic code such as the one found in Gfx::Painter and the Shell. To make it easier for the consumers to detect the original bytes, we provide a new method on the iterator that returns a Span over the data that has been decoded. This method is immediately used in the TextNode::compute_text_for_rendering method, which previously did this in a ad-hoc waay. This also add tests for the new behavior in TestUtf8.cpp, as well as reinforcements to the existing tests to check if the underlying bytes match up with their expected values.	2021-06-03 18:28:27 +04:30
Andreas Kling	407d6cd9e4	AK: Rename Utf8CodepointIterator => Utf8CodePointIterator	2021-06-01 09:45:52 +02:00
Max Wipfli	14506e8f5e	AK: Implement Utf8CodepointIterator::peek(size_t) This adds a peek method for Utf8CodepointIterator, which enables it to be used in some parsing cases where peeking is necessary. peek(0) is equivalent to operator*, expect that peek() does not contain any assertions and will just return an empty Optional<u32>. This also implements a test case for iterating UTF-8.	2021-06-01 09:28:05 +02:00
Brian Gianforcaro	67322b0702	Tests: Move AK tests to Tests/AK	2021-05-06 17:54:28 +02:00

Renamed from AK/Tests/TestUtf8.cpp (Browse further)

7 commits