mirror of
https://github.com/RGBCube/serenity
synced 2026-01-12 22:50:59 +00:00
Emoji sequences in the grapheme segmentation spec are a bit tricky:
\p{Extended_Pictographic} Extend* ZWJ × \p{Extended_Pictographic}
Our current strategy of tracking a boolean to indicate if we are in an
emoji sequence was causing us to break up emoji made of multiple sub-
sequences. For example, in the "family: man, woman, girl, boy" sequence:
U+1F468 U+200D U+1F469 U+200D U+1F467 U+200D U+1F466
We would break at indices 0 (correctly) and 6 (incorrectly).
Instead of tracking a boolean, it's quite a bit simpler to reason about
emoji sequences by just skipping past them entirely. Note that in cases
like the above emoji, we skip one sub-sequence at a time.
|
||
|---|---|---|
| .. | ||
| CharacterTypes.cpp | ||
| CharacterTypes.h | ||
| CMakeLists.txt | ||
| CurrencyCode.cpp | ||
| CurrencyCode.h | ||
| Emoji.cpp | ||
| Emoji.h | ||
| Forward.h | ||
| Normalize.cpp | ||
| Normalize.h | ||
| Segmentation.cpp | ||
| Segmentation.h | ||
| String.cpp | ||
| UnicodeUtils.cpp | ||
| UnicodeUtils.h | ||