1
Fork 0
mirror of https://github.com/RGBCube/serenity synced 2025-10-24 22:22:32 +00:00
Commit graph

23 commits

Author SHA1 Message Date
Timothy Flynn
5cbf054651 LibUnicode: Fix typos causing text segmentation on mid-word punctuation
For example the words "can't" and "32.3" should not have boundaries
detected on the "'" and "." code points, respectively.

The String test cases fixed here are because "b'ar" is now considered
one word.
2023-02-15 12:36:47 +01:00
Timothy Flynn
8f2589b3b0 LibUnicode: Parse and generate case folding code point data
Case folding rules have a similar mapping style as special casing rules,
where one code point may map to zero or more case folding rules. These
will be used for case-insensitive string comparisons. To see how case
folding can differ from other casing rules, consider "ß" (U+00DF):

    >>> "ß".lower()
    'ß'

    >>> "ß".upper()
    'SS'

    >>> "ß".title()
    'Ss'

    >>> "ß".casefold()
    'ss'
2023-01-18 14:43:40 +00:00
Timothy Flynn
bc51017a03 LibUnicode: Support full case folding for titlecasing a string
Unicode declares that to titlecase a string, the first cased code point
after each word boundary should be transformed to its titlecase mapping.
All other codepoints are transformed to their lowercase mapping.
2023-01-16 18:33:44 -05:00
Timothy Flynn
b562348d31 LibUnicode: Generate simple case folding mappings for titlecase
Note we already generate the special case foldings for titlecase.
2023-01-16 18:33:44 -05:00
Timothy Flynn
1ff29afc45 LibUnicode+LibJS+LibWeb: Propagate OOM from Unicode case transformations 2023-01-09 22:48:15 +00:00
Timothy Flynn
f38c68177b LibUnicode: Update code point ideographic replacements for Unicode 15 2022-10-07 18:17:40 +01:00
thankyouverycool
5658524aa3 Tests: Add Unicode tests for CharacterType block properties 2022-02-15 10:13:19 -05:00
Timothy Flynn
6efbafa6e0 Everywhere: Update copyrights with my new serenityos.org e-mail :^) 2022-01-31 18:23:22 +00:00
Timothy Flynn
7e6ad172a4 LibUnicode: Support code point names that apply to ranges of code points
For example, consider the following adjacent entries in UnicodeData.txt:

    3400;<CJK Ideograph Extension A, First>;Lo;0;L;;;;;N;;;;;
    4DBF;<CJK Ideograph Extension A, Last>;Lo;0;L;;;;;N;;;;;

Our current implementation would assign the display name "CJK Ideograph
Extension A" to code points U+3400 & U+4DBF, but not to the code points
in between. Not only should those code points be assigned a name, but
the Unicode spec also has formatting rules on what the names should be
(the names for these ranged code points are not as they appear in
UnicodeData.txt).

The spec also defines names for code point ranges that actually are
listed individually in UnicodeData.txt. For example:

    2F800;CJK COMPATIBILITY IDEOGRAPH-2F800;Lo;0;L;4E3D;;;;N;;;;;
    2F801;CJK COMPATIBILITY IDEOGRAPH-2F801;Lo;0;L;4E38;;;;N;;;;;
    2F802;CJK COMPATIBILITY IDEOGRAPH-2F802;Lo;0;L;4E41;;;;N;;;;;

Code points are only coalesced into a range if all fields after the name
are equivalent. Our parser will insert the range and its name formatting
pattern when it comes across the first code point in that range, then
ignore other code points in that range. This reduces the number of names
we generated by nearly 2,000.
2021-11-30 11:24:02 +01:00
Timothy Flynn
50158abaf1 LibUnicode: Implement locale-aware BEFORE_DOT special casing
Note that the algorithm in the Unicode spec is for checking that a code
point precedes U+0307, but the special casing condition NotBeforeDot is
interested in the inverse of this rule.
2021-09-06 15:24:27 +01:00
Timothy Flynn
436faf9fd9 LibUnicode: Implement locale-aware MORE_ABOVE special casing 2021-09-06 15:24:27 +01:00
Timothy Flynn
1427ebc622 LibUnicode: Implement locale-aware AFTER_SOFT_DOTTED special casing 2021-09-06 15:24:27 +01:00
Timothy Flynn
0053d48c41 LibUnicode: Implement locale-aware AFTER_I special casing 2021-09-06 15:24:27 +01:00
Timothy Flynn
1e91334008 LibUnicode: Handle edge-case script extensions, Common and Inherited
These script extensions have some peculiar behavior in the Unicode spec.
The UCD ScriptExtension file does not contain these scripts. Rather, it
is implied the code points which have these scripts as an extension are
the code points that both:

  1. Have Common or Inherited as their primary script value
  2. Do not have any other script value in their script extension lists

Because these are not explictly listed in the UCD, we must manually form
these script extensions.
2021-08-11 13:11:01 +02:00
Timothy Flynn
47bb350ebd LibUnicode: Generate separate tables for scripts and script extensions
Notice that unlike the note in populate_general_category_unions(),
script extension do indeed have code point ranges which overlap. Thus,
this commit adds code to handle that, and hooks it into the GC unions.
2021-08-11 13:11:01 +02:00
Timothy Flynn
5ac23d244d LibUnicode: Generate separate tables for Unicode properties
Similar to General Categories, this generates separate tables for the
Property list.
2021-08-11 13:11:01 +02:00
Timothy Flynn
b06c104076 LibUnicode: Include Unassigned code points in the Other General Category
Now that the generator parses unassigned General Category properties, it
can include Unassigned (Cn) in the Other (C) category.
2021-08-11 13:11:01 +02:00
Timothy Flynn
7dce2bfe23 LibUnicode: Generate separate tables for General Category properties
Previously, each code point's General Category was part of the generated
UnicodeData structure. This ultimately presented two problems, one
functional and one performance related:

  * Some General Categories are applied to unassigned code points, for
    example the Unassigned (Cn) category. Unassigned code points are
    strictly excluded from UnicodeData.txt, so by relying on that file,
    the generator is unable to handle these categories.

  * Lookups for General Categories are slower when searching through the
    large UnicodeData hash map. Even though lookups are O(1), the hash
    function turned out to be slower than binary searching through a
    category-specific table.

So, now a table is generated for each General Category. When querying a
code point for a category, a binary search is done on each code point
range in that category's table to check if code point has that category.

Further, General Categories are now parsed from the UCD file
DerivedGeneralCategory.txt. This file is a normal "prop list" file and
contains the categories for unassigned code points.
2021-08-11 13:11:01 +02:00
Timothy Flynn
c4bfda7f7f LibUnicode: Handle code points that are both cased and case-ignorable
Apparently, some code points fit both categories, for example U+0345
(COMBINING GREEK YPOGEGRAMMENI). Handle this fact when determining if
a code point is a final code point in a string.
2021-07-28 23:42:29 +02:00
Timothy Flynn
7827aede6f LibUnicode: Check word break when deciding on case-ignorable code points 2021-07-28 23:42:29 +02:00
Timothy Flynn
c45a014645 LibUnicode: Check property list when deciding if a code point is cased 2021-07-28 23:42:29 +02:00
Timothy Flynn
39f971e42b LibUnicode: Begin implementing special Unicode case folding
This implements unconditional special case folding, and conditional
folding for non-locale cases. Worth noting that the only conditional,
non-locale special case is for converting an uppercase sigma to
lowercase.
2021-07-27 21:04:36 +01:00
Timothy Flynn
4dda3edc9e LibUnicode: Introduce a Unicode library for interacting with UCD files
The Unicode standard publishes the Unicode Character Database (UCD) with
information about every code point, such as each code point's upper case
mapping. LibUnicode exists to download and parse UCD files at build time
and to provide accessors to that data.

As a start, LibUnicode includes upper- and lower-case code point
converters.
2021-07-26 17:03:55 +01:00