There are a couple of minor nuances with parsing script values, compared
to other properties. In Scripts.txt, the UCD file lists the full name of
each script; other properties, like General Category, list the shorter
name in their primary files. This means that the aliases listed in
PropertyValueAliases.txt are reversed for script values.
It takes a non-neglible amount of time to parse all of the UCD files and
generate the Unicode data files. To help compile times, only invoke the
generator once.
This downloads the PropertyValueAliases.txt UCD file, which contains a
set of General Category aliases.
This changes the General Category enumeration to now be generated as a
bitmask. This is to easily allow General Category unions. For example,
the LC (Cased_Letter) category is the union of the Ll, Lu, and Lt
categories.
DerivedCoreProperties are pseudo-properties that are the union of other
categories and properties. For example, the derived property Math is the
union of the general category Sm and the property Other_Math.
Parsing these is necessary for implementing Unicode property escapes.
But it also has the added benefit that LibUnicode now does not need to
derive some of these properties at runtime.
The previous logic had several checks for Lagom directories and
subdirectories. All we really want to do for these header checks is make
sure that the files end up in an included folder prefixed with
LibUnicode. We also don't need to hard code the path to the generator,
the $<TARGET_FILES> generator expression can create the path for us.
Note that unlike the main property list, each code point has only one
word break property. Code points that do not have a word break property
are to be assigned the property "Other".
This adds a SpecialCasing structure to the generated UnicodeData.h/cpp
files. This structure contains casing rules for code points which have
non-1-to-1 upper-to-lower case code point mappings. Further, these rules
may be limited to specific locales or other context.
This is primarily to allow using LibUnicode within LibJS and its REPL.
Note: this seems to be the first time that a Lagom dependency requires
generated source files. For this to work, some of Lagom's CMakeLists.txt
commands needed to be re-organized to include the CMake files that fetch
and parse UnicodeData.txt. The paths required to invoke the generator
also differ depending on what is currently building (SerenityOS vs.
Lagom as part of the Serenity build vs. a standalone Lagom build).
The Unicode standard publishes the Unicode Character Database (UCD) with
information about every code point, such as each code point's upper case
mapping. LibUnicode exists to download and parse UCD files at build time
and to provide accessors to that data.
As a start, LibUnicode includes upper- and lower-case code point
converters.