The mappings are exposed via `Unicode::code_point_decomposition(u32)`
and `Unicode::code_point_decompositions()`, the latter being useful for
reverse searching a code point from its decomposition.
The normalization code does not make use of `Quick_Check` props (https://www.unicode.org/reports/tr44/#Decompositions_and_Normalization),
meaning no quick check optimizations.
According to TR #51, the "best definition of the full set [of emojis] is
in the emoji-test.txt file". This defines not only the emoji themselves,
but the order in which they should be displayed, and what "group" of
emojis they belong to.
Currently, LibUnicodeData contains the generated UCD and CLDR data. Move
the UCD data to the main LibUnicode library, and rename LibUnicodeData
to LibLocaleData. This is another prepatory change to migrate to
LibLocale.
To prepare for placing all CLDR generated data in a new library,
LibLocale, this moves the code generators for the CLDR data to the
LibLocale subfolder.
Plural rules in the CLDR are of the form:
"cs": {
"pluralRule-count-one": "i = 1 and v = 0 @integer 1",
"pluralRule-count-few": "i = 2..4 and v = 0 @integer 2~4",
"pluralRule-count-many": "v != 0 @decimal 0.0~1.5, 10.0, 100.0 ...",
"pluralRule-count-other": "@integer 0, 5~19, 100, 1000, 10000 ..."
}
The syntax is described here:
https://unicode.org/reports/tr35/tr35-numbers.html#Plural_rules_syntax
There are up to 2 sets of rules for each locale, a cardinal set and an
ordinal set. The approach here is to generate a C++ function for each
set of rules. Each condition in the rules (e.g. "i = 1 and v = 0") is
transpiled to a C++ if-statement within its function. Then lookup tables
are generated to match locales to their generated functions.
NOTE: -Wno-parentheses-equality is added to the LibUnicodeData compile
flags because the generated plural rules have lots of extra parentheses
(because e.g. we need to selectively negate and combine rules). The code
to generate only exactly the right number of parentheses is quite hairy,
so this just tells the compiler to ignore the extras.
Relative-time format patterns are of one of two forms:
* Tensed - refer to the past or the future, e.g. "N years ago" or
"in N years".
* Numbered - refer to a specific numeric value, e.g. "in 1 year"
becomes "next year" and "in 0 years" becomes "this year".
In ECMA-402, tensed and numbered refer to the numeric formatting options
of "always" and "auto", respectively.
This is no longer needed now that LibTimeZone is included within LibC.
Remove the direct linkage so that others do not mistakenly copy-paste
the CMakeLists text elsewhere.
This adds an API to use LibTimeZone to convert a time zone such as
"America/New_York" to a GMT offset string like "GMT-5" (short form) or
"GMT-05:00" (long form).
LibUnicode no longer needs to generate a list of time zone names that it
parsed from metaZones.json. We can defer to the TZDB for a golden list
of time zones.
The generated data for libunicodedata.so is quite large, and loading it
is a price paid by nearly every application by way of depending on
LibRegex. In order to defer this cost until an application actually uses
one of the surrounding APIs, dynamically load the generated symbols.
To be able to load the symbols dynamically, the generated methods must
have demangled names. Typically, this is accomplished with `extern "C"`
blocks. The clang toolchain complains about this here because the types
returned from the generators are strictly C++ types. So to demangle the
names, we use the asm() compiler directive to manually define a symbol
name; the caveat is that we *must* be sure the symbols are unique. As an
extra precaution, we prefix each symbol name with "unicode_". For more
details, see: https://gcc.gnu.org/onlinedocs/gcc/Asm-Labels.html
This symbol loader used in this implementation provides the additional
benefit of removing many [[maybe_unused]] attributes from the LibUnicode
methods. Internally, if ENABLE_UNICODE_DATABASE_DOWNLOAD is OFF, the
loader is able to stub out the function pointers it returns.
Note that as of this commit, LibUnicode is still directly linked against
LibUnicodeData. This commit is just a first step towards removing that.
This breaks LibUnicode into two libraries: LibUnicode containing the
public APIs for accessing the library, and LibUnicodeData containing the
generated source files. LibUnicodeData has compile options optimized for
size, which save about 1MB of data in total.
Currently, we generate separate data files for locale and number format
related tables/methods, but provide public accessors for all of the data
in one Locale.h file. Rather than continuing this trend for date-time,
relative time, etc. formatting, it's a bit easier to reason about if the
public accessors are also in separate files.
This data is published under ISO-4217 as an XML file. Since we can't
parse XML files yet, and the data isn't very large, it was translated to
C++ manually here.
Moving this helper CMake file to the centralized Meta/CMake folder helps
to get a better grasp on what extra files are required for the build,
and what files are generated.
While we're at it, don't use add_compile_definitions for
ENABLE_UNICODE_DATA, which only needs to be seen by LibUnicode sources.
ECMA-402 requires validating user input against the EBNF grammar for
Unicode locales described in TR-35: https://www.unicode.org/reports/tr35
This commit adds validators for that grammar, as well as other helper to
e.g. canonicalize a locale string.
The Unicode standard publishes the Unicode Character Database (UCD) with
information about every code point, such as each code point's upper case
mapping. LibUnicode exists to download and parse UCD files at build time
and to provide accessors to that data.
As a start, LibUnicode includes upper- and lower-case code point
converters.