Previously, we were breaking up digits into groups without regard for
the locale's minimumGroupingDigits value in the CLDR. This value is 1 in
most locales, but is 2 in locales such as pl-PL. What this means is that
in those locales, the group separator should only be inserted if the
thousands group has at least 2 digits. So 1000 is formatted as "1,000"
in en-US, but "1000" in pl-PL. And 10000 is "10,000" in en-US and
"10 000" in pl-PL.
Currently, the UnicodeLocale generator collects a list of known locales
from the CLDR before processing language display names. For each locale,
the identifier is broken into language, script, and region subtags, and
we create a list of seen languages. When processing display names, we
skip languages we hadn't seen in that first step.
This is insufficient for language display names like "en-GB", which do
not have an locale entry in the CLDR, and thus are skipped. So instead,
create the list of known languages by actually reading through the list
of languages which have a display name.
These patterns indicate how to display locale strings when that locale
contains multiple subtags. For example, "en-US" would be displayed as
"English (United States)".
Note there's a bit of an unfortunate duplication in the calendar enum
generated by UnicodeLocale and the existing enum generated by
UnicodeDateTimeFormat. The former contains every calendar known to the
CLDR, whereas the latter contains the calendars we've actually parsed
for DateTimeFormat (currently only Gregorian). The new enum generated
here can be removed once DateTimeFormat knows about all calendars.
Our generator is currently preferring the DST variant of the time zone
display names over the non-DST variant. LibTimeZone currently does not
have DST support, and operates in a mode that basically assumes DST does
not exist. Swap the display names for now just to be consistent until we
have DST support.
Note we will need to generate both of these variants and select the
appropriate one at runtime once we have DST support.
Now that number systems are generated as an enum, we can generated the
number system data in the order of that enum. This lets us perform
lookups of that data by index instead of a loop of string comparisons.
We had a hard-coded table of number system digits copied from ECMA-402.
Turns out these digits are in the CLDR, so let's parse the digits from
there instead of hard-coding them.
This adds an API to use LibTimeZone to convert a time zone such as
"America/New_York" to a GMT offset string like "GMT-5" (short form) or
"GMT-05:00" (long form).
The generate_mapping helper generates a series of structs like:
Array<SomeType, 1> s_mapping_key_0 {};
Array<SomeType, 2> s_mapping_key_1 {};
Array<SomeType, 3> s_mapping_key_2 {};
Array<Span<SomeType const>> s_mapping { {
s_mapping_key_0.span(),
s_mapping_key_1.span(),
s_mapping_key_2.span(),
} };
Where the names of the struct were generated by the format_mapping_name
lambda inside the helper. Rather than this lambda making assumptions on
how each generator wants to name its structs, add a parameter for the
caller to provide a naming formatter.
This is because the TimeZoneData generator will want pretty specific
identifier formatting rules.
LibUnicode no longer needs to generate a list of time zone names that it
parsed from metaZones.json. We can defer to the TZDB for a golden list
of time zones.
The generator parses metaZones.json to form a mapping of meta zones to
time zones (AKA "golden zone" in TR-35). This parser errantly assumed
this was a 1-to-1 mapping.
In Unicode::get_time_zone_name(), we don't need to require that the time
zone is UTC for long- and short-style name lookups. This is required for
other styles, because they will depend on TZDB data - so move the VERIFY
to that scope.
When searching for the locale-specific flexible day period for a given
hour, we were neglecting to handle cases where the period crosses 00:00.
For example, the en locale defines a day period range of [21:00, 06:00).
When given the hour of 05:00, we were checking if (21 <= 5 && 5 < 6),
thus not recognizing that the hour falls in that period.
This is a temporary mechanism while LibUnicode is in an in-between state
where some symbols are weakly linked and others are dynamically loaded.
The latter require an asm() label to be loaded.
Currently, we load the generated Unicode symbols with dlopen at runtime.
This is unnecessary as of 565a880ce5.
Applications that want Unicode data now link directly against the shared
library holding that data. So the same functionality can be achieved
with weak symbols.
ECMA-402 now supports short-offset, long-offset, short-generic, and
long-generic time zone name formatting. For example, in the en-US locale
the America/Eastern time zone would be formatted as:
short-offset: GMT-5
long-offset: GMT-05:00
short-generic: ET
long-generic: Eastern Time
We currently only support the UTC time zone, however. Therefore, this
very minimal implementation does not consider GMT offset or generic
display names. Instead, the CLDR defines specific strings for UTC.
The generated data for libunicodedata.so is quite large, and loading it
is a price paid by nearly every application by way of depending on
LibRegex. In order to defer this cost until an application actually uses
one of the surrounding APIs, dynamically load the generated symbols.
To be able to load the symbols dynamically, the generated methods must
have demangled names. Typically, this is accomplished with `extern "C"`
blocks. The clang toolchain complains about this here because the types
returned from the generators are strictly C++ types. So to demangle the
names, we use the asm() compiler directive to manually define a symbol
name; the caveat is that we *must* be sure the symbols are unique. As an
extra precaution, we prefix each symbol name with "unicode_". For more
details, see: https://gcc.gnu.org/onlinedocs/gcc/Asm-Labels.html
This symbol loader used in this implementation provides the additional
benefit of removing many [[maybe_unused]] attributes from the LibUnicode
methods. Internally, if ENABLE_UNICODE_DATABASE_DOWNLOAD is OFF, the
loader is able to stub out the function pointers it returns.
Note that as of this commit, LibUnicode is still directly linked against
LibUnicodeData. This commit is just a first step towards removing that.
The variable `s_time_zone_list_index_type` seems to be unused (detected
when compiling with clang), and it seems logical to bind it even it if
it is not used for now.