Similar to NumberFormat, replace the segments of date-time patterns with
partitions that can be split at runtime. Also generate the pattern style
fields for e.g. era, day, hour, etc.
Add unique storage for parsed CalendarPattern structures to ensure only
one copy of each structure is generated.
This doesn't have any impact on libunicode.so with the current generated
data. Rather, this prevents the amount of generated data from needlessly
growing astronomically once date-time patterns are fully parsed. There
will be 173,459 patterns parsed, of which only 22,495 (about 12%) are
unique. This change will save a few MB, and will also help compilation
times.
Currently, there's only a handful of entries in these arrays, so it is
not a huge deal to generate them inline with the struct that holds them.
But they will each soon contain a few hundred entries. Generate them out
of line for easier viewing in the generated code.
Add unique storage for parsed NumberFormat structures to ensure only one
copy of each structure is generated. Reduces libunicode.so on x86 from
13.2 MB to 11.4 MB.
UniqueStringStorage is used to ensure only one copy of a string will be
generated, and interested parties store just an index into the generated
storage. Generalize this class to allow any* type to be stored uniquely.
* To actually be storable, the type must have both an AK::Format and an
AK::Traits overload available.
This is not a calendar supported by ECMA-402, so let's not waste space
with its data.
Further, don't generate "gregorian" as a valid Unicode locale extension
keyword. It's an invalid type identifier, thus cannot be used in locales
such as "en-u-ca-gregorian".
For example, consider the following adjacent entries in UnicodeData.txt:
3400;<CJK Ideograph Extension A, First>;Lo;0;L;;;;;N;;;;;
4DBF;<CJK Ideograph Extension A, Last>;Lo;0;L;;;;;N;;;;;
Our current implementation would assign the display name "CJK Ideograph
Extension A" to code points U+3400 & U+4DBF, but not to the code points
in between. Not only should those code points be assigned a name, but
the Unicode spec also has formatting rules on what the names should be
(the names for these ranged code points are not as they appear in
UnicodeData.txt).
The spec also defines names for code point ranges that actually are
listed individually in UnicodeData.txt. For example:
2F800;CJK COMPATIBILITY IDEOGRAPH-2F800;Lo;0;L;4E3D;;;;N;;;;;
2F801;CJK COMPATIBILITY IDEOGRAPH-2F801;Lo;0;L;4E38;;;;N;;;;;
2F802;CJK COMPATIBILITY IDEOGRAPH-2F802;Lo;0;L;4E41;;;;N;;;;;
Code points are only coalesced into a range if all fields after the name
are equivalent. Our parser will insert the range and its name formatting
pattern when it comes across the first code point in that range, then
ignore other code points in that range. This reduces the number of names
we generated by nearly 2,000.
Unlike most data in the CLDR, hour cycles are not stored on a per-locale
basis. Instead, they are keyed by a string that is usually a region, but
sometimes is a locale. Therefore, given a locale, to determine the hour
cycles for that locale, we:
1. Check if the locale itself is assigned hour cycles.
2. If the locale has a region, check if that region is assigned hour
cycles.
3. Otherwise, maximize that locale, and if the maximized locale has
a region, check if that region is assigned hour cycles.
4. If the above all fail, fallback to the "001" region.
Further, each locale's default hour cycle is the first assigned hour
cycle.
This hasn't mattered yet by chance, because the source for all enums
contains names of the same case. But the enum generated for hour cycle
regions will have mixed case. Sort them case-insensitively in order to
traverse these names in the same order in both generate_enum and
generate_mapping.
Similar to number formatting, the data for date-time formatting will be
located in its own generated file. This extracts the cldr-dates package
from the CLDR and sets up the generator plumbing to create the date-time
data files.
Currently, we generate separate data files for locale and number format
related tables/methods, but provide public accessors for all of the data
in one Locale.h file. Rather than continuing this trend for date-time,
relative time, etc. formatting, it's a bit easier to reason about if the
public accessors are also in separate files.
As noted by ECMA-402, if a supported locale contains all of a language,
script, and region subtag, then the implementation must also support the
locale without the script subtag. The most complicated example of this
is the zh-TW locale.
The list of locales in the CLDR database does not include zh-TW or its
maximized zh-Hant-TW variant. Instead, it inlcudes the zh-Hant locale.
However, zh-Hant-TW is listed in the default-content locale list in the
cldr-core package. This defines an alias from zh-Hant-TW to zh-Hant. We
must then also support the zh-Hant-TW alias without the script subtag:
zh-TW. This transitively maps zh-TW to zh-Hant, which is a case quite
heavily tested by test262.
Previously, we were just copying the locale data into default-content
locales (for example, copying the "en" data into "en-US"). Instead, we
can just define the default-content locales as aliases to their main
locales.
This will be used for locale aliases as well. Also rename the "property"
field in this struct to "name", as it no longer is only used for
property aliases.
Also add slightly richer parse errors now that we can include a string
literal with returned errors.
This will allow us to use TRY() when working with JSON data.
This wasn't the case for compact patterns, but unit patterns can contain
multiple (up to 2, really) identifiers that must each be recognized by
LibJS.
Each generated NumberFormat object now stores an array of identifiers
parsed. The format pattern itself is encoded with the index into this
array for that identifier, e.g. the compact format string "0K" will
become "{number}{compactIdentifier:0}".
This field is currently used to store the StringView into the compact
name/symbol in the format string. Units will need to store a similar
field, so rename the field to be more generic, and extract the parser
for it.
The compact scale of each formatting rule was precomputed in commit:
be69eae651
Using the formula: compact scale = magnitude - pattern scale
This computation was off-by-one.
For example, consider the format key "10000-count-one", which maps to
"00 thousand" in en-US. What we are really after is the exponent that
best represents the string "thousand" for values greater than 10000
and less than 100000 (the next format key). We were previously doing:
log10(10000) - "00 thousand".count("0") = 2
Which clearly isn't what we want. Instead, if we do:
log10(10000) + 1 - "00 thousand".count("0") = 3
We get the correct exponent for each format key for each locale.
This commit also renames the generated variable from "compact_scale" to
"exponent" to match the terminology used in ECMA-402.
For example, in en-US, the decimal, long compact pattern for numbers
between 10,000 and 100,000 is "00 thousand". In that pattern, "thousand"
is the compact identifier, and the generated format pattern is now
"{number} {compactIdentifier}". This also generates that identifier as
its own field in the NumberFormat structure.
Most locales have a single grouping size (the number of integer digits
to be written before inserting a grouping separator). However some have
a primary and secondary size. We parse the primary size as the size used
for the least significant integer digits, and the secondary size for the
most significant.
In order to implement Intl.NumberFormat.prototype.formatToParts, do not
replace {currency} keys in the format pattern before ECMA-402 tells us
to. Otherwise, the array return by formatToParts will not contain the
expected currency key.
Early replacement was done to avoid resolving the currency display more
than once, as it involves a couple of round trips to search through
LibUnicode data. So this adds a non-standard method to NumberFormat to
do this resolution and cache the result.
Another side effect of this change is that LibUnicode must replace unit
format patterns of the form "{0} {1}" during code generation. These were
previously skipped during code generation because LibJS would just
replace the keys with the currency display at runtime. But now that the
currency display injection is delayed, any {0} or {1} keys in the format
pattern will cause PartitionNumberPattern to abort.
Currencies are a bit strange; the layout of currency data in the CLDR is
not particularly compatible with what ECMA-402 expects. For example, the
currency format in the "en" and "ar" locales for the Latin script are:
en: "¤#,##0.00"
ar: "¤\u00A0#,##0.00"
Note how the "ar" locale has a non-breaking space after the currency
symbol (¤), but "en" does not. This does not mean that this space will
appear in the "ar"-formatted string, nor does it mean that a space won't
appear in the "en"-formatted string. This is a runtime decision based on
the currency display chosen by the user ("$" vs. "USD" vs. "US dollar")
and other rules in the Unicode TR-35 spec.
ECMA-402 shies away from the nuances here with "implementation-defined"
steps. LibUnicode will store the data parsed from the CLDR however it is
presented; making decisions about spacing, etc. will occur at runtime
based on user input.
For example, there isn't a unique set of data for the en-US locale;
rather, it defaults to the data for the en locale. See this commit for
much more detail: 357c97dfa8
These are used when formatting a number as currency with a display
option of "name" (e.g. for USD, the name is "US Dollars" in en-US).
These patterns appear in the CLDR in a different manner than other
number formats that are pluralized. They are of the form "{0} {1}",
therefore do not undergo subpattern replacements.
Currently, LibUnicode is only parsing and generating the "long" style of
currency display names. However, the CLDR contains "short" and "narrow"
forms as well that need to be handled. Parse these, and update LibJS to
actually respect the "style" option provided by the user for displaying
currencies with Intl.DisplayNames.
Note: There are some discrepencies between the engines on how style is
handled. In particular, running:
new Intl.DisplayNames('en', {type:'currency', style:'narrow'}).of('usd')
Gives:
SpiderMoney: "USD"
V8: "US Dollar"
LibJS: "$"
And running:
new Intl.DisplayNames('en', {type:'currency', style:'short'}).of('usd')
Gives:
SpiderMonkey: "$"
V8: "US Dollar"
LibJS: "$"
My best guess is V8 isn't handling style, and just returning the long
form (which is what LibJS did before this commit). And SpiderMoney can
handle some styles, but if they don't have a value for the requested
style, they fall back to the canonicalized code passed into of().
The data used for number formatting is going to grow quite a bit when
the cldr-units package is parsed. To prevent the generated UnicodeLocale
file from growing outrageously large, the number formatting data can go
into its own file. To prepare for this, move code that will be common
between the generators for UnicodeLocale and UnicodeNumberFormat to the
utility header.
This will be needed for the ComputeExponentForMagnitude AO for compact
formatting, namely step 5b:
Let exponent be an implementation- and locale-dependent (ILD) integer
by which to scale a number of the given magnitude in compact notation
for the current locale.
A number formatting pattern in the CLDR contains one or two entries,
delimited by a semi-colon. Previously, LibUnicode was just storing the
entire pattern as one string. This changes the generator to split the
pattern on that delimiter and generate the 3 unique patterns expected by
ECMA-402.
The rules for generating the 3 patterns are as follows:
* If the pattern contains 1 entry, it is the zero pattern. The positive
pattern is the zero pattern prepended with {plusSign}. The negative
pattern is the zero pattern prepended with {minusSign}.
* If the pattern contains 2 entries, the first is the zero pattern, and
the second is the negative pattern. The positive pattern is the zero
pattern prepended with {plusSign}.