1
Fork 0
mirror of https://github.com/RGBCube/serenity synced 2025-05-15 08:54:57 +00:00
Commit graph

191 commits

Author SHA1 Message Date
Timothy Flynn
97508b74eb LibUnicode: Remove declaration of function which moved to another header
Unicode::get_number_system_symbol is declared in UnicodeNumberFormat and
defined in UnicodeNumberFormat.cpp.
2021-12-21 13:09:49 -08:00
Timothy Flynn
92233660b8 LibUnicode: Compile generated sources optimized for size
This breaks LibUnicode into two libraries: LibUnicode containing the
public APIs for accessing the library, and LibUnicodeData containing the
generated source files. LibUnicodeData has compile options optimized for
size, which save about 1MB of data in total.
2021-12-15 13:26:03 +00:00
Timothy Flynn
62ff029890 LibUnicode: Generate CalendarSymbols in a predetermined order
Similar to commit 2a7f36b392, this change moves the generated
CalendarSymbol enumeration to the public LibUnicode/NumberFormat.h
header with a pre-defined set of symbols that we need. This is to
prepare for uniquely generating the CalendarSymbols structure.
2021-12-13 21:28:56 -08:00
Timothy Flynn
2a7f36b392 LibJS+LibUnicode: Generate unique numeric symbol lists
There are 443 number system objects generated, each of which held an
array of number system symbols. Of those 443 arrays, only 39 are unique.

To uniquely store these, this change moves the generated NumericSymbol
enumeration to the public LibUnicode/NumberFormat.h header with a pre-
defined set of symbols that we need. This is to ensure the generated,
unique arrays are created in a known order with known symbols. While it
is unfortunate to no longer discover these symbols at generation time,
it does allow us to ignore unwanted symbols and perform less string-to-
enumeration conversions at lookup time.
2021-12-11 14:17:47 +00:00
Timothy Flynn
a417c23de0 LibUnicode: Parse and generate per-locale day period ranges 2021-12-10 21:27:24 +00:00
Timothy Flynn
fa8e881cfa LibUnicode: Parse and generate secondary day period symbols
Generate morning2, afternoon2, evening2, and night2 symbols.
2021-12-10 21:27:24 +00:00
Timothy Flynn
76aab821f4 LibJS+LibUnicode: Rename some Unicode::DayPeriod values
In the CLDR, there aren't "night" values, there are "night1" & "night2"
values. This is for locales which use a different name for nighttime
depending on the hour. For example, the ja locale uses "夜" between the
hours of 19:00 and 23:00, and "夜中" between the hours of 23:00 and
04:00. Our CLDR parser is currently ignoring "night2", so this rename
is to prepare for that.

We could probably come up with better names, but in the end, the API in
LibUnicode will be such that outside callers won't even see Night1, etc.
2021-12-10 21:27:24 +00:00
Timothy Flynn
2024d9e9ea LibUnicode: Add method to combine two format pattern skeletons
The fields of the generated elements must be in the same order as the
table here:
https://unicode.org/reports/tr35/tr35-dates.html#Date_Field_Symbol_Table

Further, only one field from each group of fields is allowed.
2021-12-09 23:43:04 +00:00
Timothy Flynn
9d4c4303fd LibUnicode: Parse and generate date time range format patterns 2021-12-09 23:43:04 +00:00
Timothy Flynn
fe84a365c2 LibUnicode: Parse and generate format pattern skeletons
Pattern skeletons are more or less the "key" of format patterns. Every
format pattern is assigned a skeleton. Interval patterns (which are not
yet parsed) are also assigned a skeleton - this is used to match them to
an "owning" format pattern. So we will use the skeleton generated here
to match format patterns at runtime with their available interval
patterns.

An alternative approach would be to append interval patterns directly to
their owning format pattern, but this has some draw backs:

    1. Skeletons aren't totally unique. A skeleton may appear in both
       the "dateFormats" and "availableFormats" objects, in which case
       the same interval formats would be generated more than once.

    2. Otherwise unique format patterns may only differ by the interval
       patterns assigned to them. This would cause the UniqueStorage for
       the format patterns to increase in size, impacting both compile
       times and libunicode.so size.
2021-12-09 23:43:04 +00:00
Timothy Flynn
b76e44f66f LibUnicode: Parse and generate time zone names in long and short form 2021-12-08 11:29:36 +00:00
Timothy Flynn
2bbf8aa24c LibUnicode: Generate era, month, weekday and day period calendar symbols
The parsing in parse_calendar_symbols() might be a bit more verbose than
it really needs to be, but it is to ensure the symbols are generated in
a known order that we can control with enumerations.
2021-12-08 11:29:36 +00:00
Timothy Flynn
6ace4000bf LibJS+LibUnicode: Supply field type in CalendarPattern's for-each method
Some callers will want different behavior depending on what field is
being provided to the callback.
2021-12-08 11:29:36 +00:00
Timothy Flynn
f02ecc1da2 LibUnicode: Fix copy-paste error in calendar_pattern_style_to_string
The string returned must be lowercase.
2021-12-01 16:36:26 +00:00
Timothy Flynn
7e6ad172a4 LibUnicode: Support code point names that apply to ranges of code points
For example, consider the following adjacent entries in UnicodeData.txt:

    3400;<CJK Ideograph Extension A, First>;Lo;0;L;;;;;N;;;;;
    4DBF;<CJK Ideograph Extension A, Last>;Lo;0;L;;;;;N;;;;;

Our current implementation would assign the display name "CJK Ideograph
Extension A" to code points U+3400 & U+4DBF, but not to the code points
in between. Not only should those code points be assigned a name, but
the Unicode spec also has formatting rules on what the names should be
(the names for these ranged code points are not as they appear in
UnicodeData.txt).

The spec also defines names for code point ranges that actually are
listed individually in UnicodeData.txt. For example:

    2F800;CJK COMPATIBILITY IDEOGRAPH-2F800;Lo;0;L;4E3D;;;;N;;;;;
    2F801;CJK COMPATIBILITY IDEOGRAPH-2F801;Lo;0;L;4E38;;;;N;;;;;
    2F802;CJK COMPATIBILITY IDEOGRAPH-2F802;Lo;0;L;4E41;;;;N;;;;;

Code points are only coalesced into a range if all fields after the name
are equivalent. Our parser will insert the range and its name formatting
pattern when it comes across the first code point in that range, then
ignore other code points in that range. This reduces the number of names
we generated by nearly 2,000.
2021-11-30 11:24:02 +01:00
Timothy Flynn
16151aa7d5 LibJS+LibUnicode: Implement the Intl.DateTimeFormat constructor 2021-11-29 22:48:46 +00:00
Timothy Flynn
6dbdfb6ba1 LibUnicode: Add special handling of hour cycle (hc) Unicode keywords
For other keywords, allowed values per locale are generated at compile
time. But since the CLDR doesn't present hour cycles on a per-locale
basis, and hour cycles lookups depend on runtime data, we must handle
hour cycle keyword lookups differently than other keywords.
2021-11-29 22:48:46 +00:00
Timothy Flynn
48ce72e472 LibUnicode: Parse and generate regional hour cycles
Unlike most data in the CLDR, hour cycles are not stored on a per-locale
basis. Instead, they are keyed by a string that is usually a region, but
sometimes is a locale. Therefore, given a locale, to determine the hour
cycles for that locale, we:

    1. Check if the locale itself is assigned hour cycles.
    2. If the locale has a region, check if that region is assigned hour
       cycles.
    3. Otherwise, maximize that locale, and if the maximized locale has
       a region, check if that region is assigned hour cycles.
    4. If the above all fail, fallback to the "001" region.

Further, each locale's default hour cycle is the first assigned hour
cycle.
2021-11-29 22:48:46 +00:00
Timothy Flynn
7872934861 LibUnicode: Parse and generate available candidate format patterns
These formats are used by ECMA-402 when neither a date nor time style is
specified. In that case, these patterns are searched for a best match.
2021-11-29 22:48:46 +00:00
Timothy Flynn
f471ecdbe9 LibUnicode: Parse and generate date, time, and date-time format patterns 2021-11-29 22:48:46 +00:00
Timothy Flynn
914675e826 LibJS+LibUnicode: Separate number formatting methods from Locale.h
Currently, we generate separate data files for locale and number format
related tables/methods, but provide public accessors for all of the data
in one Locale.h file. Rather than continuing this trend for date-time,
relative time, etc. formatting, it's a bit easier to reason about if the
public accessors are also in separate files.
2021-11-29 22:48:46 +00:00
Ben Wiederhake
b06b54772e Meta+LibUnicode: Provide code point names through library 2021-11-20 00:31:55 +01:00
Timothy Flynn
cafb717486 LibUnicode: Parse and generate CLDR unit data for Intl.NumberFormat
The units data is in another CLDR package, cldr-units.
2021-11-16 23:14:09 +00:00
Timothy Flynn
80493908d3 LibUnicode: Tweak the definition of the plurality "many"
As noted at the top of this method, this is a naive implementation of
the Unicode plurality specification. But for now, we should tweak the
defintion of "many" to be "more than 2" (which is what I had in mind
when I wrote this, but forgot about fractions).
2021-11-16 23:14:09 +00:00
Timothy Flynn
04b8b87c17 LibJS+LibUnicode: Support multiple identifiers within format pattern
This wasn't the case for compact patterns, but unit patterns can contain
multiple (up to 2, really) identifiers that must each be recognized by
LibJS.

Each generated NumberFormat object now stores an array of identifiers
parsed. The format pattern itself is encoded with the index into this
array for that identifier, e.g. the compact format string "0K" will
become "{number}{compactIdentifier:0}".
2021-11-16 23:14:09 +00:00
Timothy Flynn
3b68370212 LibJS+LibUnicode: Rename the generated compact_identifier to identifier
This field is currently used to store the StringView into the compact
name/symbol in the format string. Units will need to store a similar
field, so rename the field to be more generic, and extract the parser
for it.
2021-11-16 23:14:09 +00:00
Timothy Flynn
6d34a0b4e8 LibJS+LibUnicode: Rename method to select a NumberFormat plurality
Instead of currency pattern lookups within select_currency_unit_pattern,
rename the method to select_pattern_with_plurality and accept any list
of patterns. This method will be needed for units.
2021-11-16 23:14:09 +00:00
Timothy Flynn
1f546476d5 LibJS+LibUnicode: Fix computation of compact pattern exponents
The compact scale of each formatting rule was precomputed in commit:
be69eae651

Using the formula: compact scale = magnitude - pattern scale

This computation was off-by-one.

For example, consider the format key "10000-count-one", which maps to
"00 thousand" in en-US. What we are really after is the exponent that
best represents the string "thousand" for values greater than 10000
and less than 100000 (the next format key). We were previously doing:

    log10(10000) - "00 thousand".count("0") = 2

Which clearly isn't what we want. Instead, if we do:

    log10(10000) + 1 - "00 thousand".count("0") = 3

We get the correct exponent for each format key for each locale.

This commit also renames the generated variable from "compact_scale" to
"exponent" to match the terminology used in ECMA-402.
2021-11-16 00:56:55 +00:00
Timothy Flynn
48d5684780 LibUnicode: Parse compact identifiers and replace them with a format key
For example, in en-US, the decimal, long compact pattern for numbers
between 10,000 and 100,000 is "00 thousand". In that pattern, "thousand"
is the compact identifier, and the generated format pattern is now
"{number} {compactIdentifier}". This also generates that identifier as
its own field in the NumberFormat structure.
2021-11-16 00:56:55 +00:00
Timothy Flynn
30fbb7d9cd LibUnicode: Parse and generate scientific formatting rules 2021-11-14 17:00:35 +00:00
Timothy Flynn
3b7f5af042 LibUnicode: Generate primary and secondary number grouping sizes
Most locales have a single grouping size (the number of integer digits
to be written before inserting a grouping separator). However some have
a primary and secondary size. We parse the primary size as the size used
for the least significant integer digits, and the secondary size for the
most significant.
2021-11-14 10:35:19 +00:00
Timothy Flynn
c65dea64bd LibJS+LibUnicode: Don't remove {currency} keys in GetNumberFormatPattern
In order to implement Intl.NumberFormat.prototype.formatToParts, do not
replace {currency} keys in the format pattern before ECMA-402 tells us
to. Otherwise, the array return by formatToParts will not contain the
expected currency key.

Early replacement was done to avoid resolving the currency display more
than once, as it involves a couple of round trips to search through
LibUnicode data. So this adds a non-standard method to NumberFormat to
do this resolution and cache the result.

Another side effect of this change is that LibUnicode must replace unit
format patterns of the form "{0} {1}" during code generation. These were
previously skipped during code generation because LibJS would just
replace the keys with the currency display at runtime. But now that the
currency display injection is delayed, any {0} or {1} keys in the format
pattern will cause PartitionNumberPattern to abort.
2021-11-13 19:01:25 +00:00
Timothy Flynn
0c9711efba LibUnicode: Handle all space code points when creating currency patterns
Previously, we were checking if the code point immediately before/after
the {currency} key was U+00A0 (non-breaking space). Instead, to handle
other spacing code points, we must check if the surrounding code point
has the separator general category.
2021-11-13 19:01:25 +00:00
Timothy Flynn
ada4bab405 LibUnicode: Remove GeneralCategory::Symbol string lookup
When I originally wrote this method, I had it in LibJS, where we can't
refer to the GeneralCategory enumeration directly. This is a big TODO,
anyone outside of LibUnicode can't assume the generated enumerations
exist and must get these values by string lookup. But this function
ended up living in LibUnicode, who can reference the enumeration.
2021-11-13 19:01:25 +00:00
Timothy Flynn
a701ed52fc LibJS+LibUnicode: Fully implement currency number formatting
Currencies are a bit strange; the layout of currency data in the CLDR is
not particularly compatible with what ECMA-402 expects. For example, the
currency format in the "en" and "ar" locales for the Latin script are:

    en: "¤#,##0.00"
    ar: "¤\u00A0#,##0.00"

Note how the "ar" locale has a non-breaking space after the currency
symbol (¤), but "en" does not. This does not mean that this space will
appear in the "ar"-formatted string, nor does it mean that a space won't
appear in the "en"-formatted string. This is a runtime decision based on
the currency display chosen by the user ("$" vs. "USD" vs. "US dollar")
and other rules in the Unicode TR-35 spec.

ECMA-402 shies away from the nuances here with "implementation-defined"
steps. LibUnicode will store the data parsed from the CLDR however it is
presented; making decisions about spacing, etc. will occur at runtime
based on user input.
2021-11-13 11:52:45 +00:00
Timothy Flynn
9421d5c0cf LibUnicode: Generate currency unit-pattern number formats
These are used when formatting a number as currency with a display
option of "name" (e.g. for USD, the name is "US Dollars" in en-US).

These patterns appear in the CLDR in a different manner than other
number formats that are pluralized. They are of the form "{0} {1}",
therefore do not undergo subpattern replacements.
2021-11-13 11:52:45 +00:00
Timothy Flynn
39e031c4dd LibJS+LibUnicode: Generate all styles of currency localizations
Currently, LibUnicode is only parsing and generating the "long" style of
currency display names. However, the CLDR contains "short" and "narrow"
forms as well that need to be handled. Parse these, and update LibJS to
actually respect the "style" option provided by the user for displaying
currencies with Intl.DisplayNames.

Note: There are some discrepencies between the engines on how style is
handled. In particular, running:

new Intl.DisplayNames('en', {type:'currency', style:'narrow'}).of('usd')

Gives:

  SpiderMoney: "USD"
  V8: "US Dollar"
  LibJS: "$"

And running:

new Intl.DisplayNames('en', {type:'currency', style:'short'}).of('usd')

Gives:

  SpiderMonkey: "$"
  V8: "US Dollar"
  LibJS: "$"

My best guess is V8 isn't handling style, and just returning the long
form (which is what LibJS did before this commit). And SpiderMoney can
handle some styles, but if they don't have a value for the requested
style, they fall back to the canonicalized code passed into of().
2021-11-13 11:52:45 +00:00
Timothy Flynn
1f2ac0ab41 LibUnicode: Move number formatting code generator to UnicodeNumberFormat 2021-11-12 20:46:38 +00:00
Timothy Flynn
be69eae651 LibUnicode: Precompute the compact scale of each number formatting rule
This will be needed for the ComputeExponentForMagnitude AO for compact
formatting, namely step 5b:

  Let exponent be an implementation- and locale-dependent (ILD) integer
  by which to scale a number of the given magnitude in compact notation
  for the current locale.
2021-11-12 09:17:08 +00:00
Timothy Flynn
230b133ee3 LibUnicode: Parse number formats into zero/positive/negative patterns
A number formatting pattern in the CLDR contains one or two entries,
delimited by a semi-colon. Previously, LibUnicode was just storing the
entire pattern as one string. This changes the generator to split the
pattern on that delimiter and generate the 3 unique patterns expected by
ECMA-402.

The rules for generating the 3 patterns are as follows:

* If the pattern contains 1 entry, it is the zero pattern. The positive
  pattern is the zero pattern prepended with {plusSign}. The negative
  pattern is the zero pattern prepended with {minusSign}.

* If the pattern contains 2 entries, the first is the zero pattern, and
  the second is the negative pattern. The positive pattern is the zero
  pattern prepended with {plusSign}.
2021-11-12 09:17:08 +00:00
Timothy Flynn
1244ebcd4f LibUnicode: Parse and generate standard accounting formatting rules
Also known as "currency-accounting" in some CLDR documentation.
2021-11-12 09:17:08 +00:00
Timothy Flynn
967afc1b84 LibUnicode: Parse and generate standard currency formatting rules 2021-11-12 09:17:08 +00:00
Timothy Flynn
bffd73e0d4 LibUnicode: Parse and generate standard decimal formatting rules 2021-11-12 09:17:08 +00:00
Timothy Flynn
feb8c22a62 LibUnicode: Parse and generate standard percentage formatting rules 2021-11-12 09:17:08 +00:00
Timothy Flynn
4317a1b552 LibUnicode: Parse and generate compact currency formatting rules 2021-11-12 09:17:08 +00:00
Timothy Flynn
604a596c90 LibUnicode: Parse and generate compact decimal formatting rules 2021-11-12 09:17:08 +00:00
Timothy Flynn
12b468a588 LibUnicode: Begin parsing and generating locale number systems
The number system data in the CLDR contains information on how to format
numbers in a locale-dependent manner. Start parsing this data, beginning
with numeric symbol strings. For example the symbol NaN maps to "NaN" in
the en-US locale, and "非數值" in the zh-Hant locale.
2021-11-12 09:17:08 +00:00
Andreas Kling
8b1108e485 Everywhere: Pass AK::StringView by value 2021-11-11 01:27:46 +01:00
Timothy Flynn
d83b262e64 LibUnicode: Generate standalone compile-time array for combining class 2021-10-10 13:49:37 +02:00
Timothy Flynn
9f83774913 LibUnicode: Generate standalone compile-time array for special casing
There are only 112 code points with special casing rules, so this array
is quite small (compared to the size 34,626 UnicodeData hash map that is
also storing this data). Removing all casing rules from UnicodeData will
happen in a subsequent commit.
2021-10-10 13:49:37 +02:00