1
Fork 0
mirror of https://github.com/RGBCube/serenity synced 2025-10-26 18:32:06 +00:00
Commit graph

209 commits

Author SHA1 Message Date
Timothy Flynn
32c07bc6c3 LibUnicode: Generate per-locale data for the "noon" fixed day period
Note that not all locales have this day period.
2022-07-21 20:36:03 +01:00
Timothy Flynn
0a6363d3e9 LibUnicode: Implement the range pattern processing algorithm
This algorithm is to inject spacing around the range separator under
certain conditions. For example, in en-US, the range [3, 5] should be
formatted as "3–5" if unitless, but as "$3 – $5" for currency.
2022-07-20 22:30:16 +01:00
Timothy Flynn
b2709f161e LibUnicode: Generate per-locale approximately & range separator symbols 2022-07-20 22:30:16 +01:00
Timothy Flynn
998f62936b LibUnicode: Remove obsolete Unicode::get_default_number_system
This has been superseded by get_preferred_keyword_value_for_locale,
which doesn't require allocating a Vector just to return its first
element.
2022-07-15 12:31:43 +02:00
Timothy Flynn
f8f7015419 LibUnicode: Generate a method to lookup locale-preferred keyword values 2022-07-15 12:31:43 +02:00
Timothy Flynn
80568d5776 LibUnicode: Generate a method to lookup available keyword values 2022-07-15 12:31:43 +02:00
Timothy Flynn
c2e5b20eb6 LibUnicode: Generate available values for the keywords co, kf, kn, hc
This also ensures we only include values we actually support in the
generated list of available values.
2022-07-15 12:31:43 +02:00
Timothy Flynn
a337b059dd LibUnicode: Parse and generate per-locale plural ranges 2022-07-12 00:43:34 +01:00
Timothy Flynn
f672b4c151 LibUnicode: Remove now-unused Unicode::select_pattern_with_plurality 2022-07-08 20:33:52 +02:00
Timothy Flynn
232df4196b LibUnicode: Replace NumberFormat::Plurality with Unicode::PluralCategory
To prepare for using plural rules within number & duration format, this
removes the NumberFormat::Plurality enumeration.

This also adds PluralCategory::ExactlyZero & PluralCategory::ExactlyOne.
These are used in locales like French, where PluralCategory::One really
means any value from 0.00 to 1.99. PluralCategory::ExactlyOne means only
the value 1, as the name implies. These exact rules are not known by the
general plural rules, they are explicitly for number / currency format.
2022-07-08 20:33:52 +02:00
Timothy Flynn
cc5c707649 LibJS+LibUnicode: Do not generate the PluralCategory enum
The PluralCategory enum is currently generated for plural rules. Instead
of generating it, this moves the enum to the public LibUnicode header.
While it was nice to auto-discover these values, they are well defined
by TR-35, and we will need their values from within the number format
code generator (which can't rely on the plural rules generator having
run yet). Further, number format will require additional values in the
enum that plural rules doesn't know about.
2022-07-08 20:33:52 +02:00
Timothy Flynn
bf85bf2a9e LibJS: Use Intl.PluralRules within Intl.RelativeFormat
The Polish test cases added here cover previous failures from test262,
due to the way that 0 is specified to be "many" in Polish.
2022-07-08 11:51:54 +02:00
Timothy Flynn
8aeacccd82 LibUnicode: Generate a list of available plural categories per locale
Separate lists are generated for cardinal and ordinal form.
2022-07-08 11:51:54 +02:00
Timothy Flynn
ea78bac36d LibUnicode: Parse and generate per-locale plural rules from the CLDR
Plural rules in the CLDR are of the form:

"cs": {
    "pluralRule-count-one": "i = 1 and v = 0 @integer 1",
    "pluralRule-count-few": "i = 2..4 and v = 0 @integer 2~4",
    "pluralRule-count-many": "v != 0 @decimal 0.0~1.5, 10.0, 100.0 ...",
    "pluralRule-count-other": "@integer 0, 5~19, 100, 1000, 10000 ..."
}

The syntax is described here:
https://unicode.org/reports/tr35/tr35-numbers.html#Plural_rules_syntax

There are up to 2 sets of rules for each locale, a cardinal set and an
ordinal set. The approach here is to generate a C++ function for each
set of rules. Each condition in the rules (e.g. "i = 1 and v = 0") is
transpiled to a C++ if-statement within its function. Then lookup tables
are generated to match locales to their generated functions.

NOTE: -Wno-parentheses-equality is added to the LibUnicodeData compile
flags because the generated plural rules have lots of extra parentheses
(because e.g. we need to selectively negate and combine rules). The code
to generate only exactly the right number of parentheses is quite hairy,
so this just tells the compiler to ignore the extras.
2022-07-08 11:51:54 +02:00
Timothy Flynn
12e7c0808a LibUnicode: Generate per-region week data
This includes:
* The minimum number of days in a week for that week to count as the
  first week of a new year.
* The day to be shown as the first day of the week in a calendar.
* The start/end days of the weekend.

Like the existing hour cycle data, week data is presented per-region in
the CLDR, rather than per-locale. The method to add likely subtags to a
locale to perform region lookups is the same.

The list of regions in the CLDR for hour cycle, minimum days, first day,
and weekend days are quite different. So rather than changing the
existing HourCycleRegion enum to a generic Region enum, we generate
separate enums for each of the week data fields. This allows each lookup
into these fields to remain simple array-based index access, without any
"jumps" for regions that don't have CLDR data for a field.
2022-07-06 16:56:42 +02:00
Timothy Flynn
4868b888be LibUnicode: Generate per-locale text layout information
Currently contains just each locale's character order, but is set up to
easily add other text layout fields from the CLDR if ECMA-402 eventually
requires them.
2022-07-06 16:56:42 +02:00
DexesTTP
7ceeb74535 AK: Use an enum instead of a bool for String::replace(all_occurences)
This commit has no behavior changes.

In particular, this does not fix any of the wrong uses of the previous
default parameter (which used to be 'false', meaning "only replace the
first occurence in the string"). It simply replaces the default uses by
String::replace(..., ReplaceMode::FirstOnly), leaving them incorrect.
2022-07-06 11:12:45 +02:00
Idan Horowitz
573061e76c LibUnicode: Extract the timeSeparator numeric symbol from CLDR
This will be used by Intl.DurationFormat
2022-07-01 01:00:05 +03:00
Timothy Flynn
1f2542247f LibUnicode: Upgrade to CLDR version 41.0.0
Release notes: https://cldr.unicode.org/index/downloads/cldr-41

Note that the HourCycleRegion enum now contains 272 entires, thus needs
to be bumped from u8 to u16.
2022-04-07 08:29:10 -04:00
Timothy Flynn
70ede2825e LibUnicode: Use BCP 47 data to filter valid calendar names 2022-02-16 07:23:07 -05:00
Timothy Flynn
71d86261c3 LibUnicode: Use BCP 47 data to filter valid numbering system names
There isn't too much of an effective difference here other than that the
BCP 47 data contains some aliases we would otherwise not handle.
2022-02-16 07:23:07 -05:00
Timothy Flynn
63c3437274 LibUnicode: Use BCP 47 data to generate available calendars and numbers
BCP 47 will be the single source of truth for known calendar and number
system keywords, and their aliases (e.g. "gregory" is an alias for
"gregorian"). Move the generation of available keywords to where we
parse the BCP 47 data, so that hard-coded aliases may be removed from
other generators.
2022-02-16 07:23:07 -05:00
Timothy Flynn
89ead8c00a LibJS+LibUnicode: Parse Unicode keywords from the BCP 47 CLDR package
We have a fair amount of hard-coded keywords / aliases that can now be
replaced with real data from BCP 47. As a result, the also changes the
awkward way we were previously generating keys. Before, we were more or
less generating keywords as a CSV list of keys, e.g. for the "nu" key,
we'd generate "latn,arab,grek" (ordered by locale preference). Then at
runtime, we'd split on the comma. We now just generate spans of keywords
directly.
2022-02-16 07:23:07 -05:00
thankyouverycool
0505e031f1 Meta+LibUnicode: Download and parse Unicode block properties
This parses Blocks.txt for CharacterType properties and creates
a global display array for use in apps.
2022-02-15 10:13:19 -05:00
Idan Horowitz
4967bcd4ce LibUnicode: Implement sentence segmentation 2022-01-31 21:05:04 +02:00
Idan Horowitz
a593a5c8ab LibUnicode: Implement word segmentation 2022-01-31 21:05:04 +02:00
Idan Horowitz
58b0eed6a7 LibUnicode: Implement grapheme segmentation 2022-01-31 21:05:04 +02:00
Idan Horowitz
2d50c08f34 LibUnicode: Download and parse {Grapheme,Word,Sentence} break props 2022-01-31 21:05:04 +02:00
Timothy Flynn
6efbafa6e0 Everywhere: Update copyrights with my new serenityos.org e-mail :^) 2022-01-31 18:23:22 +00:00
Timothy Flynn
bb0f548614 LibUnicode: Generate a list of available currencies 2022-01-31 00:32:41 +00:00
Timothy Flynn
481ced53d8 LibUnicode: Generate a list of available numbering systems 2022-01-31 00:32:41 +00:00
Timothy Flynn
ebd33e580b LibUnicode: Generate a list of available calendars 2022-01-31 00:32:41 +00:00
Timothy Flynn
f8892fdea2 LibUnicode: Templatize our naive implementation of plurality selection
As we didn't (and still don't) have Intl.PluralRules when we implemented
Intl.NumberFormat, we use a locale-unaware basic implementation to pick
a pattern based on a number's value. Templatize this method for now to
work other other format-like structures (will be used for relative-time
formatting).
2022-01-27 21:16:44 +00:00
Timothy Flynn
789f093b2e LibUnicode: Parse and generate relative-time format patterns
Relative-time format patterns are of one of two forms:

    * Tensed - refer to the past or the future, e.g. "N years ago" or
      "in N years".
    * Numbered - refer to a specific numeric value, e.g. "in 1 year"
      becomes "next year" and "in 0 years" becomes "this year".

In ECMA-402, tensed and numbered refer to the numeric formatting options
of "always" and "auto", respectively.
2022-01-27 21:16:44 +00:00
Timothy Flynn
2d2f713426 LibUnicode: Generate per-locale minimum grouping digit values
Previously, we were breaking up digits into groups without regard for
the locale's minimumGroupingDigits value in the CLDR. This value is 1 in
most locales, but is 2 in locales such as pl-PL. What this means is that
in those locales, the group separator should only be inserted if the
thousands group has at least 2 digits. So 1000 is formatted as "1,000"
in en-US, but "1000" in pl-PL. And 10000 is "10,000" in en-US and
"10 000" in pl-PL.
2022-01-27 20:30:52 +00:00
Timothy Flynn
bced4e9324 LibJS+LibUnicode: Convert Intl.ListFormat to use Unicode::Style
Remove ListFormat's own definition of the Style enum, which was further
duplicated by a generated ListPatternStyle enum with the same values.
2022-01-25 19:02:59 +00:00
Timothy Flynn
e261132e8b LibUnicode: Add helper methods to convert a Style to and from a string
This conversion is duplicated a few times in our Intl implementation, so
let's just define these once and be done with it.
2022-01-25 19:02:59 +00:00
Timothy Flynn
7f6edb7976 LibUnicode: Remove the Unicode::Style::Numeric value
It is unused.
2022-01-25 19:02:59 +00:00
Timothy Flynn
0a4430fc41 LibJS+LibTimeZone+LibUnicode: Remove direct linkage to LibTimeZone
This is no longer needed now that LibTimeZone is included within LibC.
Remove the direct linkage so that others do not mistakenly copy-paste
the CMakeLists text elsewhere.
2022-01-23 12:48:26 +00:00
Timothy Flynn
4400150cd2 LibJS+LibUnicode: Return the appropriate time zone name depending on DST 2022-01-19 21:20:41 +00:00
Timothy Flynn
70f49d0696 LibJS+LibTimeZone+LibUnicode: Indicate whether a time zone is in DST
Return whether the time zone is in DST during the provided time from
TimeZone::get_time_zone_offset,
2022-01-19 21:20:41 +00:00
Timothy Flynn
701b7810ba LibUnicode: Generate code point abbreviations 2022-01-18 15:13:25 +00:00
Timothy Flynn
c86f7a675d LibUnicode: Do not limit language display names to known locales
Currently, the UnicodeLocale generator collects a list of known locales
from the CLDR before processing language display names. For each locale,
the identifier is broken into language, script, and region subtags, and
we create a list of seen languages. When processing display names, we
skip languages we hadn't seen in that first step.

This is insufficient for language display names like "en-GB", which do
not have an locale entry in the CLDR, and thus are skipped. So instead,
create the list of known languages by actually reading through the list
of languages which have a display name.
2022-01-13 23:05:31 +01:00
Timothy Flynn
b0671ceb74 LibUnicode: Add a method to combine locale subtags into a display string
This is just a convenience wrapper around the underlying generated APIs.
2022-01-13 23:05:31 +01:00
Timothy Flynn
91acc2e9c5 LibUnicode: Parse and generate locale display patterns
These patterns indicate how to display locale strings when that locale
contains multiple subtags. For example, "en-US" would be displayed as
"English (United States)".
2022-01-13 23:05:31 +01:00
Timothy Flynn
8126cb2545 LibJS+LibUnicode: Remove unnecessary locale currency mapping wrapper
Before LibUnicode generated methods were weakly linked, we had a public
method (get_locale_currency_mapping) for retrieving currency mappings.
That method invoked one of several style-specific methods that only
existed in the generated UnicodeLocale.

One caveat of weakly linked functions is that every such function must
have a public declaration. The result is that each of those styled
methods are declared publicly, which makes the wrapper redundant
because it is just as easy to invoke the method for the desired style.
2022-01-13 13:43:57 +01:00
Timothy Flynn
0d75949827 LibUnicode: Parse and generate locale display names for date fields 2022-01-13 13:43:57 +01:00
Timothy Flynn
7f162c471d LibUnicode: Parse and generate locale display names for calendars
Note there's a bit of an unfortunate duplication in the calendar enum
generated by UnicodeLocale and the existing enum generated by
UnicodeDateTimeFormat. The former contains every calendar known to the
CLDR, whereas the latter contains the calendars we've actually parsed
for DateTimeFormat (currently only Gregorian). The new enum generated
here can be removed once DateTimeFormat knows about all calendars.
2022-01-13 13:43:57 +01:00
Timothy Flynn
c5138f0f2b LibUnicode: Parse number system digits from the CLDR
We had a hard-coded table of number system digits copied from ECMA-402.
Turns out these digits are in the CLDR, so let's parse the digits from
there instead of hard-coding them.
2022-01-12 10:49:07 +01:00
Timothy Flynn
d50f5e14f8 LibUnicode: Fall back to GMT offset when a time zone name is unavailable
The following table in TR-35 includes a web of fall back rules when the
requested time zone style is unavailable:
https://unicode.org/reports/tr35/tr35-dates.html#dfst-zone

Conveniently, the subset of styles supported by ECMA-402 (and therefore
LibUnicode) all either fall back to GMT offset or to a style that is
unsupported but itself falls back to GMT offset.
2022-01-11 23:56:35 +01:00