In particular, we implicitly required that the caller initializes the
returned instances themselves (solved by making
UniformBumpAllocator::allocate call the constructor), and BumpAllocator
itself cannot handle classes that are not trivially deconstructible
(solved by deleting the method).
Co-authored-by: Ali Mohammad Pur <ali.mpfard@gmail.com>
The implementation of transform_bytecode_repetition_min_max expects at
least min=1, so let's also optimise on our way out of a bug where
'x{0,}' would cause a crash.
Generate a sorted, compressed series of ranges in a match table for
character classes, and use a binary search to find the matches.
This is about a 3-4x speedup for character class match performance. :^)
It's very common to encounter single-character strings in JavaScript on
the web. We can make such strings significantly lighter by having a
1-character inline capacity on the Vectors.
Flagged by pvs-studio, it looks like these were intended to be passed by
reference originally, but it was missed. This avoids excessive argument
copy when searching / matching in the regex API.
Before:
Command: /usr/Tests/LibRegex/Regex --bench
Average time: 5998.29 ms (median: 5991, stddev: 102.18)
After:
Command: /usr/Tests/LibRegex/Regex --bench
Average time: 5623.2 ms (median: 5623, stddev: 86.25)
Previously we would've copied the bytecode instead of moving the chunks
around, use the fancy new DisjointChunks<T> abstraction to make that
happen automagically.
This decreases vector copies and uses of memmove() by nearly 10x :^)
Doing so would increase memory consumption by quite a bit, since many
useless copies of the checkpoints hashmap would be created and later
thrown away.
- Make sure that all the Repeat ops are reset (otherwise the operation
would not be correct when going over the Repeat op a second time)
- Make sure that all matches that are allowed to fail are backed by a
fork, otherwise the last failing fork would not have anywhere to
return to.
Fixes#9707.
For example, consider the following pattern:
new RegExp('\ud834\udf06', 'u')
With this pattern, the regex parser should insert the UTF-8 encoded
bytes 0xf0, 0x9d, 0x8c, and 0x86. However, because these characters are
currently treated as normal char types, they have a negative value since
they are all > 0x7f. Then, due to sign extension, when these characters
are cast to u64, the sign bit is preserved. The result is that these
bytes are inserted as 0xfffffffffffffff0, 0xffffffffffffff9d, etc.
Fortunately, there are only a few places where we insert bytecode with
the raw characters. In these places, be sure to treat the bytes as u8
before they are cast to u64.
Unfortunately, this requires a slight divergence in the way the capture
group names are stored. Previously, the generated byte code would simply
store a view into the regex pattern string, so no string copying was
required.
Now, the escape sequences are decoded into a new string, and a vector
of all parsed capture group names are stored in a vector in the parser
result structure. The byte code then stores a view into the
corresponding string in that vector.
This will allow regex::Lexer users to invoke GenericLexer consumption
methods, such as GenericLexer::consume_escaped_codepoint().
This also allows for de-duplicating common methods between the lexers.
This is primarily to be able to remove the GenericLexer include out of
Format.h as well. A subsequent commit will add AK::Result to
GenericLexer, which will cause naming conflicts with other structures
named Result. This can be avoided (for now) by preventing nearly every
file in the system from implicitly including GenericLexer.
Other changes in this commit are to add the GenericLexer include to
files where it is missing.
Currently, when we need to repeat an instruction N times, we simply add
that instruction N times in a for-loop. This doesn't scale well with
extremely large values of N, and ECMA-262 allows up to N = 2^53 - 1.
Instead, add a new REPEAT bytecode operation to defer this loop from the
parser to the runtime executor. This allows the parser to complete sans
any loops (for this instruction), and allows the executor to bail early
if the repeated bytecode fails.
Note: The templated ByteCode methods are to allow the Posix parsers to
continue using u32 because they are limited to N = 2^20.
This struct holds a counter for the number of executed operations, and
vectors for matches, captures groups, and named capture groups. Each of
the vectors is unused. Remove the struct and just keep a separate
counter for the executed operations.
Combining these into one list helps reduce the size of MatchState, and
as a result, reduces the amount of memory consumed during execution of
very large regex matches.
Doing this also allows us to remove a few regex byte code instructions:
ClearNamedCaptureGroup, SaveLeftNamedCaptureGroup, and NamedReference.
Named groups now behave the same as unnamed groups for these operations.
Note that SaveRightNamedCaptureGroup still exists to cache the matched
group name.
This also removes the recursion level from the MatchState, as it can
exist as a local variable in Matcher::execute instead.
Before the BumpAllocator OOB access issue was understood and fixed, the
chunk size was increased to 8MiB as a workaround in commit:
27d555bab0.
The issue is now resolved by: 0f1425c895.
We can reduce the chunk size to 2MiB, which has the added benefit of
reducing runtime of the RegExp.prototype.exec test.
The grammar for the ECMA-262 CharacterEscape is:
CharacterEscape[U, N] ::
ControlEscape
c ControlLetter
0 [lookahead ∉ DecimalDigit]
HexEscapeSequence
RegExpUnicodeEscapeSequence[?U]
[~U]LegacyOctalEscapeSequence
IdentityEscape[?U, ?N]
It's important to parse the standalone "\0 [lookahead ∉ DecimalDigit]"
before parsing LegacyOctalEscapeSequence. Otherwise, all standalone "\0"
patterns are parsed as octal, which are disallowed in Unicode mode.
Further, LegacyOctalEscapeSequence should also be parsed while parsing
character classes.
* Only alphabetic (A-Z, a-z) characters may be escaped with \c. The loop
currently parsing \c includes code points between the upper/lower case
groups.
* In Unicode mode, all invalid identity escapes should cause a parser
error, even in browser-extended mode.
* Avoid an infinite loop when parsing the pattern "\c" on its own.
In non-Unicode mode, the existing MatchState::string_position is tracked
in code units; in Unicode mode, it is tracked in code points.
In order for some RegexStringView operations to be performant, it is
useful for the MatchState to have a field to always track the position
in code units. This will allow RegexStringView methods (e.g. operator[])
to perform lookups based on code unit offsets, rather than needing to
iterate over the entire string to find a code point offset.
The current method of iterating through the string to access a code
point hurts performance quite badly for very large strings. The test262
test "RegExp/property-escapes/generated/Any.js" previously took 3 hours
to complete; this one change brings it down to under 10 seconds.