...by flattening the underlying bytecode chunks first.
Also avoid calling DisjointChunks::size() inside a loop.
This is a very significant improvement in performance, making the
compilation of a large regex with lots of alternatives take only ~100ms
instead of many minutes (I ran out of patience waiting for it) :^)
The instructions can have dependencies (e.g. Repeat), so only unify
equal blocks instead of consecutive instructions.
Fixes#11247.
Also adds the minimal test case(s) from that issue.
The initial `ForkStay` is only needed if the looping block has a
following block, if there's no following block or the following block
does not attempt to match anything, we should not insert the ForkStay,
otherwise we would be rewriting `a+` as `a*` by allowing the 'end' to be
executed.
Fixes#10952.
This isn't a complete conversion to ErrorOr<void>, but a good chunk.
The end goal here is to propagate buffer allocation failures to the
caller, and allow the use of TRY() with formatting functions.
Previously we were always choosing the "nothing special" code path, even
if the dollar symbol was at the end of the pattern (and therefore should
have been considered special).
Fix that by actually checking if the pattern end follows, and emitting
the correct instruction if necessary.
In particular, we implicitly required that the caller initializes the
returned instances themselves (solved by making
UniformBumpAllocator::allocate call the constructor), and BumpAllocator
itself cannot handle classes that are not trivially deconstructible
(solved by deleting the method).
Co-authored-by: Ali Mohammad Pur <ali.mpfard@gmail.com>
The implementation of transform_bytecode_repetition_min_max expects at
least min=1, so let's also optimise on our way out of a bug where
'x{0,}' would cause a crash.
Generate a sorted, compressed series of ranges in a match table for
character classes, and use a binary search to find the matches.
This is about a 3-4x speedup for character class match performance. :^)
It's very common to encounter single-character strings in JavaScript on
the web. We can make such strings significantly lighter by having a
1-character inline capacity on the Vectors.
Flagged by pvs-studio, it looks like these were intended to be passed by
reference originally, but it was missed. This avoids excessive argument
copy when searching / matching in the regex API.
Before:
Command: /usr/Tests/LibRegex/Regex --bench
Average time: 5998.29 ms (median: 5991, stddev: 102.18)
After:
Command: /usr/Tests/LibRegex/Regex --bench
Average time: 5623.2 ms (median: 5623, stddev: 86.25)
Previously we would've copied the bytecode instead of moving the chunks
around, use the fancy new DisjointChunks<T> abstraction to make that
happen automagically.
This decreases vector copies and uses of memmove() by nearly 10x :^)
Doing so would increase memory consumption by quite a bit, since many
useless copies of the checkpoints hashmap would be created and later
thrown away.
- Make sure that all the Repeat ops are reset (otherwise the operation
would not be correct when going over the Repeat op a second time)
- Make sure that all matches that are allowed to fail are backed by a
fork, otherwise the last failing fork would not have anywhere to
return to.
Fixes#9707.
For example, consider the following pattern:
new RegExp('\ud834\udf06', 'u')
With this pattern, the regex parser should insert the UTF-8 encoded
bytes 0xf0, 0x9d, 0x8c, and 0x86. However, because these characters are
currently treated as normal char types, they have a negative value since
they are all > 0x7f. Then, due to sign extension, when these characters
are cast to u64, the sign bit is preserved. The result is that these
bytes are inserted as 0xfffffffffffffff0, 0xffffffffffffff9d, etc.
Fortunately, there are only a few places where we insert bytecode with
the raw characters. In these places, be sure to treat the bytes as u8
before they are cast to u64.
Unfortunately, this requires a slight divergence in the way the capture
group names are stored. Previously, the generated byte code would simply
store a view into the regex pattern string, so no string copying was
required.
Now, the escape sequences are decoded into a new string, and a vector
of all parsed capture group names are stored in a vector in the parser
result structure. The byte code then stores a view into the
corresponding string in that vector.
This will allow regex::Lexer users to invoke GenericLexer consumption
methods, such as GenericLexer::consume_escaped_codepoint().
This also allows for de-duplicating common methods between the lexers.
This is primarily to be able to remove the GenericLexer include out of
Format.h as well. A subsequent commit will add AK::Result to
GenericLexer, which will cause naming conflicts with other structures
named Result. This can be avoided (for now) by preventing nearly every
file in the system from implicitly including GenericLexer.
Other changes in this commit are to add the GenericLexer include to
files where it is missing.
Currently, when we need to repeat an instruction N times, we simply add
that instruction N times in a for-loop. This doesn't scale well with
extremely large values of N, and ECMA-262 allows up to N = 2^53 - 1.
Instead, add a new REPEAT bytecode operation to defer this loop from the
parser to the runtime executor. This allows the parser to complete sans
any loops (for this instruction), and allows the executor to bail early
if the repeated bytecode fails.
Note: The templated ByteCode methods are to allow the Posix parsers to
continue using u32 because they are limited to N = 2^20.
This struct holds a counter for the number of executed operations, and
vectors for matches, captures groups, and named capture groups. Each of
the vectors is unused. Remove the struct and just keep a separate
counter for the executed operations.
Combining these into one list helps reduce the size of MatchState, and
as a result, reduces the amount of memory consumed during execution of
very large regex matches.
Doing this also allows us to remove a few regex byte code instructions:
ClearNamedCaptureGroup, SaveLeftNamedCaptureGroup, and NamedReference.
Named groups now behave the same as unnamed groups for these operations.
Note that SaveRightNamedCaptureGroup still exists to cache the matched
group name.
This also removes the recursion level from the MatchState, as it can
exist as a local variable in Matcher::execute instead.
Before the BumpAllocator OOB access issue was understood and fixed, the
chunk size was increased to 8MiB as a workaround in commit:
27d555bab0.
The issue is now resolved by: 0f1425c895.
We can reduce the chunk size to 2MiB, which has the added benefit of
reducing runtime of the RegExp.prototype.exec test.