This completely changes how HTMLTokens store their data. Previously,
space was allocated for all token types separately. Now, the HTMLToken's
data is stored in just a String, two booleans and a Variant.
This change reduces sizeof(HTMLToken) from 68 to 32. Also, this reduces
raw tokenization time by around 20 to 50 percent, depending on the page.
Full document parsing time (with HTMLDocumentParser, on a local HTML
page without any dependency files) is reduced by between 4 and 20
percent, depending on the page.
Since tokenizing HTML pages can easily generated 50'000 tokens and more,
the storage has been designed in a way that avoids heap allocations
where possible, while trying to reduce the size of the tokens. The only
tokens which need to allocate on the heap are thus DOCTYPE tokens (max.
1 per document), and tag tokens (but only if they have attributes). This
way, only around 5 percent of all tokens generated need to allocate on
the heap (except for StringImpl allocations).
Since all interaction with the HTMLToken class now happens over getters
and setters, there is no more need for HTMLTokenizer and
HTMLDocumentParser to have direct access to the members.
This is in preparation for an upcoming storage change of HTMLToken. In
contrast to the other token types, the accessor can hand out a mutable
reference to allow users to change parts of the DoctypeData easily.
Previously, HTMLToken would expose the Vector<Attribute> directly to
its users. In preparation for a future change, all users now use
implementation-agnostic APIs which do not expose the Vector directly.
This fixes parsing the following regular expression: /</g;
It also adds a simple script element to the HTMLTokenizer regression
test, which also contains that specific regex.
`Element::tag_name` return an uppercase version of the tag name. However
the `Web::HTML::TagNames` values are all lowercase.
This change fixes that using `Element::local_name`, which returns a
lowercase value.
Previously it was not doing so, and some code relied on this not being
the case.
In particular, set_caption, set_t_head and set_t_foot in
HTMLTableElement relied on this. This commit is not here to fix this,
so I added an assertion to make it equivalent to a reference for now.
This allows you to invoke the HTML document parser and retrieve a
document as though it was loaded as a web page, minus any scripting
ability.
This does not currently support XML parsing.
This is used by YouTube (or more accurately, Web Components Polyfills)
to polyfill templates.
And use them to highlight javascript in HTML source.
This commit also changes how TextDocumentSpan::data is interpreted,
as it used to be an opaque pointer, but everyone stuffed an enum value
inside it, which made the values not unique to each highlighter;
that field is now a u64 serial id.
The syntax highlighters don't need to change their ways of stuffing
token types into that field, but a highlighter that calls another
nested highlighter needs to register the nested types for use with
token pairs.
This patch aims to fix wrong highlighting for some cases in HTML's
syntax highlighter. The values were somewhat experimentally determined
are are subject to change. Regardless, it should be more correct with
this patch than without it. :^)
This changes the HTML SyntaxHighlighter to conform to the now-fixed
rendering of syntax highlighting spans in GUI::TextEditor. It also
avoids emitting tokens if they have a zero or negative length.
This fixes a bug where single-character tokens were not highlighted
properly.
This patch changes HTMLTokenizer::nth_last_position to not fail if the
requested position is not available. Rather, it will just return (0-0).
While this is not the correct solution, it prevents the tokenizer from
crashing just because it cannot find a source position. This should only
affect SyntaxHighlighter.
This replaces ctype.h with CharacterType.h everywhere I could find
issues with narrowing conversions. While using it will probably make
sense almost everywhere in the future, the most critical places should
have been addressed.
We currently only support application/x-www-form-urlencoded for
form submissions, which uses a special percent encode set when
percent encoding the body/query. However, we were not using this
percent encode set.
With the new URL implementation, we can now specify the percent encode
set to be used, allowing us to use this special percent encode set.
This is one of the fixes needed to make the Google cookie consent work.