Now that we're using the new HTML parser, we don't have to do the weird
"run the script when inserted into the document, uhh, or when the text
content of the script element changes" dance.
Instead, we just follow the spec, and scripts run the way they should.
<a href="/foo&=bar"> was being tokenized into <a href="/foo&=bar">.
The spec mentions this but I had overlooked it. The bug happens because
we interpreted the "&" as a named character reference.
The only remaining client of the old parser is the fragment parser used
by the Element.innerHTML setter. We'll need to implement a bit more
stuff in the new parser before we can switch that over.
The more generic virtual variant is renamed to node_name() and now only
Element has tag_name(). This removes a huge amount of String ctor/dtor
churn in selector matching.
Get rid of the weird old signature:
- int StringType::to_int(bool& ok) const
And replace it with sensible new signature:
- Optional<int> StringType::to_int() const
This makes us at least parse selectors like [foo=bar\ baz] correctly.
The current solution here is quite hackish but the real fix will come
when we implement a spec-compliant CSS parser.
There was a logic mistake in the entity parser that chose the shorter
matching entity instead of the longer. Fix this and make the entity
lists constexpr while we're here.
This patch introduces support for more than just "absolute px" units in
our Length class. It now also supports "em" and "rem", which are units
relative to the font-size of the current layout node and the <html>
element's layout node respectively.
When handling a "textarea" start tag, we have to ignore the next token
if it's an LF ('\n'). However, we were not switching the tokenizer
state before fetching the lookahead token, and this caused us to force
the tokenizer into the RCDATA state too late, effectively getting it
stuck in that state for way longer than it should be.
Fixes#2508.
Oops, these were still using the byte-offset cursor. My goodness is it
unergonomic to index into UTF-8 strings, but Dr. Bugaev says it's good.
There is lots of room for improvement here. Just like the rest of the
tokenizer and parser. We'll have to do a few optimization passes over
them once they mature.
We already convert the input to UTF-8 before starting the tokenizer,
so all this patch had to do was switch the tokenizer to use an Utf8View
for its input (and to emit 32-bit codepoints.)
Now that we flush characters in a single place, we can call the Text's
children_changed() from there instead of having a goofy targeted hack
for <style> elements. :^)