Sometimes people put a '}' where it doesn't belong, or various other
things go wrong. 99% of the time, it's our fault, but either way,
this patch makes us not crash or infinite-loop in some common cases.
The real solution here is to write a proper CSS lexer-parser according
to the language spec, this is just a hack fix to make more sites load
at all.
In step 4 of the "renstruct the active formatting elements" algorithm it
says:
Rewind: If there are no entries before entry in the list of active
formatting elements, then jump to the step labeled create.
Prior to this patch, the implementation accorded to the spec only for
the first loop iteration.
Unless otherwise stated, we shouldn't stop parsing just because there's
a parse error, so let's allow ourselves to continue.
With this change, we can now tokenize and parse the ACID1 test. :^)
Now that we've gotten rid of the misguided character buffering in the
tokenizer, it actually spits out character tokens that we have to deal
with in the parser.
This patch implements enough to bring us back to speed with simple.html
When watching the video of the new HTML parser I noticed a small copy
and paste error. In one of the cases in `handle_after_head` the code
was checking for end tags when it should check for start tags.
I haven't tested this change, just looking at the spec.
The "audible bell" character ('\a' U+0007) was treated as whitespace
while the "line feed" character ('\n' U+000a) was not.
'\a' is no longer considered whitespace.
'\n' is now considered whitespace.
We can now parse a little DOM like this:
<!DOCTYPE html>
<html>
<head></head>
<body>
<div></div>
</body>
</html>
This is pretty slow work, but the incremental progress is satisfying!
This patch adds a new HTMLDocumentParser class. It keeps a tokenizer
object internally and feeds itself with one token at a time from it.
The names and idioms in this class are expressed as closely to the
actual HTML parsing spec as possible, to make development as easy
and bug free as possible. :^)
This is going to become pretty large, but it's pretty cool!
Instead of emitting data-bearing tokens immediately, do it lazily at
the next state change. This allows us to accumulate full bursts of
text in between tags instead of having one token per character. :^)
This makes it a compile error to omit the END_STATE. Also add some more
missing END_STATE's exposed by this (nice!)
Thanks to @predmond for suggesting the multi-pair trick! :^)
In order to actually view the web as it is, we're gonna need a proper
HTML parser. So let's build one!
This patch introduces the Web::HTMLTokenizer class, which currently
operates on a StringView input stream where it fetches (ASCII only atm)
codepoints and tokenizes acccording to the HTML spec tokenization algo.
The tokenizer state machine looks a bit weird but is written in a way
that tries to mimic the spec as closely as possible, in order to make
development easier and bugs less likely.
This initial version is far from finished, but it can parse a trivial
document with a DOCTYPE and open/close tags. :^)
Many properties can now have percentage values that get resolved in
layout. The reference value (what is this a percentage *of*?) differs
per property, so I've added a helper where you provide a reference
value as an added parameter to the existing length_or_fallback().
Currently we don't deal with them, so they shouldn't return a
SimpleSelector - that'd be a false positive.
Also don't produce a ComplexSelector if no SimpleSelector was parsed.
This fixes a couple of rendering issues on awesomekling.github.io:
link colours, footer size, content max-width (and possibly more!)