When watching the video of the new HTML parser I noticed a small copy
and paste error. In one of the cases in `handle_after_head` the code
was checking for end tags when it should check for start tags.
I haven't tested this change, just looking at the spec.
Add reasonable support for all values of white-space CSS property.
Values of the property are translated into a 3-tuple of rules:
do_collapse: whether whitespace is to be collapsed
do_wrap_lines: whether to wrap on word boundaries when
lines get too long
do_wrap_breaks: whether to wrap on linebreaks
The previously separate handling of per-line splitting and per-word
splitting have been unified. The Word structure is now a more
general Chunk, which represents different amounts of text depending
on splitting rules.
The "audible bell" character ('\a' U+0007) was treated as whitespace
while the "line feed" character ('\n' U+000a) was not.
'\a' is no longer considered whitespace.
'\n' is now considered whitespace.
We can now parse a little DOM like this:
<!DOCTYPE html>
<html>
<head></head>
<body>
<div></div>
</body>
</html>
This is pretty slow work, but the incremental progress is satisfying!
This patch adds a new HTMLDocumentParser class. It keeps a tokenizer
object internally and feeds itself with one token at a time from it.
The names and idioms in this class are expressed as closely to the
actual HTML parsing spec as possible, to make development as easy
and bug free as possible. :^)
This is going to become pretty large, but it's pretty cool!
When hit testing encountered a block with inline children, we assumed
that the inline children are nothing but text boxes. An inline-block
box is actually a block child of a block with inline children, so we
have to handle that scenario as well. :^)
Fixes#2353.
Instead of emitting data-bearing tokens immediately, do it lazily at
the next state change. This allows us to accumulate full bursts of
text in between tags instead of having one token per character. :^)
This makes it a compile error to omit the END_STATE. Also add some more
missing END_STATE's exposed by this (nice!)
Thanks to @predmond for suggesting the multi-pair trick! :^)
In order to actually view the web as it is, we're gonna need a proper
HTML parser. So let's build one!
This patch introduces the Web::HTMLTokenizer class, which currently
operates on a StringView input stream where it fetches (ASCII only atm)
codepoints and tokenizes acccording to the HTML spec tokenization algo.
The tokenizer state machine looks a bit weird but is written in a way
that tries to mimic the spec as closely as possible, in order to make
development easier and bugs less likely.
This initial version is far from finished, but it can parse a trivial
document with a DOCTYPE and open/close tags. :^)
This patch is unfortunately rather large and might make some things feel
bloated, but it is necessary to fix a few flaws in LibJS, primarily
blindly coercing values to numbers without exception checks - i.e.
interpreter.argument(0).to_i32(); // can fail!!!
Some examples where the interpreter would actually crash:
var o = { toString: () => { throw Error() } };
+o;
o - 1;
"foo".charAt(o);
"bar".repeat(o);
To fix this, we now have the following...
to_double(Interpreter&)
to_i32()
to_i32(Interpreter&)
to_size_t()
to_size_t(Interpreter&)
...and a whole lot of exception checking.
There's intentionally no to_double(), use as_double() directly instead.
This way we still can use these convenient utility functions but don't
need to check for exceptions if we are sure the value already is a
number.
Fixes#2267.
Passing a Heap& to it only to then call interpreter() on that is weird.
Let's just give it the Interpreter& directly, like some of the other
to_something() functions.
There are now two API's on Value:
- Value::to_string(Interpreter&) -- may throw.
- Value::to_string_without_side_effects() -- will never throw.
These are some pretty big sweeping changes, so it's possible that I did
some part the wrong way. We'll work it out as we go. :^)
Fixes#2123.