1
Fork 0
mirror of https://github.com/RGBCube/serenity synced 2025-05-30 22:48:11 +00:00
Commit graph

65 commits

Author SHA1 Message Date
Andreas Kling
39b5494aeb LibWeb: Implement the "after attribute name" tokenizer state
One little step at a time towards parsing the monster blob of HTML we
get from twitter.com :^)
2020-05-27 18:30:29 +02:00
Andreas Kling
1b0c39ca60 LibWeb: Handle more benign parse errors in the "in body" insertion mode 2020-05-27 18:30:29 +02:00
Andreas Kling
1de29e3f59 LibWeb: Implement the "self closing start tag" tokenizer state 2020-05-27 18:30:29 +02:00
Andreas Kling
a5ce09f8e3 LibWeb: Implement partial support for numeric character references 2020-05-27 18:30:27 +02:00
TheDumpap
c700a30ce8 LibWeb: Handle additional parser inputs in "initial" and "before html". 2020-05-27 11:10:54 +02:00
Andreas Kling
7ed80ae96c LibWeb: Make the CSS parser a little more tolerant to invalid CSS
Sometimes people put a '}' where it doesn't belong, or various other
things go wrong. 99% of the time, it's our fault, but either way,
this patch makes us not crash or infinite-loop in some common cases.

The real solution here is to write a proper CSS lexer-parser according
to the language spec, this is just a hack fix to make more sites load
at all.
2020-05-26 22:31:22 +02:00
Linus Groh
72c52466e0 LibWeb: Add more HTML entities
®, ß and all the lowercase and uppercase umlaut characters.
2020-05-26 22:23:09 +02:00
FalseHonesty
4e8bcda4d1 LibWeb: Add HTML copyright escape 2020-05-26 22:02:17 +02:00
Kevin Meyer
b85ab86c84 LibWeb: Fix step within reconstruct the active elements
In step 4 of the "renstruct the active formatting elements" algorithm it
says:
  Rewind: If there are no entries before entry in the list of active
  formatting elements, then jump to the step labeled create.

Prior to this patch, the implementation accorded to the spec only for
the first loop iteration.
2020-05-26 21:52:46 +02:00
Andreas Kling
ecd25ce6c7 LibWeb: Allow HTML tokenizer to emit more than one token
Tokens are now put on a queue when emitted, and we always pop from that
queue when returning from next_token().
2020-05-26 15:50:05 +02:00
FalseHonesty
b352a6b59d LibWeb: Implement vendor specific CSS color style for System Palette
Add "-libweb-palette-foo-bar" CSS color properties to allow CSS to
style itself using the currently selected System Theme.
2020-05-26 10:17:50 +02:00
Andreas Kling
1e30ef239b LibWeb: Start fleshing out the "in table" parser insertion mode 2020-05-25 20:30:34 +02:00
Andreas Kling
f62a8d3b19 LibWeb: Handle some more parser inputs in the "in head" insertion mode 2020-05-25 20:16:48 +02:00
Andreas Kling
50265858ab LibWeb: Add a PARSE_ERROR() macro to the new HTML parser
Unless otherwise stated, we shouldn't stop parsing just because there's
a parse error, so let's allow ourselves to continue.

With this change, we can now tokenize and parse the ACID1 test. :^)
2020-05-25 20:02:27 +02:00
Andreas Kling
406fd95f32 LibWeb: Flesh out the remaining DOCTYPE related tokenizer states
We can now parse public and system identifiers! Not super useful, but
at least we can do it :^)
2020-05-25 19:51:23 +02:00
Andreas Kling
556a6eea61 LibWeb: Checking for "DOCTYPE" should be case insensitive in tokenizer 2020-05-25 19:51:23 +02:00
Andreas Kling
1df2a3d8ce LibWeb: Use String::is_one_of() a bunch in the HTML parser 2020-05-25 19:51:23 +02:00
Andreas Kling
21b1aba03b LibWeb: Add missing copyright header 2020-05-25 00:25:33 +02:00
Andreas Kling
4cbe202d2c LibWeb: Finally parse enough that we can actually handle welcome.html!
We made it, at last! What a long journey this was. :^)
2020-05-24 23:54:22 +02:00
Andreas Kling
65d8d5e83e LibWeb: Yet more work towards parsing www/welcome.html :^) 2020-05-24 23:54:22 +02:00
Andreas Kling
45da08a1e6 LibWeb: A whole bunch of work towards spec-compliant <script> elements
This is still very unfinished, but there's at least a skeleton of code.
2020-05-24 23:54:22 +02:00
Andreas Kling
5d332c1f11 LibWeb: Parse enough to handle a <style> inside a <head> :^) 2020-05-24 23:54:22 +02:00
Andreas Kling
af8a9331b2 LibWeb: Support comments in the "in head" insertion mode 2020-05-24 23:54:22 +02:00
Andreas Kling
20911efd4d LibWeb: More work on the HTML parser and tokenizer
The parser can now switch the state of the tokenizer! Very webby. :^)
2020-05-24 23:54:22 +02:00
Andreas Kling
31db3f21ae LibWeb: Start implementing character token parsing
Now that we've gotten rid of the misguided character buffering in the
tokenizer, it actually spits out character tokens that we have to deal
with in the parser.

This patch implements enough to bring us back to speed with simple.html
2020-05-24 23:54:22 +02:00
Andreas Kling
53d2f4df70 LibWeb: Factor out the "stack of open elements" into its own class
This will allow us to write more expressive parsing code. :^)
2020-05-24 23:54:22 +02:00
Andreas Kling
96cc1138c0 LibWeb: Remove tokenizer's premature character buffering optimization 2020-05-24 23:54:22 +02:00
Daniel Gustafsson
6561987e9f
LibWeb: Fix copy-paste error in HTMLDocumentParser (#2358)
When watching the video of the new HTML parser I noticed a small copy
and paste error. In one of the cases in `handle_after_head` the code
was checking for end tags when it should check for start tags.

I haven't tested this change, just looking at the spec.
2020-05-24 13:48:46 +02:00
Emanuele Torre
3f2158bbfe LibWeb: HtmlTokenizer.cpp: fix ON_WHITESPACE macro
The "audible bell" character ('\a' U+0007) was treated as whitespace
while the "line feed" character ('\n' U+000a) was not.

'\a' is no longer considered whitespace.
'\n' is now considered whitespace.
2020-05-24 09:47:28 +02:00
Andreas Kling
e44c87cfff LibWeb: Implement enough HTML parsing to handle a small simple DOM :^)
We can now parse a little DOM like this:

<!DOCTYPE html>
<html>
    <head></head>
    <body>
        <div></div>
    </body>
</html>

This is pretty slow work, but the incremental progress is satisfying!
2020-05-24 00:49:22 +02:00
Andreas Kling
fd1b31d0ff LibWeb: Start building the tree building part of the new HTML parser
This patch adds a new HTMLDocumentParser class. It keeps a tokenizer
object internally and feeds itself with one token at a time from it.

The names and idioms in this class are expressed as closely to the
actual HTML parsing spec as possible, to make development as easy
and bug free as possible. :^)

This is going to become pretty large, but it's pretty cool!
2020-05-24 00:14:23 +02:00
Andreas Kling
e45c8b842c LibWeb: Implement a bit more of DOCTYPE tokenization 2020-05-23 21:08:25 +02:00
Andreas Kling
7be36366be LibWeb: Emit character/comment tokens lazily to accumulate more data
Instead of emitting data-bearing tokens immediately, do it lazily at
the next state change. This allows us to accumulate full bursts of
text in between tags instead of having one token per character. :^)
2020-05-23 18:44:32 +02:00
Andreas Kling
45450c7edc LibWeb: Make BEGIN_STATE and END_STATE include some {{{ and }}}
This makes it a compile error to omit the END_STATE. Also add some more
missing END_STATE's exposed by this (nice!)

Thanks to @predmond for suggesting the multi-pair trick! :^)
2020-05-23 15:25:43 +02:00
Andreas Kling
2e4147d0fc LibWeb: Add missing END_STATE for TagName
Fixes #2339.
2020-05-23 10:33:23 +02:00
Andreas Kling
a58500fdc5 LibWeb: Teach HTMLTokenizer how to tokenize comments
We can now correctly tokenize the welcome.html test page. :^)
2020-05-23 01:54:26 +02:00
Andreas Kling
6caa5661f3 LibWeb: Teach HTMLTokenizer how to tokenize attributes
Properly tokenize single-quoted, double-quoted and unquoted attributes!
2020-05-23 01:22:15 +02:00
Andreas Kling
004ef9a86b LibWeb: Minor tweaks to HTMLToken declaration 2020-05-22 23:45:02 +02:00
Andreas Kling
272b35d2e1 LibWeb: Begin work on a spec-compliant HTML parser
In order to actually view the web as it is, we're gonna need a proper
HTML parser. So let's build one!

This patch introduces the Web::HTMLTokenizer class, which currently
operates on a StringView input stream where it fetches (ASCII only atm)
codepoints and tokenizes acccording to the HTML spec tokenization algo.

The tokenizer state machine looks a bit weird but is written in a way
that tries to mimic the spec as closely as possible, in order to make
development easier and bugs less likely.

This initial version is far from finished, but it can parse a trivial
document with a DOCTYPE and open/close tags. :^)
2020-05-22 21:46:13 +02:00
Sergey Bugaev
c00076de82 LibWeb: Update the CSS prefix to -libweb 2020-05-21 14:15:49 +02:00
Andreas Kling
25cfdf3f67 LibWeb: Parse &quot; into '"' 2020-05-21 12:27:08 +02:00
Hüseyin ASLITÜRK
241df7206e LibWeb: HTML Parser, handle html escaped characters
Convert HTML escaped (&#XXX;)  characters to string.
2020-05-21 01:19:42 +02:00
Linus Groh
7bfd24ca76 LibWeb: Support the :root pseudo class 2020-05-14 08:49:51 +02:00
Linus Groh
2f29e61203 LibWeb: Make CSS pseudo classes case-insensitive 2020-05-14 08:49:51 +02:00
Linus Groh
cbd746e3ec LibWeb: Support "transparent" CSS color value 2020-05-13 19:25:49 +02:00
Linus Groh
57857cd8f6 LibWeb: Make parsing of most CSS values case-insensitive
These are all valid:

width: AUTO;
height: 10PX;
color: LiMeGrEeN;
2020-05-13 19:25:49 +02:00
Andreas Kling
5f9d80d8bc LibWeb: Add basic support for CSS percentages
Many properties can now have percentage values that get resolved in
layout. The reference value (what is this a percentage *of*?) differs
per property, so I've added a helper where you provide a reference
value as an added parameter to the existing length_or_fallback().
2020-05-11 23:07:30 +02:00
Linus Groh
673527d314 LibWeb: Ignore parsed pseudo-element selectors & empty complex selectors
Currently we don't deal with them, so they shouldn't return a
SimpleSelector - that'd be a false positive.

Also don't produce a ComplexSelector if no SimpleSelector was parsed.

This fixes a couple of rendering issues on awesomekling.github.io:
link colours, footer size, content max-width (and possibly more!)
2020-05-11 10:48:54 +02:00
Andreas Kling
8a40294f42 LibWeb: Turn some HTML entities into nicer text in the parser 2020-05-05 15:50:28 +02:00
Andreas Kling
6676f2c259 LibWeb: Don't emit a simple selector if nothing was consumed 2020-05-05 15:50:28 +02:00