We still don't handle non-ASCII input correctly, but at least now we'll
convert e.g ISO-8859-1 to UTF-8 before starting to tokenize.
This patch also makes "view source" work with the new parser. :^)
This is enough to parse the Google front page! (Note: I did have to
hack the tokenizer while parsing Google, in order to avoid named
character references screwing everything up. We'll fix that too soon
enough!)
While we're still supporting both the old and the new parser, we have
to deal with the way they load inline stylesheet (and scripts) a bit
differently.
The old parser loads all the text content up front, and then notifies
the containing element. The new parser creates the containing element
up front and appends text inside it afterwards.
For now, we simply do an empty "children_changed" notification when
first inserting a text node inside an element. This at least prevents
the CSS parser from choking on a single-character stylesheet.
The AAA is a somewhat daunting algorithm you have to run for certain
tag when inserted inside the <body> element. The purpose of it is to
resolve issues with mismatched tags.
This patch implements the first half of the AAA. We also move the
"list of active formatting elements" to its own class, since it kept
accumulating little behaviors. "Marker" entries are now signified by
null Element pointers in the list.
You can now pass "-n" to the browser to use the new HTML parser.
It's not turned on by default since it's still very immature, but this
is a huge step towards bringing it into maturity. :^)
Previously, the Object class had many different types of functions for
each action. For example: get_by_index, get(PropertyName),
get(FlyString). This is a bit verbose, so these methods have been
shortened to simply use the PropertyName structure. The methods then
internally call _by_index if necessary. Note that the _by_index
have been made private to enforce this change.
Secondly, a clear distinction has been made between "putting" and
"defining" an object property. "Putting" should mean modifying a
(potentially) already existing property. This is akin to doing "a.b =
'foo'".
This implies two things about put operations:
- They will search the prototype chain for setters and call them, if
necessary.
- If no property exists with a particular key, the put operation
should create a new property with the default attributes
(configurable, writable, and enumerable).
In contrast, "defining" a property should completely overwrite any
existing value without calling setters (if that property is
configurable, of course).
Thus, all of the many JS objects have had any "put" calls changed to
"define_property" calls. Additionally, "put_native_function" and
"put_native_property" have had their "put" replaced with "define".
Finally, "put_own_property" has been made private, as all necessary
functionality should be exposed with the put and define_property
methods.
Instead of creating extremely common FlyStrings like "id" and "class"
on demand every time they are needed, we now have AttributeNames.h,
which provides Web::HTML::AttributeNames::{id,class_}
This avoids a bunch of string allocations during selector matching.
Instead of string splitting every time you call Element::has_class(),
we now split the "class" attribute value when it changes, and cache
the individual classes as FlyStrings in Element::m_classes.
This makes has_class() significantly faster and moves the pain point
of selector matching somewhere else.
Sometimes people put a '}' where it doesn't belong, or various other
things go wrong. 99% of the time, it's our fault, but either way,
this patch makes us not crash or infinite-loop in some common cases.
The real solution here is to write a proper CSS lexer-parser according
to the language spec, this is just a hack fix to make more sites load
at all.
We now implement the somewhat fuzzy shrink-to-fit algorithm when laying
out inline-block elements with both block and inline children.
Shrink-to-fit works by doing two speculative layouts of the entire
subtree inside the current block, to compute two things:
1. Preferred minimum width: If we made a line break at every chance we
had, how wide would the widest line be?
2. Preferred width: We break only when explicitly told to (e.g "<br>")
How wide would the widest line be?
We then shrink the width of the inline-block element to an appropriate
value based on the above, taking the available width in the containing
block into consideration (sans all the box model fluff.)
To make the speculative layouts possible, plumb a LayoutMode enum
throughout the layout system since it needs to be respected in various
places.
Note that this is quite hackish and I'm sure there are smarter ways to
do a lot of this. But it does kinda work! :^)
In step 4 of the "renstruct the active formatting elements" algorithm it
says:
Rewind: If there are no entries before entry in the list of active
formatting elements, then jump to the step labeled create.
Prior to this patch, the implementation accorded to the spec only for
the first loop iteration.
This was causing very tall lines on many websites. We can now see the
section header thingy on google.com (although it's broken into lines
where it should not be..) :^)
And move canonicalized_path() to a static method on LexicalPath.
This is to make it clear that FileSystemPath/canonicalized_path() only
perform *lexical* canonicalization.
Unless otherwise stated, we shouldn't stop parsing just because there's
a parse error, so let's allow ourselves to continue.
With this change, we can now tokenize and parse the ACID1 test. :^)