1
Fork 0
mirror of https://github.com/RGBCube/serenity synced 2025-10-25 06:52:08 +00:00
Commit graph

66 commits

Author SHA1 Message Date
Shannon Booth
49eb3bfb1d LibWeb: Make Document::run_the_document_write_steps take a StringView
Which flows on down into HTMLTokenizer::insert_input_at_insertion_point.
2023-09-13 07:26:35 +02:00
Timothy Flynn
fea440055a LibWeb: Track the byte offset of an HTMLToken's position
We currently track the [line, column] position of every HTMLToken, as
this is what is needed for LibGUI's syntax highlighting. Some non-LibGUI
purposes (e.g. highlighting HTML with HTML) require a byte offset. Track
both during tokenization.
2023-08-29 08:11:11 -04:00
Timothy Flynn
5a2bf7fdd1 LibWeb: Set the correct end position of HTML attribute names
We were previously setting the end position of attribute names in self-
closing HTML tags to the end of the attribute value. To illustrate the
previous behavior, consider this tag and its attribute's start and end
positions (shown inclusively below):

    <meta charset="UTF-8" />
          ^ name start
                  ^ value start
                        ^ value end
                        ^ name end

Rather than setting the end position of the attribute name when we parse
the closing slash, ensure the end position is already set while we are
in the AttributeName state. We now have:

    <meta charset="UTF-8" />
          ^ name start
                ^ name end
                  ^ value start
                        ^ value end

The tokenizer unit test has been extended to test these positions.
2023-08-25 08:22:24 +02:00
Timothy Flynn
5b2bc90b50 LibWeb: Set consistent positions for the start and end of HTML tags
To illustrate the previous behavior, consider these tags and their start
and end positions (shown inclusively below):

    Start tag:    End tag:
    <span>        </span>
     ^ start       ^ start
         ^end           ^end

The start position of a tag is the first ASCII-alpha code point after
the opening brace. The start position of a close tag is the slash just
before the first ASCII-alpha code point. And the end position of both
is the closing brace. So the opening brace is not included in the
emitted tag, but the closing brace is. And the end tag including the
slash is an oddity that had to be worked around in its only use case
(syntax highlighting).

We now consistently exclude the braces from the emitted tag, and also
exclude the slash from the end tag, so that it does not need to be
accounted for in syntax highlighting. That is, we now have:

    Start tag:    End tag:
    <span>        </span>
     ^ start        ^ start
        ^end           ^end

The tokenizer unit test has been extended to test these positions.
2023-08-25 08:22:24 +02:00
Andreas Kling
70db40c9b0 LibWeb: Don't include Layout/Node.h from DOM/Element.h
This required moving the CSS::StyleProperty destruct out of line.
2023-05-08 09:29:44 +02:00
Sam Atkins
2db168acc1 LibTextCodec+Everywhere: Port Decoders to new Strings 2023-02-19 17:15:47 +01:00
Sam Atkins
f2a9426885 LibTextCodec+Everywhere: Return Optional<Decoder&> from decoder_for() 2023-02-19 17:15:47 +01:00
Linus Groh
6e7459322d AK: Remove StringBuilder::build() in favor of to_deprecated_string()
Having an alias function that only wraps another one is silly, and
keeping the more obvious name should flush out more uses of deprecated
strings.
No behavior change.
2023-01-27 20:38:49 +00:00
Linus Groh
57dc179b1f Everywhere: Rename to_{string => deprecated_string}() where applicable
This will make it easier to support both string types at the same time
while we convert code, and tracking down remaining uses.

One big exception is Value::to_string() in LibJS, where the name is
dictated by the ToString AO.
2022-12-06 08:54:33 +01:00
Linus Groh
6e19ab2bbc AK+Everywhere: Rename String to DeprecatedString
We have a new, improved string type coming up in AK (OOM aware, no null
state), and while it's going to use UTF-8, the name UTF8String is a
mouthful - so let's free up the String name by renaming the existing
class.
Making the old one have an annoying name will hopefully also help with
quick adoption :^)
2022-12-06 08:54:33 +01:00
Andreas Kling
c79e8aab0a LibWeb: Make ON_WHITESPACE less heavy in HTML tokenizer
Once we know that the current code point is an ASCII character, we can
just check if it's one of the HTML whitespace characters.

Before this patch, we were using the generic StringView::contains(u32)
path that splats a code point into a StringBuilder and then searches
for it with memmem().

This reduces time spent in the HTML tokenizer from 16% to 6% when
loading the ECMA-262 spec.
2022-11-05 00:31:11 +01:00
Andreas Kling
ab8432783e LibWeb: Implement aborting the HTML parser
This is roughly on-spec, although I had to invent a simple "aborted"
state for the tokenizer.
2022-09-20 23:44:59 +02:00
sin-ack
3f3f45580a Everywhere: Add sv suffix to strings relying on StringView(char const*)
Each of these strings would previously rely on StringView's char const*
constructor overload, which would call __builtin_strlen on the string.
Since we now have operator ""sv, we can replace these with much simpler
versions. This opens the door to being able to remove
StringView(char const*).

No functional changes.
2022-07-12 23:11:35 +02:00
stelar7
e547f5887e LibWeb: Fix Array OOBs in the HTMLTokenizer
Accessing last() if there are no elements makes WebContent crash :^)
2022-06-03 12:29:11 +01:00
Andreas Kling
1061c863f8 LibWeb: Fix issue where double-quoted doctype system ID was not captured
We were storing double-quoted system ID's in the public ID field.

1% progression on ACID3. :^)
2022-03-02 12:30:15 +01:00
Lorenz Steinert
db789813c9 LibWeb: Add basic support for dynamic markup insertion
This implements basic support for dynamic markup insertion, adding
 * Document::open()
 * Document::write(Vector<String> const&)
 * Document::writeln(Vector<String> const&)
 * Document::close()

The HTMLParser is modified to make it possible to create a
script-created parser which initially only contains a HTMLTokenizer
without any data. Aditionally the HTMLParser::run method gains an
overload which does not modify the Document and does not run
HTMLParser::the_end() so that we can reenter the parser at a later time.
Furthermore all FIXMEs that consern the insertion point are implemented
wich is defined in the HTMLTokenizer. Additionally the following
member-variables of the HTMLParser are now exposed by getter funcions:
 * m_tokenizer
 * m_aborted
 * m_script_nesting_level

The HTMLTokenizer is modified so that it contains an insertion
point which keeps track of where the next input from the Document::write
functions will be inserted. The insertion point is implemented as the
charakter offset into m_decoded_input and a boolean describing if the
insertion point is defined. Functions to update, check and {re}store the
insertion point are also added.
The function HTMLTokenizer::insert_eof is added to tell a script-created
parser that document::close was called and HTMLParser::the_end() should
be called.
Lastly an explicit default constructor is added to HTMLTokenizer to
create a empty HTMLTokenizer into which data can be inserted.
2022-02-21 18:26:43 +01:00
Adam Hodgen
b6eaefa87d LibWeb: Fix 'Comment end state' in HTML Tokenizer
Also, update the expected hash in the LibWeb TestHTMLTokenizer
regression test.

This is due to the "This comment has a few too many dashes." comment
token being updated.
2022-02-21 16:31:45 +01:00
Adam Hodgen
d73bb2633c LibWeb: Implement tokenization newline preprocessing
Newline normalization will replace \r and \r\n with \n.

The spec specifically states
> Before the tokenization stage, the input stream must be preprocessed
> by normalizing newlines.
wheras this is implemented the processing during the tokenization
itself.

This should still exhibit the same behaviour, while keeping the
tokenization logic in the same place.
2022-02-21 16:31:45 +01:00
Adam Hodgen
c6fcdd0f93 LibWeb: Fix off by one error in HTML Tokenizer
In 'NamedCharacterReference' we attempt to lookup the code point by a
identifier, eg apos; becomes '

This is done by passing the entire rest of the document to the
`HTML::code_points_from_entity` function.

However, before this change we didn't sent the final character which
meant if the document ended in a named character reference the lookup
would fail.
2022-02-21 16:31:45 +01:00
Andreas Kling
25504f6a1b LibWeb: Use Vector::clear_with_capacity() in HTMLTokenizer
This avoids constantly reallocating the Vector<HTMLToken>.
2022-02-19 14:45:59 +01:00
Linus Groh
892f6394b8 LibWeb: Implement state switch for "[CDATA[" in HTML parser 2022-02-15 23:24:34 +01:00
Linus Groh
f61fb08492 LibWeb: Add spec links to each HTML tokenizer state section
I didn't add full spec comments this time, but this is better than
nothing :^)
2022-02-15 23:24:34 +01:00
Karol Kosek
c157c2148f LibWeb: Don't emit current token on EOF in HTML Tokenizer
Emitting tokens on EOF caused an infinite loop, freezing the app, which
could be a bit annoying when writing an HTML comment at the end of
the file in Text Editor. :^)
2022-02-14 12:50:44 +03:30
Karol Kosek
fb5e2670d6 LibWeb: Fix highlighting HTML comments
Commit b193351a99 caused the HTML comments to flash when changing
the text cursor. Also, when double-clicking on a comment, the selection
started from the beginning of the file instead.

The following message was displaying when `TOKENIZER_TRACE_DEBUG`
was enabled:

    (Tokenizer::nth_last_position) Invalid position requested: 4th-last
    of 4. Returning (0-0).

Changing the `nth_last_position` to 3 fixes this. I'm guessing that's
because the parser is at that moment on the second hyphen of the `<!--`
string, so it has to go back only by three characters.
2022-02-14 12:50:44 +03:30
MacDue
b193351a99 LibWeb: Fix off-by-one in HTMLTokenizer::restore_to()
The difference should be between m_utf8_iterator and the
the new position, if m_prev_utf8_iterator is used one fewer
source position is popped than required.

This issue was not apparent on most pages since restore_to
used for tokens such  <!doctype> that are normally
followed by a newline that resets the column to zero,
but it can be seen on pages with minified HTML.
2022-02-13 14:51:09 +00:00
Sam Atkins
197759e30f LibWeb: Fix off-by-one error when highlighting unquoted HTML attributes
This fixes #11166
2021-12-10 21:27:13 +01:00
Andreas Kling
8b1108e485 Everywhere: Pass AK::StringView by value 2021-11-11 01:27:46 +01:00
Andreas Kling
f67648f872 LibWeb: Rename HTMLDocumentParser => HTMLParser 2021-09-25 23:36:43 +02:00
ovf
898b8ffcb6 LibWeb: Avoid assertion failure on parsing numeric character references 2021-07-28 18:32:22 +02:00
ovf
13c7d55320 LibWeb: Fix parsing of character references in attribute values 2021-07-27 00:03:43 +02:00
Max Wipfli
ccae0cae45 LibWeb: Rename HTMLToken::doctype_data() => ensure_doctype_data()
This renames the accessor to better reflect what it does, as this will
allocate a DoctypeData struct if there is none.
2021-07-17 16:24:57 +04:30
Max Wipfli
25cba4387b LibWeb: Add HTMLToken(Type) constructor and use it 2021-07-17 16:24:57 +04:30
Max Wipfli
f2e3c770f9 LibWeb: Use setter for HTMLToken::m_{start,end}_position 2021-07-17 16:24:57 +04:30
Max Wipfli
8b31e41692 LibWeb: Change HTMLToken::m_doctype into named DoctypeData struct
This is in preparation for an upcoming storage change of HTMLToken. In
contrast to the other token types, the accessor can hand out a mutable
reference to allow users to change parts of the DoctypeData easily.
2021-07-17 16:24:57 +04:30
Max Wipfli
918bde98b1 LibWeb: Hide implementation details of HTMLToken attribute list
Previously, HTMLToken would expose the Vector<Attribute> directly to
its users. In preparation for a future change, all users now use
implementation-agnostic APIs which do not expose the Vector directly.
2021-07-17 16:24:57 +04:30
Max Wipfli
15d8635afc LibWeb: User getter+setter for HTMLToken tag name and self-closing flag 2021-07-17 16:24:57 +04:30
Max Wipfli
1aeafcc58b LibWeb: Use getter and setter for Character type HTMLTokens
While storing the code point in a UTF-8 encoded String in horrendously
inefficient, this problem will be addressed at a later stage.
2021-07-17 16:24:57 +04:30
Max Wipfli
e8e9426b4f LibWeb: User getter and setter for Comment type HTMLTokens 2021-07-17 16:24:57 +04:30
Max Wipfli
f886aa15b8 LibWeb: Rename HTMLToken::AttributeBuilder struct to Attribute
This does not contain StringBuilders anymore, so it can do with a
simpler name: Attribute.
2021-07-17 16:24:57 +04:30
Max Wipfli
e22a34badb LibWeb: Fix assertion failures in HTMLTokenizer
The *TagName states are all very similar, so it seems to be correct to
apply the fix from #8761 to all of those states.

This fixes #8788.
2021-07-16 11:55:55 +02:00
Max Wipfli
2404ad6897 LibWeb: Fix assertion failure when tokenizing JS regex literals
This fixes parsing the following regular expression: /</g;

It also adds a simple script element to the HTMLTokenizer regression
test, which also contains that specific regex.
2021-07-15 01:47:22 +02:00
Max Wipfli
bb2aed7d76 LibWeb: Correct behavior of Comment* states in HTMLTokenizer
Previously, this would lead to assertion failures when parsing HTML
comments. This fixes #8757.
2021-07-15 00:48:45 +02:00
Max Wipfli
af0b483123 LibWeb: VERIFY an empty builder when emitting tokens in HTMLTokenizer 2021-07-15 00:48:45 +02:00
Max Wipfli
125982943a LibWeb: Change HTMLTokenizer.{cpp,h} to east const style 2021-07-14 23:03:36 +02:00
Gunnar Beutner
300823c314 LibWeb: Use move() when enqueuing tokens in HTMLTokenizer
We're not using the current token anymore once it's enqueued so let's
use move() when enqueuing the tokens.
2021-07-14 23:03:36 +02:00
Gunnar Beutner
c3ad8e9a52 LibWeb: Remove StringBuilder from HTMLToken::m_comment_or_character 2021-07-14 23:03:36 +02:00
Gunnar Beutner
3aa202c432 LibWeb: Remove StringBuilder from HTMLToken::m_tag 2021-07-14 23:03:36 +02:00
Gunnar Beutner
901d71148b LibWeb: Remove StringBuilders from HTMLToken::AttributeBuilder 2021-07-14 23:03:36 +02:00
Gunnar Beutner
992964aa7d LibWeb: Remove StringBuilders from HTMLToken::m_doctype 2021-07-14 23:03:36 +02:00
Gunnar Beutner
d9e52997e2 LibWeb: Use an Optional<String> to track the last HTML start tag
Using an HTMLToken object here is unnecessary because the only
attribute we're interested in is the tag_name.
2021-07-14 23:03:36 +02:00