LibWeb: Use UTF-16 code unit offsets and lengths in CharacterData

We were previously assuming that the input offsets and lengths were all in raw byte offsets into a UTF-8 string. While internally our String representation may be in UTF-8 from the external world it is seen as UTF-16, with code unit offsets passed through, and used as the returned length. Beforehand, the included test included in this commit would crash ladybird (and otherwise return wrong values). The implementation here is very inefficient, I am sure there is a much smarter way to write it so that we would not need a conversion from UTF-8 to a UTF-16 string (and then back again). Fixes: #20971
2025-07-24 22:07:34 +00:00 · 2023-12-22 20:41:34 +13:00 · 2023-12-22 20:41:34 +13:00 · d8759d9656
commit d8759d9656
parent d51f84501a
6 changed files with 54 additions and 24 deletions
--- a/Userland/Libraries/LibWeb/DOM/Node.cpp
+++ b/Userland/Libraries/LibWeb/DOM/Node.cpp
@ -1492,11 +1492,8 @@ size_t Node::length() const
        return 0;

    // 2. If node is a CharacterData node, then return node’s data’s length.
-    if (is_character_data()) {
-        auto* character_data_node = verify_cast<CharacterData>(this);
-        // FIXME: This should be in UTF-16 code units, not byte size.
-        return character_data_node->data().bytes().size();
-    }
+    if (is_character_data())
+        return verify_cast<CharacterData>(*this).length_in_utf16_code_units();

    // 3. Return the number of node’s children.
    return child_count();