1
Fork 0
mirror of https://github.com/RGBCube/serenity synced 2025-05-31 19:38:12 +00:00

LibPDF: Stop converting encodings in object parser

Per 1.7 spec 3.8.1, there are multiple logical text string types:
* text strings
* ASCII strings
* byte strings

Text strings can be in UTF-16BE, PDFDocEncoding, or (since PDF 2.0)
UTF-8.

But byte strings shouldn't be converted but treated as binary
data.

This makes us no longer convert strings used for drawing page text.
TABLE 5.6 "Text-showing operators" lists the operands for text-showing
operators as just "string", not "text string" (even though these strings
confusingly are called "text strings" in the body text), so not doing
this there is correct (and matches other viewers).

We also no longer incorrectly convert strings used for cypto data
(such as passwords), if they start with an UTF-16BE or UTF-8 marker.

No behavior change for outlines and info dict entries.

https://pdfa.org/understanding-utf-8-in-pdf-2-0/ has a good overview of
this.

(ASCII strings only contain ASCII characters and behave the same
anyways.)
This commit is contained in:
Nico Weber 2023-11-20 21:04:31 -05:00 committed by Andrew Kaster
parent 8ee0c75f43
commit e39a790c82
3 changed files with 38 additions and 20 deletions

View file

@ -9,7 +9,6 @@
#include <LibPDF/Document.h>
#include <LibPDF/Filter.h>
#include <LibPDF/Parser.h>
#include <LibTextCodec/Decoder.h>
#include <ctype.h>
namespace PDF {
@ -262,17 +261,6 @@ PDFErrorOr<NonnullRefPtr<StringObject>> Parser::parse_string()
if (m_document->security_handler() && m_enable_encryption)
m_document->security_handler()->decrypt(string_object, m_current_reference_stack.last());
auto unencrypted_string = string_object->string();
if (unencrypted_string.bytes().starts_with(Array<u8, 2> { 0xfe, 0xff })) {
// The string is encoded in UTF16-BE
string_object->set_string(TextCodec::decoder_for("utf-16be"sv)->to_utf8(unencrypted_string).release_value_but_fixme_should_propagate_errors().to_deprecated_string());
} else if (unencrypted_string.bytes().starts_with(Array<u8, 3> { 239, 187, 191 })) {
// The string is encoded in UTF-8. This is the default anyways, but if these bytes
// are explicitly included, we have to trim them
string_object->set_string(unencrypted_string.substring(3));
}
return string_object;
}