A certain PDF was drawing some text used `9 0 0 9 474.54 700.6801 Tm`
to set the text matrix to a matrix that scaled by 9 in one text object.
Then, after ending that text object, it had the following new text
object which contained nothing that invalidated the text matrix:
```
BT
/F1 7 Tf
/DeviceRGB CS
0 0 0 SC
10 TL
86.37849 21.908 Td
(Authorized licensed use limited to: ...) Tj
ET
```
`BT` did reset it as required, but since we didn't mark the matrix
as dirty, we never recomputed it and drew the additional text scaled
up 9x.
An image mask is a 1-bit-per-pixel bitmap that's black where the
current color should be painted, and white where it should be
transparent (think: like ink).
load_image() already converts images like this into 8-bit-per-pixel
images that have 0xff, 0xff, 0xff in rgb for opaque (originally 0 bit)
pixels and 0, 0, 0 in rgb for transparent pixels.
So we just move copy the image mask's image data into the alpha
channel and replace rgb with the current color, and then draw
it like a regular bitmap.
0000440.pdf contains an xref stream object (at offset 3643676) starting:
```
294 0 obj <<
/Type /XRef
/Index [0 295]
/Size 295
```
and an object stream object (at offset 3640121) starting:
```
230 0 obj <<
/Type /ObjStm
/N 73
/First 614
```
In both cases, the `obj` and the `<<` are separated by non-newline
whitespace.
633e1632d0 made parse_indirect_value() tolerate this, but it didn't
update neither parse_xref_stream() (which parses xref streams) nor
parse_compressed_object_with_index() (which parses object streams),
despite all three changes being part of #14873.
Make parse_xref_stream() and parse_compressed_object_with_index()
call parse_indirect_value() to pick up the fix over there. It's a bit
less code too.
(0000440.pdf is the only PDF in my 1000 test PDFs that this helps,
somewhat surprisingly.)
At first I tried implmenting the quirk from PDF 1.7 Appendix H,
3.4.4, "File Trailer": """Acrobat viewers require only that the %%EOF
marker appear somewhere within the last 1024 bytes of the file.""
This would've been like #22548 but at end-of-file instead of at
start-of-file.
This helped a bunch of files, but also broke a bunch of files that
made more than 1024 bytes of stuff at the end, and it wouldn't have
helped 0000059.pdf, which has over 40k of \0 bytes after the %%EOF.
So just tolerate whitespace after the %%EOF line, and keep ignoring
and arbitrary amount of other stuff after that like before.
This helps:
* 0000599.pdf
One trailing \0 byte after %%EOF. Due to that byte, the
is_linearized() check fails and we go down the non-linearized
codepath. But with this fix, that code path succeeds.
* 0000937.pdf
Same.
* 0000055.pdf
Has one space followed by a \n after %%EOF
* 0000059.pdf
Has over 40kB of trailing \0 bytes
The following files keep working with it:
* 0000242.pdf
5586 bytes of trailing HTML
* 0000336.pdf
5586 bytes of trailing HTML fragment
* 0000136.pdf
2054 bytes of trailing space characters
This one kind of only worked by accident before since it found
the %%EOF block before the final %%EOF block. Maybe this is
even an intentional XRefStm compat hack? Anyways, now it
find the final block instead.
* 0000327.pdf
11044 bytes of trailing HTML
It's pretty tricky to do, and also tricky with respect to skipping
trailing bytes after %%EOF: The check requires knowning the full size of
the PDF (which means web servers not sending content lengths are out),
but that size has to be after stripping trailing bytes, which normal
static file servers won't do. So PDF viewers would have to download the
last couple bytes of the PDF unconditionally, then strip trailing bytes
and use the count to figure out the final actual PDF size.
Luckily, we don't incrementally download PDFs from the net but
instead require all data to be available in one chunk, so it's
not currently a problem.
The spec isn't super clear on if this is allowed:
"""Each cross-reference section shall begin with a line containing the
keyword xref. Following this line..."""
"""The two preceding lines shall contain, one per line and in order, the
keyword startxref and..."""
It kind of sounds like anything goes on both lines as long as they
contain `xref` and `startxref`.
In practice, both seem to always occur at the start of their line,
but in 0000780.pdf (and nowhere else), there's one space after each
keyword before the following linebreak, and this makes that file load.
Per "TABLE 5.11 Entries in an encoding dictionary", /Differences is
optional.
(Per "Encodings for TrueType Fonts" in 5.5.5 Character Encoding,
nonsymbolic truetype fonts are even recommended to have "no Differences
array." But in practice, most seem to have it.)
Fixes crashes on:
* 0000001.pdf
* 0000574.pdf
* 0000337.pdf
All three don't render super great, but at least they no longer crash.
Other readers do this too, and files depend on this.
Fixes opening these four files from the PDFA 0000.zip dataset:
* 0000015.pdf
Starts with `C:\web\webeuncet\_cat\_docs\_publics\` before header
* 0000408.pdf
Starts with UTF-8 BOM
* 0000524.pdf
Starts with 867 bytes of HTML containing a PHP backtrace
* 0000680.pdf
Starts with `C:\web\webeuncet\_cat\_docs\_publics\` too
A local (non-public) PDF I have lying around contains this in
a page's operator stream:
```
[<00b4003e> 3 <002600480051> 3 <005700550044004f0003> -29
<00330044> 3 <0055> -3 <004e0040> 4 <0003> -29 <004c00560003> -31
<0057004b> 4 <00480003> -37 <0050
>] TJ
```
That is, there's a newline in a hexstring after a character.
This led to `Parser error at offset 5184: Unexpected character`.
The spec says in 3.2.3 String Objects, Hexadecimal Strings:
"""Each pair of hexadecimal digits defines one byte of the string.
White-space characters (such as space, tab, carriage return, line feed,
and form feed) are ignored."""
But we didn't ignore whitespace before or after a character, only
in between the bytes.
The spec also says:
"""If the final digit of a hexadecimal string is missing—that is, if
there is an odd number of digits—the final digit is assumed to be 0."""
In that case, we were skipping the closing `>` twice -- or, more
accurately, we ignored the character after it too. This has been
wrong all the way back in #6974.
Add a test that fails if either of the two changes isn't present.
In a bunch of cases, this actually ends up simplifying the code as
to_number will handle something such as:
```
Optional<I> opt;
if constexpr (IsSigned<I>)
opt = view.to_int<I>();
else
opt = view.to_uint<I>();
```
For us.
The main goal here however is to have a single generic number conversion
API between all of the String classes.
The idea is to massage the inline image data into something that
looks like a regular image, and then use the normal image drawing code:
We translate the inline image abbreviations to the expanded version at
rendering time, then unfilter (i.e. uncompress) the image data at
rendering time, and the go down the usual image drawing path.
Normal streams are unfiltered when they're first accessed, but
inline image streams live in a page's drawing operators, and this
fits the current approach of parsing a page's operators anew
every time the page is rendered.
(We also need to add some special-case handling for color spaces
of inline images: Inline images can use named color spaces, while
regular images always use direct color space objects.)
We create a inline_image_end operator that has all the relevant data
in a synthetic StreamObject.
inline_image_end is still a RENDERER_TODO(), so no real behavior
change. (Previously we'd call only inline_image_begin, so string the
todo message is about is now a bit different. But no interesting
behavior change.)
This commit un-deprecates DeprecatedString, and repurposes it as a byte
string.
As the null state has already been removed, there are no other
particularly hairy blockers in repurposing this type as a byte string
(what it _really_ is).
This commit is auto-generated:
$ xs=$(ack -l \bDeprecatedString\b\|deprecated_string AK Userland \
Meta Ports Ladybird Tests Kernel)
$ perl -pie 's/\bDeprecatedString\b/ByteString/g;
s/deprecated_string/byte_string/g' $xs
$ clang-format --style=file -i \
$(git diff --name-only | grep \.cpp\|\.h)
$ gn format $(git ls-files '*.gn' '*.gni')
In practice, basically no file has it, since it was only added in 2.0,
and 1.7 explicitly said "in particular, the Type, Subtype, and Length
entries normally found in a stream or image dictionary are unnecessary."
Fixes a crash on page 3 of 0000450.pdf of 0000.zip, where we previously
started interpreting the middle of an inline image content stream as
operators, since it contained `EI` in its pixel data.
Fixes these errors from `Meta/test_pdf.py path/to/0000`, with
0000 being 0000.zip from the PDF/A corpus in unzipped:
Malformed PDF file: Indexed color space lookup table doesn't
match size, in 4 files, on 8 pages, 73 times
path/to/0000/0000206.pdf 2 4 (2x) 5 (3x) 6 (4x)
path/to/0000/0000364.pdf 5 6
path/to/0000/0000918.pdf 5
path/to/0000/0000683.pdf 8
When upsampling e.g. the 4-bit value 0b1101 to 8-bit, we used to repeat
the value to fill the full 8-bits, e.g. 0b11011101. This maps RGB colors
to 8-bit nicely, but is the wrong thing to do for palette indices.
Stop doing this for palette indices.
Fixes "Indexed color space index out of range" for 11 files in the
PDF/A 0000.zip test set now that we correctly handle palette indices
as of the previous commit:
Malformed PDF file: Indexed color space lookup table doesn't match
size, in 4 files, on 8 pages, 73 times
path/to/0000/0000206.pdf 2 4 (2x) 5 (3x) 6 (4x)
path/to/0000/0000364.pdf 5 6
path/to/0000/0000918.pdf 5
path/to/0000/0000683.pdf 8