1
Fork 0
mirror of https://github.com/RGBCube/serenity synced 2025-05-31 05:38:11 +00:00
Commit graph

490 commits

Author SHA1 Message Date
Nico Weber
a1f17bd643 LibPDF: Skip inline image data in operator stream
Inline images can contain arbitrary binary data in the operator stream,
greatly confusing the operator parser.

Just skip them for now. They'll produce a
`Rendering of feature not supported: draw operation: inline_image_begin`
diag as usual, so we won't forget about it.

After #21536, reduces number of crashes on 300 random PDFs from the web
(the first 300 from 0000.zip from
https://pdfa.org/new-large-scale-pdf-corpus-now-publicly-available/)
from 23 (7%) to 22 (7%).

On a larger sample (`Meta/test_pdf.py -n 500 ~/Downloads/0000`),
reduces number of crashes from 53 (10.6%) with 36 distinct crash
stacks to 46 (9.2%) with 33 distinct stacks.
2023-10-23 07:51:08 +02:00
Nico Weber
1a58fee0fd LibPDF: Don't assert on named simple color space
If a PDF uses `/CustomName cs` and `/CustomName` then points at just a
name like `/DeviceGray` instead of an array, that's ok. Just using
`/DeviceGray cs` is simpler, so this extra level of indirection is
somewhat rare in practice, but it's valid and it does happen. So support
it.

We already have a helper that does the right thing that we just need to
call.

Together with #21524 and #21525, reduces number of crashes on 300 random
PDFs from the web (the first 300 from 0000.zip from
https://pdfa.org/new-large-scale-pdf-corpus-now-publicly-available/)
from 29 (9%) to 25 (8%).
2023-10-21 21:04:26 +02:00
Nico Weber
04aec4a032 LibPDF: Don't log CFF Copyright tag as unknown 2023-10-21 21:04:02 +02:00
Nico Weber
8922574133 LibPDF: Fix assertion when destination page is an index
This isn't correct per spec, but it happens in practice, e.g.
0000847.pdf, 0000327.pdf, 0000124.pdf from 0000.zip from
https://pdfa.org/new-large-scale-pdf-corpus-now-publicly-available/
2023-10-21 09:10:30 +02:00
Nico Weber
fbd00d9c8e LibPDF: Use resolve_to on /Dests entry
Fixes an assertion if /Dests is an indirect object (`24 0 R`)
instead of an inline dictionary.
2023-10-21 09:10:30 +02:00
Nico Weber
8c3478a921 LibPDF: Use resolve_to() helper
No behavior change.
2023-10-21 09:10:30 +02:00
Nico Weber
801cfd5ae3 LibPDF: Let parser process filters by default
This fixes a small bug from 39b2eed3f6: That commit tried to disable
filters for the very first object read, for the case covered in
Tests/LibPDF/password-is-sup.pdf.

However, it accidentally also disabled filters by default.

Most of the time, this isn't really a difference: We call
`set_filters_enabled(true);` very early in
`DocumentParser::initialize_linearization_dict()`, which explicitly
enables filters, and `initialize_linearization_dict()` is the very
first thing called in `DocumentParser::initialize()`.

But there's an early exit in `initialize_linearization_dict()`
for if there's nothing looking like an indirect object right
after the header, and in this case we used to not enable
filtering, and would hand compressed streams to the operand parser.

(And due to a 2nd bug, we'd even do this if the header line was
followed by an empty line.)
2023-10-21 09:09:53 +02:00
Nico Weber
cf26fc2393 LibPDF: Make parser skip whitespace after header
0000990.pdf from 0000.zip from
https://pdfa.org/new-large-scale-pdf-corpus-now-publicly-available/
starts like so:

```
%PDF-1.7

4 0 obj
```

parse_heaader() used to put the cursor at the start of the 2nd,
empty, line. initialize_linearization_dict() would then check
if `m_reader.matches_number()` to see if there could possibly
be a linearization dict.

In this case, there isn't one, but we should detect linearization
dicts even if they're separated by whitespace from the first line.
2023-10-21 09:09:53 +02:00
Nico Weber
34cb506bad LibPDF: Replace another TODO with a message
Like ca1a98ba9f, but for stroke color.
2023-10-21 09:09:06 +02:00
Nico Weber
9442782881 LibPDF: Implement text_next_line_show_string_set_spacing
Not used terribly often, but e.g. used in 000333.pdf page 17 in
stillhq.com-pdfdb.
2023-10-20 14:24:31 -04:00
Nico Weber
78dea9500f LibPDF: Make operator parsing use ReadonlySpan instead of Vector
No behavior change.
2023-10-20 14:24:31 -04:00
Nico Weber
e0268dcc87 LibPDF: Allow /Pattern to be used directly as a color space name
Per spec:

"If the color space is one that can be specified by a name and no
additional parameters (DeviceGray, DeviceRGB, DeviceCMYK, and certain
cases of Pattern), the name may be specified directly."

We still don't implement /Pattern color spaces, but now we no longer
crash trying to look up the potentially-nonexistent /ColorSpace
dictionary on the page object when /Pattern is used directly as color
space name.

On top of #21514, reduces number of crashes on 300 random PDFs from the
web (the first 300 from 0000.zip from
https://pdfa.org/new-large-scale-pdf-corpus-now-publicly-available/)
from 42 (14%) to 34 (11%).
2023-10-20 10:35:54 -06:00
Nico Weber
aea0e2f313 LibPDF: Rename ColorSpaceFamily function to may_be_specified_directly()
It used to be called ColorSpaceFamily::never_needs_parameters().

But in the cpp file, the macro arg was called ever_needs_parameters,
and the spec says

"If the color space is one that can be specified by a name and no
additional parameters (DeviceGray, DeviceRGB, DeviceCMYK, and certain
cases of Pattern), the name may be specified directly."

so let's use that language here.

No behavior change.
2023-10-20 10:35:54 -06:00
Nico Weber
095a2a17ed LibPDF: Replace TODO()s in Type0Font code with Errors
...which causes us to not render these fonts instead of crashing.

Reduces number of crashes on 300 random PDFs from the web (the first 300
from 0000.zip from
https://pdfa.org/new-large-scale-pdf-corpus-now-publicly-available/)
from 64 (21%) to 42 (14%).
2023-10-20 10:33:59 -06:00
Nico Weber
33443f7991 LibPDF: Implement ICCBasedColorSpace::number_of_components()
We now no longer crash on images that use an ICC-based color space.
Reduces number of crashes on 300 random PDFs from the web (the first 300
from 0000.zip from
https://pdfa.org/new-large-scale-pdf-corpus-now-publicly-available/)
from 81 (27%) to 64 (21%).

Also fixes all remaining crashes in
411_getting_started_with_instruments.pdf and
513_high_efficiency_image_file_format.pdf.
2023-10-20 08:58:52 +02:00
Nico Weber
f5d3f47af3 LibPDF: Add spec comment about color spaces on images 2023-10-20 08:58:52 +02:00
Nico Weber
7c24a89acf LibPDF: Add spec comment about valid bits_per_component values 2023-10-20 08:58:52 +02:00
Nico Weber
64bb9aa8c7 LibPDF: Fix comment typo 2023-10-20 08:58:52 +02:00
Nico Weber
ea6fed627a LibPDF: Get color rendering intent from image dict
Still not used for anything, so no behavior change.
2023-10-20 08:58:52 +02:00
Nico Weber
ebba24b848 LibPDF: Fix lookup of built-in Bold Italic strings
Liberation*-BoldItalic.ttf apparently self-identifies as "Bold Italic",
not "BoldItalic".
2023-10-19 16:52:49 -04:00
Nico Weber
708d5e2fe6 LibPDF: Implement color_rendering_intent operator
Implements the `ri` operator, and the `RI` key in a graphics state
dictionary.

We don't do anything yet with the color rendering intent except
store it.

No behavior change except removing a few "not yet implemented"
messages.
2023-10-19 16:51:16 -04:00
Nico Weber
609e640530 LibPDF: Try harder to use a RAII object to restore state
Follow-up to #21489. There, I made us use a RAII object.

That's great, but if the embedded instruction stream pushes
its own graphics state, then an early return would cause us to
not process graphics state pop instructions in the embedded stream.

To fix this, remember the graphics stack depth before entering
the nested instruction stream, and explicitly shrink the stack back
to that size upon exit.

Enables us to render all pages of
https://devstreaming-cdn.apple.com/videos/wwdc/2017/821kjtggolzxsv/821/821_get_started_with_display_p3.pdf
without crashing.
2023-10-19 16:49:00 -04:00
Nico Weber
b835d2bd66 LibPDF: Use a RAII object to restore state in recursive render
Previously, if one operator returned an error, the TRY() would cause
us to return without restoring the outer graphics state, leading to
problems such as handing a 3-tuple to a grayscale color space
(because the inner object set up a grayscale color space that we
failed to dispose of).

Makes us crash later on page 43 of
https://devstreaming-cdn.apple.com/videos/wwdc/2017/821kjtggolzxsv/821/821_get_started_with_display_p3.pdf
2023-10-18 19:43:31 -04:00
Nico Weber
3c2d820391 LibPDF: If softmask has different size than target bitmap, resize it
Size of smask and image aren't guaranteed to be equal by the spec
(...except for /Matte, see page 555 of the PDF 1.7 spec, but we
don't implement that), and in pratice they sometimes aren't.

Fixes an assert on page 4 of
https://devstreaming-cdn.apple.com/videos/wwdc/2017/821kjtggolzxsv/821/821_get_started_with_display_p3.pdf
We now make it all the way to page 43 of 64 before crashing.
2023-10-18 20:03:35 +01:00
Nico Weber
3907374621 LibPDF: Implement support for callgsubr in CFF font programs
Font programs are bytecode programs defining glyphs. If several glyphs
share a piece of outline, that opcode sequence can be put in a
subroutine ("subr") table and the definition of those glyphs can then
call that subroutine by number, to reduce file size.

CFF fonts can in theory contain multiple fonts, and so there's a global
subr table shared by all the fonts in one CFF, and a local per-fornt
subr table.  We used to only implement the local subr table, now we
implement both.

(We only support one font per CFF, and at least in PDF files, that's
all that's ever used. So a global subr table isn't very useful.
But the spec explicitly allows it -- "Global subroutines may be used in
a FontSet even if it only contains one font." -- and it happens in
practice.)
2023-10-18 10:50:32 -04:00
Nico Weber
185573c03f LibPDF: Implement subr_number biasing for CFF font programs 2023-10-18 10:50:32 -04:00
Nico Weber
4dc4de052a LibPDF: Implement opcode 28 for CFF font programs 2023-10-18 10:50:32 -04:00
Nico Weber
44efff81b9 LibPDF: Remove a dbgln() call in CFF subrs decoding
This code is a lot more reliable now than it used to be, and this
dbgln() is quite noisy for some files. So let's remove it.
2023-10-18 10:43:51 -04:00
Nico Weber
02d2d12592 LibPDF: Allow moving Reader::move_to() to end of data stream
CFF::parse_index_data() calls move_to() to put the reader's
current position behind the index data.

In several PDFs, the PrivDictOperator::Subrs case in CFF::create()
sets up a span that contains exactly the Subrs data and nothing
after it, so that finale move_to() call in parse_index_data()
would cause an assert.

This is similar to fe3612ebcb, where the caller was also in CFF.
So maybe CFF just has a different view of what valid values to pass
to Reader are, compared to the rest of the code? But having an iterator
point to one past the valid data in a container is common, so maybe
this is the Right Fix after all.

Fixes a crash opening 411_getting_started_with_instruments.pdf
(and a whole bunch of other WWDC slides). Rendering is pretty glitchy
and we still crash on page 14, but at least we can open the file now.

The file is currently available at:
411cbc60y12x68arcof/411/411_getting_started_with_instruments.pdf
2023-10-18 06:32:23 -04:00
Nico Weber
182639217f LibPDF: Implement GoTo action for outline
Outline items can contain either a /Dest key or an /A key.

The /Dest key points to a "Destination" (various ways to reference a
page in the same document).

The /A key points to an "Action" which can have several types.
One type, the /GoTo type, just also points to a Destination.

Implement GoTo actions. This makes clicking "Contents" in the outline of
https://developer.apple.com/library/archive/documentation/mac/pdf/Text.pdf
work. (Almost all other items in this file's outline use /Dest.
"Contents" could too, but it uses /A /GoTo for some reason.)

(Other action types are things like opening a hyperlink, opening a
different file, playing a sound, submitting a form, etc. Actions
are also used for in-page links, not just in outlines. Many of
these action types we'll likely never want to implement.)
2023-10-18 06:29:02 -04:00
Nico Weber
d9c9510d3c LibPDF: Rename x-macro argument name
I'd like to add a string called `A`, so the argument can't be called
`A` as well.

No behavior change.
2023-10-18 06:29:02 -04:00
Nico Weber
f646e47d46 LibPDF: Extract a create_destination_from_object() function
No big behavior change. The new function now produces an error
if a destination isn't in one of the supported formats.
2023-10-18 06:29:02 -04:00
Nico Weber
46fd6fdfa3 LibPDF: Read Global subr data in CFF reader
This was the last piece of data we didn't read yet.
(We also don't yet support multiple fonts per CFF, but I haven't
found a PDF using that yet.)

We still don't do anything with it, but now we at least print a
warning if this data is there and we ignore it.
2023-10-18 11:02:10 +02:00
Nico Weber
3be5719987 LibPDF: Rename subroutines to local_subroutines in CFF code 2023-10-18 11:02:10 +02:00
Nico Weber
9a0b559932 LibPDF: Tweak formatting of built-in CFF tables
This makes the code look more like the pages in the spec.

No behavior change, whitespace change only.
2023-10-18 11:00:17 +02:00
Nico Weber
f0e7fb7038 LibPDF: Make Subrs optional in PS1FontProgram
https://adobe-type-tools.github.io/font-tech-notes/pdfs/T1_SPEC.pdf :

"Using charstring subroutines is not a requirement of a Type 1
font program."

And some versions of Computer Modern do in fact not contain a Subrs
array.

Together with #21473, makes Problemset.pdf from the pdffiles repro
render ok instead of crashing.
2023-10-18 11:00:02 +02:00
Nico Weber
cb961101c7 LibPDF: Implement CFF built-in Standard and Expert encodings
With this, all tables from the spec appendixes are in CFF.cpp.

This fixes a crash reading page 2 (and onward) of
2ThestructureoftheCIE1997ColourAppearanceModelCIECAM97s.pdf in
the pdffiles repo.
2023-10-17 10:21:38 +02:00
Nico Weber
eeada4678c LibPDF: Postpone CFF encoding processing after Top DICT has been read
The encoding offset defaults to 0, i.e. the Standard Encoding.
That means reading the encoding only if the tag is present causes
us to not read it if a font uses the Standard Encoding.

Now, we always read an encoding, even if it's the (implicit) default
one.
2023-10-17 10:21:38 +02:00
Nico Weber
1cfe639b6c LibPDF: Implement CFF supplemental encoding
The main encoding data maps glyph ID ("GID") to its codepoint.
If a glyph has several codepoints, then a secondary table mapping
codepoint to string ID ("SID") of the glyph's name is present.

(A separate table associates each glyph with its name already.)

I haven't seen this used in the wild, but the structure of the
supplemental data is also going to be needed for built-in encodings.
2023-10-17 10:21:38 +02:00
Nico Weber
37daeae6fd LibPDF: Add spec comments, dbgln_if()s to CFF's parse_encoding() 2023-10-17 10:21:38 +02:00
Nico Weber
007d7cdd53 LibPDF: Fix sign (and fixed point) in glyph decoding opcode 24
Two bugs:

1. We decoded a u32, not an i32 as the spec wants
2. (minor) Our fixed-point divisor was off by one

Fixes text rendering in Bakke2010a.pdf in pdffiles, and rendering of
other fonts with negative width adjustments from optcode 255.
That PDF was produced by "Apple pstopdf" and uses font SFBX1200,
which is apparently a variant of Computer Modern. So maybe this
helps with lots of PDFs produced from TeX files, but I haven't
checked that.
2023-10-16 08:33:35 +02:00
Nico Weber
96a4936567 LibPDF: Checking for built-in CFF encodings
Only prints a warning for them for now.

Also warn on the not-yet-implemented encoding supplement.
2023-10-16 08:32:18 +02:00
Nico Weber
414a164850 LibPDF: Be louder about unimplemented CFF dict entries 2023-10-16 08:32:18 +02:00
Nico Weber
c825194fb9 LibPDF: Reject CFFs with more than one font
The code assumes that there's just one Top DICT, so let's be loud
when that isn't the case.
2023-10-16 08:32:18 +02:00
Nico Weber
6f783929dd LibPDF: Implement support for CFF charset format 2
I haven't seen this being used in the wild (yet), but it's easy
to implement, and with this we support all charset formats.

So we can now mention if we see a format we don't know about.
2023-10-15 15:27:15 +02:00
Nico Weber
5b915fb15c LibPDF: Add more spec comments to parse_charset() 2023-10-15 15:27:15 +02:00
Nico Weber
49275c4b17 LibPDF: Don't overflow SIDs in type 1 charset parsing
first_sid has type SID (aka u16), so don't store it in an u8.

This fixes (among other things) page 24 on the PDF 1.7 spec.
2023-10-15 15:27:15 +02:00
Nico Weber
23d6e9f577 LibPDF: Implement CFF built-in charsets ISOAdobe, Expert, Expert Subset 2023-10-15 09:33:34 +02:00
Nico Weber
8060957d8d LibPDF: Use Appendix A instead of Appendix C for standard names
From "10 String INDEX":

"Further space saving is obtained by allocating commonly occurring
strings to predefined SIDs. These strings, known as the standard
strings, describe all the names used in the ISOAdobe and Expert
character sets along with a few other strings common to Type 1 fonts. A
complete list of standard strings is given in Appendix A.  The client
program will contain an array of standard strings with nStoStrings
elements. Thus, the standard strings take SIDs in the range 0 to
(nStaStrings-1)."

And "13 Charsets" says that charsets store SIDs.

Fixes all

    "Couldn't find string for SID $n, going with space"

messages when going through the encoding pages (page 1010 and
thereabouts) in the PDF 1.7 spec.
2023-10-15 09:33:34 +02:00
Nico Weber
aba787a441 LibPDF: Implement reading of CFF String Index
Only really useful for reading SIDs in the Top DICT (copyright
text etc), which we currently don't do.

I haven't seen a difference from looking things up in the string
table. The only real effect from the commit that I need is that
it pulls a local resolve() labmda into a real function
resolve_sid(), which I want to call in a future commit.

But it makes things more spec-compliant, and if we ever want to
read SIDs in metadata in the future, now we can.
2023-10-15 09:33:34 +02:00