1
Fork 0
mirror of https://github.com/RGBCube/serenity synced 2025-05-16 19:35:08 +00:00
Commit graph

622 commits

Author SHA1 Message Date
Nico Weber
d24289eef4 LibPDF: Always log unhandled type 1 and type 2 font program opcodes
This would've made it easy to see that we were missing flex opcodes for
https://developer.apple.com/library/archive/documentation/mac/pdf/Text.pdf
2023-11-01 11:40:16 -04:00
Nico Weber
e1a743f286 LibPDF: Implement type 2 flex, hflex, hflex1, flex1 operators
This is the type 2 equivalent to type2 othersubr, from what I can tell.

See "4.1 Path Construction Operators" in 5177.Type2.pdf,
"The Type 2 Charstring Format".

Makes text show up alright on
https://developer.apple.com/library/archive/documentation/mac/pdf/Text.pdf
2023-11-01 11:40:16 -04:00
Nico Weber
3e707efdfa LibPDF: Move type1 subr 0 handling into othersubr handler
https://adobe-type-tools.github.io/font-tech-notes/pdfs/T1_SPEC.pdf,
8.4 First Four Subrs Entries:

"""If Flex or hint replacement is used in a Type 1 font program, the
first four entries in the Subrs array in the Private dictionary must be
assigned charstrings that correspond to the following code sequences. If
neither Flex nor hint replacement is used in the font program, then this
requirement is removed, and the first Subrs entry may be a normal
charstring subroutine sequence. The first four Subrs entries contain:

Subrs entry number 0:
3 0 callothersubr pop pop setcurrentpoint return
"""

othersubr handler 0 gets three arguments:
* The flex height (the distance after which the bezier splines
  are replaced with just straight lines)
* The current position after the flex

It pushes that position on the postscript stack, where predefined subr
handler number 0 then pops it from. It then passes it to
setcurrentpoint.

In theory, we now correctly do that setcurrentpoint call, which we
previously weren't.

In practice, that setcurrentpoint call always receives the last point of
the flex -- and our path api apparently gets confused when move_to() is
called on it when the current point is already at that same location.

So tweak the SetCurrentPoint handler to not set the current point on
the path if it's already the path's current point, with a FIXME to
figure out what exactly is happening in Gfx::Path.

No big behavior change if flex is used, but this is more correct if it
isn't.

(This only works because our `return` handler is empty, else we would
have to make the callothersubr handler start a call frame.)
2023-11-01 11:38:41 -04:00
Nico Weber
0bb8249780 LibPDF: Move type1 subr 1 and 2 handling into othersubr handler
https://adobe-type-tools.github.io/font-tech-notes/pdfs/T1_SPEC.pdf,
8.4 First Four Subrs Entries:

"""If Flex or hint replacement is used in a Type 1 font program, the
first four entries in the Subrs array in the Private dictionary must be
assigned charstrings that correspond to the following code sequences. If
neither Flex nor hint replacement is used in the font program, then this
requirement is removed, and the first Subrs entry may be a normal
charstring subroutine sequence. The first four Subrs entries contain:

[...]

Subrs entry number 1:
0 1 callothersubr return

Subrs entry number 2:
0 2 callothersubr return
"""

So subr entry numbers 1 and 2 just call othersubr 1 and and 2, which
means we can just move the handling code over.

No behavior change if flex is used, but more correct if it isn't.

(This only works because our `return` handler is empty, else we would
have to make the callothersubr handler start a call frame.)
2023-11-01 11:38:41 -04:00
Ali Mohammad Pur
78c04cb8b2 AK+LibPDF: Make Format print floats in a roundtrip-safe way by default
Previously we assumed a default precision of 6, which made the printed
values quite odd in some cases.
This commit changes that default to print them with just enough
precision to produce the exact same float when roundtripped.

This commit adds some new tests that assert exact format outputs, which
have to be modified if we decide to change the default behaviour.
2023-10-31 09:12:35 +03:30
Nico Weber
4cc24548f6 LibPDF: Call dbgln() for unimplemented flex upcodes 2023-10-28 13:28:05 -04:00
Nico Weber
e484fae8e1 LibPDF: Don't do special subr processing for type 2 CFFs
This is a subset of #21484: Type 2 CFFs never use the special subrs,
so stop doing them for type 2 at least for now.

Fixes an assert in 0000064.pdf in 0000.zip in the pdfa dataset
(a stack underflow because a subr is supposed to push a bunch of
stuff, but instead it ran one of the built-in routines instead of
the subr from the font file).

As discussed in #21484, this isn't right for type 1 CFFs either,
but just removing the code there regresses Tests/LibPDF/type1.pdf.
A slightly more involved thing is needed there; I added a FIXME
for that here.
2023-10-28 13:28:05 -04:00
Tim Ledbetter
5c0c55d2c0 LibPDF: Ensure xref stream field widths are within expected range
Previously, an xref stream with a field with larger than 8 would
result in an undefined shift occurring. We now ensure that each field
width is a number and is less than or equal to 8.
2023-10-28 13:17:09 -04:00
Nico Weber
6d47fca3bf LibPDF: Don't assert on outline destinations that use null as page
Nothing in PDF 1.7 spec 8.2.1 Destinations mentions the page being
`null`, but it happens in 0000372.pdf (for the root outline element)
and in 0000776.pdf (for every outline element, which looks like a
bug in the generator maybe) of 0000.zip from the pdfa dataset.
2023-10-27 06:38:25 -04:00
Tim Ledbetter
b4296e1c9b LibPDF: Don't use unsanitized values in error messages
Previously, constructing error messages with unsanitized input could
fail because error message strings must be UTF-8.
2023-10-26 11:05:32 +02:00
Nico Weber
f8bf9c6506 LibPDF: Sketch out DeviceN color spaces a bit
Documents using them now show render-time diagnostics instead
of asserting that number of parameters passed to a color don't
match whatever number of channels the previously-set color space
had.

Fixes two asserts on the `-n 500` 0000.zip test set.
2023-10-26 11:05:00 +02:00
Nico Weber
4549d6cf1b LibPDF: Add a FIXME comment to the inline image data skipping path 2023-10-26 10:59:45 +02:00
Nico Weber
2878af5968 LibPDF: Sketch out Lab color space
Same as other recent color spaces: Enough to make us not assert,
but not enough to actually produce color.

Fixes 2 asserts on the `-n 500` 0000.zip pdfa dataset.
2023-10-26 10:59:45 +02:00
Nico Weber
a65d8ff2ea LibPDF: Tolerate page rotation being an indirect object
Needed e.g. for 0000196.pdf in 0000.zip in the pdfa dataset.
2023-10-26 10:58:45 +02:00
Nico Weber
8b806183f6 LibPDF: Tolerate indirect objects in various image dict values
0000101.pdf from 0000.zip from the pdfa dataset has /Height set to
an indirect object that contains an int.

Make that work, and make sure various other similar places getting
values of the image dict also resolve indirect references.
2023-10-26 10:58:45 +02:00
Nico Weber
5dd7639386 LibPDF: Tolerate indirect references in Type0 /W array
Makes e.g. 0000236.pdf in 0000.zip in the pdfa dataset work.
2023-10-26 10:58:45 +02:00
Nico Weber
b928fadba7 LibPDF: Swap int and array branches in outline item reading
No intended behavior change.

It does have the effect that indirect object references now go down
the array path instead of the number path. They still fall over there,
but now that's easy to fix.
2023-10-26 10:58:45 +02:00
Nico Weber
208a058eab LibPDF: Tolerate integer outline item colors
0000296.pdf from 0000.zip from the pdfa dataset contains
`/C [0 0 0]` (as opposed to `/C [0.0 0.0 0.0]`). Make that work.
(It's fine per spec.)
2023-10-26 10:58:45 +02:00
Nico Weber
54cdcd0d06 LibPDF: Reject non-hexdigits in hex string with error
...instead of VERIFY()ing input data.

I haven't seen this in the wild, but since I'm here anyways,
might as well fix this.
2023-10-25 10:44:26 +02:00
Nico Weber
4675700057 LibPDF: Reject unterminated literal strings with an error
0000459.pdf in 0000.zip in the pdfa dataset contains this as the
very first object:

```
1 0 obj
<<
/Creator (Developer 2000)
/CreatorDate (
/Author (Oracle Reports)
/Producer (Oracle PDF driver)
/Title (2021_06_29 Tutoritzacions APTES.PDF)
>>
endobj
```

The `/CreatorDate` value string is unterminated.

Before, we'd assert when trying to check if the first object is
a linearization dict.

Now, we never read the first object (an error during the linearization
dict reading is treated as "file is not linearized") unless we try
to print the document's metadata -- and there we now show an error
instead of asserting.
2023-10-25 10:44:26 +02:00
Nico Weber
c0f3f1674c LibPDF: Make string literal parsing fallible
...and make running out of data after a \ an error instead of silently
returning an empty string.
2023-10-25 10:44:26 +02:00
Nico Weber
311cc7d9b9 LibPDF: Implement two SeparationColorSpace methods
Actually using separation color spaces still doesn't work, but we
now no longer assert on them when they're used.

Fixes 2 crashes on the `-n 500` 0000.zip pdfa dataset.
2023-10-25 05:52:47 +02:00
Nico Weber
e7f7c434f7 LibPDF: Don't check for startxref after trailer dict
Several files have a comment after the trailer dict and the
`startxref` after it.

We really should add a consume_whitespace_and_comments() function
and call that in most places we currently call consume_whitespace().

But in this case, for non-linearized files, we first jump to the
end of the file, read `startxref`, then jump to `xref` from the
offset there, and then read the trailer after the `xref`,
only to read `startxref` again. So we can just not do that.

(For linearized files, we now completely ignore `startxref`.
But we don't use the data in `startxref` in linearized files
anyways, so it's fine to not read it there too.)

Reduces number of crashes on 300 random PDFs from the web (the first 300
from 0000.zip from
https://pdfa.org/new-large-scale-pdf-corpus-now-publicly-available/)
from 25 (8%) to 23 (7%).
2023-10-24 13:32:01 -04:00
Nico Weber
acf668e234 LibPDF: Make Reader::move_by() parameter more truthful
No behavior change, just simpler and less surprising.
2023-10-24 13:30:25 -04:00
Nico Weber
3fe9f8e48d LibPDF: Don't accidentally form new tokens on pages with contents arrays
A page's /Contents can be an array of streams, and the page's contents
are then as if those streams are concatenated.

Most of the time, a stream ends with whitespace. But in some cases
(e.g. 0000642.pdf from 0000.zip from the pdfa dataset), the first
stream ends with an operator (`Q`) and the next stream starts with
one (`q`), and the concatenation would form a new, unkonwn operator
(`Qq`). Separate the streams' contents with a space to prevent that.

Reduces numbers of PDF files we fail to open in the -n 500 case
from 11 to 10 (in either case, we then crash on 18 of the PDFs
that we do manage to open).
2023-10-23 13:23:54 -04:00
Nico Weber
11bee7a075 LibPDF: Don't crash on fixed-width type 1 fonts that use /MissingWidth
Type 1 fonts usually have a m_font_program and no m_font -- they only
have m_font if we're using a replacement font for the fonts that
were built-in to PDFs before Acrobat 4.0 (and must still work to
show existing files).

However, SimpleFont::get_glyph_width() used to always return a
float, which in Type1Font was only implemented if m_font was set.

Per spec, we're supposed to just use /MissingWidth for fonts that
are missing an entry in the descriptor's /Width array. However, for
built-in fonts, no explicit /Width array is needed (PDF 1.7 spec,
Appendix H.3, 5.5.1). So if we just always use /MissingWidth,
then PDFs that use a built-in font draw all their text on top
of each other (e.g. 000333.pdf from stillhq.com-pdfdb).

So change get_glyph_width() to return Optional<float>, return
it only in Type1Font if m_font is set, and use MissingWidth
if it isn't set.

That way, replacement fonts still return a width, and real
fonts that are supposed to have /Width and use /MissingWidth
for missing entries do what they're supposed to too, instead
of crashing.

From 20 (6%) to 16 (5%) crashes on the 300 first PDFs, and from
39 (7.8%) to 31 (6.2%) on the 500-random PDFs test.
2023-10-23 09:33:03 -04:00
Nico Weber
52afa936c4 LibPDF: Don't over-read in charset formats 1 and 2
`left` might be a number bigger than there are actually glyphs in the
CFF.

The spec says "The number of ranges is not explicitly specified in the
font. Instead, software utilizing this data simply processes ranges
until all glyphs in the font are covered." Apparently we have to check
for this within each range as well.

Needed for example in 0000054.pdf and 0000354.pdf in 0000.zip in the
pdfa dataset.

Together with the previous commit:

From 21 (7%) to 20 (6%) crashes on the 300 first PDFs, and from
41 (8.2%) to 39 (7.8%) on the 500-random PDFs test.
2023-10-23 09:31:11 -04:00
Nico Weber
58ff7b5336 LibPDF: Support offset size 3 in CFF index reading
...and replace template instantiations with a loop, to make this
easily possible.

Vaguely nice for code size as well.

Needed for example in 0000054.pdf and 0000354.pdf in 0000.zip in the
pdfa dataset.
2023-10-23 09:31:11 -04:00
Nico Weber
3197f0cab6 LibPDF: Handle CFF fonts with charset format 0 and > 255 glyphs better
We used to use an u8 as loop counter, which would overflow
if there were more than 255 glyphs, producing hundreds of megabytes
of

    Couldn't find string for SID x, going with space

output in the process, while all data until the end of the CFF
section got interpreted as SIDs, until a try_read() would finally
fail.

We now no longer fail miserably trying to render page 2 of
0000352.pdf of 0000.zip from the pdfa dataset.

Fixes just one crash of the larger 500-document test set, but
when I tweak test_pdf.py to print all stacks instead of just the
top 5, it no longer produces 260 MB of output.
2023-10-23 09:31:11 -04:00
Nico Weber
0869ca5615 LibPDF: Add more CFF_DEBUG output 2023-10-23 09:31:11 -04:00
Nico Weber
cf705eb235 LibPDF: Use TRY() to get decompression result
Makes us die with a better error message for some PDFs.
2023-10-23 09:30:41 -04:00
Nico Weber
6153dd7b84 LibPDF: Tolerate comments after dict values
Makes 0000607.pdf from 0000.zip from the pdfa dataset load.
2023-10-23 09:28:00 -04:00
Nico Weber
a1f17bd643 LibPDF: Skip inline image data in operator stream
Inline images can contain arbitrary binary data in the operator stream,
greatly confusing the operator parser.

Just skip them for now. They'll produce a
`Rendering of feature not supported: draw operation: inline_image_begin`
diag as usual, so we won't forget about it.

After #21536, reduces number of crashes on 300 random PDFs from the web
(the first 300 from 0000.zip from
https://pdfa.org/new-large-scale-pdf-corpus-now-publicly-available/)
from 23 (7%) to 22 (7%).

On a larger sample (`Meta/test_pdf.py -n 500 ~/Downloads/0000`),
reduces number of crashes from 53 (10.6%) with 36 distinct crash
stacks to 46 (9.2%) with 33 distinct stacks.
2023-10-23 07:51:08 +02:00
Nico Weber
1a58fee0fd LibPDF: Don't assert on named simple color space
If a PDF uses `/CustomName cs` and `/CustomName` then points at just a
name like `/DeviceGray` instead of an array, that's ok. Just using
`/DeviceGray cs` is simpler, so this extra level of indirection is
somewhat rare in practice, but it's valid and it does happen. So support
it.

We already have a helper that does the right thing that we just need to
call.

Together with #21524 and #21525, reduces number of crashes on 300 random
PDFs from the web (the first 300 from 0000.zip from
https://pdfa.org/new-large-scale-pdf-corpus-now-publicly-available/)
from 29 (9%) to 25 (8%).
2023-10-21 21:04:26 +02:00
Nico Weber
04aec4a032 LibPDF: Don't log CFF Copyright tag as unknown 2023-10-21 21:04:02 +02:00
Nico Weber
8922574133 LibPDF: Fix assertion when destination page is an index
This isn't correct per spec, but it happens in practice, e.g.
0000847.pdf, 0000327.pdf, 0000124.pdf from 0000.zip from
https://pdfa.org/new-large-scale-pdf-corpus-now-publicly-available/
2023-10-21 09:10:30 +02:00
Nico Weber
fbd00d9c8e LibPDF: Use resolve_to on /Dests entry
Fixes an assertion if /Dests is an indirect object (`24 0 R`)
instead of an inline dictionary.
2023-10-21 09:10:30 +02:00
Nico Weber
8c3478a921 LibPDF: Use resolve_to() helper
No behavior change.
2023-10-21 09:10:30 +02:00
Nico Weber
801cfd5ae3 LibPDF: Let parser process filters by default
This fixes a small bug from 39b2eed3f6: That commit tried to disable
filters for the very first object read, for the case covered in
Tests/LibPDF/password-is-sup.pdf.

However, it accidentally also disabled filters by default.

Most of the time, this isn't really a difference: We call
`set_filters_enabled(true);` very early in
`DocumentParser::initialize_linearization_dict()`, which explicitly
enables filters, and `initialize_linearization_dict()` is the very
first thing called in `DocumentParser::initialize()`.

But there's an early exit in `initialize_linearization_dict()`
for if there's nothing looking like an indirect object right
after the header, and in this case we used to not enable
filtering, and would hand compressed streams to the operand parser.

(And due to a 2nd bug, we'd even do this if the header line was
followed by an empty line.)
2023-10-21 09:09:53 +02:00
Nico Weber
cf26fc2393 LibPDF: Make parser skip whitespace after header
0000990.pdf from 0000.zip from
https://pdfa.org/new-large-scale-pdf-corpus-now-publicly-available/
starts like so:

```
%PDF-1.7

4 0 obj
```

parse_heaader() used to put the cursor at the start of the 2nd,
empty, line. initialize_linearization_dict() would then check
if `m_reader.matches_number()` to see if there could possibly
be a linearization dict.

In this case, there isn't one, but we should detect linearization
dicts even if they're separated by whitespace from the first line.
2023-10-21 09:09:53 +02:00
Nico Weber
34cb506bad LibPDF: Replace another TODO with a message
Like ca1a98ba9f, but for stroke color.
2023-10-21 09:09:06 +02:00
Nico Weber
9442782881 LibPDF: Implement text_next_line_show_string_set_spacing
Not used terribly often, but e.g. used in 000333.pdf page 17 in
stillhq.com-pdfdb.
2023-10-20 14:24:31 -04:00
Nico Weber
78dea9500f LibPDF: Make operator parsing use ReadonlySpan instead of Vector
No behavior change.
2023-10-20 14:24:31 -04:00
Nico Weber
e0268dcc87 LibPDF: Allow /Pattern to be used directly as a color space name
Per spec:

"If the color space is one that can be specified by a name and no
additional parameters (DeviceGray, DeviceRGB, DeviceCMYK, and certain
cases of Pattern), the name may be specified directly."

We still don't implement /Pattern color spaces, but now we no longer
crash trying to look up the potentially-nonexistent /ColorSpace
dictionary on the page object when /Pattern is used directly as color
space name.

On top of #21514, reduces number of crashes on 300 random PDFs from the
web (the first 300 from 0000.zip from
https://pdfa.org/new-large-scale-pdf-corpus-now-publicly-available/)
from 42 (14%) to 34 (11%).
2023-10-20 10:35:54 -06:00
Nico Weber
aea0e2f313 LibPDF: Rename ColorSpaceFamily function to may_be_specified_directly()
It used to be called ColorSpaceFamily::never_needs_parameters().

But in the cpp file, the macro arg was called ever_needs_parameters,
and the spec says

"If the color space is one that can be specified by a name and no
additional parameters (DeviceGray, DeviceRGB, DeviceCMYK, and certain
cases of Pattern), the name may be specified directly."

so let's use that language here.

No behavior change.
2023-10-20 10:35:54 -06:00
Nico Weber
095a2a17ed LibPDF: Replace TODO()s in Type0Font code with Errors
...which causes us to not render these fonts instead of crashing.

Reduces number of crashes on 300 random PDFs from the web (the first 300
from 0000.zip from
https://pdfa.org/new-large-scale-pdf-corpus-now-publicly-available/)
from 64 (21%) to 42 (14%).
2023-10-20 10:33:59 -06:00
Nico Weber
33443f7991 LibPDF: Implement ICCBasedColorSpace::number_of_components()
We now no longer crash on images that use an ICC-based color space.
Reduces number of crashes on 300 random PDFs from the web (the first 300
from 0000.zip from
https://pdfa.org/new-large-scale-pdf-corpus-now-publicly-available/)
from 81 (27%) to 64 (21%).

Also fixes all remaining crashes in
411_getting_started_with_instruments.pdf and
513_high_efficiency_image_file_format.pdf.
2023-10-20 08:58:52 +02:00
Nico Weber
f5d3f47af3 LibPDF: Add spec comment about color spaces on images 2023-10-20 08:58:52 +02:00
Nico Weber
7c24a89acf LibPDF: Add spec comment about valid bits_per_component values 2023-10-20 08:58:52 +02:00
Nico Weber
64bb9aa8c7 LibPDF: Fix comment typo 2023-10-20 08:58:52 +02:00