This is a step towards AESV3 support for PDF files.
The straight-forward way of writing this with our APIs is pretty
allocation-heavy, but this code won't run all that often for the
regular "open PDF, check password" flow.
- `encrypt()` will always fill a multiple of block size,
`decrypt()` might produce less data. But other than that,
the middle span isn't modified even though it's a reference.
So pass the ByteBuffer to assign() (kind of like before 5998072f15,
but pass-by-move())
- In the encryption code path, assign a single buffer for IV and data
instead of awkwardly copying the data around later.
Thanks to CxByte for suggesting most of this!
No intentional behavior change.
Subsections are generally not contiguous, however this logic assumed
that they were, and kept a persistent "entry_index" count while looping
through all subsections. This commit rewrites the logic to be more
straightforward; just loop through all of the subsections and handle
each one separately.
DeprecatedString::substring() makes a copy of the substring.
Instead, use a StringView, which can make substring views in constant
time.
Reduces time for `pdf --dump-contents image-based-pdf-sample.pdf` to
2.2s (from not completing for 1+ minutes).
That file contains a 221 kB jpeg.
Find it on the internet here:
https://nlsblog.org/wp-content/uploads/2020/06/image-based-pdf-sample.pdf
This detects AESV3, and copies over the spec comments explaining what
needs to be done, but doesn't actually do it yet.
AESV3 is technically PDF 2.0-only, but
https://cipa.jp/std/documents/download_e.html?CIPA_DC-007-2021_E has a
1.7 PDF that uses it.
Previously we'd claim that we need a password to decrypt it.
Now, we cleanly crash with a TODO() \o/
- No , between array or dict elements
- `stream` goes in front of stream data, _after_ the stream dict
Also, print string contents as ASCII if the string data is mostly ASCII.
We now track it in the graphics state. It isn't used for anything yet.
Fixes the one thing that rendering the first 100 pages of
pdf_reference_1-7.pdf complains about.
With this, looking at page 2 of pdf_reference_1-7.pdf no longer crashes.
Why did it crash in the first place? Because due to this bug, CFF.cpp
failed to parse the font program for the font used to render the `®`
character. `Renderer::render()` adds all errors that are encounterd
to an `errors` object but continues rendering. That meant that the
previous font was still active, and that didn't have a width for that
symbol in its width table.
SimpleFont::draw_string() falls back to get_glyph_width() if there's
no entry for a character for a symbol. `Type1Font::get_glyph_width()`
always dereferences `m_font` in that method, even if the font has
a font program (and m_font is hence nullptr).
With the off-by-one fixed, the second font is successfully installed
as current font, and the second font has a width entry for that symbol,
so the problem no longer occurs.
There were two problems:
1. parse_compressed_object_with_index() parses indirect objects
without going through Parser::parse_indirect_value(), so
push_reference() / pop_reference() weren't called.
Manually call them, both for the indirect object containing
the object stream and for the indirect object within the
object stream.
2. The indirect object within the object stream got decrypted
twice: Once when the object stream data itself got decrypted,
and then incorrectly a second time when the object data within
the stream was read. To fix, disable encryption while parsing
object stream data (since it's already decrypted).
The test is from http://opf-labs.org/format-corpus/pdfCabinetOfHorrors/
which according to readme.md at the same location is CC0.
PDF files can be linearized. In that case, they start with a
"linearization dict" that stores the key `/Linearized` and the value
`1`. To check if a file is linearized, we just read the first dict, and
then checked if it has that key.
If the first object of a PDF was a stream with a compression filter
and the input PDF was encrypted and not linearized, then us trying to
decode the linearization dict could crash due to stream contents being
encrypted, decryption state not yet being initialized, and us trying
to decompress stream data before decrypting it.
To prevent this, disable uncompression when parsing the first object
to determine if it's a lineralization dictionary.
(A linearization dict never stores string values, so decryption
not yet being initialized is not a problem. Integer values aren't
encrypted in encrypted PDF files.)
This dict contains some metadata in some files.
Newer files also contain XMP metadata, but it's recommended to
still include this dict as well, for compatibility with older readers.
And it's much less complex than XMP, so let's support it.
Two lambdas were capturing locals that were out of scope by the
time the lambdas ran.
With this, `pdf` can successfully load and print the page count of
pdf_reference_1.7.pdf.
Reference used to be clever and stored the index of a ref in 18 bits
and the generation in 14 bits, so that both fit into a single u32.
However:
- It set MAX_REF_INDEX incorrectly (the max value of an 18-bit number
is `(1 << 18) - 1`, not `(1 << 19) - 1`
- pdf_reference_1-7.pdf has 349223 objects, and that's larger
than `(1 << 18) - 1` (which is 262143)
Since a Reference is stored in Value which is a Variant that also
stores a pointer, the size of Value is already 64-bit. So just don't
be clever here.
Makes pdf_reference_1-7.pdf get a bit further during decryption.
If the font dictionary didn't specify custom glyph widths, we would fall
back to the specified "missing width" (or 0 in most cases!), which meant
that we would draw glyphs on top of each other in a lot of cases, namely
for TrueTypeFonts or standard Type1Fonts with an OpenType fallback.
What we actually want to do in this case is ask the OpenType font for
the correct width.
A limit of 1024 subroutines seemed like a sensible choice, but some
fonts actually do exceed it. We will now only assert that the specified
amount is positive.
Previously, get_inheritable_object would always try to find the object
and throw an error if it couldn't. The spec tells us that some page
attributes, like CropBox, are optional but also inheritable. Others,
like the media box and resources, are technically required by the spec,
but omitted by some documents.
In both cases, we are now able to search for inheritable objects and
find a suitable replacement if there wasn't one.