1
Fork 0
mirror of https://github.com/RGBCube/serenity synced 2025-05-14 07:04:57 +00:00
Commit graph

57 commits

Author SHA1 Message Date
Nico Weber
87112dcbdc LibPDF: Return null for invalid refs, tolerate null objects as outline
https://llvm.org/devmtg/2022-11/slides/TechTalk5-WhatDoesItTakeToRunLLVMBuildbots.pdf
has an xref table that starts like so:

```
xref
0 214
0000000002 65535 f
0000924663 00000 n
0000000003 00000 f
0000000000 00000 f
0000000016 00000 n
0000000160 00000 n
0000000263 00000 n
```

This is a list of objects in the PDF file. The lines ending with 'f'
mean that this object is "free", that is it's not stored in the file.
In this file, objects 0, 2, 3 are free. For free objects, the first
number is the offset of the next free object: Object 0 refers to object
2, 2 to 3, and 3 back to 0 (since it's the last free object).
The lines ending with "n" are actual objects; here the first number is
a byte offset to where that object is stored in the file.

Furthermore, the file contains

```
/Outlines
2
0
R
```

in its root object, meaning that object 2 stores the page outlines.

Since object 2 is set as free, there is no object 2. But the spec
says that an invalid object reference is just the null object.

This patch makes us return null objects for references to free
objects, and it also makes us treat a null object as /Outlines value
the same as not having /Outlines in the first place.

Fixes #23023 -- we can now open that file. (We don't render it super
well, but only for already-known reasons.)

Since I found it a bit confusing: XRefTable has two related methods
here:

1. has_object() returns if an object was explicitly listed in an
   xref table. The first number right after `xref` is the start
   index. So if an xref table were to start with `10`, we'd implicitly
   create 10 trailing objects for which has_object() would return false
2. is_object_in_use() returns true if an object that was in a table
   (i.e. one where has_object() returns true) was listed with 'n' and
   false if it was listed with 'f'.

DocumentParser::parse_object_with_index() should probably return a null
object for the `!has_object()` case as well instead of VERIFY()ing
that has_object() is true. But I haven't seen this in the wild yet,
so keeping as-is for now.
2024-01-31 12:10:19 -05:00
Nico Weber
0bb0c7dac2 LibPDF: Scan for PDF file start in first 1024 bytes
Other readers do this too, and files depend on this.

Fixes opening these four files from the PDFA 0000.zip dataset:

* 0000015.pdf
  Starts with `C:\web\webeuncet\_cat\_docs\_publics\` before header
* 0000408.pdf
  Starts with UTF-8 BOM
* 0000524.pdf
  Starts with 867 bytes of HTML containing a PHP backtrace
* 0000680.pdf
  Starts with `C:\web\webeuncet\_cat\_docs\_publics\` too
2024-01-03 10:12:35 +01:00
Nico Weber
13641693cb LibPDF: Use make_object<>() to make objects
No behavior change.
2023-12-20 12:19:08 +01:00
Ali Mohammad Pur
5e1499d104 Everywhere: Rename {Deprecated => Byte}String
This commit un-deprecates DeprecatedString, and repurposes it as a byte
string.
As the null state has already been removed, there are no other
particularly hairy blockers in repurposing this type as a byte string
(what it _really_ is).

This commit is auto-generated:
  $ xs=$(ack -l \bDeprecatedString\b\|deprecated_string AK Userland \
    Meta Ports Ladybird Tests Kernel)
  $ perl -pie 's/\bDeprecatedString\b/ByteString/g;
    s/deprecated_string/byte_string/g' $xs
  $ clang-format --style=file -i \
    $(git diff --name-only | grep \.cpp\|\.h)
  $ gn format $(git ls-files '*.gn' '*.gni')
2023-12-17 18:25:10 +03:30
Nico Weber
57e2b5ef59 LibPDF+Tests: Correctly decode text strings without explicit encoding 2023-11-22 09:08:06 -07:00
Nico Weber
e39a790c82 LibPDF: Stop converting encodings in object parser
Per 1.7 spec 3.8.1, there are multiple logical text string types:
* text strings
* ASCII strings
* byte strings

Text strings can be in UTF-16BE, PDFDocEncoding, or (since PDF 2.0)
UTF-8.

But byte strings shouldn't be converted but treated as binary
data.

This makes us no longer convert strings used for drawing page text.
TABLE 5.6 "Text-showing operators" lists the operands for text-showing
operators as just "string", not "text string" (even though these strings
confusingly are called "text strings" in the body text), so not doing
this there is correct (and matches other viewers).

We also no longer incorrectly convert strings used for cypto data
(such as passwords), if they start with an UTF-16BE or UTF-8 marker.

No behavior change for outlines and info dict entries.

https://pdfa.org/understanding-utf-8-in-pdf-2-0/ has a good overview of
this.

(ASCII strings only contain ASCII characters and behave the same
anyways.)
2023-11-22 09:08:06 -07:00
Nico Weber
6d47fca3bf LibPDF: Don't assert on outline destinations that use null as page
Nothing in PDF 1.7 spec 8.2.1 Destinations mentions the page being
`null`, but it happens in 0000372.pdf (for the root outline element)
and in 0000776.pdf (for every outline element, which looks like a
bug in the generator maybe) of 0000.zip from the pdfa dataset.
2023-10-27 06:38:25 -04:00
Nico Weber
a65d8ff2ea LibPDF: Tolerate page rotation being an indirect object
Needed e.g. for 0000196.pdf in 0000.zip in the pdfa dataset.
2023-10-26 10:58:45 +02:00
Nico Weber
208a058eab LibPDF: Tolerate integer outline item colors
0000296.pdf from 0000.zip from the pdfa dataset contains
`/C [0 0 0]` (as opposed to `/C [0.0 0.0 0.0]`). Make that work.
(It's fine per spec.)
2023-10-26 10:58:45 +02:00
Nico Weber
8922574133 LibPDF: Fix assertion when destination page is an index
This isn't correct per spec, but it happens in practice, e.g.
0000847.pdf, 0000327.pdf, 0000124.pdf from 0000.zip from
https://pdfa.org/new-large-scale-pdf-corpus-now-publicly-available/
2023-10-21 09:10:30 +02:00
Nico Weber
fbd00d9c8e LibPDF: Use resolve_to on /Dests entry
Fixes an assertion if /Dests is an indirect object (`24 0 R`)
instead of an inline dictionary.
2023-10-21 09:10:30 +02:00
Nico Weber
8c3478a921 LibPDF: Use resolve_to() helper
No behavior change.
2023-10-21 09:10:30 +02:00
Nico Weber
182639217f LibPDF: Implement GoTo action for outline
Outline items can contain either a /Dest key or an /A key.

The /Dest key points to a "Destination" (various ways to reference a
page in the same document).

The /A key points to an "Action" which can have several types.
One type, the /GoTo type, just also points to a Destination.

Implement GoTo actions. This makes clicking "Contents" in the outline of
https://developer.apple.com/library/archive/documentation/mac/pdf/Text.pdf
work. (Almost all other items in this file's outline use /Dest.
"Contents" could too, but it uses /A /GoTo for some reason.)

(Other action types are things like opening a hyperlink, opening a
different file, playing a sound, submitting a form, etc. Actions
are also used for in-page links, not just in outlines. Many of
these action types we'll likely never want to implement.)
2023-10-18 06:29:02 -04:00
Nico Weber
f646e47d46 LibPDF: Extract a create_destination_from_object() function
No big behavior change. The new function now produces an error
if a destination isn't in one of the supported formats.
2023-10-18 06:29:02 -04:00
Nico Weber
532230c0e4 LibPDF: Extract a Document::read_filters() method
No behavior change.
2023-07-24 09:50:45 -04:00
Nico Weber
164c132928 LibPDF: Fix dumping of toplevel indirects
An indirect object starts `42 0 obj`, not `obj 42 0`.
2023-07-21 10:44:50 -04:00
Nico Weber
ca433befa0 LibPDF: Add method to Document to dump a Page and all related objects
...except for the /Parent object, else we'd print all pages :)
2023-07-13 20:29:58 +02:00
Nico Weber
ea89053c12 LibPDF: Make PDF version accessible on Document 2023-07-11 13:49:17 -04:00
Nico Weber
c5c940b1c9 LibPDF: Add accessor for the document's info dict
This dict contains some metadata in some files.

Newer files also contain XMP metadata, but it's recommended to
still include this dict as well, for compatibility with older readers.
And it's much less complex than XMP, so let's support it.
2023-07-10 17:49:07 +01:00
Julian Offenhäuser
95a804bc4e LibPDF: Allow the page rotation to be inherited 2023-03-25 16:27:30 -06:00
Julian Offenhäuser
b90a794d78 LibPDF: Allow pages with no specified contents
The contents object may be omitted as per spec, which will just leave
the page blank.
2023-03-25 16:27:30 -06:00
Julian Offenhäuser
fde990ead8 LibPDF: Allow optional inheritable page attributes
Previously, get_inheritable_object would always try to find the object
and throw an error if it couldn't. The spec tells us that some page
attributes, like CropBox, are optional but also inheritable. Others,
like the media box and resources, are technically required by the spec,
but omitted by some documents.

In both cases, we are now able to search for inheritable objects and
find a suitable replacement if there wasn't one.
2023-03-25 16:27:30 -06:00
Andreas Kling
8a48246ed1 Everywhere: Stop using NonnullRefPtrVector
This class had slightly confusing semantics and the added weirdness
doesn't seem worth it just so we can say "." instead of "->" when
iterating over a vector of NNRPs.

This patch replaces NonnullRefPtrVector<T> with Vector<NNRP<T>>.
2023-03-06 23:46:35 +01:00
Timothy Flynn
f3db548a3d AK+Everywhere: Rename FlyString to DeprecatedFlyString
DeprecatedFlyString relies heavily on DeprecatedString's StringImpl, so
let's rename it to A) match the name of DeprecatedString, B) write a new
FlyString class that is tied to String.
2023-01-09 23:00:24 +00:00
Rodrigo Tobar
a5620fd41f LibPDF: Load destinations from Catalogue -> Names -> Dests name tree
PDF allows for named destinations to be provided as string. These can be
either found in the Dests dictionary in the document catalogue (as
already implemented), or in the Name Tree specified by the Dests key in
the Names dictionary of the document catalogue (missing).

This commit adds this missing case. Once the named destination is found
in the name tree, its value is interpreted just like in the first case,
so a new utility method encapsulates the common behavior.
2023-01-06 18:06:41 +01:00
Rodrigo Tobar
5420261347 LibPDF: Implement name tree lookups
Name Trees are hierarchical, string-keyed, sorted-by-key dictionary
structures in PDF where each node (except the root) specifies the bounds
of the values it holds, and either its kids (more nodes) or the
key/value pairs it contains.

This commit implements a series of lookup calls for finding a key in
such name trees. This implementation follows the tree as needed on each
lookup, but if that becomes inefficient in the long run we can switch to
creating a HashMap with all the contents, which as a drawback will
require more memory.
2023-01-06 18:06:41 +01:00
Rodrigo Tobar
f510b2b180 LibPDF: Support null destination parameters
Destination arrays contain a page number, a mode name, and parameters
specific to that mode. In many cases these parameters can be set to
"null", which our code wasn't taking into consideration.

This commit parses these parameters taking into account whether they are
null or actual numbers, and stores them as Optional<float> instead of
plain floats. The parameters are not yet used anywhere else other than
when formatting a Destination object, so the change is fairly small.
2023-01-06 18:06:41 +01:00
Rodrigo Tobar
6df9aa8f2c LibPDF: Store page number, not Value, in OutlineItem
The Value previously stored corresponded to a Reference to a Page object
in the PDF document. This isn't useful information, since what we want
to display at the end of the day is the page an outline item refers to.

This commit changes the page member on OutlineItem to be a Optional<u32>
(some destinations don't necessarily refer to a Page), which we resolve
while building OutlineItems.
2022-12-17 19:40:52 +01:00
Rodrigo Tobar
3db6af6360 LibPDF: Keep track of OutlineItem parents
While OutlineItem had a parent field, it was never populated nor used.
This commit populates it when possible (no parent means the OutlineItem
is a top-level item).
2022-12-17 19:40:52 +01:00
Rodrigo Tobar
cb1a7cc721 LibPDF: Simplify outline construction
While the Outline Items making up the document's Outline have all sorts
of cross-references (parent, first/last chlid, next/previous sibling,
etc), not all documents out there have fully-consistent references. Our
implementation already discarded some of that information too (e.g.,
/Parent and /Prev were never read), and trusted that /First and /Next
were good enough to traverse the whole hierarchy.

Where the current implementation failed was in assuming that /Last was
also a good source of information. There are documents out there were
/Last also points to dead ends, and were therefore causing a crash when
we verified that the last child found on a chain was the /Last child
declared by the parent. To fix this I'm simply removing the check, and
simplifying the function call to remove any references to /Last. This
way we affirm our commitment to /First and /Next as the main sources of
information.
2022-12-16 01:24:43 -07:00
Linus Groh
57dc179b1f Everywhere: Rename to_{string => deprecated_string}() where applicable
This will make it easier to support both string types at the same time
while we convert code, and tracking down remaining uses.

One big exception is Value::to_string() in LibJS, where the name is
dictated by the ToString AO.
2022-12-06 08:54:33 +01:00
Linus Groh
6e19ab2bbc AK+Everywhere: Rename String to DeprecatedString
We have a new, improved string type coming up in AK (OOM aware, no null
state), and while it's going to use UTF-8, the name UTF8String is a
mouthful - so let's free up the String name by renaming the existing
class.
Making the old one have an annoying name will hopefully also help with
quick adoption :^)
2022-12-06 08:54:33 +01:00
Julian Offenhäuser
36f83cecab LibPDF: Allow page objects to inherit the MediaBox and Resources entries 2022-10-16 17:44:54 +02:00
Julian Offenhäuser
4887aacec7 LibPDF: Move document-specific parsing functionality into its own class
The Parser class is now a generic PDF object parser, of which the new
DocumentParser class derives. DocumentParser now takes over all
functions relating to linearization, pages, xref and trailer handling.

This allows the use of multiple parsers in the same document's
context, which will be needed in order to handle PDF object streams.
2022-09-17 10:07:14 +01:00
sin-ack
3f3f45580a Everywhere: Add sv suffix to strings relying on StringView(char const*)
Each of these strings would previously rely on StringView's char const*
constructor overload, which would call __builtin_strlen on the string.
Since we now have operator ""sv, we can replace these with much simpler
versions. This opens the door to being able to remove
StringView(char const*).

No functional changes.
2022-07-12 23:11:35 +02:00
Matthew Olsson
5b316462b2 LibPDF: Add implementation of the Standard security handler
Security handlers manage encryption and decription of PDF files. The
standard security handler uses RC4/MD5 to perform its crypto (AES as
well, but that is not yet implemented).
2022-03-29 02:52:57 +02:00
Matthew Olsson
e9342183f0 LibPDF: Support all Dest types 2022-03-07 10:53:57 +01:00
Matthew Olsson
73cf8205b4 LibPDF: Propagate errors in Parser and Document 2022-03-07 10:53:57 +01:00
Andreas Kling
80d4e830a0 Everywhere: Pass AK::ReadonlyBytes by value 2021-11-11 01:27:46 +01:00
Ben Wiederhake
f84a7e2e22 LibPDF: Replace Value class by AK::Variant
This decreases the memory consumption by LibPDF by 4 bytes per Value,
compensating exactly for the increase in an earlier commit. :^)
2021-09-20 17:39:36 +04:30
Brian Gianforcaro
507effce5b LibPDF: Use move to avoid unnecessary ref/unref of network device RefPtr
Flagged by pvs-studio as a potential perf optimization.
2021-09-16 17:17:13 +02:00
Daniel Bertalan
d7b6cc6421 Everywhere: Prevent risky implicit casts of (Nonnull)RefPtr
Our existing implementation did not check the element type of the other
pointer in the constructors and move assignment operators. This meant
that some operations that would require explicit casting on raw pointers
were done implicitly, such as:
- downcasting a base class to a derived class (e.g. `Kernel::Inode` =>
  `Kernel::ProcFSDirectoryInode` in Kernel/ProcFS.cpp),
- casting to an unrelated type (e.g. `Promise<bool>` => `Promise<Empty>`
  in LibIMAP/Client.cpp)

This, of course, allows gross violations of the type system, and makes
the need to type-check less obvious before downcasting. Luckily, while
adding the `static_ptr_cast`s, only two truly incorrect usages were
found; in the other instances, our casts just needed to be made
explicit.
2021-09-03 23:20:23 +02:00
Matthew Olsson
612b183703 LibPDF: Convert to east-const to comply with the recent style changes 2021-06-12 22:45:01 +04:30
Matthew Olsson
7b4e36bf88 LibPDF: Split ColorSpace into a different class for each color space
While unnecessary at the moment, this will allow for more fine-grained
control when complex color spaces get added.
2021-06-12 22:45:01 +04:30
Matthew Olsson
78bc9d1539 LibPDF: Refine the distinction between the Document and Parser
The Parser should hold information relevant for parsing, whereas the
Document should hold information relevant for displaying pages.
With this in mind, there is no reason for the Document to hold the
xref table and trailer. These objects have been moved to the Parser,
which allows the Parser to expose less public methods (which will be
even more evident once linearized PDFs are supported).
2021-06-12 22:45:01 +04:30
Matthew Olsson
1ef5071d1b LibPDF: Harden the document/parser against errors 2021-06-12 22:45:01 +04:30
Matthew Olsson
78f3bad7e6 LibPDF: Pre-initialize common FlyStrings in CommonNames.h 2021-05-25 00:24:09 +04:30
Matthew Olsson
a08922d2f6 LibPDF: Parse outline structures 2021-05-25 00:24:09 +04:30
Matthew Olsson
be6e4b6f3c LibPDF: Store indirect value refs in Value objects
IndirectValueRef is so simple that it can be stored directly in the
Value class instead of being heap allocated.

As the comment in Value says, however, in theory the max bits needed to
store is 48 (16 for the generation index and 32(?) for the object
index), but 32 should be good enough for now. We can increase it to u64
later if necessary.
2021-05-25 00:24:09 +04:30
Matthew Olsson
d5f94aaa7b LibPDF/PDFViewer: Support rotated pages 2021-05-18 16:35:23 +02:00