serenity

mirror of https://github.com/RGBCube/serenity synced 2025-07-01 13:22:12 +00:00

Author	SHA1	Message	Date
Nico Weber	87112dcbdc	LibPDF: Return null for invalid refs, tolerate null objects as outline https://llvm.org/devmtg/2022-11/slides/TechTalk5-WhatDoesItTakeToRunLLVMBuildbots.pdf has an xref table that starts like so: ``` xref 0 214 0000000002 65535 f 0000924663 00000 n 0000000003 00000 f 0000000000 00000 f 0000000016 00000 n 0000000160 00000 n 0000000263 00000 n ``` This is a list of objects in the PDF file. The lines ending with 'f' mean that this object is "free", that is it's not stored in the file. In this file, objects 0, 2, 3 are free. For free objects, the first number is the offset of the next free object: Object 0 refers to object 2, 2 to 3, and 3 back to 0 (since it's the last free object). The lines ending with "n" are actual objects; here the first number is a byte offset to where that object is stored in the file. Furthermore, the file contains ``` /Outlines 2 0 R ``` in its root object, meaning that object 2 stores the page outlines. Since object 2 is set as free, there is no object 2. But the spec says that an invalid object reference is just the null object. This patch makes us return null objects for references to free objects, and it also makes us treat a null object as /Outlines value the same as not having /Outlines in the first place. Fixes #23023 -- we can now open that file. (We don't render it super well, but only for already-known reasons.) Since I found it a bit confusing: XRefTable has two related methods here: 1. has_object() returns if an object was explicitly listed in an xref table. The first number right after `xref` is the start index. So if an xref table were to start with `10`, we'd implicitly create 10 trailing objects for which has_object() would return false 2. is_object_in_use() returns true if an object that was in a table (i.e. one where has_object() returns true) was listed with 'n' and false if it was listed with 'f'. DocumentParser::parse_object_with_index() should probably return a null object for the `!has_object()` case as well instead of VERIFY()ing that has_object() is true. But I haven't seen this in the wild yet, so keeping as-is for now.	2024-01-31 12:10:19 -05:00
Nico Weber	0bb0c7dac2	LibPDF: Scan for PDF file start in first 1024 bytes Other readers do this too, and files depend on this. Fixes opening these four files from the PDFA 0000.zip dataset: * 0000015.pdf Starts with `C:\web\webeuncet\_cat\_docs\_publics\` before header * 0000408.pdf Starts with UTF-8 BOM * 0000524.pdf Starts with 867 bytes of HTML containing a PHP backtrace * 0000680.pdf Starts with `C:\web\webeuncet\_cat\_docs\_publics\` too	2024-01-03 10:12:35 +01:00
Nico Weber	13641693cb	LibPDF: Use make_object<>() to make objects No behavior change.	2023-12-20 12:19:08 +01:00
Ali Mohammad Pur	5e1499d104	Everywhere: Rename {Deprecated => Byte}String This commit un-deprecates DeprecatedString, and repurposes it as a byte string. As the null state has already been removed, there are no other particularly hairy blockers in repurposing this type as a byte string (what it _really_ is). This commit is auto-generated: $ xs=$(ack -l \bDeprecatedString\b\\|deprecated_string AK Userland \ Meta Ports Ladybird Tests Kernel) $ perl -pie 's/\bDeprecatedString\b/ByteString/g; s/deprecated_string/byte_string/g' $xs $ clang-format --style=file -i \ $(git diff --name-only \| grep \.cpp\\|\.h) $ gn format $(git ls-files '.gn' '.gni')	2023-12-17 18:25:10 +03:30
Nico Weber	57e2b5ef59	LibPDF+Tests: Correctly decode text strings without explicit encoding	2023-11-22 09:08:06 -07:00
Nico Weber	e39a790c82	LibPDF: Stop converting encodings in object parser Per 1.7 spec 3.8.1, there are multiple logical text string types: * text strings * ASCII strings * byte strings Text strings can be in UTF-16BE, PDFDocEncoding, or (since PDF 2.0) UTF-8. But byte strings shouldn't be converted but treated as binary data. This makes us no longer convert strings used for drawing page text. TABLE 5.6 "Text-showing operators" lists the operands for text-showing operators as just "string", not "text string" (even though these strings confusingly are called "text strings" in the body text), so not doing this there is correct (and matches other viewers). We also no longer incorrectly convert strings used for cypto data (such as passwords), if they start with an UTF-16BE or UTF-8 marker. No behavior change for outlines and info dict entries. https://pdfa.org/understanding-utf-8-in-pdf-2-0/ has a good overview of this. (ASCII strings only contain ASCII characters and behave the same anyways.)	2023-11-22 09:08:06 -07:00
Nico Weber	6d47fca3bf	LibPDF: Don't assert on outline destinations that use `null` as page Nothing in PDF 1.7 spec 8.2.1 Destinations mentions the page being `null`, but it happens in 0000372.pdf (for the root outline element) and in 0000776.pdf (for every outline element, which looks like a bug in the generator maybe) of 0000.zip from the pdfa dataset.	2023-10-27 06:38:25 -04:00
Nico Weber	a65d8ff2ea	LibPDF: Tolerate page rotation being an indirect object Needed e.g. for 0000196.pdf in 0000.zip in the pdfa dataset.	2023-10-26 10:58:45 +02:00
Nico Weber	208a058eab	LibPDF: Tolerate integer outline item colors 0000296.pdf from 0000.zip from the pdfa dataset contains `/C [0 0 0]` (as opposed to `/C [0.0 0.0 0.0]`). Make that work. (It's fine per spec.)	2023-10-26 10:58:45 +02:00
Nico Weber	8922574133	LibPDF: Fix assertion when destination page is an index This isn't correct per spec, but it happens in practice, e.g. 0000847.pdf, 0000327.pdf, 0000124.pdf from 0000.zip from https://pdfa.org/new-large-scale-pdf-corpus-now-publicly-available/	2023-10-21 09:10:30 +02:00
Nico Weber	fbd00d9c8e	LibPDF: Use resolve_to on /Dests entry Fixes an assertion if /Dests is an indirect object (`24 0 R`) instead of an inline dictionary.	2023-10-21 09:10:30 +02:00
Nico Weber	8c3478a921	LibPDF: Use resolve_to() helper No behavior change.	2023-10-21 09:10:30 +02:00
Nico Weber	182639217f	LibPDF: Implement GoTo action for outline Outline items can contain either a /Dest key or an /A key. The /Dest key points to a "Destination" (various ways to reference a page in the same document). The /A key points to an "Action" which can have several types. One type, the /GoTo type, just also points to a Destination. Implement GoTo actions. This makes clicking "Contents" in the outline of https://developer.apple.com/library/archive/documentation/mac/pdf/Text.pdf work. (Almost all other items in this file's outline use /Dest. "Contents" could too, but it uses /A /GoTo for some reason.) (Other action types are things like opening a hyperlink, opening a different file, playing a sound, submitting a form, etc. Actions are also used for in-page links, not just in outlines. Many of these action types we'll likely never want to implement.)	2023-10-18 06:29:02 -04:00
Nico Weber	f646e47d46	LibPDF: Extract a create_destination_from_object() function No big behavior change. The new function now produces an error if a destination isn't in one of the supported formats.	2023-10-18 06:29:02 -04:00
Nico Weber	532230c0e4	LibPDF: Extract a Document::read_filters() method No behavior change.	2023-07-24 09:50:45 -04:00
Nico Weber	164c132928	LibPDF: Fix dumping of toplevel indirects An indirect object starts `42 0 obj`, not `obj 42 0`.	2023-07-21 10:44:50 -04:00
Nico Weber	ca433befa0	LibPDF: Add method to Document to dump a Page and all related objects ...except for the /Parent object, else we'd print all pages :)	2023-07-13 20:29:58 +02:00
Nico Weber	ea89053c12	LibPDF: Make PDF version accessible on Document	2023-07-11 13:49:17 -04:00
Nico Weber	c5c940b1c9	LibPDF: Add accessor for the document's info dict This dict contains some metadata in some files. Newer files also contain XMP metadata, but it's recommended to still include this dict as well, for compatibility with older readers. And it's much less complex than XMP, so let's support it.	2023-07-10 17:49:07 +01:00
Julian Offenhäuser	95a804bc4e	LibPDF: Allow the page rotation to be inherited	2023-03-25 16:27:30 -06:00
Julian Offenhäuser	b90a794d78	LibPDF: Allow pages with no specified contents The contents object may be omitted as per spec, which will just leave the page blank.	2023-03-25 16:27:30 -06:00
Julian Offenhäuser	fde990ead8	LibPDF: Allow optional inheritable page attributes Previously, get_inheritable_object would always try to find the object and throw an error if it couldn't. The spec tells us that some page attributes, like CropBox, are optional but also inheritable. Others, like the media box and resources, are technically required by the spec, but omitted by some documents. In both cases, we are now able to search for inheritable objects and find a suitable replacement if there wasn't one.	2023-03-25 16:27:30 -06:00
Andreas Kling	8a48246ed1	Everywhere: Stop using NonnullRefPtrVector This class had slightly confusing semantics and the added weirdness doesn't seem worth it just so we can say "." instead of "->" when iterating over a vector of NNRPs. This patch replaces NonnullRefPtrVector<T> with Vector<NNRP<T>>.	2023-03-06 23:46:35 +01:00
Timothy Flynn	f3db548a3d	AK+Everywhere: Rename FlyString to DeprecatedFlyString DeprecatedFlyString relies heavily on DeprecatedString's StringImpl, so let's rename it to A) match the name of DeprecatedString, B) write a new FlyString class that is tied to String.	2023-01-09 23:00:24 +00:00
Rodrigo Tobar	a5620fd41f	LibPDF: Load destinations from Catalogue -> Names -> Dests name tree PDF allows for named destinations to be provided as string. These can be either found in the Dests dictionary in the document catalogue (as already implemented), or in the Name Tree specified by the Dests key in the Names dictionary of the document catalogue (missing). This commit adds this missing case. Once the named destination is found in the name tree, its value is interpreted just like in the first case, so a new utility method encapsulates the common behavior.	2023-01-06 18:06:41 +01:00
Rodrigo Tobar	5420261347	LibPDF: Implement name tree lookups Name Trees are hierarchical, string-keyed, sorted-by-key dictionary structures in PDF where each node (except the root) specifies the bounds of the values it holds, and either its kids (more nodes) or the key/value pairs it contains. This commit implements a series of lookup calls for finding a key in such name trees. This implementation follows the tree as needed on each lookup, but if that becomes inefficient in the long run we can switch to creating a HashMap with all the contents, which as a drawback will require more memory.	2023-01-06 18:06:41 +01:00
Rodrigo Tobar	f510b2b180	LibPDF: Support null destination parameters Destination arrays contain a page number, a mode name, and parameters specific to that mode. In many cases these parameters can be set to "null", which our code wasn't taking into consideration. This commit parses these parameters taking into account whether they are null or actual numbers, and stores them as Optional<float> instead of plain floats. The parameters are not yet used anywhere else other than when formatting a Destination object, so the change is fairly small.	2023-01-06 18:06:41 +01:00
Rodrigo Tobar	6df9aa8f2c	LibPDF: Store page number, not Value, in OutlineItem The Value previously stored corresponded to a Reference to a Page object in the PDF document. This isn't useful information, since what we want to display at the end of the day is the page an outline item refers to. This commit changes the page member on OutlineItem to be a Optional<u32> (some destinations don't necessarily refer to a Page), which we resolve while building OutlineItems.	2022-12-17 19:40:52 +01:00
Rodrigo Tobar	3db6af6360	LibPDF: Keep track of OutlineItem parents While OutlineItem had a parent field, it was never populated nor used. This commit populates it when possible (no parent means the OutlineItem is a top-level item).	2022-12-17 19:40:52 +01:00
Rodrigo Tobar	cb1a7cc721	LibPDF: Simplify outline construction While the Outline Items making up the document's Outline have all sorts of cross-references (parent, first/last chlid, next/previous sibling, etc), not all documents out there have fully-consistent references. Our implementation already discarded some of that information too (e.g., /Parent and /Prev were never read), and trusted that /First and /Next were good enough to traverse the whole hierarchy. Where the current implementation failed was in assuming that /Last was also a good source of information. There are documents out there were /Last also points to dead ends, and were therefore causing a crash when we verified that the last child found on a chain was the /Last child declared by the parent. To fix this I'm simply removing the check, and simplifying the function call to remove any references to /Last. This way we affirm our commitment to /First and /Next as the main sources of information.	2022-12-16 01:24:43 -07:00
Linus Groh	57dc179b1f	Everywhere: Rename to_{string => deprecated_string}() where applicable This will make it easier to support both string types at the same time while we convert code, and tracking down remaining uses. One big exception is Value::to_string() in LibJS, where the name is dictated by the ToString AO.	2022-12-06 08:54:33 +01:00
Linus Groh	6e19ab2bbc	AK+Everywhere: Rename String to DeprecatedString We have a new, improved string type coming up in AK (OOM aware, no null state), and while it's going to use UTF-8, the name UTF8String is a mouthful - so let's free up the String name by renaming the existing class. Making the old one have an annoying name will hopefully also help with quick adoption :^)	2022-12-06 08:54:33 +01:00
Julian Offenhäuser	36f83cecab	LibPDF: Allow page objects to inherit the MediaBox and Resources entries	2022-10-16 17:44:54 +02:00
Julian Offenhäuser	4887aacec7	LibPDF: Move document-specific parsing functionality into its own class The Parser class is now a generic PDF object parser, of which the new DocumentParser class derives. DocumentParser now takes over all functions relating to linearization, pages, xref and trailer handling. This allows the use of multiple parsers in the same document's context, which will be needed in order to handle PDF object streams.	2022-09-17 10:07:14 +01:00
sin-ack	3f3f45580a	Everywhere: Add sv suffix to strings relying on StringView(char const) Each of these strings would previously rely on StringView's char const constructor overload, which would call __builtin_strlen on the string. Since we now have operator ""sv, we can replace these with much simpler versions. This opens the door to being able to remove StringView(char const*). No functional changes.	2022-07-12 23:11:35 +02:00
Matthew Olsson	5b316462b2	LibPDF: Add implementation of the Standard security handler Security handlers manage encryption and decription of PDF files. The standard security handler uses RC4/MD5 to perform its crypto (AES as well, but that is not yet implemented).	2022-03-29 02:52:57 +02:00
Matthew Olsson	e9342183f0	LibPDF: Support all Dest types	2022-03-07 10:53:57 +01:00
Matthew Olsson	73cf8205b4	LibPDF: Propagate errors in Parser and Document	2022-03-07 10:53:57 +01:00
Andreas Kling	80d4e830a0	Everywhere: Pass AK::ReadonlyBytes by value	2021-11-11 01:27:46 +01:00
Ben Wiederhake	f84a7e2e22	LibPDF: Replace Value class by AK::Variant This decreases the memory consumption by LibPDF by 4 bytes per Value, compensating exactly for the increase in an earlier commit. :^)	2021-09-20 17:39:36 +04:30
Brian Gianforcaro	507effce5b	LibPDF: Use move to avoid unnecessary ref/unref of network device RefPtr Flagged by pvs-studio as a potential perf optimization.	2021-09-16 17:17:13 +02:00
Daniel Bertalan	d7b6cc6421	Everywhere: Prevent risky implicit casts of (Nonnull)RefPtr Our existing implementation did not check the element type of the other pointer in the constructors and move assignment operators. This meant that some operations that would require explicit casting on raw pointers were done implicitly, such as: - downcasting a base class to a derived class (e.g. `Kernel::Inode` => `Kernel::ProcFSDirectoryInode` in Kernel/ProcFS.cpp), - casting to an unrelated type (e.g. `Promise<bool>` => `Promise<Empty>` in LibIMAP/Client.cpp) This, of course, allows gross violations of the type system, and makes the need to type-check less obvious before downcasting. Luckily, while adding the `static_ptr_cast`s, only two truly incorrect usages were found; in the other instances, our casts just needed to be made explicit.	2021-09-03 23:20:23 +02:00
Matthew Olsson	612b183703	LibPDF: Convert to east-const to comply with the recent style changes	2021-06-12 22:45:01 +04:30
Matthew Olsson	7b4e36bf88	LibPDF: Split ColorSpace into a different class for each color space While unnecessary at the moment, this will allow for more fine-grained control when complex color spaces get added.	2021-06-12 22:45:01 +04:30
Matthew Olsson	78bc9d1539	LibPDF: Refine the distinction between the Document and Parser The Parser should hold information relevant for parsing, whereas the Document should hold information relevant for displaying pages. With this in mind, there is no reason for the Document to hold the xref table and trailer. These objects have been moved to the Parser, which allows the Parser to expose less public methods (which will be even more evident once linearized PDFs are supported).	2021-06-12 22:45:01 +04:30
Matthew Olsson	1ef5071d1b	LibPDF: Harden the document/parser against errors	2021-06-12 22:45:01 +04:30
Matthew Olsson	78f3bad7e6	LibPDF: Pre-initialize common FlyStrings in CommonNames.h	2021-05-25 00:24:09 +04:30
Matthew Olsson	a08922d2f6	LibPDF: Parse outline structures	2021-05-25 00:24:09 +04:30
Matthew Olsson	be6e4b6f3c	LibPDF: Store indirect value refs in Value objects IndirectValueRef is so simple that it can be stored directly in the Value class instead of being heap allocated. As the comment in Value says, however, in theory the max bits needed to store is 48 (16 for the generation index and 32(?) for the object index), but 32 should be good enough for now. We can increase it to u64 later if necessary.	2021-05-25 00:24:09 +04:30
Matthew Olsson	d5f94aaa7b	LibPDF/PDFViewer: Support rotated pages	2021-05-18 16:35:23 +02:00

1 2

57 commits