serenity

mirror of https://github.com/RGBCube/serenity synced 2025-07-10 03:47:34 +00:00

Author	SHA1	Message	Date
Nico Weber	3fe9f8e48d	LibPDF: Don't accidentally form new tokens on pages with contents arrays A page's /Contents can be an array of streams, and the page's contents are then as if those streams are concatenated. Most of the time, a stream ends with whitespace. But in some cases (e.g. 0000642.pdf from 0000.zip from the pdfa dataset), the first stream ends with an operator (`Q`) and the next stream starts with one (`q`), and the concatenation would form a new, unkonwn operator (`Qq`). Separate the streams' contents with a space to prevent that. Reduces numbers of PDF files we fail to open in the -n 500 case from 11 to 10 (in either case, we then crash on 18 of the PDFs that we do manage to open).	2023-10-23 13:23:54 -04:00
Nico Weber	11bee7a075	LibPDF: Don't crash on fixed-width type 1 fonts that use /MissingWidth Type 1 fonts usually have a m_font_program and no m_font -- they only have m_font if we're using a replacement font for the fonts that were built-in to PDFs before Acrobat 4.0 (and must still work to show existing files). However, SimpleFont::get_glyph_width() used to always return a float, which in Type1Font was only implemented if m_font was set. Per spec, we're supposed to just use /MissingWidth for fonts that are missing an entry in the descriptor's /Width array. However, for built-in fonts, no explicit /Width array is needed (PDF 1.7 spec, Appendix H.3, 5.5.1). So if we just always use /MissingWidth, then PDFs that use a built-in font draw all their text on top of each other (e.g. 000333.pdf from stillhq.com-pdfdb). So change get_glyph_width() to return Optional<float>, return it only in Type1Font if m_font is set, and use MissingWidth if it isn't set. That way, replacement fonts still return a width, and real fonts that are supposed to have /Width and use /MissingWidth for missing entries do what they're supposed to too, instead of crashing. From 20 (6%) to 16 (5%) crashes on the 300 first PDFs, and from 39 (7.8%) to 31 (6.2%) on the 500-random PDFs test.	2023-10-23 09:33:03 -04:00
Nico Weber	52afa936c4	LibPDF: Don't over-read in charset formats 1 and 2 `left` might be a number bigger than there are actually glyphs in the CFF. The spec says "The number of ranges is not explicitly specified in the font. Instead, software utilizing this data simply processes ranges until all glyphs in the font are covered." Apparently we have to check for this within each range as well. Needed for example in 0000054.pdf and 0000354.pdf in 0000.zip in the pdfa dataset. Together with the previous commit: From 21 (7%) to 20 (6%) crashes on the 300 first PDFs, and from 41 (8.2%) to 39 (7.8%) on the 500-random PDFs test.	2023-10-23 09:31:11 -04:00
Nico Weber	58ff7b5336	LibPDF: Support offset size 3 in CFF index reading ...and replace template instantiations with a loop, to make this easily possible. Vaguely nice for code size as well. Needed for example in 0000054.pdf and 0000354.pdf in 0000.zip in the pdfa dataset.	2023-10-23 09:31:11 -04:00
Nico Weber	3197f0cab6	LibPDF: Handle CFF fonts with charset format 0 and > 255 glyphs better We used to use an u8 as loop counter, which would overflow if there were more than 255 glyphs, producing hundreds of megabytes of Couldn't find string for SID x, going with space output in the process, while all data until the end of the CFF section got interpreted as SIDs, until a try_read() would finally fail. We now no longer fail miserably trying to render page 2 of 0000352.pdf of 0000.zip from the pdfa dataset. Fixes just one crash of the larger 500-document test set, but when I tweak test_pdf.py to print all stacks instead of just the top 5, it no longer produces 260 MB of output.	2023-10-23 09:31:11 -04:00
Nico Weber	0869ca5615	LibPDF: Add more CFF_DEBUG output	2023-10-23 09:31:11 -04:00
Nico Weber	cf705eb235	LibPDF: Use TRY() to get decompression result Makes us die with a better error message for some PDFs.	2023-10-23 09:30:41 -04:00
Nico Weber	6153dd7b84	LibPDF: Tolerate comments after dict values Makes 0000607.pdf from 0000.zip from the pdfa dataset load.	2023-10-23 09:28:00 -04:00
Nico Weber	a1f17bd643	LibPDF: Skip inline image data in operator stream Inline images can contain arbitrary binary data in the operator stream, greatly confusing the operator parser. Just skip them for now. They'll produce a `Rendering of feature not supported: draw operation: inline_image_begin` diag as usual, so we won't forget about it. After #21536, reduces number of crashes on 300 random PDFs from the web (the first 300 from 0000.zip from https://pdfa.org/new-large-scale-pdf-corpus-now-publicly-available/) from 23 (7%) to 22 (7%). On a larger sample (`Meta/test_pdf.py -n 500 ~/Downloads/0000`), reduces number of crashes from 53 (10.6%) with 36 distinct crash stacks to 46 (9.2%) with 33 distinct stacks.	2023-10-23 07:51:08 +02:00
Nico Weber	1a58fee0fd	LibPDF: Don't assert on named simple color space If a PDF uses `/CustomName cs` and `/CustomName` then points at just a name like `/DeviceGray` instead of an array, that's ok. Just using `/DeviceGray cs` is simpler, so this extra level of indirection is somewhat rare in practice, but it's valid and it does happen. So support it. We already have a helper that does the right thing that we just need to call. Together with #21524 and #21525, reduces number of crashes on 300 random PDFs from the web (the first 300 from 0000.zip from https://pdfa.org/new-large-scale-pdf-corpus-now-publicly-available/) from 29 (9%) to 25 (8%).	2023-10-21 21:04:26 +02:00
Nico Weber	04aec4a032	LibPDF: Don't log CFF Copyright tag as unknown	2023-10-21 21:04:02 +02:00
Nico Weber	8922574133	LibPDF: Fix assertion when destination page is an index This isn't correct per spec, but it happens in practice, e.g. 0000847.pdf, 0000327.pdf, 0000124.pdf from 0000.zip from https://pdfa.org/new-large-scale-pdf-corpus-now-publicly-available/	2023-10-21 09:10:30 +02:00
Nico Weber	fbd00d9c8e	LibPDF: Use resolve_to on /Dests entry Fixes an assertion if /Dests is an indirect object (`24 0 R`) instead of an inline dictionary.	2023-10-21 09:10:30 +02:00
Nico Weber	8c3478a921	LibPDF: Use resolve_to() helper No behavior change.	2023-10-21 09:10:30 +02:00
Nico Weber	801cfd5ae3	LibPDF: Let parser process filters by default This fixes a small bug from `39b2eed3f6`: That commit tried to disable filters for the very first object read, for the case covered in Tests/LibPDF/password-is-sup.pdf. However, it accidentally also disabled filters by default. Most of the time, this isn't really a difference: We call `set_filters_enabled(true);` very early in `DocumentParser::initialize_linearization_dict()`, which explicitly enables filters, and `initialize_linearization_dict()` is the very first thing called in `DocumentParser::initialize()`. But there's an early exit in `initialize_linearization_dict()` for if there's nothing looking like an indirect object right after the header, and in this case we used to not enable filtering, and would hand compressed streams to the operand parser. (And due to a 2nd bug, we'd even do this if the header line was followed by an empty line.)	2023-10-21 09:09:53 +02:00
Nico Weber	cf26fc2393	LibPDF: Make parser skip whitespace after header 0000990.pdf from 0000.zip from https://pdfa.org/new-large-scale-pdf-corpus-now-publicly-available/ starts like so: ``` %PDF-1.7 4 0 obj ``` parse_heaader() used to put the cursor at the start of the 2nd, empty, line. initialize_linearization_dict() would then check if `m_reader.matches_number()` to see if there could possibly be a linearization dict. In this case, there isn't one, but we should detect linearization dicts even if they're separated by whitespace from the first line.	2023-10-21 09:09:53 +02:00
Nico Weber	34cb506bad	LibPDF: Replace another TODO with a message Like `ca1a98ba9f`, but for stroke color.	2023-10-21 09:09:06 +02:00
Nico Weber	9442782881	LibPDF: Implement text_next_line_show_string_set_spacing Not used terribly often, but e.g. used in 000333.pdf page 17 in stillhq.com-pdfdb.	2023-10-20 14:24:31 -04:00
Nico Weber	78dea9500f	LibPDF: Make operator parsing use ReadonlySpan instead of Vector No behavior change.	2023-10-20 14:24:31 -04:00
Nico Weber	e0268dcc87	LibPDF: Allow /Pattern to be used directly as a color space name Per spec: "If the color space is one that can be specified by a name and no additional parameters (DeviceGray, DeviceRGB, DeviceCMYK, and certain cases of Pattern), the name may be specified directly." We still don't implement /Pattern color spaces, but now we no longer crash trying to look up the potentially-nonexistent /ColorSpace dictionary on the page object when /Pattern is used directly as color space name. On top of #21514, reduces number of crashes on 300 random PDFs from the web (the first 300 from 0000.zip from https://pdfa.org/new-large-scale-pdf-corpus-now-publicly-available/) from 42 (14%) to 34 (11%).	2023-10-20 10:35:54 -06:00
Nico Weber	aea0e2f313	LibPDF: Rename ColorSpaceFamily function to may_be_specified_directly() It used to be called ColorSpaceFamily::never_needs_parameters(). But in the cpp file, the macro arg was called ever_needs_parameters, and the spec says "If the color space is one that can be specified by a name and no additional parameters (DeviceGray, DeviceRGB, DeviceCMYK, and certain cases of Pattern), the name may be specified directly." so let's use that language here. No behavior change.	2023-10-20 10:35:54 -06:00
Nico Weber	095a2a17ed	LibPDF: Replace TODO()s in Type0Font code with Errors ...which causes us to not render these fonts instead of crashing. Reduces number of crashes on 300 random PDFs from the web (the first 300 from 0000.zip from https://pdfa.org/new-large-scale-pdf-corpus-now-publicly-available/) from 64 (21%) to 42 (14%).	2023-10-20 10:33:59 -06:00
Nico Weber	33443f7991	LibPDF: Implement ICCBasedColorSpace::number_of_components() We now no longer crash on images that use an ICC-based color space. Reduces number of crashes on 300 random PDFs from the web (the first 300 from 0000.zip from https://pdfa.org/new-large-scale-pdf-corpus-now-publicly-available/) from 81 (27%) to 64 (21%). Also fixes all remaining crashes in 411_getting_started_with_instruments.pdf and 513_high_efficiency_image_file_format.pdf.	2023-10-20 08:58:52 +02:00
Nico Weber	f5d3f47af3	LibPDF: Add spec comment about color spaces on images	2023-10-20 08:58:52 +02:00
Nico Weber	7c24a89acf	LibPDF: Add spec comment about valid bits_per_component values	2023-10-20 08:58:52 +02:00
Nico Weber	64bb9aa8c7	LibPDF: Fix comment typo	2023-10-20 08:58:52 +02:00
Nico Weber	ea6fed627a	LibPDF: Get color rendering intent from image dict Still not used for anything, so no behavior change.	2023-10-20 08:58:52 +02:00
Nico Weber	ebba24b848	LibPDF: Fix lookup of built-in Bold Italic strings Liberation*-BoldItalic.ttf apparently self-identifies as "Bold Italic", not "BoldItalic".	2023-10-19 16:52:49 -04:00
Nico Weber	708d5e2fe6	LibPDF: Implement color_rendering_intent operator Implements the `ri` operator, and the `RI` key in a graphics state dictionary. We don't do anything yet with the color rendering intent except store it. No behavior change except removing a few "not yet implemented" messages.	2023-10-19 16:51:16 -04:00
Nico Weber	609e640530	LibPDF: Try harder to use a RAII object to restore state Follow-up to #21489. There, I made us use a RAII object. That's great, but if the embedded instruction stream pushes its own graphics state, then an early return would cause us to not process graphics state pop instructions in the embedded stream. To fix this, remember the graphics stack depth before entering the nested instruction stream, and explicitly shrink the stack back to that size upon exit. Enables us to render all pages of https://devstreaming-cdn.apple.com/videos/wwdc/2017/821kjtggolzxsv/821/821_get_started_with_display_p3.pdf without crashing.	2023-10-19 16:49:00 -04:00
Nico Weber	b835d2bd66	LibPDF: Use a RAII object to restore state in recursive render Previously, if one operator returned an error, the TRY() would cause us to return without restoring the outer graphics state, leading to problems such as handing a 3-tuple to a grayscale color space (because the inner object set up a grayscale color space that we failed to dispose of). Makes us crash later on page 43 of https://devstreaming-cdn.apple.com/videos/wwdc/2017/821kjtggolzxsv/821/821_get_started_with_display_p3.pdf	2023-10-18 19:43:31 -04:00
Nico Weber	3c2d820391	LibPDF: If softmask has different size than target bitmap, resize it Size of smask and image aren't guaranteed to be equal by the spec (...except for /Matte, see page 555 of the PDF 1.7 spec, but we don't implement that), and in pratice they sometimes aren't. Fixes an assert on page 4 of https://devstreaming-cdn.apple.com/videos/wwdc/2017/821kjtggolzxsv/821/821_get_started_with_display_p3.pdf We now make it all the way to page 43 of 64 before crashing.	2023-10-18 20:03:35 +01:00
Nico Weber	3907374621	LibPDF: Implement support for callgsubr in CFF font programs Font programs are bytecode programs defining glyphs. If several glyphs share a piece of outline, that opcode sequence can be put in a subroutine ("subr") table and the definition of those glyphs can then call that subroutine by number, to reduce file size. CFF fonts can in theory contain multiple fonts, and so there's a global subr table shared by all the fonts in one CFF, and a local per-fornt subr table. We used to only implement the local subr table, now we implement both. (We only support one font per CFF, and at least in PDF files, that's all that's ever used. So a global subr table isn't very useful. But the spec explicitly allows it -- "Global subroutines may be used in a FontSet even if it only contains one font." -- and it happens in practice.)	2023-10-18 10:50:32 -04:00
Nico Weber	185573c03f	LibPDF: Implement subr_number biasing for CFF font programs	2023-10-18 10:50:32 -04:00
Nico Weber	4dc4de052a	LibPDF: Implement opcode 28 for CFF font programs	2023-10-18 10:50:32 -04:00
Nico Weber	44efff81b9	LibPDF: Remove a dbgln() call in CFF subrs decoding This code is a lot more reliable now than it used to be, and this dbgln() is quite noisy for some files. So let's remove it.	2023-10-18 10:43:51 -04:00
Nico Weber	02d2d12592	LibPDF: Allow moving Reader::move_to() to end of data stream CFF::parse_index_data() calls move_to() to put the reader's current position behind the index data. In several PDFs, the PrivDictOperator::Subrs case in CFF::create() sets up a span that contains exactly the Subrs data and nothing after it, so that finale move_to() call in parse_index_data() would cause an assert. This is similar to `fe3612ebcb`, where the caller was also in CFF. So maybe CFF just has a different view of what valid values to pass to Reader are, compared to the rest of the code? But having an iterator point to one past the valid data in a container is common, so maybe this is the Right Fix after all. Fixes a crash opening 411_getting_started_with_instruments.pdf (and a whole bunch of other WWDC slides). Rendering is pretty glitchy and we still crash on page 14, but at least we can open the file now. The file is currently available at: https://devstreaming-cdn.apple.com/videos/wwdc/2019/411cbc60y12x68arcof/411/411_getting_started_with_instruments.pdf	2023-10-18 06:32:23 -04:00
Nico Weber	182639217f	LibPDF: Implement GoTo action for outline Outline items can contain either a /Dest key or an /A key. The /Dest key points to a "Destination" (various ways to reference a page in the same document). The /A key points to an "Action" which can have several types. One type, the /GoTo type, just also points to a Destination. Implement GoTo actions. This makes clicking "Contents" in the outline of https://developer.apple.com/library/archive/documentation/mac/pdf/Text.pdf work. (Almost all other items in this file's outline use /Dest. "Contents" could too, but it uses /A /GoTo for some reason.) (Other action types are things like opening a hyperlink, opening a different file, playing a sound, submitting a form, etc. Actions are also used for in-page links, not just in outlines. Many of these action types we'll likely never want to implement.)	2023-10-18 06:29:02 -04:00
Nico Weber	d9c9510d3c	LibPDF: Rename x-macro argument name I'd like to add a string called `A`, so the argument can't be called `A` as well. No behavior change.	2023-10-18 06:29:02 -04:00
Nico Weber	f646e47d46	LibPDF: Extract a create_destination_from_object() function No big behavior change. The new function now produces an error if a destination isn't in one of the supported formats.	2023-10-18 06:29:02 -04:00
Nico Weber	46fd6fdfa3	LibPDF: Read Global subr data in CFF reader This was the last piece of data we didn't read yet. (We also don't yet support multiple fonts per CFF, but I haven't found a PDF using that yet.) We still don't do anything with it, but now we at least print a warning if this data is there and we ignore it.	2023-10-18 11:02:10 +02:00
Nico Weber	3be5719987	LibPDF: Rename `subroutines` to `local_subroutines` in CFF code	2023-10-18 11:02:10 +02:00
Nico Weber	9a0b559932	LibPDF: Tweak formatting of built-in CFF tables This makes the code look more like the pages in the spec. No behavior change, whitespace change only.	2023-10-18 11:00:17 +02:00
Nico Weber	f0e7fb7038	LibPDF: Make Subrs optional in PS1FontProgram https://adobe-type-tools.github.io/font-tech-notes/pdfs/T1_SPEC.pdf : "Using charstring subroutines is not a requirement of a Type 1 font program." And some versions of Computer Modern do in fact not contain a Subrs array. Together with #21473, makes Problemset.pdf from the pdffiles repro render ok instead of crashing.	2023-10-18 11:00:02 +02:00
Nico Weber	cb961101c7	LibPDF: Implement CFF built-in Standard and Expert encodings With this, all tables from the spec appendixes are in CFF.cpp. This fixes a crash reading page 2 (and onward) of 2ThestructureoftheCIE1997ColourAppearanceModelCIECAM97s.pdf in the pdffiles repo.	2023-10-17 10:21:38 +02:00
Nico Weber	eeada4678c	LibPDF: Postpone CFF encoding processing after Top DICT has been read The encoding offset defaults to 0, i.e. the Standard Encoding. That means reading the encoding only if the tag is present causes us to not read it if a font uses the Standard Encoding. Now, we always read an encoding, even if it's the (implicit) default one.	2023-10-17 10:21:38 +02:00
Nico Weber	1cfe639b6c	LibPDF: Implement CFF supplemental encoding The main encoding data maps glyph ID ("GID") to its codepoint. If a glyph has several codepoints, then a secondary table mapping codepoint to string ID ("SID") of the glyph's name is present. (A separate table associates each glyph with its name already.) I haven't seen this used in the wild, but the structure of the supplemental data is also going to be needed for built-in encodings.	2023-10-17 10:21:38 +02:00
Nico Weber	37daeae6fd	LibPDF: Add spec comments, dbgln_if()s to CFF's parse_encoding()	2023-10-17 10:21:38 +02:00
Nico Weber	007d7cdd53	LibPDF: Fix sign (and fixed point) in glyph decoding opcode 24 Two bugs: 1. We decoded a u32, not an i32 as the spec wants 2. (minor) Our fixed-point divisor was off by one Fixes text rendering in Bakke2010a.pdf in pdffiles, and rendering of other fonts with negative width adjustments from optcode 255. That PDF was produced by "Apple pstopdf" and uses font SFBX1200, which is apparently a variant of Computer Modern. So maybe this helps with lots of PDFs produced from TeX files, but I haven't checked that.	2023-10-16 08:33:35 +02:00
Nico Weber	96a4936567	LibPDF: Checking for built-in CFF encodings Only prints a warning for them for now. Also warn on the not-yet-implemented encoding supplement.	2023-10-16 08:32:18 +02:00

... 3 4 5 6 7 ...

598 commits