1
Fork 0
mirror of https://github.com/RGBCube/serenity synced 2025-07-15 08:07:38 +00:00
Commit graph

583 commits

Author SHA1 Message Date
Nico Weber
32f601f9a4 LibPDF: Fix small bug from #21452
I implemented CFF charset format 2 in 6f783929dd with the note
"I haven't seen this being used in the wild". Now that I have
seen it (0000658.pdf), I can say that this has never worked,
despite me claiming "it's easy to implement".

But now it works!
2024-02-08 13:48:56 +00:00
Nico Weber
9fc47345ce LibGfx+LibPDF: Make sample() functions take ReadonlySpan<>
...instead of Vector<>.

No behavior (or performance) change.
2024-02-06 08:44:53 +01:00
Nico Weber
92a628c07c LibPDF: Always treat /Subtype /Image as binary data when dumping
Sometimes, the "is mostly text" heuristic fails for images.

Before:

    Build/lagom/bin/pdf --render out.png ~/Downloads/0000/0000521.pdf \
        --page 10 --dump-contents 2>&1 | wc -l
       25709

After:

    Build/lagom/bin/pdf --render out.png ~/Downloads/0000/0000521.pdf \
         --page 10 --dump-contents 2>&1 | wc -l
       11376
2024-02-05 21:18:19 -05:00
Nico Weber
f562c470e2 LibGfx+LibPDF: Simpler and faster N-D linear sampling
Previously, if we wanted to to e.g. do linear interpolation in 2-D,
we'd get a sample point like (1.3, 4.4), then get 4 samples around
it at (1, 4), (2, 4), (1, 5), (2, 5), then reduce the 4 samples
to 2 samples by computing the combined samples
`0.3 * f(1, 4) + 0.7 * f(2, 4)` and `0.3 * f(1, 5) + 0.8 * f(2, 5)`,
and then 1-D linearly blending between these two samples with the
factor 0.4. In the end we'd multiply the first value by 0.3 * 0.4,
the second by 0.7 * 0.4, the third by 0.3 * 0.6, and the third by
0.7 * 0.6, and then sum them all up.

This requires computing and storing 2**N samples, followed by
another 2**N iterations to combine the 2**N sampls to a single value.
(N is in practice either 4 or 3, so 2**N isn't super huge.)

Instead, for every sample we can directly compute the product of
weights and sum them up directly. This lets us omit the second loop
and storing 2**N values, in exchange for doing an additional O(n)
work to compute the product.

Takes

    Build/lagom/bin/image --no-output --invert-cmyk \
        --assign-color-profile \
            Build/lagom/Root/res/icc/Adobe/CMYK/USWebCoatedSWOP.icc \
        --convert-to-color-profile serenity-sRGB.icc \
        cmyk.jpg

form 3.42s to 3.08s on my machine, almost 10% faster (and less code).

Here cmyk.jpg is a 2253x3080 cmyk jpeg, and USWebCoatedSWOP.icc is an
mft2 profile with input tables with 256 samples and a 9x9x9x9 CLUT.

The LibPDF change is covered by TEST_CASE(sampled) in LibPDF.cpp,
and the LibGfx change is basically the same change as the one in
LibPDF (where the test results don't change) and the output
subjectively looks identical. So hopefully this causes indeed no
behavior change :^)
2024-02-04 21:49:23 +01:00
Nico Weber
955d73657e LibPDF: Make pdf --dump-contents dump less binary data
For pages containing images or embedded fonts, --dump-contents
used to dump a ton of binary data. That isn't very useful, so
stop doing it.

Before:

    % time Build/lagom/bin/pdf --render out.png \
        ~/Downloads/0000/0000711.pdf --dump-contents | wc -l
      937972

Now:

    % time Build/lagom/bin/pdf --render out.png \
        ~/Downloads/0000/0000711.pdf --dump-contents | wc -l
        6566

Printing 7k lines is also much faster than printing 940k,
0.15s instead of 2s.
2024-02-03 08:26:29 +00:00
Nico Weber
9c762b9650 LibPDF+Meta: Use a CMYK ICC profile to convert CMYK to RGB
CMYK data describes which inks a printer should use to print a color.
If a screen should display a color that's supposed to look similar
to what the printer produces, it results in a color very different
to what Color::from_cmyk() produces. (It's also printer-dependent.)

There are many ICC profiles describing printing processes. It doesn't
matter too much which one we use -- most of them look somewhat
similar, and they all look dramatically better than Color::from_cmyk().

This patch adds a function to download a zip file that Adobe offers
on their web site. They even have a page for redistribution:
https://www.adobe.com/support/downloads/iccprofiles/icc_eula_win_dist.html

(That one leads to a broken download though, so this downloads the
end-user version.)

In case we have to move off this download at some point, there are also
a whole bunch of profiles at https://www.color.org/registry/index.xalter
that "may be used, embedded, exchanged, and shared without restriction".

The adobe zip contains a whole bunch of other useful and fun profiles,
so I went with it.

For now, this only unzips the USWebCoatedSWOP.icc file though, and
installs it in ${CMAKE_BINARY_DIR}/Root/res/icc/Adobe/CMYK/. In
Serenity builds, this will make it to /res/icc/Adobe/CMYK in the
disk image. And in lagom build, after #23016 this is the
lagom res staging directory that tools can install via
Core::ResourceImplementation. `pdf` and `MacPDF` already do that,
`TestPDF` now does it too.

The final piece is that LibPDF then loads the profile from there
and uses it for DeviceCMYK color conversions.

(Doing file access from the bowels of a library is a bit weird,
especially in a system that has sandboxing built in. But LibGfx does
that in FontDatabase too already, and LibPDF uses that, so it's not a
new problem.)
2024-02-01 13:42:04 -07:00
Nico Weber
f840fb6b4e LibPDF: Make DeviceCMYKColorSpace::the() fallible
No behavior change.
2024-02-01 13:42:04 -07:00
Nico Weber
384c6cf0f9 LibPDF: Tweak vertical position of truetype fonts again
See #22821 for a previous attempt. This attempt should settle
things once and for all.

The opentype render path adjusts by `-font_ascender * -y_scale` in
Glyf::Glyph::append_simple_path(), so that's what we need to undo
to draw at the font's baseline.

(OpenType::Font::metrics() returns ascender scaled by y_scale already,
so no need to have the scale here where we undo the shift.)

Previously, we called `baseline()` which just returns the font's
font size, which is pretty meaningless:

https://tonsky.me/blog/font-size/
https://simoncozens.github.io/fonts-and-layout/opentype.html#vertical-metrics-hhea-and-os2

Also, conceptually it makes sense to translate up by the ascender
to get from the upper edge of the glyph to the baseline.
2024-02-01 10:05:40 +01:00
Nico Weber
87112dcbdc LibPDF: Return null for invalid refs, tolerate null objects as outline
https://llvm.org/devmtg/2022-11/slides/TechTalk5-WhatDoesItTakeToRunLLVMBuildbots.pdf
has an xref table that starts like so:

```
xref
0 214
0000000002 65535 f
0000924663 00000 n
0000000003 00000 f
0000000000 00000 f
0000000016 00000 n
0000000160 00000 n
0000000263 00000 n
```

This is a list of objects in the PDF file. The lines ending with 'f'
mean that this object is "free", that is it's not stored in the file.
In this file, objects 0, 2, 3 are free. For free objects, the first
number is the offset of the next free object: Object 0 refers to object
2, 2 to 3, and 3 back to 0 (since it's the last free object).
The lines ending with "n" are actual objects; here the first number is
a byte offset to where that object is stored in the file.

Furthermore, the file contains

```
/Outlines
2
0
R
```

in its root object, meaning that object 2 stores the page outlines.

Since object 2 is set as free, there is no object 2. But the spec
says that an invalid object reference is just the null object.

This patch makes us return null objects for references to free
objects, and it also makes us treat a null object as /Outlines value
the same as not having /Outlines in the first place.

Fixes #23023 -- we can now open that file. (We don't render it super
well, but only for already-known reasons.)

Since I found it a bit confusing: XRefTable has two related methods
here:

1. has_object() returns if an object was explicitly listed in an
   xref table. The first number right after `xref` is the start
   index. So if an xref table were to start with `10`, we'd implicitly
   create 10 trailing objects for which has_object() would return false
2. is_object_in_use() returns true if an object that was in a table
   (i.e. one where has_object() returns true) was listed with 'n' and
   false if it was listed with 'f'.

DocumentParser::parse_object_with_index() should probably return a null
object for the `!has_object()` case as well instead of VERIFY()ing
that has_object() is true. But I haven't seen this in the wild yet,
so keeping as-is for now.
2024-01-31 12:10:19 -05:00
Timothy Flynn
aa0a6d58b2 Userland: Remove LibCore dependency from libraries that do not use it 2024-01-22 08:48:34 -05:00
Nico Weber
a0462f495c LibPDF+MacPDF: Clip text, and add a debug option for disabling it 2024-01-20 08:56:03 +01:00
Nico Weber
90fdf738a1 LibPDF: Alphabetize clip_ fields in RenderingPreferences
No behavior change.
2024-01-20 08:56:03 +01:00
Nico Weber
66f8259a0b LibPDF: Move ClipRAII to .h file
No behavior change.
2024-01-20 08:56:03 +01:00
Tim Ledbetter
459fa8b840 LibPDF: Ensure that xref subsection numbers are u32
Previously, parsing an xref entry with a floating point subsection
number would cause a crash.
2024-01-18 15:11:42 +01:00
Nico Weber
d2f3288666 LibPDF: Apply text matrix to each glyph's position
We still don't apply it to the glyph itself, so they don't show up
scaled or rotated, but they're at the right spot now.

One big thing this here hsa going for it is that the final glyph
position is now calculated with just
`ext_rendering_matrix.map(glyph_position)`.

Also, character_spacing and word_spacing are now used unmodified
in the SimpleFont::draw_string() loop. This also means we no longer
have to undo a scale when updating the position in
`Renderer::show_text()`.

Most of the rest stays pretty yucky though. The root cause of many
problems is that ScaledFont has its rendering sized baked into the
object. We want to render fonts at size font_size times scale from
text matrix times scale from current transformation matrix (but
not size from hotizontal_scaling). So we have to make that the
font_size, but then we have to undo that in a bunch of places to
get the actualy font size.

This will eventually get better when LibPDF moves off ScaledFont.
2024-01-18 14:01:30 +01:00
Nico Weber
f54b0e7c22 LibPDF: Don't accidentally put horizontal_scaling in places
Fonts should have size font_size times total scaling. We tried to
get that by computing text_rendering_matrix.x_scale() * font_size,
but text_rendering_matrix.x_scale() also includes
horizontal_scaling, which shouldn't be part of font size.

Same for character_spacing and word_spacing.

This is all a big mess that's caused by LibPDF using ScaledFont,
which requires scaling to be aprt of the text type. I have an
in-progress local branch that moves LibPDF to directly use VectorFont,
which will hopefully make this (and other things) nicer. But first,
let's get this right, and then make sure we don't regress it when
things change :^)
2024-01-18 14:01:30 +01:00
Nico Weber
abda5e66f6 LibPDF: Scale delta_x by horizontal_scaling in Renderer::show_text()
While PDFFont::draw_string() already returns a position scaled by
horizontal_scaling, the division by text_rendering_matrix.x_scale()
(which also contains the scaling factor) undid it. Reapply it.

Fixes the horizontal layout of the line
"should be the same on all lines: super" in Tests/LibPDF/text.pdf.
2024-01-18 14:01:30 +01:00
Nico Weber
470d1d8dcf LibPDF: Fix order of parameter, text, and current transform matrix
PDF spec 1.7 5.3.3 Text Space Details gives the correct multiplication
order: parameters * textmatrix * ctm.

We used to do text * ctm * parameters
(AffineTransform::multiply() does left-multiplication).

This only matters if `text_state().rise` is non-zero. In practice,
it's almost always zero, in which case the paramter matrix is a
diagonal matrix that commutes.

Fixes the horizontal offset of "super" in Tests/LibPDF/text.pdf.
2024-01-18 14:01:30 +01:00
Nico Weber
6c65c18c40 LibPDF: Add spec ref to Renderer::calculate_text_rendering_matrix() 2024-01-18 14:01:30 +01:00
Nico Weber
13f007aadb LibPDF: Tweak vertical position of truetype fonts
The vertical coordinates for truetype fonts are different somehow.
We compensated a bit for that; now we compensate some more.

This is still not 100% perfect, but much better than before.
2024-01-17 08:44:07 +00:00
Nico Weber
1845a406ea LibPDF: Add debug settings for clipping paths and images 2024-01-17 08:42:56 +00:00
Nico Weber
2d8a22f4b4 LibPDF: Clip images too
Since we can't clip against a general path yet, this clips images
against the bounding box of the current clip path as well.

Clips for images are often rectangular, so this works out well.

(We wastefully still decode and color-convert the entire image.
In a follow-up, we could consider only converting the unclipped
part.)
2024-01-17 08:42:56 +00:00
Nico Weber
5615a2691a LibPDF: Extract activate_clip() / deactivate_clip() functions
No behavior change.
2024-01-17 08:42:56 +00:00
MacDue
d55867e563 LibPDF: Fix paths with negatively sized re (rect) commands
Turns out the width/height in a `re` command can be negative. This
results in rectangles with different winding orders. For example, a
negative width results in a reversed winding order.

Previously, this was lost by passing the rect through an
`AffineTransform` before constructing the path. So instead, this
constructs the rect path, and then transforms the resulting path.
2024-01-16 21:31:20 +00:00
Nico Weber
0e91682283 LibPDF: Be more forgiving about trailing image data
The predictor code assumed that all stream data is image data
(...which would make sense: trailing data there is wasted space).

But some PDFs have trailing data there, e.g. 0000257.pdf, so be
forgiving about it.
2024-01-16 09:55:11 -05:00
Nico Weber
b34509edd2 LibPDF: Make pdf --dump-contents handle \r line endings better
Previously, all page contents ended up overprinting a single line
over and over for PDFs that used only `\r` as line ending.

This is for example useful for 0000364.pdf.
2024-01-15 23:16:45 -07:00
Nico Weber
9f9dbb325b LibPDF: Make prediction filters error on user-controlled alloc OOM 2024-01-15 23:06:06 -07:00
Nico Weber
93f5420282 LibPDF: Start implementing the TIFF predictor
This codepath is separate from the predictor in the TIFF decoder.
The TIFF decoder currently does bits->Color conversion before
processing the predictor. That doesn't fit the PDF model where
filters are processed before converting streams into bitmaps.

If this code here ever grows to handle all cases, maybe we can move
it over to the TIFF decoder and then make it do predictions before
decoding to colors, to share this code.

(TIFF prediction is pretty messy since it's bits-per-pixel-dependent.
PNG prediction is always byte-based, which makes things easier.)
2024-01-15 23:06:06 -07:00
Nico Weber
9a93f677f4 LibPDF: Mark text rendering matrix as dirty after TJ numbers
Mostly because I audited all places that assigned to `m_text_matrix`
after #22760.

This one is very difficult to trigger in practice.

`show_text()` marks the text rendering matrix dirty already,
so this only has an effect if the `TJ` array starts with a
number, and the matrix isn't marked dirty going in.

`Tm` caches the text rendering matrix, so I changed text.pdf
to contain:

```
1 0 0 1 45 130 Tm
[ 200 (Hello) -2000 (World) ] TJ T*
```

This first sets an x offset of 5 (on top of the normal 40), and
then undoes it (`200` is multiplied by font size (25) / -1000,
and `200 * 25 / -1000` is -5). Before this change, the topmost
"Hello World" ended up slightly indented.

Likely no behavior change in practice, but makes the code easier
to understand, and maybe it helps in the wild somewhere.
2024-01-15 08:39:04 +00:00
Nico Weber
f23f5dcd62 LibPDF: Mark text rendering matrix dirty for Td operator
0000342.pdf page 5 contains this snippet:

```
/T1_1 10.976 Tf
0 -31.643 TD
(This)Tj

1 0 0 1 54 745.563 Tm
22.181 -31.643 Td
[(vehicle)-270.926(uses)...
```

The `Tm` marked the text rendering matrix as dirty at the start,
but it then calls calculate_text_rendering_matrix() almost in the
next line, which recalculates the text rendering matrix and caches
the new matrix. The `Td` used to not mark it as dirty, and we'd
draw "vehicle" with an incorrect matrix.
2024-01-15 08:37:55 +00:00
Nico Weber
f4ee9a2333 LibPDF: Support drawing images with 16 bits per channel
This uses the tried-and-true "throw away the lower 8 bits" technique
for now. This lets us render  Tests/LibPDF/wide-gamut-only.pdf.
2024-01-12 16:20:46 -07:00
Nico Weber
5f85aff036 LibPDF: Move ColorSpace::style() to take ReadonlySpan<float>
All ColorSpace subclasses converted to float anyways, and this
allows us to save lots of float->Value->float conversions during
image color space processing.

A bit faster:

```
    N           Min           Max        Median         Avg       Stddev
x  50    0.99054313     1.0412271    0.99933481   1.0052408  0.012931916
+  50    0.97073889     1.0075941    0.97849107  0.98184034 0.0090329046
Difference at 95.0% confidence
	-0.0234004 +/- 0.00442595
	-2.32785% +/- 0.440287%
	(Student's t, pooled s = 0.0111541)
```
2024-01-12 12:37:56 +00:00
Nico Weber
56a4af8d03 LibPDF: Don't reallocate Vectors in ICCBasedColorSpace all the time
Microoptimization; according to ministat a bit faster:

```
    N           Min           Max        Median         Avg       Stddev
x  50     1.0179932     1.0561159     1.0315337   1.0333617 0.0094757426
+  50      1.000875     1.0427601     1.0208509   1.0201902   0.01066116
Difference at 95.0% confidence
	-0.0131715 +/- 0.00400208
	-1.27463% +/- 0.387287%
	(Student's t, pooled s = 0.0100859)
```
2024-01-12 12:37:56 +00:00
Nico Weber
cfd05b1a55 LibPDF: Use MatrixMatrixConversion when possible
Reduces time spent rendering page 3 of 0000849.pdf from 1.32s to 1.13s
on my machine.

Also reduces the time to run Meta/test_pdf.py on 0000.zip
(without 0000849.pdf) from 56s to 54s.
2024-01-12 09:09:56 +01:00
Nico Weber
c161b2d2f9 LibPDF: Extract ICCBasedColorSpace::sRGB() helper 2024-01-12 09:09:56 +01:00
Nico Weber
f7fc2df8ac LibPDF: Simplify load_image() a tiny bit
Images can't use Pattern color spaces, so we'll always have a Color.

No behavior (or perf) change.
2024-01-10 23:26:57 +01:00
Nico Weber
df5451a889 LibPDF: Mark text rendering matrix dirty after changing it in text_begin
A certain PDF was drawing some text used `9 0 0 9 474.54 700.6801 Tm`
to set the text matrix to a matrix that scaled by 9 in one text object.

Then, after ending that text object, it had the following new text
object which contained nothing that invalidated the text matrix:

```
BT
/F1 7 Tf
/DeviceRGB CS
0 0 0 SC
10 TL
86.37849 21.908 Td
(Authorized licensed use limited to: ...) Tj
ET
```

`BT` did reset it as required, but since we didn't mark the matrix
as dirty, we never recomputed it and drew the additional text scaled
up 9x.
2024-01-10 19:42:08 +01:00
Nico Weber
4fd5d450be LibPDF: Add support for image masks
An image mask is a 1-bit-per-pixel bitmap that's black where the
current color should be painted, and white where it should be
transparent (think: like ink).

load_image() already converts images like this into 8-bit-per-pixel
images that have 0xff, 0xff, 0xff in rgb for opaque (originally 0 bit)
pixels and 0, 0, 0 in rgb for transparent pixels.

So we just move copy the image mask's image data into the alpha
channel and replace rgb with the current color, and then draw
it like a regular bitmap.
2024-01-10 09:10:11 +00:00
Nico Weber
e770cf06b0 LibPDF: Send jpeg data down the same path as all other data
JPEG images now honor decode arrays and color spaces.
2024-01-10 09:39:00 +01:00
Nico Weber
f157cd50a1 LibPDF: Use mix() in SampledFunction::evaluate()
No behavior change.
2024-01-04 21:12:23 +01:00
Nico Weber
e16345555b LibPDF: Port 59b50fa43f8c2 to xref and object streams
0000440.pdf contains an xref stream object (at offset 3643676) starting:

```
294 0 obj <<
/Type /XRef
/Index [0 295]
/Size 295
```

and an object stream object (at offset 3640121) starting:

```
230 0 obj <<
/Type /ObjStm
/N 73
/First 614
```

In both cases, the `obj` and the `<<` are separated by non-newline
whitespace.

633e1632d0 made parse_indirect_value() tolerate this, but it didn't
update neither parse_xref_stream() (which parses xref streams) nor
parse_compressed_object_with_index() (which parses object streams),
despite all three changes being part of #14873.

Make parse_xref_stream() and parse_compressed_object_with_index()
call parse_indirect_value() to pick up the fix over there. It's a bit
less code too.

(0000440.pdf is the only PDF in my 1000 test PDFs that this helps,
somewhat surprisingly.)
2024-01-04 11:27:24 +01:00
Nico Weber
9d69c5d434 LibPDF: Tolerate trailing whitespace after %%EOF marker
At first I tried implmenting the quirk from PDF 1.7 Appendix H,
3.4.4, "File Trailer": """Acrobat viewers require only that the %%EOF
marker appear somewhere within the last 1024 bytes of the file.""
This would've been like #22548 but at end-of-file instead of at
start-of-file.

This helped a bunch of files, but also broke a bunch of files that
made more than 1024 bytes of stuff at the end, and it wouldn't have
helped 0000059.pdf, which has over 40k of \0 bytes after the %%EOF.
So just tolerate whitespace after the %%EOF line, and keep ignoring
and arbitrary amount of other stuff after that like before.

This helps:
* 0000599.pdf
  One trailing \0 byte after %%EOF. Due to that byte, the
  is_linearized() check fails and we go down the non-linearized
  codepath. But with this fix, that code path succeeds.
* 0000937.pdf
  Same.
* 0000055.pdf
  Has one space followed by a \n after %%EOF
* 0000059.pdf
  Has over 40kB of trailing \0 bytes

The following files keep working with it:
* 0000242.pdf
  5586 bytes of trailing HTML
* 0000336.pdf
  5586 bytes of trailing HTML fragment
* 0000136.pdf
  2054 bytes of trailing space characters
  This one kind of only worked by accident before since it found
  the %%EOF block before the final %%EOF block. Maybe this is
  even an intentional XRefStm compat hack? Anyways, now it
  find the final block instead.
* 0000327.pdf
  11044 bytes of trailing HTML
2024-01-04 11:19:15 +01:00
Nico Weber
2d12647e29 LibPDF: Add FIXME for "was linearized PDF incrementally updated" check
It's pretty tricky to do, and also tricky with respect to skipping
trailing bytes after %%EOF: The check requires knowning the full size of
the PDF (which means web servers not sending content lengths are out),
but that size has to be after stripping trailing bytes, which normal
static file servers won't do. So PDF viewers would have to download the
last couple bytes of the PDF unconditionally, then strip trailing bytes
and use the count to figure out the final actual PDF size.

Luckily, we don't incrementally download PDFs from the net but
instead require all data to be available in one chunk, so it's
not currently a problem.
2024-01-04 11:19:15 +01:00
Nico Weber
1b45c3e127 LibPDF: Tolerate whitespace after xref and startxref
The spec isn't super clear on if this is allowed:

"""Each cross-reference section shall begin with a line containing the
keyword xref. Following this line..."""

"""The two preceding lines shall contain, one per line and in order, the
keyword startxref and..."""

It kind of sounds like anything goes on both lines as long as they
contain `xref` and `startxref`.

In practice, both seem to always occur at the start of their line,
but in 0000780.pdf (and nowhere else), there's one space after each
keyword before the following linebreak, and this makes that file load.
2024-01-04 10:14:30 +01:00
Nico Weber
efb37f7252 LibPDF: Add Reader::consume_non_eol_whitespace() 2024-01-04 10:14:30 +01:00
Nico Weber
c59e08123b LibPDF: Add a FIXME and a spec comment to Encoding::from_object() 2024-01-04 10:12:11 +01:00
Nico Weber
ad5fc0eda1 LibPDF: An Encoding's /Differences entry is optional
Per "TABLE 5.11 Entries in an encoding dictionary", /Differences is
optional.

(Per "Encodings for TrueType Fonts" in 5.5.5 Character Encoding,
nonsymbolic truetype fonts are even recommended to have "no Differences
array." But in practice, most seem to have it.)

Fixes crashes on:
* 0000001.pdf
* 0000574.pdf
* 0000337.pdf

All three don't render super great, but at least they no longer crash.
2024-01-04 10:12:11 +01:00
Nico Weber
0bb0c7dac2 LibPDF: Scan for PDF file start in first 1024 bytes
Other readers do this too, and files depend on this.

Fixes opening these four files from the PDFA 0000.zip dataset:

* 0000015.pdf
  Starts with `C:\web\webeuncet\_cat\_docs\_publics\` before header
* 0000408.pdf
  Starts with UTF-8 BOM
* 0000524.pdf
  Starts with 867 bytes of HTML containing a PHP backtrace
* 0000680.pdf
  Starts with `C:\web\webeuncet\_cat\_docs\_publics\` too
2024-01-03 10:12:35 +01:00
Nico Weber
9495f64f91 LibPDF: Improve hex string parsing
A local (non-public) PDF I have lying around contains this in
a page's operator stream:

```
[<00b4003e> 3 <002600480051> 3 <005700550044004f0003> -29
<00330044> 3 <0055> -3 <004e0040> 4 <0003> -29 <004c00560003> -31
<0057004b> 4 <00480003> -37 <0050
>] TJ
```

That is, there's a newline in a hexstring after a character.

This led to `Parser error at offset 5184: Unexpected character`.

The spec says in 3.2.3 String Objects, Hexadecimal Strings:
"""Each pair of hexadecimal digits defines one byte of the string.
White-space characters (such as space, tab, carriage return, line feed,
and form feed) are ignored."""

But we didn't ignore whitespace before or after a character, only
in between the bytes.

The spec also says:
"""If the final digit of a hexadecimal string is missing—that is, if
there is an odd number of digits—the final digit is assumed to be 0."""

In that case, we were skipping the closing `>` twice -- or, more
accurately, we ignored the character after it too. This has been
wrong all the way back in #6974.

Add a test that fails if either of the two changes isn't present.
2024-01-02 22:13:21 +01:00
Lucas CHOLLET
f389c1cdba LibGfx+LibPDF: Use LibCompress' implementation of the PackBits decoder
No need to have these three copies :^)
2023-12-27 17:40:11 +01:00