1
Fork 0
mirror of https://github.com/RGBCube/serenity synced 2025-05-17 11:35:07 +00:00
Commit graph

37 commits

Author SHA1 Message Date
Ali Mohammad Pur
5e1499d104 Everywhere: Rename {Deprecated => Byte}String
This commit un-deprecates DeprecatedString, and repurposes it as a byte
string.
As the null state has already been removed, there are no other
particularly hairy blockers in repurposing this type as a byte string
(what it _really_ is).

This commit is auto-generated:
  $ xs=$(ack -l \bDeprecatedString\b\|deprecated_string AK Userland \
    Meta Ports Ladybird Tests Kernel)
  $ perl -pie 's/\bDeprecatedString\b/ByteString/g;
    s/deprecated_string/byte_string/g' $xs
  $ clang-format --style=file -i \
    $(git diff --name-only | grep \.cpp\|\.h)
  $ gn format $(git ls-files '*.gn' '*.gni')
2023-12-17 18:25:10 +03:30
Nico Weber
52afa936c4 LibPDF: Don't over-read in charset formats 1 and 2
`left` might be a number bigger than there are actually glyphs in the
CFF.

The spec says "The number of ranges is not explicitly specified in the
font. Instead, software utilizing this data simply processes ranges
until all glyphs in the font are covered." Apparently we have to check
for this within each range as well.

Needed for example in 0000054.pdf and 0000354.pdf in 0000.zip in the
pdfa dataset.

Together with the previous commit:

From 21 (7%) to 20 (6%) crashes on the 300 first PDFs, and from
41 (8.2%) to 39 (7.8%) on the 500-random PDFs test.
2023-10-23 09:31:11 -04:00
Nico Weber
58ff7b5336 LibPDF: Support offset size 3 in CFF index reading
...and replace template instantiations with a loop, to make this
easily possible.

Vaguely nice for code size as well.

Needed for example in 0000054.pdf and 0000354.pdf in 0000.zip in the
pdfa dataset.
2023-10-23 09:31:11 -04:00
Nico Weber
3197f0cab6 LibPDF: Handle CFF fonts with charset format 0 and > 255 glyphs better
We used to use an u8 as loop counter, which would overflow
if there were more than 255 glyphs, producing hundreds of megabytes
of

    Couldn't find string for SID x, going with space

output in the process, while all data until the end of the CFF
section got interpreted as SIDs, until a try_read() would finally
fail.

We now no longer fail miserably trying to render page 2 of
0000352.pdf of 0000.zip from the pdfa dataset.

Fixes just one crash of the larger 500-document test set, but
when I tweak test_pdf.py to print all stacks instead of just the
top 5, it no longer produces 260 MB of output.
2023-10-23 09:31:11 -04:00
Nico Weber
0869ca5615 LibPDF: Add more CFF_DEBUG output 2023-10-23 09:31:11 -04:00
Nico Weber
04aec4a032 LibPDF: Don't log CFF Copyright tag as unknown 2023-10-21 21:04:02 +02:00
Nico Weber
3907374621 LibPDF: Implement support for callgsubr in CFF font programs
Font programs are bytecode programs defining glyphs. If several glyphs
share a piece of outline, that opcode sequence can be put in a
subroutine ("subr") table and the definition of those glyphs can then
call that subroutine by number, to reduce file size.

CFF fonts can in theory contain multiple fonts, and so there's a global
subr table shared by all the fonts in one CFF, and a local per-fornt
subr table.  We used to only implement the local subr table, now we
implement both.

(We only support one font per CFF, and at least in PDF files, that's
all that's ever used. So a global subr table isn't very useful.
But the spec explicitly allows it -- "Global subroutines may be used in
a FontSet even if it only contains one font." -- and it happens in
practice.)
2023-10-18 10:50:32 -04:00
Nico Weber
44efff81b9 LibPDF: Remove a dbgln() call in CFF subrs decoding
This code is a lot more reliable now than it used to be, and this
dbgln() is quite noisy for some files. So let's remove it.
2023-10-18 10:43:51 -04:00
Nico Weber
46fd6fdfa3 LibPDF: Read Global subr data in CFF reader
This was the last piece of data we didn't read yet.
(We also don't yet support multiple fonts per CFF, but I haven't
found a PDF using that yet.)

We still don't do anything with it, but now we at least print a
warning if this data is there and we ignore it.
2023-10-18 11:02:10 +02:00
Nico Weber
3be5719987 LibPDF: Rename subroutines to local_subroutines in CFF code 2023-10-18 11:02:10 +02:00
Nico Weber
9a0b559932 LibPDF: Tweak formatting of built-in CFF tables
This makes the code look more like the pages in the spec.

No behavior change, whitespace change only.
2023-10-18 11:00:17 +02:00
Nico Weber
cb961101c7 LibPDF: Implement CFF built-in Standard and Expert encodings
With this, all tables from the spec appendixes are in CFF.cpp.

This fixes a crash reading page 2 (and onward) of
2ThestructureoftheCIE1997ColourAppearanceModelCIECAM97s.pdf in
the pdffiles repo.
2023-10-17 10:21:38 +02:00
Nico Weber
eeada4678c LibPDF: Postpone CFF encoding processing after Top DICT has been read
The encoding offset defaults to 0, i.e. the Standard Encoding.
That means reading the encoding only if the tag is present causes
us to not read it if a font uses the Standard Encoding.

Now, we always read an encoding, even if it's the (implicit) default
one.
2023-10-17 10:21:38 +02:00
Nico Weber
1cfe639b6c LibPDF: Implement CFF supplemental encoding
The main encoding data maps glyph ID ("GID") to its codepoint.
If a glyph has several codepoints, then a secondary table mapping
codepoint to string ID ("SID") of the glyph's name is present.

(A separate table associates each glyph with its name already.)

I haven't seen this used in the wild, but the structure of the
supplemental data is also going to be needed for built-in encodings.
2023-10-17 10:21:38 +02:00
Nico Weber
37daeae6fd LibPDF: Add spec comments, dbgln_if()s to CFF's parse_encoding() 2023-10-17 10:21:38 +02:00
Nico Weber
96a4936567 LibPDF: Checking for built-in CFF encodings
Only prints a warning for them for now.

Also warn on the not-yet-implemented encoding supplement.
2023-10-16 08:32:18 +02:00
Nico Weber
414a164850 LibPDF: Be louder about unimplemented CFF dict entries 2023-10-16 08:32:18 +02:00
Nico Weber
c825194fb9 LibPDF: Reject CFFs with more than one font
The code assumes that there's just one Top DICT, so let's be loud
when that isn't the case.
2023-10-16 08:32:18 +02:00
Nico Weber
6f783929dd LibPDF: Implement support for CFF charset format 2
I haven't seen this being used in the wild (yet), but it's easy
to implement, and with this we support all charset formats.

So we can now mention if we see a format we don't know about.
2023-10-15 15:27:15 +02:00
Nico Weber
5b915fb15c LibPDF: Add more spec comments to parse_charset() 2023-10-15 15:27:15 +02:00
Nico Weber
49275c4b17 LibPDF: Don't overflow SIDs in type 1 charset parsing
first_sid has type SID (aka u16), so don't store it in an u8.

This fixes (among other things) page 24 on the PDF 1.7 spec.
2023-10-15 15:27:15 +02:00
Nico Weber
23d6e9f577 LibPDF: Implement CFF built-in charsets ISOAdobe, Expert, Expert Subset 2023-10-15 09:33:34 +02:00
Nico Weber
8060957d8d LibPDF: Use Appendix A instead of Appendix C for standard names
From "10 String INDEX":

"Further space saving is obtained by allocating commonly occurring
strings to predefined SIDs. These strings, known as the standard
strings, describe all the names used in the ISOAdobe and Expert
character sets along with a few other strings common to Type 1 fonts. A
complete list of standard strings is given in Appendix A.  The client
program will contain an array of standard strings with nStoStrings
elements. Thus, the standard strings take SIDs in the range 0 to
(nStaStrings-1)."

And "13 Charsets" says that charsets store SIDs.

Fixes all

    "Couldn't find string for SID $n, going with space"

messages when going through the encoding pages (page 1010 and
thereabouts) in the PDF 1.7 spec.
2023-10-15 09:33:34 +02:00
Nico Weber
aba787a441 LibPDF: Implement reading of CFF String Index
Only really useful for reading SIDs in the Top DICT (copyright
text etc), which we currently don't do.

I haven't seen a difference from looking things up in the string
table. The only real effect from the commit that I need is that
it pulls a local resolve() labmda into a real function
resolve_sid(), which I want to call in a future commit.

But it makes things more spec-compliant, and if we ever want to
read SIDs in metadata in the future, now we can.
2023-10-15 09:33:34 +02:00
Nico Weber
3c49d0dad3 LibPDF: Add a CFF_DEBUG toggle
I'd like to put some debug prints behind this soon.

No behavior change.
2023-10-15 07:14:29 +02:00
Nico Weber
2249e79630 LibPDF: Add two FIXMEs 2023-10-13 07:53:27 +02:00
Nico Weber
d451197d3d LibPDF: Add spec comments to CFF 2023-10-13 07:53:27 +02:00
Nico Weber
349996f7f2 LibPDF: Don't crash on files with float CFF defaultWidthX
We'd unconditionally get the int from a Variant<int, float> here,
but PDFs often have a float for defaultWidthX and nominalWidthX.

Fixes crash opening Bakke2010a.pdf from pdffiles (but while the
file loads ok, it looks completely busted).
2023-10-12 19:43:57 +02:00
Nico Weber
f56b897622 Everywhere: Fix a few typos
Some even user-visible!
2023-04-12 19:37:35 +02:00
Rodrigo Tobar
4a20751ff6 LibPDF: Detect CFF encodings with supplements
These are not yet actually parsed, but detecting them means we at least
don't fail to understand the *actual* format value, which was causing
some CFF fonts to fail to load.
2023-03-02 12:18:53 +01:00
Rodrigo Tobar
c4507bb56e LibPDF: Add more built-in SIDs
The first iteration has enough SIDs to display simple documents, but
when trying more and more documents we started to need more of these
SIDs to be properly defined. This is a copy/paste exercise from the CFF
document, which is tedious, so it will continue in small drops.

This commit fills all the gaps until SID 228, which covers all the
ISOAdobe space, and should be enough for most use cases. Since this is a
continuous space starting at 0, we now use an Array instead of a Map to
store these names, which should be more performant. Also to simplify
things I've moved the Array out of the CFF class, making it a simpler
static variable, which allows us to use template type deduction.
2023-02-13 00:23:17 +00:00
Rodrigo Tobar
3eaa27f53a LibPDF: Add infrastructure for accented character glyphs
Type1 accented character glyphs are composed of two other glyphs in the
same font: a base glyph and an accent glyph, given as char codes in the
standard encoding. These two glyphs are then composed together to form
the accented character.

This commit adds the data structures to hold the information for
accented characters, and also the routine that composes the final glyph
path out of the two individual components. All glyphs must have been
loaded by the time this composition takes place, and thus a new
protected consolidate_glyphs() routine has been added to perform this
calculation.
2023-02-08 19:47:15 +01:00
Rodrigo Tobar
11a9bfd4b6 LibPDF: Turn Glyph into a class
Glyph was a simple structure, but even now it's become more complex that
it was initially. Turning it into a class hides some of that complexity,
and make sit easier to understand to external eyes.

While doing this I also decided to remove the float + bool combo for
keeping track of the glyph's width, and replaced it with an Optional
instead.
2023-02-08 19:47:15 +01:00
Rodrigo Tobar
c084943457 LibPDF: Index Type1 glyphs by name, not char code
Storing glyphs indexed by char code in a Type1 Font Program binds a Font
Program instance to the particular Encoding that was used at Font
Program construction time. This makes it difficult to reuse Font Program
instances against different Encodings, which would be otherwise
possible.

This commit changes how we store the glyphs on Type1 Font Programs.
Instead of storing them on a map indexed by char code, the map is now
indexed by glyph name. In turn, when rendering a glyph we use the
Encoding object to turn the char code into a glyph name, which in turn
is used to index into the map of glyphs.

This is the first step towards reusability of Type1 Font Programs. It
also unlocks the ability to render glyphs that are described via the
"seac" command (standard encoding accented character), which requires
accessing the base and accent glyphs by name.
2023-02-08 19:47:15 +01:00
Rodrigo Tobar
286e3e6872 LibPDF: Simplify Encoding to align with simple font requirements
All "Simple Fonts" in PDF (all but Type0 fonts) have the property that
glyphs are selected with single byte character codes. This means that
the Encoding objects should use u8 for representing these character
codes. Moreover, and as mentioned in a previous commit, there is no need
to store the unicode code point associated with a character (which was
in turn wrongly associated to a glyph).

This commit greatly simplifies the Encoding class. Namely it:

 * Removes the unnecessary CharDescriptor class.
 * Changes the internal maps to be u8 -> FlyString and vice-versa,
   effectively providing two-way lookups.
 * Adds a new method to set a two-way u8 -> FlyString mapping and uses
   it in all possible places.
 * Simplified the creation of Encoding objects.
 * Changes how the WinAnsi special treatment for bullet points is
   implemented.
2023-02-02 14:50:38 +01:00
Tim Schumacher
ae64b68717 AK: Deprecate the old AK::Stream
This also removes a few cases where the respective header wasn't
actually required to be included.
2023-01-29 19:16:44 -07:00
Rodrigo Tobar
c4b45a82cd LibPDF: Add initial CFF parsing
The Compat Font Format specification (Adobe's Technical Note #5176) is
used by PDF's Type1C fonts to store their data. While being similar in
spirit to PS1 Type 1 Font Programs, it was designed for a more compact
representation and thus space reduction (but an increment on
complexity). It also shares most of the charstring encoding logic, which
is why the CFF class also inherits from Type1FontProgram.

This initial implementation is still lacking many details, e.g.:

 * It doesn't include all the built-in CFF SIDs
 * It doesn't support CFF-provided SIDs (defaults those glyphs to the
   space character)
 * More checks in general
2023-01-25 15:40:11 +01:00