/BaseFont is a required key for type 0, type 1, and truetype
font dictionaries, but not for type 3 font dictionaries.
This is mechanical; type 0 fonts don't even use this yet
(but probably should).
PDFFont::initialize() is now empty and could be removed,
but maybe we'll put stuff there again later, so I'm leaving
it around for a bit longer.
Per 5177.Type2.pdf 3.1 "Type 2 Charstring Organization",
a glyph's charstring looks like:
w? {hs* vs* cm* hm* mt subpath}? {mt subpath}* endchar
The `w?` is the width of the glyph, but it's optional. So all
possible commands after it (hstem* vstem* cntrmask hintmask
moveto endchar) check if there's an extra number at the start
and interpret it as a width, for the very first command we read.
This was done by having an `is_first_command` local bool that
got set to false after the first command. That didn't work with
subrs: If the first command was a call to a subr that just pushed
a bunch of numbers, then the second command after it is the actual
first command.
Instead, move that bool into the state. Set it to false the
first time we try to read a width, since that means we just read
a command that could've been prefixed by a width.
https://adobe-type-tools.github.io/font-tech-notes/pdfs/T1_SPEC.pdf,
8.4 First Four Subrs Entries:
"""If Flex or hint replacement is used in a Type 1 font program, the
first four entries in the Subrs array in the Private dictionary must be
assigned charstrings that correspond to the following code sequences. If
neither Flex nor hint replacement is used in the font program, then this
requirement is removed, and the first Subrs entry may be a normal
charstring subroutine sequence. The first four Subrs entries contain:
Subrs entry number 0:
3 0 callothersubr pop pop setcurrentpoint return
"""
othersubr handler 0 gets three arguments:
* The flex height (the distance after which the bezier splines
are replaced with just straight lines)
* The current position after the flex
It pushes that position on the postscript stack, where predefined subr
handler number 0 then pops it from. It then passes it to
setcurrentpoint.
In theory, we now correctly do that setcurrentpoint call, which we
previously weren't.
In practice, that setcurrentpoint call always receives the last point of
the flex -- and our path api apparently gets confused when move_to() is
called on it when the current point is already at that same location.
So tweak the SetCurrentPoint handler to not set the current point on
the path if it's already the path's current point, with a FIXME to
figure out what exactly is happening in Gfx::Path.
No big behavior change if flex is used, but this is more correct if it
isn't.
(This only works because our `return` handler is empty, else we would
have to make the callothersubr handler start a call frame.)
https://adobe-type-tools.github.io/font-tech-notes/pdfs/T1_SPEC.pdf,
8.4 First Four Subrs Entries:
"""If Flex or hint replacement is used in a Type 1 font program, the
first four entries in the Subrs array in the Private dictionary must be
assigned charstrings that correspond to the following code sequences. If
neither Flex nor hint replacement is used in the font program, then this
requirement is removed, and the first Subrs entry may be a normal
charstring subroutine sequence. The first four Subrs entries contain:
[...]
Subrs entry number 1:
0 1 callothersubr return
Subrs entry number 2:
0 2 callothersubr return
"""
So subr entry numbers 1 and 2 just call othersubr 1 and and 2, which
means we can just move the handling code over.
No behavior change if flex is used, but more correct if it isn't.
(This only works because our `return` handler is empty, else we would
have to make the callothersubr handler start a call frame.)
This is a subset of #21484: Type 2 CFFs never use the special subrs,
so stop doing them for type 2 at least for now.
Fixes an assert in 0000064.pdf in 0000.zip in the pdfa dataset
(a stack underflow because a subr is supposed to push a bunch of
stuff, but instead it ran one of the built-in routines instead of
the subr from the font file).
As discussed in #21484, this isn't right for type 1 CFFs either,
but just removing the code there regresses Tests/LibPDF/type1.pdf.
A slightly more involved thing is needed there; I added a FIXME
for that here.
No intended behavior change.
It does have the effect that indirect object references now go down
the array path instead of the number path. They still fall over there,
but now that's easy to fix.
Type 1 fonts usually have a m_font_program and no m_font -- they only
have m_font if we're using a replacement font for the fonts that
were built-in to PDFs before Acrobat 4.0 (and must still work to
show existing files).
However, SimpleFont::get_glyph_width() used to always return a
float, which in Type1Font was only implemented if m_font was set.
Per spec, we're supposed to just use /MissingWidth for fonts that
are missing an entry in the descriptor's /Width array. However, for
built-in fonts, no explicit /Width array is needed (PDF 1.7 spec,
Appendix H.3, 5.5.1). So if we just always use /MissingWidth,
then PDFs that use a built-in font draw all their text on top
of each other (e.g. 000333.pdf from stillhq.com-pdfdb).
So change get_glyph_width() to return Optional<float>, return
it only in Type1Font if m_font is set, and use MissingWidth
if it isn't set.
That way, replacement fonts still return a width, and real
fonts that are supposed to have /Width and use /MissingWidth
for missing entries do what they're supposed to too, instead
of crashing.
From 20 (6%) to 16 (5%) crashes on the 300 first PDFs, and from
39 (7.8%) to 31 (6.2%) on the 500-random PDFs test.
`left` might be a number bigger than there are actually glyphs in the
CFF.
The spec says "The number of ranges is not explicitly specified in the
font. Instead, software utilizing this data simply processes ranges
until all glyphs in the font are covered." Apparently we have to check
for this within each range as well.
Needed for example in 0000054.pdf and 0000354.pdf in 0000.zip in the
pdfa dataset.
Together with the previous commit:
From 21 (7%) to 20 (6%) crashes on the 300 first PDFs, and from
41 (8.2%) to 39 (7.8%) on the 500-random PDFs test.
...and replace template instantiations with a loop, to make this
easily possible.
Vaguely nice for code size as well.
Needed for example in 0000054.pdf and 0000354.pdf in 0000.zip in the
pdfa dataset.
We used to use an u8 as loop counter, which would overflow
if there were more than 255 glyphs, producing hundreds of megabytes
of
Couldn't find string for SID x, going with space
output in the process, while all data until the end of the CFF
section got interpreted as SIDs, until a try_read() would finally
fail.
We now no longer fail miserably trying to render page 2 of
0000352.pdf of 0000.zip from the pdfa dataset.
Fixes just one crash of the larger 500-document test set, but
when I tweak test_pdf.py to print all stacks instead of just the
top 5, it no longer produces 260 MB of output.
Font programs are bytecode programs defining glyphs. If several glyphs
share a piece of outline, that opcode sequence can be put in a
subroutine ("subr") table and the definition of those glyphs can then
call that subroutine by number, to reduce file size.
CFF fonts can in theory contain multiple fonts, and so there's a global
subr table shared by all the fonts in one CFF, and a local per-fornt
subr table. We used to only implement the local subr table, now we
implement both.
(We only support one font per CFF, and at least in PDF files, that's
all that's ever used. So a global subr table isn't very useful.
But the spec explicitly allows it -- "Global subroutines may be used in
a FontSet even if it only contains one font." -- and it happens in
practice.)
This was the last piece of data we didn't read yet.
(We also don't yet support multiple fonts per CFF, but I haven't
found a PDF using that yet.)
We still don't do anything with it, but now we at least print a
warning if this data is there and we ignore it.
https://adobe-type-tools.github.io/font-tech-notes/pdfs/T1_SPEC.pdf :
"Using charstring subroutines is not a requirement of a Type 1
font program."
And some versions of Computer Modern do in fact not contain a Subrs
array.
Together with #21473, makes Problemset.pdf from the pdffiles repro
render ok instead of crashing.
With this, all tables from the spec appendixes are in CFF.cpp.
This fixes a crash reading page 2 (and onward) of
2ThestructureoftheCIE1997ColourAppearanceModelCIECAM97s.pdf in
the pdffiles repo.
The encoding offset defaults to 0, i.e. the Standard Encoding.
That means reading the encoding only if the tag is present causes
us to not read it if a font uses the Standard Encoding.
Now, we always read an encoding, even if it's the (implicit) default
one.
The main encoding data maps glyph ID ("GID") to its codepoint.
If a glyph has several codepoints, then a secondary table mapping
codepoint to string ID ("SID") of the glyph's name is present.
(A separate table associates each glyph with its name already.)
I haven't seen this used in the wild, but the structure of the
supplemental data is also going to be needed for built-in encodings.
Two bugs:
1. We decoded a u32, not an i32 as the spec wants
2. (minor) Our fixed-point divisor was off by one
Fixes text rendering in Bakke2010a.pdf in pdffiles, and rendering of
other fonts with negative width adjustments from optcode 255.
That PDF was produced by "Apple pstopdf" and uses font SFBX1200,
which is apparently a variant of Computer Modern. So maybe this
helps with lots of PDFs produced from TeX files, but I haven't
checked that.
I haven't seen this being used in the wild (yet), but it's easy
to implement, and with this we support all charset formats.
So we can now mention if we see a format we don't know about.
From "10 String INDEX":
"Further space saving is obtained by allocating commonly occurring
strings to predefined SIDs. These strings, known as the standard
strings, describe all the names used in the ISOAdobe and Expert
character sets along with a few other strings common to Type 1 fonts. A
complete list of standard strings is given in Appendix A. The client
program will contain an array of standard strings with nStoStrings
elements. Thus, the standard strings take SIDs in the range 0 to
(nStaStrings-1)."
And "13 Charsets" says that charsets store SIDs.
Fixes all
"Couldn't find string for SID $n, going with space"
messages when going through the encoding pages (page 1010 and
thereabouts) in the PDF 1.7 spec.
Only really useful for reading SIDs in the Top DICT (copyright
text etc), which we currently don't do.
I haven't seen a difference from looking things up in the string
table. The only real effect from the commit that I need is that
it pulls a local resolve() labmda into a real function
resolve_sid(), which I want to call in a future commit.
But it makes things more spec-compliant, and if we ever want to
read SIDs in metadata in the future, now we can.
We'd unconditionally get the int from a Variant<int, float> here,
but PDFs often have a float for defaultWidthX and nominalWidthX.
Fixes crash opening Bakke2010a.pdf from pdffiles (but while the
file loads ok, it looks completely busted).