Introduces CIDIterator, an iterator type for iterating over CIDs.
Also introduces Type0CMap which can return a CIDIterator given some
bytes.
The existing code of treating the bytes as an identity map of
big-endian u16s is now implemented in IdentityType0CMap.
No behavior change.
This will allow us to get at the font's glyphs as paths, which will
eventually enable us to implement glyph rotation. We'll have to do our
own caching then, but we can then hopefully share the caching across the
Type0 / Type1 / TrueType codepaths.
It also gives us access to a font's glyphs by glyph id, which will help
us implementing looking up glyph ids by postscript name. (Else we'd
have to plumb through a whole Painter::draw_glyph_by_postscript_name()
API just for LibPDF.)
No behavior change.
Liberation Sans still doesn't have the vast majority of the
Zapf Dingbats glyphs, but now we map the Zapf Dingbats names to good
unicode values. So we only need to use a different font and all should
work. (And Liberation Sans has _some_ of the glyphs, like 13 of the
223.) And we now render empty squares instead of wrong glyphs for the
ones we don't have.
I haven't seen any PDFs using ZapfDingbats in the wild, but they
probably exist somewhere.
(Tests/LibPDF/standard-14-fonts.pdf is a synthetic PDF using it.)
Turns out there's a spec that goes with the table.
The big change here is that we can now map `uni1234` to 0x1234 and
`u123456` to 0x123456.
The parts where we split a name on `_` and map each component
and the part where we're supposed to allow multiple groups of 4
after `uni` aren't implemented yet.
The ZapfDingbats lookup is also still missing.
I haven't seen this have an effect in practice, but it's easy to
construct a PDF with a custom encoding where it would make a
difference.
We use Liberation Sans for the actual glyph for these, and that's
missing some (Symbol) / all (ZapfDingbats) of the glyphs we need
for these two standard fonts (...or at least the mapping from
name to glyph, not sure). But still, better rendering squares than
completely incorrect glpyhs.
Our code deciding what to do when a value isn't found in an encoding,
or when the name doesn't map to a glpyh, also needs work, but that's
mostly independent of this change. I think this is a nice small
standalone progression.
Makes text show up on 0000646.pdf pages 87-92, which for some reason
renders all text using 2x2 images with huge masks that contain
rendered text outlines.
This will need further thought once we implement support for the
truetype 'post' table, but for now it's correct most of the time,
and better than not doing it.
...and for fallback fonts too.
We use Liberation Sans (a truetype font) for standard and fallback
fonts. So we should use the standard PDF algorithm for mapping bytes
to truetype glyphs. TrueTypePainter knows how to do this.
Makes the "fi" ligature in the title on page 1 of 5014.CIDFont_Spec.pdf
or the dotless-i in the title of page 2 of ThinkingInPostScript.pdf
show up. They use Helvetica and TImes, and Helvetica and Symbol
respecitively (with -Bold variants).
Since ScaledFont bakes the size of the font into the font type, we
do the same for Type1 fonts, and then have to divide by the font height
when figuring out what to scale by. For a target width of 0, chances are
the source width is also 0, and we end up with NaN due to dividing
0 by 0. This then triggered the `VERIFY(isfinite(error))` in
can_approximate_bezier_curve() in Painter.cpp.
Check for this case and scale by 0 instead of dividing.
It could happen that the denominator is 0 without the numerator being 0,
but it's not clear what that's supposed to mean. In this case we'd end
up with +inf/-inf, which would also trigger the assert. I haven't seen
this case in practice, so let's not worry about that for now.
(A nicer longer-term fix is probably to make LibPDF use VectorFont
instead of ScaledFont, so that we don't have to bake the font size into
the font type. Then we won't need this division at all. In the meantime,
this fixes the crash.)
Fixes a crash on page 66 of
https://developer.apple.com/library/archive/documentation/mac/pdf/Text.pdf
Fixes a crash on page 37 of
https://open-std.org/jtc1/sc22/wg14/www/docs/n3220.pdf
Fixes crashes in `0000310.pdf`, `0000430.pdf`, `0000229.pdf`.
Brings down the number of crashes on my 1000 file test set from
5 with 3 distinct stacks to 2 with 1 distinct stack.
(The number went up from 3 crashes with 2 distinct stacks to 5/3 when we
started rendering much more text when Type0 font support was added.
This fixes the crashes we had before Type0 support.)
Non-CID-keyed fonts in PDFs have 8-bit codepoints which are mapped from
bytes to character names via encoding.
TrueType fonts don't index glyphs by name (Type1 fonts do), so the fix
(codified in the spec) was to make a list of all possible glyph names
and map those to (16-bit) unicode values, and then pass those into the
truetype cmap.
(As a fallback, we're supposed to look at the optional names in the
font's "post" table. That part isn't implemented here yet.)
(Note that this affects the behavior of fallback fonts for TrueType
fonts, but not yet fallback fonts for Type1 fonts, and neither the
behavior of the 14 built-in Type1 fonts (which we implement as
fallback fonts), since the TrueType fallback in Type1Font.cpp does
not use this algorithm yet. This will be fixed in a future patch.)
For `:#xx` in names, we now also handle lower-case hex digits.
The spec is silent on the case of these hex digits.
Our previous check (isxdigit(), and now is_ascii_hex_digit()) lets
through lower-case hex digits, so it seems better to handle them
rather than computing e.g. `'a' - 'A' + 10` (== 42 -- off by 32!).
I don't know if this has any visible effect on any files, but it's
more correct, and less code, and the code looks more like the code
in Filter::decode_ascii_hex().
Both type 1 and type 2 spec tell us to do this.
I haven't observed a difference from this, but I noticed it in the
spec while I was touching this code. Probably good to do what the
spec tells us to do.
With this, a character can be defined that uses two existing glyphs.
This is useful for umlauts and the like, which then just need to
reference e.g. the glyphs named "a" and "dieresis" and provide a
translation.
Makes umlauts appear on some PDFs using CFF type2 data in Type 1
fonts.
It is sometimes truncated in fonts embedded in PDFs, and the data
is not needed to render PDFs. 2 of my 1000 test PDFs used to
complain "Could not load OS2 v1: Not enough data" and 1
"Could not load OS2 v2: Not enough data" before.
Increases number of PDFs that render without diagnostics from
764 to 765 (and decreases the number of distinct error messages
from 27 to 25).
It is sometimes truncated in fonts embedded in PDFs, and the data
is not needed to render PDFs. 26 of my 1000 test files complained
"Could not load Hmtx: Not enough data" before.
Increases number of PDFs that render without diagnostics from
743 to 764.
It is often missing in fonts embedded in PDFs. 75 of my 1000 test
files complained "Font is missing Name" when trying to read fonts
before.
Increases number of PDFs that render without diagnostics from
682 to 743.
This is required by the CFF spec, and is consistent with what we do for
the encoding 24 lines down.
As far as I can tell, nothing in `Type1FontProgram::rasterize_glyph()`
or in Type1Font.cpp implements the "If an encoding maps to a character
name that does not exist in the Type 1 font pro- gram, the .notdef glyph
is substituted." line from the PDF 1.7 spec (in 5.5.5 Character
Encoding, Encodings for Type 1 Fonts) yet, so this does yet have an
effect.
Of my 1000 test files, 73 have stream Type0 truetype fonts with stream
CIDToGIDMaps. This makes that work.
(With this patch, the number of files in my 1000 test files complaining
"Font is missing Name" increases from 41 to 75, so a bit under half of
the fonts using stream CIDToGIDMaps also have no 'name' table. So that's
next.)
Increases files without issues from 652 to 681.
https://adobe-type-tools.github.io/font-tech-notes/pdfs/5177.Type2.pdf
says "The behavior of undefined operators is unspecified." but
https://learn.microsoft.com/en-us/typography/opentype/spec/cff2
says "When an unrecognized operator is encountered, it is ignored and
the stack is cleared."
Some type 0 CIDFontType0C fonts (i.e. CID-keyed non-OpenType CFF fonts)
depend on the latter, even though they're governed by the former spec.
Fixes rendering of text in 0000521.pdf (e.g. page 10 or 5). The font
there has a bunch of 0 opcodes for some reason.
Disclaimers, similar to what's on #23202 (and most of the
prerequisites mentioned there are needed for this too):
* Only supports the `Identity-H` type0 cmap at the moment
* Doesn't support vertical text yet
* Only supports the `Identity` CIDToGIDMap at the moment
(this one is a truetype-only thing)
Together with the already-merged #23122, #23128, #23135, #23136, #23162,
and #23167, #23179, #23190, #23194 this adds initial support for
rendering some CFF-based Type0 fonts :^)
There's a long list of things that still need improving after this:
* A small number of CFF programs contain the charstring command 0,
which is invalid. Currently, this makes us reject the whole font.
* Type1FontProgram::rasterize_glyph() is name-based. For CID-based
fonts, we want a version that takes CIDs (character IDs) instead.
For now, I'm printing the CID to a string and using that, yuck.
(I looked into doing this nicely. I do want to do that, but I
need to read up on how the `seac` type1 charstring command uses
character names to identify parts of an accented character.
Also, it looks like `seac`'s accented character handling moved
over to `endchar` in type2 charstring commands (i.e. in CFF data),
and it looks like we don't implement that at all. So I need to do
more reading first, and I didn't want to block this on that.)
* The name for the first string in name-based CFF fonts looks wrong;
added a FIXME for that for now.
* This supports the named Identity-H cmap only for now. Identity-H
maps UTF16-BE values to glyph IDs with the idenity function, and
assumes it's horizontal text. Other named cmaps in my test files are
UniJIS-UCS2-H, UniCNS-UCS2-H, Identity-V, UniGB-UCS2-H, UniKS-UCS2-H.
(There are also 2 files using the stream-based cmaps instead of the
name-based ones.)
* In particular, we can't draw vertical text (`-V`) yet
* Passing in the encoding to CFF::create() is awkward (it's nullptr
for CID-keyed fonts), and it's also not necessary since
`Type1Font::draw_glyph()` already does the "take encoding from PDF,
and only from font if the PDF doesn't store one" dance.
* This doesn't cache glyphs but re-rasterizes them each time. Easy
to add, but maybe I want to look at rotation first. And things
don't feel glacial as-is.
* Type0Font::draw_glyph() is pretty similar to second half of
Type1Font::draw_glyph()
Make TopDict's defaultWidthX and nominalWidthX Optional<>s so that
we can check if they're set per fdselect-selected font dict, and
if so use the value from there in CID-keyed fonts. Otherwise, keep
using the value in the top dict.