Both type 1 and type 2 spec tell us to do this.
I haven't observed a difference from this, but I noticed it in the
spec while I was touching this code. Probably good to do what the
spec tells us to do.
With this, a character can be defined that uses two existing glyphs.
This is useful for umlauts and the like, which then just need to
reference e.g. the glyphs named "a" and "dieresis" and provide a
translation.
Makes umlauts appear on some PDFs using CFF type2 data in Type 1
fonts.
It is sometimes truncated in fonts embedded in PDFs, and the data
is not needed to render PDFs. 2 of my 1000 test PDFs used to
complain "Could not load OS2 v1: Not enough data" and 1
"Could not load OS2 v2: Not enough data" before.
Increases number of PDFs that render without diagnostics from
764 to 765 (and decreases the number of distinct error messages
from 27 to 25).
It is sometimes truncated in fonts embedded in PDFs, and the data
is not needed to render PDFs. 26 of my 1000 test files complained
"Could not load Hmtx: Not enough data" before.
Increases number of PDFs that render without diagnostics from
743 to 764.
It is often missing in fonts embedded in PDFs. 75 of my 1000 test
files complained "Font is missing Name" when trying to read fonts
before.
Increases number of PDFs that render without diagnostics from
682 to 743.
This is required by the CFF spec, and is consistent with what we do for
the encoding 24 lines down.
As far as I can tell, nothing in `Type1FontProgram::rasterize_glyph()`
or in Type1Font.cpp implements the "If an encoding maps to a character
name that does not exist in the Type 1 font pro- gram, the .notdef glyph
is substituted." line from the PDF 1.7 spec (in 5.5.5 Character
Encoding, Encodings for Type 1 Fonts) yet, so this does yet have an
effect.
Of my 1000 test files, 73 have stream Type0 truetype fonts with stream
CIDToGIDMaps. This makes that work.
(With this patch, the number of files in my 1000 test files complaining
"Font is missing Name" increases from 41 to 75, so a bit under half of
the fonts using stream CIDToGIDMaps also have no 'name' table. So that's
next.)
Increases files without issues from 652 to 681.
https://adobe-type-tools.github.io/font-tech-notes/pdfs/5177.Type2.pdf
says "The behavior of undefined operators is unspecified." but
https://learn.microsoft.com/en-us/typography/opentype/spec/cff2
says "When an unrecognized operator is encountered, it is ignored and
the stack is cleared."
Some type 0 CIDFontType0C fonts (i.e. CID-keyed non-OpenType CFF fonts)
depend on the latter, even though they're governed by the former spec.
Fixes rendering of text in 0000521.pdf (e.g. page 10 or 5). The font
there has a bunch of 0 opcodes for some reason.
Disclaimers, similar to what's on #23202 (and most of the
prerequisites mentioned there are needed for this too):
* Only supports the `Identity-H` type0 cmap at the moment
* Doesn't support vertical text yet
* Only supports the `Identity` CIDToGIDMap at the moment
(this one is a truetype-only thing)
Together with the already-merged #23122, #23128, #23135, #23136, #23162,
and #23167, #23179, #23190, #23194 this adds initial support for
rendering some CFF-based Type0 fonts :^)
There's a long list of things that still need improving after this:
* A small number of CFF programs contain the charstring command 0,
which is invalid. Currently, this makes us reject the whole font.
* Type1FontProgram::rasterize_glyph() is name-based. For CID-based
fonts, we want a version that takes CIDs (character IDs) instead.
For now, I'm printing the CID to a string and using that, yuck.
(I looked into doing this nicely. I do want to do that, but I
need to read up on how the `seac` type1 charstring command uses
character names to identify parts of an accented character.
Also, it looks like `seac`'s accented character handling moved
over to `endchar` in type2 charstring commands (i.e. in CFF data),
and it looks like we don't implement that at all. So I need to do
more reading first, and I didn't want to block this on that.)
* The name for the first string in name-based CFF fonts looks wrong;
added a FIXME for that for now.
* This supports the named Identity-H cmap only for now. Identity-H
maps UTF16-BE values to glyph IDs with the idenity function, and
assumes it's horizontal text. Other named cmaps in my test files are
UniJIS-UCS2-H, UniCNS-UCS2-H, Identity-V, UniGB-UCS2-H, UniKS-UCS2-H.
(There are also 2 files using the stream-based cmaps instead of the
name-based ones.)
* In particular, we can't draw vertical text (`-V`) yet
* Passing in the encoding to CFF::create() is awkward (it's nullptr
for CID-keyed fonts), and it's also not necessary since
`Type1Font::draw_glyph()` already does the "take encoding from PDF,
and only from font if the PDF doesn't store one" dance.
* This doesn't cache glyphs but re-rasterizes them each time. Easy
to add, but maybe I want to look at rotation first. And things
don't feel glacial as-is.
* Type0Font::draw_glyph() is pretty similar to second half of
Type1Font::draw_glyph()
Make TopDict's defaultWidthX and nominalWidthX Optional<>s so that
we can check if they're set per fdselect-selected font dict, and
if so use the value from there in CID-keyed fonts. Otherwise, keep
using the value in the top dict.
* FDArray, FDSelect must be present
* Encoding must not be present
* Charset maps from GID (Glyph ID) to CID (Character ID),
instead of to character name
The fdselect array (that we already read) maps eachs glyph ID
to an fdarray index. The font dict at that index then stores
information for that glyph.
In practice, this is used to assign different defaultWidthX /
nominalWidthX values to blocks of glyphs in CID-keyed fonts.
We don't do anything yet with the data, and we also don't send
data of CID-keyed CFFs into this parser either, so no behavior
change.
This happens for CFFs that contain multiple fonts. This doesn't
happen in practice, but the same code will be used for fdarray
parsing, which will contain several dicts.
No behavior change.
This is very similar to SimpleFont::draw_string() for now, but
it'll become a bit different when we add support for vertical
text.
CIDFontType now only needs to draw single glyphs. Neither of the
subclasses can do that yet, so no behavior change yet.
...instead of reading them in Filter::decode() for all filters and
then passing them around to only the LZW and flate filters.
(EarlyChange is LZWDecode-only, so that's read there instead.)
No behavior change.
Previously, we'd loop over the index of the output coordinate,
for example for a CMYK->RGB function, we'd loop over RGB. For
every output index, we'd then sample the function at the CMYK
input point.
Now, we sample at CMYK once and return a span for all outputs,
since they're stored in contiguous memory. And we then loop
over the outputs only to do weighting and mapping to the target
range at the end.
Reduces the runtime of
(cd Tests/LibPDF; \
../../Build/lagom/bin/BenchmarkPDF --benchmark_repetitions 5)
from 235.6±2.3ms to 103.2±3.3ms on my system, and makes
SampledFunction::evaluate() more similar to lerp_nd() in TagTypes.h.
This fixes rendering of commas in 0000941.pdf page 1. The commas
use the default width, and without this they show up very large,
covering the page.
Also, it's nice that the code now looks like the regular case 4 lines
further up.
This is one of the two top dict entries we need for CID-keyed fonts.
We don't send any CID-keyed font data into the CFF parser yet,
so no behavior change.
No real behavior change. We don't actually load the CFF data yet
(blocked on #23136 and some more), and we don't have drawing code
yet, and Type0Font::draw_string() doesn't do any drawing yet.
But it's a step in the right direction.
No behavior change, except that we now dbgln() if we see a
PrivDictOperator we don't know about. (I haven't seen this in
practice, but I found this useful while debugging things.)
...and do string expansion at the call site.
CID-keyed fonts treat the charset as CIDs instead of as SIDs,
so having access to the SIDs in numberic form will be useful
when we implement support for CID-keyed CFF fonts.
No behavior change.
I implemented CFF charset format 2 in 6f783929dd with the note
"I haven't seen this being used in the wild". Now that I have
seen it (0000658.pdf), I can say that this has never worked,
despite me claiming "it's easy to implement".
But now it works!
Previously, if we wanted to to e.g. do linear interpolation in 2-D,
we'd get a sample point like (1.3, 4.4), then get 4 samples around
it at (1, 4), (2, 4), (1, 5), (2, 5), then reduce the 4 samples
to 2 samples by computing the combined samples
`0.3 * f(1, 4) + 0.7 * f(2, 4)` and `0.3 * f(1, 5) + 0.8 * f(2, 5)`,
and then 1-D linearly blending between these two samples with the
factor 0.4. In the end we'd multiply the first value by 0.3 * 0.4,
the second by 0.7 * 0.4, the third by 0.3 * 0.6, and the third by
0.7 * 0.6, and then sum them all up.
This requires computing and storing 2**N samples, followed by
another 2**N iterations to combine the 2**N sampls to a single value.
(N is in practice either 4 or 3, so 2**N isn't super huge.)
Instead, for every sample we can directly compute the product of
weights and sum them up directly. This lets us omit the second loop
and storing 2**N values, in exchange for doing an additional O(n)
work to compute the product.
Takes
Build/lagom/bin/image --no-output --invert-cmyk \
--assign-color-profile \
Build/lagom/Root/res/icc/Adobe/CMYK/USWebCoatedSWOP.icc \
--convert-to-color-profile serenity-sRGB.icc \
cmyk.jpg
form 3.42s to 3.08s on my machine, almost 10% faster (and less code).
Here cmyk.jpg is a 2253x3080 cmyk jpeg, and USWebCoatedSWOP.icc is an
mft2 profile with input tables with 256 samples and a 9x9x9x9 CLUT.
The LibPDF change is covered by TEST_CASE(sampled) in LibPDF.cpp,
and the LibGfx change is basically the same change as the one in
LibPDF (where the test results don't change) and the output
subjectively looks identical. So hopefully this causes indeed no
behavior change :^)
For pages containing images or embedded fonts, --dump-contents
used to dump a ton of binary data. That isn't very useful, so
stop doing it.
Before:
% time Build/lagom/bin/pdf --render out.png \
~/Downloads/0000/0000711.pdf --dump-contents | wc -l
937972
Now:
% time Build/lagom/bin/pdf --render out.png \
~/Downloads/0000/0000711.pdf --dump-contents | wc -l
6566
Printing 7k lines is also much faster than printing 940k,
0.15s instead of 2s.
CMYK data describes which inks a printer should use to print a color.
If a screen should display a color that's supposed to look similar
to what the printer produces, it results in a color very different
to what Color::from_cmyk() produces. (It's also printer-dependent.)
There are many ICC profiles describing printing processes. It doesn't
matter too much which one we use -- most of them look somewhat
similar, and they all look dramatically better than Color::from_cmyk().
This patch adds a function to download a zip file that Adobe offers
on their web site. They even have a page for redistribution:
https://www.adobe.com/support/downloads/iccprofiles/icc_eula_win_dist.html
(That one leads to a broken download though, so this downloads the
end-user version.)
In case we have to move off this download at some point, there are also
a whole bunch of profiles at https://www.color.org/registry/index.xalter
that "may be used, embedded, exchanged, and shared without restriction".
The adobe zip contains a whole bunch of other useful and fun profiles,
so I went with it.
For now, this only unzips the USWebCoatedSWOP.icc file though, and
installs it in ${CMAKE_BINARY_DIR}/Root/res/icc/Adobe/CMYK/. In
Serenity builds, this will make it to /res/icc/Adobe/CMYK in the
disk image. And in lagom build, after #23016 this is the
lagom res staging directory that tools can install via
Core::ResourceImplementation. `pdf` and `MacPDF` already do that,
`TestPDF` now does it too.
The final piece is that LibPDF then loads the profile from there
and uses it for DeviceCMYK color conversions.
(Doing file access from the bowels of a library is a bit weird,
especially in a system that has sandboxing built in. But LibGfx does
that in FontDatabase too already, and LibPDF uses that, so it's not a
new problem.)
See #22821 for a previous attempt. This attempt should settle
things once and for all.
The opentype render path adjusts by `-font_ascender * -y_scale` in
Glyf::Glyph::append_simple_path(), so that's what we need to undo
to draw at the font's baseline.
(OpenType::Font::metrics() returns ascender scaled by y_scale already,
so no need to have the scale here where we undo the shift.)
Previously, we called `baseline()` which just returns the font's
font size, which is pretty meaningless:
https://tonsky.me/blog/font-size/https://simoncozens.github.io/fonts-and-layout/opentype.html#vertical-metrics-hhea-and-os2
Also, conceptually it makes sense to translate up by the ascender
to get from the upper edge of the glyph to the baseline.
https://llvm.org/devmtg/2022-11/slides/TechTalk5-WhatDoesItTakeToRunLLVMBuildbots.pdf
has an xref table that starts like so:
```
xref
0 214
0000000002 65535 f
0000924663 00000 n
0000000003 00000 f
0000000000 00000 f
0000000016 00000 n
0000000160 00000 n
0000000263 00000 n
```
This is a list of objects in the PDF file. The lines ending with 'f'
mean that this object is "free", that is it's not stored in the file.
In this file, objects 0, 2, 3 are free. For free objects, the first
number is the offset of the next free object: Object 0 refers to object
2, 2 to 3, and 3 back to 0 (since it's the last free object).
The lines ending with "n" are actual objects; here the first number is
a byte offset to where that object is stored in the file.
Furthermore, the file contains
```
/Outlines
2
0
R
```
in its root object, meaning that object 2 stores the page outlines.
Since object 2 is set as free, there is no object 2. But the spec
says that an invalid object reference is just the null object.
This patch makes us return null objects for references to free
objects, and it also makes us treat a null object as /Outlines value
the same as not having /Outlines in the first place.
Fixes#23023 -- we can now open that file. (We don't render it super
well, but only for already-known reasons.)
Since I found it a bit confusing: XRefTable has two related methods
here:
1. has_object() returns if an object was explicitly listed in an
xref table. The first number right after `xref` is the start
index. So if an xref table were to start with `10`, we'd implicitly
create 10 trailing objects for which has_object() would return false
2. is_object_in_use() returns true if an object that was in a table
(i.e. one where has_object() returns true) was listed with 'n' and
false if it was listed with 'f'.
DocumentParser::parse_object_with_index() should probably return a null
object for the `!has_object()` case as well instead of VERIFY()ing
that has_object() is true. But I haven't seen this in the wild yet,
so keeping as-is for now.