SMasks are greyscale images that get used as alpha channel for a
different image.
JPEGs in PDFs are stored as streams with /DCTDecode filters, and
we have a separate code path for loading those in the PDF renderer.
That code path just calls our JPEG decoder, which creates bitmaps
with format BGRx8888.
So when we process an SMask for such a bitmap, we have to change
the bitmap's format to BGRA8888 in addition to setting alpha values
on all pixels.
Per 1.7 spec 3.8.1, there are multiple logical text string types:
* text strings
* ASCII strings
* byte strings
Text strings can be in UTF-16BE, PDFDocEncoding, or (since PDF 2.0)
UTF-8.
But byte strings shouldn't be converted but treated as binary
data.
This makes us no longer convert strings used for drawing page text.
TABLE 5.6 "Text-showing operators" lists the operands for text-showing
operators as just "string", not "text string" (even though these strings
confusingly are called "text strings" in the body text), so not doing
this there is correct (and matches other viewers).
We also no longer incorrectly convert strings used for cypto data
(such as passwords), if they start with an UTF-16BE or UTF-8 marker.
No behavior change for outlines and info dict entries.
https://pdfa.org/understanding-utf-8-in-pdf-2-0/ has a good overview of
this.
(ASCII strings only contain ASCII characters and behave the same
anyways.)
This is a hack: Ideally we'd have a CMYK Bitmap pixel format,
and we'd convert to rgb at blit time. Then we could also apply color
profiles (which for CMYK images are CMYK-based).
Also, the colors for our CMYK->RGB conversion are off for PDFs,
and we have distinct codepaths for this in Gfx::Color (for paths)
and JPEGs. So when we fix that, we'll have to fix it in two places.
But this doesn't require a lot of code and it's a huge visual
progression, so let's go with it for now.
The file wasn't quite decided if it wanted to sort by ascii value
or by case folding. Now it uses ascii value, thanks to vim's
`:'<,'>sort`.
No behavior change.
This is a very inefficient implementation: Every time a type 3 font
glyph is drawn, we parse its operator stream and execute all the
operators therein.
We'll want to instead cache the glyphs in bitmaps (at least in most
cases), like we do for other fonts. But it's a good first step, and
all the coordinate math seems to work in the files I've tested.
Good test files from pdfa dataset 0000.zip:
- 0000559.pdf page 1 (and 2): Has a non-default font matrix;
text appears mirrored if the font matrix isn't handled correctly
- 0000425.pdf, page 1: Draws several glyphs in a single run;
glyphs overlap if Renderer::render_type3_glyph() ignores the
passed-in point
- 0000211.pdf, any page: Uses type 3 glyphs for all text.
Good perf test (already "reasonably fast")
- 0000521.pdf, page 5 (or 7 or or 16): The little red flag in the
purple box is a type 3 font glyph, and it's colored (which in part
means the first operator is `d0`, while all the other documents above
use `d1`)
Type 3 font glyphs begin with either `d0` or `d1`. If we bail out
with an "unsupported" error on the very first operator in a glyph,
we'll never paint the glyph.
Just stub these out for now. We probably want to do more in here in
the future (see "TABLE 5.10 Type 3 font operators" in the 1.7 spec).
They are the first operator in a type 3 charproc.
Operator.h already knew about them, but we didn't manage to parse
them, since they're the only two operators that contain a digit.
It's a bit unfortunate that fonts need to know about the renderer,
but type 3 fonts contain PDF drawing operators, so it's necessary.
On the bright side, it makes it possible to pass fewer parameters
around and compute things locally as needed.
(As we implement more fonts, we'll probably want to create some
functions to do these computations in a central place, eventually.)
No behavior change.
/BaseFont is a required key for type 0, type 1, and truetype
font dictionaries, but not for type 3 font dictionaries.
This is mechanical; type 0 fonts don't even use this yet
(but probably should).
PDFFont::initialize() is now empty and could be removed,
but maybe we'll put stuff there again later, so I'm leaving
it around for a bit longer.
In the main page contents, /T0 might refer to a different font than
it might refer to in an XObject. So don't use the `Tf` argument as
font cache key. Instead, use the address of the font dictionary object.
Fixes false cache sharing, and also allows us to share cache entries
if the same font dict is referred to by two different names.
Fixes a regression from 2340e834cd (but keeps the speed-up intact).
It's less code, but it also fixes a bug: The implementation in
Filter.cpp used to use the previous byte as reference value, while
we're supposed to use the value of the previous channel as reference
(at least when a pixel is larger than one byte).
No behavior change.
Ideally, the PDF code would just call a function PNGLoader to do the
PNG unfiltering, but let's first try to make the implementations look
more similar.
These two static members are now used to implement respective `matches_`
methods but will also be useful to provide a global implementation of
the specified concept of whitespace.
TJ acts on a list of either strings or numbers.
The strings are drawn, and the numbers are treated as offsets.
Previously, we'd only apply the last-seen number as offset when
we saw a string. That had the effect of us ignoring all but the
last number in front of a string, and ignoring numbers at the
end of the list.
Now, we apply all numbers as offsets.
Our rendering of Tests/LibPDF/text.pdf now matches other PDF viewers.
Per 5177.Type2.pdf 3.1 "Type 2 Charstring Organization",
a glyph's charstring looks like:
w? {hs* vs* cm* hm* mt subpath}? {mt subpath}* endchar
The `w?` is the width of the glyph, but it's optional. So all
possible commands after it (hstem* vstem* cntrmask hintmask
moveto endchar) check if there's an extra number at the start
and interpret it as a width, for the very first command we read.
This was done by having an `is_first_command` local bool that
got set to false after the first command. That didn't work with
subrs: If the first command was a call to a subr that just pushed
a bunch of numbers, then the second command after it is the actual
first command.
Instead, move that bool into the state. Set it to false the
first time we try to read a width, since that means we just read
a command that could've been prefixed by a width.
Images can have multiple filters, each one of them is processed
sequentially. Only the last one will be relevant for the image format
(DCT or JPXDecode), so use the last filter instead of the first one to
detect that property.
For valid PDFs, this makes no difference.
For invalid PDFs, we now assert during the cast in resolve_to() instead
of returning a PDFError. However, most PDFs are valid, and even for
invalid PDFs, we'd previously keep the old color space around when
getting the PDF error and then usually assert later when the old
color space got passed a color with an unexpected number of components
(since the components were for the new color space).
Doesn't affect any of the > 2000 PDFs I use for testing locally,
is less code, and should make for less surprising asserts when it
does happen.
Namely, for CalGrayColorSpace, CalRGBColorSpace, LabColorSpace.
Fixes a crash rendering any page of Adobe's 5014.CIDFont_Spec.pdf
(which uses CalRGBColorSpace with an indirect dict: The dict is
object `92 0`, and many color spaces are inline objects referring
to it).
I didn't find example code for this and the AI assistant did very
poorly on this as well. So I had to write it all by myself!
It can be much more efficient I think, but I think the overall
shape is maybe roughly fine.
* SampledFunction now keeps the StreamObject it gets data from alive
(doesn't matter too much in practice, but does matter in the test,
where nothing else keeps the stream alive).
* If a sample is an integer, we would previously sample that value
twice and then divide by zero when interpolating. Make sure to
sample 1 unit apart.