Type0 fonts can be either CFF-based or TrueType-based.
Create a subclass for each, put in some spec text, and
give each case a dedicated error code, so that `--debugging-stats`
can tell me which branch is more common.
LibGfx's ScaledFont doesn't do this, but in ScaledFont m_x_scale and
m_y_scale are immutable once the class is created, so it can get away
with not doing it.
In Type1Font, `width` changes in different calls to
Type1Font::draw_glyph(), so we need to make it part of the cache key.
Fixes rendering of the word "Version" on the first page of
pdf_reference_1-7.pdf.
If the font dictionary didn't specify custom glyph widths, we would fall
back to the specified "missing width" (or 0 in most cases!), which meant
that we would draw glyphs on top of each other in a lot of cases, namely
for TrueTypeFonts or standard Type1Fonts with an OpenType fallback.
What we actually want to do in this case is ask the OpenType font for
the correct width.
A limit of 1024 subroutines seemed like a sensible choice, but some
fonts actually do exceed it. We will now only assert that the specified
amount is positive.
These are not yet actually parsed, but detecting them means we at least
don't fail to understand the *actual* format value, which was causing
some CFF fonts to fail to load.
Type1 imposes a stack limit of 24 elements, but Type2 has a limit of 48.
We are better off relaxing the limit of the former in favour of properly
supporting the latter.
There were two issues with how we counted hints with Type2 CharString
commands: the first was that we assumed a single hint per command, even
though there are commands that accept multiple hints thanks to taking a
variable number of operands; and secondly, the hintmask/ctrlmask
commands can also take operands (i.e., hints) themselves in certain
situations.
This commit fixes these two issues by correctly counting hints in both
cases. This in turn fixes cases when there were more than 8 hints in
total, therefore a hintmask/ctrlmask command needed to read more than
one byte past the operator itself.
The PDFFont class hierarchy was very simple (a top-level PDFFont class,
followed by all the children classes that derived directly from it).
While this design was good enough for some things, it didn't correctly
model the actual organization of font types:
* PDF fonts are first divided between "simple" and "composite" fonts.
The latter is the Type0 font, while the rest are all simple.
* PDF fonts yield a glyph per "character code". Simple fonts char codes
are always 1 byte long, while Type0 char codes are of variable size.
To this effect, this commit changes the hierarchy of Font classes,
introducing a new SimpleFont class, deriving from PDFFont, and acting as
the parent of Type1Font and TrueTypeFont, while Type0 still derives from
PDFFont directly. This distinction allows us now to:
* Model string rendering differently from simple and composite fonts:
PDFFont now offers a generic draw_string method that takes a whole
string to be rendered instead of a single char code. SimpleFont
implements this as a loop over individual bytes of the string, with
T1 and TT implementing draw_glyph for drawing a single char code.
* Some common fields between T1 and TT fonts now live under SimpleFont
instead of under PDFfont, where they previously resided.
* Some other interfaces specific to SimpleFont have been cleaned up,
with u16/u32 not appearing on these classes (or in PDFFont) anymore.
* Type0Font's rendering still remains unimplemented.
As part of this exercise I also took the chance to perform the following
cleanups and restructurings:
* Refactored the creation and initialisation of fonts. They are all
centrally created at PDFFont::create, with a virtual "initialize"
method that allows them to initialise their inner members in the
correct order (parent first, child later) after creation.
* Removed duplicated code.
* Cleaned up some public interfaces: receive const refs, removed
unnecessary ctro/dtors, etc.
* Slightly changed how Type1 and TrueType fonts are implemented: if
there's an embedded font that takes priority, otherwise we always
look for a replacement.
* This means we don't do anything special for the standard fonts. The
only behavior previously associated to standard fonts was choosing an
encoding, and even that was under questioning.
The first iteration has enough SIDs to display simple documents, but
when trying more and more documents we started to need more of these
SIDs to be properly defined. This is a copy/paste exercise from the CFF
document, which is tedious, so it will continue in small drops.
This commit fills all the gaps until SID 228, which covers all the
ISOAdobe space, and should be enough for most use cases. Since this is a
continuous space starting at 0, we now use an Array instead of a Map to
store these names, which should be more performant. Also to simplify
things I've moved the Array out of the CFF class, making it a simpler
static variable, which allows us to use template type deduction.
When loading OpenType fonts, either as a replacement for the standard
14 fonts or an embedded one, we previously passed the font size as the
_point_ size to the loader class. The difference is quite subtle, being
that Gfx::ScaledFont uses the optional dpi parameter to convert the
input from inches to pixels.
This meant that our glyphs were exactly 1.333% too large, causing them
to overlap in places.
The mapping of standard font to replacement now looks like this:
Times New Roman -> Liberation Serif
Courier -> Liberation Mono
Helvetica, Arial -> Liberation Sans
The seac command provides the base and accented character that are
needed to create an accented character glyph. Storing these values is
all that was left to properly support these composed glyphs.
Type1 accented character glyphs are composed of two other glyphs in the
same font: a base glyph and an accent glyph, given as char codes in the
standard encoding. These two glyphs are then composed together to form
the accented character.
This commit adds the data structures to hold the information for
accented characters, and also the routine that composes the final glyph
path out of the two individual components. All glyphs must have been
loaded by the time this composition takes place, and thus a new
protected consolidate_glyphs() routine has been added to perform this
calculation.
Glyph was a simple structure, but even now it's become more complex that
it was initially. Turning it into a class hides some of that complexity,
and make sit easier to understand to external eyes.
While doing this I also decided to remove the float + bool combo for
keeping track of the glyph's width, and replaced it with an Optional
instead.
Storing glyphs indexed by char code in a Type1 Font Program binds a Font
Program instance to the particular Encoding that was used at Font
Program construction time. This makes it difficult to reuse Font Program
instances against different Encodings, which would be otherwise
possible.
This commit changes how we store the glyphs on Type1 Font Programs.
Instead of storing them on a map indexed by char code, the map is now
indexed by glyph name. In turn, when rendering a glyph we use the
Encoding object to turn the char code into a glyph name, which in turn
is used to index into the map of glyphs.
This is the first step towards reusability of Type1 Font Programs. It
also unlocks the ability to render glyphs that are described via the
"seac" command (standard encoding accented character), which requires
accessing the base and accent glyphs by name.
All "Simple Fonts" in PDF (all but Type0 fonts) have the property that
glyphs are selected with single byte character codes. This means that
the Encoding objects should use u8 for representing these character
codes. Moreover, and as mentioned in a previous commit, there is no need
to store the unicode code point associated with a character (which was
in turn wrongly associated to a glyph).
This commit greatly simplifies the Encoding class. Namely it:
* Removes the unnecessary CharDescriptor class.
* Changes the internal maps to be u8 -> FlyString and vice-versa,
effectively providing two-way lookups.
* Adds a new method to set a two-way u8 -> FlyString mapping and uses
it in all possible places.
* Simplified the creation of Encoding objects.
* Changes how the WinAnsi special treatment for bullet points is
implemented.
When rendering text, a sequence of bytes corresponds to a glyph, but not
necessarily to a character. This misunderstanding permeated through the
Encoding through to the Font classes, which were all trying to calculate
such values. Moreover, this was done only to identify "space"
characters/glyphs, which were getting a special treatment (e.g., avoid
rendering). Spaces are not special though -- there might be fonts that
render something for them -- and thus should not be skipped
The Compat Font Format specification (Adobe's Technical Note #5176) is
used by PDF's Type1C fonts to store their data. While being similar in
spirit to PS1 Type 1 Font Programs, it was designed for a more compact
representation and thus space reduction (but an increment on
complexity). It also shares most of the charstring encoding logic, which
is why the CFF class also inherits from Type1FontProgram.
This initial implementation is still lacking many details, e.g.:
* It doesn't include all the built-in CFF SIDs
* It doesn't support CFF-provided SIDs (defaults those glyphs to the
space character)
* More checks in general
The Type1FontProgram logic was based on the Adobe Type 1 Font Format; in
particular, it implemented the CharStrings Dictionary section
(charstring decoding, and most commands). In the case of Type1, these
charstrings are read from a PS1 diciontary, with one entry per character
in the font's charset. This has served us well for Type1 font rendering.
When implementing Type1C font rendering, this wasn't enough. Type1C PDF
fonts are specified in embedded CFF (Compact Font File) streams, which
also contain a charstring dictionary with an entry for each character in
the font's charset. These entries can be slightly different from those
in a PS1 Font Program though: depending on a flag in the CFF, the
entries will be encoded either in the original charstring format from
the Adobe Type 1 Font Format, or in the "Type 2 Charstring Format"
(Adobe's Technical Note #1577). This new format is for the most part a
super-set of the original, with small differences, all in the name of
making the representation as compact as possible:
* The glyph's width is not specified via a separate command; instead
it's an optional additional argument to the first command of the
charstring stream (and even then, it's only the *difference* to a
nominal character width specified in the CFF).
* The interpretation of a 4-byte number is different from Type 1: in
Type 1 this is a 4-byte unsigned integer, whereas in Type 1 it's a
fixed decimal with 16 bits of fractional part.
* Many commands accept a variable set of arguments, so they can draw
more than one line/curve on a single go. These are all
retro-compatible with Type 1's commands.
All these changes are implemented in this patch in a
backwards-compatible way. To ensure Type 1/2 behavior is accessed, a new
parameter indicates which behavior is desired when decoding the
charstring stream.
I also took the chance to centralise some logic that was previously
duplicated across the parse_glyph function. Common lambdas capture the
logic for moving to, or drawing a line/curve to a given point and
updating the glyph state. Similarly, some command logic, including
reading parameters, are shared by several commands. Finally, I've
re-organised the cases in the main switch to group together related
commands.
We are planning to add support for CFF fonts to read Type1 fonts, and
therefore much of the logic already found in PS1FontProgram will be
useful for representing the Type1 fonts read from CFF.
This commit moves the PS1-independent bits of PS1FontProgram into a new
Type1FontProgram base class that can be used as the base for CFF-based
Type1 fonts in the future. The Type1Font class uses this new type now
instead of storing a PS1FontProgram pointer. While doing this
refactoring I also took care of making some minor adjustments to the
PS1FontProgram API, namely:
* Its create() method is static and returns a
NonnullRefPtr<Type1FontProgram>.
* Many (all?) of the parse_* methods are now static.
* Added const where possible.
Notably, the Type1FontProgram also contains at the moment the code that
parses the CharString data from the PS1 program. This logic is very
similar in CFF files, so after some minor adjustments later on it should
be possible to reuse most of it.
This might not be an issue at the moment, but moved-from objects are
usually in a unspecifed but valid state, meaning that we shouldn't read
from them.
DeprecatedFlyString relies heavily on DeprecatedString's StringImpl, so
let's rename it to A) match the name of DeprecatedString, B) write a new
FlyString class that is tied to String.
Previously, we would assume that all standard 14 fonts use a
TrueTypeFont dictionary. Now we render them in Type1Font as well,
given that it doesn't contain a PostScript font program.
This command is meant to print an Standard Encoding Accented Character.
It's not critical to implement it yet, but if we want to render more
documents we need to handle the instruction, even if simply ignore it.
Fonts with the encoding name "WinAnsiEncoding" should render missing
characters above character code 040 (octal) as a "bullet" character.
This patch adds Encoding::should_map_to_bullet(char_code) which is then
called by char_code_to_code_point() to check if the given char code
should be displayed as a bullet instead.
I didn't have a good way to test this, so I've only verified that it
works by manually overriding inputs to the function during the rendering
stage.
This takes care of a FIXME in the Annex D part of the PDF specification.