When parsing streams we rely on a /Length item being defined in the
stream's dictionary to know how much data comprises the stream. Its
value is usually a direct value, but it can be indirect. There was
however a contradiction in the code: the condition that allowed it to
read and use the /Length value required it to be a direct value, but the
actual code using the value would have worked with indirect ones. This
meant that indirect /Length values triggered the fallback, "manual"
stream parsing code.
On the other hand, this latter code was also buggy, because it relied on
the "endstream" keyword to appear on a separate line, which isn't always
the case.
This commit both fixes the bug in the manual stream parsing scenario,
while also allowing for indirect /Length values to be used to parse
streams more directly and avoid the manual approach. The main caveat to
this second change is that for a brief period of time the Document is
not able to resolve references (i.e., before the xref table itself is
not parsed). Any parsing happening before that (e..g, the linearization
dictionary) must therefore use the manual stream parsing approach.
While the clipping logic was correct (current v/s new clipping path),
the clipping path contents weren't. This commit fixed that.
We calculate the clipping path in two places: when we set it to be the
whole page at graphics state creation time, and when we perform clipping
path intersection to calculate a new clipping path. The clipping path is
then used to limit painting by passing it to the painter (more
precisely, but passing its bounding box to the painter, as the latter
doesn't support arbitrary path clipping). For this last point the
clipping path must be in device coordinates.
There was however a mix of coordinate systems involved in the creation,
update and usage of the clipping path:
* The initial values of the path (i.e., the whole page) were in user
coordinates.
* Clipping path intersection was performed against m_current_path,
which is in device coordinates.
* To perform the clipping operation, the current clipping path was
assumed to be in user coordinates.
This mix resulted in the clipping not working correctly depending on the
zoom level at which one visualised a page.
This commit fixes the issue by always keeping track of the clipping path
in device coordinates. This means that the initial full-page contents
are now converted to device coordinates before putting them in the
graphics state, and that no mapping is performed when applied the
clipping to the painter.
All "Simple Fonts" in PDF (all but Type0 fonts) have the property that
glyphs are selected with single byte character codes. This means that
the Encoding objects should use u8 for representing these character
codes. Moreover, and as mentioned in a previous commit, there is no need
to store the unicode code point associated with a character (which was
in turn wrongly associated to a glyph).
This commit greatly simplifies the Encoding class. Namely it:
* Removes the unnecessary CharDescriptor class.
* Changes the internal maps to be u8 -> FlyString and vice-versa,
effectively providing two-way lookups.
* Adds a new method to set a two-way u8 -> FlyString mapping and uses
it in all possible places.
* Simplified the creation of Encoding objects.
* Changes how the WinAnsi special treatment for bullet points is
implemented.
When rendering text, a sequence of bytes corresponds to a glyph, but not
necessarily to a character. This misunderstanding permeated through the
Encoding through to the Font classes, which were all trying to calculate
such values. Moreover, this was done only to identify "space"
characters/glyphs, which were getting a special treatment (e.g., avoid
rendering). Spaces are not special though -- there might be fonts that
render something for them -- and thus should not be skipped
The initial values were fine, but those starting at 100 were wrong: they
are all octal values, but since they were missing an initial 0 they were
interpreted as decimals.
In PDF's fonts, encoding objects are used to translate bytes into fonts'
glyphs. Glyphs (in the fonts we currently support) organise their glyphs
in such a way that they are accessed by name, and thus encoding
translate between a byte sequence and a glyph name.
Note that an no point this translation includes a Unicode character, and
therefore assigning a character to a glyph in the Encoding object is the
wrong thing to do. Moreover, using the code point for this character
during the byte-sequence-to-glyph translation sequence is double-wrong.
This commit removes the characters associated to each translation in the
built-in Encoding objects. In order to keep commits short and sweet, I'm
currently simply removing the character from the enumeration, leaving
the old structure this information was held on intact. Instead, I'm
filling the "code_point" member with a zero, and filling both mappings
(which will be changed later on too) with the glyph name and the
associated char code.
The Compat Font Format specification (Adobe's Technical Note #5176) is
used by PDF's Type1C fonts to store their data. While being similar in
spirit to PS1 Type 1 Font Programs, it was designed for a more compact
representation and thus space reduction (but an increment on
complexity). It also shares most of the charstring encoding logic, which
is why the CFF class also inherits from Type1FontProgram.
This initial implementation is still lacking many details, e.g.:
* It doesn't include all the built-in CFF SIDs
* It doesn't support CFF-provided SIDs (defaults those glyphs to the
space character)
* More checks in general
The Type1FontProgram logic was based on the Adobe Type 1 Font Format; in
particular, it implemented the CharStrings Dictionary section
(charstring decoding, and most commands). In the case of Type1, these
charstrings are read from a PS1 diciontary, with one entry per character
in the font's charset. This has served us well for Type1 font rendering.
When implementing Type1C font rendering, this wasn't enough. Type1C PDF
fonts are specified in embedded CFF (Compact Font File) streams, which
also contain a charstring dictionary with an entry for each character in
the font's charset. These entries can be slightly different from those
in a PS1 Font Program though: depending on a flag in the CFF, the
entries will be encoded either in the original charstring format from
the Adobe Type 1 Font Format, or in the "Type 2 Charstring Format"
(Adobe's Technical Note #1577). This new format is for the most part a
super-set of the original, with small differences, all in the name of
making the representation as compact as possible:
* The glyph's width is not specified via a separate command; instead
it's an optional additional argument to the first command of the
charstring stream (and even then, it's only the *difference* to a
nominal character width specified in the CFF).
* The interpretation of a 4-byte number is different from Type 1: in
Type 1 this is a 4-byte unsigned integer, whereas in Type 1 it's a
fixed decimal with 16 bits of fractional part.
* Many commands accept a variable set of arguments, so they can draw
more than one line/curve on a single go. These are all
retro-compatible with Type 1's commands.
All these changes are implemented in this patch in a
backwards-compatible way. To ensure Type 1/2 behavior is accessed, a new
parameter indicates which behavior is desired when decoding the
charstring stream.
I also took the chance to centralise some logic that was previously
duplicated across the parse_glyph function. Common lambdas capture the
logic for moving to, or drawing a line/curve to a given point and
updating the glyph state. Similarly, some command logic, including
reading parameters, are shared by several commands. Finally, I've
re-organised the cases in the main switch to group together related
commands.
We are planning to add support for CFF fonts to read Type1 fonts, and
therefore much of the logic already found in PS1FontProgram will be
useful for representing the Type1 fonts read from CFF.
This commit moves the PS1-independent bits of PS1FontProgram into a new
Type1FontProgram base class that can be used as the base for CFF-based
Type1 fonts in the future. The Type1Font class uses this new type now
instead of storing a PS1FontProgram pointer. While doing this
refactoring I also took care of making some minor adjustments to the
PS1FontProgram API, namely:
* Its create() method is static and returns a
NonnullRefPtr<Type1FontProgram>.
* Many (all?) of the parse_* methods are now static.
* Added const where possible.
Notably, the Type1FontProgram also contains at the moment the code that
parses the CharString data from the PS1 program. This logic is very
similar in CFF files, so after some minor adjustments later on it should
be possible to reuse most of it.
This might not be an issue at the moment, but moved-from objects are
usually in a unspecifed but valid state, meaning that we shouldn't read
from them.
When trying to figure out the correct implementation, we now have a very
strong distinction on plugins that are well suited for sniffing, and
plugins that need a MIME type to be chosen.
Instead of having multiple calls to non-static virtual sniff methods for
each Image decoding plugin, we have 2 static methods for each
implementation:
1. The sniff method, which in contrast to the old method, gets a
ReadonlyBytes parameter and ensures we can figure out the result
with zero heap allocations for most implementations.
2. The create method, which just creates a new instance so we don't
expose the constructor to everyone anymore.
In addition to that, we have a new virtual method called initialize,
which has a per-implementation initialization pattern to actually ensure
each implementation can construct a decoder object, and then have a
correct context being applied to it for the actual decoding.
DeprecatedFlyString relies heavily on DeprecatedString's StringImpl, so
let's rename it to A) match the name of DeprecatedString, B) write a new
FlyString class that is tied to String.
Previously, we would assume that all standard 14 fonts use a
TrueTypeFont dictionary. Now we render them in Type1Font as well,
given that it doesn't contain a PostScript font program.
PDF allows for named destinations to be provided as string. These can be
either found in the Dests dictionary in the document catalogue (as
already implemented), or in the Name Tree specified by the Dests key in
the Names dictionary of the document catalogue (missing).
This commit adds this missing case. Once the named destination is found
in the name tree, its value is interpreted just like in the first case,
so a new utility method encapsulates the common behavior.
Name Trees are hierarchical, string-keyed, sorted-by-key dictionary
structures in PDF where each node (except the root) specifies the bounds
of the values it holds, and either its kids (more nodes) or the
key/value pairs it contains.
This commit implements a series of lookup calls for finding a key in
such name trees. This implementation follows the tree as needed on each
lookup, but if that becomes inefficient in the long run we can switch to
creating a HashMap with all the contents, which as a drawback will
require more memory.
Being both of them containers, these classes already offered a set of
methods to retrieve an inner element by key or index, respectively, with
different methods for the different subtypes of the PDF::Object type
returning the element cast to the correct type pointer. On top of
that, DictObject offered an additional method to obtain an element as an
Object pointer.
While these methods were useful, they have some shortcomings:
* They always take a Document pointer to first perform an object
resolution, in case the element is a Reference. This is not always
necessary though, as there are values that are always meant to be
immediate, and hence the resolution lookup adds overhead.
* There was no easy way to get an individual Object element from an
ArrayObject like there is in DictObject. This makes it difficult to
obtain such values, as one first needs to call dict.get() to get a
Value, then cast it manually to a NonnullRefPtr<Object>.
This commit fixes these two issues by:
* Adding a new method that returns an Object for a given index.
* Adding overloads for this new method, and all the existing methods
described above, that do *not* take a Document, and therefore do
*not* perform an object resolution lookup.
This functionality was previously part of the resolve_to() Document
method, and thus only available only when resolving objects through the
Document class. There are many use cases where this casting can be used,
but no resolution is needed.
This commit moves this functionality into a new cast_to function, and
makes the resolve_to function call it internally. With this new function
in place we can now offer new versions of DictObject::get_* and
ArrayObject::get_*_at that don't perform Document resolution
unnecessarily when not required.
Destination arrays contain a page number, a mode name, and parameters
specific to that mode. In many cases these parameters can be set to
"null", which our code wasn't taking into consideration.
This commit parses these parameters taking into account whether they are
null or actual numbers, and stores them as Optional<float> instead of
plain floats. The parameters are not yet used anywhere else other than
when formatting a Destination object, so the change is fairly small.
Before this patch, the generation of the encryption key was not working
correctly since the lifetime of the underlying data was too short,
same inputs would give random encryption keys.
Fixes#16668
These instances were detected by searching for files that include
AK/StdLibExtras.h, but don't match the regex:
\\b(abs|AK_REPLACED_STD_NAMESPACE|array_size|ceil_div|clamp|exchange|for
ward|is_constant_evaluated|is_power_of_two|max|min|mix|move|_RawPtr|RawP
tr|round_up_to_power_of_two|swap|to_underlying)\\b
(Without the linebreaks.)
This regex is pessimistic, so there might be more files that don't
actually use any "extra stdlib" functions.
In theory, one might use LibCPP to detect things like this
automatically, but let's do this one step after another.
In 7c5e30daaa, the focus was "only" on
Userland/Libraries/, whereas this commit cleans up the remaining
headers in the repo, and any new badly-formatted include.
When an attempt is made to provide the user password to a
SecurityHandler a user gets back a boolean result indicating success or
failure on the attempt. However, the SecurityHandler is left in a state
where it thinks it has a user password, regardless of the outcome of the
attempt. This confuses the rest of the system, which continues as if the
provided password is correct, resulting in garbled content.
This commit fixes the situation by resetting the internal fields holding
the encryption key (which is used to determine whether a user password
has been successfully provided) in case of a failed attempt.
With the StandardSecurityHandler the Length item in the Encryption
dictionary is optional, and needs to be given only if the encryption
algorithm (V) is other than 1; otherwise we can assume a length of 40
bits for the encryption key.
The Value previously stored corresponded to a Reference to a Page object
in the PDF document. This isn't useful information, since what we want
to display at the end of the day is the page an outline item refers to.
This commit changes the page member on OutlineItem to be a Optional<u32>
(some destinations don't necessarily refer to a Page), which we resolve
while building OutlineItems.
While OutlineItem had a parent field, it was never populated nor used.
This commit populates it when possible (no parent means the OutlineItem
is a top-level item).
Instead of calling TODO(), which will abort the program, we now return
an Error specifying that we haven't implemented the drawing operation
yet. This will now nicely trickle up all the way through to the
PDFViewer, which will then notify its clients about the problem.
The current rendering routine aborts as soon as an error is found during
rendering, which potentially severely limits the contents we show on
screen. Moreover, whenever an error happens the PDFViewer widget shows
an error dialog, and doesn't display the bitmap that has been painted so
far.
This commit improves the situation in both fronts, implementing
rendering now with a best-effort approach. Firstly, execution of
operations isn't halted after an operand results in an error, but
instead execution of all operations is always attempted, and all
collected errors are returned in bulk. Secondly, PDFViewer now always
displays the resulting bitmap, regardless of error being produced or
not. To communicate errors, an on_render_errors callback has been added
so clients can subscribe to these events and handle them as appropriate.
This will be used to perform a best-effort rendering, where an error in
rendering won't abort the whole rendering operation, but instead will be
stored for later reference while rendering continues.
The code parsing comments parsed only a single line of comments, but
callers assumed they parsed all comments that appeared contiguously in a
block. The latter is an easier to understand API, so this commit changes
the parse_comment function to parse entire blocks of comments instead of
single lines.