Extending the borders on reference frames so that motion vectors that
point outside the reference frame allows `predict_inter_block()` to
avoid some branches to clamp the sample coordinates in its loops.
This results in about a 25% improvement in decode time of a motion-
heavy YouTube video (~20.8s -> ~15.6s).
Moving the clamping of the coordinates of the reference frame samples
as well as some bounds checks outside of the loop reduces the branches
needed in the `predict_inter_block()` significantly.
This results in a whopping ~41% improvement in decode performance
of an inter-prediction-heavy YouTube video (~35.4s -> ~20.8s).
Changing the calculation of reference frame scale factors to be done on
a per-frame basis reduces the amount of work done in
`predict_inter_block()`, which is a big hotspot in most videos.
This reduces decode times in a test video from YouTube by about 5%
(~37.2s -> ~35.4s).
This changes the order of the loop copying data to a reference frame
store so that it copies each row in a contiguous line rather than
copying a column at a time, which caused unnecessary branches.
This reduces the decode time on a fairly long 720p YouTube video by
about 14.5% (~43.5s to ~37.2s).
This doesn't appear to have had a measurable impact on performance,
and behavior is the same.
With the tiles using independent BooleanDecoders with their own
backing BitStreams, we're even one step closer to threaded tiles!
Checking the bounds of the intermediate values was only implemented to
help debug the decoder. However, it is non-fatal to have the values
exceed the spec-defined bounds, and causes a measurable performance
reduction.
Additionally, the checks were implemented as an assertion, which is
easily broken by bad input files.
I see about a 4-5% decrease in decoding times in the `webm_in_vp9` test
in TestVP9Decode.
That matches the terminology used in ITU-T Rec. H.273,
PNG's cICP chunk, and the ICC cicpTag.
Also change the enum values to match the values in the spec --
0 means "not full range" and 1 means "full range".
(For now, keep the "Unspecified" entry around, and give it value 2.
This value is not in the spec.)
No intended behavior change.
I previously changed it to use the absolute inter-prediction mode
values instead of the ones relative to NearestMv. That caused the
probability adaption to take invalid indices from the counts and broke
certain videos.
Now it will just convert to the PredictionMode enum when returning from
parse_inter_mode, which allows us to still use it the same as before.
There were rare cases in which u8 was not large enough for the total
count of values read, and increasing this to u32 should have no real
effect on performance (hopefully).
We have a new, improved string type coming up in AK (OOM aware, no null
state), and while it's going to use UTF-8, the name UTF8String is a
mouthful - so let's free up the String name by renaming the existing
class.
Making the old one have an annoying name will hopefully also help with
quick adoption :^)
Like the non-zero tokens and segmentation IDs, these can be moved into
the tile decoding loop for above context and allocated by TileContext
for left context.
We can store this context in the stack of Parser::decode_tiles and use
spans to give access to the sections of the context for each tile and
subsequently each block.
The array containing the vertical line of bools indicating whether non-
zero tokens were decoded in each sub-block is moved to TileContext, and
a span of the valid range for a block to read and write to is created
when we construct a BlockContext.
Since the context information for parsing residual tokens changes based
on whether we're parsing the first coefficient or subsequent ones, the
TreeParser::get_tokens_context function was split into two new ones to
allow them to read more cleanly. All variables now have meaningful
names to aid in readability as well.
The math used in the function for the first token was changed to
be more friendly to tile- or block-specific coordinates to facilitate
range-restricted Spans of the above and left context arrays.
Only the residual tokens array needs to be kept for the transforms to
use after all the tokens have been parsed. The token cache is able to
be kept in the stack only for the duration of the token parsing loop.
Since the enum is used as an index to arrays, it unfortunately can't
be converted to an enum class, but at least we can make sure to use it
with the qualified enum name to make things a bit clearer.
Previously, the variables were named similarly to the names in spec
which aren't very human-readable. This adds some utility functions for
dimensional unit conversions and names the variables in residual()
based on their units.
References to 4x4 blocks were also renamed to call them sub-blocks
instead, since unit conversion functions would not be able to begin
with "4x4_blocks".
Moving these to another header allows Parser.h to include less context
structs/classes that were previously in Context.h.
This change will also allow consolidating some common calculations into
Context.h, since we won't be polluting the VP9 namespace as much. There
are quite a few duplicate calculations for block size, transform size,
number of horizontal and vertical sub-blocks per block, all of which
could be moved to Context.h to allow for code deduplication and more
semantic code where those calculations are needed.
Those previous constants were only set and used to select the first and
second transforms done by the Decoder class. By turning it into a
struct, we can make the code a bit more legible while keeping those
transform modes the same size as before or smaller.
The sub-block transform types set and then used in a very small scope,
so now it is just stored in a variable and passed to the two functions
that need it, Parser::tokens() and Decoder::reconstruct().
Note that some of the previous segmentation feature settings must be
preserved when a frame is decoded that doesn't use segmentation.
This change also allowed a few functions in Decoder to be made static.
The motion vector joints enum is set up so that the first bit indicates
that a vector should have a non-zero value in the column, and the
second bit indicates a non-zero value for the row. Taking advantage of
this makes the code a bit more legible.
Previously, we were using size_t, often coerced from bool or u8, to
index reference pairs. Now, they must either be taken directly from
named fields or indexed using the `ReferenceIndex` enum with options
`primary` and `secondary`. With a more explicit method of indexing
these, the compiler can aid in using reference pairs correctly, and
fuzzers may be able to detect undefined behavior more easily.