Quantizers are a constant for the whole frame, except when segment
features override them, in which case they are a constant per segment
ID. We take advantage of this by pre-calculating those after reading
the quantization parameters and segmentation features for a frame.
This results in a small 1.5% improvement (~12.9s -> ~12.7s).
Previously, the block sizes would be checked at runtime to
determine the transform size to apply for residuals. Making the block
sizes into constant expressions allows all the loops to be unrolled
and reduces branching significantly.
This results in about a 26% improvement (~18s -> ~13.2s) in speed in an
intra-heavy test video.
Inter-prediction convolution filters are selected based on the
subpixel position determined for the motion vector relative to the
block being predicted. The subpixel position 0 only uses one single
sample in the center of the convolution, not averaging any other
samples. Let's call this a copy.
Reference frames can also be a different size relative to the frame
being predicted, but in almost every case, that scale will be 1:1
for every single frame in a video.
Taking into account these facts, we can create multiple fast paths for
inter prediction. These fast paths are only active when scaling is 1:1.
If we are doing a copy in both dimensions, then we can do a straight
memcpy from the reference frame to the output block buffer. In videos
where there is no motion, this is a dramatic speedup.
If we are doing a copy in one dimension, we can just do one convolution
and average directly into the output block buffer.
If we aren't doing a copy in either dimension, we can still cut out a
few operations from the convolution loops, since we only need to
advance our samples by whole pixels instead of subpixels.
These fast paths result in about a 34% improvement (~31.2s -> ~20.6s)
in a video which relies heavily on intra-predicted blocks due to high
motion. In videos with less motion, the improvement will be even
greater.
Also, note that the accumulators in these faster loops are only 16-bit.
High bit-depth videos will overflow those, so for now the fast path is
only used for 8-bit videos.
Changing the calculation of reference frame scale factors to be done on
a per-frame basis reduces the amount of work done in
`predict_inter_block()`, which is a big hotspot in most videos.
This reduces decode times in a test video from YouTube by about 5%
(~37.2s -> ~35.4s).
Checking the bounds of the intermediate values was only implemented to
help debug the decoder. However, it is non-fatal to have the values
exceed the spec-defined bounds, and causes a measurable performance
reduction.
Additionally, the checks were implemented as an assertion, which is
easily broken by bad input files.
I see about a 4-5% decrease in decoding times in the `webm_in_vp9` test
in TestVP9Decode.
There were rare cases in which u8 was not large enough for the total
count of values read, and increasing this to u32 should have no real
effect on performance (hopefully).
Those previous constants were only set and used to select the first and
second transforms done by the Decoder class. By turning it into a
struct, we can make the code a bit more legible while keeping those
transform modes the same size as before or smaller.
The sub-block transform types set and then used in a very small scope,
so now it is just stored in a variable and passed to the two functions
that need it, Parser::tokens() and Decoder::reconstruct().
Note that some of the previous segmentation feature settings must be
preserved when a frame is decoded that doesn't use segmentation.
This change also allowed a few functions in Decoder to be made static.
Previously, we were using size_t, often coerced from bool or u8, to
index reference pairs. Now, they must either be taken directly from
named fields or indexed using the `ReferenceIndex` enum with options
`primary` and `secondary`. With a more explicit method of indexing
these, the compiler can aid in using reference pairs correctly, and
fuzzers may be able to detect undefined behavior more easily.
This also renames (most?) of the related quantizer functions and
variables to make more sense. I haven't determined what AC/DC stands
for here, but it may be just an arbitrary naming scheme for the first
and subsequent coefficients used to quantize the residuals for a block.
The color config is reused for most inter predicted frames, so we use a
struct ColorConfig to store the config from intra frames, and put it in
a field in Parser to copy from when an inter frame without color config
is encountered.
These are used to pass context needed for decoding, with mutability
scoped only to the sections that the function receiving the contexts
needs to modify. This allows lifetimes of data to be more explicit
rather than being stored in fields, as well as preventing tile threads
from modifying outside their allowed bounds.
These are now passed as parameters to each function that uses them.
These will later be moved to a struct to further reduce the amount of
parameters that get passed around.
Above and left per-frame block contexts are now also parameters passed
to the functions that use them instead of being retrieved when needed
from a field. This will allow them to be more easily moved to a tile-
specific context later.
The function serves no purpose now, any debug information we want to
pull from the decoder should be instead accessed by some other yet to
be created interface.
This has two benefits:
- I observed a ~34% decrease in decoding time running TestVP9Decode.
- Removing all of these silly Vector fields helps simplify the code
relationships between all the functions in Decoder.cpp. It'll also be
much easier to make these static with template specializations, if
that turns out to be worthy performance improvement.
Frames will now be queued for retrieval by the user of the decoder.
When the end of the current queue is reached, a DecoderError of
category NeedsMoreInput will be emitted, allowing the caller to react
by displaying what was previously retrieved for sending more samples.
The class is virtual and has one subclass, SubsampledYUVFrame, which
is used by the VP9 decoder to return a single frame. The
output_to_bitmap(Bitmap&) function can be used to set pixels on an
existing bitmap of the correct size to the RGB values that
should be displayed. The to_bitmap() function will allocate a new bitmap
and fill it using output_to_bitmap.
This new class also implements bilinear scaling of the subsampled U and
V planes so that subsampled videos' colors will appear smoother.
This adds a struct called CodingIndependentCodePoints and related enums
that are used by video codecs to define its color space that frames
must be converted from when displaying a video.
Pre-multiplied matrices and lookup tables are stored to avoid most of
the floating point division and exponentiation in the conversion.
This allows the second shown frame of the VP9 test video to be decoded,
as the second chunk uses a superframe to encode a reference frame and
a second to inter predict between the keyframe and the reference frame.
This enables the second frame of the test video to be decoded.
It appears that the test video uses a superframe (group of multiple
frames) for the first chunk of the file, but we haven't implemented
superframe parsing.
We also ignore the show_frame flag, so for now, this
means that the second frame read out is shown when it should not be. To
fix this, another error type needs to be implemented that is "thrown" to
decoder's client so they know to send another sample buffer.
The first keyframe of the test video can be decoded with these changes.
Raw memory allocations in the Parser have been replaced with Vector or
Array to avoid memory leaks and OOBs.
This allows runtime strings, so we can format the errors to make them
more helpful. Errors in the VP9 decoder will now print out a function,
filename and line number for where a read or bitstream requirement
has failed.
The DecoderErrorCategory enum will classify the errors so library users
can show general user-friendly error messages, while providing the
debug information separately.
Any non-DecoderErrorOr<> results can be wrapped by DECODER_TRY to
return from decoder functions. This will also add the extra information
mentioned above to the error message.
This was required for correctly parsing more than one frame's
height/width data properly. Additionally, start handling failure
a little more gracefully. Since we don't fully parse a tile before
starting to parse the next tile, we will now no longer make it past
the first tile mark, meaning we should not handle that scenario well.
The class that was previously named Decoder handled section 6.X.X of
the spec, which actually deals with parsing out the syntax of the data,
not the actual decoding logic which is specified in section 8.X.X.
The new Decoder class will be in charge of owning and running the
Parser, as well as implementing all of the decoding behavior.
Additionally, this uncovered a couple bugs with existing code,
so those have been fixed. Currently, parsing a whole video does
fail because we are now using a new calculation for frame width,
but it hasn't been fully implemented yet.
Now TreeParser has mostly complete probability calculation
implementations for all currently used syntax elements. Some of these
calculation methods aren't actually finished because they use data
we have yet to parse in the Decoder, but they're close to finished.
This patch brings all of LibVideo up to the east-const style in the
project. Additionally, it applies a few fixes from the reviews in #8170
that referred to older LibVideo code.
The TreeParser requires information about a lot of the decoder's
current state in order to parse syntax tree elements correctly, so
there has to be some communication between the Decoder and the
TreeParser. Previously, the Decoder would copy its state to the
TreeParser when it changed, however, this was a poor choice. Now,
the TreeParser simply has a reference to its owning Decoder, and
accesses its state directly.