Ignoring Size for a second, we currently have:
Rect::scale_by
Rect::scaled
Point::scale_by
Point::scaled
In Size, before this patch, we have:
Size::scale_by
Size::scaled_by
This aligns Size to use the same method name as Rect and Point. While
subjectively providing API symmetry, this is mostly to allow using this
method in templated helpers without caring what the exact underlying
type is.
These methods are slightly more convenient than storing the Bytes
separately. However, it it feels unsanitary to reach in and access this
data directly. Both of the users of these already have the
[Readonly]Bytes available in their constructors, and can easily avoid
using these methods, so let's remove them entirely.
Due to overload resolutions rules, this simple code provokes a crash:
ReadonlyBytes readonly_bytes{};
FixedMemoryStream stream{readonly_bytes};
ReadonlyBytes give_them_back{stream.bytes()};
// -> Panics on VERIFY(m_writing_enabled);
// but this is fine:
auto bytes = static_cast<FixedMemoryStream const&>(*stream).bytes()
If we need to be explicit about it, let's rename the overload instead of
adding that `static_cast`.
Errors are now deferred until `finish_decode()` is finished, meaning
branches to return errors only need to occur at the end of a ranged
decode. If VPX_DEBUG is enabled, a debug message will be printed
immediately when an overread occurs.
Average decoding times for `Tests/LibGfx/test-inputs/4.webp` improve
by about 4.7% with this change, absolute decode times changing from
27.4ms±1.1ms down to 26.1ms±1.0ms.
This does a few things:
- The decoder uses a 32- or 64-bit integer as a reservoir of the data
being decoded, rather than one single byte as it was previously.
- `read_bool()` only refills the reservoir (value) when the size drops
below one byte. Previously, it would read out a bit-sized range from
the data to completely refill the 8-bit value, doing much more work
than necessary for each individual read.
- VP9-specific code for reading the marker bit was moved to its own
function in Context.h.
- A debug flag `VPX_DEBUG` was added to optionally enable checking of
the final bits in a VPX ranged arithmetic decode and ensure that it
contains all zeroes. These zeroes are a bitstream requirement for
VP9, and are also present for all our lossy WebP test inputs
currently. This can be useful to test whether all the data present in
the range has been consumed.
A lot of the size of this diff comes from the removal of error handling
from all the range decoder reads in LibVideo/VP9 and LibGfx/WebP (VP8),
since it is now checked only at the end of the range.
In a benchmark decoding `Tests/LibGfx/test-inputs/4.webp`, decode times
are improved by about 22.8%, reducing average runtime from 35.5ms±1.1ms
down to 27.4±1.1ms.
This should cause no behavioral changes.
...and keep a forwarding header around in VP9, so we don't have to
update all references to the class there.
In time, we probably want to merge LibGfx/ImageDecoders and LibVideo
into LibMedia, but for now I need just this class for the lossy
webp decoder. So move just it over.
No behvior change.
...from "bytes" to "size_in_bytes".
I thought at first that `bytes` meant the variable contained
bytes that the decoder would read from.
Also, this variable is called `sz` (for `size`) in the spec.
No behavior change.
The framebuffer size was reduced in f2c0cee, but this caused some niche
block layouts to write outside of the frame.
This could be fixed by adding checks to see if a block being predicted/
reconstructed is within the frame, but the branches introduced by that
reduce performance slightly. Therefore, it's better to keep the
framebuffer sized according to the decoded frame size in 8x8 blocks so
that any block can be decoded without bounds checking.
A test was added to ensure that this continues to work.
Some occasional cases could cause the accumulator to overflow and have
an incorrect result. It would be nice to use a smaller accumulator, but
it seems not to be correct. :^(
We now cast to i16 to allow 128-bit vectorization to make use of one
whole register instead of having to split the loop into multiple.
This results in about a 5% reduction in performance in my testing.
Clang was reluctant to inline these for some reason. However, inlining
them seems to be quite beneficial, reducing decoding time in an intra-
heavy video by about 21% (~12.7s -> ~10.0s).
Quantizers are a constant for the whole frame, except when segment
features override them, in which case they are a constant per segment
ID. We take advantage of this by pre-calculating those after reading
the quantization parameters and segmentation features for a frame.
This results in a small 1.5% improvement (~12.9s -> ~12.7s).
This throws out some ugly `#define`s we had that were taking the role
of an enum anyway. We now have some nice getters in the contexts that
take the place of the combo of `seg_feature_active()` and then doing a
lookup in `FrameContext::m_segmentation_features` directly.
Bit reversals are used very often in intra-predicted frames. Turning
these into a constexpr lookup table reduces the branching needed for
block transforms significantly. This reduces the times spent decoding
an intra-heavy 1080p video by about 9% (~14.3s -> ~12.9s).
Previously, the block sizes would be checked at runtime to
determine the transform size to apply for residuals. Making the block
sizes into constant expressions allows all the loops to be unrolled
and reduces branching significantly.
This results in about a 26% improvement (~18s -> ~13.2s) in speed in an
intra-heavy test video.
Inter-prediction convolution filters are selected based on the
subpixel position determined for the motion vector relative to the
block being predicted. The subpixel position 0 only uses one single
sample in the center of the convolution, not averaging any other
samples. Let's call this a copy.
Reference frames can also be a different size relative to the frame
being predicted, but in almost every case, that scale will be 1:1
for every single frame in a video.
Taking into account these facts, we can create multiple fast paths for
inter prediction. These fast paths are only active when scaling is 1:1.
If we are doing a copy in both dimensions, then we can do a straight
memcpy from the reference frame to the output block buffer. In videos
where there is no motion, this is a dramatic speedup.
If we are doing a copy in one dimension, we can just do one convolution
and average directly into the output block buffer.
If we aren't doing a copy in either dimension, we can still cut out a
few operations from the convolution loops, since we only need to
advance our samples by whole pixels instead of subpixels.
These fast paths result in about a 34% improvement (~31.2s -> ~20.6s)
in a video which relies heavily on intra-predicted blocks due to high
motion. In videos with less motion, the improvement will be even
greater.
Also, note that the accumulators in these faster loops are only 16-bit.
High bit-depth videos will overflow those, so for now the fast path is
only used for 8-bit videos.
A typo caused the Y scale value to never be used, so if a reference
frame's aspect ratio didn't match up with the current frame's, it would
decode incorrectly.
Some comments have been added to clarify the frame-constants used in
the function as well.
This moves all the frame size calculation to `FrameContext`, where the
subsampling is easily accessible to determine the size for each plane.
The internal framebuffer size has also been reduced to the exact frame
size that is output.
The division in the `round_mv_...()` functions contained in the motion
vector selection process was done by bit shifting right. However, since
bit shifting negative values will truncate towards the negative end, it
was flooring instead of rounding.
This changes it to match the spec and rely on the compiler to simplify
down to a bit shift.
Previously, the `Parser::decode_tiles()` function wouldn't wait for the
tile-decoding workers to finish before exiting the function, which
could mean that the data the threads are working with could become
invalid if the decoder is deleted after an error is encountered.
Using malloc does not invoke T's constructor, nor were were invoking T's
constructor ourselves. Accessing T without invoking its constructor is
undefined behavior.
This adds a new WorkerThread class to run one task asynchronously,
and allow waiting for that thread to finish its work.
TileContexts are placed into multiple tile column vectors with their
streams to read from pre-created. Once those are ready, the threads can
start their work on each vector separately. The main thread waits for
those tasks to finish, then sums up the syntax element counts for each
tile that was decoded.
Previously, we were incorrectly wrapping an error from `BooleanDecoder`
initialization in a `DecoderErrorCategory::Memory` error. This caused
an incorrect error message in VideoPlayer. Now it will instead return
`DecoderErrorCategory::Corrupted`.
Extending the borders on reference frames so that motion vectors that
point outside the reference frame allows `predict_inter_block()` to
avoid some branches to clamp the sample coordinates in its loops.
This results in about a 25% improvement in decode time of a motion-
heavy YouTube video (~20.8s -> ~15.6s).
Moving the clamping of the coordinates of the reference frame samples
as well as some bounds checks outside of the loop reduces the branches
needed in the `predict_inter_block()` significantly.
This results in a whopping ~41% improvement in decode performance
of an inter-prediction-heavy YouTube video (~35.4s -> ~20.8s).
Changing the calculation of reference frame scale factors to be done on
a per-frame basis reduces the amount of work done in
`predict_inter_block()`, which is a big hotspot in most videos.
This reduces decode times in a test video from YouTube by about 5%
(~37.2s -> ~35.4s).
This changes the order of the loop copying data to a reference frame
store so that it copies each row in a contiguous line rather than
copying a column at a time, which caused unnecessary branches.
This reduces the decode time on a fairly long 720p YouTube video by
about 14.5% (~43.5s to ~37.2s).
This doesn't appear to have had a measurable impact on performance,
and behavior is the same.
With the tiles using independent BooleanDecoders with their own
backing BitStreams, we're even one step closer to threaded tiles!
Checking the bounds of the intermediate values was only implemented to
help debug the decoder. However, it is non-fatal to have the values
exceed the spec-defined bounds, and causes a measurable performance
reduction.
Additionally, the checks were implemented as an assertion, which is
easily broken by bad input files.
I see about a 4-5% decrease in decoding times in the `webm_in_vp9` test
in TestVP9Decode.
That matches the terminology used in ITU-T Rec. H.273,
PNG's cICP chunk, and the ICC cicpTag.
Also change the enum values to match the values in the spec --
0 means "not full range" and 1 means "full range".
(For now, keep the "Unspecified" entry around, and give it value 2.
This value is not in the spec.)
No intended behavior change.
I previously changed it to use the absolute inter-prediction mode
values instead of the ones relative to NearestMv. That caused the
probability adaption to take invalid indices from the counts and broke
certain videos.
Now it will just convert to the PredictionMode enum when returning from
parse_inter_mode, which allows us to still use it the same as before.
There were rare cases in which u8 was not large enough for the total
count of values read, and increasing this to u32 should have no real
effect on performance (hopefully).