LibVideo/VP9: Implement unscaled fast paths in inter prediction

Inter-prediction convolution filters are selected based on the subpixel position determined for the motion vector relative to the block being predicted. The subpixel position 0 only uses one single sample in the center of the convolution, not averaging any other samples. Let's call this a copy. Reference frames can also be a different size relative to the frame being predicted, but in almost every case, that scale will be 1:1 for every single frame in a video. Taking into account these facts, we can create multiple fast paths for inter prediction. These fast paths are only active when scaling is 1:1. If we are doing a copy in both dimensions, then we can do a straight memcpy from the reference frame to the output block buffer. In videos where there is no motion, this is a dramatic speedup. If we are doing a copy in one dimension, we can just do one convolution and average directly into the output block buffer. If we aren't doing a copy in either dimension, we can still cut out a few operations from the convolution loops, since we only need to advance our samples by whole pixels instead of subpixels. These fast paths result in about a 34% improvement (~31.2s -> ~20.6s) in a video which relies heavily on intra-predicted blocks due to high motion. In videos with less motion, the improvement will be even greater. Also, note that the accumulators in these faster loops are only 16-bit. High bit-depth videos will overflow those, so for now the fast path is only used for 8-bit videos.
2025-09-17 05:46:17 +00:00 · 2023-04-16 08:39:05 -05:00 · 2023-04-16 08:39:05 -05:00 · 8ad0dff5c2
commit 8ad0dff5c2
parent 8cd72ad1ed
3 changed files with 157 additions and 52 deletions
--- a/Userland/Libraries/LibVideo/VP9/Decoder.h
+++ b/Userland/Libraries/LibVideo/VP9/Decoder.h
@ -103,9 +103,6 @@ private:
    template<typename S, typename D>
    inline void hadamard_rotation(Span<S> source, Span<D> destination, size_t index_a, size_t index_b);

-    template<typename T>
-    inline i32 rounded_right_shift(T value, u8 bits);
-
    // (8.7.1.10) This process does an in-place Walsh-Hadamard transform of the array T (of length 4).
    inline DecoderErrorOr<void> inverse_walsh_hadamard_transform(Span<Intermediate> data, u8 log2_of_block_size, u8 shift);