Why bring up assumptions/suppositions about Netflix's encoding process? Their te...

astrange · on Dec 12, 2023

> Their dynamic optimization encoding framework - allocating more bits for complex scenes and fewer bits for simpler, quieter scenes: https://netflixtechblog.com/dynamic-optimizer-a-perceptual-v... and https://netflixtechblog.com/optimized-shot-based-encodes-now...

This is overrated - of course that's how you do it, what else would you do?

> Mean-squared-error (MSE), typically used for encoder decisions, is a number that doesn’t always correlate very nicely with human perception.

Academics, the reference MPEG encoder, and old proprietary encoder vendors like On2 VP9 did make decisions this way because their customers didn't know what they wanted. But people who care about quality, i.e. anime and movie pirate college students with a lot of free time, didn't.

It looks like they've run x264 in an unnatural mode to get an improvement here, because the default "constant ratefactor" and "psy-rd" always behaved like this.

canucker2016 · on Dec 12, 2023

You're letting the video codec make all the decisions for bitrate allocation.

Netflix tries to optimize the encoding parameters per shot/scene.

from the dynamic optimization article:

- A long video sequence is split in shots ("Shots are portions of video with a relatively short duration, coming from the same camera under fairly constant lighting and environment conditions.")

- Each shot is encoded multiple times with different encoding parameters, such as resolutions and qualities (QPs)

- Each encode is evaluated using VMAF, which together with its bitrate produces an (R,D) point. One can convert VMAF quality to distortion using different mappings; we tested against the following two, linearly and inversely proportional mappings, which give rise to different temporal aggregation strategies, discussed in the subsequent section

- The convex hull of (R,D) points for each shot is calculated. In the following example figures, distortion is inverse of (VMAF+1)

- Points from the convex hull, one from each shot, are combined to create an encode for the entire video sequence by following the constant-slope principle and building end-to-end paths in a Trellis

- One produces as many aggregate encodes (final operating points) by varying the slope parameter of the R-D curve as necessary in order to cover a desired bitrate/quality range

- Final result is a complete R-D or rate-quality (R-Q) curve for the entire video sequence

astrange · on Dec 12, 2023

> You're letting the video codec make all the decisions for bitrate allocation. > Netflix tries to optimize the encoding parameters per shot/scene.

That's the problem - if the encoding parameters need to be varied per scene, it means you've defined the wrong parameters. Using a fixed H264 QP is not on the rate-distortion frontier, so don't encode at constant QP then. That's why x264 has a different fixed quality setting called "ratefactor".

canucker2016 · on Dec 12, 2023

What about VP9? And any of the other codecs that Netflix uses (I'll assume AV1 is one they currently use)?

astrange · on Dec 13, 2023

It's not a codec-specific concept, so it should be portable to any encoder. x265 and AV1 should have similar things, not sure about VP9 as I think it's too old and On2 were, as I said, not that competent.

soundarana · on Dec 13, 2023

Isn't two pass encoding similar? In the first pass you collect statistics you use in the second pass for bandwidth allocation?

Possibly Netflix statistics are way better.

slhck · on Dec 12, 2023

> This is overrated - of course that's how you do it, what else would you do?

That's not what has been done previously for adaptive streaming. I guess you are referring to what encoding modes like CRF do for an individual, entire file? Or where else has this kind of approach been shown before?

In the early days of streaming you would've done constant bitrate for MPEG-TS, even adding zero bytes to pad "easy" scenes. Later you'd have selected 2-pass ABR with some VBV bitrate constraints to not mess up the decoding buffer. At the time, YouTube did something where they tried to predict the CRF they'd need to achieve a certain (average) bitrate target (can't find the reference anymore). With per-title encoding (which was also popularized by Netflix) you could change the target bitrates for an entire title based on a previous complexity analysis. It took quite some time for other players in the field to also hop on the per-title encoding train.

Going to a per-scene/per-shot level is the novely here, and exhaustively finding the best possible combination of QP/resolution pairs for an entire encoding ladder that also optimizes subjective quality – and not just MSE.

astrange · on Dec 12, 2023

> exhaustively finding the best possible combination of QP/resolution pairs for an entire encoding ladder that also optimizes subjective quality – and not just MSE.

This is unnecessary if the encoder is well-written. It's like how some people used to run multipass encoders 3 or 4 times just in case the result got better. You only need one analysis pass to find the optimal quality at a bitrate.

slhck · on Dec 12, 2023

Sure, the whole point of CRF is to set a quality target and forget about it, or, with ABR, to be as good as you can with an average bitrate target (under constraints). But you can't do that across resolutions, e.g. do you pick the higher bitrate 360p version, or the lower bitrate 480p one, considering both coding artifacts and upscaling degradation?

astrange · on Dec 12, 2023

At those two resolutions you'd pick the higher resolution one. I agree that generation of codec doesn't scale all the way up to 4K and at that point you might need to make some smart decisions.

I think it should be possible to decide in one shot in the codec though. My memory is that codecs (image and video) have tried implementing scalable resolutions before, but it didn't catch on simply because dropping resolution is almost never better than dropping bitrate.