Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Why bring up assumptions/suppositions about Netflix's encoding process?

Their tech blog and tech presentations discuss many of the requirements and steps involved for encoding source media to stream to all the devices that Netflix supports.

The Netflix tech blog: https://netflixtechblog.com/ or https://netflixtechblog.medium.com/

Netflix seems to use AWS CPU+GPU for encoding, whereas YouTube has gone to the expense of producing an ASIC to do much of their encoding.

2015 blog entry about their video encoding pipeline: https://netflixtechblog.com/high-quality-video-encoding-at-s...

2021 presentation of their media encoding pipeline: https://www.infoq.com/presentations/video-encoding-netflix/

An example of their FFmpeg usage - a neural-net video frame downscaler: https://netflixtechblog.com/for-your-eyes-only-improving-net...

Their dynamic optimization encoding framework - allocating more bits for complex scenes and fewer bits for simpler, quieter scenes: https://netflixtechblog.com/dynamic-optimizer-a-perceptual-v... and https://netflixtechblog.com/optimized-shot-based-encodes-now...

Netflix developed an algorithm for determining video quality - VMAF, which helps determine their encoding decisions: https://netflixtechblog.com/toward-a-practical-perceptual-vi..., https://netflixtechblog.com/vmaf-the-journey-continues-44b51..., https://netflixtechblog.com/toward-a-better-quality-metric-f...



> Their dynamic optimization encoding framework - allocating more bits for complex scenes and fewer bits for simpler, quieter scenes: https://netflixtechblog.com/dynamic-optimizer-a-perceptual-v... and https://netflixtechblog.com/optimized-shot-based-encodes-now...

This is overrated - of course that's how you do it, what else would you do?

> Mean-squared-error (MSE), typically used for encoder decisions, is a number that doesn’t always correlate very nicely with human perception.

Academics, the reference MPEG encoder, and old proprietary encoder vendors like On2 VP9 did make decisions this way because their customers didn't know what they wanted. But people who care about quality, i.e. anime and movie pirate college students with a lot of free time, didn't.

It looks like they've run x264 in an unnatural mode to get an improvement here, because the default "constant ratefactor" and "psy-rd" always behaved like this.


You're letting the video codec make all the decisions for bitrate allocation.

Netflix tries to optimize the encoding parameters per shot/scene.

from the dynamic optimization article:

- A long video sequence is split in shots ("Shots are portions of video with a relatively short duration, coming from the same camera under fairly constant lighting and environment conditions.")

- Each shot is encoded multiple times with different encoding parameters, such as resolutions and qualities (QPs)

- Each encode is evaluated using VMAF, which together with its bitrate produces an (R,D) point. One can convert VMAF quality to distortion using different mappings; we tested against the following two, linearly and inversely proportional mappings, which give rise to different temporal aggregation strategies, discussed in the subsequent section

- The convex hull of (R,D) points for each shot is calculated. In the following example figures, distortion is inverse of (VMAF+1)

- Points from the convex hull, one from each shot, are combined to create an encode for the entire video sequence by following the constant-slope principle and building end-to-end paths in a Trellis

- One produces as many aggregate encodes (final operating points) by varying the slope parameter of the R-D curve as necessary in order to cover a desired bitrate/quality range

- Final result is a complete R-D or rate-quality (R-Q) curve for the entire video sequence


> You're letting the video codec make all the decisions for bitrate allocation. > Netflix tries to optimize the encoding parameters per shot/scene.

That's the problem - if the encoding parameters need to be varied per scene, it means you've defined the wrong parameters. Using a fixed H264 QP is not on the rate-distortion frontier, so don't encode at constant QP then. That's why x264 has a different fixed quality setting called "ratefactor".


What about VP9? And any of the other codecs that Netflix uses (I'll assume AV1 is one they currently use)?


It's not a codec-specific concept, so it should be portable to any encoder. x265 and AV1 should have similar things, not sure about VP9 as I think it's too old and On2 were, as I said, not that competent.


Isn't two pass encoding similar? In the first pass you collect statistics you use in the second pass for bandwidth allocation?

Possibly Netflix statistics are way better.


> This is overrated - of course that's how you do it, what else would you do?

That's not what has been done previously for adaptive streaming. I guess you are referring to what encoding modes like CRF do for an individual, entire file? Or where else has this kind of approach been shown before?

In the early days of streaming you would've done constant bitrate for MPEG-TS, even adding zero bytes to pad "easy" scenes. Later you'd have selected 2-pass ABR with some VBV bitrate constraints to not mess up the decoding buffer. At the time, YouTube did something where they tried to predict the CRF they'd need to achieve a certain (average) bitrate target (can't find the reference anymore). With per-title encoding (which was also popularized by Netflix) you could change the target bitrates for an entire title based on a previous complexity analysis. It took quite some time for other players in the field to also hop on the per-title encoding train.

Going to a per-scene/per-shot level is the novely here, and exhaustively finding the best possible combination of QP/resolution pairs for an entire encoding ladder that also optimizes subjective quality – and not just MSE.


> exhaustively finding the best possible combination of QP/resolution pairs for an entire encoding ladder that also optimizes subjective quality – and not just MSE.

This is unnecessary if the encoder is well-written. It's like how some people used to run multipass encoders 3 or 4 times just in case the result got better. You only need one analysis pass to find the optimal quality at a bitrate.


Sure, the whole point of CRF is to set a quality target and forget about it, or, with ABR, to be as good as you can with an average bitrate target (under constraints). But you can't do that across resolutions, e.g. do you pick the higher bitrate 360p version, or the lower bitrate 480p one, considering both coding artifacts and upscaling degradation?


At those two resolutions you'd pick the higher resolution one. I agree that generation of codec doesn't scale all the way up to 4K and at that point you might need to make some smart decisions.

I think it should be possible to decide in one shot in the codec though. My memory is that codecs (image and video) have tried implementing scalable resolutions before, but it didn't catch on simply because dropping resolution is almost never better than dropping bitrate.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: