This edition is sponsored by Udacity

Roughly four out of every five hours watched on Netflix come from a recommendation. The "Continue Watching" row, the "Because you watched" carousel, the cover artwork that quietly changes depending on who is looking — all of it is the recommendation engine doing its job.

So when Netflix starts experimenting with how large language models could improve search, recommendations, and personalisation, it is not poking at a side feature. It is rebuilding the part of the product that drives most of the business.

The trouble is that the most capable language models on the market do not know anything about Netflix. A frontier model can write essays about cinema, but it cannot recommend the right show for you on a Tuesday evening. It does not know which titles are trending this week, which actors appear in which series, or whether you finish thrillers but abandon documentaries halfway through. Trained on the open internet, it has read a lot about Netflix without ever having seen the inside of it.

Closing that gap is what post-training is for, and Netflix has built an internal framework for doing it at scale. The framework is what would let them build conversational search ("something light and funny for a Tuesday evening"), recommendations that can explain themselves in plain language, and show summaries, descriptions, and mood tags drafted by AI and reviewed by humans rather than the other way round. None of that has shipped yet. The framework is the bet that lets them build it.

What "Post-Training" Actually Means

To understand post-training, it helps to know what a language model actually is. Underneath the conversational interface, a model is a very large pile of numbers — billions of them — arranged in layers. These numbers are called weights. When you type a question, the model converts your text into numbers, runs those numbers through its layers, and the weights nudge the result one way or another at each step. After billions of small nudges, an answer comes out the other side.

Training a model means adjusting those weights. The original training — what the AI labs (like OpenAI) do when they release a new model — runs the entire internet through the system and tweaks the weights until the answers come out coherent. Post-training is what happens after that. You take a model whose weights already produce sensible English and you nudge those weights further so the answers fit a narrower job.

It helps to think of the model as a chef who has read every cookbook ever published but has never worked in your restaurant. They understand technique, ingredients, and cuisine. They do not know your menu or what your regulars order on a Friday. Post-training is the apprenticeship that teaches them.

Netflix's framework supports four ways of running that apprenticeship. Each adjusts the weights in a different way and suits a different kind of problem.

Supervised fine-tuning

This is the most straightforward. You collect a dataset of input-output examples — for Netflix, perhaps a viewing history paired with the title that user actually watched and enjoyed next. You feed the input into the model, compare its prediction to the correct answer, and shift the weights in the direction that would have made the prediction less wrong. Repeat across thousands of examples and the model learns to imitate the pattern.

Preference optimisation

This is for problems where there is no single right answer, only better and worse ones. You show the model two candidate answers and tell it which one was preferred. There is no objectively correct recommendation for "something light and funny on a Tuesday," but a human can usually look at two suggestions and pick the better one. The model learns to produce answers closer to the preferred ones and further from the rejected ones.

Reinforcement learning

This is where things get more interesting. Instead of telling the model what is right or what is preferred, you let it try things and reward it when the result is good. The model generates an answer, the system scores that answer, and the weights are nudged toward higher scores. This is how you teach a model not just to imitate but to reason — to take multiple steps before arriving at an answer, because the reward only comes at the end.

Knowledge distillation

This is the practical one. You take a large, expensive model that performs well and use it as a teacher for a smaller, cheaper one trained to produce the same answers. You end up with a model that runs faster and costs less in production while keeping most of the quality of its bigger sibling. For a service that has to respond to millions of users in milliseconds, that gap between expensive-and-slow and cheap-and-fast is often the difference between a feature that ships and one that doesn't.

Netflix needed all four, which is why they built their own framework instead of using off-the-shelf tools.

What Standard Tools Can't Handle

Most teams start with an off-the-shelf library that handles the standard case: give the model question-and-answer pairs, run a fine-tuning loop, get a more polite chatbot out the other end. Netflix's requirements break that case in two important ways.

The first is the shape of the answer. A standard language model produces text, one word at a time. The model's final layer — called the output head — looks at everything the model has computed and assigns a probability to every word it knows; the most likely word becomes the next word in the answer. That works for chatbots, but not when the answer you want is a specific Netflix title rather than a sentence. You need a different output head — one that assigns probabilities to titles instead of words. Standard tools assume the answer is always text and make swapping that out hard.

The second is what the model is allowed to read. A model can only understand inputs that have been broken into chunks it knows. The component that does the breaking is called a tokeniser, and it has a fixed dictionary — maybe fifty thousand chunks of common words, sub-words, and punctuation — learned during the original training. When you type "Stranger Things," the tokeniser breaks it into a couple of those chunks. Netflix sometimes wants to teach the model about things that are not in any standard dictionary: a single chunk that is the show "Stranger Things," or one that means "this user binged the entire season in a sitting."

Even more ambitiously, Netflix sometimes trains a model where the input is not text at all, but a sequence of shows watched in order — fed into the model the way a sentence is fed into a language model. The model learns to predict the next show the way a language model predicts the next word. Standard tools were not built for any of that.

So Netflix built their own: flexible enough for the unusual cases, but clean enough that a researcher can still run a standard fine-tuning job without endless plumbing.

The Bigger Shift: From Fine-Tuning to Learning by Doing

If there is one architectural lesson buried in Netflix's framework, this is it.

Through 2024, post-training mostly meant supervised fine-tuning. The mechanics were simple: feed an input in, compare the output to the correct answer, calculate how wrong it was, shift the weights to be slightly less wrong. Run that same operation across thousands of specialised AI chips in parallel — each chip handling a different slice of the data — and you have an efficient training loop. Almost every standard AI training tool is built around this symmetric pattern: every chip doing the same thing on different data, all at once.

Now picture reinforcement learning for a Netflix-style problem. Suppose you are training a model to recommend a good follow-up show. The step looks more like this:

  1. The model is given a viewing history and asked to recommend three follow-ups.

  2. The model generates its three suggestions.

  3. A separate scoring system — another AI trained for the job — evaluates how good those suggestions are, perhaps a score from zero to one based on how well they match the user's taste.

  4. A second helper AI checks that the recommendations have not drifted too far from what the original model would have produced. This is a guardrail against the model learning to game the score by recommending strange edge-case shows.

  5. The system collects all of that and uses it to compute how the weights should be adjusted.

  6. Weights update, and you start again with a new viewing history.

The training step is no longer a single operation. It is a chain of distinct stages — generation, scoring, guardrail-checking, weight-updating — and each one has to finish before the next starts. The clean symmetry of "every chip doing the same thing at the same time" is broken. You need a controller that decides which stage runs when and how to move data between them.

Netflix's original framework was built for the symmetric case. To support reinforcement learning they evolved it into an active controller that orchestrates the multi-stage workflow. The trigger was the release of DeepSeek-R1 in early 2025, which showed that reinforcement learning at scale could push a model into genuinely new behaviour — particularly the ability to reason through a problem in multiple steps. After that, every serious AI lab pivoted some part of their post-training work toward reinforcement learning. The plumbing had to change to keep up.

Don't Fight the Open-Source Ecosystem

One design choice runs through the framework: stay close to the open-source ecosystem instead of building a private standard.

The most-used hub for open AI tools is a company called Hugging Face, and Netflix uses Hugging Face's standard model format and tokeniser. Early on the team had bypassed the standard tokeniser for more control, and it cost them. Tiny differences in how text was split during training versus production — the same word becoming two chunks in one place and three in another — meant the model behaved slightly differently in the live product than in testing. The differences were small enough that nobody noticed at first and large enough that quality eventually degraded for reasons no one could explain. Switching to the standard tokeniser made the problem go away.

The takeaway: departing from ecosystem conventions creates debt that gets repaid in mysterious bugs months later.

Why the Plumbing Is the Point

Netflix's existing recommendation algorithms already save the company over a billion dollars a year by reducing churn.

Adding language model understanding on top of that — the ability to reason about your taste, explain picks back to you, and search the catalogue in plain language — is the next big jump. You cannot make that jump by bolting an off-the-shelf model onto your stack. You have to build the apprenticeship.

Netflix has built the workshop where a generic model gets taught the menu of one specific kitchen, designed it so researchers can move between techniques without rebuilding their mental model, and kept it close to the open-source ecosystem so they do not trap themselves in a private standard. The infrastructure is not the headline. But the company that gets the infrastructure right is the one that ships the headline a year before everyone else.

That’s it

See you soon
—Sid

Login or Subscribe to participate

Reply

Avatar

or to participate

Keep Reading