How Do LLMs Learn To Be Helpful: A PM's Guide To Post-Training

The model that writes your product copy, answers your customers, and helps your engineers debug code did not learn to do any of those things during pre-training.

A model learns a language in pre-training. It reads trillions of tokens of text from the internet and learns to predict the next word.

Intelligence comes from here. Now that it knows so much about the world, it can complete sentences, answer trivia, and generate plausible text on almost any subject.

But it cannot follow instructions. It cannot have a conversation. It does not know when to refuse a harmful request or when to say "I don't know."

Ask a raw pre-trained model, "What is the capital of France?" and it might respond with "What is the capital of Germany? What is the capital of Spain?"

That's because completing a list of similar questions is a valid next-token prediction.

Post-training is the techniques that turn an unhelpful model into a useful assistant. Model learns to follow instructions, prefer helpful answers, and behave as users expect.

If pre-training builds the engine, post-training teaches it to drive.

Let’s dig in!

For PMs building AI products, post-training determines model behaviour.

Why does Claude refuse certain requests? Why does GPT-5 format its answers differently from Gemini? Why does fine-tuning sometimes make a model worse?

The answers all live in post-training.

Stage 1: Supervised Fine-Tuning
Learning by Example

The first stage of post-training is supervised fine-tuning (SFT).

The concept is straightforward: show the model thousands of examples of correct behaviour and train it to imitate them. These examples are prompt-response pairs.

A human writes a question ("Explain quantum computing to a 10-year-old") and a high-quality response ("Imagine you have a magic coin that can be heads and tails at the same time...").

The model trains on thousands of these pairs, learning the pattern of instruction → helpful response. Think of it like training a new employee.

You don't explain customer service theory from first principles, but show them ten great customer service interactions and say, "Do it like this."

After enough examples, they pick up the tone, structure, and boundaries of what a good response looks like. SFT is remarkably effective.

A small number of high-quality examples, often just a few thousand, can transform a raw language model into something that follows instructions, formats its answers clearly, and handles multi-turn conversations.

Most of the behavioural difference you see between a base model and an instruction-tuned model comes from this stage.

But SFT has a ceiling. It teaches the model what good looks like, but it does not teach it to differentiate between good and great. It does not teach how to handle edge cases that the training examples did not cover.

And it does not teach it to refuse harmful requests unless someone specifically wrote a harmful request and a refusal as a training example.

Stage 2: Preference Training
Learning What Humans Prefer

The second stage teaches the model to rank responses to understand that some answers are better than others, even when both are correct.

Human feedback enters the game here. The original approach, RLHF (Reinforcement Learning from Human Feedback), works in two steps.

First, you build a reward model.

Human labelers get a prompt and two possible responses.

They choose which one they prefer. These preferences, thousands of them, will train a separate model whose only job is to score responses.

Given any prompt and response, the reward model outputs a number. Higher means more aligned with human preferences.

Second, you use that reward model to train the language model through reinforcement learning. The language model generates responses. The reward model scores them.

The language model is updated to produce responses that score higher. Over time, the model learns to generate the kind of answers that humans consistently prefer.

The analogy here is that of a student receiving graded papers back.

SFT is like studying model answers. RLHF is like submitting your own work, getting a score, and improving based on that feedback. The second process is slower and more expensive, but it produces deeper learning. RLHF is powerful but complex.

You need to manage three separate models (the language model, the reward model, and a reference model to prevent the language model from drifting too far).

The training process is unstable and expensive.

Small changes in hyperparameters can cause the model to collapse.

It can produce nonsensical text that scores high on the reward model but is useless to humans. This failure mode is called reward hacking.

The DPO Revolution: Skipping the Reward Model

Direct Preference Optimization (DPO) gives nearly the same results as RLHF but without reinforcement learning and the reward model. The key insight is mathematical.

DPO showed that reformulating the RLHF objective as a simple classification problem.

Instead of training a reward model and then using RL to optimise against it, you directly train the language model on preference pairs.

For each preferred response and rejected response pair, the model learns to increase the probability of the preferred one and decrease the probability of the rejected one.

The result is one model instead of three.

Standard supervised training instead of unstable RL. As of 2026, DPO is the default approach for most teams doing alignment fine-tuning. It is simpler to implement, cheaper to run, and produces competitive results on most benchmarks.

Frontier labs like OpenAI and Anthropic rely on RL-based methods for their most skilled models, where the extra complexity yields measurable gains.

For almost every team, DPO has made preference training accessible.

Constitutional AI: Feedback Without Humans

One bottleneck in both RLHF and DPO is data. You need thousands of human preference judgements, and high-quality labelling is slow and expensive.

Anthropic's Constitutional AI (CAI) solves this by replacing it with the model itself.

Instead of asking humans, you give the model written principles, a "constitution," and ask it to evaluate its own responses against those principles.

The process works like this. The model generates a response, critiques it against the constitution, and then revises the response. The original and revised responses become the preference pair that feeds into training. The advantage is scale.

You can generate millions of preference comparisons without any human annotators. It is sometimes called RLAIF (Reinforcement Learning from AI Feedback).

It doesn't replace human feedback entirely, but it extends it. A small set of human-written principles can give a large, consistent training dataset.

Why This Matters When You Choose a Model

Post-training makes models different. Two models pre-trained on similar data can behave very differently because of how they were post-trained.

When Claude refuses to help with something, and ChatGPT does not, that difference comes from post-training.

When one model is better at following complex multi-step instructions, that comes from the quality of SFT examples. When a model handles edge cases gracefully instead of producing confident nonsense, that comes from preference training.

Understanding this pipeline, pre-training → SFT → preference training, gives you a practical framework for several common PM decisions.

When you fine-tune a model for your product, you are adding another round of SFT.

If the fine-tuning data is of low quality, it can override the model's post-training and make it worse. That is why fine-tuning sometimes degrades model behaviour.

The new examples conflict with the alignment training. When you evaluate model vendors, you are comparing post-training strategies.

A model that excels at creative writing but struggles with structured data extraction was likely post-trained with different priorities than one that does the reverse.

When you build feedback loops into your product, you are generating the raw material for preference training. The quality and structure of that feedback directly determines whether future model versions will be better or worse at serving your users.

The models get smarter at every stage of this pipeline.

Pre-training data is getting more curated. SFT examples are getting more diverse. Preference training is getting more efficient.

But the structure itself: teach, then refine, then align, is likely to remain the foundation of how models learn to be useful for a long time.

Post-training is the least visible and most important stage of building an LLM.

The next time you notice a model doing something unexpectedly well or poorly, the answer is almost certainly somewhere in this pipeline.

How Do LLMs Actually Learn To Be Helpful? A PM's Guide To Post-Training

Stage 1: Supervised Fine-Tuning
Learning by Example

Stage 2: Preference Training
Learning What Humans Prefer

The DPO Revolution: Skipping the Reward Model

Constitutional AI: Feedback Without Humans

Why This Matters When You Choose a Model

How did you like this edition?

Reply

Keep Reading

JustAnotherPM

How Do LLMs Actually Learn To Be Helpful? A PM's Guide To Post-Training

Stage 1: Supervised Fine-TuningLearning by Example

Stage 2: Preference TrainingLearning What Humans Prefer

The DPO Revolution: Skipping the Reward Model

Constitutional AI: Feedback Without Humans

Why This Matters When You Choose a Model

How did you like this edition?

Reply

Keep Reading

JustAnotherPM

Stage 1: Supervised Fine-Tuning
Learning by Example

Stage 2: Preference Training
Learning What Humans Prefer