DoorDash Used LLMs to Rebuild Its Homepage

Most personalization systems look smart on the surface. They show “Recommended for you," rearrange a few rows, and adjust ranking weights.

And teams celebrate that as AI.

But here is the uncomfortable truth: reordering fixed categories is not real personalization. It is controlled reshuffling.

If you are building AI products, you need to think beyond ranking tweaks. The real opportunity isn’t in sorting better. It’s in redefining what gets shown in the first place.

DoorDash understood this. Instead of squeezing more performance from an old structure, they questioned the structure itself.

What happens when your content inventory is fixed, but user preferences are not?

That is what we are exploring today.

Let's goo!

The Original Homepage System

DoorDash treats the homepage as its front door.

It is the first screen most users see when they open the app.

If the homepage works well, users find food faster, explore more restaurants, place orders quickly, and return more often.

If it fails, they leave, or worse, they open another app. DoorDash originally powered the homepage using a Food Knowledge Graph (FKG).

It was a structured system that stored cuisines, dishes, and categories.

The team also created over 300 themed carousels, such as breakfast burritos, salads, and baked goods. Here is how the system worked:

The algorithm studied a user’s past orders and browsing history.
It pulled out food-related tags from the knowledge graph.
It matched those tags to the 300 carousel tags.
It selected the most relevant rows for that user.

The system was clean and organized. But there is just one problem. Three hundred predefined rows cannot represent the real preferences of millions of users.

Not just that. It also comes with limitations.

Three Big Limitations

1. Fixed the vocabulary.

The knowledge graph only knew the categories that engineers had already defined.

If a concept wasn't in the system, it did not exist. For example, if a user often ordered pho, bánh mì, and boba, the system would label them as Vietnamese or Asian Food.

That is correct, but it is too broad.

It does not reflect the specific mix of what that person likes.

2. The carousels were too general.

A row called salads does not really personalize anything.

It doesn't differentiate someone who orders Caesar salads from Italian places from those who prefer quinoa bowls from health-focused diners.

The category exists. The intent does not.

3. The system depended on tagging quality.

If the restaurant had an incorrect or incomplete tag, the system either missed it or matched it to the wrong carousel.

The homepage quality depended heavily on perfect tagging.

The root issue was structural. A knowledge graph works with a fixed set of categories. If something is not defined, the system cannot show it.

If tagging is weak, the matching is weak. DoorDash realized it could not depend (and work) on this system forever. The ceiling was too low.

So they changed the approach completely.

The Shift: From Selecting Rows to Generating Them

Instead of choosing from 300 fixed carousels, DoorDash built a system where an LLM now generates new carousel themes.

These personalized carousels are per user and per time of day. That means breakfast, lunch, dinner, and late night can all look different for the same person.

Note that the system runs offline. DoorDash does not generate these carousels in real time when a user opens the app. They precompute them in advance.

This design choice matters a lot because if they called an LLM every time someone opened the homepage:

Costs would increase quickly
Latency would hurt the user experience

By generating content offline, they separated the generation and serving price. The homepage can load fast, even if the generation was expensive.

So, the team designed a five-stage pipeline to make this work.

The Five-Stage Pipeline

DoorDash engineering blog

1. Consumer Profiles → Carousel Generation

The system starts with a structured consumer profile.

It builds this from past orders, browsing behavior, cuisine patterns, dish-level preferences, and time of day. It then feeds this profile into an LLM.

The LLM generates carousel titles and structured metadata. Now, instead of picking from 300 fixed rows, the model can create new themes like:

Late-Night Noodle Cravings
Bangkok Street Heat
Cozy Italian Comfort Food

No one manually created these rows. The model uses its knowledge about food and cuisine to connect user behavior to meaningful themes.

However, DoorDash does not let the model generate as it likes. They:

Enforce diversity so similar rows do not repeat
Exclude brand names
Block unwanted food types
Continuously refine prompts based on evaluation data

The LLM expands the vocabulary, but the team keeps it controlled.

2. Embedding Generation

After the system generates carousel titles, it converts them into embeddings. An embedding turns text into numbers. It captures meaning in vector form.

For example, Hearty Weekend Brunch becomes a numerical vector. The system then compares that vector to restaurant-level embeddings.

That is important. The system no longer depends on exact tags.

It works in semantic space. That means it can match restaurants that feel like brunch places even if no one labeled them as “brunch.”

That makes the system more flexible and more accurate.

3. Content Moderation with LLM-as-Jury

You can't manually review millions of carousel titles. So, DoorDash built an automated moderation system using three separate LLMs.

Each model reviews every title independently. If even one model flags a title as inappropriate, the system blocks it. It is a single-veto system.

The team prioritizes safety over letting borderline content through. The system achieves about 95% recall in catching problematic titles.

They use three models because each model has its own blind spots. One model might miss something that another catches.

If you are building AI systems that generate user-facing content, you need a moderation architecture like this from day one.

4. Store and Item Retrieval

Once the system approves a carousel, it retrieves relevant stores.

It compares the carousel embedding with store embeddings and selects the closest matches. That removes the dependency on manual tagging.

For example, a carousel called “Hearty Weekend Brunch” can show diners that serve brunch-style items even without a brunch tag.

The system also selects the most relevant dish image from each restaurant’s menu. The same restaurant can show up with different images in different carousels:

Pasta in an Italian comfort row
Dessert in a sweets-focused row

The homepage becomes context-aware.

5. Store Ranking

The final step ranks stores within each carousel. The ranking model balances:

How well the store matches the theme
How likely the user is to click and order

The system does not just show the most similar store first. It shows the store that is both relevant and engaging. Personalization meets performance here.

Evaluation: Real Metrics, Real Impact

DoorDash did not rely on intuition. They used:

Human labelers
LLM-as-judge evaluation

They measured precision@10. It tells you how many of the top 10 results are relevant. They improved precision@10 from 68% to 85%. That is a big jump.

It means that roughly 8 or 9 out of 10 stores in a carousel match the theme. They then ran A/B tests in San Francisco and Manhattan. The results showed:

Double-digit improvement in click-through rates
Higher conversion rates
Better homepage relevance scores
More merchant exploration

The system also helped smaller and niche restaurants get discovered more often. That improved the supply side of the marketplace.

It was not just a UX improvement. It changed marketplace dynamics.

But The System Still Does Not Know This

Off-the-shelf LLMs understand food.

They know that pad thai is Thai, poke bowls are Hawaiian, and tikka masala ties back to British-Indian cuisine, but they do not know DoorDash’s internal marketplace data.

They do not know which restaurants deliver the fastest, which dishes get reordered often, local co-purchase behavior, or city-level preference patterns.

DoorDash plans to fine-tune models with proprietary data, such as:

Co-purchase patterns
Regional preferences
Store performance metrics

When you combine world knowledge with platform-specific data, you build a strong competitive advantage.

In a Nutshell

Not only did DoorDash improve its ranking, but it also changed the content layer.

They:

Moved from fixed taxonomy to generative concepts.
Separated generation, retrieval, moderation, and ranking clearly.
Ran expensive AI steps offline to control cost and latency.
Built strong moderation with a three-model jury.
Measured improvements and improved precision from 68% to 85%.
Validated the impact through A/B tests in major markets.

If you are an AI Product Manager, ask yourself one simple question: Are you only rearranging what already exists, or are you creating new surface area for discovery?

If you design your system the way DoorDash did, you will build something that actually changes how your product works.