Airbnb Replaced Press 1 With Four ML Models That Listen

The traditional phone support experience is a pain.

You call a company, and they give you a tree of numbered options.

"Press 1 for billing, press 2 for technical support, press 3 for...." By the time you reach a human, you have already repeated your problem three times.

The menu system was designed for the company's routing needs.

Airbnb decided to replace this entirely. Instead of forcing callers into predefined menu trees, their Interactive Voice Response (IVR) system now asks a single open-ended question: "In a few sentences, please tell us why you're calling today."

IVR Core Service architecture (Source)

The caller speaks naturally. The system listens, transcribes, understands the intent, finds the right help article, summarises it back to the caller, and either resolves the issue automatically or routes to a human agent with full context attached.

But behind that experience are four ML models working in sequence, each handling a different stage of the pipeline. All these run in real time during a live phone call.

It reduced word error rates from 33% to 10%, processes intent detection in under 50 milliseconds, and achieves precision above 90% on paraphrased responses.

Let’s dig in!

Stage 1: Hearing Correctly (Domain-Specific ASR)

The first challenge is transcription. Converting speech to text during a phone call is harder than it sounds. Phone audio is compressed, noisy, and often unclear.

Callers speak with accents, use incomplete sentences, and mumble through the exact words that matter most. General-purpose speech recognition models handle conversational English well but fail on domain-specific vocabulary.

Airbnb found that off-the-shelf models misinterpreted terms: "listing" became "lifting," "help with my stay" became "happy Christmas Day."

These transcription errors went through the entire pipeline. If the system cannot hear the words correctly, it cannot understand the intent. Airbnb's solution was twofold.

They switched from a generic pretrained model to one specifically adapted for noisy phone audio. The acoustic conditions of a phone call are fundamentally different from podcast-quality speech.
They introduced a domain-specific phrase list that biases the model toward Airbnb terminology like reservation, listing, refund, check-in, host, superhost, and hundreds of other terms that show up frequently in support calls.

The result was dramatic. Word error rate dropped from 33% to approximately 10%. That improvement directly translated to better downstream performance.

More accurate intent detection, better help article matches, improved customer NPS among users who interacted with the ASR menu, reduced reliance on human agents, and lower overall customer service handling time.

A 23 percentage point improvement in transcription accuracy rippled through every stage of the pipeline.

Stage 2: Understanding Why They Called (Intent Detection)

Once the caller's words are accurately transcribed, the system needs to understand what they actually need. Airbnb built a Contact Reason Detection model trained on a detailed taxonomy of every type of inquiry the company receives.

The model classifies each transcribed statement into the right category.

When a caller says, "I haven't received my refund yet," the model predicts the reason as "Missing Refund" and forwards it to the relevant downstream components.

The intent detection service runs in parallel across multiple instances, achieving average latency under 50 milliseconds. That speed matters.

Intent detection architecture (source)

It is imperceptible to the caller. There is no awkward pause while the system thinks. The caller finishes speaking, and the system responds almost immediately.

A separate model handles a specific edge case. Callers who do not want to describe their issue at all. Some people say, "Let me talk to someone."

A dedicated escalation intent detector recognises these requests and routes the call directly to a human, respecting the caller's preference.

Also read: 7 AI Agents Write JustAnotherPM Newsletter. Here’s Everything

Stage 3: Finding the Right Answer (Semantic Retrieval)

Pointing the caller to the right information resolves most support issues.

Airbnb's Help Centre contains hundreds of articles covering booking changes, refund policies, safety procedures, and host guidelines.

The challenge is matching a caller's natural language description to the single most relevant article. Airbnb built a two-stage retrieval and ranking system.

The first stage uses semantic search. The caller's transcribed query is converted into an embedding (a numerical representation of its meaning) and compared against pre-indexed embeddings of every help article. This retrieves up to 30 candidate articles in around 60 milliseconds using cosine similarity matching.
The second stage applies an LLM-based ranking model to re-rank those candidates. Semantic similarity gets you in the right neighbourhood, but ranking determines the exact address. The top-ranked article is the one sent to the caller via SMS or app notification.

This dual-stage approach, fast retrieval followed by precise ranking, is a common pattern in search systems, but applying it within a live phone call adds real-time constraints that most search applications never face.

The entire retrieval-and-rank process must complete within the natural pause of a conversation, before the caller notices any delay.

The same retrieval system also powers Airbnb's customer support chatbot and Help Centre search, making it a shared infrastructure.

Its effectiveness is continuously measured using Precision@N, the proportion of top N recommended articles that are actually relevant.

Help Article Retrieval and Ranking (Source)

This allows teams to track quality and iterate based on real usage data.

Stage 4: Explaining What Was Found (Paraphrasing)

There is a UX problem that most automated phone systems ignore. When they send a help article link via text message, the caller can't see a title or preview while on the call.

If the system says, "I've sent you a link," the caller has no reason to trust that the link will actually solve their problem. Airbnb addressed this with a paraphrasing model.

Before sending the article link, the system generates a natural language summary of what it understood and what it is going to recommend.

If the caller said, "I need to cancel my reservation and request a refund," the system responds: "I understand your issue is about a refund request. We have sent you a link to resources about this topic." The implementation is elegant in its simplicity.

Rather than generating paraphrases from scratch, which would introduce latency and risk of hallucination, Airbnb's UX writers created a curated set of standardised summaries for common scenarios.

During live serving, the caller's transcribed query is matched to the nearest curated summary using text embedding similarity. A calibrated similarity threshold ensures only high-quality matches are used. Manual evaluation confirmed precision exceeding 90%.

This paraphrasing step serves a psychological function as much as a practical one. It confirms to the caller that the system understood their problem correctly.

In experiments, presenting the paraphrased summary before sending the article link increased user engagement with the article content and improved self-resolution rates, reducing the need for human agent assistance.

Also read: What is LLM-Routing and When Should You Use Multiple Models?

In a Nutshell

The fact that all four run in sequence during a live phone call without perceptible delay makes this system remarkable.

The caller speaks.
ASR transcribes in real time.
Intent detection classifies in under 50 milliseconds.
Semantic retrieval finds candidate articles in 60 milliseconds.
The ranking model selects the best one.
The paraphrasing model generates a summary.
The IVR speaks the summary aloud and sends the article link.

All before the caller has time to wonder what is happening. When the system cannot resolve the issue, either because the intent is ambiguous, the caller explicitly requests a human, or the call is routed to a support agent if the issue needs investigation.

However, the agent gets the transcription, the detected intent, and the context.

The caller does not need to repeat themselves. The conversation picks up where the automation left off. This handoff design is where the PM decisions live.

The system is not trying to replace human agents.

It is trying to ensure that when a human is needed, they come with full context. And when they aren’t needed, the caller gets an answer faster than any agent could.

Airbnb Replaced Press 1 With Four ML Models That Actually Listen

Stage 1: Hearing Correctly (Domain-Specific ASR)

Stage 2: Understanding Why They Called (Intent Detection)

Stage 3: Finding the Right Answer (Semantic Retrieval)

Stage 4: Explaining What Was Found (Paraphrasing)

In a Nutshell

How did you like this edition?

Reply

Keep Reading

JustAnotherPM