How AI Chatbots Actually Work: A Plain-English Architecture Guide (2026)

What's actually happening when you type into an AI chatbot? Here's the full architecture — ingestion, embeddings, retrieval, generation, and escalation — explained without the jargon, with real examples from production systems.

By Manasth SoniApril 27, 202611 min read

Short answer: A modern AI chatbot is six components stitched together: an ingestion pipeline that pulls in your content, an embedding model that turns each chunk of text into a vector, a vector database that stores them, a retriever that finds the most relevant chunks for each question, an LLM (GPT-4, Claude, Gemini) that writes the answer using those chunks as evidence, and an escalation layer that hands off to a human when the AI gets stuck. The magic isn't in any one piece — it's in how they're wired together.

This post walks through the entire pipeline with no math, no code (mostly), and no marketing hype. By the end you'll know what's happening when you click "send," why some chatbots feel sharp and others feel useless, and which architectural decisions actually matter when you're picking a vendor.

If you've already read the RAG explainer, this post is the broader system view. RAG is one component; this is everything around it.

The 30-second version

When you type a question into a well-built AI chatbot, here's what happens in roughly two seconds:

  1. The chatbot embeds your question — converts your text into a list of numbers that represents its meaning.
  2. It searches a vector database for the chunks of your business's content most similar to that meaning.
  3. It builds a prompt that includes those chunks plus your question plus instructions like "answer in 3 sentences, refuse to discuss competitors, capture an email if intent is high."
  4. It calls an LLM (the "brain" — GPT-4 / Claude / Gemini), which writes a draft answer.
  5. It applies post-processing — checks for hallucinations, strips PII, decides if it should escalate to a human.
  6. It streams the answer back to your screen, often token-by-token.

That's the user-facing flow. Behind the scenes, an entirely separate process — ingestion — runs whenever your content changes, refreshing the vector database so the chatbot stays current.

Let's go through each piece.

1. Ingestion: getting your content into the chatbot

Before the chatbot can answer anything about your business, it needs to know about your business. The ingestion pipeline is what makes that happen.

It runs in three steps:

Crawl

The chatbot platform's crawler follows links from a starting URL (usually your domain root) and downloads each page. Modern crawlers handle JavaScript-rendered pages — they actually wait for client-side React to render before reading the DOM, because if your content lives in useEffect calls, a static fetch will see a blank page.

You can also bypass the crawler entirely and upload PDFs, paste text, connect a Notion workspace, sync from a Google Drive folder, or pull from a GitHub repo. The form doesn't matter — the goal is to get the words into the platform.

Clean

Raw HTML is full of junk: navigation menus, footers, cookie banners, ads. Useful content is maybe 20% of the bytes. The cleaner strips boilerplate and leaves the meaningful prose: headings, paragraphs, lists, tables.

This step matters more than people realize. Sloppy cleaning means your chatbot answers questions about your menu navigation instead of your product. Good cleaning preserves semantic structure (h2 stays an h2) so retrieval can use it.

Chunk

Even clean documents are too long to use whole. The chunker splits each document into smaller pieces — typically 500-4,000 characters each, roughly a paragraph or two.

Chunking is where good chatbots and bad ones diverge. A bad chunker splits sentences in half. A good one respects paragraph boundaries, keeps headings with the content under them, and overlaps slightly between chunks so a sentence that lives at a chunk boundary still gets context.

Each chunk also gets metadata stamped on: its source URL, its position in the document, the timestamp it was crawled. This metadata lets the chatbot show citations later.

2. Embedding: turning chunks into vectors

Each chunk gets passed through an embedding model — a separate AI whose only job is to convert text into a vector of numbers (typically 768 or 1,536 dimensions).

The vector is a mathematical fingerprint of the chunk's meaning. Two chunks with similar meanings produce similar vectors, even if they share no actual words.

For example:

  • "Our refund window is 30 days from delivery" → some vector V1.
  • "You can return items within a month of receiving them" → vector V2.

V1 and V2 will be very close to each other in vector space, because the embedding model has learned what those phrases mean, not just which words appear in them. This is what makes vector search beat keyword search for question-answering: the user can ask "how long do I have to send something back?" and still match a paragraph that says "refund window is 30 days."

The embedding model is itself an AI (a transformer, like the LLM, but smaller and specialized). Common choices: OpenAI's text-embedding-3-large, Cohere's embed-english-v3, open-source models like bge-large or nomic-embed. The choice of embedding model affects retrieval quality more than most teams realize.

3. Vector database: storing the embeddings

The vectors get saved to a vector database — a system optimized for "find the N vectors closest to this one" queries.

Popular options: Pinecone, Weaviate, Qdrant, Chroma, pgvector (Postgres extension), or the chatbot platform's proprietary store. They all do roughly the same thing: index millions of high-dimensional vectors so that nearest-neighbor search runs in milliseconds instead of minutes.

The database also stores each vector's metadata (source URL, document title, crawl timestamp), which gets returned alongside the vector when retrieval happens.

4. Retrieval: finding the right chunks at query time

Now the user asks something. Here's what happens:

  1. The user's question gets embedded through the same embedding model. You get a query vector.
  2. The chatbot searches the vector database for the chunks whose vectors are most similar to the query vector. Similarity is usually measured by cosine distance — a fancy way of saying "how close are these two arrows pointing?"
  3. The top N matches are returned — typically 3 to 8.

A good chatbot doesn't stop at pure vector search. It often layers in:

  • Hybrid search — combines vector similarity with traditional keyword matching (BM25). Vector search is great for paraphrases; keyword search catches exact phrases (product names, error codes) where the embedding model under-weights specificity.
  • Re-ranking — a second model takes the top 20 candidates and re-orders them by a finer-grained relevance score. Slower but more accurate.
  • Query rewriting — for ambiguous questions, the chatbot may rewrite the user's input to be more retrieval-friendly before searching. (E.g., "how much?" gets expanded to "how much does the product cost?" with context from the conversation.)

These layers are why a good chatbot can answer accurately while a cheap one keeps surfacing irrelevant chunks.

5. Generation: the LLM writes the answer

Now we have the user's question and 3-8 relevant chunks. The chatbot constructs a prompt that looks roughly like this:

You are Acme Co.'s support assistant. Use the context below to answer
the user's question. If the answer isn't in the context, say you
don't know and offer to connect them to a human.
 
Context:
[chunk 1: "Returns must be requested within 30 days..."]
[chunk 2: "Items must be in original packaging..."]
[chunk 3: "Refunds are processed within 5 business days..."]
 
Conversation so far:
User: hey
Assistant: Hi! What can I help you with today?
User: what's your refund policy?
 
User question: what's your refund policy?

This prompt — the system instructions plus the retrieved chunks plus the conversation history plus the new question — gets sent to an LLM. The LLM (GPT-4, Claude, Gemini, or an open-source model like Llama) reads it all and writes a response token by token.

The choice of LLM affects answer quality. GPT-4 vs Claude vs Gemini covers the tradeoffs in detail — Claude tends to follow instructions most reliably, GPT-4 has the broadest tool ecosystem, Gemini wins on cost at scale.

The LLM is the "brain," but the brain is only as good as the chunks fed to it. Bad retrieval → bad context → bad answer, even with the world's best LLM. This is why teams who switch from GPT-3.5 to GPT-4 sometimes see no improvement: the bottleneck wasn't the model, it was the retrieval.

6. Post-processing: the safety net

Before the answer reaches the user, several checks usually run:

  • Citation extraction — pulling out which chunks the LLM actually used, so the chatbot can show "sources" links.
  • PII redaction — masking any phone numbers / credit card numbers / SSNs the LLM might have leaked from context.
  • Hallucination detection — comparing the answer to the source chunks; flagging if the LLM said something not supported by retrieved content. This is hard to do perfectly but even basic checks catch the worst cases.
  • Toxicity / safety filters — making sure the model's output won't embarrass the company.
  • Intent scoring — was this conversation high-intent? Should we capture an email or escalate?

The post-processing layer is what turns a generic LLM call into a production chatbot. It's also where most cheap chatbots cut corners and where most expensive ones earn their price.

7. Escalation: handing off to a human

The final piece: knowing when not to answer.

A well-built chatbot recognizes signals like:

  • The user asked the same question twice (the answer wasn't satisfying).
  • The user is using frustrated language ("this is ridiculous", "speak to a human", lots of exclamation marks).
  • The retrieved chunks don't actually contain the answer (the LLM had to hedge).
  • The user crossed a high-intent threshold (asked about pricing 3 times, requested a demo, mentioned a budget).

When any of these fire, the chatbot offers a handoff: "Want me to get a human on this?" The user accepts, the chat is paged into the operator dashboard, a real person picks it up with the full conversation context already loaded. No re-introduction.

The handoff is the difference between a chatbot that helps and a chatbot that traps. We've covered the design pattern in detail — the short version is: AI for volume, humans for what matters, and the seam between them should be invisible to the user.

What this means for picking a chatbot

Most vendor pitches focus on the LLM ("powered by GPT-4!"). The LLM is the cheapest part of the stack — every vendor uses the same models. The actual differentiation lives in the parts they don't talk about:

  1. How does ingestion handle JS-heavy sites? Ask for a live crawl of your site in the demo. Look at the chunks.
  2. What embedding model? Can it be swapped? Budget tiers often use cheap embeddings; quality tiers don't.
  3. Hybrid search or pure vector? Pure vector misses exact-match queries. Hybrid is the modern default.
  4. How do they handle hallucinations? Look for an explicit answer, not "the LLM is good now." Good answers mention grounding, refusal prompts, source citations.
  5. What does escalation look like? Is there a real operator dashboard, or is it "the chatbot emails you the conversation"?

The vendors winning right now are the ones who've built every layer well and given you visibility into each. The ones who say "just plug in your URL, we handle the rest" are usually hiding shortcuts at one of the layers.

What's coming next in chatbot architecture

Three architectural shifts worth tracking:

  • Agentic chatbots — chatbots that don't just answer questions, they take actions. Booking a meeting, processing a refund, updating a ticket. The retrieval-and-respond loop becomes a retrieve-plan-act-verify loop. Early but real.
  • Multi-modal context — chatbots that understand images, audio, video. A user uploads a screenshot of an error; the chatbot recognizes the error and walks through a fix. GPT-4o and Gemini already enable this; chatbot platforms are still catching up.
  • Long-context models — models with million-token context windows let some teams skip retrieval entirely and just stuff the whole knowledge base into every prompt. Expensive, but for narrow domains it's becoming viable.

The fundamental architecture won't change much in the next 12-18 months. The pieces will get better — embeddings sharper, retrieval more clever, LLMs cheaper — but the six-component shape (ingest → embed → store → retrieve → generate → escalate) is going to stay.

The bottom line

An AI chatbot is not magic. It's a deterministic pipeline with a probabilistic generator at the end. Every step has design decisions that affect the final user experience. Vendors who treat any one step as a black box are vendors you'll outgrow.

If you understand the architecture, you can ask sharper questions when shopping, debug bad answers when they happen, and build a roadmap of what to fix when conversion lags. That's the whole reason this post exists.

If you want to see an end-to-end implementation that gets each layer right — visible chunks, hybrid retrieval, Claude-by-default with model swaps available, native operator handover — Chatmount has a free tier. Set up takes about five minutes and you'll see every layer at work on your own content.

Share:XLinkedIn
About the author
Manasth Soni
Founder, Chatmount

Building Chatmount — the AI chatbot for lead generation with native human handover. Writing about what teams actually ship vs what AI chatbot vendors say in marketing.

From the makers

Try Chatmount free — built for the lead-gen patterns in this post

AI chatbot with native human handover and in-conversation lead capture. Plans start at $6/month annual ($8/mo monthly). No credit card to start.

Start free