Standard large language models hallucinate at rates between 3% and 10%. For a chatbot handling 10,000 monthly conversations, that’s up to a thousand confidently wrong answers per month – refunds your bot promised that don’t exist, return windows it invented, prices it made up. If you want to build an AI chatbot for your website that doesn’t do this, the technical choices you make in the first hour matter more than which platform’s free trial you sign up for.
Most tutorials skip those choices. They walk you through clicking buttons on Tidio or Landbot, copy the embed code, and call it done. This isn’t that guide.
What you’re actually deciding
You have a website. You want a bot that answers customer questions from your own content – pricing, policies, product specs, hours – and hands off to a human when it can’t. Three real paths exist:
- No-code widget (Tidio, Landbot, Elfsight, Chatbase): paste a URL, get a bot in minutes. Good for FAQs, weak at anything custom.
- RAG on top of an LLM API (OpenAI, Anthropic, Gemini + a vector store): more work, more control, lower hallucination if done right.
- Fully managed (Intercom Fin, Ada, Zendesk AI): expensive, fast, designed for support teams already on those platforms.
The decision isn’t really about features. It’s about who owns the failure when the bot says something wrong – you, the platform, or a vendor with an SLA. That question should drive the choice, not which free plan has a prettier widget.
Under the hood: what separates a useful bot from a confident liar
Two architectures. Two very different failure modes. A pure LLM bot generates answers from training data – no access to your pricing page, your return policy, your actual stock. Wrong question, wrong answer, stated with total confidence. RAG – Retrieval-Augmented Generation – changes the mechanic: the model pulls verified passages from your content first, then phrases the answer from what it found. Closed-book exam versus open-book. The LLM’s job shrinks from “know everything” to “write clearly.”
For a website chatbot, RAG is the only sensible default. A 2026 IJERT case study on a 100-page engineering knowledge site found that domain-constrained RAG with canonical URL filtering, token-based chunking, and a Pinecone-based vector index produced a verifiable knowledge base – the same site fed to an unconstrained LLM hallucinated freely. Every serious platform uses this architecture, whether they call it that or not.
The catch: if a no-code platform won’t tell you how it chunks your content or which embedding model it uses, assume it’s a black box. That’s fine for an FAQ bot. Not fine if a wrong answer costs you a customer.
RAG in 6 decisions, not 6 clicks
Every platform handles “paste your URL and deploy.” Here’s what actually determines whether your chatbot works.
1. Pick your knowledge source carefully
Scraping your live site sounds easy – and it pulls navigation menus, cookie banners, and footer junk straight into your knowledge base. Turns out traditional knowledge bases are written for humans: long documents, mixed topics, unclear instructions. When AI retrieves that kind of content, it struggles to find the most relevant part, which is exactly how hallucinations start. Curate first. Strip nav. Split product pages from policy pages before you upload anything.
2. Get chunking right
On OpenAI’s File Search (as of 2025), the default is 800-token chunks with 400-token overlap between consecutive chunks. That works fine for support docs. For product catalogs with short specs – part numbers, dimensions, prices – smaller chunks (300-500 tokens) retrieve better because a single 800-token chunk can blur multiple products together.
3. Force grounding in the prompt
One line. “Answer only from the provided context. If the answer is not in the context, say you don’t know and offer to connect a human.” Without it, the model fills gaps with confident fiction. This is the single most effective hallucination guard, and it costs nothing to add.
4. Configure escalation triggers
This is the part tutorials skip entirely. Decide which user phrases hand off to a human – “refund,” “speak to agent,” “cancel,” anger keywords. The California Management Review (2026) documents what happens when you don’t: even when customers explicitly ask to speak to a human, chatbots often respond by asking for more details first. Some users have resorted to repeating “speak to a human” – or typing “chicken nuggets” – just to break out of the loop. That’s the kind of UX failure that ends up on Reddit and costs you more than the chatbot saves.
5. Test with messy inputs
Real users don’t type clean queries. Test with typos, fragments, rage-typed all-caps. If your bot only handles the demo questions, it’ll fail on day one.
6. Add source citations to responses
Show which doc or page the answer came from. It builds user trust – and when the bot is wrong, you know exactly which doc to fix.
Here’s an honest question worth sitting with before you build: if your bot is wrong 5% of the time, and you’re fielding 10,000 chats a month, are 500 wrong answers acceptable? The answer depends entirely on what those wrong answers are about – and that’s a product decision, not a technical one.
The hidden cost layer
Sticker prices lie. The advertised $20/month plan is almost never what you actually pay. Here’s what gets added on (figures as of 2026):
| Hidden cost | Typical amount | Source |
|---|---|---|
| Human agent seats (for the 35-50% AI can’t resolve) | $29-$169/agent/month | Elfsight 2026 |
| Per-resolution AI fees (Intercom Fin) | ~$0.99 per AI chat | Docuyond, chatty.net |
| Knowledge base upkeep | 5-15 staff hours/month | Elfsight 2026 |
| Vector storage (OpenAI direct) | $0.10/GB/day after 1 GB free | OpenAI docs |
The per-resolution model is the one that catches teams off-guard. A Black Friday traffic spike – say, 10,000 resolutions in a day – is suddenly a $9,900 line item. Startups and seasonal businesses should calculate their worst-case monthly chat volume before committing to usage-based pricing like Intercom Fin’s (per Docuyond’s 2026 pricing comparison).
Building directly on the API? GPT-4o-mini runs roughly $0.004-$0.005 per conversation – meaning 5,000 monthly chats costs $20-$75 in API fees (Elfsight, 2026). The model itself is cheap. The platform layer on top is what costs money.
Limitations buried in the OpenAI docs
Building on OpenAI’s Assistants or Responses API with File Search? A few hard constraints that catch teams late.
OpenAI’s own Assistants v2 FAQ is direct about what File Search can’t do: no support for parsing images within documents, no support for retrievals over structured file formats like CSV or JSONL, and no way to modify chunking or embedding settings beyond what’s documented. Docs say it handles your content – but if your product catalog lives in a CSV, you have to convert it to text first. If your PDFs contain pricing tables as images, those tables are invisible to the retriever entirely. Worth knowing before you’re three days into integration.
# Minimal Python skeleton for a grounded chatbot
from openai import OpenAI
client = OpenAI()
vector_store = client.vector_stores.create(name="site_kb")
# Upload pre-cleaned .md or .txt files only - skip CSVs and image-heavy PDFs
client.vector_stores.files.upload(
vector_store_id=vector_store.id,
file=open("policies.md", "rb")
)
assistant = client.beta.assistants.create(
model="gpt-4o-mini",
instructions=(
"Answer ONLY from retrieved context. "
"If unsure, say so and offer human handoff."
),
tools=[{"type": "file_search"}],
tool_resources={"file_search": {"vector_store_ids": [vector_store.id]}}
)
Honest limitations
Well-built RAG chatbots still hit ceilings. Five failure modes come up repeatedly in the research: failure to understand requests, inability to solve complex problems, poor integration with human agents, lack of humanization, and lack of personalization (California Management Review, 2026). Only one of those is really an AI problem. The other four are design problems your team owns.
And the resolution ceiling is real. A Gartner survey found only 14% of customer service issues are fully resolved in self-service. Your bot will not handle everything. Planning for that – staffing handoffs, training your team on escalation context – determines whether the bot is a win or a complaint generator. This is the gap most project plans don’t have a line item for.
Fine-tuning is almost never the fix for gaps. Fine-tuning can cost $50,000-$200,000 per iteration (Magic Suite). If your bot is wrong, the fix is almost always cleaner data, not a more expensive model – a much cheaper fix than most teams assume.
FAQ
Do I need to code to build a website chatbot?
No. No-code platforms cover most use cases.
Will my chatbot work on WordPress, Shopify, and custom-built sites?
Every major platform – Tidio, Landbot, Elfsight, Chatbase, Intercom – ships a JavaScript snippet you paste before the closing </body> tag. WordPress and Shopify also have native plugins so you don’t touch theme files. The one case to watch: single-page apps (React, Vue, Next.js) sometimes need the snippet loaded after the router initializes, otherwise the widget vanishes on route changes. Check the platform’s SPA documentation before assuming the default install works.
How long until my chatbot is actually useful?
You’ll have something live in an afternoon. Actually useful? That takes 2-4 weeks of watching real conversations, fixing the questions it got wrong, and adding the gaps to your knowledge base. Teams that treat it as a launch-and-forget project almost always regret it – the bot doesn’t improve on its own.
Next step: open your three most-visited support pages, paste them into a single Markdown file, and use that as the entire knowledge base for your first prototype. Don’t scrape your whole site. Start with the 20% of content that answers 80% of questions, and grow from there.