Skip to content

Build a ChatGPT API Chatbot: The Gaps Nobody Tells You

Most tutorials show you the basics. This one walks through conversation memory traps, rate limit quirks, and pricing gotchas learned from real deployments.

8 min readAdvanced

Why does every chatbot tutorial I follow work perfectly for one message, then the bot completely forgets our conversation by the second question?

The ChatGPT API has no memory.

Most tutorials skip this. They show you how to send a message and get a response. Great. But try to build an actual conversation – where the bot remembers what you just said – you get stuck. Your code needs to track state, store history, resend everything on every call. Token costs triple. Nobody mentioned that in the five-minute quickstart.

This tutorial walks through building a chatbot that actually works beyond the demo. We start with the memory problem, because that’s where everyone gets stuck. Then rate limits, cost traps, the stuff that only breaks when real users touch your bot.

The Conversation Memory Problem (Start Here)

The GPT-4 Technical Report won’t tell you this outright: the API is stateless. Each request is isolated. The model doesn’t remember your last message. It doesn’t even know there was a last message.

ChatGPT on the web feels like it has memory. It doesn’t. OpenAI’s interface stores your conversation history and secretly resends it with every new message. You have to do the same thing in your chatbot.

Send this:

{
 "model": "gpt-4",
 "messages": [
 {"role": "user", "content": "My name is Alex"}
 ]
}

Then send this:

{
 "model": "gpt-4",
 "messages": [
 {"role": "user", "content": "What's my name?"}
 ]
}

The bot has no idea. The second request is a blank slate.

Fix it like this – resend the entire conversation every time:

{
 "model": "gpt-4",
 "messages": [
 {"role": "user", "content": "My name is Alex"},
 {"role": "assistant", "content": "Nice to meet you, Alex!"},
 {"role": "user", "content": "What's my name?"}
 ]
}

Now it works. But look – your token count tripled. You’re paying for the same context over and over. After 10 exchanges, you’re sending 20+ messages with every call. API costs spiral for chatbots because of this.

Think of it like carrying every receipt from every purchase in your wallet. Eventually you can’t close it. Same with conversation history – you hit a token ceiling, the API chokes, or your bill explodes. (As of March 2026, GPT-4.1 costs around $2/1M input tokens according to OpenAI’s pricing page.)

Pro tip: Use a rolling window. Store only the last 3-5 exchanges instead of the entire conversation. Most chatbots don’t need infinite memory – just enough context to feel natural. Cuts token usage by 50-70% without breaking conversational flow.

Build the Chatbot (With Memory from the Start)

You’ll need an API key and a way to handle requests. Let’s build this in Python – the OpenAI library makes the HTTP calls easier.

Install the library:

pip install openai

Grab your API key from platform.openai.com. Gotcha: even if you have free credits, the API won’t work until you add a payment method to your account. Community reports confirm this – not in the quickstart guide.

Core loop with memory built in:

from openai import OpenAI

client = OpenAI(api_key="your-api-key-here")

conversation_history = [
 {"role": "system", "content": "You are a helpful assistant."}
]

def chat(user_message):
 conversation_history.append({"role": "user", "content": user_message})

 response = client.chat.completions.create(
 model="gpt-4",
 messages=conversation_history
 )

 assistant_message = response.choices[0].message.content
 conversation_history.append({"role": "assistant", "content": assistant_message})

 return assistant_message

print(chat("My favorite color is blue"))
print(chat("What's my favorite color?"))

This works. The bot remembers. But there’s a trap.

After 20 turns? Your conversation_history array has 40+ entries. Thousands of tokens on every request. Tier 1 accounts (new as of early 2026) have a 500,000 tokens-per-minute limit per OpenAI’s rate limits docs. Sounds generous. A busy chatbot with 10 concurrent users hits that in seconds.

Limit memory to the last 3 exchanges:

MAX_HISTORY = 7 # system message + 3 user/assistant pairs

def chat(user_message):
 conversation_history.append({"role": "user", "content": user_message})

 # Keep system message, trim the rest
 if len(conversation_history) > MAX_HISTORY:
 conversation_history = [conversation_history[0]] + conversation_history[-(MAX_HISTORY-1):]

 response = client.chat.completions.create(
 model="gpt-4",
 messages=conversation_history
 )

 assistant_message = response.choices[0].message.content
 conversation_history.append({"role": "assistant", "content": assistant_message})

 return assistant_message

Token usage stays flat instead of growing forever.

Rate Limits Hit Faster Than You Think

You launch your bot. Five users try it at the same time. It breaks.

You check usage: 80,000 tokens used, well under the 500,000 TPM limit. Why are you getting 429 errors?

Rate limits are quantized. Documentation from Azure and third-party developers shows OpenAI enforces limits over 1-10 second windows, not just per-minute averages. Five users send requests in the same second? You might burn through your per-second quota even though your per-minute usage looks fine.

The other limit people forget: requests per minute (RPM). You can have tons of TPM headroom but still hit RPM caps if your requests are small. A chatbot sending 100 short questions in 60 seconds will hit RPM before TPM.

Two ways to handle this:

  • Implement exponential backoff: 429 error? Wait 1 second, then 2, then 4, then retry.
  • Queue requests on your server instead of sending them all to OpenAI at once. Process them in controlled batches.

The error response includes a Retry-After header telling you how long to wait. Use it.

The $50/Day Token Trap

You build a customer service bot. Works great in testing. You launch it to your user base.

10,000 users. Each sends 5 messages per day. Each conversation averages 500 tokens per API call (question + history + answer). That’s 10,000 × 5 × 500 = 25 million tokens per day.

GPT-4.1 pricing (around $2 per million input tokens as of March 2026 per MetaCTO’s cost analysis)? $50/day just for input. Add output tokens, you’re closer to $75/day, or $2,250/month.

Tutorials tell you the API is cheap. It is – until it scales.

Cut costs:

  1. Use gpt-3.5-turbo for simple queries. 10x cheaper, fast enough for most customer service questions.
  2. Set max_tokens to the smallest value that works. Responses usually 200 tokens? Set max_tokens: 250. The default is way higher and inflates your bill.
  3. Implement prompt caching if your system message or instructions are reused. Some models now support cached input pricing at 90% off (as of March 2026 – this may have changed).

Real question: do you even need the API, or can a scripted bot handle 80% of your queries?

Common Pitfalls (Production Only)

Errors you’ll see:

429 Too Many Requests – You hit a rate limit. Could be RPM, TPM, or a burst limit. Check response headers for x-ratelimit-remaining-tokens to see what you’re running out of.

“Exceeded your current quota” – Two usual culprits. Either you actually ran out of credits, or – more common – you forgot to add a payment method. Even free-tier users need a credit card on file.

“Error in body stream” – Response got cut off mid-generation. Prompt too long or server overloaded. Retry with a shorter context window.

System messages lose weight over time – One developer tested a system message instructing the bot to always answer incorrectly. After a few turns? The bot started self-correcting because user messages began to dominate the context. Don’t rely on system instructions for hard rules – validate outputs instead.

One more thing nobody mentions: conversation history stored in memory is gone when your server restarts. Need persistence? Save it to a database (PostgreSQL, Redis, whatever). Otherwise, every deployment wipes your users’ chat history.

Don’t Use This If

Questions have known answers? Don’t use the API. Build a decision tree or use keyword matching. Faster, cheaper, won’t hallucinate.

Need guaranteed accuracy – medical advice, legal guidance, financial calculations? Don’t use the API. The model can be confidently wrong. GPT-4 report noted the model “can suffer from hallucinations.”

Building a public bot with no rate limiting? Don’t use the API. Someone will spam it and burn through your quota in minutes.

Users expect instant responses under 100ms? Don’t use the API. Typical latency is 1-3 seconds. Fine for chat, terrible for autocomplete.

You’ll End Up With

A chatbot that:

  • Remembers the last few exchanges without infinite token growth
  • Handles rate limits with retries instead of crashing
  • Costs $0.02 per conversation instead of $0.50 because you tuned max_tokens and trimmed history
  • Saves conversation state to a database so users don’t lose context on restart

That’s what works in production. Everything else is a demo.

The code above gets you 80% there. The last 20% – database integration, user sessions, error logging – is standard web dev. Add Flask or FastAPI for the HTTP layer. Use PostgreSQL to store conversation_history per user_id. Deploy to Railway or Render.

Next step: build the basic bot, then load-test it with 10 simulated users hitting it at once. Watch where it breaks. That’s where you learn.

Frequently Asked Questions

Can I use my ChatGPT Plus subscription for API access?

No. ChatGPT Plus ($20/month as of 2026) gives you web access. API usage is billed separately per token. You need to add a payment method to platform.openai.com even if you’re a Plus subscriber.

How do I stop users from draining my API quota?

Set spending limits in your OpenAI account dashboard (under “Usage limits”). Implement per-user rate limiting in your application code – track requests per user per hour and block those who exceed a threshold. Most quota drain? One bug or malicious user, not organic traffic. Monitor your usage daily for the first week after launch.

Why does my chatbot sometimes return nonsense or refuse to answer?

Conversation history might be too long, causing the model to lose coherence – trim it to the last 3-5 exchanges. Or content moderation filters are triggering on ambiguous input – rephrase or add context to the system message. Or the model hit a hallucination – GPT-4’s own technical report acknowledges this happens. For critical applications, validate outputs against a knowledge base before showing them to users.