ChatGPT API in Python: The Hidden Output Cap Everyone Misses

Most Python tutorials skip the #1 mistake: using openai.ChatCompletion.create(). Learn the v1.x SDK method that actually works in 2026, plus rate limit traps.

Jack Tom2026-03-2611 min readAdvanced

Here’s the mistake: you find a ChatGPT API tutorial, copy the code, run it, and get AttributeError: module 'openai' has no attribute 'ChatCompletion'. The tutorial looks recent. The code looks clean. What happened?

The OpenAI Python library changed completely in November 2023. Every tutorial written before that date – and plenty written after – teaches syntax that doesn’t work anymore. You’re not doing anything wrong. The code itself is outdated.

This isn’t about building a toy chatbot. It’s about understanding why most Python developers hit rate limits they didn’t know existed, why token counting breaks their budget estimates, and why the “simple” API call pattern everyone teaches won’t scale past 100 users.

What Changed Between v0.28 and v1.x (And Why Your Tutorial Failed)

Every beginner tutorial still floating around uses this pattern:

import openai
openai.api_key = "sk-..."
response = openai.ChatCompletion.create(
 model="gpt-3.5-turbo",
 messages=[{"role": "user", "content": "Hello"}]
)
print(response.choices[0].message.content)

That stopped working on November 6, 2023. According to Microsoft’s migration documentation, when you run pip install openai today, you get version 1.x – a breaking change.

The new syntax requires instantiating a client:

from openai import OpenAI
client = OpenAI()
response = client.chat.completions.create(
 model="gpt-4o",
 messages=[{"role": "user", "content": "Hello"}]
)
print(response.choices[0].message.content)

Three things changed: you import OpenAI instead of openai as a module, you create a client instance, and the method path switched from openai.ChatCompletion.create() to client.chat.completions.create().

Why does this matter beyond syntax? The old approach used global state. Your API key lived in a module-level variable that every part of your code touched. The new client pattern isolates configuration – you can run multiple clients with different keys, timeouts, or retry logic in the same program.

Method A: Hardcode Your Key (Fast, Fragile)

The fastest way to test is dropping your API key directly in the code:

from openai import OpenAI
client = OpenAI(api_key="sk-proj-...your-key-here...")
response = client.chat.completions.create(
 model="gpt-4o-mini",
 messages=[{"role": "user", "content": "Explain async in one sentence"}]
)
print(response.choices[0].message.content)

This works. It’s also how API keys leak into GitHub repos and end up on Reddit three hours later. Use this for throwaway experiments, never for anything you’ll commit.

Method B: Environment Variables (Production Standard)

The official OpenAI Python SDK reads your API key from the OPENAI_API_KEY environment variable by default. Set it once, never hardcode again:

# In your terminal (macOS/Linux):
export OPENAI_API_KEY="sk-proj-..."

# In your terminal (Windows PowerShell):
$env:OPENAI_API_KEY="sk-proj-..."

# In Python:
from openai import OpenAI
client = OpenAI() # Automatically reads OPENAI_API_KEY
response = client.chat.completions.create(
 model="gpt-4o-mini",
 messages=[{"role": "user", "content": "What's the capital of France?"}]
)
print(response.choices[0].message.content)

For local development, drop your key in a .env file and load it with python-dotenv:

pip install python-dotenv

# .env file:
OPENAI_API_KEY=sk-proj-...

# In your script:
import os
from dotenv import load_dotenv
from openai import OpenAI

load_dotenv()
client = OpenAI() # Reads from .env

Add .env to your .gitignore immediately. This file should never touch version control.

Method B wins. It separates secrets from code, works across environments, and doesn’t require changing your script when you rotate keys.

Installing the SDK (And Checking You Got v1.x)

The OpenAI Python library requires Python 3.9 or higher. Install it:

pip install openai

Verify you’re on version 1.x or later (as of March 2026, the latest is v2.29.0):

pip show openai

If you see 0.28.1, something pinned an old version. Upgrade explicitly:

pip install --upgrade openai

Still stuck on 0.28? Check for a requirements.txt or pyproject.toml locking the version. Remove the pin and reinstall.

Your First API Call (With Real Error Handling)

Here’s the minimal script that actually handles failure:

from openai import OpenAI, AuthenticationError, RateLimitError, APIError
import os

client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

try:
 response = client.chat.completions.create(
 model="gpt-4o-mini",
 messages=[
 {"role": "system", "content": "You are a helpful assistant."},
 {"role": "user", "content": "Explain Python decorators in 20 words."}
 ],
 max_tokens=50,
 temperature=0.7
 )
 print(response.choices[0].message.content)
except AuthenticationError:
 print("Invalid API key. Check your OPENAI_API_KEY environment variable.")
except RateLimitError:
 print("Rate limit hit. Wait a moment and try again.")
except APIError as e:
 print(f"API error: {e}")

The model parameter matters. gpt-4o-mini costs significantly less than gpt-4o. According to OpenAI’s pricing page, GPT-4o-mini runs at roughly $0.15 per 1M input tokens versus $2.50+ for larger models as of March 2026.

The max_tokens parameter caps output length. Set it too high and you burn through your token budget on responses you don’t need. Set it too low and you get cut-off answers.

The Three Rate Limit Traps No One Warns You About

Rate limits aren’t just a number you read in the docs and forget. They break in non-obvious ways.

Trap 1: Quantization – Your Per-Minute Limit Is Actually Enforced Per Second

The OpenAI Help Center explains that a 60,000 requests-per-minute limit is enforced as 1,000 requests per second. If you send 5,000 requests in the first 5 seconds of a minute, you hit a 429 Too Many Requests error – even though you’re technically under 60,000 RPM.

Most batch processing scripts fail here. They queue up requests, fire them all at once, and immediately hit the wall.

Trap 2: Token Limits Count Input AND Output

Your TPM (tokens per minute) budget gets charged for both your prompt and the model’s response. Community testing shows that a 5,000-token prompt with a 100-token answer consumes 5,100 tokens – not 100.

Developers optimize output size and forget that the real cost is in the context they’re sending. If you’re passing entire documents into the API on every request, your input tokens dominate your bill and your rate limit.

Trap 3: The Free Tier Is a Mirage

As of early 2026, free tier API accounts are capped at 3 RPM and 40,000 TPM. That’s three requests per minute. You can’t build anything real on that. You can barely run a demo.

You need to add at least $5 to your account to enable the Pay-As-You-Go tier. This isn’t documented loudly – it’s buried in community posts and support threads.

Tier	RPM	TPM	Cost to enable
Free	3	40,000	$0
Pay-As-You-Go (Tier 1)	500	200,000	$5 minimum deposit
Tier 2	5,000	2,000,000	$50+ spent

Hitting a rate limit returns a RateLimitError. The correct response is exponential backoff: wait, then retry with increasing delays.

import time
from openai import OpenAI, RateLimitError

client = OpenAI()

def call_with_backoff(prompt, retries=3):
 for attempt in range(retries):
 try:
 response = client.chat.completions.create(
 model="gpt-4o-mini",
 messages=[{"role": "user", "content": prompt}]
 )
 return response.choices[0].message.content
 except RateLimitError:
 if attempt < retries - 1:
 wait_time = 2 ** attempt # 1s, 2s, 4s
 print(f"Rate limit hit. Retrying in {wait_time}s...")
 time.sleep(wait_time)
 else:
 raise

result = call_with_backoff("What is Python?")
print(result)

This is production-ready error handling. The first retry waits 1 second. The second waits 2. The third waits 4. Most transient rate limit spikes clear within this window.

Managing Conversation History (Without Blowing Your Token Budget)

The ChatGPT API has no memory. If you want a multi-turn conversation, you send the full history on every request:

from openai import OpenAI

client = OpenAI()
messages = [{"role": "system", "content": "You are a helpful assistant."}]

while True:
 user_input = input("You: ")
 if user_input.lower() == "quit":
 break

 messages.append({"role": "user", "content": user_input})

 response = client.chat.completions.create(
 model="gpt-4o-mini",
 messages=messages
 )

 assistant_reply = response.choices[0].message.content
 messages.append({"role": "assistant", "content": assistant_reply})

 print(f"Assistant: {assistant_reply}")

This works until your conversation history grows. After 20 turns, you’re sending thousands of tokens on every request – most of which the model doesn’t need.

Sliding window fix: keep only the last N messages.

MAX_HISTORY = 10 # Keep last 10 messages
messages = messages[-MAX_HISTORY:]

This truncates history. The model loses context from earlier in the conversation, but you stop wasting tokens.

Streaming Responses (When You Need Real-Time Output)

By default, the API waits until the entire response is generated, then returns it. For long answers, this feels slow. Streaming sends tokens as they’re generated:

from openai import OpenAI

client = OpenAI()

stream = client.chat.completions.create(
 model="gpt-4o-mini",
 messages=[{"role": "user", "content": "Write a 200-word essay on AI."}],
 stream=True
)

for chunk in stream:
 if chunk.choices[0].delta.content:
 print(chunk.choices[0].delta.content, end="")

The stream=True parameter changes the return type. Instead of a single response object, you get an iterator. Each chunk contains a delta – the next piece of text. You print it immediately.

This is how ChatGPT’s web interface works. It feels faster because you see output before the model finishes.

Why Pricing Tiers Matter More Than Token Costs

Everyone obsesses over per-token pricing. GPT-4o costs $2.50 per million input tokens. GPT-4o-mini costs $0.15. That’s a 16x difference.

But pricing tiers determine your rate limits, and rate limits determine whether your app stays online under load. If you’re on Tier 1 (the $5 minimum deposit tier), you’re capped at 500 RPM. A single user refreshing a page 10 times in a minute eats 2% of your capacity.

Tier 2 unlocks at $50 spent. You get 5,000 RPM – 10x more headroom. Tier 3 requires $500 spent and gives you 10,000 RPM.

The math changes: paying for a higher tier isn’t about token discounts. It’s about buying the ability to handle traffic.

Common Error Patterns (And What They Actually Mean)

401 Unauthorized: Your API key is wrong, revoked, or not set. Check echo $OPENAI_API_KEY in your terminal.
429 Rate Limit: You hit RPM, TPM, or the quantized per-second limit. Implement backoff.
APIConnectionError: Network issue. Check your firewall, proxy settings, or internet connection.
APITimeoutError: The request took too long. The default timeout is 600 seconds. Retry or increase the timeout on the client.
500 Internal Server Error: OpenAI’s side. Retry with exponential backoff. These are usually transient.

According to the OpenAI error codes documentation, 500 errors should always trigger a retry. 401 errors should not – your key won’t become valid by retrying.

Next Step: Write a Function That Handles All of This

Here’s the production-ready wrapper that handles errors, retries, and token limits:

from openai import OpenAI, RateLimitError, APIError, AuthenticationError
import time
import os

client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

def chat_completion(prompt, model="gpt-4o-mini", max_tokens=500, retries=3):
 messages = [{"role": "user", "content": prompt}]

 for attempt in range(retries):
 try:
 response = client.chat.completions.create(
 model=model,
 messages=messages,
 max_tokens=max_tokens
 )
 return response.choices[0].message.content

 except RateLimitError:
 if attempt < retries - 1:
 wait_time = 2 ** attempt
 print(f"Rate limit. Retrying in {wait_time}s...")
 time.sleep(wait_time)
 else:
 raise

 except AuthenticationError:
 print("Authentication failed. Check your API key.")
 return None

 except APIError as e:
 print(f"API error: {e}. Retrying...")
 if attempt < retries - 1:
 time.sleep(2)
 else:
 raise

# Use it:
result = chat_completion("Explain recursion in 30 words")
if result:
 print(result)

Copy this. Modify it. Build on it. This is the baseline every ChatGPT API integration should start from.

FAQ

Can I use the old openai.ChatCompletion.create() syntax if I install version 0.28.1?

Yes, but you shouldn’t. Run pip install openai==0.28.1 and the old syntax works, but you lose access to new models, features, and bug fixes. OpenAI’s v1.x SDK has been the standard since November 2023. Pinning to 0.28.1 is technical debt you’ll regret six months from now when you need a feature that only exists in v1.x.

How do I know how many tokens my prompt uses before I send it?

Use the tiktoken library – OpenAI’s official tokenizer. Install it with pip install tiktoken, then count tokens like this: import tiktoken; enc = tiktoken.encoding_for_model("gpt-4o"); tokens = enc.encode("Your prompt here"); print(len(tokens)). This gives you the exact token count before you call the API. You can use this to trim prompts that exceed your budget or to estimate costs in advance. The response object also includes response.usage.total_tokens after the call, but by then you’ve already been charged.

What happens if I hit my rate limit in the middle of a user request?

Your code raises a RateLimitError, and if you don’t catch it, your app crashes. The API doesn’t queue your request – it just rejects it with a 429 status code. This is why exponential backoff is critical. Without it, every burst of traffic kills your app. In production, you’d wrap API calls in a retry decorator or use a task queue (like Celery) that automatically retries with backoff. Some developers also implement client-side rate limiting – tracking how many requests they’ve sent and sleeping before hitting the server’s limit. That prevents 429 errors entirely, but it requires you to know your tier’s RPM and TPM caps in advance and manage state across requests.