Skip to content

Build a Data Pipeline with AI: What They Don’t Tell You

AI coding tools can generate ETL code in seconds - but they'll also hardcode secrets, skip authentication, and miss edge cases that break production. Here's how to actually use them.

12 min readAdvanced

You’ve probably tried asking ChatGPT or Copilot to “write a Python script that pulls data from an API and loads it into Postgres.” And it probably worked – once. Then production happened.

The date formats broke because your CRM uses YYYY-MM-DD while your payment processor sends MM/DD/YY. Half your records have missing email fields. Your fraud model learned that null emails mean legitimate transactions, except that’s not true – it’s just broken instrumentation.

Here’s what nobody mentions in those “build a data pipeline with AI in 10 minutes” tutorials: AI tools will generate working code, but they’ll also hardcode your database password, skip input validation entirely, and assume your data is clean when it never is.

This isn’t a tutorial about tools. It’s about what breaks when you hand pipeline work to AI, and how to use these tools without creating a security disaster or a production failure three weeks from now.

What AI Gets Wrong About Data Pipelines (And Why It Matters)

The problem isn’t that AI can’t write code. Studies show AI reduces coding time by 40-55% for repetitive tasks. The problem is what it omits.

When you prompt GitHub Copilot or Claude with “create a pipeline to sync customer data,” the generated code will connect to your database, extract records, maybe transform a column or two, and write the output somewhere. It looks fine. But:

  • No authentication.Research on LLM-generated web code found that prompts lacking security language result in endpoints with no auth checks, hardcoded API keys, or unrestricted database access.
  • Zero input validation. Missing sanitization is the #1 flaw in AI-generated code across every model and language, per academic security analysis. Your pipeline will happily ingest malformed dates, SQL injection payloads, or null values that crash downstream models.
  • Brittle error handling. AI defaults to basic try-catch blocks that log errors but don’t recover. When your data provider changes their file format at 2 AM, your pipeline dies silently.

Real pipelines don’t fail from bad models. They fail from schema drift, inconsistent formats, and missing fields that AI-generated code never anticipates.

That’s the gap. AI writes the happy path. You have to build everything else.

Choosing Your AI Tool: Cursor vs. Claude vs. Copilot for Pipeline Work

Not all AI coding tools handle data work the same way. Here’s what actually matters when you’re building ETL pipelines, not web apps.

GitHub Copilot: Best for dbt and Inline SQL

Copilot shines when you’re working inside dbt projects, Jupyter notebooks, and data pipeline code. It picks up your project’s patterns – column naming conventions, common aggregations, join styles – and suggestions improve the longer you use it in a codebase.

Pricing: $10/month for Pro (unlimited completions, 300 premium requests), $19/user/month for Business (adds IP indemnity and audit logs). As of March 2026, Copilot supports multi-model routing – you can back it with Claude or GPT-4o.

Where it fails: Large file context. Copilot works with up to 128K tokens. If you paste a 2,800-line API file and ask it to add rate limiting that matches your existing middleware patterns, it’ll drift from your conventions. Claude handles this better.

Claude (Anthropic): Best for Large Context and Refactoring

Claude Pro supports 200K tokens (1M in beta for Sonnet 4.6). You can paste your entire authentication module, data models, utility functions, and still have room for a detailed prompt. For refactoring sprawling dbt models or writing pandas pipelines, Claude’s explanations are genuinely educational.

Pricing: $20/month for Pro. The catch: usage limits. Developers report hitting Claude’s message caps faster than ChatGPT’s, especially when working with large contexts. You’ll find yourself switching to ChatGPT mid-project to avoid the limit.

Where it fails: No code execution. Unlike ChatGPT’s Advanced Data Analysis, Claude can’t run Python in-session to test transformations on your CSV.

Cursor: Best for Multi-File Refactoring

Cursor is VS Code with AI baked in at a deeper level. The “Composer” feature lets you describe a transformation in plain English and have Cursor write the entire dbt model, test, and schema.yml across multiple files.

Pricing: $20/month (individual), with enterprise plans available. Cursor is used by over 50% of Fortune 500 tech companies and hit $500M annual revenue within two years of launch.

Where it fails: Data science workflows. Cursor isn’t a notebook. If your team is addicted to Jupyter (most are), the lack of cell-by-cell execution and inline visualization is a dealbreaker.

Tool Best For Context Window Price Key Limitation
GitHub Copilot dbt, SQL, inline coding 128K tokens $10/mo Pro Smaller context, drifts on large files
Claude Pro Large context, refactoring 200K (1M beta) $20/mo Usage limits, no code execution
Cursor Multi-file edits, pipelines Varies by model $20/mo Not a notebook, steeper learning curve

A Real Pipeline: What AI Writes vs. What You Actually Need

Let’s build a pipeline that syncs payment data from Stripe to BigQuery for a fraud detection model. This is the scenario competitors never show: messy real-world data with edge cases AI won’t anticipate.

The Prompt (What You Ask)

Write a Python script that:
1. Pulls yesterday's payment transactions from Stripe API
2. Transforms the data: convert amounts to USD, parse timestamps to YYYY-MM-DD HH:MM:SS
3. Load into BigQuery table 'payments_raw'
4. Log any errors to a file

What AI Generates (GitHub Copilot, Claude, or ChatGPT)

import stripe
from google.cloud import bigquery
from datetime import datetime, timedelta

stripe.api_key = "sk_live_YOUR_KEY_HERE" # Hardcoded secret
client = bigquery.Client()

def fetch_payments():
 yesterday = datetime.now() - timedelta(days=1)
 charges = stripe.Charge.list(created={'gte': int(yesterday.timestamp())})

 rows = []
 for charge in charges.data:
 rows.append({
 'id': charge.id,
 'amount': charge.amount / 100, # Assumes USD
 'created': datetime.fromtimestamp(charge.created).strftime('%Y-%m-%d %H:%M:%S')
 })

 table_id = "project.dataset.payments_raw"
 client.insert_rows_json(table_id, rows)
 print(f"Loaded {len(rows)} rows")

fetch_payments()

This code runs. It will even work in a demo. Here’s what it’s missing:

  • Hardcoded API key. Production secret in source code.
  • No error handling for API failures. If Stripe rate-limits you or the network drops, the script crashes.
  • Assumes all amounts are USD. Stripe returns amounts in the currency of the charge. A €50 charge becomes 5000 cents, which your code divides by 100 and stores as 50 USD. Your fraud model now thinks European transactions are 10x cheaper.
  • Missing field validation. What if charge.created is null? What if the BigQuery insert fails halfway through?

What You Actually Need

import os
import stripe
from google.cloud import bigquery
from datetime import datetime, timedelta
import logging

logging.basicConfig(filename='pipeline_errors.log', level=logging.ERROR)
stripe.api_key = os.getenv('STRIPE_API_KEY') # Secret from env
client = bigquery.Client()

def fetch_payments():
 try:
 yesterday = datetime.now() - timedelta(days=1)
 charges = stripe.Charge.list(created={'gte': int(yesterday.timestamp())})
 except stripe.error.RateLimitError as e:
 logging.error(f"Stripe rate limit: {e}")
 return
 except Exception as e:
 logging.error(f"Stripe API error: {e}")
 return

 rows = []
 for charge in charges.data:
 # Validate required fields
 if not charge.id or not charge.amount or not charge.created:
 logging.warning(f"Missing data in charge {charge.id}")
 continue

 # Convert to USD if needed
 amount_usd = charge.amount / 100 if charge.currency == 'usd' else convert_currency(charge.amount, charge.currency)

 rows.append({
 'id': charge.id,
 'amount_usd': amount_usd,
 'currency': charge.currency,
 'created': datetime.fromtimestamp(charge.created).strftime('%Y-%m-%d %H:%M:%S')
 })

 if rows:
 table_id = "project.dataset.payments_raw"
 errors = client.insert_rows_json(table_id, rows)
 if errors:
 logging.error(f"BigQuery insert errors: {errors}")
 else:
 print(f"Loaded {len(rows)} rows")

def convert_currency(amount, currency):
 # Placeholder: call exchange rate API
 return amount / 100 # Simplified

fetch_payments()

Notice what changed: secrets from environment variables, explicit error handling for API failures, currency conversion logic, field validation before insert, and structured logging.

AI won’t write this version unless you prompt for every single edge case. Even then, it might skip the currency conversion because that requires domain knowledge AI doesn’t have.

When AI Breaks: Schema Drift, Missing Fields, and Self-Healing Pipelines

Here’s where AI assistance gets interesting – and expensive.

The Problem: Pipelines Break in Production, Not in Code

Your data provider changes their CSV delimiter from commas to pipes. Overnight. Your pandas script crashes because read_csv() defaults to commas. In a traditional setup, you get paged at 3 AM.

Self-healing pipelines use a small LLM to diagnose and fix errors automatically. One data scientist built this and reported it “shielded me from at least three late-night wake-up calls this month.”

How it works: the script catches the error, sends the exception and first 4 lines of the file to an LLM (GPT-4o or Llama-3 via Ollama), and retries with the LLM’s suggested fix. It worked for delimiter errors, date format issues, and missing columns.

The Cost Trap

Every error triggers an API call. Fine for a few files. Not fine when a bad deployment breaks 100,000 files at once. You wake up to a “very nasty surprise on your OpenAI bill.”

The fix: implement a circuit breaker (stop after N consecutive failures) or switch to a local model like Llama-3 to avoid API costs. And never send sensitive data to the LLM – one study only sent the first 4 lines of each file, nothing with PII.

Using dbt Copilot for Data Transformation (The Right Way)

dbt Copilot auto-generates documentation, tests, and semantic models using your project’s metadata – not your actual data. It never accesses row-level warehouse data.

Where it works: clean, well-documented schemas. If your dbt models already follow naming conventions and have column descriptions, Copilot can draft tests and docs in seconds.

Where it fails: legacy systems with inconsistent schemas, missing metadata, or tables named temp_final_v2_backup. Garbage in, garbage out.

Pricing: Available on Starter, Enterprise, and Enterprise+ dbt Cloud plans. You can bring your own OpenAI API key to control costs.

What Breaks When You Scale: Security, Governance, and Audit Trails

Small projects tolerate hardcoded secrets and missing tests. Production data pipelines feeding ML models or financial reports do not.

  • IP indemnity. GitHub Copilot Business ($19/user/month) includes legal protection if AI-generated code accidentally reproduces training data. Free tools don’t.
  • Audit logs. Enterprise plans (Copilot Business, Cursor Enterprise) track who generated what code and when. Required for SOC 2 compliance.
  • Data governance. Tools like Airbyte (600+ connectors) and Snowflake Cortex Code CLI support dbt and Airflow workflows with built-in lineage tracking and schema validation.

If you’re building pipelines for a regulated industry (finance, healthcare), free-tier AI tools are a non-starter. You need enterprise controls, data residency guarantees, and the ability to disable telemetry.

Common Mistakes (And How to Avoid Them)

Trusting AI-generated code without testing it on real data. Run it on a sample of production data – not the 10-row CSV from the docs.

Skipping security prompts. Always add “use environment variables for secrets, validate all inputs, implement error handling” to your prompt. AI won’t do this by default.

Ignoring usage limits. Claude caps messages per hour. If you’re iterating on a complex pipeline, you’ll hit the limit. Have a backup tool (ChatGPT or Copilot) ready.

When NOT to Use AI for Data Pipelines

Sometimes the old way is faster.

  • Highly regulated data. If your pipeline touches PII, PHI, or financial records, you need human review of every line. AI speeds up drafting, but a human must validate.
  • Critical infrastructure. Pipelines feeding real-time fraud detection or payment processing should not rely on AI-generated error handling. Write that yourself.
  • Learning data engineering. If you don’t understand how ETL works, AI will generate code you can’t debug. Learn the fundamentals first, then use AI to speed up repetitive tasks.

Next Step: Build One Small Thing

Don’t try to rebuild your entire data stack with AI. Start with one annoying task: documenting an existing dbt model, writing tests for a brittle Python script, or refactoring a 500-line SQL query into modular CTEs.

Use Cursor if you’re refactoring multi-file projects. Use Claude if you need to paste 2,000 lines of context. Use Copilot if you live in VS Code and work with dbt daily.

Then check the output: Did it validate inputs? Did it handle errors? Did it hardcode any secrets? Fix what’s broken. Ship it. Repeat.

The goal isn’t to let AI write your pipelines. The goal is to use AI to handle the boring parts – boilerplate, documentation, repetitive transformations – so you can focus on the edge cases, the business logic, and the stuff that breaks at 2 AM when data formats change.

FAQ

Can AI fully automate data pipeline creation?

No. AI generates working code for the happy path, but it skips input validation, error handling, and edge cases by default. You’ll spend as much time fixing AI output as you would writing it yourself unless you structure your prompts to explicitly require security, validation, and error handling. Think of AI as a very fast junior engineer who needs supervision.

Which AI tool is best for dbt and SQL work?

GitHub Copilot ($10/month Pro) works best for dbt projects because it learns your column naming conventions, join patterns, and aggregation styles the longer you use it in a codebase. The Copilot Chat panel lets you ask questions scoped to your open files. If you need larger context (entire dbt projects), use Claude Pro ($20/month) with its 200K token window, but watch for usage limits. For multi-file refactoring across a dbt project, Cursor ($20/month) with its Composer feature is the strongest option, though it’s not a notebook environment.

How do I prevent AI from hardcoding secrets in pipeline code?

Always include “use environment variables for all API keys and database credentials” in your initial prompt. AI defaults to hardcoding secrets unless told otherwise. After code generation, grep for common patterns: api_key = "sk_, password = ", token = ". If you find any, reject the code and re-prompt with explicit security requirements. For production pipelines, use a secrets manager (AWS Secrets Manager, Google Secret Manager) and reference secrets at runtime, never in source code. GitHub Copilot Business ($19/user/month) offers better security defaults than free tools, but manual review is still required.