Build a CLI Tool with AI Code Generation: The Real Story

Build a Python CLI tool using AI coding assistants. Start with a reader problem, choose your AI tool, handle the edge cases AI misses - then ship it. Real workflow, real gotchas.

Jack Tom2026-04-239 min readIntermediate

You want to build a command-line tool. Not a web app, not a GUI – a CLI tool that does one thing well, takes arguments, returns output, and doesn’t break when someone passes it an empty string.

The old way: spend hours reading argparse docs, write boilerplate, debug flag parsing, test edge cases manually.

The new way? Describe what you want, let an AI coding assistant scaffold it, then fix the three things it always gets wrong.

Why CLI tools? Why now?

CLI tools are composable. You pipe them together, script them, run them in CI, automate them with cron. A well-designed CLI beats a GUI for speed, repeatability, and integration.

Python is still the best language for this. Readable syntax, rich ecosystem, and three solid CLI frameworks: argparse (built-in, verbose), Click (decorator-based, mature), and Typer (type-hint-driven, modern).

2026: AI coding tools can now generate the CLI structure faster than you can read the docs. But they can’t think through your edge cases, error messages, or business logic. That gap? That’s what you’re here to figure out.

Pick your AI coding tool

Three tools dominate CLI generation: Claude Code, Cursor, and GitHub Copilot. They’re not interchangeable.

Claude Code runs in your terminal. $20/month (as of April 2026). Achieves 80.8% on SWE-bench Verified – highest among these three. Agentic mode: you describe a task, it plans, executes, iterates. The catch? Each agentic task burns 3-8 API calls. Simple CLI feature: $0.12 in tokens. Complex one with debugging? $0.48-0.72. Building 10 CLI tools a month? Do the math.

Cursor is a VS Code fork, also $20/month. Not terminal-native – it’s an IDE with AI baked in. Composer mode handles multi-file changes, supports Claude/GPT/Gemini models. Like visual editing and inline suggestions? Cursor feels familiar. Live in the terminal? Friction.

GitHub Copilot costs $10/month (half the price), works in VS Code, JetBrains, Neovim. Free tier: 2000 completions + 50 chat messages (as of April 2026). Most accessible for students and hobbyists. Tradeoff? Trails Claude and Cursor on complex reasoning. For straightforward CLI code, fine. For multi-step logic with error handling, you’ll be fixing more.

Free options exist. Gemini CLI is free with 6000 code requests daily, but community feedback: “free, but you may waste time and gain little value.” Aider is open-source, git-centric, supports any LLM – best choice if you want model flexibility without a subscription.

For this guide: Claude Code (terminal) and Cursor (IDE) examples. The prompts work with any tool.

The workflow: scaffolding a CLI tool with AI

Scenario: you want a CLI tool that reads a CSV file, filters rows by a column value, outputs the result. Handles missing files, invalid columns, empty inputs gracefully.

Don’t ask AI to “build a CLI tool.” Too vague. AI needs constraints.

Build a Python CLI tool using Click that:
- Takes a CSV file path as a required argument
- Takes a column name and filter value as options
- Reads the CSV, filters rows where column == value
- Outputs filtered CSV to stdout
- Handles: file not found, invalid column name, empty CSV
- Includes --help text
- Uses proper exit codes (0 for success, 1 for errors)

This prompt specifies the framework (Click, not argparse), the exact behavior, and the error cases. AI will generate something close to working code.

Claude Code produces:

import click
import csv
import sys

@click.command()
@click.argument('csv_file', type=click.Path(exists=True))
@click.option('--column', required=True, help='Column name to filter on')
@click.option('--value', required=True, help='Value to match')
def filter_csv(csv_file, column, value):
 """Filter CSV rows by column value."""
 try:
 with open(csv_file, 'r') as f:
 reader = csv.DictReader(f)
 rows = list(reader)

 if column not in reader.fieldnames:
 click.echo(f"Error: Column '{column}' not found", err=True)
 sys.exit(1)

 filtered = [row for row in rows if row[column] == value]

 if not filtered:
 click.echo("No matching rows", err=True)
 sys.exit(0)

 writer = csv.DictWriter(sys.stdout, fieldnames=reader.fieldnames)
 writer.writeheader()
 writer.writerows(filtered)
 except Exception as e:
 click.echo(f"Error: {e}", err=True)
 sys.exit(1)

if __name__ == '__main__':
 filter_csv()

Looks good. Click decorators, checks if file exists, validates column name, writes to stdout.

Three bugs AI didn’t catch.

The three things AI always gets wrong

1. Argument validation happens too late.

Code checks if the column exists after reading the entire CSV into memory. 10GB CSV? Problem. Validation should happen as soon as you read the header.

Fix:

reader = csv.DictReader(f)
if column not in reader.fieldnames:
 click.echo(f"Error: Column '{column}' not found", err=True)
 sys.exit(1)
for row in reader: # stream, don't load all rows
 if row[column] == value:
 # process row

2. Error handling is too broad.

except Exception as e catches everything, including keyboard interrupts. User hits Ctrl-C? They’ll see “Error: ” instead of a clean exit.

Fix: catch specific exceptions (FileNotFoundError, csv.Error, KeyError), let KeyboardInterrupt propagate.

3. Edge case: empty CSV file.

CSV has headers but no rows? reader.fieldnames is None. Column check will crash.

AI-generated code has 3x more edge case bugs than human code. Most common category? Control-flow errors – exactly this kind of missing branch.

Fix: add a check after creating the reader:

if not reader.fieldnames:
 click.echo("Error: CSV file is empty or malformed", err=True)
 sys.exit(1)

These aren’t obscure bugs. They’re predictable AI failures. Once you know the pattern, you catch them in 2 minutes.

Test it before you ship it

AI can generate tests, but it won’t know your edge cases. Use AI to scaffold, then add the cases it missed.

Pro tip: Ask AI to generate tests with this prompt: “Write pytest tests for this CLI tool. Include: valid inputs, missing file, invalid column, empty CSV, malformed CSV, and a case where no rows match the filter. Use Click’s CliRunner.”

AI will produce something like:

from click.testing import CliRunner
import pytest
from filter_csv import filter_csv

def test_valid_filter(tmp_path):
 csv_file = tmp_path / "data.csv"
 csv_file.write_text("name,agenAlice,30nBob,25")
 runner = CliRunner()
 result = runner.invoke(filter_csv, [str(csv_file), '--column', 'name', '--value', 'Alice'])
 assert result.exit_code == 0
 assert 'Alice' in result.output

Good start. But 48% of AI-generated code contains security vulnerabilities (as of 2026 research). Add these tests manually:

CSV with special characters (commas, quotes, newlines in values)
Unicode column names
CSV with 1 million rows (does it stream or crash?)
Exit codes explicitly (0 vs 1)

Property-based testing helps. Install hypothesis, let it generate random CSV inputs. Finds cases you didn’t think of.

Why Click (or Typer) instead of argparse?

AI defaults to argparse because it’s in the standard library and appears in more training data. But Click has 38.7% adoption in Python CLI projects as of 2025, and Typer is built on Click for Python 3.8+.

Framework	When to use it	Why AI suggests it
argparse	Zero dependencies, simple script	Standard library, lots of training examples
Click	Subcommands, file I/O, testing	Mature, well-documented, decorator syntax
Typer	Type hints, modern Python 3.8+	Newer, less training data but cleaner code

Don’t specify? AI gives you argparse. Want Click or Typer? Say so in the prompt.

Deploy it

You’ve built the tool. You’ve tested it. Now what?

Option 1: Install it locally with pip install -e . (editable mode). Add a pyproject.toml:

[project]
name = "csv-filter"
version = "0.1.0"
dependencies = ["click"]

[project.scripts]
csv-filter = "filter_csv:filter_csv"

Now csv-filter is a command on your system.

Option 2: Package it for PyPI. AI can generate the full pyproject.toml and README. Prompt: “Generate a pyproject.toml for this CLI tool, ready for PyPI upload. Include license, author, and entry point.”

Option 3: Distribute as a single executable with PyInstaller. Useful for non-Python users.

What about non-Python CLI tools?

This guide focused on Python because it’s the most common AI-generated CLI language. But the workflow applies to any language:

Rust: Use clap (specify it in your prompt)
Go: Use cobra or urfave/cli
Node.js: Use commander or yargs

Same three bugs appear: late validation, overly broad error handling, missing edge cases. Same fix: catch them manually and add tests.

The honest limitations

A March 2026 arXiv study analyzing Claude Code, Codex, and Gemini CLI found that API errors (18.3%), terminal problems (14%), and command failures (12.7%) are the most common bug symptoms.

What AI can’t do yet:

Understand your business logic (you specify it)
Predict your users’ weird inputs (you test them)
Know which error messages are helpful vs cryptic (you review them)
Decide between performance and simplicity (you choose)

AI accelerates the boilerplate. You still own the design.

What happens when your CLI tool grows?

You’ll add subcommands. You’ll need configuration files. You’ll want colorized output and progress bars.

At that point, AI becomes less useful for structure and more useful for iteration. Prompt: “Add a subcommand ‘validate’ that checks if the CSV is well-formed without filtering.” AI will scaffold it. You’ll fix the edge cases again.

The cycle repeats: generate, fix, test, ship.

FAQ

Can I use AI to generate the entire CLI tool from start to finish without writing code?

Technically yes. But you’ll ship bugs. AI generates code that looks correct and passes basic tests, but misses error paths, edge cases, and performance issues. Plan to review and test every line. If you’re learning, fine – AI shows you patterns faster than reading docs. Shipping to production? Budget 30-40% of your time for fixes and tests after AI generates the initial code.

Which AI tool is best for CLI development: Claude Code, Cursor, or Copilot?

Claude Code: you work in the terminal and need autonomous multi-step tasks. Cursor: you want an IDE experience with inline edits and visual feedback. Copilot: you’re budget-conscious ($10/month vs $20, as of April 2026) and writing straightforward CLI tools without complex logic. For experimentation, try Aider (open-source, supports any LLM) or Gemini CLI (free but limited). All three will make the same core mistakes – missing edge cases, overly broad error handling, late validation – so your testing strategy matters more than your tool choice.

How do I test AI-generated CLI code without writing every test case manually?

Use AI to scaffold the test suite (“Write pytest tests for this CLI tool covering valid inputs, errors, and edge cases”), then add the cases AI missed: Unicode inputs, very large files, malformed data, concurrent usage if applicable. Use property-based testing (Hypothesis for Python) to auto-generate random inputs – this finds bugs you didn’t anticipate. Set a higher coverage target for AI code (85-90% vs 70-80% for human code). Use Click’s CliRunner or Typer’s testing utilities to invoke your CLI in tests without subprocess overhead. Never trust AI to both generate and validate – generate tests with AI, execute them yourself, manually add security and performance tests. One debugging session burns through that free tier quota fast.