AI Code Migration Tools: Why 80% Still Needs Human Review

Most tutorials promise fully automated code migrations. Reality: Google's AI toolkit wrote 74% of migration code but engineers still spent half their time validating, fixing context window failures, and debugging hallucinations.

Jack Tom2026-02-258 min readAdvanced

AI won’t save your code migration. But it can cut the grind by half – if you know what it can’t do.

Reality check: Google’s AI toolkit wrote 80% of migration code but engineers still spent 50% of their time (as of January 2025) babysitting the output, fixing context window failures, and debugging hallucinations that looked perfect until tests ran.

Tutorials treat AI migration tools like magic wands. They’re power tools – effective when you know the failure modes, dangerous otherwise.

The context window trap nobody mentions

File size. That’s what breaks first.

AWS notes files over 700 lines need “careful human review.” Translation: the AI probably got it wrong. Even models with 200K token context windows choke on large files – not because they can’t fit them (they can), but because cross-file dependencies create context webs no single prompt captures. The model sees your code. Doesn’t see the six other files that make it work.

Google’s internal toolkit (January 2025) automatically expands the file list to include tests, interfaces, dependencies. Even then? Engineers reported context window limitations as a recurring blocker. One team migrating Delphi to C# had to manually chunk files – Claude’s context window couldn’t process them whole.

The workaround: break migrations into 50-100 file batches, not thousands. This isn’t a limitation you optimize away. It’s a constraint you design around.

Why hallucinations multiply in migrations

Code generation: no correct answer to deviate from. Code transformation: you have a ground truth (the original behavior). This is where hallucinations turn dangerous.

LLMs train on public code and docs. They understand patterns (“Python 2 print statements become functions in Python 3”). What they miss: tribal knowledge. That weird workaround your team added in 2019? Now lives in 47 files. The AI sees syntax. Doesn’t see intent.

Turns out AI models pattern-match but don’t understand WHY code behaves a certain way. That strange conditional buried in your legacy code might handle a regulatory requirement, a customer edge case, hardware quirks. Google reported their fine-tuned Gemini occasionally produced “irrelevant changes, unnecessary comments, or reformatted code without meaningful changes.” Syntactically perfect. Semantically useless.

Run characterization tests BEFORE migration. These capture current behavior as a safety net. If migrated code passes different tests than the original, you’ve changed functionality – whether the AI admits it or not.

The architecture debt problem

Spaghetti code in legacy? AI gives you spaghetti code in a new language.

An engineer who led a 20-year-old Delphi migration: “If the current system is spaghetti code, AI recreates spaghetti code in another language.” Teams think they’ll migrate first, refactor later. Wrong order. The migration IS the refactoring opportunity. Miss it? You’ve spent six months and $500K moving technical debt to a newer platform where it’s harder to fix.

Salesforce’s Apex-to-Java migration (completed in 4 months, as of 2024) worked because they didn’t just translate syntax – they defined transformation rules converting static classes into object-oriented service layers with dependency injection. The AI didn’t invent that architecture. Engineers specified it.

What Google’s assisted approach actually does

Google didn’t build a fully automated system. They built an assisted one.

Workflow: Expert engineer identifies what needs migrating using Code Search and custom scripts. AI toolkit runs autonomously, produces verified changes that pass unit tests. Engineer reviews, fixes where the model failed, ships. Key word: verified. The toolkit validates through five gates – completion check, whitespace-only filter, AST parsing, “punt check” where the LLM flags its own uncertainty, then build and test validation. Five gates before human review.

Result? Three developers performed 39 migrations over 12 months. AI wrote 80% of code modifications. Engineers reported 50% time savings vs. manual migration. Not 100% automation. Fifty percent – and they considered it a massive win.

Tools worth your time

Context awareness, validation hooks, failure modes. That’s what matters.

GitHub Copilot + app modernization extensions handle .NET and Java version upgrades with framework-specific transformations (as of 2024). Tightly integrated with Microsoft’s ecosystem: less configuration, more lock-in. Works best migrating to Azure.

OpenAI Codex (powered by o3 optimized for software engineering, 2024) works autonomously for 7+ hours on complex refactors. The catch: cloud-based, runs in isolated containers, requires careful scoping. Best for well-defined tasks – not exploratory rewrites.

Anthropic Claude Code (especially Opus 4.5, 2024) maps dependencies across 2000+ lines, handles long-horizon coding. One team used it for COBOL modernization – documented workflows, spotted risks manual analysis missed. Smaller context window than competitors, but higher accuracy. According to one migration engineer, Claude had fewer hallucinations than GPT models but required file splitting.

Moderne combines deterministic automation with agent-assisted workflows. Built for enterprise-scale migrations across thousands of repositories (as of 2024). Higher upfront cost, but you’re paying for repeatability and governance – critical when code handles money or healthcare data.

Gitar mixes static analysis with LLMs. Generated 4,881 PRs in six months at Uber for API migrations, annotation processors, feature flag cleanup. The hybrid approach offsets each tool’s limitations: static analysis for precision, LLMs for adaptability.

Language support varies. All tools handle Java, Python, C++. Most struggle with Dart, proprietary DSLs, custom build tooling. Exotic stack? Budget for manual fallback.

Choosing tools feels like picking a framework. It’s not. You’re picking failure modes you can work around vs. ones that block you.

The real workflow: human-in-the-loop by design

Automated migration is a myth. Assisted migration is reality.

Think of AI tools like junior engineers. They need documentation, examples, feedback. Salesforce’s approach: break the codebase into dependency layers, migrate bottom-up (utilities first, workflows last), have AI generate transformation candidates that senior engineers validate before applying across the codebase.

This works because you’re encoding tribal knowledge into transformation rules. First pass: AI translates 100 files, you review 10. Find three patterns it got wrong. Update the rules. Second pass: AI translates 500 files using corrected rules, you spot-check 20. Accuracy improves each iteration.

The bottleneck isn’t generation. It’s review. Google’s team purposely throttled output to avoid overwhelming reviewers. Too many AI-generated PRs? Humans stop reading carefully. Bugs ship.

When AI migration fails (and what to do instead)

Some migrations shouldn’t use AI.

Codebase lacks tests? Stop. Generate tests first (AI is actually good at this), then migrate. Architecture is tangled? Refactor locally before migrating globally. Team doesn’t understand the legacy system? Have AI document it before transforming it.

Anthropic published a code modernization playbook breaking COBOL migrations into phases: inventory and diagrams first, then technical report, target design, finally migration. Teams skip straight to migration and wonder why it fails.

Another failure mode: treating AI output as final. One logistics company migrating Java to Node.js used AI-driven test generation to simulate warehouse load scenarios. Migrated code passed functional tests but failed performance tests under realistic load. AI translated logic correctly – didn’t preserve performance characteristics.

What the case studies actually show

Salesforce: 275 Apex classes, 3,537 total files, managed package to Core. Four months using AI-assisted refactoring (2024). Original estimate: two years manual. But they spent the first month defining transformation rules and dependency maps. AI didn’t do that. Engineers did.

Uber via Gitar: 4,881 PRs in six months, tens of millions in annual savings (as of 2024). Automated tools reduced time by 80% vs. manual rewrites. The work? API deprecations, annotation processors, feature flag cleanup. Well-scoped, repetitive tasks – not complex architectural changes.

Google: 5,359 files modified, 149K+ lines changed, three months (2025). Bottleneck: review speed, not generation. Engineers throttled AI output to keep reviews manageable. Context window limitations hit large files. Hallucinations produced irrelevant changes in ~10% of attempts.

AI shines on repetitive, well-defined transformations with clear correctness criteria (tests). Struggles with architectural decisions, performance optimization, business logic lacking explicit specification.

FAQ

Can AI tools migrate my entire codebase automatically?

No. Budget for 50% time savings, not full automation.

Which AI model is best for code migrations – GPT, Claude, or Gemini?

Depends on your codebase. Claude produces fewer hallucinations but has smaller context windows requiring file chunking (as of 2024). GPT-4 handles common languages well; GPT-5 hallucinates more. Google fine-tuned Gemini on internal code for 50% time savings. The model matters less than validation pipeline. Example: Google’s five-gate checks (completion, whitespace, AST, punt, build/test) catch errors before human review. One team found Claude Opus 4.5 best for COBOL – mapped dependencies across legacy systems, documented workflows manual analysis missed.

My legacy system has zero documentation and failing tests – can AI help or should I refactor first?

Generate tests and documentation BEFORE migrating. AI excels at writing characterization tests that capture current behavior, even for undocumented code. One team used Claude to document 20-year-old Delphi workflows, then migrated. Skipping documentation means AI replicates architectural debt – spaghetti code in a new language. If tests are broken, fix them first or you have no validation baseline. The migration isn’t the starting point; it’s the middle step. A common misconception: teams think “we’ll document as we go.” That compounds risk. You’re debugging AI hallucinations AND discovering what the legacy code does simultaneously. Separate those problems. Document first (AI can help – give it code, ask for workflow diagrams and edge case lists). Then migrate with AI assistance. Then refine architecture. Three separate phases with distinct validation criteria at each gate.