Most teams ship code faster than they write tests for it. That’s not a discipline problem – writing tests is genuinely tedious, especially the boring boundary cases nobody wants to think about at 4pm on a Friday. So the question isn’t whether to use Copilot to write unit tests. It’s which Copilot entry point to use, and where it quietly fails. This guide covers both – with the trade-offs the official tutorials don’t touch.
The takeaway, upfront
C# project in Visual Studio 2026? Use the @Test agent. Everything else? Use the /tests slash command in Copilot Chat with your file pinned via #. The inline right-click “Generate Tests” is fine for a single pure function and nearly useless for anything more complex – it carries the least context and produces the shallowest output.
Now the why.
Three entry points, not one
Copilot exposes test generation through three interfaces, and they are not equivalent. Same underlying model. Completely different scaffolding around it, which changes the output more than you’d expect.
- Slash command (
/tests) – type it in Copilot Chat and it implicitly pulls in the@workspaceparticipant, giving the model a broader view of your project structure (as of mid-2025, per the GitHub blog). - Inline action – right-click → Generate Code → Generate Tests. Drops tests into an existing test file, or creates a new one. No conversation, no corrections.
- @Test agent (.NET only) – generally available in Visual Studio 2026 v18.3, grounded in the C# compiler rather than just language pattern-matching. It builds the project. It fixes errors. It re-runs.
That last point matters more than any feature list makes it sound. The agent doesn’t just write tests – it closes the loop.
/tests vs @Test: the side-by-side nobody draws
| Capability | /tests (Chat) | @Test (.NET agent) |
|---|---|---|
| Languages | Any Copilot-supported language | C# only |
| Frameworks | Anything you ask for | MSTest, NUnit, xUnit |
| Determinism | Non-deterministic | Grounded in C# compiler semantics |
| Auto-build & fix | No | Yes |
| Test project creation | Manual | Automatic, separate project |
The “determinism” row is doing real work here. Two developers running the same /tests prompt on the same file will get different test suites – the official cookbook explicitly warns that Copilot Chat responses are non-deterministic. The @Test agent sidesteps this because the C# compiler is doing the semantic grounding – assertions come from type contracts, not from whatever the model felt like generating that morning.
For non-.NET projects, /tests wins by default. Embrace the non-determinism as a feature: run it twice, compare the two outputs, keep the tests that cover angles the other missed.
Getting useful output from /tests
Default /tests output is mediocre. Here’s the three-step fix.
1. Pin the file explicitly
Don’t assume Copilot is reading what you think it’s reading. Type /tests, press #, select the file. Sounds obvious – but turns out workspace context is less reliable than advertised. A community-reported case (GitHub Discussions #160959) showed Copilot confidently reporting PrimeNG 13.4 as the installed version when package.json clearly stated 17.18. It had the file. It ignored the file. If it can do that with a version number, it can do it with the function signature you actually want tested.
2. State the framework, pattern, and what to skip
Vague prompts produce vague tests. Compare these two:
Bad:/tests #webhook.py
Better:
/tests #webhook.py
Generate unit tests for parse_webhook_payload using pytest.
Follow AAA. Cover: missing required fields, malformed JSON,
a valid payload with all optional fields present, and a
payload where the signature header is absent.
Do not write integration tests. Do not mock the HTTP layer.
The exclusions matter as much as the inclusions. Without “do not mock the HTTP layer,” Copilot will happily invent a requests mock for a function that never touches HTTP – because mocking looks like thoroughness to a language model.
3. Save it as a prompt file
Most teams type the same testing instructions dozens of times before it clicks: this belongs in a file. Store a generate-unit-tests.prompt.md in .github/prompts/, invoke it with /generate-unit-tests in Chat, and optionally pass parameters like function_name=parse_webhook_payload and framework=pytest. Took us embarrassingly long to realize the prompt file existed. It’s been in the docs since before most people started asking Copilot to write tests.
The cheapest quality check: After accepting generated tests, make the function obviously wrong – flip a comparison operator, return
Noneinstead of the result. Re-run. If the tests still pass, the assertions are mirroring the implementation rather than checking a contract. Costs 30 seconds. Has caught more bugs in practice than any coverage report.
The model trade-off nobody mentions
GPT-4.1 or Sonnet? It’s not a preference question. There’s a measurable difference in what you get.
Schallermayer and Schnappinger’s 2026 study (published in Springer’s PROFES proceedings, as cited at this chapter) tested Copilot across five real-world Java projects. GPT-4.1: highest compilation rate – tests that build on the first try, predictably. Sonnet variants (3.7, 4): more compile errors, but higher mutation coverage. Meaning they catch bugs that GPT-4.1’s safer tests walk right past.
So: GPT-4.1 for a fast baseline on a new feature. Sonnet when you’re hardening a payment flow and a failed build is cheaper than a missed edge case. The model picker is right there in the chat interface. Most developers never touch it.
Which raises a question worth sitting with: if the model changes what edge cases get tested, are we really testing the code – or are we testing whichever model happened to be selected? That’s not a knock on the tools. It’s just a different kind of variance than you get from a human writing tests, and it’s worth accounting for.
Three failure modes the docs don’t cover
Green tests, wrong logic
The one that should actually concern you. Copilot reads your function and writes assertions that match what your function does – not what it should do. Buggy function gets a buggy test. Both pass. The TestPilot empirical study (arXiv:2302.06527) showed this concretely: generated tests passed for a function that processed 3D point data when the spec required 2D-only handling. Green. Wrong.
The mutation-testing check from the blockquote above is the cheapest defense available.
The silent @Test fallback
Mixed-language solution in Visual Studio. You right-click a Python or TypeScript file. “Generate Tests” still appears in the menu. It just silently downgrades – the docs confirm that for non-C# projects, the same menu options route to a generic Copilot prompt with no compiler grounding, no auto-build, none of the @Test behavior. Same button. Completely different feature. The IDE gives you no indication this happened.
Coverage that lies
TestPilot hits 60-80% statement coverage on npm packages. Sounds good. Statement coverage just means a line executed – not that any assertion checked it. A test calling a function and asserting it doesn’t throw reaches 100% line coverage and catches zero bugs. Don’t use coverage as the quality bar for AI-generated tests. Mutation testing is the honest metric, painful as the runtime is.
FAQ
Should I trust Copilot to test code it also wrote?
No. Same model, same blind spots – the thing that wrote the bug will write a test that doesn’t catch it. Either flip to TDD-style (generate tests before the implementation) or deliberately use a different model for the tests than you used for the code. This is the only structural defense that actually works.
Why do generated tests sometimes import functions that don’t exist?
Hallucinated imports almost always trace back to context. Copilot didn’t see the actual module, guessed based on naming patterns, and invented a plausible-sounding path. Fix: pin files with #filename instead of relying on @workspace inference. Monorepos with similarly-named modules are especially prone to this. If a specific symbol keeps getting hallucinated, open that file in a tab before invoking /tests – forces it into the active context window. Works about 90% of the time. The other 10% is a project structure that the workspace indexer genuinely can’t resolve, and the answer there is a more explicit prompt.
Integration tests – same workflow?
Mostly yes, but Copilot defaults hard to unit-test patterns. Without explicit instruction, mocks won’t appear. The GitHub docs suggest a prompt structure like: “Write integration tests for the deposit function. Use mocks to simulate NotificationSystem and verify it’s called correctly after a deposit.” The words “integration,” “mock,” and the named collaborator each carry weight. Leave one out and you get something shallower than you wanted.
Try this now
Find one function in your codebase with zero tests. Run /tests on it with a specific prompt – framework named, AAA stated, edge cases listed. Then sabotage the function: flip a comparison, return the wrong type. Re-run the tests. If any generated test misses your sabotage, rewrite it by hand. That one exercise tells you more about how Copilot-generated tests actually behave than any walkthrough can.