Production deployments produce false positives at rates between 2% and 15%. Each one burns analyst hours. A single mis-flagged batch in a high-volume pipeline can cost more than the validation itself.
Here’s the part no one talks about upfront.
The Real Question: Rule-Based or LLM Validation?
Two paths. Traditional rule-based validation (Great Expectations, Soda, statistical profiling) works by defining constraints – “this field can’t be null,” “this value must be between 0 and 100.” Fast, cheap, deterministic. Catches only what you explicitly coded.
LLM-based validation flips the model. Describe data quality requirements in plain language. Let the model judge if records meet criteria. Research shows this works because validating correctness beats generating correct responses.
The catch: you’re doubling API calls. One to generate or process, one to validate. Smaller models like GPT-4o-mini ($0.75 input / $4.50 output per million tokens as of April 2026, per OpenAI’s pricing) handle validation well enough that cost stays reasonable.
Most production systems use both. Rules filter structure (schema, types, ranges). LLMs handle semantic quality (does this address look real, is this product description coherent).
Method A: Statistical Profiling
Statistical profiling scans your data, calculates distributions (mean, median, standard deviation for numbers; value frequencies for categories), then flags deviations. Qualytics’ guide documents this approach – it auto-defines validation checks based on observed patterns.
Your order_total column historically averages $47 with a standard deviation of $12. Profiling sets a threshold at 3 standard deviations. Tomorrow’s average hits $85? Alert.
from great_expectations.dataset import PandasDataset
import pandas as pd
df = pd.read_csv('orders.csv')
ge_df = PandasDataset(df)
# Auto-generated expectation from profiling
ge_df.expect_column_mean_to_be_between(
column='order_total',
min_value=35,
max_value=59
)
result = ge_df.validate()
print(result)
Works until your data changes for legitimate reasons. Black Friday sale? Rules fire false positives all day. New product line? You’re manually updating thresholds.
Think of it like a smoke detector. Too sensitive: you’re silencing alarms every time you cook toast. Not sensitive enough: the kitchen’s on fire before it beeps. That middle ground shifts with your business.
Pro tip: Use percentile-based thresholds (P5 to P95) instead of standard deviations for skewed distributions. They adapt better to real business patterns.
Method B: LLM Semantic Validation
Describe what good data looks like. Let the model decide. The Instructor library makes this pattern clean with Pydantic.
from pydantic import BaseModel, field_validator
from instructor import llm_validator
import instructor
from openai import OpenAI
client = instructor.patch(OpenAI())
class ProductReview(BaseModel):
product_name: str
review_text: str = field_validator('review_text')(
llm_validator(
"Review must be relevant to the product, "
"not contain spam or promotional language, "
"and be written in coherent English",
model="gpt-4o-mini"
)
)
# This validates semantically, not just structurally
review = ProductReview(
product_name="Laptop Stand",
review_text="Great stand! Improved my posture."
)
Validation fails? Instructor automatically retries with error context. Structured failure reasons, not just a boolean.
Cost reality: 1,000 validation calls at 200 tokens input + 50 tokens output averages $0.27 with GPT-4o-mini (as of April 2026). High-volume pipelines: adds up. Critical data where manual review costs $50/hour per analyst: cheap insurance.
Building the Pipeline
Great Expectations remains the production standard for structured validation. 11,398 GitHub stars and roughly 28.35 million monthly PyPI downloads as of 2024, with v1.0 GA released in August 2024. The open-source version is free; GX Cloud adds managed infrastructure and ExpectAI for auto-generated expectations.
| Framework | Best For | Pricing (2024-2026) | Limitation |
|---|---|---|---|
| Great Expectations | Schema, types, ranges (CSV, SQL, Parquet) | Free (OSS) / Custom (Cloud) | Weak on unstructured data, streaming |
| Anomalo | Anomaly detection at petabyte scale | Enterprise only | No pricing transparency |
| LLM + Pydantic | Semantic validation, text quality | $0.75-$2.50 per 1M input tokens | Latency, cost at high volume |
Standard pattern: run Great Expectations for structural checks, then pipe failures or edge cases through LLM validation for semantic review.
# Combined approach
import great_expectations as gx
from validator import llm_semantic_check
context = gx.get_context()
batch = context.get_batch(batch_request)
# Structural validation first (cheap)
result = batch.validate(expectation_suite)
if not result.success:
# LLM validation for semantic issues (expensive, targeted)
for failed_row in result.failed_rows:
semantic_result = llm_semantic_check(
row=failed_row,
criteria="Check if data is plausible given business context"
)
if not semantic_result.valid:
log_quality_issue(failed_row, semantic_result.reason)
Keeps LLM costs contained by only invoking on suspicious records.
The False Positive Trap (And How to Tune It)
AI validation produces false positives. Research on AI detection accuracy (2026 data) shows rates from 2.1% (Originality.ai) to 14.7% (ZeroGPT). In data quality contexts, false positives mean legitimate records flagged as bad.
Why it happens:
- Statistical models flag outliers that are real but rare (legitimate $10,000 orders during enterprise sales)
- LLMs misinterpret domain-specific terminology as errors
- Thresholds set too tight produce noise; too loose miss real issues
Tuning strategy from production deployments:
Start with high precision. Low false positives even if you miss some true errors. Set statistical thresholds at 4-5 standard deviations, not 2-3.
Log everything flagged for 2 weeks. Manually review a sample. Calculate your actual false positive rate.
Adjust thresholds based on business impact. A false positive that blocks a $50,000 order costs more than missing a $12 data quality issue.
Use confidence scores. Most tools return a probability, not a binary flag. Set different alert levels (0.7 = warning, 0.9 = block).
For LLM validators, include examples of edge cases in your validation prompt. “High-value enterprise orders may exceed $5,000 and are valid.”
Streaming vs Batch
Batch validation (scan everything once, flag issues, move on) works for nightly ETL jobs. Real-time streams? Different architecture.
Problem: tools like Informatica DQ were designed for batch processing (as of 2024-2026). They’re disconnected from the ETL flow and can’t validate streaming data from IoT sensors or social media feeds in real-time.
Streaming validation requires:
- Sliding window checks: validate the last N records, not the entire dataset
- Watermarks and latency thresholds: detect late-arriving or out-of-sequence events
- Stateful validation: track running statistics (moving averages) without re-scanning history
Apache Flink or Kafka Streams handle this better than traditional data quality tools. You embed validation logic directly in the stream processor.
Ataccama’s Agentic AI cuts rule creation time from 9 minutes to 1 minute (saves roughly 25 workdays per year when automating 1,500 rules, per official docs). But even automated rule generation doesn’t solve the streaming architecture gap.
When Static Rules Fail
Most validation rules are static. Define them once based on current data patterns. Data evolves – new product categories, seasonal shifts, schema changes – rules break.
Your product_category field historically contained 12 values. You add a new category. Static validation flags every new record as invalid.
ML-powered auto-rule generation (DataBuck, Ataccama as of 2024-2026) solves this by learning from data drift. Patterns shift? System auto-generates updated rules instead of requiring manual intervention.
Turns out real errors show up more often than you’d think. Google’s research (MLSys 2019, NeurIPS 2021) found their data validation system automatically detected errors in 6% of training runs across 700+ ML pipelines. Even “curated” benchmark datasets contain an average 3.4% label error rate. Static rules can’t adapt to this.
Stuck with static tools? Implement rule review cycles every quarter. Better: profiling tools that recalculate baselines weekly and alert when drift exceeds thresholds.
Production Deployment: Start Small, Measure Everything
Don’t validate every field on day one. High false positive rates kill trust fast.
Deployment pattern that actually works:
Pick 3-5 critical fields (primary keys, revenue-impacting columns).
Implement structural validation only (non-null, type checks, basic ranges).
Run in shadow mode for 2 weeks – log failures, don’t block pipelines.
Review logs. Calculate precision: (true issues) / (total flags).
If precision > 80%, promote to production with alerts. If < 80%, tune thresholds and repeat.
Add semantic validation for 1-2 text fields using LLM. Monitor cost and latency.
Expand coverage incrementally based on ROI.
Track these metrics weekly:
- False positive rate (analyst time wasted on non-issues)
- False negative rate (real errors that slipped through – harder to measure, this may have changed as of 2026)
- API cost per 1,000 records validated
- Latency added to pipeline
Organizations lose an average of $15 million annually to poor data quality (Gartner research). IBM found poor data quality strips $3.1 trillion from the U.S. economy each year. Automated checks help, but only if they don’t create new problems.
How do I choose between rule-based and LLM validation?
Rules for structure. LLMs for semantics. Done.
Great Expectations, Soda – schema conformance, type checks, range constraints, null detection. Fast. Cheap. LLM validation – text coherence, plausibility checks, domain-specific correctness that’s hard to encode in rules. Slower. Costs more. Production systems use both. Rules filter obvious structural issues, then LLMs handle edge cases that need context.
What’s a realistic false positive rate for automated data quality checks?
Statistical profiling: 2-5% with careful threshold tuning (as of 2024-2026 deployments). LLM validation: depends heavily on prompt quality and model choice; expect 3-10% initially, tunable to 2-4% with examples and refinement. High-volume pipelines? Even 2% produces significant analyst overhead. One debugging session on a false positive burns through hours. Always measure your actual rate in production and adjust thresholds based on business impact, not arbitrary precision targets. A false positive that blocks a $50,000 order hurts more than missing a $12 data quality issue.
Can Great Expectations handle streaming data validation?
No. Great Expectations was designed for batch data (CSV, Parquet, SQL tables). Doesn’t support real-time streaming validation for IoT, social media, or event-driven architectures. For streaming, you need tools that integrate with Apache Flink, Kafka Streams, or similar stream processors, or use platforms explicitly built for real-time validation. This is a documented limitation as of v1.0 GA (August 2024) – GX works brilliantly for batch ETL pipelines but requires different tooling for streaming use cases.