ChatGPT Training Data: YouTube Transcripts vs. Comments Explained

Viral claims say ChatGPT trained on YouTube comments. The truth is messier: 173K video transcripts scraped without consent, plus what that actually means for your AI results.

Jack Tom2026-03-199 min readBeginner

A claim blew up this week: “ChatGPT is just a program trained on YouTube comments.” It’s everywhere – Reddit threads, X posts, TikTok explainers. The problem? It’s wrong. But the truth is messier, weirder, and way more important for how you actually use AI.

Here’s what really happened, how to verify what training data means for your results, and three hidden biases this creates that nobody’s talking about.

The Myth: ChatGPT Trained on YouTube Comments

The viral claim conflates two separate things: YouTube video transcripts (which were scraped) and YouTube comments (which weren’t part of disclosed training datasets). Proof News and WIRED revealed in July 2024 that companies like Apple, Nvidia, and Anthropic used a dataset called “YouTube Subtitles” – transcripts from 173,536 videos – to train AI models without creator consent.

Not comments. Transcripts.

Why does this matter? Because the myth makes it sound like ChatGPT learned sarcasm and trolling from comment sections. The reality is it absorbed scripted, produced video content – educational channels, tech reviews, news broadcasts – which has completely different implications for bias and reliability.

What ChatGPT Is Actually Trained On

Let’s walk backwards from the confusion. GPT-3, the base model for ChatGPT, was trained on roughly 570GB of text data (per the original GPT-3 paper). The sources:

Common Crawl: Filtered web scrapes from 2016-2019 (45TB compressed to 570GB after cleanup)
WebText2: Text from web pages linked in Reddit posts with 3+ upvotes
Books1 & Books2: Two internet-based book corpora
Wikipedia: English-language pages

GPT-4 scaled up to approximately 13 trillion tokens, according to leaked reports. OpenAI didn’t disclose specifics, citing “competitive landscape and safety implications.” Training reportedly cost $63 million and used 25,000 NVIDIA A100 GPUs over 90-100 days.

Pro tip: When someone says “ChatGPT was trained on X,” ask: Pre-training or fine-tuning? GPT-3 era or GPT-4? The answer changes everything. Most viral claims mix up base training data (books, web text) with post-training RLHF (human feedback) or with datasets used by other companies who built on top of OpenAI’s architecture.

The YouTube Transcript Scandal: What Actually Happened

Here’s the timeline of what went down:

2020: EleutherAI, a non-profit AI research lab, created “The Pile” – an 800GB dataset meant to “democratize” AI training data. One component: “YouTube Subtitles,” containing transcripts from 173,536 videos across 48,000 channels.
2024: Proof News investigated and found that Apple (OpenELM model), Nvidia, Anthropic, and Salesforce all trained models using The Pile – including YouTube Subtitles – without creator permission and in violation of YouTube’s Terms of Service.
March 2026: YouTube started asking users “Does this feel like AI slop?” to detect low-quality AI content. Users immediately theorized this wasn’t to filter slop out, but to train AI to make slop that’s undetectable.

Affected creators include MrBeast (2 videos), PewDiePie (337 videos), Marques Brownlee (7 videos), and educational channels like Khan Academy and MIT. The dataset also contains 12,000+ videos that have been deleted since 2020 – meaning the AI can reference content you can’t verify.

Apple later clarified OpenELM was “research only.” But Salesforce’s model trained on The Pile has been downloaded 86,000+ times. Research-only doesn’t stay research-only.

How to Verify What Training Data Means for Your Queries

You can’t audit ChatGPT’s training data directly, but you can test whether its answers reflect YouTube-style content patterns. Here’s how:

Step 1: Prompt for Verbatim Recall

Ask ChatGPT to recite the opening of a well-known video transcript. Example:

"Recite the first 100 words of Marques Brownlee's iPhone 13 review."

If it generates something suspiciously close to the actual video, that’s evidence of memorization (not proof, but a signal). ChatGPT is designed NOT to regurgitate training data, but researchers extracted ~1GB of verbatim training content using specific prompt patterns. Over 5% of output in their tests was direct 50-token copies.

Step 2: Ask for Source Attribution

When ChatGPT gives you technical info, ask: “Where would this information typically come from?” If it says “tech reviews” or “tutorial videos” without you mentioning video, that’s a tell.

Step 3: Check for Video-Specific Phrasing

YouTube transcripts have patterns: “Hey guys,” “Before we get started,” “Link in the description.” If ChatGPT’s answers include casual video-style intros when you asked for formal analysis, it’s leaking transcript style.

Step 4: Cross-Check Knowledge Cutoff

GPT-4’s training data cutoff is late 2023. If you ask about the YouTube scraping controversy itself, ChatGPT won’t know it happened (the scandal broke in April-July 2024). This creates a weird loop: the AI can’t tell you if it was trained on YouTube because it doesn’t know the investigation occurred.

Three Hidden Biases YouTube Transcripts Create

Most tutorials stop at “it was trained on YouTube.” Here’s what that actually means for your results:

Bias Type	How It Shows Up	When It Matters
Explainer Bias	ChatGPT defaults to tutorial-style answers even when you need analysis. It “explains” instead of “evaluates.”	Research, critical thinking tasks, comparing trade-offs
Engagement Optimization	Answers mirror YouTube’s need to retain attention – longer than necessary, repetitive summaries, rhetorical questions.	When you need concise answers or are on a token budget
Deleted Content Ghosts	Training data includes 12,000+ removed videos. ChatGPT can reference info that no longer exists publicly and you can’t fact-check.	Fact-checking, citing sources, legal/medical queries

Explainer Bias is subtle. Ask ChatGPT “What’s the best programming language?” and it’ll give you a beginner-friendly breakdown (“Python is great for beginners…”) even if you’re an experienced dev looking for performance benchmarks. That’s YouTube pedagogy baked in.

What This Means for ChatGPT Plus vs. Free

If you’re using the free tier (GPT-3.5), you’re working with the 570GB training dataset from 2020-2021 – pre-YouTube Subtitles scandal but post-EleutherAI dataset creation. OpenAI may use your conversations to train future models unless you opt out via Settings → Data Controls.

ChatGPT Plus ($20/month) uses GPT-4, which has an undisclosed training dataset but a late 2023 knowledge cutoff. The YouTube Subtitles dataset was created in 2020, so GPT-4 could include it – OpenAI won’t confirm. Enterprise and API users have data excluded from training by default.

The catch: Even if you pay, you don’t get transparency. You’re trusting OpenAI’s filtering, not verifying it.

When YouTube Training Data Actually Breaks Things

Here are three scenarios where YouTube-sourced training causes real problems:

Medical/legal advice: YouTube health videos are often wrong or oversimplified. If ChatGPT absorbed “Top 5 Cancer Cures Big Pharma Doesn’t Want You to Know” transcripts, that’s now in the model’s probability distribution. It won’t cite the video, but the misinformation influences token prediction.
Code tutorials: YouTube coding videos prioritize “works in the demo” over “production-ready.” ChatGPT’s code suggestions may mirror this – functional but not secure, optimized, or scalable.
Citing sources: If you ask ChatGPT for references and it suggests a YouTube video, you have no way to know if that video was in training data (memorized) or hallucinated (invented because it sounds plausible).

The deleted content problem is the wildest edge case. Imagine ChatGPT confidently describing a study mentioned in a now-deleted video. You can’t verify it. The original creator can’t defend it. The info just… floats in the model, untethered.

How to Opt Out (Sort Of)

You can’t remove YouTube transcripts from ChatGPT’s existing training data. But you can control what OpenAI does with your data going forward:

Open ChatGPT → Click your profile → Settings → Data Controls
Toggle off “Improve the model for everyone”
Use Temporary Chat (icon in top-right) for sensitive queries – these aren’t saved or used for training

If you’re a YouTube creator and want to check if your videos were scraped, Proof News built a search tool that cross-references the leaked dataset. Fair warning: finding your content there means it’s already in deployed models. There’s no “undo.”

When NOT to Trust ChatGPT on YouTube-Style Queries

If your question is the kind of thing that gets 10M views on YouTube, be skeptical. That means:

“How to get rich quick” (high engagement, low accuracy)
“Secret tricks the experts don’t tell you” (clickbait optimized)
“Why [controversial topic] is actually [hot take]” (engagement bait)

ChatGPT doesn’t “know” it’s parroting YouTube. It just learned that certain phrase patterns co-occur with certain topics. If YouTube’s incentive is watch time, and ChatGPT trained on transcripts optimized for watch time, the model inherits that incentive – even though it has no concept of “views” or “ads.”

For research, cross-check answers against academic sources. For code, test in a sandbox. For medical stuff, don’t use ChatGPT at all.

The Future: LLMs Now Cite YouTube More Than Reddit

As of January 2026, LLMs cite YouTube in 16% of responses compared to Reddit’s 10%, according to data from AI marketing platform Bluefish. This isn’t just training data – it’s live retrieval. When ChatGPT uses search (if you’re on Plus with browsing enabled), it’s pulling from YouTube’s current library, not just the 2020 snapshot.

The problem? There’s no disclosure. You can’t tell if an answer came from 2020 training data (potentially deleted, unverified) or a 2026 live search (current but algorithmically recommended). It’s a black box wrapped in a black box.

Why This Matters More Than the Comment Myth

The viral claim – “ChatGPT trained on YouTube comments” – is wrong but useful. It forces the question: What is this trained on, and how does that shape what I get back?

The real scandal isn’t comments. It’s that 173,536 videos were scraped without consent, 12,000 are now deleted, and you have no way to audit which parts of ChatGPT’s “knowledge” come from verified sources vs. algorithmic junk vs. literal ghosts of removed content.

Next time ChatGPT gives you an answer, ask: Does this sound like a YouTube tutorial? If yes, that’s not necessarily wrong – but it’s a signal to verify before you trust it.

Frequently Asked Questions

Was ChatGPT directly trained on YouTube comments?

No. The disclosed training datasets (GPT-3’s 570GB corpus, GPT-4’s undisclosed dataset) include YouTube video transcripts (via the YouTube Subtitles dataset), not comments. Comments weren’t part of known scraping efforts, though OpenAI hasn’t released full GPT-4 data sources. The myth likely conflates “YouTube data” with “comment data.”

If I delete a YouTube video, is it still in ChatGPT’s training data?

Yes, if it was scraped before deletion. The YouTube Subtitles dataset (created in 2020) contains 12,000+ videos that have since been removed. Once content is in a model’s training data, deleting the original doesn’t remove it from the AI. The model learned patterns from that text – it’s baked in. Future models might not include it if they’re trained on fresh data, but deployed models don’t get “unlearned.”

How can I tell if ChatGPT’s answer came from YouTube transcripts vs. other sources?

You can’t definitively tell, but watch for: (1) Tutorial-style phrasing (“First, you’ll want to…”), (2) casual intros (“So here’s the thing…”), (3) step-by-step structures even when you asked for analysis. Cross-check by asking for sources – if ChatGPT says “this is commonly explained in tutorial videos” without you mentioning video, that’s a tell. For critical info, verify against academic papers or official docs, not just AI output.