Hot take: most people who share Horace He’s “Making deep learning go brrrr from first principles” on Twitter haven’t actually applied it. They quote the factory analogy, nod at operator fusion, and then go right back to randomly toggling mixed_precision=True and praying. The post is trending again – it hit the Hacker News front page in May 2026, four years after it was written – and the reason it keeps resurfacing is that the framework still works. But only if you treat it as a decision procedure, not a vibe.
This guide skips the re-explanation of the factory metaphor (every other tutorial does that). Instead: how to actually use the deep learning go brrrr framework on your own code this week, what the post quietly leaves out, and the places where its advice has aged.
The problem: “my model is slow” is not a diagnosis
You write a training loop. It runs at 40% GPU utilization. You Google “PyTorch slow training” and get fifteen suggestions: bigger batch size, num_workers=8, mixed precision, gradient accumulation, pin memory, channels-last. You try three of them. Nothing changes. Or worse, things get slower.
This is the cargo-cult trap Horace called out in the original Twitter thread: researchers often cargo cult performance without a solid understanding of the underlying principles. The optimizations on every “speed up your PyTorch” listicle target different bottlenecks. Applying the wrong one to your situation does literally nothing – sometimes worse than nothing.
Why the standard “speed up PyTorch” advice falls short
Generic checklists assume your bottleneck is whatever the author’s bottleneck was. The framework’s actual claim is sharper: for single-GPU performance, there are 3 main areas your model might be bottlenecked by – compute, memory-bandwidth, and overhead. The fix that matters depends entirely on which regime you’re in.
Concrete example of the asymmetry: if you’re spending all your time doing memory transfers (memory-bandwidth-bound), then upgrading to a GPU with more FLOPS won’t help. If you’re spending all your time on large matmuls (compute-bound), then rewriting model logic in C++ to reduce overhead won’t help. Pick the wrong fix, get zero return.
The recommended approach: profile, classify, then fix
Three steps. In order. No skipping.
Step 1 – Measure where the time actually goes
Before changing anything, profile a single forward+backward pass. PyTorch ships a profiler that’s good enough for the first cut:
import torch
from torch.profiler import profile, ProfilerActivity
model = model.cuda()
x = torch.randn(32, 3, 224, 224, device="cuda")
# Warm up - first call always lies
for _ in range(3):
model(x)
with profile(
activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
record_shapes=True,
) as prof:
out = model(x)
out.sum().backward()
print(prof.key_averages().table(
sort_by="cuda_time_total", row_limit=15))
Top of that table tells the story. Matmul/conv kernels eating ≥70% of CUDA time? Compute-bound. Pointwise ops (add, mul, gelu, layer_norm) and memory copies dominating? Memory-bandwidth-bound. CUDA time tiny but wall-clock enormous, Python frames at the top of the list? That’s overhead.
Step 2 – Match the bottleneck to the fix
| Regime | What helps | What’s a waste of time |
|---|---|---|
| Compute-bound | Tensor cores (FP16/BF16), bigger batches, better matmul shapes, FlashAttention | Reducing Python overhead, kernel fusion of pointwise ops |
| Memory-bandwidth-bound | Operator fusion, activation checkpointing, lower precision, channels-last | Switching from FP32 to TF32 on matmuls you’re not running |
| Overhead-bound | torch.compile, CUDA graphs, larger batches, removing Python loops in the hot path |
Buying a faster GPU |
Two A100 numbers get confused all the time (per Horace’s 2022 post): 312 TFLOPS – tensor core throughput, only reachable in FMA-heavy workloads – and 9.75 TFLOPS, what the chip delivers for general-purpose compute. Most GPU bragging cites the first number. Most training code never hits it.
Step 3 – Apply the fix and re-measure
Don’t apply three changes at once. You won’t know which one helped. Apply one. Re-profile. Compare.
A real example: why fusion is the single biggest lever
Here’s the BERT data point that made the original post famous. In a standard BERT forward pass: tensor contractions (matmuls) account for 99.8% of total FLOPs, normalization layers hit roughly 250x fewer FLOPs, and pointwise operations hit roughly 700x fewer. So matmuls dominate the FLOP budget – but they do not dominate the wall-clock budget proportionally, because the cheap-looking pointwise ops are memory-bound and spend their time shuffling tensors between HBM and SRAM.
This is the entire argument for operator fusion. If you fuse x → layer_norm → gelu → dropout into one kernel, you load once, write once, and skip three round trips to HBM. The math doesn’t change. The wall-clock time drops.
The most cited application of this principle is FlashAttention (Dao et al., 2022). As the paper puts it: attention algorithms should be IO-aware – accounting for reads and writes between levels of GPU memory. FlashAttention uses tiling to reduce HBM reads/writes, keeping intermediate results in on-chip SRAM. Same math as standard attention. Just fused. Yet it’s the speedup that unlocked long-context transformers.
Most readers don’t need to write fused kernels by hand anymore. PyTorch 2.0 introduced torch.compile with TorchInductor as the default backend (using OpenAI Triton for NVIDIA/AMD GPUs), which automatically fuses eligible pointwise and reduction operations that eager mode would launch as separate kernels. One line:
model = torch.compile(model)
# That's it. Now train as usual.
What the original post quietly leaves out
The 2022 piece is still right about principles. It’s missing some things that matter in 2026.
1. torch.compile’s first call lies to you. If you benchmark torch.compile(model) on a single forward pass, you’ll conclude it’s slower than eager. It is – the first time. One community benchmark (RTX 3090, torch 2.1.0) recorded eager mode at 0.020s vs compiled first run at 0.056s, then 0.004s from the second run onward – a 4.8x speedup once the compilation overhead is paid. Always warm up at least 3 iterations before timing anything.
2. Compile doesn’t work on every model. The same 2023 benchmark found that convnext-base breaks torch.compile outright. If your model has dynamic shapes, complex control flow, or unusual custom ops, compile may silently fall back to eager or crash. Test before you depend on it.
3. Fusion has a ceiling. On newer hardware, even hand-fused kernels leave performance on the table. The FlashAttention-3 paper (2024) reports that FlashAttention-2 achieves only ~35% utilization on H100, vs 80-90% for optimized GEMM kernels. The first-principles framework points you in the right direction, but the actual ceiling shifts every hardware generation.
The counterintuitive one: when you’re memory-bandwidth-bound, sometimes the highest-use move is to recompute values instead of saving them. Activation checkpointing trades extra FLOPs for fewer HBM round trips. The recompute itself is memory-bound and often fuses – so in practice, the wall-clock cost is almost nothing. You get a large chunk of memory back for nearly free.
Pro tips you won’t find in the post itself
- Always check arithmetic intensity (FLOPs per byte moved). It’s the missing tie-breaker between compute-bound and memory-bound. Low intensity = memory-bound, no matter how many FLOPs the op has on paper.
- Don’t trust
nvidia-smiutilization. It reports “GPU is doing something” – not “GPU is doing useful something.” A memory-bound model can show 95% utilization and still be wasting 90% of its potential. - Profile with realistic shapes. A 32×3×224×224 toy input won’t reveal the same bottlenecks as your actual 8192-token sequences.
- If overhead dominates and you can’t compile, the older trick still works: increase batch size until per-step work overwhelms per-step overhead.
FAQ
Is the 2022 post still relevant in 2026?
Yes. The hardware got faster, the compilers got better, but the three-regime framework hasn’t been replaced. It’s still the right first question to ask.
Should beginners just use torch.compile and skip all this?
Honestly, yes – try torch.compile first. One line, no code changes, and on modern GPU hardware (as of 2025, commonly reported on A100/H100 workloads) it often gives you a real speedup. But when it doesn’t help – or when it crashes, as it does on some architectures – you need the framework. Compile handles overhead and does some fusion. It won’t fix a tiny batch size or a matmul shape that misses tensor cores.
How do I know if I’m compute-bound vs memory-bound without a fancy profiler?
Quick proxy: temporarily drop your model’s precision (FP32 → BF16). If wall-clock time drops a lot, you were partly compute-bound – tensor cores kicked in. If it barely changes, you were memory-bound and the win came from moving half as many bytes, not from faster math. Either way you learn something. The PyTorch profiler gives you the precise answer, but this two-minute test gets you 80% of the way there.
Next action: open your current training script, add the 15-line profiler block from Step 1, and run it once. Don’t change anything else yet. Just look at the table and answer one question – which of the three regimes are you in? That answer determines everything you should do next.