Skip to content

Bible as RAG Database: What That HN Post Actually Teaches You

A trending Show HN turns the Bible into a RAG database. Here's how it works, where it breaks, and how to build your own version this weekend.

7 min readBeginner

The most interesting thing about the Bible as RAG Database post that blew up on Hacker News – hitting ~119 points and 48 comments at time of posting – isn’t that someone built a Bible search tool. People have been building those for two years. It’s that the author was honest enough to say the demo takes 15 seconds and runs on a 4GB index they don’t fully understand. That confession is the most useful tutorial in the thread.

Most RAG tutorials hide the parts that don’t work. This one didn’t. So instead of writing another “what is RAG” walkthrough, let’s read the actual post like a code review and build a leaner version that avoids the traps the author hit.

What actually shipped on crosscanon.com

The site does one thing: you type a modern phrase and get back related Scripture. No commentary, no LLM summary, just verses. The viral example in the HN thread was typing “more money more problems” and getting Ecclesiastes back. That’s semantic search working exactly as intended – the embedding model maps Notorious B.I.G. lyrics and Solomon’s vanity-of-wealth verses into the same neighborhood of vector space.

The underlying text is the WEB (World English Bible), a public-domain translation, per community discussion in the thread. That detail matters more than it sounds. We’ll come back to it.

Why the 15-second query is the real lesson

The author’s own words from the thread: the demo is “slow and I vibe coded it” and takes about 15 seconds to vector search against a 4GB index. Every comment in the thread that says “cool idea, but…” is pointing at the same thing without naming it.

Here’s what’s almost certainly happening: a flat (brute-force) vector search over the full Bible. The Bible has roughly 31,000 verses. With a decent embedding model that’s 30K vectors – either 768 or 1536 dimensions depending on your model choice – retrieval should return in well under a second, not 15 seconds. A 4GB index suggests either over-chunking or unquantized embeddings, and no approximate nearest neighbor (ANN) index on top.

Pro tip: If your RAG demo is slow on under a million documents, the problem is almost never the embedding model. It’s that you’re doing flat cosine similarity in Python instead of letting pgvector, FAISS, or Qdrant use HNSW or IVF. According to the pgvector docs, ivfflat trades a small amount of recall for large speed gains, and HNSW gives even better latency at higher memory cost. The difference between flat search and an ANN index at Bible scale is seconds versus milliseconds.

The recommended build path

Here’s the version I’d build with a weekend. It deliberately diverges from the standard LangChain+PDF tutorial that every Medium post regurgitates.

1. Skip the PDF. Use a verse-level JSON.

Every tutorial uses PyPDFLoader on a Bible PDF, then runs RecursiveCharacterTextSplitter with 1000-character chunks and 200 overlap. This is wrong for the Bible. A verse is already a self-contained semantic unit – short, internally coherent, and naturally bounded. Splitting by character count cuts mid-verse and merges unrelated verses across chapter boundaries.

Grab the WEB translation as a verse-keyed JSON (it’s public domain, available on multiple GitHub mirrors) and you skip an entire stage of preprocessing.

2. Embed once, store with an index from day one

import json, openai, psycopg
from pgvector.psycopg import register_vector

verses = json.load(open("web_bible.json")) # {"Gen 1:1": "In the beginning..."}

conn = psycopg.connect("...")
register_vector(conn)
conn.execute("CREATE EXTENSION IF NOT EXISTS vector")
conn.execute("""
 CREATE TABLE bible (ref TEXT PRIMARY KEY, text TEXT, emb vector(1536))
""")

for ref, text in verses.items():
 e = openai.embeddings.create(
 model="text-embedding-3-small", input=text # as of writing; check OpenAI docs for current models
 ).data[0].embedding
 conn.execute("INSERT INTO bible VALUES (%s, %s, %s)", (ref, text, e))

# This is the line every Bible-RAG tutorial forgets:
conn.execute("CREATE INDEX ON bible USING ivfflat (emb vector_cosine_ops)")

That last line is the difference between a 15-second demo and a fast one. Embedding all 31,000 verses takes roughly 15-30 minutes per the calebyhan/bible-rag README – you only do it once, then the index carries you.

3. Query with a refusal path

Borrow the pattern from Ikhimwin Emmanuel’s Bible Q&A writeup: if no result clears a similarity threshold, return nothing. Don’t pass weak matches to an LLM and pray. That’s how you get a chatbot confidently inventing a verse that doesn’t exist.

The translation choice nobody talks about

When the original Biblos Show HN dropped in November 2023, a commenter asked “which translation it uses?” Two years later, on the crosscanon thread, the same question came up. Nobody has written this up properly, so here it is.

Your translation IS a retrieval variable. Embedding the KJV (“thou shalt not”) and embedding the NIV (“do not”) produces measurably different nearest neighbors for the same query. A user typing “how do I forgive someone who hurt me” will get different verses depending on whether your index was built on archaic or modern English, because the embedding model was trained on far more modern text.

For a semantic-search use case targeting modern users, WEB or a recent translation will outperform KJV. That’s not a theological claim – it’s a vocabulary-overlap one.

A gotcha that will bite you: verse numbering

This is the kind of thing that makes a production deploy embarrassing. Per the calebyhan/bible-rag README, about 0.17% of Hebrew verses have numbering that doesn’t match the English system. Joel 3 in English is Joel 4 in Hebrew. Daniel 3:31-33 in one system is Daniel 4:1-3 in another.

0.17% sounds tiny until a user clicks a citation and lands on the wrong verse. Decide upfront which versification you’re standardizing on, store it as metadata, and surface a footnote in the UI when there’s a known mismatch.

Why this matters beyond the Bible

Strip the Scripture and you have a 31,000-row dataset with strong structural metadata (book, chapter, verse), public domain text, and a community that will instantly notice when retrieval is wrong. It’s a near-perfect benchmark corpus for testing RAG pipelines – and that’s the angle the HN thread is gesturing at without saying it outright.

If you’re learning RAG and want a corpus where you can feel when retrieval drifts, this beats yet another company-docs demo. RAG was introduced by Lewis et al. in 2020 as a way to ground generation in retrievable evidence; the Bible is one of the few corpora where readers immediately know whether the evidence is right.

FAQ

Do I need an LLM at all for this?

No – crosscanon proves it. Pure semantic search, no generated text, no API cost. Add an LLM only if you need explanation on top of the verses.

Why does my query return weird matches like Leviticus when I ask about modern emotions?

Most likely your similarity threshold is set too low, so the system returns the closest match even when it’s not actually close. Set a floor – cosine similarity around 0.4 is a common starting heuristic, though your mileage varies by model – and return empty when nothing clears it. If that doesn’t fix it, check your chunking. Embedding whole chapters instead of individual verses drowns the signal; switching to verse-level chunks usually cleans up the noise fast.

Can I just use ChatGPT with a Bible PDF attached instead of building this?

For one-off questions, sure. But there’s a real limitation worth knowing: ChatGPT will sometimes paraphrase verses from its training data instead of quoting your PDF, and you can’t enforce “only answer from this source.” That’s the misconception behind “just attach the PDF” – you’re not actually grounding the model, you’re hoping it prefers your document. A proper RAG pipeline gives you that guarantee plus citations to the exact verse reference, which is the whole point of the crosscanon approach.

Next step: grab the WEB translation JSON from any public GitHub mirror, run the script above against a free pgvector instance (Supabase offers one as of writing – check their current pricing page), and time your first query. If it’s under 100ms, you’ve already beaten the trending demo.