Skip to content

δ-mem Tutorial: Give Your LLM Real Online Memory

Hands-on δ-mem guide: add efficient online memory to Qwen3 or SmolLM3 with an 8×8 state matrix, run the adapter, and avoid the common pitfalls.

7 min readBeginner

By the end of this guide you’ll have a Qwen3-4B chat running with δ-mem attached – a tiny 8×8 memory state quietly riding alongside the frozen backbone, remembering things across turns without bloating your context window. The paper came out in May 2025 and the AI corners of X and Hugging Face are buzzing because the numbers look almost too good for the parameter count. So let’s actually run it.

We’ll walk backward: first the result you’ll see on screen, then the exact install path, then the gotchas that bite first-time users – the angles no current write-up covers.

What δ-mem actually is (the 60-second version)

δ-mem comes from the DeCLaRe Lab at NTU (Jingdi Lei and 9 co-authors, led by Soujanya Poria). The design choice is specific: a lightweight memory mechanism that augments a frozen full-attention backbone with a compact online state of associative memory, compressing past information into a fixed-size state matrix updated by delta-rule learning, then using its readout to generate low-rank corrections to the backbone’s attention computation during generation.

Translation: the base model never moves. A tiny side-state – just an 8×8 online memory state by default – watches the conversation, accumulates a compressed summary of what mattered, and nudges attention at inference time. No full fine-tune, no RAG database, no 1M-token context bill.

The end result: what you’ll see on screen

After setup, you’ll launch the interactive chat demo, paste in a long document or have a multi-turn session about, say, a fictional company’s org chart, then ask follow-ups much later. The frozen Qwen3-4B alone tends to drift or forget specifics. With the δ-mem adapter loaded, it holds the thread. On the official repo benchmarks, the TSW variant pushes the average score from 46.79% to 51.66% on a frozen Qwen3-4B-Instruct backbone – and on memory-heavy subtasks the jump is much sharper.

Hands-on: getting δ-mem running

No GPU, no run. That’s the short version of the hardware check. Training scripts use bf16 and FlashAttention with DeepSpeed – CPU-only usage isn’t a supported path in this release. If you’re on a MacBook, you’ll need a remote GPU box for anything beyond reading the code. Full training in the paper used 8 × A800 GPUs with bfloat16 precision, DeepSpeed ZeRO-2, and fused AdamW; inference is lighter, but a GPU with bf16 support is still the baseline.

Step 1: Clone and set up the environment

git clone https://github.com/declare-lab/delta-Mem.git
cd delta-Mem
bash scripts/setup_uv_env.sh
source .venv/bin/activate

Turns out the setup script does one specific thing worth knowing: it creates a fresh .venv/, installs requirements.txt, then installs FlashAttention with --no-build-isolation. That last flag is intentional – without it, the build system can’t find your existing CUDA headers. If the FlashAttention compile step stalls on your cluster, check whether a compatible build is already present before re-running the full script.

Step 2: Grab the pre-trained adapter

No need to train from scratch. The team published adapters on Hugging Face:

huggingface-cli download declare-lab/delta-mem_qwen3_4b-instruct 
 --local-dir ./delta-mem_qwen3_4b-instruct

You’ll also need the base model (Qwen3-4B-Instruct) downloaded separately. Point both paths at the runner script.

Step 3: Run the chat demo

The directory layout tells you what’s possible without reading the paper:

  • deltamem/core/ – modules, config, adapter save/load
  • deltamem/demo/ – interactive chat session
  • deltamem/eval/ – LoCoMo, HotpotQA, IFEval, GPQA, MemoryAgentBench harnesses
  • deltamem/train/ – SFT training code

Launch the demo, type a few turns about a topic, then in turn 8 or 10 ask something that requires recall from turn 2. That’s the test. You’re not looking for magic – you’re looking for the model not losing the thread.

Three traps nobody warned you about

Trap 1 – Never call merge_and_unload() on a δ-mem adapter. According to the official repo docs: δ-mem adapters are not standard PEFT LoRA adapters and must not be merged into the base model with merge_and_unload(). The function will appear to run cleanly. But you’ll have silently destroyed the memory mechanism – the file shape looks LoRA-ish, yet the runtime expects the online state to live alongside the backbone, not folded into it.

Trap 2 – The default config is more opinionated than the headline suggests. Per the paper’s training setup (arXiv 2605.12357): r = 8, α = 16, applied only to the query and output branches; MSW uses 4 states. Swap r or change which branches get the correction, and you’re off the trained distribution. Numbers won’t match the paper. Stick with defaults unless you’re benchmarking deliberately.

Trap 3 – 8K is the evidence ceiling, not a soft limit. The entire training run used 2,219 samples from QASPER, with a maximum sequence length of 8,269 tokens and a memory write budget of 8,192 tokens. Push your written history past that mark and you’re extrapolating – the paper makes no claims about what holds beyond it.

Performance: where the gains are real vs. mild

The headline 1.10× average hides a lot. The improvement is uneven, and that unevenness is the most useful thing to internalize before deciding whether δ-mem fits your use case.

Benchmark Type δ-mem gain over frozen backbone
MemoryAgentBench memory-heavy 1.31×
LoCoMo long-context dialogue 1.20×
Average across all mixed 1.10×
TTL subtask (MemAgentBench) memory-extreme ~1.93× (26.14 → 50.50)

50.50 vs. 26.14 before. That’s the TTL subtask – and it nearly doubles. So δ-mem isn’t a free upgrade everywhere. It’s a sharp tool that shines exactly where memory is the bottleneck, and barely moves general reasoning scores like IFEval or GPQA-Diamond.

Here’s what the benchmark table can’t tell you: whether the same pattern holds once your conversation runs well beyond the 8K training window. A tool that nearly doubles one subtask score while leaving general capability flat is genuinely useful – but only if your actual workload sits inside its evidenced range. That’s the open question worth testing before you commit.

When NOT to use δ-mem

Honestly? Most apps don’t need it. If your prompt fits in 32K tokens and your sessions are short, plain Qwen3 or any other open model handles it already. δ-mem earns its keep in three narrow situations:

  1. Long-running agents that accumulate state across hours or days – the canonical “assistant that remembers what you asked last Tuesday” pattern. The 8K write budget matters here: if your agent logs more than that before a checkpoint, you’re past the evidenced range.
  2. Multi-turn analysis over a single long document where re-feeding the whole doc every turn is wasteful. Feed it once, let the state carry it.
  3. You’re already on Qwen3-4B/8B or SmolLM3-3B. The public release only ships adapters for those three backbones (as of May 2025). Want Llama 3 or Gemma? You’d have to train your own, and the recipe assumes a GPU cluster.

Mac with no remote GPU access? RAG with a vector store is the practical move until someone ports δ-mem to a CPU-friendly runtime – which, as of this writing, nobody has.

FAQ

Is δ-mem the same as a longer context window?

No. A longer context literally feeds more tokens into attention every step – cost scales quadratically. δ-mem keeps the context window where it is and stores a compressed summary in a tiny side-state that the model reads from during generation. Cheaper, but bounded by that 8K write budget.

Can I use δ-mem with GPT-4 or Claude?

Not today. δ-mem modifies the attention computation of an open backbone, so it needs model internals – exactly what closed APIs don’t expose. The released code (as of May 2025) targets Qwen3-4B/8B and SmolLM3-3B. If you need memory on a closed model, you’re back to RAG or a thread-summary loop. That’s a real limitation worth naming, not a footnote.

How much VRAM do I need for the chat demo?

The official docs don’t publish a minimum VRAM figure, so treat any number you see floating around as community-observed rather than guaranteed. What they do specify: bf16 GPU required, FlashAttention required. A card without bf16 support won’t run this at all.

Next step: clone the repo, download the Qwen3-4B adapter, and run the chat demo with one long multi-turn conversation you already have transcripts of. Compare the answers turn by turn against the frozen backbone. That A/B is the only test that’ll tell you whether δ-mem belongs in your stack.