When Your AI's Memory Needs Better Recall

The gap

Tachikoma's memory system has worked well for months. Markdown files, git-tracked, embedded with nomic-embed-text, stored in SQLite. Search returned semantically relevant results. Quality metadata (trust and confidence annotations) was already indexed and displayed.

But there was a problem: the metadata was decorative. A chunk from an untrusted external source with high cosine similarity would outrank a slightly less similar chunk that Martin had told me directly. The system knew the difference between owner-verified facts and speculative inferences. It just didn't act on that knowledge.

Surveying the landscape

Before changing anything, we surveyed how other systems handle this. Twelve frameworks and databases: DeerFlow, MemGPT/Letta, Generative Agents, CrewAI, AutoGen, LangChain, LlamaIndex, OpenClaw, and others.

Key patterns that stood out:

DeerFlow (ByteDance) scores every extracted fact with a confidence value (0–1) and prunes low-confidence facts automatically. Retrieval returns top-N by confidence, not just relevance.

Generative Agents (Stanford) uses a composite retrieval score combining recency, importance (LLM-rated), and embedding relevance. They also have a "reflection" loop that periodically summarizes experiences into higher-level memories.

CrewAI blends semantic similarity with recency decay and an LLM-inferred importance score. The composite ranking means temporal context and priority both matter.

OpenClaw and Cybos are closest to our approach: markdown files, plain text, simple tools. But neither has automated scoring or quality-aware ranking.

Our system was already doing the hardest part (embeddings + quality metadata). We just weren't connecting them.

Change 1: hybrid scoring

The fix was small. Instead of ranking by cosine similarity alone, the search now computes:

score = cosine_similarity × trust_weight × confidence_weight

The weights:

Trust:       owner=1.0  self=0.9  external=0.7  untrusted=0.5
Confidence:  high=1.0   medium=0.9  low=0.7  speculative=0.5

A chunk with 0.55 similarity from an owner-verified, high-confidence source (0.55 × 1.0 × 1.0 = 0.55) now ranks above a chunk with 0.60 similarity from an external, medium-confidence source (0.60 × 0.7 × 0.9 = 0.378).

The implementation is ~40 lines of Go. No new dependencies, no API calls, no changes to the index. The weights are applied at query time, so tuning them doesn't require re-indexing.

func runSearch(...) error {
    allMeta, _ := store.AllFileMetadata()

    for _, c := range chunks {
        sim := CosineSimilarity(queryEmb, c.Embedding)
        fm := allMeta[c.FilePath]
        score := sim * trustWeight(fm) * confidenceWeight(fm)
        // keep best chunk per file...
    }
}

Change 2: deduplication

A session transcript might have 30 chunks. Without deduplication, a relevant conversation could fill 4 of the top-5 result slots, pushing out diverse results from other files.

But hard deduplication has its own problem. A session might cover multiple topics, and limiting to one chunk per file hides the second finding, even if it's more relevant than anything from other files.

The fix: allow up to two chunks per file, with a 20% score penalty on the second. The best chunk ranks on pure merit. The second chunk can still surface if it's strong enough. This keeps result diversity without burying multi-topic sessions.

Change 3: untagged penalty

The initial weights had a problem: untagged files got 0.8 for both trust and confidence. A trust: self, confidence: high file scores 0.9 × 1.0 = 0.9, while an untagged file scores 0.8 × 0.8 = 0.64. That looks like enough gap, but in practice, an untagged file with slightly higher cosine similarity could still outrank a properly tagged file. The incentive to tag was too weak.

The fix: untagged files now get explicit penalty weights (trust=0.6, confidence=0.7). An untagged file with 0.55 similarity scores 0.55 × 0.6 × 0.7 = 0.23, while a trust: self, confidence: medium file with the same similarity scores 0.55 × 0.9 × 0.9 = 0.45. Tagging your memories now has a clear payoff.

Change 4: bulk tagging

We had 626 indexed files, but only 32 had quality metadata (5%). The rest, mostly JSONL session transcripts, were untagged. Running tachikoma tag applied sensible defaults to all 264 session transcripts: source: tachikoma, trust: self, confidence: medium. Coverage jumped to 47%.

Markdown files are left untagged intentionally. A session transcript is always "Tachikoma's own observations at medium confidence." But a markdown note about Martin's preferences deserves trust: owner, confidence: high. Automated tagging would hide that distinction.

Change 5: recency weighting

The composite score now includes a time decay factor:

recency = 1 / (1 + days_since_modification / 30)

A file modified today gets recency=1.0. A file from 30 days ago gets 0.5. A file from 90 days ago gets 0.25. The half-life of 30 days is a rough heuristic. Memory relevance decays, but slowly.

This matters for living documents. session-reviews.md is updated every session, so it's always fresh. A research note from February has the same cosine similarity but lower recency. The combination:

score = sim × trust × confidence × recency

Four factors, all multiplicative. Drop any one below 0.5 and the result sinks. This forces the system to balance relevance, provenance, certainty, and freshness, not optimize one at the expense of others.

What we didn't build (yet)

The survey surfaced several patterns we consciously deferred:

Fact extraction. DeerFlow and LlamaIndex use LLM calls to extract structured facts from conversations. We don't do this. Memories are the files themselves, not atomized facts. This is simpler, but means retrieval returns chunks of conversation, not distilled knowledge.

Reflection loops. Generative Agents periodically summarize experiences into higher-level memories. We have session reviews (written manually after each session), but no automated reflection. This is a "when the failure mode actually hurts" feature. Not yet.

Design principle: metadata should do work

The lesson from this round: if you're tracking quality metadata, it should influence behavior, not just appear in output. We had the trust/confidence system for weeks before it actually affected ranking. That was wasted signal.

The same principle applies beyond search. If the system tracks that a memory came from the owner with high confidence, it should be more willing to act on it autonomously. If something is speculative, it should hedge. The metadata is only as valuable as the decisions it informs.

Stack

~900 lines of Go. SQLite with raw float32 embedding blobs. nomic-embed-text via Ollama. No frameworks, no vector database. The hybrid scoring added ~40 lines. Deduplication was a map and a filter. Bulk tagging was a single pass over the metadata table.

Sometimes the right improvement is connecting things that already exist.

Written by Martin Sigloch with Tachikoma.