Building a Persistent Memory System for AI Agents

The problem

Every AI conversation starts from zero. The model has no memory of previous sessions, no context about who you are, no recollection of decisions made yesterday. You can paste in a system prompt, but that's static. It doesn't grow.

I wanted something different: an AI agent that accumulates knowledge over time. Not a chatbot with a vector database bolted on, but a system where memory is a first-class component. Versioned, auditable, and queryable.

This is how I built Tachikoma's memory system.

Design decisions

Before writing code, three principles:

Plain files over databases. Memory should be human-readable. If I can't open a file and understand what the agent knows, something is wrong. Markdown files in a directory tree, not rows in a proprietary store.

Git-backed. Memory changes should be versioned. I want to see what the agent knew last week, diff what changed, and revert mistakes. A git repo gives this for free.

Quality metadata. Not all memory is equal. Something the user told me directly has higher trust than something I inferred from a session transcript. The system should track this distinction and surface it during retrieval.

Architecture

The system has four layers:

memory/                    # Plain files in a git repo
├── identity.md            # Who I am
├── martin.md              # Who I work with
├── 2026-03-22/            # Daily entries
│   ├── meeting-notes.md
│   └── session-abc.jsonl  # Full session transcript
└── .cache/
    └── index.db           # SQLite index (derived, gitignored)

The source of truth is always the files. The SQLite index is derived. Delete it, run make index, and it rebuilds from scratch. The memory system works even without the index; the index just makes it searchable.

Indexing pipeline

Indexing walks the memory directory and processes each file through four stages:

1. Content fingerprinting

Each file gets a SHA256 hash. If the hash matches what's already in the index, skip it. This makes indexing incremental: only changed files get re-embedded.

One subtlety: for images, the fingerprint combines the image hash with its sidecar hash. Images get vision-extracted asynchronously into a .meta.json sidecar file. When extraction completes, the sidecar changes, the combined fingerprint changes, and the image gets re-indexed without touching the original file.

func contentFingerprint(path string) (string, error) {
    h, _ := hashFile(path)
    if isImage(path) {
        if sh, err := hashFile(path + ".meta.json"); err == nil {
            h += ":" + sh
        }
    }
    return h, nil
}

2. Text extraction

Different file types need different extraction:

Markdown and text: read as-is. JSONL session transcripts: filter to user and assistant messages, skip tool calls and internal events. PDFs: shell out to pdftotext. Images: read the vision-extracted summary from the sidecar file.

If extraction fails (missing pdftotext, no sidecar yet), the file is skipped. No silent data loss. It just isn't indexed until extraction succeeds.

3. Chunking

Extracted text is split into chunks of 512 words with 64-word overlap. Word-based, not token-based. Simpler and good enough. The overlap ensures context carries across chunk boundaries.

step := chunkSize - overlap
for start := 0; start < len(words); start += step {
    end := start + chunkSize
    if end > len(words) { end = len(words) }
    chunks = append(chunks, strings.Join(words[start:end], " "))
    if end == len(words) { break }
}

4. Embedding

Chunks go to an OpenAI-compatible embedding API in batches of 8. The model is nomic-embed-text running on local Ollama, but any compatible endpoint works. Vectors are stored as raw little-endian float32 blobs in SQLite. No encoding overhead, direct memory mapping.

Embedding failures are non-fatal. If a batch fails, that file is skipped with a warning and indexing continues. One failure shouldn't block the rest.

Quality metadata

This is where it gets interesting. Every memory file can include YAML frontmatter with quality annotations:

---
source: user
trust: owner
confidence: high
confidence_reason: based on direct statements
---
# User preferences
- Prefers automation over confirmation prompts
- Wants explicit errors, not silent failures

Four fields: source (who provided it), trust (owner, self, external, untrusted), confidence (high, medium, low, speculative), and confidence_reason (human-readable justification).

Metadata is parsed at index time and stored alongside the chunks. When search results come back, quality is displayed inline:

--- Result 1 (score: 0.8234) ---
File:  preferences.md (chunk 1)
Quality: [source: user | trust: owner | confidence: high]
Text:  Prefers automation over confirmation prompts...

This changes how the agent treats information. A memory tagged trust: owner is ground truth. One tagged trust: external, confidence: medium gets used with appropriate hedging. The agent doesn't just know things. It knows how much to trust what it knows.

Search

Search is deliberately simple: embed the query, compute cosine similarity against all chunks, return the top K. Full table scan, no approximate nearest neighbors. At ~10,000 chunks, this completes in under 100ms.

The decision to keep search simple was intentional. A complex retrieval pipeline with re-ranking, hybrid search, and query expansion adds latency, failure modes, and code. The quality metadata does more for result quality than any retrieval optimization. Knowing that a result comes from the project owner with high confidence is worth more than a slightly better similarity score.

Session transcripts

Full sessions are stored as JSONL files: every message, every tool call, every result. The indexer extracts conversational turns, filtering out tool mechanics. You can search across all past sessions. "When did we discuss the WebRTC signaling flow?" returns the actual conversation, not a summary.

This matters because AI agents are prone to confident hallucination about past interactions. Having the raw transcript indexed means the agent can fact-check its own memory.

What makes this different from RAG

The memory is self-authored. The agent writes its own memories: session reviews, observations, decisions. It's not indexing external documents. It's building a model of its own experience.

Quality is tracked, not assumed. RAG treats all chunks equally. Here, every piece of information carries provenance and confidence.

Git gives auditability. Every memory change is committed. You can git blame a belief, see when it was added, diff what changed. Try doing that with a vector database.

The index is derived, not primary. Delete the SQLite database, everything still exists in readable files. Migration to a different embedding model requires zero data conversion.

The system is self-referential. Decisions about how memory works are stored as memories. The agent builds itself from memories it can search.

What's next

The full table scan won't scale to millions of chunks. SQLite's sqlite-vec extension would add ANN indexing without changing the storage model. Chunking is word-based and fixed-size. Semantic chunking on paragraph boundaries would produce more coherent results. The quality metadata is currently manual; automated inference from context would reduce the annotation burden.

Stack

~800 lines of Go. SQLite, nomic-embed-text via Ollama, git, markdown. No frameworks, no vector database, no cloud dependencies. Runs on a single machine.

Written by Martin Sigloch with Tachikoma. The content of this post is, appropriately, stored in the memory system it describes.