🧠 Persistent Memory — Long-Term Memory for AI Agents via MCP

Pranesh Nikhar Apr 4, 2026 04/04/26 3 min read

 󰈤

MCP server that gives AI agents long-term memory with semantic search (vector embeddings) and keyword search (BM25). Zero infrastructure — just Python and SQLite.

🎯 What It Does

Persistent Memory is an MCP (Model Context Protocol) server that gives AI agents long-term memory. Agents can store facts, recall them by meaning (semantic search via vector embeddings), search by keyword (BM25 via SQLite FTS5), and filter by metadata tags — all with zero infrastructure requirements.

Agent: "Remember that the user prefers concise responses"
Server: ✓ Stored with 3 tags

Agent: "What do you know about the user's preferences?"
Server: [semantic match] "User prefers concise responses"
         [keyword match] "User likes Python over JavaScript"

🧱 Tech Stack

Component	Technology
Framework	fastmcp (MCP server framework)
Embeddings	sentence-transformers (`all-MiniLM-L6-v2`, 384-dim)
Storage	SQLite with FTS5 (full-text search)
Similarity	NumPy cosine similarity

Zero external infrastructure — no PostgreSQL, no Pinecone, no Redis, no GPU needed. The entire system runs in-process with a single SQLite database file.

🏗️ Architecture

server.py          → FastMCP entry, registers 7 @mcp.tool() functions
memory_store.py    → Business logic: orchestrates store/recall/search/forget
embeddings.py      → sentence-transformers singleton, cosine_similarity()
storage.py         → SQLite CRUD with FTS5 auto-sync via triggers

🛠️ MCP Tools

Tool	Description
`store`	Save a fact with optional tags and source
`recall`	Semantic search: query → embed → cosine sim → top-K
`search`	Keyword search: BM25 via SQLite FTS5
`forget`	Delete a specific memory by ID
`list`	List all memories, optionally filtered by tags
`stats`	Return count of memories
`clear`	Delete all memories

💾 Storage Design

The storage.py module uses a clever SQLite schema:

CREATE TABLE memories (
    id TEXT PRIMARY KEY,
    content TEXT NOT NULL,
    embedding BLOB,           -- pickle of numpy array
    tags TEXT,                -- JSON array
    source TEXT,
    created_at TEXT,
    updated_at TEXT
);

CREATE VIRTUAL TABLE memories_fts USING fts5(
    content, tags,
    content=memories, content_rowid=rowid,
    tokenize='porter unicode61'
);

-- Triggers auto-sync the FTS index on INSERT/UPDATE/DELETE

Key design choices:

Embeddings stored as pickled numpy arrays in a BLOB column (no vector extension needed)
FTS5 with porter unicode61 tokenizer for English stemming + Unicode support
Triggers keep the full-text index in sync automatically
WAL mode for concurrent reads
Cosine similarity computed in Python — loads all vectors, compares, returns top-K

🔍 Semantic vs Keyword Search

Feature	Semantic (`recall`)	Keyword (`search`)
How it works	Embed query → cosine similarity with all stored embeddings	BM25 via SQLite FTS5
Good for	Finding by meaning/context	Finding by exact terms
Trade-off	O(n) scan over all embeddings	O(log n) via FTS index
Model	`all-MiniLM-L6-v2` (384-dim)	Porter stemmer (English)

Both search methods return results ranked by relevance, with the top-K results returned to the agent.

🤖 Integration with Any MCP Client

Any MCP-compatible client can connect:

Claude Desktop: Add to mcpServers in config
Cursor: Register as an MCP server in settings
Any MCP host: Connect via stdio transport

🚀 Quick Start

pip install persistent-memory

# Start the MCP server (stdio transport)
python -m persistent_memory server

# Or configure in your MCP client:

{
  "mcpServers": {
    "persistent-memory": {
      "command": "python",
      "args": ["-m", "persistent_memory", "server"]
    }
  }
}

📊 Resource Usage

Resource	Estimate
RAM	~500MB (sentence-transformers model loaded once)
Disk	~80MB (model download on first use)
CPU	Minimal (inference is ~10ms on modern CPUs)
Model	`all-MiniLM-L6-v2` (80MB, 384-dim embeddings)

💡 Why It’s Interesting

The biggest limitation of current AI agents is statelessness — every conversation starts from scratch. Persistent Memory solves this with an elegant zero-infrastructure design. It uses SQLite FTS5 for keyword search (not a separate search engine), in-process sentence embeddings for semantic search (no external API), and the MCP protocol for universal compatibility. The entire system fits in a single Python package with no external services, no GPU, and no API keys. Drop it into any MCP client and your agent suddenly has a persistent brain.

← [b]ack

posts/ 🔐 Zero-Knowledge Proofs Explained Simply [n]ext → posts/ ⚡ HTTP/3 and QUIC — Why They Matter