Clawdbot Memory System Analysis
This document details the architecture and implementation of the Clawdbot memory system.
Overview
Clawdbot implements a Hybrid Retrieval Augmented Generation (RAG) system. It combines Vector Search (semantic understanding) with Full-Text Search (keyword precision) to retrieve relevant context from a local knowledge base.
Key Technologies:
- Storage:
node:sqlite(Local SQLite database) - Vector Search:
sqlite-vecextension (vec0virtual tables) - Keyword Search: SQLite FTS5 extension (
fts5virtual tables) - File Watcher:
chokidar(Real-time updates)
Core Architecture
1. Data Sources
The memory system indexes content from two primary sources:
- Memory Files: Markdown files located in
<workspace>/memory/and<workspace>/MEMORY.md. These are watched for real-time updates. - Session Transcripts: Active or archived conversation logs (if configured).
2. Storage Schema
The system uses a unified SQLite database (memory.sqlite) with the
following key tables:
| Table | Purpose |
|---|---|
files |
Tracks indexed files, modification times, and hashes to support incremental updates. |
chunks |
Stores the actual text chunks derived from files. Maps 1:N from files. |
chunks_vec |
Virtual Table (sqlite-vec). Stores the vector embeddings for each chunk. Used for semantic search. |
chunks_fts |
Virtual Table (FTS5). Stores the full text of chunks. Used for BM25/keyword search. |
embedding_cache |
Caches embeddings by (Provider, Model, TextHash) to prevent expensive re-computation. |
3. Indexing Pipeline (Code Snapshot)
The system uses a token-aware chunking strategy optimized for Markdown. It handles overlaps to maintain context across chunk boundaries.
src/memory/internal.ts:
export function chunkMarkdown(
content: string,
chunking: { tokens: number; overlap: number },
): MemoryChunk[] {
// ... (setup code) ...
const lines = content.split("\n");
// Iterate line by line to respect markdown structure
for (let i = 0; i < lines.length; i += 1) {
const line = lines[i] ?? "";
// ... (splitting long lines) ...
for (const segment of segments) {
const lineSize = segment.length + 1;
// Flush chunk if size limit reached
if (currentChars + lineSize > maxChars && current.length > 0) {
flush();
carryOverlap(); // Function to keep the last N chars for the next chunk
}
current.push({ line: segment, lineNo });
currentChars += lineSize;
}
}
flush(); // Final flush
return chunks;
}
4. Retrieval Pipeline (Hybrid Search Logic)
Clawdbot explicitly merges vector results (semantic) with keyword results (lexical).
src/memory/manager.ts:
async search(query: string, opts?: { ... }): Promise<MemorySearchResult[]> {
// ...
// 1. Keyword Search (FTS5)
const keywordResults = hybrid.enabled
? await this.searchKeyword(cleaned, candidates).catch(() => [])
: [];
// 2. Vector Search (sqlite-vec)
const queryVec = await this.embedQueryWithTimeout(cleaned);
const vectorResults = hasVector
? await this.searchVector(queryVec, candidates).catch(() => [])
: [];
if (!hybrid.enabled) {
return vectorResults.filter((entry) => entry.score >= minScore);
}
// 3. Merge & Rank
const merged = this.mergeHybridResults({
vector: vectorResults,
keyword: keywordResults,
vectorWeight: hybrid.vectorWeight,
textWeight: hybrid.textWeight,
});
return merged.filter((entry) => entry.score >= minScore).slice(0, maxResults);
}
5. Session Memory Hook
The system includes a specialized hook (session-memory) that bridges the
gap between ephemeral conversation and long-term memory.
- Trigger:
/newcommand (start of new session). - Action:
1. Extracts the summary/context of the ending session.
2. Uses an LLM to generate a descriptive filename (slug).
3. Writes the context to a new Markdown file in
memory/. - Result: Past conversations automatically become part of the searchable knowledge base for future sessions.
File Structure Reference
src/memory/manager.ts: The "brain" of the system. Orchestrates indexing, searching, and provider management.src/memory/internal.ts: Handles file reading, hashing, and the critical chunking logic.src/memory/memory-schema.ts: Defines the SQLite tables and virtual tables.src/memory/embeddings-*.ts: Adapters for different AI providers (OpenAI, Gemini).
Memory for Coding Agents: Analysis & Comparison
Clawdbot uses a General Text Memory approach. While effective for documentation and conversation history, it differs significantly from specialized Code-Aware Memory systems (like LSP) often used in advanced coding agents.
1. The Disconnect: Text vs. Structure
-
Clawdbot (Text-Based): Treats code as text chunks.
- Pros: Simple, universal, good for finding comments and high-level logic in documentation.
- Cons: "Blind" to code structure. It might split a function in
half if it crosses a chunk boundary (though overlap helps). It doesn't
understand that
class Dogis related todog.bark()unless they are textually close.
-
LSP / Code-Aware (Structure-Based): treating code as a graph or tree (AST).
- Pros: Precise navigation (Go to Definition, Find References). Understands "symbols" independent of file position.
- Cons: Language-specific, complex to implement and maintain.
2. Is Clawdbot Good at Coding? (Analysis)
Despite lacking an LSP index, Clawdbot is a highly effective Coding Assistant (Pair Programmer), but with a different strength profile than an IDE-native agent.
Strengths
- Active Coding: It can execute shell commands, run tests, lint files, and read massive documentation files into its context.
- Correction Loop: It uses
bashandnodeto verify its own work. If it generates broken code, it can run it, see the error, and fix it—mimicking a human developer's workflow. - Explanation: Its generalist memory is excellent for explaining high-level architecture decisions found in design docs or READMEs.
Limitations (The "LSP Gap")
- Refactoring at Scale: Because it "sees" code as documents, it struggles with tasks like "Rename this method and update all 50 files that use it". It might miss usages that don't match the text search exactly.
- Implicit Graphs: It doesn't inherently know that modifying
A.tsbreaksZ.tsunless they are explicitly linked in the documentation or textually close.
3. LSP-Like Memory Design
For a highly effective *Coding Agent*, memory should ideally be layered:
Layer 1: The Index (What Clawdbot lacks)
An LSP-like system indexes Symbols rather than just text chunks.
- Query: "Where is
Userdefined?" - Text Search: Returns every file containing the word "User" (thousands of results).
- Symbol Search: Returns
src/models/user.ts:class User.
Layer 2: The Graph (Dependency Awareness)
Code is a graph of dependencies. A strong memory system maps these relationships.
- node:
auth.ts - edge:
imports->user.ts - edge:
calls->login() - Agent Utility: When modifying
user.ts, the agent automatically knows to checkauth.tsbecause of the graph connection.
3. Recommendation for Evolution
To evolve Clawdbot into a stronger coding assistant, the memory system could be augmented with a Symbol Index:
- Parser: Use
tree-sitterto parse code files when indexing. - Extraction: Extract top-level symbols (Classes, Functions, Interfaces).
- Storage: Store these as metadata in the
chunkstable or a separatesymbolstable. - Retrieval: Allow the agent to query "definition of X" which maps to the specific chunk containing that symbol.
Code Snapshot Concept for Symbol Indexing:
// Conceptual Future Implementation
import Parser from 'tree-sitter';
function extractSymbols(code: string) {
const tree = parser.parse(code);
// Traverse tree to find function/class declarations
// Return list of { name, type, line_number }
}
// In Indexing Pipeline:
const symbols = extractSymbols(fileContent);
await db.storeSymbols(fileId, symbols);