This document details the architecture and implementation of the Clawdbot memory system.

Overview

Clawdbot implements a Hybrid Retrieval Augmented Generation (RAG) system. It combines Vector Search (semantic understanding) with Full-Text Search (keyword precision) to retrieve relevant context from a local knowledge base.

Key Technologies:

  • Storage: node:sqlite (Local SQLite database)
  • Vector Search: sqlite-vec extension (vec0 virtual tables)
  • Keyword Search: SQLite FTS5 extension (fts5 virtual tables)
  • File Watcher: chokidar (Real-time updates)

Core Architecture

1. Data Sources

The memory system indexes content from two primary sources:

  1. Memory Files: Markdown files located in <workspace>/memory/ and <workspace>/MEMORY.md. These are watched for real-time updates.
  2. Session Transcripts: Active or archived conversation logs (if configured).

2. Storage Schema

The system uses a unified SQLite database (memory.sqlite) with the following key tables:

Table Purpose
files Tracks indexed files, modification times, and hashes to support incremental updates.
chunks Stores the actual text chunks derived from files. Maps 1:N from files.
chunks_vec Virtual Table (sqlite-vec). Stores the vector embeddings for each chunk. Used for semantic search.
chunks_fts Virtual Table (FTS5). Stores the full text of chunks. Used for BM25/keyword search.
embedding_cache Caches embeddings by (Provider, Model, TextHash) to prevent expensive re-computation.

3. Indexing Pipeline (Code Snapshot)

The system uses a token-aware chunking strategy optimized for Markdown. It handles overlaps to maintain context across chunk boundaries.

src/memory/internal.ts:

export function chunkMarkdown(
  content: string,
  chunking: { tokens: number; overlap: number },
): MemoryChunk[] {
  // ... (setup code) ...
  const lines = content.split("\n");
  
  // Iterate line by line to respect markdown structure
  for (let i = 0; i < lines.length; i += 1) {
    const line = lines[i] ?? "";
    // ... (splitting long lines) ...
    for (const segment of segments) {
      const lineSize = segment.length + 1;
      // Flush chunk if size limit reached
      if (currentChars + lineSize > maxChars && current.length > 0) {
        flush();
        carryOverlap(); // Function to keep the last N chars for the next chunk
      }
      current.push({ line: segment, lineNo });
      currentChars += lineSize;
    }
  }
  flush(); // Final flush
  return chunks;
}

4. Retrieval Pipeline (Hybrid Search Logic)

Clawdbot explicitly merges vector results (semantic) with keyword results (lexical).

src/memory/manager.ts:

  async search(query: string, opts?: { ... }): Promise<MemorySearchResult[]> {
    // ...
    // 1. Keyword Search (FTS5)
    const keywordResults = hybrid.enabled
      ? await this.searchKeyword(cleaned, candidates).catch(() => [])
      : [];

    // 2. Vector Search (sqlite-vec)
    const queryVec = await this.embedQueryWithTimeout(cleaned);
    const vectorResults = hasVector
      ? await this.searchVector(queryVec, candidates).catch(() => [])
      : [];

    if (!hybrid.enabled) {
      return vectorResults.filter((entry) => entry.score >= minScore);
    }

    // 3. Merge & Rank
    const merged = this.mergeHybridResults({
      vector: vectorResults,
      keyword: keywordResults,
      vectorWeight: hybrid.vectorWeight,
      textWeight: hybrid.textWeight,
    });

    return merged.filter((entry) => entry.score >= minScore).slice(0, maxResults);
  }

5. Session Memory Hook

The system includes a specialized hook (session-memory) that bridges the gap between ephemeral conversation and long-term memory.

  • Trigger: /new command (start of new session).
  • Action: 1. Extracts the summary/context of the ending session. 2. Uses an LLM to generate a descriptive filename (slug). 3. Writes the context to a new Markdown file in memory/.
  • Result: Past conversations automatically become part of the searchable knowledge base for future sessions.

File Structure Reference

  • src/memory/manager.ts: The "brain" of the system. Orchestrates indexing, searching, and provider management.
  • src/memory/internal.ts: Handles file reading, hashing, and the critical chunking logic.
  • src/memory/memory-schema.ts: Defines the SQLite tables and virtual tables.
  • src/memory/embeddings-*.ts: Adapters for different AI providers (OpenAI, Gemini).

Memory for Coding Agents: Analysis & Comparison

Clawdbot uses a General Text Memory approach. While effective for documentation and conversation history, it differs significantly from specialized Code-Aware Memory systems (like LSP) often used in advanced coding agents.

1. The Disconnect: Text vs. Structure

  • Clawdbot (Text-Based): Treats code as text chunks.

    • Pros: Simple, universal, good for finding comments and high-level logic in documentation.
    • Cons: "Blind" to code structure. It might split a function in half if it crosses a chunk boundary (though overlap helps). It doesn't understand that class Dog is related to dog.bark() unless they are textually close.
  • LSP / Code-Aware (Structure-Based): treating code as a graph or tree (AST).

    • Pros: Precise navigation (Go to Definition, Find References). Understands "symbols" independent of file position.
    • Cons: Language-specific, complex to implement and maintain.

2. Is Clawdbot Good at Coding? (Analysis)

Despite lacking an LSP index, Clawdbot is a highly effective Coding Assistant (Pair Programmer), but with a different strength profile than an IDE-native agent.

Strengths

  • Active Coding: It can execute shell commands, run tests, lint files, and read massive documentation files into its context.
  • Correction Loop: It uses bash and node to verify its own work. If it generates broken code, it can run it, see the error, and fix it—mimicking a human developer's workflow.
  • Explanation: Its generalist memory is excellent for explaining high-level architecture decisions found in design docs or READMEs.

Limitations (The "LSP Gap")

  • Refactoring at Scale: Because it "sees" code as documents, it struggles with tasks like "Rename this method and update all 50 files that use it". It might miss usages that don't match the text search exactly.
  • Implicit Graphs: It doesn't inherently know that modifying A.ts breaks Z.ts unless they are explicitly linked in the documentation or textually close.

3. LSP-Like Memory Design

For a highly effective *Coding Agent*, memory should ideally be layered:

Layer 1: The Index (What Clawdbot lacks)

An LSP-like system indexes Symbols rather than just text chunks.

  • Query: "Where is User defined?"
  • Text Search: Returns every file containing the word "User" (thousands of results).
  • Symbol Search: Returns src/models/user.ts:class User.

Layer 2: The Graph (Dependency Awareness)

Code is a graph of dependencies. A strong memory system maps these relationships.

  • node: auth.ts
  • edge: imports -> user.ts
  • edge: calls -> login()
  • Agent Utility: When modifying user.ts, the agent automatically knows to check auth.ts because of the graph connection.

3. Recommendation for Evolution

To evolve Clawdbot into a stronger coding assistant, the memory system could be augmented with a Symbol Index:

  1. Parser: Use tree-sitter to parse code files when indexing.
  2. Extraction: Extract top-level symbols (Classes, Functions, Interfaces).
  3. Storage: Store these as metadata in the chunks table or a separate symbols table.
  4. Retrieval: Allow the agent to query "definition of X" which maps to the specific chunk containing that symbol.

Code Snapshot Concept for Symbol Indexing:

// Conceptual Future Implementation
import Parser from 'tree-sitter';

function extractSymbols(code: string) {
  const tree = parser.parse(code);
  // Traverse tree to find function/class declarations
  // Return list of { name, type, line_number }
}

// In Indexing Pipeline:
const symbols = extractSymbols(fileContent);
await db.storeSymbols(fileId, symbols);