Cogtrix Memory Modes

Cogtrix manages what the LLM “remembers” during a session. Different tasks benefit from different memory strategies — a quick Q&A session doesn’t need error tracking, and a planning session benefits from decision logging. Memory modes let you pick the right strategy for the job.

Not sure which mode to use? Start with conversation (the default). Switch to code when you start writing or debugging code, and to reasoning when you need to plan, compare options, or make decisions.

Overview
Hybrid Memory System
Message Timestamps
Mode Comparison
Conversation Mode
Code Development Mode
Reasoning Mode
Configuration
Switching Modes

Overview

Cogtrix uses a pluggable memory system that optimizes context management for different use cases. Each mode manages:

Working Memory — Recent messages sent to the LLM (sliding window)
Hybrid Memory — Automatic summarization and optional semantic recall of older messages
Context Tracking — Mode-specific information (files, decisions, etc.)
System Prompt Additions — Mode-specific instructions for the LLM
Token-Aware Trimming — Ensures the context always fits the model’s context window

graph TD FACTORY(Memory Factory create(mode, store, …)) CONV(Conversation 25 msgs) CODE(Code 30 msgs) REAS(Reasoning 30 msgs) HYBRID(Hybrid Memory all modes Summary · Vector Recall) FACTORY --> CONV FACTORY --> CODE FACTORY --> REAS CONV --> HYBRID CODE --> HYBRID REAS --> HYBRID

Hybrid Memory System

All three memory modes share a hybrid memory layer that prevents long-term context loss. When messages fall outside the sliding window, they are not simply discarded — they are processed in two ways:

Incremental summarization — An LLM generates a concise rolling summary of older messages, preserving key facts, decisions, and user preferences.
Vector recall (optional) — Older message pairs are embedded and stored in a per-session FAISS index. On each turn, the user’s input is used to retrieve the most semantically relevant past exchanges.

Both layers are injected at the top of the context, giving the LLM a sense of the full conversation history without consuming the entire context window.

How It Works

Consider an 80-message conversation. The full history is split into three buckets:

Messages 1–44 — covered by the rolling summary (compressed text) and stored in the vector index for semantic recall.
Messages 45–55 — pending batch; will be summarized once ≥ 10 messages accumulate.
Messages 56–80 — sliding window; sent verbatim to the LLM.

What the LLM actually sees on each turn:

graph TD SYS(System prompt + mode-specific additions) SUM(Conversation summary · older context The user asked about Python web frameworks… They decided to use FastAPI with PostgreSQL…) REC(Related past exchanges · vector recall User: How should I structure the database schema? Assistant: For your e-commerce project, I recommend…) WIN(Sliding window · last 25/30 messages verbatim 2026-02-14 15:23:05 UTC Human: … 2026-02-14 15:23:12 UTC AI: … Human: … ← Current input) SYS --> SUM --> REC --> WIN

Summarization

Summarization is triggered after each response, not during the user’s wait for a reply. Specifically:

After the agent replies, the memory manager checks how many messages have fallen outside the sliding window since the last summary was generated.
Summarization is skipped unless at least 10 messages have fallen out of the window since the last summary (_SUMMARY_BATCH_SIZE = 10). This prevents premature summarization after every single turn.
Once the 10-message threshold is crossed, a meaningful-content gate runs before sending anything to the LLM:
- At least 4 meaningful messages (2 full human+assistant turns) must be present (_MIN_MEANINGFUL_MSGS_FOR_SUMMARY = 4)
- At least 5,000 characters of meaningful content must exist (_MIN_MEANINGFUL_CHARS_FOR_SUMMARY = 5000)
- Both thresholds must be missed simultaneously to skip summarization — if either is met, the batch proceeds. This prevents summarization from firing on short or tool-heavy exchanges that contain no real conversational substance.
The LLM produces an updated rolling summary that merges the new batch into the existing summary.
The summary index is advanced so those messages aren’t re-summarized.

The summarization prompt instructs the LLM to:

Preserve key facts, data, decisions, user preferences, and action items
Drop small-talk, greetings, and verbose tool-call details
Write in third person present tense
Keep the summary under 400 words
Use bullet points for clarity

Graceful degradation: If the LLM call fails or returns an empty result, the previous summary is retained unchanged. Summarization never blocks or crashes the conversation.

Vector Recall

When an embedding provider is available (Ollama with nomic-embed-text, OpenAI, etc.), Cogtrix automatically:

Embeds older conversation exchanges (human + AI pairs) into a per-session FAISS index.
On each new user input, queries the index for the top-k most similar past exchanges.
Injects the recalled exchanges into the context as “Related past exchanges.”

This allows the agent to recall specific details from much earlier in the conversation — even details that the rolling summary may have compressed away.

Graceful degradation: If no embedding provider is available, vector recall is simply skipped. The sliding window and rolling summary still function normally.

Configuring embeddings: Cogtrix auto-detects an embedding provider at startup (tries Ollama first, then OpenAI). To explicitly control which embedding model is used for hybrid memory, define a model entry in the models registry and configure rag.model to reference it — the same model is used for both RAG ingestion and memory vector recall.

Embedding model tracking: The embedding model name is stored alongside the FAISS index. If you switch embedding models between sessions, the stale index is automatically discarded and rebuilt from scratch.

Configuration

Hybrid memory is enabled by default. You can tune it per mode:

memory:
  modes:
    conversation:
      working_memory_size: 25
      summarization: true        # Enable/disable LLM summarization (default: true)
      vector_recall_k: 3         # Number of past exchanges to recall (default: 3)
    code:
      working_memory_size: 30
      summarization: true
      vector_recall_k: 3
    reasoning:
      working_memory_size: 30
      summarization: true
      vector_recall_k: 3

Option	Type	Default	Description
`summarization`	bool	`true`	Enable incremental LLM summarization of older messages
`vector_recall_k`	int	`3`	Number of semantically similar past exchanges to retrieve

Setting summarization: false disables the rolling summary (useful if you want to save LLM calls on a metered API). Setting vector_recall_k: 0 effectively disables vector recall.

Persistence

Hybrid memory state is persisted alongside session history:

Summary text + coverage index → data/history/{session_id}_hybrid.json
Vector index → data/vectordb/sessions/{session_id}/ (FAISS files + metadata)

When you resume a session, both the summary and vector index are restored. If the session history was sanitized (e.g., corrupted messages removed), the summary index is automatically clamped to stay within bounds.

Message Timestamps

Every message in the conversation history is automatically stamped with a UTC timestamp at the moment it is created:

User messages are stamped when the input is submitted (at prepare_context() time).
AI responses are stamped when the LLM finishes generating its reply (at update() time).

When messages are sent to the LLM, each one is prefixed with a human-readable timestamp:

[2026-02-14 15:23:05 UTC] What are the top news affecting the stock market?
[2026-02-14 15:23:47 UTC] Here are the top stories...

This gives the model a sense of time: it can see how long a response took, how much time passed between turns, and whether a session spans minutes or days. Timestamps are stored in UTC for unambiguous cross-timezone comparison.

Persistence: Timestamps are saved alongside each message in the session JSON file (as an ISO 8601 string, e.g. "2026-02-14T15:23:05Z"). Old session files without timestamps load normally — those messages simply appear without a time prefix.

Mode Comparison

Aspect	Conversation	Code	Reasoning
Working Memory	25 messages	30 messages	30 messages
Best For	General chat, Q&A	Programming, debugging	Planning, decisions
Tracks	Topics, entities	Files, errors, changes	Goals, decisions, constraints
Context Focus	Conversation flow	Current code + task	Problem + objectives
Hybrid Memory	Summary + vector recall	Summary + vector recall	Summary + vector recall

About tools: All tools are on-demand regardless of memory mode — the agent requests only the tools it needs for the current task. See Tool Loading for details.

Conversation Mode

CLI: python cogtrix.py -M conversation (default)

Best for: General chat, Q&A, research, information lookup

How It Works

Maintains a sliding window of recent messages with entity tracking:

graph TD subgraph CONV["Conversation Memory"] direction TB HYBRID(Hybrid Prefix · injected when available Summary: The user discussed Python frameworks… Related: vector-recalled past exchanges) WIN(Working Memory · last 25 messages, timestamped 2026-02-14 15:23:05 UTC Human: What is Python? 2026-02-14 15:23:12 UTC AI: Python is a … … up to 25 messages) ENT(Entity Tracking Topics: Python, installation, programming Key Facts: user wants to learn Python) HYBRID --- WIN --- ENT end

Context Composition

What gets sent to the LLM:

graph TD SP(System Prompt You are a helpful AI assistant…) HP(Hybrid Prefix · summary + recalled exchanges) WM(Working Memory · last 25 messages, timestamped 2026-02-14 15:23:05 UTC Human: … 2026-02-14 15:23:12 UTC AI: … Human: … ← Current input) SP --> HP --> WM

Configuration

memory:
  mode: conversation
  modes:
    conversation:
      working_memory_size: 25
      summarization: true
      vector_recall_k: 3

Option	Default	Description
`working_memory_size`	25	Number of messages to keep in context
`summarization`	`true`	Enable rolling summary of older messages
`vector_recall_k`	3	Semantically similar past exchanges to retrieve

Code Development Mode

CLI: python cogtrix.py -M code

Best for: Programming, debugging, code review, software development

How It Works

Optimized for coding with task and file tracking:

graph TD subgraph CODE["Code Development Memory"] direction TB HYBRID(Hybrid Prefix · summary + vector recall Summary: Working on auth module refactor… Related: past exchanges about auth.py) WIN(Working Memory · last 30 messages, timestamped 10:05:30 UTC Human: Fix the bug in auth.py 10:05:47 UTC AI: I see the issue… … up to 30 messages) TASK(Task Context Current Task: Fix authentication bug Progress: Identified issue, Modified auth.py Files Touched: auth.py, tests/test_auth.py) ERR(Error Tracking Recent Errors: TypeError at auth.py:45 ImportError in test_auth.py) HYBRID --- WIN --- TASK --- ERR end

Context Composition

What gets sent to the LLM:

graph TD SP(System Prompt You are an expert programmer…) HP(Hybrid Prefix · summary + recall) TC(Task Context Current task: Fix authentication bug Files: auth.py, test_auth.py Recent errors: TypeError at line 45) WM(Working Memory · last 30 messages 10:05:30 UTC Human: … 10:05:47 UTC AI: … Human: … ← Current input) SP --> HP --> TC --> WM

Special Features

File Tracking — Automatically tracks mentioned files
Error Memory — Retains error messages for debugging context
Task Progress — Tracks what’s been accomplished
Structured Context — Task, files, and errors injected alongside messages

Configuration

memory:
  mode: code
  modes:
    code:
      working_memory_size: 30
      max_files: 20
      max_errors: 5
      summarization: true
      vector_recall_k: 3

Option	Default	Description
`working_memory_size`	30	Number of messages to keep
`max_files`	20	Maximum files to track
`max_errors`	5	Maximum errors to remember
`summarization`	`true`	Enable rolling summary of older messages
`vector_recall_k`	3	Semantically similar past exchanges to retrieve

Reasoning Mode

CLI: python cogtrix.py -M reasoning

Best for: Strategic planning, architecture decisions, complex problem-solving

How It Works

Designed for deep thinking with goal and decision tracking:

graph TD subgraph REAS["Reasoning Memory"] direction TB HYBRID(Hybrid Prefix · summary + vector recall Summary: Evaluating microservices architecture… Related: recalled constraint discussion) WIN(Working Memory · last 30 messages, timestamped 09:00:15 UTC Human: Should we use microservices? 09:01:03 UTC AI: Let me analyze the trade-offs… … up to 30 messages) GOAL(Goal Hierarchy Primary Objective: Design scalable architecture Sub-goals: Evaluate patterns · Consider team capabilities · Plan migration) DEC(Decision Log #1 Use event-driven · Rationale: Better decoupling, async Alternatives rejected: Direct API calls #2 Start with monolith, extract services later · Team size, time) CONS(Constraints Budget: $50k Timeline: 3 months Team: 4 developers) HYBRID --- WIN --- GOAL --- DEC --- CONS end

Context Composition

What gets sent to the LLM:

graph TD SP(System Prompt You are a strategic advisor…) HP(Hybrid Prefix · summary + recall) GH(Goal Hierarchy Objective: Design scalable architecture Sub-goals: list Current phase: Evaluation) CN(Constraints Budget: $50k, Timeline: 3 months…) RD(Recent Decisions #1 Use event-driven · Rationale…) WM(Working Memory · last 30 messages 09:00:15 UTC Human: … 09:01:03 UTC AI: …) SP --> HP --> GH --> CN --> RD --> WM

Special Features

Goal Tracking — Maintains objective hierarchy
Decision Audit — Logs decisions with rationale
Constraint Awareness — Keeps boundaries visible
Alternative Tracking — Records rejected options
Assumption Logging — Explicit assumption tracking

Configuration

memory:
  mode: reasoning
  modes:
    reasoning:
      working_memory_size: 30
      max_decisions: 20
      max_alternatives: 10
      summarization: true
      vector_recall_k: 3
      prefix_max_stale_turns: 3  # Turns before a stale section is omitted from prefix

Option	Default	Description
`working_memory_size`	30	Number of messages to keep
`max_decisions`	20	Maximum decisions to track
`max_alternatives`	10	Maximum alternatives to track
`summarization`	`true`	Enable rolling summary of older messages
`vector_recall_k`	3	Semantically similar past exchanges to retrieve
`prefix_max_stale_turns`	3	Turns a prefix section can go unmodified before being omitted from the context prefix (section-freshness gating)

Configuration

Via Config File

memory:
  mode: code
  modes:
    conversation:
      working_memory_size: 25
      summarization: true
      vector_recall_k: 3
    code:
      working_memory_size: 30
      summarization: true
      vector_recall_k: 3
    reasoning:
      working_memory_size: 30
      summarization: true
      vector_recall_k: 3

Via Environment Variable

export COGTRIX_MEMORY_MODE=code
python cogtrix.py

Via Command Line

python cogtrix.py -M code
python cogtrix.py --memory-mode reasoning

Switching Modes

At Runtime (Live Switching)

Switch modes during an interactive session using the /mode or /M command:

You: /mode code
Switched to code mode

You: /M reasoning
Switched to reasoning mode

Switching preserves the current session but rebuilds the system prompt, memory context, and tool presets for the new mode. The agent is re-initialized immediately.

At Startup

Specify a mode when starting:

# Morning: Planning session
python cogtrix.py -M reasoning -s project-planning

# Afternoon: Coding session
python cogtrix.py -M code -s project-dev

# Evening: Research session
python cogtrix.py -M conversation -s research

Mode Selection Guide

If you’re doing…	Use mode	Why
General questions, research	`conversation`	Lightweight, fast — no extra overhead
Summarizing articles, brainstorming	`conversation`	Focus on the flow of ideas
Writing or reviewing code	`code`	Tracks files you mention and errors you hit
Debugging errors	`code`	Error memory prevents the LLM from losing context on the bug
Refactoring a codebase	`code`	Larger working memory (30 msgs) keeps more context visible
Architecture decisions	`reasoning`	Decision log records choices and rationale
Project planning	`reasoning`	Goal hierarchy keeps objectives structured
Comparing options with trade-offs	`reasoning`	Constraint tracking + deep think integration

Rule of thumb: conversation < code < reasoning in terms of working memory size and tracking overhead. Pick the lightest mode that fits your task.

Memory Persistence

All modes save to the same JSON format:

Path	Contents
`data/history/{session_id}.json`	Message history + session metadata
`data/history/{session_id}_hybrid.json`	Summary text + coverage index
`data/history/{session_id}_mode_state.json`	Mode-specific state (goals, decisions, etc.)
`data/vectordb/sessions/{session_id}/`	FAISS vector index (if embeddings available)

The history file contains:

Full message history (each message includes a UTC timestamp field)
Session metadata

The mode state file contains mode-specific tracking data (goals, decisions, reasoning chains, code tasks, conversation entities, turn counters, and section timestamps) persisted via _save_mode_meta() and restored on session restart via _restore_mode_state().

Memory is automatically loaded when resuming a session:

# First session
python cogtrix.py -M code -s my-project
# ... work on code ...
# Exit

# Resume later (memory restored — including summary and vector index)
python cogtrix.py -M code -s my-project

Token-Aware Context Management

Regardless of the memory mode, Cogtrix ensures the prepared context never exceeds the model’s context window. Before messages are sent to the LLM:

The total token count is estimated using a character-based heuristic (~4 characters per token).
If the total exceeds the available budget, the oldest history messages are dropped first.
If individual messages are still too large after trimming, they are truncated with a […truncated…] marker.
The system prompt and the current user input are never removed.

The max_tokens parameter sent to the LLM is also dynamically calculated to avoid requesting more tokens than the remaining context window allows, preventing “max_tokens must be at least 1” errors from the API.

Memory modes

Cogtrix Memory Modes

Table of Contents

Overview

Hybrid Memory System

How It Works

Summarization

Vector Recall

Configuration

Persistence

Message Timestamps

Mode Comparison

Conversation Mode

How It Works

Context Composition

Configuration

Code Development Mode

How It Works

Context Composition

Special Features

Configuration

Reasoning Mode

How It Works

Context Composition

Special Features

Configuration

Configuration

Via Config File

Via Environment Variable

Via Command Line

Switching Modes

At Runtime (Live Switching)

At Startup

Mode Selection Guide

Memory Persistence

Token-Aware Context Management

See Also