Cogtrix Memory Modes
Cogtrix manages what the LLM “remembers” during a session. Different tasks benefit from different memory strategies — a quick Q&A session doesn’t need error tracking, and a planning session benefits from decision logging. Memory modes let you pick the right strategy for the job.
Not sure which mode to use? Start with conversation (the default). Switch to code when you start writing or debugging code, and to reasoning when you need to plan, compare options, or make decisions.
Table of Contents
- Overview
- Hybrid Memory System
- Message Timestamps
- Mode Comparison
- Conversation Mode
- Code Development Mode
- Reasoning Mode
- Configuration
- Switching Modes
Overview
Cogtrix uses a pluggable memory system that optimizes context management for different use cases. Each mode manages:
- Working Memory — Recent messages sent to the LLM (sliding window)
- Hybrid Memory — Automatic summarization and optional semantic recall of older messages
- Context Tracking — Mode-specific information (files, decisions, etc.)
- System Prompt Additions — Mode-specific instructions for the LLM
- Token-Aware Trimming — Ensures the context always fits the model’s context window
Hybrid Memory System
All three memory modes share a hybrid memory layer that prevents long-term context loss. When messages fall outside the sliding window, they are not simply discarded — they are processed in two ways:
- Incremental summarization — An LLM generates a concise rolling summary of older messages, preserving key facts, decisions, and user preferences.
- Vector recall (optional) — Older message pairs are embedded and stored in a per-session FAISS index. On each turn, the user’s input is used to retrieve the most semantically relevant past exchanges.
Both layers are injected at the top of the context, giving the LLM a sense of the full conversation history without consuming the entire context window.
How It Works
Consider an 80-message conversation. The full history is split into three buckets:
- Messages 1–44 — covered by the rolling summary (compressed text) and stored in the vector index for semantic recall.
- Messages 45–55 — pending batch; will be summarized once ≥ 10 messages accumulate.
- Messages 56–80 — sliding window; sent verbatim to the LLM.
What the LLM actually sees on each turn:
Summarization
Summarization is triggered after each response, not during the user’s wait for a reply. Specifically:
- After the agent replies, the memory manager checks how many messages have fallen outside the sliding window since the last summary was generated.
- Summarization is skipped unless at least 10 messages have fallen out of the window since the last summary (
_SUMMARY_BATCH_SIZE = 10). This prevents premature summarization after every single turn. - Once the 10-message threshold is crossed, a meaningful-content gate runs before sending anything to the LLM:
- At least 4 meaningful messages (2 full human+assistant turns) must be present (
_MIN_MEANINGFUL_MSGS_FOR_SUMMARY = 4) - At least 5,000 characters of meaningful content must exist (
_MIN_MEANINGFUL_CHARS_FOR_SUMMARY = 5000) - Both thresholds must be missed simultaneously to skip summarization — if either is met, the batch proceeds. This prevents summarization from firing on short or tool-heavy exchanges that contain no real conversational substance.
- At least 4 meaningful messages (2 full human+assistant turns) must be present (
- The LLM produces an updated rolling summary that merges the new batch into the existing summary.
- The summary index is advanced so those messages aren’t re-summarized.
The summarization prompt instructs the LLM to:
- Preserve key facts, data, decisions, user preferences, and action items
- Drop small-talk, greetings, and verbose tool-call details
- Write in third person present tense
- Keep the summary under 400 words
- Use bullet points for clarity
Graceful degradation: If the LLM call fails or returns an empty result, the previous summary is retained unchanged. Summarization never blocks or crashes the conversation.
Vector Recall
When an embedding provider is available (Ollama with nomic-embed-text, OpenAI, etc.), Cogtrix automatically:
- Embeds older conversation exchanges (human + AI pairs) into a per-session FAISS index.
- On each new user input, queries the index for the top-k most similar past exchanges.
- Injects the recalled exchanges into the context as “Related past exchanges.”
This allows the agent to recall specific details from much earlier in the conversation — even details that the rolling summary may have compressed away.
Graceful degradation: If no embedding provider is available, vector recall is simply skipped. The sliding window and rolling summary still function normally.
Configuring embeddings: Cogtrix auto-detects an embedding provider at startup (tries Ollama first, then OpenAI). To explicitly control which embedding model is used for hybrid memory, define a model entry in the models registry and configure rag.model to reference it — the same model is used for both RAG ingestion and memory vector recall.
Embedding model tracking: The embedding model name is stored alongside the FAISS index. If you switch embedding models between sessions, the stale index is automatically discarded and rebuilt from scratch.
Configuration
Hybrid memory is enabled by default. You can tune it per mode:
memory:
modes:
conversation:
working_memory_size: 25
summarization: true # Enable/disable LLM summarization (default: true)
vector_recall_k: 3 # Number of past exchanges to recall (default: 3)
code:
working_memory_size: 30
summarization: true
vector_recall_k: 3
reasoning:
working_memory_size: 30
summarization: true
vector_recall_k: 3
| Option | Type | Default | Description |
|---|---|---|---|
summarization | bool | true | Enable incremental LLM summarization of older messages |
vector_recall_k | int | 3 | Number of semantically similar past exchanges to retrieve |
Setting summarization: false disables the rolling summary (useful if you want to save LLM calls on a metered API). Setting vector_recall_k: 0 effectively disables vector recall.
Persistence
Hybrid memory state is persisted alongside session history:
- Summary text + coverage index →
data/history/{session_id}_hybrid.json - Vector index →
data/vectordb/sessions/{session_id}/(FAISS files + metadata)
When you resume a session, both the summary and vector index are restored. If the session history was sanitized (e.g., corrupted messages removed), the summary index is automatically clamped to stay within bounds.
Message Timestamps
Every message in the conversation history is automatically stamped with a UTC timestamp at the moment it is created:
- User messages are stamped when the input is submitted (at
prepare_context()time). - AI responses are stamped when the LLM finishes generating its reply (at
update()time).
When messages are sent to the LLM, each one is prefixed with a human-readable timestamp:
[2026-02-14 15:23:05 UTC] What are the top news affecting the stock market?
[2026-02-14 15:23:47 UTC] Here are the top stories...
This gives the model a sense of time: it can see how long a response took, how much time passed between turns, and whether a session spans minutes or days. Timestamps are stored in UTC for unambiguous cross-timezone comparison.
Persistence: Timestamps are saved alongside each message in the session JSON file (as an ISO 8601 string, e.g. "2026-02-14T15:23:05Z"). Old session files without timestamps load normally — those messages simply appear without a time prefix.
Mode Comparison
| Aspect | Conversation | Code | Reasoning |
|---|---|---|---|
| Working Memory | 25 messages | 30 messages | 30 messages |
| Best For | General chat, Q&A | Programming, debugging | Planning, decisions |
| Tracks | Topics, entities | Files, errors, changes | Goals, decisions, constraints |
| Context Focus | Conversation flow | Current code + task | Problem + objectives |
| Hybrid Memory | Summary + vector recall | Summary + vector recall | Summary + vector recall |
About tools: All tools are on-demand regardless of memory mode — the agent requests only the tools it needs for the current task. See Tool Loading for details.
Conversation Mode
CLI: python cogtrix.py -M conversation (default)
Best for: General chat, Q&A, research, information lookup
How It Works
Maintains a sliding window of recent messages with entity tracking:
Context Composition
What gets sent to the LLM:
Configuration
memory:
mode: conversation
modes:
conversation:
working_memory_size: 25
summarization: true
vector_recall_k: 3
| Option | Default | Description |
|---|---|---|
working_memory_size | 25 | Number of messages to keep in context |
summarization | true | Enable rolling summary of older messages |
vector_recall_k | 3 | Semantically similar past exchanges to retrieve |
Code Development Mode
CLI: python cogtrix.py -M code
Best for: Programming, debugging, code review, software development
How It Works
Optimized for coding with task and file tracking:
Context Composition
What gets sent to the LLM:
Special Features
- File Tracking — Automatically tracks mentioned files
- Error Memory — Retains error messages for debugging context
- Task Progress — Tracks what’s been accomplished
- Structured Context — Task, files, and errors injected alongside messages
Configuration
memory:
mode: code
modes:
code:
working_memory_size: 30
max_files: 20
max_errors: 5
summarization: true
vector_recall_k: 3
| Option | Default | Description |
|---|---|---|
working_memory_size | 30 | Number of messages to keep |
max_files | 20 | Maximum files to track |
max_errors | 5 | Maximum errors to remember |
summarization | true | Enable rolling summary of older messages |
vector_recall_k | 3 | Semantically similar past exchanges to retrieve |
Reasoning Mode
CLI: python cogtrix.py -M reasoning
Best for: Strategic planning, architecture decisions, complex problem-solving
How It Works
Designed for deep thinking with goal and decision tracking:
Context Composition
What gets sent to the LLM:
Special Features
- Goal Tracking — Maintains objective hierarchy
- Decision Audit — Logs decisions with rationale
- Constraint Awareness — Keeps boundaries visible
- Alternative Tracking — Records rejected options
- Assumption Logging — Explicit assumption tracking
Configuration
memory:
mode: reasoning
modes:
reasoning:
working_memory_size: 30
max_decisions: 20
max_alternatives: 10
summarization: true
vector_recall_k: 3
prefix_max_stale_turns: 3 # Turns before a stale section is omitted from prefix
| Option | Default | Description |
|---|---|---|
working_memory_size | 30 | Number of messages to keep |
max_decisions | 20 | Maximum decisions to track |
max_alternatives | 10 | Maximum alternatives to track |
summarization | true | Enable rolling summary of older messages |
vector_recall_k | 3 | Semantically similar past exchanges to retrieve |
prefix_max_stale_turns | 3 | Turns a prefix section can go unmodified before being omitted from the context prefix (section-freshness gating) |
Configuration
Via Config File
memory:
mode: code
modes:
conversation:
working_memory_size: 25
summarization: true
vector_recall_k: 3
code:
working_memory_size: 30
summarization: true
vector_recall_k: 3
reasoning:
working_memory_size: 30
summarization: true
vector_recall_k: 3
Via Environment Variable
export COGTRIX_MEMORY_MODE=code
python cogtrix.py
Via Command Line
python cogtrix.py -M code
python cogtrix.py --memory-mode reasoning
Switching Modes
At Runtime (Live Switching)
Switch modes during an interactive session using the /mode or /M command:
You: /mode code
Switched to code mode
You: /M reasoning
Switched to reasoning mode
Switching preserves the current session but rebuilds the system prompt, memory context, and tool presets for the new mode. The agent is re-initialized immediately.
At Startup
Specify a mode when starting:
# Morning: Planning session
python cogtrix.py -M reasoning -s project-planning
# Afternoon: Coding session
python cogtrix.py -M code -s project-dev
# Evening: Research session
python cogtrix.py -M conversation -s research
Mode Selection Guide
| If you’re doing… | Use mode | Why |
|---|---|---|
| General questions, research | conversation | Lightweight, fast — no extra overhead |
| Summarizing articles, brainstorming | conversation | Focus on the flow of ideas |
| Writing or reviewing code | code | Tracks files you mention and errors you hit |
| Debugging errors | code | Error memory prevents the LLM from losing context on the bug |
| Refactoring a codebase | code | Larger working memory (30 msgs) keeps more context visible |
| Architecture decisions | reasoning | Decision log records choices and rationale |
| Project planning | reasoning | Goal hierarchy keeps objectives structured |
| Comparing options with trade-offs | reasoning | Constraint tracking + deep think integration |
Rule of thumb: conversation < code < reasoning in terms of working memory size and tracking overhead. Pick the lightest mode that fits your task.
Memory Persistence
All modes save to the same JSON format:
| Path | Contents |
|---|---|
data/history/{session_id}.json | Message history + session metadata |
data/history/{session_id}_hybrid.json | Summary text + coverage index |
data/history/{session_id}_mode_state.json | Mode-specific state (goals, decisions, etc.) |
data/vectordb/sessions/{session_id}/ | FAISS vector index (if embeddings available) |
The history file contains:
- Full message history (each message includes a UTC
timestampfield) - Session metadata
The mode state file contains mode-specific tracking data (goals, decisions, reasoning chains, code tasks, conversation entities, turn counters, and section timestamps) persisted via _save_mode_meta() and restored on session restart via _restore_mode_state().
Memory is automatically loaded when resuming a session:
# First session
python cogtrix.py -M code -s my-project
# ... work on code ...
# Exit
# Resume later (memory restored — including summary and vector index)
python cogtrix.py -M code -s my-project
Token-Aware Context Management
Regardless of the memory mode, Cogtrix ensures the prepared context never exceeds the model’s context window. Before messages are sent to the LLM:
- The total token count is estimated using a character-based heuristic (~4 characters per token).
- If the total exceeds the available budget, the oldest history messages are dropped first.
- If individual messages are still too large after trimming, they are truncated with a
[…truncated…]marker. - The system prompt and the current user input are never removed.
The max_tokens parameter sent to the LLM is also dynamically calculated to avoid requesting more tokens than the remaining context window allows, preventing “max_tokens must be at least 1” errors from the API.
See Also
- Configuration Reference — memory and summarization settings
- Architecture Overview — how memory fits in the execution pipeline