Performance Tuning

Guide to optimizing CortexPrism performance for different workloads and deployment scenarios.

LLM Provider Selection

Provider choice has the biggest impact on response time:

ProviderAvg. LatencyBest For
Groq~500msFast inference, prototyping
OpenAI GPT-4o mini~1sGeneral purpose, cost-sensitive
Anthropic Claude Sonnet~2sCode, analysis
Ollama (local)~5-30sOffline, privacy-sensitive
DeepSeek~1.5sCode generation, cost-effective

Cascade Router

Optimize cost and latency with cascading providers:

{
  "router": {
    "enabled": true,
    "confidenceThreshold": 0.7,
    "cascade": [
      { "provider": "groq", "model": "llama-3.3-70b-versatile" },
      { "provider": "openai", "model": "gpt-4o-mini" },
      { "provider": "anthropic", "model": "claude-sonnet-4-20250514" }
    ]
  }
}

The router tries cheaper/faster models first and escalates when confidence is low.

Memory Performance

Hybrid Retrieval

Memory search uses hybrid FTS5 + vector embedding retrieval:

{
  "memory": {
    "retrieval": {
      "ftsWeight": 0.4,
      "vectorWeight": 0.4,
      "recencyWeight": 0.2,
      "limit": 5
    }
  }
}
  • Lower limit for faster queries
  • Adjust ftsWeight vs vectorWeight based on your data type
  • Indexed queries are faster — use specific keywords

Memory Tiers

Limit which tiers are searched:

# Search only fast tiers
cortex memory search "query" --tiers episodic,semantic

# Exclude slow reflection tier
cortex memory search "query" --tiers working,episodic,semantic

Pruning

Regularly prune old or low-value memories:

# Auto-prune on startup (config)
{
  "memory": {
    "pruning": {
      "enabled": true,
      "maxEntries": 10000,
      "retentionDays": 90
    }
  }
}

# Manual prune
cortex memory prune --keep 5000
cortex memory prune --older-than 30d

Database Performance

SQLite WAL Mode

CortexPrism uses SQLite with WAL (Write-Ahead Logging) mode by default for concurrent read/write performance.

Vacuum

Periodically vacuum databases to reclaim space:

# Vacuum all databases
cortex setup --vacuum

# Manual vacuum for specific db
sqlite3 ~/.cortex/data/cortex.db "VACUUM;"

Connection Pooling

The Prisma client handles connection pooling automatically. For high-throughput deployments:

# Set connection limits
export DATABASE_POOL_MIN=2
export DATABASE_POOL_MAX=10

Sandbox Performance

Docker vs Subprocess

ModeStartupExecutionIsolation
Docker~2sFastStrong
Subprocess~10msFastWeak
WASM~1msFastestStrong

For development: use subprocess mode for speed For production: use WASM or Docker for security

Sandbox Resource Limits

{
  "sandbox": {
    "timeout": 30,
    "memory": 256,
    "maxOutput": 64,
    "cpuQuota": 0.5
  }
}
  • Increase timeout for long-running scripts
  • Increase memory for data-heavy operations
  • Decrease cpuQuota to share resources across sessions

Server Performance

Concurrent Sessions

# Increase max concurrent sessions
cortex serve --max-sessions 50

# Or in config:
{
  "server": {
    "maxSessions": 50,
    "sessionTimeout": 3600000
  }
}

WebSocket vs REST

Use WebSocket for streaming chat (lower latency, fewer connections):

# WebSocket endpoint
ws://localhost:3000/api/chat

# REST endpoint
POST http://localhost:3000/api/chat

WebSocket is recommended for interactive sessions; REST for batch processing.

Profiling

Enable performance profiling:

cortex chat --profile

Output includes:

  • LLM call latency per provider
  • Tool execution time
  • Memory retrieval time
  • Total agent loop time

Benchmarking

# Run built-in benchmarks
cortex benchmark chat --rounds 10
cortex benchmark memory --queries 100
cortex benchmark sandbox --iterations 50

# Compare providers
cortex benchmark providers --rounds 5

Production Tuning Checklist

  • Cascade router enabled with fast fallback providers
  • Memory pruning configured and scheduled
  • Database vacuum scheduled (weekly)
  • Sandbox timeout and memory limits tuned for workload
  • Server max sessions adjusted for expected load
  • Docker sandbox enabled (or WASM for performance)
  • Profiling data collected to identify bottlenecks
  • Benchmarks run to establish baseline