Performance Tuning
Guide to optimizing CortexPrism performance for different workloads and deployment scenarios.
LLM Provider Selection
Provider choice has the biggest impact on response time:
| Provider | Avg. Latency | Best For |
|---|---|---|
| Groq | ~500ms | Fast inference, prototyping |
| OpenAI GPT-4o mini | ~1s | General purpose, cost-sensitive |
| Anthropic Claude Sonnet | ~2s | Code, analysis |
| Ollama (local) | ~5-30s | Offline, privacy-sensitive |
| DeepSeek | ~1.5s | Code generation, cost-effective |
Cascade Router
Optimize cost and latency with cascading providers:
{
"router": {
"enabled": true,
"confidenceThreshold": 0.7,
"cascade": [
{ "provider": "groq", "model": "llama-3.3-70b-versatile" },
{ "provider": "openai", "model": "gpt-4o-mini" },
{ "provider": "anthropic", "model": "claude-sonnet-4-20250514" }
]
}
}
The router tries cheaper/faster models first and escalates when confidence is low.
Memory Performance
Hybrid Retrieval
Memory search uses hybrid FTS5 + vector embedding retrieval:
{
"memory": {
"retrieval": {
"ftsWeight": 0.4,
"vectorWeight": 0.4,
"recencyWeight": 0.2,
"limit": 5
}
}
}
- Lower
limitfor faster queries - Adjust
ftsWeightvsvectorWeightbased on your data type - Indexed queries are faster — use specific keywords
Memory Tiers
Limit which tiers are searched:
# Search only fast tiers
cortex memory search "query" --tiers episodic,semantic
# Exclude slow reflection tier
cortex memory search "query" --tiers working,episodic,semantic
Pruning
Regularly prune old or low-value memories:
# Auto-prune on startup (config)
{
"memory": {
"pruning": {
"enabled": true,
"maxEntries": 10000,
"retentionDays": 90
}
}
}
# Manual prune
cortex memory prune --keep 5000
cortex memory prune --older-than 30d
Database Performance
SQLite WAL Mode
CortexPrism uses SQLite with WAL (Write-Ahead Logging) mode by default for concurrent read/write performance.
Vacuum
Periodically vacuum databases to reclaim space:
# Vacuum all databases
cortex setup --vacuum
# Manual vacuum for specific db
sqlite3 ~/.cortex/data/cortex.db "VACUUM;"
Connection Pooling
The Prisma client handles connection pooling automatically. For high-throughput deployments:
# Set connection limits
export DATABASE_POOL_MIN=2
export DATABASE_POOL_MAX=10
Sandbox Performance
Docker vs Subprocess
| Mode | Startup | Execution | Isolation |
|---|---|---|---|
| Docker | ~2s | Fast | Strong |
| Subprocess | ~10ms | Fast | Weak |
| WASM | ~1ms | Fastest | Strong |
For development: use subprocess mode for speed For production: use WASM or Docker for security
Sandbox Resource Limits
{
"sandbox": {
"timeout": 30,
"memory": 256,
"maxOutput": 64,
"cpuQuota": 0.5
}
}
- Increase
timeoutfor long-running scripts - Increase
memoryfor data-heavy operations - Decrease
cpuQuotato share resources across sessions
Server Performance
Concurrent Sessions
# Increase max concurrent sessions
cortex serve --max-sessions 50
# Or in config:
{
"server": {
"maxSessions": 50,
"sessionTimeout": 3600000
}
}
WebSocket vs REST
Use WebSocket for streaming chat (lower latency, fewer connections):
# WebSocket endpoint
ws://localhost:3000/api/chat
# REST endpoint
POST http://localhost:3000/api/chat
WebSocket is recommended for interactive sessions; REST for batch processing.
Profiling
Enable performance profiling:
cortex chat --profile
Output includes:
- LLM call latency per provider
- Tool execution time
- Memory retrieval time
- Total agent loop time
Benchmarking
# Run built-in benchmarks
cortex benchmark chat --rounds 10
cortex benchmark memory --queries 100
cortex benchmark sandbox --iterations 50
# Compare providers
cortex benchmark providers --rounds 5
Production Tuning Checklist
- Cascade router enabled with fast fallback providers
- Memory pruning configured and scheduled
- Database vacuum scheduled (weekly)
- Sandbox timeout and memory limits tuned for workload
- Server max sessions adjusted for expected load
- Docker sandbox enabled (or WASM for performance)
- Profiling data collected to identify bottlenecks
- Benchmarks run to establish baseline