Performance

This guide explains Cicada's performance characteristics, optimization strategies, and best practices for large codebases.

Quick Reference

Operation	Small Project (<1000 files)	Large Project (>1000 files)	Very Large (>10,000 files)
Initial Index	5-10 seconds	30-60 seconds	2-5 minutes
Incremental Index	<5 seconds	5-15 seconds	15-30 seconds
Query Response	<1 second	<1 second	<2 seconds
Disk Space	5-10 MB	10-50 MB	50-200 MB

Indexing Performance

Initial Indexing

First-time indexing parses all files in your project:

cd /path/to/project
cicada index

Performance by Language:

Language	Approach	Speed	Notes
Elixir	Tree-sitter	Fast	~200-300 files/sec
Erlang	Tree-sitter	Fast	~200-300 files/sec
Python	SCIP	Medium	Requires scip-python external tool
TypeScript	SCIP	Medium	Requires scip-typescript
Go	SCIP	Medium	Requires scip-go
Rust	SCIP	Medium	Requires rust-analyzer
Other SCIP	SCIP	Medium	External indexer required

Factors Affecting Speed:

File count and size
Language backend (tree-sitter vs SCIP)
External tool availability
CPU and disk I/O

Incremental Indexing

Subsequent indexes only process changed files:

cicada index  # Auto-detects changes via SHA-256 hashing

How it Works:

Hash each file (SHA-256)
Compare against hashes.json
Re-index only modified files
Merge with existing index

Typical Performance: <5 seconds for small changes

Watch Mode: Automatically reindex on file changes:

cicada watch

Query Performance

Response Times

Cicada queries are optimized for AI agent workflows:

Query Type	Typical Response	Token Count
`search_module`	<500ms	200-600 tokens
`search_function`	<500ms	300-800 tokens
`query` (keyword)	<1s	150-500 tokens
`git_history`	1-2s	300-1200 tokens

Index Lookup

All queries use in-memory index for instant access:

Module lookups: O(1) hash table
Function searches: O(n) with early termination
Keyword matching: O(n) with pre-computed scores

Token Optimization

Understanding Token Usage

MCP tool responses consume tokens in AI conversations. Cicada provides controls to optimize token usage:

Output Modes

Compact Mode (Default):

# Minimal output - identifiers and file:line only
search_module("MyApp.User", verbose=False)
# ~200-400 tokens

Verbose Mode:

# Full documentation, specs, examples
search_module("MyApp.User", verbose=True)
# ~600-1200 tokens

Git History Optimization

Brief Format (Recommended for AI agents):

git_history("lib/auth.ex", include_pr_description=False)

Output:

## History for lib/auth.ex (5 commits)

- 589000d (2025-11-25) Optimize password hashing (#175)
- 0885638 (2025-11-24) Add 2FA support (#178)
- e68df27 (2025-11-23) Refactor session management (#160)

Token count: ~120 tokens

Standard Format:

git_history("lib/auth.ex", include_pr_description=True)

Includes:

PR titles and URLs
Authors
PR descriptions (truncated)

Token count: ~400-800 tokens

Verbose Format:

git_history(
    "lib/auth.ex",
    include_pr_description=True,
    include_review_comments=True
)

Includes everything plus review comments.

Token count: ~1200-2000 tokens

Search Result Pagination

For large result sets, use pagination to control token usage:

# First 10 results
search_function("create*", head_limit=10)

# Next 10 results
search_function("create*", head_limit=10, offset=10)

Best Practices for Token Efficiency

Start with compact output

# Query first
query("authentication")  # 150-300 tokens

# Then get details for specific matches
search_function("authenticate/2")  # 300-500 tokens

Use filters to narrow results

# Instead of
search_function("*")  # Returns hundreds of functions

# Do
search_function("*", glob="lib/auth/**")  # Focused results

Disable unused features

# If you don't need usage examples
search_function("authenticate", include_usage_examples=False)

Request specific information

# Instead of
git_history("lib/auth.ex")  # Full history

# Do
git_history("lib/auth.ex", recent=True, max_results=5)  # Last 5 recent changes

Disk Usage

Index Storage

All indexes stored in ~/.cicada/projects/<repo_hash>/:

~/.cicada/projects/<hash>/
├── index.json         # 2-50 MB (main code index)
├── config.yaml        # <1 KB (project config)
├── hashes.json        # 100-500 KB (file tracking)
├── pr_index.json      # 500 KB-5 MB (PR attribution)
└── index.scip         # 5-50 MB (SCIP languages only)

Total Disk Usage per Project:

Small project: 5-10 MB
Medium project: 10-50 MB
Large project: 50-200 MB

Cleanup

Remove stale indexes:

# Remove specific project index
cicada clean  # In project directory

# Or manually
rm -rf ~/.cicada/projects/<hash>/

Optimization Tips

For Large Codebases (>10,000 files)

Exclude unnecessary directories

Edit ~/.cicada/projects/<hash>/config.yaml:

exclude_patterns:
  - "deps/*"
  - "build/*"
  - "_build/*"
  - "node_modules/*"
  - "vendor/*"
  - "*.gen.ex"  # Generated files

Use file extension filters

file_extensions:
  - ".ex"
  - ".exs"
  # Don't index test files if not needed
  # - ".exs"

Disable PR indexing (if not needed)

# Skip PR indexing during cicada index
# PR indexing requires gh CLI and takes extra time

For Slow SCIP Languages

Python Optimization: The SCIP converter was optimized from O(n²) to O(n):

Before: 60 seconds for large projects
After: <5 seconds for same projects
71x faster with recent optimizations

TypeScript Optimization: Use project-specific tsconfig.json to exclude unnecessary files:

{
  "exclude": [
    "node_modules",
    "dist",
    "build"
  ]
}

Performance Monitoring

Index Statistics

Check index health:

cicada stats

Output:

Project: /path/to/project
Language: elixir
Total modules: 234
Total functions: 1,842
Index size: 12.4 MB
Last indexed: 2025-01-05 10:23:14
Files tracked: 547

Benchmark Tool Usage

Use the benchmarking script to measure tool call frequency:

# Test specific prompts
python tests/benchmark/benchmark_mcp_tool_calls.py \
  --prompt "Show me the User module"

# Run full test suite
python tests/benchmark/benchmark_mcp_tool_calls.py --test-suite

Metrics Tracked:

Total tool calls per query
Tool call breakdown
Response duration
Token usage estimation

Architecture-Specific Performance

Tree-sitter Backend (Elixir, Erlang)

Pros:

Fast incremental parsing
No external dependencies
Real-time indexing (<5s)

Cons:

Syntactic analysis only
Manual AST traversal
Per-language custom extractors

Best for: Languages without SCIP indexers, rapid prototyping

SCIP Backend (Python, TypeScript, Go, Rust, etc.)

Pros:

Compiler-accurate results
Shared converter infrastructure (3x faster to add new languages)
Type-aware analysis
Cross-repository references

Cons:

Requires external indexer
Slower initial indexing
Must rebuild on changes (no incremental)

Best for: Mainstream languages with mature tooling

Troubleshooting Performance Issues

Issue: Slow Initial Indexing

Symptoms: First index takes >5 minutes

Solutions:

Check excluded patterns - may be indexing unnecessary files
For SCIP languages, ensure external indexer is optimized
Check disk I/O - slow drive can impact performance
Reduce file count by excluding test directories

Issue: Large Index Files

Symptoms: index.json >100 MB

Solutions:

Review exclude patterns - likely indexing generated code
Check for large docstrings or comments
Consider splitting into multiple smaller projects
Use cicada clean and reindex with tighter exclusions

Issue: Slow Query Responses

Symptoms: Queries take >2 seconds

Solutions:

Check index file size - may need cleanup
Use more specific queries (avoid wildcards like *)
Add filters: glob, path, recent
Reduce head_limit for initial exploration

Issue: Out of Memory During Indexing

Symptoms: Indexing crashes with memory error

Solutions:

Exclude large generated files
Process in batches (not currently supported - file issue)
Increase available RAM
Use SCIP for large codebases (more memory-efficient)

Benchmarks & Comparisons

Token Usage Comparison

Task	Cicada (Compact)	Cicada (Verbose)	Raw Git Commands
File history (5 PRs)	300 tokens	1200 tokens	800 tokens
Module search	250 tokens	800 tokens	N/A (no equivalent)
Function calls	400 tokens	1000 tokens	N/A (no equivalent)

Key Insight: Compact mode optimizes for AI agents while verbose mode provides human-readable context.

Development Effort

Adding new language support:

Approach	Effort	Accuracy	Maintenance
Tree-sitter	12-17 days	Syntactic	Per-language custom code
SCIP	4-6 days	Semantic	Shared converter

SCIP is 3x faster for languages with existing indexers.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance

Quick Reference

Indexing Performance

Initial Indexing

Incremental Indexing

Query Performance

Response Times

Index Lookup

Token Optimization

Understanding Token Usage

Output Modes

Git History Optimization

Search Result Pagination

Best Practices for Token Efficiency

Disk Usage

Index Storage

Cleanup

Optimization Tips

For Large Codebases (>10,000 files)

For Slow SCIP Languages

Performance Monitoring

Index Statistics

Benchmark Tool Usage

Architecture-Specific Performance

Tree-sitter Backend (Elixir, Erlang)

SCIP Backend (Python, TypeScript, Go, Rust, etc.)

Troubleshooting Performance Issues

Issue: Slow Initial Indexing

Issue: Large Index Files

Issue: Slow Query Responses

Issue: Out of Memory During Indexing

Benchmarks & Comparisons

Token Usage Comparison

Development Effort

Related Documentation

FilesExpand file tree

PERFORMANCE.md

Latest commit

History

PERFORMANCE.md

File metadata and controls

Performance

Quick Reference

Indexing Performance

Initial Indexing

Incremental Indexing

Query Performance

Response Times

Index Lookup

Token Optimization

Understanding Token Usage

Output Modes

Git History Optimization

Search Result Pagination

Best Practices for Token Efficiency

Disk Usage

Index Storage

Cleanup

Optimization Tips

For Large Codebases (>10,000 files)

For Slow SCIP Languages

Performance Monitoring

Index Statistics

Benchmark Tool Usage

Architecture-Specific Performance

Tree-sitter Backend (Elixir, Erlang)

SCIP Backend (Python, TypeScript, Go, Rust, etc.)

Troubleshooting Performance Issues

Issue: Slow Initial Indexing

Issue: Large Index Files

Issue: Slow Query Responses

Issue: Out of Memory During Indexing

Benchmarks & Comparisons

Token Usage Comparison

Development Effort

Related Documentation