-
Notifications
You must be signed in to change notification settings - Fork 163
Cortex Design Doc
Cortex.cpp is a local AI API platform designed to run and customize machine learning models on user hardware. It provides a lightweight, high-performance runtime for inference tasks, supporting multiple models and custom inference backends.
AI has taken the world by storm since the release of GPT-3, Stable Diffusion, and other models in the space of consumer products and enterprise-level applications, but it hasn't yet made it into the mainstream of robotics. Meaning, there is no software (open source or proprietary) yet that can serve as the brain of a robot (or the pip install a-robots-brain
), and that's what we will shape Cortex to be:
The open-source brain for robots: vision, speech, language, tabular, and action -- the cloud is optional.
- Users can download models from external sources (Hugging Face, custom repositories).
- Users can store models in a structured, efficient format.
- Multiple models can be loaded and switched dynamically.
- Users can execute inference requests via CLI or REST API.
- Support for different quantization strategies (GGUF, FP16, INT8, etc.).
- Performance optimizations using CPU and GPU acceleration.
- Expose REST endpoints similar to OpenAI’s API:
/v1/chat/completions
/v1/embeddings
/v1/fine_tuning
- Support structured outputs and function calling.
- Users can check model load status and resource usage.
- Provide telemetry on memory and GPU utilization.
- Cross-platform installation via standalone executables.
- Prebuilt binaries for Linux (Deb, Arch), macOS, and Windows.
- Must handle 7B models with at least 8GB RAM.
- Response times for inference should be under 500ms for small queries.
- Local execution ensures no data is transmitted externally.
- Secure API with optional authentication.
- Multi-threaded execution to utilize available CPU cores.
- Future support for distributed inference across devices.
Cortex consists of three main layers:
- CLI / REST API Interface – Handles user interactions.
- Engine Layer – Loads models, manages execution, and optimizes runtime.
- Inference Backend – Executes computations using different backends (Llama.cpp, ONNXRuntime).
Command-Line Interface (CLI)
- Commands: cortex pull, cortex run, cortex ps Provides simplified management of models.
- Runs as a local server, exposing AI capabilities via HTTP.
- Manages model loading, unloading, and switching. Uses optimized quantized formats for faster inference.
Inference Backend
Supports multiple engines (default: llama.cpp, future: ONNXRuntime, TensorRT-LLM).
GPU acceleration where applicable.
-
Deployment & Installation 5.1 Installation Methods
Local Installer: Standalone package with all dependencies. Network Installer: Downloads dependencies dynamically.
5.2 Supported Platforms
Linux: .deb, .tar.gz for Arch, generic shell script.
macOS: .pkg installer.
Windows: .exe installer.
- Use Cases 6.1 Local AI Chatbot
A developer downloads a Llama3 model and runs a chatbot locally using:
cortex pull llama3.2
cortex run llama3.2
A developer integrates Cortex.cpp into a VS Code extension for offline code suggestions.
A researcher runs a fine-tuned AI model to analyze and categorize documents.
- Model fine-tuning support.
- Integration with AMD and Apple Silicon.
- Support for multi-modal AI (text, audio, vision).