Cortex Design Doc

Abstract

Cortex.cpp is a local AI API platform designed to run and customize machine learning models on user hardware. It provides a lightweight, high-performance runtime for inference tasks, supporting multiple models and custom inference backends.

Motivation

AI has taken the world by storm since the release of GPT-3, Stable Diffusion, and other models in the space of consumer products and enterprise-level applications, but it hasn't yet made it into the mainstream of robotics. Meaning, there is no software (open source or proprietary) yet that can serve as the brain of a robot (or the pip install a-robots-brain), and that's what we will shape Cortex to be:

The open-source brain for robots: vision, speech, language, tabular, and action -- the cloud is optional.

Functional Requirements

Model Management

Users can download models from external sources (Hugging Face, custom repositories).
Users can store models in a structured, efficient format.
Multiple models can be loaded and switched dynamically.

Inference Execution

Users can execute inference requests via CLI or REST API.
Support for different quantization strategies (GGUF, FP16, INT8, etc.).
Performance optimizations using CPU and GPU acceleration.

API Compatibility

Expose REST endpoints similar to OpenAI’s API:
- /v1/chat/completions
- /v1/embeddings
- /v1/fine_tuning
Support structured outputs and function calling.

System Monitoring

Users can check model load status and resource usage.
Provide telemetry on memory and GPU utilization.

Platform Support

Cross-platform installation via standalone executables.
Prebuilt binaries for Linux (Deb, Arch), macOS, and Windows.

Nonfunctional Requirements

Performance

Must handle 7B models with at least 8GB RAM.
Response times for inference should be under 500ms for small queries.

Security

Local execution ensures no data is transmitted externally.
Secure API with optional authentication.

Scalability

Multi-threaded execution to utilize available CPU cores.
Future support for distributed inference across devices.

System Architecture

High-Level Design

Cortex consists of three main layers:

CLI / REST API Interface – Handles user interactions.
Engine Layer – Loads models, manages execution, and optimizes runtime.
Inference Backend – Executes computations using different backends (Llama.cpp, ONNXRuntime).

Key Components

Command-Line Interface (CLI)

Commands: cortex pull, cortex run, cortex ps Provides simplified management of models.

REST API

Runs as a local server, exposing AI capabilities via HTTP.

Engine Layer

Manages model loading, unloading, and switching. Uses optimized quantized formats for faster inference.

Inference Backend

Supports multiple engines (default: llama.cpp, future: ONNXRuntime, TensorRT-LLM).
GPU acceleration where applicable.

Deployment & Installation 5.1 Installation Methods

Local Installer: Standalone package with all dependencies. Network Installer: Downloads dependencies dynamically.

5.2 Supported Platforms

Linux: .deb, .tar.gz for Arch, generic shell script.
macOS: .pkg installer.
Windows: .exe installer.

Use Cases 6.1 Local AI Chatbot

A developer downloads a Llama3 model and runs a chatbot locally using:

cortex pull llama3.2
cortex run llama3.2

AI-Powered Code Completion

A developer integrates Cortex.cpp into a VS Code extension for offline code suggestions.

Private AI Search Engine

A researcher runs a fine-tuned AI model to analyze and categorize documents.

Future Enhancements

Model fine-tuning support.
Integration with AMD and Apple Silicon.
Support for multi-modal AI (text, audio, vision).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Cortex Design Doc

Abstract

Motivation

Functional Requirements

Model Management

Inference Execution

API Compatibility

System Monitoring

Platform Support

Nonfunctional Requirements

Performance

Security

Scalability

System Architecture

High-Level Design

Key Components

REST API

Engine Layer

AI-Powered Code Completion

Private AI Search Engine

Future Enhancements

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally