Skip to content

Cortex Design Doc

Ramon Perez edited this page Feb 5, 2025 · 3 revisions

Abstract

Cortex.cpp is a local AI API platform designed to run and customize machine learning models on user hardware. It provides a lightweight, high-performance runtime for inference tasks, supporting multiple models and custom inference backends.

Motivation

AI has taken the world by storm since the release of GPT-3, Stable Diffusion, and other models in the space of consumer products and enterprise-level applications, but it hasn't yet made it into the mainstream of robotics. Meaning, there is no software (open source or proprietary) yet that can serve as the brain of a robot (or the pip install a-robots-brain), and that's what we will shape Cortex to be:

The open-source brain for robots: vision, speech, language, tabular, and action -- the cloud is optional.

Functional Requirements

Model Management

  • Users can download models from external sources (Hugging Face, custom repositories).
  • Users can store models in a structured, efficient format.
  • Multiple models can be loaded and switched dynamically.

Inference Execution

  • Users can execute inference requests via CLI or REST API.
  • Support for different quantization strategies (GGUF, FP16, INT8, etc.).
  • Performance optimizations using CPU and GPU acceleration.

API Compatibility

  • Expose REST endpoints similar to OpenAI’s API:
    • /v1/chat/completions
    • /v1/embeddings
    • /v1/fine_tuning
  • Support structured outputs and function calling.

System Monitoring

  • Users can check model load status and resource usage.
  • Provide telemetry on memory and GPU utilization.

Platform Support

  • Cross-platform installation via standalone executables.
  • Prebuilt binaries for Linux (Deb, Arch), macOS, and Windows.

Nonfunctional Requirements

Performance

  • Must handle 7B models with at least 8GB RAM.
  • Response times for inference should be under 500ms for small queries.

Security

  • Local execution ensures no data is transmitted externally.
  • Secure API with optional authentication.

Scalability

  • Multi-threaded execution to utilize available CPU cores.
  • Future support for distributed inference across devices.

System Architecture

High-Level Design

Cortex consists of three main layers:

  • CLI / REST API Interface – Handles user interactions.
  • Engine Layer – Loads models, manages execution, and optimizes runtime.
  • Inference Backend – Executes computations using different backends (Llama.cpp, ONNXRuntime).

Key Components

Command-Line Interface (CLI)

  • Commands: cortex pull, cortex run, cortex ps Provides simplified management of models.

REST API

  • Runs as a local server, exposing AI capabilities via HTTP.

Engine Layer

  • Manages model loading, unloading, and switching. Uses optimized quantized formats for faster inference.

Inference Backend

Supports multiple engines (default: llama.cpp, future: ONNXRuntime, TensorRT-LLM).
GPU acceleration where applicable.
  1. Deployment & Installation 5.1 Installation Methods

    Local Installer: Standalone package with all dependencies. Network Installer: Downloads dependencies dynamically.

5.2 Supported Platforms

Linux: .deb, .tar.gz for Arch, generic shell script.
macOS: .pkg installer.
Windows: .exe installer.
  1. Use Cases 6.1 Local AI Chatbot

A developer downloads a Llama3 model and runs a chatbot locally using:

cortex pull llama3.2
cortex run llama3.2

AI-Powered Code Completion

A developer integrates Cortex.cpp into a VS Code extension for offline code suggestions.

Private AI Search Engine

A researcher runs a fine-tuned AI model to analyze and categorize documents.

Future Enhancements

  • Model fine-tuning support.
  • Integration with AMD and Apple Silicon.
  • Support for multi-modal AI (text, audio, vision).
Clone this wiki locally