High-Performance Matching Engine in C

A production-grade, cache-optimized order matching engine designed for high-frequency trading (HFT) applications. Built in C11 with a focus on sub-microsecond latency, deterministic performance, and safety-critical coding standards.

Key Features

Core Matching Engine

Price-Time Priority: Orders matched by best price, then earliest timestamp (FIFO)
Order Types: Limit orders and market orders (price = 0)
Partial Fills: Large orders match against multiple resting orders
Multi-Symbol Support: Independent order books per trading symbol
Order Cancellation: Cancel individual orders or flush entire books

High-Performance Architecture

Zero-Allocation Hot Path: Memory pools pre-allocate all structures at startup
Cache-Line Optimized: All hot structures aligned to 64-byte boundaries
Open-Addressing Hash Tables: Linear probing for cache-friendly O(1) lookups
Lock-Free Queues: SPSC queues with false-sharing prevention and batch operations
RDTSC Timestamps: Serialized rdtscp for ~5 cycle timestamps on x86-64
Packed Enums: uint8_t enums save 3 bytes per field vs standard enums
Batch Dequeue: Amortizes atomic operations across up to 32 messages

Network Layer

UDP Mode: High-throughput single-client with multicast market data
TCP Mode: Multi-client support with per-client message routing
Dual-Processor Mode: Horizontal scaling with symbol-based partitioning (A-M / N-Z)
Low-Latency Sockets: TCP_NODELAY, TCP_QUICKACK, SO_BUSY_POLL optimizations
DPDK Kernel Bypass: Optional ~25x latency reduction for UDP/multicast (Linux)

Protocol Support

CSV Protocol: Human-readable format for debugging and testing
Binary Protocol: Packed structs with compile-time size verification
Auto-Detection: Per-client protocol detection (binary vs CSV)
Branchless Routing: Symbol-based routing with zero conditional branches

Safety & Reliability

Power of Ten Compliant: Follows NASA/JPL safety-critical coding standards
Compile-Time Verification: _Static_assert validates all struct layouts and offsets
Bounded Operations: All loops have fixed upper bounds (Rule 2)
No Dynamic Allocation: After initialization, zero malloc/free calls (Rule 3)
Defensive Programming: Minimum 2 assertions per function (Rule 5)
Return Value Checking: All system calls verified (Rule 7)

Performance Characteristics

Metric	Socket Mode	DPDK Mode	Notes
UDP RX latency	~5 µs	~200 ns	Kernel bypass eliminates syscalls
Order latency	< 1 µs	< 1 µs	Engine processing unchanged
Timestamp overhead	~5 cycles	~5 cycles	Serialized `rdtscp` on x86-64
Hash lookup	O(1)	O(1)	1-2 cache lines
Memory per order	64 bytes	64 bytes	Exactly one cache line
Queue throughput	> 10M msgs/sec	> 10M msgs/sec	Lock-free SPSC
Packets per second	~1M pps	~10M+ pps	DPDK poll-mode driver

Cache Optimization Summary

Every data structure has been optimized for modern CPU cache hierarchies:

Structure	Size	Alignment	Optimization
`order_t`	64 bytes	64-byte	One order = one cache line
`price_level_t`	64 bytes	64-byte	Hot fields in first 32 bytes
`order_map_slot_t`	32 bytes	natural	2 slots per cache line
`output_msg_envelope_t`	64 bytes	64-byte	Perfect for DMA transfers
`client_entry_t`	64 bytes	64-byte	Prevents false sharing in registry
`udp_client_entry_t`	32 bytes	natural	2 entries per cache line
`tcp_client_t`	optimized	64-byte	Hot fields grouped first
Queue head/tail	8 bytes each	64-byte padding	Prevents false sharing

DPDK Kernel Bypass

Overview

The matching engine supports optional DPDK (Data Plane Development Kit) integration for ultra-low-latency UDP packet I/O. DPDK bypasses the kernel network stack entirely, reducing UDP receive latency from ~5µs to ~200ns.

┌─────────────────────────────────────────────────────────────────────┐
│                    Standard Socket Path (~5µs)                       │
│  NIC → IRQ → Kernel Driver → sk_buff → Socket Buffer → syscall → App│
└─────────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────────┐
│                    DPDK Kernel Bypass (~200ns)                       │
│  NIC → DMA to Huge Pages → Poll-Mode Driver → App                   │
└─────────────────────────────────────────────────────────────────────┘

Architecture

The transport layer uses a clean abstraction that allows compile-time backend selection:

┌─────────────────────────────────────────────────────────────────────┐
│                    Application Code                                  │
│  udp_transport_t* t = udp_transport_create(&config, ...);           │
│  udp_transport_start(t);                                            │
│  udp_transport_send_to_client(t, client_id, data, len);             │
└──────────────────────────────┬──────────────────────────────────────┘
                               │
         ┌─────────────────────┴─────────────────────┐
         │                                           │
         ▼                                           ▼
┌─────────────────────┐                   ┌─────────────────────┐
│ Socket Backend      │                   │ DPDK Backend        │
│ (USE_DPDK=OFF)      │                   │ (USE_DPDK=ON)       │
├─────────────────────┤                   ├─────────────────────┤
│ • recvfrom()        │                   │ • rte_eth_rx_burst()│
│ • sendto()          │                   │ • rte_eth_tx_burst()│
│ • Works everywhere  │                   │ • Linux only        │
│ • No special setup  │                   │ • Requires sudo     │
│ • ~5µs latency      │                   │ • ~200ns latency    │
└─────────────────────┘                   └─────────────────────┘

Hybrid Approach

The engine uses a hybrid network strategy:

Protocol	Backend	Rationale
UDP RX/TX	DPDK	Hot path, connectionless, simple 8-byte header
Multicast TX	DPDK	Market data broadcast, high throughput
TCP	Kernel	Complex state machine (SYN/ACK/FIN), flow control

Platform Support

Platform	DPDK Support	Notes
Linux (bare metal)	✅ Full	Best performance
Linux (Cloud/AWS)	✅ Full	ENA driver, dedicated instances
WSL2	❌ No	Virtualized network
macOS	❌ No	Different kernel (XNU)
Windows	❌ No	DPDK is Linux/FreeBSD only

DPDK Prerequisites (Linux)

# 1. Install DPDK
sudo apt update
sudo apt install -y dpdk dpdk-dev libdpdk-dev dpdk-doc

# 2. Allocate huge pages (2GB recommended)
sudo sh -c 'echo 1024 > /proc/sys/vm/nr_hugepages'

# Make permanent (survives reboot):
echo 'vm.nr_hugepages = 1024' | sudo tee -a /etc/sysctl.conf

# 3. Load UIO module
sudo modprobe uio_pci_generic

# Make permanent:
echo 'uio_pci_generic' | sudo tee -a /etc/modules-load.d/dpdk.conf

# 4. Verify setup
./build.sh check-dpdk

Building with DPDK

# Standard build (sockets, works everywhere)
./build.sh build

# DPDK build with virtual device (no physical NIC needed)
./build.sh build-dpdk          # Uses net_null (TX benchmarks)
./build.sh build-dpdk ring     # Uses net_ring (loopback testing)

# DPDK build for physical NIC
./build.sh build-dpdk ""       # Empty string = physical NIC

Running with DPDK

# Check prerequisites first
./build.sh check-dpdk

# Run with DPDK (requires sudo for huge pages)
./build.sh run-dpdk            # Interactive mode
./build.sh run-dpdk-quiet      # Benchmark mode
./build.sh run-dpdk-benchmark  # Binary + quiet

# Run DPDK tests
./build.sh test-dpdk

# DPDK benchmark
./build.sh benchmark-dpdk 24   # 10M matching pairs

Virtual Device Testing

DPDK provides virtual devices for testing without a physical NIC:

Device	Command	Use Case
`net_null`	`./build.sh build-dpdk`	TX benchmarks, drops all packets
`net_ring`	`./build.sh build-dpdk ring`	Loopback testing, internal ring

This allows full DPDK code path testing on any Linux system with huge pages.

Binding a Physical NIC

When ready to use a real NIC:

# 1. Identify your NIC
dpdk-devbind.py --status

# 2. Bring interface down
sudo ip link set eth0 down

# 3. Bind to DPDK driver
sudo dpdk-devbind.py --bind=uio_pci_generic 0000:05:00.0

# 4. Build without virtual device
./build.sh build-dpdk ""

# 5. Run
./build.sh run-dpdk

# To unbind (return to kernel):
sudo dpdk-devbind.py --bind=igc 0000:05:00.0
sudo ip link set eth0 up

Supported NICs

Vendor	Driver	Models
Intel	`igc`	I225-V, I226-V (2.5GbE)
Intel	`ixgbe`	X520, X540, X550 (10GbE)
Intel	`i40e`	X710, XL710, XXV710 (10/25/40GbE)
Intel	`ice`	E810 (25/100GbE)
Mellanox	`mlx5`	ConnectX-4/5/6 (10/25/40/100GbE)
Amazon	`ena`	Elastic Network Adapter (cloud)

DPDK File Structure

matching-engine-c/
├── include/network/
│   ├── transport_types.h        # Shared types (socket & DPDK)
│   ├── udp_transport.h          # Abstract UDP transport API
│   ├── multicast_transport.h    # Abstract multicast API
│   │
│   └── dpdk/
│       ├── dpdk_config.h        # Configuration constants
│       └── dpdk_init.h          # DPDK initialization API
│
├── src/network/
│   ├── udp_socket.c             # Socket backend
│   ├── multicast_socket.c       # Socket backend
│   │
│   └── dpdk/
│       ├── dpdk_init.c          # EAL, mempool, port setup
│       ├── dpdk_udp.c           # UDP via rte_eth_rx/tx_burst
│       └── dpdk_multicast.c     # Multicast via DPDK
│
├── tests/
│   └── test_dpdk.c              # DPDK transport tests
│
└── docs/
    └── KERNEL_BYPASS_SETUP.md   # Detailed setup guide

Building

Prerequisites

GCC 7+ or Clang 6+ with C11 support
CMake 3.10+
Ninja (recommended) or Make
Linux (for full performance) or macOS (with platform fallbacks)
DPDK 21.11+ (optional, for kernel bypass)

Quick Start

# Clone and build
git clone https://github.com/yourrepo/matching-engine-c.git
cd matching-engine-c

# Build with sockets (works everywhere)
./build.sh build

# Run tests
./build.sh test

# Start server
./build.sh run

Build Commands

# Standard builds
./build.sh build           # Release with sockets
./build.sh debug           # Debug build
./build.sh rebuild         # Clean + rebuild

# DPDK builds (Linux only)
./build.sh build-dpdk      # Release with DPDK (net_null vdev)
./build.sh build-dpdk ring # Release with DPDK (net_ring vdev)

# Specialized builds
./build.sh valgrind        # Run tests under valgrind

CMake Options

# Standard socket build
cmake .. -DCMAKE_BUILD_TYPE=Release

# DPDK kernel bypass
cmake .. -DUSE_DPDK=ON -DDPDK_VDEV=null

# Debug with AddressSanitizer
cmake .. -DCMAKE_BUILD_TYPE=Debug -DENABLE_ASAN=ON

# Valgrind-compatible (no AVX-512)
cmake .. -DVALGRIND_BUILD=ON

Compiler Flags

The project compiles with strict warnings:

-Wall -Wextra -Wpedantic -Werror -march=native -O3

Usage

Server Modes

# Standard socket mode
./build.sh run              # CSV format, dual processor
./build.sh run-binary       # Binary format
./build.sh run-quiet        # Benchmark mode (minimal output)
./build.sh run-benchmark    # Binary + quiet

# DPDK kernel bypass mode (Linux, requires sudo)
./build.sh run-dpdk         # CSV format
./build.sh run-dpdk-benchmark # Binary + quiet

Client Commands

./build.sh client           # Interactive TCP client
./build.sh client udp       # Interactive UDP client
./build.sh client-scenario 24 udp  # Run benchmark scenario
./build.sh scenarios        # List all scenarios

Benchmark Commands

# Socket benchmarks
./build.sh benchmark-match 24    # 10M matching pairs
./build.sh benchmark-dual 26     # 250M pairs, dual processor

# DPDK benchmarks (requires sudo)
./build.sh benchmark-dpdk 24     # 10M pairs with kernel bypass

Message Format

Input Messages (CSV):

# New Order: N, user_id, symbol, price, qty, side, order_id
N, 1, IBM, 150, 100, B, 1001

# Cancel: C, symbol, user_id, order_id  
C, IBM, 1, 1001

# Flush all books: F
F

Input Messages (Binary):

Magic (0x4D) | Type (1 byte) | Payload (variable)
New Order:  0x4D 0x01 [user_id:4][symbol:8][price:4][qty:4][side:1][order_id:4] = 27 bytes
Cancel:     0x4D 0x02 [user_id:4][order_id:4] = 10 bytes  
Flush:      0x4D 0x03 = 2 bytes

Output Messages:

# Acknowledgment
A, user_id, order_id

# Trade
T, buy_user, buy_order, sell_user, sell_order, price, qty, symbol

# Top of Book Update
B, symbol, side, price, qty

# Cancel Acknowledgment
X, user_id, order_id

Architecture Overview

┌─────────────────────────────────────────────────────────────────────┐
│                         Network Layer                                │
│  ┌─────────────────┐  ┌─────────────────┐  ┌─────────────────────┐  │
│  │ UDP Transport   │  │ TCP Listener    │  │ Multicast Transport │  │
│  │ ═══════════════ │  │ • TCP_NODELAY   │  │ ═══════════════════ │  │
│  │ Socket Backend: │  │ • TCP_QUICKACK  │  │ Socket Backend:     │  │
│  │ • SO_BUSY_POLL  │  │ • epoll/kqueue  │  │ • IP_MULTICAST_TTL  │  │
│  │ • 10MB rx buf   │  │ • edge-trigger  │  │ • Batch send        │  │
│  │ ─────────────── │  │                 │  │ ─────────────────── │  │
│  │ DPDK Backend:   │  │                 │  │ DPDK Backend:       │  │
│  │ • rx_burst()    │  │                 │  │ • tx_burst()        │  │
│  │ • Poll mode     │  │                 │  │ • Zero-copy         │  │
│  │ • ~200ns RX     │  │                 │  │ • Multicast MAC     │  │
│  └────────┬────────┘  └────────┬────────┘  └──────────▲──────────┘  │
└───────────┼─────────────────────┼─────────────────────┼─────────────┘
            │                     │                     │
            ▼                     ▼                     │
┌───────────────────────────────────────────────────────┼─────────────┐
│                  Lock-Free SPSC Queues                │             │
│  ┌─────────────────────────────┐  ┌───────────────────┼───────────┐ │
│  │ Input Queue                 │  │ Output Queue      │           │ │
│  │ • 64K × 64B envelopes       │  │ • 64K × 64B       │           │ │
│  │ • Batch dequeue (32/op)     │  │ • Non-atomic stats│           │ │
│  │ • Cache-line aligned        │  │ • Round-robin     ◄───────────┘ │
│  └──────────────┬──────────────┘  └───────────────────┘             │
└─────────────────┼───────────────────────────────────────────────────┘
                  │
                  ▼
┌─────────────────────────────────────────────────────────────────────┐
│                      Processor Thread                                │
│  • Batch dequeue up to 32 messages (single atomic op)               │
│  • Configurable spin-wait (PAUSE/YIELD) or nanosleep                │
│  • Prefetch next message while processing current                   │
└──────────────────────────────┬──────────────────────────────────────┘
                               │
                               ▼
┌─────────────────────────────────────────────────────────────────────┐
│                       Matching Engine                                │
│  ┌───────────────────────────────────────────────────────────────┐  │
│  │ Symbol Map (open-addressing, 512 slots)                       │  │
│  └───────────────────────────────────────────────────────────────┘  │
│                              │                                       │
│              ┌───────────────┼───────────────┐                      │
│              ▼               ▼               ▼                      │
│  ┌───────────────┐ ┌───────────────┐ ┌───────────────┐             │
│  │ Order Book    │ │ Order Book    │ │ Order Book    │ ...         │
│  │ IBM           │ │ AAPL          │ │ NVDA          │             │
│  │ Bids/Asks     │ │ Bids/Asks     │ │ Bids/Asks     │             │
│  │ Order Map     │ │ Order Map     │ │ Order Map     │             │
│  └───────────────┘ └───────────────┘ └───────────────┘             │
│                                                                      │
│  ┌───────────────────────────────────────────────────────────────┐  │
│  │ Memory Pools (shared across all books)                        │  │
│  │ Order Pool: 10,000 × 64 bytes = 640 KB (pre-allocated)       │  │
│  │ Zero malloc/free in hot path (Rule 3 compliant)              │  │
│  └───────────────────────────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────────────────────┘

Dual-Processor Mode

For horizontal scaling, the engine supports two processor threads with symbol-based partitioning:

                    ┌──────────────────────┐
                    │ Symbol Router        │
                    │ Branchless:          │
                    │ return (int)is_n_to_z│
                    └───────────┬──────────┘
                                │
              ┌─────────────────┴─────────────────┐
              ▼                                   ▼
    ┌─────────────────────┐             ┌─────────────────────┐
    │ Processor 0         │             │ Processor 1         │
    │ Symbols: A-M        │             │ Symbols: N-Z        │
    │ Input Queue 0       │             │ Input Queue 1       │
    │ Output Queue 0      │             │ Output Queue 1      │
    └─────────────────────┘             └─────────────────────┘

Project Structure

matching-engine-c/
├── include/
│   ├── core/
│   │   ├── order.h              # 64-byte cache-aligned order struct
│   │   ├── order_book.h         # Price levels, open-addressing order map
│   │   └── matching_engine.h    # Multi-symbol engine, symbol map
│   ├── protocol/
│   │   ├── message_types.h      # Packed uint8_t enums
│   │   ├── symbol_router.h      # Branchless A-M/N-Z routing
│   │   ├── binary/              # Binary protocol implementation
│   │   └── csv/                 # CSV protocol implementation
│   ├── threading/
│   │   ├── lockfree_queue.h     # SPSC queue with batch dequeue
│   │   ├── processor.h          # Batched stats, spin-wait config
│   │   └── queues.h             # Queue type instantiations
│   ├── network/
│   │   ├── transport_types.h    # Shared transport types
│   │   ├── udp_transport.h      # Abstract UDP API
│   │   ├── multicast_transport.h # Abstract multicast API
│   │   ├── tcp_listener.h       # TCP multi-client
│   │   └── dpdk/                # DPDK-specific headers
│   └── platform/
│       └── timestamps.h         # RDTSC / clock_gettime abstraction
├── src/
│   ├── core/                    # Matching logic
│   ├── protocol/                # Message parsing/formatting
│   ├── threading/               # Processor implementation
│   ├── network/
│   │   ├── udp_socket.c         # Socket backend
│   │   ├── multicast_socket.c   # Socket backend
│   │   └── dpdk/                # DPDK backend
│   └── modes/                   # Server mode entry points
├── tests/
│   ├── core/                    # Order book & engine tests
│   ├── protocol/                # Message parsing tests
│   ├── threading/               # Queue tests
│   └── test_dpdk.c              # DPDK transport tests
├── docs/
│   ├── ARCHITECTURE.md          # Design documentation
│   └── KERNEL_BYPASS_SETUP.md   # DPDK setup guide
├── CMakeLists.txt
└── build.sh                     # Build automation script

Testing

# Run all tests
./build.sh test

# Run with valgrind
./build.sh valgrind

# Run DPDK tests (requires sudo)
./build.sh test-dpdk

# Memory checking
valgrind --leak-check=full ./build/matching_engine_tests

Power of Ten Compliance

This codebase follows Gerard Holzmann's "Power of Ten" rules:

Rule	Implementation	Verification
1. No goto, setjmp, recursion	All control flow is structured	Code review
2. Fixed loop bounds	`MAX_PROBE_LENGTH`, `MAX_MATCH_ITERATIONS`	`_Static_assert`
3. No malloc after init	Memory pools pre-allocate everything	Code review
4. Functions ≤ 60 lines	Large functions split	`wc -l`
5. ≥ 2 assertions per function	Preconditions + postconditions	`grep -c assert`
6. Smallest variable scope	Declared at point of use	C99 style
7. Check all return values	pthread, clock_gettime, socket ops	Code review
8. Limited preprocessor	Simple macros, no complex logic	`grep -c '#define'`
9. Restrict pointer use	Temp variables eliminate `(*ptr)->field`	Code review
10. Compile warning-free	`-Wall -Wextra -Wpedantic -Werror`	CI build

Assertion Counts by Module

Module	Assertions	Notes
core/	~120	order.h, order_book.c, matching_engine.c
threading/	~80	lockfree_queue.h, processor.c
protocol/	~100	parsers, formatters, validators
network/	~230	Socket + DPDK implementations
Total	~530+	Minimum 2 per function

Platform Support

Feature	Linux x86-64	macOS Intel	macOS ARM
Socket mode	✅	✅	✅
DPDK mode	✅	❌	❌
Cache alignment	64-byte	64-byte	64-byte
Timestamps	`rdtscp`	`rdtscp`	`clock_gettime`
Spin-wait hint	`PAUSE`	`PAUSE`	`YIELD`
Event loop	epoll	kqueue	kqueue
`TCP_QUICKACK`	✅	❌	❌
`SO_BUSY_POLL`	✅	❌	❌

References

Power of Ten - Rules for Developing Safety Critical Code - Gerard Holzmann, NASA/JPL
What Every Programmer Should Know About Memory - Ulrich Drepper
Lock-Free Data Structures - CMU
DPDK Programmer's Guide - DPDK Project

Changelog

v2.2 (December 2024)

DPDK Kernel Bypass: Full implementation with ~25x latency reduction
Transport Abstraction: Clean API supporting socket and DPDK backends
Virtual Device Support: net_null and net_ring for testing without physical NIC
DPDK Files: dpdk_init.c, dpdk_udp.c, dpdk_multicast.c (2,347 lines, 74 assertions)
build.sh: New commands: check-dpdk, build-dpdk, run-dpdk, test-dpdk
Documentation: KERNEL_BYPASS_SETUP.md with complete setup guide

v2.1 (December 2024)

Network: Socket optimizations (TCP_NODELAY, TCP_QUICKACK, SO_BUSY_POLL)
Network: Thread-local parsers/formatters (eliminates global state)
Network: Kernel bypass abstraction points [KB-1] through [KB-5]
Protocol: _Static_assert for all binary struct sizes and offsets
Protocol: Truly branchless symbol routing (no ternary operator)

v2.0 (December 2024)

Critical bug fix: Iterator invalidation in order_book_cancel_client_orders
Critical bug fix: RDTSC serialization (use rdtscp instead of rdtsc)
Performance: Lock-free queue batch dequeue (~20-30x speedup)
Performance: Non-atomic queue statistics (-45-90 cycles/message)
Safety: Rule 5 compliance (minimum 2 assertions per function)
Compatibility: macOS platform support (Intel and Apple Silicon)

v1.0 (Initial Release)

Core matching engine with price-time priority
Lock-free SPSC queues
UDP and TCP modes
Dual-processor support

Name		Name	Last commit message	Last commit date
Latest commit History 179 Commits
data		data
documentation		documentation
include		include
src		src
tests		tests
tools		tools
.dockerignore		.dockerignore
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
Dockerfile		Dockerfile
README.md		README.md
build.sh		build.sh
check_dpdk_prereq.sh		check_dpdk_prereq.sh
check_sizes.c		check_sizes.c
docker-compose.yml		docker-compose.yml
install_dpdk.sh		install_dpdk.sh

tembolo1284/matching-engine-c

Folders and files

Latest commit

History

Repository files navigation

High-Performance Matching Engine in C

Key Features

Core Matching Engine

High-Performance Architecture

Network Layer

Protocol Support

Safety & Reliability

Performance Characteristics

Cache Optimization Summary

DPDK Kernel Bypass

Overview

Architecture

Hybrid Approach

Platform Support

DPDK Prerequisites (Linux)

Building with DPDK

Running with DPDK

Virtual Device Testing

Binding a Physical NIC

Supported NICs

DPDK File Structure

Building

Prerequisites

Quick Start

Build Commands

CMake Options

Compiler Flags

Usage

Server Modes

Client Commands

Benchmark Commands

Message Format

Architecture Overview

Dual-Processor Mode

Project Structure

Testing

Power of Ten Compliance

Assertion Counts by Module

Platform Support

References

Changelog

v2.2 (December 2024)

v2.1 (December 2024)

v2.0 (December 2024)

v1.0 (Initial Release)

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages