Optimum Neuron Inference Models (NxD) Guide

This guide focuses on NxD inference models, decoder graphs, attention, and porting practices. For project-wide workflows see AGENTS.md.

NxD Decoder Models

Three-Graph Architecture

Context encoding graph: multi-token prompt → KV cache
Token generation graph: 1 token in → 1 token out
Speculation graph: optional multi-token proposal

Implementation details live in:

KV Cache Management

KV cache is managed by KVCacheManager with BHSD layout and in-place aliasing:

optimum/neuron/models/inference/backend/modules/kvcache/kv_cache_manager.py

On-Device Sampling

Sampling on NeuronCores uses nxd_topk, nxd_argmax, NKI cumsum kernels:

optimum/neuron/models/inference/backend/modules/generation/sampling.py

Common Pitfalls

Runtime shapes must match compiled shapes.
Call context encoding before token generation.
TP degree must match compiled model.
Decoder graph changes require cache prune: python tools/prune_test_models.py.

Attention Mechanisms

Grouped Query Attention (GQA)

Sharding strategy selection: REPLICATE_TO_TP_DEGREE vs CONVERT_TO_MHA
Logic in optimum/neuron/models/inference/backend/modules/attention/gqa.py

Flash Attention on Neuron

Attention module guide (dispatch table, NKI kernel for head_dim > 128, sliding window): optimum/neuron/models/inference/backend/modules/attention/AGENTS.md
Training path uses attn_implementation="flash_attention_2".

Parallel Attention Layers

Parallel QKV and output projections use ColumnParallelLinear/RowParallelLinear in:

optimum/neuron/models/inference/backend/modules/attention/attention_base.py

Neuron vs HF Modeling Differences

Llama-like models (decoder-only)

Replace HF nn.Linear/Embedding with TP-aware parallel layers.
Replace HF attention with NeuronAttentionBase for static shapes.
Use KVCacheManager instead of HF dynamic cache.
Optional fused QKV/MLP kernels (Neuron-only).
State dict remaps (e.g., QKV concatenation).

See reference implementation:

optimum/neuron/models/inference/llama/modeling_llama.py

MoE models (Mixtral, Qwen3 MoE)

Expert routing and sharding are TP/EP aware.
Expert capacity and dispatch are statically shaped.
Expert MLPs use parallel layers or fused kernels.
State dict remaps for expert sharding when required.

Porting from NxDI

Use NxDI for neuron-specific graph changes and HF Transformers for base architecture.

The Optimum Neuron implementation prioritizes stability, maintainability, and HF ecosystem compatibility over cutting-edge performance optimizations. For production deployments requiring maximum throughput, NxDI remains the reference implementation.

Per-Module Parity Tests

Track numerical differences using module-level tests before full graph tests:

tests/decoder/test_modules.py compares HF layers to Neuron equivalents using nxd_testing.build_module() and validate_accuracy().
tests/decoder/test_attention.py validates attention with explicit rotary embedding and mask handling.

These isolate drift or state-dict conversion issues early.

New Model Checklist

When adding a new model directory:

Create CLAUDE.md in the model directory containing @AGENTS.md so Claude Code auto-loads the model-specific guide.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimum Neuron Inference Models (NxD) Guide

NxD Decoder Models

Three-Graph Architecture

KV Cache Management

On-Device Sampling

Common Pitfalls

Attention Mechanisms

Grouped Query Attention (GQA)

Flash Attention on Neuron

Parallel Attention Layers

Neuron vs HF Modeling Differences

Llama-like models (decoder-only)

MoE models (Mixtral, Qwen3 MoE)

Porting from NxDI

Per-Module Parity Tests

New Model Checklist

FilesExpand file tree

AGENTS.md

Latest commit

History

AGENTS.md

File metadata and controls

Optimum Neuron Inference Models (NxD) Guide

NxD Decoder Models

Three-Graph Architecture

KV Cache Management

On-Device Sampling

Common Pitfalls

Attention Mechanisms

Grouped Query Attention (GQA)

Flash Attention on Neuron

Parallel Attention Layers

Neuron vs HF Modeling Differences

Llama-like models (decoder-only)

MoE models (Mixtral, Qwen3 MoE)

Porting from NxDI

Per-Module Parity Tests

New Model Checklist