HyperPod-InstantStart

HyperPod InstantStart is a training-and-inference integrated platform built on SageMaker HyperPod. It utilizes standard EKS orchestration and supports training and inference tasks with arbitrary GPU resource granularity.

Overview

HyperPod-InstantStart provides a unified interface for managing ML infrastructure, from cluster provisioning to training job orchestration and model serving.

For training, it leverages HyperPod Training Operator (significantly simplifying distributed configuration with process-level recovery and log exception monitoring; optional), or KubeRay (as an orchestrator for the reinforcement learning framework VERL).
For inference, it supports deployment on single or multi-node setups using arbitrary containers, such as standard vLLM/SGLang or self-buit containers, while also providing standardized API exposure (e.g., OpenAI-compatible API).
Additionally, it offers managed MLFlow Tracking Server for storing training metrics, enabling sharing and collaboration with fine-grained IAM permission controls.

Architecture

Demo Videos

Create HyperPod Cluster

Download Model from HuggingFace

Model Deployment from S3

Distributed Verl Training with KubeRay

Agentic Orchestration and AI Workloads

Key Components

Cluster Management: Supports EKS cluster creation, importing existing EKS clusters, cluster environment configuration, HyperPod cluster creation and scaling, EKS Node Group creation
Model Management: Supports multiple S3 CSI configurations, as well as HuggingFace model downloads (CPU Pod)
Inference: Hosting for vLLM, SGLang or any custom container, with support for binding Pods to different Services (no need to repeatedly destroy and create Pods during resource rebalancing)
Training: Supports model training patterns including LlamaFactory, Verl, and Torch Script
Training History: Integration with SageMaker-managed MLFlow creation and display/sharing of training performance metrics
(NEW)Agentic Orchestration: Provides integrated MCP server for Natural language based AI task orchestration, e.g. Cluster Management, Inference, (Coming Soon) Training & Hosting
(NEW)SandBox Service for RL Training: Provides interactive SandBox Service within the Cluster for (Coding) RL Training

For detailed setup instructions, please refer to Feishu Doc (zh_cn), or Lark Doc (en)

Upcoming Features

Type	Feature	Updated At	Target Date
Agentic	MCP Server for HyperPod InstantStart	2025-12-25	AVAILABLE
Training	RL SandBox as Cluster Service	2025-12-25	Done
Training	TorchTitan Training Recipe Integration	2025-10-17	TBD

Name		Name	Last commit message	Last commit date
Latest commit History 267 Commits
.claude		.claude
.kiro		.kiro
cli-min		cli-min
hypd-inst-mcp		hypd-inst-mcp
resources		resources
smtj-recipes		smtj-recipes
train-recipes		train-recipes
ui-panel		ui-panel
.mcp.json		.mcp.json
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

HyperPod-InstantStart

Overview

Architecture