Skip to content

haozhx23/HyperPod-InstantStart

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

267 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

HyperPod-InstantStart

English Documentation Chinese Documentation

HyperPod InstantStart is a training-and-inference integrated platform built on SageMaker HyperPod. It utilizes standard EKS orchestration and supports training and inference tasks with arbitrary GPU resource granularity.

Overview

HyperPod-InstantStart provides a unified interface for managing ML infrastructure, from cluster provisioning to training job orchestration and model serving.

  • For training, it leverages HyperPod Training Operator (significantly simplifying distributed configuration with process-level recovery and log exception monitoring; optional), or KubeRay (as an orchestrator for the reinforcement learning framework VERL).
  • For inference, it supports deployment on single or multi-node setups using arbitrary containers, such as standard vLLM/SGLang or self-buit containers, while also providing standardized API exposure (e.g., OpenAI-compatible API).
  • Additionally, it offers managed MLFlow Tracking Server for storing training metrics, enabling sharing and collaboration with fine-grained IAM permission controls.

Architecture

Architecture Diagram

Demo Videos

Create HyperPod Cluster

hypd create

Download Model from HuggingFace

model download

Model Deployment from S3

deploy

Distributed Verl Training with KubeRay

verl

Agentic Orchestration and AI Workloads

agentic

Key Components

  • Cluster Management: Supports EKS cluster creation, importing existing EKS clusters, cluster environment configuration, HyperPod cluster creation and scaling, EKS Node Group creation
  • Model Management: Supports multiple S3 CSI configurations, as well as HuggingFace model downloads (CPU Pod)
  • Inference: Hosting for vLLM, SGLang or any custom container, with support for binding Pods to different Services (no need to repeatedly destroy and create Pods during resource rebalancing)
  • Training: Supports model training patterns including LlamaFactory, Verl, and Torch Script
  • Training History: Integration with SageMaker-managed MLFlow creation and display/sharing of training performance metrics
  • (NEW)Agentic Orchestration: Provides integrated MCP server for Natural language based AI task orchestration, e.g. Cluster Management, Inference, (Coming Soon) Training & Hosting
  • (NEW)SandBox Service for RL Training: Provides interactive SandBox Service within the Cluster for (Coding) RL Training

For detailed setup instructions, please refer to Feishu Doc (zh_cn), or Lark Doc (en)

Upcoming Features

Type Feature Updated At Target Date
Agentic MCP Server for HyperPod InstantStart 2025-12-25 AVAILABLE
Training RL SandBox as Cluster Service 2025-12-25 Done
Training TorchTitan Training Recipe Integration 2025-10-17 TBD

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors