ETHOS: Efficient Transformers via Hypernetwork-Organized Sparsity

This repository contains the implementation of ETHOS from the paper "ETHOS: Efficient Transformers via Hypernetwork-Organized Sparsity"

ETHOS is a novel architecture that dynamically generates millions of tiny experts from compressed latent representations, achieving 8.7B parameter capacity while using ~20× fewer FLOPs.

Installation

# Clone the repository
git clone https://github.com/yourusername/ethos.git
cd ethos

# Install dependencies (requires Python 3.10+, CUDA 11.8+)
pip install -r requirements.txt

Quick Start

Training ETHOS

The simplest way to train ETHOS is:

python train.py

This will:

Download 1% of the C4 dataset (configurable)
Train for 3 epochs with default settings
Save checkpoints and training logs

To use a custom configuration:

python train.py --config configs/default.yaml

Model Architecture

ETHOS combines several key innovations:

Dynamic expert generation: Instead of storing millions of expert parameters, we generate them from 128-dimensional latent codes
Product-key routing: Efficient O(√N) routing to 262K experts per layer utilizing Query BatchNorm from PEER
Reordered execution: Custom Triton kernel achieving 8× speedup

Repository Structure

ethos/
├── model.py          # All model components
├── data.py           # Data loading and tokenization  
├── train.py          # Training script
├── monitor.py        # Training visualization
├── kernels.py        # Triton kernel implementation
├── configs/          # Configuration files
└── notebooks/        # Demo notebooks

Configuration

Key parameters in configs/default.yaml:

num_experts: 262,144 (512²) experts per layer
d_latent: 128-dimensional latent codes
top_k: 16 experts selected per token
num_routing_heads: 8 independent routing heads

Monitoring Training

Training progress is automatically logged and visualized:

Real-time plots of loss, perplexity, learning rate
CSV and JSON logs saved to training_logs/
Checkpoints saved to checkpoints/

Requirements

PyTorch 2.0+
CUDA 11.8+
Triton 2.1+
Flash Attention 2.0+
80GB+ GPU memory recommended

Paper Results

On 1% of C4 dataset:

Perplexity: 34.85 after 4B tokens using c100k tokenizer
Training speed: 15K tokens/second on GH200
Memory efficiency: 16× reduction vs PEER

License

This project is licensed under the GNU Affero General Public License v3.0 (AGPLv3) - see the LICENSE file for details.

Important: The AGPLv3 license requires that any modifications or derivative works be released under the same license, including when used as a network service.

Commercial Licensing

For commercial use cases that require a different license, please contact [email protected] to discuss commercial licensing options.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

ETHOS: Efficient Transformers via Hypernetwork-Organized Sparsity

Installation

Quick Start

Training ETHOS

Model Architecture

Repository Structure

Configuration

Monitoring Training

Requirements

Paper Results

License

Commercial Licensing

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
configs		configs
notebooks		notebooks
paper		paper
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
data.py		data.py
kernels.py		kernels.py
model.py		model.py
monitor.py		monitor.py
requirements.txt		requirements.txt
train.py		train.py

License

wrmedford/ETHOS

Folders and files

Latest commit

History

Repository files navigation

ETHOS: Efficient Transformers via Hypernetwork-Organized Sparsity

Installation

Quick Start

Training ETHOS

Model Architecture

Repository Structure

Configuration

Monitoring Training

Requirements

Paper Results

License

Commercial Licensing

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages