This repository contains the implementation of ETHOS from the paper "ETHOS: Efficient Transformers via Hypernetwork-Organized Sparsity"
ETHOS is a novel architecture that dynamically generates millions of tiny experts from compressed latent representations, achieving 8.7B parameter capacity while using ~20× fewer FLOPs.
# Clone the repository
git clone https://github.com/yourusername/ethos.git
cd ethos
# Install dependencies (requires Python 3.10+, CUDA 11.8+)
pip install -r requirements.txtThe simplest way to train ETHOS is:
python train.pyThis will:
- Download 1% of the C4 dataset (configurable)
- Train for 3 epochs with default settings
- Save checkpoints and training logs
To use a custom configuration:
python train.py --config configs/default.yamlETHOS combines several key innovations:
- Dynamic expert generation: Instead of storing millions of expert parameters, we generate them from 128-dimensional latent codes
- Product-key routing: Efficient O(√N) routing to 262K experts per layer utilizing Query BatchNorm from PEER
- Reordered execution: Custom Triton kernel achieving 8× speedup
ethos/
├── model.py # All model components
├── data.py # Data loading and tokenization
├── train.py # Training script
├── monitor.py # Training visualization
├── kernels.py # Triton kernel implementation
├── configs/ # Configuration files
└── notebooks/ # Demo notebooks
Key parameters in configs/default.yaml:
num_experts: 262,144 (512²) experts per layerd_latent: 128-dimensional latent codestop_k: 16 experts selected per tokennum_routing_heads: 8 independent routing heads
Training progress is automatically logged and visualized:
- Real-time plots of loss, perplexity, learning rate
- CSV and JSON logs saved to
training_logs/ - Checkpoints saved to
checkpoints/
- PyTorch 2.0+
- CUDA 11.8+
- Triton 2.1+
- Flash Attention 2.0+
- 80GB+ GPU memory recommended
On 1% of C4 dataset:
- Perplexity: 34.85 after 4B tokens using c100k tokenizer
- Training speed: 15K tokens/second on GH200
- Memory efficiency: 16× reduction vs PEER
This project is licensed under the GNU Affero General Public License v3.0 (AGPLv3) - see the LICENSE file for details.
Important: The AGPLv3 license requires that any modifications or derivative works be released under the same license, including when used as a network service.
For commercial use cases that require a different license, please contact [email protected] to discuss commercial licensing options.