You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
`torchtitan` is currently in a pre-release state and under extensive development. Currently we showcase pre-training **Llama 3.1** LLMs of various sizes from scratch. To use the latest features of `torchtitan`, we recommend using the most recent PyTorch nightly.
15
+
`torchtitan` is currently in a pre-release state and under extensive development. We showcase training Llama 3.1 LLMs at scale, and are working on other types of generative AI models, including LLMs with MoE architectures, multimodal LLMs, and diffusion models, in the [`experiments`](torchtitan/experiments) folder.
16
+
To use the latest features of `torchtitan`, we recommend using the most recent PyTorch nightly.
17
+
18
+
19
+
## Latest News
20
+
-[2025/04] Our paper has been accepted by [ICLR 2025](https://iclr.cc/virtual/2025/poster/29620). The poster will be presented on Friday April 25th.
21
+
-[2025/04][Llama 4](torchtitan/experiments/llama4/) initial support is available as an experiment.
22
+
-[2025/04] Training the diffusion model [FLUX](torchtitan/experiments/flux/) with FSDP/HSDP is available as an experiment.
23
+
-[2025/04] The frontend implementation of [SimpleFSDP](torchtitan/experiments/simple_fsdp/), a compiler-based FSDP framework, is available as an experiment.
24
+
-[2024/12] GPU MODE [lecture](https://www.youtube.com/watch?v=VYWRjcUqW6w) on torchtitan.
25
+
-[2024/11][Presentation](https://www.alluxio.io/videos/ai-ml-infra-meetup-torchtitan-one-stop-pytorch-native-solution-for-production-ready-llm-pre-training) at an AI/ML Infra Meetup.
26
+
-[2024/07][Presentation](https://pytorch2024.sched.com/event/1fHn3) at PyTorch Conference 2024.
27
+
-[2024/04][Intro video](https://youtu.be/ee5DOEqD35I?si=_B94PbVv0V5ZnNKE) - learn more about `torchtitan` in under 4 minutes.
28
+
16
29
17
30
## Overview
18
31
19
-
`torchtitan` is a proof-of-concept for large-scale LLM training using native PyTorch. It is (and will continue to be) a repo to showcase PyTorch's latest distributed training features in a clean, minimal codebase. `torchtitan`is complementary to and not a replacement for any of the great large-scale LLM training codebases such as Megatron, MegaBlocks, LLM Foundry, DeepSpeed, etc. Instead, we hope that the features showcased in `torchtitan`will be adopted by these codebases quickly. `torchtitan` is unlikely to ever grow a large community around it.
32
+
`torchtitan` is a PyTorch native platform designed for **rapid experimentation and large-scale training** of generative AI models. As a minimal clean-room implementation of PyTorch native scaling techniques, `torchtitan`provides a flexible foundation for developers to build upon. With `torchtitan`[extension points](docs/extension.md), one can easily create custom extensions tailored to specific needs.
20
33
21
-
Our guiding principles when building `torchtitan`:
34
+
Our mission is to accelerate innovation in the field of generative AI by empowering researchers and developers to explore new modeling architectures and infrastructure techniques.
22
35
36
+
The guiding principles when building `torchtitan`
23
37
* Designed to be easy to understand, use and extend for different training purposes.
24
38
* Minimal changes to the model code when applying multi-dimensional parallelism.
25
-
* Modular components instead of a monolithic codebase.
26
-
* Get started in minutes, not hours!
39
+
* Bias towards a clean, minimal codebase while providing basic reusable / swappable components.
40
+
41
+
`torchtitan` has been showcasing PyTorch's latest distributed training features, via pretraining Llama 3.1 LLMs of various sizes.
42
+
To accelerate contributions to and innovations around torchtitan, we are hosting a new [`experiments`](torchtitan/experiments) folder. We look forward to your contributions!
27
43
28
-
### Intro video - learn more about `torchtitan` in under 4 mins
29
44
30
-
[](https://youtu.be/ee5DOEqD35I?si=_B94PbVv0V5ZnNKE"Welcome to torchtitan!")
45
+
## Llama 3.1 pretraining
31
46
32
47
### Key features available
33
48
@@ -37,7 +52,7 @@ Our guiding principles when building `torchtitan`:
-[Interoperable checkpoints](docs/checkpoint.md) which can be loaded directly into [`torchtune`](https://github.com/pytorch/torchtune) for fine-tuning
43
58
5.`torch.compile` support
@@ -115,21 +130,24 @@ srun torchrun --nnodes 2
115
130
116
131
If your gpu count per node is not 8, adjust `--nproc_per_node` in the torchrun command and `#SBATCH --gpus-per-task` in the SBATCH command section.
117
132
133
+
118
134
## Citation
119
135
120
-
We provide a detailed look into the parallelisms and optimizations available in `torchtitan`, along with summary advice on when to use various techniques: [TorchTitan: One-stop PyTorch native solution for production ready LLM pre-training](https://arxiv.org/abs/2410.06511).
136
+
We provide a detailed look into the parallelisms and optimizations available in `torchtitan`, along with summary advice on when to use various techniques.
137
+
138
+
[TorchTitan: One-stop PyTorch native solution for production ready LLM pre-training](https://openreview.net/forum?id=SFN6Wm7YBI)
121
139
```
122
-
@misc{torchtitan,
123
-
title={TorchTitan: One-stop PyTorch native solution for production ready LLM pre-training},
124
-
author={Wanchao Liang and Tianyu Liu and Less Wright and Will Constable and Andrew Gu and Chien-Chin Huang and Iris Zhang and Wei Feng and Howard Huang and Junjie Wang and Sanket Purandare and Gokul Nadathur and Stratos Idreos},
125
-
year={2024},
126
-
eprint={2410.06511},
127
-
archivePrefix={arXiv},
128
-
primaryClass={cs.CL},
129
-
url={https://arxiv.org/abs/2410.06511},
140
+
@inproceedings{
141
+
liang2025torchtitan,
142
+
title={TorchTitan: One-stop PyTorch native solution for production ready {LLM} pretraining},
143
+
author={Wanchao Liang and Tianyu Liu and Less Wright and Will Constable and Andrew Gu and Chien-Chin Huang and Iris Zhang and Wei Feng and Howard Huang and Junjie Wang and Sanket Purandare and Gokul Nadathur and Stratos Idreos},
144
+
booktitle={The Thirteenth International Conference on Learning Representations},
145
+
year={2025},
146
+
url={https://openreview.net/forum?id=SFN6Wm7YBI}
130
147
}
131
148
```
132
149
150
+
133
151
## License
134
152
135
153
Source code is made available under a [BSD 3 license](./LICENSE), however you may have other legal obligations that govern your use of other content linked in this repository, such as the license or terms of service for third-party data and models.
Copy file name to clipboardExpand all lines: docs/extension.md
+1-1
Original file line number
Diff line number
Diff line change
@@ -1,4 +1,4 @@
1
-
To support quick experimentation with torchtitan, we provide several extension points. The principle for adding these extension points is to support various use cases with flexible component swapping and reuse, while trying to keep the code clean and minimal.
1
+
To support rapid experimentation with torchtitan, we provide several extension points. The principle for adding these extension points is to support various use cases with flexible component swapping and reuse, while trying to keep the code clean and minimal.
2
2
3
3
The extension points and protocols mentioned in this note are subject to change.
0 commit comments