Skip to content

Commit ce754de

Browse files
authored
Merge branch 'main' into pr_cublas_handle_mgmt
Signed-off-by: Kirthi Shankar Sivamani <[email protected]>
2 parents e62a128 + 8a20d66 commit ce754de

File tree

345 files changed

+31336
-37439
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

345 files changed

+31336
-37439
lines changed

.github/PULL_REQUEST_TEMPLATE.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@ Fixes # (issue)
1111
- [ ] New feature (non-breaking change which adds functionality)
1212
- [ ] Breaking change (fix or feature that would cause existing functionality to not work as expected)
1313
- [ ] Infra/Build change
14-
- [ ] Code refractor
14+
- [ ] Code refactoring
1515

1616
## Changes
1717

.github/workflows/build.yml

+3-22
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@ jobs:
1212
name: 'Core'
1313
runs-on: ubuntu-latest
1414
container:
15-
image: nvcr.io/nvidia/cuda:12.0.0-devel-ubuntu22.04
15+
image: nvcr.io/nvidia/cuda:12.1.0-devel-ubuntu22.04
1616
options: --user root
1717
steps:
1818
- name: 'Dependencies'
@@ -28,14 +28,15 @@ jobs:
2828
run: pip install . -v
2929
env:
3030
NVTE_FRAMEWORK: none
31+
MAX_JOBS: 1
3132
- name: 'Sanity check'
3233
run: python3 -c "import transformer_engine"
3334
working-directory: /
3435
pytorch:
3536
name: 'PyTorch'
3637
runs-on: ubuntu-latest
3738
container:
38-
image: nvcr.io/nvidia/cuda:12.5.0-devel-ubuntu22.04
39+
image: nvcr.io/nvidia/cuda:12.8.0-devel-ubuntu22.04
3940
options: --user root
4041
steps:
4142
- name: 'Dependencies'
@@ -73,23 +74,3 @@ jobs:
7374
MAX_JOBS: 1
7475
- name: 'Sanity check'
7576
run: python tests/jax/test_sanity_import.py
76-
paddle:
77-
name: 'PaddlePaddle'
78-
runs-on: ubuntu-latest
79-
container:
80-
image: nvcr.io/nvidia/paddlepaddle:24.10-py3
81-
options: --user root
82-
steps:
83-
- name: 'Checkout'
84-
uses: actions/checkout@v3
85-
with:
86-
submodules: recursive
87-
- name: 'Build'
88-
run: |
89-
apt-get update
90-
apt-get install -y libgoogle-glog-dev
91-
pip install . -v
92-
env:
93-
NVTE_FRAMEWORK: paddle
94-
- name: 'Sanity check'
95-
run: python tests/paddle/test_sanity_import.py

.github/workflows/lint.yml

-27
Original file line numberDiff line numberDiff line change
@@ -61,30 +61,3 @@ jobs:
6161
export PYTHON_ONLY=1
6262
export TE_PATH=.
6363
bash ./qa/L0_jax_lint/test.sh
64-
paddle_cpplint:
65-
name: 'PaddlePaddle C++'
66-
runs-on: ubuntu-latest
67-
steps:
68-
- name: Checkout
69-
uses: actions/checkout@v3
70-
- name: 'Lint'
71-
run: |
72-
sudo apt-get update
73-
sudo apt-get install pip -y
74-
export CPP_ONLY=1
75-
export TE_PATH=.
76-
bash ./qa/L0_paddle_lint/test.sh
77-
paddle_pylint:
78-
name: 'PaddlePaddle Python'
79-
runs-on: ubuntu-latest
80-
steps:
81-
- name: 'Checkout'
82-
uses: actions/checkout@v3
83-
- name: 'Lint'
84-
run: |
85-
sudo apt-get update
86-
sudo apt-get install pip -y
87-
pip install paddlepaddle-gpu
88-
export PYTHON_ONLY=1
89-
export TE_PATH=.
90-
bash ./qa/L0_paddle_lint/test.sh

.github/workflows/trigger-ci.yml

+2
Original file line numberDiff line numberDiff line change
@@ -43,6 +43,8 @@ jobs:
4343
|| github.actor == 'youngeunkwon0405'
4444
|| github.actor == 'KshitijLakhani'
4545
|| github.actor == 'jberchtold-nvidia'
46+
|| github.actor == 'sanandaraj5597'
47+
|| github.actor == 'negvet'
4648
)
4749
steps:
4850
- name: Check if comment is issued by authorized person

.gitignore

+1-1
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,6 @@
88
*.nsys-rep
99
*.ncu-rep
1010
*.sqlite
11-
*.onnx
1211
*.eggs
1312
build/
1413
*.so
@@ -39,3 +38,4 @@ downloads/
3938
.pytest_cache/
4039
compile_commands.json
4140
.nfs
41+
tensor_dumps/

3rdparty/cudnn-frontend

Submodule cudnn-frontend updated 112 files

README.rst

+30-23
Original file line numberDiff line numberDiff line change
@@ -33,11 +33,12 @@ What is Transformer Engine?
3333
.. overview-begin-marker-do-not-remove
3434
3535
Transformer Engine (TE) is a library for accelerating Transformer models on NVIDIA GPUs, including
36-
using 8-bit floating point (FP8) precision on Hopper GPUs, to provide better performance with lower
37-
memory utilization in both training and inference. TE provides a collection of highly optimized
38-
building blocks for popular Transformer architectures and an automatic mixed precision-like API that
39-
can be used seamlessly with your framework-specific code. TE also includes a framework agnostic
40-
C++ API that can be integrated with other deep learning libraries to enable FP8 support for Transformers.
36+
using 8-bit floating point (FP8) precision on Hopper, Ada, and Blackwell GPUs, to provide better
37+
performance with lower memory utilization in both training and inference. TE provides a collection
38+
of highly optimized building blocks for popular Transformer architectures and an automatic mixed
39+
precision-like API that can be used seamlessly with your framework-specific code. TE also includes a
40+
framework agnostic C++ API that can be integrated with other deep learning libraries to enable FP8
41+
support for Transformers.
4142

4243
As the number of parameters in Transformer models continues to grow, training and inference for
4344
architectures such as BERT, GPT and T5 become very memory and compute-intensive. Most deep learning
@@ -51,16 +52,16 @@ not available natively in frameworks today.
5152

5253
TE addresses the problem of FP8 support by providing APIs that integrate with popular Large Language
5354
Model (LLM) libraries. It provides a Python API consisting of modules to easily build a Transformer
54-
layer as well as a framework-agnostic library in C++ including structs and kernels needed for FP8 support.
55-
Modules provided by TE internally maintain scaling factors and other values needed for FP8 training, greatly
56-
simplifying mixed precision training for users.
55+
layer as well as a framework-agnostic library in C++ including structs and kernels needed for FP8
56+
support. Modules provided by TE internally maintain scaling factors and other values needed for FP8
57+
training, greatly simplifying mixed precision training for users.
5758

5859
Highlights
5960
==========
6061

6162
* Easy-to-use modules for building Transformer layers with FP8 support
6263
* Optimizations (e.g. fused kernels) for Transformer models
63-
* Support for FP8 on NVIDIA Hopper and NVIDIA Ada GPUs
64+
* Support for FP8 on NVIDIA Hopper, Ada, and Blackwell GPUs
6465
* Support for optimizations across all precisions (FP16, BF16) on NVIDIA Ampere GPU architecture generations and later
6566

6667
Examples
@@ -149,48 +150,54 @@ Installation
149150
Pre-requisites
150151
^^^^^^^^^^^^^^^^^^^^
151152
* Linux x86_64
152-
* CUDA 12.0+ for Hopper and CUDA 12.1+ for Ada
153-
* NVIDIA Driver supporting CUDA 12.0 or later
154-
* cuDNN 8.1 or later
155-
* For fused attention, CUDA 12.1 or later, NVIDIA Driver supporting CUDA 12.1 or later, and cuDNN 8.9 or later.
153+
* CUDA 12.1+ (CUDA 12.8+ for Blackwell)
154+
* NVIDIA Driver supporting CUDA 12.1 or later
155+
* cuDNN 9.3 or later
156156

157157
Docker
158158
^^^^^^^^^^^^^^^^^^^^
159159

160160
The quickest way to get started with Transformer Engine is by using Docker images on
161-
`NVIDIA GPU Cloud (NGC) Catalog <https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch>`_. For example to use the NGC PyTorch container interactively,
161+
`NVIDIA GPU Cloud (NGC) Catalog <https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch>`_.
162+
For example to use the NGC PyTorch container interactively,
162163

163164
.. code-block:: bash
164165
165-
docker run --gpus all -it --rm nvcr.io/nvidia/pytorch:23.10-py3
166+
docker run --gpus all -it --rm nvcr.io/nvidia/pytorch:25.01-py3
166167
167-
Where 23.10 is the container version. For example, 23.10 for the October 2023 release.
168+
Where 25.01 (corresponding to January 2025 release) is the container version.
168169

169170
pip
170171
^^^^^^^^^^^^^^^^^^^^
171172
To install the latest stable version of Transformer Engine,
172173

173174
.. code-block:: bash
174175
175-
pip install git+https://github.com/NVIDIA/TransformerEngine.git@stable
176+
pip3 install git+https://github.com/NVIDIA/TransformerEngine.git@stable
176177
177-
This will automatically detect if any supported deep learning frameworks are installed and build Transformer Engine support for them. To explicitly specify frameworks, set the environment variable NVTE_FRAMEWORK to a comma-separated list (e.g. NVTE_FRAMEWORK=jax,pytorch,paddle).
178+
This will automatically detect if any supported deep learning frameworks are installed and build
179+
Transformer Engine support for them. To explicitly specify frameworks, set the environment variable
180+
NVTE_FRAMEWORK to a comma-separated list (e.g. NVTE_FRAMEWORK=jax,pytorch).
178181

179-
Alternatively, the package can be directly installed from `Transformer Engine's PyPI <https://pypi.org/project/transformer-engine/>`_, e.g.
182+
Alternatively, the package can be directly installed from
183+
`Transformer Engine's PyPI <https://pypi.org/project/transformer-engine/>`_, e.g.
180184

181185
.. code-block:: bash
182186
183-
pip install transformer_engine[pytorch]
187+
pip3 install transformer_engine[pytorch]
184188
185-
To obtain the necessary Python bindings for Transformer Engine, the frameworks needed must be explicitly specified as extra dependencies in a comma-separated list (e.g. [jax,pytorch,paddle]). Transformer Engine ships wheels for the core library as well as the PaddlePaddle extensions. Source distributions are shipped for the JAX and PyTorch extensions.
189+
To obtain the necessary Python bindings for Transformer Engine, the frameworks needed must be
190+
explicitly specified as extra dependencies in a comma-separated list (e.g. [jax,pytorch]).
191+
Transformer Engine ships wheels for the core library. Source distributions are shipped for the JAX
192+
and PyTorch extensions.
186193

187194
From source
188195
^^^^^^^^^^^
189196
`See the installation guide <https://docs.nvidia.com/deeplearning/transformer-engine/user-guide/installation.html#installation-from-source>`_.
190197

191198
Compiling with FlashAttention-2
192199
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
193-
Transformer Engine release v0.11.0 adds support for FlashAttention-2 in PyTorch for improved performance.
200+
Transformer Engine release v0.11.0 added support for FlashAttention-2 in PyTorch for improved performance.
194201

195202
It is a known issue that FlashAttention-2 compilation is resource-intensive and requires a large amount of RAM (see `bug <https://github.com/Dao-AILab/flash-attention/issues/358>`_), which may lead to out of memory errors during the installation of Transformer Engine. Please try setting **MAX_JOBS=1** in the environment to circumvent the issue.
196203

@@ -264,10 +271,10 @@ Transformer Engine has been integrated with popular LLM frameworks such as:
264271
* `NVIDIA NeMo Framework <https://github.com/NVIDIA/NeMo-Megatron-Launcher>`_
265272
* `Amazon SageMaker Model Parallel Library <https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-core-features-v2-tensor-parallelism.html>`_
266273
* `Levanter <https://github.com/stanford-crfm/levanter>`_
274+
* `GPT-NeoX <https://github.com/EleutherAI/gpt-neox>`_
267275
* `Hugging Face Nanotron <https://github.com/huggingface/nanotron>`_ - Coming soon!
268276
* `Colossal-AI <https://github.com/hpcaitech/ColossalAI>`_ - Coming soon!
269277
* `PeriFlow <https://github.com/friendliai/periflow-python-sdk>`_ - Coming soon!
270-
* `GPT-NeoX <https://github.com/EleutherAI/gpt-neox>`_ - Coming soon!
271278

272279

273280
Contributing

build_tools/VERSION.txt

+1-1
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
1.14.0.dev0
1+
2.2.0.dev0

build_tools/build_ext.py

+9-67
Original file line numberDiff line numberDiff line change
@@ -94,7 +94,7 @@ def _build_cmake(self, build_dir: Path, install_dir: Path) -> None:
9494
print(f"Time for build_ext: {total_time:.2f} seconds")
9595

9696

97-
def get_build_ext(extension_cls: Type[setuptools.Extension]):
97+
def get_build_ext(extension_cls: Type[setuptools.Extension], install_so_in_wheel_lib: bool = False):
9898
class _CMakeBuildExtension(extension_cls):
9999
"""Setuptools command with support for CMake extension modules"""
100100

@@ -129,81 +129,23 @@ def run(self) -> None:
129129
super().run()
130130
self.extensions = all_extensions
131131

132-
paddle_ext = None
133-
if "paddle" in get_frameworks():
134-
for ext in self.extensions:
135-
if "paddle" in ext.name:
136-
paddle_ext = ext
137-
break
138-
139-
# Manually write stub file for Paddle extension
140-
if paddle_ext is not None:
141-
# Load libtransformer_engine.so to avoid linker errors
142-
if not bool(int(os.getenv("NVTE_RELEASE_BUILD", "0"))):
143-
# Source compilation from top-level (--editable)
144-
search_paths = list(Path(__file__).resolve().parent.parent.iterdir())
145-
# Source compilation from top-level
146-
search_paths.extend(list(Path(self.build_lib).iterdir()))
147-
148-
# Dynamically load required_libs.
149-
from transformer_engine.common import _load_cudnn, _load_nvrtc
150-
151-
_load_cudnn()
152-
_load_nvrtc()
153-
else:
154-
# Only during release bdist build for paddlepaddle.
155-
import transformer_engine
156-
157-
search_paths = list(Path(transformer_engine.__path__[0]).iterdir())
158-
del transformer_engine
159-
160-
common_so_path = ""
161-
for path in search_paths:
162-
if path.name.startswith("libtransformer_engine."):
163-
common_so_path = str(path)
164-
assert common_so_path, "Could not find libtransformer_engine"
165-
ctypes.CDLL(common_so_path, mode=ctypes.RTLD_GLOBAL)
166-
167-
# Figure out stub file path
168-
module_name = paddle_ext.name
169-
assert module_name.endswith(
170-
"_pd_"
171-
), "Expected Paddle extension module to end with '_pd_'"
172-
stub_name = module_name[:-4] # remove '_pd_'
173-
stub_path = os.path.join(self.build_lib, "transformer_engine", stub_name + ".py")
174-
Path(stub_path).parent.mkdir(exist_ok=True, parents=True)
175-
176-
# Figure out library name
177-
# Note: This library doesn't actually exist. Paddle
178-
# internally reinserts the '_pd_' suffix.
179-
so_path = self.get_ext_fullpath(module_name)
180-
_, so_ext = os.path.splitext(so_path)
181-
lib_name = stub_name + so_ext
182-
183-
# Write stub file
184-
print(f"Writing Paddle stub for {lib_name} into file {stub_path}")
185-
from paddle.utils.cpp_extension.extension_utils import custom_write_stub
186-
187-
custom_write_stub(lib_name, stub_path)
188-
189132
# Ensure that binaries are not in global package space.
190-
target_dir = install_dir / "transformer_engine"
133+
lib_dir = (
134+
"wheel_lib"
135+
if bool(int(os.getenv("NVTE_RELEASE_BUILD", "0"))) or install_so_in_wheel_lib
136+
else ""
137+
)
138+
target_dir = install_dir / "transformer_engine" / lib_dir
191139
target_dir.mkdir(exist_ok=True, parents=True)
192140

193141
for ext in Path(self.build_lib).glob("*.so"):
194142
self.copy_file(ext, target_dir)
195143
os.remove(ext)
196144

197-
# For paddle, the stub file needs to be copied to the install location.
198-
if paddle_ext is not None:
199-
stub_path = Path(self.build_lib) / "transformer_engine"
200-
for stub in stub_path.glob("transformer_engine_paddle.py"):
201-
self.copy_file(stub, target_dir)
202-
203145
def build_extensions(self):
204-
# BuildExtensions from PyTorch and PaddlePaddle already handle CUDA files correctly
146+
# BuildExtensions from PyTorch already handle CUDA files correctly
205147
# so we don't need to modify their compiler. Only the pybind11 build_ext needs to be fixed.
206-
if "pytorch" not in get_frameworks() and "paddle" not in get_frameworks():
148+
if "pytorch" not in get_frameworks():
207149
# Ensure at least an empty list of flags for 'cxx' and 'nvcc' when
208150
# extra_compile_args is a dict.
209151
for ext in self.extensions:

0 commit comments

Comments
 (0)