Skip to content

Commit 0432a23

Browse files
committed
address review comments
1 parent 80f228c commit 0432a23

File tree

1 file changed

+2
-2
lines changed

1 file changed

+2
-2
lines changed

Diff for: prototype_source/context_parallel.rst

+2-2
Original file line numberDiff line numberDiff line change
@@ -26,7 +26,7 @@ Introduction
2626
Context Parallel is an approach used in large language model training to reduce peak activation size by sharding the long input sequence across multiple devices.
2727
It breaks the constraint on input sequence length resulting from peak memory usage on storing activations in Transformer blocks.
2828

29-
The core of Context Parallel is Ring Attention, a novel parallel implementation of the Attention layer.
29+
Ring Attention, a novel parallel implementation of the Attention layer, is critical to performant Context Parallel.
3030
Ring Attention shuffles the KV shards and calculates the partial attention scores, repeats until all KV shards have been used on each device.
3131
Two Ring Attention variants have been implemented: `the all-gather based pass-KV <https://arxiv.org/abs/2407.21783>`__ and `the all-to-all based pass-KV <https://openreview.net/forum?id=WsRHpHH4s0>`__:
3232

@@ -42,7 +42,7 @@ The Context Parallel APIs consist of two parts:
4242
1. ``context_parallel()`` allows users to create a Python context where the SDPA function (``torch.nn.functional.scaled_dot_product_attention``)
4343
will be automatically replaced with Ring Attention. To shard Tensors along a dimension, simply pass the Tensors and their sharding dimensions to
4444
argument ``buffers`` and ``buffer_seq_dims`` respectively. We recommend that users add tensors computing along the sequence dimension to ``buffers``
45-
and shard them along this dimension.
45+
and shard them along this dimension. Taking Llama3 training as an example, missing ``freq_cis`` in ``buffers`` will result in a miscalculated rotary embedding.
4646
2. ``set_rotate_method()`` allows users to choose between the all-gather based pass-KV approach and the all-to-all based pass-KV approach.
4747

4848

0 commit comments

Comments
 (0)