New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Llama3 hybrid implementation using submeshes #18777

Draft

ipotkonjak-tt wants to merge 15 commits into main from ipotkonjak/llama3_hybrid_submesh

Contributor

ipotkonjak-tt commented Mar 7, 2025 •

edited

Loading

Problem description

Missing support for data / hybrid parallelism for Llama3 models.

What's changed

Addition of hybrid parallelism within llama code base with concept of submeshes. Implementation is mainly based at the LlamaGenerator level. MeshDevice is partitioned into submeshes where each subset of devices has an independent model. Models remain implemented in the tensor parallel manner.

Checklist

All post commit CI passes
Model regression CI passes
Device performance regression CI passes
New/Existing tests provide coverage for changes - added test cases to demo tests

ipotkonjak-tt added 14 commits

March 7, 2025 10:35


          initial hybrid impl using submeshes

0bb92df


          load state_dict only once

3dadf7d


          code clean-up, bug fix: prefill data in DRAM, correct user distribution

e8ced18


          vLLM integration: model, model_args, page_table

f0b7794


          vLLM integration: kv_cache proposal

ef4fdd0


          vLLM inegration: kv_cache; bug fix: avoid submesh in case of 1 device…

be91311

…/DP group


          expose allocate_kv_cache

2f0780d


          hybrid llama vision

cbe7ce6


          Test only 70b 4xT3K on TG and 1b/8b 8xN150, 4xN300 on T3K

2856a62


          restore test cases for simple_vision_demo.py

39b9de7


          test_llama_accuracy.py bug fix

e6a084e


          test_llama_chunked_generation.py bug fix


          hang batch32 fix

e8e6ddf


          restore usage of argmax_on_device arg

ipotkonjak-tt requested review from mbezuljTT and skhorasganiTT

March 7, 2025 13:06

yieldthought requested changes

View reviewed changes

Contributor

yieldthought left a comment

Clean 👌

To do:

Add at least one CI test that will exercise DP. I suggest adding a demo to the t3k tests.

models/demos/llama3/tt/model_config.py Outdated Show resolved Hide resolved

ipotkonjak-tt requested a review from cfjchu

March 7, 2025 14:34


          remove unused code

cbb4a66

cfjchu approved these changes

View reviewed changes

ipotkonjak-tt self-assigned this

ipotkonjak-tt added the llama3 label

skhorasganiTT reviewed

View reviewed changes

models/demos/llama3/demo/simple_text_demo.py

Comment on lines +366 to +367

		"batch-1-DP-4", # DP 4 latency
		"batch-1-DP-8", # DP 8 latency

Contributor

skhorasganiTT Mar 10, 2025

Could you add a batch 32 + DP test?

models/demos/llama3/demo/simple_text_demo.py

Comment on lines +452 to +453

		if is_ci_env and num_devices == 8 and data_parallel > 1 and not ("3.2-1B" in llama_dir or "3.1-8B" in llama_dir):
		pytest.skip("CI runs only hybrid Llama3 1b and 8b on T3K")

Contributor

skhorasganiTT Mar 10, 2025

What about 3B?

models/demos/llama3/demo/simple_text_demo.py

@@ @@ -335,6 +439,19 @@ def test_llama_demo_text( @@
                   ]:  # If the flag is provided, use it. Take an int instead of bool due to parser limitations
                       stop_at_eos = request.config.getoption("--stop_at_eos")
+                  num_devices = mesh_device.get_num_devices() if isinstance(mesh_device, ttnn.MeshDevice) else 1
+                  batch_size *= data_parallel  # input batch_size is interpreted as size per DP group

Contributor

skhorasganiTT Mar 10, 2025

Can batch_size be renamed for clarity (here and throughout the demo)? e.g. global_batch_size for batch_size * data_parallel

models/demos/llama3/demo/simple_text_demo.py

+                  # Hybrid requires a model per submesh
+                  model_args = []
+                  model = []
+                  page_table = []

Contributor

skhorasganiTT Mar 10, 2025

unused page_table var (overwritten below)

models/demos/llama3/demo/simple_text_demo.py

+                          max_num_blocks=page_params["page_max_num_blocks"],
+                      )
+                      # Implied shuffling of blocks
+                      permutation = torch.randperm(paged_attention_config.max_num_blocks)

Contributor

skhorasganiTT Mar 10, 2025

According to this, max_num_blocks now represents the max blocks per dp group right? Can it be renamed to max_num_blocks_per_dp for clarity?

models/demos/llama3/tt/generator_vllm.py

+                          )
+                          model_args.append(model_args_i)
+                          model.append(model_i)
                       return cls(model, model_args, mesh_device)
                   @property

Contributor

skhorasganiTT Mar 10, 2025

Should be model_args[0] in cache_path and max_cross_attn_tokens?

models/demos/llama3/tt/generator_vllm.py

                   def prefill_forward(self, *args, **kwargs):
                       return super().prefill_forward_text(*args, **kwargs)
                   def decode_forward(self, *args, **kwargs):
                       return super().decode_forward_text(*args, **kwargs)
+                  def allocate_kv_cache(self, *args, **kwargs):
+                      return allocate_kv_cache(*args, **kwargs)
               class TtQwen2ForCausalLM(LlamaGenerator):
                   def __init__(self, *args, **kwargs):

Contributor

skhorasganiTT Mar 10, 2025

Should be model_args[0] in cache_path?

models/demos/llama3/tt/generator_vllm.py


		model_args = []
		model = []

Contributor

skhorasganiTT Mar 10, 2025

missing state_dict=None for first loop iter

models/demos/llama3/tt/generator_vllm.py

@@ @@ -20,6 +20,44 @@ @@
               from vllm.model_executor.models.mllama import MLLAMA_IMAGE_TOKEN_ID, MLLAMA_IMAGE_TOKEN
+              def generate_submeshes(mesh_device):
+                  data_parallel = int(os.getenv("TT_DATA_PARALLEL", 1))

Contributor

skhorasganiTT Mar 10, 2025

Could you replace this env var with a new arg like tt_data_parallel in the initialize_vllm_model class methods? I think it would be better than propagating an env var all the way here

models/demos/llama3/tt/generator_vllm.py

		return data_parallel, mesh_device.create_submeshes(ttnn.MeshShape(1, num_devices // data_parallel))


		def allocate_kv_cache(kv_cache_shape, dtype, num_layers, mesh_device):

Contributor

skhorasganiTT Mar 10, 2025

TODO (@ipotkonjak-tt and/or @skhorasganiTT) Modify KV creation in vLLM to use this function and test with DP

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Reviewers

skhorasganiTT skhorasganiTT left review comments

yieldthought yieldthought requested changes

cfjchu cfjchu approved these changes

mbezuljTT Awaiting requested review from mbezuljTT

cglagovichTT Awaiting requested review from cglagovichTT cglagovichTT will be requested when the pull request is marked ready for review cglagovichTT is a code owner

mtairum Awaiting requested review from mtairum mtairum will be requested when the pull request is marked ready for review mtairum is a code owner

uaydonat Awaiting requested review from uaydonat uaydonat will be requested when the pull request is marked ready for review uaydonat is a code owner

Requested changes must be addressed to merge this pull request.

Labels