From ed502c68de798ae449c323b53cf4dee2d1beeb5c Mon Sep 17 00:00:00 2001
From: Electronic-Waste <2690692950@qq.com>
Date: Tue, 18 Feb 2025 15:55:01 +0000
Subject: [PATCH] doc: move torchtune sections to proposal and design chapters.

Signed-off-by: Electronic-Waste <2690692950@qq.com>
---
 docs/proposals/2401-llm-trainer-v2/README.md | 95 +++++++++-----------
 1 file changed, 42 insertions(+), 53 deletions(-)

diff --git a/docs/proposals/2401-llm-trainer-v2/README.md b/docs/proposals/2401-llm-trainer-v2/README.md
index ed2ecbc385..fde19eb7c1 100644
--- a/docs/proposals/2401-llm-trainer-v2/README.md
+++ b/docs/proposals/2401-llm-trainer-v2/README.md
@@ -38,10 +38,52 @@ By now, Kubeflow Training V1 has implemented a [Trainer for LLM](../2003-train-a
 
 ## Proposal
 
+`torchtune` is a PyTorch-native library for easily authoring, fine-tuning and experimenting with LLMs. It provides rich support for LLM fine-tuning:
+
+1. Modular native-PyTorch implementations of popular LLMs
+2. Training recipes for a variety of fine-tuning techniques
+3. Support for distributed training using [FSDP2](https://github.com/pytorch/torchtitan/blob/main/docs/fsdp.md)
+4. YAML configs for easily configuring training runs
+
+`torchtune` is something like our LLM Trainer, because its [core concepts](https://pytorch.org/torchtune/main/overview.html#key-concepts) "recipes" and "configs" can be easily corresponded to our “LLM Trainer Script” and “[Trainer field in TrainJob](https://github.com/kubeflow/training-operator/blob/cf741267f8f8ec96592178532b6787bab3f11110/pkg/apis/kubeflow.org/v2alpha1/trainjob_types.go#L110-L111)”. **It’s the easiest way for us to implement the LLM Trainer**.
+
+An example for using `torchtune`:
+
+```bash
+$ tune ls
+RECIPE                                   CONFIG
+full_finetune_single_device              llama2/7B_full_low_memory
+                                         mistral/7B_full_low_memory
+full_finetune_distributed                llama2/7B_full
+                                         llama2/13B_full
+                                         mistral/7B_full
+lora_finetune_single_device              llama2/7B_lora_single_device
+                                         llama2/7B_qlora_single_device
+                                         mistral/7B_lora_single_device
 
+$ tune run lora_finetune_single_device --config llama2/7B_lora_single_device epochs=1
+INFO:torchtune.utils.logging:Running LoRAFinetuneRecipeSingleDevice with resolved config:
+Writing logs to /tmp/lora_finetune_output/log_1713194212.txt
+INFO:torchtune.utils.logging:Model is initialized with precision torch.bfloat16.
+INFO:torchtune.utils.logging:Tokenizer is initialized from file.
+INFO:torchtune.utils.logging:Optimizer and loss are initialized.
+INFO:torchtune.utils.logging:Loss is initialized.
+INFO:torchtune.utils.logging:Dataset and Sampler are initialized.
+INFO:torchtune.utils.logging:Learning rate scheduler is initialized.
+1|52|Loss: 2.3697006702423096:   0%|▏                     | 52/25880 [00:24<3:55:01,  1.83it/s]
+```
 
 ## Design Details
 
+### `torchtune` Plugin
+
+As is shown in the [torchtune official document](https://pytorch.org/torchtune/main/tune_cli.html#run-a-recipe) and [source code](https://github.com/pytorch/torchtune/blob/75965d4281b9b76c454630d015221b9933c77bf3/torchtune/_cli/run.py#L113-L118), the distributed training arguments like `--nnodes` and `--nproc_per_node` should be passed ahead of the recipe argument in the command line, and **cannot be passed by the environment variables** in the `PET_XXX` convention. And also, `torchtune` is extremely different from the fine-tuning paradigm of `torchrun` because it is **recipe and config-based**, which may need more mutation operations in the config file. Here is an [example](https://github.com/Electronic-Waste/kubeflow-llm-trainer/blob/main/torchtune-llm-finetuning.yaml).
+
+Thus, we need to implement a new plugin for `torchtune` if we decide to adopt `torchtune` as a launcher for LLM fine-tuning on Kubernetes. And the new plugin should have these abilities:
+
+1. Parse distributed training arguments in TrainJob and TrainingRuntime API, and integrate them with the `tune run` command.
+2. Handle overrides in the `torchtune` fine-tuning configuration file.
+3. Validate some requirements.
 
 ### Fine-Tuning Config
 
@@ -212,47 +254,6 @@ And it's worthwhile to notice that we'll preprocess dataset for users with built
 
 In the future, we'll provide users with more options on launchers (`torchtune`, `accelerate`), frameworks (TensorFlow, Jax, etc.) and fine-tuning techniques (RLHF, Distilation, etc.).
 
-### Native PyTorch Launcher - `torchtune`
-
-`torchtune` is a PyTorch-native library for easily authoring, fine-tuning and experimenting with LLMs. It provides rich support for LLM fine-tuning:
-
-1. Modular native-PyTorch implementations of popular LLMs
-2. Training recipes for a variety of fine-tuning techniques
-3. Support for distributed training using [FSDP2](https://github.com/pytorch/torchtitan/blob/main/docs/fsdp.md)
-4. YAML configs for easily configuring training runs
-
-`torchtune` is something like our LLM Trainer, because its [core concepts](https://pytorch.org/torchtune/main/overview.html#key-concepts) "recipes" and "configs" can be easily corresponded to our “LLM Trainer Script” and “[Trainer field in TrainJob](https://github.com/kubeflow/training-operator/blob/cf741267f8f8ec96592178532b6787bab3f11110/pkg/apis/kubeflow.org/v2alpha1/trainjob_types.go#L110-L111)”. **It’s the easiest way for us to implement the LLM Trainer**.
-
-**However, `torchtune` only supports single-node training**, which means that we can only have 1 pod in the training phase (`--nnodes=1`, [related issue](https://github.com/pytorch/torchtune/issues/2018)). This would put a strong restriction for us on scaling training pods on Kubernetes. And also, **it only supports some popular LLMs** and will bring inflexibility for us to fine-tune other models.
-
-An example for using `torchtune`:
-
-```bash
-$ tune ls
-RECIPE                                   CONFIG
-full_finetune_single_device              llama2/7B_full_low_memory
-                                         mistral/7B_full_low_memory
-full_finetune_distributed                llama2/7B_full
-                                         llama2/13B_full
-                                         mistral/7B_full
-lora_finetune_single_device              llama2/7B_lora_single_device
-                                         llama2/7B_qlora_single_device
-                                         mistral/7B_lora_single_device
-
-$ tune run lora_finetune_single_device --config llama2/7B_lora_single_device epochs=1
-INFO:torchtune.utils.logging:Running LoRAFinetuneRecipeSingleDevice with resolved config:
-Writing logs to /tmp/lora_finetune_output/log_1713194212.txt
-INFO:torchtune.utils.logging:Model is initialized with precision torch.bfloat16.
-INFO:torchtune.utils.logging:Tokenizer is initialized from file.
-INFO:torchtune.utils.logging:Optimizer and loss are initialized.
-INFO:torchtune.utils.logging:Loss is initialized.
-INFO:torchtune.utils.logging:Dataset and Sampler are initialized.
-INFO:torchtune.utils.logging:Learning rate scheduler is initialized.
-1|52|Loss: 2.3697006702423096:   0%|▏                     | 52/25880 [00:24<3:55:01,  1.83it/s]
-```
-
-(**Note**: We need to create a new plugin for `torchtune`, so that it can fit in the yaml-based fine-tuning configurations. And also we may need to explore how to integrate the recipes provided by `torchtune`.)
-
 ### HF Accelerate CLI - `accelerate`
 
 Huggingface Accelerate CLI is a simplified distributed training launch tool, which is **targeted to junior users not familiar with distributed training**. The official slogan for Huggingface Accelerate is “Run your raw PyTorch training script on any kind of device”. There are several advantages to adopt it:
@@ -489,16 +490,4 @@ class ZeroConfig:
 
 ```
 
-### Backend Design - `torchtune`
-
-#### New Runtime Plugin
-
-As is shown in the [torchtune official document](https://pytorch.org/torchtune/main/tune_cli.html#run-a-recipe) and [source code](https://github.com/pytorch/torchtune/blob/75965d4281b9b76c454630d015221b9933c77bf3/torchtune/_cli/run.py#L113-L118), the distributed training arguments like `--nnodes` and `--nproc_per_node` should be passed ahead of the recipe argument in the command line, and **cannot be passed by the environment variables** in the `PET_XXX` convention. And also, `torchtune` is extremely different from the fine-tuning paradigm of `torchrun` because it is **recipe and config-based**, which may need more mutation operations in the config file. Here is an [example](https://github.com/Electronic-Waste/kubeflow-llm-trainer/blob/main/torchtune-llm-finetuning.yaml).
-
-Thus, we need to implement a new plugin for `torchtune` if we decide to adopt `torchtune` as a launcher for LLM fine-tuning on Kubernetes. And the new plugin should have these abilities:
-
-1. Parse distributed training arguments in TrainJob and TrainingRuntime API, and integrate them with the `tune run` command.
-2. Handle overrides in the `torchtune` fine-tuning configuration file.
-3. Validate some requirements, such as `--nnodes` should be equal to 1.
-
 \# WIP