Continued pretraining: Adapt language models to a new language or domain, or simply improve it by continue pre-training (causal language modeling) on a new/specific dataset.
Supervised fine-tuning: teach language models to follow instructions and tips on how to collect and curate your own training dataset.
litgpt finetune_full
: This method trains all model weight parameters and is the most memory-intensive fine-tuning technique in LitGPT.
Reward modeling: Teach language models to distinguish model responses according to human or AI preferences.
Rejection sampling: A technique to boost the performance of a SFT model.
Direct preference optimization (DPO): a powerful and promising alternative to PPO.
Odds Ratio Preference Optimisation (ORPO): a technique to fine-tune language models with human preferences, combining SFT and DPO in a single stage.es
litgpt finetune_lora
: A more memory-efficient alternative to full fine-tuning.
litgpt finetune_lora stabilityai/stablelm-base-alpha-3b
litgpt finetune_adapter
: A form of prefix-tuning that prepends a learnable adaption-prompt to the inputs of the attention blocks in an LLM.
litgpt finetune_adapter_v2
-
What is relation between fine-tuning and alignment?
-
How to use Nvidia NIM to connect LLM to
- combinatorial search
- logic
- CSP
- difference between inductive vs. deductive reasoning
- soundness vs. completeness of inference/reasoning
composer?
Model | Pretrained by | Use case | Workflow Type | Launcher | Trainer library | Hardware | Notes |
---|---|---|---|---|---|---|---|
llama-8b | Meta | text generation, chat | fine-tuning | torchtune | torch | ... | ... |
llama-8b | Meta | text generation, chat | fine-tuning | deepspeed | torch | ... | ... |
llama-8b | Meta | text generation, chat | fine-tuning-lora | torchtune | torch | ... | ... |
mistral-8b | Mistral | text generation, chat | fine-tuning | torchtune | torch | ... | ... |
mistral-8b | Mistral | text generation, chat | fine-tuning | deepspeed | torch | ... | ... |
mistral-8b | Mistral | text generation, chat | fine-tuning-lora | torchtune | torch | ... | ... |
... | ... | ... | ... | ... | ... | ... | ... |
llama-70b | Meta | text generation, chat | fine-tuning | torchtune | ... | ... | ... |
llama-70b | Meta | text generation, chat | fine-tuning | deepspeed | ... | ... | ... |
llama-70b | Meta | text generation, chat | fine-tuning-lora | torchtune | ... | ... | ... |
mixtral-8x7b | Mistral | text generation, chat | fine-tuning | torchtune | ... | ... | ... |
mixtral-8x7b | Mistral | text generation, chat | fine-tuning | deepspeed | ... | ... | ... |
mixtral-8x7b | Mistral | text generation, chat | fine-tuning-lora | torchtune | ... | ... | ... |
codestral | Mistral | text generation, chat | fine-tuning | torchtune | ... | ... | ... |
stable-diffusion-xl | Stability AI | text-to-image, text-to-video | fine-tuning | ... | ... | ... | ... |
stable-diffusion-xl | Stability AI | text-to-image, text-to-video | fine-tuning | ... | ... | ... | ... |
stable-diffusion-xl | Stability AI | text-to-image, text-to-video | fine-tuning-lora | ... | ... | ... | ... |
A critical developer experience bottleneck in AI is the cycle of choosing what you want to do, what resources doing this requires, how to find and use these resources.
Major cloud providers can add a layer of indirection by buying and repackaging GPUs from Nvidia and AMD in EC2 offerings. If Robin knows her workflow tasks require GPU cards with ≥24 GB of RAM, knowing the available GPUs and mapping them to cloud VMs can be difficult. Google has a handy CLI filter and matching UI table filter to make this easier:
gcloud compute accelerator-types list --filter="nvidia-h100-80gb"
- To do a full finetune of a 7B model requires ≥1-4 cards with ≥24 GB VRAM.
- Using
PagedAdamW
frombitsandbytes
reduces the number of cards you'll need.
- Using
- To do a full finetune of a 70B model requires 8 cards with 80 GB VRAM.
- To do a LORA finetune of a 7B model requires ≥1 cards with ≥24 GB VRAM, possibly ≥1-2 cards with ≥16 GB VRAM.
- To do a QLoRA finetune of a 7B model requires ≥1 card with ≥16 GB VRAM.
GPU Type | Architecture | Today's equivalent | Dtypes | VRAM | Memory Bandwidth | Interconnect | Server packaging | AWS | Azure | GCP |
---|---|---|---|---|---|---|---|---|---|---|
H200 80GB SXM | Hopper | N/A | bf16, fp64, fp32, fp16, fp8, int8 | 141GB | 4.8TB/s | NVLink 900GB/s | 4 or 8 GPUs | N/A | ... | ... |
H200 80GB NVL | Hopper | N/A | bf16, fp64, fp32, fp16, fp8, int8 | 141GB | 4.8TB/s | NVLink 900GB/s | 1-8 GPUs | N/A | ||
H100 PCIe | Hopper | N/A | bf16, fp64, fp32, fp16, fp8, int8 | 80GB HBM3 | 3.35TB/s | NVLink 600GB/s | 1-8 GPUs | N/A | ... | ... |
H100 SXM | Hopper | N/A | bf16, fp64, fp32, fp16, fp8, int8 | 80GB HBM3 | 2TB/s | NVLink 900GB/s | 4 or 8 GPUs | p5.48xlarge |
... | a3-megagpu-8g |
H100 NVL | Hopper | N/A | bf16, fp64, fp32, fp16, fp8, int8 | 188 GB | 7.8TB/s | NVLink 600GB/s | 2-4 GPU pairs | N/A | ||
A100 80GB PCIe | Ampere | H100 | fp64, fp32, tf32, bf16, fp16, int8 | 80GB HBM2e | 1.935TB/s | NVLink 600GB/s | ... | N/A | NC_A100_v4-series |
... |
A100 80GB SXM | Ampere | H100 | fp64, fp32, tf32, bf16, fp16, int8 | 80GB HBM2e | 2.039TB/s | NVLink 600GB/s | ... | p4de.24xlarge |
a2-ultragpu-1g , ..., a2-ultragpu-1g |
|
A100 40GB PCIe | Ampere | H100 | fp64, fp32, tf32, bf16, fp16, int8 | 40GB HBM2 | 1.555TB/s | NVLink 600GB/s | ... | N/A | ... | ... |
A100 40GB SXM | Ampere | H100 | fp64, fp32, tf32, bf16, fp16, int8 | 40GB HBM2 | 1.555TB/s | NVLink 600GB/s | ... | p4d.24xlarge |
... | a2-highgpu-1g , ..., a2-highgpu-16g |
L40s | Ada Lovelace | N/A | fp32, tf32, bf16, fp16, fp8, int4, int8 | 48GB GDDR6 | 864GB/s | PCIe Gen4x16: 64GB/s | ... | N/A | ... | ... |
L40 | Ada Lovelace | N/A | fp32, tf32, bf16, fp16, fp8, int4, int8 | 48GB GDDR6 | 864GB/s | PCIe Gen4x16: 64GB/s | ... | N/A | ||
L4 | Ada Lovelace | N/A | fp32, fp16, bf16, tf32, int8 | 24GB GDDR6 | 300GB/s | PCIe Gen4 64GB/s | 1-8 GPUs | g6.xlarge , ..., g6.48xlarge |
N/A | g2-standard-4 , ..., g2-standard-96 |
A2 | Ampere | ... | ... | 16GB GDDR6 | 200GB/s | PCIe Gen4 | ... | ... | N/A | |
A40 | Ampere | L40/L40s | ... | 48GB GDDR6 | 696GB/s | NVLink 112.5GB/s | ... | N/A | ||
A30 | Ampere | ... | bf16, fp64, fp32, tf32, fp16, int4, int8 | 24GB HBM2 | 933GB/s | NVLink 200GB/s | ... | N/A | ||
A16 | Ampere | ... | fp32, tf32, fp16, int8 | 4x 16GB GDDR6 | 4x 200GB/s | PCIe Gen4: 64 GB/s | ... | N/A | ||
A10G | Ampere | ... | ... | 24GB GDDR6 | 600GB/s | PCIe Gen4: 64 GB/s | ... | g5.xlarge , ..., g5.48xlarge |
||
A10 | Ampere | ... | bf16, fp32, tf32, fp16, int4, int8 | 24GB GDDR6 | 600GB/s | PCIe Gen4: 64 GB/s | ... | N/A | ||
T4 | Turing | ... | int4, int8 | 16GB GDDR6 | ... | PCIe Gen3: 32GB/s | ... | N/A | ... | nvidia-tesla-t4 |
RTX A6000 | Ampere | L40/L40s | ... | 48GB GDDR6 | 768GB/s | NVLink 112.5GB/s | ... | ... | ... | ... |
RTX A5000 | Ampere | L40/L40s | ... | 24GB GDDR6 | 768GB/s | NVLink 112.5GB/s | ... | ... | ... | ... |
RTX A4000 | Ampere | L40/L40s | ... | 16 GB GDDR6 | 448GB/s | PCIe Gen4 | ... | ... | ... | ... |
Quadro RTX 8000 | Turing | L40/L40s | ... | 48 GB GDDR6 | ... | NVLink 100GB/s | ... | ... | ... | ... |
Quadro RTX 6000 | Turing | L40/L40s | ... | 24 GB GDDR6 | ... | NVLink 100GB/s | ... | ... | ... | ... |
Quadro RTX 5000 | Turing | L40/L40s | ... | 24 GB GDDR6 | ... | NVLink 100GB/s | ... | ... | ... | ... |
Quadro RTX 4000 | Turing | L40/L40s | ... | 24 GB GDDR6 | ... | NVLink 100GB/s | ... | ... | ... | ... |
NVLink
- A wire-based communications protocol first produced by Nvidia in 2014.
- Direct GPU-to-GPU interconnect that scales multi-GPU input and output (IO) within a server/VM.
- Direct GPU-to-GPU interconnect leads to better compute utilization rates and less need for
@resources(memory=...)
- Direct GPU-to-GPU interconnect leads to better compute utilization rates and less need for
- "A single NVIDIA Blackwell Tensor Core GPU supports up to 18 NVLink 100 gigabyte-per-second (GB/s) connections for a total bandwidth of 1.8 terabytes per second (TB/s)." - Nvidia
PCIe
- Peripheral Component Interconnect Express
- A standard for moving data on a bus at high speeds between graphics cards, SSDs, Ethernet connections, etc.
- Each device connected to the bus has a dedicated connection to the host, which is faster than shared bus architectures.
- PCIe devices communicate via "interconnects" or "links", point-to-point communication channels between PCIe ports that send/receive requests and interrupts.
- Example use: The PCIe bus is used to attach non-volatile memory express (NVMe), a specification for how hardware and software can better use parallelism in modern SSDs. This reduces I/O overhead.
SXM
- An Nvidia product that connects GPUs by directly socketing them to the motherboard, instead of using PCIe slots to connect them to the motherboard.
- So far, each DGX system series (Pascal, Volta, Ampere, Hopper, ...), comes with its own SXM socket generation.
- SXM may have NVLink switches, allowing faster GPU-to-GPU communication.
NVSwitch
- Part of DGX-2
- Extend NVLink across nodes
- Connect multiple NVLinks to provide all-to-all GPU communication at full NVLink speed within a single rack and between racks
NIC
- Network interface card - distributed/multi-node training
- Examples: Nvidia and Microsoft use Infiniband. AWS and GCP have proprietary NICs, the Elastic Fabric Adapter (EFA) and gVNIC, respectively.
- End user APIs like MPI rely on guarantees of systems at this level.
RDMA
- Transfer data directly from application memory - or GPU VRAM - to the wire, reducing need for host resources and latency in message passing.
- When it comes to Nvidia GPUs, GPUDirect and Infiniband are important examples of RDMA.
Infiniband
- Mellanox manufactured host bus adapters, network switches, and Infiniband. In 2019 Nvidia acquired Mellanox, an Israeli-American computer networking company and the last independent supplier of Infiniband.
- Infiniband is a NIC, consisting of the physical link-layer protocol and the Verbs API, an RDMA implementation.
- interconnect bottleneck: when connections between integrated circuits are faster than the computation that runs within them.
- On AWS, the proprietary Infiniband substitute is Elastic Fabric Adapter (EFA).
GPUDirect
- Nvidia doesn't market Infiniband much. Instead they leverage a new tech called GPUDirect, which
NCCL
- ...
Here are some resources with heuristics and data that simplify this process:
- What resources does my use case require?
- Is my cluster behaving reasonably?
- HPC latency numbers
distributed-training-checklist
to test core NCCL operations are functional in your Metaflow deployment.
As of June 2024,
- Tokyo, Japan, APAC:
asia-northeast1-b
- Jurong West, Singapore:
asia-southeast1-b
- St. Ghislain, Belgium:
europe-west1-b
- Eemshaven, Netherlands:
europe-west4-b
- Tel Aviv, Israel:
me-west1-b
- Ashburn, Virginia:
us-east-4b
- Council Bluffs, Iowa:
us-central1-c
- Columbus, Ohio:
us-east5-a
- The Dalles, Oregon:
us-west1-a
- Las Vegas, Nevada:
us-west4-a