-
Notifications
You must be signed in to change notification settings - Fork 570
Labels
bugSomething isn't workingSomething isn't working
Description
Bug description
Using B200 devgpu
Repro
-
fsdp=4, ep=4, AC=none, compile=False
-
Using debug model with DSV3 671b config in single node testing via setting
n_layers=2
andn_dense_layers=1
to avoid OOM.- Is this not a valid config? Could this explain the error?
"debugmodel": DeepSeekV3ModelArgs(
vocab_size=129280,
dim=7168,
inter_dim=18432,
moe_inter_dim=2048,
n_layers=2,
n_dense_layers=1,
n_heads=128,
moe_args=MoEArgs(
num_experts=256,
num_shared_experts=1,
top_k=8,
score_func="sigmoid",
route_norm=True,
route_scale=2.5,
score_before_experts=False,
),
n_expert_groups=8,
n_limited_groups=4,
q_lora_rank=1536,
kv_lora_rank=512,
qk_nope_head_dim=128,
qk_rope_head_dim=64,
v_head_dim=128,
use_flex_attn=True,
attn_mask_type="block_causal",
),
Default seq len (2048) works:
CUDA_VISIBLE_DEVICES="4,5,6,7" NGPU=4 CONFIG_FILE="./torchtitan/models/deepseek_v3/train_configs/debug_model.toml" ./run_train.sh --training.steps=200 --parallelism.data_parallel_shard_degree=4 --parallelism.expert_parallel_degree=4 --parallelism.tensor_parallel_degree=1 --activation_checkpoint.mode=none
Increasing seq_len to 8192 -> CUDA illegal memory access:
CUDA_VISIBLE_DEVICES="4,5,6,7" NGPU=4 CONFIG_FILE="./torchtitan/models/deepseek_v3/train_configs/debug_model.toml" ./run_train.sh --training.steps=200 --parallelism.data_parallel_shard_degree=4 --parallelism.expert_parallel_degree=4 --parallelism.tensor_parallel_degree=1 --activation_checkpoint.mode=none --training.seq_len=8192
Versions
- torchtitan latest main branch
- torch 10/2 nightly for cuda 12.8:
torch 2.10.0.dev20251002+cu128
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working
Type
Projects
Status
In Progress