Skip to content

Deepseekv3 CUDA illegal memory access on B200 from increasing seq_len to 8192 #1791

@danielvegamyhre

Description

@danielvegamyhre

Bug description

Using B200 devgpu

Repro

  • fsdp=4, ep=4, AC=none, compile=False

  • Using debug model with DSV3 671b config in single node testing via setting n_layers=2 and n_dense_layers=1 to avoid OOM.

    • Is this not a valid config? Could this explain the error?
    "debugmodel": DeepSeekV3ModelArgs(
        vocab_size=129280,
        dim=7168,
        inter_dim=18432,
        moe_inter_dim=2048,
        n_layers=2,
        n_dense_layers=1,
        n_heads=128,
        moe_args=MoEArgs(
            num_experts=256,
            num_shared_experts=1,
            top_k=8,
            score_func="sigmoid",
            route_norm=True,
            route_scale=2.5,
            score_before_experts=False,
        ),
        n_expert_groups=8,
        n_limited_groups=4,
        q_lora_rank=1536,
        kv_lora_rank=512,
        qk_nope_head_dim=128,
        qk_rope_head_dim=64,
        v_head_dim=128,
        use_flex_attn=True,
        attn_mask_type="block_causal",
    ),

Default seq len (2048) works:

CUDA_VISIBLE_DEVICES="4,5,6,7" NGPU=4 CONFIG_FILE="./torchtitan/models/deepseek_v3/train_configs/debug_model.toml" ./run_train.sh --training.steps=200 --parallelism.data_parallel_shard_degree=4 --parallelism.expert_parallel_degree=4 --parallelism.tensor_parallel_degree=1 --activation_checkpoint.mode=none 

Increasing seq_len to 8192 -> CUDA illegal memory access:

CUDA_VISIBLE_DEVICES="4,5,6,7" NGPU=4 CONFIG_FILE="./torchtitan/models/deepseek_v3/train_configs/debug_model.toml" ./run_train.sh --training.steps=200 --parallelism.data_parallel_shard_degree=4 --parallelism.expert_parallel_degree=4 --parallelism.tensor_parallel_degree=1 --activation_checkpoint.mode=none --training.seq_len=8192

Versions

  • torchtitan latest main branch
  • torch 10/2 nightly for cuda 12.8: torch 2.10.0.dev20251002+cu128

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    Projects

    Status

    In Progress

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions