Deepseekv3 CUDA illegal memory access on B200 from increasing seq_len to 8192

### Bug description

Using B200 devgpu

## Repro 

- fsdp=4, ep=4, AC=none, compile=False

- Using debug model with DSV3 671b config in single node testing via setting `n_layers=2` and `n_dense_layers=1` to avoid OOM.
    - Is this not a valid config? Could this explain the error? 


```python
    "debugmodel": DeepSeekV3ModelArgs(
        vocab_size=129280,
        dim=7168,
        inter_dim=18432,
        moe_inter_dim=2048,
        n_layers=2,
        n_dense_layers=1,
        n_heads=128,
        moe_args=MoEArgs(
            num_experts=256,
            num_shared_experts=1,
            top_k=8,
            score_func="sigmoid",
            route_norm=True,
            route_scale=2.5,
            score_before_experts=False,
        ),
        n_expert_groups=8,
        n_limited_groups=4,
        q_lora_rank=1536,
        kv_lora_rank=512,
        qk_nope_head_dim=128,
        qk_rope_head_dim=64,
        v_head_dim=128,
        use_flex_attn=True,
        attn_mask_type="block_causal",
    ),
```


Default seq len (2048) works:

```
CUDA_VISIBLE_DEVICES="4,5,6,7" NGPU=4 CONFIG_FILE="./torchtitan/models/deepseek_v3/train_configs/debug_model.toml" ./run_train.sh --training.steps=200 --parallelism.data_parallel_shard_degree=4 --parallelism.expert_parallel_degree=4 --parallelism.tensor_parallel_degree=1 --activation_checkpoint.mode=none 
```


Increasing seq_len to 8192 -> CUDA illegal memory access:

```
CUDA_VISIBLE_DEVICES="4,5,6,7" NGPU=4 CONFIG_FILE="./torchtitan/models/deepseek_v3/train_configs/debug_model.toml" ./run_train.sh --training.steps=200 --parallelism.data_parallel_shard_degree=4 --parallelism.expert_parallel_degree=4 --parallelism.tensor_parallel_degree=1 --activation_checkpoint.mode=none --training.seq_len=8192
```

### Versions

- torchtitan latest main branch
- torch 10/2 nightly for cuda 12.8: `torch                     2.10.0.dev20251002+cu128  `

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Deepseekv3 CUDA illegal memory access on B200 from increasing seq_len to 8192 #1791

Bug description

Repro

Versions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Deepseekv3 CUDA illegal memory access on B200 from increasing seq_len to 8192 #1791

Description

Bug description

Repro

Versions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions