Volume attachment limits for p4d.24xlarge are too low? #2301

j-vizcaino · 2025-01-22T14:06:44Z

/kind bug

What happened?

csinode for p4d.24xlarge reports 6 allocatable EBS volumes but can support more.

According to the AWS docs these instance types should support up to 11 EBS volumes.
As our p4d.24xlarge instances include 4 EFA/ENI devices, this brings down the number to 7. Taking into account the root EBS volume, this brings us to 6.

BUT, those instances, with EFA, support at least 8 EBS (+1 for root) volumes (see below)

How to reproduce it (as minimally and precisely as possible)?

create a pd4.24xlarge instance, with EFA enabled
describe the associated csinode resource (or look for the ebs-csi-node pod log line): the allocatable volume count is 6
attach 6 EBS volumes, using pods
create a 7th pod: pod stays in pending due to insufficient allocatable EBS capacity (expected behaviour)
update the ebs-csi-node daemonset and force the number of allocatable EBS volumes to 11, adding --volume-attachment-limits=11
watch the 7th pod start, with an additional EBS volume
(bonus) adding a 8th pod works as well

Anything else we need to know?:

It's unclear if the issue is related to how the available EBS volumes computation is performed in the ebs-csi code, or if it's an AWS issue, with the EC2 metadata endpoint not reporting numbers correctly, but it's clear that those instances support more EBS attachments than what the driver reports.

Environment

Kubernetes version (use kubectl version): v1.29.12-eks-2d5f260
Driver version: v1.38.1-eksbuild.2 (EKS addon)

The text was updated successfully, but these errors were encountered:

j-vizcaino · 2025-01-22T16:20:22Z

Output of lspci on a p4d.24xlarge node with both EFA and 9 EBS (including root volume) attachments (in case this helps)

# lspci -tv
-+-[0000:a0]-+-01.0  Amazon.com, Inc. Elastic Network Adapter (ENA)
 |           +-1b.0  Amazon.com, Inc. Elastic Fabric Adapter (EFA)
 |           +-1c.0  NVIDIA Corporation GA100 [A100 SXM4 40GB]
 |           +-1d.0  NVIDIA Corporation GA100 [A100 SXM4 40GB]
 |           +-1e.0  Amazon.com, Inc. NVMe SSD Controller
 |           \-1f.0  Amazon.com, Inc. NVMe SSD Controller
 +-[0000:90]-+-01.0  Amazon.com, Inc. Elastic Network Adapter (ENA)
 |           +-1b.0  Amazon.com, Inc. Elastic Fabric Adapter (EFA)
 |           +-1c.0  NVIDIA Corporation GA100 [A100 SXM4 40GB]
 |           +-1d.0  NVIDIA Corporation GA100 [A100 SXM4 40GB]
 |           +-1e.0  Amazon.com, Inc. NVMe SSD Controller
 |           \-1f.0  Amazon.com, Inc. NVMe SSD Controller
 +-[0000:80]-+-1a.0  NVIDIA Corporation GA100 [A100 NVSwitch]
 |           +-1b.0  NVIDIA Corporation GA100 [A100 NVSwitch]
 |           +-1c.0  NVIDIA Corporation GA100 [A100 NVSwitch]
 |           +-1d.0  NVIDIA Corporation GA100 [A100 NVSwitch]
 |           +-1e.0  NVIDIA Corporation GA100 [A100 NVSwitch]
 |           \-1f.0  NVIDIA Corporation GA100 [A100 NVSwitch]
 +-[0000:20]-+-01.0  Amazon.com, Inc. Elastic Network Adapter (ENA)
 |           +-1b.0  Amazon.com, Inc. Elastic Fabric Adapter (EFA)
 |           +-1c.0  NVIDIA Corporation GA100 [A100 SXM4 40GB]
 |           +-1d.0  NVIDIA Corporation GA100 [A100 SXM4 40GB]
 |           +-1e.0  Amazon.com, Inc. NVMe SSD Controller
 |           \-1f.0  Amazon.com, Inc. NVMe SSD Controller
 +-[0000:10]-+-00.0  Amazon.com, Inc. Elastic Network Adapter (ENA)
 |           +-02.0  Amazon.com, Inc. Elastic Network Adapter (ENA)
 |           +-1b.0  Amazon.com, Inc. Elastic Fabric Adapter (EFA)
 |           +-1c.0  NVIDIA Corporation GA100 [A100 SXM4 40GB]
 |           +-1d.0  NVIDIA Corporation GA100 [A100 SXM4 40GB]
 |           +-1e.0  Amazon.com, Inc. NVMe SSD Controller
 |           \-1f.0  Amazon.com, Inc. NVMe SSD Controller
 \-[0000:00]-+-00.0  Intel Corporation 440FX - 82441FX PMC [Natoma]
             +-01.0  Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II]
             +-01.3  Intel Corporation 82371AB/EB/MB PIIX4 ACPI
             +-03.0  Amazon.com, Inc. Device 1111
             +-04.0  Amazon.com, Inc. NVMe EBS Controller
             +-17.0  Amazon.com, Inc. NVMe EBS Controller
             +-18.0  Amazon.com, Inc. NVMe EBS Controller
             +-19.0  Amazon.com, Inc. NVMe EBS Controller
             +-1a.0  Amazon.com, Inc. NVMe EBS Controller
             +-1c.0  Amazon.com, Inc. NVMe EBS Controller
             +-1d.0  Amazon.com, Inc. NVMe EBS Controller
             +-1e.0  Amazon.com, Inc. NVMe EBS Controller
             \-1f.0  Amazon.com, Inc. NVMe EBS Controller

AndrewSirenko · 2025-01-22T16:38:33Z

Hi @j-vizcaino, thank you for opening this issue and providing great reproduction steps!

Let me look into this. We will prioritize a fix in the driver or correct the docs.

In the meantime, you can rely on our Additional Node DaemonSets feature to automate overriding the volume attachment limit for p4d.24xlarge nodes.

Pasting the relevant AWS docs wording below for posterity:

For accelerated computing instances other than VT1 instances, each accelerator counts as an attachment. For example, p4d.24xlarge instances have a shared volume limit of 28, 8 GPUs, and 8 NVMe instance store volumes. This means that you can attach up to 11 EBS volumes (28 volumes - 1 network interface - 8 GPUs - 8 NVMe instance store volumes).

AndrewSirenko · 2025-01-22T16:43:58Z

/priority important-soon

torredil · 2025-01-28T18:51:25Z

/assign

k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Jan 22, 2025

k8s-ci-robot added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Jan 22, 2025

k8s-ci-robot assigned torredil Jan 28, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Volume attachment limits for p4d.24xlarge are too low? #2301

Volume attachment limits for p4d.24xlarge are too low? #2301

j-vizcaino commented Jan 22, 2025 •

edited

Loading

j-vizcaino commented Jan 22, 2025

AndrewSirenko commented Jan 22, 2025 •

edited

Loading

AndrewSirenko commented Jan 22, 2025

torredil commented Jan 28, 2025

Volume attachment limits for p4d.24xlarge are too low? #2301

Volume attachment limits for p4d.24xlarge are too low? #2301

Comments

j-vizcaino commented Jan 22, 2025 • edited Loading

j-vizcaino commented Jan 22, 2025

AndrewSirenko commented Jan 22, 2025 • edited Loading

AndrewSirenko commented Jan 22, 2025

torredil commented Jan 28, 2025

j-vizcaino commented Jan 22, 2025 •

edited

Loading

AndrewSirenko commented Jan 22, 2025 •

edited

Loading