-
Notifications
You must be signed in to change notification settings - Fork 1.6k
file: Query physical block size and minimum I/O size #3046
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
This table shows the parameters resulting from various device types being registered with the Linux kernel. Table 9: Common device types and their resulting parameters
quoted from https://people.redhat.com/msnitzer/docs/linux-advanced-storage-6.1.pdf, section 1.5 |
src/core/file.cc
Outdated
// - minimum_io_size: preferred minimum I/O size the device can perform without performing read-modify-write | ||
// - physical block size: smallest unit a physical storage device can write atomically | ||
// - logical block size: smallest unit the the storage device can address (typically 512 bytes) | ||
size_t block_size = std::ranges::max({logical_block_size, physical_block_size, minimum_io_size}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
physical_block_size should be the write alignment, but not the read alignment. There's no downside to reading a 512 byte logical sector from a 4096 byte physical sector disk.
Wrt write alignment, even there it's iffy. Writing 4096 avoids RMW but can generate space amplification. With the current exposed parameters, physical_block_size for writes is the best match. We may want to expose another write alignment (choosing a name will be hard) to indicate a non-optimal block write block size that is smaller than the other one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pointing out the read/write differentiation! I've updated the implementation:
- Read alignment: Now uses logical_block_size only (as you suggested - no downside to reading 512-byte sectors)
- Write alignment: Now uses physical_block_size (not max(logical, physical, min_io))
You're right about the space amplification issue. I verified this in the kernel source - the Linux kernel only enforces logical_block_size
alignment for O_DIRECT (see block/fops.c
:blkdev_dio_invalid()
):
return (iocb->ki_pos | iov_iter_count(iter)) &
(bdev_logical_block_size(bdev) - 1);
This confirms that physical_block_size
and min_io
are optimization hints, not requirements. Using physical_block_size
provides the best balance:
- Avoids hardware-level RMW (4K physical sectors)
- Prevents space amplification from RAID stripe alignment (
min_io
can be 64 KiB+)
For context, I found that RAID devices set min_io to the chunk/stripe size (see drivers/md/raid0.c
:386, raid5.c
:7748, raid10.c
:4003), which would cause massive space waste if used for write alignment.
Regarding exposing another write alignment: This is an interesting idea. We could expose something like:
- disk_write_dma_alignment = physical_block_size (current, safe default)
- disk_write_dma_alignment_optimal = max(physical_block_size, min_io) (for apps willing to trade space for throughput)
However, I'm inclined to defer this until we have a concrete use case. Most Seastar applications probably want "do the right thing" rather than having to choose between alignment strategies.
I don't understand min_io and opt_io for RAID0. The disks could just as easily read and write 512 byte blocks here.
Nor this. Why is 16k more optimal than anything else?
16k makes sense here because it avoids a RMW. But I don't understand 8k.
|
I looked into the Linux kernel source code to understand how these values are set. Here's what I found: TL;DR: For RAID devices, In the Linux kernel, all RAID types set io_min to the chunk size:
Answering Your Questions
Yes, physically they can, but:
You're right that smaller I/Os work, but they defeat the purpose of RAID0's striping.
I couldn't find this in the kernel code. RAID1 doesn't set io_min to a chunk size (it has no striping concept). This 16k value is likely either:
RAID1 should typically just mirror the underlying device topology.
From the kernel code:
The 8k ensures I/O aligns with chunk boundaries. The 16k (which you understood) is optimal because it writes a full data stripe. |
d5dda7a
to
b1deb82
Compare
v2:
|
This is a great document, but it's also ~15 years old. It's less relevant for SSD/NVMe, for example. |
b1deb82
to
e6157f5
Compare
Enhance block device initialization to query physical block size (BLKPBSZGET) in addition to logical block size. Differentiate read and write DMA alignment to optimize for both performance and space efficiency: Read alignment: Use logical_block_size only - No performance penalty for reading 512-byte logical sectors from 4K physical sector disks - Allows fine-grained reads without forced alignment overhead Write alignment: Use physical_block_size - Avoids read-modify-write at the hardware level (e.g., 4K physical sectors on Advanced Format disks) - Prevents space amplification that would occur from aligning to larger values like min_io (which can be 64 KiB+ for RAID devices) Kernel verification: The Linux kernel only enforces logical_block_size alignment for O_DIRECT operations. From block/fops.c:blkdev_dio_invalid(): ```c++ return (iocb->ki_pos | iov_iter_count(iter)) & (bdev_logical_block_size(bdev) - 1); ``` This confirms that physical_block_size is not enforced by the kernel - it is an optimization hint. Using physical_block_size provides the best balance between avoiding hardware-level RMW and preventing space amplification. For RAID devices, min_io is set to the chunk/stripe size (see drivers/md/raid0.c:386, raid5.c:7748, raid10.c:4003). While optimal for throughput, using min_io for write alignment would cause excessive space waste for small writes. We leave min_io as a potential future optimization for applications that want to maximize throughput at the cost of space. Signed-off-by: Kefu Chai <[email protected]>
e6157f5
to
05785ef
Compare
Good point. While the document is dated, the kernel topology API it describes is still current and works for modern devices. For modern SSDs/NVMe (verified from Linux kernel source drivers/nvme/host/core.c): /* NOWS = Namespace Optimal Write Size */
if (id->nows)
io_opt = bs * (1 + le16_to_cpu(id->nows));
The key difference from the RAID examples is that RAID arrays have complex stripe/chunk geometry (64 KiB chunks × multiple drives) leading to much larger min_io/opt_io values. NVMe drives, when they set NOWS, typically indicate page-size granularity (4-16 KiB). Important: Our implementation in this PR deliberately uses only physical_block_size for write alignment, not optimal_io_size or minimum_io_size. This avoids write amplification that would occur with large optimal_io_size values (e.g., forcing all writes to 256 KiB alignment for RAID0 would be wasteful). We only need to avoid hardware-level RMW, which physical_block_size handles. So while the document's RAID examples show legacy use cases with their stripe-based optimal sizes, we use the simpler physical_block_size approach that works well for both modern SSDs and RAID without write amplification issues. |
We recently discovered that some disks lie about physical_block_size. @robertbindar (or @tchaikov if you want), suggest adjusting iotune to detect the physical block size and write it in io_properties.yaml. Then the reactor can pick it up and use it to override what it detects from the disk. |
// Configure DMA alignment requirements based on block device characteristics | ||
// - Read alignment: logical_block_size (no performance penalty for reading 512-byte sectors) | ||
// - Write alignment: physical_block_size (avoids hardware-level RMW) | ||
_memory_dma_alignment = write_block_size; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
memory_dma_alignment is fixed 512 IIRC regardless of logical/physical sector sizes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see it can be obtained via statx stx_dio_mem_align.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interestingly it returns 4 for my nvme (and 512 for xfs files)
Enhance block device initialization to query additional device characteristics beyond logical block size: