GH-343: Fix BaseVariableWidthVector and BaseLargeVariableWidthVector offset buffer serialization#989
Conversation
This comment has been minimized.
This comment has been minimized.
jbonofre
left a comment
There was a problem hiding this comment.
It's the similar fix as for valueBuffer afair. It sounds good to me.
|
@Yicong-Huang sorry to be a pain, can you please rebase ? Thanks ! |
Thanks @jbonofre . I did rebase but I think the root error in this change is |
|
@Yicong-Huang can you please rebase again ? Sorry about that. Else I can do the rebase for you. |
Thanks. Just merged the latest master back in. |
|
@Yicong-Huang thanks a lot ! I would like to include this fix on Arrow Java 19.0.0 release 😄 |
|
Closes #343 |
Thanks that'd be nice! I can work with you closely on fixing this. However I see it is still failing CI, I think the original issue is still persist
Do you have any suggestions? |
|
I did a new pass on the PR and I don't think it's correct.
I think the problem is that the PR changes |
jbonofre
left a comment
There was a problem hiding this comment.
See my previous comment about the tests.
cac75c2 to
c76dbbb
Compare
c76dbbb to
fcb011a
Compare
|
Thanks a lot, @jbonofre, for the new pass! I've reworked the PR based on your feedback. Here's a summary of the changes:
You were right that the test expectations for empty vectors should stay at 0. my earlier approach of allocating in the constructor was causing cascading issues. I've reverted that entirely. The constructor now continues to use The actual fix is now just in
For the memory leaks in test you flagged, I fixed one failing case, the |
|
@Yicong-Huang awesome ! Thanks ! I'm doing a new pass. |
jbonofre
left a comment
There was a problem hiding this comment.
LGTM with the latest changes.
Thanks !
…columns `checkBinaryOffsetsBuffer` always required at least `sizeof(offset_type)` bytes (the `+1` sentinel entry), even when `chunk.length() == 0`. Apache Arrow Java < 19.0.0, as bundled with Apache Spark, emits a 0-byte offsets buffer for empty String/Binary children (e.g. when every map row is empty so the key child has zero elements). Every other Arrow implementation (arrow-cpp, pyarrow, arrow-rs, Arrow Java >= 19) accepts this; arrow-java fixed the non-compliance in apache/arrow-java#989. When `chunk.length() == 0`, every caller's iteration loop is skipped and no byte of the offsets buffer is ever accessed, so requiring any bytes was a spec-pedantry regression, not a security necessity (confirmed by thorough code review of all 8 call sites). The `checkValidityBitmap` call inside `checkBinaryOffsetsBuffer` is unconditional and continues to validate the null bitmap. Fix: gate the `+1` sentinel on `chunk.length() > 0`; when the column is empty, set `count_plus_one = 0` so any buffer size (including 0 bytes) passes the check. Regression tests in `04356_arrow_empty_string_offsets.sh` cover: standalone empty `String` column, `Map(String, Int32)` with all-empty rows, and `Array(String)` with all-empty rows — all using the 0-byte offsets buffer that Arrow Java < 19.0.0 produces. Closes: ClickHouse#107749 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
What's Changed
Fix
BaseVariableWidthVector/BaseLargeVariableWidthVectorIPC serialization whenvalueCountis 0.Problem
When
valueCount == 0,setReaderAndWriterIndex()was settingoffsetBuffer.writerIndex(0), which meansreadableBytes() == 0. IPC serializer usesreadableBytes()to determine buffer size, so 0 bytes were written to the IPC stream. This crashes IPC readers in other libraries because Arrow spec requires offset buffer to have at least one entry[0].This is a follow-up to #967 which fixed the same issue in
ListVector/LargeListVector.Fix
Modify
setReaderAndWriterIndex()to always use(valueCount + 1) * OFFSET_WIDTHfor the offset buffer'swriterIndex, moved outside the if/else branch. When the offset buffer capacity is insufficient (e.g., empty buffer from constructor or loaded vialoadFieldBuffers()), it reallocates a properly sized buffer on demand.Testing
Added tests for empty
VarCharVectorandLargeVarCharVectorverifying offset buffer has correctreadableBytes()aftersetValueCount(0).Closes #343