[9195] optimize group value bytes#23193
Conversation
2e3d6e0 to
d0f3e0d
Compare
ce12740 to
1e74a3d
Compare
perf: zero-copy emit and unified drain for GroupValuesBytes Replace the O(n) Buffer::from(slice) memcpy in ArrowBytesMap::emit() with a zero-copy path: freeze the MutableBuffer (finish()), slice the emitted window from the frozen Arc, and copy only the surviving bytes back into a fresh builder. Add drain_emitted(n) to evict hash table entries for the n emitted values and renumber survivors, enabling EmitTo::First(n) without a full rebuild. Unify GroupValuesBytes::emit() to a single code path for both EmitTo::All and EmitTo::First(n): compute n, call map.emit(n), then map.drain_emitted(n). remove local benchmarks
1e74a3d to
f77b22b
Compare
| pub fn emit(&mut self, n: usize) -> ArrayRef { | ||
| let total = self.offsets.len() - 1; | ||
| let remaining = total - self.emitted; | ||
| assert!(n <= remaining, "emit({n}): only {remaining} values remain"); |
There was a problem hiding this comment.
Honestly not sure when to return a panic as opposed to just clamping down on n. I think it's fair to make the caller responsible for passing in a correct n. Im open to ideas against this
0982ab0 to
d3d6ef4
Compare
d3d6ef4 to
337af59
Compare
|
run benchmarks tpch_sf10 tpch_mem_sf1 clickbench_1 |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing rich-t-kid/optimize-groupValueBytes (902effe) to a27f030 (merge-base) diff using: tpch_sf10 File an issue against this benchmark runner |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing rich-t-kid/optimize-groupValueBytes (902effe) to a27f030 (merge-base) diff using: tpch_mem_sf1 File an issue against this benchmark runner |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing rich-t-kid/optimize-groupValueBytes (902effe) to a27f030 (merge-base) diff using: clickbench_1 File an issue against this benchmark runner |
|
Benchmark for this request failed. Last 20 lines of output: Click to expandFile an issue against this benchmark runner |
1 similar comment
|
Benchmark for this request failed. Last 20 lines of output: Click to expandFile an issue against this benchmark runner |
|
@gabotechs benchmarks failed, I don't see a reason from the bot as to why. |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usageclickbench_1 — base (merge-base)
clickbench_1 — branch
File an issue against this benchmark runner |
|
Well, we got one at least. Something does not look right in the screenshot you shared. In ClickBench, some queries claim to be 25x more performant. That is very suspicious, I don't think that's possible while maintaining correct results. |
|
run benchmarks tpch tpcds |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing rich-t-kid/optimize-groupValueBytes (902effe) to a27f030 (merge-base) diff using: tpcds File an issue against this benchmark runner |
I agree, the queries with the largest speed ups also shouldn't have been targeted by these optimizations. Thats why I wanted to double check using @adriangbot. I need to also investigate why my local benchmarks were so off. |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing rich-t-kid/optimize-groupValueBytes (902effe) to a27f030 (merge-base) diff using: tpch File an issue against this benchmark runner |
| self.num_groups = 0; | ||
| self.map.take().into_state() | ||
| } | ||
| EmitTo::First(n) => { |
There was a problem hiding this comment.
May be worth adding a match if branch here to see if n == group_size. in that case also just use into_state()
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagetpch — base (merge-base)
tpch — branch
File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagetpcds — base (merge-base)
tpcds — branch
File an issue against this benchmark runner |



Which issue does this PR close?
utf8ViewandbinaryViewRationale for this change
The previous
EmitTo::First(n)path inGroupValuesBytesusedtake().into_state()to drain the entire map, then re-interned the surviving groups back from scratch, rebuilding the hash table on every spill. This PR adds a dedicated emit(n) method toArrowBytesMapthat emits the first n values zero-copy by slicing directly from the frozen buffer, compacts the surviving bytes in-place, and rebases the hash table entries rather than discarding and rebuilding them.What changes are included in this PR?
ArrowBytesMap::emit(n): new method that emits the first n values zero-copy, compacts the surviving buffer in-place, and rebases non-inline hash table offsets so the map stays valid without a rebuildGroupValuesBytes::emit:EmitTo::First(n)now uses map.emit(n) instead of take().into_state() + re-intern, eliminating the O(n) hash table rebuild on every spill;EmitTo::Allkeeps the O(1) take().into_state() fast pathAre these changes tested?
yes, ran into some edge cases that the test caught & fixed.
Are there any user-facing changes?
yes, this adds a public method to
ArrowBytesMap