refactor: improve RowCompactedSerializer by using string_view to avoid data copies#196
refactor: improve RowCompactedSerializer by using string_view to avoid data copies#196lxy-9602 wants to merge 2 commits intoalibaba:mainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
Refactors serialization and compaction selection to reduce copies by using std::string_view in hot paths, and extends full-compaction picking logic to account for deletion vectors via BucketedDvMaintainer.
Changes:
- Enable
use_view=trueforInternalRowfield getters and add string-view based write paths inRowCompactedSerializer/ serializer utilities. - Extend
CompactStrategy::PickFullCompactionto accept aBucketedDvMaintainerand trigger rewrites when deletion vectors exist. - Update/add unit tests to cover the new compaction selection behavior and adjust persist processor test setup.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 10 comments.
Show a summary per file
| File | Description |
|---|---|
| src/paimon/core/mergetree/lookup/persist_processor_test.cpp | Reworks test fixture setup to build KeyValue from Arrow/ColumnarRow instead of BinaryRowGenerator. |
| src/paimon/core/mergetree/compact/compact_strategy_test.cpp | Updates PickFullCompaction call sites for new signature and adds DV-maintainer coverage. |
| src/paimon/core/mergetree/compact/compact_strategy.h | Adds DV maintainer parameter and logic to include files with deletion vectors in full compaction. |
| src/paimon/common/data/serializer/row_compacted_serializer.h | Adds RowWriter::WriteStringView to write length-prefixed bytes from std::string_view. |
| src/paimon/common/data/serializer/row_compacted_serializer.cpp | Enables view-based getters and removes intermediate allocations/copies when serializing string/binary. |
| src/paimon/common/data/serializer/binary_serializer_utils.cpp | Switches STRING/BINARY serialization to pass views instead of copying. |
| src/paimon/common/data/generic_row.h | Adds conversion path when a GenericRow field is stored as std::string_view but API requires BinaryString. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| bool force_rewrite_all_files) { | ||
| static std::optional<CompactUnit> PickFullCompaction( | ||
| int32_t num_levels, const std::vector<LevelSortedRun>& runs, | ||
| const std::shared_ptr<BucketedDvMaintainer>& dv_maintainer, bool force_rewrite_all_files) { |
There was a problem hiding this comment.
This changes the public PickFullCompaction signature by adding dv_maintainer, which can ripple through many callers. Consider adding an overload retaining the previous signature (calling the new one with nullptr) to keep the API backwards-compatible and reduce churn for call sites that don’t care about deletion vectors.
| const std::shared_ptr<BucketedDvMaintainer>& dv_maintainer, bool force_rewrite_all_files) { | |
| bool force_rewrite_all_files) { | |
| return PickFullCompaction(num_levels, runs, nullptr, force_rewrite_all_files); | |
| } | |
| static std::optional<CompactUnit> PickFullCompaction( | |
| int32_t num_levels, const std::vector<LevelSortedRun>& runs, | |
| const std::shared_ptr<BucketedDvMaintainer>& dv_maintainer, | |
| bool force_rewrite_all_files) { |
| #pragma once | ||
|
|
||
| #include "paimon/core/compact/compact_unit.h" | ||
| #include "paimon/core/deletionvectors/bucketed_dv_maintainer.h" |
There was a problem hiding this comment.
Including bucketed_dv_maintainer.h in this header increases coupling and compile-time impact, especially since PickFullCompaction is defined inline. If feasible, move the PickFullCompaction implementation to a .cpp file so the header can forward-declare BucketedDvMaintainer and avoid pulling in the heavier dependency transitively.
| } | ||
| case arrow::Type::type::STRING: { | ||
| writer->WriteString(pos, getter->GetString(pos)); | ||
| writer->WriteStringView(pos, getter->GetStringView(pos)); |
There was a problem hiding this comment.
For arrow::Type::BINARY, this switches from getter->GetBinary(pos) to getter->GetStringView(pos) and from WriteBinary to WriteStringView. If GetStringView is implemented with STRING semantics (e.g., expecting UTF-8, or only supported for STRING arrays), this can mis-serialize BINARY data or fail at runtime. Prefer a binary-specific view API (e.g., GetBinaryView / WriteBinaryView) or keep using GetBinary + WriteBinary for the BINARY case while still avoiding allocations.
| } | ||
| case arrow::Type::type::BINARY: { | ||
| writer->WriteBinary(pos, *(getter->GetBinary(pos))); | ||
| writer->WriteStringView(pos, getter->GetStringView(pos)); |
There was a problem hiding this comment.
For arrow::Type::BINARY, this switches from getter->GetBinary(pos) to getter->GetStringView(pos) and from WriteBinary to WriteStringView. If GetStringView is implemented with STRING semantics (e.g., expecting UTF-8, or only supported for STRING arrays), this can mis-serialize BINARY data or fail at runtime. Prefer a binary-specific view API (e.g., GetBinaryView / WriteBinaryView) or keep using GetBinary + WriteBinary for the BINARY case while still avoiding allocations.
| writer->WriteStringView(pos, getter->GetStringView(pos)); | |
| writer->WriteBinary(pos, getter->GetBinary(pos)); |
| std::map<std::string, std::shared_ptr<DeletionVector>> deletion_vectors = { | ||
| {"fake.data", std::make_shared<BitmapDeletionVector>(RoaringBitmap32())}}; |
There was a problem hiding this comment.
This test assumes the file(s) produced by CreateRunsWithLevelAndSize will have file_name == \"fake.data\". If the helper generates different names (common in such factories), DeletionVectorOf(file->file_name) won’t find a DV, and the compaction selection may return nullopt, making the test flaky/incorrect. Make the DV map key match the actual file_name produced by runs (e.g., derive the name from runs/files created in the test) so the test asserts the intended behavior reliably.
| std::map<std::string, std::shared_ptr<DeletionVector>> deletion_vectors = { | ||
| {"fake.data", std::make_shared<BitmapDeletionVector>(RoaringBitmap32())}}; | ||
|
|
||
| auto dv_maintainer = std::make_shared<BucketedDvMaintainer>( | ||
| std::make_shared<DeletionVectorsIndexFile>(nullptr, nullptr, /*bitmap64=*/false, | ||
| GetDefaultPool()), | ||
| deletion_vectors); | ||
| auto runs = CreateRunsWithLevelAndSize(/*levels=*/{3}, /*sizes*/ {10}); |
There was a problem hiding this comment.
This test assumes the file(s) produced by CreateRunsWithLevelAndSize will have file_name == \"fake.data\". If the helper generates different names (common in such factories), DeletionVectorOf(file->file_name) won’t find a DV, and the compaction selection may return nullopt, making the test flaky/incorrect. Make the DV map key match the actual file_name produced by runs (e.g., derive the name from runs/files created in the test) so the test asserts the intended behavior reliably.
| std::map<std::string, std::shared_ptr<DeletionVector>> deletion_vectors = { | |
| {"fake.data", std::make_shared<BitmapDeletionVector>(RoaringBitmap32())}}; | |
| auto dv_maintainer = std::make_shared<BucketedDvMaintainer>( | |
| std::make_shared<DeletionVectorsIndexFile>(nullptr, nullptr, /*bitmap64=*/false, | |
| GetDefaultPool()), | |
| deletion_vectors); | |
| auto runs = CreateRunsWithLevelAndSize(/*levels=*/{3}, /*sizes*/ {10}); | |
| auto runs = CreateRunsWithLevelAndSize(/*levels=*/{3}, /*sizes*/ {10}); | |
| // Derive the file name from the created runs to ensure the DV map key matches. | |
| ASSERT_FALSE(runs.empty()); | |
| ASSERT_TRUE(runs[0].run); | |
| ASSERT_FALSE(runs[0].run->files.empty()); | |
| const std::string& file_name = runs[0].run->files[0]->file_name; | |
| std::map<std::string, std::shared_ptr<DeletionVector>> deletion_vectors = { | |
| {file_name, std::make_shared<BitmapDeletionVector>(RoaringBitmap32())}}; | |
| auto dv_maintainer = std::make_shared<BucketedDvMaintainer>( | |
| std::make_shared<DeletionVectorsIndexFile>(nullptr, nullptr, /*bitmap64=*/false, | |
| GetDefaultPool()), | |
| deletion_vectors); |
|
|
||
| #include "paimon/core/mergetree/lookup/persist_processor.h" | ||
|
|
||
| #include "arrow/ipc/json_simple.h" |
There was a problem hiding this comment.
The test now depends on Arrow IPC ‘internal’ JSON utilities (arrow::ipc::internal::json and json_simple.h). These internal APIs are more likely to change across Arrow upgrades and ValueOrDie() will hard-abort rather than producing a clean gtest failure. Prefer a public Arrow test utility (if available in this repo’s Arrow version) and/or propagate failures via ASSERT-style checks so failures are reported as test assertions rather than process aborts.
| auto key_array = std::dynamic_pointer_cast<arrow::StructArray>( | ||
| arrow::ipc::internal::json::ArrayFromJSON(key_type, R"([[10]])").ValueOrDie()); |
There was a problem hiding this comment.
The test now depends on Arrow IPC ‘internal’ JSON utilities (arrow::ipc::internal::json and json_simple.h). These internal APIs are more likely to change across Arrow upgrades and ValueOrDie() will hard-abort rather than producing a clean gtest failure. Prefer a public Arrow test utility (if available in this repo’s Arrow version) and/or propagate failures via ASSERT-style checks so failures are reported as test assertions rather than process aborts.
| auto value_array = std::dynamic_pointer_cast<arrow::StructArray>( | ||
| arrow::ipc::internal::json::ArrayFromJSON(value_type, R"([["Alice", 10, null, 10.1]])") | ||
| .ValueOrDie()); |
There was a problem hiding this comment.
The test now depends on Arrow IPC ‘internal’ JSON utilities (arrow::ipc::internal::json and json_simple.h). These internal APIs are more likely to change across Arrow upgrades and ValueOrDie() will hard-abort rather than producing a clean gtest failure. Prefer a public Arrow test utility (if available in this repo’s Arrow version) and/or propagate failures via ASSERT-style checks so failures are reported as test assertions rather than process aborts.
| // add all files when force compacted | ||
| files_to_be_compacted.push_back(file); | ||
| } else if (dv_maintainer && dv_maintainer->DeletionVectorOf(file->file_name)) { | ||
| // check deletion vector for large files |
There was a problem hiding this comment.
The comment says ‘check deletion vector for large files’, but the condition does not check file size—only whether a deletion vector exists. To avoid misleading future readers, update the comment to reflect the actual logic (e.g., ‘rewrite files with deletion vectors’) or add an explicit size-based condition if that was intended.
| // check deletion vector for large files | |
| // rewrite files that have deletion vectors |
Purpose
Linked issue: #93
BinaryStringandstd::stringwithstd::string_viewinRowCompactedSerializerserialization interfaces to enable zero-copy access and reduce memory allocations. No functional behavior is changed, but performance is improved in hot paths.CompactStrategysupportsBucketedDvMaintainer.Tests
CompactStrategyTest, TestPickFullCompaction
Generative AI tooling
Partially Generated-by: Claude-4.6-Opus