Skip to content

Commit e4f3146

Browse files
authored
apacheGH-41317: [C++] Fix crash on invalid Parquet file (apache#41366)
### Rationale for this change Fixes the crash detailed in apache#41317 in TableBatchReader::ReadNext() on a corrupted Parquet file ### What changes are included in this PR? Add a validation that all read columns have the same size ### Are these changes tested? I've tested on the reproducer I provided in apache#41317 that it now triggers a clean error: ``` Traceback (most recent call last): File "test.py", line 3, in <module> [_ for _ in parquet_file.iter_batches()] File "test.py", line 3, in <listcomp> [_ for _ in parquet_file.iter_batches()] File "pyarrow/_parquet.pyx", line 1587, in iter_batches File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status pyarrow.lib.ArrowInvalid: columns do not have the same size ``` I'm not sure if/how unit tests for corrupted datasets should be added ### Are there any user-facing changes? No **This PR contains a "Critical Fix".** * GitHub Issue: apache#41317 Authored-by: Even Rouault <[email protected]> Signed-off-by: mwish <[email protected]>
1 parent de37ee8 commit e4f3146

File tree

3 files changed

+14
-0
lines changed

3 files changed

+14
-0
lines changed

cpp/src/arrow/table.cc

+2
Original file line numberDiff line numberDiff line change
@@ -619,6 +619,7 @@ TableBatchReader::TableBatchReader(const Table& table)
619619
for (int i = 0; i < table.num_columns(); ++i) {
620620
column_data_[i] = table.column(i).get();
621621
}
622+
DCHECK(table_.Validate().ok());
622623
}
623624

624625
TableBatchReader::TableBatchReader(std::shared_ptr<Table> table)
@@ -632,6 +633,7 @@ TableBatchReader::TableBatchReader(std::shared_ptr<Table> table)
632633
for (int i = 0; i < owned_table_->num_columns(); ++i) {
633634
column_data_[i] = owned_table_->column(i).get();
634635
}
636+
DCHECK(table_.Validate().ok());
635637
}
636638

637639
std::shared_ptr<Schema> TableBatchReader::schema() const { return table_.schema(); }

cpp/src/arrow/table.h

+2
Original file line numberDiff line numberDiff line change
@@ -241,6 +241,8 @@ class ARROW_EXPORT Table {
241241
///
242242
/// The conversion is zero-copy: each record batch is a view over a slice
243243
/// of the table's columns.
244+
///
245+
/// The table is expected to be valid prior to using it with the batch reader.
244246
class ARROW_EXPORT TableBatchReader : public RecordBatchReader {
245247
public:
246248
/// \brief Construct a TableBatchReader for the given table

cpp/src/parquet/arrow/reader.cc

+10
Original file line numberDiff line numberDiff line change
@@ -1043,6 +1043,16 @@ Status FileReaderImpl::GetRecordBatchReader(const std::vector<int>& row_groups,
10431043
}
10441044
}
10451045

1046+
// Check all columns has same row-size
1047+
if (!columns.empty()) {
1048+
int64_t row_size = columns[0]->length();
1049+
for (size_t i = 1; i < columns.size(); ++i) {
1050+
if (columns[i]->length() != row_size) {
1051+
return ::arrow::Status::Invalid("columns do not have the same size");
1052+
}
1053+
}
1054+
}
1055+
10461056
auto table = ::arrow::Table::Make(batch_schema, std::move(columns));
10471057
auto table_reader = std::make_shared<::arrow::TableBatchReader>(*table);
10481058

0 commit comments

Comments
 (0)