-
Notifications
You must be signed in to change notification settings - Fork 2.4k
feat: (schema - phase 2) Perform Column Statistics Schema Migration #14311
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
e2f5d30 to
efa976a
Compare
|
CI is failing due to out of disk space error: @rahil-c @the-other-tim-brown This is ready for review |
| } else { | ||
| // schema is evolving for the column of interest. | ||
| Schema schema = colsToIndexSchemaMap.get(a.getColumnName()); | ||
| HoodieSchema hoodieSchema = colsToIndexSchemaMap.get(a.getColumnName()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nitpick: Let's keep the variable name schema
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@voonhous just a reminder to push this change
| String fieldName = fieldNameFieldPair.getKey(); | ||
| Schema fieldSchema = getNonNullTypeFromUnion(fieldNameFieldPair.getValue().schema()); | ||
| ColumnStats colStats = allColumnStats.computeIfAbsent(fieldName, ignored -> new ColumnStats(getValueMetadata(fieldSchema, indexVersion))); | ||
| HoodieSchema hoodieFieldSchema = HoodieSchema.fromAvroSchema(fieldSchema); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we update all the methods in this class to also take in HoodieSchema?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may take awhile, there's ~3k LoC, it touches quite abit of tests too, will do it slowly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, can we add a separate task for this then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No worries, i am already doing it now. Almost done.
| * @throws IllegalArgumentException if the schema type is not supported | ||
| * @since 1.2.0 | ||
| */ | ||
| public static ValueType fromSchema(HoodieSchema schema) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be easier to migrate once we have added the other types in this PR #14312
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me rebase this PR onto your changes since they are merged.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These changes are merged now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Modified fromSchema to not rely on avro for conversion.
965332d to
78a7204
Compare
78a7204 to
94f6e10
Compare
| * @since 1.2.0 | ||
| */ | ||
| public static Comparable<?> coerceToComparable(Schema schema, Object val) { | ||
| public static Comparable<?> coerceToComparable(HoodieSchema hoodieSchema, Object val) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick: let's just leave the variable name as schema
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
| public abstract class FileFormatUtils { | ||
| /** | ||
| * Aggregate column range statistics across files in a partition. | ||
| * Aggregate column range statistics across files in a partition using HoodieSchema. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be helpful to add the context that the HoodieSchema is used for properly extracting the stats based on the data type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
| Schema avroSchema = valueSchema.toAvroSchema(); | ||
| return DecimalMetadata.create((LogicalTypes.Decimal) avroSchema.getLogicalType()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be updated to operate directly on HoodieSchema?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
| */ | ||
| public static ValueType fromSchema(HoodieSchema schema) { | ||
| // Handle logical types first using instanceof checks on specialized classes | ||
| if (schema instanceof HoodieSchema.Decimal) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of using instanceof we can just augment the switch statement below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
| Map<String, HoodieSchema> hoodieSchemaMap = colsToIndexSchemaMap.entrySet().stream() | ||
| .collect(Collectors.toMap(Map.Entry::getKey, entry -> HoodieSchema.fromAvroSchema(entry.getValue()))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's simplify this
Map<String, HoodieSchema> hoodieSchemaMap = Collections.singletonMap(colName, HoodieSchema.create(schemaType));
Let's also update the method signature to take in HoodieSchemaType instead of an avro type to help break away from the avro dependency
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
7d35116 to
11ddcc4
Compare
c4dc3d0 to
476d16a
Compare
476d16a to
7a9e53f
Compare
|
Whew, all tests passing, it's ready for a second round of review now. |
Describe the issue this Pull Request addresses
This PR adds the required changes as stipulated in #14267 - phase 2: Perform Column Statistics Schema Migration.
This change is part of an effort to migrate column statistics handling to use the internal HoodieSchema representation instead of direct Avro Schema objects. This facilitates schema evolution support for column statistics, particularly for the min/max values recorded in the Hudi Metadata Table.
Summary and Changelog
This patch introduces HoodieSchema into the column statistics logic across FileFormatUtils, HoodieTableMetadataUtil, and ValueMetadata to support schema migration and evolution.
Comparable<?>.Impact
None, we are just wrapping existing code with a wrapper class and adapting functions that use Avro schema
Risk Level
low. The changes primarily involve replacing direct usage of Avro's
Schemawith Hudi'sHoodieSchemain the metadata logic, which is an internal component. Existing Avro-based serialization is maintained.Documentation Update