Skip to content

Conversation

@voonhous
Copy link
Member

@voonhous voonhous commented Nov 21, 2025

Describe the issue this Pull Request addresses

This PR adds the required changes as stipulated in #14267 - phase 2: Perform Column Statistics Schema Migration.

This change is part of an effort to migrate column statistics handling to use the internal HoodieSchema representation instead of direct Avro Schema objects. This facilitates schema evolution support for column statistics, particularly for the min/max values recorded in the Hudi Metadata Table.

Summary and Changelog

This patch introduces HoodieSchema into the column statistics logic across FileFormatUtils, HoodieTableMetadataUtil, and ValueMetadata to support schema migration and evolution.

  • Updated methods in FileFormatUtils and HoodieTableMetadataUtil to accept and use HoodieSchema (or map of HoodieSchema) when aggregating column ranges and coercing values to Comparable<?>.
  • Modified ValueMetadata and ValueType to derive metadata and type information from HoodieSchema instead of Avro Schema, ensuring compatibility with existing Avro-based serialization
  • Deprecates use of direct Avro Schema in several column statistics-related utility methods in favor of HoodieSchema.

Impact

None, we are just wrapping existing code with a wrapper class and adapting functions that use Avro schema

Risk Level

low. The changes primarily involve replacing direct usage of Avro's Schema with Hudi's HoodieSchema in the metadata logic, which is an internal component. Existing Avro-based serialization is maintained.

Documentation Update

@github-actions github-actions bot added the size:M PR with lines of changes in (100, 300] label Nov 21, 2025
@voonhous voonhous changed the title feat: feat: (phase 2) Perform Column Statistics Schema Migration Nov 21, 2025
@voonhous voonhous marked this pull request as draft November 21, 2025 16:23
@voonhous voonhous marked this pull request as ready for review November 21, 2025 16:36
@voonhous voonhous marked this pull request as draft November 21, 2025 16:37
@voonhous voonhous marked this pull request as ready for review November 21, 2025 18:34
@voonhous voonhous changed the title feat: (phase 2) Perform Column Statistics Schema Migration feat (schema): (phase 2) Perform Column Statistics Schema Migration Nov 21, 2025
@voonhous voonhous changed the title feat (schema): (phase 2) Perform Column Statistics Schema Migration feat: (schema - phase 2) Perform Column Statistics Schema Migration Nov 22, 2025
@voonhous
Copy link
Member Author

CI is failing due to out of disk space error:

failed to register layer: write /opt/statsd/node_modules/typescript/lib/typescriptServices.js: no space left on device

@rahil-c @the-other-tim-brown This is ready for review

} else {
// schema is evolving for the column of interest.
Schema schema = colsToIndexSchemaMap.get(a.getColumnName());
HoodieSchema hoodieSchema = colsToIndexSchemaMap.get(a.getColumnName());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick: Let's keep the variable name schema

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@voonhous just a reminder to push this change

String fieldName = fieldNameFieldPair.getKey();
Schema fieldSchema = getNonNullTypeFromUnion(fieldNameFieldPair.getValue().schema());
ColumnStats colStats = allColumnStats.computeIfAbsent(fieldName, ignored -> new ColumnStats(getValueMetadata(fieldSchema, indexVersion)));
HoodieSchema hoodieFieldSchema = HoodieSchema.fromAvroSchema(fieldSchema);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we update all the methods in this class to also take in HoodieSchema?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may take awhile, there's ~3k LoC, it touches quite abit of tests too, will do it slowly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, can we add a separate task for this then?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No worries, i am already doing it now. Almost done.

* @throws IllegalArgumentException if the schema type is not supported
* @since 1.2.0
*/
public static ValueType fromSchema(HoodieSchema schema) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be easier to migrate once we have added the other types in this PR #14312

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me rebase this PR onto your changes since they are merged.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes are merged now

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modified fromSchema to not rely on avro for conversion.

@voonhous voonhous force-pushed the phase_2_col_stats branch 4 times, most recently from 965332d to 78a7204 Compare November 26, 2025 08:48
* @since 1.2.0
*/
public static Comparable<?> coerceToComparable(Schema schema, Object val) {
public static Comparable<?> coerceToComparable(HoodieSchema hoodieSchema, Object val) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: let's just leave the variable name as schema

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

public abstract class FileFormatUtils {
/**
* Aggregate column range statistics across files in a partition.
* Aggregate column range statistics across files in a partition using HoodieSchema.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be helpful to add the context that the HoodieSchema is used for properly extracting the stats based on the data type.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines 262 to 263
Schema avroSchema = valueSchema.toAvroSchema();
return DecimalMetadata.create((LogicalTypes.Decimal) avroSchema.getLogicalType());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be updated to operate directly on HoodieSchema?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

*/
public static ValueType fromSchema(HoodieSchema schema) {
// Handle logical types first using instanceof checks on specialized classes
if (schema instanceof HoodieSchema.Decimal) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of using instanceof we can just augment the switch statement below

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines 456 to 457
Map<String, HoodieSchema> hoodieSchemaMap = colsToIndexSchemaMap.entrySet().stream()
.collect(Collectors.toMap(Map.Entry::getKey, entry -> HoodieSchema.fromAvroSchema(entry.getValue())));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's simplify this

    Map<String, HoodieSchema> hoodieSchemaMap = Collections.singletonMap(colName, HoodieSchema.create(schemaType));

Let's also update the method signature to take in HoodieSchemaType instead of an avro type to help break away from the avro dependency

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@github-actions github-actions bot added size:XL PR with lines of changes > 1000 and removed size:M PR with lines of changes in (100, 300] labels Nov 27, 2025
@voonhous voonhous force-pushed the phase_2_col_stats branch 2 times, most recently from 7d35116 to 11ddcc4 Compare November 27, 2025 11:37
@voonhous voonhous force-pushed the phase_2_col_stats branch 3 times, most recently from c4dc3d0 to 476d16a Compare November 27, 2025 13:15
@hudi-bot
Copy link

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@voonhous
Copy link
Member Author

Whew, all tests passing, it's ready for a second round of review now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:XL PR with lines of changes > 1000

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants