Skip to content

Conversation

@dantengsky
Copy link
Member

@dantengsky dantengsky commented Nov 26, 2025

I hereby agree to the terms of the CLA available at: https://docs.databend.com/dev/policies/cla/

Summary

Enabled Parquet dictionary encoding by default, plus an NDV-based heuristic that disables per-column dictionaries when a block’s NDV/row ratio exceeds 10%.

During a streaming fuse table write, the first block’s statistics determine whether dictionaries should be disabled for the remaining blocks in that stream.

Tests

  • Unit Test
  • Logic Test
  • Benchmark Test
  • No Test - Explain why

Type of change

  • Bug Fix (non-breaking change which fixes an issue)
  • New Feature (non-breaking change which adds functionality)
  • Breaking Change (fix or feature that could cause existing functionality not to work as expected)
  • Documentation Update
  • Refactoring
  • Performance Improvement
  • Other (please describe):

This change is Reviewable

@github-actions github-actions bot added the pr-feature this PR introduces a new feature to the codebase label Nov 26, 2025
@dantengsky dantengsky changed the title feat: heuristic rules for fuse parquet dictionary page feat: heuristic rule for fuse parquet dictionary page Nov 26, 2025
@dantengsky dantengsky force-pushed the feat/dictionary-heuristic-rule branch from 81dfffa to 8325306 Compare November 27, 2025 09:56
@dantengsky dantengsky marked this pull request as ready for review December 2, 2025 14:02
@chatgpt-codex-connector
Copy link

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Repo admins can enable using credits for code reviews in their settings.

@dantengsky
Copy link
Member Author

@codex review

@chatgpt-codex-connector
Copy link

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Repo admins can enable using credits for code reviews in their settings.

@BohuTANG
Copy link
Member

BohuTANG commented Dec 2, 2025

@codex review

@chatgpt-codex-connector
Copy link

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.
To continue using code reviews, add credits to your account and enable them for code reviews in your settings.

@BohuTANG
Copy link
Member

BohuTANG commented Dec 2, 2025

Review comment:

  • [P2] Disable dictionary using wrong column path for nested fields — databend/src/query/storages/common/blocks/src/parquet_rs.rs:130-139
    In build_parquet_writer_properties the dictionary is selectively disabled by calling set_column_dictionary_enabled(ColumnPath::from(field.name().as_str()), false).
    TableSchema::leaf_fields() intentionally returns field names using the "parent:child" colon notation, but Parquet’s ColumnPath expects a vector of the actual nested components
    (e.g. ["parent", "child"]). When Arrow writes nested columns it uses those real components, so the per-column configuration registered under "nested:leaf" is never looked up and
    the dictionary remains enabled. As a result, high-cardinality nested fields (struct members, array elements, etc.) always keep dictionary encoding despite the statistics indicating
    they should be disabled. To make the heuristic effective you need to populate the ColumnPath with the individual path components rather than the colon-joined string.

@BohuTANG
Copy link
Member

BohuTANG commented Dec 3, 2025

@codex review

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@dantengsky
Copy link
Member Author

@codex review

@chatgpt-codex-connector
Copy link

Codex Review: Didn't find any major issues. What shall we delve into next?

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

@dantengsky dantengsky added the ci-benchmark-cloud Benchmark: run only cloud tests for tpch/hits label Dec 5, 2025
@github-actions
Copy link
Contributor

github-actions bot commented Dec 5, 2025

Docker Image for PR

  • tag: pr-19024-fba3f0f-1764930416

note: this image tag is only available for internal use.

@dantengsky dantengsky force-pushed the feat/dictionary-heuristic-rule branch from 6de5681 to ce90366 Compare December 5, 2025 12:31
@dantengsky dantengsky force-pushed the feat/dictionary-heuristic-rule branch from c6fe056 to 6f50265 Compare December 6, 2025 07:31
@dantengsky dantengsky added ci-benchmark-cloud Benchmark: run only cloud tests for tpch/hits and removed ci-benchmark-cloud Benchmark: run only cloud tests for tpch/hits labels Dec 6, 2025
@github-actions
Copy link
Contributor

github-actions bot commented Dec 6, 2025

Docker Image for PR

  • tag: pr-19024-2a4ba64-1765024271

note: this image tag is only available for internal use.

@dantengsky
Copy link
Member Author

@codex review

@chatgpt-codex-connector
Copy link

Codex Review: Didn't find any major issues. Keep them coming!

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci-benchmark-cloud Benchmark: run only cloud tests for tpch/hits pr-feature this PR introduces a new feature to the codebase

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants