Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement predicate pruning for not like expressions #14567

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

UBarney
Copy link
Contributor

@UBarney UBarney commented Feb 10, 2025

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

For predicate col NOT LIKE 'foo%', we rewrite it as (col_min NOT LIKE 'foo%' OR col_max NOT LIKE 'foo%'). If both col_min and col_max have the prefix foo, we skip the entire row group (as we can be certain that all data in this row group has the prefix foo).

Are these changes tested?

Yes

Are there any user-facing changes?

No

@github-actions github-actions bot added optimizer Optimizer rules core Core DataFusion crate labels Feb 10, 2025
@UBarney UBarney marked this pull request as ready for review February 10, 2025 04:48
@alamb
Copy link
Contributor

alamb commented Feb 10, 2025

FYI @adriangb

Copy link
Contributor

@adriangb adriangb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume this should also work with no wildcards col not like 'foo'? Or do we optimize like expressions without wildcards into an equality check? If we don't already do that optimization we probably should 😄

Comment on lines +113 to +118
#[tokio::test]
async fn test_utf8_not_like_ecsape() {
Utf8Test::new(|value| col("a").not_like(lit(format!("\\%{}%", value))))
.run()
.await;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love these fuzz tests! These 3 lines of code give me great confidence that this PR does the right thing 😄

Operator::NotLikeMatch => {
build_not_like_match(expr_builder).ok_or_else(|| {
plan_datafusion_err!(
"The NOT LIKE expression with wildcards is only supported at the end of the pattern"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a general note, not necessarily for this PR: I do think we should somehow make these errors (also for the existing LIKE case) a bit better if there is any other reason they might be rejected? Or if this is the only reason, a more generic error might be better? I realize these errors likely never surface to users but I think it'd be unfortunate to see an error like this one when the actual reason it wasn't supported has nothing to do with it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. Now it return Result<Arc<dyn PhysicalExpr>>

Comment on lines +1737 to +1734
// Example: For pattern "foo%bar", the row group might include values like
// ["foobar", "food", "foodbar"], making it unsafe to prune.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you expand upon the examples here? My naive gut instinct (which I think is wrong, I tried before 😓) is that it should be able to truncate foo%bar to foo%. Maybe an example that can show what would go wrong if you tried that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a comment to explain why it goes wrong. Please take another look.

Copy link
Contributor

@adriangb adriangb Feb 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense thanks. @findepi do you know if Trino handles this and if so does it do it in similar way?

@UBarney UBarney marked this pull request as draft February 11, 2025 01:54
@UBarney UBarney marked this pull request as ready for review February 11, 2025 05:40
@UBarney
Copy link
Contributor Author

UBarney commented Feb 11, 2025

I assume this should also work with no wildcards col not like 'foo'?

Yes. add some test to demonstrate it
There's also an optimization to rewrite col not like 'foo' as col != 'foo'

> explain select * from values ('foo'), ('bar') where column1 not like 'foo';
+---------------+-------------------------------------------------------+
| plan_type     | plan                                                  |
+---------------+-------------------------------------------------------+
| logical_plan  | Filter: column1 != Utf8("foo")                        |
|               |   Values: (Utf8("foo")), (Utf8("bar"))                |
| physical_plan | CoalesceBatchesExec: target_batch_size=8192           |
|               |   FilterExec: column1@0 != foo                        |
|               |     DataSourceExec: partitions=1, partition_sizes=[1] |
|               |                                                       |
+---------------+-------------------------------------------------------+
2 row(s) fetched. 
Elapsed 0.006 seconds.

@UBarney UBarney requested a review from adriangb February 11, 2025 05:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate optimizer Optimizer rules
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support predicate pruning for NOT LIKE expressions
3 participants