Open
Description
Is your feature request related to a problem? Please describe.
Performing an intersect operation adds a Distinct
operation to the query plan, however in situations where Distinct has already been applied to the values undergoing the intersect operation, this adds a redundant Distinct
operation. For example, the query:
import pandas as pd
from dask_sql import Context
df = pd.DataFrame({"a":[1,2,2,3,3], "b":[2,3,3,4,4]})
c = Context()
c.create_table("df", df)
c.explain("SELECT DISTINCT a FROM df INTERSECT SELECT DISTINCT b FROM df")
results in the explain plan:
'LeftSemi Join: df.a = df.b
Distinct:
Distinct:
TableScan: df projection=[a, b]
Distinct:
TableScan: df projection=[a, b]'
where there are 3 Distinct
operations present.
Another example of this is in query 38 where the query:
select count(*) from (
select distinct c_last_name, c_first_name, d_date
from store_sales, date_dim, customer
where store_sales.ss_sold_date_sk = date_dim.d_date_sk
and store_sales.ss_customer_sk = customer.c_customer_sk
and d_month_seq between 1189 and 1189 + 11
intersect
select distinct c_last_name, c_first_name, d_date
from catalog_sales, date_dim, customer
where catalog_sales.cs_sold_date_sk = date_dim.d_date_sk
and catalog_sales.cs_bill_customer_sk = customer.c_customer_sk
and d_month_seq between 1189 and 1189 + 11
intersect
select distinct c_last_name, c_first_name, d_date
from web_sales, date_dim, customer
where web_sales.ws_sold_date_sk = date_dim.d_date_sk
and web_sales.ws_bill_customer_sk = customer.c_customer_sk
and d_month_seq between 1189 and 1189 + 11
) hot_cust
limit 100
leads to 5 distinct calls as seen in the explain plan:
Limit: skip=0, fetch=100
Projection: COUNT(UInt8(1))
Aggregate: groupBy=[[]], aggr=[[COUNT(UInt8(1))]]
LeftSemi Join: customer.c_last_name = customer.c_last_name, customer.c_first_name = customer.c_first_name, date_dim.d_date = date_dim.d_date
Distinct:
LeftSemi Join: customer.c_last_name = customer.c_last_name, customer.c_first_name = customer.c_first_name, date_dim.d_date = date_dim.d_date
Distinct:
Distinct:
Projection: customer.c_last_name, customer.c_first_name, date_dim.d_date
Inner Join: store_sales.ss_customer_sk = customer.c_customer_sk
Inner Join: store_sales.ss_sold_date_sk = date_dim.d_date_sk
Filter: store_sales.ss_customer_sk IS NOT NULL AND store_sales.ss_sold_date_sk IS NOT NULL
TableScan: store_sales projection=[ss_sold_date_sk, ss_customer_sk], partial_filters=[store_sales.ss_customer_sk IS NOT NULL, store_sales.ss_sold_date_sk IS NOT NULL]
Filter: date_dim.d_date_sk IS NOT NULL AND date_dim.d_month_seq >= Int64(1189) AND date_dim.d_month_seq <= Int64(1200)
TableScan: date_dim projection=[d_date_sk, d_date, d_month_seq], partial_filters=[date_dim.d_date_sk IS NOT NULL]
Filter: customer.c_customer_sk IS NOT NULL
TableScan: customer projection=[c_customer_sk, c_first_name, c_last_name], partial_filters=[customer.c_customer_sk IS NOT NULL]
Distinct:
Projection: customer.c_last_name, customer.c_first_name, date_dim.d_date
Inner Join: catalog_sales.cs_bill_customer_sk = customer.c_customer_sk
Inner Join: catalog_sales.cs_sold_date_sk = date_dim.d_date_sk
Filter: catalog_sales.cs_bill_customer_sk IS NOT NULL AND catalog_sales.cs_sold_date_sk IS NOT NULL
TableScan: catalog_sales projection=[cs_sold_date_sk, cs_bill_customer_sk], partial_filters=[catalog_sales.cs_bill_customer_sk IS NOT NULL, catalog_sales.cs_sold_date_sk IS NOT NULL]
Filter: date_dim.d_date_sk IS NOT NULL AND date_dim.d_month_seq >= Int64(1189) AND date_dim.d_month_seq <= Int64(1200)
TableScan: date_dim projection=[d_date_sk, d_date, d_month_seq], partial_filters=[date_dim.d_date_sk IS NOT NULL]
Filter: customer.c_customer_sk IS NOT NULL
TableScan: customer projection=[c_customer_sk, c_first_name, c_last_name], partial_filters=[customer.c_customer_sk IS NOT NULL]
Distinct:
Projection: customer.c_last_name, customer.c_first_name, date_dim.d_date
Inner Join: web_sales.ws_bill_customer_sk = customer.c_customer_sk
Inner Join: web_sales.ws_sold_date_sk = date_dim.d_date_sk
Filter: web_sales.ws_bill_customer_sk IS NOT NULL AND web_sales.ws_sold_date_sk IS NOT NULL
TableScan: web_sales projection=[ws_sold_date_sk, ws_bill_customer_sk], partial_filters=[web_sales.ws_bill_customer_sk IS NOT NULL, web_sales.ws_sold_date_sk IS NOT NULL]
Filter: date_dim.d_date_sk IS NOT NULL AND date_dim.d_month_seq >= Int64(1189) AND date_dim.d_month_seq <= Int64(1200)
TableScan: date_dim projection=[d_date_sk, d_date, d_month_seq], partial_filters=[date_dim.d_date_sk IS NOT NULL]
Filter: customer.c_customer_sk IS NOT NULL
TableScan: customer projection=[c_customer_sk, c_first_name, c_last_name], partial_filters=[customer.c_customer_sk IS NOT NULL]
Describe the solution you'd like
I would like to get rid of these redundant Distinct
operrations with an optimizer rule.