[ENH] Create an optimizer rule to remove redundant Distinct calls

**Is your feature request related to a problem? Please describe.**
Performing an intersect operation adds a `Distinct` operation to the query plan, however in situations where Distinct has already been applied to the values undergoing the intersect operation, this adds a redundant `Distinct` operation. For example, the query:
```python
import pandas as pd
from dask_sql import Context

df = pd.DataFrame({"a":[1,2,2,3,3], "b":[2,3,3,4,4]})
c = Context()
c.create_table("df", df)

c.explain("SELECT DISTINCT a FROM df INTERSECT SELECT DISTINCT b FROM df")
```
results in the explain plan:
```
'LeftSemi Join: df.a = df.b
  Distinct:
    Distinct:
      TableScan: df projection=[a, b]
  Distinct:
    TableScan: df projection=[a, b]'
```
where there are 3 `Distinct` operations present.
Another example of this is in query 38 where the query:
```
select  count(*) from (
    select distinct c_last_name, c_first_name, d_date
    from store_sales, date_dim, customer
          where store_sales.ss_sold_date_sk = date_dim.d_date_sk
      and store_sales.ss_customer_sk = customer.c_customer_sk
      and d_month_seq between 1189 and 1189 + 11
  intersect
    select distinct c_last_name, c_first_name, d_date
    from catalog_sales, date_dim, customer
          where catalog_sales.cs_sold_date_sk = date_dim.d_date_sk
      and catalog_sales.cs_bill_customer_sk = customer.c_customer_sk
      and d_month_seq between 1189 and 1189 + 11
  intersect
    select distinct c_last_name, c_first_name, d_date
    from web_sales, date_dim, customer
          where web_sales.ws_sold_date_sk = date_dim.d_date_sk
      and web_sales.ws_bill_customer_sk = customer.c_customer_sk
      and d_month_seq between 1189 and 1189 + 11
) hot_cust
limit 100
```
leads to 5 distinct calls as seen in the explain plan:
```
Limit: skip=0, fetch=100
  Projection: COUNT(UInt8(1))
    Aggregate: groupBy=[[]], aggr=[[COUNT(UInt8(1))]]
      LeftSemi Join: customer.c_last_name = customer.c_last_name, customer.c_first_name = customer.c_first_name, date_dim.d_date = date_dim.d_date
        Distinct:
          LeftSemi Join: customer.c_last_name = customer.c_last_name, customer.c_first_name = customer.c_first_name, date_dim.d_date = date_dim.d_date
            Distinct:
              Distinct:
                Projection: customer.c_last_name, customer.c_first_name, date_dim.d_date
                  Inner Join: store_sales.ss_customer_sk = customer.c_customer_sk
                    Inner Join: store_sales.ss_sold_date_sk = date_dim.d_date_sk
                      Filter: store_sales.ss_customer_sk IS NOT NULL AND store_sales.ss_sold_date_sk IS NOT NULL
                        TableScan: store_sales projection=[ss_sold_date_sk, ss_customer_sk], partial_filters=[store_sales.ss_customer_sk IS NOT NULL, store_sales.ss_sold_date_sk IS NOT NULL]
                      Filter: date_dim.d_date_sk IS NOT NULL AND date_dim.d_month_seq >= Int64(1189) AND date_dim.d_month_seq <= Int64(1200)
                        TableScan: date_dim projection=[d_date_sk, d_date, d_month_seq], partial_filters=[date_dim.d_date_sk IS NOT NULL]
                    Filter: customer.c_customer_sk IS NOT NULL
                      TableScan: customer projection=[c_customer_sk, c_first_name, c_last_name], partial_filters=[customer.c_customer_sk IS NOT NULL]
            Distinct:
              Projection: customer.c_last_name, customer.c_first_name, date_dim.d_date
                Inner Join: catalog_sales.cs_bill_customer_sk = customer.c_customer_sk
                  Inner Join: catalog_sales.cs_sold_date_sk = date_dim.d_date_sk
                    Filter: catalog_sales.cs_bill_customer_sk IS NOT NULL AND catalog_sales.cs_sold_date_sk IS NOT NULL
                      TableScan: catalog_sales projection=[cs_sold_date_sk, cs_bill_customer_sk], partial_filters=[catalog_sales.cs_bill_customer_sk IS NOT NULL, catalog_sales.cs_sold_date_sk IS NOT NULL]
                    Filter: date_dim.d_date_sk IS NOT NULL AND date_dim.d_month_seq >= Int64(1189) AND date_dim.d_month_seq <= Int64(1200)
                      TableScan: date_dim projection=[d_date_sk, d_date, d_month_seq], partial_filters=[date_dim.d_date_sk IS NOT NULL]
                  Filter: customer.c_customer_sk IS NOT NULL
                    TableScan: customer projection=[c_customer_sk, c_first_name, c_last_name], partial_filters=[customer.c_customer_sk IS NOT NULL]
        Distinct:
          Projection: customer.c_last_name, customer.c_first_name, date_dim.d_date
            Inner Join: web_sales.ws_bill_customer_sk = customer.c_customer_sk
              Inner Join: web_sales.ws_sold_date_sk = date_dim.d_date_sk
                Filter: web_sales.ws_bill_customer_sk IS NOT NULL AND web_sales.ws_sold_date_sk IS NOT NULL
                  TableScan: web_sales projection=[ws_sold_date_sk, ws_bill_customer_sk], partial_filters=[web_sales.ws_bill_customer_sk IS NOT NULL, web_sales.ws_sold_date_sk IS NOT NULL]
                Filter: date_dim.d_date_sk IS NOT NULL AND date_dim.d_month_seq >= Int64(1189) AND date_dim.d_month_seq <= Int64(1200)
                  TableScan: date_dim projection=[d_date_sk, d_date, d_month_seq], partial_filters=[date_dim.d_date_sk IS NOT NULL]
              Filter: customer.c_customer_sk IS NOT NULL
                TableScan: customer projection=[c_customer_sk, c_first_name, c_last_name], partial_filters=[customer.c_customer_sk IS NOT NULL]
```

**Describe the solution you'd like**
I would like to get rid of these redundant `Distinct` operrations with an optimizer rule.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[ENH] Create an optimizer rule to remove redundant Distinct calls #925

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[ENH] Create an optimizer rule to remove redundant Distinct calls #925

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions