Skip to content

[ENH] Create an optimizer rule to remove redundant Distinct calls #925

Open
@ChrisJar

Description

@ChrisJar

Is your feature request related to a problem? Please describe.
Performing an intersect operation adds a Distinct operation to the query plan, however in situations where Distinct has already been applied to the values undergoing the intersect operation, this adds a redundant Distinct operation. For example, the query:

import pandas as pd
from dask_sql import Context

df = pd.DataFrame({"a":[1,2,2,3,3], "b":[2,3,3,4,4]})
c = Context()
c.create_table("df", df)

c.explain("SELECT DISTINCT a FROM df INTERSECT SELECT DISTINCT b FROM df")

results in the explain plan:

'LeftSemi Join: df.a = df.b
  Distinct:
    Distinct:
      TableScan: df projection=[a, b]
  Distinct:
    TableScan: df projection=[a, b]'

where there are 3 Distinct operations present.
Another example of this is in query 38 where the query:

select  count(*) from (
    select distinct c_last_name, c_first_name, d_date
    from store_sales, date_dim, customer
          where store_sales.ss_sold_date_sk = date_dim.d_date_sk
      and store_sales.ss_customer_sk = customer.c_customer_sk
      and d_month_seq between 1189 and 1189 + 11
  intersect
    select distinct c_last_name, c_first_name, d_date
    from catalog_sales, date_dim, customer
          where catalog_sales.cs_sold_date_sk = date_dim.d_date_sk
      and catalog_sales.cs_bill_customer_sk = customer.c_customer_sk
      and d_month_seq between 1189 and 1189 + 11
  intersect
    select distinct c_last_name, c_first_name, d_date
    from web_sales, date_dim, customer
          where web_sales.ws_sold_date_sk = date_dim.d_date_sk
      and web_sales.ws_bill_customer_sk = customer.c_customer_sk
      and d_month_seq between 1189 and 1189 + 11
) hot_cust
limit 100

leads to 5 distinct calls as seen in the explain plan:

Limit: skip=0, fetch=100
  Projection: COUNT(UInt8(1))
    Aggregate: groupBy=[[]], aggr=[[COUNT(UInt8(1))]]
      LeftSemi Join: customer.c_last_name = customer.c_last_name, customer.c_first_name = customer.c_first_name, date_dim.d_date = date_dim.d_date
        Distinct:
          LeftSemi Join: customer.c_last_name = customer.c_last_name, customer.c_first_name = customer.c_first_name, date_dim.d_date = date_dim.d_date
            Distinct:
              Distinct:
                Projection: customer.c_last_name, customer.c_first_name, date_dim.d_date
                  Inner Join: store_sales.ss_customer_sk = customer.c_customer_sk
                    Inner Join: store_sales.ss_sold_date_sk = date_dim.d_date_sk
                      Filter: store_sales.ss_customer_sk IS NOT NULL AND store_sales.ss_sold_date_sk IS NOT NULL
                        TableScan: store_sales projection=[ss_sold_date_sk, ss_customer_sk], partial_filters=[store_sales.ss_customer_sk IS NOT NULL, store_sales.ss_sold_date_sk IS NOT NULL]
                      Filter: date_dim.d_date_sk IS NOT NULL AND date_dim.d_month_seq >= Int64(1189) AND date_dim.d_month_seq <= Int64(1200)
                        TableScan: date_dim projection=[d_date_sk, d_date, d_month_seq], partial_filters=[date_dim.d_date_sk IS NOT NULL]
                    Filter: customer.c_customer_sk IS NOT NULL
                      TableScan: customer projection=[c_customer_sk, c_first_name, c_last_name], partial_filters=[customer.c_customer_sk IS NOT NULL]
            Distinct:
              Projection: customer.c_last_name, customer.c_first_name, date_dim.d_date
                Inner Join: catalog_sales.cs_bill_customer_sk = customer.c_customer_sk
                  Inner Join: catalog_sales.cs_sold_date_sk = date_dim.d_date_sk
                    Filter: catalog_sales.cs_bill_customer_sk IS NOT NULL AND catalog_sales.cs_sold_date_sk IS NOT NULL
                      TableScan: catalog_sales projection=[cs_sold_date_sk, cs_bill_customer_sk], partial_filters=[catalog_sales.cs_bill_customer_sk IS NOT NULL, catalog_sales.cs_sold_date_sk IS NOT NULL]
                    Filter: date_dim.d_date_sk IS NOT NULL AND date_dim.d_month_seq >= Int64(1189) AND date_dim.d_month_seq <= Int64(1200)
                      TableScan: date_dim projection=[d_date_sk, d_date, d_month_seq], partial_filters=[date_dim.d_date_sk IS NOT NULL]
                  Filter: customer.c_customer_sk IS NOT NULL
                    TableScan: customer projection=[c_customer_sk, c_first_name, c_last_name], partial_filters=[customer.c_customer_sk IS NOT NULL]
        Distinct:
          Projection: customer.c_last_name, customer.c_first_name, date_dim.d_date
            Inner Join: web_sales.ws_bill_customer_sk = customer.c_customer_sk
              Inner Join: web_sales.ws_sold_date_sk = date_dim.d_date_sk
                Filter: web_sales.ws_bill_customer_sk IS NOT NULL AND web_sales.ws_sold_date_sk IS NOT NULL
                  TableScan: web_sales projection=[ws_sold_date_sk, ws_bill_customer_sk], partial_filters=[web_sales.ws_bill_customer_sk IS NOT NULL, web_sales.ws_sold_date_sk IS NOT NULL]
                Filter: date_dim.d_date_sk IS NOT NULL AND date_dim.d_month_seq >= Int64(1189) AND date_dim.d_month_seq <= Int64(1200)
                  TableScan: date_dim projection=[d_date_sk, d_date, d_month_seq], partial_filters=[date_dim.d_date_sk IS NOT NULL]
              Filter: customer.c_customer_sk IS NOT NULL
                TableScan: customer projection=[c_customer_sk, c_first_name, c_last_name], partial_filters=[customer.c_customer_sk IS NOT NULL]

Describe the solution you'd like
I would like to get rid of these redundant Distinct operrations with an optimizer rule.

Metadata

Metadata

Assignees

Labels

enhancementNew feature or requestneeds triageAwaiting triage by a dask-sql maintainer

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions