Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create user facing APIs for building PyDough code without any code-transformation steps #237

Open
knassre-bodo opened this issue Jan 29, 2025 · 1 comment
Labels
documentation Improvements or additions to documentation effort - medium mid-sized issue with average implementation time/difficulty enhancement New feature or request extensibility Increasing situations in which PyDough works refactor improving the quality of PyDough code user feature Adding a new user-facing feature/functionality

Comments

@knassre-bodo
Copy link
Contributor

knassre-bodo commented Jan 29, 2025

Goal: create an equivalent of the spark expressions classes that can be used to build unqualified nodes w/o writing invalid Python code that needs to be transformed by an API such as the Jupyter %%pydough cell magic. This will allow users to write PyDough code that is valid Python code & is formatter/mypy friendly, help users understand PyDough code as native Python code, and also becomes a useful tool for writing unit tests without going through extra steps to define the PyDough code. This API should not be a preferred path for LLMs.

Secondary benefit: some of the builder APIs may become useful if the correct PyDough code is difficult to write due to name overloads between collections/properties vs variables in the namespace. Currently, the variable in the namespace always wins, but using builder nodes can help avoid this.

The proposed API is as follows:

  • A universal builder object can be imported via from pydough.builder import pydough_builder
  • The pydough_builder object has the following methods to create new builder nodes:
    • pydough_builder.var(name) returns the equivalent of an unqualified access (e.g. in TPCH(n=COUNT(Customers.WHERE(nation.name == "GERMANY"))), the terms TPCH, Customers and nation).
    • pydough_builder.function(name, *args, **kwargs) takes in the name of a function and calls it on the provided arguments & keyword arguments, which should either be literals or builder nodes.
    • pydough_builder.partition(data, name, keys) takes in a builder node for the data to partition, the name to use for the partitioned data, and 1+ string names (either a single string or iterable string) for the keys from data to use for partitioning.
    • pydough_builder.back(levels) is the equivalent of a BACK call (down the line, will need to add the same for BEST, PREV and NEXT)
    • pydough_builder.literal(value) creates a literal node from a Python value (from one of the valid Python literal types). This is usually optional since the builder will implicitly convert literals when it can.
  • Builder nodes have the following APIs:
    • node[field] builds a access to a property/subcollection of node using the string field to determine which property to access
    • node.where(cond) builds a WHERE using the builder node cond
    • node.order_by(key1, key2, ...) builds an ORDER_BY using the builder nodes key1, key2, etc.
    • node.asc(na_pos="first") and node.desc(na_pos="last") generate ASC/DESC nodes
    • node.partition works just like pydough_builder but using node as the root instead of implicitly using the graph context
    • node.topk(k, key1, key2, ...) builds a TOP_Kusing the builder nodeskey1, key2,` etc.
    • node.calculate(term1=val1, term2=val2) builds a CALC using pydough builder nodes val1, val2, etc. as the values with names specified by the kwargs (same behavior for unnamed args as a regular calc)
    • Down the line, will need to add a node.BEST(...) method.
    • All of the other magic methods (eg __add__, __and__, __eq__, etc.)
    • Calling node.build() on a builder node flushes it to return an unqualified PyDough node.
    • Calling node.to_string() on a builder node generates the PyDough code as a string that, if transformed by something like the %%pydough cell magic, would create the same unqualified PyDough node a node.build()
  • If a builder node is passed into a context that requires unqualified nodes (to_sql, to_df, explain, passed inside an unqualified node, etc.) it is automatically built.

Note: once #154, #155 and #162 are added, the builder API will need to be updated accordingly.

The builder needs to be clearly & thoroughly documented in the user documentation, and can also become one of the primary means of writing unit tests for PyDough that require unqualified nodes.

Consider the following PyDough snippet, using the graph for the TPC-H schema:

%%pydough
# Question: Find which part types have an average retail price of all parts of that type
# that is below the global average across all types. For each of those part types, count how
# many times parts of that part type with "red" in the name have been purchased. Find
# which 10 part types have the highest number of such purchases, and order alphabetically
# by the part type.
answer = TPCH(
  avg_price=AVG(Parts.retail_price)
).PARTITION(
  Parts, name="p", by="part_type"
).WHERE(
  AVG(p.retail_price) < BACK(1).avg_price
)(
  part_type, 
  n_red_purchases=COUNT(p.WHERE(CONTAINS(name, "red")).lines))
).TOP_K(
  10, by=n_red_purchases.DESC()
).ORDER_BY(
  part_type.ASC()
)

Here is how the builder API would be used to build the example above:

import pydough
from pydough.builder import pydough_builder as pb

tpch = pb.var("TPCH")
parts = pb.var("Parts")
global_retail_price_avg = pb.function("AVG", parts["retail_price"])
global_calc = tpch.calculate(avg_price=global_retail_price_avg)
# global_calc.to_string() returns "TPCH(avg_price=AVG(Parts.retail_price))"

part_types = global_calc.partition(parts, name="p", by="part_type")
parts_of_type = pb.var("p")
type_retail_price_avg = pb.function("AVG", parts_of_type["retail_price"])
back_global_price_avg = pb.back(1)["avg_price"]
selected_types = part_types.where(type_retail_price_avg < back_global_price_avg)
# selected_types.to_string() returns global_calc.to_string() + ".WHERE(AVG(p.retail_price) < BACK(1).avg_price)"

is_red_part = pb.function("CONTAINS", pb.var("name"), "red")
red_part_purchases = parts_of_type.where(is_red_part)["lines"]
part_type = pb.var("part_type")
part_info = selected_types.calculate(part_type, n_red_purchases=pb.function("COUNT", red_part_purchases))
# part_info.to_string() returns selected_types.to_string() + "(part_type, n_red_purchases=COUNT(p.WHERE(CONTAINS(name, "red")).lines)))"

top_parts = part_info.top_k(10, pb.var("n_red_purchases").DESC())
# top_parts.to_string() returns part_info.to_string() + ".TOP_K(10, by=n_red_purchases.DESC())"

result = top_parts.order_by(part_type.ASC())
# result.to_string() returns top_parts.to_string() + ".ORDER_BY(part_type.ASC())"
# i.e., a string that is identical to the code from `answer` in the original example

pydough_code = result.build()
# `pydough_code` is an unqualified node equivalent to `answer` in the original example

Example where the builder helps avoid a name issue. Assume that lines has a field that someone unwisely decided to call date, and parts has a field that someone unwisely decided to name part):

# Setup cell
import pydough
from datetime import date
from pydough.builder import pydough_builder as pb
pydough.active_session.load_metadata_graph(...)
pydough.active_session.connect_database(...)
lines = 42
# Bad PyDough cell.
# This code won't work because `date`, `type`, and `custs` are assumed to be the Python variables with
# those names instead of properties of collections.
%%pydough
from_94 = YEAR(date) == 1994
is_promo = HAS(part.WHERE(STARTSWITH(type, "PROMO")))
result = TPCH(n_selected_lines = COUNT(lines.WHERE(from_94 & is_promo)))
pydough.to_df(result)
# Good PyDough cell.
# This code will work because the correct versions of `date`, `type`, and `custs` are used.
%%pydough
from_94 = YEAR(pb.var("date")) == 1994 # can use pb.var("date").build(), or YEAR will do it implicitly
is_promo = HAS(part.WHERE(STARTSWITH(pb.var("type"), "PROMO"))) # same with pb.var("type").build
result = TPCH(n_selected_lines = COUNT(pb.var("lines").build().WHERE(from_94 & is_promo))) # have to use pb.var("lines").build() since `.WHERE` isn't allowed on a builder node
pydough.to_df(result)
@knassre-bodo knassre-bodo added documentation Improvements or additions to documentation enhancement New feature or request user feature Adding a new user-facing feature/functionality effort - medium mid-sized issue with average implementation time/difficulty extensibility Increasing situations in which PyDough works refactor improving the quality of PyDough code labels Jan 29, 2025
@knassre-bodo
Copy link
Contributor Author

This issue is also a superset of #193. Completing this issue automatically completes that one as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation effort - medium mid-sized issue with average implementation time/difficulty enhancement New feature or request extensibility Increasing situations in which PyDough works refactor improving the quality of PyDough code user feature Adding a new user-facing feature/functionality
Projects
None yet
Development

No branches or pull requests

1 participant