Create user facing APIs for building PyDough code without any code-transformation steps #237
Labels
documentation
Improvements or additions to documentation
effort - medium
mid-sized issue with average implementation time/difficulty
enhancement
New feature or request
extensibility
Increasing situations in which PyDough works
refactor
improving the quality of PyDough code
user feature
Adding a new user-facing feature/functionality
Goal: create an equivalent of the spark expressions classes that can be used to build unqualified nodes w/o writing invalid Python code that needs to be transformed by an API such as the Jupyter
%%pydough
cell magic. This will allow users to write PyDough code that is valid Python code & is formatter/mypy friendly, help users understand PyDough code as native Python code, and also becomes a useful tool for writing unit tests without going through extra steps to define the PyDough code. This API should not be a preferred path for LLMs.Secondary benefit: some of the builder APIs may become useful if the correct PyDough code is difficult to write due to name overloads between collections/properties vs variables in the namespace. Currently, the variable in the namespace always wins, but using builder nodes can help avoid this.
The proposed API is as follows:
from pydough.builder import pydough_builder
pydough_builder
object has the following methods to create new builder nodes:pydough_builder.var(name)
returns the equivalent of an unqualified access (e.g. inTPCH(n=COUNT(Customers.WHERE(nation.name == "GERMANY")))
, the termsTPCH
,Customers
andnation
).pydough_builder.function(name, *args, **kwargs)
takes in the name of a function and calls it on the provided arguments & keyword arguments, which should either be literals or builder nodes.pydough_builder.partition(data, name, keys)
takes in a builder node for the data to partition, the name to use for the partitioned data, and 1+ string names (either a single string or iterable string) for the keys fromdata
to use for partitioning.pydough_builder.back(levels)
is the equivalent of aBACK
call (down the line, will need to add the same forBEST
,PREV
andNEXT
)pydough_builder.literal(value)
creates a literal node from a Python value (from one of the valid Python literal types). This is usually optional since the builder will implicitly convert literals when it can.node[field]
builds a access to a property/subcollection of node using the stringfield
to determine which property to accessnode.where(cond)
builds aWHERE
using the builder nodecond
node.order_by(key1, key2, ...)
builds anORDER_BY
using the builder nodeskey1
,key2,
etc.node.asc(na_pos="first")
andnode.desc(na_pos="last")
generateASC
/DESC
nodesnode.partition
works just likepydough_builder
but usingnode
as the root instead of implicitly using the graph contextnode.topk(k, key1, key2, ...) builds a
TOP_Kusing the builder nodes
key1,
key2,` etc.node.calculate(term1=val1, term2=val2)
builds aCALC
using pydough builder nodesval1
,val2
, etc. as the values with names specified by the kwargs (same behavior for unnamed args as a regular calc)node.BEST(...)
method.__add__
,__and__
,__eq__
, etc.)node.build()
on a builder node flushes it to return an unqualified PyDough node.node.to_string()
on a builder node generates the PyDough code as a string that, if transformed by something like the%%pydough
cell magic, would create the same unqualified PyDough node anode.build()
to_sql
,to_df
,explain
, passed inside an unqualified node, etc.) it is automatically built.Note: once #154, #155 and #162 are added, the builder API will need to be updated accordingly.
The builder needs to be clearly & thoroughly documented in the user documentation, and can also become one of the primary means of writing unit tests for PyDough that require unqualified nodes.
Consider the following PyDough snippet, using the graph for the TPC-H schema:
Here is how the builder API would be used to build the example above:
Example where the builder helps avoid a name issue. Assume that
lines
has a field that someone unwisely decided to calldate
, andparts
has a field that someone unwisely decided to namepart
):The text was updated successfully, but these errors were encountered: