Skip to content

Commit

Permalink
Overhaul BACK and CALC to use downstreaming of aliases and add CALCUL…
Browse files Browse the repository at this point in the history
…ATE method (#256)

Resolves #255. See issue for more details about the goals of the drastic overhaul. The propagation of aliases from parent to child, implicitly creating `BACK` references, is referred to as down-streaming. The vast majority of the changes are updates to documentaiton, notebooks, and unit tests to align with these new semantics. The collection equivalents of back-reference were deleted as they are no longer needed, and weren't fully supported to begin with. Also includes changes to QDAG nodes and hybrid conversion to account for the change in how terms are handled.
  • Loading branch information
knassre-bodo authored Feb 26, 2025
1 parent b2d548f commit 1636bfa
Show file tree
Hide file tree
Showing 116 changed files with 3,779 additions and 3,407 deletions.
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -132,6 +132,7 @@ venv/
ENV/
env.bak/
venv.bak/
.vscode

# Spyder project settings
.spyderproject
Expand Down
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ Suppose I want to know for every person their name & the total income they've ma
The following PyDough snippet solves this problem:

```py
result = People(
result = People.CALCULATE(
name,
net_income = SUM(jobs.income_earned) - SUM(schools.tuition_paid)
)
Expand Down
16 changes: 10 additions & 6 deletions demos/notebooks/1_introduction.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -101,7 +101,7 @@
"source": [
"%%pydough\n",
"\n",
"nations(key, name)"
"nations.CALCULATE(nkey=key, nname=name)"
]
},
{
Expand All @@ -121,7 +121,7 @@
"source": [
"%%pydough\n",
"\n",
"nation_keys = nations(key, name)"
"nation_keys = nations.CALCULATE(nkey=key, nname=name)"
]
},
{
Expand Down Expand Up @@ -149,7 +149,7 @@
"source": [
"%%pydough\n",
"\n",
"lowest_customer_nations = nation_keys(key, name, cust_count=COUNT(customers)).TOP_K(2, by=cust_count.ASC())\n",
"lowest_customer_nations = nation_keys.CALCULATE(nkey, nname, cust_count=COUNT(customers)).TOP_K(2, by=cust_count.ASC())\n",
"lowest_customer_nations"
]
},
Expand Down Expand Up @@ -236,7 +236,9 @@
"id": "f52dfcfe-6e90-44b8-b9c4-7dc08a5b28ca",
"metadata": {},
"source": [
"Finally, while building a statement from smaller components is best practice in Pydough, you can always evaluate the entire expression all at once within a PyDough cell, such as this example that loads the all asian nations in the dataset."
"Finally, while building a statement from smaller components is best practice in Pydough, you can always evaluate the entire expression all at once within a PyDough cell, such as this example that loads the all Asian nations in the dataset.\n",
"\n",
"We can use the optional `columns` argument to `to_sql` or `to_df` to specify which columns to include, or even what they should be renamed as."
]
},
{
Expand All @@ -248,7 +250,9 @@
"source": [
"%%pydough\n",
"\n",
"pydough.to_df(nations.WHERE(region.name == \"ASIA\"))"
"asian_countries = nations.WHERE(region.name == \"ASIA\")\n",
"print(pydough.to_df(asian_countries, columns=[\"name\", \"key\"]))\n",
"pydough.to_df(asian_countries, columns={\"nation_name\": \"name\", \"id\": \"key\"})"
]
},
{
Expand Down Expand Up @@ -290,7 +294,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.7"
"version": "3.12.6"
}
},
"nbformat": 4,
Expand Down
89 changes: 52 additions & 37 deletions demos/notebooks/2_pydough_operations.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -84,9 +84,9 @@
"id": "a25a2965-4f88-4626-b326-caf931fdba9c",
"metadata": {},
"source": [
"## Calc\n",
"## Calculate\n",
"\n",
"The next important operation is the `CALC` operation, which is used by \"calling\" a collection as a function."
"The next important operation is the `CALCULATE` operation, which takes in a variable number of positioning and/or keyword arguments."
]
},
{
Expand All @@ -98,18 +98,21 @@
"source": [
"%%pydough\n",
"\n",
"pydough.to_sql(nations(key))"
"print(pydough.to_sql(nations.CALCULATE(key, nation_name=name)))"
]
},
{
"cell_type": "markdown",
"id": "f89da4ca-5493-493f-bfe3-41d8a5f5d2a1",
"metadata": {},
"source": [
"Calc has a few purposes:\n",
"Calculate has a few purposes:\n",
"* Select which entries you want in the output.\n",
"* Define new fields by calling functions.\n",
"* Allow operations to be evaluated for each entry in the outermost collection's \"context\"."
"* Allow operations to be evaluated for each entry in the outermost collection's \"context\".\n",
"* Define aliases for terms that get down-streamed to descendants ([see here](#down-streaming)).\n",
"\n",
"The terms of the last `CALCULATE` in the PyDough logic are the terms that are included in the result (unless the `columns` argument of `to_sql` or `to_df` is used)."
]
},
{
Expand All @@ -121,15 +124,15 @@
"source": [
"%%pydough\n",
"\n",
"pydough.to_sql(nations(key + 1))"
"print(pydough.to_sql(nations.CALCULATE(adjusted_key = key + 1)))"
]
},
{
"cell_type": "markdown",
"id": "24031aa2-1df7-441d-b487-aa093b852504",
"metadata": {},
"source": [
"Here the context is the \"nations\" at the root of the graph. This means that for each entry within nations, we compute the result. This has important implications for when we get to more complex expressions. For example, if we want to know how many nations we have stored in each region, we can do via CALC."
"Here the context is the \"nations\" at the root of the graph. This means that for each entry within nations, we compute the result. This has important implications for when we get to more complex expressions. For example, if we want to know how many nations we have stored in each region, we can do via `CALCULATE`."
]
},
{
Expand All @@ -141,7 +144,7 @@
"source": [
"%%pydough\n",
"\n",
"pydough.to_df(regions(name, nation_count=COUNT(nations)))"
"pydough.to_df(regions.CALCULATE(name, nation_count=COUNT(nations)))"
]
},
{
Expand All @@ -151,7 +154,7 @@
"source": [
"Internally, this process evaluates `COUNT(nations)` grouped on each region and then joining the result with the original `regions` table. Importantly, this outputs a \"scalar\" value for each region.\n",
"\n",
"This shows a very important restriction of CALC, each final entry in a calc expression must be scalar with respect to a current context. For example, the expression `regions(region_name=name, nation_name=nations.name)` is not legal because region and nation is a one to many relationship, so there is not a single nation name for each region. \n",
"This shows a very important restriction of `CALCULATE`: each final entry in the operation must be scalar with respect to a current context. For example, the expression `regions(region_name=name, nation_name=nations.name)` is not legal because region and nation is a one to many relationship, so there is not a single nation name for each region. \n",
"\n",
"**The cell below will result in an error because it violates this restriction.**"
]
Expand All @@ -165,7 +168,7 @@
"source": [
"%%pydough\n",
"\n",
"pydough.to_df(regions(region_name=name, nation_name=nations.name))"
"pydough.to_df(regions.CALCULATE(region_name=name, nation_name=nations.name))"
]
},
{
Expand All @@ -185,7 +188,7 @@
"source": [
"%%pydough\n",
"\n",
"pydough.to_df(nations(nation_name=name, region_name=region.name))"
"pydough.to_df(nations.CALCULATE(nation_name=name, region_name=region.name))"
]
},
{
Expand Down Expand Up @@ -216,29 +219,39 @@
"%%pydough\n",
"\n",
"# Numeric operations\n",
"print(pydough.to_sql(nations(key + 1, key - 1, key * 1, key / 1)))\n",
"print(\"Q1\")\n",
"print(pydough.to_sql(nations.CALCULATE(key + 1, key - 1, key * 1, key / 1)))\n",
"\n",
"# Comparison operators\n",
"print(pydough.to_sql(nations(key == 0, key < 0, key != 0, key >= 5)))\n",
"print(\"\\nQ2\")\n",
"print(pydough.to_sql(nations.CALCULATE(key == 0, key < 0, key != 0, key >= 5)))\n",
"\n",
"# String Operations\n",
"print(pydough.to_sql(nations(LENGTH(name), UPPER(name), LOWER(name), STARTSWITH(name, \"A\"))))\n",
"print(\"\\nQ3\")\n",
"print(pydough.to_sql(nations.CALCULATE(LENGTH(name), UPPER(name), LOWER(name), STARTSWITH(name, \"A\"))))\n",
"\n",
"# Boolean operations\n",
"print(pydough.to_sql(nations((key != 1) & (LENGTH(name) > 5)))) # Boolean AND\n",
"print(pydough.to_sql(nations((key != 1) | (LENGTH(name) > 5)))) # Boolean OR\n",
"print(pydough.to_sql(nations(~(LENGTH(name) > 5)))) # Boolean NOT \n",
"print(pydough.to_sql(nations(ISIN(name, (\"KENYA\", \"JAPAN\"))))) # In\n",
"print(\"\\nQ4\")\n",
"print(pydough.to_sql(nations.CALCULATE((key != 1) & (LENGTH(name) > 5)))) # Boolean AND\n",
"print(\"\\nQ5\")\n",
"print(pydough.to_sql(nations.CALCULATE((key != 1) | (LENGTH(name) > 5)))) # Boolean OR\n",
"print(\"\\nQ6\")\n",
"print(pydough.to_sql(nations.CALCULATE(~(LENGTH(name) > 5)))) # Boolean NOT \n",
"print(\"\\nQ7\") \n",
"print(pydough.to_sql(nations.CALCULATE(ISIN(name, (\"KENYA\", \"JAPAN\"))))) # In\n",
"\n",
"# Datetime Operations\n",
"# Note: Since this is based on SQL lite the underlying date is a bit strange.\n",
"print(pydough.to_sql(lines(YEAR(ship_date), MONTH(ship_date), DAY(ship_date),HOUR(ship_date),MINUTE(ship_date),SECOND(ship_date))))\n",
"print(\"\\nQ8\")\n",
"print(pydough.to_sql(lines.CALCULATE(YEAR(ship_date), MONTH(ship_date), DAY(ship_date),HOUR(ship_date),MINUTE(ship_date),SECOND(ship_date))))\n",
"\n",
"# Aggregation operations\n",
"print(pydough.to_sql(TPCH(NDISTINCT(nations.comment), SUM(nations.key))))\n",
"print(\"\\nQ9\")\n",
"print(pydough.to_sql(TPCH.CALCULATE(NDISTINCT(nations.comment), SUM(nations.key))))\n",
"# Count can be used on a column for non-null entries or a collection\n",
"# for total entries.\n",
"print(pydough.to_sql(TPCH(COUNT(nations), COUNT(nations.comment))))"
"print(\"\\nQ10\")\n",
"print(pydough.to_sql(TPCH.CALCULATE(COUNT(nations), COUNT(nations.comment))))"
]
},
{
Expand All @@ -260,9 +273,11 @@
"id": "b70993e8-3cd2-4c45-87e3-8e68f67b92a0",
"metadata": {},
"source": [
"### BACK\n",
"### Down-Streaming\n",
"\n",
"Sometimes you need to load a value from a previous context to use at a later step in a PyDough statement. Any expression from an ancestor context that is placed in a `CALCULATE` is automatically made available to all descendants of that context. However, an error will occur if the name of the term defined in the ancestor collides with a name of a term or property of a descendant context, since PyDough will not know which one to use.\n",
"\n",
"Sometimes you need to load a value from a previous context to use at a later step in a PyDough statement. That can be done using the `BACK` operation. This step moves back `k` steps to find the name you are searching for. This is useful to avoid repeating computation."
"Notice how in the example below, `region_name` is defined in a `CALCULATE` within the context of `regions`, so the calculate within the context of `nations` also has access to `region_name` (interpreted as \"the name of the region that this nation belongs to\")."
]
},
{
Expand All @@ -274,15 +289,15 @@
"source": [
"%%pydough\n",
"\n",
"pydough.to_df(regions.nations(region_name=BACK(1).name, nation_name=name))"
"pydough.to_df(regions.CALCULATE(region_name=name).nations.CALCULATE(region_name, nation_name=name))"
]
},
{
"cell_type": "markdown",
"id": "6040a7c5-fc82-4e33-8b2b-a1b3ef394f71",
"metadata": {},
"source": [
"Here is a more complex example showing intermediate values. Here we will first compute `total_value` and then reuse it via `BACK`."
"Here is a more complex example showing intermediate values. Here we will first compute `total_value` and then reuse it downstream."
]
},
{
Expand All @@ -294,7 +309,7 @@
"source": [
"%%pydough\n",
"\n",
"nations_value = nations(name, total_value=SUM(suppliers.account_balance))\n",
"nations_value = nations.CALCULATE(nation_name=name, total_value=SUM(suppliers.account_balance))\n",
"pydough.to_df(nations_value)"
]
},
Expand All @@ -306,12 +321,12 @@
"outputs": [],
"source": [
"%%pydough\n",
"suppliers_value = nations_value.suppliers(\n",
"suppliers_value = nations_value.suppliers.CALCULATE(\n",
" key,\n",
" name,\n",
" nation_name=BACK(1).name,\n",
" nation_name,\n",
" account_balance=account_balance,\n",
" percentage_of_national_value=100 * account_balance / BACK(1).total_value\n",
" percentage_of_national_value=100 * account_balance / total_value\n",
")\n",
"top_suppliers = suppliers_value.TOP_K(20, by=percentage_of_national_value.DESC())\n",
"pydough.to_df(top_suppliers)"
Expand All @@ -324,7 +339,7 @@
"source": [
"## WHERE\n",
"\n",
"The WHERE operation by be used to filter unwanted entries in a context. For example, we can filter `nations` to only consider the `AMERICA` and `EUROPE` regions. A WHERE's context functions similarly to a calc except that it cannot be used to assign new properties. "
"The `WHERE` operation by be used to filter unwanted entries in a context. For example, we can filter `nations` to only consider the `AMERICA` and `EUROPE` regions. A WHERE's context functions similarly to a `CALCULATE` except that it cannot be used to assign new properties; it only contains a single positional argument: the predicate to filter on. "
]
},
{
Expand Down Expand Up @@ -367,7 +382,7 @@
"metadata": {},
"source": [
"The `by` argument requirements are:\n",
"* Anything that can be an expression used in a `CALC` or a `WHERE` can be used a component of a `by`.\n",
"* Anything that can be an expression used in a `CALCULATE` or a `WHERE` can be used a component of a `by`.\n",
"* The value in the `by` must end with either `.ASC()` or `.DESC()`\n",
"\n",
"You can also provide a tuple to by if you need to break ties. Consider this alternatives that instead selects the 20 parts with the largest size, starting with the smallest part id."
Expand Down Expand Up @@ -428,10 +443,10 @@
"source": [
"%%pydough\n",
"\n",
"updated_nations = nations(key, name_length=LENGTH(name))\n",
"updated_nations = nations.CALCULATE(key, name_length=LENGTH(name))\n",
"grouped_nations = PARTITION(\n",
" updated_nations, name=\"n\", by=(name_length)\n",
")(\n",
").CALCULATE(\n",
" name_length,\n",
" nation_count=COUNT(n.key)\n",
")\n",
Expand All @@ -446,7 +461,7 @@
"A couple important usage details:\n",
"* The `name` argument specifies the name of the subcollection access from the partitions to the original unpartitioned data.\n",
"* `keys` can be either be a single expression or a tuple of them, but it can only be references to expressions that already exist in the context of the data (e.g. `name`, not `LOWER(name)` or `region.name`)\n",
"* `BACK` should be used to step back into the partition child without retaining the partitioning. An example is shown below where we select brass european parts but only with the minimum supply cost."
"* Terms defined from the context of the `PARTITION` can be down-streamed to its descendants. An example is shown below where we select brass parts of size 15, but only the ones whose supply is below the average of all such parts."
]
},
{
Expand All @@ -459,8 +474,8 @@
"%%pydough\n",
"\n",
"selected_parts = parts.WHERE(ENDSWITH(part_type, \"BRASS\") & (size == 15))\n",
"part_types = PARTITION(selected_parts, name=\"p\", by=part_type)(avg_price=AVG(p.retail_price))\n",
"output = part_types.p.WHERE(retail_price < BACK(1).avg_price)\n",
"part_types = PARTITION(selected_parts, name=\"p\", by=part_type).CALCULATE(avg_price=AVG(p.retail_price))\n",
"output = part_types.p.WHERE(retail_price < avg_price)\n",
"pydough.to_df(output)"
]
},
Expand Down Expand Up @@ -532,7 +547,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.7"
"version": "3.12.6"
}
},
"nbformat": 4,
Expand Down
6 changes: 3 additions & 3 deletions demos/notebooks/3_exploration.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -308,7 +308,7 @@
"\n",
"orders_1995 = customers.orders.WHERE(YEAR(order_date) == 1995)\n",
"\n",
"asian_countries_info = asian_countries(country_name=LOWER(name), total_orders=COUNT(orders_1995))\n",
"asian_countries_info = asian_countries.CALCULATE(country_name=LOWER(name), total_orders=COUNT(orders_1995))\n",
"\n",
"top_asian_countries = asian_countries_info.TOP_K(3, by=total_orders.DESC())\n",
"\n",
Expand Down Expand Up @@ -408,7 +408,7 @@
"source": [
"Here, we learn that `customers.orders` invokes a child of the current context (`nations.WHERE(region.name == 'ASIA')`) by accessing the `customers` subcollection, then accessing its `orders` collection, then filtering it on the conedition `YEAR(order_date) == 1995`. \n",
"\n",
"We also know that this resulting child is plural with regards to the context, meaning that `asian_countries(asian_countries.order_date)` would be illegal, but `asian_countries(MAX(asian_countries.order_date))` is legal.\n",
"We also know that this resulting child is plural with regards to the context, meaning that `asian_countries.CALCULATE(asian_countries.order_date)` would be illegal, but `asian_countries.CALCULATE(MAX(asian_countries.order_date))` is legal.\n",
"\n",
"More combinations of `pydough.explain` and `pydough.explain_terms` can be done to learn more about what each of these components does."
]
Expand Down Expand Up @@ -438,7 +438,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.7"
"version": "3.12.6"
}
},
"nbformat": 4,
Expand Down
Loading

0 comments on commit 1636bfa

Please sign in to comment.