A window function performs a calculation across a set of table rows that are related to the current row within a specific "window" or subset of data. This is comparable to the type of calculation that can be done with an aggregate function (such as SUM(), AVG(), COUNT(), etc.).
But unlike regular aggregate functions, use of a window function does not cause rows to become grouped into a single output row — the rows retain their separate identities.
Syntax:
FUNCTION() OVER (PARTITION BY column_name ORDER BY column_name)
A window function always has two components. This second part here defines your window:
OVER (PARTITION BY column_name ORDER BY column_name)
Your window here is how you want to be viewing your data when you're applying your function
-
PARTITION BY: divides the result set into groups (optional).
-
ORDER BY: defines the order of processing rows within the partition.
Common Window Functions:
Ranking Functions:
- ROW_NUMBER(): Assigns a unique row number within a partition.
- RANK(): Similar to ROW_NUMBER(), but assigns the same rank to duplicate values, skipping numbers.
- DENSE_RANK(): Like RANK(), but without gaps in numbering.
Aggregate Functions as Window Functions:
- SUM() OVER(): Computes a running total.
- AVG() OVER(): Computes a moving average.
Lag and Lead Functions:
- LAG(): Retrieves the value from a previous row.
- LEAD(): Retrieves the value from the next row.
ROW_NUMBER() does just what it sounds like—displays the number of a given row. It starts at 1 and numbers the rows according to the ORDER BY part of the window statement. Using the PARTITION BY clause will allow you to begin counting 1 again in each partition.
Syntax:
ROW_NUMBER() OVER (PARTITION BY column_name ORDER BY column_name)
Common Uses:
-
Removing Duplicates: You can use ROW_NUMBER() to identify duplicate rows and keep only one by filtering out rows with a row number greater than 1.
-
Ranking Data: Used when ranking rows based on specific criteria but requiring unique row numbers.
-
Selecting the Latest Record: Helps in selecting the most recent entry per category when combined with PARTITION BY.
Example 1:
SELECT
total_amount,
ROW_NUMBER() OVER (ORDER BY total_amount DESC) AS ranking
FROM `greentaxi_trips`
LIMIT 10;
The query returns the top 10 highest total_amount values from the table, along with a row number indicating their ranking.
total_amount | ranking |
---|---|
4012.3 | 1 |
2878.3 | 2 |
2438.8 | 3 |
2156.3 | 4 |
2109.8 | 5 |
2017.3 | 6 |
1971.05 | 7 |
1958.8 | 8 |
1762.8 | 9 |
1600.8 | 10 |
The column generated with ROW_NUMBER() is temporary and does not modify the original table. It is just a calculation applied to the data in the query result.
Example 2:
Let's modify the previous query to add a partition by pick up location ID
SELECT
total_amount,
PULocationID,
ROW_NUMBER() OVER (PARTITION BY PULocationID ORDER BY total_amount DESC) AS ranking
FROM `greentaxi_trips`
LIMIT 10;
This SQL query assigns a ranking to each row based on total_amount in descending order within each PULocationID group:
total_amount | PULocationID | ranking |
---|---|---|
8.51 | 224 | 432 |
8.3 | 224 | 433 |
8.3 | 224 | 434 |
7.3 | 224 | 435 |
3.3 | 224 | 436 |
86.42 | 234 | 1 |
73.5 | 234 | 2 |
62.7 | 234 | 3 |
61.94 | 234 | 4 |
61.94 | 234 | 5 |
Using the PARTITION BY clause will allow you to begin counting 1 again in each partition.
ROW_NUMBER(), RANK(), and DENSE_RANK() are window functions used to assign a ranking to rows based on a specified order. However, they behave differently when there are duplicate values in the ranking column.
RANK() assigns a ranking, but skips numbers if there are ties. DENSE_RANK() its similar to RANK(), but does not skip numbers when there are ties.
For example:
Score | ROW_NUMBER() | RANK() | DENSE_RANK() |
---|---|---|---|
95 | 1 | 1 | 1 |
90 | 2 | 2 | 2 |
90 | 3 | 2 | 2 |
85 | 4 | 4 | 3 |
It can often be useful to compare rows to preceding or following rows. You can use LAG or LEAD to create columns that pull values from other rows without the need for a self-join. All you need to do is enter which column to pull from and how many rows away you'd like to do the pull. LAG pulls from previous rows and LEAD pulls from following rows
Syntax:
LAG(expression) OVER (PARTITION BY partition_expression ORDER BY order_expression)
- expression: The column whose value you want to retrieve from the previous row
- offset (optional): The number of rows back from the current row to look. The default is 1, meaning it looks at the immediate previous row.
- PARTITION BY (optional): Divides the result set into partitions to apply the function to each partition separately.
- ORDER BY: Specifies the order in which the rows are processed.
Example:
SELECT
lpep_pickup_datetime,
total_amount,
LAG(total_amount) OVER (ORDER BY lpep_pickup_datetime) as prev_total_amount,
LEAD(total_amount) OVER (ORDER BY lpep_pickup_datetime) as next_total_amount
FROM `greentaxi_trips`
ORDER BY lpep_pickup_datetime
The query retrieves the lpep_pickup_datetime, total_amount, the previous trip's total_amount, and the next trip's total_amount.
lpep_pickup_datetime | total_amount | prev_total_amount | next_total_amount |
---|---|---|---|
2008-12-31 23:33:38 UTC | 7.3 | 6.3 | 5.3 |
2008-12-31 23:42:31 UTC | 5.3 | 7.3 | 14.55 |
2008-12-31 23:47:51 UTC | 14.55 | 5.3 | 19.55 |
2008-12-31 23:57:46 UTC | 19.55 | 14.55 | 9.8 |
2009-01-01 00:00:00 UTC | 9.8 | 19.55 | 81.3 |
Computes the specified percentile value for the value_expression, with linear interpolation.
Syntax:
PERCENTILE_CONT(value_expression, percentile ) OVER (PARTITION BY partition_expression)
Example:
Lets calculate the 90th percentile of total_amount for each unique pickup location (PULocationID)
SELECT
PULocationID,
total_amount,
PERCENTILE_CONT(total_amount, 0.9 ) OVER (PARTITION BY PULocationID) AS p90
FROM `greentaxi_trips`
- PERCENTILE_CONT(total_amount, 0.9): calculates the 90th percentile (p90) of total_amount
- PARTITION BY PULocationID: This groups the calculations by PULocationID, so the 90th percentile is computed separately for each location.
Query results looks like this:
PULocationID | total_amount | p90 |
---|---|---|
224 | 17.3 | 51.9 |
224 | 20.67 | 51.9 |
224 | 21 | 51.9 |
224 | 26.06 | 51.9 |
224 | 27.13 | 51.9 |
224 | 40.14 | 51.9 |
224 | 55.46 | 51.9 |
224 | 25.74 | 51.9 |
224 | 27.02 | 51.9 |
224 | 37 | 51.9 |
The P90 value is essentially the amount below which 90% of the values fall. In this table, the P90 is constant at 51.9, which means that for location "224", 90% of the total amounts are below 51.9.
A CTE, short for Common Table Expression, is like a query within a query. With the WITH statement, you can create temporary tables to store results, making complex queries more readable and maintainable. These temporary tables exist only for the duration of the main query.
CTEs and subqueries are both powerful tools and can be used to achieve similar goals, but they have different use cases and advantages. Differences are CTE is reusable during the entire session and more readable
By declaring CTEs at the beginning of the query, you enhance code readability, enabling a clearer grasp of your analysis logic.
Syntax:
WITH cte_name AS (
SELECT column1, column2
FROM some_table
WHERE condition
)
SELECT * FROM cte_name;
Example: Let's find the trip with the second largest total_amount
WITH cte AS(
SELECT
lpep_pickup_datetime,
total_amount,
RANK() OVER (ORDER BY total_amount DESC) AS rank
FROM `greentaxi_trips`
)
SELECT * FROM cte WHERE rank = 2;
The query starts with a Common Table Expression (CTE) named cte. We use the RANK() window function to assign a ranking (rank) to each row based on total_amount in descending order (from highest to lowest).
Now, we use the CTE in the main query: SELECT * FROM cte WHERE rank = 2;
Result of the query:
lpep_pickup_datetime | total_amount | rank |
---|---|---|
2019-10-10 15:22:49 UTC | 2878.3 | 2 |
CTEs and window functions will be used a lot in module 4 on dbt. Let's see an example of application in dbt models
Example:
Suppose we start from the FHV dataset and we want to create a dbt model that enriches the data by calculating the trip duration and the 90th percentile.
WITH trip_duration_calculated AS (
SELECT
*,
timestamp_diff(dropOff_datetime, pickup_datetime, second) as trip_duration
FROM `fhv_trips`
)
SELECT
PUlocationID,
trip_duration,
PERCENTILE_CONT(trip_duration, 0.90) OVER (PARTITION BY PUlocationID) AS trip_duration_p90
FROM trip_duration_calculated
Step 1: Understanding the CTE
The WITH clause creates a CTE named trip_duration_calculated. This CTE acts as a temporary table that contains all columns from the fhv_trips table. Additionally, it calculates the trip duration for each ride
Step 2: Main Query using the CTE and Window Function
This query computes the 90th percentile of trip duration for each PUlocationID using a window function:
The PARTITION BY PUlocationID clause ensures that the percentile calculation is performed separately for each unique PUlocationID.
The percentile 90 means that 90% of the trips have a duration equal to or below this value
Query result looks like this:
PUlocationID | trip_duration | trip_duration_p90 |
---|---|---|
190 | 451 | 2170.0 |
190 | 1373 | 2170.0 |
190 | 817 | 2170.0 |
190 | 589 | 2170.0 |
190 | 1648 | 2170.0 |
32 | 546 | 1988.0 |
32 | 151 | 1988.0 |
32 | 1752 | 1988.0 |
32 | 2426 | 1988.0 |
32 | 888 | 1988.0 |
- For PUlocationID = 190, 90% of trips have a duration ≤ 2170.0 seconds.
- For PUlocationID = 32, 90% of trips have a duration ≤ 1988.0 seconds.