Articles #1 and #2 in my series on window functions introduce their core functionality by showing how to create unordered and then ordered windows of rows over which aggregate and analytic functions are applied. The articles drive from business cases, helping you recognize when and how to apply window functions to generate business results. In this third article, I step back and take a detailed look at the underlying mechanics of these functions and how they are executed.
Note: What follows is a conceptual description of query execution involving window functions. I'm well aware that database engines optimize in their implementations, but a good mental model of their conceptual execution is really the key to understanding these functions, how they operate, and when to apply them.
Prelude
Following is an example query I've created as a vehicle for explaining window function execution. It's an interesting example in that it mixes grouping operations and window functions together in a single SELECT
statement. I'll explain the window function invocations later. For now, just know that the query is an attempt to report average sales by department per year, for those years having average sales greater than $250,000.
select department, year, round(avg(sales),2) as avg_this_year,
max(round(avg(sales),2)) over (partition by year) max_this_year,
lag(round(avg(sales),2),1)
over (partition by year
order by round(avg(sales),2)) closest_this_year
from quarterly_sales
where year in (2008, 2009)
group by department, year
having round(avg(sales),2) > 250000
order by max(round(avg(sales),2)) over (partition by year)
- round(avg(sales),2) desc
You can execute this query using the [example data](/examples/WindowFunctionsData.sql) created for this series of articles. For this specific article, I am pretending that data exists only for the Appliances, Electronics, and Jewelry departments, and only for the years 2008 - 2010. However, the query ultimately generates the same results against the full set of example data as against the subset that I'm using now.
## Step 1: Generating the Available Rows
The ISO SQL standard specifies that query execution begins with the `FROM` clause. It is the `FROM` clause that generates the universe of data—all the rows and all the columns—available to the rest of the query.
SQL won't let us execute `FROM` by itself, but we can put `SELECT *` in front of the clause to get the same result. Following is the data available to the rest of the query:
/*------------------------
select * from quarterly_sales
------------------------*/
department year quarter sales
---------------- ------ ------- -----------
Appliances 2008 1 171343.75
Appliances 2008 2 201019.64
Appliances 2008 3 249473.54
Appliances 2008 4 223878.21
Appliances 2009 1 268765.87
Appliances 2009 2 278433.01
Appliances 2009 3 265891.68
Appliances 2009 4 259873.34
Appliances 2010 1 274473.83
Appliances 2010 2 294689.04
Appliances 2010 3 300174.32
Appliances 2010 4 289756.12
Electronics 2008 1 790333.32
Electronics 2008 2 859887.47
Electronics 2008 3 883433.83
Electronics 2008 4 901493.29
Electronics 2009 1 814949.82
Electronics 2009 2 890009.39
Electronics 2009 3 950388.38
Electronics 2009 4 976645.38
Electronics 2010 1 902332.83
Electronics 2010 2 967483.62
Electronics 2010 3 954383.52
Electronics 2010 4 989403.48
Jewelry 2008 1 287454.44
Jewelry 2008 2 294004.27
Jewelry 2008 3 317217.32
Jewelry 2008 4 329029.02
Jewelry 2009 1 303878.78
Jewelry 2009 2 307434.26
Jewelry 2009 3 356987.76
Jewelry 2009 4 389064.64
Jewelry 2010 1 424543.28
Jewelry 2010 2 444847.74
Jewelry 2010 3 463746.46
Jewelry 2010 4 584746.85
(36 row(s) affected)
There is nothing else. All results from the query in this article must derive from the preceding 36 rows.
Step 2: Filtering the Rows
Next comes the WHERE
clause. It filters the rows prior to any grouping operation. The WHERE
clause in the example pares down the universe of data to those rows for years 2008 and 2009:
/*------------------------
select * from quarterly_sales
where year in (2008, 2009)
------------------------*/
department year quarter sales
---------------- ------ ------- -----------
Appliances 2008 1 171343.75
Appliances 2008 2 201019.64
Appliances 2008 3 249473.54
Appliances 2008 4 223878.21
Appliances 2009 1 268765.87
Appliances 2009 2 278433.01
Appliances 2009 3 265891.68
Appliances 2009 4 259873.34
Electronics 2008 1 790333.32
Electronics 2008 2 859887.47
Electronics 2008 3 883433.83
Electronics 2008 4 901493.29
Electronics 2009 1 814949.82
Electronics 2009 2 890009.39
Electronics 2009 3 950388.38
Electronics 2009 4 976645.38
Jewelry 2008 1 287454.44
Jewelry 2008 2 294004.27
Jewelry 2008 3 317217.32
Jewelry 2008 4 329029.02
Jewelry 2009 1 303878.78
Jewelry 2009 2 307434.26
Jewelry 2009 3 356987.76
Jewelry 2009 4 389064.64
(24 row(s) affected)
These rows are now the only rows available to the rest of the query.
Step 3: Grouping the Rows
Grouping comes next. You can think of grouping as occurring in three substeps: a) sorting the detail rows into groups—in this case by department and year, b) evaluating any aggregate functions, and c) collapsing each group into a single row.
Following is the relevant part of the example query:
select department, year, round(avg(sales),2) as avg_this_year
...
group by department, year
Next is some intermediate data showing the results from substeps a and b. You can see the rows arranged into groups by department and year. You can see that each group includes the rounded average of the sales column.
department year quarter sales
---------------- ------ ------- -----------
Appliances 2008 1 171343.75
Appliances 2008 2 201019.64
Appliances 2008 3 249473.54
Appliances 2008 4 223878.21
---------
round(avg(sales),2) ==========> 211428.79
Appliances 2009 1 268765.87
Appliances 2009 2 278433.01
Appliances 2009 3 265891.68
Appliances 2009 4 259873.34
---------
round(avg(sales),2) ==========> 268240.98
Electronics 2008 1 790333.32
Electronics 2008 2 859887.47
Electronics 2008 3 883433.83
Electronics 2008 4 901493.29
---------
round(avg(sales),2) ==========> 858786.98
Electronics 2009 1 814949.82
Electronics 2009 2 890009.39
Electronics 2009 3 950388.38
Electronics 2009 4 976645.38
---------
round(avg(sales),2) ==========> 907998.24
Jewelry 2008 1 287454.44
Jewelry 2008 2 294004.27
Jewelry 2008 3 317217.32
Jewelry 2008 4 329029.02
---------
round(avg(sales),2) ==========> 306926.26
Jewelry 2009 1 303878.78
Jewelry 2009 2 307434.26
Jewelry 2009 3 356987.76
Jewelry 2009 4 389064.64
---------
round(avg(sales),2) ==========> 339341.36
Substep c is to collapse each group into a single row. Window functions have net yet been computed, so the result so far is simply:
/*------------------------ select department, year, round(avg(sales),2) as avg_this_year from quarterly_sales where year in (2008, 2009) group by department, year ------------------------*/ department year avg_this_year ---------------- ------ --------------- Appliances 2008 211428.790000 Appliances 2009 268240.980000 Electronics 2008 858786.980000 Electronics 2009 907998.240000 Jewelry 2008 306926.260000 Jewelry 2009 339341.360000 (6 row(s) affected)
The quarter column is no longer available. It is not listed in the SELECT
clause, nor can it be. Only the columns listed in the GROUP BY
clause, or computed by an aggregate function such as AVG
, can be a part of the collapsed row set.
Note: The aggregate function
AVG
returns a numeric type having six decimal-places of precision. From here on, all sales results in the article will display with four trailing zeros.
##Step 4: Filtering the Groups
After grouping comes another filtering step, giving the opportunity to eliminate groups through predicates on aggregate values. The HAVING
clause does the work here. In our example, it reads as follows:
having round(avg(sales),2) > 250000
The result is to eliminate one of the rows for the appliances department, leaving just five rows for the rest of the query:
/*------------------------
select department, year, round(avg(sales),2) as avg_this_year
from quarterly_sales
where year in (2008, 2009)
group by department, year
having round(avg(sales),2) > 250000
------------------------*/
department year avg_this_year
---------------- ------ ---------------
Appliances 2009 268240.980000
Electronics 2008 858786.980000
Electronics 2009 907998.240000
Jewelry 2008 306926.260000
Jewelry 2009 339341.360000
(5 row(s) affected)
Pause now and realize that the five rows shown are all that are available to the window function execution that comes next. You must think carefully when writing window functions
about what rows will be available to feed into their execution when they are finally executed, because their placement in the SELECT
clause at the very beginning of a query as written belies their true evaluation near the very end of the query as executed.
Step 5: Creating the Window Partition
Finally, it is time to begin evaluating the window functions. There are two invocations in our example query. You can know them from their inclusion of OVER
clauses.
The first invocation returns the largest of the average quarterly sales amounts for each given year:
max(round(avg(sales),2))
over (partition by year) max_this_year,
The second invocation makes use of LAG
to retrieve the average quarterly sales amount in a given year that is highest without exceeding the amount reported in the current row:
lag(round(avg(sales),2),1)
over (partition by year
order by round(avg(sales),2)) closest_this_year
Each of these functions refers to the result of the aggregation computed earlier in Step 3. Each partitions by year, so each row's execution of a given function gets to “see” all other average sales values from the same year. Look at the following, intermediate result to see the effects from the PARTITION BY
clauses:
department year avg_this_year max_this_year closest_this_year
max(...) lag(...,1)
------------- ---- --------------- --------------- -------------------
Appliances 2009 268240.980000 {268240.980000 {268240.980000
907998.240000 907998.240000
339341.360000} 339341.360000}
Electronics 2008 858786.980000 858786.980000 {858786.980000
306926.260000} 306926.260000}
Electronics 2009 907998.240000 {268240.980000 {268240.980000
907998.240000 907998.240000
339341.360000} 339341.360000}
Jewelry 2008 306926.260000 {858786.980000 {858786.980000
306926.260000} 306926.260000}
Jewelry 2009 339341.360000 {268240.980000 {268240.980000
907998.240000 907998.240000
339341.360000} 339341.360000}
The first row is for Appliance Department sales in 2009. The window function partitions for that row consist of the three average sales values pertaining to the same year. These three average sales values are from the Step 3 results earlier.
The second row pertains to the Electronics Department in 2008. Partitions in that row contain the two sales amounts from 2008. The pattern continues for all rows in the result set, and is the result of specifying `partition by year` in the window function invocations.
## Step 6: Sorting the Partitions
The example query includes an invocation of the analytic function `LAG`. The purpose is to compute for each row the next-highest average sales value from any other department in the same year. The invocation is `LAG(..., 1)`, which means to return the average quarterly sales value _immediately preceding_ -- or that is one behind -- the current row's average quarterly sales value. The concept of _immediately preceding_ lacks a deterministic meaning when values are unsorted, and hence `LAG` _requires_ an ordering to the data in each partition. The invocation of `LAG` includes _order by round(avg(sales),2)` to provide that ordering, and the effects are as follows:
department year avg_this_year max_this_year closest_this_year
max(...) lag(..., 1)
------------- ------ --------------- --------------- -------------------
Appliances 2009 268240.980000 {268240.980000 {268240.980000
907998.240000 339341.360000
339341.360000} 907998.240000}
Electronics 2008 858786.980000 {858786.980000 {306926.260000
306926.260000} 858786.980000}
Electronics 2009 907998.240000 {268240.980000 {268240.980000
907998.240000 339341.360000
339341.360000} 907998.240000}
Jewelry 2008 306926.260000 {858786.980000 {306926.260000
306926.260000} 858786.980000}
Jewelry 2009 339341.360000 {268240.980000 {268240.980000
907998.240000 339341.360000
339341.360000} 907998.240000}
You can see in these results that the partitions for `LAG` are ordered differently than those for `MAX`. The bold in the `LAG` column highlights the partition value corresponding to each current row. It is from those values that `LAG` begins to count either forward or backwards as indicated by the second parameter.
## Step 7: Computing Final Results
Window function partitions are created. Ordering has been applied where needed—in this case to the partitions pertaining to `LAG`. The database engine can now compute the final result set, which in this case is:
/*------------------------
select department, year, round(avg(sales),2) as avg_this_year,
max(round(avg(sales),2)) over (partition by year) max_this_year,
lag(round(avg(sales),2),1)
over (partition by year
order by round(avg(sales),2)) closest_this_year
from quarterly_sales
where year in (2008, 2009)
group by department, year
having round(avg(sales),2) > 250000
------------------------*/
department year avg_this_year max_this_year closest_this_year
---------------- ------ --------------- --------------- -----------------
Jewelry 2008 306926.260000 858786.980000 NULL
Electronics 2008 858786.980000 858786.980000 306926.260000
Appliances 2009 268240.980000 907998.240000 NULL
Jewelry 2009 339341.360000 907998.240000 268240.980000
Electronics 2009 907998.240000 907998.240000 339341.360000
(5 row(s) affected)
Two rows show NULL for closest_this_year. Each of those rows represents the lowest average sales amount for the year. There is no lower value in the partition, so LAG(..., 1)
returns a NULL in those cases.
Step 8: Ordering the Result Set
Sorting of the result set is the final step in query execution. Window function results have been computed, so they are available for use in the ORDER BY
clause. For example, you can order the results in descending order of the difference between the maximum quarterly sales and each department's average quarterly sales in a given year:
/*------------------------
select department, year, round(avg(sales),2) as avg_this_year,
max(round(avg(sales),2)) over (partition by year) max_this_year,
lag(round(avg(sales),2),1)
over (partition by year
order by round(avg(sales),2)) closest_this_year
from quarterly_sales
where year in (2008, 2009)
group by department, year
having round(avg(sales),2) > 250000
order by max(round(avg(sales),2)) over (partition by year)
- round(avg(sales),2) desc
------------------------*/
department year avg_this_year max_this_year closest_this_year
---------------- ------ --------------- --------------- -----------------
Appliances 2009 268240.980000 907998.240000 NULL
Jewelry 2009 339341.360000 907998.240000 268240.980000
Jewelry 2008 306926.260000 858786.980000 NULL
Electronics 2008 858786.980000 858786.980000 306926.260000
Electronics 2009 907998.240000 907998.240000 339341.360000
(5 row(s) affected)
You are not limited in the ORDER BY
clause to only those window functions that you list in the SELECT
clause. You can write any window function that you like into the ORDER BY
clause, so long as it can be evaluated using the rows that make it through the WHERE
and HAVING
clauses, and so long as you respect any grouping that you've done via the GROUP BY
clause.
Postlude
Following are some key points about window functions and their execution to take away from the walk-through in this article:
- Only rows that pass the filters in the
WHERE
andHAVING
clauses feed into window function execution. - Values can be passed to a window function in a
GROUP BY
query only if the same values can also be specified alone in theSELECT
list of the same query (i.e., the grouping operation must be respected). - Each invocation of a window function can specify a different window via the
PARTITION BY
clause. - Each window partition can be sorted independently from the others.
- Sorting of rows in the final result set has nothing to do with the sorting of values in the window function partitions.
It is inconvenient that a query cannot filter on the results of window functions specified in the same query. The reason is that window functions are not evaluated until after the WHERE
and HAVING
filters are applied. One might wish for a third filtering clause to come later just for windowing functions—and do not ask me what such a clause would be called—but the solution we are left with is to perform any filtering in a parent query. Read Detail in the Aggregate for an explanation and example of how such filtering is done.
Published on Dec 7, 2011