Interlude

Articles #1 and #2 in my series on window functions introduce their core functionality by showing how to create unordered and then ordered windows of rows over which aggregate and analytic functions are applied. The articles drive from business cases, helping you recognize when and how to apply window functions to generate business results. In this third article, I step back and take a detailed look at the underlying mechanics of these functions and how they are executed.

Note: What follows is a conceptual description of query execution involving window functions. I'm well aware that database engines optimize in their implementations, but a good mental model of their conceptual execution is really the key to understanding these functions, how they operate, and when to apply them.

Prelude

Following is an example query I've created as a vehicle for explaining window function execution. It's an interesting example in that it mixes grouping operations and window functions together in a single SELECT statement. I'll explain the window function invocations later. For now, just know that the query is an attempt to report average sales by department per year, for those years having average sales greater than $250,000.

select department, year, round(avg(sales),2) as avg_this_year,
       max(round(avg(sales),2)) over (partition by year) max_this_year,
	   lag(round(avg(sales),2),1) 
	      over (partition by year
	            order by round(avg(sales),2)) closest_this_year
from quarterly_sales 
where year in (2008, 2009)
group by department, year
having round(avg(sales),2) > 250000
order by max(round(avg(sales),2)) over (partition by year)
         - round(avg(sales),2) desc

You can execute this query using the example data created for this series of articles. For this specific article, I am pretending that data exists only for the Appliances, Electronics, and Jewelry departments, and only for the years 2008 - 2010. However, the query ultimately generates the same results against the full set of example data as against the subset that I'm using now.

Step 1: Generating the Available Rows

The ISO SQL standard specifies that query execution begins with the FROM clause. It is the FROM clause that generates the universe of data—all the rows and all the columns—available to the rest of the query.

SQL won't let us execute FROM by itself, but we can put SELECT * in front of the clause to get the same result. Following is the data available to the rest of the query:

/*------------------------
select * from quarterly_sales
------------------------*/
department       year quarter     sales
---------------- ------ ------- -----------
Appliances       2008     1   171343.75
Appliances       2008     2   201019.64
Appliances       2008     3   249473.54
Appliances       2008     4   223878.21
Appliances       2009     1   268765.87
Appliances       2009     2   278433.01
Appliances       2009     3   265891.68
Appliances       2009     4   259873.34
Appliances       2010     1   274473.83
Appliances       2010     2   294689.04
Appliances       2010     3   300174.32
Appliances       2010     4   289756.12
Electronics      2008     1   790333.32
Electronics      2008     2   859887.47
Electronics      2008     3   883433.83
Electronics      2008     4   901493.29
Electronics      2009     1   814949.82
Electronics      2009     2   890009.39
Electronics      2009     3   950388.38
Electronics      2009     4   976645.38
Electronics      2010     1   902332.83
Electronics      2010     2   967483.62
Electronics      2010     3   954383.52
Electronics      2010     4   989403.48
Jewelry          2008     1   287454.44
Jewelry          2008     2   294004.27
Jewelry          2008     3   317217.32
Jewelry          2008     4   329029.02
Jewelry          2009     1   303878.78
Jewelry          2009     2   307434.26
Jewelry          2009     3   356987.76
Jewelry          2009     4   389064.64
Jewelry          2010     1   424543.28
Jewelry          2010     2   444847.74
Jewelry          2010     3   463746.46
Jewelry          2010     4   584746.85

(36 row(s) affected)

There is nothing else. All results from the query in this article must derive from the preceding 36 rows.

Step 2: Filtering the Rows

Next comes the WHERE clause. It filters the rows prior to any grouping operation. The WHERE clause in the example pares down the universe of data to those rows for years 2008 and 2009:

/*------------------------
select * from quarterly_sales
where year in (2008, 2009)
------------------------*/
department       year quarter     sales
---------------- ------ ------- -----------
Appliances       2008     1   171343.75
Appliances       2008     2   201019.64
Appliances       2008     3   249473.54
Appliances       2008     4   223878.21
Appliances       2009     1   268765.87
Appliances       2009     2   278433.01
Appliances       2009     3   265891.68
Appliances       2009     4   259873.34
Electronics      2008     1   790333.32
Electronics      2008     2   859887.47
Electronics      2008     3   883433.83
Electronics      2008     4   901493.29
Electronics      2009     1   814949.82
Electronics      2009     2   890009.39
Electronics      2009     3   950388.38
Electronics      2009     4   976645.38
Jewelry          2008     1   287454.44
Jewelry          2008     2   294004.27
Jewelry          2008     3   317217.32
Jewelry          2008     4   329029.02
Jewelry          2009     1   303878.78
Jewelry          2009     2   307434.26
Jewelry          2009     3   356987.76
Jewelry          2009     4   389064.64

(24 row(s) affected)

These rows are now the only rows available to the rest of the query.

Step 3: Grouping the Rows

Grouping comes next. You can think of grouping as occurring in three substeps: a) sorting the detail rows into groups—in this case by department and year, b) evaluating any aggregate functions, and c) collapsing each group into a single row.

Following is the relevant part of the example query:

select department, year, round(avg(sales),2) as avg_this_year
...
group by department, year

Next is some intermediate data showing the results from substeps a and b. You can see the rows arranged into groups by department and year. You can see that each group includes the rounded average of the sales column.

department       year quarter     sales
---------------- ------ ------- -----------
Appliances       2008     1   171343.75
Appliances       2008     2   201019.64
Appliances       2008     3   249473.54
Appliances       2008     4   223878.21
                                  ---------
round(avg(sales),2) ==========>   211428.79

Appliances       2009     1   268765.87
Appliances       2009     2   278433.01
Appliances       2009     3   265891.68
Appliances       2009     4   259873.34
                                  ---------
round(avg(sales),2) ==========>   268240.98


Electronics      2008     1   790333.32
Electronics      2008     2   859887.47
Electronics      2008     3   883433.83
Electronics      2008     4   901493.29
                                  ---------
round(avg(sales),2) ==========>   858786.98


Electronics      2009     1   814949.82
Electronics      2009     2   890009.39
Electronics      2009     3   950388.38
Electronics      2009     4   976645.38
                                  ---------
round(avg(sales),2) ==========>   907998.24


Jewelry          2008     1   287454.44
Jewelry          2008     2   294004.27
Jewelry          2008     3   317217.32
Jewelry          2008     4   329029.02
                                  ---------
round(avg(sales),2) ==========>   306926.26


Jewelry          2009     1   303878.78
Jewelry          2009     2   307434.26
Jewelry          2009     3   356987.76
Jewelry          2009     4   389064.64
                                  ---------
round(avg(sales),2) ==========>   339341.36

Substep c is to collapse each group into a single row. Window functions have net yet been computed, so the result so far is simply:

/*------------------------
select department, year, round(avg(sales),2) as avg_this_year
from quarterly_sales
where year in (2008, 2009)
group by department, year
------------------------*/
department       year   avg_this_year
---------------- ------ ---------------
Appliances       2008   211428.790000
Appliances       2009   268240.980000
Electronics      2008   858786.980000
Electronics      2009   907998.240000
Jewelry          2008   306926.260000
Jewelry          2009   339341.360000

(6 row(s) affected)

The quarter column is no longer available. It is not listed in the SELECT clause, nor can it be. Only the columns listed in the GROUP BY clause, or computed by an aggregative function such as AVG, can be a part of the collapsed row set.

Note: The aggregate function AVG returns a numeric type having six decimal-places of precision. From here on, all sales results in the article will display with four trailing zeros.

Step 4: Filtering the Groups

After grouping comes another filtering step, giving the opportunity to eliminate groups through predicates on aggregate values. The HAVING clause does the work here. In our example, it reads as follows:

having round(avg(sales),2) > 250000

The result is to eliminate one of the rows for the appliances department, leaving just five rows for the rest of the query:

/*------------------------
select department, year, round(avg(sales),2) as avg_this_year
from quarterly_sales
where year in (2008, 2009)
group by department, year
having round(avg(sales),2) > 250000
------------------------*/
department       year   avg_this_year
---------------- ------ ---------------
Appliances       2009   268240.980000
Electronics      2008   858786.980000
Electronics      2009   907998.240000
Jewelry          2008   306926.260000
Jewelry          2009   339341.360000

(5 row(s) affected)

Pause now and realize that the five rows shown are all that are available to the window function execution that comes next. You must think carefully when writing window functions about what rows will be available to feed into their execution when they are finally executed, because their placement in the SELECT clause at the very beginning of a query as written belies their true evaluation near the very end of the query as executed.

Step 5: Creating the Window Partitions

Finally, it is time to begin evaluating the window functions. There are two invocations in our example query. You can know them from their inclusion of OVER clauses.

The first invocation returns the largest of the average quarterly sales amounts for each given year:

max(round(avg(sales),2))
  over (partition by year) max_this_year,

The second invocation makes use of LAG to retrieve the average quarterly sales amount in a given year that is highest without exceeding the amount reported in the current row:

lag(round(avg(sales),2),1)
  over (partition by year 
  order by round(avg(sales),2)) closest_this_year

Each of these functions refers to the result of the aggregation computed earlier in Step 3. Each partitions by year, so each row's execution of a given function gets to "see" all other average sales values from the same year. Look at the following, intermediate result to see the effects from the PARTITION BY clauses:

department  year   avg_this_year   max_this_year   closest_this_year
                                          max(...)          lag(...,1)
------------- ---- --------------- --------------- -------------------
Appliances  2009   268240.980000 {268240.980000    {268240.980000
                                    907998.240000     907998.240000
                                    339341.360000}      339341.360000}

Electronics   2008   858786.980000858786.980000    {858786.980000
                                    306926.260000}      306926.260000}

Electronics   2009   907998.240000 {268240.980000    {268240.980000
                                    907998.240000     907998.240000
                                    339341.360000}      339341.360000}

Jewelry     2008   306926.260000 {858786.980000    {858786.980000
                                    306926.260000}      306926.260000}

Jewelry     2009   339341.360000 {268240.980000    {268240.980000
                                    907998.240000     907998.240000
                                    339341.360000}      339341.360000}

The first row is for Appliance Department sales in 2009. The window function partitions for that row consist of the three average sales values pertaining to the same year. These three average sales values are from the Step 3 results earlier.

The second row pertains to the Electronics Department in 2008. Partitions in that row contain the two sales amounts from 2008. The pattern continues for all rows in the result set, and is the result of specifyingpartition by year in the window function invocations.

Step 6: Sorting the Partitions

The example query includes an invocation of the analytic function LAG. The purpose is to compute for each row the next-highest average sales value from any other department in the same year. The invocation isLAG(..., 1), which means to return the average quarterly sales value immediately preceding -- or that is one behind -- the current row's average quarterly sales value. The concept of immediately preceding lacks a deterministic meaning when values are unsorted, and hence LAG requires an ordering to the data in each partition. The invocation of LAG includes order by round(avg(sales),2) to provide that ordering, and the effects are as follows:

department    year   avg_this_year   max_this_year   closest_this_year
                                            max(...)         lag(..., 1)
------------- ------ --------------- --------------- -------------------
Appliances    2009   268240.980000 {268240.980000    {268240.980000
                                      907998.240000     339341.360000
                                      339341.360000}      907998.240000}

Electronics   2008   858786.980000 {858786.980000    {306926.260000
                                      306926.260000}      858786.980000}

Electronics   2009   907998.240000 {268240.980000    {268240.980000
                                      907998.240000     339341.360000
                                      339341.360000}      907998.240000}

Jewelry       2008   306926.260000 {858786.980000    {306926.260000
                                      306926.260000}      858786.980000}

Jewelry       2009   339341.360000 {268240.980000    {268240.980000
                                      907998.240000     339341.360000
                                      339341.360000}      907998.240000}

You can see in these results that the partitions for LAG are ordered differently than those for MAX. The bold + underlining in the LAG column highlights the partition value corresponding to each current row. It is from those values that LAG begins to count either forward or backwards as indicated by the second parameter.

Step 7: Computing Final Results

Window function partitions are created. Ordering has been applied where needed—in this case to the partitions pertaining to LAG. The database engine can now compute the final result set, which in this case is:

/*------------------------
select department, year, round(avg(sales),2) as avg_this_year,
       max(round(avg(sales),2)) over (partition by year) max_this_year,
	   lag(round(avg(sales),2),1) 
	      over (partition by year
	            order by round(avg(sales),2)) closest_this_year
from quarterly_sales 
where year in (2008, 2009)
group by department, year
having round(avg(sales),2) > 250000
------------------------*/
department       year   avg_this_year   max_this_year closest_this_year
---------------- ------ --------------- --------------- -----------------
Jewelry          2008   306926.260000   858786.980000            NULL
Electronics      2008   858786.980000   858786.980000   306926.260000
Appliances       2009   268240.980000   907998.240000            NULL
Jewelry          2009   339341.360000   907998.240000   268240.980000
Electronics      2009   907998.240000   907998.240000   339341.360000

(5 row(s) affected)

Two rows show NULL for closest_this_year. Each of those rows represents the lowest average sales amount for the year. There is no lower value in the partition, so LAG(..., 1) returns a NULL in those cases.

Step 8: Ordering the Result Set

Sorting of the result set is the final step in query execution. Window function results have been computed, so they are available for use in the ORDER BY clause. For example, you can order the results in descending order of the difference between the maximum quarterly sales and each department's average quarterly sales in a given year:

/*------------------------
select department, year, round(avg(sales),2) as avg_this_year,
       max(round(avg(sales),2)) over (partition by year) max_this_year,
	   lag(round(avg(sales),2),1) 
	      over (partition by year
	            order by round(avg(sales),2)) closest_this_year
from quarterly_sales 
where year in (2008, 2009)
group by department, year
having round(avg(sales),2) > 250000
order by max(round(avg(sales),2)) over (partition by year)
         - round(avg(sales),2) desc
------------------------*/
department       year   avg_this_year   max_this_year closest_this_year
---------------- ------ --------------- --------------- -----------------
Appliances       2009   268240.980000   907998.240000            NULL
Jewelry          2009   339341.360000   907998.240000   268240.980000
Jewelry          2008   306926.260000   858786.980000            NULL
Electronics      2008   858786.980000   858786.980000   306926.260000
Electronics      2009   907998.240000   907998.240000   339341.360000

(5 row(s) affected)

You are not limited in the ORDER BY clause to only those window functions that you list in the SELECT clause. You can write any window function that you like into the ORDER BY clause, so long as it can be evaluated using the rows that make it through the WHERE and HAVING clauses, and so long as you respect any grouping that you've done via the GROUP BY clause.

Postlude

Following are some key points about window functions and their execution to take away from the walk-through in this article:

  • Only rows that pass the filters in the WHERE and HAVING clauses feed into window function execution.
  • Values can be passed to a window function in a GROUP BY query only if the same values can also be specified alone in the SELECT list of the same query (i.e., the grouping operation must be respected).
  • Each invocation of a window function can specify a different window via the PARTITION BY clause.
  • Each window partition can be sorted independently from the others.
  • Sorting of rows in the final result set has nothing to do with the sorting of values in the window function partitions.

It is inconvenient that a query cannot filter on the results of window functions specified in the same query. The reason is that window functions are not evaluated until after the WHERE and HAVING filters are applied. One might wish for a third filtering clause to come later just for windowing functions—and do not ask me what such a clause would be called—but the solution we are left with is to perform any filtering in a parent query. Read Detail in the Aggregate for an explanation and example of how such filtering is done.