Detail in the Aggregate

SQL Server 2012 – the "Denali" release – brings improved support for window functions as defined in the SQL standard. This is the first in a series of articles on using these functions. They help elegantly solve a great many, formerly difficult query problems. One of the most common use cases is to mix or compare summary and detail data in the same row.

Example Data: This is the first in a series of articles exploring window functions and their enhanced implementation in SQL Server 2012. If you would like to follow along, execute this script to create the example table and data. All articles in this series use the same data set.

Window functions first came to SQL Server in the 2005 release. Everything in this article is available from SQL Server 2005 forward. While not brand new, the mixing of aggregate and detail as described in this article is fundamental and thereby a good place to begin exploring the new support added in SQL Server 2012.

A Classic Problem

Have you ever been asked a question involving summary and detail values together? Businesses ask that type of question all the time. Here's a classic example:

In what fiscal quarters were sales below average?

Back in the day I would answer a question like this with a tangled mess of nested joins and subqueries. It was an occult art to write the queries, and the resulting SQL was often inscrutable to the point of being write-only code. Window functions make the job so much easier.

Step 1: Clarify the Business Question

Always make sure to understand the business question, and that it makes sense. After all, we're the data professionals. It's our job to help our clients to create well-formed questions that make good sense and can be answered using the data at hand.

Imagine that you work in a department store chain. You ask questions and learn that your client is the Vice-President of Appliance Sales. You confirm that he manages by fiscal year, so a more precisely-formed question is:

In what fiscal quarters were appliance sales below average for the fiscal year?

There is also that pesky word "average". For this article, I define "average for the fiscal year" to be "the mean of the four quarterly sales values for the year." Other interpretations are possible. Be especially wary of the term "average". Make sure you understand precisely what is meant by it in a given scenario.

Step 2: Gather the Detail

I like to begin with the detail, get that right, and then work towards the summary data. The business problem calls for quarterly sales data, so begin by writing a simple query to list appliance sales by quarter:

/*------------------------
SELECT year, quarter, sales
FROM quarterly_sales
WHERE department = 'Appliances'
ORDER BY year, quarter;
------------------------*/

  year quarter     sales
------ ------- -----------
  2000     1   235956.23
  2000     2   129871.95
  2000     3   398213.39
  2000     4  54111.05
  2001     1  87230.33
  2001     2   105231.94
  2001     3   167323.22
  2001     4   100010.13
  2002     1  95678.13
  2002     2   197399.29
...

Simple enough, right? These results provide the needed detail data.

Step 3: Fold in the Summary

The business problem requires that you compare each quarterly sales result against an average value for the year. For that, you need a fourth column in the query output that you can reference in a WHERE clause. Begin by writing the aggregate function to generate an average sales amount:

SELECT year, quarter, sales,
       AVG(sales)
...

Next is where the magic begins. Add the keyword OVER to indicate a window function:

SELECT year, quarter, sales,
       AVG(sales) OVER
...

For each detail row, you want the average sales for that row's department and year. The PARTITION BY clause added next gives you that result:

SELECT year, quarter, sales,
       AVG(sales) OVER (PARTITION BY department, year)
...

Pause a moment and take a good a look at the expression I've just finished adding to the query. Here are some important points to grasp:

  • Any of the aggregate functions such as AVG and MAX can be written as a window function. Simply add the keyword OVER followed by a PARTITION BY clause.
  • There is no need for a GROUP BY clause to "collapse" detail rows into summary results. You retain all the detail rows that your query generates.
  • The PARTITION BY clause denotes the scope of the aggregation. In this example, the aggregation AVG computes an average sales amount for only those rows having the same department and year as the current row.

You can execute the query now if you like, but two housekeeping details remain in order to do the job right. First, consider rounding the average sales amount to the nearest cent:

SELECT year, quarter, sales,
       ROUND(AVG(sales) OVER (PARTITION BY department, year),2)
...

Then add a column alias. You won't need it now, but you'll want it for the next step. Give the new column the name avg_sales:

SELECT year, quarter, sales,
       ROUND(AVG(sales) OVER (PARTITION BY department, year),2) avg_sales
...

Following is an execution of the new query and some of the results:

/*------------------------
SELECT year, quarter, sales,
       ROUND(AVG(sales) OVER (PARTITION BY department, year),2) avg_sales
FROM quarterly_sales
WHERE department = 'Appliances'
ORDER BY year, quarter;
------------------------*/
  year quarter     sales     avg_sales
------ ------- ----------- ---------------
  2000     1   235956.23   204538.160000
  2000     2   129871.95   204538.160000
  2000     3   398213.39   204538.160000
  2000     4  54111.05   204538.160000
  2001     1  87230.33   114948.910000
  2001     2   105231.94   114948.910000
  2001     3   167323.22   114948.910000
  2001     4   100010.13   114948.910000
  2002     1  95678.13   150994.320000
  2002     2   197399.29   150994.320000
...

Each quarterly sales result now includes the average quarterly sales for the year. You've successfully mixed detail and summary data in the same row. You have all the data needed – summary and detail – to answer the business question.

Step 4: Answer the Question

Here is where the job gets a mite tricky. You have the raw data needed to answer the business question. You must now feed that data into an enclosing query that does the work of computing the final answer. Here's the final query and some of the results:

/*------------------------
SELECT *
FROM
(
SELECT year, quarter, sales,
       ROUND(AVG(sales) OVER (PARTITION BY department, year),2) avg_sales
FROM quarterly_sales
WHERE department = 'Appliances' 
) x
WHERE sales < avg_sales
ORDER BY year, quarter;
------------------------*/
  year quarter     sales     avg_sales
------ ------- ----------- ---------------
  2000     2   129871.95   204538.160000
  2000     4  54111.05   204538.160000
  2001     1  87230.33   114948.910000
  2001     2   105231.94   114948.910000
  2001     4   100010.13   114948.910000
  2002     1  95678.13   150994.320000
...

The ORDER BY clause goes into the outer query. The inner query must be given an alias, in this case x. The inner query generates the raw data, and the outer query filters that data to return the rows representing the answer to the original business question.

Statement Execution Order

Why the need for an enclosing query? The answer lies in understanding the execution order of a SELECT statement as defined in the ISO SQL Standard. Database engines are free to optimize execution order, but their results must be as if the order defined in the standard were followed. That order of execution by clause is:

  1. FROM
  2. WHERE
  3. GROUP BY
  4. HAVING
  5. SELECT (includes window functions)
  6. ORDER BY

There is no way to filter the results of a query using window functions written in that same query. Any filtering happens in the WHERE and HAVING clauses before window functions are evaluated, and no further filtering steps occur after the window functions are evaluated as part of the SELECT clause processing. You must pass results to an enclosing query and filter there.

There's another implication from execution order, one that will help you generate wrong results(!) if you don't understand it. Imagine for a moment that the business problem is changed, that the average sales value in the query must be the average across all departments. The solution seems simple: Just change the inner query by removing department from the PARTITION BY clause, and partition only by year. For example:

SELECT *
FROM
(
SELECT year, quarter, sales,
       ROUND(AVG(sales) OVER (PARTITION BY year),2) avg_sales
FROM quarterly_sales
WHERE department = 'Appliances' 
) x
WHERE sales < avg_sales
ORDER BY year, quarter;

You are welcome to execute this query, and you will find the results unchanged. It makes no difference to partition by year, or by department and year – in this particular case. Why? Because the WHERE clause has already eliminated all the data for other departments. If you need to include all departments in the yearly average, then you must move the filter for the Appliances department into the enclosing query. Here's an example:

/*------------------------
SELECT year, quarter, sales, avg_sales
FROM
(
SELECT department, year, quarter, sales,
       ROUND(AVG(sales) OVER (PARTITION BY year),2) avg_sales
FROM quarterly_sales
) x
WHERE department = 'Appliances'
  AND sales < avg_sales
ORDER BY year, quarter
------------------------*/
  year quarter     sales     avg_sales
------ ------- ----------- ---------------
  2000     2   129871.95   204340.850000
  2000     4  54111.05   204340.850000
  2001     1  87230.33   224915.990000
  2001     2   105231.94   224915.990000
  2001     3   167323.22   224915.990000
  2001     4   100010.13   224915.990000
  2002     1  95678.13   259250.270000
  2002     2   197399.29   259250.270000
...

Now the inner query has access to all rows. The average sales calculations can include all departments. The outer query does the work of restricting the detail results to the Appliance department. And hopefully you can see how important it is to think about the order of execution whilst writing window functions into your queries.

Wrap-Up

I've just scratched the surface in this article. You aren't limited to just one aggregation, or at one level. Following is a query that generates several levels of aggregation at the same time:

SELECT department, year, quarter, sales,
       ROUND(AVG(sales) OVER (PARTITION BY year),2) avg_by_year,
       ROUND(AVG(sales) OVER (PARTITION BY department, year), 2) avg_by_deptyear,
       ROUND(AVG(sales) OVER ( ), 2) avg_all
FROM quarterly_sales;

This query generates average sales per year, per department and year, and across the board. The result fromAVG(sales) OVER ( ) is the same as you would get from AVG (sales), but you can access the value from detail rows, as the preceding query in fact does. Anytime you find yourself wanting to work with detail and summary data in the same row, head towards the window function toolbox.

Window functions are useful for a lot more than I've shown in this one article. Future articles in this series will explore other use cases such as reaching across rows, ranking and numbering results, and creating moving computations such as running totals and moving averages.