SQL is traditionally one-dimensional. This is in the sense that scalar expressions within SELECT
and other statements have been limited to values in the current row only. You had access to the horizontal, but not the vertical. SQL Server 2012's support of LAG
and LEAD
and other so-called analytic functions open up the vertical dimension, giving access to all values in a column. LAG
and LEAD
and related functions are especially useful in working with data across time.
Example Data: This is the second in a series of articles exploring window functions and their enhanced implementation in SQL Server 2012. If you would like to follow along, execute this script to create the example table and data. All articles in this series use the same data set.
A Classic Problem in Time
You're a data analyst for a large department store. You have access to quarterly sales by department from your data mart. You decide to compare quarterly sales year-on-year in order to answer the following question:
In what quarters did a department's sales drop below those of the same quarter one year previous?
In other words, if a department sold $100,000 in Quarter 2 of 2005, then only $95,000 in Quarter 2 of 2006, you want to know about that because it represents a year-on-year drop that bears further investigation.
Step 1: Gather the Detail
As in the previous article, I like to review the data available before getting too deep into writing a query. Job #1 is to ensure that I understand the business question being posed. Job #2 is to ensure that I have the data needed to answer the question properly. Following is a simple query against the quarterly_sales table showing the available data:
/*------------------------
select department, year, quarter, sales
from quarterly_sales
order by department, year, quarter;
------------------------*/
department year quarter sales
---------------- ------ ------- -----------
Appliances 2000 1 235956.23
Appliances 2000 2 129871.95
Appliances 2000 3 398213.39
Appliances 2000 4 54111.05
Appliances 2001 1 87230.33
Appliances 2001 2 105231.94
...
No tricks! These results are straightforward. The quarterly_sales
table gives sales by quarter for each department. There are no gaps. Take it on faith for now that the process of loading the data mart involves a data cleansing step to eliminate gaps.
Caution! It is not always reasonable to presume clean data having no gaps. I won't ignore the issue, and will circle back to the possible problem of gaps later in this article.
Step 2: Go Trans-Dimensional
Getting sales data by department and quarter was easy; a simple query sufficed. The interesting part of the problem is to look at the one year prior amounts in each row. The traditional approach is to execute a self-join, but that approach scales poorly and results in obtuse and difficult to modify queries. You are not interested in SQL as a game. You want to get the job done and go home, so you choose SQL Server 2012's new LAG
function.
LAG
provides easy access to a value from a prior row. It's perfect for working with time-series data in which the time interval from row to row is consistent. There are four quarters in a year, so there are four rows per combination of year and department. Write a LAG
function to retrieve the fourth sales amount back:
select department, year, quarter, sales,
LAG(sales, 4) year_ago_sales
...
Don't execute the query just yet. Going back four rows means nothing without some sort of ordering. Add an OVER
clause to the function and specify an ordering that follows the calendar:
select department, year, quarter, sales,
LAG(sales, 4) OVER (ORDER BY year, quarter) year_ago_sales
...
One more thing! Partition by department so as to avoid commingling results from different departments:
select department, year, quarter, sales,
LAG(sales, 4) OVER (PARTITION BY department
ORDER BY year, quarter) year_ago_sales
...
Now you are ready to execute the query. Each row in the output will contain the current quarterly sales value along with the corresponding value from four quarters – i.e., one year – previous. Here is the complete query and some of the output:
/*------------------------
select department, year, quarter, sales,
LAG(sales, 4) OVER (PARTITION BY department
ORDER BY year, quarter) year_ago_sales
from quarterly_sales
order by department, year, quarter;
------------------------*/
department year quarter sales year_ago_sales
---------------- ------ ------- ----------- ----------------
Appliances 2000 1 235956.23 NULL
Appliances 2000 2 129871.95 NULL
Appliances 2000 3 398213.39 NULL
Appliances 2000 4 54111.05 NULL
Appliances 2001 1 87230.33 235956.23
Appliances 2001 2 105231.94 129871.95
...
You can see in year 2001 how rows begin to pick up the sales amount from the same quarter one year prior. There's no history prior to 2000, so rows from that year receive a NULL in the year_ago_sales
column.
You have the data you need. Now it's time to feed these rows into a parent query that answers the business question.
Step 3: Answer the Question
If you haven't read the first article in this series – Detail in the Aggregate – be sure to circle back and read it, because it explains in good detail why you must feed the results from the previous section's query into a parent query that does the work answering the business question. The short story is that analytic function results are not available until after any filtering triggered by WHERE
and HAVING
clauses. You must feed your analytic results to a parent query and filter there. For example:
/*------------------------
select *
from
(
select department, year, quarter, sales,
LAG(sales, 4) OVER (PARTITION BY department
ORDER BY year, quarter) year_ago_sales
from quarterly_sales
) x
where sales < year_ago_sales
order by department, year, quarter;
------------------------*/
department year quarter sales year_ago_sales
---------------- ------ ------- ----------- ----------------
Appliances 2001 1 87230.33 235956.23
Appliances 2001 2 105231.94 129871.95
...
Electronics 2003 2 624831.23 673839.83
Electronics 2003 3 634839.23 665839.73
...
Jewelry 2002 2 323939.83 364983.54
Jewelry 2003 3 383542.32 387213.39
...
The WHERE
clause predicate in the outer query – sales < year_ago_sales
– should be clear enough as written. The presence of NULLs when no prior history is available is enough on its own to prevent any year-2000 data from cluttering the results. The ORDER BY
clause is now in the outer query, and serves to present the results in calendar order by department. You can readily see when a given department has suffered a drop in year-on-year quarterly sales.
A Real Challenge!
Analytic functions come into their own as business questions become increasingly complex. Use of analytics over the traditional self-join approach make a query more malleable, easier and clearer to understand, empowering you to explore variations on the original question.
For example, you might look at the data returned so far and decide that you lack the time to investigate 40 separate instances of year-on-year declines. (40 is the number of such instances returned from this article's data set). Thanksgiving is coming and you want to spend it with your family, so you narrow your question as follows:
In what quarters did a department's sales drop below those of the same quarter one year previous without increasing again the very next year?
Ah! Now you are interested in year-on-year declines only when the trend does not reverse itself the next year. Now you want to look forward and backward at the same time. The self-join approach would require adding a second self-join to your query, possibly hurting performance and definitely making the SQL a mess to read. With analytic functions all you need do is add a LEAD
function to your inner query and modify the WHERE
clause in your outer query. Here are the new query and results:
/*------------------------
select *
from
(
select department, year, quarter, sales,
LAG(sales, 4) OVER (PARTITION BY department
ORDER BY year, quarter) year_ago_sales,
LEAD(sales, 4, 999999999) OVER (PARTITION BY department
ORDER BY year, quarter) year_next_sales
from quarterly_sales
) x
where sales < year_ago_sales
and year_next_sales <= sales
order by department, year, quarter;
------------------------*/
department year quarter sales year_ago_sales year_next_sales
---------------- ------ ------- ----------- ---------------- -----------------
Appliances 2005 2 123863.65 162489.02 123033.29
Electronics 2003 2 624831.23 673839.83 584812.23
Jewelry 2003 3 383542.32 387213.39 369293.39
Jewelry 2006 1 398721.04 404040.04 303879.58
Jewelry 2006 2 364561.98 427023.93 314873.74
Jewelry 2006 3 339828.84 414849.84 298733.50
Jewelry 2006 4 357221.41 429232.13 309883.37
Jewelry 2007 1 303879.58 398721.04 287454.44
Jewelry 2007 2 314873.74 364561.98 294004.27
Vacuums 2004 3 98574.34 123983.11 94598.72
(10 row(s) affected)
Key items to notice in this query are:
LEAD
is used to access the sales value four rows into the future.999999999
is specified as a third argument toLEAD
and forms a default to be used when there is no fourth row into the future, thereby excluding rows for which the business question cannot yet be answered.- The predicate
year_next_sales <= sales
is written with a<=
so as to require that thenext_year_sales
be an increase in order for a given quarter not to be listed on the report. Change the predicate to<
if you prefer to look or two year-on-year declines in succession.
Yes, the use of 999999999
is a hack, but hacks are sometimes ok in ad-hoc queries. Remove the default if you prefer the results to include those cases for which the one-year-forward values are not yet known. (Whether to include or exclude those rows is a business decision. An analyst might choose to look at the results both ways).
A Walk in the Park of business Intelligence
You're curious now, and decide to ask whether any of the year-on-year declines have been more than made up by a subsequent increase in sales. Say that a department's sales went from $100,000 to $95,000 to $105,000 in the same quarter over three years. You want report on cases matching that pattern, where the value in year three exceeds the value in year one. You're new business question is:
In what quarters did a department's sales drop below those of the same quarter one year previous only to increase enough the following year to exceed the first amount?
You can answer this new question by making two, trivial changes to the query:
- Change the second predicate to read:
year_next_sales > year_ago_sales
. - Remove the default of
999999999
from the invocation ofLEAD
. Do this, because the new question really does require this time that you know the next year's outcome.
Following is the new query and results:
/*------------------------
select *
from
(
select department, year, quarter, sales,
LAG(sales, 4) OVER (PARTITION BY department
ORDER BY year, quarter) year_ago_sales,
LEAD(sales, 4) OVER (PARTITION BY department
ORDER BY year, quarter) year_next_sales
from quarterly_sales
) x
where sales < year_ago_sales
and year_next_sales > year_ago_sales
order by department, year, quarter;
------------------------*/
department year quarter sales year_ago_sales year_next_sales
---------------- ------ ------- ----------- ---------------- -----------------
Appliances 2001 2 105231.94 129871.95 197399.29
Appliances 2005 4 137740.47 143670.03 169102.06
Appliances 2006 2 123033.29 123863.65 178345.47
Electronics 2003 3 634839.23 665839.73 701833.96
Electronics 2004 1 594878.97 604832.43 683987.76
Electronics 2004 2 584812.23 624831.23 714452.52
Electronics 2006 3 719542.57 743232.13 833445.73
Jewelry 2002 2 323939.83 364983.54 368843.63
Jewelry 2004 2 360198.03 368843.63 427023.93
Jewelry 2004 3 369293.39 383542.32 414849.84
Jewelry 2004 4 398321.75 401040.94 429232.13
Mattresses 2001 3 43923.82 44893.84 47323.73
Mattresses 2003 1 45889.79 47383.23 52090.87
Mattresses 2010 1 71083.25 72045.84 84109.38
Vacuums 2001 1 64341.54 65873.32 78141.41
Vacuums 2005 3 94598.72 98574.34 108683.80
(16 row(s) affected)
You can see that there have been 16 occasions during the time for which data is available that quarterly sales have dropped year-on-year only to rebound a year later to a value exceeding the pre-drop amount. Just two easy changes was all it took to explore an interesting question that came to mind as I wrote this part of the article.
Holes in the Data
Earlier I promised to return to the problem of holes, or gaps, in the data. All the queries in this article rest on the presumption of an unbroken sequence of quarterly sales for each department. What if you have doubts?
Three suggestions come to mind:
- Execute a query to verify an unbroken sequence and any other assumptions upon which subsequent queries rest.
- Densify the data by joining with a numbers table as I describe in my article on Partition Outer-Joins.
- Add an error message column to your queries along with associated logic to populate that column with error messages should gaps in the data occur.
LAG
and LEAD
can help with the first approach, and probably with the third as well. Following is a query using LAG
to report gaps in the data for this article. I executed the query after deleting the 2002 sales data for the Appliance Department.
/*------------------------
select department, year-1 year, quarter
from
(
select department, year, quarter,
lag(year, 4) over (partition by department
order by year, quarter) last_year,
lag(quarter, 4) over (partition by department
order by year, quarter) same_quarter
from quarterly_sales
) x
where not (
last_year = year-1 and quarter = same_quarter
);
------------------------*/
department year quarter
---------------- ----------- -------
Appliances 2002 1
Appliances 2002 2
Appliances 2002 3
Appliances 2002 4
(4 row(s) affected)
Choose the approach that best fits your circumstances and how you're using the data. I have used all three of the above approaches, depending upon circumstances and what I was trying to accomplish.
Wrap-Up
You're a busy analyst. It's a been a long day. You want to get home to your family. You can see, I hope, the expressive power in analytic functions such as LAG
and LEAD
. Also be aware of FIRST_VALUE
and LAST_VALUE
. They are new as well in SQL Server 2012, and do exactly as their names suggest – enabling access to the very first and last values in an ordered column.
These analytic functions make it easy to access data up and down a column; they enable the vertical dimension. They are especially useful in answering business questions involving time-series data like that used in this article. Instead of writing brittle and obtuse collections of self-joins, you can write easily understandable and malleable queries that make it almost fun to ask that one, last, nagging business question before you run out the door for the evening.
Published on Nov 20, 2011