H. G. Wells and SQL: Traveling in the Second Dimension

 

SQL is traditionally one-dimensional. This is in the sense that scalar expressions within SELECT and other statements have been limited to values in the current row only. You had access to the horizontal, but not the vertical. SQL Server 2012's support of LAG and LEAD and other so-called analytic functions open up the vertical dimension, giving access to all values in a column. LAG and LEAD and related functions are especially useful in working with data across time.

Example Data: This is the second in a series of articles exploring window functions and their enhanced implementation in SQL Server 2012. If you would like to follow along, execute this script to create the example table and data. All articles in this series use the same data set.

A Classic Problem in Time

You're a data analyst for a large department store. You have access to quarterly sales by department from your data mart. You decide to compare quarterly sales year-on-year in order to answer the following question:

In what quarters did a department's sales drop below those of the same quarter one year previous?

In other words, if a department sold $100,000 in Quarter 2 of 2005, and then only $95,000 in Quarter 2 of 2006, you want to know about that because it represents a year-on-year drop that bears further investigation.

Step 1: Gather the Detail

As in the previous article, I like to review the data available before getting too deep into writing a query. Job #1 is to ensure that I understand the business question being posed. Job #2 is to ensure that I have the data needed to answer the question properly. Following is a simple query against the quarterly_sales table showing the available data:

/*------------------------
select department, year, quarter, sales
from quarterly_sales
order by department, year, quarter;
------------------------*/
department         year quarter       sales
---------------- ------ ------- -----------
Appliances         2000       1   235956.23
Appliances         2000       2   129871.95
Appliances         2000       3   398213.39
Appliances         2000       4    54111.05
Appliances         2001       1    87230.33
Appliances         2001       2   105231.94
...

No tricks! These results are straightforward. The quarterly_sales table gives sales by quarter for each department. There are no gaps. Take it on faith for now that the process of loading the data mart involves a data cleansing step to eliminate gaps.

Caution! It is not always reasonable to presume clean data having no gaps. I won't ignore the issue, and will circle back to the possible problem of gaps later in this article.

Step 2: Go Trans-Dimensional

Getting sales data by department and quarter was easy; a simple query sufficed. The "interesting" part of the problem is to look at the one year prior amounts in each row. The traditional approach is to execute a self-join, but that approach scales poorly and results in obtuse and difficult to modify queries. You are not interested in SQL as a game. You want to get the job done and go home, so you choose SQL Server 2012's new LAG function. 

LAG provides easy access to a value from a prior row. It's perfect for working with time-series data in which the time interval from row to row is consistent. There are four quarters in a year, so there are four rows per combination of year and department. Write a LAG function to retrieve the fourth sales amount back:

select department, year, quarter, sales,
       LAG(sales, 4) year_ago_sales
...

Don't execute the query just yet. Going back four rows means nothing without some sort of ordering. Add an OVER clause to the function and specify an ordering that follows the calendar:

select department, year, quarter, sales,
       LAG(sales, 4) OVER (ORDER BY year, quarter) year_ago_sales
...

One more thing! Partition by department so as to avoid commingling results from different departments:

select department, year, quarter, sales,
       LAG(sales, 4) OVER (PARTITION BY department
                           ORDER BY year, quarter) year_ago_sales
...

Now you are ready to execute the query. Each row in the output will contain the current quarterly sales value along with the corresponding value from four quarters – i.e., one year – previous. Here is the complete query and some of the output:

/*------------------------
select department, year, quarter, sales,
       LAG(sales, 4) OVER (PARTITION BY department
                           ORDER BY year, quarter) year_ago_sales
from quarterly_sales
order by department, year, quarter;
------------------------*/
department         year quarter       sales   year_ago_sales
---------------- ------ ------- ----------- ----------------
Appliances         2000       1   235956.23             NULL
Appliances         2000       2   129871.95             NULL
Appliances         2000       3   398213.39             NULL
Appliances         2000       4    54111.05             NULL
Appliances         2001       1    87230.33        235956.23
Appliances         2001       2   105231.94        129871.95
...

You can see in year 2001 how rows begin to pick up the sales amount from the same quarter one year prior. There's no history prior to 2000, so rows from that year receive a NULL in the year_ago_sales column.

You have the data you need. Now it's time to feed these rows into a parent query that answers the business question.

Step 3: Answer the Question

If you haven't read the first article in this series – Detail in the Aggregate – be sure to circle back and read it, because it explains in good detail why you must feed the results from the previous section's query into a parent query that does the work answering the business question. The short story is that analytic function results are not available until after any filtering triggered by WHERE and HAVING clauses. You must feed your analytic results to a parent query and filter there. For example:

/*------------------------
select * 
from
(
select department, year, quarter, sales,
       LAG(sales, 4) OVER (PARTITION BY department
                           ORDER BY year, quarter) year_ago_sales
from quarterly_sales
) x
where sales < year_ago_sales
order by department, year, quarter;
------------------------*/
department         year quarter       sales   year_ago_sales
---------------- ------ ------- ----------- ----------------
Appliances         2001       1    87230.33        235956.23
Appliances         2001       2   105231.94        129871.95
...
Electronics        2003       2   624831.23        673839.83
Electronics        2003       3   634839.23        665839.73
...
Jewelry            2002       2   323939.83        364983.54
Jewelry            2003       3   383542.32        387213.39
...

The WHERE clause predicate in the outer query – sales < year_ago_sales – should be clear enough as written. The presence of NULLs when no prior history is available is enough on its own to prevent any year-2000 data from cluttering the results. The ORDER BY clause is now in the outer query, and serves to present the results in calendar order by department. You can readily see when a given department has suffered a drop in year-on-year quarterly sales.

Real Challenge!

Analytic functions come into their own as business questions become increasingly complex. Use of analytics over the traditional self-join approach make a query more malleable, easier and clearer to understand, empowering you to explore variations on the original question.

For example, you might look at the data returned so far and decide that you lack the time to investigate 40 separate instances of year-on-year declines. (40 is the number of such instances returned from this article's data set). Thanksgiving is coming and you want to spend it with your family, so you narrow your question as follows:

In what quarters did a department's sales drop below those of the same quarter one year previous without increasing again the very next year?

Ah! Now you are interested in year-on-year declines only when the trend does not reverse itself the next year. Now you want to look forward and backward at the same time. The self-join approach would require adding a second self-join to your query, possibly hurting performance and definitely making the SQL a mess to read. With analytic functions all you need do is add a LEAD function to your inner query and modify the WHERE clause in your outer query. Here are the new query and results:

/*------------------------
select * 
from
(
select department, year, quarter, sales,
       LAG(sales, 4) OVER (PARTITION BY department
                           ORDER BY year, quarter) year_ago_sales,
       LEAD(sales, 4, 999999999) OVER (PARTITION BY department
	                         ORDER BY year, quarter) year_next_sales
from quarterly_sales
) x
where sales < year_ago_sales
  and year_next_sales <= sales
order by department, year, quarter;
------------------------*/
department         year quarter       sales   year_ago_sales   year_next_sales
---------------- ------ ------- ----------- ---------------- -----------------
Appliances         2005       2   123863.65        162489.02         123033.29
Electronics        2003       2   624831.23        673839.83         584812.23
Jewelry            2003       3   383542.32        387213.39         369293.39
Jewelry            2006       1   398721.04        404040.04         303879.58
Jewelry            2006       2   364561.98        427023.93         314873.74
Jewelry            2006       3   339828.84        414849.84         298733.50
Jewelry            2006       4   357221.41        429232.13         309883.37
Jewelry            2007       1   303879.58        398721.04         287454.44
Jewelry            2007       2   314873.74        364561.98         294004.27
Vacuums            2004       3    98574.34        123983.11          94598.72

(10 row(s) affected)

Key items to notice in this query are:

  • LEAD is used to access the sales value four rows into the future.

  • 999999999 is specified as a third argument to LEAD and forms a default to be used when there is no fourth row into the future, thereby excluding rows for which the business question cannot yet be answered. 

  • The predicate year_next_sales <= sales is written with a <= so as to require that the next_year_sales be an increase in order for a given quarter not to be listed on the report. Change the predicate to < if you prefer to look or two year-on-year declines in succession.

Yes, the use of 999999999 is a hack, but hacks are sometimes ok in ad-hoc queries. Remove the default if you prefer the results to include those cases for which the one-year-forward values are not yet known. (Whether to include or exclude those rows is a business decision. An analyst might choose to look at the results both ways).

A Walk in the Park of business Intelligence

You're curious now, and decide to ask whether any of the year-on-year declines have been more than made up by a subsequent increase in sales. Say that a department's sales went from $100,000 to $95,000 to $105,000 in the same quarter over three years. You want report on cases matching that pattern, where the value in year three exceeds the value in year one. You're new business question is:

In what quarters did a department's sales drop below those of the same quarter one year previous only to increase enough the following year to exceed the first amount?

You can answer this new question by making two, trivial changes to the query:

  1. Change the second predicate to read: year_next_sales > year_ago_sales.

  2. Remove the default of 999999999 from the invocation of LEAD. Do this, because the new question really does require this time that you know the next year's outcome.

Following is the new query and results:

/*------------------------
select * 
from
(
select department, year, quarter, sales,
       LAG(sales, 4) OVER (PARTITION BY department
                           ORDER BY year, quarter) year_ago_sales,
       LEAD(sales, 4) OVER (PARTITION BY department
	                    ORDER BY year, quarter) year_next_sales
from quarterly_sales
) x
where sales < year_ago_sales
  and year_next_sales > year_ago_sales
order by department, year, quarter;
------------------------*/
department         year quarter       sales   year_ago_sales   year_next_sales
---------------- ------ ------- ----------- ---------------- -----------------
Appliances         2001       2   105231.94        129871.95         197399.29
Appliances         2005       4   137740.47        143670.03         169102.06
Appliances         2006       2   123033.29        123863.65         178345.47
Electronics        2003       3   634839.23        665839.73         701833.96
Electronics        2004       1   594878.97        604832.43         683987.76
Electronics        2004       2   584812.23        624831.23         714452.52
Electronics        2006       3   719542.57        743232.13         833445.73
Jewelry            2002       2   323939.83        364983.54         368843.63
Jewelry            2004       2   360198.03        368843.63         427023.93
Jewelry            2004       3   369293.39        383542.32         414849.84
Jewelry            2004       4   398321.75        401040.94         429232.13
Mattresses         2001       3    43923.82         44893.84          47323.73
Mattresses         2003       1    45889.79         47383.23          52090.87
Mattresses         2010       1    71083.25         72045.84          84109.38
Vacuums            2001       1    64341.54         65873.32          78141.41
Vacuums            2005       3    94598.72         98574.34         108683.80

(16 row(s) affected)

You can see that there have been 16 occasions during the time for which data is available that quarterly sales have dropped year-on-year only to rebound a year later to a value exceeding the pre-drop amount. Just two easy changes was all it took to explore an interesting question that came to mind as I wrote this part of the article.

Holes in the Data

Earlier I promised to return to the problem of holes, or gaps, in the data. All the queries in this article rest on the presumption of an unbroken sequence of quarterly sales for each department. What if you have doubts?

Three suggestions come to mind:

  • Execute a query to verify an unbroken sequence and any other assumptions upon which subsequent queries rest.

  • Densify the data by joining with a numbers table as I describe in my article on Partition Outer-Joins.

  • Add an error message column to your queries along with associated logic to populate that column with error messages should gaps in the data occur.

LAG and LEAD can help with the first approach, and probably with the third as well. Following is a query using LAG to report gaps in the data for this article. I executed the query after deleting the 2002 sales data for the Appliance Department.

/*------------------------
select department, year-1 year, quarter
from 
(
select department, year, quarter,
       lag(year, 4) over (partition by department
	                  order by year, quarter) last_year,
       lag(quarter, 4) over (partition by department 
	                     order by year, quarter) same_quarter
from quarterly_sales
) x
where not (
  last_year = year-1 and quarter = same_quarter
  );
------------------------*/
department              year quarter
---------------- ----------- -------
Appliances              2002       1
Appliances              2002       2
Appliances              2002       3
Appliances              2002       4

(4 row(s) affected)

Choose the approach that best fits your circumstances and how you're using the data. I have used all three of the above approaches, depending upon circumstances and what I was trynig to accomplish.

Wrap-Up

You're a busy analyst. It's a been a long day. You want to get home to your family. You can see, I hope, the expressive power in analytic functions such as LAG and LEAD. Also be aware of FIRST_VALUE and LAST_VALUE. They are new as well in SQL Server 2012, and do exactly as their names suggest – enabling access to the very first and last values in an ordered column. These analytic functions make it easy to access data up and down a column; they enable the vertical dimension. They are especially useful in answering business questions involving time-series data like that used in this article. Instead of writing brittle and obtuse collections of self-joins, you can write easily understandable and malleable queries that make it almost fun to ask that one, last, nagging business question before you run out the door for the evening.