Exceptional SQL

Fifth in a series of posts in response to Tim Ford's #EntryLevel Challenge.


SQL implements a number of so-called union operators that under the right circumstances can make queries easy to write and intuitive to read and understand. One of these is the EXCEPT operator that "subtracts" one set of rows from another. 

Say for example that you're doing some work on data quality and want to investigate products that your firm has sold without ever having first purchased. What have you sold but never bought? You can answer that question easily by executing the following EXCEPT query:

SELECT ProductId 
FROM Sales.SalesOrderDetail
EXCEPT
SELECT ProductID 
FROM Purchasing.PurchaseOrderDetail

  ProductId
-----------
        717
        718
        719

SQL's expressiveness allows me to write the query intuitively in a manner closely corresponding to how I think about my business question. Then the optimizer finds an efficient execution path. Here's what's happening at the conceptual level:

  1. A set of product ID numbers is generated from sales data.
  2. A second set of product ID numbers is generated from purchase data. 
  3. All purchase data values are eliminated from the final result.

There's an implied SELECT DISTINCT operation at work. Relational database technology has roots in set theory, and the union operators such as EXCEPT operate in set-based terms. No matter how many duplicate rows are returned by the individual queries, you'll have only one of each row in your final result. 

Not many know this, but IBM DB2 does in fact support an EXCEPT ALL operation that implements duplicate row semantics similar to those in the more commonly implemented UNION ALL. To my knowledge, IBM DB2 is the only relational database engine to support EXCEPT ALL.

By now you're looking at the query results and are realizing that you do in fact manufacture products as well as just buy them. A logical follow-up is to wonder whether those products sold without being purchased correspond to what it is that you make in house. Investigate that angle by adding a second EXCEPT operation involving bill-of-materials data:

SELECT ProductId 
FROM Sales.SalesOrderDetail
EXCEPT
SELECT ProductID 
FROM Purchasing.PurchaseOrderDetail
EXCEPT 
SELECT ProductAssemblyID 
FROM Production.BillOfMaterials

Read from top down, and you'll find the query intuitive and easy to grasp. Here's what's happening:

  1. Query results are initialized to the list of products that have been sold
  2. Any purchased products are removed from the list
  3. Any manufactured products are removed from the list

You get a null result, which is precisely what you hope for. All your products sold are either purchased, or they are built in house from parts that you have purchased. Thankfully, there is no data quality problem to resolve, and you can break for lunch early.

But! You're curious about the query's efficiency. So instead of an early lunch, you decide to have a quick look at execution plans. Following is an alternative formulation of the query, one written around NOT EXISTS operations, but it will provide the same result as the EXCEPT query. Notice in particular the need for a SELECT DISTINCT to ensure against duplicate rows.

SELECT DISTINCT(ProductID) 
FROM Sales.SalesOrderDetail
WHERE NOT EXISTS (
   SELECT ProductID 
   FROM Purchasing.PurchaseOrderDetail
   WHERE PurchaseOrderDetail.ProductID = SalesOrderDetail.ProductID) 
AND NOT EXISTS (
   SELECT ProductAssemblyID 
   FROM Production.BillOfMaterials
   WHERE BillOfMaterials.ProductAssemblyID = SalesOrderDetail.ProductID)

For a one-off or infrequently executed query, you can just run it and call it a day if you're happy with the execution speed. But it pays to look into execution plans when writing queries that might get executed frequently as part of a transactional system. Here is a screenshot showing the actual execution plans for the two queries as generated on my test system:

The EXCEPT query's actual execution plan comes in with the lowest cost, in this one case

I generated the plans in the figure by putting both queries into one batch, and then executing the batch and requesting the actual execution plans along with the query results. The EXCEPT query has a slight edge in this specific case with a cost amounting to 46% of the total batch cost, whereas the NOT EXISTS query accounted for 54%.

Don't generalize. Don't presume that EXCEPT is always better. My testing with different business questions and queries to answer them showed in some cases that NOT EXISTS had a slight edge, and in other cases the plans and costs were identical. In particular, I noticed that when operating on primary- and foreign-key columns, that it almost didn't seem to matter how I formulated my query.

Anyone who is routinely writing SQL should be familiar with the various union operators such as EXCEPT. These operators can lead to query formulations that closely mirror your thought process around the business question being posed. When that happens, queries are easier to write in the short term, and easier to comprehend when you revisit them in the long term.