Surprise Attack of the Nulls: Hidden Traps in the NOT IN

Eight in a series of posts in response to Tim Ford's #EntryLevel Challenge.


My Fear Nothing post last month introduced some of the trouble that nulls can cause by generating results that are counterintuitive to what you would expect from a superficial reading of a query. This month we’ll go deeper into the rabbit hole, because the situation with nulls is rather worse than you might think. 

Let’s imagine for a moment that we’re interested in answering this business question:

What bicycles do we manufacture for which accessories in matching colors are available?

First we need to determine the colors of our accessories. Then we can restrict our list of bicycles to those in matching colors. Here’s one formulation of a query to answer the business question:

USE AdventureWorks2014;

SELECT p.ProductID, p.Name, p.Color
FROM production.Product AS p
WHERE p.ProductSubcategoryID IN (
    SELECT sc.ProductSubcategoryID
    FROM production.ProductSubcategory AS sc
    WHERE sc.ProductCategoryID = 1
    )
and p.Color IN (
    SELECT p2.Color
    FROM production.Product as p2
    WHERE p2.ProductSubcategoryID IN (
        SELECT sc2.ProductSubcategoryID
        FROM production.ProductSubcategory AS sc2
        WHERE sc2.ProductCategoryID = 4
        )
    )

You can read this query from the bottom up. Here’s a quick summary of the logic behind it:

  1. The bottom-most SELECT retrieves a list of subcategory identifiers in accessories. Category 4 is accessories, individual products are placed into subcategories, so we need the subcategory identifiers.
  2. The enclosing SELECT retrieves a list of colors for all parts in the accessory subcategories.
  3. Similar logic in the main query finds subcategories for bicycles.
  4. An IN predicate in the main query restricts results to those bicycles having matching-color accessories.

I get 79 rows returned in my copy of AdventureWorks2014, and they look as follows:

  ProductID Name                Color
----------- ------------------- ---------------
        749 Road-150 Red, 62    Red
        750 Road-150 Red, 44    Red
        751 Road-150 Red, 48    Red
...

Now let’s turn our business question around and ask instead:

What bicycles lack matching-color accessories?

We can just change our IN to a NOT IN predicate. Right? That seems intuitive. Let’s try it:

USE AdventureWorks2014;

SELECT p.ProductID, p.Name, p.Color
FROM production.Product AS p
WHERE p.ProductSubcategoryID IN (
    SELECT sc.ProductSubcategoryID
    FROM production.ProductSubcategory AS sc
    WHERE sc.ProductCategoryID = 1
    )
and p.Color NOT IN (
    SELECT p2.Color
    FROM production.Product as p2
    WHERE p2.ProductSubcategoryID IN (
        SELECT sc2.ProductSubcategoryID
        FROM production.ProductSubcategory AS sc2
        WHERE sc2.ProductCategoryID = 4
        )
    )

The only change in this query is rewrite of one predicate to and p.Color NOT IN. Execute the query, and you’ll find there are no bicycles for which matching-color accessories do not also exist:

  ProductID Name            Color
----------- --------------- ---------------

(0 row(s) affected)

This is good, right? Except the answer is wrong. It’s wrong, because there happen to be some accessories for which color is not known to the database engine:

USE AdventureWorks2014

SELECT DISTINCT p.Color
FROM production.Product AS p
WHERE p.ProductSubcategoryID IN (
    SELECT sc.ProductSubcategoryID
    FROM production.ProductSubcategory AS sc
    WHERE sc.ProductCategoryID = 4
    )


Color
---------------
NULL
Black
Blue
Grey
Red
Silver

(6 row(s) affected)

What if there are yellow-colored bicycles? Can the database engine determine that no yellow accessories exist? The answer is "no" because of the nulls. Here's how that NOT IN predicate ends up being evaluated: 

Yellow <> NULL AND Yellow <> Black AND Yellow <> Blue …

This expression evaluates to the following truth values in three-valued logic:

UNKNOWN AND FALSE AND FALSE

And the final result is UNKNOWN.

The first query with just an IN clause delivers correct results because the database engine can in fact determine that a color from one list exists in a second list. But the presence of a null in that second list means it’s impossible to determine the lack of a color – because any accessory for which the color is not known could in reality be the color in question; three-valued logic allows for that possibility. 

Here’s what’s interesting: We cannot answer the current business question as it is posed. We must reformulate our question to fit the availability of facts. For example:

What bicycles lack matching-color accessories when the colors of those accessories are known?

Now we have a question we can answer, and following is a query to run to get that answer. Notice the addition of an IS NOT NULL predicate to exclude accessories for which color is unavailable.

USE AdventureWorks2014;

SELECT p.ProductID, p.Name, p.Color
FROM production.Product AS p
WHERE p.ProductSubcategoryID IN (
    SELECT sc.ProductSubcategoryID
    FROM production.ProductSubcategory AS sc
    WHERE sc.ProductCategoryID = 1
    )
and p.Color NOT IN (
    SELECT p2.Color
    FROM production.Product as p2
    WHERE p2.ProductSubcategoryID IN (
        SELECT sc2.ProductSubcategoryID
        FROM production.ProductSubcategory AS sc2
        WHERE sc2.ProductCategoryID = 4
        )
    AND p2.Color IS NOT NULL
    )

We do indeed have some yellow bicycles for which we do not know whether we have any yellow accessories. Here they are:

  ProductID Name                     Color
----------- ------------------------ ---------------
        797 Road-550-W Yellow, 38    Yellow
        798 Road-550-W Yellow, 40    Yellow
        799 Road-550-W Yellow, 42    Yellow
...

The presence of null accessory colors goes beyond just the issue of how we must formulate our query. We are forced to rethink the very business question we’re intending to ask! 

It’s worth my pointing out that I checked ahead of time to ensure no nulls in the bicycle colors. The presence of null bicycle colors would force further rethinking of all the business questions I’ve posed in this article, and then reformulating the queries would follow.

Nulls are tricky, and their effects are not limited to just how you formulate queries against your database. Nulls affect the very business questions you’re able to pose. How you write your queries flows directly from how you rethink your business questions to match the data available.