Understanding the WITH Clause

Recently I researched recursive SQL queries for an Oracle Technology Network (OTN) article on Oracle's CONNECT BY syntax. (By the way, that article is scheduled to appear on OTN sometime in July) While researching that article, someone at Oracle pointed me to the SELECT statement's WITH clause. WITH was introduced in the SQL:1999 standard, and made it's way into Oracle in Oracle9i Database Release 1. As defined in the standard, WITH enables you to do two things:

  • You can use WITH to write recursive queries

  • You can use WITH to factor subqueries out of a main query.

I was only dimly aware of WITH, and hadn't given it much thought. That it could be used to write recursive queries was a surprise to me, probably because Oracle hasn't implemented that aspect of WITH. I decided to do some investigating to find out just what WITH was all about.

Using WITH to Issue Recursive Queries

Perhaps the driving force behind the introduction of WITH in the SQL standard was the need to issue recursive queries. Oracle has long (since version 2) supported such queries through its CONNECT BY syntax. For example, the following CONNECT BY query retrieves the definition of an automobile from a bill-of-mterials table:

SQL> SELECT assembly_id, assembly_name, parent_assembly
  2  FROM bill_of_materials
  3  START WITH assembly_id=100
  4  CONNECT BY parent_assembly = PRIOR assembly_id;

ASSEMBLY_ID ASSEMBLY_NAME           PARENT_ASSEMBLY
----------- ----------------------- ---------------
        100 Automobile
        110 Combustion Engine                   100
        111 Piston                              110
        112 Air Filter                          110
        113 Spark Plug                          110
        114 Block                               110
        115 Starter System                      110
        116 Alternator                          115
        117 Battery                             115
        118 Starter Motor                       115
        120 Body                                100
        121 Roof                                120
        122 Left Door                           120
        123 Right Door                          120
        130 Interior                            100

If you look carefully at this output, you can see that an automobile is made up of a combustion engine, which in turn is made up of a piston, air filter, spark plug, block, and starter system. The starter system is in turn composed of an alternator, battery, and starter motor, and so forth. CONNECT BY gives you the ability to query a table that is recursively related to itself, and the results from a CONNECT BY query are generated in the hierarchical order shown here. When you write a CONNECT BY query, you can think of the START WITH clause as defining the set of root nodes to be returned by the query. The CONNECT BY clause then defines the "join" between parent and child rows. I explain this in more detail in my OTN article. Partly for grins, and partly to expand my horizons a bit, I decided to look at how the preceding query could be implemented using WITH instead of CONNECT BY. To do that, I had to use a database supporting recursive WITH, so I installed a copy of IBM DB2. After a good bit of head-scratching and reading of IBM's example queries, I came up with the following WITH-based solution to my bill-of-materials query:

WITH recursiveBOM
   (assembly_id, assembly_name, parent_assembly) AS
(SELECT parent.assembly_id,
        parent.assembly_name,
        parent.parent_assembly
FROM bill_of_materials parent
WHERE parent.assembly_id=100
UNION ALL
SELECT child.assembly_id,
       child.assembly_name,
       child.parent_assembly
FROM recursiveBOM parent, bill_of_materials child
WHERE child.parent_assembly = parent.assembly_id)
SELECT assembly_id, parent_assembly, assembly_name
FROM recursiveBOM;

Wow! This is inscrutable. It took me a long time fathom this query. Let's take it a piece at a time:

WITH recursiveBOM
   (assembly_id, assembly_name, parent_assembly) AS

The WITH keyword defines the name recursiveBOM for the subquery that is to follow. Next comes a list of column aliases to use when referencing the results of the subquery. Oracle requires that you specify aliases in the subquery's SELECT, something I'll talk more about later. Next comes the first part of the named subquery:

(SELECT parent.assembly_id,
        parent.assembly_name,
        parent.parent_assembly
FROM bill_of_materials parent
WHERE parent.assembly_id=100

The named subquery is a UNION ALL of two queries. This, the first query, defines the starting point for the recursion. As in my CONNECT BY query, I want to know what makes up assembly 100, an automobile. Next up is the part that was most confusing to me:

UNION ALL
SELECT child.assembly_id,
       child.assembly_name,
       child.parent_assembly
FROM recursiveBOM parent, bill_of_materials child
WHERE child.parent_assembly = parent.assembly_id)

This is where things get recursive, and confusing. The second query in the union joins bill_of_materials to the results of the named subquery. Notice how the WHERE clause strikingly resembles the CONNECT BY clause used in the Oracle version of this recursive query. The final part of the query is the main SELECT:

SELECT assembly_id, parent_assembly, assembly_name
FROM recursiveBOM;

This SELECT does nothing more than to retrieve the results returned by named subquery. All the recursion happens in the subquery. The important thing to notice here is that the three column names in the main query's select-list match the three aliases specified near the beginning of the WITH clause. The aliases you specify for a named subquery are the names you must use when retrieving the results of that subquery. How does WITH's recursion work? It's important to form a clear mental-model of how a query such as this produces the results that it does.

For more on mental models, see my article at:
http://gennick.com/mental_models.html

My first attempt at understanding the preceding WITH query was to think in terms of replacing the query *text* recursively. When I came across the text "recursiveBOM", I replaced it with the subquery text. The resulting mess that I envisioned in my mind looked as follows, with ellipses (...) indicating where further query text expansion would occur:

WITH recursiveBOM
   (assembly_id, assembly_name, parent_assembly) AS
(SELECT parent.assembly_id,
        parent.assembly_name,
        parent.parent_assembly
FROM bill_of_materials parent
WHERE parent.assembly_id=100
UNION ALL
SELECT child.assembly_id,
       child.assembly_name,
       child.parent_assembly
FROM (...) parent, bill_of_materials child
WHERE child.parent_assembly = parent.assembly_id)
SELECT assembly_id, parent_assembly, assembly_name
FROM (SELECT parent.assembly_id,
             parent.assembly_name,
             parent.parent_assembly
FROM bill_of_materials parent
WHERE parent.assembly_id=100
UNION ALL
SELECT child.assembly_id,
       child.assembly_name,
       child.parent_assembly
FROM (...) parent, bill_of_materials child
WHERE child.parent_assembly = parent.assembly_id);

The above is 

not

 the way to understand a recursive WITH query. The recursion is not modeled in terms of the text. You have to think in terms of subquery *results*. Following is the mental model I'm currently using to understand this recursive WITH query: 1. The parent query executes:

   SELECT assembly_id, parent_assembly, assembly_name
   FROM recursiveBOM;

This triggers execution of the named subquery. 2 The first query in the subquery's union executes, giving us a seed row with which to begin the recursion:

   SELECT parent.assembly_id,
          parent.assembly_name,
          parent.parent_assembly
   FROM bill_of_materials parent
   WHERE parent.assembly_id=100

The seed row in this case will be for "Automobile". Let's refer to the seed row from here on out as the "new results", new in the sense that we haven't finished processing them yet. 3 The second query in the subquery's union executes:

   SELECT child.assembly_id,
          child.assembly_name,
          child.parent_assembly
   FROM recursiveBOM parent, bill_of_materials child
   WHERE child.parent_assembly = parent.assembly_id

This is where things get interesting. Notice the recursive reference to recursiveBOM. That plays into Step 4. 4. The recursive reference to recursiveBOM is replaced by the query results. And what are the query results? They are:

  • The new results ("Automobile" the first time)

  • The results from executing this query using the new results in place of recursiveBOM. And what are the results from that? They are:

    • The new results (brought in by the join)

    • The results from executing this query using the new results in place of recursiveBOM. And what are the results from that? They are:

      • The new results (brought in by the join)

      • The results from executing this query using the new results in place of recursiveBOM ...

Do you see the recursion here? It took me most of a day to get to the point where I felt comfortable that I understood this. Hopefully you'll catch on more quickly than I did. By the way, the recursion stops when a join fails to bring back any more new rows. Oracle doesn't support the use of WITH to write recursive queries, so if you're using only Oracle, everything I've discussed so far is somewhat academic. However, WITH is part of the ANSI/ISO SQL standard, and it's good to be familiar with what the SQL standard calls for as opposed to what Oracle implements. Who knows? Someday you may encounter a recursive WITH in code written for another database, and now you're equipped to understand it.

Factoring Out Subqueries

The factoring out of subqueries is one use of WITH that Oracle does support, and that use has positive implications for performance. Consider the following SQL query which uses a non-correlated subquery to retrieve all parts with inventories greater than the average:

SELECT part_number, (SELECT AVG(current_inventory)

                     FROM part)
FROM part
WHERE current_inventory > (SELECT AVG(current_inventory)
                           FROM part);

This query uses the same subquery twice, once to return the average inventory along with each result row, and again to determine the average inventory for use in the WHERE clause. Using WITH, you can factor out the subquery:

WITH partavg AS (SELECT AVG(current_inventory) avg_inventory
                 FROM part)
SELECT part_number, (SELECT avg_inventory FROM partavg)
FROM part
WHERE current_inventory > (SELECT avg_inventory
                           FROM partavg);

The main query still contains two subqueries, but each of those subqueries is a SELECT from the named subquery partavg. This gives the optimizer enough information to determine that the partavg subquery needs to be executed only once. Notice here that I've named the column returned by the partavg query by specifying avg_inventory as a column alias within the named subqueries SELECT clause. This is different from the IBM DB2 approach you saw earlier. To see how these queries executed, I connected to my test database using SQL*Plus, issued the SQL*Plus command SET AUTOTRACE TRACEONLY, and then executed each query. Following are the two execution plans:

 Execution Plan
---------------------------------------------------------
  0      SELECT STATEMENT Optimizer=CHOOSE
  1    0   FILTER
  2    1     TABLE ACCESS (FULL) OF 'PART'
  3    1     SORT (AGGREGATE)
  4    3       TABLE ACCESS (FULL) OF 'PART'

Execution Plan
---------------------------------------------------------
  0      SELECT STATEMENT Optimizer=CHOOSE
  1    2     RECURSIVE EXECUTION OF 'SYS_LE_2_0'
  2    0   TEMP TABLE TRANSFORMATION
  3    2     FILTER
  4    3       TABLE ACCESS (FULL) OF 'PART'
  5    3       VIEW
  6    5         TABLE ACCESS (FULL) OF 'SYS_TEMP_0FD9D660B_4B6F5C4'

It's interesting to notice that the first execution plan does not indicate a double execution of the subquery. I'll come back to this point shortly, because the execution statistics make it rather obvious that the subquery is, in fact, being executed twice. The second execution plan is from the WITH version of the query. You can see the effects of WITH in the plan. The "RECURSIVE EXECUTION OF 'SYS_LE_2_0'" no doubt represents execution of the named subquery. The results are stored in a temporary table named SYS_TEMP_0FD9D660B_4B6F5C4, and that table feeds into the processing of the main query. How do I know that the subquery is executed twice the first time, but only once when it's factored out using WITH? The execution statistics provide a good clue. Look at the following two sets of statistics, and focus on the number of consistent gets in each set:

Statistics
---------------------------------------------------------
         1  recursive calls
         2  db block gets
      9427  consistent gets
         0  physical reads
     53880  redo size
   2290609  bytes sent via SQL*Net to client
    572834  bytes received via SQL*Net from client
      4371  SQL*Net roundtrips to/from client
         0  sorts (memory)
         0  sorts (disk)
     65536  rows processed

Statistics
---------------------------------------------------------
        69  recursive calls
        10  db block gets
      7141  consistent gets
         2  physical reads
      1264  redo size
   2290603  bytes sent via SQL*Net to client
    572834  bytes received via SQL*Net from client
      4371  SQL*Net roundtrips to/from client
         0  sorts (memory)
         0  sorts (disk)
     65536  rows processed

The first set of statistics is from the first query, which specifies the same subquery redundantly. The execution of that query against my test data required 9427 consistent gets. (A "consistent get" represents the logical read of a block from the database buffer cache) The second set of statistics is from the WITH version of the query, and required only 7141 consistent gets. Clearly the WITH version of the query is doing less work, and I can only attribute that to the fact that the subquery is being executed once, not twice. It's a bit disappointing that the execution plan doesn't make that crystal clear.

Correlated Subqueries

The previous example was easy, because it used a non-correlated subquery, one that did not reference values from the parent query. It's trivial to factor out a non-correlated subqueries. But what if your subquery is correlated, and references one or more values from the main query. What then? It turns out you can still benefit from WITH, but you may have to think just a bit harder. The following query is a slight revision of the part query shown earlier, but this time, rather than compare each part's inventory to the average of all inventories, it compares each part's inventory to the average of all inventories of parts made by the same manufacturer. Thus, the subqueries are correlated, referencing the manufactured_by value from the parent row.

SELECT p1.part_number,
       (SELECT AVG(p2.current_inventory)
        FROM part p2
        WHERE p2.manufactured_by
              = p1.manufactured_by) avg_inventory
FROM part p1
WHERE p1.current_inventory
      > (SELECT AVG(p2.current_inventory)
         FROM part p2
         WHERE p2.manufactured_by=p1.manufactured_by);

WITH clause subqueries can't be correlated. I can't factor out these subqueries by copying them as-is into a WITH clause. It isn't that simple this time. However, after thinking about things a bit, I realized that I was after the average part inventory for each manufacturer, and that I could generate those values all at once using a simple GROUP BY query:

SELECT manufactured_by,
       AVG(current_inventory) avg_inventory
FROM part
GROUP BY manufactured_by;

This subquery became the basis for the WITH version of my correlated part query. Rather than compute the average inventory for a given manufacturer anew for each part, I can generate all the manufacturer-specific part inventory averages just once, and then reference those results from the main query. Here's what I came up with:

WITH partavg AS (SELECT manufactured_by,
                 AVG(current_inventory) avg_inventory
                 FROM part
                 GROUP BY manufactured_by)
SELECT part_number,
       (SELECT avg_inventory
        FROM partavg
        WHERE part.manufactured_by
              = partavg.manufactured_by) avg_inventory
FROM part
WHERE part.current_inventory
      > (SELECT avg_inventory
         FROM partavg
         WHERE part.manufactured_by
               = partavg.manufactured_by)

Notice that I didn't totally eliminate subqueries from my main query. What I did eliminate from the main query is the constant summarizing and averaging of data. The WITH subquery computes part inventory averages for each manufacturer just once, placing those averages into a temporary table. The subqueries in the main query now select from that much smaller, pre-aggregated temporary table. The results are telling. I won't show the execution plans, because they are similar to the previous two plans, but look at the statistics:

Statistics
----------------------------------------------------------
          0  recursive calls
          0  db block gets
      16794  consistent gets
          0  physical reads
          0  redo size
    2934767  bytes sent via SQL*Net to client
     572834  bytes received via SQL*Net from client
       4371  SQL*Net roundtrips to/from client
          0  sorts (memory)
          0  sorts (disk)
      65536  rows processed

Statistics
----------------------------------------------------------
          4  recursive calls
          7  db block gets
       7139  consistent gets
          1  physical reads
        520  redo size
    2934767  bytes sent via SQL*Net to client
     572834  bytes received via SQL*Net from client
       4371  SQL*Net roundtrips to/from client
          1  sorts (memory)
          0  sorts (disk)
      65536  rows processed

The number of consistent gets issued by the first version of the query, the one with the redundant, correlated subqueries, is 16,794. Look at the improvement in the WITH version of the query, which issued only 7139 consistent gets. That's less than half the work! Using WITH in this case leads to a dramatic decrease in the amount of logical I/O, and decreasing logical I/O is one key to improving performance, both for the database as a whole and for the query in question. WITH can be an important tool in your query-writing arsenal. Anytime you find yourself using the same subquery twice, give some thought to using WITH to factor out that subquery, thus reducing the amount of logical I/O your query must perform. You can also use WITH to factor out non-redundant subqueries in order to make a main query more readable. Be sure to weigh the benefits of using WITH, against the lack of compatibility with older releases of Oracle, because WITH is not available prior to Oracle9i.

Acknowledgments

I'd like to thank Jim Melton of Oracle Corporation, who is also the editor of the ANSI/ISO SQL Standard, for his long-suffering patience when it came to answering my many, many questions about the recursive use of WITH. If my explanation of recursive WITH execution has any clarity, it's largely due to Jim's patient explanations.