Understanding the WITH Clause
Recently I researched recursive SQL queries for an Oracle Technology Network (OTN) article on Oracle's CONNECT BY syntax. (By the way, that article is scheduled to appear on OTN sometime in July) While researching that article, someone at Oracle pointed me to the SELECT statement's WITH clause. WITH was introduced in the SQL:1999 standard, and made it's way into Oracle in Oracle9i Database Release 1. As defined in the standard, WITH enables you to do two things:You can use WITH to write recursive queries
You can use WITH to factor subqueries out of a main query.
Using WITH to Issue Recursive Queries
Perhaps the driving force behind the introduction of WITH in the SQL standard was the need to issue recursive queries. Oracle has long (since version 2) supported such queries through its CONNECT BY syntax. For example, the following CONNECT BY query retrieves the definition of an automobile from a bill-of-mterials table:SQL> SELECT assembly_id, assembly_name, parent_assembly
2 FROM bill_of_materials
3 START WITH assembly_id=100
4 CONNECT BY parent_assembly = PRIOR assembly_id;
ASSEMBLY_ID ASSEMBLY_NAME PARENT_ASSEMBLY
----------- ----------------------- ---------------
100 Automobile
110 Combustion Engine 100
111 Piston 110
112 Air Filter 110
113 Spark Plug 110
114 Block 110
115 Starter System 110
116 Alternator 115
117 Battery 115
118 Starter Motor 115
120 Body 100
121 Roof 120
122 Left Door 120
123 Right Door 120
130 Interior 100
If you look carefully at this output, you can see that an
automobile is made up of a combustion engine, which in turn
is made up of a piston, air filter, spark plug, block, and
starter system. The starter system is in turn composed of
an alternator, battery, and starter motor, and so forth.
CONNECT BY gives you the ability to query a table that is
recursively related to itself, and the results from a
CONNECT BY query are generated in the hierarchical order
shown here.
When you write a CONNECT BY query, you can think of the
START WITH clause as defining the set of root nodes to be
returned by the query. The CONNECT BY clause then defines
the "join" between parent and child rows. I explain
this in more detail in my OTN article.
Partly for grins, and partly to expand my horizons a bit, I
decided to look at how the preceding query could be
implemented using WITH instead of CONNECT BY. To do that, I
had to use a database supporting recursive WITH, so I
installed a copy of IBM DB2. After a good bit of
head-scratching and reading of IBM's example queries, I
came up with the following WITH-based solution to my
bill-of-materials query:
WITH recursiveBOM
(assembly_id, assembly_name, parent_assembly) AS
(SELECT parent.assembly_id,
parent.assembly_name,
parent.parent_assembly
FROM bill_of_materials parent
WHERE parent.assembly_id=100
UNION ALL
SELECT child.assembly_id,
child.assembly_name,
child.parent_assembly
FROM recursiveBOM parent, bill_of_materials child
WHERE child.parent_assembly = parent.assembly_id)
SELECT assembly_id, parent_assembly, assembly_name
FROM recursiveBOM;
Wow! This is inscrutable. It took me a long time
fathom this query. Let's take it a piece at a time:
WITH recursiveBOM (assembly_id, assembly_name, parent_assembly) ASThe WITH keyword defines the name recursiveBOM for the subquery that is to follow. Next comes a list of column aliases to use when referencing the results of the subquery. Oracle requires that you specify aliases in the subquery's SELECT, something I'll talk more about later. Next comes the first part of the named subquery:
(SELECT parent.assembly_id,
parent.assembly_name,
parent.parent_assembly
FROM bill_of_materials parent
WHERE parent.assembly_id=100
The named subquery is a UNION ALL of two queries. This,
the first query, defines the starting point for the
recursion. As in my CONNECT BY query, I want to know what
makes up assembly 100, an automobile.
Next up is the part that was most confusing to me:
UNION ALL
SELECT child.assembly_id,
child.assembly_name,
child.parent_assembly
FROM recursiveBOM parent, bill_of_materials child
WHERE child.parent_assembly = parent.assembly_id)
This is where things get recursive, and confusing. The
second query in the union joins bill_of_materials to the
results of the named subquery. Notice how the WHERE clause
strikingly resembles the CONNECT BY clause used in the
Oracle version of this recursive query.
The final part of the query is the main SELECT:
SELECT assembly_id, parent_assembly, assembly_name FROM recursiveBOM;This SELECT does nothing more than to retrieve the results returned by named subquery. All the recursion happens in the subquery. The important thing to notice here is that the three column names in the main query's select-list match the three aliases specified near the beginning of the WITH clause. The aliases you specify for a named subquery are the names you must use when retrieving the results of that subquery. How does WITH's recursion work? It's important to form a clear mental-model of how a query such as this produces the results that it does.
For more on mental models, see my article at:My first attempt at understanding the preceding WITH query was to think in terms of replacing the query *text* recursively. When I came across the text "recursiveBOM", I replaced it with the subquery text. The resulting mess that I envisioned in my mind looked as follows, with ellipses (...) indicating where further query text expansion would occur:
http://gennick.com/mental_models.html
WITH recursiveBOM
(assembly_id, assembly_name, parent_assembly) AS
(SELECT parent.assembly_id,
parent.assembly_name,
parent.parent_assembly
FROM bill_of_materials parent
WHERE parent.assembly_id=100
UNION ALL
SELECT child.assembly_id,
child.assembly_name,
child.parent_assembly
FROM (...) parent, bill_of_materials child
WHERE child.parent_assembly = parent.assembly_id)
SELECT assembly_id, parent_assembly, assembly_name
FROM (SELECT parent.assembly_id,
parent.assembly_name,
parent.parent_assembly
FROM bill_of_materials parent
WHERE parent.assembly_id=100
UNION ALL
SELECT child.assembly_id,
child.assembly_name,
child.parent_assembly
FROM (...) parent, bill_of_materials child
WHERE child.parent_assembly = parent.assembly_id);
The above is not the way to understand a recursive WITH
query. The recursion is not modeled in terms of the text.
You have to think in terms of subquery *results*.
Following is the mental model I'm currently using to
understand this recursive WITH query:
1. The parent query executes:
SELECT assembly_id, parent_assembly, assembly_name FROM recursiveBOM;This triggers execution of the named subquery. 2 The first query in the subquery's union executes, giving us a seed row with which to begin the recursion:
SELECT parent.assembly_id,
parent.assembly_name,
parent.parent_assembly
FROM bill_of_materials parent
WHERE parent.assembly_id=100
The seed row in this case will be for "Automobile".
Let's refer to the seed row from here on out as the
"new results", new in the sense that we haven't finished
processing them yet.
3 The second query in the subquery's union executes:
SELECT child.assembly_id,
child.assembly_name,
child.parent_assembly
FROM recursiveBOM parent, bill_of_materials child
WHERE child.parent_assembly = parent.assembly_id
This is where things get interesting. Notice the
recursive reference to recursiveBOM. That plays into
Step 4.
4. The recursive reference to recursiveBOM is replaced by
the query results. And what are the query results?
They are:
The new results ("Automobile" the first time)
The results from executing this query using the new results in place of recursiveBOM. And what are the results from that? They are:
The new results (brought in by the join)
The results from executing this query using the new results in place of recursiveBOM. And what are the results from that? They are:
The new results (brought in by the join)
The results from executing this query using the new results in place of recursiveBOM ...
Factoring Out Subqueries
The factoring out of subqueries is one use of WITH that Oracle does support, and that use has positive implications for performance. Consider the following SQL query which uses a non-correlated subquery to retrieve all parts with inventories greater than the average:SELECT part_number, (SELECT AVG(current_inventory) FROM part) FROM part WHERE current_inventory > (SELECT AVG(current_inventory) FROM part);This query uses the same subquery twice, once to return the average inventory along with each result row, and again to determine the average inventory for use in the WHERE clause. Using WITH, you can factor out the subquery:WITH partavg AS (SELECT AVG(current_inventory) avg_inventory FROM part) SELECT part_number, (SELECT avg_inventory FROM partavg) FROM part WHERE current_inventory > (SELECT avg_inventory FROM partavg);The main query still contains two subqueries, but each of those subqueries is a SELECT from the named subquery partavg. This gives the optimizer enough information to determine that the partavg subquery needs to be executed only once. Notice here that I've named the column returned by the partavg query by specifying avg_inventory as a column alias within the named subqueries SELECT clause. This is different from the IBM DB2 approach you saw earlier. To see how these queries executed, I connected to my test database using SQL*Plus, issued the SQL*Plus command SET AUTOTRACE TRACEONLY, and then executed each query. Following are the two execution plans:Execution Plan --------------------------------------------------------- 0 SELECT STATEMENT Optimizer=CHOOSE 1 0 FILTER 2 1 TABLE ACCESS (FULL) OF 'PART' 3 1 SORT (AGGREGATE) 4 3 TABLE ACCESS (FULL) OF 'PART' Execution Plan --------------------------------------------------------- 0 SELECT STATEMENT Optimizer=CHOOSE 1 2 RECURSIVE EXECUTION OF 'SYS_LE_2_0' 2 0 TEMP TABLE TRANSFORMATION 3 2 FILTER 4 3 TABLE ACCESS (FULL) OF 'PART' 5 3 VIEW 6 5 TABLE ACCESS (FULL) OF 'SYS_TEMP_0FD9D660B_4B6F5C4'
It's interesting to notice that the first execution plan does not indicate a double execution of the subquery. I'll come back to this point shortly, because the execution statistics make it rather obvious that the subquery is, in fact, being executed twice. The second execution plan is from the WITH version of the query. You can see the effects of WITH in the plan. The "RECURSIVE EXECUTION OF 'SYS_LE_2_0'" no doubt represents execution of the named subquery. The results are stored in a temporary table named SYS_TEMP_0FD9D660B_4B6F5C4, and that table feeds into the processing of the main query. How do I know that the subquery is executed twice the first time, but only once when it's factored out using WITH? The execution statistics provide a good clue. Look at the following two sets of statistics, and focus on the number of consistent gets in each set:Statistics --------------------------------------------------------- 1 recursive calls 2 db block gets 9427 consistent gets 0 physical reads 53880 redo size 2290609 bytes sent via SQL*Net to client 572834 bytes received via SQL*Net from client 4371 SQL*Net roundtrips to/from client 0 sorts (memory) 0 sorts (disk) 65536 rows processed Statistics --------------------------------------------------------- 69 recursive calls 10 db block gets 7141 consistent gets 2 physical reads 1264 redo size 2290603 bytes sent via SQL*Net to client 572834 bytes received via SQL*Net from client 4371 SQL*Net roundtrips to/from client 0 sorts (memory) 0 sorts (disk) 65536 rows processedThe first set of statistics is from the first query, which specifies the same subquery redundantly. The execution of that query against my test data required 9427 consistent gets. (A "consistent get" represents the logical read of a block from the database buffer cache) The second set of statistics is from the WITH version of the query, and required only 7141 consistent gets. Clearly the WITH version of the query is doing less work, and I can only attribute that to the fact that the subquery is being executed once, not twice. It's a bit disappointing that the execution plan doesn't make that crystal clear.Correlated Subqueries
The previous example was easy, because it used a non-correlated subquery, one that did not reference values from the parent query. It's trivial to factor out a non-correlated subqueries. But what if your subquery is correlated, and references one or more values from the main query. What then? It turns out you can still benefit from WITH, but you may have to think just a bit harder. The following query is a slight revision of the part query shown earlier, but this time, rather than compare each part's inventory to the average of all inventories, it compares each part's inventory to the average of all inventories of parts made by the same manufacturer. Thus, the subqueries are correlated, referencing the manufactured_by value from the parent row.SELECT p1.part_number, (SELECT AVG(p2.current_inventory) FROM part p2 WHERE p2.manufactured_by = p1.manufactured_by) avg_inventory FROM part p1 WHERE p1.current_inventory > (SELECT AVG(p2.current_inventory) FROM part p2 WHERE p2.manufactured_by=p1.manufactured_by);WITH clause subqueries can't be correlated. I can't factor out these subqueries by copying them as-is into a WITH clause. It isn't that simple this time. However, after thinking about things a bit, I realized that I was after the average part inventory for each manufacturer, and that I could generate those values all at once using a simple GROUP BY query:SELECT manufactured_by, AVG(current_inventory) avg_inventory FROM part GROUP BY manufactured_by;This subquery became the basis for the WITH version of my correlated part query. Rather than compute the average inventory for a given manufacturer anew for each part, I can generate all the manufacturer-specific part inventory averages just once, and then reference those results from the main query. Here's what I came up with:WITH partavg AS (SELECT manufactured_by, AVG(current_inventory) avg_inventory FROM part GROUP BY manufactured_by) SELECT part_number, (SELECT avg_inventory FROM partavg WHERE part.manufactured_by = partavg.manufactured_by) avg_inventory FROM part WHERE part.current_inventory > (SELECT avg_inventory FROM partavg WHERE part.manufactured_by = partavg.manufactured_by)Notice that I didn't totally eliminate subqueries from my main query. What I did eliminate from the main query is the constant summarizing and averaging of data. The WITH subquery computes part inventory averages for each manufacturer just once, placing those averages into a temporary table. The subqueries in the main query now select from that much smaller, pre-aggregated temporary table. The results are telling. I won't show the execution plans, because they are similar to the previous two plans, but look at the statistics:Statistics ---------------------------------------------------------- 0 recursive calls 0 db block gets 16794 consistent gets 0 physical reads 0 redo size 2934767 bytes sent via SQL*Net to client 572834 bytes received via SQL*Net from client 4371 SQL*Net roundtrips to/from client 0 sorts (memory) 0 sorts (disk) 65536 rows processed Statistics ---------------------------------------------------------- 4 recursive calls 7 db block gets 7139 consistent gets 1 physical reads 520 redo size 2934767 bytes sent via SQL*Net to client 572834 bytes received via SQL*Net from client 4371 SQL*Net roundtrips to/from client 1 sorts (memory) 0 sorts (disk) 65536 rows processedThe number of consistent gets issued by the first version of the query, the one with the redundant, correlated subqueries, is 16,794. Look at the improvement in the WITH version of the query, which issued only 7139 consistent gets. That's less than half the work! Using WITH in this case leads to a dramatic decrease in the amount of logical I/O, and decreasing logical I/O is one key to improving performance, both for the database as a whole and for the query in question. WITH can be an important tool in your query-writing arsenal. Anytime you find yourself using the same subquery twice, give some thought to using WITH to factor out that subquery, thus reducing the amount of logical I/O your query must perform. You can also use WITH to factor out non-redundant subqueries in order to make a main query more readable. Be sure to weigh the benefits of using WITH, against the lack of compatibility with older releases of Oracle, because WITH is not available prior to Oracle9i.Acknowledgments
I'd like to thank Jim Melton of Oracle Corporation, who is also the editor of the ANSI/ISO SQL Standard, for his long-suffering patience when it came to answering my many, many questions about the recursive use of WITH. If my explanation of recursive WITH execution has any clarity, it's largely due to Jim's patient explanations.