Analytic Magic

One aspect of SQL that I've long enjoyed is the possibility of taking a feature of the language that doesn't seem to bear on the problem at hand, and applying that feature in some clever way to solve the problem. Until recently, I hadn't seen any clever applications of the new analytic functions, and I often found myself wondering whether Oracle's newer SQL features were so specific as to preclude their use for other than their target problems as described in the documentation.

I needn't have worried.

Last fall, someone on the ORACLE-L list posed an interesting problem. To begin, you have verses of poetry stored one line of verse to one database row. For example:

SELECT *
FROM sliced_kipling
ORDER BY VERSE, PIECE;

     VERSE      PIECE CHUNK
---------- ---------- ---------------------------------------------
         1          1 Oh, East is East,
         1          2 and West is West,
         1          3 and never the twain shall meet,
         2          1 Till Earth and Sky stand
         2          2 presently at God's great Judgment Seat;
         3          1 But there is neither East nor West,
         3          2 Border,
         3          3 nor Breed,
         3          4 nor Birth,
         4          1 When two strong men stand face to face,
         4          2 tho' they come from the ends of the earth!

Given this data, what SQL query can you issue to bring back each *verse* in its own row? I'd solved similar problems using procedural code back in my COBOL programming days, but had never thought about how to solve such problems from SQL. Stephane's, and then Larry's, solutions are fascinating.

The root problem here is the need to combine data from multiple rows into just one row. One approach to doing that is a self-join:

SELECT ROWNUM, a.chunk || ' ' || b.chunk
FROM sliced_kipling a INNER JOIN sliced_kipling b
     ON a.verse = b.verse AND a.piece+1 = b.piece;

    ROWNUM A.CHUNK||''||B.CHUNK
---------- ------------------------------------------------------------
         1 Oh, East is East, and West is West,
         2 and West is West, and never the twain shall meet,
...

These results are somewhat promising, but the self-join approach is the wrong path. Not all verses have the same number of lines. You'd need two self-joins to combine all three lines from verse 1, only one self-join to combine both lines from verse 2, and three self-joins to combine all four lines from verse 3. There is no single number of self-joins that works in all cases, and for verses that have only one line, you don't want a self-join at all.

It turns out that this problem fits the same "linkage" pattern as the Finding Flight Legs problem I wrote about in September of last year. By this, I mean that you can think of each verse as a chain of linked rows, one row for each line in the verse. Line 1 links to line 2 links to line 3, and so forth until the end of each verse is reached. Such linkage problems can attacked using recursive queries.

The following CONNECT BY query uses the SYS_CONNECT_BY_PATH function to concatenate the lines of each verse together:

SELECT verse, LEVEL lvl, SYS_CONNECT_BY_PATH(chunk, '/') text
FROM sliced_kipling
CONNECT BY verse = PRIOR verse
       AND piece - 1 = PRIOR piece
START WITH piece = 1;

This query transforms each line of each verse into a node in a hierarchy. The first line of a verse is the root node, the next line in the verse is a child of the root, and so forth. SYS_CONNECT_BY_PATH returns the concatenation of all chunk values from the root node to the current node:

 VERSE  LVL TEXT
------ ---- ------------------------------------------------------------
     1    1 /Oh, East is East,
     1    2 /Oh, East is East,/and West is West,
     1    3 /Oh, East is East,/and West is West,/and never the twain
            shall meet,

     2    1 /Till Earth and Sky stand
     2    2 /Till Earth and Sky stand/presently at God's great Judgment
            Seat;

     3    1 /But there is neither East nor West,
     3    2 /But there is neither East nor West,/Border,
     3    3 /But there is neither East nor West,/Border,/nor Breed,
     3    4 /But there is neither East nor West,/Border,/nor Breed,/nor
            Birth,

     4    1 /When two strong men stand face to face,
     4    2 /When two strong men stand face to face,/tho' they come from
            the ends of the earth!

This query does return a row containing the complete text of that verse. However, the results include many extraneous rows. The full solution lies in returning only the leaf nodes. For each verse, the leaf node is the one the with highest LEVEL value. To know which nodes to return from the query, you need a way to determine whether a given node represents the highest LEVEL value for its verse. The following GROUP BY query returns the highest LEVEL value for each verse:

SELECT verse, MAX(piece) piecemax
FROM sliced_kipling
GROUP BY verse;

 VERSE   PIECEMAX
------ ----------
     1          3
     2          2
     3          4
     4          2

This query works, because there is a correspondence between line numbers (from the PIECE column) and levels in the hierarchy. Line 1 is LEVEL 1, line 2 is LEVEL 2, and so forth. The query really returns the maximum line number for each verse, but that value happens to be the same as the maximum, hierarchical LEVEL number.

You can now join the preceding two queries together, and treat the result as a subquery:

SELECT  TRANSLATE(LTRIM(x.text, '/'), '/', ' ') text
FROM (SELECT verse, LEVEL lvl, SYS_CONNECT_BY_PATH(chunk, '/') text
      FROM sliced_kipling
      CONNECT BY verse = PRIOR verse
             AND piece - 1 = PRIOR piece
      START WITH piece = 1) x
     INNER JOIN
     (SELECT verse, MAX(piece) piecemax
      FROM sliced_kipling
      GROUP BY verse) y
     ON x.verse = y.verse AND x.lvl = y.piecemax
ORDER BY x.verse;

TEXT
--------------------------------------------------
Oh, East is East, and West is West, and never the
twain shall meet,

Till Earth and Sky stand presently at God's great
Judgment Seat;

But there is neither East nor West, Border, nor
Breed, nor Birth,

When two strong men stand face to face, tho' they
come from the ends of the earth!

The join between the list of highest LEVEL numbers (the y alias) and the rowset with all LEVEL numbers (the x alias) leaves only the following rows:

 VERSE  LVL TEXT
------ ---- ------------------------------------------------------------
     1    3 /Oh, East is East,/and West is West,/and never the twain
            shall meet,

     2    2 /Till Earth and Sky stand/presently at God's great Judgment
            Seat;

     3    4 /But there is neither East nor West,/Border,/nor Breed,/nor
            Birth,

     4    2 /When two strong men stand face to face,/tho' they come from
            the ends of the earth!

The outer SELECT retrieves only the text of each verse, uses LTRIM to remove the leading, forward-slash (/) delimiter, uses TRANSLATE to replace the other forward-slash delimiters with spaces, and uses ORDER BY to ensure that verses are finally returned in their proper order.

The query works, but it's a bit difficult to fathom at first glance, and it sure would be nice not to join two subqueries that collectively read the source table twice. This is where analytic functions come into play. The following query uses an analytic function to avoid joining two subqueries, and to avoid scanning the entire source table twice:

SELECT TRANSLATE(LTRIM(text, '/'), '/', ' ') text
FROM (SELECT ROW_NUMBER() OVER
             (PARTITION BY verse ORDER BY verse, lvl DESC) rn,
             verse, text
      FROM (SELECT verse, LEVEL lvl,
                   SYS_CONNECT_BY_PATH(chunk, '/') text
            FROM sliced_kipling
            CONNECT BY verse = PRIOR verse
                   AND piece - 1 = PRIOR piece
            START WITH piece = 1))
WHERE rn = 1
ORDER BY verse;

The innermost subquery returns one row per node, as you saw earlier:

 VERSE  LVL TEXT
------ ---- ------------------------------------------------------------
     1    1 /Oh, East is East,
     1    2 /Oh, East is East,/and West is West,
     1    3 /Oh, East is East,/and West is West,/and never the twain
            shall meet,
...

The outermost subquery uses ROW_NUMBER() in its analytic form to rank the rows for each verse in descending order of LEVEL. The "PARTITION BY verse" clause treats each verse's nodes separately. The "ORDER BY verse" clause is there only because the syntax requires ORDER BY in order to specify a windowing clause such as "lvl DESC". The results from this outermost subquery are as follows:

 RN  VERSE TEXT
--- ------ ------------------------------------------------------------
  1      1 /Oh, East is East,/and West is West,/and never the twain
           shall meet,
  2      1 /Oh, East is East,/and West is West,
  3      1 /Oh, East is East,
...

>From this result set, the main query retrieves only those rows where rn = 1, and sorts those rows in verse order:

TEXT
------------------------------------------------------------
Oh, East is East, and West is West, and never the twain
shall meet,

Till Earth and Sky stand presently at God's great Judgment
Seat;

But there is neither East nor West, Border, nor Breed, nor
Birth,

When two strong men stand face to face, tho' they come from
the ends of the earth!

Is the analytic version of the query to return each verse as one row really easier to read? I don't believe I can make that argument very strongly. One solution uses a join between two subqueries, while the other solution uses a subquery within a subquery. Both queries take a bit of thought to understand.

However, on my system at least, the analytic solution performs better. On my system, the solution involving a join between two subqueries results in 42 logical I/Os, whereas the analytic solution results in only 35 logical I/Os. That's almost a 17% reduction in logical I/O, and if your tables are large (mine aren't) that's nothing to sneeze at.

Sample Data

Use the following statements to create and populate the tables I used for the examples in this article.

CREATE TABLE sliced_kipling (
    verse   NUMBER,
    piece   NUMBER,
    chunk   VARCHAR2(45));

INSERT INTO sliced_kipling VALUES(  1, 1, 'Oh, East is East,');
INSERT INTO sliced_kipling VALUES(  1, 2, 'and West is West,');
INSERT INTO sliced_kipling VALUES(  1, 3, 'and never the twain shall meet,');
INSERT INTO sliced_kipling VALUES(  2, 1, 'Till Earth and Sky stand');
INSERT INTO sliced_kipling VALUES(  2, 2, 'presently at God''s great Judgment Seat;');
INSERT INTO sliced_kipling VALUES(  3, 1, 'But there is neither East nor West,');
INSERT INTO sliced_kipling VALUES(  3, 2, 'Border,');
INSERT INTO sliced_kipling VALUES(  3, 3, 'nor Breed,');
INSERT INTO sliced_kipling VALUES(  3, 4, 'nor Birth,');
INSERT INTO sliced_kipling VALUES(  4, 1, 'When two strong men stand face to face,');
INSERT INTO sliced_kipling VALUES(  4, 2, 'tho'' they come from the ends of the earth!');

COMMIT;