Is Yours a UNION Shop?

In the early 1990s I interviewed for a contract programming position with Ford Motor Company in Dearborn, Michigan. During the technical part of the interview process, my interviewer posed this question: "Do you know what a union is?" I did know, and I explained the concept, but then I made a fatal misstep: I volunteered that I hardly ever used unions. Almost the moment the words were out of my mouth, I realized I'd blown the interview. I could read the look on the interviewer's face as he promptly redounded with: "We use unions a lot around here." I walked out the door a few minutes later knowing that my job-hunt was not yet over.

The UNION Operator

In the world of relational databases, the term "union" is often used in reference to a class of operators that perform set-manipulation operations on the results from SELECT queries. Perhaps the most well-known union operator is UNION, which combines the results from two SELECTs into one rowset.

NOTE: You can find SQL statements to create and populate the example tables following the end of the article. Look for the heading "Example Data".

UNION can be useful when you have similar data (such as names of people) stored in more than one table, and you wish to query those tables and treat the data as if it were one rowset from one table. For example, you might use a query such as the following to return the names of all people with whom your company has contact:

SELECT name FROM employee
UNION
SELECT name FROM customer
UNION
SELECT name FROM supplier;

The result from this query will be a single, combined list of names including employees, customers, and supplier representatives. Each SELECT returns a "set" of rows, and the UNION operators combine the three separate sets into one, large set. The UNION operation eliminates duplicates from the final result. For example, if the same name appears in the employee and customer tables, the above query will return that name only once. Duplicate elimination is usually accomplished by sorting and merging the data from the rowsets returned by each SELECT, imparting a built-in performance penalty to any union query. Later I'll talk about the ALL keyword, which you can use to avoid the penalty of duplicate elimination.

Other Set Operators

The ISO SQL standard (SQL:1999) defines two other set operators besides UNION: EXCEPT - Removes rows from one set that are present in another. INTERSECT - Returns rows that are present in both of two sets. Oracle implements both these operators, except that Oracle uses the keyword MINUS instead of EXCEPT. Other database brands, such as DB2 and SQL Server, do implement EXCEPT, so it pays to be aware that EXCEPT is the standard. One use for MINUS is to compare two tables. You might wish to do this in a test environment, to see whether the results of running a batch job match your expected results. For example, to compare two tables, each having a primary key, you can write a query similar to:

(SELECT 'Extra in employee', e.* FROM employee e
MINUS
SELECT 'Extra in employee', ee.* FROM employee_expected ee)
UNION
(SELECT 'Missing from employee', ee.* FROM employee_expected ee
MINUS
SELECT 'Missing from employee', e.* FROM employee e);

The first MINUS operation returns rows in employee that are not also in employee_expected. The second MINUS operation finds rows in employee_expected that are not also in employee. I use parentheses in this example to force the two MINUS operations to occur first, followed by the UNION. If no rows are returned by this query, then no differences were found in either direction, and you know the two tables contain the same data. If rows are returned, those rows represent differences. The problem becomes more difficult if primary keys do not exist on the tables. This is because the MINUS operation eliminates duplicates, just as UNION does. A single row in one of the tables will cancel out multiple identical rows in the other. Therefore, if no primary keys exist to ensure that each row is unique within its table, you have to somehow take into account the number of times each row occurs in each table. Following is one way to do this:

(SELECT e.*, COUNT(*)
FROM employee e
GROUP BY e.id, e.name, e.county, e.hire_date, e.fire_date
MINUS
SELECT ee.*, COUNT(*)
FROM employee_expected ee
GROUP BY ee.id, ee.name, ee.county, ee.hire_date, ee.fire_date)
UNION
(SELECT ee.*, COUNT(*)
FROM employee_expected ee
GROUP BY ee.id, ee.name, ee.county, ee.hire_date, ee.fire_date
MINUS
SELECT e.*, COUNT(*)
FROM employee e
GROUP BY e.id, e.name, e.county, e.hire_date, e.fire_date);

The key to this query is that each SELECT contains a GROUP BY clause listing ALL columns in the table being queried. It's critical to GROUP BY all columns. As a result, COUNT(*) will return an occurrence count telling you how many time each row occurs. If each table has the same rows as the other, and each row occurs the same number of times in each table, everything will cancel out and the query will return no rows, indicating that the two tables are identical. Another use for MINUS is to determine rows for which there are no children, or no parents. For example, the following query returns a list of employees who have never been paid:

SELECT e2.id, e2.name
FROM (SELECT id FROM employee
      MINUS
      SELECT emp_id FROM paycheck) e1
      INNER JOIN employee e2 ON e1.id = e2.id;

INTERSECT is the final set operator. It allows you to determine rows in common between two sets. Like UNION and MINUS, the INTERSECT operation eliminates duplicate rows. For example, to find employees who are also customers:

SELECT name FROM employee
INTERSECT
SELECT name FROM customer;

This particular example breaks down if you push it too far. You would very likely want more than just the employee names. If you did want columns other than the name, you would lose the ability to use INTERSECT. For example, you can't get away with:

SELECT name, hire_date FROM employee
INTERSECT
SELECT name FROM customer;

You could join the result of the INTERSECT to the employee table:

SELECT *
FROM (SELECT name FROM employee
      INTERSECT
      SELECT name FROM customer) e1
      INNER JOIN employee e2 ON e1.name = e2.name;

However, at this point, you might be better off using a subquery, or perhaps a join to customer:

SELECT *
FROM employee e INNER JOIN customer c
     ON e.name = c.name;

For September, I'm working on an article showing a much more interesting and innovative use of INTERSECT than I've shown here. Stay tuned!

ALL Versus DISTINCT

So far I've talked about how the various set operators eliminate duplicate rows from the result set. This duplicate elimination extracts a performance penalty, so if duplicate rows are not a problem in your application, or if you know that duplicates will never occur, it'd be good to bypass the overhead of duplicate elimination. It turns out you can do that. The SQL:1999 standard specifies the use of the ALL keyword for this purpose: UNION ALL EXCEPT ALL INTERSECT ALL Of these three, Oracle supports only UNION ALL. That's actually unfortunate, because MINUS ALL would greatly simplify the table-difference solution when no primary keys are involved.

NOTE: The SQL:1999 standard also defines the DISTINCT keyword for you to use in explicitly specifying the default behavior, as in UNION DISTINCT. Oracle does not support that keyword.

I once used UNION ALL to good effect in a case where I needed to generate a report based on several GROUP BY queries, but the reporting tool I was using allowed for only one query per report. My solution was to "concatenate" my GROUP BY queries together using UNION ALL. For example:

SELECT 1, county, COUNT(*)
FROM employee
WHERE hire_date < TO_DATE('1-Jul-2003','dd-mon-yyyy')
AND (fire_date >= TO_DATE('1-Jul-2003','dd-mon-yyyy')
     OR fire_date IS NULL)
GROUP BY county
UNION ALL
SELECT 2, county, COUNT(*)
FROM employee
WHERE hire_date < TO_DATE('1-Aug-2003','dd-mon-yyyy')
AND (fire_date >= TO_DATE('1-Aug-2003','dd-mon-yyyy')
     OR fire_date IS NULL)
GROUP BY county
ORDER BY 1, county;

The first SELECT generates a count of employees by county as of the beginning of July 2003. The second SELECT generates a count of employees by county as of the end of the month. The UNION ALL combines the two result sets into one. Each SELECT returns an arbitrary number (1 & 2 in this example) that I reference in the ORDER BY clause to keep rows from each SELECT grouped together.

Set Complements

An interesting set-oriented problem is that of finding all potential members of a set that are not, in fact, members of the set. For example, I might ask for a list of people who are not customers. To solve this problem, I first need to know the "universe" for my set, where "universe" is defined as all possible elements. Once I know the universe, a simple MINUS operation gives the list of missing elements:

SELECT name FROM all_people_on_earth
MINUS
SELECT name FROM customer;

Of course, no one maintains a table with a row for everyone on the planet, so the particular problem I stated is not solvable. However, it is sometimes possible to generate a "universe" on-the-fly, and I describe such a case in an article that ran in the September 2002 issue of Oracle Magazine. The article is titled "Turning on Pivot Tables", and describes a case in which a team I was part of was assigned the task of generating a report of missing rows. To generate that report, we used a pivot table to generate, on-the-fly, a set defining the "universe" of all possible rows in a given time period. We then used MINUS to eliminate those rows that we actually had, leaving us with a list of the missing rows.

Wrap-up

So what finally happened with that job interview? It came at a dark time in my career, which is probably why I remember the incident so clearly. I'd been unemployed for some time, was having difficulty finding steady work, desperately needed a job, and my self-confidence was in the dumps. Blowing the interview over union queries didn't help. It turned out that a few weeks later, to my surprise, I got a call to return for a second interview, so I must not have made as bad an impression as I initially thought. But by then I'd found a position elsewhere. I hope reading this article leaves you better-equipped to answer questions about unions than I was at the time.

Acknowledgments

I'd like to thank the following people from the ORACLE-L list who took the time to talk about their use of union queries: Prasad Gunda, Denny, Mark Richard, Nuno Souto, Jack Applewhite, Rachel Carmichael, Ian MacGregor, and Larry Elkins. I'd also like to thank Jim Melton of Oracle Corporation, who is also the editor of the ANSI/ISO SQL Standard, for taking the time to answer a few questions about Oracle's implementation vis-a-vis the SQL standard.

Example Data

Following are the statements to create and populate the tables I used for all the SQL statements in this article:

/* DROP TABLE employee;
DROP TABLE customer;
DROP TABLE supplier;
DROP TABLE employee_expected;
DROP TABLE paycheck; */

CREATE TABLE employee (
   id NUMBER,
   name VARCHAR2(20),
   county VARCHAR2(20),
   hire_date DATE,
   fire_date DATE
);

CREATE TABLE customer (
   name VARCHAR2(30)
);

CREATE TABLE supplier (
   name VARCHAR2(30)
);

CREATE TABLE employee_expected (
   id NUMBER,
   name VARCHAR2(20),
   county VARCHAR2(20),
   hire_date DATE,
   fire_date DATE
);

CREATE TABLE paycheck (
   emp_id NUMBER,
   pay_date DATE,
   amount NUMBER);

INSERT INTO employee VALUES (1,'Andrew Sears','Alger',
   TO_DATE('10-Jun-2003','dd-mon-yyyy'),NULL);
INSERT INTO employee VALUES (2,'Jeff Gennick','Alger',
   TO_DATE('1-Jan-2003','dd-mon-yyyy'),TO_DATE('15-Jul-2003','dd-mon-yyyy'));
INSERT INTO employee VALUES (3,'Pat Murphy','Marquette',
   TO_DATE('2-Jul-2003','dd-mon-yyyy'),NULL);

INSERT INTO employee VALUES (4,'John Doe','Luce',
   TO_DATE('30-Jul-2003','dd-mon-yyyy'),NULL);
INSERT INTO employee VALUES (5,'Jane Doe','Ontonagon',
   TO_DATE('23-Apr-2003','dd-mon-yyyy'),NULL);
INSERT INTO employee VALUES (6,'Larry Moe','Delta',
   TO_DATE('27-May-2003','dd-mon-yyyy'),TO_DATE('27-Jul-2003','dd-mon-yyyy'));

INSERT INTO customer VALUES ('Andrew Sears');
INSERT INTO customer VALUES ('Mike Loukides');
INSERT INTO customer VALUES ('Larry Ellison');

INSERT INTO supplier VALUES ('Bill Gates');

INSERT INTO employee_expected VALUES (1,'Andrew Sears','Alger',
   TO_DATE('10-Jun-2003','dd-mon-yyyy'),NULL);
INSERT INTO employee_expected VALUES (2,'Jeff Gennick','Alger',
   TO_DATE('1-Jan-2003','dd-mon-yyyy'),TO_DATE('15-Jul-2003','dd-mon-yyyy'));
INSERT INTO employee_expected VALUES (3,'Heidi Murphy','Marquette',
   TO_DATE('2-Jul-2003','dd-mon-yyyy'),NULL);
INSERT INTO employee_expected VALUES (4,'John Doe','Luce',
   TO_DATE('30-Jul-2003','dd-mon-yyyy'),NULL);
INSERT INTO employee_expected VALUES (5,'Jane Doe','Ontonagon',
   TO_DATE('23-Apr-2003','dd-mon-yyyy'),NULL);
INSERT INTO employee_expected VALUES (6,'Larry Moe','Delta',
   TO_DATE('27-May-2003','dd-mon-yyyy'),TO_DATE('27-Jul-2003','dd-mon-yyyy'));

INSERT INTO paycheck VALUES (1,TO_DATE('30-Jun-2003','dd-mon-yyyy'),500);
INSERT INTO paycheck VALUES (2,TO_DATE('31-Jan-2003','dd-mon-yyyy'),500);
INSERT INTO paycheck VALUES (3,TO_DATE('31-Jul-2003','dd-mon-yyyy'),500);
INSERT INTO paycheck VALUES (4,TO_DATE('30-Jun-2003','dd-mon-yyyy'),500);

COMMIT;