Is Yours a UNION Shop?
by Jonathan Gennick, 1-August-2003
In the early 1990s I interviewed for a contract programming
position with Ford Motor Company in Dearborn, Michigan.
During the technical part of the interview process, my
interviewer posed this question: "Do you know what a union
is?" I did know, and I explained the concept, but then I
made a fatal misstep: I volunteered that I hardly ever used
unions. Almost the moment the words were out of my mouth, I
realized I'd blown the interview. I could read the look on
the interviewer's face as he promptly redounded with: "We
use unions a lot around here." I walked out the door a few
minutes later knowing that my job-hunt was not yet over.
The UNION Operator
In the world of relational databases, the term "union" is
often used in reference to a class of operators that
perform set-manipulation operations on the results from
SELECT queries. Perhaps the most well-known union operator
is UNION, which combines the results from two SELECTs into
one rowset.
NOTE: You can find SQL statements to create and
populate the example tables following the end of
the article. Look for the heading "Example Data".
UNION can be useful when you have similar data (such as
names of people) stored in more than one table, and you
wish to query those tables and treat the data as if it were
one rowset from one table. For example, you might use a
query such as the following to return the names of all
people with whom your company has contact:
SELECT name FROM employee
UNION
SELECT name FROM customer
UNION
SELECT name FROM supplier;
The result from this query will be a single, combined list
of names including employees, customers, and supplier
representatives. Each SELECT returns a "set" of rows, and
the UNION operators combine the three separate sets into
one, large set.
The UNION operation eliminates duplicates from the final
result. For example, if the same name appears in the
employee and customer tables, the above query will return
that name only once. Duplicate elimination is usually
accomplished by sorting and merging the data from the
rowsets returned by each SELECT, imparting a built-in
performance penalty to any union query. Later I'll talk
about the ALL keyword, which you can use to avoid the
penalty of duplicate elimination.
The ISO SQL standard (SQL:1999) defines two other set
operators besides UNION:
EXCEPT - Removes rows from one set that are
present in another.
INTERSECT - Returns rows that are present in
both of two sets.
Oracle implements both these operators, except that Oracle
uses the keyword MINUS instead of EXCEPT. Other database
brands, such as DB2 and SQL Server, do implement EXCEPT, so
it pays to be aware that EXCEPT is the standard.
One use for MINUS is to compare two tables. You might wish
to do this in a test environment, to see whether the
results of running a batch job match your expected results.
For example, to compare two tables, each having a primary
key, you can write a query similar to:
(SELECT 'Extra in employee', e.* FROM employee e
MINUS
SELECT 'Extra in employee', ee.* FROM employee_expected ee)
UNION
(SELECT 'Missing from employee', ee.* FROM employee_expected ee
MINUS
SELECT 'Missing from employee', e.* FROM employee e);
The first MINUS operation returns rows in employee that are
not also in employee_expected. The second MINUS operation
finds rows in employee_expected that are not also in
employee. I use parentheses in this example to force the
two MINUS operations to occur first, followed by the
UNION. If no rows are returned by this query, then no
differences were found in either direction, and you know the
two tables contain the same data. If rows are returned,
those rows represent differences.
The problem becomes more difficult if primary keys do not
exist on the tables. This is because the MINUS operation
eliminates duplicates, just as UNION does. A single row in
one of the tables will cancel out multiple identical rows
in the other. Therefore, if no primary keys exist to ensure
that each row is unique within its table, you have to
somehow take into account the number of times each row
occurs in each table. Following is one way to do this:
(SELECT e.*, COUNT(*)
FROM employee e
GROUP BY e.id, e.name, e.county, e.hire_date, e.fire_date
MINUS
SELECT ee.*, COUNT(*)
FROM employee_expected ee
GROUP BY ee.id, ee.name, ee.county, ee.hire_date, ee.fire_date)
UNION
(SELECT ee.*, COUNT(*)
FROM employee_expected ee
GROUP BY ee.id, ee.name, ee.county, ee.hire_date, ee.fire_date
MINUS
SELECT e.*, COUNT(*)
FROM employee e
GROUP BY e.id, e.name, e.county, e.hire_date, e.fire_date);
The key to this query is that each SELECT contains a GROUP
BY clause listing ALL columns in the table being queried.
It's critical to GROUP BY all columns. As a result,
COUNT(*) will return an occurrence count telling you how
many time each row occurs. If each table has the same rows
as the other, and each row occurs the same number of times
in each table, everything will cancel out and the query
will return no rows, indicating that the two tables are
identical.
Another use for MINUS is to determine rows for which there
are no children, or no parents. For example, the following
query returns a list of employees who have never been paid:
SELECT e2.id, e2.name
FROM (SELECT id FROM employee
MINUS
SELECT emp_id FROM paycheck) e1
INNER JOIN employee e2 ON e1.id = e2.id;
INTERSECT is the final set operator. It allows you to
determine rows in common between two sets. Like UNION
and MINUS, the INTERSECT operation eliminates duplicate
rows. For example, to find employees who are also customers:
SELECT name FROM employee
INTERSECT
SELECT name FROM customer;
This particular example breaks down if you push it too far.
You would very likely want more than just the employee
names. If you did want columns other than the name, you
would lose the ability to use INTERSECT. For example, you
can't get away with:
SELECT name, hire_date FROM employee
INTERSECT
SELECT name FROM customer;
You could join the result of the INTERSECT to the employee
table:
SELECT *
FROM (SELECT name FROM employee
INTERSECT
SELECT name FROM customer) e1
INNER JOIN employee e2 ON e1.name = e2.name;
However, at this point, you might be better off using a
subquery, or perhaps a join to customer:
SELECT *
FROM employee e INNER JOIN customer c
ON e.name = c.name;
For September, I'm working on an article showing a much
more interesting and innovative use of INTERSECT than I've
shown here. Stay tuned!
ALL Versus DISTINCT
So far I've talked about how the various set operators
eliminate duplicate rows from the result set. This
duplicate elimination extracts a performance penalty, so if
duplicate rows are not a problem in your application, or if
you know that duplicates will never occur, it'd be good to
bypass the overhead of duplicate elimination. It turns out
you can do that. The SQL:1999 standard specifies the use of
the ALL keyword for this purpose:
UNION ALL
EXCEPT ALL
INTERSECT ALL
Of these three, Oracle supports only UNION ALL. That's
actually unfortunate, because MINUS ALL would greatly
simplify the table-difference solution when no primary keys
are involved.
NOTE: The SQL:1999 standard also defines the DISTINCT
keyword for you to use in explicitly specifying the
default behavior, as in UNION DISTINCT. Oracle does
not support that keyword.
I once used UNION ALL to good effect in a case where I
needed to generate a report based on several GROUP BY
queries, but the reporting tool I was using allowed for
only one query per report. My solution was to "concatenate"
my GROUP BY queries together using UNION ALL. For example:
SELECT 1, county, COUNT(*)
FROM employee
WHERE hire_date < TO_DATE('1-Jul-2003','dd-mon-yyyy')
AND (fire_date >= TO_DATE('1-Jul-2003','dd-mon-yyyy')
OR fire_date IS NULL)
GROUP BY county
UNION ALL
SELECT 2, county, COUNT(*)
FROM employee
WHERE hire_date < TO_DATE('1-Aug-2003','dd-mon-yyyy')
AND (fire_date >= TO_DATE('1-Aug-2003','dd-mon-yyyy')
OR fire_date IS NULL)
GROUP BY county
ORDER BY 1, county;
The first SELECT generates a count of employees by county
as of the beginning of July 2003. The second SELECT
generates a count of employees by county as of the end of
the month. The UNION ALL combines the two result sets into
one. Each SELECT returns an arbitrary number (1 & 2 in this
example) that I reference in the ORDER BY clause to keep
rows from each SELECT grouped together.
Set Complements
An interesting set-oriented problem is that of finding all
potential members of a set that are not, in fact, members
of the set. For example, I might ask for a list of people
who are not customers. To solve this problem, I first need
to know the "universe" for my set, where "universe" is
defined as all possible elements. Once I know the universe,
a simple MINUS operation gives the list of missing
elements:
SELECT name FROM all_people_on_earth
MINUS
SELECT name FROM customer;
Of course, no one maintains a table with a row for everyone
on the planet, so the particular problem I stated is not
solvable. However, it is sometimes possible to generate a
"universe" on-the-fly, and I describe such a case in an
article that ran in the September 2002 issue of Oracle
Magazine.
The article is titled "Turning on Pivot Tables", and
describes a case in which a team I was part of was assigned
the task of generating a report of missing rows. To
generate that report, we used a pivot table to generate,
on-the-fly, a set defining the "universe" of all possible
rows in a given time period. We then used MINUS to
eliminate those rows that we actually had, leaving us with
a list of the missing rows.
Wrap-up
So what finally happened with that job interview? It came
at a dark time in my career, which is probably why I
remember the incident so clearly. I'd been unemployed for
some time, was having difficulty finding steady work,
desperately needed a job, and my self-confidence was in the
dumps. Blowing the interview over union queries didn't
help. It turned out that a few weeks later, to my surprise,
I got a call to return for a second interview, so I must
not have made as bad an impression as I initially thought.
But by then I'd found a position elsewhere. I hope reading
this article leaves you better-equipped to answer questions
about unions than I was at the time.
Acknowledgments
I'd like to thank the following people from the ORACLE-L
list who took the time to talk about their use of union
queries: Prasad Gunda, Denny, Mark Richard, Nuno Souto,
Jack Applewhite, Rachel Carmichael, Ian MacGregor, and
Larry Elkins. I'd also like to thank Jim Melton of Oracle
Corporation, who is also the editor of the ANSI/ISO SQL
Standard, for taking the time to answer a few questions
about Oracle's implementation vis-a-vis the SQL standard.
Example Data
Following are the statements to create and populate the
tables I used for all the SQL statements in this article:
/* DROP TABLE employee;
DROP TABLE customer;
DROP TABLE supplier;
DROP TABLE employee_expected;
DROP TABLE paycheck; */
CREATE TABLE employee (
id NUMBER,
name VARCHAR2(20),
county VARCHAR2(20),
hire_date DATE,
fire_date DATE
);
CREATE TABLE customer (
name VARCHAR2(30)
);
CREATE TABLE supplier (
name VARCHAR2(30)
);
CREATE TABLE employee_expected (
id NUMBER,
name VARCHAR2(20),
county VARCHAR2(20),
hire_date DATE,
fire_date DATE
);
CREATE TABLE paycheck (
emp_id NUMBER,
pay_date DATE,
amount NUMBER);
INSERT INTO employee VALUES (1,'Andrew Sears','Alger',
TO_DATE('10-Jun-2003','dd-mon-yyyy'),NULL);
INSERT INTO employee VALUES (2,'Jeff Gennick','Alger',
TO_DATE('1-Jan-2003','dd-mon-yyyy'),TO_DATE('15-Jul-2003','dd-mon-yyyy'));
INSERT INTO employee VALUES (3,'Pat Murphy','Marquette',
TO_DATE('2-Jul-2003','dd-mon-yyyy'),NULL);
INSERT INTO employee VALUES (4,'John Doe','Luce',
TO_DATE('30-Jul-2003','dd-mon-yyyy'),NULL);
INSERT INTO employee VALUES (5,'Jane Doe','Ontonagon',
TO_DATE('23-Apr-2003','dd-mon-yyyy'),NULL);
INSERT INTO employee VALUES (6,'Larry Moe','Delta',
TO_DATE('27-May-2003','dd-mon-yyyy'),TO_DATE('27-Jul-2003','dd-mon-yyyy'));
INSERT INTO customer VALUES ('Andrew Sears');
INSERT INTO customer VALUES ('Mike Loukides');
INSERT INTO customer VALUES ('Larry Ellison');
INSERT INTO supplier VALUES ('Bill Gates');
INSERT INTO employee_expected VALUES (1,'Andrew Sears','Alger',
TO_DATE('10-Jun-2003','dd-mon-yyyy'),NULL);
INSERT INTO employee_expected VALUES (2,'Jeff Gennick','Alger',
TO_DATE('1-Jan-2003','dd-mon-yyyy'),TO_DATE('15-Jul-2003','dd-mon-yyyy'));
INSERT INTO employee_expected VALUES (3,'Heidi Murphy','Marquette',
TO_DATE('2-Jul-2003','dd-mon-yyyy'),NULL);
INSERT INTO employee_expected VALUES (4,'John Doe','Luce',
TO_DATE('30-Jul-2003','dd-mon-yyyy'),NULL);
INSERT INTO employee_expected VALUES (5,'Jane Doe','Ontonagon',
TO_DATE('23-Apr-2003','dd-mon-yyyy'),NULL);
INSERT INTO employee_expected VALUES (6,'Larry Moe','Delta',
TO_DATE('27-May-2003','dd-mon-yyyy'),TO_DATE('27-Jul-2003','dd-mon-yyyy'));
INSERT INTO paycheck VALUES (1,TO_DATE('30-Jun-2003','dd-mon-yyyy'),500);
INSERT INTO paycheck VALUES (2,TO_DATE('31-Jan-2003','dd-mon-yyyy'),500);
INSERT INTO paycheck VALUES (3,TO_DATE('31-Jul-2003','dd-mon-yyyy'),500);
INSERT INTO paycheck VALUES (4,TO_DATE('30-Jun-2003','dd-mon-yyyy'),500);
COMMIT;