External Tables and "Bad" Files

Last fall I wrote an article for Oracle Magazine, titled "Load Up with the Latest", on Oracle9i's new extraction, transformation, and load (ETL) features. In that article I showed how external tables provide an alternative means to SQL*Loader for loading data from a text file (or any other external file) into an Oracle database. Since then, I've discovered an important semantic difference in the definition of "bad input record" for each of the two mechanisms.

When using both SQL*Loader and external tables you can define a bad file to which "bad" input records are copied. A "bad" record is one that can't be loaded into the target database table for one reason or another—perhaps a numeric field has non-numeric characters; perhaps the record in question causes a constraint violation. After the article ran, a reader pointed out to me that bad file mechanism for external tables functions slightly differently than the bad file mechanism for SQL*Loader. When SQL*Loader encounters a record that can't be loaded, for whatever reason, it writes that record to the bad file and continues loading all the good records that it can. With external tables however, certain types of "bad" records can cause the entire load to fail, leaving you to reload all the records from the beginning.

I decided to do some experimenting to better my understanding of how Oracle9i's external table mechanism handles bad input records. To begin with, I recreated all the tables and sample data that I used for my earlier article on ETL. I then edited my data file and inserted some non-numeric characters into a numeric field. The results are as follows. Note the "xxxx" in the second population field for Ishpeming:

         Baraga    1231 1285
         Ishpeming 7200 xxxx
         Munising  2783 2539

Using the following statement, I attempted to load the above data into my database:

         INSERT INTO city_populations (city_name, census_year, population)
         SELECT city_name, 1990, pop_1990
         FROM city_populations_ext
         WHERE pop_1990 IS NOT NULL
         UNION ALL
         SELECT city_name, 2000, pop_2000
         FROM city_populations_ext
         WHERE pop_2000 IS NOT NULL;

After the INSERT completed, I checked the resulting log file. It indicated that a record with a bad numeric value was found:

         ...
         Field Definitions for table CITY_POPULATIONS_EXT
         Record format FIXED, record length 21
         Data in file has same endianness as the platform
         Fields in Data Source:
         CITY_NAME CHAR (10)
         Record position (1, 10)
         Trim whitespace same as SQL Loader
         POP_1990 Integer external (4)
         Record position (11, 14)
         Trim whitespace same as SQL Loader
         POP_2000 Integer external (4)
         Record position (16, 19)
         Trim whitespace same as SQL Loader
         error processing column POP_2000 in row 2 for datafile e:\a\city_populations.dat
         ORA-01722: invalid number
         ...

Sure enough, the bad file contained the offending input record:

         E:\A>type city_populations.bad
         Ishpeming 7200 xxxx

And the rest of the input data was successfully loaded:

         SQL> SELECT * FROM city_populations;
         CITY_NAME CENSUS_YEAR POPULATION
         --------- ----------- ----------
         Baraga           1990       1231
         Ishpeming        1990       7200
         Munising         1990       2783
         Baraga           2000       1285
         Munising         2000       2539

This is the behavior we've come to know and expect from SQL*Loader, and in this instance this behavior applies to the external table mechanism as well. What's happening here, I believe, is that the external table access driver is able to recognize the bad input data when it is read from the external file. The bad record is then diverted to the bad file instead of being passed on to the SELECT statement's result set. The key here is that the nature of the bad data is such that the access driver can detect the problem. So far so good, but what happens if bad data isn't recognized as bad until after it's been returned as the result of a query against an external table? To find out, I added the following constraint to my target table:

         ALTER TABLE city_populations ADD (
         CONSTRAINT bad_bad_7200 CHECK (population <> 7200)
         );

I then removed the "xxxx" from my input file so that all numeric data was now valid:

         Baraga    1231 1285
         Ishpeming 7200
         Munising  2783 2539

The record for Ishpeming still contains bad data, in this case data that will violate a constraint, but there's no way to know that until an attempt has first been made to insert that data into the target table. Only at that point will a constraint violation occur. Let's look at how SQL*Loader deals with this situation. Following is the control file that I used:

         LOAD DATA
         REPLACE INTO TABLE city_populations
         (
         city_name POSITION(1:10) CHAR,
         census_year CONSTANT 1990,
         population POSITION(11:14) INTEGER EXTERNAL,
         pop_2000 FILLER POSITION(16:19) INTEGER EXTERNAL
         )

The actual SQL*Loader command and execution of SQL*Loader:

         E:\A>sqlldr gennick/gennick@jonathan.gennick.org control=city_populations.ctl
         log=ldr.log data=city_populations.dat bad=ldr.bad
         SQL*Loader: Release 9.0.1.1.1 - Production on Sun Jan 20 15:56:06 2002
         (c) Copyright 2001 Oracle Corporation. All rights reserved.
         Commit point reached - logical record count 3

The log file entry showing that a bad record was encountered:

         Record 2: Rejected - Error on table CITY_POPULATIONS.
         ORA-02290: check constraint (GENNICK.BAD_BAD_7200) violated

The bad file contents:

         E:\A>type ldr.bad
         Ishpeming 7200

And finally, the good data that made it into the target table:

         SQL> SELECT * FROM city_populations;
         CITY_NAME CENSUS_YEAR POPULATION
         --------- ----------- ----------
         Baraga           1990       1231
         Munising         1990       2783

Note that for the SQL*Loader example I only bothered to load the 1990 data. The one bad value, 7200 for Ishpeming, caused a constraint violation and was written to the bad file. The other two records were good, and were loaded into the target table. As in the previous example, this is the behavior we've come to expect when loading data into Oracle.

Things changed however, when I attempted the exact same load via an external table instead SQL*Loader:

         SQL> INSERT INTO city_populations (city_name, census_year, population)
         2 SELECT city_name, 1990, pop_1990
         3 FROM city_populations_ext
         4 WHERE pop_1990 IS NOT NULL
         5 UNION ALL
         6 SELECT city_name, 2000, pop_2000
         7 FROM city_populations_ext
         8 WHERE pop_2000 IS NOT NULL;
         INSERT INTO city_populations (city_name, census_year, population)
         *
         ERROR at line 1:
         ORA-02290: check constraint (GENNICK.BAD_BAD_7200) violated
         SQL> COMMIT;
         Commit complete.
         SQL> SELECT * FROM city_populations;
         no rows selected

The constraint violation was reported, as I expected. I then committed my changes, hoping to preserve the good data that was loaded. However, as you can plainly see, none of the good data remained in the target table. Further, no entries were written to the bad file and no error messages were written to the log file.

What's going on here? Why is it that SQL*Loader can write the Ishpeming record to the bad file as a result of the constraint violation, but the external table mechanism can't? I believe the behavior we see here is the result of a core tenet of relational database theory: any given SQL statement must executed successfully or not at all; there can be no halfway about it. SQL*Loader, at least during a conventional path load, uses a separate INSERT for each logical input record. Sure, SQL*Loader uses array processing, but the end result is the equivalent of one INSERT per input record. When an INSERT fails because of a constraint violation, SQL*Loader can recognize that and write the offending input record to the bad file. With SQL*Loader, the definition of "bad input record" includes those records that result in SQL errors.

Note: During a direct path load, SQL*Loader disables CHECK constraints such as the one I created for the examples in this article. It's then up to you to revalidate the data in your target table after the load, and reenable any such constraints.

When you use an external table to load data, an access driver reads the external file, interprets the input data, and passes good data on to the Oracle engine to be returned as the result of the SELECT statement against the external table. The best that the access driver can do is to look at the input data to be sure that it conforms to the description in the CREATE EXTERNAL TABLE statement. Once the data is returned from the SELECT, it is out of the access driver's hands, and any subsequent errors (such as a constraint violation) are treated in accordance with normal SQL statement processing. Thus, for the external table mechanism, the definition of "bad input record" includes only those records that the access driver can identified as bad. Records that ultimately result in SQL errors may still be considered "good".

Note: When you think through the implications of being able to to load external table data using statements much more complex than the simple INSERT INTO...SELECT FROM shown in this article, you'll realize that the external table mechanism's current behavior with respect to bad input records is the only behavior possible.

Is this semantic difference between the SQL*Loader bad file mechanism and the external table bad file mechanism a bad thing? I don't think so. It may be inconvenient if you're not expecting it, but I don't think it's fair to say that it's bad, only different. If you're used to using SQL*Loader to load data, and you decide to make the switch to using external tables, be sure you understand that with external tables, any input record that causes a SQL error will cause the entire load to fail. Such a record won't be written to a bad file, and any good records that were loaded will be lost when the effects of the failed statement are rolled back.