Regular Expression Anti-Patterns

Oracle Database 10g's new regular expression support is very powerful when it comes to detecting patterns in text. However, you face an interesting problem when it comes to finding text that comes close to matching a pattern while still violating that pattern in some way.

For example, you can use the following regular expression to search for 10-digit phone numbers in a free-form comment column:

[0-9]{3}([-. ])[0-9]{3}\1[0-9]{4}

This pattern will detect phone numbers in the following U.S.-centric patterns:

906-387-1698      906.387.1698      906 387 1698

Three possible separators are allowed between digit-groups: a hyphen (-), a period (.), and a space. Well-formed phone numbers use the same separator between both digit groups. I'm ignoring, for this article, the (906) 387-1698 format in which area-codes are in enclosed within parentheses. Consider now, the problem of identifying comments containing phone numbers such as 906-387.1698 and 906-387 1698, that are written with mismatched separators. How do you search for text that doesn't match a pattern?

Note: A script to create the example table and data is available at the end of this article.

Attempt #1 - Negating a Backreference

In the regular expression given earlier, I use the backreference \1 to require that the second delimiter match the first. My first thought was that it'd be nice to have a way to negate a backreference, to specify in a given location that I want anything but what the backreference resolves to. There is no syntax to negate a backreference, but I did think to try enclosing the backreference within a negated bracket-expression:

SQL> SELECT emp_id, text
  2  FROM employee_comment
  3  WHERE REGEXP_LIKE(text, '[0-9]{3}([-. ])[0-9]{3}[^\1][0-9]{4}');

    EMP_ID TEXT
---------- ------------------------------------------------------------
      7369 126 Varnum, Edmore MI 48829, 989 313-5351
      7499 1105 McConnell Court
           Cedar Lake MI 48812
           Home: 989-387-4321
           Cell: (237) 438-3333
      7566 Phone number: 989.387.4444, Address: 23 Fleepy Lane, Edmore,
            MI, 48829
      7654 231-898-9823
           1873 Lilac
           Howard City MI 49329
      7844 989-387.5359
      9999 906-387-1698, 313-535.8886
      9998 313-906-235-4330535-8886

When used inside a bracket-expression, the notation \1 is not a backreference. All I did here was to search for phone numbers using other than a backslash or the digit "1" as the second separator, and even the valid phone numbers match that pattern.

Attempt #2 - Using the NOT Operator

Another thought I had was to use NOT REGEXP_LIKE to detect phone numbers not matching my pattern. This seemed a likely solution at first, but the results prove otherwise:

SQL> SELECT emp_id, text
  2  FROM employee_comment
  3  WHERE NOT REGEXP_LIKE(text, '[0-9]{3}([-. ])[0-9]{3}\1[0-9]{4}');

    EMP_ID TEXT
---------- ------------------------------------------------------------
      7369 126 Varnum, Edmore MI 48829, 989 313-5351
      7521 Her address is: 12646 Bramell. Her city and state are Vestab
           urg, MI 48891
           Home phone is 387-1698
      7698 Marion Blake
           N1375 Lakeshore Drive
           Vestaburg MI 48891
           (989) 388-1234
      7782 Academy Apartments, #138, Cedar Lake MI 48812
      7788 #1 Water Tower Lane
           Cedar Lake MI 48812
      7839 100 Magnetic Drive, Edmore MI 48829
      7844 989-387.5359
      7876 Home phone: (231) 453-9999
      7900 Cares for 100-year-old aunt during the day. Schedule only fo
           r evening and night shifts.

Following are some things to notice about these results:

  • The first row, for 7369, contains the sort of badly formatted phone number I'm looking for. The number 989 313-5351 uses two different separator characters: a space and a hyphen. So far, so good. However, the results quickly break down.

  • The second row, for 7521, contains the seven-digit phone number 387-1698, which is perfectly fine. The phone number does not contain mismatched separator characters, and I do not want to see this row in my results.

  • Similarly, the row for 7698 contains a phone number properly formatted with the area-code in parentheses. I don't want this row in my results either.

The problem with my NOT REGEXP_LIKE query is that the results include any comment without a phone number in my specified format, including comments without phone numbers at all, and also including comments with phone numbers in other, equally valid, formats.

Attempt #3 - Extracting and Testing

Finally, I decided that what I really want to do is find comments containing phone numbers in one pattern that do not match a second pattern. I want to begin by considering any phone number matching the following pattern, which I'll call Pattern A:

Pattern A: [0-9]{3}[-. ][0-9]{3}[-. ][0-9]{4}

Notice absence of \1 here. This pattern allows mismatched separators, and it defines the universe of phone numbers that I want to consider. Given a phone number matching Pattern A, I then want to detect those that fail to match Pattern B:

Pattern B: [0-9]{3}([-. ])[0-9]{3}\1[0-9]{4}

Aha! Now I've got the problem nailed, or so I thought. I wrote the following query using REGEXP_SUBSTR to extract phone numbers matching Pattern A, and to then find those that do not also match Pattern B:

SQL> SELECT emp_id, text
  2  FROM employee_comment
  3  WHERE NOT REGEXP_LIKE(
  4     REGEXP_SUBSTR(text, '[0-9]{3}[-. ][0-9]{3}[-. ][0-9]{4}'),
  5     '[0-9]{3}([-. ])[0-9]{3}\1[0-9]{4}');

    EMP_ID TEXT
---------- ------------------------------------------------------------
      7369 126 Varnum, Edmore MI 48829, 989 313-5351
      7844 989-387.5359

I was happy for about five-minutes, until I realized that this solution considered only the first candidate phone number in a given comment field. The following row, for 9999, is missed:

SQL> SELECT emp_id, text
  2  FROM employee_comment
  3  WHERE emp_id=9999;

    EMP_ID TEXT
---------- ---------------------------
      9999 906-387-1698, 313-535.8886

This comment for 9999 contains two phone numbers matching Pattern A. Yet my REGEXP_SUBSTR solution only looks at the first such phone number, allowing the fact that this record contains a second, badly formatted phone number to go unnoticed.

I was at an impasse. This problem of detecting badly formatted phone numbers was proving to be far less trivial than I'd first expected. The REGEXP_SUBSTR solution came tantalizingly close to generating the results I was after. How then, could I extend that solution to consider all potential phone numbers within a comment without resorting to procedural code?

Attempt #4 - The Solution!

As is often the case, the solution hit me after a good night's sleep, and at a most unexpected moment: during Cary Millsap's Hotsos Symposium keynote address. The solution lay not in using REGEXP_SUBSTR, but rather in using REGEXP_REPLACE. Why? Because REGEXP_REPLACE is the only regular expression function that can operate on multiple instances of a pattern within a string.

I'll walk you through this solution. To begin, I wanted to detect any row containing at least one occurrence of Pattern A. The following query does that:

SELECT emp_id, text
FROM employee_comment
WHERE REGEXP_LIKE(text, '[0-9]{3}[-. ][0-9]{3}[-. ][0-9]{4}');

Next, I applied REGEXP_LIKE to the task of removing any phone numbers matching Pattern B. The following query won't execute yet, but shows how I used REGEXP_REPLACE to replace any validly formatted phone number with a string of three asterisks:

SELECT emp_id, text
FROM employee_comment
WHERE REGEXP_LIKE(text, '[0-9]{3}[-. ][0-9]{3}[-. ][0-9]{4}')
  AND REGEXP_REPLACE(text,
            '[0-9]{3}([-. ])[0-9]{3}\1[0-9]{4}','***')

After eliminating all occurrences of Pattern A that were valid, that also matched Pattern B, any remaining occurrences of Pattern A must, therefor, be invalid. The following query feeds the results from REGEXP_REPLACE into REGEXP_LIKE to find any remaining, invalid patterns:

SQL> SELECT emp_id, text
  2  FROM employee_comment
  3  WHERE REGEXP_LIKE(text, '[0-9]{3}[-. ][0-9]{3}[-. ][0-9]{4}')
  4    AND REGEXP_LIKE(
  5           REGEXP_REPLACE(text,
  6              '[0-9]{3}([-. ])[0-9]{3}\1[0-9]{4}','***'),
  7           '[0-9]{3}[-. ][0-9]{3}[-. ][0-9]{4}');

    EMP_ID TEXT
---------- ----------------------------------------------------------
      7369 126 Varnum, Edmore MI 48829, 989 313-5351
      7844 989-387.5359
      9999 906-387-1698, 313-535.8886

These results are what I've been after. They show all comments containing phone numbers matching Pattern A, but with mismatched separator characters.

Why the Asterisks?

You might wonder why, in my final solution, I chose to replace each occurrence of Pattern A with asterisks rather than simply deleting the text. The following row illustrates why I didn't simply delete occurrences of Pattern A:

SQL> SELECT emp_id, text
  2  FROM employee_comment
  3  WHERE emp_id=9998;

    EMP_ID TEXT
---------- -------------------------
      9998 313-906-235-4330535-8886

There are two ways of looking at the number in this row:

  • It's not a phone number at all, in which case I don't want the row in my results.

  • It's the phone number 906-235-4330 surrounded by some other numbers, but 906-235-4330 uses consistent separators, so again, I don't want the row in my results.

Had I simply deleted occurrences of Pattern A, I'd have converted "313-906-235-4330535-8886" into "313-535-8886", Pattern A would be found in the resulting text, and the row would show up erroneously in my results.

Some Thoughts on Negating Backreferences

For awhile, I thought it would be nice to have a mechanism for negating backreferences. What if I could use ^1 to look for text not matching that returned by \1? Then I could use the following pattern to find mismatched separators:

[0-9]{3}([-. ])[0-9]{3}^1[0-9]{4}

Alas, while this approach of using a notation like ^1 to negate a backreference seems desirable on the surface, it's implementation is fraught with ambiguity. If \1 resolves to a hyphen, then ^1 would match a period or a space, but ^1 would also match any string of characters other than a single hyphen. For example, a string such as the following would necessarily have to match the preceding regular expression:

110-000 is the wrong way to write the postal code for New Delhi. The correct way is 110000.

Why would this match? Because it contains three digits, followed by a hyphen, followed by a whole bunch of characters not matching that hyphen, followed by four more digits. Don't let those final two digits mislead you. Their presence or absence wouldn't change things.

Even my early idea of using [^\1] to create a bracket-expression matching other than the separator returned by \1 isn't all that workable. Were such notation to be accepted, the resulting bracket-expression would match any character other than that returned by \1. However, I want to limit the possible universe of separators to just three characters: a space, a hyphen, and a period.

Conclusion

When you find yourself searching for text that does not match a given pattern, you may find that what you really want is to find text matching one pattern, but not another. In such cases, you can apply the following, general solution:

  1. Refer to the more inclusive pattern as Pattern A. This pattern should include properly formatted values, as well as those values that are improperly formatted.

  2. Refer to the "good" pattern, as Pattern B.

  3. Use REGEXP_LIKE to detect text containing Pattern A.

  4. Use REGEXP_REPLACE to eliminate occurrences of Pattern B. You may want to replace Pattern B with some block of text, such as a few asterisks, that won't throw off the results generated in Step 5.

  5. Use REGEXP_LIKE to again detect occurrences of Pattern A. Any such occurrences detected now must necessarily not match Pattern B, and thus will represent improperly formatted text.

I hope you've enjoyed this article. I doubt that it's the last word on this topic. If you have any thoughts on my solution, or any alternative approaches to suggest, please send me an email. I'd love to hear from you.

Example Data

You can use the following table and data to follow along with the examples in the article:

DROP TABLE employee_comment;

CREATE TABLE employee_comment (
   emp_id NUMBER(5),
   text VARCHAR2(500));

INSERT INTO employee_comment VALUES (7369,
'126 Varnum, Edmore MI 48829, 989 313-5351');

INSERT INTO employee_comment VALUES (7499,
'1105 McConnell Court
Cedar Lake MI 48812
Home: 989-387-4321
Cell: (237) 438-3333');

INSERT INTO employee_comment VALUES (7521,
'Her address is: 12646 Bramell. Her city and state are Vestaburg, MI 48891
Home phone is 387-1698');

INSERT INTO employee_comment VALUES (7566,
'Phone number: 989.387.4444, Address: 23 Fleepy Lane, Edmore, MI, 48829');

INSERT INTO employee_comment VALUES (7654,
'231-898-9823
1873 Lilac
Howard City MI 49329');

INSERT INTO employee_comment VALUES (7698,
'Marion Blake
N1375 Lakeshore Drive
Vestaburg MI 48891
(989) 388-1234');

INSERT INTO employee_comment VALUES (7782,
'Academy Apartments, #138, Cedar Lake MI 48812');

INSERT INTO employee_comment VALUES (7788,
'#1 Water Tower Lane
Cedar Lake MI 48812');

INSERT INTO employee_comment VALUES (7839,
'100 Magnetic Drive, Edmore MI 48829');

INSERT INTO employee_comment VALUES (7844,
'989-387.5359');

INSERT INTO employee_comment VALUES (7876,'Home phone: (231) 453-9999');

INSERT INTO employee_comment VALUES (7900,
'Cares for 100-year-old aunt during the day. Schedule only for evening and night shifts.');

INSERT INTO employee_comment VALUES (9998,
'313-906-235-4330535-8886');

INSERT INTO employee_comment VALUES (9999,
'906-387-1698, 313-535.8886');