Skip Headers
Oracle® Database Advanced Application Developer's Guide
11g Release 2 (11.2)

Part Number E10471-02
Go to Documentation Home
Home
Go to Book List
Book List
Go to Table of Contents
Contents
Go to Index
Index
Go to Master Index
Master Index
Go to Feedback page
Contact Us

Go to previous page
Previous
Go to next page
Next
View PDF

3 Using Regular Expressions in Database Applications

This chapter explains how to use regular expressions in database applications.

Topics:

See Also:

Overview of Regular Expressions

Topics:

What Are Regular Expressions?

Regular expressions enable you to search for patterns in string data by using standardized syntax conventions. You specify a regular expression through these types of characters:

  • Metacharacters, which are operators that specify search algorithms

  • Literals, which are the characters for which you are searching

A regular expression can specify complex patterns of character sequences. For example, this regular expression searches for the literals f or ht, the t literal, the p literal optionally followed by the s literal, and finally the colon (:) literal:

(f|ht)tps?:

The parentheses are metacharacters that group a series of pattern elements to a single element; the pipe symbol (|) matches an alternative in the group. The question mark (?) is a metacharacter indicating that the preceding pattern, in this case the s character, is optional. Thus, the preceding regular expression matches the http:, https:, ftp:, and ftps: strings.

How Are Regular Expressions Useful?

Regular expressions are a powerful text processing component of programming languages such as PERL and Java. For example, a PERL script can process each HTML file in a directory, read its contents into a scalar variable as a single string, and then use regular expressions to search for URLs in the string. One reason that many developers write in PERL is for its robust pattern matching functionality.

Oracle Database support of regular expressions enables developers to implement complex match logic in the database. This technique is useful for these reasons:

  • By centralizing match logic in Oracle Database, you avoid intensive string processing of SQL results sets by middle-tier applications. For example, life science customers often rely on PERL to do pattern analysis on bioinformatics data stored in huge databases of DNA and proteins. Previously, finding a match for a protein sequence such as [AG].{4}GK[ST] was handled in the middle tier. The SQL regular expression functions move the processing logic closer to the data, thereby providing a more efficient solution.

  • Before Oracle Database 10g, developers often coded data validation logic on the client, requiring the same validation logic to be duplicated for multiple clients. Using server-side regular expressions to enforce constraints solves this problem.

  • The built-in SQL and PL/SQL regular expression functions and conditions make string manipulations more powerful and less cumbersome than in previous releases of Oracle Database.

Oracle Database Implementation of Regular Expressions

Oracle Database implements regular expression support with a set of Oracle Database SQL functions and conditions that enable you to search and manipulate string data. You can use these functions in any environment that supports Oracle Database SQL. You can use these functions on a text literal, bind variable, or any column that holds character data such as CHAR, NCHAR, CLOB, NCLOB, NVARCHAR2, and VARCHAR2 (but not LONG).

Table 3-1 describes the regular expression functions and conditions.

Table 3-1 SQL Regular Expression Functions and Conditions

SQL Element Category Description
REGEXP_LIKE
Condition

Searches a character column for a pattern. Use this function in the WHERE clause of a query to return rows matching a regular expression. The condition is also valid in a constraint or as a PL/SQL function returning a boolean.

This WHERE clause filters employees with a first name of Steven or Stephen:

WHERE REGEXP_LIKE(first_name, '^Ste(v|ph)en$')
REGEXP_REPLACE
Function

Searches for a pattern in a character column and replaces each occurrence of that pattern with the specified string.

These function call puts a space after each character in the country_name column:

REGEXP_REPLACE(country_name, '(.)', '\1 ')
REGEXP_INSTR
Function

Searches a string or substring for a given occurrence of a regular expression pattern (a substring) and returns an integer indicating the position in the string or substring where the match is found. You specify which occurrence you want to find and the start position.

This function call performs a boolean test for a valid eEmail address in the email column:

REGEXP_INSTR(email, '\w+@\w+(\.\w+)+') > 0
REGEXP_SUBSTR
Function

Searches a string or substring for a given occurrence of a regular expression pattern (a substring) and returns the substring itself. You specify which occurrence you want to find and the start position.

This function call uses the x flag to match the first string by ignoring spaces in the regular expression:

REGEXP_SUBSTR('oracle', 'o r a c l e', 1, 1, 'x')
REGEXP_COUNT
Function

Returns the number of times a pattern appears in a string. You specify the string and the pattern. You can also specify the start position and matching options (for example, c for case sensitivity).

This function call returns the number of times that e (but not E) appears in the string 'Albert Einstein', starting at character position 7 (that is, one):

REGEXP_COUNT('Albert Einstein', 'e', 7, 'c')

A string literal in a REGEXP function or condition conforms to the rules of SQL text literals. By default, regular expressions must be enclosed in single quotation marks. If your regular expression includes the single quotation mark, then enter two single quotation marks to represent one single quotation mark within the expression. This technique ensures that the entire expression is interpreted by the SQL function and improves the readability of your code. You can also use the q-quote syntax to define your own character to terminate a text literal. For example, you can delimit your regular expression with the pound sign (#) and then use a single quotation mark within the expression.

Note:

If your expression comes from a column or a bind variable, then the preceding rules for quotation marks do not apply.

See Also:

Oracle Database Support for the POSIX Regular Expression Standard

Oracle Database implementation of regular expressions conforms to these standards:

  • IEEE Portable Operating System Interface (POSIX) standard draft 1003.2/D11.2

  • Unicode Regular Expression Guidelines of the Unicode Consortium

Oracle Database follows the exact syntax and matching semantics for these operators as defined in the POSIX standard for matching ASCII (English language) data. You can find the POSIX standard draft at this URL:

http://www.opengroup.org/onlinepubs/007908799/xbd/re.html

Oracle Database enhances regular expression support in these ways:

  • Extends the matching capabilities for multilingual data beyond what is specified in the POSIX standard.

  • Adds support for the common PERL regular expression extensions that are not included in the POSIX standard but do not conflict with it. Oracle Database provides built-in support for some heavily used PERL regular expression operators, for example, character class shortcuts, the "nongreedy" modifier, and so on.

Oracle Database supports a set of common metacharacters used in regular expressions. For information about the action of supported metacharacters and related features, see "Metacharacters in Regular Expressions".

Note:

The interpretation of metacharacters differs between tools that support regular expressions. If you are porting regular expressions from another environment to Oracle Database, ensure that the regular expression syntax is supported and the action is what you expect.

Metacharacters in Regular Expressions

Topics:

POSIX Metacharacters in Oracle Database Regular Expressions

Table 3-2 lists the list of metacharacters supported for use in regular expressions passed to SQL regular expression functions and conditions. These metacharacters conform to the POSIX standard; any differences in action from the standard are noted in the "Description" column.

Table 3-2 POSIX Metacharacters in Oracle Database Regular Expressions

Syntax Operator Name Description Example

.

Any Character — Dot

Matches any character in the database character set. If the n flag is set, it matches the newline character. The newline is recognized as the linefeed character (\x0a) on Linux, UNIX, and Windows or the carriage return character (\x0d) on Macintosh platforms.

Note: In the POSIX standard, this operator matches any English character except NULL and the newline character.

The expression a.b matches the strings abb, acb, and adb, but does not match acc.

+

One or More — Plus Quantifier

Matches one or more occurrences of the preceding subexpression.

The expression a+ matches the strings a, aa, and aaa, but does not match bbb.

?

Zero or One — Question Mark Quantifier

Matches zero or one occurrence of the preceding subexpression.

The expression ab?c matches the strings abc and ac, but does not match abbc.

*

Zero or More — Star Quantifier

Matches zero or more occurrences of the preceding subexpression. By default, a quantifier match is "greedy," because it matches as many occurrences as possible while allowing the rest of the match to succeed.

The expression ab*c matches the strings ac, abc, and abbc, but does not match abb.

{m}

Interval—Exact Count

Matches exactly m occurrences of the preceding subexpression.

The expression a{3} matches the strings aaa, but does not match aa.

{m,}

Interval—At Least Count

Matches at least m occurrences of the preceding subexpression.

The expression a{3,} matches the strings aaa and aaaa, but does not match aa.

{m,n}

Interval—Between Count

Matches at least m, but not more than n occurrences of the preceding subexpression.

The expression a{3,5} matches the strings aaa, aaaa, and aaaaa, but does not match aa.

[ ... ]

Matching Character List

Matches any single character in the list within the brackets. These operators are allowed within the list, but other metacharacters included are treated as literals:

  • Range operator: -

  • POSIX character class: [: :]

  • POSIX collation element: [. .]

  • POSIX character equivalence class: [= =]

A dash (-) is a literal when it occurs first or last in the list, or as an ending range point in a range expression, as in [#--]. A right bracket (]) is treated as a literal if it occurs first in the list.

Note: In the POSIX standard, a range includes all collation elements between the start and end of the range in the linguistic definition of the current locale. Thus, ranges are linguistic rather than byte values ranges; the semantics of the range expression are independent of character set. In Oracle Database, the linguistic range is determined by the NLS_SORT initialization parameter.

The expression [abc] matches the first character in the strings all, bill, and cold, but does not match any characters in doll.

[^ ... ]

Nonmatching Character List

Matches any single character not in the list within the brackets. Characters not in the nonmatching character list are returned as a match. See the description of the Matching Character List operator for an account of metacharacters allowed in the character list.

The expression [^abc] matches the character d in the string abcdef, but not the character a, b, or c. The expression [^abc]+ matches the sequence def in the string abcdef, but not a, b, or c.

The expression [^a-i] excludes any character between a and i from the search result. This expression matches the character j in the string hij, but does not match any characters in the string abcdefghi.

|

Or

Matches an alternative.

The expression a|b matches character a or character b.

( ... )

Subexpression or Grouping

Treats the expression within parentheses as a unit. The subexpression can be a string of literals or a complex expression containing operators.

The expression (abc)?def matches the optional string abc, followed by def. Thus, the expression matches abcdefghi and def, but does not match ghi.

\n

Back reference

Matches the nth preceding subexpression, that is, whatever is grouped within parentheses, where n is an integer from 1 to 9. The parentheses cause an expression to be remembered; a back reference refers to it. A back reference counts subexpressions from left to right, starting with the opening parenthesis of each preceding subexpression. The expression is invalid if the source string contains fewer than n subexpressions preceding the \n.

Oracle Database supports the back reference expression in the regular expression pattern and the replacement string of the REGEXP_REPLACE function.

The expression (abc|def)xy\1 matches the strings abcxyabc and defxydef, but does not match abcxydef or abcxy.

A backreference enables you to search for a repeated string without knowing the actual string ahead of time. For example, the expression ^(.*)\1$ matches a line consisting of two adjacent instances of the same string.

\

Escape Character

Treats the subsequent metacharacter in the expression as a literal. Use a backslash (\) to search for a character that is normally treated as a metacharacter. Use consecutive backslashes (\\) to match the backslash literal itself.

The expression \+ searches for the plus character (+). It matches the plus character in the string abc+def, but does not match abcdef.

^

Beginning of Line Anchor

Matches the beginning of a string (default). In multiline mode, it matches the beginning of any line within the source string.

The expression ^def matches def in the string defghi but does not match def in abcdef.

$

End of Line Anchor

Matches the end of a string (default). In multiline mode, it matches the end of any line within the source string.

The expression def$ matches def in the string abcdef but does not match def in the string defghi.

[:class:]

POSIX Character Class

Matches any character belonging to the specified POSIX character class. You can use this operator to search for characters with specific formatting such as uppercase characters, or you can search for special characters such as digits or punctuation characters. The full set of POSIX character classes is supported.

Note: In English regular expressions, range expressions often indicate a character class. For example, [a-z] indicates any lowercase character. This convention is not useful in multilingual environments, where the first and last character of a given character class might not be the same in all languages. Oracle Database supports the character classes in Table 3-3 based on character class definitions in Globalization classification data.

The expression [[:upper:]]+ searches for one or more consecutive uppercase characters. This expression matches DEF in the string abcDEFghi but does not match the string abcdefghi.

[.element.]

POSIX Collating Element Operator

Specifies a collating element to use in the regular expression. The element must be a defined collating element in the current locale. Use any collating element defined in the locale, including single-character and multicharacter elements. The NLS_SORT initialization parameter determines supported collation elements.This operator lets you use a multicharacter collating element in cases where only one character is otherwise allowed. For example, you can ensure that the collating element ch, when defined in a locale such as Traditional Spanish, is treated as one character in operations that depend on the ordering of characters.

The expression [[.ch.]] searches for the collating element ch and matches ch in string chabc, but does not match cdefg. The expression [a-[.ch.]] specifies the range a to ch.

[=character=]

POSIX Character Equivalence Class

Matches all characters that are members of the same character equivalence class in the current locale as the specified character.

The character equivalence class must occur within a character list, so the character equivalence class is always nested within the brackets for the character list in the regular expression.

Usage of character equivalents depends on how canonical rules are defined for your database locale. See Oracle Database Globalization Support Guide for more information about linguistic sorting and string searching.

The expression [[=n=]] searches for characters equivalent to n in a Spanish locale. It matches both N and ñ in the string El Niño.


See Also:

Oracle Database SQL Language Reference for syntax, descriptions, and examples of the REGEXP functions and conditions

Multilingual Extensions to POSIX Regular Expression Standard

When applied to multilingual data, Oracle Database implementation of the POSIX operators extends beyond the matching capabilities specified in the POSIX standard. Table 3-3 shows the relationship of the operators in the POSIX standard.

  • The first column lists the supported operators.

  • The second column indicates whether the POSIX standard for Basic Regular Expression (BRE) defines the operator.

  • The third column indicates whether the POSIX standard for Extended Regular Expression (ERE) defines the operator.

  • The fourth column indicates whether the Oracle Database implementation extends the operator's semantics for handling multilingual data.

Oracle Database lets you enter multibyte characters directly, if you have a direct input method, or use functions to compose the multibyte characters. You cannot use the Unicode hexadecimal encoding value of the form \xxxx. Oracle Database evaluates the characters based on the byte values used to encode the character, not the graphical representation of the character.

Table 3-3 POSIX and Multilingual Operator Relationships

Operator POSIX BRE syntax POSIX ERE Syntax Multilingual Enhancement

\

Yes

Yes

--

*

Yes

Yes

--

+

--

Yes

--

?

--

Yes

--

|

--

Yes

--

^

Yes

Yes

Yes

$

Yes

Yes

Yes

.

Yes

Yes

Yes

[ ]

Yes

Yes

Yes

( )

Yes

Yes

--

{m}

Yes

Yes

--

{m,}

Yes

Yes

--

{m,n}

Yes

Yes

--

\n

Yes

Yes

Yes

[..]

Yes

Yes

Yes

[::]

Yes

Yes

Yes

[==]

Yes

Yes

Yes


PERL-Influenced Extensions to POSIX Regular Expression Standard

Table 3-4 describes PERL-influenced metacharacters supported in Oracle Database regular expression functions and conditions. These metacharacters are not in the POSIX standard, but are common at least partly from the popularity of PERL. PERL character class matching is based on the locale model of the operating system, whereas Oracle Database regular expressions are based on the language-specific data of the database. In general, a regular expression involving locale data cannot be expected to produce the same results between PERL and Oracle Database.

Table 3-4 PERL-Influenced Extensions in Oracle Database Regular Expressions

Reg. Exp. Matches . . . Example

\d

A digit character. It is equivalent to the POSIX class [[:digit:]].

The expression ^\(\d{3}\) \d{3}-\d{4}$ matches (650) 555-0100 but does not match 650-555-0100.

\D

A nondigit character. It is equivalent to the POSIX class [^[:digit:]].

The expression \w\d\D matches b2b and b2_ but does not match b22.

\w

A word character, which is defined as an alphanumeric or underscore (_) character. It is equivalent to the POSIX class [[:alnum:]_]. If you do not want to include the underscore character, you can use the POSIX class [[:alnum:]].

The expression \w+@\w+(\.\w+)+ matches the string jdoe@company.co.uk but not the string jdoe@company.

\W

A nonword character. It is equivalent to the POSIX class [^[:alnum:]_].

The expression \w+\W\s\w+ matches the string to: bill but not the string to bill.

\s

A whitespace character. It is equivalent to the POSIX class [[:space:]].

The expression \(\w\s\w\s\) matches the string (a b ) but not the string (ab).

\S

A nonwhitespace character. It is equivalent to the POSIX class [^[:space:]].

The expression \(\w\S\w\S\) matches the string (abde) but not the string (a b d e).

\A

Only at the beginning of a string. In multi-line mode, that is, when embedded newline characters in a string are considered the termination of a line, \A does not match the beginning of each line.

The expression \AL matches only the first L character in the string Line1\nLine2\n, regardless of whether the search is in single-line or multi-line mode.

\Z

Only at the end of a string or before a newline ending a string. In multi-line mode, that is, when embedded newline characters in a string are considered the termination of a line, \Z does not match the end of each line.

In the expression \s\Z, the \s matches the last space in the string L i n e \n, regardless of whether the search is in single-line or multi-line mode.

\z

Only at the end of a string.

In the expression \s\z, the \s matches the newline in the string L i n e \n, regardless of whether the search is in single-line or multi-line mode.

*?

The preceding pattern element 0 or more times ("nongreedy"). This quantifier matches the empty string whenever possible.

The expression \w*?x\w is "nongreedy" and so matches abxc in the string abxcxd. The expression \w*x\w is "greedy" and so matches abxcxd in the string abxcxd. The expression \w*?x\w also matches the string xa.

+?

The preceding pattern element 1 or more times ("nongreedy").

The expression \w+?x\w is "nongreedy" and so matches abxc in the string abxcxd. The expression \w+x\w is "greedy" and so matches abxcxd in the string abxcxd. The expression \w+?x\w does not match the string xa, but does match the string axa.

??

The preceding pattern element 0 or 1 time ("nongreedy"). This quantifier matches the empty string whenever possible.

The expression a??aa is "nongreedy" and matches aa in the string aaaa. The expression a?aa is "greedy" and so matches aaa in the string aaaa.

{n}?

The preceding pattern element exactly n times ("nongreedy"). In this case {n}? is equivalent to {n}.

The expression (a|aa){2}? matches aa in the string aaaa.

{n,}?

The preceding pattern element at least n times ("nongreedy").

The expression a{2,}? is "nongreedy" and matches aa in the string aaaaa. The expression a{2,} is "greedy" and so matches aaaaa.

{n,m}?

At least n but not more than m times ("nongreedy"). {0,m}? matches the empty string whenever possible.

The expression a{2,4}? is "nongreedy" and matches aa in the string aaaaa. The expression a{2,4} is "greedy" and so matches aaaa.


The Oracle Database regular expression functions and conditions support the pattern matching modifiers described in Table 3-5.

Table 3-5 Pattern Matching Modifiers

Mod. Description Example

i

Specifies case-insensitive matching.

This regular expression returns AbCd:

REGEXP_SUBSTR('AbCd', 'abcd', 1, 1, 'i')

c

Specifies case-sensitive matching.

This regular expression fails to match:

REGEXP_SUBSTR('AbCd', 'abcd', 1, 1, 'c')

n

Allows the period (.), which by default does not match newlines, to match the newline character.

This regular expression matches the string only because the n flag is specified:

REGEXP_SUBSTR('a'||CHR(10)||'d', 'a.d', 1, 1, 'n')

m

Performs the search in multi-line mode. The metacharacter ^ and $ signify the start and end, respectively, of any line anywhere in the source string, rather than only at the start or end of the entire source string.

This regular expression returns ac:

REGEXP_SUBSTR('ab'||CHR(10)||'ac', '^a.', 1, 2, 'm') 

x

Ignores whitespace characters in the regular expression. By default, whitespace characters match themselves.

This regular expression returns abcd:

REGEXP_SUBSTR('abcd', 'a b c d', 1, 1, 'x')

Using Regular Expressions in SQL Statements: Scenarios

Scenarios:

Using a Constraint to Enforce a Phone Number Format

Regular expressions are useful for enforcing constraints. For example, suppose that you want to ensure that phone numbers are entered into the database in a standard format. Example 3-1 creates a contacts table and adds a CHECK constraint to the p_number column to enforce this format mask:

(XXX) XXX-XXXX

Example 3-1 Enforcing a Phone Number Format with Regular Expressions

DROP TABLE contacts;
CREATE TABLE contacts (
  l_name    VARCHAR2(30),
  p_number  VARCHAR2(30)
  CONSTRAINT c_contacts_pnf
  CHECK (REGEXP_LIKE (p_number, '^\(\d{3}\) \d{3}-\d{4}$'))
);

Table 3-6 explains the elements of the regular expression.

Table 3-6 Explanation of the Regular Expression Elements in Example 3-1

Regular Expression Element Matches . . .

^

The beginning of the string.

\(

A left parenthesis. The backward slash (\) is an escape character that indicates that the left parenthesis after it is a literal rather than a grouping expression.

\d{3}

Exactly three digits.

\)

A right parenthesis. The backward slash (\) is an escape character that indicates that the right parenthesis after it is a literal rather than a grouping expression.

(space character)

A space character.

\d{3}

Exactly three digits.

-

A hyphen.

\d{4}

Exactly four digits.

$

The end of the string.


Example 3-2 Inserting Phone Numbers in Correct and Incorrect Formats

These are correct:

INSERT INTO contacts (p_number) VALUES('(650) 555-0100');
INSERT INTO contacts (p_number) VALUES('(215) 555-0100');
 

These generate CHECK constraint errors:

INSERT INTO contacts (p_number) VALUES('650 555-0100');
INSERT INTO contacts (p_number) VALUES('650 555 0100');
INSERT INTO contacts (p_number) VALUES('650-555-0100');
INSERT INTO contacts (p_number) VALUES('(650)555-0100');
INSERT INTO contacts (p_number) VALUES(' (650) 555-0100');

Using Back References to Reposition Characters

As explained in Table 3-2, back references store matched subexpressions in a temporary buffer, enabling you to reposition characters. You access buffers with the \n notation, where n is a number in the range from 1 through 9. Each subexpression is enclosed in parentheses, and its characters are numbered from left to right.

Example 3-3 creates a table, populates it with names in different formats, and uses a query that repositions names that are in the format "first middle last" to the format "last, first middle". It ignores names not in the format "first middle last". The elements of the regular expression in the query are explained in Table 3-7.

Example 3-3 Using Back References to Reposition Characters

Create and populate table:

DROP TABLE famous_people;
CREATE TABLE famous_people (names VARCHAR2(20));
INSERT INTO famous_people (names) VALUES ('John Quincy Adams');
INSERT INTO famous_people (names) VALUES ('Harry S. Truman');
INSERT INTO famous_people (names) VALUES ('John Adams');
INSERT INTO famous_people (names) VALUES (' John Quincy Adams');
INSERT INTO famous_people (names) VALUES ('John_Quincy_Adams');

SQL*Plus formatting command:

COLUMN "names after regexp" FORMAT A20

Repositioning query:

SELECT names "names",
  REGEXP_REPLACE(names, '^(\S+)\s(\S+)\s(\S+)$', '\3, \1 \2')
    AS "names after regexp"
FROM famous_people;
 

Result:

names                names after regexp
-------------------- --------------------
John Quincy Adams    Adams, John Quincy
Harry S. Truman      Truman, Harry S.
John Adams           John Adams
 John Quincy Adams    John Quincy Adams
John_Quincy_Adams    John_Quincy_Adams
 
5 rows selected.

Table 3-7 Explanation of the Regular Expression Elements in Example 3-3

Regular Expression Element Description

^

Matches the beginning of the string.

$

Matches the end of the string.

(\S+)

Matches one or more nonspace characters. The parentheses are not escaped so they function as a grouping expression.

\s

Matches a whitespace character.

\1

Substitutes the first subexpression, that is, the first group of parentheses in the matching pattern.

\2

Substitutes the second subexpression, that is, the second group of parentheses in the matching pattern.

\3

Substitutes the third subexpression, that is, the third group of parentheses in the matching pattern.

,

Inserts a comma character.