What is a Regular Expression?
Purpose
A regular expression is a flexible way of defining patterns of text. It is a formal language which is interpreted by a regular expression engine (which might be part of an application or a programming language) that parses input text and compares it to the regular expression, and then performs operations on text that matches the regular expression.
Common uses of regular expressions include:
- Matching text
- Substituting text
- Extracting text
Syntax
The basic syntax of a regular expression is /pattern/flags
. The main part is the text pattern description, and the flags control the behaviour of the regular expression engine.
Different regular expression engines support different features, and also slightly vary in their syntax. After a overview of general regexp syntax we will look at some common applications and languages and how they support regular expressions.
Examples
%\d\d?/\d\d?/\d\d\d?\d?% |
This will match something that looks like a date, in a format like dd/mm/yyyy
or m/d/yy
. Note that it does not check that it is a valid date, a string like 75/33/9876
would match. Also note that a percentage mark has been used as the regexp delimiter; this can be clearer when the pattern contains slashes.
/<p( [^>]*)?>.*?</p>/m |
This regular will match a paragraph element and its contents in a HTML document.
/ (?: (?:(-?\d{1,3})m {0,3}(-?\d{1,4})y (?:\(( {0,2}-?\d{1,2})\))?) | (?: {0,2} (?:(-?\d{1,3})\/(\d)|(Junct))[ ] - {0,2} (?:(-?\d{1,3})\/(\d)|(Junct)) ) ) \s+ (C|C&A&T|C & [AT]|L[234]|\*L4|SD[12])? \s+ (?:\(((?:[ -]\d{2})|(?:\d\.\d))\))? \s+ (-?\d+)? \s+ (ALIG35|AL70|GAUGE|MT70|[LR]TOP | TW[35]M|CYC(?:[69]_|1[38])(?:BO|[LR]T)) \s*=?\s* (-?\d{1,3}\.\d+)mm (?:\(1: ?(\d{2,3})\))? \s+ \[ {0,2}(\d{1,3})\] \s*(.*) (?: +> :)_+ *_+:_+\/_+\/_+ \s+ (?:to +(?:(-?\d{1,3})m {0,3}(-?\d{1,4})y[ ] (?:\(( {0,2}-?\d{1,2})\))?): *(\d+)cycles)? \s+ ((?:P )?(?:IN)?VALID (?: BUT OFF ROUTE)?|OFF ROUTE|UNVERIFIED)? /gmx |
This is a much more complex example. It is a regular expression that was written to match text in reports produced by a legacy system. These reports had been designed to be printed and read; by using a regular expression it was possible to parse the report and extract the important information from it. This regular expression matches a group of lines in the report and captures the bits of data that we are interested in. It would be possible to use other methods to parse this report, but the flexibility of regular expressions make it well suited to cope with the quirks of the report formatting produced by different versions of the legacy software; the use of alternation and variable matches means that this regexp can match all formats of the report instead of having to rewrite the parsing code for each version.
Regexp elements
Characters
Normal characters
Normal characters match themselves only.
a
b
c
X
Y
Z
0
1
2
3
4
5
6
7
8
9
"
_
=
#
Special characters
More exoctic characters are matched using character sequences.
.
- The dot character will match almost any single character. It does not usually match line break characters, unless the
/g
flag is set. \*
\?
\}
\[
\]
\/
\\
\^
\$
- If you need to match a literal character that has a special meaning in regular expresssions then it needs to be escaped using a backslash.
\n
\t
\e
\a
- There are several predefined sequences for non-printable characters.
\n
is a new line character,\t
is a tab,\e
is an escape character and\a
is a bell. These will be familiar to anyone who has used C or many other programming languages. \xB0
\u0260
- Some regular expression engines allow arbitrary hexadecimal or Unicode code points to be represented using a
\x00
or\u0000
syntax.
Character classes
By using [square brackets] you can match any of several different characters.
Collections of characters
[abc]
- The simplest form is a list of characters in square brackets, this will match any one of those characters.
[0-9]
[a-z]
[0-9a-zA-Z]
- To make it simpler to match a large number of possible characaters you can specify ranges.
[-+0-9]
- Simple characters and ranges can be combined as shown above. Note that, due to its special meaning for ranges, to match a literal hyphen character then you can place it at the start of a character class (alternatively you can escape it with a backslash).
[^abc]
- Negation is done by having a caret at the start of a character class. The above example will match any character apart from
a
,b
, orc
.
Pre-defined character classes
There are many predefined shorthand sequences for commonly used character classes.
\d
- Any digit.
\d
- Any character other than a digit.
\s
- Any space character, e.g. space, tab.
\S
- Any non-space character.
\w
- Any word character. The definition of word characters can vary, but it usually means any letter, any digit, or an underscore.
\W
- Any non-word character.
[[:alpha:]]
- Any letter character. This is an example of a POSIX character class. Note the double square brackets used here; the POSIX character class is
[:alpha:]
which can only be used inside the normal square brackets for character classes. POSIX character classes can be combined with other elements within a character class, e.g.[[:alpha::]ab[:digit:]]
.
Repetition
Quantifiers are used to control repetetive matching. Greedy quantifiers will try and match as much text as possible, lazy quantifiers will try and match as little as possible. Lazy quantifiers are used much less frequently than greedy quantifiers.
Normal quantifiers
ab?c
- The question mark character will match either zero or one occurrence of the preceding expression. The above example will match either
ac
orabc
, preferring the latter if possible. ab*c
- The asterisk character matches zero or more occurrences. The above example will match
ac
,abc
,abbc
,abbbc
, … ab+c
- The plus character matches one or more occurrences. The above example will match
abc
,abbc
,abbbc
, …
Range quantifiers
ab{3}c
- A number inside braces indicates an exact number of occurrences. The above only will match
abbbc
. ab{2,4}c
- Two numbers inside braces, separated by a comma, indicates a range of occurrences. This example will match
abbc
,abbbc
, orabbbbc
. ab{2,}c
- Omitting the second number, but keeping the comma, gives a minimum number of occurrences. This example will match
abbc
,abbbc
,abbbbc
, … ab{,3}c
- Omitting the first number, gives a maximum number of occurrences. This example will match
ac
,abc
,abbc
, orabbbc
.
Lazy quantifiers
ab??c
- Like
ab?c
this will match eitherac
orabc
, but the double question mark will make it prefer to match the former if this is possible. ab*?c
- This will match the same set of possibilities as
ab*c
, but if there are several possible matches then it will match as fewb
characters as it can. ab+?c
- This is the lazy equivalent of
ab+c
. ab{1,2}?c
- Similarly, putting a question mark after a range quantifier makes it lazy.
Alternation, grouping and matching
Alternation
a|b
- Matching one a set of possible different is done by using the pipe operator. This will match either
a
orb
. foo|bar
- The alternation operator has very low precedence, in particular lower than a sequence of characters. This means that this example will match either
foo
orbar
, notfooar
orfobar
. foo|bar|baz
- Matching one of more than two possibilities is simply done by using multiple pipe operators. This will match any one of
foo
,bar
, orbaz
.
Grouping and matching
foo(bar)?
- Parentheses group a set of characters together. Here the
?
quantifier applies to everything inside the brackets, so this will match eitherfoo
orfoobar
. foo(bar|foo)
- Parentheses can be combined with other operators, such as the pipe alternation operator. This will match either
foobar
orfoofoo
. (fooba[rz])
- As well as grouping characters together, parentheses are used to capture elements within a regular expression which can then be examined later on. This will match either
foobar
orfoobaz
and the matching text will be captured; insed
it will be stored as/1
, inperl
in the variable$1
. (?:foo)
- If you want to group a set of characters together without capturing them, then the
(?:…)
operator will do this. ^foo
^
matches the start of a line or piece of text. This example will only matchfoo
if it is at the start of a line.bar$
$
matches the end of a line or the end of the text. This will matchbar
when it is at the end of a line.\b
- This batches word boundaries. In the string
foo bar
it will match the start of the string, between theo
and the space at the end of the wordfoo
, between the space and theb
at the start of the wordbar
, and at the end of the string. \B
- This is the opposite of
/b
and will match anywhere other than a word boundary, i.e. in the middle of words, and within sequences of non-word characters. foo(?=bar)
- This is a positive lookahead: it will match if the text contains
foobar
, but will only match thefoo
part, and not thebar
part. foo(?!bar)
- This is a negative lookahead: it will match
foo
, unless it is immediately followed bybar
. (?<=foo)bar
- This is a positive lookbehind: it will match the text
bar
, but only if it occurs asfoobar
. The textfoo
will not be part of the match. (?<!foo)bar
- This is a negative lookbehind: it will match
bar
unless it is preceded byfoo
. i
- The
i
(insensitive) flag tells the regular expression to match in a case insensitive manner./foo/
will only matchfoo
, but/foo/i
will also matchFOO
,fOo
, and so on. g
- The
g
(global) flag tells the regexp engine to match all possible instances of the regular expression. Normally it will stop after the first match, but if this flag is set then it will look for any further matches. m
- The
m
(multiline) flag is for regular expressions that span more than one line of text. Normally the match has to be on a single line, but if this flag is set then the match can span several lines. This also changes the behaviour of the dot character class; it normally does not match line end characters, but will if them
flag is set. x
- Unlike the other flags, this does not alter the behaviour of the regexp engine. Instead it allows you to write more legible regular expressions by splitting them across multiple lines: the lines will be concatenated with leading and trailing white space ignored. The earlier example used this flag to break up a very long regular expression.
- Metacharacters
.
\n
\t
\s
\S
\w
\W
POSIX character classes, e.g.[[:digit:]]
- Repetition
\?
*
\+
\{n,m\}
- Alternation and grouping
\|
\(…\)
- Anchoring
^
$
\b
\B
\<
\>
- Metacharacters
.
\n
\t
\s
\S
\w
\W
POSIX character classes, e.g.[[:digit:]]
- Repetition
?
*
+
{n,m}
- Alternation and grouping
|
(…)
- Anchoring
^
$
\b
\B
\<
\>
- Metacharacters
.
\n
\t
\s
\S
\w
\W
POSIX character classes, e.g.[[:digit:]]
- Repetition
*
\?
\+
\{m,n\}
- Alternation and grouping
\|
\(
\)
- Anchors
^
$
\b
\B
\`
\'
- Metacharacters
- All metacharacters are supported.
- Repetition
?
*
+
{n,m}
??
*?
+?
{n,m}?
- Alternation and grouping
|
(…)
(?:…)
- Anchors
^
$
\b
\B
(?=…)
(?!…)
(?<=…)
(?<!…)
perldoc perlretut
- Mastering Regular Expressions by Jeffrey E F Friedl.
2nd edition published by O'Reilly, 2002.
Grouping without matching
Positional markers
As well as matching text itself, you can control where the text occurs by using positional markers. These markers do not match any text themselves, but control where the other patterns in the regular expression are able to match text.
Beginning/end of lines
Beginning and end of words
Lookaround
Flags
Flags controls the overall behaviour of the regular expression.
Programs and Languages
grep
grep is a simple program usually used to extract lines from text files that match a given pattern. It is often used to match plain character sequences, so there are very few special characters: most regular expression operators have to be preceded by a backslash to give them their normal meaning. Exceptions to this are the *
quantifier and the ^
and $
anchors which work as normal.
Examples
grep 'FIXME\|TODO' */*.p[lm] |
This will print any lines containing either FIXME
or TODO
from perl files.
egrep
Grep also has an extended mode which removes makes most of the operator characters behave as normal, so you do not need to prefix them with a backslash like in its basic mode. If you are using anything other than very simple regular expressions with grep then is best to use this mode.
Examples
grep -E 'FIXME|TODO' **/*.p[lm] |
sed
sed performs operations on streams of characters. The most common operation is to replace strings, but many more powerful things are possible. Its regexp syntax is very similar to the basic mode of grep.
Examples
sed 's%\(\d\d\)/\(\d\d\)/\(\d\d\)%20\3-\1-\2%' |
This will transform dates from the format mm/dd/yy
to the format yyyy-mm-dd
, assuming that the date is in the 21st century.
sed '/^__END__$/,$d' foo.pl |
This will strip the perlpod, and anything else that follows a __END__
line, from a perl file.
sed 's/^\s\+//;s/\s\+$//' |
This strips all leading and trailing spaces from text.
sed 's/^\s*\(.*\S\)\?\s*$/\1/' |
This also strips leading and trailing spaces from text. The previous example uses two statements, one for leading space, and one for trailing space; this one using a single statement using a backreference. This approach is much less efficient and will be several order of magnitudes slower than the previous example due to the increased memory requirements from the backreference.
perl
Perl has by far the most comprehensive support for regular expression features. Many features appear first in Perl before being copied by other languages and programs.
The Perl regular expression syntax is used in many applications and other programming languages through the PCRE library. This library is used by PHP, the Apache webserver, the Exim mailserver, and many others.
Perl Example 1
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 | if ( /^ *ELR : +([A-Z]{3}\d?|[A-Z]{2}\d{2})/o ) { $elr = $1; } elsif ( /^ *Track Id : +\d(4})/o ) { $tid = $1; } elsif ( /^ *\d{1,3}.\d{4}/o) { my @data = unpack($template, $_); for (my $i = @data; $i >= 0; --$i) { if ($i % 2 == 0) { # every other element is a separator -- delete these splice(@data, $i, 1); } else { # remove leading/trailing spaces $data[$i] =~ s/^ +//; $data[$i] =~ s/ +$//; } print $elr, $tid, @data; } |
Perl Example 2
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | # decodes a standard deviation value my %errcodes = ( NA => -1, NF => -2, NV => -3, SS => -4, ST => -5 ); sub sdval { my $val = $_[0]; if ($val =~ m/\d\.\d/) { return $val; } elsif ($val =~ m/\*\*/) { return 10; } elsif ($val =~ m/($errcodes)/) { return $errcodes{$1}; } else { return ""; } } |
Further reading
Based on a talk presented by Oliver Burnett-Hall at Durham LUG on 17 February 2008.