separated
by blank
spaces?” The word “separated” means that there is a blank between words, but not before
the first or after the last. What we need here is to present the last word as a “special case,”
one that does not include the space. How?
^([A-Za-z][a-z]* ){1,4}[A-Za-z][a-z]*$
Now our expression says “one to four words followed by spaces followed by one addi-
tional word.” Thus, our words are separated by spaces but there is no space before the first
or after the last words.
6.3 EXAMPLES
In this section, we reinforce the discussion of Section 6.2 by providing numerous examples
and descriptions.
•
[0-9]
+
Match if the string contains at least one digit.
•
^[0-9]
+
Match if the string starts with at least one digit.
214
◾
Linux with Operating System Concepts
•
^[0-9]+$
Match if the string only consists of digits. The empty string will not match.
Use
*
in place of
+
to also match the empty string.
•
[b-df-hj-np-tv-z]
+
[aeiou][b-df-hj-np-tv-z]+
Match a string that
includes at least one vowel that is surrounded by at least one consonant on each side.
Add
^
and
$
around the regex to specify a string that consists exactly of one vowel
with consonants on either side.
•
[A-Z]{4,}
Match if the string contains at least four consecutive upper-case letters.
• [A-Z]{4} Although this appears to say “match if the string contains exactly four con-
secutive upper-case letters,” it will in fact match the same as the preceding example
because we are not forcing any characters surrounding the four upper-case letters to
be nonupper case characters.
•
[^A-Z][A-Z]{4}[^A-Z]
Match if the string contains exactly four upper-case letters
surrounded by other characters. For instance, this would match “
abcDEFGhi
” and
“
Hi There FRED, how are you?
” but not “
abcDEFGHijk.
” It will also not
match “
FRED
” because we are insisting that there be nonupper-case letters around
the four upper-case letters. We can fix this as shown below.
• [^A-Z][A-Z]{4}[^A-Z]|^[A-Z]{4}$
•
^$
Match only the empty string (blank lines).
•
…
Match a string that contains at least three characters of any type. Add
^
and
$
to
specify a regex that matches a string of exactly three characters.
•
[Vv].?[Ii1!].?[Aa@].?[Gg9].?[Rr].?[Aa@]
This regex might be used in a spam
filter to match the word “
Viagra
” along with variations. Notice the .
?
used in the
regex. This states that between each letter of the word viagra, we can accept another
character. This could account for such variations as
Via!gra
or
V.I.A.G.R.A
. The
use of 1, !, @, and 9 are there to account for variations where these letters are replaced
with look-alike characters, for instance @ for a.
•
([A-Z][[:alpha:]]
+
)?[A-Z][[:alpha:]]+, [A-Z]{2} [0-9]{5}$
This regex can
be used to match the city/state/zip code of a US postal address. First, we expect a city
name. A city name should appear as an upper-case letter followed by additional let-
ters. The additional letters may include upper-case letters as in McAllen. Some cities
are two names like Los Angeles. To cover a two-word city, we expect a blank space and
another name. Notice the
?
that follows the close parenthesis to indicate that we would
expect to see this one or zero times. So we either expect a word, a space and a word, or
just a word. This is followed by a comma and space followed by two upper-case letters
to denote the state abbreviation and a space and five digits to end the line. We might
expect two spaces between state and zip code. We need to include an optional space. We
could use either of
[]{1,2}
or
[][]?
. We can also add
(-[0-9]{4})?
to indicate that
the four-digit zip code extension is optional.
Regular Expressions
◾
215
•
[A-Za-z_][A-Za-z_0-9]*
In most programming languages, a variable’s name is
a collection of letters, digits, and underscores. The variable name must start with a
letter or underscore, not a digit. In some languages, variable names can also con-
tain a dollar sign, so we can enhance our regex by adding the
$
character in the
second set of brackets. In some languages, variable names are restricted in length.
For instance, we might restrict variables to 32 or fewer characters. To denote this, we
can replace the
*
with
{0,31}
. We use 31 instead of 32 because we already have one
character specified. Unfortunately, this would not prevent our regex from matching
a 33-character variable name because we are not specifying that the regex not match
33 characters. We could resolve this by placing delimiters around the regex. There are
a number of delimiters such as spaces, commas, semicolons, arithmetic symbols, and
parentheses. Instead, we could also state that before and after the variable name we
would not expect additional letters, digits, or underscores. So we could improve our
regex above to be
• [^A-Za-z_0-9][A-Za-z_][A-Za-z_0-9]{0,31}[^A-Za-z_0-9]
•
([(][0-9]{3}[)] )?[0-9]{3}-[0-9]{4}
In this expression, we describe a US phone
number. A phone number consists of three digits, a hyphen, and four digits. If the
number is long distance, we include the area code before the number. The area code is
three digits enclosed in parentheses. For instance, a phone number can be
555-5555
or
(123) 555-5555
. If the area code is included, after the close paren is a space.
In the above regex, the two parens are placed in
[]
to differentiate the literal paren
from the metacharacter paren as used to indicate a sequence. We could have also
used
\(
and
\)
. Some people will write the 10-digit phone number (with area code)
without the parens. We can include this by adding a
?
after each paren as in
[(]?
and
[)]?
however, this would cause the regex to match if only one of the two parens are
supplied as in
(123 555-5555
. Alternatively, we can provide three different ver-
sions of the regex with an OR between them as in
• [(][0-9]{3}[)] [0-9]{3}-[0-9]{4}|[0-9]{3} [0-9]{3}-[0-9]{4}|[0-9]{3}-[0-9]{4}
•
[0-9]+(.[0-9]
+
)?
This regex will match a numeric value with or without a decimal
point. We assume that there must be at least one digit and if there is a decimal point,
there must be at least one digit to the right of the decimal point. By placing the ? after
the sequence of period and a digit, we are saying that if one appears the other must
appear. This allows us to have
99.99
or
0.0
but not
0.
with nothing after the decimal
point.
•
$[0-9]+\.[0-9]{2}
Here, the
$
indicates that we seek a dollar sign and not “end
of string.” This is followed by some number of digits, a period, and two digits. This
makes up a dollar amount as in
$123.45
. We have three problems with this regex
if we want to match any dollar amount. First, we are discounting a dollar amount
that has no cents such as
$123
. Second, the regex would not prevent a match against
something like
$123.45678
. We would not expect to see more than two digits after
216
◾
Linux with Operating System Concepts
the decimal point but our regex does not prevent this. Finally, if the dollar amount
contains commas, our regex will not match. To resolve the first problem, we provide
two versions:
•
$([0-9]+|[0-9]+\.[0-9]{2})
Now we can match either a dollar sign and digits or a
dollar sign, digits, a period, and two digits. To resolve the second problem, we have to
embed our regex such that it is not followed by digits.
•
$([0-9]+|[0-9]+\.[0-9]{2})[^0-9]
This in itself would require that a dollar
amount not end a line. So we could enhance it as follows:
•
$([0-9]+|[0-9]+\.[0-9]{2})[^0-9]|$([0-9]+|[0-9]+\.[0-9]{2})$
Although
this looks a bit bizarre with two dollar signs around the latter portion of the regex,
the first is treated literally and the last means “end of string.” The commas can be
more of a challenge and this is left to an end of chapter exercise.
Now that we have introduced regular expressions (and no doubt confused the reader),
we will examine in the next three sections how to use regular expressions in three common
pieces of Linux software, grep, sed, and awk. Numerous examples will be offered which
will hopefully help the reader understand regular expressions.
6.4 GREP
The name grep comes from
Do'stlaringiz bilan baham: |