212
◾
Linux with Operating System Concepts
• abc12345
• 12345
The expression would not match
1234567
or
abc123def
because neither of these strings
contains exactly five digits in sequence. It would match
1ab23456de7
though because the
five digits are surrounded by nondigits. How does
^[^0-9]*[0-9]{5}[^0-9]*$
differ?
6.2.8
Selecting between Sequences
Now that we can control the exact number of repetitions that we expect to see, let us define
a regular expression to match a zip code. If we consider a five-digit zip code, we can use
[0-9]{5}
. If we want to match a five-digit zip code that is in a string that contains other
characters, we might use the previously defined expression
^[^0-9]*[0-9]{5}[^0-9]*$
.
In fact, if we know that the zip code will always follow a two-letter state abbreviation fol-
lowed
by a blank space, we could be more precise, as in
[A-Z]{2} [0-9]{5}[^0-9]*$
.
We also have nine-digit zip codes. These are zip codes that consist of five digits, a
hyphen, and four digits, as in 12345-6789. We would define this sequence as
[0-9]{5}-
[0-4]{4}
. Now we have a new problem. Which expression should we specify? If we spec-
ify both, as in
[0-9]{5} [0-5]{5}-[0-9]{4}
we are stating that the string must have a five-digit sequence followed by a space followed by a
five-digit sequence, a hyphen, and a four-digit sequence. We need to be able to say “or.” Recall
from earlier that we were expressing “or” using
[list]
. However,
the items in
[]
indicated
that any single character should match, not that we want to match an entire sequence.
We use another metacharacter to denote “or” in the sense that we have two or more
expressions and we want to match either (any) expression against a string. This metacha-
racter is
|
(the vertical bar). Now we can express a zip code using the following.
[0-9]{5}|[0-5]{5}-[0-9]{4}
In the above expression, the | appears between the two definitions: the five-digit zip code
and the nine-digit zip code. That is, the regex will match a five-digit number OR a five-digit
number followed by a hyphen followed by a four-digit number.
Let us consider another example. The Cincinnati metropolitan region extends into three
states, Ohio, Kentucky, and Indiana. If we want to define a regular expression that will
match any of these three states’
abbreviations, our first idea might be to express this as
[IKO][NYH]
. This will match any of IN, KY, and OH, so it seems to solve the problem.
However, there is no way to control the ideas that “if the first character in the first list
matches, then only use the first character in the second list.” So this expression could also
match any of
IY
,
IH
,
KN
,
KH
,
ON,
or
OY
. By using the | we can avoid this problem through
IN|KY|OH
.
Regular Expressions
◾
213
The final metacharacters are the parentheses,
()
. These are used when you want to
encapsulate an entire
pattern of metacharacters, literal characters, and enumerated lists
inside another set of metacharacters. This allows you to state that the trailing metacharac-
ter applies to the entire pattern rather than the single preceding character.
For instance, we want to match against a list of words. Words will consist of either
capitalized words or lower case words and will be separated by spaces. A single word is
indicated using
[A-Za-z][a-z]*
, that is, any upper or lower-case letter followed by 0 or
more lower-case letters. To express the blank space, we will follow
the above pattern with
a blank space, giving us ‘
[A-Za-z][a-z]*
’. The quote marks are shown to clearly indicate
the blank space. Now, to express that there are several of these words in the string, we will
want to add
{2,}
. However, if we place the
{2,}
after the blank space,
it will only modify
the blank space. We instead want the
{2,}
to modify the entire expression. Therefore, we
place the expression in () giving us
([A-Za-z][a-z]* ){2,}
. Now we have an expression
that will match two or more sequences of “upper- or lower-case letter followed by 0 or more
lower-case letters followed by a blank.”
If we expect to see between two and five words in the string, we would express this as
([A-Za-z][a-z]* ){2,5}
To ensure that the two to five words
makes up the entire string, we might enclose the
expression within
^
and
$
marks as in
^([A-Za-z][a-z]* ){2,5}$
However, there is a flaw in our expression. We might assume that the final word in the
string does not end in a blank space. How can we say “two to five words
Do'stlaringiz bilan baham: