Regular Expressions
◾
205
b’s and 0 c’s. In fact, when we use the * metacharacter, we have to be careful because, being
able to interpret it as “0 or more” means it can match just about anything and everything.
The same can be said of the ? metacharacter as it can also match 0
of the preceding literal
characters. In Section 6.2.3, we will see how to better control our regular expressions.
6.2.2 Using and Modifying the ‘.’ Metacharacter
The ‘.’ metacharacter is perhaps the easiest to understand. It is a true wildcard metacha-
racter meaning that it will match any single character. So,
b.t
will match any string that
has ‘b,’ something, ‘t,’ no matter what that something is. The . will match any character no
matter if that character is a letter, digit, punctuation mark, or white space. Consider the
regular expression
. . .
which will match
any three-character sequence, whether it is
abc
,
a b
,
123
,
3*5,
or
^%#
.
We can combine the . metacharacter with the previously described three metacharacters
*,
+
,
and
?
. If
*,
+
,
or
?
appear after the . then the
*,
+
,
or
?
modifies the . itself. For
instance, .
*
means “0 or more of any character” whereas .
?
means “0 or 1 of any character.”
The expression
b.?t
means ‘b’ followed by 0 or 1 of anything followed by ‘t.’ This would
match
bit
,
bat
,
but
,
b3t
,
b
+
t,
as well as
bt
in which case the ? is used to match against
0 characters. If we use .
+
instead, we are requiring that there be one or more of any char-
acter. If there is more than one character in the sequence, those characters do not have
to be the
same
character.
For instance,
b.
+
t
will match a ‘b’ followed by any sequence of
characters ending with a ‘t’ including any of the following:
bit boat baaaaat
b1234t
b
+
13*&%3wert
but not
bt
. The regular expression
b.*t
allows there to be 0 or more instances of charac-
ters and so it is the same as
b.
+
t
except that
bt
also matches.
How do the regular expressions
b*t b
+
t b?t
differ from the expressions
b.*t b.
+
t b.?t
In
the former three cases, the metacharacters
*,
+
,
and
?
modify the b in the expres-
sion while in the latter three cases, the metacharacters
*,
+
,
and
?
modify the . in the
expressions. In the first three expressions, only the ‘b’ is variable in nature. These three
expressions match against any number of b’s followed by a t, at least one b followed by a
t, and 0 or 1 b followed by a t respectively.
In the latter three cases, the
*,
+
,
and
?
are
applied to the . so that there are any number of characters between the b and t, at least
one of any characters between the b and t, and either 0 or 1 character between the b and
t respectively. So, for instance,
b*t
will match any of
bbbbbt,
bt
,
bbbbbbbbt
, and
t
(0 b’s) while
b.*t
will match
babcdt
,
b12345678t
,
bbbbbt
, and
bt
.
206
◾
Linux with Operating System Concepts
6.2.3 Controlling Where a Pattern Matches
Let us reconsider a problem that we mentioned earlier. Recall that a regular expression will
match a substring of a string. Now consider the string
bbbbbbbattttttt
. The regu-
lar expression
b?.t
+
will match this string even though it only matches a substring. See
Figure 6.1 where we can see that
b?.t
+
matches the last b in the string (b? requires 0 or
1 b’s), followed by any single character (the a), followed by 1 or more t’s. Even though the
string has seven b’s, an a, and seven t’s, the b? does not have to match all seven b’s. The
period matches
the single a and the
t
+
matches the 7 t’s. We do not require that the regular
expression match the
entire
string. Note also in the figure is ^b?.t
+
. We describe this later.
Unfortunately, the lack of control of where an expression matches can lead to a signifi-
cant problem. What if we
want
a regular expression to match the entire string? How do we
specify that there must not be multiple b’s? Before we describe how we can do this, let us
define what we mean by
Do'stlaringiz bilan baham: