208
◾
Linux with Operating System Concepts
of 1 or more characters in the set a, b, c while
[abc]*
will also match the empty string. In
this latter case, we actually have a regular expression that will
match anything because any
string can contain 0 a’s, b’s, and c’s. For instance,
12345
contains no a’s, b’s, or c’s, and so
it can match
[abc]*
when
*
is interpreted as 0.
Now we have a means of expressing a regular expression where order is not impor-
tant. The expression
[abc]
+
will match any of these four strings that we saw earlier that
matched
a*b*c*
:
•
aaaabbbbcccc
•
abcccc
•
accccc
•
aaaaaabbbbbb
This expression will also match strings like the following.
•
abcabcabcabc
•
abacab
•
aaaaaccccc
•
a
•
cccccbbbbbbaaaa
We can combine any characters in the brackets as in
[abcxyz]
,
[abcd1234],
or
[abcdABCD]
. If we have a number of characters to enumerate, a range is more practical.
We would certainly prefer to use a range like
[a-z]
than to list all of the letters. We can also
combine ranges and enumerations. For instance, the three sequences above could also be
written as
[a-cx-z]
,
[a-d1-4],
and
[a-dA-D]
respectively. Now consider the list of all
lower case consonants. We could enumerate them all as
[bcdfghjklmnpqrstvwxyz]
or we could use several ranges as in
[b-df-hj-np-tv-z]
.
While we can use ranges for letters and digits, there is no range available for the punc-
tuation marks. You could enumerate all of the punctuation marks in brackets to capture
“any punctuation mark” but this would be tedious. Instead, we also have a class named
:punct:
which is applied in double brackets, as in
[[:punct:]]
. Table 6.2 provides a
listing of the classes available in Linux.
Let us now combine all of the metacharacters we have learned with some exam-
ples. We want to find a string that consists only of letters. We can use
^[a-zA-Z]
+
$
or
^[[:alpha:]]
+
$
. The ^ and $ force the regex to match an entire string. Thus,
any string
that contains nonletters will not match. If we had used only
[a-zA-Z]
+
, then it could
match any string that contains letters but could also have other characters that precede or
succeed the letters such as
abc123
,
123abc
,
abc!def,
as well as
^#!$a*%&
. Why do we
use the
+
in this regex? If we had used
*
, this could also match the empty string, that is,
Regular Expressions
◾
209
a string with no characters. The
+
insists that there be at least one letter and the
^
and
$
insist that the only characters found are letters.
We could similarly match a string of only binary digits. Binary digits are 0 and 1. So
instead of [a-zA-Z] or [[:alpha:]], we use [01]. The regex is
^[01]
+
$
. Again, we use the
^
and
$
to force the expression to match
entire strings and we use
+
instead of * to disallow the
empty string. If we wanted to match strings that comprised solely digits, but any digits, we
would use either
^[0-9]
+
$
or
^[[:digit:]]
+
$
.
If we want to match a string of only punctuation marks, we would use
^[[:punct:]]
+
$
.
Unlike the previous examples, we would not use […] and enumerate the list of punctuation
marks. Why not? There are too many and we might (carelessly) miss some. There is no range
to
indicate all punctuation marks, such as [!-?], so we must either list them all, or use :punct:.
If we want to match a string that consists only of digits and letters where the digits precede
the letters, we would use
^[0-9]
+
[[:alpha:]]
+
$
. If we wanted to match a string that
consists only of letters and digits where the first character must be a letter and then can be
followed by any (0 or more) letters and digits, we would use
^[[:alpha:]][0-9a-zA-Z]*$
.
6.2.5 Matching Characters That Must Not Appear
In some cases, you will have to express a pattern that seeks to match
a string that does not
contain specific character(s). We might want to match a string that has no blank spaces
in it. You might think to use
[. . .]
+
where the
. . .
is “all characters except the blank
space.” That would require enumerating quite a list as it would have to include every letter,
every digit, and every punctuation mark. In such a case, we would prefer to indicate “no
space” by using the notation
[^ ]
. The
^
, when used inside of
[]
means “do not match”
against the characters listed in the brackets. The blank space after
^
indicates that the only
character we do not want to match against is the blank.
Unfortunately, our regex
[^ ]
will have the same flaw as earlier
expressions in that if it
locates any single nonblank character within the string, it is a match to the string. If our
string is “hi there,” the
[^ ]
regex will match the ‘h’ at the beginning of the string because it
TABLE 6.2
Classes Defined for Regular Expressions
Do'stlaringiz bilan baham: