Regular Expressions
◾
203
Aside from examining regular expressions (and numerous examples), this chapter
presents three useful tools:
grep
,
sed,
and
awk
. The grep program searches a text file
for instances of the given regular expression, returning any lines that contain a match-
ing string. This can be very useful for searching multiple files for
specific content such as
people’s names, IP addresses, or program instructions. The egrep program is a variation
of grep which allows for more metacharacters and so we will primarily examine it when
we look at grep in Section 6.4. Regular expressions are also used in the programs sed and
awk although we will see that these programs are far more complex than grep/egrep. We
examine sed in Section 6.5 and awk in Section 6.6. Regular expressions can also be used
in both vi and emacs when searching for strings. If you are a programmer, you will also
find that many modern programming languages have regular expression facilities. One
of the earliest languages
to make use of regex was perl, but we find it in such languages as
Java, Python, Ruby, PHP, and .Net platform languages (C
++
, C#, J#, ASP, VB, etc).
6.2 METACHARACTERS
In this section, we present the regex metacharacters available in Linux. Each of these will
be discussed in detail and illustrated through a number of examples. We build upon the
metacharacters in that we will find some metacharacters can be applied in conjunction.
That is, some metacharacters not only modify literal characters but can also modify other
metacharacters. You might find this section to be challenging as this is a complicated topic.
Working through the examples presented should help.
Table 6.1 lists the various metacharacters that are part of the standard and extended
regular expression set. In this section, we examine their usage with numerous examples.
The first thing you might notice from the characters in Table 6.1 is that some of them are
the same as Linux wildcard characters. Unfortunately, the * and ? differ between their use
as wildcards and their use in regular expressions. So we need to understand the context in
which the symbol is being used. For instance, the *, when used as part of a regular expres-
sion in grep differs from the use of * when it is used in the ls instruction. We will consider
this difference in more detail in Section 6.3.
6.2.1 Controlling Repeated Characters through *,
+
, and ?
The first three metacharacters from Table 6.1 are all used to express a variable number of
times that the character that precedes the metacharacter can appear. With
*
,
the preceding
character can appear 0 or more times. With
+
, the preceding character can appear 1 or
more times. With
?
, the preceding character can appear 0 or 1 times exactly.
Consider the following strings:
1.
aaaabbbbcccc
2.
abcccc
3.
accccc
4.
aaaaaabbbbbb
204
◾
Linux with Operating System Concepts
The regular expression
a*b*c*
would match all four strings because this regular
expression will match any string that has 0 or more a’s followed by 0 or more b’s fol-
lowed by 0 or more c’s. String #3 has no b and string #4 has no c but * permits this as
“0 or more.”
The regular expression
a
+
b
+
c
+
would only match the first two because each letter, a, b,
and c, must appear at least one time. The regular expression
a?b?c*
will match the second
and third strings because the letters a and b must appear 0 or 1
times apiece and the letter
a appears once in each and the letter b appears once in the second string and 0 times in the
third string. The first and fourth strings have too many occurrences of a and b to match
a?b?c*
.
Notice in the previous examples, order must be maintained. The string
aaaaacccccbbbbbb
would not match
a
+
b
+
c
+
because the b’s must appear before the
c’s. Would
a*b*c*
match this new string? It seems counter-intuitive
to say yes it matches
because the c’s appear before the b’s in the string. But in fact,
a*b*c*
does match for
a reason that is not obvious. When we use the * metacharacter, we are saying that the
string must contain “0 or more” instances of the given literal character. In this case, we are
requiring that the string contain at least 0 a’s, at least 0 b’s, and at least 0 c’s, in that order.
It does because the string
contains for instance
aaaaa
which means that the
a*
portion
matched. The
b*
and
c*
portions also match because immediately following the a’s are 0
TABLE 6.1
Regular Expression Metacharacters
Do'stlaringiz bilan baham: