Regular Expressions
◾
217
We have a series of files, some of which include financial information. Assume that such
files contain dollar amounts in the form
$
number.number
as in
$1234.56
and
$12.00
.
If we want to quickly identify these files, we might issue the grep instruction
egrep ‘$[0-9]
+
\.[0-9]{2}’ *
As another example, we want to locate files that contain an address. In this case, let us
assume the address
will contain the zip code
41099
. The obvious instruction is
egrep ‘41099’ *
However, this instruction will match any file that contains that five-digit sequence of
numbers. Since this is part of an address, it should only appear within an address, which
will include city and state. The 41099 zip code is part of Highland Heights, KY. So we might
further refine our regular expression with the following instruction:
egrep ‘Highland Heights, KY 41099’ *
In this case though, we might have too precise a regular expression. Perhaps some
addresses placed a period after KY and others have only one space after KY. In yet others,
Highland Heights may not appear on the same line. We can resolve
the first two issues by
using a list of “or” possibilities, as in
KY
|KY
|KY\.
and we can resolve the second prob-
lem by removing Highland Heights entirely. Now our instruction is
egrep ‘(KY |KY|KY\.) 41099’ *
The use of the
()
makes it clear that there are three choices, KY with a space, KY with-
out a space, and KY with a period, followed by a space and 41099.
Let us now turn to a more elaborate example. Here, we wish to use egrep to help us
locate a particular file. In this case, we want to find the file in /etc that stores the DNS
name server IP address(es). We do not recall the file name and there are hundreds of files
in this directory to examine. The simple approach is to let egrep
find any file that con-
tains an IP address and report it by name. How do we express an IP address as a regular
expression?
An IP address is of the form 1.2.3.4 where each number can range between 0 and 255.
We will want to issue the egrep command:
egrep ‘
regex-for-ip-address
’/etc/*
where
regex-for-ip-address
is the regular expression that we come up with that will
match an IP address. The instruction will return all matching lines of all files. Included
in this list should be the line(s) that matched the file that we are looking for (which is
resolv.conf).
218
◾
Linux with Operating System Concepts
An IP address (version 4) consists of four numbers separated by periods. If we could
permit any number, the regular expression could be
[0-9]
+
.[0-9]
+
.[0-9]
+
.[0-9]
+
However, this is not accurate because the . represents any character. We really mean “a
period” so we want the period interpreted literally. We need to modify this by using
\.
or
[.]
. Our regular expression is now
[0-9]
+
\.[0-9]
+
\.[0-9]
+
\.[0-9]
+
The above regular expression certainly matches any IP address,
but in fact it can match
against any four numbers that are separated by periods. The four numbers that make up
the IP address must be within the range of 0–255. How can we express that the number
must be no greater than 255? Your first thought may be to use
[0-255]
. Unfortunately,
that does not work nor does it make sense. Recall that in
[]
we enumerate a list of
choices
to match
one
character in the string. The expression
[0-255]
can match one of three dif-
ferent sets of single characters: a character in the range 0–2, the character 5, and the char-
acter 5. This
expression is equivalent to
[0-25]
or
[0125]
as that second 5 is not needed.
Obviously, this expression is not what we are looking for.
What we need to do is express that the item to match can range from 0, a single digit,
all the way up to 255, three digits long. How do we accomplish this? Let us consider this
solution:
[0-9]{1,3}
This regular expression will match any sequence of one to three digits. This includes
0, 1, 2, 10, 11, 12, 20, 21, 22, 99, 100, 101, 102, 201, 202, and 255 so it appears to work.
Unfortunately, it is too liberal of an expression because it also matches 256, 257, 258, 301,
302, 400, 401, 500, 501, 502, 998, and 999 (as well as 000, 001, up through 099), none of
which are permissible as parts of an IP address.
If we do not mind our regular expression matching strings that it should not, then we
can solve our problem with the command
egrep ‘[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}’ /etc/*
We are asking grep to search all of the files in /etc for a string that consists of 1–3 digits, a
period, 1–3 digits, a period, 1–3 digits, a period and 1–3 digits. This would match 1.2.3.4 or
10.11.12.13 or 172.31.185.3 for instance, all of which are IP addresses.
It would also match
999.998.997.996 which is not an IP address.
If we want to build a more precise regular expression, then we have some work to do. Let
us consider again the range of possible numbers in 0–255. First, we have 0–9. That is easily
Regular Expressions
◾
219
captured as
[0-9]
. Next, we have 10–99. This is also easily captured as
[1-9][0-9]
. We
could use the expression
[0-9]|[1-9][0-9]
.
What about 100–255? Here, we have a little bit more of a challenge. We cannot just
use [1-2][0-9][0-9]
because this would range from 100 to 299. So we need to enu-
merate even more possible sequences. First, we would have
1[0-9][0-9]
to allow for any
value from 100 to 199. Next, we would have
2[0-4][0-9]
to allow for any combination
from 200 to 249. Finally, we would have 25[0–5] to allow
for the last few combinations,
250–255.
Let us put all this together to find a regex for any legal IP address octet. Combining the
Do'stlaringiz bilan baham: