pattern
/ {print . . .}’
filename
The pattern is a string, regular expression, or a condition. The action, in this case, is a
print statement which specifies what to print from the matching line. The
…
would need
to be replaced with the content to output. Given such a command, awk examines filename
line-by-line looking for
pattern
.
awk considers each line to be a series of data separated into fields. Each field is denoted by
$
n
where
n
is the column of that field where the leftmost (first) field of a line is $1. The nota-
tion
$0
is reserved to indicate “all fields in the line” which we could use in the print statement.
Let us reconsider our file from Section 6.5 names.txt which contained first names, mid-
dle initials, and last names. We want to print out the full names of those with middle ini-
tials. We could use /
[A-Z]\./
for the pattern and
{print $0}
for the action (assuming
that each line only contains names, otherwise $0 would give us the full line which might
be more information than we want to output). If we want to ensure that only the names are
output, and assuming that the names appear in the first three fields of each line, we could
use
{print $1 $2 $3}
. The full awk command is as follows.
awk ‘/[A-Z]\./ {print $1 $2 $3}’ names.txt
awk matches entries
on these lines
awk operates on values
in these fields
FIGURE 6.4
awk operates on fields within rows of a file.
232
◾
Linux with Operating System Concepts
Here, we see that $1 represents the first field (first name) of the row, $2 represents the
second field (middle initial) of the row, and $3 represents the third field (last name) of the
row. If we do not want to output the middle initial, we could use the following command.
awk ‘/[A-Z]\./ {print $1 $3}’ names.txt
What is a field? Fields are indicated by a
delimiter
(a separator). For awk, delimiters are
either spaces or tabs (indicated with
\t
). Therefore, whether a file is clearly in a tabular
form or just text with spaces, awk is able to be used on it. However, if it is a textfile, we may
not know the exact number of items (fields) per line as the number of items (words) can
vary line-by-line.
The simple awk structure and example above only specify a single pattern. As with sed,
you are able to specify any number of patterns for awk, each with its own action. The struc-
ture of a more elaborate awk command is
awk ‘/pattern
1
/ {action
1
}
/pattern
2
/ {action
2
}
/pattern
3
/ {action
3
}
…
/pattern
n
/ {action
n
}’
filename
This instruction is interpreted much like a nested if-then-else statement in a program-
ming language. Working line-by-line in
filename
, if pattern
1
matches, execute action
1
, else
if pattern
2
matches, execute action
2
, else if …, else if pattern
n
matches, execute action
n
. As
soon as a pattern matches and the corresponding action is executed, awk moves on to the
next line.
Let us consider some more interesting uses of awk. Assume we have a textfile, sales.dat,
which contains sales information. The file consists of rows of sales information using the
following fields:
Month
Salesman Sales
Commission amount
Region
Jan
Zappa
3851
.15
CA,
OR,
AZ
Aside from the first row, which is the file’s header, each entry is of sales information for a
given employee. There may be multiple rows of the same month and/or the same salesman.
For instance, another row might contain
Feb
Zappa
6781
.20
CA, OR, WA
First, let us compute the salary earned for each salesman whose region includes AZ.
Here, we have a single pattern, a regular expression to match AZ. In fact, since there is
no variability in the regular expression, our pattern is literally the two characters “AZ”.
For each matching line, we have a salesman who worked the Arizona region. To compute
the salary earned, we need to multiply the sales by the commission amount. These two
Regular Expressions
◾
233
amounts are fields 3 and 4 respectively. We want to print these values. Our awk command
could be
awk ‘/AZ/ {print $3*$4}’ sales.txt
Unfortunately, this will only provide us with a list of sales values, but not who earned
them or in which month. We should instead have a more expressive output using
{print
$1 $2 $3*$4}
. This would give us
JanZappa577.65
! We need to explain to awk how
to format the output. There are two general choices for formatting. First, separate the
fields with commas, which forces awk to output each field separated by a space. Second,
between each field, we can use “
\t
” or “ “ to indicate that a tab or a blank space should
be output.
Next, let us compute the total amount earned for Zappa. The awk command
awk ‘/Zappa/ {print $1 “\t” $3*$4}’ sales.txt
will provide one output (line) for each Zappa entry in the file. This will not give us a grand
total, merely all of Zappa’s monthly sales results. What we need to do is accumulate each
value in some running total. Fortunately, awk allows us to define and use variables. Let us
use a variable named
total
. Our action will now be
{ total
=
total
+
$3*$4}
. We can
also print out each month’s result if we wish, so we could use
{print $1 “\t” $3*$4;
total
=
total
+
$3*$4;}
or if we want to be more efficient
{temp
=
$3*$4; print
$1 “\t” temp; total
=
total
+
temp}
. Our new awk command is
awk ‘/Zappa/ {temp
=
$3*$4; print $1 “\t” temp;
total
=
total
+
temp}’ sales.txt
Notice that we are not outputting total in the above awk statement. We will come back
to this in the next subsection.
While awk is very useful in pulling out information from a file and performing compu-
tations, we can also use awk to provide specific results from a Linux command. We would
do this by piping the result of an instruction to an awk statement. Let us consider a couple
of simple examples.
Let us output the permissions and filenames of all files in a directory. The
ls –l
long
listing will provide 10 characters that display the item’s file type and permissions. This first
character should be a hyphen to indicate a file. If we have a match, we then want to output
the first and last entries on the line ($1 and $9). This can be accomplished as follows.
ls –l | awk ‘/^-/ {print $1, $9}’
Notice that the awk instruction does not have a filename after it because its input is com-
ing from the long listing. The regular expression used as our pattern,
^-
, means that the
line starts with a hyphen.
234
◾
Linux with Operating System Concepts
In another example, we want to obtain process information using ps of all running bash
shells. This solution is even easier because our regex is simply
bash
. We print $0 to output
the full line including for instance the PID and statistics about each bash shell’s processor
usage.
ps aux | awk ‘/bash/ {print $0}’
6.6.2 BEGIN and END Sections
Our earlier example of computing Zappa’s total earnings computed his total pay but did
not print it out. We could change our action to be
{temp
=
$3*$4; print $1 “\t”
temp; total
=
total
+
temp; print total}
. This would then explicitly output
the value of temp for each match. But this will have the unfortunate effect of outputting
the total for every row in which Zappa appears; in addition, the total will increase with
each of these outputs. What we want to do is hold off on printing total until the very end
of awk’s run.
Fortunately, awk does have this capability. We can enhance the awk command to
include a
BEGIN
section and/or an
END
section. The
BEGIN
section is executed auto-
matically before awk begins to search the file. The
END
section is executed automatically
after the search ends. The
BEGIN
section might be useful to output some header infor-
mation and to initialize variables if necessary. The
END
section might be useful to wrap
up the computations (for instance, by computing an average) and output any results. We
enhance our previous awk instruction to first output a report header and then at the end,
output the result.
awk ‘BEGIN {print “Sales results for Zappa”; total
=
0}
/Zappa/ {temp
=
$3*$4; print $1 “\t” temp;
total
=
total
+
temp}
END {print “Zappa’s total sales is $” total}’ sales.txt
The above instruction works as follows. First, the
BEGIN
statement executes, outputting
the header (“Sales results for Zappa”) and initializes the variable total to 0. This initializa-
tion is not necessary as, in awk, any variable used is automatically initialized to 0. However,
initializing all variables is a good habit to get into. Next, awk scans the file line-by-line
for the pattern
Zappa
. For each line that matches, temp is set to the values of the third
and fourth columns multiplied together. Then, awk outputs
$1
(the name), a tab, and the
value of temp. Finally, temp is added to the variable total. After completing its scan of the
file, awk ends by output a closing message of Zappa’s total. Note that if no lines contained
Zappa, the output would be simply:
Sales results for Zappa
Zappa’s total sales is $0
Now, let us combine the use of the
BEGIN
and
END
sections with a multipatterned
instruction. In this case, let us compute the total salaries for three employees. We want to
Regular Expressions
◾
235
have, as output, each employee’s total earnings from sale commissions. This will require
maintaining three different totals, unlike the previous example with just the total for
Zappa. We will call these variables total1, total2, and total3.
awk ‘BEGIN {total1
=
0;total2
=
0;total3
=
0}
/Zappa/ {total1
=
total1
+
$3*$4}
/Duke/ {total2
=
total2
+
$3*$4}
/Keneally/ {total3
=
total3
+
$3*$4}
END {print “Zappa $” total1 “\n” “Duke $” total2 “\n”
“Keneally $” total3}’ sales.txt
As with our previous examples, the regular expression exactly matches the string we are
looking for, so it is not a very challenging set of code. However, the logic is slightly more
involved because we are utilizing three different running totals.
6.6.3 More Complex Conditions
Let us look at an example that requires a greater degree of sophistication with our patterns.
In this case, let us obtain the number of salesmen who operated in either OH or KY. To
specify “or,” we use/
pattern1/||/pattern2/
where the notation
||
means “or.” If we
have a matching pattern, we want to increment a counter variable. In the END statement,
we will want to output this counter’s value. We omit the BEGIN statement because we do
not need a header in this case (the END statement outputs an explanation of the informa-
tion that the command computed for us and the variable, counter, is automatically initial-
ized to 0).
awk ‘/OH/||/KY/{counter
=
counter
+
1;}
END {print “Total number of employees who serve OH or KY: “
counter}’
sales.txt
If we wanted to count the number in OH and KY, we would use
&&
instead of
||
.
Let us consider a different file, courses.dat, to motivate additional examples. Imagine
that this file contains a student’s schedule for several semesters. The file contains fields for
semester (fall, spring, summer, and the year as in fall12 or summer14), the course which is a
designator and a course number as in CSC 362.001 (this is divided into two separate fields,
one for designator, one for course number), number of credit hours, location (building,
room), and time. For instance, one entry might be
fall12 CSC 362.001 3 GH 314 MWF 9:00-10:00 am
Let us create an awk command to output the courses taken in a particular year, for
instance 2012. We would not want to use the pattern /
12
/ because the “12” could match the
year, the course number or section number, the classroom number, or the time. Instead,
we need to ensure that any 12 occurs near the beginning of the line. We could use the
expression /
fall12/||/spring12/||/summer12/
. A shorter regular expression is one
that finds 12 in the first field. Since the first field will be a string of letters representing the
236
◾
Linux with Operating System Concepts
season (fall, spring, summer), we can denote this as
[a-z]
+
12
. To indicate that this must
occur at the beginning of the line, we add
^
to the beginning of the expression. This gives
us the command
awk ‘/^[a-z]
+
12/{print $0}’ courses.dat
An awk command to output all of the 400-level courses should not just contain the
pattern /
4
/ nor /
4[0-9][0-9]/
because these could potentially match other things like
a section number, a classroom number, or in the case of /
4
/, credit hours. Instead, we
will assume that all course designators are three-letter combinations while all classroom
buildings are two-letter combinations. Therefore, the course, as indicated as
4[0-9][0-9]
should follow after
[A-Z][A-Z][A-Z]
. Since we require three letters in a row, this would
not match a building. Our awk command then would look like this:
awk ‘/[A-Z][A-Z][A-Z] 4[0-9][0-9]/{print $0}’ courses.dat
We can compute the number of hours earned in any particular semester or year. Let us
compute the total hours for all of 2012. Again, we will use
^[a-z]
+
12
to indicate the pat-
tern as we match the “12” only after the season at the beginning of the line. But rather than
printing out the entry, we want to sum the value of hours, which is the fourth field (
$4
).
Our awk command will be as follows.
awk ‘/^[a-z]
+
12/ {sum
=
sum
+
$4}
END {print “Total hours earned in 2012 is “ sum}’
courses.dat
The metacharacter
^
is used to denote that the regular expression must match at the
beginning of the line. To indicate “not,” we use
!
before the pattern. We would use
!/
Do'stlaringiz bilan baham: |