filename2 filename
.
There is only one replacement string in our sed command while the pattern could
potentially match many strings. Thus, any string that matches the pattern has the match-
ing portion of the string (the substring) replaced by the one replacement string. However,
the replacement string can reference the matched string in such ways that you can, with
effort, have different replacement strings appear. We will explore these ideas later in this
section. First, we start with a simple example.
Let us look at some examples using a file of addresses called addresses.txt. As is com-
mon, people who live in apartments might list their apartment with the abbreviation apt or
Apt. We want to convert these to read “Apartment.” We can do so with a sed command as
follows:
sed ‘s/[Aa]pt/Apartment/’ addresses.txt
>
revised_addresses.txt
In this instruction, we are searching each line of the addresses.txt file for either Apt or
apt and replacing the string with Apartment.
226
◾
Linux with Operating System Concepts
People may also indicate an apartment using #, Number or number. To incorporate
these into our regular expression, we must use
|
to separate each possibility. However,
|
is
a character that cannot be used in a sed command as is. We need to precede it with a
\
in
our regular expression. Our previous instruction becomes
sed ‘s/[Aa]pt\|[Nn]umber\|#/Apartment /’ addresses.txt
>
revised_addresses.txt
This expression reads “replace Apt, apt, Number, number and # with Apartment.”
Unfortunately, our solution is not perfect. While we might see an address in the form
Apt 3a or Number 7, when people use #, it is typically with the format #12. The difference
is that with Apt and Number there is a space before the apartment number, yielding the
replacement in the form of “Apartment 3a” or “Apartment 7.” But the replacement for “#12”
is “Apartment12.” How can we enforce a blank space after the word “Apartment?” One way
to resolve this is to change our instruction to be
sed ‘s/[Aa]pt\|[Nn]umber\|#/Apartment/’ addresses.txt
>
revised_addresses.txt
We are inserting a blank space after the word “Apartment” in the replacement string.
This however leads to these strings as replacements: “Apartment 3b,” “Apartment 7,”
and “Apartment 12.” That is, when Apt, apt, Number, or number are used, a blank space
appears after Apartment as part of our replacement string but there was already a blank
space between Apt/apt/Number/number and the apartment number leading us to hav-
ing two blank spaces rather than one. Only when # is used do we receive a single blank
space.
We resolve this problem by specifying several /pattern/replacement/ pairs. We do this
by providing sed with the option –e. The format of such a sed command is:
sed –e ‘s/pattern1/replacement1/’
–e ‘s/pattern2/replacement2/’ … filename
The
. . .
indicates additional
–e ‘s/pattern/replacement/’
pairs. Notice
that –e precedes each pair.
We will want to use a different replacement for the pattern #. Our new sed instruction is
sed –e ‘s/[Aa]pt\|[Nn]umber/Apartment/’
–e ‘s/#/Apartment /’ addresses.txt
>
revised_addresses.txt
Here, we see that the replacement string differs slightly between the patterns that match
Apt/apt/number/Number and #. For the latter case, the replacement string includes a space
while in the form case, the replacement string does not contain a space.
Continuing on with our example of a file of addresses, assume each address includes
a zip code in the form #####-#### where each # would be a digit. We want to eliminate
Regular Expressions
◾
227
the -#### portion. Let us assume that four-digit numbers following hyphens only appear in
zip codes (for instance, we would not see a street address in the form 12–3456 Pine Street).
In order to accomplish this task, we want to search for a hyphen followed by four con-
secutive digits and replace that string with nothing. To denote the pattern, we could use the
regular expression -[0–9]{4}. Our command should then be
sed ‘s/-[0-9]{4}//’ addresses.txt
>
revised_addresses.txt
This instruction specifies that the pattern of a hyphen followed by four digits is replaced
by nothing.
Unfortunately, we are using
{}
in this regular expression. Remember from Section 6.2
that
{}
are part of the extended regular expression set. If we wish to use the extended set,
we have to add the option –r. We have two alternatives then. First, add –r. Second, use
-[0–9][0–9][0–9][0–9] as our regular expression. Obviously, the –r is easier. Our new com-
mand is
sed –r ‘s/-[0-9]{4}//’ addresses.txt
>
revised_addresses.txt
This instruction will now convert a string like
Ruth Underwood, 915 Inca Road, Los Angeles, CA 90125-1234
to
Ruth Underwood, 915 Inca Road, Los Angeles, CA 90125
Recall that grep matched the regular expression to each line in a file. If any line matched
the regular expression, the line was output. It did not really matter if the expression
matched multiple substrings of the line. As long as any single match was found, the line
was output. With sed, we are looking to replace the matching string with a replacement
string. What if there are multiple matching strings in one line of the file? Will sed replace
each matched string?
The answer is not normally. By default, sed will stop parsing a line as soon as a match is
found. In order to force sed to continue to work on the same line, you have to specify that
the search and replace should be “global.” This is done by adding the letter ‘g’ after the final
/ and before the close quote. Let us imagine a file that contains a number of mathematical
problems where the words “equals,” “times,” “plus,” and “minus” are used in place “
=
,”
“*,” “
+
,” and “-.” We want sed to replace all of the words with the operators. Further, since
any particular line might have multiple instances of these words, we need to do a global
replace. Our sed command might look like this (assuming that the file is called math.txt):
sed –e ‘s/equals/
=
/g’ –e ‘s/times/*/g’ –e ‘s/plus/
+
/g’
–e ‘s/minus/-/g’ math.txt
228
◾
Linux with Operating System Concepts
6.5.2 Placeholders
While using –e is a convenient way to express multiple patterns and their replacement
strings, this will not work in a more general case. Let us consider a situation where we
want to place any fully capitalized word inside of parentheses (I have no idea why you
would want to do this, but it illustrates the problem and the solution nicely). Our regular
expression is simply [A-Z]
+
, that is, any collection of one or more capital letters. If we had
a specific list of words, for instance ALPHA, BETA, GAMMA, we could solve the problem
easily enough using the –e option as in
sed –e ‘s/ALPHA/(ALPHA)/g’ –e ‘s/BETA/(BETA)/g’
–e ‘s/GAMMA/(GAMMA)/g’ . . .
Do'stlaringiz bilan baham: |