− 1 times
for each of the O(n
2
) possible concatenations. We needed a faster dictionary data
structure, since search was the innermost operation in such a deep loop.
“How about using a hash table?” I suggested. “It should take O(k) time to hash
a k-character string and look it up in our table. That should knock off a factor of
O(log n), which will mean something when n
≈ 2,000.”
Dimitris went back and implemented a hash table implementation for our dic-
tionary. Again, it worked great up until the moment we ran it.
“Our program is still too slow,” Dimitris complained. “Sure, it is now about
ten times faster on strings of length 2,000. So now we can get up to about 4,000
characters. Big deal. We will never get up to 50,000.”
“We should have expected this,” I mused. “After all, lg
2
(2, 000)
≈ 11. We need
a faster data structure to search in our dictionary of strings.”
“But what can be faster than a hash table?” Dimitris countered. “To look up
a k-character string, you must read all k characters. Our hash table already does
O(k) searching.”
“Sure, it takes k comparisons to test the first substring. But maybe we can do
better on the second test. Remember where our dictionary queries are coming from.
When we concatenate ABCD with EF GH, we are first testing whether BCDE
is in the dictionary, then CDEF . These strings differ from each other by only one
character. We should be able to exploit this so each subsequent test takes constant
time to perform. . . .”
“We can’t do that with a hash table,” Dimitris observed. “The second key is not
going to be anywhere near the first in the table. A binary search tree won’t help,
either. Since the keys ABCD and BCDE differ according to the first character,
the two strings will be in different parts of the tree.”
“But we can use a suffix tree to do this,” I countered. “A suffix tree is a trie
containing all the suffixes of a given set of strings. For example, the suffixes of
ACAC are
{ACAC, CAC, AC, C}. Coupled with suffixes of string CACT , we get
the suffix tree of Figure
3.12
. By following a pointer from ACAC to its longest
proper suffix CAC, we get to the right place to test whether CACT is in our set
of strings. One character comparison is all we need to do from there.”
Suffix trees are amazing data structures, discussed in considerably more detail
in Section
12.3
(page
377
). Dimitris did some reading about them, then built a nice
3 . 9
W A R S T O R Y : S T R I N G ’ E M U P
97
a
t
c
c
c
c
a
a
t
t
t
Figure 3.12: Suffix tree on ACAC and CACT , with the pointer to the suffix of ACAC
suffix tree implementation for our dictionary. Once again, it worked great up until
the moment we ran it.
“Now our program is faster, but it runs out of memory,” Dimitris complained.
“The suffix tree builds a path of length k for each suffix of length k, so all told there
can be Θ(n
2
) nodes in the tree. It crashes when we go beyond 2,000 characters.
We will never get up to strings with 50,000 characters.”
I wasn’t ready to give up yet. “There is a way around the space problem, by
using compressed suffix trees,” I recalled. “Instead of explicitly representing long
paths of character nodes, we can refer back to the original string.” Compressed
suffix trees always take linear space, as described in Section
12.3
(page
377
).
Dimitris went back one last time and implemented the compressed suffix tree
data structure. Now it worked great! As shown in Figure
3.13
, we ran our simu-
lation for strings of length n = 65,536 without incident. Our results showed that
interactive SBH could be a very efficient sequencing technique. Based on these
simulations, we were able to arouse interest in our technique from biologists. Mak-
ing the actual wet laboratory experiments feasible provided another computational
challenge, which is reported in Section
7.7
(page
263
).
The take-home lessons for programmers from Figure
3.13
should be apparent.
We isolated a single operation (dictionary string search) that was being performed
repeatedly and optimized the data structure we used to support it. We started with
a simple implementation (binary search trees) in the hopes that it would suffice,
and then used profiling to reveal the trouble when it didn’t. When an improved
dictionary structure still did not suffice, we looked deeper into the kind of queries we
were performing, so that we could identify an even better data structure. Finally,
we didn’t give up until we had achieved the level of performance we needed. In
algorithms, as in life, persistence usually pays off.
98
3 .
D A T A S T R U C T U R E S
String
Binary
Hash
Suffix
Compressed
length
tree
table
tree
tree
8
0.0
0.0
0.0
0.0
16
0.0
0.0
0.0
0.0
32
0.1
0.0
0.0
0.0
64
0.3
0.4
0.3
0.0
128
2.4
1.1
0.5
0.0
256
17.1
9.4
3.8
0.2
512
31.6
67.0
6.9
1.3
1,024
1,828.9
96.6
31.5
2.7
2,048
11,441.7
941.7
553.6
39.0
4,096
> 2 days
5,246.7
out of
45.4
8,192
> 2 days
memory
642.0
16,384
1,614.0
32,768
13,657.8
65,536
39,776.9
Figure 3.13: Run times (in seconds) for the SBH simulation using various data structures
Do'stlaringiz bilan baham: |