SOrteD arraYS, treeS, aND DICtS: ChOICeS, ChOICeS
bisection (on sorted arrays), binary search trees, and dicts (that is, hash tables) all implement the same basic
functionality: they let you search efficiently. there are some important differences, though. bisection is fast, with
little overhead, but works only on sorted arrays (such as python lists). and sorted arrays are hard to maintain;
adding elements takes linear time. Search trees have more overhead but are dynamic and let you insert and
remove elements. in many cases, though, the clear winner will be the hash table, in the form of
dict
. its average
asymptotic running time is constant (as opposed to the logarithmic running time of bisection and search trees),
and it is close to that in practice, with little overhead.
hashing requires you to be able to compute a hash value for your objects. in practice, you can almost always do
this, but in theory, bisection and search trees are a bit more flexible here—they need only to compare objects and
figure out which one is smaller.
4
this focus on ordering also means that search trees will let you access your values
in sorted order—either all of them or just a portion. trees can also be extended to work in multiple dimensions
(to search for points inside a hyperrectangular region) or to even stranger forms of search criteria, where hashing
may be hard to achieve. there are more common cases, too, when hashing isn’t immediately applicable. For
example, if you want the entry that is closest to your lookup key, a search tree would be the way to go.
4
Actually, more flexible may not be entirely correct. There are many objects (such as complex numbers) that can be hashed but that
cannot be compared for size.
Chapter 6
■
DiviDe, Combine, anD Conquer
123
Selection
I’ll round off this section on “searching by half” with an algorithm you may not end up using a lot in practice but that
takes the idea of bisection in an interesting direction. Besides, it sets the stages for quicksort (next section), which is
one of the classics.
The problem is to find the kth largest number in an unsorted sequence, in linear time. The most important case
is, perhaps, to find the median—the element that would have been in the middle position (that is, (n+1)//2), had the
sequence been sorted.
5
Interestingly, as a side effect of how the algorithm works, it will also allow us to identify which
objects are smaller than the object we seek. That means we’ll be able to find the k smallest (and, simultaneously, the
n-k largest) elements with a running time of Q(n), meaning that the value of k doesn’t matter!
This may be stranger than it seems at first glance. The running time constraint rules out sorting (unless we can
count occurrences and use counting sort, as discussed in Chapter 4). Any other obvious algorithm for finding the k
smallest objects would use some data structure to keep track of them. For example, you could use an approach similar
to insertion sort: Keep the k smallest objects found so far either at the beginning of the sequence or in a separate
sequence.
If you kept track of which one of them was largest, checking each large object in the main sequence would be fast
(just a constant-time check). If you needed to add an object, though, and you already had k, you’d have to remove one.
You’d remove the largest, of course, but then you’d have to find out which one was now largest. You could keep them
sorted (that is, stay close to insertion sort), but the running time would be Q(nk) anyway.
One step up from this (asymptotically) would be to use a heap, essentially transforming our “partial insertion
sort” into a “partial heap sort,” making sure that there are never more than k elements in the heap. (See the “Black
Box” sidebar about binary heaps, heapq, and heapsort for more information.) This would give you a running time of
Q(n lg k), and for a reasonably small k, this is almost identical to Q(n), and it lets you iterate over the main sequence
without jumping around in memory, so in practice it might be the solution of choice.
Tip
■
if you’re looking for the k smallest (or largest) objects in an iterable in python, you would probably use the
nsmallest
(or
nlargest
) function from the
heapq
module if your k is small, relative to the total number of objects. if the k
is large, you should probably sort the sequence (either using the
sort
method or using the
sorted
function) and pick out
the k first objects. time your results to see what works best—or just choose the version that makes your code as clear as
possible.
So, how can we take the next step, asymptotically, and remove dependence on k altogether? It turns out that
guaranteeing a linear worst case is a bit knotty, so let’s focus on the average case. Now, if I tell you to try applying the
idea of divide and conquer, what would you do? A first clue might be that we’re aiming for a linear running time; what
“divide by half” recurrence does that? It’s the one with a single recursive call (which is equivalent to the knockout
tournament sum): T(n) = T(n/2) + n. In other words, we divide the problem in half (or, for now, in half on average) by
performing linear work, just like the more canonical divide-and-conquer approach, but we manage to eliminate one
half, taking us closer to binary search. What we need to figure out, in order to design this algorithm, is how to partition
the data in linear time so that we end up with all our objects in one half.
As always, systematically going through the tools at our disposal, and framing the problem as clearly as we can,
makes it much easier to figure out a solution. We’ve arrived at a point where what we need is to partition a sequence
into two halves, one consisting of small values and one of large values. And we don’t have to guarantee that the halves
are equal—only that they’ll be equal on average. A simple way of doing this is to choose one of the values as a so-called
pivot and use it to divide the others: All those smaller than (or equal to) the pivot end up in the left half, while those
5
In statistics, the median is also defined for sequences of even length. It is then the average of the two middle elements. That’s not
an issue we worry about here.
Chapter 6
■
DiviDe, Combine, anD Conquer
124
larger end up on the right. Listing 6-3 gives you a possible implementation of partition and select. Note that this version
of partition is primarily meant to be readable; Exercise 6-11 asks you to see whether you can remove some overhead.
The way select is written here, it returns the kth smallest element; if you’d rather have all the k smallest elements, you
can simply rewrite it to return lo instead of pi.
Do'stlaringiz bilan baham: |