Chapter 6
■
DiviDe, Combine, anD Conquer
123
Selection
I’ll round off this section on “searching by half” with an algorithm you may not end up using a lot in practice but that
takes the idea of bisection in an interesting direction. Besides, it sets the stages for quicksort (next section), which is
one of the classics.
The problem is to find the
kth largest
number in an unsorted sequence, in
linear time. The most important case
is, perhaps, to find the median—the element that
would have been in the middle position (that is, (n+1)//2), had the
sequence been sorted.
5
Interestingly, as a side effect of how the algorithm works, it will also allow us to identify which
objects are
smaller than the object we seek. That means we’ll be able to find the
k smallest (and, simultaneously, the
n-
k largest) elements with a running time of Q(
n), meaning that the value of
k doesn’t matter!
This may be stranger than it seems at first glance. The running time constraint rules out sorting (unless we can
count occurrences
and use counting sort, as discussed in Chapter 4). Any other
obvious algorithm for finding the
k
smallest objects would use some data structure to keep track of them. For example, you could use an approach similar
to insertion sort: Keep the
k smallest objects found so far either at the beginning of the sequence or in a separate
sequence.
If you kept track of which one of them was largest, checking each
large object in the main sequence would be fast
(just a constant-time check).
If you needed to add an object, though, and you already had
k, you’d have to remove one.
You’d remove the largest, of course, but then you’d have to find out which one was now largest. You could keep them
sorted (that is, stay close to insertion sort), but the running time would be Q(
nk) anyway.
One step up from this (asymptotically) would be to use a
heap, essentially transforming our “partial insertion
sort” into a “partial heap sort,” making sure that there are never more than
k elements in the heap. (See the “Black
Box” sidebar about binary heaps, heapq, and heapsort for more information.) This would
give you a running time of
Q(
n lg
k), and for a reasonably small
k, this is almost identical to Q(
n), and it lets you iterate over the main sequence
without jumping around in memory, so in practice it might be the solution of choice.
Tip
■
if you’re looking for the
k smallest (or largest) objects in an iterable in python, you would probably use the
nsmallest
(or
nlargest
) function from the
heapq
module if your
k is small, relative to the total number of objects. if the
k
is large, you should probably sort the sequence (either using the
sort
method or using the
sorted
function)
and pick out
the
k first objects. time your results to see what works best—or just choose the version that makes your code as clear as
possible.
So, how can we take the next step, asymptotically, and remove dependence on
k altogether? It turns out that
guaranteeing a linear worst case is a bit knotty, so let’s focus on the average case. Now, if I tell you to try applying the
idea of divide and conquer, what would you do? A first clue might be that we’re aiming for a linear running time; what
“divide by half” recurrence does that? It’s the one with a single recursive call (which is
equivalent to the knockout
tournament sum):
T(
n) =
T(
n/2) +
n. In other words, we divide the problem in half (or, for now, in half on average) by
performing linear work, just like the more canonical divide-and-conquer approach, but we manage to eliminate one
half, taking us closer to binary search. What we need to figure out, in order to design this algorithm, is how to partition
the data in linear time so that we end up with all our objects in one half.
As always, systematically going through the tools at our disposal, and framing the problem as clearly as we can,
makes it much easier to figure out a solution. We’ve arrived at a point where what we need
is to partition a sequence
into two halves, one consisting of
small values and one of
large values. And we don’t have to guarantee that the halves
are equal—only that they’ll be equal
on average. A simple way of doing this is to choose one of the values as a so-called
pivot and use it to divide the others: All those
smaller than (or equal to) the pivot end up in the left half, while those
5
In
statistics, the median is also defined for sequences of even length. It is then the average of the two middle elements. That’s not
an issue we worry about here.
Chapter 6
■
DiviDe, Combine, anD Conquer
124
larger end up on the right. Listing 6-3 gives you a possible implementation of partition and select. Note that this version
of partition is primarily meant to be readable; Exercise 6-11 asks you to see whether you can remove some overhead.
The way select is written here, it returns the
kth smallest element; if you’d rather have all the
k smallest elements, you
can simply rewrite it to return lo instead of pi.
Do'stlaringiz bilan baham: