Python Programming for Biology: Bioinformatics and Beyond

Figure 25.3. Trajectories of subsequent trials from Metropolis-Hastings Monte

Download 7,75 Mb.

Pdf ko'rish

bet	420/514
Sana	30.12.2021
Hajmi	7,75 Mb.
	#91066

1 ... 416 417 418 419 420 421 422 423 ... 514

Bog'liq
[Tim J. Stevens, Wayne Boucher] Python Programming

Figure 25.3. Trajectories of subsequent trials from Metropolis-Hastings Monte

Carlo searches to minimise a shallow function. In this plot the lighter the background

shade the smaller is the value of a two-dimensional test function. The trajectories of test

points that result from applying the Metropolis-Hastings Monte Carlo method are

visualised as darker lines. This method works by selecting subsequent test points (after a

random start) relative to the position of the previous point, but only accepting the new

solution if it is better and occasionally if it is worse, with a probability specified by the

Metropolis-Hastings criterion. In this way the routes of accepted points will follow the

function to its minimum but have occasional jumps, which may be important in

overcoming local minima.

The travelling salesman problem solely involves considering the distances between

cities, but in other, more biological, situations there may be more information to reduce

the complexity of the problem. For example, in nuclear magnetic resonance (NMR) it is a

common operation to assign signals that occur in a spectrum to the amino acid residues of

a protein chain. The problem is that you don’t know which signal corresponds to which

amino acid in the sequence. Specific NMR experiments allow you to determine which

signals are likely to be neighbours in a sequence (although not unambiguously so) and

also what kinds of amino acid a signal could represent. Thus, although you can solve the

problem is the same basic way as the travelling salesman, to determine the best data

sequence you have considerably more information and can exclude many possibilities;

many data points are known to not be neighbours, and certain data points can be excluded

from particular locations in the sequence. Generalising for other biological problems, the

more information you have, to give better probabilities, the more robust the solution.

To solve the travelling salesman problem with a simple Python function we will first

make a few handy imports and then define a helper function, which has the job of

determining the total distance for a particular route between all the cities.

from math import sqrt, exp

from random import shuffle, randint

from numpy import array

The function getRouteLength() takes a dictionary containing the distances between

points (cities in our example) and the route to calculate the distance for. It should be noted

that if the distances between all of the relevant points can be determined in advance then

this will generally lead to a quicker program, given that we won’t have to repeat the

distance calculation. However, caching the distances in this way requires memory, which

may be insufficient for large problems. The innards of the function are fairly simple. The

initial distance is initialised as zero, then we go through each pair of subsequent points

visited in the route; the order of cities in the list defines the route. The inbuilt enumerate()

is used to extract the index and each location of the route at the same time. Note that this

loop only goes up to the penultimate point in the route (route[:-1]) so that we always have

room to fetch the next point in a pair: route[i+1]. The two neighbouring points are then

used as a key to look up the distance between them in the distanceData dictionary passed

in, which is then added to the total. If it is not practical to have a look-up table of distances

then this function could calculate the distances from the data points themselves.

Construction of the dictionary of city distances is illustrated in the function

calcCityDistances() discussed later.

def getRouteLength(distanceData, route):

distance = 0.0

for i, pointA in enumerate(route[:-1]):

pointB = route[i+1]

key = frozenset((pointA, pointB))

distance += distanceData[key]

return distance

Next we come to the function that does the actual solving of the travelling salesman

problem using the Monte Carlo method. The general idea is that we will start with a

random route (order of cities) and then swap random pairs of cities on the route to

generate the next test route. A test route will be accepted or rejected in the same manner

that we did for the previous function optimisation. Swapping pairs of cities is not the only

way that we could generate new routes (states) to test: for example, random neighbours

could be swapped. However, swapping completely random points works well for the test

problem and generally allows a wide range of solutions to be sampled. Swapping only

neighbours gives less drastic changes and may be better at refining good solutions, but

will not be as quick to search a wide variety of routes. Subsequently this method that uses

random swaps will be refined, using simulated annealing, so that we get an initially wide

and unbiased search that subsequently narrows to a refined solution.

The function travellingSalesman() is defined as taking the distanceData dictionary that

provides the distance look-up, a list of the cities involved and the number of search steps

to use. Naturally, the number of steps to use will depend on the number of cities that are

considered and acceptable values may be determined by experimentation.

def travellingSalesman(distanceData, cities, numSteps=10000):

The number of cities in the problem is recorded and the initial bestRoute list is a

randomly shuffled copy of the input list of cities.

n = len(cities)

bestRoute = cities[:]

shuffle(bestRoute)

The acceptance scores we use below will depend on the differences in route lengths, but

the magnitude of these distances will vary according to the situation at hand (in our

example we happen to be dealing with thousands of kilometres). Hence, so that it doesn’t

matter what the scale of the distances are, we calculate a normalising factor, which we call

scale. The heuristic we use here is to take half of the standard deviation of the inter-city

distances as the scale factor. This is easily calculated with the function std() inbuilt into

NumPy arrays:

dists = list(distanceData.values()) # list() not needed in Python 2

scale = 0.5 * array(dists).std()

The corresponding initial bestDistance uses the helper function getRouteLength() and

the dictionary of distance information to calculate the length of the initial route. Copies of

the initial route and distance are then made, which will be the starting point for the Monte

Carlo search.

bestDistance = getRouteLength(distanceData, bestRoute)

prevRoute = bestRoute

prevDistance = bestDistance

A for loop is defined to go through each of the search steps. Two indices a and b are

defined as random numbers, between zero and the index of the last city. These will be the

indices of the two cities that will be swapped in the route. The test route for this cycle is

initially defined as a copy of the previous ‘good’ route. Then the entries at the two random

indices are swapped; position a in the old route is b in the new one and vice versa.

for i in range(numSteps):

a = randint(0, n-1)

b = randint(0, n-1)

route = prevRoute[:]

route[a] = prevRoute[b]

route[b] = prevRoute[a]

With the test route defined, the length going through the cities is calculated. The score

(equivalent to a Metropolis-Hastings acceptance probability) is calculated as the exponent

of the difference in distances between the test route and the prior one, divided by the

normalising distance scale factor. The result of this is that if a test route length is longer by

the value of scale there is an e

−1

(37.8%) chance of acceptance.

distance = getRouteLength(distanceData, route)

score = exp((prevDistance-distance)/scale)

The score is compared against a random number in the range 0.0 to 1.0. If the new route

is shorter the score will be greater than 1.0 and it will definitely be accepted, otherwise it

is a matter of chance depending on how much longer the distance is. If the test route is

accepted it is recorded as prevRoute so that it will be the state to go from in the next cycle.

if score > uniform():

prevRoute = route

prevDistance = distance

A second test determines whether the distance is the overall shortest so far measured. If

so the route is remembered as the best one. The best distance and step number are printed

to indicate progress.

if distance < bestDistance:

bestRoute = route[:]

bestDistance = distance

print('%5d %.5f' % (i, distance))

return bestDistance, bestRoute

To finally test the function some distance information is required. Here this is done

using a dictionary that contains the coordinates (latitude and longitude in degrees) of the

cities concerned. Of course, this will only allow us to calculate approximate distance

across the Earth’s globe, rather than practical road or rail distances. Nonetheless, this is

good enough to illustrate the procedure.

cityCoords = {'Paris':(48.856667, 2.350833),

'Marseille':(43.296386, 5.369954),

'Lyon':(45.759723, 4.842223),

'Toulouse':(43.604503, 1.444026),

'Nice':(43.703393, 7.266274),

'Strasbourg':(48.584445, 7.748612),

'Nantes':(47.21806, -1.55278),

'Bordeaux':(44.838611, -0.578333),

'Montpellier':(43.61194, 3.87722),

'Rennes':(48.114722, -1.679444),

'Lille':(50.637222, 3.063333),

'Le Havre':(49.498889, 0.121111),

'Reims':(49.26278, 4.03472),

'Saint-Etienne':(45.434722, 4.390278),

'Toulon':(43.125, 5.930556)}

The next function uses the city coordinates to calculate a dictionary containing the

distances between each pair, hence we use two for loops to get all city pairs. The distance

metric derives from the length of the great arc between the cities, given that we are dealing

with positions on the surface of the Earth rather than a flat plane. Hence we first calculate

the central angle between the cities (in radians) and then multiply this by the radius of the

Earth. Note that the key to the distances dictionary is a Python frozenset object, thus the

key to get a distance value does not depend on the order of the two cities in the key. If we

used a Python tuple as a key then we would have to store both (cityA, cityB) and (cityB,

cityA) or always sort the key items. Here a simple Python set cannot be used, it must be

the frozen kind as dictionary keys cannot have modifiable values.

from math import acos, cos, sin, radians

def calcCityDistances(coordDict):

cities = list(coordDict.keys())

n = len(cities)

distances = {}

for i in range(n-1):

cityA = cities[i]

latA, longA = coordDict[cityA]

latA = radians(latA)

longA = radians(longA)

for j in range(i+1, n):

cityB = cities[j]

latB, longB = coordDict[cityB]

latB = radians(latB)

longB = radians(longB)

dLong = abs(longA - longB)

angle = acos(sin(latA)*sin(latB) + cos(latA)*cos(latB)*cos(dLong))

dist = angle * 6371.1 # Mean Earth radius (km)

key = frozenset((cityA, cityB))

distances[key] = dist

return distances

Using the distance information, the best route between the cities can hopefully be found

using the Monte Carlo method.

distances = calcCityDistances(cityCoords)

cities = list(cityCoords.keys()) # Use all the cities

dist, route = travellingSalesman(distances, cities, 1000000)

print('%.3f %s' % (dist, ', '.join(route)))

If all goes well, the test will give the following route, or its equally good reverse, with

the distance of 2465.56 km:

Strasbourg, Reims, Lille, Paris, Le Havre, Rennes, Nantes, Bordeaux,

Toulouse, Montpellier, Marseille, Toulon, Nice, Lyon, Saint-Etienne

Download 7,75 Mb.

Do'stlaringiz bilan baham:

1 ... 416 417 418 419 420 421 422 423 ... 514