Python Programming for Biology: Bioinformatics and Beyond

Homologous structure alignment

Download 7,75 Mb.

Pdf ko'rish

bet	231/514
Sana	30.12.2021
Hajmi	7,75 Mb.
	#91066

1 ... 227 228 229 230 231 232 233 234 ... 514

Bog'liq
[Tim J. Stevens, Wayne Boucher] Python Programming

Homologous structure alignment

In this section we will link the structural examples described in this chapter with the

sequence alignment routine described in

Chapter 12

. The objective here is to superimpose

two homologous structures (i.e. they have a common ancestor) which have similarly

shaped structures, but which have different residue sequences, and hence different atoms.

Because our structural superimposition function only optimises the transformation

between two lists of coordinates that are of the same size, we first need to determine a

subset of atoms that are common to both structures. This is achieved by first doing a

sequence alignment to gauge which residues are equivalent (those that pair up in the

alignment), and then selecting backbone atoms that are present on both residues of each

pair. Despite there being potentially different side-chain atoms for the aligned residues the

backbone atoms will be common and we can use their coordinates alone to do a structural

alignment.

The first thing we need before doing the structural alignment is to convert the

sequences, of two molecular chains we wish to align, from the structure data model

representation of Residue objects into one-letter residue codes. Because the data model

uses three-letter codes, we define a dictionary that gives the equivalent one-letter codes,

the required input to our previous sequence alignment function. Note that we have residue

codes for DNA and RNA as well as protein residues, although the nucleic acid codes are

not truly three-letter.

We need such unaltered codes in our dictionary so that we can

detect an unknown code. An alternative would be to have a new sequence alignment

function that works with lists of three-letter codes, but that is more work.

THREE_LETTER_TO_ONE = {'ALA':'A','CYS':'C','ASP':'D','GLU':'E',

'PHE':'F','GLY':'G','HIS':'H','ILE':'I',

'LYS':'K','LEU':'L','MET':'M','ASN':'N',

'PRO':'P','GLN':'Q','ARG':'R','SER':'S',

'THR':'T','VAL':'V','TRP':'W','TYR':'Y',

'G':'G','C':'C','A':'A','T':'T','U':'U'}

Extracting the sequence of one-letter residue codes from a chain is a fairly simple

matter of looping through each of the residues that belong to the input chain. For each

residue the above dictionary is used to convert the (mostly three-letter) PDB residue

codes, which are used in the structure data model, into a string of one-letter codes that can

be used in the alignment routines. The one-letter codes are added to a list, which is then

joined into a long string after the loop is done. Note that the dictionary look-up uses a

.get() function call, which supplies an ‘X’ character for an unknown residue type that has

no key in the dictionary.

def getChainSequence(chain):

letters = []

for residue in chain.residues:

code = residue.code

letter = THREE_LETTER_TO_ONE.get(code, 'X')

letters.append(letter)

seq = ''.join(letters)

return seq

Now we define the function that does the sequence alignment to get the pairs of

corresponding residues. These pairs are then used to define a subset of common backbone

atoms which are used to perform the coordinate superimposition. Although this function is

fairly long and may look complicated at first glance, it is mostly connecting together

existing functionality to do the job. Firstly, we make sure that we have imported the

required substitution matrix (the default for the function) and the function that will

perform the pairwise sequence alignment.

from Alignments import sequenceAlign, BLOSUM62

The function is defined and takes two Chain objects as mandatory input arguments and

has optional arguments for the names of the atoms to align and for the substitution matrix

to use in the alignment. The atom names default to the heavy polypeptide backbone atoms,

and so will be present in all amino acids (including proline). The substitution matrix

defaults to a fairly general one, although a better one for structural purposes may of course

be passed in. Both defaulting arguments naturally assume that our molecular chains are

proteins. If we were using this function on nucleic acid structures we could need to input a

different set of atom names, for the ribose sugar phosphate backbone, and a different

substitution matrix.

def seqStructureBackboneAlign(chainA, chainB,

atomNames=set(['CA','C','N']),

simMatrix=BLOSUM62):

Inside the function we initially find the two Structure objects that contain the input

chains (following the parent link). These objects will be used later on in selecting

structural subsets, and when applying the transformations for the final coordinate

superimposition. Then we use the chains to get a complete list of all Residue objects that

we are going to align.

structureA = chainA.structure

structureB = chainB.structure

residuesA = chainA.residues

residuesB = chainB.residues

The getChainSequence() function defined earlier is used to obtain the one-letter

sequence strings from the two Chain objects, so that we can pass input to the function that

performs the pairwise sequence alignment (see

Chapter 12

). The alignment strings that are

passed back from the function will contain gaps (‘-’ symbol) to indicate how one sequence

is aligned with the other. Gaps will indicate that a residue has no equivalent in the other

sequence, but otherwise we will have pairs of residue letters that indicate which positions

are deemed to be equivalent when comparing the two molecules. Although in this instance

we are ignoring the alignment score that is generated, we could be more prudent and use it

to check whether the sequence alignment is sufficiently good to proceed further. Note that

in order to align the two sequences we also pass in the similarity matrix as an argument.

seqA = getChainSequence(chainA)

seqB = getChainSequence(chainB)

score, alignA, alignB = sequenceAlign(seqA, seqB, simMatrix)

Next a few variables are initialised: two lists to contain the locations of the aligned

residues (relative to their position in the chain, not the alignment) and positional counters

that are used to track these locations as we look through the sequence alignment.

pairedPosA = []

pairedPosB = []

posA = 0

posB = 0

Then the output strings from the alignment (alignA and alignB) are interrogated to find

the residue pairs. This is achieved by defining a loop that goes through the complete range

of index positions in the alignment string. For each index i, we look at the residue codes at

that alignment position and work out the position of each residue relative to the start of its

chain (posA, posB). In essence the sequence position of each residue in the molecule is

out of step with the position in the alignment because of the gaps that are inserted as

padding, to make the alignment possible. If at a given position there is a gap symbol in

one of the alignment strings then we know that this represents a point where a residue on

one sequence has no equivalent in the other sequence. Accordingly, we increase a counter

(+= 1) for the sequence that does not have the gap, because for this sequence we still go

one position along the chain. Hence, if alignA[i] is a gap, we increment posB, and vice

versa. Otherwise, if there are no gaps, both residue chain positions are added to the lists

that record the aligned residue pair and both positions increment for the next loop. Note

that we add the old residue positions before the increment, so that we start the positions

from zero; handy for Python lists.

for i in range(len(alignA)):

# No dashes in both at same location

if alignA[i] == '-':

posB += 1

elif alignB[i] == '-':

posA += 1

else:

pairedPosA.append(posA)

pairedPosB.append(posB)

posA += 1

posB += 1

Then we simply use the list of paired alignment positions, which are the indices to the

residues in their respective lists, to get hold of the required Residue object and place its

number (seqId attribute) in a list. In essence, the positions in the alignment are not the

same thing as the residue number in the structures, so we need a conversion between the

two.

filterIdsA = [residuesA[p].seqId for p in pairedPosA]

filterIdsB = [residuesB[p].seqId for p in pairedPosB]

The filterSubStructure()function defined above, which takes a structure and copies

specified parts, is used to define new Structure objects representing only backbone atoms.

Note that we make a filter to select the required coordinates using the chain’s code, a list

of residue numbers obtained via the sequence alignment, and the names of the backbone

atoms which the structures will have in common. The resulting sub-structures will have

exactly the same number of atoms, with a one-to-one correspondence that can be used for

structure superimposition.

backboneStrucA = filterSubStructure(structureA, [chainA.code],

filterIdsA, atomNames)

backboneStrucB = filterSubStructure(structureB, [chainB.code],

filterIdsB, atomNames)

For each of the backbone-only sub-structures we collect a list of atoms and the array of

coordinates (we move from the data model to the NumPy array). These atom lists are not

used in the end, but are generated by the function in any case. The weights that are used to

perform the superimposition are all set to 1.0 for each atom, so all coordinates carry equal

worth in the initial instance.

atomsA, coordsA = getAtomCoords(backboneStrucA)

atomsB, coordsB = getAtomCoords(backboneStrucB)

weights = ones(len(atomsA))

The two sets of coordinates are moved to the centre (zero on all axes) using the

previously defined function. It is important to do this so that the remainder of the

structural superimposition, described by a rotation, can be estimated. Note that the

translations that were applied to move the structures are recorded, i.e. in the centerA and

centerB vectors, so that we can use them again when we align the full structures.

coordsA, centerA = centerCoords(coordsA, weights)

coordsB, centerB = centerCoords(coordsB, weights)

With the coordinates extracted, for those atoms we wish to superimpose, we use the

superimposition function we described above, noting that the input argument is a list of

coordinates. In this instance we are not actually interested in the relocated coordinates, but

rather in the rotational transformation that was applied. Also, we collect the RMSD values

to report at the end.

coords = [coordsA, coordsB]

c, rmsds, atomRmsds, rotations = superimposeCoords(coords, weights)

The rotations that were recorded, from the coordinate superimposition, and the

locations of the coordinate centres are used to transform the original structures. Thus,

although this transformation data is calculated only using the common, sequence-aligned,

subset of backbone atoms, the transformations are applied to all of the atoms from both

structures. Finally, at the end of the function the RMSD values for the chains and

individual atoms are returned, to see how well the structures are superimposed.

affineTransformStructure(structureA, rotations[0], -centerA)

affineTransformStructure(structureB, rotations[1], -centerB)

return rmsds, atomRmsds

We can test the function with two known homologous proteins. Here we download the

PDB file data and extract the file data to define Structure objects. Note that in this

example we only use the first available coordinate set (hence the [0] to select the first). For

the first structure this means the first conformation of an NMR-derived ensemble of

structures. The second structure only has one set of coordinates in any case, the usual

situation for crystal structures. From both structures we select the chains with code ‘A’ and

perform the sequence-coordinate alignment.

struc1 = getStructuresFromFile(downloadPDB('1UST'))[0]

struc2 = getStructuresFromFile(downloadPDB('1HST'))[0]

chain1 = struc1.getChain('A')

chain2 = struc2.getChain('A')

rmsds, atomRmsds = seqStructureBackboneAlign(chain1, chain2)

print(rmsds)

Download 7,75 Mb.

Do'stlaringiz bilan baham:

1 ... 227 228 229 230 231 232 233 234 ... 514