Homologous structure alignment
In this section we will link the structural examples described in this chapter with the
sequence alignment routine described in
Chapter 12
. The objective here is to superimpose
two homologous structures (i.e. they have a common ancestor) which have similarly
shaped structures, but which have different residue sequences, and hence different atoms.
Because our structural superimposition function only optimises the transformation
between two lists of coordinates that are of the same size, we first need to determine a
subset of atoms that are common to both structures. This is achieved by first doing a
sequence alignment to gauge which residues are equivalent (those that pair up in the
alignment), and then selecting backbone atoms that are present on both residues of each
pair. Despite there being potentially different side-chain atoms for the aligned residues the
backbone atoms will be common and we can use their coordinates alone to do a structural
alignment.
The first thing we need before doing the structural alignment is to convert the
sequences, of two molecular chains we wish to align, from the structure data model
representation of Residue objects into one-letter residue codes. Because the data model
uses three-letter codes, we define a dictionary that gives the equivalent one-letter codes,
the required input to our previous sequence alignment function. Note that we have residue
codes for DNA and RNA as well as protein residues, although the nucleic acid codes are
not truly three-letter.
17
We need such unaltered codes in our dictionary so that we can
detect an unknown code. An alternative would be to have a new sequence alignment
function that works with lists of three-letter codes, but that is more work.
THREE_LETTER_TO_ONE = {'ALA':'A','CYS':'C','ASP':'D','GLU':'E',
'PHE':'F','GLY':'G','HIS':'H','ILE':'I',
'LYS':'K','LEU':'L','MET':'M','ASN':'N',
'PRO':'P','GLN':'Q','ARG':'R','SER':'S',
'THR':'T','VAL':'V','TRP':'W','TYR':'Y',
'G':'G','C':'C','A':'A','T':'T','U':'U'}
Extracting the sequence of one-letter residue codes from a chain is a fairly simple
matter of looping through each of the residues that belong to the input chain. For each
residue the above dictionary is used to convert the (mostly three-letter) PDB residue
codes, which are used in the structure data model, into a string of one-letter codes that can
be used in the alignment routines. The one-letter codes are added to a list, which is then
joined into a long string after the loop is done. Note that the dictionary look-up uses a
.get() function call, which supplies an ‘X’ character for an unknown residue type that has
no key in the dictionary.
def getChainSequence(chain):
letters = []
for residue in chain.residues:
code = residue.code
letter = THREE_LETTER_TO_ONE.get(code, 'X')
letters.append(letter)
seq = ''.join(letters)
return seq
Now we define the function that does the sequence alignment to get the pairs of
corresponding residues. These pairs are then used to define a subset of common backbone
atoms which are used to perform the coordinate superimposition. Although this function is
fairly long and may look complicated at first glance, it is mostly connecting together
existing functionality to do the job. Firstly, we make sure that we have imported the
required substitution matrix (the default for the function) and the function that will
perform the pairwise sequence alignment.
from Alignments import sequenceAlign, BLOSUM62
The function is defined and takes two Chain objects as mandatory input arguments and
has optional arguments for the names of the atoms to align and for the substitution matrix
to use in the alignment. The atom names default to the heavy polypeptide backbone atoms,
and so will be present in all amino acids (including proline). The substitution matrix
defaults to a fairly general one, although a better one for structural purposes may of course
be passed in. Both defaulting arguments naturally assume that our molecular chains are
proteins. If we were using this function on nucleic acid structures we could need to input a
different set of atom names, for the ribose sugar phosphate backbone, and a different
substitution matrix.
def seqStructureBackboneAlign(chainA, chainB,
atomNames=set(['CA','C','N']),
simMatrix=BLOSUM62):
Inside the function we initially find the two Structure objects that contain the input
chains (following the parent link). These objects will be used later on in selecting
structural subsets, and when applying the transformations for the final coordinate
superimposition. Then we use the chains to get a complete list of all Residue objects that
we are going to align.
structureA = chainA.structure
structureB = chainB.structure
residuesA = chainA.residues
residuesB = chainB.residues
The getChainSequence() function defined earlier is used to obtain the one-letter
sequence strings from the two Chain objects, so that we can pass input to the function that
performs the pairwise sequence alignment (see
Chapter 12
). The alignment strings that are
passed back from the function will contain gaps (‘-’ symbol) to indicate how one sequence
is aligned with the other. Gaps will indicate that a residue has no equivalent in the other
sequence, but otherwise we will have pairs of residue letters that indicate which positions
are deemed to be equivalent when comparing the two molecules. Although in this instance
we are ignoring the alignment score that is generated, we could be more prudent and use it
to check whether the sequence alignment is sufficiently good to proceed further. Note that
in order to align the two sequences we also pass in the similarity matrix as an argument.
seqA = getChainSequence(chainA)
seqB = getChainSequence(chainB)
score, alignA, alignB = sequenceAlign(seqA, seqB, simMatrix)
Next a few variables are initialised: two lists to contain the locations of the aligned
residues (relative to their position in the chain, not the alignment) and positional counters
that are used to track these locations as we look through the sequence alignment.
pairedPosA = []
pairedPosB = []
posA = 0
posB = 0
Then the output strings from the alignment (alignA and alignB) are interrogated to find
the residue pairs. This is achieved by defining a loop that goes through the complete range
of index positions in the alignment string. For each index i, we look at the residue codes at
that alignment position and work out the position of each residue relative to the start of its
chain (posA, posB). In essence the sequence position of each residue in the molecule is
out of step with the position in the alignment because of the gaps that are inserted as
padding, to make the alignment possible. If at a given position there is a gap symbol in
one of the alignment strings then we know that this represents a point where a residue on
one sequence has no equivalent in the other sequence. Accordingly, we increase a counter
(+= 1) for the sequence that does not have the gap, because for this sequence we still go
one position along the chain. Hence, if alignA[i] is a gap, we increment posB, and vice
versa. Otherwise, if there are no gaps, both residue chain positions are added to the lists
that record the aligned residue pair and both positions increment for the next loop. Note
that we add the old residue positions before the increment, so that we start the positions
from zero; handy for Python lists.
for i in range(len(alignA)):
# No dashes in both at same location
if alignA[i] == '-':
posB += 1
elif alignB[i] == '-':
posA += 1
else:
pairedPosA.append(posA)
pairedPosB.append(posB)
posA += 1
posB += 1
Then we simply use the list of paired alignment positions, which are the indices to the
residues in their respective lists, to get hold of the required Residue object and place its
number (seqId attribute) in a list. In essence, the positions in the alignment are not the
same thing as the residue number in the structures, so we need a conversion between the
two.
filterIdsA = [residuesA[p].seqId for p in pairedPosA]
filterIdsB = [residuesB[p].seqId for p in pairedPosB]
The filterSubStructure()function defined above, which takes a structure and copies
specified parts, is used to define new Structure objects representing only backbone atoms.
Note that we make a filter to select the required coordinates using the chain’s code, a list
of residue numbers obtained via the sequence alignment, and the names of the backbone
atoms which the structures will have in common. The resulting sub-structures will have
exactly the same number of atoms, with a one-to-one correspondence that can be used for
structure superimposition.
backboneStrucA = filterSubStructure(structureA, [chainA.code],
filterIdsA, atomNames)
backboneStrucB = filterSubStructure(structureB, [chainB.code],
filterIdsB, atomNames)
For each of the backbone-only sub-structures we collect a list of atoms and the array of
coordinates (we move from the data model to the NumPy array). These atom lists are not
used in the end, but are generated by the function in any case. The weights that are used to
perform the superimposition are all set to 1.0 for each atom, so all coordinates carry equal
worth in the initial instance.
atomsA, coordsA = getAtomCoords(backboneStrucA)
atomsB, coordsB = getAtomCoords(backboneStrucB)
weights = ones(len(atomsA))
The two sets of coordinates are moved to the centre (zero on all axes) using the
previously defined function. It is important to do this so that the remainder of the
structural superimposition, described by a rotation, can be estimated. Note that the
translations that were applied to move the structures are recorded, i.e. in the centerA and
centerB vectors, so that we can use them again when we align the full structures.
coordsA, centerA = centerCoords(coordsA, weights)
coordsB, centerB = centerCoords(coordsB, weights)
With the coordinates extracted, for those atoms we wish to superimpose, we use the
superimposition function we described above, noting that the input argument is a list of
coordinates. In this instance we are not actually interested in the relocated coordinates, but
rather in the rotational transformation that was applied. Also, we collect the RMSD values
to report at the end.
coords = [coordsA, coordsB]
c, rmsds, atomRmsds, rotations = superimposeCoords(coords, weights)
The rotations that were recorded, from the coordinate superimposition, and the
locations of the coordinate centres are used to transform the original structures. Thus,
although this transformation data is calculated only using the common, sequence-aligned,
subset of backbone atoms, the transformations are applied to all of the atoms from both
structures. Finally, at the end of the function the RMSD values for the chains and
individual atoms are returned, to see how well the structures are superimposed.
affineTransformStructure(structureA, rotations[0], -centerA)
affineTransformStructure(structureB, rotations[1], -centerB)
return rmsds, atomRmsds
We can test the function with two known homologous proteins. Here we download the
PDB file data and extract the file data to define Structure objects. Note that in this
example we only use the first available coordinate set (hence the [0] to select the first). For
the first structure this means the first conformation of an NMR-derived ensemble of
structures. The second structure only has one set of coordinates in any case, the usual
situation for crystal structures. From both structures we select the chains with code ‘A’ and
perform the sequence-coordinate alignment.
struc1 = getStructuresFromFile(downloadPDB('1UST'))[0]
struc2 = getStructuresFromFile(downloadPDB('1HST'))[0]
chain1 = struc1.getChain('A')
chain2 = struc2.getChain('A')
rmsds, atomRmsds = seqStructureBackboneAlign(chain1, chain2)
print(rmsds)
Do'stlaringiz bilan baham: |