Python Programming for Biology: Bioinformatics and Beyond


Homologous structure alignment



Download 7,75 Mb.
Pdf ko'rish
bet231/514
Sana30.12.2021
Hajmi7,75 Mb.
#91066
1   ...   227   228   229   230   231   232   233   234   ...   514
Bog'liq
[Tim J. Stevens, Wayne Boucher] Python Programming

Homologous structure alignment

In  this  section  we  will  link  the  structural  examples  described  in  this  chapter  with  the

sequence alignment routine described in

Chapter 12

. The objective here is to superimpose

two  homologous  structures  (i.e.  they  have  a  common  ancestor)  which  have  similarly

shaped structures, but which have different residue sequences, and hence different atoms.

Because  our  structural  superimposition  function  only  optimises  the  transformation

between  two  lists  of  coordinates  that  are  of  the  same  size,  we  first  need  to  determine  a

subset  of  atoms  that  are  common  to  both  structures.  This  is  achieved  by  first  doing  a

sequence  alignment  to  gauge  which  residues  are  equivalent  (those  that  pair  up  in  the

alignment),  and  then  selecting  backbone  atoms  that  are  present  on  both  residues  of  each

pair. Despite there being potentially different side-chain atoms for the aligned residues the

backbone atoms will be common and we can use their coordinates alone to do a structural

alignment.

The  first  thing  we  need  before  doing  the  structural  alignment  is  to  convert  the

sequences,  of  two  molecular  chains  we  wish  to  align,  from  the  structure  data  model

representation  of  Residue  objects  into  one-letter  residue  codes.  Because  the  data  model

uses  three-letter  codes,  we  define  a  dictionary  that  gives  the  equivalent  one-letter  codes,

the required input to our previous sequence alignment function. Note that we have residue

codes for DNA and RNA as well as protein residues, although the nucleic acid codes are

not  truly  three-letter.

17

 We  need  such  unaltered  codes  in  our  dictionary  so  that  we  can



detect  an  unknown  code.  An  alternative  would  be  to  have  a  new  sequence  alignment

function that works with lists of three-letter codes, but that is more work.

THREE_LETTER_TO_ONE = {'ALA':'A','CYS':'C','ASP':'D','GLU':'E',

'PHE':'F','GLY':'G','HIS':'H','ILE':'I',

'LYS':'K','LEU':'L','MET':'M','ASN':'N',

'PRO':'P','GLN':'Q','ARG':'R','SER':'S',

'THR':'T','VAL':'V','TRP':'W','TYR':'Y',

'G':'G','C':'C','A':'A','T':'T','U':'U'}

Extracting  the  sequence  of  one-letter  residue  codes  from  a  chain  is  a  fairly  simple

matter  of  looping  through  each  of  the  residues  that  belong  to  the  input  chain.  For  each

residue  the  above  dictionary  is  used  to  convert  the  (mostly  three-letter)  PDB  residue

codes, which are used in the structure data model, into a string of one-letter codes that can

be used in the alignment routines. The one-letter codes are added to a list, which is then

joined  into  a  long  string  after  the  loop  is  done.  Note  that  the  dictionary  look-up  uses  a

.get() function call, which supplies an ‘X’ character for an unknown residue type that has

no key in the dictionary.

def getChainSequence(chain):

letters = []

for residue in chain.residues:

code = residue.code




letter = THREE_LETTER_TO_ONE.get(code, 'X')

letters.append(letter)

seq = ''.join(letters)

return seq

Now  we  define  the  function  that  does  the  sequence  alignment  to  get  the  pairs  of

corresponding residues. These pairs are then used to define a subset of common backbone

atoms which are used to perform the coordinate superimposition. Although this function is

fairly  long  and  may  look  complicated  at  first  glance,  it  is  mostly  connecting  together

existing  functionality  to  do  the  job.  Firstly,  we  make  sure  that  we  have  imported  the

required  substitution  matrix  (the  default  for  the  function)  and  the  function  that  will

perform the pairwise sequence alignment.

from Alignments import sequenceAlign, BLOSUM62

The function is defined and takes two Chain objects as mandatory input arguments and

has optional arguments for the names of the atoms to align and for the substitution matrix

to use in the alignment. The atom names default to the heavy polypeptide backbone atoms,

and  so  will  be  present  in  all  amino  acids  (including  proline).  The  substitution  matrix

defaults to a fairly general one, although a better one for structural purposes may of course

be  passed  in.  Both  defaulting  arguments  naturally  assume  that  our  molecular  chains  are

proteins. If we were using this function on nucleic acid structures we could need to input a

different  set  of  atom  names,  for  the  ribose  sugar  phosphate  backbone,  and  a  different

substitution matrix.

def seqStructureBackboneAlign(chainA, chainB,

atomNames=set(['CA','C','N']),

simMatrix=BLOSUM62):

Inside  the  function  we  initially  find  the  two  Structure  objects  that  contain  the  input

chains  (following  the  parent  link).  These  objects  will  be  used  later  on  in  selecting

structural  subsets,  and  when  applying  the  transformations  for  the  final  coordinate

superimposition. Then we use the chains to get a complete list of all Residue objects that

we are going to align.

structureA = chainA.structure

structureB = chainB.structure

residuesA = chainA.residues

residuesB = chainB.residues

The  getChainSequence()  function  defined  earlier  is  used  to  obtain  the  one-letter

sequence strings from the two Chain objects, so that we can pass input to the function that

performs the pairwise sequence alignment (see

Chapter 12

). The alignment strings that are

passed back from the function will contain gaps (‘-’ symbol) to indicate how one sequence

is aligned with the other. Gaps will indicate that a residue has no equivalent in the other

sequence, but otherwise we will have pairs of residue letters that indicate which positions

are deemed to be equivalent when comparing the two molecules. Although in this instance

we are ignoring the alignment score that is generated, we could be more prudent and use it

to check whether the sequence alignment is sufficiently good to proceed further. Note that




in order to align the two sequences we also pass in the similarity matrix as an argument.

seqA = getChainSequence(chainA)

seqB = getChainSequence(chainB)

score, alignA, alignB = sequenceAlign(seqA, seqB, simMatrix)

Next  a  few  variables  are  initialised:  two  lists  to  contain  the  locations  of  the  aligned

residues (relative to their position in the chain, not the alignment) and positional counters

that are used to track these locations as we look through the sequence alignment.

pairedPosA = []

pairedPosB = []

posA = 0


posB = 0

Then the output strings from the alignment (alignA and alignB) are interrogated to find

the residue pairs. This is achieved by defining a loop that goes through the complete range

of index positions in the alignment string. For each index i, we look at the residue codes at

that alignment position and work out the position of each residue relative to the start of its

chain  (posA,  posB).  In  essence  the  sequence  position  of  each  residue  in  the  molecule  is

out  of  step  with  the  position  in  the  alignment  because  of  the  gaps  that  are  inserted  as

padding,  to  make  the  alignment  possible.  If  at  a  given  position  there  is  a  gap  symbol  in

one of the alignment strings then we know that this represents a point where a residue on

one sequence has no equivalent in the other sequence. Accordingly, we increase a counter

(+= 1) for the sequence that does not have the gap, because for this sequence we still go

one  position  along  the  chain.  Hence,  if  alignA[i]  is  a  gap,  we  increment  posB,  and  vice

versa. Otherwise, if there are no gaps, both residue chain positions are added to the lists

that record the aligned residue pair and both positions increment for the next loop. Note

that we add the old residue positions before the increment, so that we start the positions

from zero; handy for Python lists.

for i in range(len(alignA)):

# No dashes in both at same location

if alignA[i] == '-':

posB += 1

elif alignB[i] == '-':

posA += 1

else:

pairedPosA.append(posA)



pairedPosB.append(posB)

posA += 1

posB += 1

Then we simply use the list of paired alignment positions, which are the indices to the

residues  in  their  respective  lists,  to  get  hold  of  the  required  Residue  object  and  place  its

number  (seqId  attribute)  in  a  list.  In  essence,  the  positions  in  the  alignment  are  not  the

same thing as the residue number in the structures, so we need a conversion between the

two.



filterIdsA = [residuesA[p].seqId for p in pairedPosA]

filterIdsB = [residuesB[p].seqId for p in pairedPosB]

The  filterSubStructure()function  defined  above,  which  takes  a  structure  and  copies

specified parts, is used to define new Structure objects representing only backbone atoms.

Note that we make a filter to select the required coordinates using the chain’s code, a list

of residue numbers obtained via the sequence alignment, and the names of the backbone

atoms  which  the  structures  will  have  in  common.  The  resulting  sub-structures  will  have

exactly the same number of atoms, with a one-to-one correspondence that can be used for

structure superimposition.

backboneStrucA = filterSubStructure(structureA, [chainA.code],

filterIdsA, atomNames)

backboneStrucB = filterSubStructure(structureB, [chainB.code],

filterIdsB, atomNames)

For each of the backbone-only sub-structures we collect a list of atoms and the array of

coordinates (we move from the data model to the NumPy array). These atom lists are not

used in the end, but are generated by the function in any case. The weights that are used to

perform the superimposition are all set to 1.0 for each atom, so all coordinates carry equal

worth in the initial instance.

atomsA, coordsA = getAtomCoords(backboneStrucA)

atomsB, coordsB = getAtomCoords(backboneStrucB)

weights = ones(len(atomsA))

The  two  sets  of  coordinates  are  moved  to  the  centre  (zero  on  all  axes)  using  the

previously  defined  function.  It  is  important  to  do  this  so  that  the  remainder  of  the

structural  superimposition,  described  by  a  rotation,  can  be  estimated.  Note  that  the

translations that were applied to move the structures are recorded, i.e. in the centerA  and

centerB vectors, so that we can use them again when we align the full structures.

coordsA, centerA = centerCoords(coordsA, weights)

coordsB, centerB = centerCoords(coordsB, weights)

With  the  coordinates  extracted,  for  those  atoms  we  wish  to  superimpose,  we  use  the

superimposition  function  we  described  above,  noting  that  the  input  argument  is  a  list  of

coordinates. In this instance we are not actually interested in the relocated coordinates, but

rather in the rotational transformation that was applied. Also, we collect the RMSD values

to report at the end.

coords = [coordsA, coordsB]

c, rmsds, atomRmsds, rotations = superimposeCoords(coords, weights)

The  rotations  that  were  recorded,  from  the  coordinate  superimposition,  and  the

locations  of  the  coordinate  centres  are  used  to  transform  the  original  structures.  Thus,

although this transformation data is calculated only using the common, sequence-aligned,

subset  of  backbone  atoms,  the  transformations  are  applied  to  all  of  the  atoms  from  both

structures.  Finally,  at  the  end  of  the  function  the  RMSD  values  for  the  chains  and

individual atoms are returned, to see how well the structures are superimposed.

affineTransformStructure(structureA, rotations[0], -centerA)




affineTransformStructure(structureB, rotations[1], -centerB)

return rmsds, atomRmsds

We can test the function with two known homologous proteins. Here we download the

PDB  file  data  and  extract  the  file  data  to  define  Structure  objects.  Note  that  in  this

example we only use the first available coordinate set (hence the [0] to select the first). For

the  first  structure  this  means  the  first  conformation  of  an  NMR-derived  ensemble  of

structures.  The  second  structure  only  has  one  set  of  coordinates  in  any  case,  the  usual

situation for crystal structures. From both structures we select the chains with code ‘A’ and

perform the sequence-coordinate alignment.

struc1 = getStructuresFromFile(downloadPDB('1UST'))[0]

struc2 = getStructuresFromFile(downloadPDB('1HST'))[0]

chain1 = struc1.getChain('A')

chain2 = struc2.getChain('A')

rmsds, atomRmsds = seqStructureBackboneAlign(chain1, chain2)

print(rmsds)


Download 7,75 Mb.

Do'stlaringiz bilan baham:
1   ...   227   228   229   230   231   232   233   234   ...   514




Ma'lumotlar bazasi mualliflik huquqi bilan himoyalangan ©hozir.org 2024
ma'muriyatiga murojaat qiling

kiriting | ro'yxatdan o'tish
    Bosh sahifa
юртда тантана
Боғда битган
Бугун юртда
Эшитганлар жилманглар
Эшитмадим деманглар
битган бодомлар
Yangiariq tumani
qitish marakazi
Raqamli texnologiyalar
ilishida muhokamadan
tasdiqqa tavsiya
tavsiya etilgan
iqtisodiyot kafedrasi
steiermarkischen landesregierung
asarlaringizni yuboring
o'zingizning asarlaringizni
Iltimos faqat
faqat o'zingizning
steierm rkischen
landesregierung fachabteilung
rkischen landesregierung
hamshira loyihasi
loyihasi mavsum
faolyatining oqibatlari
asosiy adabiyotlar
fakulteti ahborot
ahborot havfsizligi
havfsizligi kafedrasi
fanidan bo’yicha
fakulteti iqtisodiyot
boshqaruv fakulteti
chiqarishda boshqaruv
ishlab chiqarishda
iqtisodiyot fakultet
multiservis tarmoqlari
fanidan asosiy
Uzbek fanidan
mavzulari potok
asosidagi multiservis
'aliyyil a'ziym
billahil 'aliyyil
illaa billahil
quvvata illaa
falah' deganida
Kompyuter savodxonligi
bo’yicha mustaqil
'alal falah'
Hayya 'alal
'alas soloh
Hayya 'alas
mavsum boyicha


yuklab olish