Python Programming for Biology: Bioinformatics and Beyond


Using biological sequences in computing



Download 7,75 Mb.
Pdf ko'rish
bet143/514
Sana30.12.2021
Hajmi7,75 Mb.
#91066
1   ...   139   140   141   142   143   144   145   146   ...   514
Bog'liq
[Tim J. Stevens, Wayne Boucher] Python Programming

Using biological sequences in computing

Whatever  the  origins  of  a  biological  sequence,  before  writing  programs  to  work  with

biological  sequence  information  one  must  first  have  the  sequences  represented  in  some

data structure; ideally this should suit the purpose of any subsequent analyses. There are

various  ways  in  which  people  store  sequences,  ranging  from  the  simplistic  to  the

exceedingly complex, and each will have its own advantages and disadvantages.

The  commonest  and  simplest  method  is  to  store  sequences  as  text;  i.e.  as  strings  of

letters,  where  each  letter  represents  a  different  kind  of  residue.  Thus  for  DNA  and  RNA

we  will  be  working  with  alphabets  of  four  letters,  representing  nucleotides,  and  for

proteins an alphabet of 20, representing amino acids. For the standard set of residues that

make up the majority of biological polymers, this representation is sufficient as we have

more  than  enough  letters  on  a  standard  keyboard.  However,  a  simple  one-letter

representation  is  not  good  enough  if  we  need  to  describe  unusual  amino  acids  (both



naturally modified and artificially created). In such circumstances people usually resort to

three-letter code strings for amino acids: for example one can distinguish between proline

‘PRO’  and  hydroxyproline  ‘HYP’.

6

 One  can  go  further  still  and  define  a  biological



sequence  as  a  series  of  purpose-made  object  data  structures,  rather  than  a  series  of  text

codes.  While  using  lists  of  complex  objects  will  be  cumbersome  and  unnecessary  for

many  tasks,  they  are  certainly  a  good  choice  if  you  need  to  work  with  the  underlying

atoms within a residue, as is the case in structural biochemistry.

In Python, a sequence of one-letter residue codes will usually be represented as a string

data  type  and  three-letter  codes  as  a  list,  although  other  arrangements  are  of  course

possible. Also, if we are being cautious with our sequences then we may like to check that

our data structures only contain valid codes. A biological sequence can also be included in

a  larger  data  structure  if  it  needs  to  be  annotated  with  further  information.  Although

Python  dictionaries  can  be  used  for  this  purpose  when  you  need  something  quick,  we

sometimes advocate defining a custom object that can link your sequences to other data.

For testing and demonstration purposes, like the examples in this book, sequence data

can  be  entered  directly  into  the  code  of  your  programs.  Of  course  for  real-world

applications  of  programs  we  would  want  to  have  our  programs  work  on  arbitrary

sequences that we read in from a file or database. These could be sequences that have been

output from another program, something you have obtained by searching a large sequence

database or even an entire genome sequence that you have downloaded. Interacting with

files and databases directly is dealt with in

Chapters 6

and


20

, and for the moment we will

simply demonstrate with short sequences.

Once you have your sequences in some kind of data structure, it is time to start analysis.

While  we  cannot  hope  to  anticipate  all  that  you  might  need  to  do,  we  can  at  least  give

some idea of what is possible. At the same time we aim to show how some of the things

that  are  commonly  done  with  sequences  can  be  readily  achieved  with  Python.  The

following  examples  are  simple  scripts  that  all  deduce  some  property  of  an  input  DNA,

RNA or protein sequence that gives some real-world information or prediction about the

sequence. Note that in all of the examples we will forego checking that the sequences we

used are valid: that they are the right kind of object and that they contain only the known

types  of  residue  code  or  letter.  In  an  important  real-world  application  you  would  clearly

make such checks before you try to run any analyses and the BioPython modules that we

demonstrate at the end can help you do this.




Download 7,75 Mb.

Do'stlaringiz bilan baham:
1   ...   139   140   141   142   143   144   145   146   ...   514




Ma'lumotlar bazasi mualliflik huquqi bilan himoyalangan ©hozir.org 2024
ma'muriyatiga murojaat qiling

kiriting | ro'yxatdan o'tish
    Bosh sahifa
юртда тантана
Боғда битган
Бугун юртда
Эшитганлар жилманглар
Эшитмадим деманглар
битган бодомлар
Yangiariq tumani
qitish marakazi
Raqamli texnologiyalar
ilishida muhokamadan
tasdiqqa tavsiya
tavsiya etilgan
iqtisodiyot kafedrasi
steiermarkischen landesregierung
asarlaringizni yuboring
o'zingizning asarlaringizni
Iltimos faqat
faqat o'zingizning
steierm rkischen
landesregierung fachabteilung
rkischen landesregierung
hamshira loyihasi
loyihasi mavsum
faolyatining oqibatlari
asosiy adabiyotlar
fakulteti ahborot
ahborot havfsizligi
havfsizligi kafedrasi
fanidan bo’yicha
fakulteti iqtisodiyot
boshqaruv fakulteti
chiqarishda boshqaruv
ishlab chiqarishda
iqtisodiyot fakultet
multiservis tarmoqlari
fanidan asosiy
Uzbek fanidan
mavzulari potok
asosidagi multiservis
'aliyyil a'ziym
billahil 'aliyyil
illaa billahil
quvvata illaa
falah' deganida
Kompyuter savodxonligi
bo’yicha mustaqil
'alal falah'
Hayya 'alal
'alas soloh
Hayya 'alas
mavsum boyicha


yuklab olish