Python Programming for Biology: Bioinformatics and Beyond


Figure 6.4.  An example of a sequence entry from a FASTA-format file



Download 7,75 Mb.
Pdf ko'rish
bet79/514
Sana30.12.2021
Hajmi7,75 Mb.
#91066
1   ...   75   76   77   78   79   80   81   82   ...   514
Bog'liq
[Tim J. Stevens, Wayne Boucher] Python Programming

Figure 6.4.  An example of a sequence entry from a FASTA-format file. Here the data

comprises an annotation line, which contains the database codes and name of a protein,

followed by the amino acid sequence of the protein, represented as one-letter codes.

Here we will define a function to read a FASTA file, and return a list of the sequences,

with  each  element  of  the  list  representing  one  sequence  as  a  string  containing  one-letter

codes.  It  would  be  a  relatively  simple  modification  to  also  collect  the  annotation  (name)

data. In this function we choose to read one line at a time because the length and number

of sequences can become very large, even a whole genome. Each sequence is potentially

specified  across  multiple  lines  so  we  need  to  keep  track  of  that,  and  it  is  normally  only

clear that the end of a sequence record is reached when the next comment line is found, or

the end of the file is reached. The example below does this by creating a list seqFragments

and appending each part of the sequence as it finds it, and then at the end joining all the

parts together using the join() function.



The function accepts a single argument, which is the name of the file to open (the full

path if not in the current directory). Within the function the file name is used to create a

file  handle  object,  opened  for  reading  in  universal  mode  ‘rU’,  and  two  empty  lists  are

initialised: one to collect complete sequences and one to store fragments of sequences as

they  are  extracted  from  separate  lines.  The  line  of  the  opened  file  is  read  by  using  a  for

loop, to iterate through the file data as it is extracted from the fileObj. The loop naturally

yields lines until the end of the file is reached.

def readFastaFile(fileName):

fileObj = open(fileName, 'rU')

sequences = []

seqFragments = []

for line in fileObj:

if line.startswith('>'):

# found start of next sequence

if seqFragments:

sequence = ''.join(seqFragments)

sequences.append(sequence)

seqFragments = []

else:

# found more of existing sequence



seq = line.rstrip() # remove newline character

seqFragments.append(seq)

if seqFragments:

# should be the case if file is not empty

sequence = ''.join(seqFragments)

sequences.append(sequence)

fileObj.close()

return sequences

Inside the loop we check whether the line begins with the comment identifier ‘>’, and if

it  does  the  line  is  either  at  the  first  sequence  record  or  it  has  found  the  start  of  a  new

record.  In  the  latter  case  the  complete  one-letter  sequence  of  the  previous  record  is

defined,  by  joining  all  of  the  fragments  from  separate  lines,  and  added  to  the  list  of

sequences. After joining, each list of fragments is then reset for the next sequence record.

If  the  line  does  not  begin  with  a  comment  identifier  we  must  be  on  a  sequence  line,  in

which  case  the  trailing  ‘\n’  character  is  removed  and  the  line  is  stored  in  the  list  of

sequence fragments (to be joined at the end of the record). After the loop ends, and any

remaining sequence is added, the list of sequences is passed back.

An  alternative,  shorter  and  perhaps  more  understandable  version  would  be  to  just

concatenate strings together:

def readFastaFile(fileName):

fileObj = open(fileName, 'rU')

sequences = []




seq = ''

for line in fileObj:

if line.startswith('>'):

if seq:


sequences.append(seq)

seq = ''


else:

seq += line.rstrip()

if seq:

sequences.append(seq)

fileObj.close()

return sequences

In Python it is sometimes recommended to avoid too much string concatenation, given

that it can be less efficient than other methods. For short files it would not matter, but for

longer ones the join() method works slightly faster. This is an example where it pays to be

a bit more careful and write slightly longer, and perhaps more opaque, code. Nonetheless,

it is up to the programmer to decide what to optimise and what not to optimise.

We shall pause to consider what might go wrong with the above code. Someone might

pass  in  the  name  of  a  file  that  does  not  exist,  or  for  which  the  user  does  not  have  read

permission, in which case the open() function will throw an exception, indicating the error.

There  are  various  functions  in  the  os  module  (see

Appendix  3

)  that  can  help  avoid  such

problems. For example, to check whether a file exists you can do:

import os

fileName = 'examples/chromoData.tsv'

if os.path.exists(fileName):

print('File exists')

# …

Alternatively,  someone  might  use  a  file  that  exists  and  for  which  the  user  has  read



permission, but which is not actually a FASTA-format file, or is not a recent FASTA file

where comment lines start with ‘>’. This will lead to junk output, rather than an error. You

could  check  that  the  first  line  starts  with  the  character  ‘>’,  and  throw  an  exception  if  it

does not. Of course it’s possible there is a non-FASTA file that happens to start with ‘>’.

You could check that all the other lines have valid nucleotide or protein one-letter codes. It

is  up  to  the  programmer  to  decide  how  much  to  check  for.  Though,  the  more  you  want

your code to be used by other people, the more checks you should have. It should be noted

that the BioPython module that can read FASTA format will do some of these checks for

you: see

Chapter 11

for examples.


Download 7,75 Mb.

Do'stlaringiz bilan baham:
1   ...   75   76   77   78   79   80   81   82   ...   514




Ma'lumotlar bazasi mualliflik huquqi bilan himoyalangan ©hozir.org 2024
ma'muriyatiga murojaat qiling

kiriting | ro'yxatdan o'tish
    Bosh sahifa
юртда тантана
Боғда битган
Бугун юртда
Эшитганлар жилманглар
Эшитмадим деманглар
битган бодомлар
Yangiariq tumani
qitish marakazi
Raqamli texnologiyalar
ilishida muhokamadan
tasdiqqa tavsiya
tavsiya etilgan
iqtisodiyot kafedrasi
steiermarkischen landesregierung
asarlaringizni yuboring
o'zingizning asarlaringizni
Iltimos faqat
faqat o'zingizning
steierm rkischen
landesregierung fachabteilung
rkischen landesregierung
hamshira loyihasi
loyihasi mavsum
faolyatining oqibatlari
asosiy adabiyotlar
fakulteti ahborot
ahborot havfsizligi
havfsizligi kafedrasi
fanidan bo’yicha
fakulteti iqtisodiyot
boshqaruv fakulteti
chiqarishda boshqaruv
ishlab chiqarishda
iqtisodiyot fakultet
multiservis tarmoqlari
fanidan asosiy
Uzbek fanidan
mavzulari potok
asosidagi multiservis
'aliyyil a'ziym
billahil 'aliyyil
illaa billahil
quvvata illaa
falah' deganida
Kompyuter savodxonligi
bo’yicha mustaqil
'alal falah'
Hayya 'alal
'alas soloh
Hayya 'alas
mavsum boyicha


yuklab olish