Python Programming for Biology: Bioinformatics and Beyond

Figure 6.4. An example of a sequence entry from a FASTA-format file

Download 7,75 Mb.

Pdf ko'rish

bet	79/514
Sana	30.12.2021
Hajmi	7,75 Mb.
	#91066

1 ... 75 76 77 78 79 80 81 82 ... 514

Bog'liq
[Tim J. Stevens, Wayne Boucher] Python Programming

Figure 6.4. An example of a sequence entry from a FASTA-format file. Here the data

comprises an annotation line, which contains the database codes and name of a protein,

followed by the amino acid sequence of the protein, represented as one-letter codes.

Here we will define a function to read a FASTA file, and return a list of the sequences,

with each element of the list representing one sequence as a string containing one-letter

codes. It would be a relatively simple modification to also collect the annotation (name)

data. In this function we choose to read one line at a time because the length and number

of sequences can become very large, even a whole genome. Each sequence is potentially

specified across multiple lines so we need to keep track of that, and it is normally only

clear that the end of a sequence record is reached when the next comment line is found, or

the end of the file is reached. The example below does this by creating a list seqFragments

and appending each part of the sequence as it finds it, and then at the end joining all the

parts together using the join() function.

The function accepts a single argument, which is the name of the file to open (the full

path if not in the current directory). Within the function the file name is used to create a

file handle object, opened for reading in universal mode ‘rU’, and two empty lists are

initialised: one to collect complete sequences and one to store fragments of sequences as

they are extracted from separate lines. The line of the opened file is read by using a for

loop, to iterate through the file data as it is extracted from the fileObj. The loop naturally

yields lines until the end of the file is reached.

def readFastaFile(fileName):

fileObj = open(fileName, 'rU')

sequences = []

seqFragments = []

for line in fileObj:

if line.startswith('>'):

# found start of next sequence

if seqFragments:

sequence = ''.join(seqFragments)

sequences.append(sequence)

seqFragments = []

else:

# found more of existing sequence

seq = line.rstrip() # remove newline character

seqFragments.append(seq)

if seqFragments:

# should be the case if file is not empty

sequence = ''.join(seqFragments)

sequences.append(sequence)

fileObj.close()

return sequences

Inside the loop we check whether the line begins with the comment identifier ‘>’, and if

it does the line is either at the first sequence record or it has found the start of a new

record. In the latter case the complete one-letter sequence of the previous record is

defined, by joining all of the fragments from separate lines, and added to the list of

sequences. After joining, each list of fragments is then reset for the next sequence record.

If the line does not begin with a comment identifier we must be on a sequence line, in

which case the trailing ‘\n’ character is removed and the line is stored in the list of

sequence fragments (to be joined at the end of the record). After the loop ends, and any

remaining sequence is added, the list of sequences is passed back.

An alternative, shorter and perhaps more understandable version would be to just

concatenate strings together:

def readFastaFile(fileName):

fileObj = open(fileName, 'rU')

sequences = []

seq = ''

for line in fileObj:

if line.startswith('>'):

if seq:

sequences.append(seq)

seq = ''

else:

seq += line.rstrip()

if seq:

sequences.append(seq)

fileObj.close()

return sequences

In Python it is sometimes recommended to avoid too much string concatenation, given

that it can be less efficient than other methods. For short files it would not matter, but for

longer ones the join() method works slightly faster. This is an example where it pays to be

a bit more careful and write slightly longer, and perhaps more opaque, code. Nonetheless,

it is up to the programmer to decide what to optimise and what not to optimise.

We shall pause to consider what might go wrong with the above code. Someone might

pass in the name of a file that does not exist, or for which the user does not have read

permission, in which case the open() function will throw an exception, indicating the error.

There are various functions in the os module (see

Appendix 3

) that can help avoid such

problems. For example, to check whether a file exists you can do:

import os

fileName = 'examples/chromoData.tsv'

if os.path.exists(fileName):

print('File exists')

# …

Alternatively, someone might use a file that exists and for which the user has read

permission, but which is not actually a FASTA-format file, or is not a recent FASTA file

where comment lines start with ‘>’. This will lead to junk output, rather than an error. You

could check that the first line starts with the character ‘>’, and throw an exception if it

does not. Of course it’s possible there is a non-FASTA file that happens to start with ‘>’.

You could check that all the other lines have valid nucleotide or protein one-letter codes. It

is up to the programmer to decide how much to check for. Though, the more you want

your code to be used by other people, the more checks you should have. It should be noted

that the BioPython module that can read FASTA format will do some of these checks for

you: see

Chapter 11

for examples.

Download 7,75 Mb.

Do'stlaringiz bilan baham:

1 ... 75 76 77 78 79 80 81 82 ... 514