Figure 6.4. An example of a sequence entry from a FASTA-format file. Here the data
comprises an annotation line, which contains the database codes and name of a protein,
followed by the amino acid sequence of the protein, represented as one-letter codes.
Here we will define a function to read a FASTA file, and return a list of the sequences,
with each element of the list representing one sequence as a string containing one-letter
codes. It would be a relatively simple modification to also collect the annotation (name)
data. In this function we choose to read one line at a time because the length and number
of sequences can become very large, even a whole genome. Each sequence is potentially
specified across multiple lines so we need to keep track of that, and it is normally only
clear that the end of a sequence record is reached when the next comment line is found, or
the end of the file is reached. The example below does this by creating a list seqFragments
and appending each part of the sequence as it finds it, and then at the end joining all the
parts together using the join() function.
The function accepts a single argument, which is the name of the file to open (the full
path if not in the current directory). Within the function the file name is used to create a
file handle object, opened for reading in universal mode ‘rU’, and two empty lists are
initialised: one to collect complete sequences and one to store fragments of sequences as
they are extracted from separate lines. The line of the opened file is read by using a for
loop, to iterate through the file data as it is extracted from the fileObj. The loop naturally
yields lines until the end of the file is reached.
def readFastaFile(fileName):
fileObj = open(fileName, 'rU')
sequences = []
seqFragments = []
for line in fileObj:
if line.startswith('>'):
# found start of next sequence
if seqFragments:
sequence = ''.join(seqFragments)
sequences.append(sequence)
seqFragments = []
else:
# found more of existing sequence
seq = line.rstrip() # remove newline character
seqFragments.append(seq)
if seqFragments:
# should be the case if file is not empty
sequence = ''.join(seqFragments)
sequences.append(sequence)
fileObj.close()
return sequences
Inside the loop we check whether the line begins with the comment identifier ‘>’, and if
it does the line is either at the first sequence record or it has found the start of a new
record. In the latter case the complete one-letter sequence of the previous record is
defined, by joining all of the fragments from separate lines, and added to the list of
sequences. After joining, each list of fragments is then reset for the next sequence record.
If the line does not begin with a comment identifier we must be on a sequence line, in
which case the trailing ‘\n’ character is removed and the line is stored in the list of
sequence fragments (to be joined at the end of the record). After the loop ends, and any
remaining sequence is added, the list of sequences is passed back.
An alternative, shorter and perhaps more understandable version would be to just
concatenate strings together:
def readFastaFile(fileName):
fileObj = open(fileName, 'rU')
sequences = []
seq = ''
for line in fileObj:
if line.startswith('>'):
if seq:
sequences.append(seq)
seq = ''
else:
seq += line.rstrip()
if seq:
sequences.append(seq)
fileObj.close()
return sequences
In Python it is sometimes recommended to avoid too much string concatenation, given
that it can be less efficient than other methods. For short files it would not matter, but for
longer ones the join() method works slightly faster. This is an example where it pays to be
a bit more careful and write slightly longer, and perhaps more opaque, code. Nonetheless,
it is up to the programmer to decide what to optimise and what not to optimise.
We shall pause to consider what might go wrong with the above code. Someone might
pass in the name of a file that does not exist, or for which the user does not have read
permission, in which case the open() function will throw an exception, indicating the error.
There are various functions in the os module (see
Appendix 3
) that can help avoid such
problems. For example, to check whether a file exists you can do:
import os
fileName = 'examples/chromoData.tsv'
if os.path.exists(fileName):
print('File exists')
# …
Alternatively, someone might use a file that exists and for which the user has read
permission, but which is not actually a FASTA-format file, or is not a recent FASTA file
where comment lines start with ‘>’. This will lead to junk output, rather than an error. You
could check that the first line starts with the character ‘>’, and throw an exception if it
does not. Of course it’s possible there is a non-FASTA file that happens to start with ‘>’.
You could check that all the other lines have valid nucleotide or protein one-letter codes. It
is up to the programmer to decide how much to check for. Though, the more you want
your code to be used by other people, the more checks you should have. It should be noted
that the BioPython module that can read FASTA format will do some of these checks for
you: see
Chapter 11
for examples.
Do'stlaringiz bilan baham: |