To read a FASTA-format file using BioPython we use the SeqIO module, which in this
case takes an open file object and extracts each sequence of the file, in turn creating a
special object for each record. These objects have the attributes .id and .seq that
respectively allow us to get hold of Python strings representing the identifier and one-
letter sequence.
Quite simply we import the SeqIO BioPython module and create an open file object for
the sequence file (in FASTA format) that we wish to read.
The SeqIO module has a parse() function that takes the file object and a file format
string to yield sequence record objects. Here the records are extracted in a for loop and
allocated to the protein variable, which we can then interrogate. In this example we send
the sequence to the estimateIsoelectric function defined above.
Writing a FASTA file using BioPython is slightly trickier because we have to first
create the right type of BioPython objects (SeqRecord), which we then pass into a function
for writing. Despite the complication of making these objects there is the added benefit
that the sequence will be checked, e.g. that it has the right set of letters, before it is written.
We make several more imports from the BioPython library. The SeqRecord is the final
object we wish to make, and which will be written out. The Seq object is needed internally
to make a SeqRecord and IUPAC is needed to check the sequence letters according to
some (the IUPAC) standard.
from Bio.SeqRecord import SeqRecord
from Bio.Seq import Seq
from Bio.Alphabet import IUPAC
An open file object is created in writing mode, with the desired output file name. If we
were prudent we would check that we are not overwriting an existing file (using
os.path.exists()).
fileObj = open("output.fasta", "w")
Next a Seq class of object is made, which accepts a one-letter sequence in its
construction, and a sequence validation alphabet specification, which in this case is the
IUPAC protein codes. This sequence object is in turn used to make a SeqRecord which
associates the sequence object with an identifier (and potentially other kinds of
annotation).
seqObj = Seq(proteinSeq, IUPAC.protein)
proteinObj = SeqRecord(seqObj, id="TEST")
The proteinObj (a SeqRecord class object) is then written to file using the SeqIO.write()
function. Note that this takes a list of sequence records, as we can have many sequences in
one file, hence we put proteinObj in a list of one, using square brackets. The other
arguments to this function are naturally the open file object to write to and the format type
of the file.
SeqIO.write([proteinObj,], fileObj, 'fasta')
fileObj.close()
Do'stlaringiz bilan baham: