Python Programming for Biology: Bioinformatics and Beyond

Reading and writing FASTA files

Download 7,75 Mb.

Pdf ko'rish

bet	155/514
Sana	30.12.2021
Hajmi	7,75 Mb.
	#91066

1 ... 151 152 153 154 155 156 157 158 ... 514

Bog'liq
[Tim J. Stevens, Wayne Boucher] Python Programming

Reading and writing FASTA files

To read a FASTA-format file using BioPython we use the SeqIO module, which in this

case takes an open file object and extracts each sequence of the file, in turn creating a

special object for each record. These objects have the attributes .id and .seq that

respectively allow us to get hold of Python strings representing the identifier and one-

letter sequence.

Quite simply we import the SeqIO BioPython module and create an open file object for

the sequence file (in FASTA format) that we wish to read.

from Bio import SeqIO

fileObj = open("examples/demoSequences.fasta", "rU")

The SeqIO module has a parse() function that takes the file object and a file format

string to yield sequence record objects. Here the records are extracted in a for loop and

allocated to the protein variable, which we can then interrogate. In this example we send

the sequence to the estimateIsoelectric function defined above.

for protein in SeqIO.parse(fileObj, 'fasta'):

print(protein.id)

print(protein.seq)

print(estimateIsoelectric(protein.seq))

fileObj.close()

Writing a FASTA file using BioPython is slightly trickier because we have to first

create the right type of BioPython objects (SeqRecord), which we then pass into a function

for writing. Despite the complication of making these objects there is the added benefit

that the sequence will be checked, e.g. that it has the right set of letters, before it is written.

We make several more imports from the BioPython library. The SeqRecord is the final

object we wish to make, and which will be written out. The Seq object is needed internally

to make a SeqRecord and IUPAC is needed to check the sequence letters according to

some (the IUPAC) standard.

from Bio.SeqRecord import SeqRecord

from Bio.Seq import Seq

from Bio.Alphabet import IUPAC

An open file object is created in writing mode, with the desired output file name. If we

were prudent we would check that we are not overwriting an existing file (using

os.path.exists()).

fileObj = open("output.fasta", "w")

Next a Seq class of object is made, which accepts a one-letter sequence in its

construction, and a sequence validation alphabet specification, which in this case is the

IUPAC protein codes. This sequence object is in turn used to make a SeqRecord which

associates the sequence object with an identifier (and potentially other kinds of

annotation).

seqObj = Seq(proteinSeq, IUPAC.protein)

proteinObj = SeqRecord(seqObj, id="TEST")

The proteinObj (a SeqRecord class object) is then written to file using the SeqIO.write()

function. Note that this takes a list of sequence records, as we can have many sequences in

one file, hence we put proteinObj in a list of one, using square brackets. The other

arguments to this function are naturally the open file object to write to and the format type

of the file.

SeqIO.write([proteinObj,], fileObj, 'fasta')

fileObj.close()

Download 7,75 Mb.

Do'stlaringiz bilan baham:

1 ... 151 152 153 154 155 156 157 158 ... 514