Using ClustalW from Python
We will use input to the ClustalW program in the simple FASTA format, which you are
hopefully already familiar with. Of course, if the data is not already in a FASTA file we
will have to make one; indeed the following example will make the files for you starting
from a simple list of one-letter sequences as we have been using in the examples.
Although plain sequence strings are fine for basic analyses you may wish to consider more
detailed sequence representations, like the Bio.Seq module, when situations are more
complicated.
Firstly, in our example we import all of the external modules we will need upfront. It is
generally a good idea to do this because you can quickly look at the start of a file and see
all of the modules you are depending upon. Most of the modules are BioPython
7
libraries,
so it is assumed that this is installed on your system.
import os
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
from Bio.Alphabet import IUPAC
from Bio import SeqIO, AlignIO
Two file names are defined; one for the input to the alignment program (.fasta) and one
for the output from the program (.aln).
fastaFileName = "test2.fasta"
alignFileName = "test2.aln"
Next we loop though the sequences, seqs, defined as needed, to make a list of sequence
record objects. For each one-letter sequence we first make a BioPython Seq object using
the standard IUPAC
8
protein sequence alphabet. If there were an invalid protein sequence
we would get an error. The basic sequence object is then combined with a name and a
description to make a fully fledged sequence record object. The SeqRecord object is
required for writing a FASTA file.
records = []
for i, seq in enumerate(seqs):
seqObj = Seq(seq, IUPAC.protein)
name = 'test%d' % i
recordObj = SeqRecord(seqObj, id=name, description='demo only')
records.append(recordObj)
We use the open() function to create a file handle object into which the sequence
records are written. The SeqIO module of BioPython has a .write() function that can
output various formats, so naturally we use the desired ‘fasta’ option. With the file written
the file handle is closed; this is not essential here but a good habit to get into.
outFileObj = open(fastaFileName, "w")
SeqIO.write(records, outFileObj, "fasta")
outFileObj.close()
With the input file made, the next job is to run the external alignment program; in this
case ClustalW. This requires we have installed the alignment program and that it can be
run as the command ‘clustalw’ from the operating system. We will invoke the command
using the subprocess.call() function, as we illustrated for BLAST in
Chapter 12
. Also, in
this instance we specify -INFILE and -OUTFILE options, to define the input and output
file names respectively, formatted in the form –OPTION=VALUE (unlike with BLAST
where the option name and associated value are separated with a space). If desired, we
could read the ClustalW manual and include other options to control other aspects like gap
penalties and the substitution matrix.
from subprocess import call
cmdArgs = ['clustalw',
'-INFILE=' + fastaFileName,
'-OUTFILE=' + alignFileName]
call(cmdArgs)
A perhaps better, but more complicated, alternative to the call() is to use the
subprocess.Popen class, which amongst other things provides a means of executing jobs in
parallel; very handy to do if you have lots of alignments and a multi-core computer.
After the system execution we then open and read the output alignments file. We use
the AlignIO module from BioPython to do the interpretation of the file, specifying the
read format as ‘clustal’. Note that we use the .read() function because we only have one
alignment in the file, if there were several we would use .parse() instead, to get back a list
of alignments. The reading function makes an alignment object from which we can print
out attributes, like the length, and loop through to get at the individual records. Note that
the record variables are BioPython objects and thus come with .seq and .id attributes from
which you can easily get at the one-letter sequence and name.
fileObj = open(alignFileName)
alignment = AlignIO.read(fileObj, "clustal")
print("Alignment length %i" % alignment.get_alignment_length())
for record in alignment:
print(record.seq, record.id)
Finally, we will illustrate how you can write the alignment object out into a different
format (other than the Clustal-format file we already have). This is a simple matter of
making a list of alignments and an output file handle object, with a new name, which we
pass on to the AlignIO.write() function. In this example we are using the ‘phylip’ format
corresponding to the PYHLIP suite
9
of programs that makes and analyses phylogenetic
trees.
alignments = [alignment,]
outputHandle = open("test2.phylip", "w")
AlignIO.write(alignments, outputHandle, "phylip")
Do'stlaringiz bilan baham: |