Python Programming for Biology: Bioinformatics and Beyond

Using ClustalW from Python

Download 7,75 Mb.

Pdf ko'rish

bet	187/514
Sana	30.12.2021
Hajmi	7,75 Mb.
	#91066

1 ... 183 184 185 186 187 188 189 190 ... 514

Bog'liq
[Tim J. Stevens, Wayne Boucher] Python Programming

Using ClustalW from Python

We will use input to the ClustalW program in the simple FASTA format, which you are

hopefully already familiar with. Of course, if the data is not already in a FASTA file we

will have to make one; indeed the following example will make the files for you starting

from a simple list of one-letter sequences as we have been using in the examples.

Although plain sequence strings are fine for basic analyses you may wish to consider more

detailed sequence representations, like the Bio.Seq module, when situations are more

complicated.

Firstly, in our example we import all of the external modules we will need upfront. It is

generally a good idea to do this because you can quickly look at the start of a file and see

all of the modules you are depending upon. Most of the modules are BioPython

libraries,

so it is assumed that this is installed on your system.

import os

from Bio.Seq import Seq

from Bio.SeqRecord import SeqRecord

from Bio.Alphabet import IUPAC

from Bio import SeqIO, AlignIO

Two file names are defined; one for the input to the alignment program (.fasta) and one

for the output from the program (.aln).

fastaFileName = "test2.fasta"

alignFileName = "test2.aln"

Next we loop though the sequences, seqs, defined as needed, to make a list of sequence

record objects. For each one-letter sequence we first make a BioPython Seq object using

the standard IUPAC

protein sequence alphabet. If there were an invalid protein sequence

we would get an error. The basic sequence object is then combined with a name and a

description to make a fully fledged sequence record object. The SeqRecord object is

required for writing a FASTA file.

records = []

for i, seq in enumerate(seqs):

seqObj = Seq(seq, IUPAC.protein)

name = 'test%d' % i

recordObj = SeqRecord(seqObj, id=name, description='demo only')

records.append(recordObj)

We use the open() function to create a file handle object into which the sequence

records are written. The SeqIO module of BioPython has a .write() function that can

output various formats, so naturally we use the desired ‘fasta’ option. With the file written

the file handle is closed; this is not essential here but a good habit to get into.

outFileObj = open(fastaFileName, "w")

SeqIO.write(records, outFileObj, "fasta")

outFileObj.close()

With the input file made, the next job is to run the external alignment program; in this

case ClustalW. This requires we have installed the alignment program and that it can be

run as the command ‘clustalw’ from the operating system. We will invoke the command

using the subprocess.call() function, as we illustrated for BLAST in

Chapter 12

. Also, in

this instance we specify -INFILE and -OUTFILE options, to define the input and output

file names respectively, formatted in the form –OPTION=VALUE (unlike with BLAST

where the option name and associated value are separated with a space). If desired, we

could read the ClustalW manual and include other options to control other aspects like gap

penalties and the substitution matrix.

from subprocess import call

cmdArgs = ['clustalw',

'-INFILE=' + fastaFileName,

'-OUTFILE=' + alignFileName]

call(cmdArgs)

A perhaps better, but more complicated, alternative to the call() is to use the

subprocess.Popen class, which amongst other things provides a means of executing jobs in

parallel; very handy to do if you have lots of alignments and a multi-core computer.

After the system execution we then open and read the output alignments file. We use

the AlignIO module from BioPython to do the interpretation of the file, specifying the

read format as ‘clustal’. Note that we use the .read() function because we only have one

alignment in the file, if there were several we would use .parse() instead, to get back a list

of alignments. The reading function makes an alignment object from which we can print

out attributes, like the length, and loop through to get at the individual records. Note that

the record variables are BioPython objects and thus come with .seq and .id attributes from

which you can easily get at the one-letter sequence and name.

fileObj = open(alignFileName)

alignment = AlignIO.read(fileObj, "clustal")

print("Alignment length %i" % alignment.get_alignment_length())

for record in alignment:

print(record.seq, record.id)

Finally, we will illustrate how you can write the alignment object out into a different

format (other than the Clustal-format file we already have). This is a simple matter of

making a list of alignments and an output file handle object, with a new name, which we

pass on to the AlignIO.write() function. In this example we are using the ‘phylip’ format

corresponding to the PYHLIP suite

of programs that makes and analyses phylogenetic

trees.

alignments = [alignment,]

outputHandle = open("test2.phylip", "w")

AlignIO.write(alignments, outputHandle, "phylip")

Download 7,75 Mb.

Do'stlaringiz bilan baham:

1 ... 183 184 185 186 187 188 189 190 ... 514