Python Programming for Biology: Bioinformatics and Beyond

Reading a genome alignment file

Download 7,75 Mb.

Pdf ko'rish

bet	263/514
Sana	30.12.2021
Hajmi	7,75 Mb.
	#91066

1 ... 259 260 261 262 263 264 265 266 ... 514

Bog'liq
[Tim J. Stevens, Wayne Boucher] Python Programming

Reading a genome alignment file

The next example illustrates how to read a genome alignment file, which in this case is in

SAM format and could have been generated by the genomeAlign() defined above. Firstly,

the name of the genomic alignment is specified as alignFile and a set is defined to collect

the chromosome identifier. The SAM file reader is used to create Alignment objects,

noting here that we choose to loop directly though the reader’s output, rather than

explicitly defining a variable that represents the open file object. The alignment objects

can then be interrogated with the .read attribute to access a sequence read object.

alignFile = 'examples/EcoliGenomeAlign.sam'

chromosomes = set()

for alignment in SAM_Reader(alignFile):

if alignment.aligned:

seqRead = alignment.read

print(seqRead.name)

print(seqRead.seq)

Also, the alignment’s .iv attribute (a GenomicInterval) provides a means of accessing

the location that was aligned in the genome sequence:

genomeRegion = alignment.iv

chromo = genomeRegion.chrom

strand = genomeRegion.strand

start = genomeRegion.start

end = genomeRegion.end

chromosomes.add(chromo)

print(chromo, start, end, strand)

The set of chromosome identifiers which were collected above are converted into a list

so that they can be used to make a GenomicArray(), an object that can be used to house

any data that we may care to map to an efficient Python representation of the genome,

which in this case will store the locations of where short-read sequences aligned. Note that

the typecode argument dictates what kind of data is housed in the array and can take the

values ‘d’, ‘i’, ‘b’ and ‘O’, respectively representing floating point numbers, integers,

Booleans and general Python objects.

chromosomes = list(chromosomes)

hitMap = GenomicArray(chromosomes, stranded=True, typecode='i')

This genomic array can be filled with the results from a SAM alignment file. A check is

made to ensure that the alignment was successful (i.e. that alignment.aligned is true) and

then the genomic interval genomeRegion can be used to set values in the genomic array,

noting that we can access the array as if it were a Python dictionary and set the value of 1

for the leading strand and -1 otherwise.

for alignment in SAM_Reader(alignFile):

if alignment.aligned:

genomeRegion = alignment.iv

if genomeRegion.strand == '+':

hitMap[genomeRegion] = 1

else:

hitMap[genomeRegion] = -1

To access the results of the hitMap genomic array GenomicInterval objects are used

again, although here they are created from scratch to represent the region of interest. As an

example intervals are created for the first chromosome from position zero to 2 Mbp, for

different strands.

chromo = chromosomes[0]

endPoint = 2000000

plusStrand = GenomicInterval(chromo, 0, endPoint, '+')

minusStrand = GenomicInterval(chromo, 0, endPoint, '-')

bothStrands = GenomicInterval(chromo, 0, endPoint, '.')

The hitMap can then simply be queried with the interval object. Converting the result to

a list allows the result to be easily plotted in a graph.

pyplot.plot(list(hitMap[plusStrand]))

pyplot.plot(list(hitMap[minusStrand]))

pyplot.show()

Download 7,75 Mb.

Do'stlaringiz bilan baham:

1 ... 259 260 261 262 263 264 265 266 ... 514