The next example illustrates how to read a genome alignment file, which in this case is in
SAM format and could have been generated by the genomeAlign() defined above. Firstly,
the chromosome identifier. The SAM file reader is used to create Alignment objects,
noting here that we choose to loop directly though the reader’s output, rather than
can then be interrogated with the .read attribute to access a sequence read object.
alignFile = 'examples/EcoliGenomeAlign.sam'
chromosomes = set()
for alignment in SAM_Reader(alignFile):
if alignment.aligned:
seqRead = alignment.read
print(seqRead.name)
print(seqRead.seq)
Also, the alignment’s .iv attribute (a GenomicInterval) provides a means of accessing
the location that was aligned in the genome sequence:
genomeRegion = alignment.iv
chromo = genomeRegion.chrom
strand = genomeRegion.strand
start = genomeRegion.start
end = genomeRegion.end
chromosomes.add(chromo)
print(chromo, start, end, strand)
The set of chromosome identifiers which were collected above are converted into a list
so that they can be used to make a GenomicArray(), an object that can be used to house
any data that we may care to map to an efficient Python representation of the genome,
which in this case will store the locations of where short-read sequences aligned. Note that
the typecode argument dictates what kind of data is housed in the array and can take the
values ‘d’, ‘i’, ‘b’ and ‘O’, respectively representing floating point numbers, integers,
Booleans and general Python objects.
chromosomes = list(chromosomes)
hitMap = GenomicArray(chromosomes, stranded=True, typecode='i')
This genomic array can be filled with the results from a SAM alignment file. A check is
made to ensure that the alignment was successful (i.e. that alignment.aligned is true) and
then the genomic interval genomeRegion can be used to set values in the genomic array,
noting that we can access the array as if it were a Python dictionary and set the value of 1
for the leading strand and -1 otherwise.
for alignment in SAM_Reader(alignFile):
if alignment.aligned:
genomeRegion = alignment.iv
if genomeRegion.strand == '+':
hitMap[genomeRegion] = 1
else:
hitMap[genomeRegion] = -1
To access the results of the hitMap genomic array GenomicInterval objects are used
again, although here they are created from scratch to represent the region of interest. As an
example intervals are created for the first chromosome from position zero to 2 Mbp, for
different strands.
chromo = chromosomes[0]
endPoint = 2000000
plusStrand = GenomicInterval(chromo, 0, endPoint, '+')
minusStrand = GenomicInterval(chromo, 0, endPoint, '-')
bothStrands = GenomicInterval(chromo, 0, endPoint, '.')
The hitMap can then simply be queried with the interval object. Converting the result to
a list allows the result to be easily plotted in a graph.
pyplot.plot(list(hitMap[plusStrand]))
pyplot.plot(list(hitMap[minusStrand]))
pyplot.show()
Do'stlaringiz bilan baham: