Matching sequence reads to genome data
Lastly we look at how we can relate high-throughput sequence information to annotation
information that accompanies a genome. Here the Python example will focus on the data
contained within a ‘GFF’-formatted text file, although the HTSeq library can be used in an
analogous manner for different annotation formats.
We have included an example GFF file with the on-line material that accompanies this
book. Also, using the file downloading function defined above, we can obtain GFF files
from the same NCBI download site as the genome sequence data.
18
remoteFileName = '/Bacteria/Escherichia_coli_536_uid58531/NC_008253.gff'
gffFile = 'examples/EcoliGenomeFeatures.gff'
downloadFile(FTP_ROOT+remoteFileName, gffFile)
When using genome annotation data it is especially important to make sure that it
matches the genome sequence that sequence reads were mapped to. Naturally the
annotations should be for the same organism, and any sub-type, as the sequence, but care
also needs to be taken to ensure that the same genome assembly (as indicated by the build
number) is used for both data sets. For larger eukaryotic genomes, especially if the data is
relatively new, there can be some significant improvements between releases.
Assuming the GFF annotation file downloaded successfully, the GFF_Reader() import
from the HTSeq library makes it easy to access the file data. In this case you can see the
file object that is opened and the annotation data can be iterated through in a loop to get
GenomicFeature class objects, which naturally link a genomic location to a description of
the feature.
fileObj = GFF_Reader(gffFile)
for genomeFeature in fileObj:
Here we fetch the chromosome location information for the feature, by accessing the
genomeFeature.iv attribute, a genomic interval:
genomeRegion = genomeFeature.iv
data = (genomeRegion.chrom,
genomeRegion.start,
genomeRegion.end,
genomeRegion.strand)
print('%s %s - %s (%s)' % data)
Similarly there are other attributes that indicate the kind of annotation that goes along
with the location. These will include features like genes, introns, exons etc.
data = (genomeFeature.name,
genomeFeature.type,
genomeFeature.source)
print('%s %s (%s)' % data)
One notable attribute is the attr dictionary, which contains a variety of information,
including cross-links to other databases.
print(genomeFeature.attr)
This will yield results like the following:
NC_008253.1 4933963 – 4935385 (+)
CreC CDS (RefSeq)
{'locus_tag': 'ECP_4785', 'exon_number': '1', 'product': 'sensory histidine
kinase CreC','EC_number': '2.7.3.-', 'note': 'part of a two-component
regulatory
system with CreB or PhoB%3B involved in catabolic regulation', 'db_xref':
'GeneID:4190641', 'transl_table': '11', 'protein_id': 'YP_672568.1'}
For the next example we will look again at creating GenomicArray objects, which will
map the annotations on to a Python representation of the chromosomes. Here geneMap
will link the genomic array directly to the annotation objects and the genePlot will hold
numbers to plot a graph of gene locations. Note the use of the argument ‘auto’, which
means that the data structure will automatically expand to represent all of the chromosome
locations that we add.
geneMap = GenomicArray('auto', stranded=False, typecode='O')
genePlot = GenomicArray('auto', stranded=False, typecode='i')
To fill the GenomicArray data we simply use the genome region from the annotation
objects as keys in a dictionary-like manner, setting the corresponding values to the
genomeFeature object and a number. Note that we only do this if the feature is of ‘gene’
type. Thus, by setting genePlot to hold the value 1 at the feature location we will get an
intermittent array of values we can turn into a rough graph to show gene positions.
for genomeFeature in fileObj:
if genomeFeature.type == 'gene':
genomeRegion = genomeFeature.iv
geneMap[genomeRegion] = genomeFeature
genePlot[genomeRegion] = 1
Once a genomic array is created we can naturally interrogate it and fetch the data we
initially mapped. The steps() function is useful to provide a mechanism to loop through
what is contained, extracting a record of the sequence region covered and whatever data
we associated with that location. Here we loop through the positioned GenomicFeature
objects (although we could have used other types of Python object).
for region, feature in geneMap.steps():
if feature:
data = (feature.name,
region.start,
region.end,
feature.iv.strand)
print('%s: %s - %s (%s)' % data)
If we wish to make a graph, plotting the locations with the value 1, which were
arbitrarily set for the genes, then we choose a chromosome (from an arbitrary region used
above) and construct a 40,000 base GenomicInterval to state the region of interest. As
before the region is used like a dictionary key to fetch part of the mapped array, and we
can convert this to a list for graphing:
chromosome = genomeRegion.chrom
region = GenomicInterval(chromosome, 0, 40000, '.')
pyplot.plot(list(genePlot[region]))
pyplot.show()
Do'stlaringiz bilan baham: |