Python Programming for Biology: Bioinformatics and Beyond


Matching sequence reads to genome data



Download 7,75 Mb.
Pdf ko'rish
bet264/514
Sana30.12.2021
Hajmi7,75 Mb.
#91066
1   ...   260   261   262   263   264   265   266   267   ...   514
Bog'liq
[Tim J. Stevens, Wayne Boucher] Python Programming

Matching sequence reads to genome data

Lastly we look at how we can relate high-throughput sequence information to annotation

information that accompanies a genome. Here the Python example will focus on the data

contained within a ‘GFF’-formatted text file, although the HTSeq library can be used in an

analogous manner for different annotation formats.

We have included an example GFF file with the on-line material that accompanies this

book.  Also,  using  the  file  downloading  function  defined  above,  we  can  obtain  GFF  files

from the same NCBI download site as the genome sequence data.

18

remoteFileName = '/Bacteria/Escherichia_coli_536_uid58531/NC_008253.gff'



gffFile = 'examples/EcoliGenomeFeatures.gff'

downloadFile(FTP_ROOT+remoteFileName, gffFile)

When  using  genome  annotation  data  it  is  especially  important  to  make  sure  that  it

matches  the  genome  sequence  that  sequence  reads  were  mapped  to.  Naturally  the

annotations should be for the same organism, and any sub-type, as the sequence, but care

also needs to be taken to ensure that the same genome assembly (as indicated by the build

number) is used for both data sets. For larger eukaryotic genomes, especially if the data is

relatively new, there can be some significant improvements between releases.

Assuming the GFF annotation file downloaded successfully, the GFF_Reader()  import

from the HTSeq library makes it easy to access the file data. In this case you can see the

file object that is opened and the annotation data can be iterated through in a loop to get

GenomicFeature class objects, which naturally link a genomic location to a description of

the feature.

fileObj = GFF_Reader(gffFile)

for genomeFeature in fileObj:

Here  we  fetch  the  chromosome  location  information  for  the  feature,  by  accessing  the

genomeFeature.iv attribute, a genomic interval:

genomeRegion = genomeFeature.iv

data = (genomeRegion.chrom,

genomeRegion.start,

genomeRegion.end,



genomeRegion.strand)

print('%s %s - %s (%s)' % data)

Similarly there are other attributes that indicate the kind of annotation that goes along

with the location. These will include features like genes, introns, exons etc.

data = (genomeFeature.name,

genomeFeature.type,

genomeFeature.source)

print('%s %s (%s)' % data)

One  notable  attribute  is  the  attr  dictionary,  which  contains  a  variety  of  information,

including cross-links to other databases.

print(genomeFeature.attr)

This will yield results like the following:

NC_008253.1 4933963 – 4935385 (+)

CreC CDS (RefSeq)

{'locus_tag': 'ECP_4785', 'exon_number': '1', 'product': 'sensory histidine

kinase CreC','EC_number': '2.7.3.-', 'note': 'part of a two-component

regulatory

system with CreB or PhoB%3B involved in catabolic regulation', 'db_xref':

'GeneID:4190641', 'transl_table': '11', 'protein_id': 'YP_672568.1'}

For the next example we will look again at creating GenomicArray objects, which will

map  the  annotations  on  to  a  Python  representation  of  the  chromosomes.  Here  geneMap

will  link  the  genomic  array  directly  to  the  annotation  objects  and  the  genePlot  will  hold

numbers  to  plot  a  graph  of  gene  locations.  Note  the  use  of  the  argument  ‘auto’,  which

means that the data structure will automatically expand to represent all of the chromosome

locations that we add.

geneMap = GenomicArray('auto', stranded=False, typecode='O')

genePlot = GenomicArray('auto', stranded=False, typecode='i')

To  fill  the  GenomicArray  data  we  simply  use  the  genome  region  from  the  annotation

objects  as  keys  in  a  dictionary-like  manner,  setting  the  corresponding  values  to  the

genomeFeature object and a number. Note that we only do this if the feature is of ‘gene’

type. Thus, by setting genePlot to hold the value 1  at  the  feature  location  we  will  get  an

intermittent array of values we can turn into a rough graph to show gene positions.

for genomeFeature in fileObj:

if genomeFeature.type == 'gene':

genomeRegion = genomeFeature.iv

geneMap[genomeRegion] = genomeFeature

genePlot[genomeRegion] = 1

Once  a  genomic  array  is  created  we  can  naturally  interrogate  it  and  fetch  the  data  we

initially  mapped.  The  steps()  function  is  useful  to  provide  a  mechanism  to  loop  through



what is contained, extracting a record of the sequence region covered and whatever data

we  associated  with  that  location.  Here  we  loop  through  the  positioned  GenomicFeature

objects (although we could have used other types of Python object).

for region, feature in geneMap.steps():

if feature:

data = (feature.name,

region.start,

region.end,

feature.iv.strand)

print('%s: %s - %s (%s)' % data)

If  we  wish  to  make  a  graph,  plotting  the  locations  with  the  value  1,  which  were

arbitrarily set for the genes, then we choose a chromosome (from an arbitrary region used

above)  and  construct  a  40,000  base  GenomicInterval  to  state  the  region  of  interest.  As

before the region is used like a dictionary key to fetch part of the mapped array, and we

can convert this to a list for graphing:

chromosome = genomeRegion.chrom

region = GenomicInterval(chromosome, 0, 40000, '.')

pyplot.plot(list(genePlot[region]))

pyplot.show()


Download 7,75 Mb.

Do'stlaringiz bilan baham:
1   ...   260   261   262   263   264   265   266   267   ...   514




Ma'lumotlar bazasi mualliflik huquqi bilan himoyalangan ©hozir.org 2024
ma'muriyatiga murojaat qiling

kiriting | ro'yxatdan o'tish
    Bosh sahifa
юртда тантана
Боғда битган
Бугун юртда
Эшитганлар жилманглар
Эшитмадим деманглар
битган бодомлар
Yangiariq tumani
qitish marakazi
Raqamli texnologiyalar
ilishida muhokamadan
tasdiqqa tavsiya
tavsiya etilgan
iqtisodiyot kafedrasi
steiermarkischen landesregierung
asarlaringizni yuboring
o'zingizning asarlaringizni
Iltimos faqat
faqat o'zingizning
steierm rkischen
landesregierung fachabteilung
rkischen landesregierung
hamshira loyihasi
loyihasi mavsum
faolyatining oqibatlari
asosiy adabiyotlar
fakulteti ahborot
ahborot havfsizligi
havfsizligi kafedrasi
fanidan bo’yicha
fakulteti iqtisodiyot
boshqaruv fakulteti
chiqarishda boshqaruv
ishlab chiqarishda
iqtisodiyot fakultet
multiservis tarmoqlari
fanidan asosiy
Uzbek fanidan
mavzulari potok
asosidagi multiservis
'aliyyil a'ziym
billahil 'aliyyil
illaa billahil
quvvata illaa
falah' deganida
Kompyuter savodxonligi
bo’yicha mustaqil
'alal falah'
Hayya 'alal
'alas soloh
Hayya 'alas
mavsum boyicha


yuklab olish