Python Programming for Biology: Bioinformatics and Beyond



Download 7,75 Mb.
Pdf ko'rish
bet147/514
Sana30.12.2021
Hajmi7,75 Mb.
#91066
1   ...   143   144   145   146   147   148   149   150   ...   514
Bog'liq
[Tim J. Stevens, Wayne Boucher] Python Programming

GC content of DNA

The  next  example  investigates  a  DNA  sequence  by  measuring  its  GC  content:  i.e.  the

percentage of the total base pairs that are G:C (rather than A:T). All we need to do for this



is  to  take  the  sequence  of  one  strand  of  DNA  and  simply  count  how  many  of  the

nucleotides are G or C. Measuring the GC content of DNA is biologically relevant because

regions of a chromosome that are rich in G and C give a hint that they might be coding for

genes.


To  make  things  more  interesting,  rather  than  just  report  the  final  GC  content  for  the

whole of a sequence, we will measure the GC content for every possible 10 residue sub-

sequence  and  then  plot  the  values  along  the  length  of  the  sequence  as  a  graph.  In  other

words we will perform the calculation on a sliding window of residues.

Firstly, we define a function that takes a DNA sequence and a window size (optionally)

as input and gives a list of numerical GC content values as output. We will take the output

data and use it to draw a graph using an external Python module called Matplotlib, which

is  very  useful  for  plotting  numerical  data  (see  Chapter  9  for  more  details).  As  with  the

profile  search  above,  we  will  use  a  for  loop  to  scan  through  the  sequence,  while  taking

care to avoid falling off the end. However, this time, because we don’t need the position of

each  nucleotide  within  the  search  window  (we  only  needed  this  before  to  get  the  right

position  in  a  profile),  we  can  find  the  number  of  G  and  C  letters  by  using  the  .count()

method that is built into Python strings and lists.

def calcGcContent(seq, winSize=10):

gcValues = []

for i in range(len(seq)-winSize):

subSeq = seq[i:i+winSize]

numGc = subSeq.count('G') + subSeq.count('C')

value = numGc/float(winSize)

gcValues.append(value)

return gcValues

Each of the measurements for each sliding position are added to the output list, which

we  can  then  plot  as  a  graph  as  follows.  Note  that  this  example  assumes  that  we  have

installed  the  Matplotlib  module  (see

http://www.cambridge.org/pythonforbiology

 for


download links), otherwise you will get an error from the import command.

from matplotlib import pyplot

gcResults = calcGcContent(dnaSeq)

pyplot.plot(gcResults)

pyplot.show()


Download 7,75 Mb.

Do'stlaringiz bilan baham:
1   ...   143   144   145   146   147   148   149   150   ...   514




Ma'lumotlar bazasi mualliflik huquqi bilan himoyalangan ©hozir.org 2024
ma'muriyatiga murojaat qiling

kiriting | ro'yxatdan o'tish
    Bosh sahifa
юртда тантана
Боғда битган
Бугун юртда
Эшитганлар жилманглар
Эшитмадим деманглар
битган бодомлар
Yangiariq tumani
qitish marakazi
Raqamli texnologiyalar
ilishida muhokamadan
tasdiqqa tavsiya
tavsiya etilgan
iqtisodiyot kafedrasi
steiermarkischen landesregierung
asarlaringizni yuboring
o'zingizning asarlaringizni
Iltimos faqat
faqat o'zingizning
steierm rkischen
landesregierung fachabteilung
rkischen landesregierung
hamshira loyihasi
loyihasi mavsum
faolyatining oqibatlari
asosiy adabiyotlar
fakulteti ahborot
ahborot havfsizligi
havfsizligi kafedrasi
fanidan bo’yicha
fakulteti iqtisodiyot
boshqaruv fakulteti
chiqarishda boshqaruv
ishlab chiqarishda
iqtisodiyot fakultet
multiservis tarmoqlari
fanidan asosiy
Uzbek fanidan
mavzulari potok
asosidagi multiservis
'aliyyil a'ziym
billahil 'aliyyil
illaa billahil
quvvata illaa
falah' deganida
Kompyuter savodxonligi
bo’yicha mustaqil
'alal falah'
Hayya 'alal
'alas soloh
Hayya 'alas
mavsum boyicha


yuklab olish