Python Programming for Biology: Bioinformatics and Beyond


Figure 21.3.  Nucleotide probabilities for two DNA positions



Download 7,75 Mb.
Pdf ko'rish
bet316/514
Sana30.12.2021
Hajmi7,75 Mb.
#91066
1   ...   312   313   314   315   316   317   318   319   ...   514
Bog'liq
[Tim J. Stevens, Wayne Boucher] Python Programming

Figure 21.3.  Nucleotide probabilities for two DNA positions. For two positions in a

DNA sequence there are 16 possible outcomes, given the four different types of

nucleotide. If the probabilities of single nucleotides are equal then the probabilities of all

nucleotide pairs are also equal (

1

/

16



), and naturally sum to one.

As  we  have  alluded  to,  in  order  to  obtain  a  realistic  probability  estimate  for  different

outcomes we generally count the number of occurrences of each in a large data set. Hence

for  our  mouse-breeding  example,  even  if  we  didn’t  have  a  good  theoretical  model  we

could  cross  the  two  strains,  count  the  different  coat  colours  of  the  progeny  and  then

express the counts as a proportion of the total. We may do such experiments to validate a

given  model,  which  in  this  case  might  show  something  of  genetic  interest,  if  the  model

does not fit. Though, for this kind of hypothesis testing (which is more properly described

in

Chapter 22



) we have to be mindful of how the amount of data affects our confidence.

To take an arbitrary example with a mouse cross, just because eight black mice were born

in a litter does not mean that the model of a 3:1 black-white ratio is wrong; litters of eight

would be all black about 10% of the time (0.75

8

). You would need a much larger sample




of data to be confident of the probabilities; the more experimental examples we have the

closer  the  experimental  ratios  will  match  the  long-term  probabilities.  Likewise  for  DNA

nucleotide probabilities we can count C:G and A:T base pairs we find in a genome,

2

 and



will  get  the  most  accurate  results  by  choosing  as  large  a  sample  of  sequence  data  as

possible. If we want our probabilities to be general for the whole genome we would not

want to look at only a small part, which may not be representative.

In  Python  if  we  know  the  number  of  G:C  and  the  number  of  A:T  pairs  for  a  whole

genome  then  the  probability  of  each,  i.e.  Pr(G),  Pr(C),  Pr(A)  and  Pr(T),  at  a  random

position can be calculated as the proportion of the total:

counts = {'G':2356491, 'C':2356491, 'A':2283184, 'T':2283184}

total = float(sum(counts.values()))

letterProbs = {}

for letter in counts:

letterProbs[letter] = counts[letter] / total

print(letterProbs)

# Result: {'A':0.24605, 'C':0.25395, 'T':0.24605, 'G':0.25395}

Even  though  these  probabilities  are  improved  from  ¼  for  all  bases  it  should  still

potentially  be  considered  as  an  approximation,  depending  on  the  situation  at  hand.  You

may  have  noticed  that  we  have  been  quite  careful  to  say  that  this  is  the  probability  at  a



random  position.  If  the  DNA  position  we  are  considering  is  not  random  then  the  above

whole-genome average would just be the first approximation.

3

The G:C content of DNA is



actually different for different chromosomes and generally varies depending on whether a

position is in a gene or non-gene region. We could end up with endless categorisations and

qualifications for probabilities. So while it is possible to define the probabilities for C or G

being at (to take an arbitrary and complex example) the last position of the first exon of all

carbohydrate metabolism genes, we wouldn’t want to go into so much detail unless there

was  a  special  reason.  In  general  a  balance  is  struck  between  having  accurate  general

probabilities, supported by large amounts of data, and contextualised probabilities, which

may be supported by very little data. In a probabilistic analysis we may wish to account

for  context,  to  make  more  accurate  predictions,  but  naturally  we  must  have  data  for  the

different situations and know when to use them. There will be some further discussion of

such matters in the Markov chains section below.


Download 7,75 Mb.

Do'stlaringiz bilan baham:
1   ...   312   313   314   315   316   317   318   319   ...   514




Ma'lumotlar bazasi mualliflik huquqi bilan himoyalangan ©hozir.org 2024
ma'muriyatiga murojaat qiling

kiriting | ro'yxatdan o'tish
    Bosh sahifa
юртда тантана
Боғда битган
Бугун юртда
Эшитганлар жилманглар
Эшитмадим деманглар
битган бодомлар
Yangiariq tumani
qitish marakazi
Raqamli texnologiyalar
ilishida muhokamadan
tasdiqqa tavsiya
tavsiya etilgan
iqtisodiyot kafedrasi
steiermarkischen landesregierung
asarlaringizni yuboring
o'zingizning asarlaringizni
Iltimos faqat
faqat o'zingizning
steierm rkischen
landesregierung fachabteilung
rkischen landesregierung
hamshira loyihasi
loyihasi mavsum
faolyatining oqibatlari
asosiy adabiyotlar
fakulteti ahborot
ahborot havfsizligi
havfsizligi kafedrasi
fanidan bo’yicha
fakulteti iqtisodiyot
boshqaruv fakulteti
chiqarishda boshqaruv
ishlab chiqarishda
iqtisodiyot fakultet
multiservis tarmoqlari
fanidan asosiy
Uzbek fanidan
mavzulari potok
asosidagi multiservis
'aliyyil a'ziym
billahil 'aliyyil
illaa billahil
quvvata illaa
falah' deganida
Kompyuter savodxonligi
bo’yicha mustaqil
'alal falah'
Hayya 'alal
'alas soloh
Hayya 'alas
mavsum boyicha


yuklab olish