Python Programming for Biology: Bioinformatics and Beyond


http://www.mysql.com 2



Download 7,75 Mb.
Pdf ko'rish
bet313/514
Sana30.12.2021
Hajmi7,75 Mb.
#91066
1   ...   309   310   311   312   313   314   315   316   ...   514
Bog'liq
[Tim J. Stevens, Wayne Boucher] Python Programming

1

http://www.mysql.com



2

http://www.sqlite.org



3

http://sourceforge.net/projects/mysql-python/




21

Probability

Contents

The basics of probability theory

Sample space

Probability values

Restriction enzyme example

Combining probabilities

Conditional probabilities

Bayesian analysis

Random variables

Binomial distribution

Poisson distribution

Geometric distribution

Markov chains

Markov processes

Hidden Markov models

Using Python for hidden Markov models

The Viterbi algorithm

The forward-backward algorithm

Implementing a protein sequence HMM

The basics of probability theory

The theory of probability was based on the observation of random physical events, most

notably  for  games  of  chance.  And  naturally,  calculating  accurate  probabilities  became

especially important for people when money was wagered on the outcome. Probability is a

way  of  ascribing  numerical  values  to  the  possible  outcomes  to  help  us  understand  a

random  process  more  fully.  This  enables  us  to  ask  questions  like  how  much  more  often

one event occurs compared to another, but because of the random nature of what we are

studying we can never say what the outcome will definitely be. Rather we tend to think of

the  process  in  terms  of  what  the  long-term  proportions  of  different  outcomes  are,  if  the

random  experiment  were  repeated  a  very  large  number  of  times,  or  perhaps  if  money  is

involved what a wager on a particular outcome is worth.



Turning  to  biological  systems,  some  things  in  living  organisms  occur  as  a  result  of

random processes, like the segregation of a parent’s chromosomes among their children or

base-pair  changes  in  DNA  (such  as  a  result  of  replication  errors  or  ionising  radiation),

though,  under  most  circumstances  we  don’t  get  to  see  the  actual  random  event.  For  the

most part we just view the outcomes, sometimes billions of years later in the case of DNA

sequence changes. Of course a DNA sequence isn’t actually random, given that it exists to

contain biologically meaningful information representing genes and gene control elements

etc.  which  have  been  selected  for  their  function  during  evolution,  even  if  the  initial

mutations  were  random.  Nonetheless  for  a  sufficiently  large  and  unbiased  selection  of

DNA we can treat the sequence as if it were random in order to ask various questions. For

example,  how  often  do  I  find  the  sub-sequence  AAGCTT  in  a  megabase-long  region  of

DNA?


Probability  theory  is  often  also  useful  in  situations  where  there  is  no  underlying

randomness in the biology, but rather an uncertainty in our scientific interpretation. Here a

probabilistic treatment of our uncertainty can lead to informative predictions. An example

of this would be for the classification of whether two genes have the same function as one

another (generally because they have a common ancestor). They either do or do not, and

the underlying assignment of this status is not a random process, but our prediction based

on the available data does have an uncertain component, and so it can be helpful to treat

the situation probabilistically. It is also notable that in biological analyses it may be rare to

actually  deal  with  probabilities  directly,  but  probability  theory  underpins  statistical  tests

which are very commonly used, and we describe some of those in

Chapter 22

.

Here we will lightly go through some of the fundamentals of probability theory. Being



mindful  of  our  expected  readership,  we  will  endeavour  to  avoid  going  into  too  much

detailed  mathematical  notation.  We  won’t  escape  the  equations  entirely  but  hopefully

these will serve as a primer for further reading.

Sample space

Firstly,  we  need  to  define  a  probabilistic  system  by  knowing  what  the  range  of  possible

outcomes is. In mathematical jargon this means to define the sample space. The range of

possible outcomes can be fairly straightforward, so for a six-sided die we know that there

are  simply  six  outcomes  corresponding  to  the  numbers  of  spots  on  different  faces.  If  we

are thinking about the occurrence of a DNA base at a position in a genome then we know

that it must be either G, C, A or T. Often though we are thinking about multiple dice rolls

or  several  positions  in  a  DNA  sequence.  In  these  cases  we  think  of  the  sample  space  in

terms of the combinations of possibilities for each roll or position. Hence for rolling two

dice we have six possibilities for the first roll, and then for any given first roll there are a

further six possibilities for the second roll. Overall there will be six times six possibilities

for the total number of possible outcomes. Naturally if there is a further roll there are six

more possibilities for each of the 36 two-roll outcomes. So here the general rule is that the

size of the sample space is 6



N

, if there are N rolls, i.e. multiplied by six for each roll. The

same idea can be applied to sequential positions in DNA. Here there are four nucleotide

possibilities at each position and so for a sequence (or sub-sequence) of length N there are

4

N

 different  combinations.  Although  the  nucleotides  of  a  DNA  sequence  are  actually  all




present in the same molecule, it may be helpful for understanding to fictitiously imagine

the sequence being generated by the roll of an imaginary four-sided die.

More generally there can sometimes be the complication that we actually don’t have a

fixed number of dice rolls or a fixed length of DNA. For example, we may be interested in

finding  out  how  many  dice  rolls  we  would  expect  to  make,  on  average,  before  we  roll

three  sixes.  The  DNA  equivalent  of  this  is  to  ask  what  the  expected  length  of  DNA

(number of positions) is before we find a given small sub-sequence. The latter is quite a

relevant  question  biologically  because  the  small  sub-sequence  might  be  a  cut  site  for  a

restriction enzyme,

1

where it can be useful to know the average size of DNA fragments the



enzyme  would  generate.  In  these  examples  the  sample  space  may  be  unbounded,  or  at

least very large in the case of a genome. Nonetheless we still have a firm idea of what the

range of possibilities is, even if it is technically infinite. For example, even though it may

be  technically  possible  to  never  roll  three  consecutive  sixes  if  a  die  were  rolled

continuously for the history of the universe the odds are so astronomically small (close to

zero) that this, and similar extremes, don’t have any practical effect.




Download 7,75 Mb.

Do'stlaringiz bilan baham:
1   ...   309   310   311   312   313   314   315   316   ...   514




Ma'lumotlar bazasi mualliflik huquqi bilan himoyalangan ©hozir.org 2024
ma'muriyatiga murojaat qiling

kiriting | ro'yxatdan o'tish
    Bosh sahifa
юртда тантана
Боғда битган
Бугун юртда
Эшитганлар жилманглар
Эшитмадим деманглар
битган бодомлар
Yangiariq tumani
qitish marakazi
Raqamli texnologiyalar
ilishida muhokamadan
tasdiqqa tavsiya
tavsiya etilgan
iqtisodiyot kafedrasi
steiermarkischen landesregierung
asarlaringizni yuboring
o'zingizning asarlaringizni
Iltimos faqat
faqat o'zingizning
steierm rkischen
landesregierung fachabteilung
rkischen landesregierung
hamshira loyihasi
loyihasi mavsum
faolyatining oqibatlari
asosiy adabiyotlar
fakulteti ahborot
ahborot havfsizligi
havfsizligi kafedrasi
fanidan bo’yicha
fakulteti iqtisodiyot
boshqaruv fakulteti
chiqarishda boshqaruv
ishlab chiqarishda
iqtisodiyot fakultet
multiservis tarmoqlari
fanidan asosiy
Uzbek fanidan
mavzulari potok
asosidagi multiservis
'aliyyil a'ziym
billahil 'aliyyil
illaa billahil
quvvata illaa
falah' deganida
Kompyuter savodxonligi
bo’yicha mustaqil
'alal falah'
Hayya 'alal
'alas soloh
Hayya 'alas
mavsum boyicha


yuklab olish