Python Programming for Biology: Bioinformatics and Beyond



Download 7,75 Mb.
Pdf ko'rish
bet346/514
Sana30.12.2021
Hajmi7,75 Mb.
#91066
1   ...   342   343   344   345   346   347   348   349   ...   514
Bog'liq
[Tim J. Stevens, Wayne Boucher] Python Programming

Mode, median and mean

For a collection of values, one of the most useful measures is to estimate where the centre

of  the  distribution  is.  The  general  idea  here  is  that  we  get  a  single  value  that  is  most

representative of the data set as a whole. There are a few different ways that are generally

used  to  get  such  a  measure,  which  are  called  mode,  median  and  mean.  Naturally  these

have  different  properties  and  are  useful  in  different  situations,  although  the  mean  is  the

most common parameter used.

The  mode  is  the  most  commonly  occurring  value  in  a  set  of  data.  For  example,  the

mode of the list of values [1,2,2,3,2,1,4,2,3,1,0] is 2, because the number 2 appears most

often. Naturally, if each value only occurs once then the mode tells you nothing. Hence,

for this to be a useful measure the amount of data and whether the values are represented

with  a  specific  precision  are  important.  This  is  especially  true  when  using  floating  point

numbers,  where  repeated  values  can  be  unlikely,  in  which  case  it  is  commonplace  to

represent the data as a histogram. If the values are assigned to suitable ranges the shape of

the  distribution  can  become  more  apparent  and  the  mode  will  be  the  histogram  bin  with

the most values.

Using standard Python we can calculate the mode of the values in a list using the list’s

.count() method. We use a list comprehension to build a counts list containing (count, val)

pairs, noting that we use set() to remove any repeats in the values. Using max() on these

pairs will find the one with the largest count, although the mode will be the second item of

the pair; the value that went with the count.

values = [1,2,2,3,2,1,4,2,3,1,0]

counts = [(values.count(val), val) for val in set(values)]

count, mode = max(counts)

print( mode )

Calculating  the  mode  is  easier  to  do  with  SciPy,  as  there  is  a  pre-constructed

stats.mode() function that works with NumPy array objects, though this also gives back an

array, hence we take the [0] item from the result:

from scipy import stats

from numpy import array

valArray = array(values, float)

mode, count = stats.mode(valArray)

print('Mode:', mode[0] ) # Result is 2

The  median  represents  the  middle-ranked  value  when  the  data  is  placed  in  its  sorted

order. Or put differently, the median is the 50th percentile point that separates the top and

bottom halves of the values. Taking the example [1,2,2,3,2,1,4,2,3,1,0] again, sorting this

gives [0,1,1,1,2,2,2,2,3,3,4] and the middle value is 2. If there is an even number of points

the median is generally represented as the average of the two middle points. The median is

a  fairly  robust  statistic  to  use,  including  where  the  underlying  probability  distribution  is

not known, because the middle ranking will be insensitive to outlier points (with extreme

values).



We can calculate the median in standard Python by sorting the values and selecting the

middle index, though if there is an even number of values (nValues % 2 == 0) we take the

average of the central two:

def getMedian(values):

vSorted = sorted(values)

nValues = len(values)

if nValues % 2 == 0: # even number

index = nValues//2

median = sum(vSorted[index-1:index+1])/2.0

else:


index = (nValues-1)//2

median = vSorted[index]

return median

med = getMedian(values)

Calculating the median is easy using NumPy, given its median() function:

from numpy import median

med = median(valArray)

print('Median:', med) # Result is 2

The mean  is  the  numerical  average  of  a  set  of  values.  It  is  analogous  to  the  centre  of

‘mass’ of the distribution. In simple terms the sample mean is calculated by adding up all

the  values  and  dividing  by  the  number  of  values.  The  mean  of  [1,2,2,3,2,1,4,2,3,1,0]  is

21

/



11

 =  1.909.  In  terms  of  an  underlying  probability  distribution,  the  mean  of  a  random

variable, X, is referred to as the expectation of the random variable, written E(X), because

it  represents  the  value  that  represents  the  long-term  average,  considering  an  unlimited

amount of data, and thus also the most representative centre value for the distribution. It

should be noted that in this chapter we will be considering two types of mean value. The

first  is  the  true  mean  value  of  the  underlying  probability  distribution,

5

 and  for  a  random



variable X we will give it the label μ

x

. The other kind of mean is the sample mean, labelled

, which often acts as an estimate for the true mean, and which is calculated as an average

value of a series of measurements, x



i

, as one might expect:

We can readily calculate the sample mean in standard Python:

values = [1,2,2,3,2,1,4,2,3,1,0]

mean = sum(values)/float(len(values))

or  using  NumPy  arrays,  noting  that  mean()  is  both  a  stand-alone  function  and  a  method

bound to array objects:



from numpy import array, mean

valArray = array(values, float)

m = valArray.mean()

# or


m = mean(valArray)

print('Mean', m) # Result is 1.909

It is handy that these NumPy functions also take an axis argument, so that in a multi-

dimensional array you can calculate the mean across rows or columns of values etc:

valArray2 = array([[7,9,5],

[1,4,3]])

print(valArray2.mean())

# All elements - result is 4.8333

print(valArray2.mean(axis=0))

# Column means - result is [4.0, 6.5, 4.0]

print(valArray2.mean(axis=1))

# Row means - result is [.0, 2.6667]

For  most  named  probability  distributions  the  mean  is  either  a  fundamental  parameter

that is used in the description of the distribution (e.g. for Gaussian) or is readily derived

from  the  fundamental  parameters  (e.g.  binomial,  geometric).  However,  there  are  some

curious cases where the mean is undefined, e.g. for the Cauchy distribution.

6


Download 7,75 Mb.

Do'stlaringiz bilan baham:
1   ...   342   343   344   345   346   347   348   349   ...   514




Ma'lumotlar bazasi mualliflik huquqi bilan himoyalangan ©hozir.org 2024
ma'muriyatiga murojaat qiling

kiriting | ro'yxatdan o'tish
    Bosh sahifa
юртда тантана
Боғда битган
Бугун юртда
Эшитганлар жилманглар
Эшитмадим деманглар
битган бодомлар
Yangiariq tumani
qitish marakazi
Raqamli texnologiyalar
ilishida muhokamadan
tasdiqqa tavsiya
tavsiya etilgan
iqtisodiyot kafedrasi
steiermarkischen landesregierung
asarlaringizni yuboring
o'zingizning asarlaringizni
Iltimos faqat
faqat o'zingizning
steierm rkischen
landesregierung fachabteilung
rkischen landesregierung
hamshira loyihasi
loyihasi mavsum
faolyatining oqibatlari
asosiy adabiyotlar
fakulteti ahborot
ahborot havfsizligi
havfsizligi kafedrasi
fanidan bo’yicha
fakulteti iqtisodiyot
boshqaruv fakulteti
chiqarishda boshqaruv
ishlab chiqarishda
iqtisodiyot fakultet
multiservis tarmoqlari
fanidan asosiy
Uzbek fanidan
mavzulari potok
asosidagi multiservis
'aliyyil a'ziym
billahil 'aliyyil
illaa billahil
quvvata illaa
falah' deganida
Kompyuter savodxonligi
bo’yicha mustaqil
'alal falah'
Hayya 'alal
'alas soloh
Hayya 'alas
mavsum boyicha


yuklab olish