Python Programming for Biology: Bioinformatics and Beyond

Download 7,75 Mb.

Pdf ko'rish

bet	346/514
Sana	30.12.2021
Hajmi	7,75 Mb.
	#91066

1 ... 342 343 344 345 346 347 348 349 ... 514

Bog'liq
[Tim J. Stevens, Wayne Boucher] Python Programming

Mode, median and mean

For a collection of values, one of the most useful measures is to estimate where the centre

of the distribution is. The general idea here is that we get a single value that is most

representative of the data set as a whole. There are a few different ways that are generally

used to get such a measure, which are called mode, median and mean. Naturally these

have different properties and are useful in different situations, although the mean is the

most common parameter used.

The mode is the most commonly occurring value in a set of data. For example, the

mode of the list of values [1,2,2,3,2,1,4,2,3,1,0] is 2, because the number 2 appears most

often. Naturally, if each value only occurs once then the mode tells you nothing. Hence,

for this to be a useful measure the amount of data and whether the values are represented

with a specific precision are important. This is especially true when using floating point

numbers, where repeated values can be unlikely, in which case it is commonplace to

represent the data as a histogram. If the values are assigned to suitable ranges the shape of

the distribution can become more apparent and the mode will be the histogram bin with

the most values.

Using standard Python we can calculate the mode of the values in a list using the list’s

.count() method. We use a list comprehension to build a counts list containing (count, val)

pairs, noting that we use set() to remove any repeats in the values. Using max() on these

pairs will find the one with the largest count, although the mode will be the second item of

the pair; the value that went with the count.

values = [1,2,2,3,2,1,4,2,3,1,0]

counts = [(values.count(val), val) for val in set(values)]

count, mode = max(counts)

print( mode )

Calculating the mode is easier to do with SciPy, as there is a pre-constructed

stats.mode() function that works with NumPy array objects, though this also gives back an

array, hence we take the [0] item from the result:

from scipy import stats

from numpy import array

valArray = array(values, float)

mode, count = stats.mode(valArray)

print('Mode:', mode[0] ) # Result is 2

The median represents the middle-ranked value when the data is placed in its sorted

order. Or put differently, the median is the 50th percentile point that separates the top and

bottom halves of the values. Taking the example [1,2,2,3,2,1,4,2,3,1,0] again, sorting this

gives [0,1,1,1,2,2,2,2,3,3,4] and the middle value is 2. If there is an even number of points

the median is generally represented as the average of the two middle points. The median is

a fairly robust statistic to use, including where the underlying probability distribution is

not known, because the middle ranking will be insensitive to outlier points (with extreme

values).

We can calculate the median in standard Python by sorting the values and selecting the

middle index, though if there is an even number of values (nValues % 2 == 0) we take the

average of the central two:

def getMedian(values):

vSorted = sorted(values)

nValues = len(values)

if nValues % 2 == 0: # even number

index = nValues//2

median = sum(vSorted[index-1:index+1])/2.0

else:

index = (nValues-1)//2

median = vSorted[index]

return median

med = getMedian(values)

Calculating the median is easy using NumPy, given its median() function:

from numpy import median

med = median(valArray)

print('Median:', med) # Result is 2

The mean is the numerical average of a set of values. It is analogous to the centre of

‘mass’ of the distribution. In simple terms the sample mean is calculated by adding up all

the values and dividing by the number of values. The mean of [1,2,2,3,2,1,4,2,3,1,0] is

= 1.909. In terms of an underlying probability distribution, the mean of a random

variable, X, is referred to as the expectation of the random variable, written E(X), because

it represents the value that represents the long-term average, considering an unlimited

amount of data, and thus also the most representative centre value for the distribution. It

should be noted that in this chapter we will be considering two types of mean value. The

first is the true mean value of the underlying probability distribution,

and for a random

variable X we will give it the label μ

x

. The other kind of mean is the sample mean, labelled

, which often acts as an estimate for the true mean, and which is calculated as an average

value of a series of measurements, x

, as one might expect:

We can readily calculate the sample mean in standard Python:

values = [1,2,2,3,2,1,4,2,3,1,0]

mean = sum(values)/float(len(values))

or using NumPy arrays, noting that mean() is both a stand-alone function and a method

bound to array objects:

from numpy import array, mean

valArray = array(values, float)

m = valArray.mean()

# or

m = mean(valArray)

print('Mean', m) # Result is 1.909

It is handy that these NumPy functions also take an axis argument, so that in a multi-

dimensional array you can calculate the mean across rows or columns of values etc:

valArray2 = array([[7,9,5],

[1,4,3]])

print(valArray2.mean())

# All elements - result is 4.8333

print(valArray2.mean(axis=0))

# Column means - result is [4.0, 6.5, 4.0]

print(valArray2.mean(axis=1))

# Row means - result is [.0, 2.6667]

For most named probability distributions the mean is either a fundamental parameter

that is used in the description of the distribution (e.g. for Gaussian) or is readily derived

from the fundamental parameters (e.g. binomial, geometric). However, there are some

curious cases where the mean is undefined, e.g. for the Cauchy distribution.

Download 7,75 Mb.

Do'stlaringiz bilan baham:

1 ... 342 343 344 345 346 347 348 349 ... 514