Mode, median and mean
For a collection of values, one of the most useful measures is to estimate where the centre
of the distribution is. The general idea here is that we get a single value that is most
representative of the data set as a whole. There are a few different ways that are generally
used to get such a measure, which are called mode, median and mean. Naturally these
have different properties and are useful in different situations, although the mean is the
most common parameter used.
The mode is the most commonly occurring value in a set of data. For example, the
mode of the list of values [1,2,2,3,2,1,4,2,3,1,0] is 2, because the number 2 appears most
often. Naturally, if each value only occurs once then the mode tells you nothing. Hence,
for this to be a useful measure the amount of data and whether the values are represented
with a specific precision are important. This is especially true when using floating point
numbers, where repeated values can be unlikely, in which case it is commonplace to
represent the data as a histogram. If the values are assigned to suitable ranges the shape of
the distribution can become more apparent and the mode will be the histogram bin with
the most values.
Using standard Python we can calculate the mode of the values in a list using the list’s
.count() method. We use a list comprehension to build a counts list containing (count, val)
pairs, noting that we use set() to remove any repeats in the values. Using max() on these
pairs will find the one with the largest count, although the mode will be the second item of
the pair; the value that went with the count.
values = [1,2,2,3,2,1,4,2,3,1,0]
counts = [(values.count(val), val) for val in set(values)]
count, mode = max(counts)
print( mode )
Calculating the mode is easier to do with SciPy, as there is a pre-constructed
stats.mode() function that works with NumPy array objects, though this also gives back an
array, hence we take the [0] item from the result:
from scipy import stats
from numpy import array
valArray = array(values, float)
mode, count = stats.mode(valArray)
print('Mode:', mode[0] ) # Result is 2
The median represents the middle-ranked value when the data is placed in its sorted
order. Or put differently, the median is the 50th percentile point that separates the top and
bottom halves of the values. Taking the example [1,2,2,3,2,1,4,2,3,1,0] again, sorting this
gives [0,1,1,1,2,2,2,2,3,3,4] and the middle value is 2. If there is an even number of points
the median is generally represented as the average of the two middle points. The median is
a fairly robust statistic to use, including where the underlying probability distribution is
not known, because the middle ranking will be insensitive to outlier points (with extreme
values).
We can calculate the median in standard Python by sorting the values and selecting the
middle index, though if there is an even number of values (nValues % 2 == 0) we take the
average of the central two:
def getMedian(values):
vSorted = sorted(values)
nValues = len(values)
if nValues % 2 == 0: # even number
index = nValues//2
median = sum(vSorted[index-1:index+1])/2.0
else:
index = (nValues-1)//2
median = vSorted[index]
return median
med = getMedian(values)
Calculating the median is easy using NumPy, given its median() function:
from numpy import median
med = median(valArray)
print('Median:', med) # Result is 2
The mean is the numerical average of a set of values. It is analogous to the centre of
‘mass’ of the distribution. In simple terms the sample mean is calculated by adding up all
the values and dividing by the number of values. The mean of [1,2,2,3,2,1,4,2,3,1,0] is
21
/
11
= 1.909. In terms of an underlying probability distribution, the mean of a random
variable, X, is referred to as the expectation of the random variable, written E(X), because
it represents the value that represents the long-term average, considering an unlimited
amount of data, and thus also the most representative centre value for the distribution. It
should be noted that in this chapter we will be considering two types of mean value. The
first is the true mean value of the underlying probability distribution,
5
and for a random
variable X we will give it the label μ
x
. The other kind of mean is the sample mean, labelled
, which often acts as an estimate for the true mean, and which is calculated as an average
value of a series of measurements, x
i
, as one might expect:
We can readily calculate the sample mean in standard Python:
values = [1,2,2,3,2,1,4,2,3,1,0]
mean = sum(values)/float(len(values))
or using NumPy arrays, noting that mean() is both a stand-alone function and a method
bound to array objects:
from numpy import array, mean
valArray = array(values, float)
m = valArray.mean()
# or
m = mean(valArray)
print('Mean', m) # Result is 1.909
It is handy that these NumPy functions also take an axis argument, so that in a multi-
dimensional array you can calculate the mean across rows or columns of values etc:
valArray2 = array([[7,9,5],
[1,4,3]])
print(valArray2.mean())
# All elements - result is 4.8333
print(valArray2.mean(axis=0))
# Column means - result is [4.0, 6.5, 4.0]
print(valArray2.mean(axis=1))
# Row means - result is [.0, 2.6667]
For most named probability distributions the mean is either a fundamental parameter
that is used in the description of the distribution (e.g. for Gaussian) or is readily derived
from the fundamental parameters (e.g. binomial, geometric). However, there are some
curious cases where the mean is undefined, e.g. for the Cauchy distribution.
6
Do'stlaringiz bilan baham: |