Samples and significance
One of the key principles, which underpins most statistical analyses, is the idea that the
data we collect contains a limited number of samples from some kind of underlying
probability distribution. This probability distribution can be thought of as the mechanism
by which the data values are generated, but naturally the data arises due to some physical
process and by ascribing a probability distribution we are merely forming a mathematical
model, which is often significantly simplified, to approximate the data-generation process.
For a given situation, if we have an idea of what type of underlying probability
distribution would be appropriate, then by looking at the observed data we can begin to
estimate what the parameters of the distribution are, such as where its centre is and how
much it spreads. Given parameter estimates we can then begin to answer questions which
relate to the probabilistic model, such as how likely it is that a given value is generated by
the model. In virtually all cases the answer provided is not certain, rather the answer is
given as being true with a certain probability, which for parameter estimation is often
called a confidence level. It is often the case that a 95% probability is considered a suitable
confidence level for inferring significance, but of course even at this seemingly strict
level, 5% (1 in 20) of the sampled values would lie outside the quoted range.
In Python several of the commonly used probability distributions are represented,
including in the scipy.stats module, which we will routinely refer to in this chapter, and
also in the numpy.random module, which allows us to draw random samples from a
distribution. Here we illustrate creating random samplings with different numbers of
points, selecting from a normal distribution using random.normal, which we then show as
a histogram:
from matplotlib import pyplot
from numpy import random
mean = 0.0
stdDev = 1.0
for nPoints in (10, 100, 1000, 10000,100000):
sample = random.normal(mean, stdDev, nPoints)
pyplot.hist(sample, bins=20, range=(-4,4), normed=True)
pyplot.show()
Predictions from a probability distribution are often coupled to the idea of a competing
hypothesis. Here the probability distribution is often a model of what we expect at
random and the competing hypothesis would mean that something significantly non-
random was happening. Hence, rather than drawing significance if this model appears to
fit the data, we assert that there is significance if the random model is unlikely to explain
the data samples; that our data does not fit the probability distribution of the random
situation. So by applying a probabilistic model we are generally not assuming that we
actually have a good physical model for our data, but rather that there is a mathematical
approximation to the data-generation process, which is nonetheless useful for making
predictions and for understanding key aspects of what we are studying.
Lastly, it is important to note that even in situations where the underlying probability
distribution is not known we can nonetheless estimate some statistical parameters. In the
simplest situation, we might simply try and estimate the mean (average) or standard
deviation (spread) of the distribution, given the data, and not worry too much about what
the distribution is.
Do'stlaringiz bilan baham: |