Managing Big Data
235
Sampling means drawing a limited set of examples from your stream and treating
them as if they represented the entire stream. It is a well-known tool in statistics
through which you can make inferences on a larger context (technically called the
universe or the population) by using a small part of it.
Reserving the right data
Statistics was born in a time when obtaining a census was impossible. A census is
a systematic investigation on a population, counting it, and acquiring useful data
from it. The government asks all the people in a country about where they live,
their family, their daily life, and their work. The census has its origins in ancient
times. In the Bible, a census occurs in the book of Numbers; the Israelite popula-
tion is counted after the exodus from Egypt. For tax purposes, the ancient Romans
periodically held a census to count the population of their large empire. Historical
documents provide accounts of similar census activities in ancient Egypt, Greece,
India, and China.
Statistics, in particular the branch of statistics called inferential statistics, can
achieve the same outcome as a census, with an acceptable margin of error, by
interrogating a smaller number of individuals (called a sample). Thus, by querying
a few people, pollsters can determine the general opinion of a larger population on
a variety of issues, such as who will win an election. In the United States, for
instance, the statistician Nate Silver made news by predicting the winner of the
2012 presidential election in all 50 states, using data from samples (
https://www.
cnet.com/news/obamas-win-a-big-vindication-for-nate-silver-king-
of-the-quants/
).
Clearly, holding a census implies huge costs (the larger the population, the greater
the costs) and requires a lot of organization (which is why censuses are infre-
quent), whereas a statistical sample is faster and cheaper. Reduced costs and
lower organizational requirements also make statistics ideal for big data stream-
ing: Users of big data streaming don’t need every scrap of information and they
can summarize the data’s complexity.
However, there’s a problem with using statistical samples. At the core of statistics
is sampling, and sampling requires randomly picking a few examples from the pool
of the entire population. The key element of the recipe is that every element from
the population has exactly the same probability of being part of the sample. If a
population consists of a million people and your sample size is one, each person’s
probability of being part of the sample is one out of a million. In mathematical
terms, if you represent the population using the variable N and the sample size is
n, the probability of being part of a sample is n/N, as shown in Figure 12-2. The
represented sample is a simple random sample. (Other sample types have greater
complexity; this is the simplest sample type and all the others build upon it.)
236
Do'stlaringiz bilan baham: |