distribution. This is how the data is spread out on our graph. It shows the
frequency of values of our data and how they appear in conjunction with
one another.
We use our variance to find the standard deviation. Standard deviation is
the average of the distances between the predicted data points and the real
data points on a regression or prediction model.
We must also be sure to be aware of models that suffer from overfitting and
underfitting. An overfitted model is good at predicting outcomes using the
training data, but when you introduce new data, then it struggles. It’s like a
model that memorizes instead of learns. It can happen if you don’t use
random data in your training sample.
Underfitting describes a model that is too simple, and it doesn't examine
any significant data patterns. It may do a good job of predicting, but the
variables and parameters aren't specific enough to give us any meaningful
insights if you don't have enough training data, your model could be under
fitted.
One of the most commonly made mistakes when people are looking at data,
is confusing correlation with causation. If I told you that every person who
committed a murder last year bough eggs every week, I couldn’t claim that
people who buy eggs are murderers. Maybe looking at my data, I see a rise
in people buying milk, as well as a rise in teen pregnancy. Would I be able
to claim that there is an association between people drinking a lot of milk
and teen pregnancy? Or teenagers getting pregnant caused people to buy
more milk.
This is the difference between correlation and causation. Sometimes the
data shows trends that seem like they are related. When two events are
correlated, it means that they seem to have a relationship because they
move along the graph at a similar trajectory, and during a similar space in
time. Whereas causation means that the relationship between the two events
involves one event causing another.
In order to imply that two things have a causal relationship, a few criteria
need to be met. The first is covariation. The causal variable, and the event,
it is supposed to have caused the need to be covarying, meaning that a
change in one lead to a change in the other.
The second criterion that needs to be met is that the causal event needs to
occur before the event it is supposed to have caused. For an event to be
considered causal, it must come first.
Third, the data scientist must control for outside factors. In order to make a
strong case that one thing causes another; you need to be able to present
evidence that the other variables of the event are not the true cause. If the
causal variable still creates the effect, even when other variables are
considered, then you can claim there is a causal relationship.
Do'stlaringiz bilan baham: |