Simple Linear Regression. In simple linear regression, we study the
relationship between some predicted value Y and some predictor X. We call
our Y the dependent variable because it is dependent on the value of X. Our
X is known as the independent variable.
If you took algebra or pre-calculus in high school or college then you might
remember the equation of a line was;
Y = mX+b
If you were to graph this equation, then you would have a graph that looks
something like this:
As you can see, the line shows that for every value of X there is a different
value of Y. You can make a prediction for the value of Y for every new
value of X. In this graph, the value of Y increases as the value of X
increases. This is not always the case.
This is the most simplistic regression, but it’s important to understand how
it works because we will be building off it from here on out. Most of the
statistical analysis involves a plot like the one pictured above, which makes
some prediction for an output dependent variable based on an input, the
independent variable. This is an example of supervised learning because we
are specifying the Y variable and the X variable that we are using before we
start modeling.
With almost all predictions, there will be more than one independent
variable that will determine our dependent variable. This leads us to our
next type of regression.
Multiple Linear Regression. In data science and most tasks in statistics,
this will be the most popular type of regression. In multiple linear
regression, we will have one output variable Y, just like before. The
difference now though, is that we will have multiple Xs or independent
variables that will predict our Y.
An example of using multiple linear regression to predict the price of
apartments in New York City real estate. Our Y or dependent variable is the
price of a New York City apartment. The price will be determined by X, our
independent variables such as the square footage, distance to transportation,
number of rooms. If we were to write this out as an expression it would
look something like:
apt_price = β0 + β1 sq_foot + β2 dist_transport + β3 num_rooms
We take sample data, data that we already have where we know our Xs and
their Ys and we look at them on a graph like this:
You can see that the values of X and Y don’t create a perfect line, but there
is a trend. We can use that trend to make predictions about future values of
Y. So, we create a multilinear regression, and we end up with a line going
through the middle of our data points. This is called the best fit line, and its
how we will predict our Y when we get new X values in the future.
The difference here is that instead of writing m for slope, we have written β.
This equation is pretty much the same thing as if I had written
Y = b + m1X1 + m2x2 + m3x3
Except now we have labels, and we know what our Xs and our Ys are. In
the future, if you see a multilinear equation, then it will most likely be
written in this form. Our β is what we call a parameter. It's like a magic
number that tells us the effect that the value of our X has on the Y. Each
independent variable will have a unique parameter. We find the parameters
by creating a regression model. Over time, with machine learning, our
model will be exposed to more and more data so that our parameter will
improve, and our model will become more accurate.
We can create the model by using training data that has the actual price of
New York City apartments, and the actual input variables of square footage,
distance to transportation, and many rooms. Our model will ‘learn' to
approximate the price from real data. Afterward, when we plug in the
independent variable for an apartment which has an unknown price, our
model can make a prediction as to what the price will be.
This is supervised learning using a linear regression model. It’s supervised
because we are telling the model what answer we want it to give us; the
price of New York City apartments. It learns how to predict the price more
accurately, as it is given more data, and we continue to evaluate its
accuracy.
Ordinary Least Squares OLS will try to find a regression line that
minimizes the sum of errors squared
Do'stlaringiz bilan baham: |