Recall that a feature is some measurable characteristic of a variable. In each
column
in a tabular dataset, we see a feature. This feature is some
measurable dimension or attribute. Here we have used data reflecting the
market value of a house as a function
of the number of bedrooms, the
number of bathrooms, square footage, and whether the house has a pool.
Our market value is the Y; this is our dependent variable. Our independent
variables, or our Xs, are num_bedrooms, num_bathrooms, st_ft, and pool.
In supervised learning, you will already have the Y in your dataset. In this
case, it's the market value of the home. With
enough of this data in our
model, even if we don't know the market value of a house we should be
able to predict it if we have the number of bedrooms,
the number of
bathrooms, square footage, and whether the house has a pool or not. Data
that is organized in this way is relatively easy to work with and have
multiple independent variables like this makes this an example of the
multivariate regression.
How much data should you use?
There is no set rule to how much data
you will need for your model, but
there are guidelines which you should follow. The most important thing is
that when you have several independent variables to analyze, then your
model will work the best if your data has as many possible combinations of
the independent variables as you can get. If you do this,
your model will
still work even when it encounters a new combination of features that it
hasn't seen before. It will have a pretty good way of predicting, even if the
combination is completely new.
A good general rule to follow is that you should have about ten times as
many respondents as we do independent variables. In the case of our market
value example above, we have num_bedrooms, num_bathrooms, sq_ft, and
pool. This is four different independent variables, which means we should
have at least forty respondents like the ones listed above to create a reliable
model.
Having a lot of variables can help us predict the Y more accurately, but that
that be costly and make your data harder to process. You must also consider
how you are pooling your data. The market values
of houses in Los Angeles
will be much different than the market values of houses in Cleveland.
It’s also important to keep features as relevant as possible. Having multiple
variables will help you make a better prediction, but there are variables that
may just create bias in the model.
Refer to Scikit learn to see what they recommend for data sizes for certain
types of analysis.
But not all data is useful. We often talk about big data, and it might be easy
to assume that the more data we have, the better. But that’s not always the
case. Some data may not be helpful. Certain variables might get in the way
and may make it harder to find the true answer.
Do'stlaringiz bilan baham: