independent variables as well as their respective dependent variables. This
means that for every input, you will already know the output of your data.
From this data, your model will learn to predict the output on its own. Our
training data gives us the parameters we need to make predictions. This is
the data that our machine learns from.
Test data is the data that the machine gets once you are satisfied with the
model, and you see what it does out in the wild. In this data, we only have
the independent variables, but no output.
With test data, we can see how
well our model does at predicting an outcome with new data.
Your training data should account for most of your data;
approximately
70%, while your test data is the remaining 30%. In order to avoid bias,
make sure that the data you choose for training data and test data is totally
random when you split them up. Don’t choose which data to use; let it be
random. Don’t use the same data for training and testing. Start by giving the
training data to the machine and examine the relationships between X and
Y, then try to see how well your model did.
The most important question to consider during this process is whether your
model will still work when it is presented with new data. You can test this
by doing cross-validation. This means you will test your model on data you
have not used yet. Keep some data to the side that you didn't use during
training to see how accurate your model is at the end.
You can also use K-fold validation to check the accuracy of your model.
This method is pretty easy to use and generally unbiased. It’s
a good
technique to use when we don’t have a lot of data to work with for testing.
For K-fold validation, we will break our data into k folds, usually between 5
and 10. Test each fold and see how they performed across all the folds once
you are finished with testing. Usually, the larger your number for k is the
less biased your test will be.
So far, we have talked about models interpreting data to find meaning and
patterns. But what kind of data are we going to use? Where will we get our
data, and what is it going to look like?
Data is the most critical component for machine learning. After all, your
model will only learn with data, so it’s important that you have data that is
relevant and meaningful. It came come in many shapes and sizes, structure
differently depending on the kinds of data. The more structured the data is,
the easier it is to work with. Some
data has very little structure, and this
data is harder to interpret. Data for facial recognition can be huge and have
very little meaning to the untrained eye.
Structured data is more organized. This is the type of data that you will
likely use when you are first starting out. It will help you get your feet wet,
and you can start understanding the statistic involved in machine learning.
Usually, structure data will come in a familiar
form that looks something
like this, in rows and columns. This is called a tabular dataset.
Do'stlaringiz bilan baham: