of data cleaning first. The process of data cleaning is often referred to as
data scrubbing.
We might have data that comes in the form of images or emails. We need to
rewrite it so that it has numerical values that will be interpretable by our
algorithms. After all, our machine learning models are algorithms or math
equations, so the data needs to have numerical values for it to be modeled.
You might also have pieces of data that were recorded incorrectly or in the
wrong format. There may be variables that you don’t need, and you must
get rid of. It can be tedious and time-consuming but its extremely important
to have data that will work and can easily be read by your model. It’s the
least sexy part of being a data scientist.
This is the part of machine learning where you will probably spend most of
your time. As a data scientist, you will probably spend about 20% of your
time doing data science and the other 80% of your time making sure your
data is clean and ready to be processed by your model. We may be
combining
multiple types of data, and we need to reformat the recordings so
that they fit together. First, in the case of supervised learning,
pick the
variables that you think are most important for your model. If we choose
irrelevant variables or variables that don’t matter, we may create a bias and
could make our model less effective.
A simple example of cleaning or scrubbing data is recoding a response for
gender. On your data, you have a column for male/female. Unfortunately,
male and female do not have a numerical value. But you can easily change
this by making this a binary variable. Assign female = 1 and male =0. Now
you can find a numerical value for the effect that being a female has on the
outcome of your model.
We can also combine variables to make it simpler to interpret. Let’s say you
are creating a regression model that predicts a person’s
income based on
several variables. One of the variables is the education level, which you
have recorded in years. So, the possible responses for years of education are
1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16. This is a lot of discrete
categories. You could simplify it by creating groups. For example, you
could rewrite variables 1, 2, 3, 4, 5, 6, 7, 8 = primary_ed and rewrite 9, 10,
11, 12 = secondary_ed and rewrite 13, 14, 15, 16 = tertiary education.
Instead
of having twelve categories, you have three. Respondents either
have some primary education, secondary education, or some level of post-
secondary or college-level education. This is known as binning data, and it
can be a good way to clean up your data if it’s used properly.
When you are combining variables to make interpretation simpler, you must
consider the tradeoff between having more streamlined data and losing
some important information about relationships in the data. Consider that in
this example, by combining these variables
into three groups instead of
sixteen, you may be creating bias in your model.
There a lot of factors that could require you to clean your data. Even a
misspelling or an extra space somewhere in your data can have a negative
impact on your model.
You might have data that is missing. In order to fix this situation, you can
replace the missing values with either the mode of the median of that