American Crow
and the
Fish Crow
, are almost indistinguishable, at least
visually. The attributes for each photo, such as color and size, have actually been labeled by
humans. Caltech and UCSD used human workers on Amazon's Mechanical Turk to label
the dataset. Researchers often use Mechanical Turk, which is a website service in which a
person gets paid a tiny amount of money for each photo they label to improve the dataset
using human insight rather than machine predictions.
If you have your own dataset that needs lots of human-provided labels,
you might consider spending some money on Mechanical Turk to
complete that task.
Prediction with Random Forests
Chapter 2
[ 28 ]
Here's an example of a single photo and its labels:
JVVRYYYXKUKQPECNVGEJGFWXKUKRGFKCFCVC%7$DTQYUG5WOOGTA6CPCIGTJVON
We can see that the Summer Tanager is marked as having a red throat, a solid belly pattern,
a perching-like shape, and so on. The dataset includes information about how long it took
each person to decide on the labels and how confident the person is with their decisions,
but we're not going to use that information.
The data is split into several files. We'll discuss those files before jumping into the code:
Prediction with Random Forests
Chapter 2
[ 29 ]
The
DMBTTFTUYU
file shows class IDs with the bird species names. The
JNBHFTUYU
file
shows image IDs and filenames. The species for each photo is given in the
JNBHF@DMBTT@MBCFMTUYU
file, which connects the class IDs with the image IDs.
The
BUUSJCVUFTUYU
file gives the name of each attribute, which ultimately is not going to
be that important to us. We're only going to need the attribute IDs:
Finally, the most important file is
JNBHF@BUUSJCVUF@MBCFMTUYU
:
Prediction with Random Forests
Chapter 2
[ 30 ]
It connects each image with its attributes in a binary value that's either present or absent for
that attribute. Users on Mechanical Turk produced each row in this file.
Now, let's look at the code:
We will first load the CSV file with all the image attribute labels.
Here are few things that need to be noted:
Space separation for all the values
No header column or row
Ignore the messages or warnings, such as
FSSPS@CBE@MJOFT'BMTF
and
XBSO@CBE@MJOFT'BMTF
Use columns
,
, and
, which have the image ID, the attribute ID, and the
present or non-present value
You don't need to worry about the attributes and the time taken to select them.
Prediction with Random Forests
Chapter 2
[ 31 ]
Here, at the top of that dataset:
Image ID number 1 does not have attributes 1, 2, 3, or 4, but it does have attribute 5.
The shape will tell us how many rows and columns we have:
It has 3.7 million rows and three columns. This is not the actual formula that you want. You
want attributes to be the columns, not rows.
Therefore, we have to use pivot, just like Excel has a pivot method:
Pivot on the image ID and make one row for each image ID. There will be only
1.
one row for image number one.
Turn the attributes into distinct columns, and the values will be ones or twos.
2.
We can now see that each image ID is just one row and each attribute is its own column,
and we have the ones and the twos:
Prediction with Random Forests
Chapter 2
[ 32 ]
Let's feed this data into a random forest. In the previous example, we have 312 columns
and 312 attributes, which is ultimately about 12,000 images or 12,000 different examples of
birds:
Now, we need to load the answers, such as whether it's a bird and which species it is. Since
it is an image class labels file, the separators are spaces. There is no header row and the two
columns are
JNHJE
and
MBCFM
. We will be using
TFU@JOEFY JNHJE
to have the same
result produced by
JNHBUUIFBE
, where the rows are identified by the image ID:
Prediction with Random Forests
Chapter 2
[ 33 ]
Here's what it looks like:
The
JNHJE
column has
,
,
,
, and
, all are labeled as
. They're all albatrossed at the
top of the file. As seen, there are about 12,000 rows, which is perfect:
This is the same number as the attributes data. We will be using join.
In the join, we will use the index on the image ID to join the two data frames. Effectively,
what we're going to get is that the label is stuck on as the last column.
We will be now shuffling and then be splitting off the attributes. In other words, we want to
drop the label from the label. So, here are the attributes, with the first 312 columns and the
last column being a label:
Prediction with Random Forests
Chapter 2
[ 34 ]
After shuffling, we have the first row as image 527, the second row as image 1532, and so
forth. The attributes in the label data are in agreement. On the first row, it's image 527,
which is the number 10. You will not know which bird it is, but it's of the kind, and these
are its attributes. But it is finally in the right form. We need to do a training test split.
There were 12,000 rows, so let's take the first 8,000 and call them training, and the call rest
of them testing (4,000). We'll get the answers using
3BOEPN'PSFTU$MBTTJGJFS
:
Max features show the number of different columns each tree can look at.
Prediction with Random Forests
Chapter 2
[ 35 ]
For an instance, if we say something like,
look at two attributes
, that's probably not enough to
actually figure out which bird it is. Some birds are unique, so you might need a lot more
attributes. Later if we say
NBY@GFBUVSFT
and the number of estimators denote the
number of trees created. The fit actually builds it.
Let's predict a few cases. Let's use attributes from the first five rows of the training set,
which will predict species 10, 28, 156, 10, and 43. After testing, we get 44% accuracy:
Even 44% accuracy is not the best result. There are 200 species, so having 0.5% accuracy is
much better than randomly guessing.
Do'stlaringiz bilan baham: |