plt.legend()
plt.show()
There we have it. We have 5 clusters and Cluster #2 (blue points, High Annual
Income and Low Spending Score) is significant enough. It might be worthwhile
for the marketing department to focus on that group.
Also notice the Centroids (the yellow points). This is a part of how K-Means
clustering works. It’s an iterative approach where random points are placed
initially until they converge to a minimum (e.g. sum of distances is minimized).
As mentioned earlier, it can all be arbitrary and it may depend heavily on our
judgment and possible application. We can set n_clusters into anything other
than 5. We only used the Elbow Method so we can have a more sound and
consistent basis for the number of clusters. But it’s still up to our judgment what
should we use and if the results are good enough for our application.
Anomaly Detection
Aside from revealing the natural clusters, it’s also a common case to see if there
are obvious points that don’t belong to those clusters. This is the heart of
detecting anomalies or outliers in data.
This is a crucial task because any large deviation from the normal can cause a
catastrophe. Is a credit card transaction fraudulent? Is a login activity suspicious
(you might be logging in from a totally different location or device)? Are the
temperature and pressure levels in a tank being maintained consistently (any
outlier might cause explosions and operational halt)? Is a certain data point
caused by wrong entry or measurement (e.g. perhaps inches were used instead of
centimeters)?
With straightforward data visualization we can immediately see the outliers. We
can then evaluate if these outliers present a major threat. We can also see and
assess those outliers by referring to the mean and standard deviation. If a data
point deviates by a standard deviation from the mean, it could be an anomaly.
This is also where our domain expertise comes in. If there’s an anomaly, how
serious are the consequences? For instance, there might be thousands of
purchase transactions happening in an online store every day. If we’re too tight
with our anomaly detection, many of those transactions will be rejected (which
results to loss of sales and profits). On the other hand, if we’re allowing much
freedom in our anomaly detection our system would approve more transactions.
However, this might lead to complaints later and possibly loss of customers in
the long term.
Notice here that it’s not all about algorithms especially when we’re dealing with
business cases. Each field might require a different sensitivity level. There’s
always a tradeoff and either of the options could be costly. It’s a matter of testing
and knowing if our system of detecting anomalies is sufficient for our
application.
Do'stlaringiz bilan baham: |