If you’re familiar with clusters, then you’ll know the importance of scatter plots. These types of plots help to distinguish groups apart from each other by plotting a dot for each set of data. Using two characteristics, like height and width of a flower, we can classify which species a flower belongs to. Let’s create some fake data and plot the points:
1| # creating a scatter plot to represent height-weight distribution
2| from random import randint
3| random.seed(2)
5| height = [ randint(58, 78) for x in range(20) ] # 20 records
between 4'10" and 6'6"
6| weight = [ randint(90, 250) for x in range(20) ] # 20 records between 90lbs.
and 250lbs.
8| plt.scatter(weight, height)
10| plt.title("Height-Weight Distribution")
11| plt.xlabel("Weight (lbs)")
12| plt.ylabel("Height (inches)")
14| plt.show( )
Go ahead and run the cell. To create some fake data, we use the randint method from the random module. Here, we’re able to create 20 records for both the height and weight lists. To plot the data, we use the scatter() method and add some characteristics to the plot. You should get an output like Figure 10-5.
CHapter 10 INtroduCtIoN to data aNalYsIs
Figure 10-5. Scatter plot of height-weight data
Histogram
While line plots are great for visualizing trends in time series data, histograms are the king of visualizing distributions. Often, the distribution of a variable is what you’re interested in, and a visualization provides a lot more information than a group of summary statistics. First, let’s see how we can create a histogram:
1| # creating a histogram to show age data for a fake population
2| import numpy as np # import the numpy module to generate data
3| np.random.seed(5)
5| ages = [ np.random.normal(loc=40, scale=10) for x in range(1000) ] # ages distributed around 40
7| plt.hist(ages, bins=45) # bins is the number of bars
9| plt.title("Ages per Population")
10| plt.xlabel("Age")
11| plt.ylabel("# of People")
13| plt.show( )
Go ahead and run the cell. We’ve mentioned the NumPy module previously. It’s used in data science to perform extremely fast numerical calculations. Pandas’ DataFrames are built on top of NumPy arrays. For the purpose of this cell, however, you just need to know that we’re using it to create random numbers that are centralized around a given number. The number we specify is passed into the loc argument on line 5. The scale argument is how wide we want the random numbers to be apart. Of course, it will still create numbers outside of that range, but it is primarily creating 1000 random numbers centralized around the age of 40.
To create the histogram, we use the hist() method and pass in the proper data. Histograms allow us to see how many times a specific piece of data appeared. In our example, the age of 40 appears more than 60 times. The y axis represents the frequency of the x axis value. The bins argument specifies how many bars you see on the chart. You may be thinking: the more bins the better right? Wrong, there’s always a fine line between too many and too little; often you’ll just have to test out the proper number. We complete this chart by adding customization. The result should look like Figure 10-6.
Figure 10-6. Histogram of centrally distributed age data
Although the data is fake, we can deduce a lot of information from the chart. We can see outliers that may exist, where the general age range sits, and much more.
Do'stlaringiz bilan baham: |