Name
Weight in Pounds
0
Nik
180
1
Jim
175
2
Alice
143
3
Jane
155
4
Matt
167
5
Kate
189
Let's convert these weights to kilograms:
In [42]:
df[
'Weight in Kg'
] = df[
'Weight in Pounds'
] /
2.2
df
Out[42]:
Name
Weight in Pounds
Weight in Kg
0
Nik
180
81.82
1
Jim
175
79.55
2
Alice
143
65
3
Jane
155
70.45
4
Matt
167
75.91
5
Kate
189
85.91
Nik Piepenbreier -
datagy.io
32
Tip #22: "Text to Columns" from Excel in
Pandas
Need to split a column containing text into multiple columns?
Use the split() function with the expand=True argument!
In [43]:
df = pd.DataFrame({
'Name'
: [
'Nik'
,
'Jim'
,
'Alice'
],
'Location'
: [
'Toronto, Ontario'
,
'Los Angeles, California'
,
'Portland, Oregon'
]})
df
Out[43]:
Name
Location
0
Nik
Toronto, Ontario
1
Jim
Los Angeles, California
2
Alice
Portland, Oregon
Let's split location into two columns:
In [44]:
df[[
'City'
,
'State'
]] = df[
'Location'
].str.split(pat=
','
,
expand=
True
)
df
Out[44]:
Name
Location
City
State
0
Nik
Toronto, Ontario
Toronto
Ontario
1
Jim
Los Angeles,
California
Los Angeles
California
2
Alice
Portland, Oregon
Portland
Oregon
Nik Piepenbreier -
datagy.io
33
Tip #23: Explode a Column in Pandas
Sometimes you'll get data back that contains lists of lists in a dataframe.
This is particularly true for surveys.
Using the Pandas explode() function, you can create a long dataset out of
this:
In [45]:
df = pd.DataFrame({
'Name'
: [
'Nik'
,
'Jim'
,
'Alice'
],
'Cities Visited'
: [[
'Toronto'
,
'Los Angeles'
,
'Oregon'
],[
'Montreal'
,
'Toronto'
,
'Atlanta'
],[
'Toronto'
,
'Vancouver'
,
'Los Angeles'
]]})
df
Out[45]:
Name
Cities Visited
0
Nik
[Toronto, Los
Angeles, Oregon]
1
Jim
[Montreal,
Toronto, Atlanta]
2
Alice
[Toronto,
Vancouver, Los
Angeles]
Now let's explode this dataset:
In [46]:
df = df.explode(
'Cities Visited'
)
df.head()
Out[46]:
Name
Cities Visited
0
Nik
Toronto
0
Nik
Los Angeles
0
Nik
Oregon
1
Jim
Montreal
1
Jim
Toronto
Nik Piepenbreier -
datagy.io
34
Tip #24: Create a Pivot Table
Let's turn this exploded dataset into a table.
In [47]:
df[
'count'
] =
1
pivot = df.pivot_table(index=
'Name'
, columns=
'Cities Visited'
,
values =
'count'
, aggfunc=
'count'
, fill_value=
0
)
pivot
Out[47]:
Name
Atlanta
Los Angeles
Montreal
Oregon
Toronto
Vancouver
Alice
0
1
0
0
1
1
Jim
1
0
1
0
1
0
Nik
0
1
0
1
1
0
To learn everything you need to know about Pandas Pivot Tables, check out
my post here:
https://datagy.io/python-pivot-tables/
Nik Piepenbreier -
datagy.io
35
Tip #25: Change Dataframe Column
Order
Need to change the order that your columns appear in?
This simple tricks makes it incredibly easy:
In [48]:
df = pd.DataFrame({
'Name'
: [
'Nik'
,
'Jim'
,
'Alice'
,
'Jane'
,
'Matt'
,
'Kate'
],
'Score'
: [
100
,
120
,
96
,
75
,
68
,
123
],
'Height'
: [
178
,
180
,
160
,
165
,
185
,
187
],
'Weight'
: [
180
,
175
,
143
,
155
,
167
,
189
]})
df
Out[48]:
Name
Score
Height
Weight
0
Nik
100
178
180
1
Jim
120
180
175
2
Alice
96
160
143
3
Jane
75
165
155
4
Matt
68
185
167
5
Kate
123
187
189
Use two sets of square brackets to change the order:
In [49]:
df = df[[
'Name'
,
'Weight'
,
'Height'
,
'Score'
]]
df
Out[49]:
Name
Weight
Height
Score
0
Nik
180
178
100
1
Jim
175
180
120
2
Alice
143
160
96
3
Jane
155
165
75
4
Matt
167
185
68
5
Kate
189
187
123
Nik Piepenbreier -
datagy.io
36
Tip #26: Calculate a Rolling Average
Sometimes you will want to smooth out your charts by producing a rolling
average.
This is especially helpful when loading in data that fluctuates wildly, and
where trends are more important than individual data points.
In [50]:
df =
pd.read_csv(
'https://raw.githubusercontent.com/datagy/mediumdata/m
aster/sales.csv'
)
df.plot(kind=
'line'
)
Out[50]:
You can see how jagged the lines are.
Let's smooth this out with a rolling average.
In [51]:
df[
'rolling_average'
] = df[
'Sales'
].rolling(
5
).mean()
df[
'rolling_average'
].plot(kind=
'line'
)
Out[51]:
Nik Piepenbreier -
datagy.io
37
With this, you get a much better picture of the trends of the data!
The rolling() function takes a number of arguments, including the first being
window (which
Nik Piepenbreier -
datagy.io
38
Tip #27: Convert String-Numbers to
Numbers
In some datasets, numbers will import as strings.
This can be annoying and trying to use the astype() function will fail.
This is where the tonumeric() function comes in handy:
In [52]:
df = pd.DataFrame({
'Name'
: [
'Nik'
,
'Jim'
,
'Alice'
,
'Jane'
,
'Matt'
,
'Kate'
],
'Score'
: [
'100'
,
120
,
'96'
,
75
,
68
,
123
]})
df.info()
Out[52]:
<
class
'
pandas
.
core
.
frame
.
DataFrame
'>
RangeIndex
:
6
entries,
0
to
5
Data columns (total
2
columns):
Name
6
non-null object
Score
6
non-null object
dtypes: object(
2
)
memory usage:
224.0
+ bytes
Let's convert the Score column to integers:
In [53]:
df[
'Score'
] = df[
'Score'
].apply(pd.to_numeric, errors=
'coerce'
)
df.info()
Out[53]:
<
class
'
pandas
.
core
.
frame
.
DataFrame
'>
RangeIndex
:
6
entries,
0
to
5
Data columns (total
2
columns):
Name
6
non-null object
Score
6
non-null int64
dtypes: int64(
1
), object(
1
)
memory usage:
224.0
+ bytes
Let’s see what this dataframe looks like:
Nik Piepenbreier -
datagy.io
39
In [54]:
df
Out[54]:
Name
Score
0
Nik
100
1
Jim
120
2
Alice
96
3
Jane
75
4
Matt
68
5
Kate
123
Nik Piepenbreier -
datagy.io
40
Tip #28: Impute Missing Values with
fillna()
Most data sets aren't perfect. You'll often encounter missing data. What to do
with it depends on the data itself. One option is to impute the value by using
its mean (average).
This can be done easily by using the fillna() function in Pandas:
In [55]:
df = pd.DataFrame({
'Name'
: [
'Nik'
,
'Jim'
,
'Alice'
,
'Jane'
,
'Matt'
,
'Kate'
],
'Sales'
: [
100
,
120
,
90
,
75
,
68
,
123
]})
df.iloc[
3
,
1
] = np.nan
df
Out[55]:
Name
Sales
0
Nik
100
1
Jim
120
2
Alice
90
3
Jane
NaN
4
Matt
68
5
Kate
123
Imputing a number isn't always the best option. If you feel confident it won't
skew the data, try the code below:
In [56]:
df = df.fillna(df.mean())
df
Out[56]:
Name
Sales
0
Nik
100
1
Jim
120
2
Alice
90
3
Jane
100.2
4
Matt
68
5
Kate
123
Nik Piepenbreier -
datagy.io
41
Tip #29: Get the Year from a Date
You've been given a lot of data and want to create a column that tells you the
year that row belongs to.
Pandas makes this easy:
In [57]:
raw_data = {
'name'
: [
'Nik'
,
'Jen'
,
'Alex'
,
'Mel'
],
'birth_date'
: [
'01-02-1988'
,
'08-05-1989'
,
'04-28-1987'
,
'12-16-1990'
]}
df = pd.DataFrame(raw_data)
df[
'birth_date'
] = pd.to_datetime(df[
'birth_date'
])
df
Out[57]:
name
birth_date
0
Nik
1988-01-02
1
Jen
1989-08-05
2
Alex
1987-04-28
3
Mel
1990-12-16
Let's calculate the birth year of everyone in the dataframe:
In [58]:
df[
'birthyear'
] = df[
'birth_date'
].dt.year
df
Out[58]:
name
birth_date
birthyear
0
Nik
1988-01-02
1988
1
Jen
1989-08-05
1989
2
Alex
1987-04-28
1987
3
Mel
1990-12-16
1990
Nik Piepenbreier -
datagy.io
42
Tip #30: Create a Frequency Table (incl.
with Percentages!)
Sometimes you just need a quick frequency table to see how often different
values show up.
Pandas makes this quite easy using the value_counts() function:
In [59]:
df =
pd.read_excel(
'https://github.com/datagy/pivot_table_pandas/raw/ma
ster/sample_pivot.xlsx'
)
df
Out[59]:
Do'stlaringiz bilan baham: |