Name
Score
Height
Weight
0
Nik
100
178
180
1
Jim
120
180
175
2
Alice
96
160
143
3
Jane
75
165
155
4
Matt
68
185
167
5
Kate
123
187
189
Let's see how to reshape this using the melt() function:
In [24]:
melted = df.melt(id_vars=
'Name'
, value_vars=[
'Score'
,
'Height'
,
'Weight'
])
melted
Out[24]:
Name
variable
value
0
Nik
Score
100
1
Jim
Score
120
2
Alice
Score
96
3
Jane
Score
75
4
Matt
Score
68
Nik Piepenbreier -
datagy.io
22
Tip #14: set_option to Format Values
You may run into data where you have just too many decimal places.
This can be less than optimal for when you're trying to display data.
Thankfully, Pandas has the set_option function that lets you set a certain
precision for decimal values.
In [25]:
import
numpy
as
np
df = pd.DataFrame(np.random.randn(
5
,
5
))
df
Out[25]:
0
1
2
3
4
0
0.495358
0.256962
-0.356198
0.190293
0.182815
1
-0.208766
-0.454087
-1.231011
-0.284256
-0.080261
2
0.936962
-0.111395
-0.695576
-1.777353
-1.794223
3
1.08996
0.005278
-0.313701
1.622452
0.720692
4
0.818865
-0.79598
-0.000649
0.866517
0.628177
Let's say you wanted to only show two decimal places, you could write:
In [26]:
pd.set_option(
'precision'
,
2
)
df
Out[26]:
0
1
2
3
4
0
0.5
2.57E-01
-3.56E-01
0.19
0.18
1
-0.21
-4.54E-01
-1.23E+00
-0.28
-0.08
2
0.94
-1.11E-01
-6.96E-01
-1.78
-1.79
3
1.09
5.28E-03
-3.14E-01
1.62
0.72
4
0.82
-7.96E-01
-6.49E-04
0.87
0.63
Nik Piepenbreier -
datagy.io
23
Tip #15: Filter by Relative Dates
Sometimes you may want to filter a dataframe based on dates. If you have a
column that is formatted as dates, you can use dates to effectively filter your
data down.
In [27]:
raw_data = {
'name'
: [
'Nik'
,
'Jen'
,
'Alex'
,
'Mel'
],
'birth_date'
: [
'01-02-1988'
,
'08-05-1989'
,
'04-28-1987'
,
'12-16-1990'
]}
df = pd.DataFrame(raw_data)
df[
'birth_date'
] = pd.to_datetime(df[
'birth_date'
])
df
Out[27]:
name
birth_date
0
Nik
1988-01-02
1
Jen
1989-08-05
2
Alex
1987-04-28
3
Mel
1990-12-16
In [28]:
filtered = df[df[
'birth_date'
] <
'1990-01-01'
]
filtered
Out[28]:
name
birth_date
0
Nik
1988-01-02
1
Jen
1989-08-05
2
Alex
1987-04-28
Similarly, you can filter just for a single year, by using the date year attribute.
In [29]:
just_1988 = df[df[
'birth_date'
].dt.year ==
1988
]
just_1988
Out[29]:
name
birth_date
0
Nik
1988-01-02
Nik Piepenbreier -
datagy.io
24
Tip #16: Calculate Cumulative Sum
Need a running total of a particular column in a dataframe?
In [30]:
df = pd.DataFrame({
'Name'
: [
'Nik'
,
'Jim'
,
'Alice'
,
'Jane'
,
'Matt'
,
'Kate'
],
'Sales'
: [
100
,
120
,
96
,
75
,
68
,
123
]})
df
Out[30]:
Name
Sales
0
Nik
100
1
Jim
120
2
Alice
96
3
Jane
75
4
Matt
68
5
Kate
123
This is where the cumsum() function comes in handy.
Assign this function to a new column and it'll calculate the cumulative sum
as a new series!
In [31]:
df[
'running_total'
] = df[
'Sales'
].cumsum()
df
Out[31]:
Name
Sales
running_total
0
Nik
100
100
1
Jim
120
220
2
Alice
96
316
3
Jane
75
391
Nik Piepenbreier -
datagy.io
25
4
Matt
68
459
5
Kate
123
582
This also works with other running functions, such as:
●
cummax() - for running max
●
cummin() - for running min
●
cumprod() - for running product
Let's get the running max:
In [32]:
df[
'running_max'
] = df[
'Sales'
].cummax()
df
Out[32]:
Name
Sales
running_total
running_max
0
Nik
100
100
100
1
Jim
120
220
120
2
Alice
96
316
120
3
Jane
75
391
120
4
Matt
68
459
120
5
Kate
123
582
123
Nik Piepenbreier -
datagy.io
26
Tip #17: Drop Multiple Columns in
Pandas
When importing a data set, chances are you'll run into many columns you
don't need. It's best to just not import these.
However, say you've been manipulating a dataset and no longer need some
columns.
Use Pandas' drop function:
In [33]:
df =
pd.read_csv(
'https://raw.githubusercontent.com/datagy/pivot_table_
pandas/master/select_columns.csv'
)
df
Out[33]:
Name
Age
Height
Score
Random_
A
Random_
B
Random_
C
Random_
D
Random_
E
0
Joe
28
5'9
30
73
59
5
4
31
1
Melissa
26
5'5
32
30
85
38
32
80
2
Nik
31
5'11
34
80
71
59
71
53
3
Andrea
33
5'6
38
16
63
86
81
42
4
Jane
32
5'8
29
19
40
48
5
68
Let's drop the last five columns:
In [34]:
df.drop(columns=[
'Random_A'
,
'Random_B'
,
'Random_C'
,
'Random_D'
,
'Random_E'
], inplace=
True
)
df
Out[34]:
Name
Age
Height
Score
0
Joe
28
5'9
30
1
Melissa
26
5'5
32
2
Nik
31
5'11
34
3
Andrea
33
5'6
38
4
Jane
32
5'8
29
Nik Piepenbreier -
datagy.io
27
Tip #18: Change all your columns to
Lowercase
Sometimes when you import data, your column names will be all over the
place.
Inconsistency in column names can make it tricky to select data.
With this trick, you'll learn how to turn all your columns to lowercase:
In [35]:
df = pd.DataFrame({
'Name'
: [
'Nik'
,
'Jim'
,
'Alice'
,
'Jane'
,
'Matt'
,
'Kate'
],
'score'
: [
100
,
120
,
96
,
75
,
68
,
123
],
'Height in CM'
: [
178
,
180
,
160
,
165
,
185
,
187
],
'Weight in Pounds'
: [
180
,
175
,
143
,
155
,
167
,
189
]})
df
Out[35]:
Name
score
Height in CM
Weight in
Pounds
0
Nik
100
178
180
1
Jim
120
180
175
2
Alice
96
160
143
3
Jane
75
165
155
4
Matt
68
185
167
5
Kate
123
187
189
In [36]:
columns = df.columns
df.columns = [column.lower()
for
column
in
columns]
df
Out[36]:
Nik Piepenbreier -
datagy.io
28
name
score
height in cm
weight in
pounds
0
Nik
100
178
180
1
Jim
120
180
175
2
Alice
96
160
143
3
Jane
75
165
155
4
Matt
68
185
167
5
Kate
123
187
189
Similarly, spaces can cause issues when you want to use the dot selector to
select columns.
You can use this same trick to switch spaces for underscores:
In [37]:
columns = df.columns
df.columns = [column.replace(
" "
,
"_"
)
for
column
in
columns]
df
Out[37]:
name
score
height_in_cm
weight_in_poun
ds
0
Nik
100
178
180
1
Jim
120
180
175
2
Alice
96
160
143
3
Jane
75
165
155
4
Matt
68
185
167
5
Kate
123
187
189
Nik Piepenbreier -
datagy.io
29
Tip #19: Create a List of Unique Values
in a Column
Sometimes you may want to pull all the unique values in a column into a list.
This can be helpful for plotting them as variables in a machine learning
model.
When you simply select the unique values and place them into a variable, you
can use these as a list:
In [38]:
df = pd.DataFrame({
'Name'
: [
'Nik'
,
'Jim'
,
'Alice'
,
'Jane'
,
'Matt'
,
'Kate'
],
'Score'
: [
100
,
120
,
96
,
75
,
68
,
123
]})
unique = df[
'Name'
].unique()
print(
'unique is type:'
,type(unique))
Out[38]:
unique
is
type: <
class
'
numpy
.
ndarray
'>
You can see that the unique is of type numpy.array, meaning you can't loop
over it.
Let's turn it into a list:
In [39]:
unique_list = df[
'Name'
].unique().tolist()
print(
'unique_list is type: '
, type(unique_list))
print(unique_list)
Out[39]:
unique_list
is
type: <
class
'
list
'>
['
Nik
', '
Jim
', '
Alice
', '
Jane
', '
Matt
', '
Kate
']
Nik Piepenbreier -
datagy.io
30
Tip #20. Export to Clipboard for Easy
Pasting
You may be working with a dataset you don't really want to save.
But you want to bring it to Excel.
Pandas gives you that option by using the to_clipboard() function.
Let's give it a try:
In [40]:
df = pd.DataFrame({
'Name'
: [
'Nik'
,
'Jim'
,
'Alice'
,
'Jane'
,
'Matt'
,
'Kate'
],
'Weight in Pounds'
: [
180
,
175
,
143
,
155
,
167
,
189
]})
df.to_clipboard()
Nik Piepenbreier -
datagy.io
31
Tip #21: Create New Columns with Math
Do you have columns that you need to apply formulas to? For example:
●
Calculating sales tax?
●
Changing pounds to kilograms?
This can be done easily in Pandas:
In [41]:
df = pd.DataFrame({
'Name'
: [
'Nik'
,
'Jim'
,
'Alice'
,
'Jane'
,
'Matt'
,
'Kate'
],
'Weight in Pounds'
: [
180
,
175
,
143
,
155
,
167
,
189
]})
df
Out[41]:
Do'stlaringiz bilan baham: |