Corporate Sales
Report
Unnamed: 1
Unnamed: 2
Unnamed: 3
Unnamed: 4
Quarterly Report
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
Date
Region
Type
Units
Sales
2020-07-11
0:00:00
East
Children's
Clothing
18
306
2020-09-23
0:00:00
North
Children's
Clothing
14
448
2020-04-02
0:00:00
South
Women's
Clothing
17
425
2020-02-28
0:00:00
East
Children's
Clothing
26
832
In order to read this into a proper dataframe, you can use the read_excel()
function with the skip_rows argument.
In [10]:
df =
pd.read_excel(
'https://github.com/datagy/pivot_table_pandas/raw/ma
ster/sampleweird.xlsx'
, skiprows=
3
)
df.head(2)
Out[10]:
Date
Region
Type
Units
Sales
0
2020-07-11
East
Children's Clothing
18
306
1
2020-09-23
North
Children's Clothing
14
448
Nik Piepenbreier -
datagy.io
14
Tip #8: Select Data Types
When you're working with a dataframe, you might be interested in only
selecting a certain data type.
For example, you might have a dataframe where you only want to select
integers and omit all other data types.
In [11]:
df = pd.DataFrame({
'a'
: [
1
,
2
,
3
,
4
,
5
,
6
],
'b'
: [
True
,
False
,
False
,
False
,
True
,
True
],
'c'
: [
1.0
,
2.0
,
3.1
,
2.4
,
1.2
,
4.7
]})
integers = df.select_dtypes(int)
integers
Out[12]:
a
0
1
1
2
2
3
3
4
4
5
5
6
Similarly, we can exclude any data type by using the exclude argument:
In [13]:
not_integers = df.select_dtypes(exclude=
'int'
)
not_integers
Out[13]:
b
c
1
TRUE
1
2
FALSE
2
3
FALSE
3.1
4
FALSE
2.4
5
TRUE
1.2
6
TRUE
4.7
Nik Piepenbreier -
datagy.io
15
Tip #9: Use Categories to Reduce
Dataframe Size
If some of your data is categorical, you can dramatically reduce the memory
usage of your dataframe by assigning the datatype of category to a column.
Let's try this in action:
In [14]:
df =
pd.read_excel(
'https://github.com/datagy/pivot_table_pandas/raw/ma
ster/sample_pivot.xlsx'
)
df.info()
Out[14]:
<
class
'
pandas
.
core
.
frame
.
DataFrame
'>
RangeIndex
:
1000
entries,
0
to
999
Data columns (total
5
columns):
Date
1000
non-null datetime64[ns]
Region
1000
non-null object
Type
1000
non-null object
Units
911
non-null float64
Sales
1000
non-null int64
dtypes: datetime64[ns](
1
), float64(
1
), int64(
1
), object(
2
)
memory usage:
39.2
+ KB
We can see that the data frame is current using 39.2kb of data.
Since our columns of Region and Type are categorical, we can specify that in
the import.
In [15]:
df2 =
pd.read_excel(
'https://github.com/datagy/pivot_table_pandas/raw/ma
ster/sample_pivot.xlsx'
, dtype = {
'Region'
:
'category'
,
'Type'
:
'category'
})
df2.info()
Out[15]:
<
class
'
pandas
.
core
.
frame
.
DataFrame
'>
Nik Piepenbreier -
datagy.io
16
RangeIndex
:
1000
entries,
0
to
999
Data columns (total
5
columns):
Date
1000
non-null datetime64[ns]
Region
1000
non-null category
Type
1000
non-null category
Units
911
non-null float64
Sales
1000
non-null int64
dtypes: category(
2
), datetime64[ns](
1
), float64(
1
), int64(
1
)
memory usage:
25.6
KB
We were able to reduce the size of the dataframe's memory usage by nearly
50%!
Nik Piepenbreier -
datagy.io
17
Tip #10: Filter Based on Multiple Criteria
using isin()
If you want to filter a dataframe on multiple conditions in a column, you could
chain together multiple or statements.
Or... you could save yourself a lot of typing by using the isin() function.
In [16]:
df = pd.DataFrame({
'Name'
: [
'Nik'
,
'Jim'
,
'Alice'
,
'Jane'
,
'Matt'
,
'Kate'
],
'Score'
: [
100
,
120
,
96
,
75
,
68
,
123
]})
df
Out[16]:
Name
Score
0
Nik
100
1
Jim
120
2
Alice
96
3
Jane
75
4
Matt
68
5
Kate
123
Let's use the traditional filter method to select the rows with names Nik,
Alice, and Kate.
In [17]:
filtered = df[(df[
'Name'
] ==
'Nik'
) | (df[
'Name'
] ==
'Alice'
) |
(df[
'Name'
] ==
'Kate'
)]
filtered
Out[17]:
Name
Score
0 Nik
100
2 Alice
96
5 Kate
123
Nik Piepenbreier -
datagy.io
18
Now, let's make it a lot less typing by using the isin() function:
In [18]:
filtered_faster = df[df[
'Name'
].isin([
'Nik'
,
'Alice'
,
'Kate'
])]
filtered_faster
Out[18]:
Name
Score
0 Nik
100
2 Alice
96
5 Kate
123
Nik Piepenbreier -
datagy.io
19
Tip #11: Filter Using the ~ (not in)
Operator
Sometimes you need to exclude rows that don't meet multiple conditions.
This can be accomplished using the ~ (tilde) operator:
In [19]:
df =
pd.read_csv(
'https://raw.githubusercontent.com/datagy/pivot_table_p
andas/master/select_columns.csv'
,
usecols = [
'Name'
,
'Age'
,
'Height'
],
dtype={
'Name'
:str,
'Age'
:int,
'Height'
:str})
df.at[
0
,
'Name'
] =
'Nik'
df
Out[19]:
Name
Age
Height
Nik
28
5'9
Melissa
26
5'5
Nik
31
5'11
Andrea
33
5'6
Jane
32
5'8
Now let's select rows that aren't equal to Name = Nik and age = 28.
In [20]:
tilde = df[~((df[
'Name'
] ==
'Nik'
) & (df[
'Age'
] ==
28
))]
tilde
Out[20]:
Name
Age
Height
Melissa
26
5'5
Nik
31
5'11
Andrea
33
5'6
Jane
32
5'8
Nik Piepenbreier -
datagy.io
20
Tip #12: Using nlargest To Filter Only
Top Values
If you need to return the top number of rows from a dataframe, the nlargest
function. This can be helpful if you want to get a sense of outliers, or just the
records that are at the top.
This returns the full dataframe, with only the one column meant for sorting.
In [21]:
df = pd.DataFrame({
'Name'
: [
'Nik'
,
'Jim'
,
'Alice'
,
'Jane'
,
'Matt'
,
'Kate'
],
'Score'
: [
100
,
120
,
96
,
75
,
68
,
123
]})
df
Out[21]:
Name
Score
0
Nik
100
1
Jim
120
2
Alice
96
3
Jane
75
4
Matt
68
5
Kate
123
Let's select only the rows with the top 3 scores.
We can use the keep argument to identify what to do with duplicate entries:
●
first keeps the first instance,
●
last keeps the last instance,
●
all doesn't drop any records.
In [22]:
top3 = df.nlargest(
3
,
'Score'
, keep=
'first'
)
top3
Out[22]:
Name
Score
Kate
123
Jim
120
Nik
100
Nik Piepenbreier -
datagy.io
21
Tip #13: Reshape Data Using Melt
Sometimes you're given a dataset in "wide" format - such as a preformatted
report you're given by a colleague.
You can use the melt() function in order to reshape this into a format that you
can create pivot tables and other analysis off of.
In [23]:
df = pd.DataFrame({
'Name'
: [
'Nik'
,
'Jim'
,
'Alice'
,
'Jane'
,
'Matt'
,
'Kate'
],
'Score'
: [
100
,
120
,
96
,
75
,
68
,
123
],
'Height'
: [
178
,
180
,
160
,
165
,
185
,
187
],
'Weight'
: [
180
,
175
,
143
,
155
,
167
,
189
]})
df
Out[23]:
Do'stlaringiz bilan baham: |