Nik piepenbreier datagy. Io 31 tips and tricks for super stars

Download 372,05 Kb.

Pdf ko'rish

bet	5/8
Sana	06.07.2022
Hajmi	372,05 Kb.
	#743883

1 2 3 4 5 6 7 8

Bog'liq
2 5215210330325523624

Corporate Sales
Report
Unnamed: 1
Unnamed: 2
Unnamed: 3
Unnamed: 4
Quarterly Report
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
Date
Region
Type
Units
Sales
2020-07-11
0:00:00
East
Children's
Clothing
18
306
2020-09-23
0:00:00
North
Children's
Clothing
14
448
2020-04-02
0:00:00
South
Women's
Clothing
17
425
2020-02-28
0:00:00
East
Children's
Clothing
26
832
In order to read this into a proper dataframe, you can use the read_excel()
function with the skip_rows argument.
In [10]:
df =
pd.read_excel(

'https://github.com/datagy/pivot_table_pandas/raw/ma
ster/sampleweird.xlsx'

, skiprows=

3

)
df.head(2)
Out[10]:
Date
Region
Type
Units
Sales
0
2020-07-11
East
Children's Clothing
18
306
1
2020-09-23
North
Children's Clothing
14
448
Nik Piepenbreier -

datagy.io

14
Tip #8: Select Data Types
When you're working with a dataframe, you might be interested in only
selecting a certain data type.
For example, you might have a dataframe where you only want to select
integers and omit all other data types.
In [11]:
df = pd.DataFrame({

'a'

: [

1

,

2

,

3

,

4

,

5

,

6

],
'b'

: [

True

,

False

,

False

,

False

,

True

,

True

],
'c'

: [

1.0

,

2.0

,

3.1

,

2.4

,

1.2

,

4.7

]})
integers = df.select_dtypes(int)
integers
Out[12]:
a
0
1
1
2
2
3
3
4
4
5
5
6
Similarly, we can exclude any data type by using the exclude argument:
In [13]:
not_integers = df.select_dtypes(exclude=

'int'

)
not_integers
Out[13]:

b
c
1
TRUE
1
2
FALSE
2
3
FALSE
3.1
4
FALSE
2.4
5
TRUE
1.2
6
TRUE
4.7
Nik Piepenbreier -

datagy.io

15
Tip #9: Use Categories to Reduce
Dataframe Size
If some of your data is categorical, you can dramatically reduce the memory
usage of your dataframe by assigning the datatype of category to a column.
Let's try this in action:
In [14]:
df =
pd.read_excel(

'https://github.com/datagy/pivot_table_pandas/raw/ma
ster/sample_pivot.xlsx'

)
df.info()
Out[14]:
<

class

'

pandas

.

core

.

frame

.

DataFrame

'>
RangeIndex

:

1000

entries,

0

to

999
Data columns (total

5

columns):
Date

1000

non-null datetime64[ns]
Region

1000

non-null object
Type

1000

non-null object
Units

911

non-null float64
Sales

1000

non-null int64
dtypes: datetime64[ns](

1

), float64(

1

), int64(

1

), object(

2

)
memory usage:

39.2

+ KB
We can see that the data frame is current using 39.2kb of data.
Since our columns of Region and Type are categorical, we can specify that in
the import.
In [15]:
df2 =
pd.read_excel(

'https://github.com/datagy/pivot_table_pandas/raw/ma
ster/sample_pivot.xlsx'

, dtype = {

'Region'

:

'category'

,
'Type'

:

'category'

})
df2.info()
Out[15]:
<

class

'

pandas

.

core

.

frame

.

DataFrame

'>
Nik Piepenbreier -

datagy.io

16
RangeIndex

:

1000

entries,

0

to

999
Data columns (total

5

columns):
Date

1000

non-null datetime64[ns]
Region

1000

non-null category
Type

1000

non-null category
Units

911

non-null float64
Sales

1000

non-null int64
dtypes: category(

2

), datetime64[ns](

1

), float64(

1

), int64(

1

)
memory usage:

25.6

KB
We were able to reduce the size of the dataframe's memory usage by nearly
50%!
Nik Piepenbreier -

datagy.io

17
Tip #10: Filter Based on Multiple Criteria
using isin()
If you want to filter a dataframe on multiple conditions in a column, you could
chain together multiple or statements.
Or... you could save yourself a lot of typing by using the isin() function.
In [16]:
df = pd.DataFrame({

'Name'

: [

'Nik'

,

'Jim'

,

'Alice'

,

'Jane'

,

'Matt'

,
'Kate'

],
'Score'

: [

100

,

120

,

96

,

75

,

68

,

123

]})
df
Out[16]:
Name
Score
0
Nik
100
1
Jim
120
2
Alice
96
3
Jane
75
4
Matt
68
5
Kate
123
Let's use the traditional filter method to select the rows with names Nik,
Alice, and Kate.
In [17]:
filtered = df[(df[

'Name'

] ==

'Nik'

) | (df[

'Name'

] ==

'Alice'

) |
(df[

'Name'

] ==

'Kate'

)]
filtered
Out[17]:
Name
Score
0 Nik
100
2 Alice
96
5 Kate
123
Nik Piepenbreier -

datagy.io

18
Now, let's make it a lot less typing by using the isin() function:
In [18]:
filtered_faster = df[df[

'Name'

].isin([

'Nik'

,

'Alice'

,

'Kate'

])]
filtered_faster
Out[18]:

Name
Score
0 Nik
100
2 Alice
96
5 Kate
123
Nik Piepenbreier -

datagy.io

19
Tip #11: Filter Using the ~ (not in)
Operator
Sometimes you need to exclude rows that don't meet multiple conditions.
This can be accomplished using the ~ (tilde) operator:
In [19]:
df =
pd.read_csv(

'https://raw.githubusercontent.com/datagy/pivot_table_p
andas/master/select_columns.csv'

,
usecols = [

'Name'

,

'Age'

,

'Height'

],
dtype={

'Name'

:str,

'Age'

:int,

'Height'

:str})
df.at[

0

,

'Name'

] =

'Nik'
df
Out[19]:
Name
Age
Height
Nik
28
5'9
Melissa
26
5'5
Nik
31
5'11
Andrea
33
5'6
Jane
32
5'8
Now let's select rows that aren't equal to Name = Nik and age = 28.
In [20]:
tilde = df[~((df[

'Name'

] ==

'Nik'

) & (df[

'Age'

] ==

28

))]
tilde
Out[20]:
Name
Age
Height
Melissa
26
5'5
Nik
31
5'11
Andrea
33
5'6
Jane
32
5'8
Nik Piepenbreier -

datagy.io

20
Tip #12: Using nlargest To Filter Only
Top Values
If you need to return the top number of rows from a dataframe, the nlargest
function. This can be helpful if you want to get a sense of outliers, or just the
records that are at the top.
This returns the full dataframe, with only the one column meant for sorting.
In [21]:
df = pd.DataFrame({

'Name'

: [

'Nik'

,

'Jim'

,

'Alice'

,

'Jane'

,

'Matt'

,
'Kate'

],
'Score'

: [

100

,

120

,

96

,

75

,

68

,

123

]})
df
Out[21]:
Name
Score
0
Nik
100
1
Jim
120
2
Alice
96
3
Jane
75
4
Matt
68
5
Kate
123
Let's select only the rows with the top 3 scores.
We can use the keep argument to identify what to do with duplicate entries:
●
first keeps the first instance,
●
last keeps the last instance,
●
all doesn't drop any records.
In [22]:
top3 = df.nlargest(

3

,

'Score'

, keep=

'first'

)
top3
Out[22]:
Name
Score
Kate
123
Jim
120
Nik
100
Nik Piepenbreier -

datagy.io

21
Tip #13: Reshape Data Using Melt
Sometimes you're given a dataset in "wide" format - such as a preformatted
report you're given by a colleague.
You can use the melt() function in order to reshape this into a format that you
can create pivot tables and other analysis off of.
In [23]:
df = pd.DataFrame({

'Name'

: [

'Nik'

,

'Jim'

,

'Alice'

,

'Jane'

,

'Matt'

,
'Kate'

],
'Score'

: [

100

,

120

,

96

,

75

,

68

,

123

],
'Height'

: [

178

,

180

,

160

,

165

,

185

,

187

],
'Weight'

: [

180

,

175

,

143

,

155

,

167

,

189

]})
df
Out[23]:

Download 372,05 Kb.

Do'stlaringiz bilan baham:

1 2 3 4 5 6 7 8