Nik piepenbreier datagy. Io 31 tips and tricks for super stars



Download 372,05 Kb.
Pdf ko'rish
bet5/8
Sana06.07.2022
Hajmi372,05 Kb.
#743883
1   2   3   4   5   6   7   8
Bog'liq
2 5215210330325523624

Corporate Sales 
Report
Unnamed: 1
Unnamed: 2
Unnamed: 3
Unnamed: 4
Quarterly Report
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
Date
Region
Type
Units
Sales
2020-07-11 
0:00:00
East
Children's 
Clothing
18
306
2020-09-23 
0:00:00
North
Children's 
Clothing
14
448
2020-04-02 
0:00:00
South
Women's 
Clothing
17
425
2020-02-28 
0:00:00
East
Children's 
Clothing
26
832
In order to read this into a proper dataframe, you can use the read_excel() 
function with the skip_rows argument. 
In [10]: 
df = 
pd.read_excel(

'https://github.com/datagy/pivot_table_pandas/raw/ma
ster/sampleweird.xlsx'

, skiprows=

3


df.head(2)
Out[10]:
Date
Region
Type
Units
Sales
0
2020-07-11
East
Children's Clothing
18
306
1
2020-09-23
North
Children's Clothing
14
448
Nik Piepenbreier - 

datagy.io
 


14 
Tip #8: Select Data Types 
When you're working with a dataframe, you might be interested in only 
selecting a certain data type. 
For example, you might have a dataframe where you only want to select 
integers and omit all other data types. 
In [11]: 
df = pd.DataFrame({

'a'

: [

1



2



3



4



5



6

], 
'b'

: [

True



False



False



False



True



True

], 
'c'

: [

1.0



2.0



3.1



2.4



1.2



4.7

]}) 
integers = df.select_dtypes(int) 
integers
Out[12]:













Similarly, we can ​exclude​ any data type by using the exclude argument: 
In [13]: 
not_integers = df.select_dtypes(exclude=

'int'


not_integers
Out[13]:
 


1
TRUE
1
2
FALSE
2
3
FALSE
3.1
4
FALSE
2.4
5
TRUE
1.2
6
TRUE
4.7
Nik Piepenbreier - 

datagy.io
 


15 
Tip #9: Use Categories to Reduce 
Dataframe Size 
If some of your data is categorical, you can dramatically reduce the memory 
usage of your dataframe by assigning the datatype of category to a column. 
Let's try this in action: 
In [14]: 
df = 
pd.read_excel(

'https://github.com/datagy/pivot_table_pandas/raw/ma
ster/sample_pivot.xlsx'


df.info()
Out[14]:
<

class

'

pandas

.

core

.

frame

.

DataFrame

'> 
RangeIndex



1000

entries, 

0

to 

999
Data columns (total 

5

columns): 
Date

1000

non-null datetime64[ns] 
Region

1000

non-null object 
Type

1000

non-null object 
Units

911

non-null float64 
Sales

1000

non-null int64 
dtypes: datetime64[ns](

1

), float64(

1

), int64(

1

), object(

2


memory usage: 

39.2

+ KB
We can see that the data frame is current using 39.2kb of data. 
Since our columns of Region and Type are categorical, we can specify that in 
the import. 
In [15]: 
df2 = 
pd.read_excel(

'https://github.com/datagy/pivot_table_pandas/raw/ma
ster/sample_pivot.xlsx'

, dtype = {

'Region'

:

'category'


'Type'

:

'category'

}) 
df2.info()
Out[15]:
<

class

'

pandas

.

core

.

frame

.

DataFrame

'> 
Nik Piepenbreier - 

datagy.io
 


16 
RangeIndex



1000

entries, 

0

to 

999
Data columns (total 

5

columns): 
Date

1000

non-null datetime64[ns] 
Region

1000

non-null category 
Type

1000

non-null category 
Units

911

non-null float64 
Sales

1000

non-null int64 
dtypes: category(

2

), datetime64[ns](

1

), float64(

1

), int64(

1


memory usage: 

25.6

KB
We were able to reduce the size of the dataframe's memory usage by nearly 
50%! 
Nik Piepenbreier - 

datagy.io
 


17 
Tip #10: Filter Based on Multiple Criteria 
using isin() 
If you want to filter a dataframe on multiple conditions in a column, you could 
chain together multiple ​or​ statements. 
Or... you could save yourself a lot of typing by using the isin() function. 
In [16]: 
df = pd.DataFrame({

'Name'

: [

'Nik'



'Jim'



'Alice'



'Jane'



'Matt'


'Kate'

], 
'Score'

: [

100



120



96



75



68



123

]}) 
df
Out[16]: 
Name
Score
0
Nik
100
1
Jim
120
2
Alice
96
3
Jane
75
4
Matt
68
5
Kate
123
Let's use the traditional filter method to select the rows with names Nik, 
Alice, and Kate. 
In [17]: 
filtered = df[(df[

'Name'

] == 

'Nik'

) | (df[

'Name'

] == 

'Alice'

) | 
(df[

'Name'

] == 

'Kate'

)] 
filtered
Out[17]: 
Name 
Score 
0 Nik 
100 
2 Alice 
96 
5 Kate 
123 
Nik Piepenbreier - 

datagy.io
 


18 
Now, let's make it a lot less typing by using the isin() function: 
In [18]: 
filtered_faster = df[df[

'Name'

].isin([

'Nik'



'Alice'



'Kate'

])] 
filtered_faster
Out[18]: 
 
Name 
Score 
0 Nik 
100 
2 Alice 
96 
5 Kate 
123 
Nik Piepenbreier - 

datagy.io
 


19 
Tip #11: Filter Using the ~ (not in) 
Operator 
Sometimes you need to ​exclude​ rows that don't meet multiple conditions. 
This can be accomplished using the ~ (tilde) operator: 
In [19]: 
df = 
pd.read_csv(

'https://raw.githubusercontent.com/datagy/pivot_table_p
andas/master/select_columns.csv'


usecols = [

'Name'



'Age'



'Height'

], 
dtype={

'Name'

:str, 

'Age'

:int, 

'Height'

:str}) 
df.at[

0

,

'Name'

] = 

'Nik'
df
Out[19]: 
Name
Age
Height
Nik
28
5'9
Melissa
26
5'5
Nik
31
5'11
Andrea
33
5'6
Jane
32
5'8
Now let's select rows that ​aren't​ equal to Name = Nik and age = 28. 
In [20]: 
tilde = df[~((df[

'Name'

] == 

'Nik'

) & (df[

'Age'

] == 

28

))] 
tilde
Out[20]: 
Name
Age
Height
Melissa
26
5'5
Nik
31
5'11
Andrea
33
5'6
Jane
32
5'8
Nik Piepenbreier - 

datagy.io
 


20 
Tip #12: Using nlargest To Filter Only 
Top Values 
If you need to return the top number of rows from a dataframe, the nlargest 
function. This can be helpful if you want to get a sense of outliers, or just the 
records that are at the top. 
This returns the full dataframe, with only the one column meant for sorting. 
In [21]: 
df = pd.DataFrame({

'Name'

: [

'Nik'



'Jim'



'Alice'



'Jane'



'Matt'


'Kate'

], 
'Score'

: [

100



120



96



75



68



123

]}) 
df
Out[21]:
Name
Score
0
Nik
100
1
Jim
120
2
Alice
96
3
Jane
75
4
Matt
68
5
Kate
123
Let's select only the rows with the top 3 scores. 
We can use the ​keep​ argument to identify what to do with duplicate entries: 

first keeps the first instance

last keeps the last instance, 

all doesn't drop any records. 
In [22]: 
top3 = df.nlargest(

3



'Score'

, keep=

'first'


top3
Out[22]:
Name
Score
Kate
123
Jim
120
Nik
100
Nik Piepenbreier - 

datagy.io
 


21 
Tip #13: Reshape Data Using Melt 
Sometimes you're given a dataset in "wide" format - such as a preformatted 
report you're given by a colleague. 
You can use the melt() function in order to reshape this into a format that you 
can create pivot tables and other analysis off of.
In [23]: 
df = pd.DataFrame({

'Name'

: [

'Nik'



'Jim'



'Alice'



'Jane'



'Matt'


'Kate'

], 
'Score'

: [

100



120



96



75



68



123

], 
'Height'

: [

178



180



160



165



185



187

], 
'Weight'

: [

180



175



143



155



167



189

]}) 
df
Out[23]:

Download 372,05 Kb.

Do'stlaringiz bilan baham:
1   2   3   4   5   6   7   8




Ma'lumotlar bazasi mualliflik huquqi bilan himoyalangan ©hozir.org 2024
ma'muriyatiga murojaat qiling

kiriting | ro'yxatdan o'tish
    Bosh sahifa
юртда тантана
Боғда битган
Бугун юртда
Эшитганлар жилманглар
Эшитмадим деманглар
битган бодомлар
Yangiariq tumani
qitish marakazi
Raqamli texnologiyalar
ilishida muhokamadan
tasdiqqa tavsiya
tavsiya etilgan
iqtisodiyot kafedrasi
steiermarkischen landesregierung
asarlaringizni yuboring
o'zingizning asarlaringizni
Iltimos faqat
faqat o'zingizning
steierm rkischen
landesregierung fachabteilung
rkischen landesregierung
hamshira loyihasi
loyihasi mavsum
faolyatining oqibatlari
asosiy adabiyotlar
fakulteti ahborot
ahborot havfsizligi
havfsizligi kafedrasi
fanidan bo’yicha
fakulteti iqtisodiyot
boshqaruv fakulteti
chiqarishda boshqaruv
ishlab chiqarishda
iqtisodiyot fakultet
multiservis tarmoqlari
fanidan asosiy
Uzbek fanidan
mavzulari potok
asosidagi multiservis
'aliyyil a'ziym
billahil 'aliyyil
illaa billahil
quvvata illaa
falah' deganida
Kompyuter savodxonligi
bo’yicha mustaqil
'alal falah'
Hayya 'alal
'alas soloh
Hayya 'alas
mavsum boyicha


yuklab olish