Dealing with data duplication
Duplicated data occurs for a number of reasons. Some of them are obvious. A user
could enter the same data more than once. Distractions cause people to lose their
place in a list or sometimes two users enter the same record. Some of the sources
are less obvious. Combining two or more datasets could create multiple records
when the data appears in more than one location. You could also create data dupli-
cations when using various data-shaping techniques to create new data from
existing data sources. Fortunately, packages such as Pandas let you remove dupli-
cate data, as shown in the following example. (You can find this code in the
A4D;
06; Remediation.ipynb
file on the Dummies site as part of the downloadable
code; see the Introduction for details.)
import pandas as pd
df = pd.DataFrame({'A': [0,0,0,0,0,1,0],
'B': [0,2,3,5,0,2,0],
'C': [0,3,4,1,0,2,0]})
CHAPTER 6
Do'stlaringiz bilan baham: |