Yet Another Blog in Statistical Computing

I can calculate the motion of heavenly bodies but not the madness of people. -Isaac Newton

Removing Records by Duplicate Values

Removing records from a data table based on duplicate values in one or more columns is a commonly used but important data cleaning technique. Below shows an example about how to accomplish this task by SAS, R, and Python respectively.

SAS Example

data _data_;
  input label $ value;
datalines;
A     4
B     3
C     6
B     3
B     1
A     2
A     4
A     4
;
run;

proc sort data = _last_;
  by label value;
run;

data _data_;
  set _last_;
  by label;
  if first.label then output;
run;

proc print data = _last_ noobs;
run;

/* OUTPUT:
label    value
  A        2  
  B        1  
  C        6 
*/

R Example

> # INPUT DATA INT THE CONSOLE
> df <- read.table(header = T, text = '
+  label value
+      A     4
+      B     3
+      C     6
+      B     3
+      B     1
+      A     2
+      A     4
+      A     4
+ ')
> # SORT DATA FRAME BY COLUMNS
> df2 <- df[order(df$label, df$value), ]
> print(df2)
  label value
6     A     2
1     A     4
7     A     4
8     A     4
5     B     1
2     B     3
4     B     3
3     C     6
> # DEDUP RECORDS
> df3 <- df2[!duplicated(df2$label), ]
> print(df3)
  label value
6     A     2
5     B     1
3     C     6

Python Example

In [1]: import pandas as pd

In [2]: # INPUT DATA INTO DATAFRAME

In [3]: df = pd.DataFrame({'label': ['A', 'B', 'C'] + ['B'] * 2 + ['A'] * 3, 'value': [4, 3, 6, 3, 1, 2, 4, 4]})

In [4]: # SORT DATA BY COLUMNS

In [5]: df2 = df.sort(['label', 'value'])

In [6]: print(df2)
  label  value
5     A      2
0     A      4
6     A      4
7     A      4
4     B      1
1     B      3
3     B      3
2     C      6

In [7]: # DEDUP RECORDS

In [8]: df3 = df2.drop_duplicates(['label'])

In [9]: print(df3)
  label  value
5     A      2
4     B      1
2     C      6
Advertisements

Written by statcompute

December 20, 2012 at 1:43 am

Posted in PYTHON, S+/R, SAS

Tagged with , ,

%d bloggers like this: