thr3ads.net - R help - [R] help with duplicates [Jun 2009]

If this information is useful, please help other people find it:
Share via:

Chris Anderson

2009-Jun-05 16:38 UTC

[R] help with duplicates

I have a large dataset that contain duplicate records. How do I identify and
remove duplicate records?


Chris Anderson
707.315.8486
www.sassydeals4u.com
____________________________________________________________
Free info for small business owners.  Click here to find great products geared
for your business.
http://thirdpartyoffers.netzero.net/TGL2241/fc/BLSrjpYWIQYSqjUJ0P8Kjx22OUKmSdbeF2JnHh5X1EZsmlucvn6niiClhHS/
	[[alternative HTML version deleted]]

Henrique Dallazuanna

2009-Jun-05 17:33 UTC

head link

[R] help with duplicates

Try this:

d <- data.frame(a = c(1, 1, 2, 3), b = c(10, 10, 9, 8))
unique(d)



On Fri, Jun 5, 2009 at 1:38 PM, Chris Anderson
<chris6764@netzero.net>wrote:
> I have a large dataset that contain duplicate records. How do I identify
> and remove duplicate records?
>
>
> Chris Anderson
> 707.315.8486
> www.sassydeals4u.com
> ____________________________________________________________
> Free info for small business owners.  Click here to find great products
> geared for your business.
>
>
http://thirdpartyoffers.netzero.net/TGL2241/fc/BLSrjpYWIQYSqjUJ0P8Kjx22OUKmSdbeF2JnHh5X1EZsmlucvn6niiClhHS/
>        [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>


-- 
Henrique Dallazuanna
Curitiba-Paraná-Brasil
25° 25' 40" S 49° 16' 22" O

	[[alternative HTML version deleted]]

Jim Porzak

2009-Jun-05 17:37 UTC

head link

[R] help with duplicates

Chris,

How large is large? How may columns?

"Duplicate" across all columns of just some?

Henrique gave you simple R answer. Perhaps doing in SQL is more efficient?
eg

SELECT DISTINCT
             <stuff>
  FROM <somewhere>;


HTH,
Jim Porzak
TGN.com
San Francisco, CA
www.linkedin.com/in/jimporzak
use R! Group SF: www.meetup.com/R-Users/


On Fri, Jun 5, 2009 at 9:38 AM, Chris Anderson
<chris6764@netzero.net>wrote:
> I have a large dataset that contain duplicate records. How do I identify
> and remove duplicate records?
>
>
> Chris Anderson
> 707.315.8486
> www.sassydeals4u.com
> ____________________________________________________________
> Free info for small business owners.  Click here to find great products
> geared for your business.
>
>
http://thirdpartyoffers.netzero.net/TGL2241/fc/BLSrjpYWIQYSqjUJ0P8Kjx22OUKmSdbeF2JnHh5X1EZsmlucvn6niiClhHS/
>        [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
	[[alternative HTML version deleted]]

Peter Dalgaard

2009-Jun-05 17:59 UTC

head link

[R] help with duplicates

Chris Anderson wrote:
> I have a large dataset that contain duplicate records. How do I identify
and remove duplicate records?
> 
Here's one way:

 > aq <- airquality[sample(NROW(airquality), replace=TRUE),]
 > any(duplicated(aq))
[1] TRUE
 > which(duplicated(aq))
  [1]   2  15  34  44  45  47  49  50  52  53  65  75  76  78  83  86 
88  90  91
[20]  94  96  98  99 100 103 104 107 108 110 111 112 114 117 119 120 121 
122 124
[39] 125 126 127 129 130 132 133 135 137 140 141 143 145 146 147 151 152
 > aqs <- subset(aq,!duplicated(aq))
 > any(duplicated(aqs))
[1] FALSE
 > dim(aqs)
[1] 98  6
 > dim(aq)
[1] 153   6

For data frames wit many columns you might want to think more carefully 
about how you recognize duplicates and maybe uses a subset of columns.

-- 
    O__  ---- Peter Dalgaard             ?ster Farimagsgade 5, Entr.B
   c/ /'_ --- Dept. of Biostatistics     PO Box 2099, 1014 Cph. K
  (*) \(*) -- University of Copenhagen   Denmark      Ph:  (+45) 35327918
~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk)              FAX: (+45) 35327907

Seemingly Similar Threads

Search for more apparently analagous threads

R help - Jun 2009 - help with duplicates

[R] help with duplicates

[R] help with duplicates

[R] help with duplicates

[R] help with duplicates

Seemingly Similar Threads