I have a large dataset that contain duplicate records. How do I identify and remove duplicate records? Chris Anderson 707.315.8486 www.sassydeals4u.com ____________________________________________________________ Free info for small business owners. Click here to find great products geared for your business. http://thirdpartyoffers.netzero.net/TGL2241/fc/BLSrjpYWIQYSqjUJ0P8Kjx22OUKmSdbeF2JnHh5X1EZsmlucvn6niiClhHS/ [[alternative HTML version deleted]]
Try this: d <- data.frame(a = c(1, 1, 2, 3), b = c(10, 10, 9, 8)) unique(d) On Fri, Jun 5, 2009 at 1:38 PM, Chris Anderson <chris6764@netzero.net>wrote:> I have a large dataset that contain duplicate records. How do I identify > and remove duplicate records? > > > Chris Anderson > 707.315.8486 > www.sassydeals4u.com > ____________________________________________________________ > Free info for small business owners. Click here to find great products > geared for your business. > > http://thirdpartyoffers.netzero.net/TGL2241/fc/BLSrjpYWIQYSqjUJ0P8Kjx22OUKmSdbeF2JnHh5X1EZsmlucvn6niiClhHS/ > [[alternative HTML version deleted]] > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >-- Henrique Dallazuanna Curitiba-Paraná-Brasil 25° 25' 40" S 49° 16' 22" O [[alternative HTML version deleted]]
Chris, How large is large? How may columns? "Duplicate" across all columns of just some? Henrique gave you simple R answer. Perhaps doing in SQL is more efficient? eg SELECT DISTINCT <stuff> FROM <somewhere>; HTH, Jim Porzak TGN.com San Francisco, CA www.linkedin.com/in/jimporzak use R! Group SF: www.meetup.com/R-Users/ On Fri, Jun 5, 2009 at 9:38 AM, Chris Anderson <chris6764@netzero.net>wrote:> I have a large dataset that contain duplicate records. How do I identify > and remove duplicate records? > > > Chris Anderson > 707.315.8486 > www.sassydeals4u.com > ____________________________________________________________ > Free info for small business owners. Click here to find great products > geared for your business. > > http://thirdpartyoffers.netzero.net/TGL2241/fc/BLSrjpYWIQYSqjUJ0P8Kjx22OUKmSdbeF2JnHh5X1EZsmlucvn6niiClhHS/ > [[alternative HTML version deleted]] > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >[[alternative HTML version deleted]]
Chris Anderson wrote:> I have a large dataset that contain duplicate records. How do I identify and remove duplicate records? >Here's one way: > aq <- airquality[sample(NROW(airquality), replace=TRUE),] > any(duplicated(aq)) [1] TRUE > which(duplicated(aq)) [1] 2 15 34 44 45 47 49 50 52 53 65 75 76 78 83 86 88 90 91 [20] 94 96 98 99 100 103 104 107 108 110 111 112 114 117 119 120 121 122 124 [39] 125 126 127 129 130 132 133 135 137 140 141 143 145 146 147 151 152 > aqs <- subset(aq,!duplicated(aq)) > any(duplicated(aqs)) [1] FALSE > dim(aqs) [1] 98 6 > dim(aq) [1] 153 6 For data frames wit many columns you might want to think more carefully about how you recognize duplicates and maybe uses a subset of columns. -- O__ ---- Peter Dalgaard ?ster Farimagsgade 5, Entr.B c/ /'_ --- Dept. of Biostatistics PO Box 2099, 1014 Cph. K (*) \(*) -- University of Copenhagen Denmark Ph: (+45) 35327918 ~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk) FAX: (+45) 35327907