Hello everyone, I have a large dataset (x) with some rows that have duplicate variables that I would like to remove. I find which rows are the duplicates with X1<-which(duplicated(x)). That gives me the rows with duplicated variables. Now, how can I remove just those rose from the original data frame. I think I can create a new data frame without the duplicates using subset. I have tried: Subset(x,!x1) and subset(x,!x[x1,]) I can't seem to find the correct syntax. Any advice. Thanks in advance Cameron Guenther, Ph.D. Associate Research Scientist FWC/FWRI, Marine Fisheries Research 100 8th Avenue S.E. St. Petersburg, FL 33701 (727)896-8626 Ext. 4305 cameron.guenther at myfwc.com
On Tue, 2006-05-16 at 14:37 -0400, Guenther, Cameron wrote:> Hello everyone, > > I have a large dataset (x) with some rows that have duplicate variables > that I would like to remove. I find which rows are the duplicates with > X1<-which(duplicated(x)). That gives me the rows with duplicated > variables. Now, how can I remove just those rose from the original data > frame. I think I can create a new data frame without the duplicates > using subset. I have tried: > Subset(x,!x1) and subset(x,!x[x1,]) > I can't seem to find the correct syntax. Any advice. > Thanks in advanceEven easier would be to use unique(): NewDF < unique(x) NewDF will contain rows from 'x' with duplicates removed. See ?unique for more information. unique(), which has a data.frame method, is basically: x[!duplicated(x), , drop = FALSE] which covers the case where the result may contain a single row and which remains a data frame. Note that the above presumes that you want to test all columns in 'x' for dups. HTH, Marc Schwartz
Marc, I have tried unique but unique looks at the entire row. I have a data set with a variable TRIPID. The dataset has 469,000 rows. In most cases TRIPID is a unique value. However, in some cases I have the same TRIPID value but different values for other variables. What this amounts to is an data entry error. I need to get rid of the repeated rows that have the same TRIPID but different co-variables. Thanks for your help. Cam Cameron Guenther, Ph.D. Associate Research Scientist FWC/FWRI, Marine Fisheries Research 100 8th Avenue S.E. St. Petersburg, FL 33701 (727)896-8626 Ext. 4305 cameron.guenther at myfwc.com -----Original Message----- From: Marc Schwartz (via MN) [mailto:mschwartz at mn.rr.com] Sent: Tuesday, May 16, 2006 2:50 PM To: Guenther, Cameron Cc: r-help at stat.math.ethz.ch Subject: Re: [R] subset On Tue, 2006-05-16 at 14:37 -0400, Guenther, Cameron wrote:> Hello everyone, > > I have a large dataset (x) with some rows that have duplicate > variables that I would like to remove. I find which rows are the > duplicates with X1<-which(duplicated(x)). That gives me the rows with> duplicated variables. Now, how can I remove just those rose from the > original data frame. I think I can create a new data frame without > the duplicates using subset. I have tried: > Subset(x,!x1) and subset(x,!x[x1,]) > I can't seem to find the correct syntax. Any advice. > Thanks in advanceEven easier would be to use unique(): NewDF < unique(x) NewDF will contain rows from 'x' with duplicates removed. See ?unique for more information. unique(), which has a data.frame method, is basically: x[!duplicated(x), , drop = FALSE] which covers the case where the result may contain a single row and which remains a data frame. Note that the above presumes that you want to test all columns in 'x' for dups. HTH, Marc Schwartz
Thanks Phil That worked pergectly. Cameron Guenther, Ph.D. Associate Research Scientist FWC/FWRI, Marine Fisheries Research 100 8th Avenue S.E. St. Petersburg, FL 33701 (727)896-8626 Ext. 4305 cameron.guenther at myfwc.com -----Original Message----- From: Phil Spector [mailto:spector at stat.Berkeley.EDU] Sent: Tuesday, May 16, 2006 3:01 PM To: Guenther, Cameron Subject: Re: [R] subset Cameron - Is X1 = which(duplicated(x)) x[-X1,] or x[!duplicated(x),] or subset(x,!duplicated(x)) what you're looking for? Remember that which() will always return indices, so negating them (with regards to subscripts) means making them negative, not applying the not operator(!). The not operator can only be applied to logical values, like those returned by duplicated() - Phil Spector Statistical Computing Facility Department of Statistics UC Berkeley spector at stat.berkeley.edu On Tue, 16 May 2006, Guenther, Cameron wrote:> Hello everyone, > > I have a large dataset (x) with some rows that have duplicate > variables that I would like to remove. I find which rows are the > duplicates with X1<-which(duplicated(x)). That gives me the rows with> duplicated variables. Now, how can I remove just those rose from the > original data frame. I think I can create a new data frame without > the duplicates using subset. I have tried: > Subset(x,!x1) and subset(x,!x[x1,]) > I can't seem to find the correct syntax. Any advice. > Thanks in advance > > Cameron Guenther, Ph.D. > Associate Research Scientist > FWC/FWRI, Marine Fisheries Research > 100 8th Avenue S.E. > St. Petersburg, FL 33701 > (727)896-8626 Ext. 4305 > cameron.guenther at myfwc.com > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide! > http://www.R-project.org/posting-guide.html >