Robert Lynch
2013-Sep-07 07:02 UTC
[R] finding both rows that are duplicated in a data frame
I have a data frame that looks like id1<-c(1,1,2,2,3,3,4,5,5,6,6,7,8,9,9,10) id2<-c(22,22,34,34,15,15,76,45,45,84,84,37,52,66,66,91) GENDER<-sample(c("G-UNK","G-M","G-F"),16, replace = TRUE) ETH <-sample(c("E-AF","E-UNK","E-VT"),16, replace = TRUE) example<-cbind(id1,id2,GENDER,ETH) where there are two id's and some duplicate entries for ID's that have different GENDER or ETH(nicity) I would like to get a data frame that doesn't have the duplicates, but the ones that are kept are which ever GENDER is not G-UNK (unknown) and the kept ETH is what ever is not E-UNK the resultant data frame should have 10 rows with no *-UNK in either of the last two columns ( unless both entries were UNK) yes the example data may have some impossible results but it does capture important aspects. 1) G-UNK is alphabetically last of G-F, G-M & G-UNK 2) E-UNK is in the middle alphabetically 3) some times the first entry is the unknown gender, some times it is the second *likely to happen with random sample 4) some times both entries for one variable, GENDER or ETH are unknown. 5) only appears to be two of each row, * not 100% sure Thanks! Robert [[alternative HTML version deleted]]
Hi, example<- data.frame(id1,id2,GENDER,ETH,stringsAsFactors=FALSE) res<-unique(example[!(grepl("UNK",example$GENDER)|grepl("UNK",example$ETH)),]) ?res #?? id1 id2 GENDER? ETH #1??? 1? 22??? G-M E-VT #3??? 2? 34??? G-M E-AF #5??? 3? 15??? G-M E-AF #7??? 4? 76??? G-F E-VT #8??? 5? 45??? G-F E-VT #12?? 7? 37??? G-F E-AF #13?? 8? 52??? G-F E-AF #14?? 9? 66??? G-F E-AF #16? 10? 91??? G-F E-VT It is a bit unclear about the condition for id1 #6.? If I include both of them, the nrows will be 11, now it is 9. 10?? 6? 84? G-UNK? E-AF 11?? 6? 84??? G-F E-UNK A.K. ----- Original Message ----- From: Robert Lynch <robert.b.lynch at gmail.com> To: R help <r-help at r-project.org> Cc: Sent: Saturday, September 7, 2013 3:02 AM Subject: [R] finding both rows that are duplicated in a data frame I have a data frame that looks like id1<-c(1,1,2,2,3,3,4,5,5,6,6,7,8,9,9,10) id2<-c(22,22,34,34,15,15,76,45,45,84,84,37,52,66,66,91) GENDER<-sample(c("G-UNK","G-M","G-F"),16, replace = TRUE) ETH <-sample(c("E-AF","E-UNK","E-VT"),16, replace = TRUE) example<-cbind(id1,id2,GENDER,ETH) where there are two id's and some duplicate entries for ID's that have different GENDER or ETH(nicity) I would like to get a data frame that doesn't have the duplicates, but the ones that are kept are which ever GENDER is not G-UNK (unknown) and the kept ETH is what ever is not E-UNK the resultant data frame should have 10 rows with no *-UNK in either of the last two columns ( unless both entries were UNK) yes the example data may have some impossible results but it does capture important aspects. 1) G-UNK is alphabetically last of G-F, G-M & G-UNK 2) E-UNK is in the middle alphabetically 3) some times the first entry is the unknown gender, some times it is the second *likely to happen with random sample 4) some times both entries for one variable, GENDER or ETH are unknown. 5) only appears to be two of each row, * not 100% sure Thanks! Robert ??? [[alternative HTML version deleted]] ______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Hi, Suppose you have situations like this: (duplicates are both UNKNOWN and want to remove those) example1<-rbind(example,data.frame(id1=c(11,12,12),id2=c(93,95,95),GENDER=rep("G-UNK",3),ETH=rep("E-UNK",3))) spl<- as.character(interaction(example1$id1,example1$id2)) ?res1<-do.call(rbind,lapply(split(example1,spl),function(x) {indx<-!(grepl("UNK",x[,3])|grepl("UNK",x[,4]));if(sum(indx)==0) {x[,3]<-x[,3][-grep("UNK",x[,3])];x[,4]<- x[,4][-grep("UNK",x[,4])];unique(x) } else unique(x[indx,])})) res1<-res1[!(is.na(res1[,3])|is.na(res1[,4])),]? ##remove the rows with NA ?res2<-res1[order(res1$id1),] ?row.names(res2)<- 1:nrow(res2) ?res2 #? id1 id2 GENDER? ETH #1??? 1? 22??? G-M E-VT #2??? 2? 34??? G-M E-AF #3??? 3? 15??? G-M E-AF #4??? 4? 76??? G-F E-VT #5??? 5? 45??? G-F E-VT #6??? 6? 84??? G-F E-AF #7??? 7? 37??? G-F E-AF #8??? 8? 52??? G-F E-AF #9??? 9? 66??? G-F E-AF #10? 10? 91??? G-F E-VT A.K. ----- Original Message ----- From: arun <smartpink111 at yahoo.com> To: Robert Lynch <robert.b.lynch at gmail.com> Cc: R help <r-help at r-project.org> Sent: Saturday, September 7, 2013 11:30 AM Subject: Re: [R] finding both rows that are duplicated in a data frame HI, May be this is what you are looking for. spl<- as.character(interaction(example$id1,example$id2)) res<-do.call(rbind,lapply(split(example,spl),function(x) {indx<-!(grepl("UNK",x[,3])|grepl("UNK",x[,4]));if(sum(indx)==0) {x[,3]<-x[,3][-grep("UNK",x[,3])];x[,4]<- x[,4][-grep("UNK",x[,4])];unique(x) } else unique(x[indx,])})) ? ?res1<-res[order(res$id1),] ?row.names(res1)<-1:nrow(res1) ?res1 #?? id1 id2 GENDER? ETH #1??? 1? 22??? G-M E-VT #2??? 2? 34??? G-M E-AF #3??? 3? 15??? G-M E-AF #4??? 4? 76??? G-F E-VT #5??? 5? 45??? G-F E-VT #6??? 6? 84??? G-F E-AF #7??? 7? 37??? G-F E-AF #8??? 8? 52??? G-F E-AF #9??? 9? 66??? G-F E-AF #10? 10? 91??? G-F E-VT A.K. ----- Original Message ----- From: arun <smartpink111 at yahoo.com> To: Robert Lynch <robert.b.lynch at gmail.com> Cc: R help <r-help at r-project.org> Sent: Saturday, September 7, 2013 10:52 AM Subject: Re: [R] finding both rows that are duplicated in a data frame Hi, example<- data.frame(id1,id2,GENDER,ETH,stringsAsFactors=FALSE) res<-unique(example[!(grepl("UNK",example$GENDER)|grepl("UNK",example$ETH)),]) ?res #?? id1 id2 GENDER? ETH #1??? 1? 22??? G-M E-VT #3??? 2? 34??? G-M E-AF #5??? 3? 15??? G-M E-AF #7??? 4? 76??? G-F E-VT #8??? 5? 45??? G-F E-VT #12?? 7? 37??? G-F E-AF #13?? 8? 52??? G-F E-AF #14?? 9? 66??? G-F E-AF #16? 10? 91??? G-F E-VT It is a bit unclear about the condition for id1 #6.? If I include both of them, the nrows will be 11, now it is 9. 10?? 6? 84? G-UNK? E-AF 11?? 6? 84??? G-F E-UNK A.K. ----- Original Message ----- From: Robert Lynch <robert.b.lynch at gmail.com> To: R help <r-help at r-project.org> Cc: Sent: Saturday, September 7, 2013 3:02 AM Subject: [R] finding both rows that are duplicated in a data frame I have a data frame that looks like id1<-c(1,1,2,2,3,3,4,5,5,6,6,7,8,9,9,10) id2<-c(22,22,34,34,15,15,76,45,45,84,84,37,52,66,66,91) GENDER<-sample(c("G-UNK","G-M","G-F"),16, replace = TRUE) ETH <-sample(c("E-AF","E-UNK","E-VT"),16, replace = TRUE) example<-cbind(id1,id2,GENDER,ETH) where there are two id's and some duplicate entries for ID's that have different GENDER or ETH(nicity) I would like to get a data frame that doesn't have the duplicates, but the ones that are kept are which ever GENDER is not G-UNK (unknown) and the kept ETH is what ever is not E-UNK the resultant data frame should have 10 rows with no *-UNK in either of the last two columns ( unless both entries were UNK) yes the example data may have some impossible results but it does capture important aspects. 1) G-UNK is alphabetically last of G-F, G-M & G-UNK 2) E-UNK is in the middle alphabetically 3) some times the first entry is the unknown gender, some times it is the second *likely to happen with random sample 4) some times both entries for one variable, GENDER or ETH are unknown. 5) only appears to be two of each row, * not 100% sure Thanks! Robert ??? [[alternative HTML version deleted]] ______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
jim holtman
2013-Sep-07 16:22 UTC
[R] finding both rows that are duplicated in a data frame
try this. Splits the dataframe based on the two IDs and then chooses the first one in cases where condition not met.> id1<-c(1,1,2,2,3,3,4,5,5,6,6,7,8,9,9,10) > id2<-c(22,22,34,34,15,15,76,45,45,84,84,37,52,66,66,91) > GENDER<-sample(c("G-UNK","G-M","G-F"),16, replace = TRUE) > ETH <-sample(c("E-AF","E-UNK","E-VT"),16, replace = TRUE) > example<-data.frame(id1,id2,GENDER,ETH, stringsAsFactors = FALSE) > # find dups by spliting on id1,id2 > result <- do.call(rbind+ , lapply(split(example, list(example$id1, example$id2), drop TRUE), function(x){ + indx <- which(!grepl("UNK", x$GENDER) & !grepl("UNK", x$ETH))[1L] # choose first one + if (is.na(indx)) indx <- 1L # none match so choose one + x[indx,] + }) + )> resultid1 id2 GENDER ETH 3.15 3 15 G-F E-AF 1.22 1 22 G-F E-VT 2.34 2 34 G-F E-AF 7.37 7 37 G-UNK E-VT 5.45 5 45 G-M E-AF 8.52 8 52 G-F E-AF 9.66 9 66 G-UNK E-AF 4.76 4 76 G-M E-AF 6.84 6 84 G-M E-VT 10.91 10 91 G-F E-UNK Jim Holtman Data Munger Guru What is the problem that you are trying to solve? Tell me what you want to do, not how you want to do it. On Sat, Sep 7, 2013 at 3:02 AM, Robert Lynch <robert.b.lynch at gmail.com> wrote:> I have a data frame that looks like > > id1<-c(1,1,2,2,3,3,4,5,5,6,6,7,8,9,9,10) > id2<-c(22,22,34,34,15,15,76,45,45,84,84,37,52,66,66,91) > GENDER<-sample(c("G-UNK","G-M","G-F"),16, replace = TRUE) > ETH <-sample(c("E-AF","E-UNK","E-VT"),16, replace = TRUE) > example<-cbind(id1,id2,GENDER,ETH) > > where there are two id's and some duplicate entries for ID's that have > different GENDER or ETH(nicity) > I would like to get a data frame that doesn't have the duplicates, but the > ones that are kept are which ever GENDER is not G-UNK (unknown) and the > kept ETH is what ever is not E-UNK > > the resultant data frame should have 10 rows with no *-UNK in either of the > last two columns ( unless both entries were UNK) > > yes the example data may have some impossible results but it does capture > important aspects. > 1) G-UNK is alphabetically last of G-F, G-M & G-UNK > 2) E-UNK is in the middle alphabetically > 3) some times the first entry is the unknown gender, some times it is the > second *likely to happen with random sample > 4) some times both entries for one variable, GENDER or ETH are unknown. > 5) only appears to be two of each row, * not 100% sure > > Thanks! > Robert > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.