Dear r experts, Sorry for this basic question, but I can't seem to find a solution? I have this data frame: df <- data.frame(id = c("id1", "id1", "id1", "id2", "id2", "id2"), A c(11905, 11907, 11907, 11829, 11829, 11829), v1 = c(NA, 3, NA,1,2,NA), v2 c(NA,2,NA, 2, NA,NA), v3 = c(NA,1,NA,1,NA,NA), v4 = c("N", "Y", "N", "Y", "N","N"), v5 = c(0,0,0,1,0,0), numMiss=c(3,0,3,0,2,3))> dfid A v1 v2 v3 v4 v5 numMiss 1 id1 11905 NA NA NA N 0 3 2 id1 11907 3 2 1 Y 0 0 3 id1 11907 NA NA NA N 0 3 4 id2 11829 1 2 1 Y 1 0 5 id2 11829 2 NA NA N 0 2 6 id2 11829 NA NA NA N 0 3 And I need to keep, of the rows that have the same value for "A" by id, only the ones with the least amount of missing values for all the variables (with min(numMiss)) to get this: id A v1 v2 v3 v4 v5 numMiss 1 id1 11905 NA NA NA N 0 3 2 id1 11907 3 2 1 Y 0 0 4 id2 11829 1 2 1 Y 1 0 Then I have to choose the records with the least value of "A" of the rows that have the same id like this: id A v1 v2 v3 v4 v5 numMiss 1 id1 11905 NA NA NA N 0 3 4 id2 11829 1 2 1 Y 1 0 For groupings I have used the package "plyr" before, but this would involve a sort of double-grouping by id and by duplicated values of A?Could you please help me understand how this can be done? Thank you very much. -f -- View this message in context: http://r.789695.n4.nabble.com/Choose-between-duplicated-rows-tp4557833p4557833.html Sent from the R help mailing list archive at Nabble.com.
try this:> x # print dataid A v1 v2 v3 v4 v5 numMiss 1 id1 11905 NA NA NA N 0 3 2 id1 11907 3 2 1 Y 0 0 3 id1 11907 NA NA NA N 0 3 4 id2 11829 1 2 1 Y 1 0 5 id2 11829 2 NA NA N 0 2 6 id2 11829 NA NA NA N 0 3> # select best data > xBest <- do.call(rbind, lapply(split(x, x$A), function(.grp){+ best <- which.min(apply(.grp, 1, function(a) sum(is.na(a)))) + .grp[best, ] + }))> xBestid A v1 v2 v3 v4 v5 numMiss 11829 id2 11829 1 2 1 Y 1 0 11905 id1 11905 NA NA NA N 0 3 11907 id1 11907 3 2 1 Y 0 0> > xWorst <- do.call(rbind, lapply(split(x, x$A), function(.grp){+ worst <- which.max(apply(.grp, 1, function(a) sum(is.na(a)))) + .grp[worst, ] + }))> xWorstid A v1 v2 v3 v4 v5 numMiss 11829 id2 11829 NA NA NA N 0 3 11905 id1 11905 NA NA NA N 0 3 11907 id1 11907 NA NA NA N 0 3> > >On Sat, Apr 14, 2012 at 3:03 PM, francy <francy.casalino@gmail.com> wrote:> Dear r experts, > > Sorry for this basic question, but I can't seem to find a solution… > > I have this data frame: > df <- data.frame(id = c("id1", "id1", "id1", "id2", "id2", "id2"), A > c(11905, 11907, 11907, 11829, 11829, 11829), v1 = c(NA, 3, NA,1,2,NA), v2 > c(NA,2,NA, 2, NA,NA), v3 = c(NA,1,NA,1,NA,NA), v4 = c("N", "Y", "N", "Y", > "N","N"), v5 = c(0,0,0,1,0,0), numMiss=c(3,0,3,0,2,3)) > > > df > id A v1 v2 v3 v4 v5 numMiss > 1 id1 11905 NA NA NA N 0 3 > 2 id1 11907 3 2 1 Y 0 0 > 3 id1 11907 NA NA NA N 0 3 > 4 id2 11829 1 2 1 Y 1 0 > 5 id2 11829 2 NA NA N 0 2 > 6 id2 11829 NA NA NA N 0 3 > > > And I need to keep, of the rows that have the same value for "A" by id, > only > the ones with the least amount of missing values for all the variables > (with > min(numMiss)) to get this: > > id A v1 v2 v3 v4 v5 numMiss > 1 id1 11905 NA NA NA N 0 3 > 2 id1 11907 3 2 1 Y 0 0 > 4 id2 11829 1 2 1 Y 1 0 > > Then I have to choose the records with the least value of "A" of the rows > that have the same id like this: > id A v1 v2 v3 v4 v5 numMiss > 1 id1 11905 NA NA NA N 0 3 > 4 id2 11829 1 2 1 Y 1 0 > > For groupings I have used the package "plyr" before, but this would involve > a sort of double-grouping by id and by duplicated values of A…Could you > please help me understand how this can be done? > > Thank you very much. > -f > > > > > > > -- > View this message in context: > http://r.789695.n4.nabble.com/Choose-between-duplicated-rows-tp4557833p4557833.html > Sent from the R help mailing list archive at Nabble.com. > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html<http://www.r-project.org/posting-guide.html> > and provide commented, minimal, self-contained, reproducible code. >-- Jim Holtman Data Munger Guru What is the problem that you are trying to solve? Tell me what you want to do, not how you want to do it. [[alternative HTML version deleted]]
My solution: SP <- split(df, df[, 1:2]) minner <- function(x, col = 'numMiss') {? ? x[which.min(unlist(x[,col])), , drop=FALSE]} NEW <- do.call('rbind', lapply(SP, minner))SP2 <- split(NEW, NEW[, 'id'])do.call('rbind', lapply(SP2, function(x) minner(x, 'A'))) Cheers,Tyler> Date: Sat, 14 Apr 2012 12:03:36 -0700 > From: francy.casalino at gmail.com > To: r-help at r-project.org > Subject: [R] Choose between duplicated rows > > Dear r experts, > > Sorry for this basic question, but I can't seem to find a solution? > > I have this data frame: > df <- data.frame(id = c("id1", "id1", "id1", "id2", "id2", "id2"), A > c(11905, 11907, 11907, 11829, 11829, 11829), v1 = c(NA, 3, NA,1,2,NA), v2 > c(NA,2,NA, 2, NA,NA), v3 = c(NA,1,NA,1,NA,NA), v4 = c("N", "Y", "N", "Y", > "N","N"), v5 = c(0,0,0,1,0,0), numMiss=c(3,0,3,0,2,3)) > > > df > id A v1 v2 v3 v4 v5 numMiss > 1 id1 11905 NA NA NA N 0 3 > 2 id1 11907 3 2 1 Y 0 0 > 3 id1 11907 NA NA NA N 0 3 > 4 id2 11829 1 2 1 Y 1 0 > 5 id2 11829 2 NA NA N 0 2 > 6 id2 11829 NA NA NA N 0 3 > > > And I need to keep, of the rows that have the same value for "A" by id, only > the ones with the least amount of missing values for all the variables (with > min(numMiss)) to get this: > > id A v1 v2 v3 v4 v5 numMiss > 1 id1 11905 NA NA NA N 0 3 > 2 id1 11907 3 2 1 Y 0 0 > 4 id2 11829 1 2 1 Y 1 0 > > Then I have to choose the records with the least value of "A" of the rows > that have the same id like this: > id A v1 v2 v3 v4 v5 numMiss > 1 id1 11905 NA NA NA N 0 3 > 4 id2 11829 1 2 1 Y 1 0 > > For groupings I have used the package "plyr" before, but this would involve > a sort of double-grouping by id and by duplicated values of A?Could you > please help me understand how this can be done? > > Thank you very much. > -f > > > > > > > -- > View this message in context: http://r.789695.n4.nabble.com/Choose-between-duplicated-rows-tp4557833p4557833.html > Sent from the R help mailing list archive at Nabble.com. > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.
I also tried using Jim's code, but it doesn't work as expected with my real dataset. This is what I did: Best.na <- do.call(rbind, lapply(split(x, x$A), function(.grp){ best <- which.min(apply(.grp, 1, function(a) sum(is.na(a)))) .grp[best, ] })) df.split <- split(Best.na, Best.na$id) Best.date <- lapply(df.split, function(x){ # Select by given criterion y <- x[which(max(x$A) == x$A),] y }) Best.date <- do.call(rbind, Best.date) Thank you again for your help. -- View this message in context: http://r.789695.n4.nabble.com/Choose-between-duplicated-rows-tp4557833p4559792.html Sent from the R help mailing list archive at Nabble.com.