Adrian Johnson
2012-Nov-04 19:25 UTC
[R] select duplicate identifier with higher mean across sample columns
Hi Group: I searched R groups before posting this question. I could not find the appropriate answer and I do not have clear understanding how to do this in R. I have a data frame with duplicated row identifiers but with different values across columns. I want to select the identifier with higher inter-quartile range or mean. id <- c("A", "A", "C", "D", "E", "F") year <- c(2000, 2001, 2001, 2002, 2003, 2004) samp1 <- c(100, 120, 101, 110, 132,123) samp2 <- c(110, 130, 131, 150, 122,143) mdf <- data.frame(id,samp1,samp2,samp2a)> mdfid samp1 samp2 samp2a 1 A 100 110 110 2 A 120 130 150 3 C 101 131 151 4 D 110 150 130 5 E 132 122 122 6 F 123 143 143 There are two A ids in this df. I want to select the row with higher mean. How can I do this. Thanks Adrian
jim holtman
2012-Nov-04 19:39 UTC
[R] select duplicate identifier with higher mean across sample columns
Is this what you want:> mdf <- read.table(text = " id samp1 samp2 samp2a+ 1 A 100 110 110 + 2 A 120 130 150 + 3 C 101 131 151 + 4 D 110 150 130 + 5 E 132 122 122 + 6 F 123 143 143", header = TRUE)> result <- do.call(rbind, lapply(split(mdf, mdf$id), function(.id){+ maxIndx <- which.max(rowMeans(.id[, -1L])) + .id[maxIndx, ] + }))> > resultid samp1 samp2 samp2a A A 120 130 150 C C 101 131 151 D D 110 150 130 E E 132 122 122 F F 123 143 143 On Sun, Nov 4, 2012 at 2:25 PM, Adrian Johnson <oriolebaltimore at gmail.com> wrote:> Hi Group: > I searched R groups before posting this question. I could not find the > appropriate answer and I do not have clear understanding how to do > this in R. > > I have a data frame with duplicated row identifiers but with different > values across columns. I want to select the identifier with higher > inter-quartile range or mean. > > > id <- c("A", "A", "C", "D", "E", "F") > year <- c(2000, 2001, 2001, 2002, 2003, 2004) > samp1 <- c(100, 120, 101, 110, 132,123) > samp2 <- c(110, 130, 131, 150, 122,143) > mdf <- data.frame(id,samp1,samp2,samp2a) > > >> mdf > id samp1 samp2 samp2a > 1 A 100 110 110 > 2 A 120 130 150 > 3 C 101 131 151 > 4 D 110 150 130 > 5 E 132 122 122 > 6 F 123 143 143 > > > There are two A ids in this df. I want to select the row with higher mean. > > How can I do this. > Thanks > Adrian > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.-- Jim Holtman Data Munger Guru What is the problem that you are trying to solve? Tell me what you want to do, not how you want to do it.
Rui Barradas
2012-Nov-04 19:40 UTC
[R] select duplicate identifier with higher mean across sample columns
Hello, Thanks for the data example. (You forgot samp2a). Try the following. mdf <- read.table(text=" id samp1 samp2 samp2a 1 A 100 110 110 2 A 120 130 150 3 C 101 131 151 4 D 110 150 130 5 E 132 122 122 6 F 123 143 143 ", header=TRUE) idx <- ave(rowMeans(mdf[,-1]), mdf$id, FUN = function(x) x == max(x)) mdf[as.logical(idx), ] Hope this helps, Rui Barradas Em 04-11-2012 19:25, Adrian Johnson escreveu:> Hi Group: > I searched R groups before posting this question. I could not find the > appropriate answer and I do not have clear understanding how to do > this in R. > > I have a data frame with duplicated row identifiers but with different > values across columns. I want to select the identifier with higher > inter-quartile range or mean. > > > id <- c("A", "A", "C", "D", "E", "F") > year <- c(2000, 2001, 2001, 2002, 2003, 2004) > samp1 <- c(100, 120, 101, 110, 132,123) > samp2 <- c(110, 130, 131, 150, 122,143) > mdf <- data.frame(id,samp1,samp2,samp2a) > > >> mdf > id samp1 samp2 samp2a > 1 A 100 110 110 > 2 A 120 130 150 > 3 C 101 131 151 > 4 D 110 150 130 > 5 E 132 122 122 > 6 F 123 143 143 > > > There are two A ids in this df. I want to select the row with higher mean. > > How can I do this. > Thanks > Adrian > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.
arun
2012-Nov-04 21:05 UTC
[R] select duplicate identifier with higher mean across sample columns
Hi, Try this: mdf[unlist(tapply(rowMeans(mdf[,-1]),mdf$id,FUN=function(x) x%in%max(x))),] #? id samp1 samp2 samp2a #2? A?? 120?? 130??? 150 #3? C?? 101?? 131??? 151 #4? D?? 110?? 150??? 130 #5? E?? 132?? 122??? 122 #6? F?? 123?? 143??? 143 A.K. ----- Original Message ----- From: Adrian Johnson <oriolebaltimore at gmail.com> To: r-help <r-help at r-project.org> Cc: Sent: Sunday, November 4, 2012 2:25 PM Subject: [R] select duplicate identifier with higher mean across sample columns Hi Group: I searched R groups before posting this question. I could not find the appropriate answer and I do not have clear understanding how to do this in R. I have a data frame with duplicated row identifiers but with different values across columns. I want to select the identifier with higher inter-quartile range or mean. id <- c("A", "A", "C", "D", "E", "F") year <- c(2000, 2001, 2001, 2002, 2003, 2004) samp1 <- c(100, 120, 101, 110, 132,123) samp2 <- c(110, 130, 131, 150, 122,143) mdf <- data.frame(id,samp1,samp2,samp2a)> mdf? id samp1 samp2 samp2a 1? A? 100? 110? ? 110 2? A? 120? 130? ? 150 3? C? 101? 131? ? 151 4? D? 110? 150? ? 130 5? E? 132? 122? ? 122 6? F? 123? 143? ? 143 There are two A ids in this df. I want to select the row with higher mean. How can I do this. Thanks Adrian ______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Adrian Johnson
2012-Nov-05 15:47 UTC
[R] select duplicate identifier with higher mean across sample columns
Thanks a lot for the help. -Adrian On Sun, Nov 4, 2012 at 2:39 PM, jim holtman <jholtman at gmail.com> wrote:> Is this what you want: > >> mdf <- read.table(text = " id samp1 samp2 samp2a > + 1 A 100 110 110 > + 2 A 120 130 150 > + 3 C 101 131 151 > + 4 D 110 150 130 > + 5 E 132 122 122 > + 6 F 123 143 143", header = TRUE) >> result <- do.call(rbind, lapply(split(mdf, mdf$id), function(.id){ > + maxIndx <- which.max(rowMeans(.id[, -1L])) > + .id[maxIndx, ] > + })) >> >> result > id samp1 samp2 samp2a > A A 120 130 150 > C C 101 131 151 > D D 110 150 130 > E E 132 122 122 > F F 123 143 143 > > > On Sun, Nov 4, 2012 at 2:25 PM, Adrian Johnson > <oriolebaltimore at gmail.com> wrote: >> Hi Group: >> I searched R groups before posting this question. I could not find the >> appropriate answer and I do not have clear understanding how to do >> this in R. >> >> I have a data frame with duplicated row identifiers but with different >> values across columns. I want to select the identifier with higher >> inter-quartile range or mean. >> >> >> id <- c("A", "A", "C", "D", "E", "F") >> year <- c(2000, 2001, 2001, 2002, 2003, 2004) >> samp1 <- c(100, 120, 101, 110, 132,123) >> samp2 <- c(110, 130, 131, 150, 122,143) >> mdf <- data.frame(id,samp1,samp2,samp2a) >> >> >>> mdf >> id samp1 samp2 samp2a >> 1 A 100 110 110 >> 2 A 120 130 150 >> 3 C 101 131 151 >> 4 D 110 150 130 >> 5 E 132 122 122 >> 6 F 123 143 143 >> >> >> There are two A ids in this df. I want to select the row with higher mean. >> >> How can I do this. >> Thanks >> Adrian >> >> ______________________________________________ >> R-help at r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. > > > > -- > Jim Holtman > Data Munger Guru > > What is the problem that you are trying to solve? > Tell me what you want to do, not how you want to do it.