Dear R-Helpers, I have a dataframe (g10df) formatted like this: GENE PVAL 1 KCTD12 4.06904e-22 2 UNC93A 9.91852e-22 3 CDKN3 1.24695e-21 4 CLEC2B 4.71759e-21 5 DAB2 1.12062e-20 The rows are ranked in ascending order by PVAL, and I need to end up with the same relative order. There are duplicate entries for genes in the first column with corresponding p-values in the second, but the p-values are unique. I had intended to use the plyr package to remove these duplicates: ddply(g10df, "GENE", summarise, PVAL = mean(PVAL)) But it occurred to me that instead of averaging the p-values for each set of duplicates, I should instead select one duplicate at random, and remove the rest. I am relatively new to R, and I have not been able to find a way to do this, with plyr or otherwise. Any help would be greatly appreciated. Thanks and best regards, Jeff -- View this message in context: http://n4.nabble.com/Choosing-and-preserving-a-random-duplicate-tp1746091p1746091.html Sent from the R help mailing list archive at Nabble.com.
Tena koe Jeff If I understand you correctly, one approach would be to randomly order your dataframe, remove the duplicates, and then reorder the resulting dataframe back into the original order: g10dfA <- g10df[sample(nrow(g10df)),] g10dfA <- g10dfA[!duplicated(g10dfA$GENE),] g10dfA <- g10dfA[order(g10dfA$PVAL),] All untested. HTH .... Peter Alspach> -----Original Message----- > From: r-help-bounces at r-project.org [mailto:r-help-bounces at r- > project.org] On Behalf Of jeff.m.ewers > Sent: Wednesday, 31 March 2010 12:33 p.m. > To: r-help at r-project.org > Subject: [R] Choosing and preserving a random duplicate > > > Dear R-Helpers, > > I have a dataframe (g10df) formatted like this: > > GENE PVAL > 1 KCTD12 4.06904e-22 > 2 UNC93A 9.91852e-22 > 3 CDKN3 1.24695e-21 > 4 CLEC2B 4.71759e-21 > 5 DAB2 1.12062e-20 > > The rows are ranked in ascending order by PVAL, and I need to end up > with > the same relative order. There are duplicate entries for genes in the > first > column with corresponding p-values in the second, but the p-values are > unique. I had intended to use the plyr package to remove these > duplicates: > > ddply(g10df, "GENE", summarise, PVAL = mean(PVAL)) > > But it occurred to me that instead of averaging the p-values for each > set of > duplicates, I should instead select one duplicate at random, andremove> the > rest. > > I am relatively new to R, and I have not been able to find a way to do > this, > with plyr or otherwise. Any help would be greatly appreciated. > > Thanks and best regards, > > Jeff > > > > > -- > View this message in context: http://n4.nabble.com/Choosing-and- > preserving-a-random-duplicate-tp1746091p1746091.html > Sent from the R help mailing list archive at Nabble.com. > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting- > guide.html > and provide commented, minimal, self-contained, reproducible code.
Hi Jeff, Here is a suggestion using aggregate(): # some data set.seed(123) genes <- sample(c("UNC93A", "CLEC2B", "KCTD12", "CDKN3", "DAB2"), 20, replace = TRUE) pfake <- runif(20, 0, 10e-21) gd10df <- data.frame(genes, pfake) gd10df gd10df[order(gd10df$genes),] # selecting one p-value randomly with(gd10df, aggregate(pfake, list(genes), function(x) sample(x, 1))) HTH, Jorge On Tue, Mar 30, 2010 at 7:33 PM, jeff.m.ewers <> wrote:> > Dear R-Helpers, > > I have a dataframe (g10df) formatted like this: > > GENE PVAL > 1 KCTD12 4.06904e-22 > 2 UNC93A 9.91852e-22 > 3 CDKN3 1.24695e-21 > 4 CLEC2B 4.71759e-21 > 5 DAB2 1.12062e-20 > > The rows are ranked in ascending order by PVAL, and I need to end up with > the same relative order. There are duplicate entries for genes in the first > column with corresponding p-values in the second, but the p-values are > unique. I had intended to use the plyr package to remove these duplicates: > > ddply(g10df, "GENE", summarise, PVAL = mean(PVAL)) > > But it occurred to me that instead of averaging the p-values for each set > of > duplicates, I should instead select one duplicate at random, and remove the > rest. > > I am relatively new to R, and I have not been able to find a way to do > this, > with plyr or otherwise. Any help would be greatly appreciated. > > Thanks and best regards, > > Jeff > > > > > -- > View this message in context: > http://n4.nabble.com/Choosing-and-preserving-a-random-duplicate-tp1746091p1746091.html > Sent from the R help mailing list archive at Nabble.com. > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >[[alternative HTML version deleted]]