Dear R-Helpers,
I have a dataframe (g10df) formatted like this:
GENE PVAL
1 KCTD12 4.06904e-22
2 UNC93A 9.91852e-22
3 CDKN3 1.24695e-21
4 CLEC2B 4.71759e-21
5 DAB2 1.12062e-20
The rows are ranked in ascending order by PVAL, and I need to end up with
the same relative order. There are duplicate entries for genes in the first
column with corresponding p-values in the second, but the p-values are
unique. I had intended to use the plyr package to remove these duplicates:
ddply(g10df, "GENE", summarise, PVAL = mean(PVAL))
But it occurred to me that instead of averaging the p-values for each set of
duplicates, I should instead select one duplicate at random, and remove the
rest.
I am relatively new to R, and I have not been able to find a way to do this,
with plyr or otherwise. Any help would be greatly appreciated.
Thanks and best regards,
Jeff
--
View this message in context:
http://n4.nabble.com/Choosing-and-preserving-a-random-duplicate-tp1746091p1746091.html
Sent from the R help mailing list archive at Nabble.com.
Tena koe Jeff If I understand you correctly, one approach would be to randomly order your dataframe, remove the duplicates, and then reorder the resulting dataframe back into the original order: g10dfA <- g10df[sample(nrow(g10df)),] g10dfA <- g10dfA[!duplicated(g10dfA$GENE),] g10dfA <- g10dfA[order(g10dfA$PVAL),] All untested. HTH .... Peter Alspach> -----Original Message----- > From: r-help-bounces at r-project.org [mailto:r-help-bounces at r- > project.org] On Behalf Of jeff.m.ewers > Sent: Wednesday, 31 March 2010 12:33 p.m. > To: r-help at r-project.org > Subject: [R] Choosing and preserving a random duplicate > > > Dear R-Helpers, > > I have a dataframe (g10df) formatted like this: > > GENE PVAL > 1 KCTD12 4.06904e-22 > 2 UNC93A 9.91852e-22 > 3 CDKN3 1.24695e-21 > 4 CLEC2B 4.71759e-21 > 5 DAB2 1.12062e-20 > > The rows are ranked in ascending order by PVAL, and I need to end up > with > the same relative order. There are duplicate entries for genes in the > first > column with corresponding p-values in the second, but the p-values are > unique. I had intended to use the plyr package to remove these > duplicates: > > ddply(g10df, "GENE", summarise, PVAL = mean(PVAL)) > > But it occurred to me that instead of averaging the p-values for each > set of > duplicates, I should instead select one duplicate at random, andremove> the > rest. > > I am relatively new to R, and I have not been able to find a way to do > this, > with plyr or otherwise. Any help would be greatly appreciated. > > Thanks and best regards, > > Jeff > > > > > -- > View this message in context: http://n4.nabble.com/Choosing-and- > preserving-a-random-duplicate-tp1746091p1746091.html > Sent from the R help mailing list archive at Nabble.com. > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting- > guide.html > and provide commented, minimal, self-contained, reproducible code.
Hi Jeff,
Here is a suggestion using aggregate():
# some data
set.seed(123)
genes <- sample(c("UNC93A", "CLEC2B", "KCTD12",
"CDKN3", "DAB2"), 20,
replace = TRUE)
pfake <- runif(20, 0, 10e-21)
gd10df <- data.frame(genes, pfake)
gd10df
gd10df[order(gd10df$genes),]
# selecting one p-value randomly
with(gd10df, aggregate(pfake, list(genes), function(x) sample(x, 1)))
HTH,
Jorge
On Tue, Mar 30, 2010 at 7:33 PM, jeff.m.ewers <> wrote:
>
> Dear R-Helpers,
>
> I have a dataframe (g10df) formatted like this:
>
> GENE PVAL
> 1 KCTD12 4.06904e-22
> 2 UNC93A 9.91852e-22
> 3 CDKN3 1.24695e-21
> 4 CLEC2B 4.71759e-21
> 5 DAB2 1.12062e-20
>
> The rows are ranked in ascending order by PVAL, and I need to end up with
> the same relative order. There are duplicate entries for genes in the first
> column with corresponding p-values in the second, but the p-values are
> unique. I had intended to use the plyr package to remove these duplicates:
>
> ddply(g10df, "GENE", summarise, PVAL = mean(PVAL))
>
> But it occurred to me that instead of averaging the p-values for each set
> of
> duplicates, I should instead select one duplicate at random, and remove the
> rest.
>
> I am relatively new to R, and I have not been able to find a way to do
> this,
> with plyr or otherwise. Any help would be greatly appreciated.
>
> Thanks and best regards,
>
> Jeff
>
>
>
>
> --
> View this message in context:
>
http://n4.nabble.com/Choosing-and-preserving-a-random-duplicate-tp1746091p1746091.html
> Sent from the R help mailing list archive at Nabble.com.
>
> ______________________________________________
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
[[alternative HTML version deleted]]