thr3ads.net - R help - [R] Choosing and preserving a random duplicate [Mar 2010]

If this information is useful, please help other people find it:
Share via:

jeff.m.ewers

2010-Mar-30 23:33 UTC

[R] Choosing and preserving a random duplicate

Dear R-Helpers,

I have a dataframe (g10df) formatted like this:

    GENE             PVAL
1 KCTD12      4.06904e-22
2 UNC93A      9.91852e-22
3  CDKN3      1.24695e-21
4 CLEC2B      4.71759e-21
5   DAB2      1.12062e-20

The rows are ranked in ascending order by PVAL, and I need to end up with
the same relative order. There are duplicate entries for genes in the first
column with corresponding p-values in the second, but the p-values are
unique. I had intended to use the plyr package to remove these duplicates:

ddply(g10df, "GENE", summarise, PVAL = mean(PVAL))

But it occurred to me that instead of averaging the p-values for each set of
duplicates, I should instead select one duplicate at random, and remove the
rest. 

I am relatively new to R, and I have not been able to find a way to do this,
with plyr or otherwise. Any help would be greatly appreciated.

Thanks and best regards,

Jeff




-- 
View this message in context:
http://n4.nabble.com/Choosing-and-preserving-a-random-duplicate-tp1746091p1746091.html
Sent from the R help mailing list archive at Nabble.com.

Peter Alspach

2010-Mar-30 23:57 UTC

head link

[R] Choosing and preserving a random duplicate

Tena koe Jeff

If I understand you correctly, one approach would be to randomly order
your dataframe, remove the duplicates, and then reorder the resulting
dataframe back into the original order:

g10dfA <- g10df[sample(nrow(g10df)),]
g10dfA <- g10dfA[!duplicated(g10dfA$GENE),]
g10dfA <- g10dfA[order(g10dfA$PVAL),]

All untested.

HTH ....

Peter Alspach
> -----Original Message-----
> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-
> project.org] On Behalf Of jeff.m.ewers
> Sent: Wednesday, 31 March 2010 12:33 p.m.
> To: r-help at r-project.org
> Subject: [R] Choosing and preserving a random duplicate
> 
> 
> Dear R-Helpers,
> 
> I have a dataframe (g10df) formatted like this:
> 
>     GENE             PVAL
> 1 KCTD12      4.06904e-22
> 2 UNC93A      9.91852e-22
> 3  CDKN3      1.24695e-21
> 4 CLEC2B      4.71759e-21
> 5   DAB2      1.12062e-20
> 
> The rows are ranked in ascending order by PVAL, and I need to end up
> with
> the same relative order. There are duplicate entries for genes in the
> first
> column with corresponding p-values in the second, but the p-values are
> unique. I had intended to use the plyr package to remove these
> duplicates:
> 
> ddply(g10df, "GENE", summarise, PVAL = mean(PVAL))
> 
> But it occurred to me that instead of averaging the p-values for each
> set of
> duplicates, I should instead select one duplicate at random, and
remove> the
> rest.
> 
> I am relatively new to R, and I have not been able to find a way to do
> this,
> with plyr or otherwise. Any help would be greatly appreciated.
> 
> Thanks and best regards,
> 
> Jeff
> 
> 
> 
> 
> --
> View this message in context: http://n4.nabble.com/Choosing-and-
> preserving-a-random-duplicate-tp1746091p1746091.html
> Sent from the R help mailing list archive at Nabble.com.
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-
> guide.html
> and provide commented, minimal, self-contained, reproducible code.

Jorge Ivan Velez

2010-Mar-31 02:44 UTC

head link

[R] Choosing and preserving a random duplicate

Hi Jeff,

Here is a suggestion using aggregate():

 # some data

set.seed(123)

genes <- sample(c("UNC93A", "CLEC2B", "KCTD12",
"CDKN3", "DAB2"), 20,
replace = TRUE)

pfake <- runif(20, 0, 10e-21)

gd10df <- data.frame(genes, pfake)

gd10df

gd10df[order(gd10df$genes),]


# selecting one p-value randomly

with(gd10df, aggregate(pfake, list(genes), function(x) sample(x, 1)))


HTH,
Jorge


On Tue, Mar 30, 2010 at 7:33 PM, jeff.m.ewers <> wrote:
>
> Dear R-Helpers,
>
> I have a dataframe (g10df) formatted like this:
>
>    GENE             PVAL
> 1 KCTD12      4.06904e-22
> 2 UNC93A      9.91852e-22
> 3  CDKN3      1.24695e-21
> 4 CLEC2B      4.71759e-21
> 5   DAB2      1.12062e-20
>
> The rows are ranked in ascending order by PVAL, and I need to end up with
> the same relative order. There are duplicate entries for genes in the first
> column with corresponding p-values in the second, but the p-values are
> unique. I had intended to use the plyr package to remove these duplicates:
>
> ddply(g10df, "GENE", summarise, PVAL = mean(PVAL))
>
> But it occurred to me that instead of averaging the p-values for each set
> of
> duplicates, I should instead select one duplicate at random, and remove the
> rest.
>
> I am relatively new to R, and I have not been able to find a way to do
> this,
> with plyr or otherwise. Any help would be greatly appreciated.
>
> Thanks and best regards,
>
> Jeff
>
>
>
>
> --
> View this message in context:
>
http://n4.nabble.com/Choosing-and-preserving-a-random-duplicate-tp1746091p1746091.html
> Sent from the R help mailing list archive at Nabble.com.
>
> ______________________________________________
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
	[[alternative HTML version deleted]]

Possibly Parallel Threads

Deleting duplicate rows in a matrix at random

R help - Mar 2010 - Choosing and preserving a random duplicate

[R] Choosing and preserving a random duplicate

[R] Choosing and preserving a random duplicate

[R] Choosing and preserving a random duplicate

Possibly Parallel Threads