thr3ads.net - R help - [R] Using apply function on duplicates in a data.frame [Jan 2010]

If this information is useful, please help other people find it:
Share via:

Sunny Srivastava

2010-Jan-31 23:05 UTC

[R] Using apply function on duplicates in a data.frame

Dear R-Helpers,
I have a data.frame (df) and the head of data.frame looks like

     ProbeUID ControlType     ProbeName GeneName SystematicName
1665     1577           0 pSysX_50_22_1 pSysX_50       pSysX_50
5422     5147           0  pSysX_49_8_1 pSysX_49       pSysX_49
4042     3843           0 pSysX_51_18_1 pSysX_51       pSysX_51
3646     3466           0   sll1514_0_2  sll1514        sll1514
2946     2807           0   sll1514_0_1  sll1514        sll1514
624       582           0  pSysX_49_8_2 pSysX_49       pSysX_49

     Description   logFC AveExpr       t    P.Value  adj.P.Val
1665     Unknown  4.3887  9.5662  61.038 1.0938e-08 9.4449e-05
5422     Unknown -3.5251  6.9103 -35.908 1.7596e-07 3.5912e-04
4042     Unknown  2.5302  8.7497  35.112 1.9786e-07 3.5912e-04
3646     Unknown  2.3457 11.1678  33.962 2.3549e-07 3.5912e-04
2946     Unknown  2.3151 11.3153  32.689 2.8751e-07 3.5912e-04
624      Unknown -3.6256  6.8986 -31.777 3.3333e-07 3.5912e-04
          B
1665 9.8342
5422 8.1650
4042 8.0758
3646 7.9408
2946 7.7822
624  7.6622

I want to "collapse" this data frame into a new data.frame so that the
df$GeneName contains no duplicate GeneNames (for eg: sll1514) AND the
df$logFC contains the average of df$logFC corresponding to these GeneNames
(which had duplicate genenames).

I am aware of an inefficient strategy using loops, but I believe that there
should be a way using Apply functions or may be plyr?

I am not able to think of one at the moment.  Can you please help me?

Any help is appreciated !


Thanks and Best Regards,
S.

	[[alternative HTML version deleted]]

David Winsemius

2010-Jan-31 23:22 UTC

head link

[R] Using apply function on duplicates in a data.frame

On Jan 31, 2010, at 6:05 PM, Sunny Srivastava wrote:
> Dear R-Helpers,
> I have a data.frame (df) and the head of data.frame looks like
>
>     ProbeUID ControlType     ProbeName GeneName SystematicName
> 1665     1577           0 pSysX_50_22_1 pSysX_50       pSysX_50
> 5422     5147           0  pSysX_49_8_1 pSysX_49       pSysX_49
> 4042     3843           0 pSysX_51_18_1 pSysX_51       pSysX_51
> 3646     3466           0   sll1514_0_2  sll1514        sll1514
> 2946     2807           0   sll1514_0_1  sll1514        sll1514
> 624       582           0  pSysX_49_8_2 pSysX_49       pSysX_49
>
>     Description   logFC AveExpr       t    P.Value  adj.P.Val
> 1665     Unknown  4.3887  9.5662  61.038 1.0938e-08 9.4449e-05
> 5422     Unknown -3.5251  6.9103 -35.908 1.7596e-07 3.5912e-04
> 4042     Unknown  2.5302  8.7497  35.112 1.9786e-07 3.5912e-04
> 3646     Unknown  2.3457 11.1678  33.962 2.3549e-07 3.5912e-04
> 2946     Unknown  2.3151 11.3153  32.689 2.8751e-07 3.5912e-04
> 624      Unknown -3.6256  6.8986 -31.777 3.3333e-07 3.5912e-04
>          B
> 1665 9.8342
> 5422 8.1650
> 4042 8.0758
> 3646 7.9408
> 2946 7.7822
> 624  7.6622
>
tdf <- tapply(df$logFC, df$GeneName, mean)
ndf <- dataframe(Gnames = names(tdf), mn.logFC= tdf)
> I want to "collapse" this data frame into a new data.frame so
that the
> df$GeneName contains no duplicate GeneNames (for eg: sll1514) AND the
> df$logFC contains the average of df$logFC corresponding to these  
> GeneNames
> (which had duplicate genenames).
>
> I am aware of an inefficient strategy using loops, but I believe  
> that there
> should be a way using Apply functions or may be plyr?
>
> I am not able to think of one at the moment.  Can you please help me?
David Winsemius, MD
Heritage Laboratories
West Hartford, CT

hadley wickham

2010-Feb-01 03:43 UTC

head link

[R] Using apply function on duplicates in a data.frame

On Sun, Jan 31, 2010 at 5:05 PM, Sunny Srivastava
<research.baba at gmail.com> wrote:> Dear R-Helpers,
> I have a data.frame (df) and the head of data.frame looks like
>
> ? ? ProbeUID ControlType ? ? ProbeName GeneName SystematicName
> 1665 ? ? 1577 ? ? ? ? ? 0 pSysX_50_22_1 pSysX_50 ? ? ? pSysX_50
> 5422 ? ? 5147 ? ? ? ? ? 0 ?pSysX_49_8_1 pSysX_49 ? ? ? pSysX_49
> 4042 ? ? 3843 ? ? ? ? ? 0 pSysX_51_18_1 pSysX_51 ? ? ? pSysX_51
> 3646 ? ? 3466 ? ? ? ? ? 0 ? sll1514_0_2 ?sll1514 ? ? ? ?sll1514
> 2946 ? ? 2807 ? ? ? ? ? 0 ? sll1514_0_1 ?sll1514 ? ? ? ?sll1514
> 624 ? ? ? 582 ? ? ? ? ? 0 ?pSysX_49_8_2 pSysX_49 ? ? ? pSysX_49
>
> ? ? Description ? logFC AveExpr ? ? ? t ? ?P.Value ?adj.P.Val
> 1665 ? ? Unknown ?4.3887 ?9.5662 ?61.038 1.0938e-08 9.4449e-05
> 5422 ? ? Unknown -3.5251 ?6.9103 -35.908 1.7596e-07 3.5912e-04
> 4042 ? ? Unknown ?2.5302 ?8.7497 ?35.112 1.9786e-07 3.5912e-04
> 3646 ? ? Unknown ?2.3457 11.1678 ?33.962 2.3549e-07 3.5912e-04
> 2946 ? ? Unknown ?2.3151 11.3153 ?32.689 2.8751e-07 3.5912e-04
> 624 ? ? ?Unknown -3.6256 ?6.8986 -31.777 3.3333e-07 3.5912e-04
> ? ? ? ? ?B
> 1665 9.8342
> 5422 8.1650
> 4042 8.0758
> 3646 7.9408
> 2946 7.7822
> 624 ?7.6622
>
> I want to "collapse" this data frame into a new data.frame so
that the
> df$GeneName contains no duplicate GeneNames (for eg: sll1514) AND the
> df$logFC contains the average of df$logFC corresponding to these GeneNames
> (which had duplicate genenames).
library(plyr)
ddply(df, "GeneName", summarise, logFC = mean(logFC)

Hadley


-- 
http://had.co.nz/

Seemingly Similar Threads

Question: how to obtain the clusters of genes (basically the ones in the row dendrograms) from an object obtained by heatmap.2 function

R help - Jan 2010 - Using apply function on duplicates in a data.frame

[R] Using apply function on duplicates in a data.frame

[R] Using apply function on duplicates in a data.frame

[R] Using apply function on duplicates in a data.frame

Seemingly Similar Threads