Sunny Srivastava
2010-Jan-31 23:05 UTC
[R] Using apply function on duplicates in a data.frame
Dear R-Helpers, I have a data.frame (df) and the head of data.frame looks like ProbeUID ControlType ProbeName GeneName SystematicName 1665 1577 0 pSysX_50_22_1 pSysX_50 pSysX_50 5422 5147 0 pSysX_49_8_1 pSysX_49 pSysX_49 4042 3843 0 pSysX_51_18_1 pSysX_51 pSysX_51 3646 3466 0 sll1514_0_2 sll1514 sll1514 2946 2807 0 sll1514_0_1 sll1514 sll1514 624 582 0 pSysX_49_8_2 pSysX_49 pSysX_49 Description logFC AveExpr t P.Value adj.P.Val 1665 Unknown 4.3887 9.5662 61.038 1.0938e-08 9.4449e-05 5422 Unknown -3.5251 6.9103 -35.908 1.7596e-07 3.5912e-04 4042 Unknown 2.5302 8.7497 35.112 1.9786e-07 3.5912e-04 3646 Unknown 2.3457 11.1678 33.962 2.3549e-07 3.5912e-04 2946 Unknown 2.3151 11.3153 32.689 2.8751e-07 3.5912e-04 624 Unknown -3.6256 6.8986 -31.777 3.3333e-07 3.5912e-04 B 1665 9.8342 5422 8.1650 4042 8.0758 3646 7.9408 2946 7.7822 624 7.6622 I want to "collapse" this data frame into a new data.frame so that the df$GeneName contains no duplicate GeneNames (for eg: sll1514) AND the df$logFC contains the average of df$logFC corresponding to these GeneNames (which had duplicate genenames). I am aware of an inefficient strategy using loops, but I believe that there should be a way using Apply functions or may be plyr? I am not able to think of one at the moment. Can you please help me? Any help is appreciated ! Thanks and Best Regards, S. [[alternative HTML version deleted]]
David Winsemius
2010-Jan-31 23:22 UTC
[R] Using apply function on duplicates in a data.frame
On Jan 31, 2010, at 6:05 PM, Sunny Srivastava wrote:> Dear R-Helpers, > I have a data.frame (df) and the head of data.frame looks like > > ProbeUID ControlType ProbeName GeneName SystematicName > 1665 1577 0 pSysX_50_22_1 pSysX_50 pSysX_50 > 5422 5147 0 pSysX_49_8_1 pSysX_49 pSysX_49 > 4042 3843 0 pSysX_51_18_1 pSysX_51 pSysX_51 > 3646 3466 0 sll1514_0_2 sll1514 sll1514 > 2946 2807 0 sll1514_0_1 sll1514 sll1514 > 624 582 0 pSysX_49_8_2 pSysX_49 pSysX_49 > > Description logFC AveExpr t P.Value adj.P.Val > 1665 Unknown 4.3887 9.5662 61.038 1.0938e-08 9.4449e-05 > 5422 Unknown -3.5251 6.9103 -35.908 1.7596e-07 3.5912e-04 > 4042 Unknown 2.5302 8.7497 35.112 1.9786e-07 3.5912e-04 > 3646 Unknown 2.3457 11.1678 33.962 2.3549e-07 3.5912e-04 > 2946 Unknown 2.3151 11.3153 32.689 2.8751e-07 3.5912e-04 > 624 Unknown -3.6256 6.8986 -31.777 3.3333e-07 3.5912e-04 > B > 1665 9.8342 > 5422 8.1650 > 4042 8.0758 > 3646 7.9408 > 2946 7.7822 > 624 7.6622 >tdf <- tapply(df$logFC, df$GeneName, mean) ndf <- dataframe(Gnames = names(tdf), mn.logFC= tdf)> I want to "collapse" this data frame into a new data.frame so that the > df$GeneName contains no duplicate GeneNames (for eg: sll1514) AND the > df$logFC contains the average of df$logFC corresponding to these > GeneNames > (which had duplicate genenames). > > I am aware of an inefficient strategy using loops, but I believe > that there > should be a way using Apply functions or may be plyr? > > I am not able to think of one at the moment. Can you please help me?David Winsemius, MD Heritage Laboratories West Hartford, CT
hadley wickham
2010-Feb-01 03:43 UTC
[R] Using apply function on duplicates in a data.frame
On Sun, Jan 31, 2010 at 5:05 PM, Sunny Srivastava <research.baba at gmail.com> wrote:> Dear R-Helpers, > I have a data.frame (df) and the head of data.frame looks like > > ? ? ProbeUID ControlType ? ? ProbeName GeneName SystematicName > 1665 ? ? 1577 ? ? ? ? ? 0 pSysX_50_22_1 pSysX_50 ? ? ? pSysX_50 > 5422 ? ? 5147 ? ? ? ? ? 0 ?pSysX_49_8_1 pSysX_49 ? ? ? pSysX_49 > 4042 ? ? 3843 ? ? ? ? ? 0 pSysX_51_18_1 pSysX_51 ? ? ? pSysX_51 > 3646 ? ? 3466 ? ? ? ? ? 0 ? sll1514_0_2 ?sll1514 ? ? ? ?sll1514 > 2946 ? ? 2807 ? ? ? ? ? 0 ? sll1514_0_1 ?sll1514 ? ? ? ?sll1514 > 624 ? ? ? 582 ? ? ? ? ? 0 ?pSysX_49_8_2 pSysX_49 ? ? ? pSysX_49 > > ? ? Description ? logFC AveExpr ? ? ? t ? ?P.Value ?adj.P.Val > 1665 ? ? Unknown ?4.3887 ?9.5662 ?61.038 1.0938e-08 9.4449e-05 > 5422 ? ? Unknown -3.5251 ?6.9103 -35.908 1.7596e-07 3.5912e-04 > 4042 ? ? Unknown ?2.5302 ?8.7497 ?35.112 1.9786e-07 3.5912e-04 > 3646 ? ? Unknown ?2.3457 11.1678 ?33.962 2.3549e-07 3.5912e-04 > 2946 ? ? Unknown ?2.3151 11.3153 ?32.689 2.8751e-07 3.5912e-04 > 624 ? ? ?Unknown -3.6256 ?6.8986 -31.777 3.3333e-07 3.5912e-04 > ? ? ? ? ?B > 1665 9.8342 > 5422 8.1650 > 4042 8.0758 > 3646 7.9408 > 2946 7.7822 > 624 ?7.6622 > > I want to "collapse" this data frame into a new data.frame so that the > df$GeneName contains no duplicate GeneNames (for eg: sll1514) AND the > df$logFC contains the average of df$logFC corresponding to these GeneNames > (which had duplicate genenames).library(plyr) ddply(df, "GeneName", summarise, logFC = mean(logFC) Hadley -- http://had.co.nz/