I need some help summarizing complex data frames (small example below): m1_1 m2_1 m3_1 m1_2 m2_2 m3_2 i1 1 1 1 2 2 2 i1 2 1 1 2 2 2 i2 2 2 1 2 2 2 For an arbitrary number of columns (say m1 ?. m199) where the column names have variable patterns, and such that each set of columns is repeated (with potentially unique data) an arbitrary number of times (say _1 ? _1000), I would like to summarize by row the mean values of (m1, m2, m3, ? m199) over all replicates (_1, _2, _3, ? _1000). I need to do this with a large number of dataframes of variable nrow, ncolumn, and colnames. I've tried various loops creating new dataframes and reassigning cell values in loops or using rbind and bind, but run into trouble in each case. Any ideas? Thanks, Chris
Hi, On Wed, Jan 11, 2012 at 3:55 PM, Christopher G Oakley <coakley at bio.fsu.edu> wrote:> I need some help summarizing complex data frames (small example below): > > ? ?m1_1 m2_1 m3_1 m1_2 m2_2 m3_2 > i1 ? ?1 ? ?1 ? ?1 ? ?2 ? ?2 ? ?2 > i1 ? ?2 ? ?1 ? ?1 ? ?2 ? ?2 ? ?2 > i2 ? ?2 ? ?2 ? ?1 ? ?2 ? ?2 ? ?2 > > > For an arbitrary number of columns (say m1 ?. m199) where the column names have variable patterns, > > and such that each set of columns is repeated (with potentially unique data) an arbitrary number of times (say _1 ? _1000),[snip] Perhaps your job would be easier if you change the layout of your data frame, for instance you can have "experiment.name" and "replicate" columns, so your "clean" data.frame would look like: experiment.name replicate region count m1 1 i1 1 m2 1 i1 1 m3 1 i1 1 ... You can use the reshape (or reshape2) package to help you whip your old table into a new one using a formula interface, if you like. You can then use your favorite split-apply-combine[1] method (via plyr, data.table, sqldf, or even base::tapply) to calculate summary statistics over the values of interest in each group/subgroup, whatever. HTH, -steve [1] The Split-Apply-Combine Strategy for Data Analysis: http://www.jstatsoft.org/v40/i01 -- Steve Lianoglou Graduate Student: Computational Systems Biology ?| Memorial Sloan-Kettering Cancer Center ?| Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact
Well, if I understand what you want to do, it's straightforward, ?"[" (pay attention to the use of column names) and ?grep would pick out the columns you want and you could then use mapply or maybe rowMeans or whatever to get your summaries. HOWEVER ... I think what you should really should do is use a more appropriate data structure. What seems more natural to me is to convert from "wide" to "long" format so that you would end up with 3 columns: Result, ID, Rep. The Result would be the value, the ID your m1, m2, etc. and Rep your _1,_2, _3, etc. Again, this appears to be easy: ?unlist would get you the vector of Results and either ?strsplit or grep would get you all the ID's and reps, each of which just has to be repped the number of rows of your frame. Alternatively, ?reshape in base R or the reshape package can probably do it for you. Once you have a more R friendly data structure, it will be much easier for you to work with your data. Finally, you may wish to post your query on a more relevant list (e.g. geo or ecology or whatever your data are) as folks there may have better ideas for what a more "R friendly data structure" should be. Cheers, Bert On Wed, Jan 11, 2012 at 12:55 PM, Christopher G Oakley <coakley at bio.fsu.edu> wrote:> I need some help summarizing complex data frames (small example below): > > ? ?m1_1 m2_1 m3_1 m1_2 m2_2 m3_2 > i1 ? ?1 ? ?1 ? ?1 ? ?2 ? ?2 ? ?2 > i1 ? ?2 ? ?1 ? ?1 ? ?2 ? ?2 ? ?2 > i2 ? ?2 ? ?2 ? ?1 ? ?2 ? ?2 ? ?2 > > > For an arbitrary number of columns (say m1 ?. m199) where the column names have variable patterns, > > and such that each set of columns is repeated (with potentially unique data) an arbitrary number of times (say _1 ? _1000), > > I would like to summarize by row the mean values of (m1, m2, m3, ? m199) over all replicates (_1, _2, _3, ? _1000). I need to do this with a large number of dataframes of variable nrow, ncolumn, and colnames. > > I've tried various loops creating new dataframes and reassigning cell values in loops or using rbind and bind, but run into trouble in each case. > > Any ideas? > > Thanks, > > Chris > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.-- Bert Gunter Genentech Nonclinical Biostatistics Internal Contact Info: Phone: 467-7374 Website: http://pharmadevelopment.roche.com/index/pdb/pdb-functional-groups/pdb-biostatistics/pdb-ncb-home.htm
On Jan 11, 2012, at 3:55 PM, Christopher G Oakley wrote:> I need some help summarizing complex data frames (small example > below): > > m1_1 m2_1 m3_1 m1_2 m2_2 m3_2 > i1 1 1 1 2 2 2 > i1 2 1 1 2 2 2 > i2 2 2 1 2 2 2 > > > For an arbitrary number of columns (say m1 ?. m199) where the column > names have variable patterns, > > and such that each set of columns is repeated (with potentially > unique data) an arbitrary number of times (say _1 ? _1000), > > I would like to summarize by row the mean values of (m1, m2, m3, ? > m199) over all replicates (_1, _2, _3, ? _1000). I need to do this > with a large number of dataframes of variable nrow, ncolumn, and > colnames.Something along the lines of this untested code: sapply(unique(sub("_.+$", "", names(dfrm))), function(x) rowMeans( dfrm[ , grep(x, names(dfrm)) ] ) ) Post a reproducible example and we can test it.> > I've tried various loops creating new dataframes and reassigning > cell values in loops or using rbind and bind, but run into trouble > in each case. > > Any ideas? > > Thanks, > > Chris > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.David Winsemius, MD West Hartford, CT