Michael R. Head
2008-Aug-12 08:47 UTC
[R] Quickly calculating the mean results over a collection of data sets?
I have a collection of datasets in separate data frames which have 3 independent test parameters (w, x, y) and one dependent variable (z) , together with some additional static test data on each row. What I want is a data frame which contains the test data, the parameters (w, x, y) and the mean value of all (z)s in the Z column. Each datasets has around 6000 rows and around 7 columns, which doesn't seem outrageously large, so it seems like this shouldn't too time consuming, but the way I've been approaching it seems to take way too long (20 seconds for datasets over 4 runs, longer for my datasets over 10 runs). My imperative-coding brain lead me to use for loops, which seems to be particularly problematic for R performance. My first attempt at this looked like the following, which takes roughly 60 seconds to complete. I rewrote it a little, but the code was much longer and effectively replaces one of the for loops with an lapply(). I could paste the other code, but it's much longer and less clear about its intent. ####################### # Start code snippet ####################### ### inputFiles just a list of paths to the test runs testRuns <- lapply(inputFiles, function(x) { read.table(x, header=TRUE)}) ### W, X, Y have (small) natural values w <- unique(testRuns[[1]]$W) x <- unique(testRuns[[1]]$X) y <- unique(testRuns[[1]]$Y) ### All runs have the same values for all columns ### with the exception of the Z values, so just ### copy the first test run data testMeans <- data.frame(testRuns[[1]]) for(w0 in w) { for(y0 in y) { for (x0 in x) { row <- which(testMeans$W == w0 & testMeans$Y == y0 & testMeans$X == x0) meanValues <- sapply(testRuns, function(r) {mean( subset(r, r$W == w0 & r$Y == y0 & r$X == x0)$Z )}) testMeans[row,]$Z = mean(meanValues) } } } ### I will then want to plot certain values over (X, Z), ### so ultimately, I'm going to subset the data further. ### Code which gives me a list of W tables with mean Z values ### works, too. ####################### # End code snippet ####################### Thanks, mike -- Michael R. Head <burner at suppressingfire.org> http://www.cs.binghamton.edu/~mike/
Dan Davison
2008-Aug-12 11:04 UTC
[R] Quickly calculating the mean results over a collection of data sets?
On Tue, Aug 12, 2008 at 04:47:14AM -0400, Michael R. Head wrote:> I have a collection of datasets in separate data frames which have 3 > independent test parameters (w, x, y) and one dependent variable (z) , > together with some additional static test data on each row. What I want > is a data frame which contains the test data, the parameters (w, x, y) > and the mean value of all (z)s in the Z column. > > Each datasets has around 6000 rows and around 7 columns, which doesn't > seem outrageously large, so it seems like this shouldn't too time > consuming, but the way I've been approaching it seems to take way too > long (20 seconds for datasets over 4 runs, longer for my datasets over > 10 runs). > > My imperative-coding brain lead me to use for loops, which seems to be > particularly problematic for R performance. My first attempt at this > looked like the following, which takes roughly 60 seconds to complete. I > rewrote it a little, but the code was much longer and effectively > replaces one of the for loops with an lapply(). I could paste the other > code, but it's much longer and less clear about its intent. >Hi Michael,> ####################### > # Start code snippet > ####################### > ### inputFiles just a list of paths to the test runs > testRuns <- lapply(inputFiles, > function(x) { > read.table(x, header=TRUE)})(Just BTW lapply(inputFiles, read.table, header=TRUE) is slightly nicer to look at)> > ### W, X, Y have (small) natural values > w <- unique(testRuns[[1]]$W) > x <- unique(testRuns[[1]]$X) > y <- unique(testRuns[[1]]$Y) > > ### All runs have the same values for all columns > ### with the exception of the Z values, so just > ### copy the first test run data > testMeans <- data.frame(testRuns[[1]])How about rbind()ing all the data frames together, and working with the combined data frame? Say that testRuns is> testRuns[[1]] W X Y Z 1 1 5 5 -0.5251156 2 5 1 3 1.1761139 3 2 4 4 -0.8934380 4 5 1 1 1.4076303 5 5 3 1 0.4679745 [[2]] W X Y Z 1 1 5 5 -0.8556862 2 5 1 3 0.3517671 3 2 4 4 -1.0202064 4 5 1 1 1.2152349 5 5 3 1 0.4340249> allRuns <- do.call("rbind", testRuns) > aggregate(allRuns$Z, by=allRuns[c("W","X","Y")], mean)W X Y x 1 5 1 1 1.3114326 2 5 3 1 0.4509997 3 5 1 3 0.7639405 4 2 4 4 -0.9568222 5 1 5 5 -0.6904009 Dan> for(w0 in w) { > for(y0 in y) { > for (x0 in x) { > row <- which(testMeans$W == w0 & > testMeans$Y == y0 & > testMeans$X == x0) > meanValues <- sapply(testRuns, > function(r) > {mean( subset(r, > r$W == w0 & > r$Y == y0 & > r$X == x0)$Z )}) > testMeans[row,]$Z = mean(meanValues) > } > } > } > ### I will then want to plot certain values over (X, Z), > ### so ultimately, I'm going to subset the data further. > ### Code which gives me a list of W tables with mean Z values > ### works, too. > ####################### > # End code snippet > ####################### > > > Thanks, > mike > > -- > Michael R. Head <burner at suppressingfire.org> > http://www.cs.binghamton.edu/~mike/ > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.-- www.stats.ox.ac.uk/~davison
Possibly Parallel Threads
- Calculate monthly means
- comparison operator, decimals, and signif()
- Last call for fixes: Releasing tomorrow
- how to delete all document from the DB (without deleting the DB itself)
- Example input data with example output using relative pathway in vignette of R package?