Michael R. Head
2008-Aug-12 08:47 UTC
[R] Quickly calculating the mean results over a collection of data sets?
I have a collection of datasets in separate data frames which have 3
independent test parameters (w, x, y) and one dependent variable (z) ,
together with some additional static test data on each row. What I want
is a data frame which contains the test data, the parameters (w, x, y)
and the mean value of all (z)s in the Z column.
Each datasets has around 6000 rows and around 7 columns, which doesn't
seem outrageously large, so it seems like this shouldn't too time
consuming, but the way I've been approaching it seems to take way too
long (20 seconds for datasets over 4 runs, longer for my datasets over
10 runs).
My imperative-coding brain lead me to use for loops, which seems to be
particularly problematic for R performance. My first attempt at this
looked like the following, which takes roughly 60 seconds to complete. I
rewrote it a little, but the code was much longer and effectively
replaces one of the for loops with an lapply(). I could paste the other
code, but it's much longer and less clear about its intent.
#######################
# Start code snippet
#######################
### inputFiles just a list of paths to the test runs
testRuns <- lapply(inputFiles,
function(x) {
read.table(x, header=TRUE)})
### W, X, Y have (small) natural values
w <- unique(testRuns[[1]]$W)
x <- unique(testRuns[[1]]$X)
y <- unique(testRuns[[1]]$Y)
### All runs have the same values for all columns
### with the exception of the Z values, so just
### copy the first test run data
testMeans <- data.frame(testRuns[[1]])
for(w0 in w) {
for(y0 in y) {
for (x0 in x) {
row <- which(testMeans$W == w0 &
testMeans$Y == y0 &
testMeans$X == x0)
meanValues <- sapply(testRuns,
function(r)
{mean( subset(r,
r$W == w0 &
r$Y == y0 &
r$X == x0)$Z )})
testMeans[row,]$Z = mean(meanValues)
}
}
}
### I will then want to plot certain values over (X, Z),
### so ultimately, I'm going to subset the data further.
### Code which gives me a list of W tables with mean Z values
### works, too.
#######################
# End code snippet
#######################
Thanks,
mike
--
Michael R. Head <burner at suppressingfire.org>
http://www.cs.binghamton.edu/~mike/
Dan Davison
2008-Aug-12 11:04 UTC
[R] Quickly calculating the mean results over a collection of data sets?
On Tue, Aug 12, 2008 at 04:47:14AM -0400, Michael R. Head wrote:> I have a collection of datasets in separate data frames which have 3 > independent test parameters (w, x, y) and one dependent variable (z) , > together with some additional static test data on each row. What I want > is a data frame which contains the test data, the parameters (w, x, y) > and the mean value of all (z)s in the Z column. > > Each datasets has around 6000 rows and around 7 columns, which doesn't > seem outrageously large, so it seems like this shouldn't too time > consuming, but the way I've been approaching it seems to take way too > long (20 seconds for datasets over 4 runs, longer for my datasets over > 10 runs). > > My imperative-coding brain lead me to use for loops, which seems to be > particularly problematic for R performance. My first attempt at this > looked like the following, which takes roughly 60 seconds to complete. I > rewrote it a little, but the code was much longer and effectively > replaces one of the for loops with an lapply(). I could paste the other > code, but it's much longer and less clear about its intent. >Hi Michael,> ####################### > # Start code snippet > ####################### > ### inputFiles just a list of paths to the test runs > testRuns <- lapply(inputFiles, > function(x) { > read.table(x, header=TRUE)})(Just BTW lapply(inputFiles, read.table, header=TRUE) is slightly nicer to look at)> > ### W, X, Y have (small) natural values > w <- unique(testRuns[[1]]$W) > x <- unique(testRuns[[1]]$X) > y <- unique(testRuns[[1]]$Y) > > ### All runs have the same values for all columns > ### with the exception of the Z values, so just > ### copy the first test run data > testMeans <- data.frame(testRuns[[1]])How about rbind()ing all the data frames together, and working with the combined data frame? Say that testRuns is> testRuns[[1]] W X Y Z 1 1 5 5 -0.5251156 2 5 1 3 1.1761139 3 2 4 4 -0.8934380 4 5 1 1 1.4076303 5 5 3 1 0.4679745 [[2]] W X Y Z 1 1 5 5 -0.8556862 2 5 1 3 0.3517671 3 2 4 4 -1.0202064 4 5 1 1 1.2152349 5 5 3 1 0.4340249> allRuns <- do.call("rbind", testRuns) > aggregate(allRuns$Z, by=allRuns[c("W","X","Y")], mean)W X Y x 1 5 1 1 1.3114326 2 5 3 1 0.4509997 3 5 1 3 0.7639405 4 2 4 4 -0.9568222 5 1 5 5 -0.6904009 Dan> for(w0 in w) { > for(y0 in y) { > for (x0 in x) { > row <- which(testMeans$W == w0 & > testMeans$Y == y0 & > testMeans$X == x0) > meanValues <- sapply(testRuns, > function(r) > {mean( subset(r, > r$W == w0 & > r$Y == y0 & > r$X == x0)$Z )}) > testMeans[row,]$Z = mean(meanValues) > } > } > } > ### I will then want to plot certain values over (X, Z), > ### so ultimately, I'm going to subset the data further. > ### Code which gives me a list of W tables with mean Z values > ### works, too. > ####################### > # End code snippet > ####################### > > > Thanks, > mike > > -- > Michael R. Head <burner at suppressingfire.org> > http://www.cs.binghamton.edu/~mike/ > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.-- www.stats.ox.ac.uk/~davison
Apparently Analagous Threads
- Calculate monthly means
- comparison operator, decimals, and signif()
- Last call for fixes: Releasing tomorrow
- how to delete all document from the DB (without deleting the DB itself)
- Example input data with example output using relative pathway in vignette of R package?