thr3ads.net - R help - [R] Quickly calculating the mean results over a collection of data sets? [Aug 2008]

If this information is useful, please help other people find it:
Share via:

Michael R. Head

2008-Aug-12 08:47 UTC

[R] Quickly calculating the mean results over a collection of data sets?

I have a collection of datasets in separate data frames which have 3
independent test parameters (w, x, y) and one dependent variable (z) ,
together with some additional static test data on each row. What I want
is a data frame which contains the test data, the parameters (w, x, y)
and the mean value of all (z)s in the Z column.

Each datasets has  around 6000 rows and around 7 columns, which doesn't
seem outrageously large, so it seems like this shouldn't too time
consuming, but the way I've been approaching it seems to take way too
long (20 seconds for datasets over 4 runs, longer for my datasets over
10 runs). 

My imperative-coding brain lead me to use for loops, which seems to be
particularly problematic for R performance. My first attempt at this
looked like the following, which takes roughly 60 seconds to complete. I
rewrote it a little, but the code was much longer and effectively
replaces one of the for loops with an lapply(). I could paste the other
code, but it's much longer and less clear about its intent.


#######################
# Start code snippet
#######################
### inputFiles just a list of paths to the test runs
testRuns <- lapply(inputFiles, 
		function(x) {
			read.table(x, header=TRUE)})

### W, X, Y have (small) natural values
w <- unique(testRuns[[1]]$W)
x <- unique(testRuns[[1]]$X)
y <- unique(testRuns[[1]]$Y)

### All runs have the same values for all columns
### with the exception of the Z values, so just
### copy the first test run data
testMeans <- data.frame(testRuns[[1]])
for(w0 in w) {
   for(y0 in y) {
     for (x0 in x) {
       row <- which(testMeans$W == w0 &
                    testMeans$Y == y0 &
                    testMeans$X == x0)
       meanValues <- sapply(testRuns,
                            function(r)
                            {mean( subset(r,
                                          r$W == w0 &
                                          r$Y == y0 &
                                          r$X == x0)$Z )})
       testMeans[row,]$Z = mean(meanValues)
     }
   }
 }
### I will then want to plot certain values over (X, Z),
### so ultimately, I'm going to subset the data further.
### Code which gives me a list of W tables with mean Z values
### works, too.
#######################
# End code snippet
#######################


Thanks,
mike

-- 
Michael R. Head <burner at suppressingfire.org>
http://www.cs.binghamton.edu/~mike/

Dan Davison

2008-Aug-12 11:04 UTC

head link

[R] Quickly calculating the mean results over a collection of data sets?

On Tue, Aug 12, 2008 at 04:47:14AM -0400, Michael R. Head
wrote:> I have a collection of datasets in separate data frames which have 3
> independent test parameters (w, x, y) and one dependent variable (z) ,
> together with some additional static test data on each row. What I want
> is a data frame which contains the test data, the parameters (w, x, y)
> and the mean value of all (z)s in the Z column.
> 
> Each datasets has  around 6000 rows and around 7 columns, which doesn't
> seem outrageously large, so it seems like this shouldn't too time
> consuming, but the way I've been approaching it seems to take way too
> long (20 seconds for datasets over 4 runs, longer for my datasets over
> 10 runs). 
> 
> My imperative-coding brain lead me to use for loops, which seems to be
> particularly problematic for R performance. My first attempt at this
> looked like the following, which takes roughly 60 seconds to complete. I
> rewrote it a little, but the code was much longer and effectively
> replaces one of the for loops with an lapply(). I could paste the other
> code, but it's much longer and less clear about its intent.
> 
Hi Michael,
> #######################
> # Start code snippet
> #######################
> ### inputFiles just a list of paths to the test runs
> testRuns <- lapply(inputFiles, 
> 		function(x) {
> 			read.table(x, header=TRUE)})
(Just BTW lapply(inputFiles, read.table, header=TRUE) is slightly nicer to look
at)
> 
> ### W, X, Y have (small) natural values
> w <- unique(testRuns[[1]]$W)
> x <- unique(testRuns[[1]]$X)
> y <- unique(testRuns[[1]]$Y)
> 
> ### All runs have the same values for all columns
> ### with the exception of the Z values, so just
> ### copy the first test run data
> testMeans <- data.frame(testRuns[[1]])
How about rbind()ing all the data frames together, and working with
the combined data frame? Say that testRuns is
> testRuns[[1]]
  W X Y          Z
1 1 5 5 -0.5251156
2 5 1 3  1.1761139
3 2 4 4 -0.8934380
4 5 1 1  1.4076303
5 5 3 1  0.4679745

[[2]]
  W X Y          Z
1 1 5 5 -0.8556862
2 5 1 3  0.3517671
3 2 4 4 -1.0202064
4 5 1 1  1.2152349
5 5 3 1  0.4340249
> allRuns <- do.call("rbind", testRuns)
> aggregate(allRuns$Z,
by=allRuns[c("W","X","Y")], mean)  W X Y          x
1 5 1 1  1.3114326
2 5 3 1  0.4509997
3 5 1 3  0.7639405
4 2 4 4 -0.9568222
5 1 5 5 -0.6904009

Dan
> for(w0 in w) {
>    for(y0 in y) {
>      for (x0 in x) {
>        row <- which(testMeans$W == w0 &
>                     testMeans$Y == y0 &
>                     testMeans$X == x0)
>        meanValues <- sapply(testRuns,
>                             function(r)
>                             {mean( subset(r,
>                                           r$W == w0 &
>                                           r$Y == y0 &
>                                           r$X == x0)$Z )})
>        testMeans[row,]$Z = mean(meanValues)
>      }
>    }
>  }
> ### I will then want to plot certain values over (X, Z),
> ### so ultimately, I'm going to subset the data further.
> ### Code which gives me a list of W tables with mean Z values
> ### works, too.
> #######################
> # End code snippet
> #######################
> 
> 
> Thanks,
> mike
> 
> -- 
> Michael R. Head <burner at suppressingfire.org>
> http://www.cs.binghamton.edu/~mike/
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
-- 
www.stats.ox.ac.uk/~davison

Possibly Parallel Threads

Search for more apparently analagous threads

R help - Aug 2008 - Quickly calculating the mean results over a collection of data sets?

[R] Quickly calculating the mean results over a collection of data sets?

[R] Quickly calculating the mean results over a collection of data sets?

Possibly Parallel Threads