I'm looking for a way to improve code that's proven to be inefficient. Suppose that a data source generates the following table every minute: Index Count ------------ 0 234 1 120 7 11 30 1 I save the tables in the following CSV format: time,index,count 0,0:1:7:30,234:120:11:1 1,0:2:3:19,199:110:87:9 That is, each line represents a table, and I have N lines for N minutes of data collection. Now, I wrote the following code to get quantiles for each time period: library(Hmisc) stbl <- read.csv("data.csv") index <- lapply(strsplit(stbl$index, ":", fixed = TRUE), as.numeric) count <- lapply(strsplit(stbl$count, ":", fixed = TRUE), as.numeric) len <- length(index) for (i in 1:len) { v <- wtd.quantile(index[[i]], count[[i]], c(0, 0.2, 0.5, 0.8, 1)) stbl$q0[i] <- v[1] stbl$q2[i] <- v[2] stbl$q5[i] <- v[3] stbl$q8[i] <- v[4] stbl$q10[i] <- v[5] } It works fine for a small N, but it get quickly inefficient as N grows. The for-loop takes too long. How could I improve the code or data representation so it can run fast? Thanks, Seung
One thing to do is to use Rprof() on your script so that you can determine where time is being spent. My guess it that most of the time is in the wtd.quantile function. If your Counts don't get too big, another way is to use 'quantile' directly:> Index <- c(0,1,7,30) > Count <- c(234,120,11,1) > rep.int(Index, times=Count)[1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 [33] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 [65] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 [97] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 [129] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 [161] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 [193] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 [225] 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 [257] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 [289] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 [321] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 [353] 1 1 7 7 7 7 7 7 7 7 7 7 7 30> quantile(rep.int(Index, times=Count), prob=c(0, .2, .5, .8, 1))0% 20% 50% 80% 100% 0 0 0 1 30>Try both solutions and see which is faster. On Jan 7, 2008 6:49 PM, Seung Jun <seungwjun at gmail.com> wrote:> I'm looking for a way to improve code that's proven to be inefficient. > > Suppose that a data source generates the following table every minute: > > Index Count > ------------ > 0 234 > 1 120 > 7 11 > 30 1 > > I save the tables in the following CSV format: > > time,index,count > 0,0:1:7:30,234:120:11:1 > 1,0:2:3:19,199:110:87:9 > > That is, each line represents a table, and I have N lines for N minutes of > data collection. > > Now, I wrote the following code to get quantiles for each time period: > > library(Hmisc) > stbl <- read.csv("data.csv") > index <- lapply(strsplit(stbl$index, ":", fixed = TRUE), as.numeric) > count <- lapply(strsplit(stbl$count, ":", fixed = TRUE), as.numeric) > len <- length(index) > for (i in 1:len) { > v <- wtd.quantile(index[[i]], count[[i]], c(0, 0.2, 0.5, 0.8, 1)) > stbl$q0[i] <- v[1] > stbl$q2[i] <- v[2] > stbl$q5[i] <- v[3] > stbl$q8[i] <- v[4] > stbl$q10[i] <- v[5] > } > > It works fine for a small N, but it get quickly inefficient as N grows. The > for-loop takes too long. How could I improve the code or data > representation so it can run fast? > > Thanks, > Seung > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >-- Jim Holtman Cincinnati, OH +1 513 646 9390 What is the problem you are trying to solve?
On Mon, 7 Jan 2008, Seung Jun wrote:> I'm looking for a way to improve code that's proven to be inefficient. >Jim was probably right on both counts (use Rprof and expect wtd.quantile to be the place where the time is being used). If following his advice doesn't get you what you need, try vectorizing the whole lot by stacking the 'index'es and the 'count's. To see how to do this look at these plots:> plot(rep(index,count)) > index <- 1:4 > count <- index*10 > plot(wtd.quantile( index, count, seq(0,1,by=0.001))) > plot(rep(index,count)) >and now this one where I 'stack' another table on top of the first one: index.2 <- c(1,3) count.2 <- c(30,40) plot( rep( c( index, index.2 ), c ( count, count.2 ) ) ) As you can probably see, (for your case) wtd.quantile() is (in effect) doing a lookup and interpolation between points in those case in which an interpolation is needed. The challenge for you is to figure out how to do the lookup without resorting to approx() - which is used by wtd.quantile(). Keeping track of the cumulative number of the stacked counts with cumsum(), the number in the each table, and the cumulative number of counts for all previous tables should get you there. HTH, Chuck> Suppose that a data source generates the following table every minute: > > Index Count > ------------ > 0 234 > 1 120 > 7 11 > 30 1 > > I save the tables in the following CSV format: > > time,index,count > 0,0:1:7:30,234:120:11:1 > 1,0:2:3:19,199:110:87:9 > > That is, each line represents a table, and I have N lines for N minutes of > data collection. > > Now, I wrote the following code to get quantiles for each time period: > > library(Hmisc) > stbl <- read.csv("data.csv") > index <- lapply(strsplit(stbl$index, ":", fixed = TRUE), as.numeric) > count <- lapply(strsplit(stbl$count, ":", fixed = TRUE), as.numeric) > len <- length(index) > for (i in 1:len) { > v <- wtd.quantile(index[[i]], count[[i]], c(0, 0.2, 0.5, 0.8, 1)) > stbl$q0[i] <- v[1] > stbl$q2[i] <- v[2] > stbl$q5[i] <- v[3] > stbl$q8[i] <- v[4] > stbl$q10[i] <- v[5] > } > > It works fine for a small N, but it get quickly inefficient as N grows. The > for-loop takes too long. How could I improve the code or data > representation so it can run fast? > > Thanks, > Seung > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >Charles C. Berry (858) 534-2098 Dept of Family/Preventive Medicine E mailto:cberry at tajo.ucsd.edu UC San Diego http://famprevmed.ucsd.edu/faculty/cberry/ La Jolla, San Diego 92093-0901