The small example below works lighting-fast; however, when I run the same script on my real problem, a 1Gb text file, the for loops have been running for over 24 hrs and I have no idea if the processing is 10% done or 90% done. I have not been able to figure out a betteR way to code up the material within the for loops at the end of the example below. The contents of divChng, the final product, are exactly what I'm after, but I need help formulating more efficient R script, I've got two more 1Gb files to process after the current one finishes, whenever that is... I appreciate any insights/solutions, Eric dat <- read.table(textConnection("ISEG IRCH div gw 1 1 265 229 1 2 260 298 1 3 234 196 54 1 432 485 54 39 467 485 54 40 468 468 54 41 460 381 54 42 489 502 1 1 265 317 1 2 276 225 1 3 217 164 54 1 430 489 54 39 456 495 54 40 507 607 54 41 483 424 54 42 457 404 1 1 265 278 1 2 287 370 1 3 224 274 54 1 412 585 54 39 473 532 54 40 502 595 54 41 497 441 54 42 447 467 1 1 230 258 1 2 251 152 1 3 199 179 54 1 412 415 54 39 439 538 54 40 474 486 54 41 477 484 54 42 413 346 1 1 230 171 1 2 262 171 1 3 217 263 54 1 432 485 54 39 455 482 54 40 493 419 54 41 489 536 54 42 431 504 1 1 1002 1090 1 2 1222 1178 1 3 1198 1177 54 1 1432 1485 54 39 1876 1975 54 40 1565 1646 54 41 1455 1451 54 42 1427 1524 1 1 1002 968 1 2 1246 1306 1 3 1153 1158 54 1 1532 1585 54 39 1790 1889 54 40 1490 1461 54 41 1518 1536 54 42 1486 1585 1 1 1002 1081 1 2 1229 1262 1 3 1142 1241 54 1 1632 1659 54 39 1797 1730 54 40 1517 1466 54 41 1527 1589 54 42 1514 1612"),header=TRUE) dat$seq <- ifelse(dat$ISEG==1 & dat$IRCH==1, 1, 0) tmp <- diff(dat[dat$seq==1,]$div)!=0 dat$idx <- 0 dat[dat$seq==1,][c(TRUE,tmp),]$idx <- 1 dat$ts <- cumsum(dat$idx) dat$iter <- ave(dat$seq, dat$ts,FUN=cumsum) dat$ct <- seq(1:length(dat[,1])) timeStep <- unique(dat$ts) SEG <- unique(dat$ISEG) divChng <- data.frame(ts=NA, ISEG=NA, divChng=NA, gwChng=NA, iter=NA) #Can the following be rescripted for better harnessing R's processing power? for (i in 1:length(timeStep)){ for (j in 1:length(SEG)){ datTS <- subset(dat,ts==timeStep[i] & ISEG==SEG[j] & IRCH==1) datGW <- subset(dat,ts==timeStep[i] & ISEG==SEG[j]) grw <- aggregate(gw ~ iter, datGW, sum) DC <- max(datTS$div)-min(datTS$div) GRW <- max(grw$gw) - min(grw$gw) divChng <- rbind(divChng,c(datTS$ts[1], SEG[j], DC, GRW, max(datTS$iter))) } } divChng <- divChng[!is.na(divChng$ISEG),] [[alternative HTML version deleted]]
You are certainly in Circle 2 of 'The R Inferno', which I suspect is where almost all of the computation time is coming from. Instead of doing: divChng <- rbind(divChng,c(datTS$ts[1], SEG[j], DC, GRW, max(datTS$iter))) it would be much better to create 'divChng' to be the final length and then in the loop do: divChng[count,] <- c(datTS$ts[1], SEG[j], DC, GRW, max(datTS$iter)) count <- count + 1 You may also want to explore 'tapply'. If this is something you will be doing a lot of, then you should probably learn the 'data.table' package. http://www.burns-stat.com/documents/books/the-r-inferno/ Pat On 12/04/2015 14:47, Morway, Eric wrote:> The small example below works lighting-fast; however, when I run the same > script on my real problem, a 1Gb text file, the for loops have been running > for over 24 hrs and I have no idea if the processing is 10% done or 90% > done. I have not been able to figure out a betteR way to code up the > material within the for loops at the end of the example below. The > contents of divChng, the final product, are exactly what I'm after, but I > need help formulating more efficient R script, I've got two more 1Gb files > to process after the current one finishes, whenever that is... > > I appreciate any insights/solutions, Eric > > dat <- read.table(textConnection("ISEG IRCH div gw > 1 1 265 229 > 1 2 260 298 > 1 3 234 196 > 54 1 432 485 > 54 39 467 485 > 54 40 468 468 > 54 41 460 381 > 54 42 489 502 > 1 1 265 317 > 1 2 276 225 > 1 3 217 164 > 54 1 430 489 > 54 39 456 495 > 54 40 507 607 > 54 41 483 424 > 54 42 457 404 > 1 1 265 278 > 1 2 287 370 > 1 3 224 274 > 54 1 412 585 > 54 39 473 532 > 54 40 502 595 > 54 41 497 441 > 54 42 447 467 > 1 1 230 258 > 1 2 251 152 > 1 3 199 179 > 54 1 412 415 > 54 39 439 538 > 54 40 474 486 > 54 41 477 484 > 54 42 413 346 > 1 1 230 171 > 1 2 262 171 > 1 3 217 263 > 54 1 432 485 > 54 39 455 482 > 54 40 493 419 > 54 41 489 536 > 54 42 431 504 > 1 1 1002 1090 > 1 2 1222 1178 > 1 3 1198 1177 > 54 1 1432 1485 > 54 39 1876 1975 > 54 40 1565 1646 > 54 41 1455 1451 > 54 42 1427 1524 > 1 1 1002 968 > 1 2 1246 1306 > 1 3 1153 1158 > 54 1 1532 1585 > 54 39 1790 1889 > 54 40 1490 1461 > 54 41 1518 1536 > 54 42 1486 1585 > 1 1 1002 1081 > 1 2 1229 1262 > 1 3 1142 1241 > 54 1 1632 1659 > 54 39 1797 1730 > 54 40 1517 1466 > 54 41 1527 1589 > 54 42 1514 1612"),header=TRUE) > > dat$seq <- ifelse(dat$ISEG==1 & dat$IRCH==1, 1, 0) > tmp <- diff(dat[dat$seq==1,]$div)!=0 > dat$idx <- 0 > dat[dat$seq==1,][c(TRUE,tmp),]$idx <- 1 > dat$ts <- cumsum(dat$idx) > dat$iter <- ave(dat$seq, dat$ts,FUN=cumsum) > dat$ct <- seq(1:length(dat[,1])) > > timeStep <- unique(dat$ts) > SEG <- unique(dat$ISEG) > divChng <- data.frame(ts=NA, ISEG=NA, divChng=NA, gwChng=NA, iter=NA) > > #Can the following be rescripted for better harnessing R's processing power? > > for (i in 1:length(timeStep)){ > for (j in 1:length(SEG)){ > datTS <- subset(dat,ts==timeStep[i] & ISEG==SEG[j] & IRCH==1) > datGW <- subset(dat,ts==timeStep[i] & ISEG==SEG[j]) > grw <- aggregate(gw ~ iter, datGW, sum) > > DC <- max(datTS$div)-min(datTS$div) > GRW <- max(grw$gw) - min(grw$gw) > divChng <- rbind(divChng,c(datTS$ts[1], SEG[j], DC, GRW, > max(datTS$iter))) > } > } > divChng <- divChng[!is.na(divChng$ISEG),] > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >
You don't need loops at all. grw <- aggregate(gw ~ ts + ISEG + iter, data = dat, FUN = sum) GRW <- aggregate(gw ~ ts + ISEG, data = grw, FUN = function(x){max(x) - min(x)}) DC <- aggregate(div ~ ts + ISEG, data = subset(dat, IRCH == 1), FUN function(x){max(x) - min(x)}) iter <- aggregate(iter ~ ts + ISEG, data = subset(dat, IRCH == 1), FUN = max) tmp <- merge(DC, iter) merge(tmp, GRW) another option is to use the plyr package library(plyr) merge( ddply( subset(dat, IRCH == 1), c("ts", "ISEG"), summarize, divChng = max(div) - min(div), max.iter = max(iter) ), ddply( dat, c("ts", "ISEG"), summarize, gwChng = diff(range(ave(gw, iter, FUN = sum))) ) ) Best regards, ir. Thierry Onkelinx Instituut voor natuur- en bosonderzoek / Research Institute for Nature and Forest team Biometrie & Kwaliteitszorg / team Biometrics & Quality Assurance Kliniekstraat 25 1070 Anderlecht Belgium To call in the statistician after the experiment is done may be no more than asking him to perform a post-mortem examination: he may be able to say what the experiment died of. ~ Sir Ronald Aylmer Fisher The plural of anecdote is not data. ~ Roger Brinner The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data. ~ John Tukey 2015-04-12 15:47 GMT+02:00 Morway, Eric <emorway at usgs.gov>:> The small example below works lighting-fast; however, when I run the same > script on my real problem, a 1Gb text file, the for loops have been running > for over 24 hrs and I have no idea if the processing is 10% done or 90% > done. I have not been able to figure out a betteR way to code up the > material within the for loops at the end of the example below. The > contents of divChng, the final product, are exactly what I'm after, but I > need help formulating more efficient R script, I've got two more 1Gb files > to process after the current one finishes, whenever that is... > > I appreciate any insights/solutions, Eric > > dat <- read.table(textConnection("ISEG IRCH div gw > 1 1 265 229 > 1 2 260 298 > 1 3 234 196 > 54 1 432 485 > 54 39 467 485 > 54 40 468 468 > 54 41 460 381 > 54 42 489 502 > 1 1 265 317 > 1 2 276 225 > 1 3 217 164 > 54 1 430 489 > 54 39 456 495 > 54 40 507 607 > 54 41 483 424 > 54 42 457 404 > 1 1 265 278 > 1 2 287 370 > 1 3 224 274 > 54 1 412 585 > 54 39 473 532 > 54 40 502 595 > 54 41 497 441 > 54 42 447 467 > 1 1 230 258 > 1 2 251 152 > 1 3 199 179 > 54 1 412 415 > 54 39 439 538 > 54 40 474 486 > 54 41 477 484 > 54 42 413 346 > 1 1 230 171 > 1 2 262 171 > 1 3 217 263 > 54 1 432 485 > 54 39 455 482 > 54 40 493 419 > 54 41 489 536 > 54 42 431 504 > 1 1 1002 1090 > 1 2 1222 1178 > 1 3 1198 1177 > 54 1 1432 1485 > 54 39 1876 1975 > 54 40 1565 1646 > 54 41 1455 1451 > 54 42 1427 1524 > 1 1 1002 968 > 1 2 1246 1306 > 1 3 1153 1158 > 54 1 1532 1585 > 54 39 1790 1889 > 54 40 1490 1461 > 54 41 1518 1536 > 54 42 1486 1585 > 1 1 1002 1081 > 1 2 1229 1262 > 1 3 1142 1241 > 54 1 1632 1659 > 54 39 1797 1730 > 54 40 1517 1466 > 54 41 1527 1589 > 54 42 1514 1612"),header=TRUE) > > dat$seq <- ifelse(dat$ISEG==1 & dat$IRCH==1, 1, 0) > tmp <- diff(dat[dat$seq==1,]$div)!=0 > dat$idx <- 0 > dat[dat$seq==1,][c(TRUE,tmp),]$idx <- 1 > dat$ts <- cumsum(dat$idx) > dat$iter <- ave(dat$seq, dat$ts,FUN=cumsum) > dat$ct <- seq(1:length(dat[,1])) > > timeStep <- unique(dat$ts) > SEG <- unique(dat$ISEG) > divChng <- data.frame(ts=NA, ISEG=NA, divChng=NA, gwChng=NA, iter=NA) > > #Can the following be rescripted for better harnessing R's processing > power? > > for (i in 1:length(timeStep)){ > for (j in 1:length(SEG)){ > datTS <- subset(dat,ts==timeStep[i] & ISEG==SEG[j] & IRCH==1) > datGW <- subset(dat,ts==timeStep[i] & ISEG==SEG[j]) > grw <- aggregate(gw ~ iter, datGW, sum) > > DC <- max(datTS$div)-min(datTS$div) > GRW <- max(grw$gw) - min(grw$gw) > divChng <- rbind(divChng,c(datTS$ts[1], SEG[j], DC, GRW, > max(datTS$iter))) > } > } > divChng <- divChng[!is.na(divChng$ISEG),] > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >[[alternative HTML version deleted]]
Well, sort of... aggregate() is basically a wrapper for lapply(), which ultimately must loop over the function call at the R interpreter level, as opposed to vectorized functions that loop at the C level and hence can be orders of magnitude faster. As a result, there is often little difference in efficiency between explicit and *smart* (in the sense that Pat Burns has already pointed out of not growing structures at each iteration,among other things) for() looping and apply-type calls. For some of us, the chief advantage of the *apply idioms is that the code is more readable and maintainable, with R handling fussy details of loop indexing, for example.lapply() is also more in keeping with the functional programming paradigm. ?Others? find both these "virtues" to be annoyances, ? ? however, and prefer explicit *smart* looping. Chaque un ? ? ? son go? ?t? . None of which necessarily denies the wisdom of the approach you've suggested, however. It may indeed be considerably faster, but timing will have to tell. I am just trying to correct ?(again) ? the ? ? widely held misperception ?that yo? u ? seem to? express ?.? Cheers, Bert Bert Gunter Genentech Nonclinical Biostatistics (650) 467-7374 "Data is not information. Information is not knowledge. And knowledge is certainly not wisdom." Clifford Stoll On Sun, Apr 12, 2015 at 1:48 PM, Thierry Onkelinx <thierry.onkelinx at inbo.be> wrote:> You don't need loops at all. > > grw <- aggregate(gw ~ ts + ISEG + iter, data = dat, FUN = sum) > GRW <- aggregate(gw ~ ts + ISEG, data = grw, FUN = function(x){max(x) - > min(x)}) > DC <- aggregate(div ~ ts + ISEG, data = subset(dat, IRCH == 1), FUN > function(x){max(x) - min(x)}) > iter <- aggregate(iter ~ ts + ISEG, data = subset(dat, IRCH == 1), FUN > = max) > tmp <- merge(DC, iter) > merge(tmp, GRW) > > another option is to use the plyr package > > library(plyr) > merge( > ddply( > subset(dat, IRCH == 1), > c("ts", "ISEG"), > summarize, > divChng = max(div) - min(div), > max.iter = max(iter) > ), > ddply( > dat, > c("ts", "ISEG"), > summarize, > gwChng = diff(range(ave(gw, iter, FUN = sum))) > ) > ) > > Best regards, > > ir. Thierry Onkelinx > Instituut voor natuur- en bosonderzoek / Research Institute for Nature and > Forest > team Biometrie & Kwaliteitszorg / team Biometrics & Quality Assurance > Kliniekstraat 25 > 1070 Anderlecht > Belgium > > To call in the statistician after the experiment is done may be no more > than asking him to perform a post-mortem examination: he may be able to say > what the experiment died of. ~ Sir Ronald Aylmer Fisher > The plural of anecdote is not data. ~ Roger Brinner > The combination of some data and an aching desire for an answer does not > ensure that a reasonable answer can be extracted from a given body of data. > ~ John Tukey > > 2015-04-12 15:47 GMT+02:00 Morway, Eric <emorway at usgs.gov>: > > > The small example below works lighting-fast; however, when I run the same > > script on my real problem, a 1Gb text file, the for loops have been > running > > for over 24 hrs and I have no idea if the processing is 10% done or 90% > > done. I have not been able to figure out a betteR way to code up the > > material within the for loops at the end of the example below. The > > contents of divChng, the final product, are exactly what I'm after, but I > > need help formulating more efficient R script, I've got two more 1Gb > files > > to process after the current one finishes, whenever that is... > > > > I appreciate any insights/solutions, Eric > > > > dat <- read.table(textConnection("ISEG IRCH div gw > > 1 1 265 229 > > 1 2 260 298 > > 1 3 234 196 > > 54 1 432 485 > > 54 39 467 485 > > 54 40 468 468 > > 54 41 460 381 > > 54 42 489 502 > > 1 1 265 317 > > 1 2 276 225 > > 1 3 217 164 > > 54 1 430 489 > > 54 39 456 495 > > 54 40 507 607 > > 54 41 483 424 > > 54 42 457 404 > > 1 1 265 278 > > 1 2 287 370 > > 1 3 224 274 > > 54 1 412 585 > > 54 39 473 532 > > 54 40 502 595 > > 54 41 497 441 > > 54 42 447 467 > > 1 1 230 258 > > 1 2 251 152 > > 1 3 199 179 > > 54 1 412 415 > > 54 39 439 538 > > 54 40 474 486 > > 54 41 477 484 > > 54 42 413 346 > > 1 1 230 171 > > 1 2 262 171 > > 1 3 217 263 > > 54 1 432 485 > > 54 39 455 482 > > 54 40 493 419 > > 54 41 489 536 > > 54 42 431 504 > > 1 1 1002 1090 > > 1 2 1222 1178 > > 1 3 1198 1177 > > 54 1 1432 1485 > > 54 39 1876 1975 > > 54 40 1565 1646 > > 54 41 1455 1451 > > 54 42 1427 1524 > > 1 1 1002 968 > > 1 2 1246 1306 > > 1 3 1153 1158 > > 54 1 1532 1585 > > 54 39 1790 1889 > > 54 40 1490 1461 > > 54 41 1518 1536 > > 54 42 1486 1585 > > 1 1 1002 1081 > > 1 2 1229 1262 > > 1 3 1142 1241 > > 54 1 1632 1659 > > 54 39 1797 1730 > > 54 40 1517 1466 > > 54 41 1527 1589 > > 54 42 1514 1612"),header=TRUE) > > > > dat$seq <- ifelse(dat$ISEG==1 & dat$IRCH==1, 1, 0) > > tmp <- diff(dat[dat$seq==1,]$div)!=0 > > dat$idx <- 0 > > dat[dat$seq==1,][c(TRUE,tmp),]$idx <- 1 > > dat$ts <- cumsum(dat$idx) > > dat$iter <- ave(dat$seq, dat$ts,FUN=cumsum) > > dat$ct <- seq(1:length(dat[,1])) > > > > timeStep <- unique(dat$ts) > > SEG <- unique(dat$ISEG) > > divChng <- data.frame(ts=NA, ISEG=NA, divChng=NA, gwChng=NA, iter=NA) > > > > #Can the following be rescripted for better harnessing R's processing > > power? > > > > for (i in 1:length(timeStep)){ > > for (j in 1:length(SEG)){ > > datTS <- subset(dat,ts==timeStep[i] & ISEG==SEG[j] & IRCH==1) > > datGW <- subset(dat,ts==timeStep[i] & ISEG==SEG[j]) > > grw <- aggregate(gw ~ iter, datGW, sum) > > > > DC <- max(datTS$div)-min(datTS$div) > > GRW <- max(grw$gw) - min(grw$gw) > > divChng <- rbind(divChng,c(datTS$ts[1], SEG[j], DC, GRW, > > max(datTS$iter))) > > } > > } > > divChng <- divChng[!is.na(divChng$ISEG),] > > > > [[alternative HTML version deleted]] > > > > ______________________________________________ > > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide > > http://www.R-project.org/posting-guide.html > > and provide commented, minimal, self-contained, reproducible code. > > > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >[[alternative HTML version deleted]]