Aleksandre Gavashelishvili
2019-Jan-23 10:17 UTC
[R] Vectorizing a for-loop for cross-validation in R
I'm trying to speed up a script that otherwise takes days to handle larger data sets. So, is there a way to completely vectorize or paralellize the following script: *# k-fold cross validation* df <- trees # a data frame 'trees' from R. df <- df[sample(nrow(df)), ] # randomly shuffles the data. k <- 10 # Number of folds. Note k=nrow(df) in the leave-one-out cross validation. folds <- cut(seq(from=1, to=nrow(df)), breaks=k, labels=FALSE) # creates unique numbers for k equally size folds. df$ID <- folds # adds fold IDs. df[paste("pred", 1:3, sep="")] <- NA # adds multiple columns "pred1" "pred2" "pred3" to speed up the following loop. library(mgcv) for(i in 1:k) { # looping for different models: m1 <- gam(Volume ~ s(Height), data=df, subset=(ID != i)) m2 <- gam(Volume ~ s(Girth), data=df, subset=(ID != i)) m3 <- gam(Volume ~ s(Girth) + s(Height), data=df, subset=(ID != i)) # looping for predictions: df[df$ID==i, "pred1"] <- predict(m1, df[df$ID==i, ], type="response") df[df$ID==i, "pred2"] <- predict(m2, df[df$ID==i, ], type="response") df[df$ID==i, "pred3"] <- predict(m3, df[df$ID==i, ], type="response") } # calculating residuals: df$res1 <- with(df, Volume - pred1) df$res2 <- with(df, Volume - pred2) df$res3 <- with(df, Volume - pred3) Model <- paste("m", 1:3, sep="") # creates a vector of model names. # creating a vector of mean-square errors (MSE): MSE <- with(df, c( sum(res1^2) / nrow(df), sum(res2^2) / nrow(df), sum(res3^2) / nrow(df) )) model.mse <- data.frame(Model, MSE) # creates a data frame of model names and mean-square errors. model.mse <- model.mse[order(model.mse$MSE), ] # rearranges the previous data frame in order of increasing mean-square errors. I'd appreciate any help. This code takes several days if run on >=30,000 different GAM models and 3 predictors. Could you please help with re-writing the script into sapply() or foreach()/doParallel format? Thanks Lexo [[alternative HTML version deleted]]
See inline.> On Jan 23, 2019, at 2:17 AM, Aleksandre Gavashelishvili <aleksandre.gavashelishvili at iliauni.edu.ge> wrote: > > I'm trying to speed up a script that otherwise takes days to handle larger > data sets. So, is there a way to completely vectorize or paralellize the > following script: > > *# k-fold cross validation* > > df <- trees # a data frame 'trees' from R. > df <- df[sample(nrow(df)), ] # randomly shuffles the data. > k <- 10 # Number of folds. Note k=nrow(df) in the leave-one-out cross > validation. > folds <- cut(seq(from=1, to=nrow(df)), breaks=k, labels=FALSE) # creates > unique numbers for k equally size folds. > df$ID <- folds # adds fold IDs. > df[paste("pred", 1:3, sep="")] <- NA # adds multiple columns "pred1" > "pred2" "pred3" to speed up the following loop. > > library(mgcv) >Rprof() replicate(100, {> for(i in 1:k) { > # looping for different models: > m1 <- gam(Volume ~ s(Height), data=df, subset=(ID != i)) > m2 <- gam(Volume ~ s(Girth), data=df, subset=(ID != i)) > m3 <- gam(Volume ~ s(Girth) + s(Height), data=df, subset=(ID != i)) > > # looping for predictions: > df[df$ID==i, "pred1"] <- predict(m1, df[df$ID==i, ], type="response") > df[df$ID==i, "pred2"] <- predict(m2, df[df$ID==i, ], type="response") > df[df$ID==i, "pred3"] <- predict(m3, df[df$ID==i, ], type="response") > } >}) Rprof(NULL) summaryRprof() ## read ?Rprof to get a sense of what it does ## read the summary to determine where time is being spent. ## the result was surprising to me. YMMV. ## there may be redundancies that you can eliminate by ## - doing the setup within gam() one time and saving it ## - calling the worker functions by modifying the setup ## in a loop or function and saving the results> # calculating residuals: > df$res1 <- with(df, Volume - pred1) > df$res2 <- with(df, Volume - pred2) > df$res3 <- with(df, Volume - pred3) > > Model <- paste("m", 1:3, sep="") # creates a vector of model names. > > # creating a vector of mean-square errors (MSE): > MSE <- with(df, c( > sum(res1^2) / nrow(df), > sum(res2^2) / nrow(df), > sum(res3^2) / nrow(df) > )) > > model.mse <- data.frame(Model, MSE) # creates a data frame of model names > and mean-square errors. > model.mse <- model.mse[order(model.mse$MSE), ] # rearranges the previous > data frame in order of increasing mean-square errors. > > I'd appreciate any help. This code takes several days if run on >=30,000 > different GAM models and 3 predictors. Could you please help with > re-writing the script into sapply() or foreach()/doParallel format? >This is something you should learn to do. It is pretty standard practice. Use the body of your for loop as the body of a function, add arguments, and create a suitable return value. The something like lapply( 1:k, your.loop.body.function, other.arg1, other.arg2, ...) should work. If it does, then parallel::mclapply(...) should also work. HTH, Chuck
Charles writes about saving execution time by eliminating redundancies. If you see redundancies related to calling a time-consuming function multiple times with the same arguments, a very easy way to speed up your program is to memoise the functions using the package memoise. HTH, Eric On Wed, Jan 23, 2019 at 8:34 PM Berry, Charles <ccberry at ucsd.edu> wrote:> See inline. > > > On Jan 23, 2019, at 2:17 AM, Aleksandre Gavashelishvili < > aleksandre.gavashelishvili at iliauni.edu.ge> wrote: > > > > I'm trying to speed up a script that otherwise takes days to handle > larger > > data sets. So, is there a way to completely vectorize or paralellize the > > following script: > > > > *# k-fold cross validation* > > > > df <- trees # a data frame 'trees' from R. > > df <- df[sample(nrow(df)), ] # randomly shuffles the data. > > k <- 10 # Number of folds. Note k=nrow(df) in the leave-one-out cross > > validation. > > folds <- cut(seq(from=1, to=nrow(df)), breaks=k, labels=FALSE) # creates > > unique numbers for k equally size folds. > > df$ID <- folds # adds fold IDs. > > df[paste("pred", 1:3, sep="")] <- NA # adds multiple columns "pred1" > > "pred2" "pred3" to speed up the following loop. > > > > library(mgcv) > > > > Rprof() > > replicate(100, { > > > > for(i in 1:k) { > > # looping for different models: > > m1 <- gam(Volume ~ s(Height), data=df, subset=(ID != i)) > > m2 <- gam(Volume ~ s(Girth), data=df, subset=(ID != i)) > > m3 <- gam(Volume ~ s(Girth) + s(Height), data=df, subset=(ID != i)) > > > > # looping for predictions: > > df[df$ID==i, "pred1"] <- predict(m1, df[df$ID==i, ], type="response") > > df[df$ID==i, "pred2"] <- predict(m2, df[df$ID==i, ], type="response") > > df[df$ID==i, "pred3"] <- predict(m3, df[df$ID==i, ], type="response") > > } > > > > }) > > Rprof(NULL) > > summaryRprof() > > ## read ?Rprof to get a sense of what it does > > ## read the summary to determine where time is being spent. > > ## the result was surprising to me. YMMV. > > ## there may be redundancies that you can eliminate by > ## - doing the setup within gam() one time and saving it > ## - calling the worker functions by modifying the setup > ## in a loop or function and saving the results > > > > # calculating residuals: > > df$res1 <- with(df, Volume - pred1) > > df$res2 <- with(df, Volume - pred2) > > df$res3 <- with(df, Volume - pred3) > > > > Model <- paste("m", 1:3, sep="") # creates a vector of model names. > > > > # creating a vector of mean-square errors (MSE): > > MSE <- with(df, c( > > sum(res1^2) / nrow(df), > > sum(res2^2) / nrow(df), > > sum(res3^2) / nrow(df) > > )) > > > > model.mse <- data.frame(Model, MSE) # creates a data frame of model names > > and mean-square errors. > > model.mse <- model.mse[order(model.mse$MSE), ] # rearranges the previous > > data frame in order of increasing mean-square errors. > > > > I'd appreciate any help. This code takes several days if run on >=30,000 > > different GAM models and 3 predictors. Could you please help with > > re-writing the script into sapply() or foreach()/doParallel format? > > > > This is something you should learn to do. It is pretty standard practice. > Use the body of your for loop as the body of a function, add arguments, and > create a suitable return value. The something like > > lapply( 1:k, your.loop.body.function, other.arg1, other.arg2, ...) > > should work. If it does, then parallel::mclapply(...) should also work. > > HTH, > > Chuck > > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >[[alternative HTML version deleted]]