thr3ads.net - R help - [R] Vectorizing a for-loop for cross-validation in R [Jan 2019]

If this information is useful, please help other people find it:
Share via:

Aleksandre Gavashelishvili

2019-Jan-23 10:17 UTC

[R] Vectorizing a for-loop for cross-validation in R

I'm trying to speed up a script that otherwise takes days to handle larger
data sets. So, is there a way to completely vectorize or paralellize the
following script:

                *# k-fold cross validation*

df <- trees # a data frame 'trees' from R.
df <- df[sample(nrow(df)), ] # randomly shuffles the data.
k <- 10 # Number of folds. Note k=nrow(df) in the leave-one-out cross
validation.
folds <- cut(seq(from=1, to=nrow(df)), breaks=k, labels=FALSE) # creates
unique numbers for k equally size folds.
df$ID <- folds # adds fold IDs.
df[paste("pred", 1:3, sep="")] <- NA # adds multiple
columns "pred1"
"pred2" "pred3" to speed up the following loop.

library(mgcv)

for(i in 1:k) {
  # looping for different models:
  m1 <- gam(Volume ~ s(Height), data=df, subset=(ID != i))
  m2 <- gam(Volume ~ s(Girth), data=df, subset=(ID != i))
  m3 <- gam(Volume ~ s(Girth) + s(Height), data=df, subset=(ID != i))

  # looping for predictions:
  df[df$ID==i, "pred1"] <- predict(m1, df[df$ID==i, ],
type="response")
  df[df$ID==i, "pred2"] <- predict(m2, df[df$ID==i, ],
type="response")
  df[df$ID==i, "pred3"] <- predict(m3, df[df$ID==i, ],
type="response")
}

# calculating residuals:
df$res1 <- with(df, Volume - pred1)
df$res2 <- with(df, Volume - pred2)
df$res3 <- with(df, Volume - pred3)

Model <- paste("m", 1:3, sep="") # creates a vector of
model names.

# creating a vector of mean-square errors (MSE):
MSE <- with(df, c(
  sum(res1^2) / nrow(df),
  sum(res2^2) / nrow(df),
  sum(res3^2) / nrow(df)
))

model.mse <- data.frame(Model, MSE) # creates a data frame of model names
and mean-square errors.
model.mse <- model.mse[order(model.mse$MSE), ] # rearranges the previous
data frame in order of increasing mean-square errors.

I'd appreciate any help. This code takes several days if run on >=30,000
different GAM models and 3 predictors. Could you please help with
re-writing the script into sapply() or foreach()/doParallel format?

Thanks
Lexo

	[[alternative HTML version deleted]]

Berry, Charles

2019-Jan-23 18:34 UTC

head link

[R] Vectorizing a for-loop for cross-validation in R

See inline.
> On Jan 23, 2019, at 2:17 AM, Aleksandre Gavashelishvili
<aleksandre.gavashelishvili at iliauni.edu.ge> wrote:
> 
> I'm trying to speed up a script that otherwise takes days to handle
larger
> data sets. So, is there a way to completely vectorize or paralellize the
> following script:
> 
>                *# k-fold cross validation*
> 
> df <- trees # a data frame 'trees' from R.
> df <- df[sample(nrow(df)), ] # randomly shuffles the data.
> k <- 10 # Number of folds. Note k=nrow(df) in the leave-one-out cross
> validation.
> folds <- cut(seq(from=1, to=nrow(df)), breaks=k, labels=FALSE) # creates
> unique numbers for k equally size folds.
> df$ID <- folds # adds fold IDs.
> df[paste("pred", 1:3, sep="")] <- NA # adds multiple
columns "pred1"
> "pred2" "pred3" to speed up the following loop.
> 
> library(mgcv)
> 
Rprof()

replicate(100, {

> for(i in 1:k) {
>  # looping for different models:
>  m1 <- gam(Volume ~ s(Height), data=df, subset=(ID != i))
>  m2 <- gam(Volume ~ s(Girth), data=df, subset=(ID != i))
>  m3 <- gam(Volume ~ s(Girth) + s(Height), data=df, subset=(ID != i))
> 
>  # looping for predictions:
>  df[df$ID==i, "pred1"] <- predict(m1, df[df$ID==i, ],
type="response")
>  df[df$ID==i, "pred2"] <- predict(m2, df[df$ID==i, ],
type="response")
>  df[df$ID==i, "pred3"] <- predict(m3, df[df$ID==i, ],
type="response")
> }
> 
})

Rprof(NULL)

summaryRprof()

## read ?Rprof to get a sense of what it does

## read the summary to determine where time is being spent.

## the result was surprising to me. YMMV.

## there may be redundancies that you can eliminate by 
##  - doing the setup within gam() one time and saving it
##  - calling the worker functions by modifying the setup 
##    in a loop or function and saving the results

> # calculating residuals:
> df$res1 <- with(df, Volume - pred1)
> df$res2 <- with(df, Volume - pred2)
> df$res3 <- with(df, Volume - pred3)
> 
> Model <- paste("m", 1:3, sep="") # creates a vector
of model names.
> 
> # creating a vector of mean-square errors (MSE):
> MSE <- with(df, c(
>  sum(res1^2) / nrow(df),
>  sum(res2^2) / nrow(df),
>  sum(res3^2) / nrow(df)
> ))
> 
> model.mse <- data.frame(Model, MSE) # creates a data frame of model
names
> and mean-square errors.
> model.mse <- model.mse[order(model.mse$MSE), ] # rearranges the previous
> data frame in order of increasing mean-square errors.
> 
> I'd appreciate any help. This code takes several days if run on
>=30,000
> different GAM models and 3 predictors. Could you please help with
> re-writing the script into sapply() or foreach()/doParallel format?
> 
This is something you should learn to do. It is pretty standard practice. Use
the body of your for loop as the body of a function, add arguments, and create a
suitable return value. The something like

	lapply( 1:k, your.loop.body.function, other.arg1, other.arg2, ...)

should work.  If it does, then parallel::mclapply(...) should also work.

HTH,

Chuck

Eric Berger

2019-Jan-23 19:37 UTC

head link

[R] Vectorizing a for-loop for cross-validation in R

Charles writes about saving execution time by eliminating redundancies.
If you see redundancies related to calling a time-consuming function
multiple times with the same arguments, a very easy way to speed up your
program is to memoise the functions using the package memoise.

HTH,
Eric




On Wed, Jan 23, 2019 at 8:34 PM Berry, Charles <ccberry at ucsd.edu>
wrote:
> See inline.
>
> > On Jan 23, 2019, at 2:17 AM, Aleksandre Gavashelishvili <
> aleksandre.gavashelishvili at iliauni.edu.ge> wrote:
> >
> > I'm trying to speed up a script that otherwise takes days to
handle
> larger
> > data sets. So, is there a way to completely vectorize or paralellize
the
> > following script:
> >
> >                *# k-fold cross validation*
> >
> > df <- trees # a data frame 'trees' from R.
> > df <- df[sample(nrow(df)), ] # randomly shuffles the data.
> > k <- 10 # Number of folds. Note k=nrow(df) in the leave-one-out
cross
> > validation.
> > folds <- cut(seq(from=1, to=nrow(df)), breaks=k, labels=FALSE) #
creates
> > unique numbers for k equally size folds.
> > df$ID <- folds # adds fold IDs.
> > df[paste("pred", 1:3, sep="")] <- NA # adds
multiple columns "pred1"
> > "pred2" "pred3" to speed up the following loop.
> >
> > library(mgcv)
> >
>
> Rprof()
>
> replicate(100, {
>
>
> > for(i in 1:k) {
> >  # looping for different models:
> >  m1 <- gam(Volume ~ s(Height), data=df, subset=(ID != i))
> >  m2 <- gam(Volume ~ s(Girth), data=df, subset=(ID != i))
> >  m3 <- gam(Volume ~ s(Girth) + s(Height), data=df, subset=(ID !=
i))
> >
> >  # looping for predictions:
> >  df[df$ID==i, "pred1"] <- predict(m1, df[df$ID==i, ],
type="response")
> >  df[df$ID==i, "pred2"] <- predict(m2, df[df$ID==i, ],
type="response")
> >  df[df$ID==i, "pred3"] <- predict(m3, df[df$ID==i, ],
type="response")
> > }
> >
>
> })
>
> Rprof(NULL)
>
> summaryRprof()
>
> ## read ?Rprof to get a sense of what it does
>
> ## read the summary to determine where time is being spent.
>
> ## the result was surprising to me. YMMV.
>
> ## there may be redundancies that you can eliminate by
> ##  - doing the setup within gam() one time and saving it
> ##  - calling the worker functions by modifying the setup
> ##    in a loop or function and saving the results
>
>
> > # calculating residuals:
> > df$res1 <- with(df, Volume - pred1)
> > df$res2 <- with(df, Volume - pred2)
> > df$res3 <- with(df, Volume - pred3)
> >
> > Model <- paste("m", 1:3, sep="") # creates a
vector of model names.
> >
> > # creating a vector of mean-square errors (MSE):
> > MSE <- with(df, c(
> >  sum(res1^2) / nrow(df),
> >  sum(res2^2) / nrow(df),
> >  sum(res3^2) / nrow(df)
> > ))
> >
> > model.mse <- data.frame(Model, MSE) # creates a data frame of model
names
> > and mean-square errors.
> > model.mse <- model.mse[order(model.mse$MSE), ] # rearranges the
previous
> > data frame in order of increasing mean-square errors.
> >
> > I'd appreciate any help. This code takes several days if run on
>=30,000
> > different GAM models and 3 predictors. Could you please help with
> > re-writing the script into sapply() or foreach()/doParallel format?
> >
>
> This is something you should learn to do. It is pretty standard practice.
> Use the body of your for loop as the body of a function, add arguments, and
> create a suitable return value. The something like
>
>         lapply( 1:k, your.loop.body.function, other.arg1, other.arg2, ...)
>
> should work.  If it does, then parallel::mclapply(...) should also work.
>
> HTH,
>
> Chuck
>
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
	[[alternative HTML version deleted]]

R help - Jan 2019 - Vectorizing a for-loop for cross-validation in R

[R] Vectorizing a for-loop for cross-validation in R

[R] Vectorizing a for-loop for cross-validation in R

[R] Vectorizing a for-loop for cross-validation in R