Hello, I'm very new to R so my apologies if I'm making an obvious mistake. I have a data frame with ~170k rows and 14 numeric variables. The first 2 of those variables (let's call them group1 and group2) are used to define groups: each unique pair of (group1,group2) is a group. There are roughly 50k such unique groups, with sizes varying from 1 through 40 rows each. My objective is to fit a linear regression within each group and get its mean square error (MSE). So the final output needs to be a collection of 50k MSE's. Now, regardless of the size of the group, the regression needs to be run on exactly 40 observations. If the group has less than 40 observations, then I need to add rows to get to 40, populating all variables with 0's for those extra rows. Here's the function I wrote to do this: get_MSE = function(x) { rownames(x) = x$ID #'ID' can take on any value from 1 to 40. x = x[as.character(1:40), ] x[is.na(x)] = 0 regressionResult = lm(A ~ B + C + D + E, data=x) #A-E are some variables in the data frame. MSE = mean((regressionResult$fitted.values - A)^2) return(MSE) } library(plyr) output = ddply(dataset, list(dataset$group1, dataset$group2), get_MSE) The above code takes about 10 minutes to run, but I'd really need it to be much faster, if at all possible. Is there anything I can do to speed up the code? Thank you very much in advance. Jose [[alternative HTML version deleted]]
R. Michael Weylandt
2012-Feb-23 13:12 UTC
[R] Improving performance of split-apply problem
It looks like what you are doing is reasonably efficient: I do think there's a residuals element to the object returned by lm() so you could just call that directly (which will be just a little more efficient). The bulk of the time is probably being taken up in the lm() call, which has alot of overhead: you could use fastLm from the RcppArmadillo package or lm.fit() directly to cut alot of this out. Michael On Wed, Feb 22, 2012 at 9:10 PM, Martin <misenial at gmail.com> wrote:> Hello, > I'm very new to R so my apologies if I'm making an obvious mistake. > > I have a data frame with ~170k rows and 14 numeric variables. The first 2 > of those variables (let's call them group1 and group2) are used to define > groups: each unique pair of (group1,group2) is a group. There are roughly > 50k such unique groups, with sizes varying from 1 through 40 rows each. > > My objective is to fit a linear regression within each group and get its > mean square error (MSE). So the final output needs to be a collection of > 50k MSE's. ?Now, regardless of the size of the group, the regression needs > to be run on exactly 40 observations. If the group has less than 40 > observations, then I need to add rows to get to 40, populating all > variables with 0's for those extra rows. Here's the function I wrote to do > this: > > get_MSE = function(x) { > ?rownames(x) = x$ID ?#'ID' can take on any value from 1 to 40. > ?x = x[as.character(1:40), ] > ?x[is.na(x)] = 0 > ?regressionResult = lm(A ~ B + C + D + E, data=x) ?#A-E are some variables > in the data frame. > ?MSE = mean((regressionResult$fitted.values - A)^2) > ?return(MSE) > } > > library(plyr) > output = ddply(dataset, list(dataset$group1, dataset$group2), get_MSE) > > The above code takes about 10 minutes to run, but I'd really need it to be > much faster, if at all possible. Is there anything I can do to speed up the > code? > > Thank you very much in advance. > > Jose > > ? ? ? ?[[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.