thr3ads.net - R help - [R] Improving performance of split-apply problem [Feb 2012]

If this information is useful, please help other people find it:
Share via:

Martin

2012-Feb-23 02:10 UTC

[R] Improving performance of split-apply problem

Hello,
I'm very new to R so my apologies if I'm making an obvious mistake.

I have a data frame with ~170k rows and 14 numeric variables. The first 2
of those variables (let's call them group1 and group2) are used to define
groups: each unique pair of (group1,group2) is a group. There are roughly
50k such unique groups, with sizes varying from 1 through 40 rows each.

My objective is to fit a linear regression within each group and get its
mean square error (MSE). So the final output needs to be a collection of
50k MSE's.  Now, regardless of the size of the group, the regression needs
to be run on exactly 40 observations. If the group has less than 40
observations, then I need to add rows to get to 40, populating all
variables with 0's for those extra rows. Here's the function I wrote to
do
this:

get_MSE = function(x) {
  rownames(x) = x$ID  #'ID' can take on any value from 1 to 40.
  x = x[as.character(1:40), ]
  x[is.na(x)] = 0
  regressionResult = lm(A ~ B + C + D + E, data=x)  #A-E are some variables
in the data frame.
  MSE = mean((regressionResult$fitted.values - A)^2)
  return(MSE)
}

library(plyr)
output = ddply(dataset, list(dataset$group1, dataset$group2), get_MSE)

The above code takes about 10 minutes to run, but I'd really need it to be
much faster, if at all possible. Is there anything I can do to speed up the
code?

Thank you very much in advance.

Jose

	[[alternative HTML version deleted]]

R. Michael Weylandt

2012-Feb-23 13:12 UTC

head link

[R] Improving performance of split-apply problem

It looks like what you are doing is reasonably efficient: I do think
there's a residuals element to the object returned by lm() so you
could just call that directly (which will be just a little more
efficient).

The bulk of the time is probably being taken up in the lm() call,
which has alot of overhead: you could use fastLm from the
RcppArmadillo package or lm.fit() directly to cut alot of this out.

Michael

On Wed, Feb 22, 2012 at 9:10 PM, Martin <misenial at gmail.com>
wrote:> Hello,
> I'm very new to R so my apologies if I'm making an obvious mistake.
>
> I have a data frame with ~170k rows and 14 numeric variables. The first 2
> of those variables (let's call them group1 and group2) are used to
define
> groups: each unique pair of (group1,group2) is a group. There are roughly
> 50k such unique groups, with sizes varying from 1 through 40 rows each.
>
> My objective is to fit a linear regression within each group and get its
> mean square error (MSE). So the final output needs to be a collection of
> 50k MSE's. ?Now, regardless of the size of the group, the regression
needs
> to be run on exactly 40 observations. If the group has less than 40
> observations, then I need to add rows to get to 40, populating all
> variables with 0's for those extra rows. Here's the function I
wrote to do
> this:
>
> get_MSE = function(x) {
> ?rownames(x) = x$ID ?#'ID' can take on any value from 1 to 40.
> ?x = x[as.character(1:40), ]
> ?x[is.na(x)] = 0
> ?regressionResult = lm(A ~ B + C + D + E, data=x) ?#A-E are some variables
> in the data frame.
> ?MSE = mean((regressionResult$fitted.values - A)^2)
> ?return(MSE)
> }
>
> library(plyr)
> output = ddply(dataset, list(dataset$group1, dataset$group2), get_MSE)
>
> The above code takes about 10 minutes to run, but I'd really need it to
be
> much faster, if at all possible. Is there anything I can do to speed up the
> code?
>
> Thank you very much in advance.
>
> Jose
>
> ? ? ? ?[[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

Maybe Matching Threads

Search for more seemingly similar threads

R help - Feb 2012 - Improving performance of split-apply problem

[R] Improving performance of split-apply problem

[R] Improving performance of split-apply problem

Maybe Matching Threads