c@buhtz m@iii@g oii posteo@jp
2024-Jun-21  14:38 UTC
[R] Regression performance when using summary() twice
Hello,
I am not a regular R user but coming from Python. But I use R for 
several special task.
Doing a regression analysis does cost some compute time. But I wonder 
when this big time consuming algorithm is executed and if it is done 
twice in my sepcial case.
It seems that calling "glm()" or similar does not execute the time 
consuming part of the regression code.
It seems it is done when calling "summary(model)".
Am I right so far?
If this is correct I would say that in my case the regression is down 
twice with the identical formula and data. Which of course is 
inefficient. See this code:
my_function <- function(formula_string, data) {
             formula <- as.formula(formula_string)
             model <- glm.nb(formula, data = data)
             result = cbind(summary(model)$coefficients, confint(model))
             result = as.data.frame(result)
             string_result = capture.output(summary(model))
             return(list(result, string_result))
         }
I do call summary() once to get the "$coefficents" and a second time 
when capturing its output as a string.
If this really result in computing the regression twice I ask myself if 
there is a R-way to make this more efficent?
Best regards,
Christian Buhtz
Dear Christian Without knowing how big your datset is it is hard to be sure but confint() can take some time. Have you thought of calling summary once summ <- summary(model) and then replace all subsequent calls to summary with summ Michael On 21/06/2024 15:38, c.buhtz at posteo.jp wrote:> Hello, > > I am not a regular R user but coming from Python. But I use R for > several special task. > > Doing a regression analysis does cost some compute time. But I wonder > when this big time consuming algorithm is executed and if it is done > twice in my sepcial case. > > It seems that calling "glm()" or similar does not execute the time > consuming part of the regression code. > It seems it is done when calling "summary(model)". > Am I right so far? > > If this is correct I would say that in my case the regression is down > twice with the identical formula and data. Which of course is > inefficient. See this code: > > my_function <- function(formula_string, data) { > ??????????? formula <- as.formula(formula_string) > ??????????? model <- glm.nb(formula, data = data) > > ??????????? result = cbind(summary(model)$coefficients, confint(model)) > ??????????? result = as.data.frame(result) > > ??????????? string_result = capture.output(summary(model)) > > ??????????? return(list(result, string_result)) > ??????? } > > I do call summary() once to get the "$coefficents" and a second time > when capturing its output as a string. > > If this really result in computing the regression twice I ask myself if > there is a R-way to make this more efficent? > > Best regards, > Christian Buhtz > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >-- Michael
Dear Christian,
You're apparently using the glm.nb() function in the MASS package.
Your function is peculiar in several respects. For example, you specify 
the model formula as a character string and then convert it into a 
formula, but you could just pass the formula to the function -- the 
conversion seems unnecessary. Similarly, you compute the summary for the 
model twice rather than just saving it in a local variable in your 
function. And the form of the function output is a bit strange, but I 
suppose you have reasons for that.
The primary reason that your function is slow, however, is that the 
confidence intervals computed by confint() profile the likelihood, which 
requires refitting the model a number of times. If you're willing to use 
possibly less accurate Wald-based rather than likelihood-based 
confidence intervals, computed, e.g., by the Confint() function in the 
car package, then you could speed up the computation considerably,
Using a model fit by example(glm.nb),
	library(MASS)
	example(glm.nb)
	microbenchmark::microbenchmark(
	  Wald = car::Confint(quine.nb1, vcov.=vcov(quine.nb1),
	               estimate=FALSE),
	  LR = confint(quine.nb1)
	)
which produces
Unit: microseconds
  expr       min       lq       mean    median       uq        max
  Wald   136.366   161.13   222.0872   184.541   283.72    386.466
    LR 87223.031 88757.09 95162.8733 95761.568 97672.23 182734.048
  neval
    100
    100
I hope this helps,
  Johm
-- 
John Fox, Professor Emeritus
McMaster University
Hamilton, Ontario, Canada
web: https://www.john-fox.ca/
--
On 2024-06-21 10:38 a.m., c.buhtz at posteo.jp wrote:> [You don't often get email from c.buhtz at posteo.jp. Learn why this is
> important at https://aka.ms/LearnAboutSenderIdentification ]
> 
> Caution: External email.
> 
> 
> Hello,
> 
> I am not a regular R user but coming from Python. But I use R for
> several special task.
> 
> Doing a regression analysis does cost some compute time. But I wonder
> when this big time consuming algorithm is executed and if it is done
> twice in my sepcial case.
> 
> It seems that calling "glm()" or similar does not execute the
time
> consuming part of the regression code.
> It seems it is done when calling "summary(model)".
> Am I right so far?
> 
> If this is correct I would say that in my case the regression is down
> twice with the identical formula and data. Which of course is
> inefficient. See this code:
> 
> my_function <- function(formula_string, data) {
>  ??????????? formula <- as.formula(formula_string)
>  ??????????? model <- glm.nb(formula, data = data)
> 
>  ??????????? result = cbind(summary(model)$coefficients, confint(model))
>  ??????????? result = as.data.frame(result)
> 
>  ??????????? string_result = capture.output(summary(model))
> 
>  ??????????? return(list(result, string_result))
>  ??????? }
> 
> I do call summary() once to get the "$coefficents" and a second
time
> when capturing its output as a string.
> 
> If this really result in computing the regression twice I ask myself if
> there is a R-way to make this more efficent?
> 
> Best regards,
> Christian Buhtz
> 
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide 
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.