thr3ads.net - R help - [R] R versus SAS: lm performance [May 2004]

If this information is useful, please help other people find it:
Share via:

Arne.Muller@aventis.com

2004-May-10 14:36 UTC

[R] R versus SAS: lm performance

Hello,

A collegue of mine has compared the runtime of a linear model + anova in SAS and
S+. He got the same results, but SAS took a bit more than a minute whereas S+
took 17 minutes. I've tried it in R (1.9.0) and it took 15 min. Neither
machine run out of memory, and I assume that all machines have similar hardware,
but the S+ and SAS machines are on windows whereas the R machine is Redhat Linux
7.2.

My question is if I'm doing something wrong (technically) calling the lm
routine, or (if not), how I can optimize the call to lm or even using an
alternative to lm. I'd like to run about 12,000 of these models in R (for a
gene expression experiment - one model per gene, which would take far too long).

I've run the follwong code in R (and S+):
> options(contrasts=c('contr.helmert', 'contr.poly'))
The 1st colum is the value to be modeled, and the others are factors.
> names(df.gene1data) <- c("Va", "Ba", "Ti",
"Do", "Ar", "Pr")
> df[c(1:2,1343:1344),]           Va    Do  Ti  Ba Ar    Pr
1    2.317804 000mM 24h NEW  1     1
2    2.495390 000mM 24h NEW  2     1
8315 2.979641 025mM 04h PRG 83    16
8415 4.505787 000mM 04h PRG 84    16

this is a dataframe with 1344 rows.

x <- Sys.time();
wlm <- lm(Va ~
Ba+Ti+Do+Pr+Ba:Ti+Ba:Do+Ba:Pr+Ti:Do+Ti:Pr+Do:Pr+Ba:Ti:Do+Ba:Ti:Pr+Ba:Do:Pr+Ti:Do:Pr+Ba:Ti:Do:Pr+(Ba:Ti:Do)/Ar,
data=df, singular=T);
difftime(Sys.time(), x)

Time difference of 15.33333 mins
> anova(wlm)Analysis of Variance Table

Response: Va
             Df Sum Sq Mean Sq   F value    Pr(>F)    
Ba            2    0.1     0.1    0.4262  0.653133    
Ti            1    2.6     2.6   16.5055 5.306e-05 ***
Do            4    6.8     1.7   10.5468 2.431e-08 ***
Pr           15 5007.4   333.8 2081.8439 < 2.2e-16 ***
Ba:Ti         2    3.2     1.6    9.8510 5.904e-05 ***
Ba:Do         7    2.8     0.4    2.5054  0.014943 *  
Ba:Pr        30   80.6     2.7   16.7585 < 2.2e-16 ***
Ti:Do         4    8.7     2.2   13.5982 9.537e-11 ***
Ti:Pr        15    2.4     0.2    1.0017  0.450876    
Do:Pr        60   10.2     0.2    1.0594  0.358551    
Ba:Ti:Do      7    1.4     0.2    1.2064  0.296415    
Ba:Ti:Pr     30    5.6     0.2    1.1563  0.259184    
Ba:Do:Pr    105   14.2     0.1    0.8445  0.862262    
Ti:Do:Pr     60   14.8     0.2    1.5367  0.006713 ** 
Ba:Ti:Do:Pr 105   15.8     0.2    0.9382  0.653134    
Ba:Ti:Do:Ar  56   26.4     0.5    2.9434 2.904e-11 ***
Residuals   840  134.7     0.2                        

The corresponding SAS program from my collegue is:

proc glm data = "the name of the data set";

class B T D A P;

model V = B T D P B*T B*D B*P T*D T*P D*P B*T*D B*T*P B*D*P T*D*P B*T*D*P
A(B*T*D);

run;

Note, V = Va, B = Ba, T = Ti, D = Do, P = Pr, A = Ar of the R-example

	kind regards + thanks a lot for your help,

	Arne

Prof Brian Ripley

2004-May-11 07:07 UTC

head link

[R] R versus SAS: lm performance

The way to time things in R is system.time().

Without knowing much more about your problem we can only guess where R is 
spending the time.  But you can find out by profiling -- see `Writing R 
Extensions'.

If you want multiple fits with the same design matrix (do you?) you 
could look at the code of lm and call lm.fit repeatedly yourself.

On Mon, 10 May 2004 Arne.Muller at aventis.com wrote:
> Hello,
> 
> A collegue of mine has compared the runtime of a linear model + anova in
SAS and S+. He got the same results, but SAS took a bit more than a minute
whereas S+ took 17 minutes. I've tried it in R (1.9.0) and it took 15 min.
Neither machine run out of memory, and I assume that all machines have similar
hardware, but the S+ and SAS machines are on windows whereas the R machine is
Redhat Linux 7.2.
> 
> My question is if I'm doing something wrong (technically) calling the
lm routine, or (if not), how I can optimize the call to lm or even using an
alternative to lm. I'd like to run about 12,000 of these models in R (for a
gene expression experiment - one model per gene, which would take far too long).
> 
> I've run the follwong code in R (and S+):
> 
> > options(contrasts=c('contr.helmert', 'contr.poly'))
> 
> The 1st colum is the value to be modeled, and the others are factors.
> 
> > names(df.gene1data) <- c("Va", "Ba",
"Ti", "Do", "Ar", "Pr")
> > df[c(1:2,1343:1344),]
>            Va    Do  Ti  Ba Ar    Pr
> 1    2.317804 000mM 24h NEW  1     1
> 2    2.495390 000mM 24h NEW  2     1
> 8315 2.979641 025mM 04h PRG 83    16
> 8415 4.505787 000mM 04h PRG 84    16
> 
> this is a dataframe with 1344 rows.
> 
> x <- Sys.time();
> wlm <- lm(Va ~
>
Ba+Ti+Do+Pr+Ba:Ti+Ba:Do+Ba:Pr+Ti:Do+Ti:Pr+Do:Pr+Ba:Ti:Do+Ba:Ti:Pr+Ba:Do:Pr+Ti:Do:Pr+Ba:Ti:Do:Pr+(Ba:Ti:Do)/Ar,
data=df, singular=T);
> difftime(Sys.time(), x)
> 
> Time difference of 15.33333 mins
> 
> > anova(wlm)
> Analysis of Variance Table
> 
> Response: Va
>              Df Sum Sq Mean Sq   F value    Pr(>F)    
> Ba            2    0.1     0.1    0.4262  0.653133    
> Ti            1    2.6     2.6   16.5055 5.306e-05 ***
> Do            4    6.8     1.7   10.5468 2.431e-08 ***
> Pr           15 5007.4   333.8 2081.8439 < 2.2e-16 ***
> Ba:Ti         2    3.2     1.6    9.8510 5.904e-05 ***
> Ba:Do         7    2.8     0.4    2.5054  0.014943 *  
> Ba:Pr        30   80.6     2.7   16.7585 < 2.2e-16 ***
> Ti:Do         4    8.7     2.2   13.5982 9.537e-11 ***
> Ti:Pr        15    2.4     0.2    1.0017  0.450876    
> Do:Pr        60   10.2     0.2    1.0594  0.358551    
> Ba:Ti:Do      7    1.4     0.2    1.2064  0.296415    
> Ba:Ti:Pr     30    5.6     0.2    1.1563  0.259184    
> Ba:Do:Pr    105   14.2     0.1    0.8445  0.862262    
> Ti:Do:Pr     60   14.8     0.2    1.5367  0.006713 ** 
> Ba:Ti:Do:Pr 105   15.8     0.2    0.9382  0.653134    
> Ba:Ti:Do:Ar  56   26.4     0.5    2.9434 2.904e-11 ***
> Residuals   840  134.7     0.2                        
> 
> The corresponding SAS program from my collegue is:
> 
> proc glm data = "the name of the data set";
> 
> class B T D A P;
> 
> model V = B T D P B*T B*D B*P T*D T*P D*P B*T*D B*T*P B*D*P T*D*P B*T*D*P
A(B*T*D);
> 
> run;
> 
> Note, V = Va, B = Ba, T = Ti, D = Do, P = Pr, A = Ar of the R-example
-- 
Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595

Arne.Muller@aventis.com

2004-May-11 11:00 UTC

head link

[R] R versus SAS: lm performance

Hello,

thanks for your reply. I've now done the profiling, and I interpret that the
most time is spend in the fortran routine(s):

Each sample represents 0.02 seconds.
Total run time: 920.219999999453 seconds.

Total seconds: time spent in function and callees.
Self seconds: time spent in function alone.

   %       total       %       self
 total    seconds     self    seconds    name
100.00    920.22      0.02      0.16     "lm"
 99.96    919.88      0.10      0.88     "lm.fit"
 99.74    917.84     99.74    917.84     ".Fortran"
  0.07      0.66      0.02      0.14     "storage.mode<-"
  0.06      0.52      0.00      0.00     "eval"
  0.06      0.52      0.04      0.34     "as.double"
  0.02      0.22      0.02      0.22     "colnames<-"
  0.02      0.20      0.02      0.20     "structure"
  0.02      0.18      0.02      0.18     "model.matrix.default"
  0.02      0.18      0.02      0.18     "as.double.default"
  0.02      0.18      0.00      0.00     "model.matrix"
  0.01      0.08      0.01      0.08     "list"

   %       self        %       total
 self     seconds    total    seconds    name
 99.74    917.84     99.74    917.84     ".Fortran"
  0.10      0.88     99.96    919.88     "lm.fit"
  0.04      0.34      0.06      0.52     "as.double"
  0.02      0.22      0.02      0.22     "colnames<-"
  0.02      0.20      0.02      0.20     "structure"
  0.02      0.18      0.02      0.18     "as.double.default"
  0.02      0.18      0.02      0.18     "model.matrix.default"
  0.02      0.16    100.00    920.22     "lm"
  0.02      0.14      0.07      0.66     "storage.mode<-"
  0.01      0.08      0.01      0.08     "list"

I guess this actually means I cannot do anything about it ... other than maybe
splitting the problem into different (independaent parts - which I actually may
be able to).

Regarding the usage of lm.fit instead of lm, this might be a good idea, since I
am using the same model.matrix for all fits! However, I'd need to recreate
an lm object from the output, because I'd like to run the anova function on
this. I'll first do some profiling on lm versus lm.fit for the 12,000 models
...

	kind regards + thanks again for your help,

	Arne

--
Arne Muller, Ph.D.
Toxicogenomics, Aventis Pharma
arne dot muller domain=aventis com
> -----Original Message-----
> From: Prof Brian Ripley [mailto:ripley at stats.ox.ac.uk]
> Sent: 11 May 2004 09:08
> To: Muller, Arne PH/FR
> Cc: r-help at stat.math.ethz.ch
> Subject: Re: [R] R versus SAS: lm performance
> 
> 
> The way to time things in R is system.time().
> 
> Without knowing much more about your problem we can only 
> guess where R is 
> spending the time.  But you can find out by profiling -- see 
> `Writing R 
> Extensions'.
> 
> If you want multiple fits with the same design matrix (do you?) you 
> could look at the code of lm and call lm.fit repeatedly yourself.
> 
> On Mon, 10 May 2004 Arne.Muller at aventis.com wrote:
> 
> > Hello,
> > 
> > A collegue of mine has compared the runtime of a linear 
> model + anova in SAS and S+. He got the same results, but SAS 
> took a bit more than a minute whereas S+ took 17 minutes. 
> I've tried it in R (1.9.0) and it took 15 min. Neither 
> machine run out of memory, and I assume that all machines 
> have similar hardware, but the S+ and SAS machines are on 
> windows whereas the R machine is Redhat Linux 7.2.
> > 
> > My question is if I'm doing something wrong (technically) 
> calling the lm routine, or (if not), how I can optimize the 
> call to lm or even using an alternative to lm. I'd like to 
> run about 12,000 of these models in R (for a gene expression 
> experiment - one model per gene, which would take far too long).
> > 
> > I've run the follwong code in R (and S+):
> > 
> > > options(contrasts=c('contr.helmert',
'contr.poly'))
> > 
> > The 1st colum is the value to be modeled, and the others 
> are factors.
> > 
> > > names(df.gene1data) <- c("Va", "Ba",
"Ti", "Do", "Ar", "Pr")
> > > df[c(1:2,1343:1344),]
> >            Va    Do  Ti  Ba Ar    Pr
> > 1    2.317804 000mM 24h NEW  1     1
> > 2    2.495390 000mM 24h NEW  2     1
> > 8315 2.979641 025mM 04h PRG 83    16
> > 8415 4.505787 000mM 04h PRG 84    16
> > 
> > this is a dataframe with 1344 rows.
> > 
> > x <- Sys.time();
> > wlm <- lm(Va ~
> > 
> Ba+Ti+Do+Pr+Ba:Ti+Ba:Do+Ba:Pr+Ti:Do+Ti:Pr+Do:Pr+Ba:Ti:Do+Ba:Ti
> :Pr+Ba:Do:Pr+Ti:Do:Pr+Ba:Ti:Do:Pr+(Ba:Ti:Do)/Ar, data=df, singular=T);
> > difftime(Sys.time(), x)
> > 
> > Time difference of 15.33333 mins
> > 
> > > anova(wlm)
> > Analysis of Variance Table
> > 
> > Response: Va
> >              Df Sum Sq Mean Sq   F value    Pr(>F)    
> > Ba            2    0.1     0.1    0.4262  0.653133    
> > Ti            1    2.6     2.6   16.5055 5.306e-05 ***
> > Do            4    6.8     1.7   10.5468 2.431e-08 ***
> > Pr           15 5007.4   333.8 2081.8439 < 2.2e-16 ***
> > Ba:Ti         2    3.2     1.6    9.8510 5.904e-05 ***
> > Ba:Do         7    2.8     0.4    2.5054  0.014943 *  
> > Ba:Pr        30   80.6     2.7   16.7585 < 2.2e-16 ***
> > Ti:Do         4    8.7     2.2   13.5982 9.537e-11 ***
> > Ti:Pr        15    2.4     0.2    1.0017  0.450876    
> > Do:Pr        60   10.2     0.2    1.0594  0.358551    
> > Ba:Ti:Do      7    1.4     0.2    1.2064  0.296415    
> > Ba:Ti:Pr     30    5.6     0.2    1.1563  0.259184    
> > Ba:Do:Pr    105   14.2     0.1    0.8445  0.862262    
> > Ti:Do:Pr     60   14.8     0.2    1.5367  0.006713 ** 
> > Ba:Ti:Do:Pr 105   15.8     0.2    0.9382  0.653134    
> > Ba:Ti:Do:Ar  56   26.4     0.5    2.9434 2.904e-11 ***
> > Residuals   840  134.7     0.2                        
> > 
> > The corresponding SAS program from my collegue is:
> > 
> > proc glm data = "the name of the data set";
> > 
> > class B T D A P;
> > 
> > model V = B T D P B*T B*D B*P T*D T*P D*P B*T*D B*T*P B*D*P 
> T*D*P B*T*D*P A(B*T*D);
> > 
> > run;
> > 
> > Note, V = Va, B = Ba, T = Ti, D = Do, P = Pr, A = Ar of the 
> R-example
> 
> -- 
> Brian D. Ripley,                  ripley at stats.ox.ac.uk
> Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
> University of Oxford,             Tel:  +44 1865 272861 (self)
> 1 South Parks Road,                     +44 1865 272866 (PA)
> Oxford OX1 3TG, UK                Fax:  +44 1865 272595
> 
>

Douglas Bates

2004-May-11 12:07 UTC

head link

[R] R versus SAS: lm performance

<Arne.Muller at aventis.com> writes:
> Hello,
> 
> A collegue of mine has compared the runtime of a linear model + anova in
SAS and S+. He got the same results, but SAS took a bit more than a minute
whereas S+ took 17 minutes. I've tried it in R (1.9.0) and it took 15 min.
Neither machine run out of memory, and I assume that all machines have similar
hardware, but the S+ and SAS machines are on windows whereas the R machine is
Redhat Linux 7.2.
> 
> My question is if I'm doing something wrong (technically) calling the
lm routine, or (if not), how I can optimize the call to lm or even using an
alternative to lm. I'd like to run about 12,000 of these models in R (for a
gene expression experiment - one model per gene, which would take far too long).
> 
> I've run the follwong code in R (and S+):
...

As Brian Ripley mentioned, you could save the model matrix and use it
with each of your responses.  Versions 0.8-1 and later of the Matrix
package have a vignette that provides comparative timings of various
ways of obtaining the least squares estimates.  If you use the classes
from the Matrix package and create and save the crossproduct of the
model matrix

mm = as(model.matrix(Va ~ Ba+Ti..., df), "geMatrix")
cprod = crossprod(mm)

then successive calls to

coef = solve(cprod, crossprod(mm, df$Va))

will produce the coefficient estimates much faster than will calls to
lm, which each do all the work of generating and decomposing the very
large model matrix.

Note that this method only produces the coefficient estimates, which
may be enough for your purposes.  Also, this method will not handle
missing data or rank-deficient model matrices in the elegant way that
lm does.

If you are doing this 12,000 times it may be worthwhile checking if
the sparse matrix formulation

mmS = as(mm, "cscMatrix")
cprodS = crossprod(mmS)

is faster.

The dense matrix formulation (but not the sparse) can benefit from
installation of optimized BLAS routines such as Atlas or Goto's BLAS.

-- 
Douglas Bates                            bates at stat.wisc.edu
Statistics Department                    608/262-2598
University of Wisconsin - Madison        http://www.stat.wisc.edu/~bates/

Liaw, Andy

2004-May-11 12:19 UTC

head link

[R] R versus SAS: lm performance

I tried the following on an Opteron 248, R-1.9.0 w/Goto's BLAS:
> y <- matrix(rnorm(14000*1344), 1344)
> x <- matrix(runif(1344*503),1344)
> system.time(fit <- lm(y~x))[1] 106.00  55.60 265.32   0.00   0.00

The resulting fit object is over 600MB.  (The coefficient compoent is a 504
x 14000 matrix.)

If I'm not mistaken, SAS sweeps on the extended cross product matrix to fit
regression models.  That, I believe, in usually faster than doing QR
decomposition on the model matrix itself, but there are trade-offs.  You
could try what Prof. Bates suggested.

Andy
> From: Arne.Muller at aventis.com
> 
> Hello,
> 
> thanks for your reply. I've now done the profiling, and I 
> interpret that the most time is spend in the fortran routine(s):
> 
> Each sample represents 0.02 seconds.
> Total run time: 920.219999999453 seconds.
> 
> Total seconds: time spent in function and callees.
> Self seconds: time spent in function alone.
> 
>    %       total       %       self
>  total    seconds     self    seconds    name
> 100.00    920.22      0.02      0.16     "lm"
>  99.96    919.88      0.10      0.88     "lm.fit"
>  99.74    917.84     99.74    917.84     ".Fortran"
>   0.07      0.66      0.02      0.14     "storage.mode<-"
>   0.06      0.52      0.00      0.00     "eval"
>   0.06      0.52      0.04      0.34     "as.double"
>   0.02      0.22      0.02      0.22     "colnames<-"
>   0.02      0.20      0.02      0.20     "structure"
>   0.02      0.18      0.02      0.18     "model.matrix.default"
>   0.02      0.18      0.02      0.18     "as.double.default"
>   0.02      0.18      0.00      0.00     "model.matrix"
>   0.01      0.08      0.01      0.08     "list"
> 
>    %       self        %       total
>  self     seconds    total    seconds    name
>  99.74    917.84     99.74    917.84     ".Fortran"
>   0.10      0.88     99.96    919.88     "lm.fit"
>   0.04      0.34      0.06      0.52     "as.double"
>   0.02      0.22      0.02      0.22     "colnames<-"
>   0.02      0.20      0.02      0.20     "structure"
>   0.02      0.18      0.02      0.18     "as.double.default"
>   0.02      0.18      0.02      0.18     "model.matrix.default"
>   0.02      0.16    100.00    920.22     "lm"
>   0.02      0.14      0.07      0.66     "storage.mode<-"
>   0.01      0.08      0.01      0.08     "list"
> 
> I guess this actually means I cannot do anything about it ... 
> other than maybe splitting the problem into different 
> (independaent parts - which I actually may be able to).
> 
> Regarding the usage of lm.fit instead of lm, this might be a 
> good idea, since I am using the same model.matrix for all 
> fits! However, I'd need to recreate an lm object from the 
> output, because I'd like to run the anova function on this. 
> I'll first do some profiling on lm versus lm.fit for the 
> 12,000 models ...
> 
> 	kind regards + thanks again for your help,
> 
> 	Arne
> 
> --
> Arne Muller, Ph.D.
> Toxicogenomics, Aventis Pharma
> arne dot muller domain=aventis com
> 
> > -----Original Message-----
> > From: Prof Brian Ripley [mailto:ripley at stats.ox.ac.uk]
> > Sent: 11 May 2004 09:08
> > To: Muller, Arne PH/FR
> > Cc: r-help at stat.math.ethz.ch
> > Subject: Re: [R] R versus SAS: lm performance
> > 
> > 
> > The way to time things in R is system.time().
> > 
> > Without knowing much more about your problem we can only 
> > guess where R is 
> > spending the time.  But you can find out by profiling -- see 
> > `Writing R 
> > Extensions'.
> > 
> > If you want multiple fits with the same design matrix (do you?) you 
> > could look at the code of lm and call lm.fit repeatedly yourself.
> > 
> > On Mon, 10 May 2004 Arne.Muller at aventis.com wrote:
> > 
> > > Hello,
> > > 
> > > A collegue of mine has compared the runtime of a linear 
> > model + anova in SAS and S+. He got the same results, but SAS 
> > took a bit more than a minute whereas S+ took 17 minutes. 
> > I've tried it in R (1.9.0) and it took 15 min. Neither 
> > machine run out of memory, and I assume that all machines 
> > have similar hardware, but the S+ and SAS machines are on 
> > windows whereas the R machine is Redhat Linux 7.2.
> > > 
> > > My question is if I'm doing something wrong (technically) 
> > calling the lm routine, or (if not), how I can optimize the 
> > call to lm or even using an alternative to lm. I'd like to 
> > run about 12,000 of these models in R (for a gene expression 
> > experiment - one model per gene, which would take far too long).
> > > 
> > > I've run the follwong code in R (and S+):
> > > 
> > > > options(contrasts=c('contr.helmert',
'contr.poly'))
> > > 
> > > The 1st colum is the value to be modeled, and the others 
> > are factors.
> > > 
> > > > names(df.gene1data) <- c("Va", "Ba",
"Ti", "Do", "Ar", "Pr")
> > > > df[c(1:2,1343:1344),]
> > >            Va    Do  Ti  Ba Ar    Pr
> > > 1    2.317804 000mM 24h NEW  1     1
> > > 2    2.495390 000mM 24h NEW  2     1
> > > 8315 2.979641 025mM 04h PRG 83    16
> > > 8415 4.505787 000mM 04h PRG 84    16
> > > 
> > > this is a dataframe with 1344 rows.
> > > 
> > > x <- Sys.time();
> > > wlm <- lm(Va ~
> > > 
> > Ba+Ti+Do+Pr+Ba:Ti+Ba:Do+Ba:Pr+Ti:Do+Ti:Pr+Do:Pr+Ba:Ti:Do+Ba:Ti
> > :Pr+Ba:Do:Pr+Ti:Do:Pr+Ba:Ti:Do:Pr+(Ba:Ti:Do)/Ar, data=df, 
> singular=T);
> > > difftime(Sys.time(), x)
> > > 
> > > Time difference of 15.33333 mins
> > > 
> > > > anova(wlm)
> > > Analysis of Variance Table
> > > 
> > > Response: Va
> > >              Df Sum Sq Mean Sq   F value    Pr(>F)    
> > > Ba            2    0.1     0.1    0.4262  0.653133    
> > > Ti            1    2.6     2.6   16.5055 5.306e-05 ***
> > > Do            4    6.8     1.7   10.5468 2.431e-08 ***
> > > Pr           15 5007.4   333.8 2081.8439 < 2.2e-16 ***
> > > Ba:Ti         2    3.2     1.6    9.8510 5.904e-05 ***
> > > Ba:Do         7    2.8     0.4    2.5054  0.014943 *  
> > > Ba:Pr        30   80.6     2.7   16.7585 < 2.2e-16 ***
> > > Ti:Do         4    8.7     2.2   13.5982 9.537e-11 ***
> > > Ti:Pr        15    2.4     0.2    1.0017  0.450876    
> > > Do:Pr        60   10.2     0.2    1.0594  0.358551    
> > > Ba:Ti:Do      7    1.4     0.2    1.2064  0.296415    
> > > Ba:Ti:Pr     30    5.6     0.2    1.1563  0.259184    
> > > Ba:Do:Pr    105   14.2     0.1    0.8445  0.862262    
> > > Ti:Do:Pr     60   14.8     0.2    1.5367  0.006713 ** 
> > > Ba:Ti:Do:Pr 105   15.8     0.2    0.9382  0.653134    
> > > Ba:Ti:Do:Ar  56   26.4     0.5    2.9434 2.904e-11 ***
> > > Residuals   840  134.7     0.2                        
> > > 
> > > The corresponding SAS program from my collegue is:
> > > 
> > > proc glm data = "the name of the data set";
> > > 
> > > class B T D A P;
> > > 
> > > model V = B T D P B*T B*D B*P T*D T*P D*P B*T*D B*T*P B*D*P 
> > T*D*P B*T*D*P A(B*T*D);
> > > 
> > > run;
> > > 
> > > Note, V = Va, B = Ba, T = Ti, D = Do, P = Pr, A = Ar of the 
> > R-example
> > 
> > -- 
> > Brian D. Ripley,                  ripley at stats.ox.ac.uk
> > Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
> > University of Oxford,             Tel:  +44 1865 272861 (self)
> > 1 South Parks Road,                     +44 1865 272866 (PA)
> > Oxford OX1 3TG, UK                Fax:  +44 1865 272595
> > 
> >
> 
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://www.stat.math.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide! 
> http://www.R-project.org/posting-guide.html
> 
>

Arne.Muller@aventis.com

2004-May-11 17:50 UTC

head link

[R] R versus SAS: lm performance

Thanks All, for your help. There seems to be a lot I can try to speed up the
fits. However, I'd like to go for a much simpler model which I think is
justified  by the experiment itself, e.g; I may think about removing the nestinh
"(Ba:Ti:Do)/Ar".

The model matrix has 1344 rows and 2970 columns, and the rank of the matrix is
504. Therefore I think I should reformulate the model.

I was just stroke my the massive difference in performance when my collegue told
me about the difference between SAS and S+.

	kind regards,

	Arne

--
Arne Muller, Ph.D.
Toxicogenomics, Aventis Pharma
arne dot muller domain=aventis com
> -----Original Message-----
> From: Liaw, Andy [mailto:andy_liaw at merck.com]
> Sent: 11 May 2004 14:20
> To: Muller, Arne PH/FR; ripley at stats.ox.ac.uk
> Cc: r-help at stat.math.ethz.ch
> Subject: RE: [R] R versus SAS: lm performance
> 
> 
> I tried the following on an Opteron 248, R-1.9.0 w/Goto's BLAS:
> 
> > y <- matrix(rnorm(14000*1344), 1344)
> > x <- matrix(runif(1344*503),1344)
> > system.time(fit <- lm(y~x))
> [1] 106.00  55.60 265.32   0.00   0.00
> 
> The resulting fit object is over 600MB.  (The coefficient 
> compoent is a 504
> x 14000 matrix.)
> 
> If I'm not mistaken, SAS sweeps on the extended cross product 
> matrix to fit
> regression models.  That, I believe, in usually faster than doing QR
> decomposition on the model matrix itself, but there are 
> trade-offs.  You
> could try what Prof. Bates suggested.
> 
> Andy
> 
> > From: Arne.Muller at aventis.com
> > 
> > Hello,
> > 
> > thanks for your reply. I've now done the profiling, and I 
> > interpret that the most time is spend in the fortran routine(s):
> > 
> > Each sample represents 0.02 seconds.
> > Total run time: 920.219999999453 seconds.
> > 
> > Total seconds: time spent in function and callees.
> > Self seconds: time spent in function alone.
> > 
> >    %       total       %       self
> >  total    seconds     self    seconds    name
> > 100.00    920.22      0.02      0.16     "lm"
> >  99.96    919.88      0.10      0.88     "lm.fit"
> >  99.74    917.84     99.74    917.84     ".Fortran"
> >   0.07      0.66      0.02      0.14     "storage.mode<-"
> >   0.06      0.52      0.00      0.00     "eval"
> >   0.06      0.52      0.04      0.34     "as.double"
> >   0.02      0.22      0.02      0.22     "colnames<-"
> >   0.02      0.20      0.02      0.20     "structure"
> >   0.02      0.18      0.02      0.18    
"model.matrix.default"
> >   0.02      0.18      0.02      0.18     "as.double.default"
> >   0.02      0.18      0.00      0.00     "model.matrix"
> >   0.01      0.08      0.01      0.08     "list"
> > 
> >    %       self        %       total
> >  self     seconds    total    seconds    name
> >  99.74    917.84     99.74    917.84     ".Fortran"
> >   0.10      0.88     99.96    919.88     "lm.fit"
> >   0.04      0.34      0.06      0.52     "as.double"
> >   0.02      0.22      0.02      0.22     "colnames<-"
> >   0.02      0.20      0.02      0.20     "structure"
> >   0.02      0.18      0.02      0.18     "as.double.default"
> >   0.02      0.18      0.02      0.18    
"model.matrix.default"
> >   0.02      0.16    100.00    920.22     "lm"
> >   0.02      0.14      0.07      0.66     "storage.mode<-"
> >   0.01      0.08      0.01      0.08     "list"
> > 
> > I guess this actually means I cannot do anything about it ... 
> > other than maybe splitting the problem into different 
> > (independaent parts - which I actually may be able to).
> > 
> > Regarding the usage of lm.fit instead of lm, this might be a 
> > good idea, since I am using the same model.matrix for all 
> > fits! However, I'd need to recreate an lm object from the 
> > output, because I'd like to run the anova function on this. 
> > I'll first do some profiling on lm versus lm.fit for the 
> > 12,000 models ...
> > 
> > 	kind regards + thanks again for your help,
> > 
> > 	Arne
> > 
> > --
> > Arne Muller, Ph.D.
> > Toxicogenomics, Aventis Pharma
> > arne dot muller domain=aventis com
> > 
> > > -----Original Message-----
> > > From: Prof Brian Ripley [mailto:ripley at stats.ox.ac.uk]
> > > Sent: 11 May 2004 09:08
> > > To: Muller, Arne PH/FR
> > > Cc: r-help at stat.math.ethz.ch
> > > Subject: Re: [R] R versus SAS: lm performance
> > > 
> > > 
> > > The way to time things in R is system.time().
> > > 
> > > Without knowing much more about your problem we can only 
> > > guess where R is 
> > > spending the time.  But you can find out by profiling -- see 
> > > `Writing R 
> > > Extensions'.
> > > 
> > > If you want multiple fits with the same design matrix (do 
> you?) you 
> > > could look at the code of lm and call lm.fit repeatedly yourself.
> > > 
> > > On Mon, 10 May 2004 Arne.Muller at aventis.com wrote:
> > > 
> > > > Hello,
> > > > 
> > > > A collegue of mine has compared the runtime of a linear 
> > > model + anova in SAS and S+. He got the same results, but SAS 
> > > took a bit more than a minute whereas S+ took 17 minutes. 
> > > I've tried it in R (1.9.0) and it took 15 min. Neither 
> > > machine run out of memory, and I assume that all machines 
> > > have similar hardware, but the S+ and SAS machines are on 
> > > windows whereas the R machine is Redhat Linux 7.2.
> > > > 
> > > > My question is if I'm doing something wrong
(technically)
> > > calling the lm routine, or (if not), how I can optimize the 
> > > call to lm or even using an alternative to lm. I'd like to 
> > > run about 12,000 of these models in R (for a gene expression 
> > > experiment - one model per gene, which would take far too long).
> > > > 
> > > > I've run the follwong code in R (and S+):
> > > > 
> > > > > options(contrasts=c('contr.helmert',
'contr.poly'))
> > > > 
> > > > The 1st colum is the value to be modeled, and the others 
> > > are factors.
> > > > 
> > > > > names(df.gene1data) <- c("Va",
"Ba", "Ti", "Do", "Ar", "Pr")
> > > > > df[c(1:2,1343:1344),]
> > > >            Va    Do  Ti  Ba Ar    Pr
> > > > 1    2.317804 000mM 24h NEW  1     1
> > > > 2    2.495390 000mM 24h NEW  2     1
> > > > 8315 2.979641 025mM 04h PRG 83    16
> > > > 8415 4.505787 000mM 04h PRG 84    16
> > > > 
> > > > this is a dataframe with 1344 rows.
> > > > 
> > > > x <- Sys.time();
> > > > wlm <- lm(Va ~
> > > > 
> > > Ba+Ti+Do+Pr+Ba:Ti+Ba:Do+Ba:Pr+Ti:Do+Ti:Pr+Do:Pr+Ba:Ti:Do+Ba:Ti
> > > :Pr+Ba:Do:Pr+Ti:Do:Pr+Ba:Ti:Do:Pr+(Ba:Ti:Do)/Ar, data=df, 
> > singular=T);
> > > > difftime(Sys.time(), x)
> > > > 
> > > > Time difference of 15.33333 mins
> > > > 
> > > > > anova(wlm)
> > > > Analysis of Variance Table
> > > > 
> > > > Response: Va
> > > >              Df Sum Sq Mean Sq   F value    Pr(>F)    
> > > > Ba            2    0.1     0.1    0.4262  0.653133    
> > > > Ti            1    2.6     2.6   16.5055 5.306e-05 ***
> > > > Do            4    6.8     1.7   10.5468 2.431e-08 ***
> > > > Pr           15 5007.4   333.8 2081.8439 < 2.2e-16 ***
> > > > Ba:Ti         2    3.2     1.6    9.8510 5.904e-05 ***
> > > > Ba:Do         7    2.8     0.4    2.5054  0.014943 *  
> > > > Ba:Pr        30   80.6     2.7   16.7585 < 2.2e-16 ***
> > > > Ti:Do         4    8.7     2.2   13.5982 9.537e-11 ***
> > > > Ti:Pr        15    2.4     0.2    1.0017  0.450876    
> > > > Do:Pr        60   10.2     0.2    1.0594  0.358551    
> > > > Ba:Ti:Do      7    1.4     0.2    1.2064  0.296415    
> > > > Ba:Ti:Pr     30    5.6     0.2    1.1563  0.259184    
> > > > Ba:Do:Pr    105   14.2     0.1    0.8445  0.862262    
> > > > Ti:Do:Pr     60   14.8     0.2    1.5367  0.006713 ** 
> > > > Ba:Ti:Do:Pr 105   15.8     0.2    0.9382  0.653134    
> > > > Ba:Ti:Do:Ar  56   26.4     0.5    2.9434 2.904e-11 ***
> > > > Residuals   840  134.7     0.2                        
> > > > 
> > > > The corresponding SAS program from my collegue is:
> > > > 
> > > > proc glm data = "the name of the data set";
> > > > 
> > > > class B T D A P;
> > > > 
> > > > model V = B T D P B*T B*D B*P T*D T*P D*P B*T*D B*T*P B*D*P 
> > > T*D*P B*T*D*P A(B*T*D);
> > > > 
> > > > run;
> > > > 
> > > > Note, V = Va, B = Ba, T = Ti, D = Do, P = Pr, A = Ar of the 
> > > R-example
> > > 
> > > -- 
> > > Brian D. Ripley,                  ripley at stats.ox.ac.uk
> > > Professor of Applied Statistics,  
> http://www.stats.ox.ac.uk/~ripley/
> > > University of Oxford,             Tel:  +44 1865 272861 (self)
> > > 1 South Parks Road,                     +44 1865 272866 (PA)
> > > Oxford OX1 3TG, UK                Fax:  +44 1865 272595
> > > 
> > >
> > 
> > ______________________________________________
> > R-help at stat.math.ethz.ch mailing list
> > https://www.stat.math.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide! 
> > http://www.R-project.org/posting-guide.html
> > 
> > 
> 
> 
> --------------------------------------------------------------
> ----------------
> Notice:  This e-mail message, together with any attachments, 
> contains information of Merck & Co., Inc. (One Merck Drive, 
> Whitehouse Station, New Jersey, USA 08889), and/or its 
> affiliates (which may be known outside the United States as 
> Merck Frosst, Merck Sharp & Dohme or MSD and in Japan, as 
> Banyu) that may be confidential, proprietary copyrighted 
> and/or legally privileged. It is intended solely for the use 
> of the individual or entity named on this message.  If you 
> are not the intended recipient, and have received this 
> message in error, please notify us immediately by reply 
> e-mail and then delete it from your system.
> --------------------------------------------------------------
> ----------------
>

Seemingly Similar Threads

Search for more maybe matching threads

R help - May 2004 - R versus SAS: lm performance

[R] R versus SAS: lm performance

[R] R versus SAS: lm performance

[R] R versus SAS: lm performance

[R] R versus SAS: lm performance

[R] R versus SAS: lm performance

[R] R versus SAS: lm performance

Seemingly Similar Threads