I have been timing a particular model fit using lmer on several different computers and came up with a peculiar result - the model fit is considerably slower on a dual-core Athlon 64 using Goto's multithreaded BLAS than on a single-core processor. Here is the timing on a single-core Athlon 64 3000+ running under today's R-devel with version 0.995-5 of the Matrix package.> library(Matrix) > data(star, package = 'mlmRev') > system.time(fm1 <- lmer(math~gr+sx+eth+cltype+(yrs|id)+(1|tch)+(yrs|sch), star, control = list(nit=0,grad=0,msV=1)))0 241720.: 1.16440 0.335239 0.00000 1.78732 0.867209 0.382318 0.00000 1 239722.: 1.94952 5.00000e-10 0.00933767 1.65999 0.858003 0.341520 0.00908757 2 239580.: 1.95924 0.0884059 0.00933767 1.65308 0.857487 0.339296 0.00954718 3 239215.: 2.60877 0.0765848 0.0177699 1.45739 0.843562 0.275100 0.0236849 4 239204.: 2.62582 0.106670 0.0239698 1.41976 0.841086 0.261033 0.0267073 5 239176.: 2.63149 0.0787924 0.0367185 1.37952 0.838527 0.245076 0.0301134 6 239141.: 2.64949 0.108534 0.0594586 1.28846 0.832543 0.208404 0.0375456 7 239049.: 2.64794 0.0789214 0.121782 1.10436 0.819711 0.126101 0.0524965 8 239004.: 2.66044 0.117957 0.181505 0.932202 0.798982 0.0718116 0.0628958 9 238944.: 2.66310 0.0819653 0.334477 0.631735 0.740855 0.258359 0.0806590 10 238893.: 2.72626 0.0975205 0.653432 0.703912 0.666067 0.109922 0.201809 11 238892.: 2.74381 0.111146 0.666155 0.693544 0.662000 0.104060 0.207591 12 238888.: 2.75052 0.0990238 0.689199 0.694588 0.655781 0.106516 0.216460 13 238861.: 2.80325 0.126935 1.05912 0.733914 0.556162 0.159296 0.360938 14 238832.: 2.82656 0.117617 1.59471 0.607916 0.441371 0.0749944 0.976142 15 238811.: 2.87350 0.136332 1.59046 0.653141 0.353763 0.226061 1.79285 16 238810.: 2.87663 0.125135 1.58992 0.656808 0.352605 0.220488 1.79282 17 238806.: 2.89342 0.141551 1.58607 0.676523 0.344212 0.181833 1.79268 18 238804.: 2.90080 0.125137 1.56624 0.682921 0.261295 0.180598 1.74325 19 238802.: 2.91950 0.128569 1.56836 0.680436 0.336051 0.159940 1.80400 20 238801.: 2.92795 0.134762 1.56597 0.685121 0.331695 0.145547 1.80414 21 238801.: 2.93741 0.125667 1.56139 0.687827 0.332700 0.138854 1.81495 22 238800.: 2.94588 0.131757 1.55294 0.687909 0.330414 0.137834 1.82826 23 238799.: 2.96867 0.129716 1.52943 0.688678 0.323171 0.139912 1.84615 24 238799.: 2.98994 0.133378 1.52188 0.700038 0.337387 0.131403 1.77623 25 238799.: 3.00312 0.135308 1.51475 0.697550 0.311750 0.145683 1.78037 26 238799.: 3.00461 0.129920 1.51083 0.697666 0.306722 0.138745 1.81188 27 238799.: 3.00504 0.134882 1.50539 0.696745 0.302949 0.135897 1.84405 28 238799.: 3.00422 0.134013 1.47947 0.698115 0.303243 0.133806 1.86486 29 238799.: 3.00819 0.134378 1.48185 0.701871 0.307097 0.134637 1.84996 30 238799.: 3.01313 0.134279 1.49098 0.702883 0.304788 0.133682 1.86254 31 238799.: 3.01291 0.134253 1.49060 0.701818 0.303155 0.133771 1.84613 32 238799.: 3.01142 0.134314 1.48921 0.701782 0.303589 0.134439 1.84653 33 238799.: 3.01174 0.134315 1.48926 0.701641 0.304120 0.134145 1.84635 34 238799.: 3.01175 0.134304 1.48942 0.701740 0.303762 0.134185 1.84649 35 238799.: 3.01173 0.134307 1.48937 0.701724 0.303809 0.134206 1.84647 [1] 43.10 3.78 48.41 0.00 0.00 (If you run the timing yourself and don't want to see the iteration output, take the msV=1 out of the control list. I keep it in there so I can monitor the progress.) If I time the same model fit on a dual-core Athlon 64 X2 3800+ with the same version of R, BLAS and Matrix package, the timing ends up with something like 90 140 235 0 0 I do see that the multithreading is working on a calculation that is essentially BLAS-bound such as> mm <- Matrix(rnorm(1e6), nc = 1e3) > system.time(crossprod(mm))[1] 0.57 0.02 0.60 0.00 0.00 On the X2 processor it still takes about 0.6 seconds user time but only 0.3 seconds elapsed time when the machine is otherwise idle and both cores are available for the calculation. Any suggestions why the dual-core processor is so much slower than the single core processor? By the way, I would be interested in accumulating timings of this model fit on other systems. If you do time it please send me (off-list) a summary of the version of R, version of the accelerated BLAS if you use them, processor speed and configuration (i.e. multiprocessor, multicore, etc.) and, if you know it, memory speed. This is an example of a complex multilevel model with crossed grouping factors fit to a relatively large (30000 observations on 10000 students, 1400 teachers and 80 schools) data set.
Doug This is probably not your reason, but I am finding my dual core Athlon 64 is much slower running 64 bit Linux and R than it was running 32 bit Linux and R. All the programs are bigger. (Some, like the clock applet, are a lot bigger for no obvious reason.) The difference is enough to put my meager 1GB machine into swapping much more, with the result that it is a lot slower. Paul Douglas Bates wrote:>I have been timing a particular model fit using lmer on several >different computers and came up with a peculiar result - the model fit >is considerably slower on a dual-core Athlon 64 using Goto's >multithreaded BLAS than on a single-core processor. > >Here is the timing on a single-core Athlon 64 3000+ running under >today's R-devel with version 0.995-5 of the Matrix package. > > > >>library(Matrix) >>data(star, package = 'mlmRev') >>system.time(fm1 <- lmer(math~gr+sx+eth+cltype+(yrs|id)+(1|tch)+(yrs|sch), star, control = list(nit=0,grad=0,msV=1))) >> >> > 0 241720.: 1.16440 0.335239 0.00000 1.78732 0.867209 0.382318 0.00000 > 1 239722.: 1.94952 5.00000e-10 0.00933767 1.65999 0.858003 >0.341520 0.00908757 > 2 239580.: 1.95924 0.0884059 0.00933767 1.65308 0.857487 >0.339296 0.00954718 > 3 239215.: 2.60877 0.0765848 0.0177699 1.45739 0.843562 >0.275100 0.0236849 > 4 239204.: 2.62582 0.106670 0.0239698 1.41976 0.841086 >0.261033 0.0267073 > 5 239176.: 2.63149 0.0787924 0.0367185 1.37952 0.838527 >0.245076 0.0301134 > 6 239141.: 2.64949 0.108534 0.0594586 1.28846 0.832543 >0.208404 0.0375456 > 7 239049.: 2.64794 0.0789214 0.121782 1.10436 0.819711 >0.126101 0.0524965 > 8 239004.: 2.66044 0.117957 0.181505 0.932202 0.798982 >0.0718116 0.0628958 > 9 238944.: 2.66310 0.0819653 0.334477 0.631735 0.740855 >0.258359 0.0806590 > 10 238893.: 2.72626 0.0975205 0.653432 0.703912 0.666067 >0.109922 0.201809 > 11 238892.: 2.74381 0.111146 0.666155 0.693544 0.662000 0.104060 0.207591 > 12 238888.: 2.75052 0.0990238 0.689199 0.694588 0.655781 >0.106516 0.216460 > 13 238861.: 2.80325 0.126935 1.05912 0.733914 0.556162 0.159296 0.360938 > 14 238832.: 2.82656 0.117617 1.59471 0.607916 0.441371 >0.0749944 0.976142 > 15 238811.: 2.87350 0.136332 1.59046 0.653141 0.353763 0.226061 1.79285 > 16 238810.: 2.87663 0.125135 1.58992 0.656808 0.352605 0.220488 1.79282 > 17 238806.: 2.89342 0.141551 1.58607 0.676523 0.344212 0.181833 1.79268 > 18 238804.: 2.90080 0.125137 1.56624 0.682921 0.261295 0.180598 1.74325 > 19 238802.: 2.91950 0.128569 1.56836 0.680436 0.336051 0.159940 1.80400 > 20 238801.: 2.92795 0.134762 1.56597 0.685121 0.331695 0.145547 1.80414 > 21 238801.: 2.93741 0.125667 1.56139 0.687827 0.332700 0.138854 1.81495 > 22 238800.: 2.94588 0.131757 1.55294 0.687909 0.330414 0.137834 1.82826 > 23 238799.: 2.96867 0.129716 1.52943 0.688678 0.323171 0.139912 1.84615 > 24 238799.: 2.98994 0.133378 1.52188 0.700038 0.337387 0.131403 1.77623 > 25 238799.: 3.00312 0.135308 1.51475 0.697550 0.311750 0.145683 1.78037 > 26 238799.: 3.00461 0.129920 1.51083 0.697666 0.306722 0.138745 1.81188 > 27 238799.: 3.00504 0.134882 1.50539 0.696745 0.302949 0.135897 1.84405 > 28 238799.: 3.00422 0.134013 1.47947 0.698115 0.303243 0.133806 1.86486 > 29 238799.: 3.00819 0.134378 1.48185 0.701871 0.307097 0.134637 1.84996 > 30 238799.: 3.01313 0.134279 1.49098 0.702883 0.304788 0.133682 1.86254 > 31 238799.: 3.01291 0.134253 1.49060 0.701818 0.303155 0.133771 1.84613 > 32 238799.: 3.01142 0.134314 1.48921 0.701782 0.303589 0.134439 1.84653 > 33 238799.: 3.01174 0.134315 1.48926 0.701641 0.304120 0.134145 1.84635 > 34 238799.: 3.01175 0.134304 1.48942 0.701740 0.303762 0.134185 1.84649 > 35 238799.: 3.01173 0.134307 1.48937 0.701724 0.303809 0.134206 1.84647 >[1] 43.10 3.78 48.41 0.00 0.00 > > >(If you run the timing yourself and don't want to see the iteration >output, take the msV=1 out of the control list. I keep it in there so >I can monitor the progress.) > >If I time the same model fit on a dual-core Athlon 64 X2 3800+ with >the same version of R, BLAS and Matrix package, the timing ends up >with something like > >90 140 235 0 0 > >I do see that the multithreading is working on a calculation that is >essentially BLAS-bound such as > > > >>mm <- Matrix(rnorm(1e6), nc = 1e3) >>system.time(crossprod(mm)) >> >> >[1] 0.57 0.02 0.60 0.00 0.00 > >On the X2 processor it still takes about 0.6 seconds user time but >only 0.3 seconds elapsed time when the machine is otherwise idle and >both cores are available for the calculation. > >Any suggestions why the dual-core processor is so much slower than the >single core processor? > >By the way, I would be interested in accumulating timings of this >model fit on other systems. If you do time it please send me >(off-list) a summary of the version of R, version of the accelerated >BLAS if you use them, processor speed and configuration (i.e. >multiprocessor, multicore, etc.) and, if you know it, memory speed. > >This is an example of a complex multilevel model with crossed grouping >factors fit to a relatively large (30000 observations on 10000 >students, 1400 teachers and 80 schools) data set. > >______________________________________________ >R-devel at r-project.org mailing list >https://stat.ethz.ch/mailman/listinfo/r-devel > >
Paul, I think what you're seeing is the performance gap between 64-bit binary and 32-bit binary on x86_64. I believe Prof. Ripley had mentioned this several times in the past. I do remember back when I was playing with optimized BLAS with R on 32-bit Linux that I've seen something similar to what Prof. Bates is seeing, that the one that uses threaded BLAS took longer the than non-threaded one (on dual Xeon boxes). I never dug into it more, though. Andy From: Paul Gilbert> > Doug > > This is probably not your reason, but I am finding my dual > core Athlon > 64 is much slower running 64 bit Linux and R than it was > running 32 bit > Linux and R. All the programs are bigger. (Some, like the > clock applet, > are a lot bigger for no obvious reason.) The difference is enough to > put my meager 1GB machine into swapping much more, with the > result that > it is a lot slower. > > Paul > > Douglas Bates wrote: > > >I have been timing a particular model fit using lmer on several > >different computers and came up with a peculiar result - the > model fit > >is considerably slower on a dual-core Athlon 64 using Goto's > >multithreaded BLAS than on a single-core processor. > > > >Here is the timing on a single-core Athlon 64 3000+ running under > >today's R-devel with version 0.995-5 of the Matrix package. > > > > > > > >>library(Matrix) > >>data(star, package = 'mlmRev') > >>system.time(fm1 <- > >>lmer(math~gr+sx+eth+cltype+(yrs|id)+(1|tch)+(yrs|sch), > star, control = list(nit=0,grad=0,msV=1))) > >> > >> > > 0 241720.: 1.16440 0.335239 0.00000 1.78732 > 0.867209 0.382318 0.00000 > > 1 239722.: 1.94952 5.00000e-10 0.00933767 1.65999 0.858003 > >0.341520 0.00908757 > > 2 239580.: 1.95924 0.0884059 0.00933767 1.65308 0.857487 > >0.339296 0.00954718 > > 3 239215.: 2.60877 0.0765848 0.0177699 1.45739 0.843562 > >0.275100 0.0236849 > > 4 239204.: 2.62582 0.106670 0.0239698 1.41976 0.841086 > >0.261033 0.0267073 > > 5 239176.: 2.63149 0.0787924 0.0367185 1.37952 0.838527 > >0.245076 0.0301134 > > 6 239141.: 2.64949 0.108534 0.0594586 1.28846 0.832543 > >0.208404 0.0375456 > > 7 239049.: 2.64794 0.0789214 0.121782 1.10436 0.819711 > >0.126101 0.0524965 > > 8 239004.: 2.66044 0.117957 0.181505 0.932202 0.798982 > >0.0718116 0.0628958 > > 9 238944.: 2.66310 0.0819653 0.334477 0.631735 0.740855 > >0.258359 0.0806590 > > 10 238893.: 2.72626 0.0975205 0.653432 0.703912 0.666067 > >0.109922 0.201809 > > 11 238892.: 2.74381 0.111146 0.666155 0.693544 > 0.662000 0.104060 0.207591 > > 12 238888.: 2.75052 0.0990238 0.689199 0.694588 0.655781 > >0.106516 0.216460 > > 13 238861.: 2.80325 0.126935 1.05912 0.733914 > 0.556162 0.159296 0.360938 > > 14 238832.: 2.82656 0.117617 1.59471 0.607916 0.441371 > >0.0749944 0.976142 > > 15 238811.: 2.87350 0.136332 1.59046 0.653141 > 0.353763 0.226061 1.79285 > > 16 238810.: 2.87663 0.125135 1.58992 0.656808 > 0.352605 0.220488 1.79282 > > 17 238806.: 2.89342 0.141551 1.58607 0.676523 > 0.344212 0.181833 1.79268 > > 18 238804.: 2.90080 0.125137 1.56624 0.682921 > 0.261295 0.180598 1.74325 > > 19 238802.: 2.91950 0.128569 1.56836 0.680436 > 0.336051 0.159940 1.80400 > > 20 238801.: 2.92795 0.134762 1.56597 0.685121 > 0.331695 0.145547 1.80414 > > 21 238801.: 2.93741 0.125667 1.56139 0.687827 > 0.332700 0.138854 1.81495 > > 22 238800.: 2.94588 0.131757 1.55294 0.687909 > 0.330414 0.137834 1.82826 > > 23 238799.: 2.96867 0.129716 1.52943 0.688678 > 0.323171 0.139912 1.84615 > > 24 238799.: 2.98994 0.133378 1.52188 0.700038 > 0.337387 0.131403 1.77623 > > 25 238799.: 3.00312 0.135308 1.51475 0.697550 > 0.311750 0.145683 1.78037 > > 26 238799.: 3.00461 0.129920 1.51083 0.697666 > 0.306722 0.138745 1.81188 > > 27 238799.: 3.00504 0.134882 1.50539 0.696745 > 0.302949 0.135897 1.84405 > > 28 238799.: 3.00422 0.134013 1.47947 0.698115 > 0.303243 0.133806 1.86486 > > 29 238799.: 3.00819 0.134378 1.48185 0.701871 > 0.307097 0.134637 1.84996 > > 30 238799.: 3.01313 0.134279 1.49098 0.702883 > 0.304788 0.133682 1.86254 > > 31 238799.: 3.01291 0.134253 1.49060 0.701818 > 0.303155 0.133771 1.84613 > > 32 238799.: 3.01142 0.134314 1.48921 0.701782 > 0.303589 0.134439 1.84653 > > 33 238799.: 3.01174 0.134315 1.48926 0.701641 > 0.304120 0.134145 1.84635 > > 34 238799.: 3.01175 0.134304 1.48942 0.701740 > 0.303762 0.134185 1.84649 > > 35 238799.: 3.01173 0.134307 1.48937 0.701724 > 0.303809 0.134206 1.84647 > >[1] 43.10 3.78 48.41 0.00 0.00 > > > > > >(If you run the timing yourself and don't want to see the iteration > >output, take the msV=1 out of the control list. I keep it > in there so > >I can monitor the progress.) > > > >If I time the same model fit on a dual-core Athlon 64 X2 > 3800+ with the > >same version of R, BLAS and Matrix package, the timing ends up with > >something like > > > >90 140 235 0 0 > > > >I do see that the multithreading is working on a calculation that is > >essentially BLAS-bound such as > > > > > > > >>mm <- Matrix(rnorm(1e6), nc = 1e3) > >>system.time(crossprod(mm)) > >> > >> > >[1] 0.57 0.02 0.60 0.00 0.00 > > > >On the X2 processor it still takes about 0.6 seconds user > time but only > >0.3 seconds elapsed time when the machine is otherwise idle and both > >cores are available for the calculation. > > > >Any suggestions why the dual-core processor is so much > slower than the > >single core processor? > > > >By the way, I would be interested in accumulating timings of > this model > >fit on other systems. If you do time it please send me > >(off-list) a summary of the version of R, version of the accelerated > >BLAS if you use them, processor speed and configuration (i.e. > >multiprocessor, multicore, etc.) and, if you know it, memory speed. > > > >This is an example of a complex multilevel model with > crossed grouping > >factors fit to a relatively large (30000 observations on 10000 > >students, 1400 teachers and 80 schools) data set. > > > >______________________________________________ > >R-devel at r-project.org mailing list > >https://stat.ethz.ch/mailman/listinfo/r-devel > > > > > > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel > >
On Fri, 3 Mar 2006, Douglas Bates wrote:> I have been timing a particular model fit using lmer on several > different computers and came up with a peculiar result - the model fit > is considerably slower on a dual-core Athlon 64 using Goto's > multithreaded BLAS than on a single-core processor.Is there a Goto BLAS tuned for that chip? I can only see one tuned for an (unspecified) Opteron. L1 and L2 cache sizes do sometimes matter a lot for tuned BLAS, and (according to the AMD site I just looked up) the X2 3800+ only has a 512Kb per core L2 cache. Opterons have a 1Mb L2 cache. Also, the very large system time seen in the dual-core run is typical of what I see when pthreads is not working right, and I suggest you try a limit of one thread (see the R-admin manual). On our dual-processor Opteron 248 that ran in 44 secs instead of 328.> Here is the timing on a single-core Athlon 64 3000+ running under > today's R-devel with version 0.995-5 of the Matrix package. > >> library(Matrix) >> data(star, package = 'mlmRev') >> system.time(fm1 <- lmer(math~gr+sx+eth+cltype+(yrs|id)+(1|tch)+(yrs|sch), star,control = list(nit=0,grad=0,msV=1)))> [1] 43.10 3.78 48.41 0.00 0.00 > > > (If you run the timing yourself and don't want to see the iteration > output, take the msV=1 out of the control list. I keep it in there so > I can monitor the progress.) > > If I time the same model fit on a dual-core Athlon 64 X2 3800+ with > the same version of R, BLAS and Matrix package, the timing ends up > with something like > > 90 140 235 0 0.... -- Brian D. Ripley, ripley at stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UK Fax: +44 1865 272595