g.russell at eos-finance.com
2006-Aug-30 12:47 UTC
[R] Antwort: Buying more computer for GLM
Hello, at the moment I am doing quite a lot of regression, especially logistic regression, on 20000 or more records with 30 or more factors, using the "step" function to search for the model with the smallest AIC. This takes a lot of time on this 1.8 GHZ Pentium box. Memory does not seem to be such a big problem; not much swapping is going on and CPU usage is at or close to 100%. What would be the most cost-effective way to speed this up? The obvious way would be to get a machine with a faster processor (3GHz plus) but I wonder whether it might instead be better to run a dual- processor machine or something like that; this looks at least like a problem R should be able to parallelise, though I don't know whether it does. Thanks for your help, George Russell
g.russell at eos-finance.com writes:> Hello, > > at the moment I am doing quite a lot of regression, especially > logistic regression, on 20000 or more records with 30 or more > factors, using the "step" function to search for the model with the > smallest AIC. This takes a lot of time on this 1.8 GHZ Pentium > box. Memory does not seem to be such a big problem; not much > swapping is going on and CPU usage is at or close to 100%. What > would be the most cost-effective way to speed this up? The > obvious way would be to get a machine with a faster processor (3GHz > plus) but I wonder whether it might instead be better to run a dual- > processor machine or something like that; this looks at least like a > problem R should be able to parallelise, though I don't know whether it > does.Is this floating point bound? (When you say 30 factors does that mean 30 parameters or factors representing a much larger number of groups). If it is integer bound, I don't think you can do much better than increase CPU speed and - note - memory bandwidth (look for large-cache systems and fast front-side bus). To increase floating point performance, you might consider the option of using optimized BLAS (see the Windows FAQ 8.2 and/or the "R Installation and Administration" manual) like ATLAS; this in turn may be multithreaded and make use of multiple CPUs or multi-core CPUs. -- O__ ---- Peter Dalgaard ?ster Farimagsgade 5, Entr.B c/ /'_ --- Dept. of Biostatistics PO Box 2099, 1014 Cph. K (*) \(*) -- University of Copenhagen Denmark Ph: (+45) 35327918 ~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk) FAX: (+45) 35327907
Please look at http://boinc.berkeley.edu/ Your problem seems to be similar to the ones for which BOINC is used. I am not sure how to do this with R, though. May be other people in this can help. Anupam.
George, Logistic regression with ONLY factors? In principle this can be solved by casting this as a log-linear model of counts and using iterative proportional fitting. For sparse data like yours (i.e. a table with 20000 counts and >= 2^31 cells), it will be necessary to use a method that does not explicitly operate on the table of counts as loglin() does. I would guess that rake() in the survey package would handle this, but I've not looked at the code it uses. If you are only using a fraction of the factors then loglm() (in MASS) or loglin() may suffice. HTH, Chuck On Wed, 30 Aug 2006, g.russell at eos-finance.com wrote:> Hello, > > at the moment I am doing quite a lot of regression, especially > logistic regression, on 20000 or more records with 30 or more > factors, using the "step" function to search for the model with the > smallest AIC. This takes a lot of time on this 1.8 GHZ Pentium > box. Memory does not seem to be such a big problem; not much > swapping is going on and CPU usage is at or close to 100%. What > would be the most cost-effective way to speed this up? The > obvious way would be to get a machine with a faster processor (3GHz > plus) but I wonder whether it might instead be better to run a dual- > processor machine or something like that; this looks at least like a > problem R should be able to parallelise, though I don't know whether it > does. > > Thanks for your help, > > George Russell > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >Charles C. Berry (858) 534-2098 Dept of Family/Preventive Medicine E mailto:cberry at tajo.ucsd.edu UC San Diego http://biostat.ucsd.edu/~cberry/ La Jolla, San Diego 92093-0717