Hello, it is likely that I will have to analyze a rather sizeable dataset: 60000 records, 10 to 15 variables. I will have to make descriptive statistics, and estimate linear models, glm's and maybe Cox proportional hazard model with time varying covariates. In theory, this is possible in R, but I would like to get some feedback on the equipment I should get for this. At this moment, I have a Pentium 3 laptop running windows 2000 with 384MB ram. What type of cpu-speed and/or how much memory should I get? Thanks for some ideas, Ruud
"Ruud H. Koning" <info at rhkoning.com> writes:> Hello, it is likely that I will have to analyze a rather sizeable dataset: > 60000 records, 10 to 15 variables. I will have to make descriptive > statistics, and estimate linear models, glm's and maybe Cox proportional > hazard model with time varying covariates. In theory, this is possible in > R, but I would like to get some feedback on the equipment I should get for > this. At this moment, I have a Pentium 3 laptop running windows 2000 with > 384MB ram. What type of cpu-speed and/or how much memory should I get? > Thanks for some ideas, RuudExcept for the time-varying Cox thing, this doesn't seem too hard:> d <- as.data.frame(matrix(rnorm(60000*15),60000,15)) > names(d)[1] "V1" "V2" "V3" "V4" "V5" "V6" "V7" "V8" "V9" "V10" "V11" "V12" [13] "V13" "V14" "V15"> system.time(lm(V15~.,data=d))[1] 2.62 0.61 3.24 0.00 0.00> gc()used (Mb) gc trigger (Mb) Ncells 431614 11.6 741108 19.8 Vcells 1079809 8.3 6817351 52.1 That's on the fastest machine I have access to, a 2.8GHz Xeon (Dual, but not with threaded BLAS lib). About three times slower on a 900 MHz PIII. For GLM you'll do similar operations iterated say 5 times, and if you have factors and interactions among your predictors, you'll get essentially an increase proportional to the number of parameters in the model. Time-dependent Cox in full generality has complexity proportional to the square of the data set (one regression computation per death) which could be prohibitive, but there are often simplifications, depending on the nature of the time dependency. -- O__ ---- Peter Dalgaard Blegdamsvej 3 c/ /'_ --- Dept. of Biostatistics 2200 Cph. N (*) \(*) -- University of Copenhagen Denmark Ph: (+45) 35327918 ~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk) FAX: (+45) 35327907
Ruud - Use your existing machine. Here's a rough calculation: 60,000 rows x 15 columns x 8 bytes = 7.2 Mb per copy of the data set x 10 - 20 copies of the data set in memory while you do the calculations = 72 - 144 Mb memory requirements. Is it 12 bytes per double instead of 8 in this implementation of the S language ? (I think it is 12 for S-Plus.) Have I missed a factor of 10 somewhere here ? I think you should be okay with your existing machine. Close other processes when you do the analysis. - tom blackwell - u michigan medical school - ann arbor - On Wed, 23 Apr 2003, Ruud H. Koning wrote:> Hello, it is likely that I will have to analyze a rather sizeable dataset: > 60000 records, 10 to 15 variables. I will have to make descriptive > statistics, and estimate linear models, glm's and maybe Cox proportional > hazard model with time varying covariates. In theory, this is possible in > R, but I would like to get some feedback on the equipment I should get for > this. At this moment, I have a Pentium 3 laptop running windows 2000 with > 384MB ram. What type of cpu-speed and/or how much memory should I get? > Thanks for some ideas, Ruud
"Ruud H. Koning" <info at rhkoning.com> writes:> Hello, it is likely that I will have to analyze a rather sizeable dataset: > 60000 records, 10 to 15 variables. I will have to make descriptive > statistics, and estimate linear models, glm's and maybe Cox proportional > hazard model with time varying covariates. In theory, this is possible in > R, but I would like to get some feedback on the equipment I should get for > this. At this moment, I have a Pentium 3 laptop running windows 2000 with > 384MB ram. What type of cpu-speed and/or how much memory should I get? > Thanks for some ideas, RuudIf you are buying a new computer then CPU speed will be less important than memory. Most new computers using Intel or AMD processors have CPU speeds in excess of 1 GHz - frequently 2 GHz or more. You probably wouldn't be able to notice the difference between a 1.8 GHz processor and a 2.8 GHz processor for most work in R so paying extra for a nominally faster processor is not a good bargain. Try to get as much memory as your budget will allow. For a laptop that may be 512 MB. For a desktop computer consider 1 GB. We have fit linear mixed-effects models on 300,000 observations and with 40 or 50 columns in the model matrix in about 5 minutes on a 2.0 GHz machine with 1 GB memory using recent versions of R. Linear models and glms should be faster than this.