hardworker
2012-Jan-03 05:09 UTC
[R] Biglm source code alternatives (E.g. Call to Fortran)
Hi everyone, I have been looking at the Bigglm (Basically does Generalised Linear Models for big data under the Biglm package) command and I have done some profiling on this code and found that to do a GLM on a 100mb file (9 million rows by 5 columns matrix(most of the numbers were either a 0,1 or 2 randomly generated)) it took about 2 minutes on a linux machine with 8gb of RAM and 4 cores. Ideally I want to run this much quicker probably around 30 seconds to 60 seconds and after viewing the profiling code I noticed these things:> summaryRprof('method2.out')$by.self self.time self.pct total.time total.pct "model.matrix.default" 24.84 19.4 26.40 20.6 ".Call" 21.00 16.4 21.00 16.4 "as.character" 17.92 14.0 17.92 14.0 "[.data.frame" 14.04 11.0 22.54 17.6 "*" 6.44 5.0 6.44 5.0 "update.bigqr" 5.34 4.2 15.32 12.0 "-" 4.52 3.5 4.52 3.5 "anyDuplicated.default" 4.12 3.2 4.12 3.2 "/" 3.76 2.9 3.76 2.9 "attr" 3.26 2.5 3.26 2.5 "|" 2.96 2.3 2.96 2.3 "unclass" 2.82 2.2 2.82 2.2 "na.omit" 2.42 1.9 17.18 13.4 "sum" 2.02 1.6 2.02 1.6 I did some further investigation and it appears the .Call command to fortran seems slow. This function is under the coef.bigqr.R and singcheck.bigqr.R functions in the biglm package. Is there an alternative way to implement the call to Fortran? As I thought matrix inversion and QR/Cholesky decomposition can be done much faster on low level design software platforms like Fortran so I was surprised by the 21 second timeframe. Furthermore are there any other packages or platforms I can implement to speed up the as.character or model.matrix commands. My expertise in R is very limited but I realise R also has the ability to do parallel computing. Is this also a possible solution to running a GLM on a big dataset very quickly. Alternatively I could increase memory and add more cores but this isn't really a long term solution as I know that I will eventually work with bigger datasets. In fact GLM is such a common tool that I think this would benefit a lot of people in the R community if it could be run quicker for bigger data using existing packages such as ff, doMC, parallel, biglm, bigmemory. Your help would be greatly appreciated, hardworker -- View this message in context: http://r.789695.n4.nabble.com/Biglm-source-code-alternatives-E-g-Call-to-Fortran-tp4255774p4255774.html Sent from the R help mailing list archive at Nabble.com.