Pradheep K E
2008-Aug-09 16:06 UTC
[R] Reading large datasets and fitting logistic models in R
Hi R-experts, Does anyone have experience using R for handling large scale data (millions of rows, hundreds or thousands of features)? What is the largest size of data that anyone has used with glm? Also, is there a library to read data in sparse data format (like SVMlight format)? Thanks Pradheep [[alternative HTML version deleted]]
Prof Brian Ripley
2008-Aug-10 06:18 UTC
[R] Reading large datasets and fitting logistic models in R
See also bigglm() in package biglm. On Sat, 9 Aug 2008, Pradheep K E wrote:> Hi R-experts, > > Does anyone have experience using R for handling large scale data (millions > of rows, hundreds or thousands of features)? > > What is the largest size of data that anyone has used with glm?I've used 700,000 rows and about 100 cols, but it was 4 years ago and we have more memory now. It matters if the 'features' are numeric or categorical, as the latter can expand to many columns in the model matrix. As a rough guide, expect to need 200x as much memory in bytes as nrows x ncols. Using glm.fit will be more efficient (I've just tested 100,000 x 100 which used 1.2Gb).> Also, is there a library to read data in sparse data format (like SVMlight > format)?You mean *store* data in a sparse format when read in? I'm not sure of the relevance, but look at the function method for bigglm for a way to avoid even doing that. If the data are numeric there are at least three sparse-matrix packages on CRAN. Ultimately R's code such as glm() is designed for flexibility and to do interesting things with the fit: for really large problems you will do better to write a specialized fitting routine. bigglm() is an intermediate position. There's also the question of whether there are any interesting homogeneous datasets of this sort of size. Often doing analyses on subsets and a meta-analysis is a much more insightful approach (as it was in our problem: we split on one of the categorical explanatory variables).> Thanks > Pradheep > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >-- Brian D. Ripley, ripley at stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UK Fax: +44 1865 272595