Christoph Lehmann
2004-Dec-13 12:27 UTC
[R] classification for huge datasets: SVM yields memory troubles
Hi I have a matrix with 30 observations and roughly 30000 variables, each obs belongs to one of two groups. With svm and slda I get into memory troubles ('cannot allocate vector of size' roughly 2G). PCA LDA runs fine. Are there any way to use the memory issue withe SVM's? Or can you recommend any other classification method for such huge datasets? P.S. I run suse 9.1 on a 2G RAM PIV machine. thanks for a hint Christoph
Andreas
2004-Dec-13 20:56 UTC
[R] classification for huge datasets: SVM yields memory troubles
Hi, I'm a beginner in the SVM-module but I have seen there is a parameter called : cachesize #cache memory in MB (default 40) please let me know if this parameter solved your problem, I might get the same number of samples in the near future. regards Andreas "Christoph Lehmann" <christoph.lehmann at gmx.ch> schrieb im Newsbeitrag news:41BD8A9F.4040509 at gmx.ch...> Hi > I have a matrix with 30 observations and roughly 30000 variables, each > obs belongs to one of two groups. With svm and slda I get into memory > troubles ('cannot allocate vector of size' roughly 2G). PCA LDA runs > fine. Are there any way to use the memory issue withe SVM's? Or can you > recommend any other classification method for such huge datasets? > > > P.S. I run suse 9.1 on a 2G RAM PIV machine. > thanks for a hint > > Christoph > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide!http://www.R-project.org/posting-guide.html>
John Maindonald
2004-Dec-14 23:59 UTC
[R] classification for huge datasets: SVM yields memory troubles
While it is true that the large number of variables relative to the number of observations restricts what can be inferred, the situation is not as hopeless as Bert seems to suggest. If it were, attempts at the analysis of expression array data would be a waste to time. Methods developed to that general area may well be relevant to other data where the number of variables is similarly far larger than the number of observations. See Ambroise, C. and Mclachlan, G.J. 2002. Selection bias in gene extraction on the basis of microarray gene-expression data. PNAS 99: 6562--6566. This discusses some of the literature on the use of SVMs. The selection bias that these authors discuss also affects plots, even principal components and other ordination-base plots where features have been selected on the basis of their ability to separate into known groups. I have draft versions of code that addresses this selection bias as it affects the plotting of graphs, which (along a paper that has been submitted for inclusion in a conference proceedings) I am happy to make available to anyone who wants to experiment. Another good place to look, as a starting point, may be Gordon Smyth's LIMMA User's Guide. This can be a bit hard to find. With limma installed, type help.start(). After some time a browser window should open. Click on Packages | limma | Overview | LIMMA User's Guide (pdf) John Maindonald email: john.maindonald at anu.edu.au phone : +61 2 (6125)3473 fax : +61 2(6125)5549 Centre for Bioinformation Science, Room 1194, John Dedman Mathematical Sciences Building (Building 27) Australian National University, Canberra ACT 0200. On 14 Dec 2004, at 10:09 PM, r-help-request at stat.math.ethz.ch wrote:> From: Berton Gunter <gunter.berton at gene.com> > Date: 14 December 2004 9:23:08 AM > To: "'Andreas'" <wolf.privat at gmx.de>, <r-help at stat.math.ethz.ch> > Cc: Subject: RE: [R] classification for huge datasets: SVM yields > memory troubles > > > " I have a matrix with 30 observations and roughly 30000 > variables, ... <snipped>" > > Comment: This is ** not ** a "huge" data set -- it is a tiny one with a > large number of covariates. The difference is: If it were truly huge, > SVM > and/or LDA or ... might actually be able to produce useful results. > With so > few data and so many variables, it is hard to see how any approach > that one > uses is not simply a fancy random number generator. >John Maindonald email: john.maindonald at anu.edu.au phone : +61 2 (6125)3473 fax : +61 2(6125)5549 Centre for Bioinformation Science, Room 1194, John Dedman Mathematical Sciences Building (Building 27) Australian National University, Canberra ACT 0200.