Ranjana Girish
2017-Jul-18 07:37 UTC
[R] Help-Multi class classification for large datasets
Hai all, We are working on Multi-class Classification. Currently up to 1.1 million records Ranger package in R is able to handle. Training time on 128 GB RAM is 12 days, which is not a practically feasible method to proceed further. In future we will have dataset of dimension 10 million records, we are in search for a package or framework which can handle 10 million records with at least 12000 features. The package or framework we are searching should handle all the below tasks: 1. Pre-processing of words in corpus( Stopword removal, stemming, remove special character) 2. Construct document term matrix 3. Feature selection process like chi square, information gain, Gain ration. 4. Random forest classification etc Kindly let us know the package or framework which can scale up to 10 million rows and 12 columns. [[alternative HTML version deleted]]
You may get some help here, but you should also do your own homework by looking at the CRAN Machine Learning Task view here: https://cran.r-project.org/web/views/MachineLearning.html Cheers, Bert Bert Gunter "The trouble with having an open mind is that people keep coming along and sticking things into it." -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip ) On Tue, Jul 18, 2017 at 12:37 AM, Ranjana Girish <ranjanagirish30 at gmail.com> wrote:> Hai all, > > We are working on Multi-class Classification. Currently up to 1.1 million > records Ranger package in R is able to handle. Training time on 128 GB RAM > is 12 days, which is not a practically feasible method to proceed further. > > In future we will have dataset of dimension 10 million records, we are in > search for a package or framework which can handle 10 million records with > at least 12000 features. > > > The package or framework we are searching should handle all the below tasks: > > 1. Pre-processing of words in corpus( Stopword removal, stemming, remove > special character) > 2. Construct document term matrix > 3. Feature selection process like chi square, information gain, Gain ration. > 4. Random forest classification etc > > Kindly let us know the package or framework which can scale up to 10 > million rows and 12 columns. > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.