Dear everyone, I am new to R, and I am looking at doing text classification on a huge collection of documents (>500,000) which are distributed among 300 classes (so basically, this is my training data). Would someone please be kind enough to let me know about the R packages to use and their scalability (time and space)? I am very new to R and do not know of the right packages to use. I started off by trying to use the tm package (http://cran.r-project.org/package=tm) for pre-processing and FSelector (http://cran.r-project.org/web/packages/FSelector/index.html) package for feature selection - but both of these are incredibly slow and completely unusable for my task. So the question is what are the right packages to use (for pre-processing, feature selection, and classification)? Please consider the fact that I may be dealing with data of millions of dimensions which may not even fit in memory. I posted on this issue twice (http://r.789695.n4.nabble.com/Entropy-based-feature-selection-in-R-td3708056.html , http://r.789695.n4.nabble.com/R-s-handling-of-high-dimensional-data-td3741758.html) but did not get any response. This is a very critical piece of my research and I have been struggling with this issue for a long time. Please consider helping me out, directly or by pointing me to any other software/website that you think may be more appropriate. Many thanks in advance. -- View this message in context: http://r.789695.n4.nabble.com/Classifying-large-text-corpora-using-R-tp3786787p3786787.html Sent from the R help mailing list archive at Nabble.com.
Take a look here: http://www.jstatsoft.org/v25/i05/paper HTH, Da. andy1234 wrote:> > Dear everyone, > > I am new to R, and I am looking at doing text classification on a huge > collection of documents (>500,000) which are distributed among 300 classes > (so basically, this is my training data). Would someone please be kind > enough to let me know about the R packages to use and their scalability > (time and space)? > > I am very new to R and do not know of the right packages to use. I started > off by trying to use the tm package (http://cran.r-project.org/package=tm) > for pre-processing and FSelector > (http://cran.r-project.org/web/packages/FSelector/index.html) package for > feature selection - but both of these are incredibly slow and completely > unusable for my task. > > So the question is what are the right packages to use (for pre-processing, > feature selection, and classification)? Please consider the fact that I > may be dealing with data of millions of dimensions which may not even fit > in memory. > > I posted on this issue twice > (http://r.789695.n4.nabble.com/Entropy-based-feature-selection-in-R-td3708056.html > , > http://r.789695.n4.nabble.com/R-s-handling-of-high-dimensional-data-td3741758.html) > but did not get any response. This is a very critical piece of my research > and I have been struggling with this issue for a long time. Please > consider helping me out, directly or by pointing me to any other > software/website that you think may be more appropriate. > > Many thanks in advance. >-- View this message in context: http://r.789695.n4.nabble.com/Classifying-large-text-corpora-using-R-tp3786787p3788196.html Sent from the R help mailing list archive at Nabble.com.
Daniel Malter wrote:> > Take a look here: http://www.jstatsoft.org/v25/i05/paper > > HTH, > Da. > > > andy1234 wrote: >> >> Dear everyone, >> >> I am new to R, and I am looking at doing text classification on a huge >> collection of documents (>500,000) which are distributed among 300 >> classes (so basically, this is my training data). Would someone please be >> kind enough to let me know about the R packages to use and their >> scalability (time and space)? >> >> I am very new to R and do not know of the right packages to use. I >> started off by trying to use the tm package >> (http://cran.r-project.org/package=tm) for pre-processing and FSelector >> (http://cran.r-project.org/web/packages/FSelector/index.html) package for >> feature selection - but both of these are incredibly slow and completely >> unusable for my task. >> >> So the question is what are the right packages to use (for >> pre-processing, feature selection, and classification)? Please consider >> the fact that I may be dealing with data of millions of dimensions which >> may not even fit in memory. >> >> I posted on this issue twice >> (http://r.789695.n4.nabble.com/Entropy-based-feature-selection-in-R-td3708056.html >> , >> http://r.789695.n4.nabble.com/R-s-handling-of-high-dimensional-data-td3741758.html) >> but did not get any response. This is a very critical piece of my >> research and I have been struggling with this issue for a long time. >> Please consider helping me out, directly or by pointing me to any other >> software/website that you think may be more appropriate. >> >> Many thanks in advance. >> >Hi, Many thanks for your reply. I did in fact mention in my e-mail that I have looked at tm package. It does not scale well at all. Then there are other stages in the pipeline - feature selection, classification etc. and I need to find suitable R packages for those also. Any other thoughts? Thanks. Andy -- View this message in context: http://r.789695.n4.nabble.com/Classifying-large-text-corpora-using-R-tp3786787p3788667.html Sent from the R help mailing list archive at Nabble.com.