thr3ads.net - R help - [R] Classifying large text corpora using R [Sep 2011]

If this information is useful, please help other people find it:
Share via:

andy1234

2011-Sep-02 18:23 UTC

[R] Classifying large text corpora using R

Dear everyone, 

I am new to R, and I am looking at doing text classification on a huge
collection of documents (>500,000) which are distributed among 300 classes
(so basically, this is my training data). Would someone please be kind
enough to let me know about the R packages to use and their scalability
(time and space)? 

I am very new to R and do not know of the right packages to use. I started
off by trying to use the tm package (http://cran.r-project.org/package=tm)
for pre-processing and FSelector
(http://cran.r-project.org/web/packages/FSelector/index.html) package for
feature selection - but both of these are incredibly slow and completely
unusable for my task. 

So the question is what are the right packages to use (for pre-processing,
feature selection, and classification)? Please consider the fact that I may
be dealing with data of millions of dimensions which may not even fit in
memory. 

I posted on this issue twice
(http://r.789695.n4.nabble.com/Entropy-based-feature-selection-in-R-td3708056.html
,
http://r.789695.n4.nabble.com/R-s-handling-of-high-dimensional-data-td3741758.html)
but did not get any response. This is a very critical piece of my research
and I have been struggling with this issue for a long time. Please consider
helping me out, directly or by pointing me to any other software/website
that you think may be more appropriate. 

Many thanks in advance.

--
View this message in context:
http://r.789695.n4.nabble.com/Classifying-large-text-corpora-using-R-tp3786787p3786787.html
Sent from the R help mailing list archive at Nabble.com.

Daniel Malter

2011-Sep-03 17:34 UTC

head link

[R] Classifying large text corpora using R

Take a look here: http://www.jstatsoft.org/v25/i05/paper

HTH,
Da.


andy1234 wrote:> 
> Dear everyone, 
> 
> I am new to R, and I am looking at doing text classification on a huge
> collection of documents (>500,000) which are distributed among 300
classes
> (so basically, this is my training data). Would someone please be kind
> enough to let me know about the R packages to use and their scalability
> (time and space)? 
> 
> I am very new to R and do not know of the right packages to use. I started
> off by trying to use the tm package (http://cran.r-project.org/package=tm)
> for pre-processing and FSelector
> (http://cran.r-project.org/web/packages/FSelector/index.html) package for
> feature selection - but both of these are incredibly slow and completely
> unusable for my task. 
> 
> So the question is what are the right packages to use (for pre-processing,
> feature selection, and classification)? Please consider the fact that I
> may be dealing with data of millions of dimensions which may not even fit
> in memory. 
> 
> I posted on this issue twice
>
(http://r.789695.n4.nabble.com/Entropy-based-feature-selection-in-R-td3708056.html
> ,
>
http://r.789695.n4.nabble.com/R-s-handling-of-high-dimensional-data-td3741758.html)
> but did not get any response. This is a very critical piece of my research
> and I have been struggling with this issue for a long time. Please
> consider helping me out, directly or by pointing me to any other
> software/website that you think may be more appropriate. 
> 
> Many thanks in advance.
> 
--
View this message in context:
http://r.789695.n4.nabble.com/Classifying-large-text-corpora-using-R-tp3786787p3788196.html
Sent from the R help mailing list archive at Nabble.com.

andy1234

2011-Sep-04 01:26 UTC

head link

[R] Classifying large text corpora using R

Daniel Malter wrote:> 
> Take a look here: http://www.jstatsoft.org/v25/i05/paper
> 
> HTH,
> Da.
> 
> 
> andy1234 wrote:
>> 
>> Dear everyone, 
>> 
>> I am new to R, and I am looking at doing text classification on a huge
>> collection of documents (>500,000) which are distributed among 300
>> classes (so basically, this is my training data). Would someone please
be
>> kind enough to let me know about the R packages to use and their
>> scalability (time and space)? 
>> 
>> I am very new to R and do not know of the right packages to use. I
>> started off by trying to use the tm package
>> (http://cran.r-project.org/package=tm) for pre-processing and FSelector
>> (http://cran.r-project.org/web/packages/FSelector/index.html) package
for
>> feature selection - but both of these are incredibly slow and
completely
>> unusable for my task. 
>> 
>> So the question is what are the right packages to use (for
>> pre-processing, feature selection, and classification)? Please consider
>> the fact that I may be dealing with data of millions of dimensions
which
>> may not even fit in memory. 
>> 
>> I posted on this issue twice
>>
(http://r.789695.n4.nabble.com/Entropy-based-feature-selection-in-R-td3708056.html
>> ,
>>
http://r.789695.n4.nabble.com/R-s-handling-of-high-dimensional-data-td3741758.html)
>> but did not get any response. This is a very critical piece of my
>> research and I have been struggling with this issue for a long time.
>> Please consider helping me out, directly or by pointing me to any other
>> software/website that you think may be more appropriate. 
>> 
>> Many thanks in advance.
>> 
> 
Hi,

Many thanks for your reply. 

I did in fact mention in my e-mail that I have looked at tm package. It does
not scale well at all. 

Then there are other stages in the pipeline - feature selection,
classification etc. and I need to find suitable R packages for those also. 

Any other thoughts?

Thanks.
Andy

--
View this message in context:
http://r.789695.n4.nabble.com/Classifying-large-text-corpora-using-R-tp3786787p3788667.html
Sent from the R help mailing list archive at Nabble.com.

Apparently Analagous Threads

Search for more reasonably related threads

R help - Sep 2011 - Classifying large text corpora using R

[R] Classifying large text corpora using R

[R] Classifying large text corpora using R

[R] Classifying large text corpora using R

Apparently Analagous Threads