search for: reuters21578

Displaying 5 results from an estimated 5 matches for "reuters21578".

2010 Feb 16
0
tm package
Hi, I'm using version 0.5.1 of tm package with R 2.10.1. It looks to me as if after the following reuters21578 <- Corpus(DirSource(corpusDir), readerControl = list(reader = readReut21578XMLasPlain)) reuters21578 <- tm_map(reuters21578, stripWhitespace) reuters21578 <- tm_map(reuters21578, tolower) reuters21578 <- tm_map(reuters21578, removePunctuation) reuters21578 <- tm_map(...
2012 May 29
1
package tm: reading XML files
...e tm for text mining, and have a problem with reading in a corpus from XML files. When I copy the example from "Introduction to the tm package" of the small reuters subset "crude", everything goes well, and I get a corpus with the required meta data. When I read in the entire reuters21578 corpus in XML format however (or a self-created subset thereof) the meta data is lost, and the files are interpreted as plain text. I use the following command, where the indicated directory contains all reuters 21578 documents as separate XML files: > reuters21578 <- Corpus(DirSource(&q...
2007 Jan 11
0
tm 0.1 uploaded to CRAN
...Corpus Volume 1 dataset, *) Gmane RSS feeds, *) e-mails, and *) several classic file formats (e.g. plain text or CSV text). tm provides easy access to preprocessing and manipulation mechanisms, like *) whitespace removal, *) stemming, or *) conversion between file formats (e.g., Reuters21578 to plain text). Further a generic filter architecture is available in order to *) filter documents for certain criteria, *) or perform fulltext search. The package supports the export from document collections to term-document matrices as frequently used in the text mining literature. Th...
2007 Jan 11
0
tm 0.1 uploaded to CRAN
...Corpus Volume 1 dataset, *) Gmane RSS feeds, *) e-mails, and *) several classic file formats (e.g. plain text or CSV text). tm provides easy access to preprocessing and manipulation mechanisms, like *) whitespace removal, *) stemming, or *) conversion between file formats (e.g., Reuters21578 to plain text). Further a generic filter architecture is available in order to *) filter documents for certain criteria, *) or perform fulltext search. The package supports the export from document collections to term-document matrices as frequently used in the text mining literature. Th...
2006 Nov 04
0
Ferret 0.10.6 released (and some benchmarks)
...has recently been brought to my attention that some people are aware of Ferret but avoid it because they think it is slow. Just to put that myth to rest, here are the outputs for a simple benchmark, indexing the reuters corpus available at: http://www.daviddlewis.com/resources/testcollections/reuters21578/ First Apache Lucene. (Yes Java users, as you can see, I did warm up the JVM (with 6 repetitions of the test) and I used the options -server -Xmx500M -XX:CompileThreshold=100 so this is a fair test). --------------------------------------------------- 1 Secs: 47.09 Docs: 19043 2 Secs: 46.46...