2009 Oct 02
text mining
...us. I have searched the above documenet as well as related
documentation. Any leads or help would be appreciated. Thanks everyone
from document
txt <- system.file("texts", "txt", package = "tm")
(ovid <- Corpus(DirSource(txt),
readerControl = list(reader = readPlain,
language = "la",
load = TRUE)))
my attempt
txt <- system.file("Speeches/speech", "txt", package = "tm")
(ovid <- Corpus(DirSource(txt),
readerControl = list(reader = readPlain,
language = "la",
load = TRUE)))
2011 Jan 24
Extracting information from text data
...data stored in different files. Where n = number of words (say w1, w2, …, wn). M is the number of documents (say d1, d2, …, dm)
A. Using package tm
I am using package tm to do the job. I have provided the code below:
> my.corpus <- Corpus(DirSource(my.path), readerControl = list (reader=readPlain))
In readLines(y, encoding = x$Encoding) :
incomplete final line found on 'M:\textmine/slr.txt'
> x <- TermDocMatrix(my.corpus)
Error: could not find function "TermDocMatrix"
B. Using package(s) other than tm
Once again, thank you very much for the time you have...
2013 Jan 08
tm: custom reader for readPlain
I have a series of newspaper articles from a Canadian newspaper database (Canadian Newsstand) that look just like below.
I've read through this vignette (http://cran.r-project.org/web/packages/tm/vignettes/extensions.pdf) about creating a custom reader to extract meta-data, but I can't understand how to apply this in the context of a text document, rather than in the tabular format
2019 Feb 12
Leer un txt a trozos
...ue me gustaría decirle a R es "ves a donde pone time y tráete X lineas"
o "ves a donde pone time y tráete lineas hasta que llegues a end"
En realidad debe ser bastante fácil, todas las tablas empiezan con time y
acaban con end y tienen el mismo numero de filas.
He estado mirando readPlain(), scan(), readfile()... pero le puedes decir
cuantas lineas leer pero no donde empezar... creo.
¿Alguna pista de por donde puedo empezar a mirar?
Muchas gracias.
Jaume Tormo.
2010 Feb 04
How to read HTML or TEXT file with tm package
2009 Jan 15
How to Solve the Error( error:cannot allocate vector of size 1.1 Gb)
...ong increasing a
physical RAM, or doing other recipes, etc?
###### my R Script's Outputs ######
> memory.limit(size = 2000)
> corpus.ko <- Corpus(DirSource("test_konews/"),
+ readerControl = list(reader = readPlain,
+ language = "UTF-8", load = FALSE))
> corpus.ko.nowhite <- tmMap(corpus.ko, stripWhitespace)
> corpus <- tmMap(corpus.ko.nowhite, tmTolower)
> tdm <- TermDocMatrix(corpus)
> findAssocs(tdm, "city", 0.97)
error:cannot allocate vector of size 1.1 Gb
2012 Feb 29
TM reader with text
"<U+FB01>nancier" "<U+FB01>nanci?re" "<U+FB01>nanci?res"
"<U+FB01>nanciers" "<U+FB01>xe"
Some french words are not well reading by TM with the reader readPlain. I
try to use reader= reader PDF. But it doesn't work so I must transformed PDF
text to text. And some words are not understand so when I use
TermDocumentMatrix a word like inflation diseappear. It's a big probleme for
me. I spend lot of time on this problem, any idea ? Thank's for you...
2009 Jan 10
Help needed for Loading "tm" package
...eka.jar", "RWeka.jar"), package =
pkgname, :
Cannot create Java virtual machine (-1)
Error : .onLoad failed in 'loadNamespace' for 'RWeka'
Error: package 'RWeka' could not be loaded
> my.corpurs <-Corpus(DirSource(my.path), readerControl =
Error: could not find function "Corpus"
> my.tdm <- TermDocMatrix(my.corpus)
Error: could not find function "TermDocMatrix"
> my.tdm[1,]
Error: object "my.tdm" not found
2009 Oct 15
Problems with rJava and tm packages
...onLoad failed in 'loadNamespace' for 'rJava'
Error: package/namespace load failed for 'rJava'
> #Set documents directory
> DIR <- "G:/TextSearch/Speeches"
> #Load corpus
> speech <- Corpus(DirSource(DIR), readerControl = list(reader = readPlain,
+ language = "en_US", load = TRUE))
> #Remove stopwords
> speech <- tmMap(speech, stripWhitespace)
> speech
A corpus with 2 text documents
> tdm<-TermDocumentMatrix(speech)
Error in if (!nchar(javahome)) stop("JAVA_HOME is not set and could not be
2009 Mar 30
Help with tm assocation analysis and Rgraphviz installation.
...1’ .
I tried other terms, and no association value is less than 1, which
obviously is wrong.
Could any export tell me where did I do wrong?
My R-code is:
R>my.corpus <- Corpus(DirSource(my.path), readerControl = list
R>tdmO <- TermDocMatrix(my.corpus)
An object of class “TermDocMatrix”
Slot "Data":
2 x 1426 sparse Matrix of class "dgCMatrix"
[[ suppressing 1426 column names ‘000’, ‘0092’, ‘0093’ ... ]]
1 3 1 12 1 1 1 8 1 1 2 1 9 . 2 2 1 518 1 1 1 2 1 1 2 6 1...
2009 Dec 11
readHTML within tm package
...t routine I get an error. When I run
getReaders (below) readHTML isn't listed.
> getReaders()
[1] "readDOC" "readGmane"
[3] "readPDF" "readReut21578XML"
[5] "readReut21578XMLasPlain" "readPlain"
[7] "readRCV1" "readTabular"
I'm a missing something? Is there an extra install I'm missing, or has the
routine been removed or replaced?
Thanks, Peter
Oh, yes, running the latest R release on Mac OS 10.6.2
2011 Sep 05
Stemming functions only work on the last word of plain text documents
...n it only stems the last word of each document (The problem is the for wordStem and stemDocument does not work at all). An example:
> path <- c("c:\path\to\directory") # collection of plain text documents
> corp <- Corpus(DirSource(path), readerControl = list(reader = readPlain, language = "en_US" , load = T))
> inspect(corp)
A corpus with 2 text documents
The metadata consists of 2 tag-value pairs and a data frame
Available tags are:
create_date creator
Available variables in the data frame are:
running runs runners
2009 Jan 09
[R} how to build TermDocMatrix in tm text mining package of R
Howdy Gurus
I 'd like to ask a question about how to build TermDocMatrix in tm text
mining package.
It is not clear about importing a plain text file, and them converting that
text file into TermDocMatrix file, etc to me.
How can I build a TermDocMatrix of " a plain text document file for text
Or are there any good manuals?
Thank you in advance,
Kum-Hoe Hwang, Ph.D.
2009 Apr 17
question about the Text Mining package tm
...but I inserted a new
line before every occurrence of http.
I ran the following code:
my.path <- 'C:\\dataForR\\textsTweet1\\'
(ovid <- Corpus(DirSource(my.path), readerControl = list(reader = readPlain,
language = "la")))
Response from R:
A text document collection with 3 text documents
Warning message:
In readLines(filename, encoding = encoding) :
incomplete final line found on 'C:\dataForR\textsTweet1\/short.txt'
Then I ran the TermDocMatrix function. It is supposed to tak...
2011 Feb 10
Help using "tm" text mining package - preprocessing
Thanks all for your help. I fear text mining is an abstract little corner of
I have imported 3228 text (.txt) files, each a news story, into R using
textd <- Corpus(DirSource("other/docs"), readerControl = list(reader
I can pre-process each individual document using tolower(textd[[1]])
however, when I try to run tmTolower() I get a no such command error, and
then the Term Document Matrix command gives me a peculiar error:
> other.TDM <- TermDocumentMatrix(textd, control = list(stopwords = TRUE))
2013 Sep 26
R hangs at NGramTokenizer
...ary(tm)))> invisible(clusterEvalQ(cl, library(RWeka))) > invisible(clusterEvalQ(cl, library(topicmodels)))> invisible(clusterEvalQ(cl, library(RTextTools)))> myCorpus <-Corpus(DirSource("/home/neeph/Test/DMOZ_Business"), encoding="UTF-8", readerControl=list(reader=readPlain))> removeURL <- function(x) gsub("http[[:alnum:]]*", "", x)> myCorpus <- tm_map(myCorpus, removeURL)> removeAmp <- function(x) gsub("&", "", x)> myCorpus <- tm_map(myCorpus, removeAmp)> removeWWW <- function(x) gsub("...
2009 Oct 13
tm: Why does adding local metadata take so long?
...# Use that vector to create a DirSource object
Dir_3compounds <- DirSource(dirName,
pattern = "_.*\\.txt",
ignore.case = TRUE,
encoding = "latin1")
# Read the .txt files into a volatile corpus
Corpus_3compounds <- Corpus(Dir_3compounds,
readerControl = list(reader = readPlain,
language = "en",
load = TRUE))
I have the metadata for these text documents in an Excel table, which
I have read into Metadata_3compounds as follows:
# Read the metadata into a data frame
Metadata_3compounds <- read.xls("/Volumes/RDR Test Documents/
2009 Jul 17
Ayuda con el paquete de text mining (TM)
Estimados, les escribo para consultar, lo siguiente:
Estoy haciendo un trabajo de text mining y necesito importar una serie de
textos para preprocesarlos, es decir eliminar los Stopwords, hacer stemming,
eliminar signos de puntuación etc. Esto último lo puedo realizar con los
datasets que trae la librería TM. Lo que no puedo lograr es importar texto
desde algún medio a pesar que existe funciones
2008 Jan 07
glibc detected *** /usr/lib64/R/bin/exec/R: double free or corruption ???? tm package
[1] "2007"
[1] "11"
[1] "26"
$`svn rev`
[1] "43537"
[1] "R"
[1] "R version 2.6.1 (2007-11-26)"
> test <- TextDocCol(DirSource(getwd()), readerControl = list(reader = readPlain, load = TRUE, language = "nl_BE"))
*** glibc detected *** /usr/lib64/R/bin/exec/R: double free or corruption (!prev): 0x0000000022e20680 ***
======= Backtrace: =========
2009 Jan 15
Interface to open source Reporting tools
...pkgname, :
> > Cannot create Java virtual machine (-1)
> > Error : .onLoad failed in 'loadNamespace' for 'RWeka'
> > Error: package 'RWeka' could not be loaded
> >> my.corpurs <-Corpus(DirSource(my.path), readerControl =
> > list(reader=readPlain))
> > Error: could not find function "Corpus"
> >> my.tdm <- TermDocMatrix(my.corpus)
> > Error: could not find function "TermDocMatrix"
> >> my.tdm[1,]
> > Error: object "my.tdm" not found
> >
> >
> > --
> &g...