Displaying 20 results from an estimated 22 matches for "stripwhitespace".
2004 Jul 16
3
Strange (non-deterministic) problem with strsplit
...chr "a*[square box]"
(square box not reproduced here because copy and pasting it seems to
break my web mail)
Can anyone reproduce the problem and/or suggest any solutions?
parseFormula <- function(formula) {
splitvars <- function(x) {
strsplit(x, "\\+|\\*")[[1]]
}
stripwhitespace <- function(x) {
gsub("\\s", "", x, perl=T)
}
vars <- stripwhitespace(as.character(formula)[3])
varsplit <- strsplit(vars, "|", fixed=TRUE)[[1]]
parts <- list(
y = stripwhitespace(as.character(formula)[2]),
x = varsplit[1],
g = varsplit[2]
)...
2004 Jul 16
3
Strange (non-deterministic) problem with strsplit
...chr "a*[square box]"
(square box not reproduced here because copy and pasting it seems to
break my web mail)
Can anyone reproduce the problem and/or suggest any solutions?
parseFormula <- function(formula) {
splitvars <- function(x) {
strsplit(x, "\\+|\\*")[[1]]
}
stripwhitespace <- function(x) {
gsub("\\s", "", x, perl=T)
}
vars <- stripwhitespace(as.character(formula)[3])
varsplit <- strsplit(vars, "|", fixed=TRUE)[[1]]
parts <- list(
y = stripwhitespace(as.character(formula)[2]),
x = varsplit[1],
g = varsplit[2]
)...
2009 Oct 15
1
Problems with rJava and tm packages
...>
> #Set documents directory
> DIR <- "G:/TextSearch/Speeches"
>
> #Load corpus
> speech <- Corpus(DirSource(DIR), readerControl = list(reader = readPlain,
+ language = "en_US", load = TRUE))
>
> #Remove stopwords
> speech <- tmMap(speech, stripWhitespace)
> speech
A corpus with 2 text documents
> tdm<-TermDocumentMatrix(speech)
Error in if (!nchar(javahome)) stop("JAVA_HOME is not set and could not be
determined from the registry") :
argument is of length zero
Error: .onLoad failed in 'loadNamespace' for 'rJava...
2009 Nov 12
2
package "tm" fails to remove "the" with remove stopwords
...s!
Mark
require(tm)
myDocument <- c("the rain in Spain", "falls mainly on the plain", "jack and
jill ran up the hill", "to fetch a pail of water")
text.corp <- Corpus(VectorSource(myDocument))
#########################
text.corp <- tm_map(text.corp, stripWhitespace)
text.corp <- tm_map(text.corp, removeNumbers)
text.corp <- tm_map(text.corp, removePunctuation)
## text.corp <- tm_map(text.corp, stemDocument)
text.corp <- tm_map(text.corp, removeWords, c("the", stopwords("english")))
dtm <- DocumentTermMatrix(text.corp)
dtm
dt...
2009 Jan 15
1
How to Solve the Error( error:cannot allocate vector of size 1.1 Gb)
...Script's Outputs ######
###############################
> memory.limit(size = 2000)
NULL
> corpus.ko <- Corpus(DirSource("test_konews/"),
+ readerControl = list(reader = readPlain,
+ language = "UTF-8", load = FALSE))
> corpus.ko.nowhite <- tmMap(corpus.ko, stripWhitespace)
> corpus <- tmMap(corpus.ko.nowhite, tmTolower)
> tdm <- TermDocMatrix(corpus)
> findAssocs(tdm, "city", 0.97)
error:cannot allocate vector of size 1.1 Gb
-------------------------------------------------------------
>
################################
Thanks for your p...
2013 Jan 15
0
Function failure in tm
...and class(mycorp[[1]] returns
"PlainTextDocument" "TextDocument" "character"
But now that I've got a corpsu, none of the transformation functions work at all. They all return the following error (with the respective function name)
Error in UseMethod("stripWhitespace", x) :
no applicable method for 'stripWhitespace' applied to an object of class "NULL"
I haven't seen this error reported anywhere in the R-list archives. Does anyone have any suggestions?
Yours, Simon Kiss
P.S. The results of sessionInfo() are
R version 2.15.0 (201...
2014 Jul 22
2
Ayuda Error in `colnames<-`(`*tmp*`, value = c(
...<-iconv(enc2utf8(d1), sub = "byte")
> d2<-readLines(txt2, encoding="UTF-8")
> d2<-iconv(enc2utf8(d2), sub = "byte")
> df<-c(d1,d2)
> corpus<-Corpus(VectorSource(df))
> d<-tm_map(corpus, content_transformer(tolower))
> d<-tm_map(d, stripWhitespace)
> d<-tm_map(d, removePunctuation)
> sw<-readLines("./StopWords.txt", encoding="UTF-8")
> sw<-iconv(enc2utf8(sw), sub="byte")
> d<-tm_map(d, removeWords, sw)
> d<-tm_map(d, removeWords, stopwords("spanish"))
> tdm<-TermDocu...
2014 Jul 29
2
wordcloud y tabla de palabras [Avanzando]
...informe
2013"), row.names=c("2005", "2013"))
ds<- DataframeSource(tmpText)
ds<- DataframeSource(tmpinformes)
corp = Corpus(ds)
corp = tm_map(corp,removePunctuation)
corp = tm_map(corp,content_transformer(tolower))
corp = tm_map(corp,removeNumbers)
corp = tm_map(corp, stripWhitespace)
corp = tm_map(corp, removeWords, sw)
corp = tm_map(corp, removeWords, stopwords("spanish"))
term.matrix<- TermDocumentMatrix(corp)
term.matrix<- as.matrix(term.matrix)
colnames(term.matrix) <- c("Año2005","Año2013")
png(file="Org2005vs2013.png",heig...
2012 Oct 25
2
Minería de texto
...e")) tw.corpus = tm_map(tw.corpus, tolower) tw.corpus = tm_map(tw.corpus, removePunctuation) tw.corpus = tm_map(tw.corpus, function(x) removeWords(x, c(stopwords("spanish"),"rt"))) tw.corpus = tm_map(tw.corpus, removeWords, my.stopwords) tw.corpus = tm_map(tw.corpus, stripWhitespace) sw <- readLines("stopwords.es.txt",encoding="UTF-8") sw = iconv(sw, to="ASCII//TRANSLIT") tw.corpus = tm_map(tw.corpus, removeWords, sw) doc.m = TermDocumentMatrix(tw.corpus, control = list(minWordLength = 2)) dm = as.matrix(doc.m) # calculate the frequency o...
2012 Dec 13
2
Tamaño de la matriz de términos y memoria. Paquete TM
...F-8")
txt = iconv(txt, to="ASCII//TRANSLIT")
# construye un corpus
corpus <- Corpus(VectorSource(txt))
# lleva a minúsculas
corpus <- tm_map(corpus, tolower)
# quita espacios en blanco
corpus <- tm_map(corpus, stripWhitespace)
# remueve la puntuación
corpus <- tm_map(corpus, removePunctuation)
# carga el archivo de palabras vacías personalizada en español y lo convierte a ASCII
sw <- readLines("D:/Publico/Documents/TextMinigSpanishResources/Stopwords.es.txt",encoding="...
2014 Jul 28
2
wordcloud y tabla de palabras
...rmes, pathname) {
> > info.dir<-sprintf("%s/%s", pathname, informes)
> > info.cor<-Corpus(DirSource(directory=info.dir, encoding="UTF-8"))
> > info.cor.cl<-tm_map(info.cor, content_transformer(tolower))
> > info.cor.cl<-tm_map(info.cor.cl, stripWhitespace)
> > info.cor.cl<-tm_map(info.cor.cl,removePunctuation)
> > sw<-readLines("C:/Users/d_2/Documents/StopWords.txt", encoding="UTF-8")
> > sw<-iconv(enc2utf8(sw), sub = "byte")
> > info.cor.cl<-tm_map(info.cor.cl, removeWords, stopw...
2014 Jun 17
2
No es un problema de tm tienes doc.corpus vacío
...tolower)corpus <- tm_map(corpus, removePunctuation)corpus <- tm_map(corpus,
> removeNumbers)corpus <- tm_map(corpus, removeWords,
> stopwords("english"))inspect(doc.corpus[1:2])library(SnowballC)corpus <-
> tm_map(corpus, stemDocument)corpus <- tm_map(corpus,
> stripWhitespace)inspect(doc.corpus[1:8])TDM <-
> TermDocumentMatrix(corpus)TDM*
>
> por adelantado, muchas gracias!!!
>
> ruben!
> ------------ próxima parte ------------
> Se ha borrado un adjunto en formato HTML...
> URL: <https://stat.ethz.ch/pipermail/r-help-es/attachments/2014061...
2014 Jul 25
3
wordcloud y tabla de palabras
...formes/"
>TDM<-function(informes, pathname) {
info.dir<-sprintf("%s/%s", pathname, informes)
info.cor<-Corpus(DirSource(directory=info.dir, encoding="UTF-8"))
info.cor.cl<-tm_map(info.cor, content_transformer(tolower))
info.cor.cl<-tm_map(info.cor.cl, stripWhitespace)
info.cor.cl<-tm_map(info.cor.cl,removePunctuation)
sw<-readLines("C:/Users/d_2/Documents/StopWords.txt", encoding="UTF-8")
sw<-iconv(enc2utf8(sw), sub = "byte")
info.cor.cl<-tm_map(info.cor.cl, removeWords, stopwords("spanish"))
info.tdm<...
2010 Feb 16
0
tm package
Hi,
I'm using version 0.5.1 of tm package with R 2.10.1. It looks to me
as if after the following
reuters21578 <- Corpus(DirSource(corpusDir), readerControl =
list(reader = readReut21578XMLasPlain))
reuters21578 <- tm_map(reuters21578, stripWhitespace)
reuters21578 <- tm_map(reuters21578, tolower)
reuters21578 <- tm_map(reuters21578, removePunctuation)
reuters21578 <- tm_map(reuters21578, removeNumbers)
reuters21578.dtm <- DocumentTermMatrix(reuters21578)
that reuters21578.dtm does not include terms from the Heading...
2011 Apr 18
0
Help with cleaning a corpus
Hi!
I created a corpus and I started to clean through this piece of code:
txt <-tm_map(txt,removeWords, stopwords("spanish"))
txt <-tm_map(txt,stripWhitespace)
txt <-tm_map(txt,tolower)
txt <-tm_map(txt,removeNumbers)
txt <-tm_map(txt,removePunctuation)
But something happpended: some of the documents in the corpus became empty,
this is a problem when i try to make a document term matrix with tfidf.
Is there any way to eliminate automatically...
2013 Sep 26
0
R hangs at NGramTokenizer
...; myCorpus <- tm_map(myCorpus, removeNumbers)> myCorpus <- tm_map(myCorpus, removePunctuation)> myCorpus <- tm_map(myCorpus, removeWords, stopwords("english"))> myCorpus <- tm_map(myCorpus, removeWords, stopwords("SMART"))> myCorpus <- tm_map(myCorpus, stripWhitespace)> myDtm <- DocumentTermMatrix(myCorpus, control = list(wordLengths = c(1,Inf)))
Everything works fine upto this stage, if I do not include tokenizing. However, when I run the code with the following alteration:> dictCorpus <- myCorpus> myDtm <- DocumentTermMatrix(myCorpus, control...
2014 Jun 18
2
No es un problema de tm tienes doc.corpus vacío
...vePunctuation)corpus <- tm_map(corpus, removeNumbers)corpus <-
> >> tm_map(corpus, removeWords,
> >>
> stopwords("english"))inspect(doc.corpus[1:2])library(SnowballC)corpus
> >> <- tm_map(corpus, stemDocument)corpus <- tm_map(corpus,
> >> stripWhitespace)inspect(doc.corpus[1:8])TDM <-
> >> TermDocumentMatrix(corpus)TDM*
> >>
> >> por adelantado, muchas gracias!!!
> >>
> >> ruben!
> >> ------------ próxima parte ------------ Se ha borrado un adjunto en
> >> formato HTML...
> >&...
2011 Mar 24
2
Problem with Snowball & RWeka
Dear Forum,
when I try to use SnowballStemmer() I get the following error message:
"Could not initialize the GenericPropertiesCreator. This exception was
produced: java.lang.NullPointerException"
It seems to have something to do with either Snowball or RWeka, however I
can't figure out, what to do myself. If you could spend 5 minutes of your
valuable time, to help me or give me a
2012 Sep 07
1
[LLVMdev] teaching FileCheck to handle variations in order
On 9/7/2012 12:12 PM, Krzysztof Parzyszek wrote:
> On 9/7/2012 7:20 AM, Matthew Curtis wrote:
>>
>> The attached patch implements one possible solution. It introduces a
>> position stack and a couple of directives:
>>
>> * 'CHECK-PUSH:' pushes the current match position onto the stack.
>> * 'CHECK-POP:' pops the top value off of the stack
2014 Jun 18
3
No es un problema de tm tienes doc.corpus vacío
...mbers)corpus <-
> > > >> tm_map(corpus, removeWords,
> > > >>
> > > stopwords("english"))inspect(doc.corpus[1:2])library(SnowballC)corpus
> > > >> <- tm_map(corpus, stemDocument)corpus <- tm_map(corpus,
> > > >> stripWhitespace)inspect(doc.corpus[1:8])TDM <-
> > > >> TermDocumentMatrix(corpus)TDM*
> > > >>
> > > >> por adelantado, muchas gracias!!!
> > > >>
> > > >> ruben!
> > > >> ------------ próxima parte ------------ Se ha bo...