search for: stripwhitespace

Displaying 20 results from an estimated 22 matches for "stripwhitespace".

2004 Jul 16
3
Strange (non-deterministic) problem with strsplit
...chr "a*[square box]" (square box not reproduced here because copy and pasting it seems to break my web mail) Can anyone reproduce the problem and/or suggest any solutions? parseFormula <- function(formula) { splitvars <- function(x) { strsplit(x, "\\+|\\*")[[1]] } stripwhitespace <- function(x) { gsub("\\s", "", x, perl=T) } vars <- stripwhitespace(as.character(formula)[3]) varsplit <- strsplit(vars, "|", fixed=TRUE)[[1]] parts <- list( y = stripwhitespace(as.character(formula)[2]), x = varsplit[1], g = varsplit[2] )...
2004 Jul 16
3
Strange (non-deterministic) problem with strsplit
...chr "a*[square box]" (square box not reproduced here because copy and pasting it seems to break my web mail) Can anyone reproduce the problem and/or suggest any solutions? parseFormula <- function(formula) { splitvars <- function(x) { strsplit(x, "\\+|\\*")[[1]] } stripwhitespace <- function(x) { gsub("\\s", "", x, perl=T) } vars <- stripwhitespace(as.character(formula)[3]) varsplit <- strsplit(vars, "|", fixed=TRUE)[[1]] parts <- list( y = stripwhitespace(as.character(formula)[2]), x = varsplit[1], g = varsplit[2] )...
2009 Oct 15
1
Problems with rJava and tm packages
...> > #Set documents directory > DIR <- "G:/TextSearch/Speeches" > > #Load corpus > speech <- Corpus(DirSource(DIR), readerControl = list(reader = readPlain, + language = "en_US", load = TRUE)) > > #Remove stopwords > speech <- tmMap(speech, stripWhitespace) > speech A corpus with 2 text documents > tdm<-TermDocumentMatrix(speech) Error in if (!nchar(javahome)) stop("JAVA_HOME is not set and could not be determined from the registry") : argument is of length zero Error: .onLoad failed in 'loadNamespace' for 'rJava...
2009 Nov 12
2
package "tm" fails to remove "the" with remove stopwords
...s! Mark require(tm) myDocument <- c("the rain in Spain", "falls mainly on the plain", "jack and jill ran up the hill", "to fetch a pail of water") text.corp <- Corpus(VectorSource(myDocument)) ######################### text.corp <- tm_map(text.corp, stripWhitespace) text.corp <- tm_map(text.corp, removeNumbers) text.corp <- tm_map(text.corp, removePunctuation) ## text.corp <- tm_map(text.corp, stemDocument) text.corp <- tm_map(text.corp, removeWords, c("the", stopwords("english"))) dtm <- DocumentTermMatrix(text.corp) dtm dt...
2009 Jan 15
1
How to Solve the Error( error:cannot allocate vector of size 1.1 Gb)
...Script's Outputs ###### ############################### > memory.limit(size = 2000) NULL > corpus.ko <- Corpus(DirSource("test_konews/"), + readerControl = list(reader = readPlain, + language = "UTF-8", load = FALSE)) > corpus.ko.nowhite <- tmMap(corpus.ko, stripWhitespace) > corpus <- tmMap(corpus.ko.nowhite, tmTolower) > tdm <- TermDocMatrix(corpus) > findAssocs(tdm, "city", 0.97) error:cannot allocate vector of size 1.1 Gb ------------------------------------------------------------- > ################################ Thanks for your p...
2013 Jan 15
0
Function failure in tm
...and class(mycorp[[1]] returns "PlainTextDocument" "TextDocument" "character" But now that I've got a corpsu, none of the transformation functions work at all. They all return the following error (with the respective function name) Error in UseMethod("stripWhitespace", x) : no applicable method for 'stripWhitespace' applied to an object of class "NULL" I haven't seen this error reported anywhere in the R-list archives. Does anyone have any suggestions? Yours, Simon Kiss P.S. The results of sessionInfo() are R version 2.15.0 (201...
2014 Jul 22
2
Ayuda Error in `colnames<-`(`*tmp*`, value = c(
...<-iconv(enc2utf8(d1), sub = "byte") > d2<-readLines(txt2, encoding="UTF-8") > d2<-iconv(enc2utf8(d2), sub = "byte") > df<-c(d1,d2) > corpus<-Corpus(VectorSource(df)) > d<-tm_map(corpus, content_transformer(tolower)) > d<-tm_map(d, stripWhitespace) > d<-tm_map(d, removePunctuation) > sw<-readLines("./StopWords.txt", encoding="UTF-8") > sw<-iconv(enc2utf8(sw), sub="byte") > d<-tm_map(d, removeWords, sw) > d<-tm_map(d, removeWords, stopwords("spanish")) > tdm<-TermDocu...
2014 Jul 29
2
wordcloud y tabla de palabras [Avanzando]
...informe 2013"), row.names=c("2005", "2013")) ds<- DataframeSource(tmpText) ds<- DataframeSource(tmpinformes) corp = Corpus(ds) corp = tm_map(corp,removePunctuation) corp = tm_map(corp,content_transformer(tolower)) corp = tm_map(corp,removeNumbers) corp = tm_map(corp, stripWhitespace) corp = tm_map(corp, removeWords, sw) corp = tm_map(corp, removeWords, stopwords("spanish")) term.matrix<- TermDocumentMatrix(corp) term.matrix<- as.matrix(term.matrix) colnames(term.matrix) <- c("Año2005","Año2013") png(file="Org2005vs2013.png",heig...
2012 Oct 25
2
Minería de texto
...e")) tw.corpus = tm_map(tw.corpus, tolower) tw.corpus = tm_map(tw.corpus, removePunctuation) tw.corpus = tm_map(tw.corpus, function(x) removeWords(x, c(stopwords("spanish"),"rt"))) tw.corpus = tm_map(tw.corpus, removeWords, my.stopwords) tw.corpus = tm_map(tw.corpus, stripWhitespace) sw <- readLines("stopwords.es.txt",encoding="UTF-8") sw = iconv(sw, to="ASCII//TRANSLIT") tw.corpus = tm_map(tw.corpus, removeWords, sw) doc.m = TermDocumentMatrix(tw.corpus, control = list(minWordLength = 2)) dm = as.matrix(doc.m) # calculate the frequency o...
2012 Dec 13
2
Tamaño de la matriz de términos y memoria. Paquete TM
...F-8") txt = iconv(txt, to="ASCII//TRANSLIT") # construye un corpus corpus <- Corpus(VectorSource(txt)) # lleva a minúsculas corpus <- tm_map(corpus, tolower) # quita espacios en blanco corpus <- tm_map(corpus, stripWhitespace) # remueve la puntuación corpus <- tm_map(corpus, removePunctuation) # carga el archivo de palabras vacías personalizada en español y lo convierte a ASCII sw <- readLines("D:/Publico/Documents/TextMinigSpanishResources/Stopwords.es.txt",encoding=&quot...
2014 Jul 28
2
wordcloud y tabla de palabras
...rmes, pathname) { > > info.dir<-sprintf("%s/%s", pathname, informes) > > info.cor<-Corpus(DirSource(directory=info.dir, encoding="UTF-8")) > > info.cor.cl<-tm_map(info.cor, content_transformer(tolower)) > > info.cor.cl<-tm_map(info.cor.cl, stripWhitespace) > > info.cor.cl<-tm_map(info.cor.cl,removePunctuation) > > sw<-readLines("C:/Users/d_2/Documents/StopWords.txt", encoding="UTF-8") > > sw<-iconv(enc2utf8(sw), sub = "byte") > > info.cor.cl<-tm_map(info.cor.cl, removeWords, stopw...
2014 Jun 17
2
No es un problema de tm tienes doc.corpus vacío
...tolower)corpus <- tm_map(corpus, removePunctuation)corpus <- tm_map(corpus, > removeNumbers)corpus <- tm_map(corpus, removeWords, > stopwords("english"))inspect(doc.corpus[1:2])library(SnowballC)corpus <- > tm_map(corpus, stemDocument)corpus <- tm_map(corpus, > stripWhitespace)inspect(doc.corpus[1:8])TDM <- > TermDocumentMatrix(corpus)TDM* > > por adelantado, muchas gracias!!! > > ruben! > ------------ próxima parte ------------ > Se ha borrado un adjunto en formato HTML... > URL: <https://stat.ethz.ch/pipermail/r-help-es/attachments/2014061...
2014 Jul 25
3
wordcloud y tabla de palabras
...formes/" >TDM<-function(informes, pathname) { info.dir<-sprintf("%s/%s", pathname, informes) info.cor<-Corpus(DirSource(directory=info.dir, encoding="UTF-8")) info.cor.cl<-tm_map(info.cor, content_transformer(tolower)) info.cor.cl<-tm_map(info.cor.cl, stripWhitespace) info.cor.cl<-tm_map(info.cor.cl,removePunctuation) sw<-readLines("C:/Users/d_2/Documents/StopWords.txt", encoding="UTF-8") sw<-iconv(enc2utf8(sw), sub = "byte") info.cor.cl<-tm_map(info.cor.cl, removeWords, stopwords("spanish")) info.tdm<...
2010 Feb 16
0
tm package
Hi, I'm using version 0.5.1 of tm package with R 2.10.1. It looks to me as if after the following reuters21578 <- Corpus(DirSource(corpusDir), readerControl = list(reader = readReut21578XMLasPlain)) reuters21578 <- tm_map(reuters21578, stripWhitespace) reuters21578 <- tm_map(reuters21578, tolower) reuters21578 <- tm_map(reuters21578, removePunctuation) reuters21578 <- tm_map(reuters21578, removeNumbers) reuters21578.dtm <- DocumentTermMatrix(reuters21578) that reuters21578.dtm does not include terms from the Heading...
2011 Apr 18
0
Help with cleaning a corpus
Hi! I created a corpus and I started to clean through this piece of code: txt <-tm_map(txt,removeWords, stopwords("spanish")) txt <-tm_map(txt,stripWhitespace) txt <-tm_map(txt,tolower) txt <-tm_map(txt,removeNumbers) txt <-tm_map(txt,removePunctuation) But something happpended: some of the documents in the corpus became empty, this is a problem when i try to make a document term matrix with tfidf. Is there any way to eliminate automatically...
2013 Sep 26
0
R hangs at NGramTokenizer
...; myCorpus <- tm_map(myCorpus, removeNumbers)> myCorpus <- tm_map(myCorpus, removePunctuation)> myCorpus <- tm_map(myCorpus, removeWords, stopwords("english"))> myCorpus <- tm_map(myCorpus, removeWords, stopwords("SMART"))> myCorpus <- tm_map(myCorpus, stripWhitespace)> myDtm <- DocumentTermMatrix(myCorpus, control = list(wordLengths = c(1,Inf))) Everything works fine upto this stage, if I do not include tokenizing. However, when I run the code with the following alteration:> dictCorpus <- myCorpus> myDtm <- DocumentTermMatrix(myCorpus, control...
2014 Jun 18
2
No es un problema de tm tienes doc.corpus vacío
...vePunctuation)corpus <- tm_map(corpus, removeNumbers)corpus <- > >> tm_map(corpus, removeWords, > >> > stopwords("english"))inspect(doc.corpus[1:2])library(SnowballC)corpus > >> <- tm_map(corpus, stemDocument)corpus <- tm_map(corpus, > >> stripWhitespace)inspect(doc.corpus[1:8])TDM <- > >> TermDocumentMatrix(corpus)TDM* > >> > >> por adelantado, muchas gracias!!! > >> > >> ruben! > >> ------------ próxima parte ------------ Se ha borrado un adjunto en > >> formato HTML... > >&...
2011 Mar 24
2
Problem with Snowball & RWeka
Dear Forum, when I try to use SnowballStemmer() I get the following error message: "Could not initialize the GenericPropertiesCreator. This exception was produced: java.lang.NullPointerException" It seems to have something to do with either Snowball or RWeka, however I can't figure out, what to do myself. If you could spend 5 minutes of your valuable time, to help me or give me a
2012 Sep 07
1
[LLVMdev] teaching FileCheck to handle variations in order
On 9/7/2012 12:12 PM, Krzysztof Parzyszek wrote: > On 9/7/2012 7:20 AM, Matthew Curtis wrote: >> >> The attached patch implements one possible solution. It introduces a >> position stack and a couple of directives: >> >> * 'CHECK-PUSH:' pushes the current match position onto the stack. >> * 'CHECK-POP:' pops the top value off of the stack
2014 Jun 18
3
No es un problema de tm tienes doc.corpus vacío
...mbers)corpus <- > > > >> tm_map(corpus, removeWords, > > > >> > > > stopwords("english"))inspect(doc.corpus[1:2])library(SnowballC)corpus > > > >> <- tm_map(corpus, stemDocument)corpus <- tm_map(corpus, > > > >> stripWhitespace)inspect(doc.corpus[1:8])TDM <- > > > >> TermDocumentMatrix(corpus)TDM* > > > >> > > > >> por adelantado, muchas gracias!!! > > > >> > > > >> ruben! > > > >> ------------ próxima parte ------------ Se ha bo...