I tried making a wordcloud of Obama's State of the Union address using the tm package to process the text sotu <- scan(file="c:/R/data/sotu2012.txt", what="character") sotu <- tolower(sotu) corp <-Corpus(VectorSource(paste(sotu, collapse=" "))) corp <- tm_map(corp, removePunctuation) corp <- tm_map(corp, stemDocument) corp <- tm_map(corp, function(x)removeWords(x,stopwords())) tdm <- TermDocumentMatrix(corp) m <- as.matrix(tdm) v <- sort(rowSums(m),decreasing=TRUE) d <- data.frame(word = names(v),freq=v) wordcloud(d$word,d$freq) I ended up with a large number of contractions that were split at the "?" character, e.g., "don?t" --> "don'" e.g., > sotu[grep("?", sotu)] [1] "qaeda?s" "taliban?s" "america?s" "they?re" "don?t" [6] "we?re" "aren?t" "we?ve" "patton?s" "what?s" [11] "let?s" "weren?t," "couldn?t" "people?s" "didn?t" [16] "we?ve" "we?ve" "we?ve" "i?m" "that?s" [21] "world?s" "what?s" "can?t" "that?s" "it?s" [26] "lock?s" "let?s" "you?re" "shouldn?t" "you?re" [31] "you?re" "it?s" "i?ll" "we?re" "don?t" [36] "we?ve" "it?s" "it?s" "it?s" "they?re" ... [201] "didn?t" "bush?s" "didn?t" "can?t" "there?s" [206] "i?m" "other?s" "we?re" > NB: What appears as the ' character above actually the character hex 92, not hex 27 on my Windows system. This should be a common problem in text processing, but I don't see a transformation in the tm package that handles this nicely. Is there something I've missed? -Michael -- Michael Friendly Email: friendly AT yorku DOT ca Professor, Psychology Dept. York University Voice: 416 736-5115 x66249 Fax: 416 736-5814 4700 Keele Street Web: http://www.datavis.ca Toronto, ONT M3J 1P3 CANADA
This may not be the answer to your problem but you could gsub out the "pretty?apostrophe" for the one tm recognizes. ?Also note that this may be due to your use of word which automatically uses the?"pretty?apostrophe". ?The default setting on MS word can be altered to?alleviate?this.#==============================#using gsub x <- ?"I didn?t know!"x <- gsub("?", "'", x)removePunctuation(x)#===============================#You could make that into a function and apply it to the corpus with tm_map exchanger <- function(x) gsub("?", "'", x)corp <- tm_map(corp, exchanger)#============================== Cheers,Tyler----------------------------------------> Date: Fri, 27 Jan 2012 09:50:51 -0500 > From: friendly at yorku.ca > To: r-help at r-project.org > Subject: [R] tm package: handling contractions > > I tried making a wordcloud of Obama's State of the Union address using > the tm package to process the text > > sotu <- scan(file="c:/R/data/sotu2012.txt", what="character") > sotu <- tolower(sotu) > corp <-Corpus(VectorSource(paste(sotu, collapse=" "))) > > corp <- tm_map(corp, removePunctuation) > corp <- tm_map(corp, stemDocument) > corp <- tm_map(corp, function(x)removeWords(x,stopwords())) > tdm <- TermDocumentMatrix(corp) > m <- as.matrix(tdm) > v <- sort(rowSums(m),decreasing=TRUE) > d <- data.frame(word = names(v),freq=v) > > wordcloud(d$word,d$freq) > > I ended up with a large number of contractions that were split at the > "?" character, e.g., "don?t" --> "don'" > e.g., > > > sotu[grep("?", sotu)] > [1] "qaeda?s" "taliban?s" "america?s" "they?re" "don?t" > [6] "we?re" "aren?t" "we?ve" "patton?s" "what?s" > [11] "let?s" "weren?t," "couldn?t" "people?s" "didn?t" > [16] "we?ve" "we?ve" "we?ve" "i?m" "that?s" > [21] "world?s" "what?s" "can?t" "that?s" "it?s" > [26] "lock?s" "let?s" "you?re" "shouldn?t" "you?re" > [31] "you?re" "it?s" "i?ll" "we?re" "don?t" > [36] "we?ve" "it?s" "it?s" "it?s" "they?re" > ... > [201] "didn?t" "bush?s" "didn?t" "can?t" "there?s" > [206] "i?m" "other?s" "we?re" > > > > NB: What appears as the ' character above actually the character hex 92, > not hex 27 on my Windows system. > > This should be a common problem in text processing, but I don't see a > transformation in the tm package that > handles this nicely. Is there something I've missed? > > -Michael > > -- > Michael Friendly Email: friendly AT yorku DOT ca > Professor, Psychology Dept. > York University Voice: 416 736-5115 x66249 Fax: 416 736-5814 > 4700 Keele Street Web: http://www.datavis.ca > Toronto, ONT M3J 1P3 CANADA > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.
Le vendredi 27 janvier 2012 ? 09:50 -0500, Michael Friendly a ?crit :> I tried making a wordcloud of Obama's State of the Union address using > the tm package to process the text > > sotu <- scan(file="c:/R/data/sotu2012.txt", what="character") > sotu <- tolower(sotu) > corp <-Corpus(VectorSource(paste(sotu, collapse=" "))) > > corp <- tm_map(corp, removePunctuation) > corp <- tm_map(corp, stemDocument) > corp <- tm_map(corp, function(x)removeWords(x,stopwords())) > tdm <- TermDocumentMatrix(corp) > m <- as.matrix(tdm) > v <- sort(rowSums(m),decreasing=TRUE) > d <- data.frame(word = names(v),freq=v) > > wordcloud(d$word,d$freq) > > I ended up with a large number of contractions that were split at the > "?" character, e.g., "don?t" --> "don'" > e.g., > > > sotu[grep("?", sotu)] > [1] "qaeda?s" "taliban?s" "america?s" "they?re" "don?t" > [6] "we?re" "aren?t" "we?ve" "patton?s" "what?s" > [11] "let?s" "weren?t," "couldn?t" "people?s" "didn?t" > [16] "we?ve" "we?ve" "we?ve" "i?m" "that?s" > [21] "world?s" "what?s" "can?t" "that?s" "it?s" > [26] "lock?s" "let?s" "you?re" "shouldn?t" "you?re" > [31] "you?re" "it?s" "i?ll" "we?re" "don?t" > [36] "we?ve" "it?s" "it?s" "it?s" "they?re" > ... > [201] "didn?t" "bush?s" "didn?t" "can?t" "there?s" > [206] "i?m" "other?s" "we?re" > > > > NB: What appears as the ' character above actually the character hex 92, > not hex 27 on my Windows system. > > This should be a common problem in text processing, but I don't see a > transformation in the tm package that > handles this nicely. Is there something I've missed?What result would you expect? As I see it, ideally, removePunctuation() would remove these apostrophes. Looks like it doesn't; the code is: removePunctuation <- function(x) UseMethod("removePunctuation", x) removePunctuation.PlainTextDocument <- function(x) gsub("[[:punct:]]+", " ", x) And ?regexp says: ?[:punct:]? Punctuation characters: ?! " # $ % & ' ( ) * + , - . / : ; < = > ? @ [ \ ] ^ _ ` { | } ~?. Maybe the ? apostrophe should be added to the list? (FWIW, it's the "real" character for apostrophe in Unicode.) I discussed a related issue about apostrophes with Ingo Feinerer and Kurt Hornik: in French, we'd need apostrophes (of any type, ' or ?) to mark a separation between words, instead of concatenating the two parts surrounding it. The conclusion was that a language-specific processor was required (languages with non-latin alphabet have many more diacritic characters we don't even know about). In English, I suspect it might be interesting to detect forms like "'re" or "'nt" and replace them with their full equivalent, i.e. "are" and "not"; OTOH, genitive forms would probably better be removed (at least by default). In the short term, Tyler's solution will work, but beware that "we're" will become "were" if you remove punctuation ;-). An alternative is to replace apostrophes with spaces so that suffixes are considered as separate words (that's what I do in French ATM). Hope this helps