thr3ads.net - R help - [R] tm package: handling contractions [Jan 2012]

If this information is useful, please help other people find it:
Share via:

Michael Friendly

2012-Jan-27 14:50 UTC

[R] tm package: handling contractions

I tried making a wordcloud of Obama's State of the Union address using 
the tm package to process the text

sotu <- scan(file="c:/R/data/sotu2012.txt",
what="character")
sotu <- tolower(sotu)
corp <-Corpus(VectorSource(paste(sotu, collapse=" ")))

corp <- tm_map(corp, removePunctuation)
corp <- tm_map(corp, stemDocument)
corp <- tm_map(corp, function(x)removeWords(x,stopwords()))
tdm <- TermDocumentMatrix(corp)
m <- as.matrix(tdm)
v <- sort(rowSums(m),decreasing=TRUE)
d <- data.frame(word = names(v),freq=v)

wordcloud(d$word,d$freq)

I ended up with a large number of contractions that were split at the 
"?" character, e.g., "don?t" --> "don'"
e.g.,

 > sotu[grep("?", sotu)]
[1] "qaeda?s" "taliban?s" "america?s"
"they?re" "don?t"
[6] "we?re" "aren?t" "we?ve" "patton?s"
"what?s"
[11] "let?s" "weren?t," "couldn?t"
"people?s" "didn?t"
[16] "we?ve" "we?ve" "we?ve" "i?m"
"that?s"
[21] "world?s" "what?s" "can?t" "that?s"
"it?s"
[26] "lock?s" "let?s" "you?re"
"shouldn?t" "you?re"
[31] "you?re" "it?s" "i?ll" "we?re"
"don?t"
[36] "we?ve" "it?s" "it?s" "it?s"
"they?re"
...
[201] "didn?t" "bush?s" "didn?t" "can?t"
"there?s"
[206] "i?m" "other?s" "we?re"
 >

NB: What appears as the ' character above actually the character hex 92, 
not hex 27 on my Windows system.

This should be a common problem in text processing, but I don't see a 
transformation in the tm package that
handles this nicely. Is there something I've missed?

-Michael

-- 
Michael Friendly     Email: friendly AT yorku DOT ca
Professor, Psychology Dept.
York University      Voice: 416 736-5115 x66249 Fax: 416 736-5814
4700 Keele Street    Web:   http://www.datavis.ca
Toronto, ONT  M3J 1P3 CANADA

Tyler Rinker

2012-Jan-27 17:14 UTC

head link

[R] tm package: handling contractions

This may not be the answer to your problem but you could gsub out the
"pretty?apostrophe" for the one tm recognizes. ?Also note that this
may be due to your use of word which automatically uses
the?"pretty?apostrophe". ?The default setting on MS word can be
altered to?alleviate?this.#==============================#using gsub
x <- ?"I didn?t know!"x <- gsub("?",
"'", x)removePunctuation(x)#===============================#You
could make that into a function and apply it to the corpus with tm_map
exchanger <- function(x) gsub("?", "'", x)corp <-
tm_map(corp, exchanger)#==============================
Cheers,Tyler----------------------------------------> Date: Fri, 27 Jan 2012 09:50:51 -0500
> From: friendly at yorku.ca
> To: r-help at r-project.org
> Subject: [R] tm package: handling contractions
>
> I tried making a wordcloud of Obama's State of the Union address using
> the tm package to process the text
>
> sotu <- scan(file="c:/R/data/sotu2012.txt",
what="character")
> sotu <- tolower(sotu)
> corp <-Corpus(VectorSource(paste(sotu, collapse=" ")))
>
> corp <- tm_map(corp, removePunctuation)
> corp <- tm_map(corp, stemDocument)
> corp <- tm_map(corp, function(x)removeWords(x,stopwords()))
> tdm <- TermDocumentMatrix(corp)
> m <- as.matrix(tdm)
> v <- sort(rowSums(m),decreasing=TRUE)
> d <- data.frame(word = names(v),freq=v)
>
> wordcloud(d$word,d$freq)
>
> I ended up with a large number of contractions that were split at the
> "?" character, e.g., "don?t" -->
"don'"
> e.g.,
>
> > sotu[grep("?", sotu)]
> [1] "qaeda?s" "taliban?s" "america?s"
"they?re" "don?t"
> [6] "we?re" "aren?t" "we?ve"
"patton?s" "what?s"
> [11] "let?s" "weren?t," "couldn?t"
"people?s" "didn?t"
> [16] "we?ve" "we?ve" "we?ve" "i?m"
"that?s"
> [21] "world?s" "what?s" "can?t"
"that?s" "it?s"
> [26] "lock?s" "let?s" "you?re"
"shouldn?t" "you?re"
> [31] "you?re" "it?s" "i?ll" "we?re"
"don?t"
> [36] "we?ve" "it?s" "it?s" "it?s"
"they?re"
> ...
> [201] "didn?t" "bush?s" "didn?t"
"can?t" "there?s"
> [206] "i?m" "other?s" "we?re"
> >
>
> NB: What appears as the ' character above actually the character hex
92,
> not hex 27 on my Windows system.
>
> This should be a common problem in text processing, but I don't see a
> transformation in the tm package that
> handles this nicely. Is there something I've missed?
>
> -Michael
>
> --
> Michael Friendly Email: friendly AT yorku DOT ca
> Professor, Psychology Dept.
> York University Voice: 416 736-5115 x66249 Fax: 416 736-5814
> 4700 Keele Street Web: http://www.datavis.ca
> Toronto, ONT M3J 1P3 CANADA
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

Milan Bouchet-Valat

2012-Jan-27 18:07 UTC

head link

[R] tm package: handling contractions

Le vendredi 27 janvier 2012 ? 09:50 -0500, Michael Friendly a ?crit
:> I tried making a wordcloud of Obama's State of the Union address using 
> the tm package to process the text
> 
> sotu <- scan(file="c:/R/data/sotu2012.txt",
what="character")
> sotu <- tolower(sotu)
> corp <-Corpus(VectorSource(paste(sotu, collapse=" ")))
> 
> corp <- tm_map(corp, removePunctuation)
> corp <- tm_map(corp, stemDocument)
> corp <- tm_map(corp, function(x)removeWords(x,stopwords()))
> tdm <- TermDocumentMatrix(corp)
> m <- as.matrix(tdm)
> v <- sort(rowSums(m),decreasing=TRUE)
> d <- data.frame(word = names(v),freq=v)
> 
> wordcloud(d$word,d$freq)
> 
> I ended up with a large number of contractions that were split at the 
> "?" character, e.g., "don?t" -->
"don'"
> e.g.,
> 
>  > sotu[grep("?", sotu)]
> [1] "qaeda?s" "taliban?s" "america?s"
"they?re" "don?t"
> [6] "we?re" "aren?t" "we?ve"
"patton?s" "what?s"
> [11] "let?s" "weren?t," "couldn?t"
"people?s" "didn?t"
> [16] "we?ve" "we?ve" "we?ve" "i?m"
"that?s"
> [21] "world?s" "what?s" "can?t"
"that?s" "it?s"
> [26] "lock?s" "let?s" "you?re"
"shouldn?t" "you?re"
> [31] "you?re" "it?s" "i?ll" "we?re"
"don?t"
> [36] "we?ve" "it?s" "it?s" "it?s"
"they?re"
> ...
> [201] "didn?t" "bush?s" "didn?t"
"can?t" "there?s"
> [206] "i?m" "other?s" "we?re"
>  >
> 
> NB: What appears as the ' character above actually the character hex
92,
> not hex 27 on my Windows system.
> 
> This should be a common problem in text processing, but I don't see a 
> transformation in the tm package that
> handles this nicely. Is there something I've missed?What result would you expect? As I see it, ideally, removePunctuation()
would remove these apostrophes. Looks like it doesn't; the code is:

removePunctuation <- function(x) UseMethod("removePunctuation", x)
removePunctuation.PlainTextDocument <- function(x)
gsub("[[:punct:]]+",
" ", x)

And ?regexp says:
     ?[:punct:]? Punctuation characters:
          ?! " # $ % & ' ( ) * + , - . / : ; < = > ? @ [ \ ]
^ _ ` { |
          } ~?.

Maybe the ? apostrophe should be added to the list? (FWIW, it's the
"real" character for apostrophe in Unicode.)

I discussed a related issue about apostrophes with Ingo Feinerer and
Kurt Hornik: in French, we'd need apostrophes (of any type, ' or ?) to
mark a separation between words, instead of concatenating the two parts
surrounding it. The conclusion was that a language-specific processor
was required (languages with non-latin alphabet have many more diacritic
characters we don't even know about).

In English, I suspect it might be interesting to detect forms like
"'re"
or "'nt" and replace them with their full equivalent, i.e.
"are" and
"not"; OTOH, genitive forms would probably better be removed (at least
by default). In the short term, Tyler's solution will work, but beware
that "we're" will become "were" if you remove
punctuation ;-). An
alternative is to replace apostrophes with spaces so that suffixes are
considered as separate words (that's what I do in French ATM).


Hope this helps

Maybe Matching Threads

Search for more reasonably related threads

R help - Jan 2012 - tm package: handling contractions

[R] tm package: handling contractions

[R] tm package: handling contractions

[R] tm package: handling contractions

Maybe Matching Threads