thr3ads.net - R help - [R] Analyzing texts with tm [Jan 2011]

If this information is useful, please help other people find it:
Share via:

Michael Weller

2011-Jan-19 10:55 UTC

[R] Analyzing texts with tm

Hey everybody!

I have to use R's tm package to do some text analysis, first thing would be
to create a term frequency matrix.
Digging in tm's source code it seems like it uses some logic like this to
create term frequencies:

data("crude")
(txt <- Content(crude[[1]]))
(tokTxt <- unlist(strsplit(gsub("[^[:alnum:]]+", " ",
txt), " ", fixed = TRUE)))
table(factor(tokTxt, levels = c('two')))
table(factor(tokTxt, levels = c('two days')))

Like this code example demostrates the tokenization of the input text makes it
impossible to use "a group of words separated by whitespace" as input
words.

So my question is: How would you create such a term frequency matrix in R?

Here's some Ruby code I once wrote to show what I want:
txt = "some text containing two days\n"
freq = ['two', 'two days'].inject({}) { |h,w| h[w] =
txt.scan(Regexp.compile(" #{w} ")).length; h }
(Reads as: "Given txt: Generate an associative array mapping words to the
word's frequency in txt. To count occurences do not split the text at
whitespace but instead use a regular expression to search for the word/group of
words surrounded by whitespace in txt.")

Thanks in advance for any input!
--

Reasonably Related Threads

Search for more reasonably related threads

R help - Jan 2011 - Analyzing texts with tm

[R] Analyzing texts with tm

Reasonably Related Threads

Wisdom of the Ancients