thr3ads.net - R help - [R] Cluster analysis using term frequencies [Mar 2015]

If this information is useful, please help other people find it:
Share via:

Sun Shine

2015-Mar-24 11:55 UTC

[R] Cluster analysis using term frequencies

Hi list

I am using the 'tm' package to review meeting notes at a school to 
identify terms frequently associated with 'learning', 'sports',
and
'extra-mural' activities, and then to sort any terms according to these 
three headers in a way that could be supported statistically (as opposed 
to, say, my own bias, etc.).

To accomplish this, I have done the following:

(1) After the usual pre-processing of the text data, loading it as a 
corpus and then converting it into a document term matrix (called 
'allTerms'), I have identified the 20 most frequently occurring terms in
the meeting notes and extracted these into a named vector called 
'freqTerms'. Many of the terms returned have nothing to do with any of 
the three themes of 'learning', 'sports', or
'extra-mural'.

(2) Therefore, I have also manually generated a list of terms and 
synonyms for 'learning' and 'sports', etc. (e.g.
'football', 'soccer',
'drama', 'chess', etc.) and then tested for the occurrence of
each of
these terms in the corpus, e.g.:

 > allTerms['soccer']

and have come up with a list of some 30 terms together with their 
frequencies. I manually sorted these according to three headers 
'learning', 'sports', and 'extra-mural' and dropped
these into a table
in a word processing document. Some of these terms are also in the 
freqTerms vector.

What I want to do now is to use cluster analysis (hclust, from the 
'cluster' library) to plot a dendrogram of the terms I have manually 
checked and put into the table, in order to see how closely similar the 
terms are and whether they cluster in ways similar to the way as I 
manually sorted these under the table column headers of 'learning', 
'sports', and 'extra-mural'.

To do this, I dropped these manually sorted terms into a data frame 
together with the associated values (which I called 'tes.df') and then 
tried plotting this as follows:

 > dtes <- dist(tes.df, method = 'euclidean')
 > dtesFreq <- hclust(dtes, method = 'ward.D')
 > plot(dtesFreq, labels = names(tes.df))

However, I get an error message when trying to plot this: "Error in 
graphics:::plotHclust(n1, merge, height, order(x$order), hang,  : 
invalid dendrogram input".

I'm clearly screwing something up, either in my source data.frame or in 
my setting hclust up, but don't know which, nor how.

More than just identifying the error however, I am interested in finding 
a smart (efficient/ elegant) way of checking the occurrence and 
frequency value of the terms that may be associated with 'sports', 
'learning', and 'extra-mural' and extracting these into a matrix
or data
frame so that I can analyse and plot their clustering to see if how I 
associated these terms is actually supported statistically.

I'm sure that there must be a way of doing this in R, but I'm obviously 
not going about it correctly. Can anyone shine a light please?

Thanks for any help/ guidance.

Regards,
Sun

Christian Hennig

2015-Mar-24 13:39 UTC

head link

[R] Cluster analysis using term frequencies

Dear Sun Shine,
>> dtes <- dist(tes.df, method = 'euclidean')
>> dtesFreq <- hclust(dtes, method = 'ward.D')
>> plot(dtesFreq, labels = names(tes.df))
>
> However, I get an error message when trying to plot this: "Error in 
> graphics:::plotHclust(n1, merge, height, order(x$order), hang,  : invalid 
> dendrogram input".
I don't see anything wrong with the code, so what I'd do is run
str(dtes) and str(dtesFreq) to see whether these are what they should be 
(or if not, what they are instead).
> I'm clearly screwing something up, either in my source data.frame or in
my
> setting hclust up, but don't know which, nor how.
Can't comment on your source data but generally, whatever you do, use 
str() or even print() to see whether the R-objects are allright or what 
went wrong.
> More than just identifying the error however, I am interested in finding a 
> smart (efficient/ elegant) way of checking the occurrence and frequency
value
> of the terms that may be associated with 'sports',
'learning', and
> 'extra-mural' and extracting these into a matrix or data frame so
that I can
> analyse and plot their clustering to see if how I associated these terms is
> actually supported statistically.
The first thing that comes to my mind (not necessarily the best/most 
elegant) is to run...
dtes3 <- cutree(dtesFreq,3)
...and to table dtes3 against your manual classification.
Note that 3 is the most "natural" number of clusters to cut the tree 
here but may not be the best to match your classification (for example, 
you may have a one-point cluster in the 3-cluster solution, so it may 
effectively be a two-cluster solution with an outlier). Your 
dendrogram, if you succeed plotting it, may give you a hint about that.

Hope this helps,
Christian

>
> I'm sure that there must be a way of doing this in R, but I'm
obviously not
> going about it correctly. Can anyone shine a light please?
>
> Thanks for any help/ guidance.
>
> Regards,
> Sun
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
*** --- ***
Christian Hennig
University College London, Department of Statistical Science
Gower St., London WC1E 6BT, phone +44 207 679 1698
c.hennig at ucl.ac.uk, www.homepages.ucl.ac.uk/~ucakche

R help - Mar 2015 - Cluster analysis using term frequencies

[R] Cluster analysis using term frequencies

[R] Cluster analysis using term frequencies