thr3ads.net - R help - [R] Retaining the original document id in #topicmodels in R [Jun 2014]

If this information is useful, please help other people find it:
Share via:

张伦

2014-Jun-24 10:19 UTC

[R] Retaining the original document id in #topicmodels in R

Hi all,
I am currently using package "topicmodels" to find the topics of a
given
text.
The dataset contains 8523 documents. I would like to see which documents
belong to which topic.

Here is my code:
########################get the documentTermMatrix#########
tdm=DocumentTermMatrix(corpus,control)
length(tdm$dimnames$Terms)
dim (tdm)   ##################the dimension of tdm is "[1]  8513
21135"
library ("slam")
library ("topicmodels")

term_tfidf <-tapply(tdm$v/row_sums(tdm)[tdm$i], tdm$j, mean) *
log2(nDocs(tdm)/col_sums(tdm > 0))
summary(term_tfidf)
summary(col_sums(tdm))

tdm <- tdm[,term_tfidf >= 0.15]
tdm2 <- tdm[row_sums(tdm) > 0,]
dim(tdm2) ######################now the dim of tdm2 is *8513 10091##*

###################topic modeling analysis######################

k <- 30
lda <-LDA (tdm2, control=list(alpha=0.1),k)

###### cell values as posterior topic distribution for each document#####
gammaDF <- as.data.frame(lda@gamma)
names(gammaDF) <- c(1:k)
# inspect...
gammaDF
toptopics <- as.data.frame(cbind(document = row.names(gammaDF),
  topic = apply(gammaDF,1,function(x) names(gammaDF)[which(x==max(x))])))
sapply(toptopics, class)
toptopics<-unlist(toptopics)
write.csv (toptopics, "topicdistribution.csv")


Some of the documents (in this case, 10 documents) were excluded since some
of them contain zero entry . Therefore, I cannot match the original
document ID with the result of the topics.

My question is how can I include the original document id and match these
id numbers with the topics?

ZHANG Lun

	[[alternative HTML version deleted]]

R help - Jun 2014 - Retaining the original document id in #topicmodels in R

[R] Retaining the original document id in #topicmodels in R