Hi all,
I am currently using package "topicmodels" to find the topics of a
given
text.
The dataset contains 8523 documents. I would like to see which documents
belong to which topic.
Here is my code:
########################get the documentTermMatrix#########
tdm=DocumentTermMatrix(corpus,control)
length(tdm$dimnames$Terms)
dim (tdm) ##################the dimension of tdm is "[1] 8513
21135"
library ("slam")
library ("topicmodels")
term_tfidf <-tapply(tdm$v/row_sums(tdm)[tdm$i], tdm$j, mean) *
log2(nDocs(tdm)/col_sums(tdm > 0))
summary(term_tfidf)
summary(col_sums(tdm))
tdm <- tdm[,term_tfidf >= 0.15]
tdm2 <- tdm[row_sums(tdm) > 0,]
dim(tdm2) ######################now the dim of tdm2 is *8513 10091##*
###################topic modeling analysis######################
k <- 30
lda <-LDA (tdm2, control=list(alpha=0.1),k)
###### cell values as posterior topic distribution for each document#####
gammaDF <- as.data.frame(lda@gamma)
names(gammaDF) <- c(1:k)
# inspect...
gammaDF
toptopics <- as.data.frame(cbind(document = row.names(gammaDF),
topic = apply(gammaDF,1,function(x) names(gammaDF)[which(x==max(x))])))
sapply(toptopics, class)
toptopics<-unlist(toptopics)
write.csv (toptopics, "topicdistribution.csv")
Some of the documents (in this case, 10 documents) were excluded since some
of them contain zero entry . Therefore, I cannot match the original
document ID with the result of the topics.
My question is how can I include the original document id and match these
id numbers with the topics?
ZHANG Lun
[[alternative HTML version deleted]]