Fridolin Wild
2005-Nov-08 23:03 UTC
[R] sorting during xtabs? sorting by "individual" order?
Hey alltogether, refacturing a package (before it will be released), I ran across the following problem. I have two directories with different text files, I want to read the first and construct a document-term matrix from it (every term=word in a row, every file in a column, occurrence frequencies form the values). The second directory contains different files. It needs to be read in to also construct a document-term matrix -- however, in the same "term-order" to enable similarity comparisons in a vector space of the same format. Let's make a (fake) example: (1) support function # directory 1 contains 2 files (F1 & F2): F1 = c("word4", "word3", "word2") F2 = c("word1", "word4", "word2") # directory 2 contains also 2 files (F3 & F4): F3 = c("word1", "word2", "bla") F4 = c("word1", "word2", "word3") # I file in the first directory, file by file, # create triples of the format (file, word, 1) F1tab = sort(table(F1), decreasing = TRUE) F2tab = sort(table(F2), decreasing = TRUE) # and create a dataframe F1frame = data.frame( docs="F1", terms=names(F1tab), Freq = F1tab, row.names = NULL) F2frame = data.frame( docs="F2", terms = names(F2tab), Freq = F2tab, row.names = NULL) (2) textmatrix function ... to be bound together for every file and to be converted with xtabs into a document term matrix: dummy = list(F1frame, F2frame) dtm = t(xtabs(Freq ~ ., data = do.call("rbind", dummy))) => docs terms F1 F2 word2 1 1 word3 1 0 word4 1 1 word1 0 1 Now, when I want to re-use this to construct another document-term matrix from files F3&F4 -- with the same terms in the exactly same order, firstly, I need to add F3clean = F3[F3 %in% rownames(dtm)] F4clean = F4[F4 %in% rownames(dtm)] to keep "unwanted" terms from getting into the tabs. And here is my problem: I need to reformat the output document-term matrix (as it would be given by another time running step 2 with F3clean and F4clean) to correspond with the given order of the rownames(dtm) of the first directory. How can I do this (not costly, the matrices I have to deal with are usually really big)? Hopefully just by adding s.th. to the xtabs function? To make an example of what I need: I need dtm2 to look exactly like this (doc-order is not important): => docs terms F3 F4 word2 1 1 word3 1 1 word4 0 0 word1 1 1 Can anybody help me? Best, Fridolin -- Fridolin Wild, Institute for Information Systems and New Media, Vienna University of Economics and Business Administration (WUW), Augasse 2-6, A-1090 Wien, Austria fon +43-1-31336-4488, fax +43-1-31336-746