Huntsinger, Reid
2005-Jun-15 22:27 UTC
[R] coding to generate a matrix to prepare for chi-sqr test f or text mining
I would compile a table of all the words in the dataset (maybe you have it already), then create a list where each component is an integer vector of indices of words. That is, replace words by their positions in the table.>From that sparse form you could create binary features to use with standardclassification methods, or for example compute the X'X matrix for linear regression directly (you would probably want to throw out infrequently occurring words to keep the matrix small enough to work with in memory). For your specific question, say "words" is the list of integer vectors as above, and "class" is the vector of class labels (1 or 2 to make it a valid index) corresponding to a given vector. Then you can fill in the "present" (==1) parts of the table class x presence x word via n <- length(words) tab <- array(as.integer(0),dim=c(2,2,n)) for (i in 1:n) { for (word in words[[i]]) tab[class[i],1,word] <- tab[class[i],1,word] + 1 } and the "absent" (==2) parts are then easy: tab[1,2,] <- sum(class == 1) - tab[1,1,] tab[2,2,] <- sum(class == 2) - tab[2,1,] so now you can use chisq.test on each of the 2 x 2 tables tab[,,i] for i a word index, all at once using apply() if convenient. Reid Huntsinger -----Original Message----- From: r-help-bounces at stat.math.ethz.ch [mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of Weiwei Shi Sent: Wednesday, June 15, 2005 5:10 PM To: R-help at stat.math.ethz.ch Subject: [R] coding to generate a matrix to prepare for chi-sqr test for text mining Hi, there: I have a dataset like the following: 1412|WINDOW|SHATTER|TORN|SOFT|TOP|WATER|RAIN|LAB|AI|BOLL|CAMP|0 1413|PARK|IV|STRUCK|PARK|PUSH|COD|POLICI|CIA|TB|SIC|0 2412|ACCID|REAREND|MULTI|EH|IV|MIDDL|FAN|DUAL|LOSS|CALM|1 2414|IV|REAREND|CD|COG|LAB|ADVERS|1 2415|ACCID|SINGL|VEHICL|IV|SWERV|AVOID|OBJECT|STRUCK|PHONE|POLE|FAN|0 2417|ACCID|SINGL|VEHICL|ROLL|DUE|FATAL|FAN|DUAL|LOSS|CALM|1 2418|AI|FELL|ASLEEP|WHEEL|VEHICL|RETENT|POND|LAB|ADVERS|1 2419|ACCID|SINGL|VEHICL|TREE|FELL|IV|LIGHTN|STORM|IV|CAMP|CALM|AD|1 2422|THEFT|RECOV|TOTAL|THEFT|0 ... The first column is always id_num, the last one is class label. I want to do some chi-square test on the dependency between a word (or further a word combination) on the class label. for example, my goal is to build a table like the following, ready for chi-square test ACCID (Yes) ACCID(No) class label 1 10 15 0 5 9 the number is the number of lines (observations). and later I want to do word-combination like ACCID & WINDOW (this result was generated from association analysis from my another program) instead of ACCID only. My first question is, how to do it automatically in R to build a data structure (data frame) to represent the table above for each word) since I am learning R programming and I don't want to do it using python. (Don't worry if a word appears twice in one observation, and I have another version of data set which only lists unique word.) My target is to find a p-value for each word/class label from chi-square test and evaluate the significance of feature for later text mining. I am not sure if this is a good idea and I am reading some papers on this. Thanks, -- Weiwei Shi, Ph.D "Did you always know?" "No, I did not. But I believed..." ---Matrix III ______________________________________________ R-help at stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Huntsinger, Reid
2005-Jun-15 23:05 UTC
[R] coding to generate a matrix to prepare for chi-sqr test f or text mining
The data structure I described is a sparse (binary) matrix, and I should have said you could use the SparseM or Matrix package's sparse matrix classes and methods to do what I suggested below. There are linear algebra routines available for them but for other things I'm not sure. Reid Huntsinger -----Original Message----- From: r-help-bounces at stat.math.ethz.ch [mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of Huntsinger, Reid Sent: Wednesday, June 15, 2005 6:28 PM To: 'Weiwei Shi'; R-help at stat.math.ethz.ch Subject: Re: [R] coding to generate a matrix to prepare for chi-sqr test f or text mining I would compile a table of all the words in the dataset (maybe you have it already), then create a list where each component is an integer vector of indices of words. That is, replace words by their positions in the table.>From that sparse form you could create binary features to use with standardclassification methods, or for example compute the X'X matrix for linear regression directly (you would probably want to throw out infrequently occurring words to keep the matrix small enough to work with in memory). For your specific question, say "words" is the list of integer vectors as above, and "class" is the vector of class labels (1 or 2 to make it a valid index) corresponding to a given vector. Then you can fill in the "present" (==1) parts of the table class x presence x word via n <- length(words) tab <- array(as.integer(0),dim=c(2,2,n)) for (i in 1:n) { for (word in words[[i]]) tab[class[i],1,word] <- tab[class[i],1,word] + 1 } and the "absent" (==2) parts are then easy: tab[1,2,] <- sum(class == 1) - tab[1,1,] tab[2,2,] <- sum(class == 2) - tab[2,1,] so now you can use chisq.test on each of the 2 x 2 tables tab[,,i] for i a word index, all at once using apply() if convenient. Reid Huntsinger -----Original Message----- From: r-help-bounces at stat.math.ethz.ch [mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of Weiwei Shi Sent: Wednesday, June 15, 2005 5:10 PM To: R-help at stat.math.ethz.ch Subject: [R] coding to generate a matrix to prepare for chi-sqr test for text mining Hi, there: I have a dataset like the following: 1412|WINDOW|SHATTER|TORN|SOFT|TOP|WATER|RAIN|LAB|AI|BOLL|CAMP|0 1413|PARK|IV|STRUCK|PARK|PUSH|COD|POLICI|CIA|TB|SIC|0 2412|ACCID|REAREND|MULTI|EH|IV|MIDDL|FAN|DUAL|LOSS|CALM|1 2414|IV|REAREND|CD|COG|LAB|ADVERS|1 2415|ACCID|SINGL|VEHICL|IV|SWERV|AVOID|OBJECT|STRUCK|PHONE|POLE|FAN|0 2417|ACCID|SINGL|VEHICL|ROLL|DUE|FATAL|FAN|DUAL|LOSS|CALM|1 2418|AI|FELL|ASLEEP|WHEEL|VEHICL|RETENT|POND|LAB|ADVERS|1 2419|ACCID|SINGL|VEHICL|TREE|FELL|IV|LIGHTN|STORM|IV|CAMP|CALM|AD|1 2422|THEFT|RECOV|TOTAL|THEFT|0 ... The first column is always id_num, the last one is class label. I want to do some chi-square test on the dependency between a word (or further a word combination) on the class label. for example, my goal is to build a table like the following, ready for chi-square test ACCID (Yes) ACCID(No) class label 1 10 15 0 5 9 the number is the number of lines (observations). and later I want to do word-combination like ACCID & WINDOW (this result was generated from association analysis from my another program) instead of ACCID only. My first question is, how to do it automatically in R to build a data structure (data frame) to represent the table above for each word) since I am learning R programming and I don't want to do it using python. (Don't worry if a word appears twice in one observation, and I have another version of data set which only lists unique word.) My target is to find a p-value for each word/class label from chi-square test and evaluate the significance of feature for later text mining. I am not sure if this is a good idea and I am reading some papers on this. Thanks, -- Weiwei Shi, Ph.D "Did you always know?" "No, I did not. But I believed..." ---Matrix III ______________________________________________ R-help at stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html ______________________________________________ R-help at stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html ---------------------------------------------------------------------------- -- Notice: This e-mail message, together with any attachments,...{{dropped}}