Weiwei Shi
2005-Jun-15 21:10 UTC
[R] coding to generate a matrix to prepare for chi-sqr test for text mining
Hi, there: I have a dataset like the following: 1412|WINDOW|SHATTER|TORN|SOFT|TOP|WATER|RAIN|LAB|AI|BOLL|CAMP|0 1413|PARK|IV|STRUCK|PARK|PUSH|COD|POLICI|CIA|TB|SIC|0 2412|ACCID|REAREND|MULTI|EH|IV|MIDDL|FAN|DUAL|LOSS|CALM|1 2414|IV|REAREND|CD|COG|LAB|ADVERS|1 2415|ACCID|SINGL|VEHICL|IV|SWERV|AVOID|OBJECT|STRUCK|PHONE|POLE|FAN|0 2417|ACCID|SINGL|VEHICL|ROLL|DUE|FATAL|FAN|DUAL|LOSS|CALM|1 2418|AI|FELL|ASLEEP|WHEEL|VEHICL|RETENT|POND|LAB|ADVERS|1 2419|ACCID|SINGL|VEHICL|TREE|FELL|IV|LIGHTN|STORM|IV|CAMP|CALM|AD|1 2422|THEFT|RECOV|TOTAL|THEFT|0 ... The first column is always id_num, the last one is class label. I want to do some chi-square test on the dependency between a word (or further a word combination) on the class label. for example, my goal is to build a table like the following, ready for chi-square test ACCID (Yes) ACCID(No) class label 1 10 15 0 5 9 the number is the number of lines (observations). and later I want to do word-combination like ACCID & WINDOW (this result was generated from association analysis from my another program) instead of ACCID only. My first question is, how to do it automatically in R to build a data structure (data frame) to represent the table above for each word) since I am learning R programming and I don't want to do it using python. (Don't worry if a word appears twice in one observation, and I have another version of data set which only lists unique word.) My target is to find a p-value for each word/class label from chi-square test and evaluate the significance of feature for later text mining. I am not sure if this is a good idea and I am reading some papers on this. Thanks, -- Weiwei Shi, Ph.D "Did you always know?" "No, I did not. But I believed..." ---Matrix III
Seemingly Similar Threads
- coding to generate a matrix to prepare for chi-sqr test f or text mining
- Exponent of sqr symmetric matrix
- Combination of Bias and MSE ?
- Replacing a few variable values within a DataFrame...
- Still can't find missing data - How do I get NA in xtabs with factors?