thr3ads.net - R help - [R] coding to generate a matrix to prepare for chi-sqr test for text mining [Jun 2005]

If this information is useful, please help other people find it:
Share via:

Weiwei Shi

2005-Jun-15 21:10 UTC

[R] coding to generate a matrix to prepare for chi-sqr test for text mining

Hi, there:
I have a dataset like the following:

1412|WINDOW|SHATTER|TORN|SOFT|TOP|WATER|RAIN|LAB|AI|BOLL|CAMP|0
1413|PARK|IV|STRUCK|PARK|PUSH|COD|POLICI|CIA|TB|SIC|0
2412|ACCID|REAREND|MULTI|EH|IV|MIDDL|FAN|DUAL|LOSS|CALM|1
2414|IV|REAREND|CD|COG|LAB|ADVERS|1
2415|ACCID|SINGL|VEHICL|IV|SWERV|AVOID|OBJECT|STRUCK|PHONE|POLE|FAN|0
2417|ACCID|SINGL|VEHICL|ROLL|DUE|FATAL|FAN|DUAL|LOSS|CALM|1
2418|AI|FELL|ASLEEP|WHEEL|VEHICL|RETENT|POND|LAB|ADVERS|1
2419|ACCID|SINGL|VEHICL|TREE|FELL|IV|LIGHTN|STORM|IV|CAMP|CALM|AD|1
2422|THEFT|RECOV|TOTAL|THEFT|0
...

The first column is always id_num, the last one is class label. I want
to do some chi-square test on the dependency between a word (or
further a word combination) on the class label.

for example, my goal is to build a table like the following, ready for
chi-square test
                      ACCID (Yes)                 ACCID(No)
class label
         1                  10                                15
         0                    5                                 9
 
the number is the number of lines (observations).
and later I want to do word-combination like ACCID & WINDOW (this
result was generated from association analysis from my another
program) instead of ACCID only.

My first question is, how to do it automatically in R to build a data
structure (data frame) to represent the table above for each word)
since I am learning R programming and I don't want to do it using
python.  (Don't worry if a word appears twice in one observation, and
I have another version of data set which only lists unique word.)

My target is to find a p-value for each word/class label from
chi-square test and evaluate the significance of feature for later
text mining. I am not sure if this is a good idea and I am reading
some papers on this.

Thanks,

--  
Weiwei Shi, Ph.D

"Did you always know?"
"No, I did not. But I believed..."
---Matrix III

Apparently Analagous Threads

Search for more possibly parallel threads

R help - Jun 2005 - coding to generate a matrix to prepare for chi-sqr test for text mining

[R] coding to generate a matrix to prepare for chi-sqr test for text mining

Apparently Analagous Threads

Wisdom of the Ancients