Weiwei Shi wrote:
>Hi,
>I have a text mining project and currently I am working on feature
>generation/selection part.
>My plan is selecting a set of words or word combinations which have
>better discriminant capability than other words in telling the group
>id's (2 classes in this case) for a dataset which has 2,000,000
>documents.
>
>One approach is using "contrast-set association rule mining" while
the
>other is using chisqr or fisher exact test.
>
>An example which has 3 contingency tables for 3 words as followed
>(word coded by number):
>
>
>>tab[,,1:3]
>>
>>
>, , 1
>
> [,1] [,2]
>[1,] 11266 2151526
>[2,] 125 31734
>
>, , 2
>
> [,1] [,2]
>[1,] 43571 2119221
>[2,] 52 31807
>
>, , 3
>
> [,1] [,2]
>[1,] 427 2162365
>[2,] 5 31854
>
>
>I have some questions on this:
>1. What's the thumb of rule to use chisq test instead of Fisher exact
>test. I have a vague memory which said for each cell, the count needs
>to be over 50 if chisq instead of fisher exact test is going to be
>used. In the case of word 3, I think I should use fisher test.
>However, running chisq like below is fine:
>
>
>>tab[,,3]
>>
>>
> [,1] [,2]
>[1,] 427 2162365
>[2,] 5 31854
>
>
>>chisq.test(tab[,,3])
>>
>>
>
> Pearson's Chi-squared test with Yates' continuity correction
>
>data: tab[, , 3]
>X-squared = 0.0963, df = 1, p-value = 0.7564
>
>but running on the whole set of words (including 14240 words) has the
>following warnings:
>
>
>>p.chisq<-as.double(lapply(1:N, function(i)
chisq.test(tab[,,i])$p.value))
>>
>>
>There were 50 or more warnings (use warnings() to see the first 50)
>
>
>>warnings()
>>
>>
>Warning messages:
>1: Chi-squared approximation may be incorrect in: chisq.test(tab[, , i])
>2: Chi-squared approximation may be incorrect in: chisq.test(tab[, , i])
>3: Chi-squared approximation may be incorrect in: chisq.test(tab[, , i])
>4: Chi-squared approximation may be incorrect in: chisq.test(tab[, , i])
>
>
>2. So, my second question is, is this warning b/c I am against the
>assumption of using chisq. But why Word 3 is fine? How to trace the
>warning to see which word caused this warning?
>
>3. My result looks like this (after some mapping treating from number
>id to word and some words are stemmed here, like ACCID is accident):
> > of[1:50,]
> map...2. p.fisher
>21 ACCID 0.000000e+00
>30 CD 0.000000e+00
>67 ROCK 0.000000e+00
>104 CRACK 0.000000e+00
>111 CHIP 0.000000e+00
>179 GLASS 0.000000e+00
>84 BACK 4.199878e-291
>395 DRIVEABL 5.335989e-287
>60 CAP 9.405235e-285
>262 WINDSHIELD 2.691641e-254
>13 IV 3.905186e-245
>110 HZ 2.819713e-210
>11 CAMP 9.086768e-207
>2 SHATTER 5.273994e-202
>297 ALP 1.678521e-177
>162 BED 1.822031e-173
>249 BCD 1.398391e-160
>493 RACK 4.178617e-156
>59 CAUS 7.539031e-147
>
>3.1 question: Should I use two-sided test instead of one-sided for
>fisher test? I read some material which suggests using two-sided.
>
>3.2 A big question: Even though the result looks very promising since
>this is case of classiying fraud cases and the words selected by this
>approach make sense. However, I think p-values here just indicate the
>strength to reject null hypothesis, not the strength of association
>between word and class of document. So, what kind of statistics I
>should use here to evaluate the strength of association? odds ratio?
>
>Any suggestions are welcome!
>
>Thanks!
>
>
You can use chisq.test with sim=TRUE, or call it as usual first, see if
there is a warning, and then recall
with sim=TRUE.
Kjetil
--
Kjetil Halvorsen.
Peace is the most effective weapon of mass construction.
-- Mahdi Elmandjra
--
No virus found in this outgoing message.
Checked by AVG Anti-Virus.