thr3ads.net - R help - [R] chisq test and fisher exact test [Jun 2005]

If this information is useful, please help other people find it:
Share via:

Weiwei Shi

2005-Jun-22 15:30 UTC

[R] chisq test and fisher exact test

Hi,
I have a text mining project and currently I am working on feature
generation/selection part.
My plan is selecting a set of words or word combinations which have
better discriminant capability than other words in telling the group
id's (2 classes in this case) for a dataset which has 2,000,000
documents.

One approach is using "contrast-set association rule mining" while the
other is using chisqr or fisher exact test.

An example which has 3 contingency tables for 3 words as followed
(word coded by number):> tab[,,1:3], , 1

      [,1]    [,2]
[1,] 11266 2151526
[2,]   125   31734

, , 2

      [,1]    [,2]
[1,] 43571 2119221
[2,]    52   31807

, , 3

     [,1]    [,2]
[1,]  427 2162365
[2,]    5   31854


I have some questions on this:
1. What's the thumb of rule to use chisq test instead of Fisher exact
test. I have a  vague memory which said for each cell, the count needs
to be over 50 if chisq instead of fisher exact test is going to be
used. In the case of word 3,  I think I should use fisher test.
However, running chisq like below is fine:> tab[,,3]     [,1]    [,2]
[1,]  427 2162365
[2,]    5   31854> chisq.test(tab[,,3])
        Pearson's Chi-squared test with Yates' continuity correction

data:  tab[, , 3]
X-squared = 0.0963, df = 1, p-value = 0.7564

but running on the whole set of words (including 14240 words) has the
following warnings:> p.chisq<-as.double(lapply(1:N, function(i)
chisq.test(tab[,,i])$p.value))There were 50 or more warnings (use warnings() to see the first
50)> warnings()Warning messages:
1: Chi-squared approximation may be incorrect in: chisq.test(tab[, , i])
2: Chi-squared approximation may be incorrect in: chisq.test(tab[, , i])
3: Chi-squared approximation may be incorrect in: chisq.test(tab[, , i])
4: Chi-squared approximation may be incorrect in: chisq.test(tab[, , i])


2. So, my second question is, is this warning b/c I am against the
assumption of using chisq. But why Word 3 is fine? How to trace the
warning to see which word caused this warning?

3. My result looks like this (after some mapping treating from number
id to word and some words are stemmed here, like ACCID is accident):
 > of[1:50,]
      map...2.      p.fisher
21       ACCID  0.000000e+00
30          CD  0.000000e+00
67        ROCK  0.000000e+00
104      CRACK  0.000000e+00
111       CHIP  0.000000e+00
179      GLASS  0.000000e+00
84        BACK 4.199878e-291
395   DRIVEABL 5.335989e-287
60         CAP 9.405235e-285
262 WINDSHIELD 2.691641e-254
13          IV 3.905186e-245
110         HZ 2.819713e-210
11        CAMP 9.086768e-207
2      SHATTER 5.273994e-202
297        ALP 1.678521e-177
162        BED 1.822031e-173
249        BCD 1.398391e-160
493       RACK 4.178617e-156
59        CAUS 7.539031e-147

3.1 question: Should I use two-sided test instead of one-sided for
fisher test? I read some material which suggests using two-sided.

3.2 A big question: Even though the result looks very promising since
this is case of classiying fraud cases and the words selected by this
approach make sense. However, I think p-values here just indicate the
strength to reject null hypothesis, not the strength of association
between word and class of document. So, what kind of statistics I
should use here to evaluate the strength of association? odds ratio?

Any suggestions are welcome!

Thanks!
-- 
Weiwei Shi, Ph.D

"Did you always know?"
"No, I did not. But I believed..."
---Matrix III

Kjetil Brinchmann Halvorsen

2005-Jun-22 18:50 UTC

head link

[R] chisq test and fisher exact test

Weiwei Shi wrote:
>Hi,
>I have a text mining project and currently I am working on feature
>generation/selection part.
>My plan is selecting a set of words or word combinations which have
>better discriminant capability than other words in telling the group
>id's (2 classes in this case) for a dataset which has 2,000,000
>documents.
>
>One approach is using "contrast-set association rule mining" while
the
>other is using chisqr or fisher exact test.
>
>An example which has 3 contingency tables for 3 words as followed
>(word coded by number):
>  
>
>>tab[,,1:3]
>>    
>>
>, , 1
>
>      [,1]    [,2]
>[1,] 11266 2151526
>[2,]   125   31734
>
>, , 2
>
>      [,1]    [,2]
>[1,] 43571 2119221
>[2,]    52   31807
>
>, , 3
>
>     [,1]    [,2]
>[1,]  427 2162365
>[2,]    5   31854
>
>
>I have some questions on this:
>1. What's the thumb of rule to use chisq test instead of Fisher exact
>test. I have a  vague memory which said for each cell, the count needs
>to be over 50 if chisq instead of fisher exact test is going to be
>used. In the case of word 3,  I think I should use fisher test.
>However, running chisq like below is fine:
>  
>
>>tab[,,3]
>>    
>>
>     [,1]    [,2]
>[1,]  427 2162365
>[2,]    5   31854
>  
>
>>chisq.test(tab[,,3])
>>    
>>
>
>        Pearson's Chi-squared test with Yates' continuity correction
>
>data:  tab[, , 3]
>X-squared = 0.0963, df = 1, p-value = 0.7564
>
>but running on the whole set of words (including 14240 words) has the
>following warnings:
>  
>
>>p.chisq<-as.double(lapply(1:N, function(i)
chisq.test(tab[,,i])$p.value))
>>    
>>
>There were 50 or more warnings (use warnings() to see the first 50)
>  
>
>>warnings()
>>    
>>
>Warning messages:
>1: Chi-squared approximation may be incorrect in: chisq.test(tab[, , i])
>2: Chi-squared approximation may be incorrect in: chisq.test(tab[, , i])
>3: Chi-squared approximation may be incorrect in: chisq.test(tab[, , i])
>4: Chi-squared approximation may be incorrect in: chisq.test(tab[, , i])
>
>
>2. So, my second question is, is this warning b/c I am against the
>assumption of using chisq. But why Word 3 is fine? How to trace the
>warning to see which word caused this warning?
>
>3. My result looks like this (after some mapping treating from number
>id to word and some words are stemmed here, like ACCID is accident):
> > of[1:50,]
>      map...2.      p.fisher
>21       ACCID  0.000000e+00
>30          CD  0.000000e+00
>67        ROCK  0.000000e+00
>104      CRACK  0.000000e+00
>111       CHIP  0.000000e+00
>179      GLASS  0.000000e+00
>84        BACK 4.199878e-291
>395   DRIVEABL 5.335989e-287
>60         CAP 9.405235e-285
>262 WINDSHIELD 2.691641e-254
>13          IV 3.905186e-245
>110         HZ 2.819713e-210
>11        CAMP 9.086768e-207
>2      SHATTER 5.273994e-202
>297        ALP 1.678521e-177
>162        BED 1.822031e-173
>249        BCD 1.398391e-160
>493       RACK 4.178617e-156
>59        CAUS 7.539031e-147
>
>3.1 question: Should I use two-sided test instead of one-sided for
>fisher test? I read some material which suggests using two-sided.
>
>3.2 A big question: Even though the result looks very promising since
>this is case of classiying fraud cases and the words selected by this
>approach make sense. However, I think p-values here just indicate the
>strength to reject null hypothesis, not the strength of association
>between word and class of document. So, what kind of statistics I
>should use here to evaluate the strength of association? odds ratio?
>
>Any suggestions are welcome!
>
>Thanks!
>  
>You can use chisq.test with sim=TRUE, or call it as usual first, see if 
there is a warning, and then recall
with sim=TRUE.

Kjetil

-- 

Kjetil Halvorsen.

Peace is the most effective weapon of mass construction.
               --  Mahdi Elmandjra




-- 
No virus found in this outgoing message.
Checked by AVG Anti-Virus.

Maybe Matching Threads

Search for more seemingly similar threads

R help - Jun 2005 - chisq test and fisher exact test

[R] chisq test and fisher exact test

[R] chisq test and fisher exact test

Maybe Matching Threads