Madhavi Bhave
2008-Aug-29 10:02 UTC
[R] Problem with Poisson - Chi Square Goodness of Fit Test - New Mail
Dear R-help, ? ? Chi Square Test for Goodness of Fit ? I have got a discrete data as given below (R script) ? No_of_Frauds<-c(1,1,1,1,1,1,1,1,1,2,1,1,1,1,1,1,2,1,2,2,2,1,1,2,1,1,1,1,4,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,5,1,2,1,1,1,1,1,1,1,3,2,1,1,1,2,1,1,2,1,1,1,1,1,2,1,3,1,2,1,2,14,2,1,1,38,3,3,2,44,1,4,1,4,1,2,2,1,3) ? I am trying to fit Poisson distribution to this data using R. ? My R script is as under : ? ________________________________________________________ ? # R SCRIPT for Fitting Poisson Distribution ? No_of_Frauds<-c(1,1,1,1,1,1,1,1,1,2,1,1,1,1,1,1,2,1,2,2,2,1,1,2,1,1,1,1,4,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,5,1,2,1,1,1,1,1,1,1,3,2,1,1,1,2,1,1,2,1,1,1,1,1,2,1,3,1,2,1,2,14,2,1,1,38,3,3,2,44,1,4,1,4,1,2,2,1,3) ? N ???? ??????? <-???????????? length(No_of_Frauds) ? Average???? <-???????????? mean(No_of_Frauds) ? Lambda???? <-???????????? Average ? i?????????????? <-???????????? c(0:(N-1)) ? pmf??? ?????? <-???????????? dpois(i, Lambda, log = FALSE) ? # ---------------------------------------------------------------------------- ? # Ho: The data follow Poisson Distribution Vs H1: Not Ho ? # observed frequencies (Oi) ? variable.cnts ????? <- ??? table(No_of_Frauds) variable.cnts.prs <-???? dpois(as.numeric(names(variable.cnts)), lambda) variable.cnts ????? <-???? c(variable.cnts, 0) ? variable.cnts.prs <-???? c(variable.cnts.prs, 1-sum(variable.cnts.prs)) tst ?????????????????? <-???? chisq.test(variable.cnts, p=variable.cnts.prs) ? chi_squared ?????? <-???? as.numeric(unclass(tst)$statistic) p_value? ?????????? <-???? as.numeric(unclass(tst)$p.value) df ??????????????????? <-???? tst[2]$parameter ? ? cv1??????????????????? <-???? qchisq(p=.01, df=tst[2]$parameter, lower.tail = FALSE, log.p FALSE) ? cv2??????????????????? <-???? qchisq(p=.05, df=tst[2]$parameter, lower.tail = FALSE, log.p FALSE) ? cv3??????????????????? <-???? qchisq(p=.1, df=tst[2]$parameter, lower.tail = FALSE, log.p FALSE) ? #----------------------------------------------------------------------------- ? # Expected value ? # variable.cnts.prs * sum(variable.cnts) ? ? # if tst > cv reject Ho at alpha confidence level ? #----------------------------------------------------------------------------- ? if(chi_squared > cv1) ? Conclusion1 <- 'Sample does not come from the postulated probability distribution at 1% los' else Conclusion1 <- 'Sample comes from postulated prob. distribution at 1% los' ? ? if(chi_squared > cv2) ? Conclusion2 <- 'Sample does not come from the postulated probability distribution at 5% los' else Conclusion2 <- 'Sample comes from postulated prob. distribution at 1% los' ? if(chi_squared > cv3) Conclusion3 <- 'Sample does not come from the postulated probability distribution at 10% los' else Conclusion3 <- 'Sample come from postulated prob distribution at 1% los' ? #----------------------------------------------------------------------------- ? # Printing RESULTS ? print(chi_squared) ? print(p_value) ? print(df) ? print(cv1) ? print(cv2) ? print(cv3) ? print(Conclusion1) ? print(Conclusion2) ? print(Conclusion3) ? ? ##### End of R Script ######## ? ________________________________________________________ ? Problem Faced : ? When I run this script using R ? console, ? I am getting value of Chi ? Square Statistics as high as ?6.95753e+37? ? When I did the same calculations in Excel, I got the Chi Square Statistics value = 138.34. ? Although it is clear that the sample data doesn?t follow Poisson distribution, and I will have to look for other discrete distribution, my problem is the HIGH Value of Chi Square test statistics. When I analyzed further, I understood the problem. ? (A) By convention, if your Expected frequency is less than 5, then by we put together such classes and form a new class such that Expected frequency is greater than 5 and also accordingly adjust the observed frequencies. ? X Oi Ei ((Oi - Ei)^2)/Ei 0 0 10 9.96 1 72 23 103.79 2 17 27 3.54 3 5 21 11.85 4 3 12 6.71 5 4 9 2.51 Total 101 101 138.34 ? ? When I apply this logic in Excel, I am getting the reasonable result (i.e. 138.34), however in Excel also, if I don?t apply this logic, my Chi square test statistic value is as high as 4.70043E+37. ? My question is how do I modify my R ? script, so that the logic mentioned in (A) i.e. adjusting the Expected frequencies (and accordingly Observed frequencies) is applied so that the expected frequency becomes greater than 5 for a given class, thereby resulting in reasonable value of Chi Square test Statistics. ? I am also attaching the xls file for ready reference. ? I sincerely apologize for taking liberty of writing such a long mail and since I am very new to this ?R language? can someone help me out. ? Thanking in advance for your kind co-operation. ? Ashok (Mumbai, India) ? ? ? ? ?