saggak
2008-Aug-29 09:13 UTC
[R] Poisson Distribution - problem with Chi Square Goodness of Fit test
Chi Square Test for Goodness of Fit I have got a discrete data as given below (R script) No_of_Frauds<-c(1,1,1,1,1,1,1,1,1,2,1,1,1,1,1,1,2,1,2,2,2,1,1,2,1,1,1,1,4,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,5,1,2,1,1,1,1,1,1,1,3,2,1,1,1,2,1,1,2,1,1,1,1,1,2,1,3,1,2,1,2,14,2,1,1,38,3,3,2,44,1,4,1,4,1,2,2,1,3) I am trying to fit Poisson distribution to this data using R. My R script is as under : ________________________________________________________ # R SCRIPT for Fitting Poisson Distribution No_of_Frauds<-c(1,1,1,1,1,1,1,1,1,2,1,1,1,1,1,1,2,1,2,2,2,1,1,2,1,1,1,1,4,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,5,1,2,1,1,1,1,1,1,1,3,2,1,1,1,2,1,1,2,1,1,1,1,1,2,1,3,1,2,1,2,14,2,1,1,38,3,3,2,44,1,4,1,4,1,2,2,1,3) N <- length(No_of_Frauds) Average <- mean(No_of_Frauds) Lambda <- Average i <- c(0:(N-1)) pmf <- dpois(i, Lambda, log = FALSE) # ---------------------------------------------------------------------------- # Ho: The data follow Poisson Distribution Vs H1: Not Ho # observed frequencies (Oi) variable.cnts <- table(No_of_Frauds) variable.cnts.prs<- dpois(as.numeric(names(variable.cnts)), lambda) variable.cnts <- c(variable.cnts, 0) variable.cnts.prs <- c(variable.cnts.prs, 1-sum(variable.cnts.prs)) tst <- chisq.test(variable.cnts, p=variable.cnts.prs) chi_squared <- as.numeric(unclass(tst)$statistic) p_value <- as.numeric(unclass(tst)$p.value) df <- tst[2]$parameter cv1 <- qchisq(p=.01, df=tst[2]$parameter, lower.tail = FALSE, log.p FALSE) cv2 <- qchisq(p=.05, df=tst[2]$parameter, lower.tail = FALSE, log.p FALSE) cv3 <- qchisq(p=.1, df=tst[2]$parameter, lower.tail = FALSE, log.p FALSE) #----------------------------------------------------------------------------- # Expected value # variable.cnts.prs * sum(variable.cnts) # if tst > cv reject Ho at alpha confidence level #----------------------------------------------------------------------------- if(chi_squared > cv1) Conclusion1 <- 'Sample does not come from the postulated probability distribution at 1% los' else Conclusion1 <- 'Sample comes from postulated prob. distribution at 1% los' if(chi_squared > cv2) Conclusion2 <- 'Sample does not come from the postulated probability distribution at 5% los' else Conclusion2 <- 'Sample comes from postulated prob. distribution at 1% los' if(chi_squared > cv3) Conclusion3 <- 'Sample does not come from the postulated probability distribution at 10% los' else Conclusion3 <- 'Sample come from postulated prob distribution at 1% los' #----------------------------------------------------------------------------- # Printing RESULTS print(chi_squared) print(p_value) print(df) print(cv1) print(cv2) print(cv3) print(Conclusion1) print(Conclusion2) print(Conclusion3) ##### End of R Script ######## ________________________________________________________ Problem Faced : When I run this script using R – console, I am getting value of Chi – Square Statistics as high as “6.95753e+37” When I did the same calculations in Excel, I got the Chi Square Statistics value = 138.34. Although it is clear that the sample data doesn’t follow Poisson distribution, and I will have to look for other discrete distribution, my problem is the HIGH Value of Chi Square test statistics. When I analyzed further, I understood the problem. (A) By convention, if your Expected frequency is less than 5, then by we put together such classes and form a new class such that Expected frequency is greater than 5 and also accordingly adjust the observed frequencies. X Oi Ei ((Oi - Ei)^2)/Ei 0 0 10 9.96 1 72 23 103.79 2 17 27 3.54 3 5 21 11.85 4 3 12 6.71 5 4 9 2.51 Total 101 101 138.34 When I apply this logic in Excel, I am getting the reasonable result (i.e. 138.34), however in Excel also, if I don’t apply this logic, my Chi square test statistic value is as high as 4.70043E+37. My question is how do I modify my R – script, so that the logic mentioned in (A) i.e. adjusting the Expected frequencies (and accordingly Observed frequencies) is applied so that the expected frequency becomes greater than 5 for a given class, thereby resulting in reasonable value of Chi Square test Statistics. I am also attaching the xls file for ready reference. I sincerely apologize for taking liberty of writing such a long mail and since I am very new to this “R language” can someone help me out. Thanking in advance for your kind co-operation. Ashok (Mumbai, India) o.com/l/in/yahoo/mail/yahoomail/tools/tools-08.html/ [[alternative HTML version deleted]]