d.s.robinson at dur.ac.uk
2007-Mar-01 16:16 UTC
[R] Creating a vector of variable bin widths
Dear R users, I am having a little trouble with grouping data. -----------Detailed explanation (summary below)------------ A small sample of my data is below (which has already been rounded and grouped a little from the raw data for clarity). I am sampling data from an unknown game which, according to my null hypothesis, follows a binomial distribution. The game can be supposedly be played with a range of probabilities (the independent variable) of success, 0.0-0.3 are shown below, although my full data set goes all the way up 0.99. The number of observations for each probability of success, and the actual proportion of wins in the sample (the dependant variable) are also shown. By CLT, the sample winning proportions (the dependant variable) should be a unbiased estimator of the population proportion (the independent variable). I want to perform a significance test at each probability level to see if the null hypothesis can be rejected. But, the problem is in defining those probability levels. At the moment, some probabilities of success have a very low number of observations, whilst others have very many. Leaving the data as it is results in statistically meaningless results at the low and high levels of success. Further grouping the data using fixed group widths results very few data points at high and low probabilities, and a few data points in the middle with a very high number of observations. The way around this (I think) is to use variable bin widths. The width of each bin should be wide enough so that (again, I think this is a reasonable idea) the variance of the sample estimate (using the normal approximation to the binomial), [p(1-p)]/n, is less than a certain value, say 2% squared. I presume I also need to make sure that for each group np<5 and n(1-p)<5, or can this simply replace the variance test? IndependantVar Observations DependantVar -------------------------------------------- 0.01 1 0.000 0.03 5 0.000 0.04 11 0.000 0.05 9 0.000 0.06 19 0.000 0.07 12 0.000 0.08 18 0.056 0.09 10 0.200 0.10 13 0.077 0.11 17 0.118 0.12 17 0.059 0.13 18 0.056 0.14 21 0.000 0.15 25 0.160 0.16 23 0.000 0.17 35 0.314 0.18 26 0.231 0.19 31 0.226 0.20 27 0.148 0.21 26 0.462 0.22 21 0.286 0.23 29 0.207 0.24 38 0.289 0.25 38 0.132 0.26 27 0.259 0.27 52 0.308 0.28 62 0.194 0.29 82 0.232 0.30 97 0.278 ------------------Summary--------------------------- So, I how can I write a function that creates a vector of variable break values for, say, cut(). It should iteratively make bin widths wider until an condition based on the value to be binned (the probability of success), and a second value, the number of observations, is met (assuming you agree with my method of restricting the variance, the rational of which is outlined above). I would appreciate any comments on either the reasoning (I am fairly new to this sort of statistics) or how I can write the R code to achieve the proposed goal. I hope I have explained this clearly enough to merit a response. Regards, DR