Sorry for the dumb question, but I cant work out how to do this. Quick version, How can I re-bin a given frequency distribution using new breaks without reference to the original data? Given distribution has integer valued bins. Long version, I am loading a frequency table into R from a file. The original data is very large, and it is a very simple process to get a frequency distribution from an SQL database, so in all this is a convenient method for me. Point being I don't start with 'raw' data. The data looks like this...> datCOUNT FREQUENCY 1 1 5734 2 2 1625 3 3 793 4 4 480 5 5 294 6 6 237 7 7 205 8 8 200 9 9 123 10 10 108 11 11 90 12 12 62 13 13 60 14 14 68 15 15 64 16 16 56 17 17 68 18 18 45 19 19 38 20 20 37 21 21 29 22 22 39 23 23 35 24 24 33 25 25 36 ... 148 153 5 149 156 2 150 157 3 151 158 2 152 159 2 153 162 1 154 163 3 155 164 3 156 165 2 157 166 1 158 168 2 159 169 4 160 170 1 ... 354 2106 1 355 2189 1 356 2194 1 357 2217 1 358 2246 1 359 2474 1 360 2801 1 361 3697 1 362 3702 1 363 7353 1 364 8738 1 365 9442 1 366 12280 1 This is a tipical 'count / frequency' distribution in biology, where low counts of a certain property are very frequent (across genomes, proteins, ecosystems, etc...), and high counts of of a certain property are very rare. In the above example a certain property occurs 12280 times with a frequency of 1, another property occurs 9442 times with the same frequency. At the other end of the extreem, a certain property occurs once with a frequency of 5734, and another property occurs twice with a frequency of 1625. This kind of distribution is variously known as a "zipf", a "power law", a "Pareto", "scale free", "heavy tailed" or a "80:20" distribution, or coloquially "the dominance of the few over the many". The term I choose is a "log linear" distribution, because that makes no assumptions about the underlying cause of the overall shape. People tipically quote the curve in the form of y ~ Cx^(-a). I want to use the binning method of parameter estimation given here... http://www.ece.uc.edu/~annexste/Courses/cs690/Zipf,%20Power-law,%20Pareto%20-%20a%20ranking%20tutorial.htm (bin the data with exponentially increasing bin widths within the data range). But I can't work out how to re-bin my existing frequency data. Sorry for the long question, all the best Dan.
(Ted Harding)
2004-Nov-21 16:47 UTC
[R] Analysis of pre-calculated frequency distribution?
On 21-Nov-04 Dan Bolser wrote:> > Sorry for the dumb question, but I cant work out how to do this. > > Quick version, > > How can I re-bin a given frequency distribution using new breaks > without reference to the original data? Given distribution has > integer valued bins. > > > Long version, > > I am loading a frequency table into R from a file. The original > data is very large, and it is a very simple process to get a > frequency distribution from an SQL database, so in all this is > a convenient method for me. Point being I don't start with 'raw' data. > > The data looks like this... > >> dat > COUNT FREQUENCY > 1 1 5734 > 2 2 1625 > [...] > 365 9442 1 > 366 12280 1 > > [...] > > People tipically quote the curve in the form of y ~ Cx^(-a). > I want to use the binning method of parameter estimation given here... > > http://www.ece.uc.edu/~annexste/Courses/cs690/Zipf,%20Power-law,%20Paret > o%20-%20a%20ranking%20tutorial.htm > > (bin the data with exponentially increasing bin widths within the data > range). > > But I can't work out how to re-bin my existing frequency data.Hi Dan, Your starting point can be the fact that the number of cases with property i ("in class i") is COUNT_i + FREQUENCY_I So if you construct a vector with these numbers in it you have in effect reconstructed the original data. I.e. N[i] <- COUNT[i]*FREQUENCY[i] which can be done in one stroke with N <- COUNT*FREQUENCY One way (and maybe others can suggest better) to bin these classes non-uniformly could be: Say you have k "upper" breakpoints for your k bins, say BP, so that e.g. if BP[1] = 2 then there are N[1]+N[2] cases with class <= 2, and if BP[2] = 5 then there are N[3] + N[4] + N[5] cases with class > 2 and class <= 5, and so on. In your case BP[k] = 366. Let csN <- cumsum(N) Then (if I've not overlooked something) diff(c(0,csN[BP])) will give you the counts in yhour new bins. E.g. (just to show it should work): > N<-rep(1,31) > BP<-c(1,3,7,15,31) > csN <- cumsum(N) > diff(c(0,csN[BP])) [1] 1 2 4 8 16 > BP<-c(2,3,5,9,17,31) > diff(c(0,csN[BP])) [1] 2 1 2 4 8 14 I hope this matches the sort of thing you have in mind! Ted. -------------------------------------------------------------------- E-Mail: (Ted Harding) <Ted.Harding at nessie.mcc.ac.uk> Fax-to-email: +44 (0)870 094 0861 [NB: New number!] Date: 21-Nov-04 Time: 16:47:05 ------------------------------ XFMail ------------------------------