Is ?cut what you need? Sean On 6/16/04 6:52 AM, "Dan Bolser" <dmb at mrc-dunn.cam.ac.uk> wrote:> > First, thanks to everyone who helped me get to grips with R in (x)emacs > (I get confused easily). Special thanks to Stephen Eglen for continued > support. > > My question is about non-linear binning, or density functions over > distributions governed by a power law ... > > y ~ mu*x**lambda # In one of its forms > # (can't find Pareto in the online help) > > Looking at the following should show my problem.... > > x3 <- runif(10000)**3 # Probably a better (correct) way to do this > > plot( density(x3,cut=0,bw=0.1)) > plot( density(x3,cut=0,bw=0.01)) > plot( density(x3,cut=0,bw=0.001)) > > plot(density(x3,cut=0,bw=0.1), log='xy') > plot(density(x3,cut=0,bw=0.01), log='xy') > plot(density(x3,cut=0,bw=0.001),log='xy') > > The upper three plots show that the bw has a big effect on the appearance > of the graph by rescaling based on the initial density at low values of x, > which is very high. > > The lower plots show (I think) an error in the use of linear bins to view > a non linear trend. I would expect this curve to be linear on log-log > scales (from experience), and you can see the expected behavior in the > tails of these plots. > > If you play with drawing these curves on top of each other they look OK > apart from at the beginning. However, changing the band width to 0.0001 has > a radical effect on these plots, and they begin to show a different trend > (look like they are being governed by a different power). > > Hmmm.... > > x3log <- -log(x3) > > plot( density(x3log,cut=0,bw=0.5), log='y',col=1) > > lines(density(x3log,cut=0,bw=0.2), log='y',col=2) > lines(density(x3log,cut=0,bw=0.1), log='y',col=3) > lines(density(x3log,cut=0,bw=0.01), log='y',col=4) > > Sorry... > > > 'Real' data of this form is usually discrete, with the value of 1 being > the most frequent (minimum) event, and higher values occurring less > frequently according to a power (power-law). This data can be easily > grouped into discrete bins, and frequency plotted on log scales. The > continuous data generated above requires some form of density estimation > or rescaling into discreet values (make the smallest value equal to 1 and > round everything else into an integer). > > I see the aggregate function, but which function lets me simply count the > number of values in a class (integer bin)? > > The analysis of even the discretized data is made more accurate by the use > of exponentially growing bins. This way you don't need to plot the data on > log scales, and the increasing variance associated with lower probability > events is handled by the increasing bin size (giving good accuracy of > power fitting). How can I easily (ignorantly) implement exponentially > increasing bin sizes? > > Thanks for any feedback, > > Dan. > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://www.stat.math.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html >
First, thanks to everyone who helped me get to grips with R in (x)emacs (I get confused easily). Special thanks to Stephen Eglen for continued support. My question is about non-linear binning, or density functions over distributions governed by a power law ... y ~ mu*x**lambda # In one of its forms # (can't find Pareto in the online help) Looking at the following should show my problem.... x3 <- runif(10000)**3 # Probably a better (correct) way to do this plot( density(x3,cut=0,bw=0.1)) plot( density(x3,cut=0,bw=0.01)) plot( density(x3,cut=0,bw=0.001)) plot(density(x3,cut=0,bw=0.1), log='xy') plot(density(x3,cut=0,bw=0.01), log='xy') plot(density(x3,cut=0,bw=0.001),log='xy') The upper three plots show that the bw has a big effect on the appearance of the graph by rescaling based on the initial density at low values of x, which is very high. The lower plots show (I think) an error in the use of linear bins to view a non linear trend. I would expect this curve to be linear on log-log scales (from experience), and you can see the expected behavior in the tails of these plots. If you play with drawing these curves on top of each other they look OK apart from at the beginning. However, changing the band width to 0.0001 has a radical effect on these plots, and they begin to show a different trend (look like they are being governed by a different power). Hmmm.... x3log <- -log(x3) plot( density(x3log,cut=0,bw=0.5), log='y',col=1) lines(density(x3log,cut=0,bw=0.2), log='y',col=2) lines(density(x3log,cut=0,bw=0.1), log='y',col=3) lines(density(x3log,cut=0,bw=0.01), log='y',col=4) Sorry... 'Real' data of this form is usually discrete, with the value of 1 being the most frequent (minimum) event, and higher values occurring less frequently according to a power (power-law). This data can be easily grouped into discrete bins, and frequency plotted on log scales. The continuous data generated above requires some form of density estimation or rescaling into discreet values (make the smallest value equal to 1 and round everything else into an integer). I see the aggregate function, but which function lets me simply count the number of values in a class (integer bin)? The analysis of even the discretized data is made more accurate by the use of exponentially growing bins. This way you don't need to plot the data on log scales, and the increasing variance associated with lower probability events is handled by the increasing bin size (giving good accuracy of power fitting). How can I easily (ignorantly) implement exponentially increasing bin sizes? Thanks for any feedback, Dan.
You can use hist(x, br, plot = FALSE)$counts. -roger Dan Bolser wrote:> First, thanks to everyone who helped me get to grips with R in (x)emacs > (I get confused easily). Special thanks to Stephen Eglen for continued > support. > > My question is about non-linear binning, or density functions over > distributions governed by a power law ... > > y ~ mu*x**lambda # In one of its forms > # (can't find Pareto in the online help) > > Looking at the following should show my problem.... > > x3 <- runif(10000)**3 # Probably a better (correct) way to do this > > plot( density(x3,cut=0,bw=0.1)) > plot( density(x3,cut=0,bw=0.01)) > plot( density(x3,cut=0,bw=0.001)) > > plot(density(x3,cut=0,bw=0.1), log='xy') > plot(density(x3,cut=0,bw=0.01), log='xy') > plot(density(x3,cut=0,bw=0.001),log='xy') > > The upper three plots show that the bw has a big effect on the appearance > of the graph by rescaling based on the initial density at low values of x, > which is very high. > > The lower plots show (I think) an error in the use of linear bins to view > a non linear trend. I would expect this curve to be linear on log-log > scales (from experience), and you can see the expected behavior in the > tails of these plots. > > If you play with drawing these curves on top of each other they look OK > apart from at the beginning. However, changing the band width to 0.0001 has > a radical effect on these plots, and they begin to show a different trend > (look like they are being governed by a different power). > > Hmmm.... > > x3log <- -log(x3) > > plot( density(x3log,cut=0,bw=0.5), log='y',col=1) > > lines(density(x3log,cut=0,bw=0.2), log='y',col=2) > lines(density(x3log,cut=0,bw=0.1), log='y',col=3) > lines(density(x3log,cut=0,bw=0.01), log='y',col=4) > > Sorry... > > > 'Real' data of this form is usually discrete, with the value of 1 being > the most frequent (minimum) event, and higher values occurring less > frequently according to a power (power-law). This data can be easily > grouped into discrete bins, and frequency plotted on log scales. The > continuous data generated above requires some form of density estimation > or rescaling into discreet values (make the smallest value equal to 1 and > round everything else into an integer). > > I see the aggregate function, but which function lets me simply count the > number of values in a class (integer bin)? > > The analysis of even the discretized data is made more accurate by the use > of exponentially growing bins. This way you don't need to plot the data on > log scales, and the increasing variance associated with lower probability > events is handled by the increasing bin size (giving good accuracy of > power fitting). How can I easily (ignorantly) implement exponentially > increasing bin sizes? > > Thanks for any feedback, > > Dan. > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://www.stat.math.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html >-- Roger D. Peng http://www.biostat.jhsph.edu/~rpeng
Why not try to avoid binning (and density plot) at all? An alternative could be a qqplot (as a log-log-plot), e.g. plot(ppoints(length(x4)), x4[order(x4)], log="xy") abline(lm(log(x4[order(x4)])~log(ppoints(length(x4)))), col="red") If the assumptions of uniform distribution and power transformation y=a*x**b are true, the coefficient of lm estimates the exponent b. Herwig -- Dr. Herwig Meschke Wissenschaftliche Beratung Hagsbucher Weg 27 D-89150 Laichingen phone +49 7333 210 417 / fax +49 7333 210 418 email HerwigMeschke at t-online.de