Hello all: I'm sure this is a trivial request, but I'm still a beginner at this, and haven't been able to find it. I need to create simulated data based on some empirical distributions of a single variable. I've found R functions to help me simulate data based on analytical distributions, or to make simulations based on correlation matrices, but nothing so simple as what I need. What I have is twelve bins of data, and the population in each bin. The top bin is open-ended, and the whole distribution is more or less poisson-ish. I can think of a couple of ways to fake this ok, but is there a real R way to do it? Many thanks, -tom -- ------------------------ tomfool at as220 dot org http://sgouros.com http://whatcheer.net
Tom Sgouros <tomfool <at> as220.org> writes:> What I have is twelve bins of data, and the population > in each bin. The top bin is open-ended, and the whole distribution is > more or less poisson-ish. > > I can think of a couple of ways to fake this ok, but is there a real R > way to do it? >hist(rpois(100,10)) # to plot table(rpois(100,10)) # to get the histogram I only have problem with your "twelve bins of data". If you really mean the "ish", you could truncate the above to the 12 that you love. Dieter
On Wed, 2007-10-10 at 10:13 -0400, Tom Sgouros wrote:> Hello all: > > I'm sure this is a trivial request, but I'm still a beginner at this, > and haven't been able to find it. I need to create simulated data based > on some empirical distributions of a single variable. I've found R > functions to help me simulate data based on analytical distributions, or > to make simulations based on correlation matrices, but nothing so simple > as what I need. What I have is twelve bins of data, and the population > in each bin. The top bin is open-ended, and the whole distribution is > more or less poisson-ish.if you have a bin with n items in it, you can generate n uniform random numbers within the range of that bin to "reconstruct" your sample (I'm assuming that you don't have a sample, just the histogram). For the open ended bin, you could generate something like an exponentially distributed random number with a shift to fit it into the bin. Now you'll have a sample which has a very similar distribution to your histogram. You can generate bootstrap samples by simply resampling this sample using replacement. You can also smooth these bootstrap samples by sampling with replacement and then adding a small gaussian random noise to each sample. such as sample(mysample,size=100,replace=T) + rnorm(100,0,.01) you may want to make the standard deviation of the normal smoothing proportional to the size of your bins (perhaps 1/2 or 1/4 the width of the bin). Also, once you have a sample, you can fit a poisson distribution to your sample and then use the fitted parameter to generate poisson random numbers which may approximate your distribution well.
Hello all: Many thanks to the people who have responded to my question, on and off-list. My problem isn't completely solved, though, and perhaps you can help again. The problem, again, is that I have what is essentially a histogram, but not the underlying data, and I want to simulate data that would have created that histogram. That is, I have counts for the number of data points in a dozen bins. The bins are not of uniform size. (It's income data, reported as incomes from 0-10k, 10k-25k, 25k-50k, and so on.) The best suggestion I had yesterday was to simulate the data with uniform distributions in each bin, and an exponential one on the rightmost bin, and I did that and superficially it looks good. Unfortunately, now that I am trying to calibrate the model, I have discovered a high bias. The way the bins are chosen, I would expect that 9 out of 12 bins have a down-ward slope, meaning that approximating them with a square top gives me more along the high border of the bin, and I currently suspect that this is at least part of the bias. Is there a way to ask for a not-quite uniform distribution of random data? I imagine a density function with a linear, but not flat, top. I admit that the standard selection of distributions in R is more than I am familiar with, but I can't find one that does what I think I need. Any advice (R advice or statistics advice) is welcome. Thanks again, -tom -- ------------------------ tomfool at as220 dot org http://sgouros.com http://whatcheer.net