thr3ads.net - R help - [R] simulated data using empirical distribution [Oct 2007]

If this information is useful, please help other people find it:
Share via:

Tom Sgouros

2007-Oct-10 14:13 UTC

[R] simulated data using empirical distribution

Hello all:

I'm sure this is a trivial request, but I'm still a beginner at this,
and haven't been able to find it.  I need to create simulated data based
on some empirical distributions of a single variable.  I've found R
functions to help me simulate data based on analytical distributions, or
to make simulations based on correlation matrices, but nothing so simple
as what I need.  What I have is twelve bins of data, and the population
in each bin.  The top bin is open-ended, and the whole distribution is
more or less poisson-ish.

I can think of a couple of ways to fake this ok, but is there a real R
way to do it?

Many thanks,

 -tom

-- 
 ------------------------
 tomfool at as220 dot org
 http://sgouros.com  
 http://whatcheer.net

Dieter Menne

2007-Oct-10 14:44 UTC

head link

[R] simulated data using empirical distribution

Tom Sgouros <tomfool <at> as220.org> writes:
> What I have is twelve bins of data, and the population
> in each bin.  The top bin is open-ended, and the whole distribution is
> more or less poisson-ish.
> 
> I can think of a couple of ways to fake this ok, but is there a real R
> way to do it?
> 

hist(rpois(100,10)) # to plot
table(rpois(100,10)) # to get the histogram

I only have problem with your "twelve bins of data". If you really
mean the
"ish", you could truncate the above to the 12 that you love.

Dieter

Daniel Lakeland

2007-Oct-10 14:44 UTC

head link

[R] simulated data using empirical distribution

On Wed, 2007-10-10 at 10:13 -0400, Tom Sgouros wrote:> Hello all:
> 
> I'm sure this is a trivial request, but I'm still a beginner at
this,
> and haven't been able to find it.  I need to create simulated data
based
> on some empirical distributions of a single variable.  I've found R
> functions to help me simulate data based on analytical distributions, or
> to make simulations based on correlation matrices, but nothing so simple
> as what I need.  What I have is twelve bins of data, and the population
> in each bin.  The top bin is open-ended, and the whole distribution is
> more or less poisson-ish.

if you have a bin with n items in it, you can generate n uniform random
numbers within the range of that bin to "reconstruct" your sample
(I'm
assuming that you don't have a sample, just the histogram).

For the open ended bin, you could generate something like an
exponentially distributed random number with a shift to fit it into the
bin.

Now you'll have a sample which has a very similar distribution to your
histogram. You can generate bootstrap samples by simply resampling this
sample using replacement. You can also smooth these bootstrap samples by
sampling with replacement and then adding a small gaussian random noise
to each sample. such as

sample(mysample,size=100,replace=T) + rnorm(100,0,.01)

you may want to make the standard deviation of the normal smoothing
proportional to the size of your bins (perhaps 1/2 or 1/4 the width of
the bin).

Also, once you have a sample, you can fit a poisson distribution to your
sample and then use the fitted parameter to generate poisson random
numbers which may approximate your distribution well.

tom sgouros

2007-Oct-11 11:30 UTC

head link

[R] simulated data using empirical distribution

Hello all:

Many thanks to the people who have responded to my question, on and
off-list.  My problem isn't completely solved, though, and perhaps you
can help again.

The problem, again, is that I have what is essentially a histogram, but
not the underlying data, and I want to simulate data that would have
created that histogram.  That is, I have counts for the number of data
points in a dozen bins.  The bins are not of uniform size.  (It's income
data, reported as incomes from 0-10k, 10k-25k, 25k-50k, and so on.)

The best suggestion I had yesterday was to simulate the data with
uniform distributions in each bin, and an exponential one on the
rightmost bin, and I did that and superficially it looks good.
Unfortunately, now that I am trying to calibrate the model, I have
discovered a high bias.  The way the bins are chosen, I would expect
that 9 out of 12 bins have a down-ward slope, meaning that approximating
them with a square top gives me more along the high border of the bin,
and I currently suspect that this is at least part of the bias.

Is there a way to ask for a not-quite uniform distribution of random
data?  I imagine a density function with a linear, but not flat, top.  I
admit that the standard selection of distributions in R is more than I
am familiar with, but I can't find one that does what I think I need.

Any advice (R advice or statistics advice) is welcome.  Thanks again,

 -tom

-- 
 ------------------------
 tomfool at as220 dot org
 http://sgouros.com  
 http://whatcheer.net

Reasonably Related Threads

Search for more possibly parallel threads

R help - Oct 2007 - simulated data using empirical distribution

[R] simulated data using empirical distribution

[R] simulated data using empirical distribution

[R] simulated data using empirical distribution

[R] simulated data using empirical distribution

Reasonably Related Threads