xin wei
2010-Jul-26 18:36 UTC
[R] how to generate a random data from a empirical distribition
hi, this is more a statistical question than a R question. but I do want to know how to implement this in R. I have 10,000 data points. Is there any way to generate a empirical probablity distribution from it (the problem is that I do not know what exactly this distribution follows, normal, beta?). My ultimate goal is to generate addition 20,000 data point from this empirical distribution created from the existing 10,000 data points. thank you all in advance. -- View this message in context: http://r.789695.n4.nabble.com/how-to-generate-a-random-data-from-a-empirical-distribition-tp2302716p2302716.html Sent from the R help mailing list archive at Nabble.com.
David Winsemius
2010-Jul-26 22:56 UTC
[R] how to generate a random data from a empirical distribition
On Jul 26, 2010, at 2:36 PM, xin wei wrote:> > hi, this is more a statistical question than a R question. but I do > want to > know how to implement this in R. > I have 10,000 data points. Is there any way to generate a empirical > probablity distribution from it (the problem is that I do not know > what > exactly this distribution follows, normal, beta?).?ecdf> My ultimate goal is to > generate addition 20,000 data point from this empirical distribution > created > from the existing 10,000 data points. > thank you all in advance. >-- David Winsemius, MD Heritage Laboratories West Hartford, CT
Nordlund, Dan (DSHS/RDA)
2010-Jul-26 23:18 UTC
[R] how to generate a random data from a empirical distribition
> -----Original Message----- > From: r-help-bounces at r-project.org [mailto:r-help-bounces at r- > project.org] On Behalf Of xin wei > Sent: Monday, July 26, 2010 11:36 AM > To: r-help at r-project.org > Subject: [R] how to generate a random data from a empirical > distribition > > > hi, this is more a statistical question than a R question. but I do > want to > know how to implement this in R. > I have 10,000 data points. Is there any way to generate a empirical > probablity distribution from it (the problem is that I do not know what > exactly this distribution follows, normal, beta?). My ultimate goal is > to > generate addition 20,000 data point from this empirical distribution > created > from the existing 10,000 data points. > thank you all in advance. >Without knowing more than what you have stated in your email, I can only suggest that you look at ?sample You may be able to do something as simple as newdata <- olddata[sample(1:10000,size=20000,replace=TRUE)] If you need more help, you need to tell us more about your data and what you are trying to do. Hope this is helpful, Dan Daniel J. Nordlund Washington State Department of Social and Health Services Planning, Performance, and Accountability Research and Data Analysis Division Olympia, WA 98504-5204
Dennis Murphy
2010-Jul-27 08:37 UTC
[R] how to generate a random data from a empirical distribition
Hi: On Mon, Jul 26, 2010 at 11:36 AM, xin wei <xinwei@stat.psu.edu> wrote:> > hi, this is more a statistical question than a R question. but I do want to > know how to implement this in R. > I have 10,000 data points. Is there any way to generate a empirical > probablity distribution from it (the problem is that I do not know what > exactly this distribution follows, normal, beta?). My ultimate goal is to > generate addition 20,000 data point from this empirical distribution > created > from the existing 10,000 data points. > thank you all in advance. >The problem, it seems to me, is the leap of faith you're taking that the empirical distribution of your manifest sample will serve as a useful data-generating mechanism for the 20,000 future observations you want to take. I would think that, if you intend to take a sample of 20,000 from ANY distribution, you would want some confidence in the specification of said distribution. Even if you don't know exactly what type of population distribution you're dealing with, there are ways to narrow down the set of possibilities. What is the domain/support of the distribution? For example, the Normal is defined on all of R (as in the real numbers, not our favorite statistical programming language), whereas the lognormal, Gamma and Weibull distributions, among others, are defined on the nonnegative reals. The beta distribution is defined on [0, 1]. Therefore, knowledge of the domain is useful in and of itself. Is it plausible that the distribution is symmetric, or should it have a distinct left or right skew? (Similar comments apply to discrete distributions.) Is censoring or truncation a relevant concern? If there is a random process that well describes how the data you observe are generated, that will certainly narrow down the class of potential data-generating mechanisms/distributions. Once you've narrowed down the class of possible distributions as much as possible, you could look into the fitdistr() function in MASS or the fitdistrplus package on CRAN to test out which candidates seem plausible wrt your existing sample and which are not. You are not likely to be able to narrow it down to one family of distributions, but you should have a much better idea about the characteristics of the distribution that gave rise to your sample of 10,000 (assuming, of course, that it is a *random* sample) after going through this exercise, which you can apply to the generation of the next 20,000 observations. OTOH, if your existing 10,000 observations were not produced by some random process, all bets are off. HTH, Dennis> > > -- > View this message in context: > http://r.789695.n4.nabble.com/how-to-generate-a-random-data-from-a-empirical-distribition-tp2302716p2302716.html > Sent from the R help mailing list archive at Nabble.com. > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >[[alternative HTML version deleted]]
Greg Snow
2010-Jul-27 22:39 UTC
[R] how to generate a random data from a empirical distribition
Another option for fitting a smooth distribution to data (and generating future observations from the smooth distribution) is to use the logspline package. -- Gregory (Greg) L. Snow Ph.D. Statistical Data Center Intermountain Healthcare greg.snow at imail.org 801.408.8111> -----Original Message----- > From: r-help-bounces at r-project.org [mailto:r-help-bounces at r- > project.org] On Behalf Of xin wei > Sent: Monday, July 26, 2010 12:36 PM > To: r-help at r-project.org > Subject: [R] how to generate a random data from a empirical > distribition > > > hi, this is more a statistical question than a R question. but I do > want to > know how to implement this in R. > I have 10,000 data points. Is there any way to generate a empirical > probablity distribution from it (the problem is that I do not know what > exactly this distribution follows, normal, beta?). My ultimate goal is > to > generate addition 20,000 data point from this empirical distribution > created > from the existing 10,000 data points. > thank you all in advance. > > > -- > View this message in context: http://r.789695.n4.nabble.com/how-to- > generate-a-random-data-from-a-empirical-distribition- > tp2302716p2302716.html > Sent from the R help mailing list archive at Nabble.com. > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting- > guide.html > and provide commented, minimal, self-contained, reproducible code.
Frank Harrell
2010-Jul-27 22:54 UTC
[R] how to generate a random data from a empirical distribition
Easiest thing is to sample with replacement from the original data. This is the idea behind the bootstrap, which is sampling from the empirical CDF. Frank E Harrell Jr Professor and Chairman School of Medicine Department of Biostatistics Vanderbilt University On Tue, 27 Jul 2010, Greg Snow wrote:> Another option for fitting a smooth distribution to data (and generating future observations from the smooth distribution) is to use the logspline package. > > -- > Gregory (Greg) L. Snow Ph.D. > Statistical Data Center > Intermountain Healthcare > greg.snow at imail.org > 801.408.8111 > > >> -----Original Message----- >> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r- >> project.org] On Behalf Of xin wei >> Sent: Monday, July 26, 2010 12:36 PM >> To: r-help at r-project.org >> Subject: [R] how to generate a random data from a empirical >> distribition >> >> >> hi, this is more a statistical question than a R question. but I do >> want to >> know how to implement this in R. >> I have 10,000 data points. Is there any way to generate a empirical >> probablity distribution from it (the problem is that I do not know what >> exactly this distribution follows, normal, beta?). My ultimate goal is >> to >> generate addition 20,000 data point from this empirical distribution >> created >> from the existing 10,000 data points. >> thank you all in advance. >> >> >> -- >> View this message in context: http://r.789695.n4.nabble.com/how-to- >> generate-a-random-data-from-a-empirical-distribition- >> tp2302716p2302716.html >> Sent from the R help mailing list archive at Nabble.com. >> >> ______________________________________________ >> R-help at r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide http://www.R-project.org/posting- >> guide.html >> and provide commented, minimal, self-contained, reproducible code. > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >
Greg Snow
2010-Jul-27 23:36 UTC
[R] how to generate a random data from a empirical distribition
If they want to generate directly from the empirical distribution, then sampling with replacement is the best choice (others had already suggested that). But the reference in the original post to the normal and beta distributions suggested to me that the original poster may have wanted a smooth approximation to the empirical distribution rather than the step function (but not locked to a specific distribution). The logspline package has functions for doing things like this. It has the advantage that it can give a smooth (non-step) plot of the cdf (estimated) as well as generate points that are based on the observed data, but could generate values outside the original range of the data and have fewer ties. Whether these "advantages" make any difference depends on what they want to do with the observations (for many applications the difference is probably negligible and using sample is the simplest/best). But there may be some uses for which these "advantages" are beneficial. (using sample then adding a small random "error" to each value is another option, but I like the logspline option better). -- Gregory (Greg) L. Snow Ph.D. Statistical Data Center Intermountain Healthcare greg.snow at imail.org 801.408.8111> -----Original Message----- > From: Frank Harrell [mailto:f.harrell at vanderbilt.edu] > Sent: Tuesday, July 27, 2010 4:54 PM > To: Greg Snow > Cc: xin wei; r-help at r-project.org > Subject: Re: [R] how to generate a random data from a empirical > distribition > > Easiest thing is to sample with replacement from the original data. > This is the idea behind the bootstrap, which is sampling from the > empirical CDF. > > Frank E Harrell Jr Professor and Chairman School of Medicine > Department of Biostatistics Vanderbilt > University > > On Tue, 27 Jul 2010, Greg Snow wrote: > > > Another option for fitting a smooth distribution to data (and > generating future observations from the smooth distribution) is to use > the logspline package. > > > > -- > > Gregory (Greg) L. Snow Ph.D. > > Statistical Data Center > > Intermountain Healthcare > > greg.snow at imail.org > > 801.408.8111 > > > > > >> -----Original Message----- > >> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r- > >> project.org] On Behalf Of xin wei > >> Sent: Monday, July 26, 2010 12:36 PM > >> To: r-help at r-project.org > >> Subject: [R] how to generate a random data from a empirical > >> distribition > >> > >> > >> hi, this is more a statistical question than a R question. but I do > >> want to > >> know how to implement this in R. > >> I have 10,000 data points. Is there any way to generate a empirical > >> probablity distribution from it (the problem is that I do not know > what > >> exactly this distribution follows, normal, beta?). My ultimate goal > is > >> to > >> generate addition 20,000 data point from this empirical distribution > >> created > >> from the existing 10,000 data points. > >> thank you all in advance. > >> > >> > >> -- > >> View this message in context: http://r.789695.n4.nabble.com/how-to- > >> generate-a-random-data-from-a-empirical-distribition- > >> tp2302716p2302716.html > >> Sent from the R help mailing list archive at Nabble.com. > >> > >> ______________________________________________ > >> R-help at r-project.org mailing list > >> https://stat.ethz.ch/mailman/listinfo/r-help > >> PLEASE do read the posting guide http://www.R-project.org/posting- > >> guide.html > >> and provide commented, minimal, self-contained, reproducible code. > > > > ______________________________________________ > > R-help at r-project.org mailing list > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide http://www.R-project.org/posting- > guide.html > > and provide commented, minimal, self-contained, reproducible code. > >
weix1
2010-Jul-28 01:29 UTC
[R] how to generate a random data from a empirical distribition
Dennis: points well taken. It seems to be important to investigate the nature of distribution. I might be too naive to assume a "emiprical probability distribution" will be simply calculated from a clound of data points....... -- View this message in context: http://r.789695.n4.nabble.com/how-to-generate-a-random-data-from-a-empirical-distribition-tp2302716p2304321.html Sent from the R help mailing list archive at Nabble.com.
xin wei
2010-Jul-28 01:43 UTC
[R] how to generate a random data from a empirical distribition
hi, Dennis: points well taken. it seems to be important to investigate the nature of distribution. I may be too naive to assume a "empirical probability distribution" would be computed from a could of data points.... -- View this message in context: http://r.789695.n4.nabble.com/how-to-generate-a-random-data-from-a-empirical-distribition-tp2302716p2304329.html Sent from the R help mailing list archive at Nabble.com.
xin wei
2010-Jul-28 01:47 UTC
[R] how to generate a random data from a empirical distribition
good point. It seems to be important to investigate the nature of distribution. I might be too naive to assume that a "empirical probability distribution" would be automatically generated from a cloud of data points..... -- View this message in context: http://r.789695.n4.nabble.com/how-to-generate-a-random-data-from-a-empirical-distribition-tp2302716p2304332.html Sent from the R help mailing list archive at Nabble.com.
xin wei
2010-Jul-28 02:06 UTC
[R] how to generate a random data from a empirical distribition
this is very insightful. sounds exactly like what I want to do. thanks. Frank. -- View this message in context: http://r.789695.n4.nabble.com/how-to-generate-a-random-data-from-a-empirical-distribition-tp2302716p2304346.html Sent from the R help mailing list archive at Nabble.com.
xin wei
2010-Jul-28 16:44 UTC
[R] how to generate a random data from a empirical distribition
hi, Frank: how can we make sure the randomly sampled data follow the same distribution as the original dataset? i assume each data point has the same prabability to be selected in a simple random sampling scheme. thanks -- View this message in context: http://r.789695.n4.nabble.com/how-to-generate-a-random-data-from-a-empirical-distribition-tp2302716p2305275.html Sent from the R help mailing list archive at Nabble.com.
Frank Harrell
2010-Jul-28 19:42 UTC
[R] how to generate a random data from a empirical distribition
This is true by definition. Read about the bootstrap which may give you some good background information. Frank E Harrell Jr Professor and Chairman School of Medicine Department of Biostatistics Vanderbilt University On Wed, 28 Jul 2010, xin wei wrote:> > hi, Frank: > how can we make sure the randomly sampled data follow the same distribution > as the original dataset? i assume each data point has the same prabability > to be selected in a simple random sampling scheme. > > thanks > -- > View this message in context: http://r.789695.n4.nabble.com/how-to-generate-a-random-data-from-a-empirical-distribition-tp2302716p2305275.html > Sent from the R help mailing list archive at Nabble.com. > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >