Hi This is OT, but I need it for my simulation in R. I have a special case for sampling with replacement: instead of sampling once and replacing it immediately, I sample n times, and then replace all n items. So: N entities x samples with replacement each sample consists of n sub-samples WITHOUT replacement, which are all replaced before the next sample is drawn My question is: which distribution can I use to describe how often each entity of the N has been sampled? Thanks for your help, Rainer -- NEW GERMAN FAX NUMBER!!! Rainer M. Krug, PhD (Conservation Ecology, SUN), MSc (Conservation Biology, UCT), Dipl. Phys. (Germany) Centre of Excellence for Invasion Biology Natural Sciences Building Office Suite 2039 Stellenbosch University Main Campus, Merriman Avenue Stellenbosch South Africa Cell: +27 - (0)83 9479 042 Fax: +27 - (0)86 516 2782 Fax: +49 - (0)321 2125 2244 email: Rainer@krugs.de Skype: RMkrug Google: R.M.Krug@gmail.com [[alternative HTML version deleted]]
G'day Rainer, On Sat, 25 Sep 2010 16:24:17 +0200 Rainer M Krug <r.m.krug at gmail.com> wrote:> This is OT, but I need it for my simulation in R. > > I have a special case for sampling with replacement: instead of > sampling once and replacing it immediately, I sample n times, and > then replace all n items. > > > So: > > N entities > x samples with replacement > each sample consists of n sub-samples WITHOUT replacement, which are > all replaced before the next sample is drawn > > My question is: which distribution can I use to describe how often > each entity of the N has been sampled?Surely, unless I am missing something, any given entity would have (marginally) a binomial distribution: A sub-sample of size n either contains the entity or it does not. The probability that a sub-sample contains the entity is a function of N and n alone. x sub-samples are drawn (with replacement), so the number of times that an entity has been sampled is the number of sub-samples in which it appears. This is given by the binomial distribution with parameters x and p, where p is the probability determined in the previous paragraph. I guess the fun starts if you try to determine the joint distribution of two (or more) entities......... HTH. Cheers, Berwin ========================== Full address ===========================Berwin A Turlach Tel.: +61 (8) 6488 3338 (secr) School of Maths and Stats (M019) +61 (8) 6488 3383 (self) The University of Western Australia FAX : +61 (8) 6488 1028 35 Stirling Highway Crawley WA 6009 e-mail: berwin at maths.uwa.edu.au Australia http://www.maths.uwa.edu.au/~berwin
On 09/25/2010 04:24 PM, Rainer M Krug wrote:> Hi > > This is OT, but I need it for my simulation in R. > > I have a special case for sampling with replacement: instead of sampling > once and replacing it immediately, I sample n times, and then replace all n > items. > > > So: > > N entities > x samples with replacement > each sample consists of n sub-samples WITHOUT replacement, which are all > replaced before the next sample is drawn > > My question is: which distribution can I use to describe how often each > entity of the N has been sampled? > > Thanks for your help, > > Rainer >How did you know I was in the middle of preparing lectures on the variance of the hypergeometric distribution and such? ;-) If you look at a single item, the answer is of course that you have a binomial with size=x and prob=n/N. The problem is that these binomials are correlated between items. If you can make do with a 2nd order approximation, then the covariances between the indicators for two items being selected is easily found from the symmetry and the fact that if you sum all N indicators you get the constant n. I.e. the variance is p(1-p) and the covariance is -p(1-p)/(N-1). For sums over repeated samples, just multiply everything by the number x of samples. If you intend to just count the frequency of a particular feature in each of your n-samples, i.e., you have x replications of a hypergeometric experiment, then you can do somewhat better by computing the explicit convolution of x hypergeometrics (convolve(x, rev(y), type="o") and Reduce() are your friends). I'm not sure this is actually worth the trouble, but it should be doable for decent-sized N and x. -- Peter Dalgaard Center for Statistics, Copenhagen Business School Phone: (+45)38153501 Email: pd.mes at cbs.dk Priv: PDalgd at gmail.com