Hello there, I have a problem concerning bootstrapping in R - especially focusing on the resampling part of it. I try to sum it up in a simplified way so that I would not confuse anybody. I have a small database consisting of 20 observations (basically numbers from 1 to 20, I mean: 1, 2, 3, 4, 5, ... 18, 19, 20). I would like to resample this database many times for the bootstrap process with the following conditions. Firstly, every resampled database should also include 20 observations. Secondly, when selecting a number from the above-mentioned 20 numbers, you can do this selection with replacement. The difficult part comes now: one number can be selected only maximum 5 times. In order to make this clear I show you a couple of examples. So the resampled databases might be like the following ones: (1st database) 1,2,1,2,1,2,1,2,1,2,3,3,3,3,3,4,4,4,4,4 4 different numbers are chosen (1, 2, 3, 4), each selected - for the maximum possible - 5 times. (2nd database) 1,8,8,6,8,8,8,2,3,4,5,6,6,6,6,7,19,1,1,1 Two numbers - 8 and 6 - selected 5 times (the maximum possible times), number 1 selected 4 times, the others selected less than 4 times. (3rd database) 1,1,2,2,3,3,4,4,9,9,9,10,10,13,10,9,3,9,2,1 Number 9 chosen for the maximum possible 5 times, number 10, 3, 2, 1 chosen for 3 times, number 4 selected twice and number 13 selected only once. ... Anybody knows how to implement my "tricky" condition into one of the R functions - that one number can be selected only 5 times at most? Are 'boot' and 'bootstrap' packages capable of managing this? I guess they are, I just couldn't figure it out yet... Thanks very much! Best regards, Laszlo Bodnar ____________________________________________________________________________________________________ Ez az e-mail és az összes hozzá tartozó csatolt melléklet titkos és/vagy jogilag, szakmailag vagy más módon védett információt tartalmazhat. Amennyiben nem Ön a levél címzettje akkor a levél tartalmának közlése, reprodukálása, másolása, vagy egyéb más úton történő terjesztése, felhasználása szigorúan tilos. Amennyiben tévedésből kapta meg ezt az üzenetet kérjük azonnal értesítse az üzenet küldőjét. Az Erste Bank Hungary Zrt. (EBH) nem vállal felelősséget az információ teljes és pontos - címzett(ek)hez történő - eljuttatásáért, valamint semmilyen késésért, kapcsolat megszakadásból eredő hibáért, vagy az információ felhasználásából vagy annak megbízhatatlanságából eredő kárért. Az üzenetek EBH-n kívüli küldője vagy címzettje tudomásul veszi és hozzájárul, hogy az üzenetekhez más banki alkalmazott is hozzáférhet az EBH folytonos munkamenetének biztosítása érdekében. This e-mail and any attached files are confidential and/...{{dropped:19}}
Hi: On Tue, Mar 1, 2011 at 8:22 AM, Bodnar Laszlo EB_HU < Laszlo.Bodnar@erstebank.hu> wrote:> Hello there, > > I have a problem concerning bootstrapping in R - especially focusing on the > resampling part of it. I try to sum it up in a simplified way so that I > would not confuse anybody. > > I have a small database consisting of 20 observations (basically numbers > from 1 to 20, I mean: 1, 2, 3, 4, 5, ... 18, 19, 20). >To check on the probability of this event happening, I ran the following: bootmat <- matrix(sample(1:20, 200000, replace = TRUE), nrow = 10000) sum(apply(bootmat, 1, function(x) any(table(x) >= 5)) ) [1] 492 It's about 0.05. A Q& D 'solution' would be to oversample by at least 5% (let's do 10% just to be on the safe side) and then pick out the first B of these. In the above example, we could do 11000 samples instead, and pick out the first 10000 that meet the criterion: bootmat <- matrix(sample(1:20, 220000, replace = TRUE), nrow = 11000) badsamps <- apply(bootmat, 1, function(x) any(tabulate(x) >= 5)) bootfin <- bootmat[-badsamps, ][1:10000, ] Time: user system elapsed 0.28 0.00 0.28 (Note 1: Using table instead of tabulate took 4.22 seconds on my machine - tabulate is much faster.) (Note 2: In the call above, there were 539 bad samples, so the 5% ballpark estimate seems plausible.) This is a simple application of the accept-reject criterion. I don't know how large 'many' is to you, but 10,000 seems to be a reasonable starting point. I ran it again for 1,000,000 such samples, and the completion time was user system elapsed 36.74 0.31 37.15 so the processing time is of an order a bit larger than linear. If your simulations are of this magnitude and are to be run repeatedly, you probably need to write a function to improve the speed and to get rid of the waste produced by a rejection sampling approach. If this is a one-off deal, perhaps the above is sufficient. HTH, Dennis> I would like to resample this database many times for the bootstrap process > with the following conditions. Firstly, every resampled database should also > include 20 observations. Secondly, when selecting a number from the > above-mentioned 20 numbers, you can do this selection with replacement. The > difficult part comes now: one number can be selected only maximum 5 times. > In order to make this clear I show you a couple of examples. So the > resampled databases might be like the following ones: > > (1st database) 1,2,1,2,1,2,1,2,1,2,3,3,3,3,3,4,4,4,4,4 > 4 different numbers are chosen (1, 2, 3, 4), each selected - for the > maximum possible - 5 times. > > (2nd database) 1,8,8,6,8,8,8,2,3,4,5,6,6,6,6,7,19,1,1,1 > Two numbers - 8 and 6 - selected 5 times (the maximum possible times), > number 1 selected 4 times, the others selected less than 4 times. > > (3rd database) 1,1,2,2,3,3,4,4,9,9,9,10,10,13,10,9,3,9,2,1 > Number 9 chosen for the maximum possible 5 times, number 10, 3, 2, 1 chosen > for 3 times, number 4 selected twice and number 13 selected only once. > > ... > > Anybody knows how to implement my "tricky" condition into one of the R > functions - that one number can be selected only 5 times at most? Are 'boot' > and 'bootstrap' packages capable of managing this? I guess they are, I just > couldn't figure it out yet... > > Thanks very much! Best regards, > Laszlo Bodnar > > > > ____________________________________________________________________________________________________ > Ez az e-mail és az összes hozzá tartozó csatolt melléklet titkos és/vagy > jogilag, szakmailag vagy más módon védett információt tartalmazhat. > Amennyiben nem Ön a levél címzettje akkor a levél tartalmának közlése, > reprodukálása, másolása, vagy egyéb más úton történõ terjesztése, > felhasználása szigorúan tilos. Amennyiben tévedésbõl kapta meg ezt az > üzenetet kérjük azonnal értesítse az üzenet küldõjét. Az Erste Bank Hungary > Zrt. (EBH) nem vállal felelõsséget az információ teljes és pontos - > címzett(ek)hez történõ - eljuttatásáért, valamint semmilyen késésért, > kapcsolat megszakadásból eredõ hibáért, vagy az információ felhasználásából > vagy annak megbízhatatlanságából eredõ kárért. > > Az üzenetek EBH-n kívüli küldõje vagy címzettje tudomásul veszi és > hozzájárul, hogy az üzenetekhez más banki alkalmazott is hozzáférhet az EBH > folytonos munkamenetének biztosítása érdekében. > > > This e-mail and any attached files are confidential and/...{{dropped:19}} > > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > >[[alternative HTML version deleted]]
On 2011-03-02, at 4:00 AM, r-help-request at r-project.org wrote:> Hello there, > > I have a problem concerning bootstrapping in R - especially focusing on the resampling part of it. I try to sum it up in a simplified way so that I would not confuse anybody. > > I have a small database consisting of 20 observations (basically numbers from 1 to 20, I mean: 1, 2, 3, 4, 5, ... 18, 19, 20). > > I would like to resample this database many times for the bootstrap process with the following conditions. Firstly, every resampled database should also include 20 observations. Secondly, when selecting a number from the above-mentioned 20 numbers, you can do this selection with replacement. The difficult part comes now: one number can be selected only maximum 5 times. In order to make this clear I show you a couple of examples. So the resampled databases might be like the following ones: > > (1st database) 1,2,1,2,1,2,1,2,1,2,3,3,3,3,3,4,4,4,4,4 > 4 different numbers are chosen (1, 2, 3, 4), each selected - for the maximum possible - 5 times. > > (2nd database) 1,8,8,6,8,8,8,2,3,4,5,6,6,6,6,7,19,1,1,1 > Two numbers - 8 and 6 - selected 5 times (the maximum possible times), number 1 selected 4 times, the others selected less than 4 times. > > (3rd database) 1,1,2,2,3,3,4,4,9,9,9,10,10,13,10,9,3,9,2,1 > Number 9 chosen for the maximum possible 5 times, number 10, 3, 2, 1 chosen for 3 times, number 4 selected twice and number 13 selected only once. > > ... > > Anybody knows how to implement my "tricky" condition into one of the R functions - that one number can be selected only 5 times at most? Are 'boot' and 'bootstrap' packages capable of managing this? I guess they are, I just couldn't figure it out yet... > > Thanks very much! Best regards, > Laszlo BodnarLaszlo, Create a vector consisting of 5 of each number. Then, for each sample, scramble the order of the items in the vector, and select the first 20. -- Please avoid sending me Word or PowerPoint attachments. See <http://www.gnu.org/philosophy/no-word-attachments.html> -Dr. John R. Vokey