jasper slingsby
2009-Apr-06 15:39 UTC
[R] how to subsample all possible combinations of n species taken 1:n at a time?
Hello I apologise for the length of this entry but please bear with me. In short: I need a way of subsampling communities from all possible communities of n taxa taken 1:n at a time without having to calculate all possible combinations (because this gives me a memory error - using combn() or expand.grid() at least). Does anyone know of a function? Or can you help me edit the combn or expand.grid functions to generate subsamples? In long: I have been creating all possible communities of n taxa taken 1:n at a time to get a presence/absence matrix of species occurrence in communities as below... Rows are samples, columns are species: A B C D . . . . 1 0 1 1 1 0 0 0 1 1 1 1 0 0 0 0 0 1 1 1 1 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1 1 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 1 0 0 1 1 1 1 0 0 0 0 1 0 0 0 0 1 0 0 1 1 1 1 0 0 0 0 0 1 0 0 0 1 0 0 1 1 1 1 0 0 0 0 1 1 0 0 0 1 0 0 1 1 1 1 0 0 0 0 0 0 1 0 0 1 0 0 1 1 1 1 0 0 0 0 ...but the number of possible communities increases exponentially with each added taxon. n<-11 #number of taxa sum(for (i in 0:n) choose(i, k = 0:i)) #number of combos So all possible combinations of 11 taxa taken 1:11 at a time is 2048, all combos of 12 taken 1:12 is 4096, 13 taken 1:13 = 8192...etc etc such that when I reach about 25 taken 1:25 the number of combos is 33554432 and I get a memory error. I have found that the number of combos of x taxa taken from a pool of n creates a very kurtotic unimodal distribution,... x<-vector("integer",20) for (i in 1:20) {x[i]<-choose(20,i)} plot(x) ...but have found that limiting the number of samples for any community size to 1000 is good enough for the further analyses I wish to do. My problem lies in sampling all possible combos without having to calculate all possible combos. I have tried two methods but both give memory errors at about 25 taxa. The expand.grid() method: n <- 11 toto <- vector("list",n) titi <- lapply(toto,function(x) c(0,1)) tutu <- expand.grid(titi) The combn() method (a slightly lengthlier function): samplecommunityD<- function(n,numsamples) { super<-mat.or.vec(,n) for (numspploop in 1:n) { minor<-t(combn(n,numspploop)) if (dim(minor)[1]<numsamples) { minot<-mat.or.vec(dim(minor)[1],n) for (loopi in 1:dim(minor)[1]) { for (loopbi in 1:dim(minor)[2]) { minot[loopi,minor[loopi,loopbi]] <- 1 } } super<-rbind(super,minot) rm(minot) } else { minot<-mat.or.vec(numsamples,n) for (loopii in 1:numsamples) { thousand<-sample(dim(minor)[1],numsamples) for (loopbii in 1:dim(minor)[2]) { minot[loopii,minor[thousand[loopii],loopbii]] <- 1 } } super<-rbind(super,minot) rm(minot) } } super<-super[!rowSums(super)>n-1&!rowSums(super)<2,] return(super) } samplecommunityD(11,1000) So unless anyone knows of another function I could try my next step would be to modify the combn or expand.grid functions to generate subsamples, but their coding beyond me at this stage (I'm a 3.5 month newbie). Can anyone identify where in the code I would need to introduce a sampling term or skipping sequence? Thanks for your time Jasper -- View this message in context: http://www.nabble.com/how-to-subsample-all-possible-combinations-of-n-species-taken-1%3An-at-a-time--tp22911399p22911399.html Sent from the R help mailing list archive at Nabble.com.
jim holtman
2009-Apr-06 19:57 UTC
[R] how to subsample all possible combinations of n species taken 1:n at a time?
Are you just trying to obtain a combination from 25 possible terms? If so, then just sample the number you want and convert the number to binary:> sample(33554432,100)[1] 6911360 5924262 23052661 12888381 25831589 16700013 24079278 33282839 12751862 26086726 31363494 7118320 21866536 4212929 [15] 8966435 12955834 449305 12830805 29181967 11420211 16175915 20118079 16560488 6248422 27762022 22430005 26650247 3621985 [29] 24283690 13800068 27546362 21711718 26270840 18556802 17774422 26486373 782865 16013167 24572344 23244187 16026237 28897360 [43] 14700082 8214024 2371593 3337527 10612303 17402454 22213173 13650936 30630988 9851680 15403666 11153297 21839554 8657593 [57] 16057288 25713076 2826853 29370859 11377380 28166893 11632747 11199608 15983665 29937151 29002363 13085852 26082502 32232925 [71] 14584722 23907975 13421556 10916983 25403574 6801209 23861215 4083294 8237209 4808486 8040610 1977505 21551566 29402643 [85] 26135975 26753178 15276437 13760103 27208220 20298140 21968831 11851302 9068401 33308858 21256448 7154058 4341004 16042933 [99] 31006704 20091025 This is a 100 samples and you can convert each of the numbers to binary and the bits will tell you might elements to combine. On Mon, Apr 6, 2009 at 11:39 AM, jasper slingsby <jslingsby at gmail.com> wrote:> > Hello > > I apologise for the length of this entry but please bear with me. > > In short: > I need a way of subsampling communities from all possible communities of n > taxa taken 1:n at a time without having to calculate all possible > combinations (because this gives me a memory error - using > combn() or expand.grid() at least). Does anyone know of a function? Or can > you help me edit the > combn > or > expand.grid > functions to generate subsamples? > > In long: > I have been creating all possible communities of n taxa taken 1:n at a time > to get a presence/absence matrix of species occurrence in communities as > below... > > Rows are samples, columns are species: > > ? ?A ? ?B ? ?C ? D ? ? . ? ? . ? ?. ? ?. > ? ?1 ? ?0 ? ?1 ? ?1 ? ?1 ? ?0 ? ?0 ? ?0 ? ?1 ? ? 1 ? ? 1 ? ? 1 ? ? 0 ? ? 0 > 0 ? ? 0 > ? ?0 ? ?1 ? ?1 ? ?1 ? ?1 ? ?0 ? ?0 ? ?0 ? ?1 ? ? 1 ? ? 1 ? ? 1 ? ? 0 ? ? 0 > 0 ? ? 0 > ? ?1 ? ?1 ? ?1 ? ?1 ? ?1 ? ?0 ? ?0 ? ?0 ? ?1 ? ? 1 ? ? 1 ? ? 1 ? ? 0 ? ? 0 > 0 ? ? 0 > ? ?0 ? ?0 ? ?0 ? ?0 ? ?0 ? ?1 ? ?0 ? ?0 ? ?1 ? ? 1 ? ? 1 ? ? 1 ? ? 0 ? ? 0 > 0 ? ? 0 > ? ?1 ? ?0 ? ?0 ? ?0 ? ?0 ? ?1 ? ?0 ? ?0 ? ?1 ? ? 1 ? ? 1 ? ? 1 ? ? 0 ? ? 0 > 0 ? ? 0 > ? ?0 ? ?1 ? ?0 ? ?0 ? ?0 ? ?1 ? ?0 ? ?0 ? ?1 ? ? 1 ? ? 1 ? ? 1 ? ? 0 ? ? 0 > 0 ? ? 0 > ? ?1 ? ?1 ? ?0 ? ?0 ? ?0 ? ?1 ? ?0 ? ?0 ? ?1 ? ? 1 ? ? 1 ? ? 1 ? ? 0 ? ? 0 > 0 ? ? 0 > ? ?0 ? ?0 ? ?1 ? ?0 ? ?0 ? ?1 ? ?0 ? ?0 ? ?1 ? ? 1 ? ? 1 ? ? 1 ? ? 0 ? ? 0 > 0 ? ? 0 > > ...but the number of possible communities increases exponentially with each > added taxon. > > n<-11 ? ? #number of taxa > sum(for (i in 0:n) choose(i, k = 0:i)) #number of combos > > So all possible combinations of 11 taxa taken 1:11 at a time is 2048, all > combos of 12 taken 1:12 is 4096, 13 taken 1:13 = 8192...etc etc such that > when I reach about 25 taken 1:25 the number of combos is 33554432 and I get > a memory error. > > I have found that the number of combos of x taxa taken from a pool of n > creates a very kurtotic unimodal distribution,... > > x<-vector("integer",20) > for (i in 1:20) {x[i]<-choose(20,i)} > plot(x) > > ...but have found that limiting the number of samples for any community size > to 1000 is good enough for the further analyses I wish to do. > My problem lies in sampling all possible combos without having to calculate > all possible combos. I have tried two methods but both give memory errors at > about 25 taxa. > > The expand.grid() method: > > n <- 11 > toto <- vector("list",n) > titi <- lapply(toto,function(x) c(0,1)) > tutu <- expand.grid(titi) > > The combn() method (a slightly lengthlier function): > > samplecommunityD<- function(n,numsamples) > { > super<-mat.or.vec(,n) > for (numspploop in 1:n) > { > ?minor<-t(combn(n,numspploop)) > ?if (dim(minor)[1]<numsamples) > ?{ > ? ?minot<-mat.or.vec(dim(minor)[1],n) > ? ?for (loopi in 1:dim(minor)[1]) > ? ?{ > ? ? ?for (loopbi in 1:dim(minor)[2]) > ? ? ?{ > ? ? ? ?minot[loopi,minor[loopi,loopbi]] <- 1 > ? ? ?} > ? ?} > ? ?super<-rbind(super,minot) > ? ?rm(minot) > ?} > ?else > ?{ > ? minot<-mat.or.vec(numsamples,n) > ? for (loopii in 1:numsamples) > ? { > ? ? thousand<-sample(dim(minor)[1],numsamples) > ? ? ? for (loopbii in 1:dim(minor)[2]) > ? ? ? { > ? ? ? minot[loopii,minor[thousand[loopii],loopbii]] <- 1 > ? ? ? } > ? } > ? super<-rbind(super,minot) > ? rm(minot) > ?} > } > super<-super[!rowSums(super)>n-1&!rowSums(super)<2,] > return(super) > } > > samplecommunityD(11,1000) > > > So unless anyone knows of another function I could try my next step would be > to modify the combn or expand.grid functions to generate subsamples, but > their coding beyond me at this stage (I'm a 3.5 month newbie). Can anyone > identify where in the code I would need to introduce a sampling term or > skipping sequence? > > Thanks for your time > Jasper > > -- > View this message in context: http://www.nabble.com/how-to-subsample-all-possible-combinations-of-n-species-taken-1%3An-at-a-time--tp22911399p22911399.html > Sent from the R help mailing list archive at Nabble.com. > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >-- Jim Holtman Cincinnati, OH +1 513 646 9390 What is the problem that you are trying to solve?
Eik Vettorazzi
2009-Apr-06 20:02 UTC
[R] how to subsample all possible combinations of n species taken 1:n at a time?
Hi Jasper, maybe its not a very R'ish solution but the following could be a starting point: First, notice that every combination you are looking for can be represented as an integer in binary notation where each bit stands for a specific community. So looping through all combinations is the same as looping through 0:(2^n-1) , eg. 7=2^0+2^2, so community 1 and 3 (natural numbering) are "set" in the corresponding subset. #some auxillary functions are needed to extract the bits set in a given combination #get bit returns presence/absence vector of species for given subset x get.bit<-function(x,n) x%/%(2^(0:(n-1)))%%2 #index of data columns included in subset x get.index<-function(x,n) x%/%(2^(0:(n-1)))%%2==1 n<-15 # number of communities dta<-matrix(1:(12*n),ncol=n) # some sample data #looping thru *all* possible combinations of n communities. #will work, but it is in fact *very*! time consuming for n=25 for (i in 0:(2^n-1)) { sub.index<-get.index(i,n) # subsetting your dataset using this index and do subset analysis # dta[,sub.index] } hth jasper slingsby schrieb:> Hello > > I apologise for the length of this entry but please bear with me. > > In short: > I need a way of subsampling communities from all possible communities of n > taxa taken 1:n at a time without having to calculate all possible > combinations (because this gives me a memory error - using > combn() or expand.grid() at least). Does anyone know of a function? Or can > you help me edit the > combn > or > expand.grid > functions to generate subsamples? > > In long: > I have been creating all possible communities of n taxa taken 1:n at a time > to get a presence/absence matrix of species occurrence in communities as > below... > > Rows are samples, columns are species: > > A B C D . . . . > 1 0 1 1 1 0 0 0 1 1 1 1 0 0 > 0 0 > 0 1 1 1 1 0 0 0 1 1 1 1 0 0 > 0 0 > 1 1 1 1 1 0 0 0 1 1 1 1 0 0 > 0 0 > 0 0 0 0 0 1 0 0 1 1 1 1 0 0 > 0 0 > 1 0 0 0 0 1 0 0 1 1 1 1 0 0 > 0 0 > 0 1 0 0 0 1 0 0 1 1 1 1 0 0 > 0 0 > 1 1 0 0 0 1 0 0 1 1 1 1 0 0 > 0 0 > 0 0 1 0 0 1 0 0 1 1 1 1 0 0 > 0 0 > > ...but the number of possible communities increases exponentially with each > added taxon. > > n<-11 #number of taxa > sum(for (i in 0:n) choose(i, k = 0:i)) #number of combos > > So all possible combinations of 11 taxa taken 1:11 at a time is 2048, all > combos of 12 taken 1:12 is 4096, 13 taken 1:13 = 8192...etc etc such that > when I reach about 25 taken 1:25 the number of combos is 33554432 and I get > a memory error. > > I have found that the number of combos of x taxa taken from a pool of n > creates a very kurtotic unimodal distribution,... > > x<-vector("integer",20) > for (i in 1:20) {x[i]<-choose(20,i)} > plot(x) > > ...but have found that limiting the number of samples for any community size > to 1000 is good enough for the further analyses I wish to do. > My problem lies in sampling all possible combos without having to calculate > all possible combos. I have tried two methods but both give memory errors at > about 25 taxa. > > The expand.grid() method: > > n <- 11 > toto <- vector("list",n) > titi <- lapply(toto,function(x) c(0,1)) > tutu <- expand.grid(titi) > > The combn() method (a slightly lengthlier function): > > samplecommunityD<- function(n,numsamples) > { > super<-mat.or.vec(,n) > for (numspploop in 1:n) > { > minor<-t(combn(n,numspploop)) > if (dim(minor)[1]<numsamples) > { > minot<-mat.or.vec(dim(minor)[1],n) > for (loopi in 1:dim(minor)[1]) > { > for (loopbi in 1:dim(minor)[2]) > { > minot[loopi,minor[loopi,loopbi]] <- 1 > } > } > super<-rbind(super,minot) > rm(minot) > } > else > { > minot<-mat.or.vec(numsamples,n) > for (loopii in 1:numsamples) > { > thousand<-sample(dim(minor)[1],numsamples) > for (loopbii in 1:dim(minor)[2]) > { > minot[loopii,minor[thousand[loopii],loopbii]] <- 1 > } > } > super<-rbind(super,minot) > rm(minot) > } > } > super<-super[!rowSums(super)>n-1&!rowSums(super)<2,] > return(super) > } > > samplecommunityD(11,1000) > > > So unless anyone knows of another function I could try my next step would be > to modify the combn or expand.grid functions to generate subsamples, but > their coding beyond me at this stage (I'm a 3.5 month newbie). Can anyone > identify where in the code I would need to introduce a sampling term or > skipping sequence? > > Thanks for your time > Jasper > >
David Katz
2009-Apr-06 23:17 UTC
[R] how to subsample all possible combinations of n species taken 1:n at a time?
If I understand your problem properly, you could just note that selecting 1:n of n objects is the same as deciding separately whether each one is included or not. (exclude the case where none are selected). Take 1000 of these and you are there- except some are duplicates - so generate extras and eliminate the duplicates, discard the extras. Something like this (not tested): p <- 2^(n-1) / (2^n - 1) #all combinations have equal probability - removing rows with all zeros result <- matrix(0,1200*n,nrow=1200) #plenty of extras for duplicates for(i in 1:1200) result[i,] <- rbinom(n,1,p) result <- subset(result,apply(result,1,sum) > 0) #cases which have at least 1 species result <- unique(result)[1:1000,] Might be interesting to see the effect of varying p on the rest of your analysis. Further memory might be saved by using sparse matrices - see the Matrix package. David Katz www.davidkatzconsulting.com jasper slingsby wrote:> > Hello > > I apologise for the length of this entry but please bear with me. > > In short: > I need a way of subsampling communities from all possible communities of n > taxa taken 1:n at a time without having to calculate all possible > combinations (because this gives me a memory error - using > combn() or expand.grid() at least). Does anyone know of a function? Or can > you help me edit the > combn > or > expand.grid > functions to generate subsamples? > >-- View this message in context: http://www.nabble.com/how-to-subsample-all-possible-combinations-of-n-species-taken-1%3An-at-a-time--tp22911399p22919388.html Sent from the R help mailing list archive at Nabble.com.
Reasonably Related Threads
- Skipping of sample in ogg writing
- Skipping of sample in ogg writing
- Re-post data format question (apologies)
- Replace Function (How to replace numbers in a data frame with a specific number)
- split character string in matrix into character vector and numeric vector