Hello everybody! I’m trying to define the optimal number of surveys to detect the highest number of species within a monitoring season/session. To do this I want to run all the possible combinations between a set of samples and to calculate the total number of species for each combination of 2, 3, 4 …n samples events, so that at the end I will be able to define which is the lowest number of samples that I need to obtain the best result. I’ve already done this operation manually, just to see if it works, but the point is that some of my datasets have more than 30 samples and more than 35 species, so that the number of combinations will be HUGE! So here is the question: I need to find a way for R to make all possible combinations of samples automatically, and then to automatically return the total number of species in every combination. I’ve tried to search for a loop script, or something like that. However, I’m relatively new to R and I don’t know what I need to do… Can anyone help me? Here I’ve written a simple example of the operations I need to do, just to make my problem clearer. My dataset (matrix) has sample events by rows (U1,U2,U3) and detected species by columns. U<-read.table("C:\\Documents \\tre_usc.txt",header=T,row.names=1,sep="\t",dec = ",") U # global matrix with 3 samples SPECIE Aadi Aagl Apap Aage Bdia Beup Crub Carc Cpam U1 0 0 0 0 7 0 5 0 1 U2 0 0 0 0 4 2 1 0 0 U3 0 0 0 0 0 0 0 0 14 First, I’ve created from this matrix all the subsets based on single samples, U1 <- U [c(1), ] U2 <- U [c(2), ] U3 <- U [c(3), ] U1 SPECIE Aadi Aagl Apap Aage Bdia Beup Crub Carc Cpam U1 0 0 0 0 7 0 5 0 1 U2 SPECIE Aadi Aagl Apap Aage Bdia Beup Crub Carc Cpam U2 0 0 0 0 4 2 1 0 0 Etc… then I’ve combined them summing each time the values of the chosen lines (total n° of combination = 4). U12<-U1+U2 U13<-U1+U3 U23<-U2+U3 U123<-U1+U2+U3 U12 SPECIE Aadi Aagl Apap Aage Bdia Beup Crub Carc Cpam U12 0 0 0 0 11 2 6 0 1 Etc…. Then I’ve applied the command “length” to find the number of species for every new combination. length(U12[U12>0]) [1] 4 length(U13[U13>0]) [1] 3 etc… Now I need to do this with 10 and 32 sample events…….: ( Thanks for your attention! Serena Corezzola Centro Nazionale per lo Studio e la Conservazione della Biodiversità Forestale, “Bosco Fontana” di Verona Strada Mantova 29 I-46045 MARMIROLO (MN) Italy [[alternative HTML version deleted]]
On Thu, Jan 27, 2011 at 11:30:37AM +0100, Serena Corezzola wrote:> Hello everybody! > > > > I?m trying to define the optimal number of surveys to detect the highest > number of species within a monitoring season/session. > > To do this I want to run all the possible combinations between a set of > samples and to calculate the total number of species for each combination of > 2, 3, 4 ?n samples events, so that at the end I will be able to define which > is the lowest number of samples that I need to obtain the best result. > > > > I?ve already done this operation manually, just to see if it works, but the > point is that some of my datasets have more than 30 samples and more than 35 > species, so that the number of combinations will be HUGE! > > So here is the question: I need to find a way for R to make all possible > combinations of samples automatically, and then to automatically return the > total number of species in every combination. > > I?ve tried to search for a loop script, or something like that. However, I?m > relatively new to R and I don?t know what I need to do? Can anyone help me? > > > > Here I?ve written a simple example of the operations I need to do, just to > make my problem clearer. > > > > My dataset (matrix) has sample events by rows (U1,U2,U3) and detected > species by columns. > > > > U<-read.table("C:\\Documents > \\tre_usc.txt",header=T,row.names=1,sep="\t",dec = ",")Hello: For simplicity of preparing a reply, let me include your data as an R command. U <- structure(list(Aadi = c(0L, 0L, 0L), Aagl = c(0L, 0L, 0L), Apap = c(0L, 0L, 0L), Aage = c(0L, 0L, 0L), Bdia = c(7L, 4L, 0L), Beup = c(0L, 2L, 0L), Crub = c(5L, 1L, 0L), Carc = c(0L, 0L, 0L), Cpam = c(1L, 0L, 14L)), .Names = c("Aadi", "Aagl", "Apap", "Aage", "Bdia", "Beup", "Crub", "Carc", "Cpam"), class = "data.frame", row.names = c("U1", "U2", "U3")) Aadi Aagl Apap Aage Bdia Beup Crub Carc Cpam U1 0 0 0 0 7 0 5 0 1 U2 0 0 0 0 4 2 1 0 0 U3 0 0 0 0 0 0 0 0 14> First, I?ve created from this matrix all the subsets based on single > samples, > > > > U1 <- U [c(1), ] > > U2 <- U [c(2), ] > > U3 <- U [c(3), ] >[...]> > then I?ve combined them summing each time the values of the chosen lines > (total n? of combination = 4). > > > > U12<-U1+U2 > > U13<-U1+U3 > > U23<-U2+U3 > > U123<-U1+U2+U3 >[...]> > > Then I?ve applied the command ?length? to find the number of species for > every new combination. > > > > length(U12[U12>0]) > > [1] 4 > > > > length(U13[U13>0]) > > [1] 3 >This can be partially automatized as follows UM <- as.matrix(U) A <- rbind( c(1, 0, 0), c(0, 1, 0), c(0, 0, 1), c(1, 1, 0), c(1, 0, 1), c(0, 1, 1), c(1, 1, 1)) rownam <- rep("U", times=nrow(A)) for (i in 1:3) { rownam[A[, i] == 1] <- paste(rownam[A[, i] == 1], i, sep="") } dimnames(A) <- list(rownam, NULL) C <- A %*% UM C Aadi Aagl Apap Aage Bdia Beup Crub Carc Cpam U1 0 0 0 0 7 0 5 0 1 U2 0 0 0 0 4 2 1 0 0 U3 0 0 0 0 0 0 0 0 14 U12 0 0 0 0 11 2 6 0 1 U13 0 0 0 0 7 0 5 0 15 U23 0 0 0 0 4 2 1 0 14 U123 0 0 0 0 11 2 6 0 15 rowSums(C != 0) U1 U2 U3 U12 U13 U23 U123 3 3 1 4 3 4 4> Now I need to do this with 10 and 32 sample events??.: (If i understand you correctly, your real table U has 32 rows and you want to consider all subsets of at most 10 rows. If this is so, then the number of combinations is sum(choose(32, 1:10)) # [1] 107594212 A matrix of this number of rows and 35 columns requires 30 GB of memory. How do you want to summarize the results? There may be a more efficient way to compute the required parameters. For example, the average number of species, which are contained in a sum of a random selection of k rows may be computed easily, since we can consider the columns (species) individually and for each column, the probability to get a nonzero sum may be computed without actually constructing all the subsets. If you need a parameter, which is harder to compute than the average, it is possible to consider simulation. In this case, not all subsets would be generated, but a smaller number of randomly chosen subsets of k rows for a given k. Petr Savicky.
On Thu, Jan 27, 2011 at 05:30:15PM +0100, Petr Savicky wrote:> On Thu, Jan 27, 2011 at 11:30:37AM +0100, Serena Corezzola wrote: > > Hello everybody! > > > > > > > > I?m trying to define the optimal number of surveys to detect the highest > > number of species within a monitoring season/session. > >[...]> This can be partially automatized as follows > > UM <- as.matrix(U) > A <- rbind( > c(1, 0, 0), > c(0, 1, 0), > c(0, 0, 1), > c(1, 1, 0), > c(1, 0, 1), > c(0, 1, 1), > c(1, 1, 1)) > rownam <- rep("U", times=nrow(A)) > for (i in 1:3) { > rownam[A[, i] == 1] <- paste(rownam[A[, i] == 1], i, sep="") > } > dimnames(A) <- list(rownam, NULL) > C <- A %*% UM > C > > Aadi Aagl Apap Aage Bdia Beup Crub Carc Cpam > U1 0 0 0 0 7 0 5 0 1 > U2 0 0 0 0 4 2 1 0 0 > U3 0 0 0 0 0 0 0 0 14 > U12 0 0 0 0 11 2 6 0 1 > U13 0 0 0 0 7 0 5 0 15 > U23 0 0 0 0 4 2 1 0 14 > U123 0 0 0 0 11 2 6 0 15 > > rowSums(C != 0) > > U1 U2 U3 U12 U13 U23 U123 > 3 3 1 4 3 4 4 > > Now I need to do this with 10 and 32 sample events??.: (Hello. In a previous email, i suggested the code above. However, it may be used only for a fixed matrix U. For testing the procedure for a larger matrix U, matrix A should be generated differently. For a fixed k, A should have choose(nrow(U), k) rows, nrow(U) columns and its rows should be all 0,1-vectors with k ones. The following code may be used, although better ways of computing A probably exist. n <- nrow(U) k <- 2 cmb <- combn(n, k) A <- matrix(0, nrow=ncol(cmb), ncol=n) ind <- cbind(1:nrow(A), 0L) for (i in seq.int(length=k)) { ind[, 2] <- cmb[i, ] A[ind] <- 1 } A [,1] [,2] [,3] [1,] 1 1 0 [2,] 1 0 1 [3,] 0 1 1 C <- A %*% as.matrix(U) rowSums(C != 0) [1] 4 3 4 This output corresponds to U12, U13, U23. If n = 32, then the above may be used for computing the required counts exactly for a few small values of k. For k up to 10, an approximation may be more suitable. For example, simulation may be used, where random subsets are generated using sample(n, k). Petr Savicky.