My data looks like this:> dataname G_hat_0_0 G_hat_1_0 G_hat_2_0 G_0 G_hat_0_1 G_hat_1_1 G_hat_2_1 G_1 1 rs0 0.488000 0.448625 0.063375 1 0.480875 0.454500 0.064625 1 2 rs1 0.002375 0.955375 0.042250 1 0.000000 0.062875 0.937125 2 3 rs2 0.050375 0.835875 0.113750 1 0.877250 0.115875 0.006875 0 4 rs3 0.000000 0.074750 0.925250 2 0.897750 0.102000 0.000250 0 5 rs4 0.000125 0.052375 0.947500 2 0.261500 0.724125 0.014375 1 6 rs5 0.003750 0.092125 0.904125 2 0.023000 0.738125 0.238875 1 And my task is: For each individual (X) on each row, to find the index corresponding to the max of G_hat_X_0, G_hat_X_1, G_hat_X_2 and then increment the cell of the confusion matrix with the row corresponding to that index and the column corresponding to G_X. For example, in the first row and the first individual, the index with the max value (0.488000) is 0 and the G_0 value is 1, so I would increment matrix index of the first row and second column. (Note that the ranges between rows and columns are one off. That is accounted for in the code.) In reality the data will be much bigger, containing 10000 rows and a variable number of columns (inds) between 10 and 500. The correct result is:> cmattru_rr tru_rv tru_vv call_rr 2 2 0 call_rv 0 4 0 call_vv 0 0 4 I am not sure what the best way to do this is. I implemented it once using two for loops. Then I tried to use lapply and came up with a nested lapply solution, but it was slower than the simple loops. I still think that there is a better way and I was hoping for some advice. Perhaps something with pmax.... #### DATA PREP ########## data = data.frame(name=c("rs0","rs1","rs2","rs3","rs4","rs5"), G_hat_0_0=c(0.488,0.002375,0.050375,0,0.000125,0.00375), G_hat_1_0=c(0.448625,0.955375,0.835875,0.07475,0.052375,0.092125), G_hat_2_0=c(0.063375,0.04225,0.11375,0.92525,0.9475,0.904125), G_0=c(1,1,1,2,2,2), G_hat_0_1=c(0.480875,0,0.87725,0.89775,0.2615,0.023), G_hat_1_1=c(0.4545,0.062875,0.115875,0.102,0.724125,0.738125), G_hat_2_1=c(0.064625,0.937125,0.006875,0.00025,0.014375,0.238875), G_1=c(1,2,0,0,1,1)) # get list of inds in file (e.g. G_0,G_1,...,G_100) inds = grep("G_[0-9]+",names(data),perl=T,value=T) # get total number of inds nind = length(inds) # create an empty "confusion" table cmat = matrix(rep(0,9), nrow=3, ncol=3) colnames(cmat) = c("tru_rr", "tru_rv", "tru_vv") rownames(cmat) = c("call_rr","call_rv","call_vv") ## APPROACH 1: Nested For Loop #### # Nested Loop Approach for (row in (1:nrow(data))) { for (i in (0:(nind-1))) { Gmax = which.max(c( data[[paste("G_hat_0_",i,sep="")]][row], data[[paste("G_hat_1_",i,sep="")]][row], data[[paste("G_hat_2_",i,sep="")]][row] )) Gtru = data[[paste("G_",i,sep="")]][row] + 1 # add 1 to match Gmax range cmat[Gmax,Gtru] = cmat[Gmax,Gtru] + 1 } } ## APPROACH 2: Nested lapply #### # This routine finds the geno w/ highest prob from the erg.avgs. # and compares it to the true geno. Result is tallied by # incrementing the appropriate index of the confusion matrix add2cmat <- function(ind,locus) { Gmax = which.max(c( data[[paste("G_hat_0_",ind,sep="")]][locus], data[[paste("G_hat_1_",ind,sep="")]][locus], data[[paste("G_hat_2_",ind,sep="")]][locus] )) Gtru = data[[paste("G_",ind,sep="")]][locus] + 1 # add 1 to match Gmax range cmat[Gmax,Gtru] <<- cmat[Gmax,Gtru] + 1 # use double arrow to modify global env. } # Run add2cmat for all individuals on a given locus add_locus2cmat <- function(locus) { lapply(0:(nind-1),add2cmat,locus) } junk = lapply((1:nrow(data)),add_locus2cmat) # don't need return value -- View this message in context: http://r.789695.n4.nabble.com/Efficiency-Question-Nested-lapply-or-nested-for-loop-tp2968553p2968553.html Sent from the R help mailing list archive at Nabble.com.
David Winsemius
2010-Oct-08 16:28 UTC
[R] Efficiency Question - Nested lapply or nested for loop
You are loosing a lot of time by repeatedly calculating character indices with paste() in every iteration. Two options: -- 1) calculate these once outside the loop and then refer to them by index idx.names <- vector(mode="character", length=nind) for (i in (0:(nind-1))) {idx[i+1] <- # need the offset c(paste("G_hat_0_",i,sep=""), paste("G_hat_1_",i,sep=""), paste("G_hat_2_",i,sep=""), paste("G_",i,sep="") ) } Then the inner loop would be: for (i in (0:(nind-1))) { Gmax = which.max(c(data[[ idx.names[1] ]][row], data[[ idx.names[2] ]][row], data[[ idx.names[3] ]][row] )) Gtru = data[[ idx.names[4] ]][row] + 1 # add 1 to match Gmax range } And as has been said many times before,... require(fortunes) fortune("dog") -- 2) probably even faster to pre-calculate (or just construct by inspection) those column indices as a numeric vector and use then access with data[row, numidxs[i] ] The for-loop is generally going to be faster than an lapply solution. The fastest solution would be a fully indexed strategy, which might become more apparent (it's not yet so to me) after you implement the second option above. -- David. On Oct 8, 2010, at 11:35 AM, epowell wrote:> > My data looks like this: > >> data > name G_hat_0_0 G_hat_1_0 G_hat_2_0 G_0 G_hat_0_1 G_hat_1_1 > G_hat_2_1 G_1 > 1 rs0 0.488000 0.448625 0.063375 1 0.480875 0.454500 > 0.064625 1 > 2 rs1 0.002375 0.955375 0.042250 1 0.000000 0.062875 > 0.937125 2 > 3 rs2 0.050375 0.835875 0.113750 1 0.877250 0.115875 > 0.006875 0 > 4 rs3 0.000000 0.074750 0.925250 2 0.897750 0.102000 > 0.000250 0 > 5 rs4 0.000125 0.052375 0.947500 2 0.261500 0.724125 > 0.014375 1 > 6 rs5 0.003750 0.092125 0.904125 2 0.023000 0.738125 > 0.238875 1 > > And my task is: > For each individual (X) on each row, to find the index corresponding > to the > max of G_hat_X_0, G_hat_X_1, G_hat_X_2 and then increment the cell > of the > confusion matrix with the row corresponding to that index and the > column > corresponding to G_X. > > For example, in the first row and the first individual, the index > with the > max value (0.488000) is 0 and the G_0 value is 1, so I would increment > matrix index of the first row and second column. (Note that the ranges > between rows and columns are one off. That is accounted for in the > code.) > > In reality the data will be much bigger, containing 10000 rows and a > variable number of columns (inds) between 10 and 500. > > The correct result is: > >> cmat > tru_rr tru_rv tru_vv > call_rr 2 2 0 > call_rv 0 4 0 > call_vv 0 0 4 > > I am not sure what the best way to do this is. I implemented it > once using > two for loops. Then I tried to use lapply and came up with a nested > lapply > solution, but it was slower than the simple loops. I still think > that there > is a better way and I was hoping for some advice. Perhaps something > with > pmax.... > > #### DATA PREP ########## > > data = data.frame(name=c("rs0","rs1","rs2","rs3","rs4","rs5"), > G_hat_0_0=c(0.488,0.002375,0.050375,0,0.000125,0.00375), > G_hat_1_0=c(0.448625,0.955375,0.835875,0.07475,0.052375,0.092125), > G_hat_2_0=c(0.063375,0.04225,0.11375,0.92525,0.9475,0.904125), > G_0=c(1,1,1,2,2,2), > G_hat_0_1=c(0.480875,0,0.87725,0.89775,0.2615,0.023), > G_hat_1_1=c(0.4545,0.062875,0.115875,0.102,0.724125,0.738125), > G_hat_2_1=c(0.064625,0.937125,0.006875,0.00025,0.014375,0.238875), > G_1=c(1,2,0,0,1,1)) > > # get list of inds in file (e.g. G_0,G_1,...,G_100) > inds = grep("G_[0-9]+",names(data),perl=T,value=T) > > # get total number of inds > nind = length(inds) > > # create an empty "confusion" table > cmat = matrix(rep(0,9), nrow=3, ncol=3) > colnames(cmat) = c("tru_rr", "tru_rv", "tru_vv") > rownames(cmat) = c("call_rr","call_rv","call_vv") > > ## APPROACH 1: Nested For Loop #### > > # Nested Loop Approach > for (row in (1:nrow(data))) { > for (i in (0:(nind-1))) { > > Gmax = which.max(c( data[[paste("G_hat_0_",i,sep="")]][row], > data[[paste("G_hat_1_",i,sep="")]][row], > data[[paste("G_hat_2_",i,sep="")]][row] )) > > Gtru = data[[paste("G_",i,sep="")]][row] + 1 # add 1 to match Gmax > range > > cmat[Gmax,Gtru] = cmat[Gmax,Gtru] + 1 > } > } > > > ## APPROACH 2: Nested lapply #### > > # This routine finds the geno w/ highest prob from the erg.avgs. > # and compares it to the true geno. Result is tallied by > # incrementing the appropriate index of the confusion matrix > > add2cmat <- function(ind,locus) { > > Gmax = which.max(c( data[[paste("G_hat_0_",ind,sep="")]][locus], > data[[paste("G_hat_1_",ind,sep="")]][locus], > data[[paste("G_hat_2_",ind,sep="")]][locus] )) > > Gtru = data[[paste("G_",ind,sep="")]][locus] + 1 # add 1 to match > Gmax > range > > cmat[Gmax,Gtru] <<- cmat[Gmax,Gtru] + 1 # use double arrow to > modify > global env. > > } > > # Run add2cmat for all individuals on a given locus > > add_locus2cmat <- function(locus) { > lapply(0:(nind-1),add2cmat,locus) > } > > junk = lapply((1:nrow(data)),add_locus2cmat) # don't need return > value > > > > -- > View this message in context: http://r.789695.n4.nabble.com/Efficiency-Question-Nested-lapply-or-nested-for-loop-tp2968553p2968553.html > Sent from the R help mailing list archive at Nabble.com. > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.David Winsemius, MD West Hartford, CT
Gabor Grothendieck
2010-Oct-08 16:47 UTC
[R] Efficiency Question - Nested lapply or nested for loop
On Fri, Oct 8, 2010 at 11:35 AM, epowell <EPowell1 at med.miami.edu> wrote:> > My data looks like this: > >> data > ?name G_hat_0_0 G_hat_1_0 G_hat_2_0 G_0 G_hat_0_1 G_hat_1_1 G_hat_2_1 G_1 > 1 ?rs0 ?0.488000 ?0.448625 ?0.063375 ? 1 ?0.480875 ?0.454500 ?0.064625 ? 1 > 2 ?rs1 ?0.002375 ?0.955375 ?0.042250 ? 1 ?0.000000 ?0.062875 ?0.937125 ? 2 > 3 ?rs2 ?0.050375 ?0.835875 ?0.113750 ? 1 ?0.877250 ?0.115875 ?0.006875 ? 0 > 4 ?rs3 ?0.000000 ?0.074750 ?0.925250 ? 2 ?0.897750 ?0.102000 ?0.000250 ? 0 > 5 ?rs4 ?0.000125 ?0.052375 ?0.947500 ? 2 ?0.261500 ?0.724125 ?0.014375 ? 1 > 6 ?rs5 ?0.003750 ?0.092125 ?0.904125 ? 2 ?0.023000 ?0.738125 ?0.238875 ? 1 > > And my task is: > For each individual (X) on each row, to find the index corresponding to the > max of G_hat_X_0, G_hat_X_1, G_hat_X_2 and then increment the cell of the > confusion matrix with the row corresponding to that index and the column > corresponding to G_X. > > For example, in the first row and the first individual, the index with the > max value (0.488000) is 0 and the G_0 value is 1, so I would increment > matrix index of the first row and second column. (Note that the ranges > between rows and columns are one off. ?That is accounted for in the code.) > > In reality the data will be much bigger, containing 10000 rows and a > variable number of columns (inds) between 10 and 500. > > The correct result is: > >> cmat > ? ? ? ?tru_rr tru_rv tru_vv > call_rr ? ? ?2 ? ? ?2 ? ? ?0 > call_rv ? ? ?0 ? ? ?4 ? ? ?0 > call_vv ? ? ?0 ? ? ?0 ? ? ?4 >If we reform data into a 3d array, arr, it can be vectorized like this where the two args of table correspond to Gmax and Gtru: arr <- array(t(data[-1]), c(4, 2, 6)) table(apply(arr[-4,,], 2:3, which.max), arr[4,,] + 1) -- Statistics & Software Consulting GKX Group, GKX Associates Inc. tel: 1-877-GKX-GROUP email: ggrothendieck at gmail.com
Gabor Grothendieck
2010-Oct-08 22:28 UTC
[R] Efficiency Question - Nested lapply or nested for loop
On Fri, Oct 8, 2010 at 12:47 PM, Gabor Grothendieck <ggrothendieck at gmail.com> wrote:> On Fri, Oct 8, 2010 at 11:35 AM, epowell <EPowell1 at med.miami.edu> wrote: >> >> My data looks like this: >> >>> data >> ?name G_hat_0_0 G_hat_1_0 G_hat_2_0 G_0 G_hat_0_1 G_hat_1_1 G_hat_2_1 G_1 >> 1 ?rs0 ?0.488000 ?0.448625 ?0.063375 ? 1 ?0.480875 ?0.454500 ?0.064625 ? 1 >> 2 ?rs1 ?0.002375 ?0.955375 ?0.042250 ? 1 ?0.000000 ?0.062875 ?0.937125 ? 2 >> 3 ?rs2 ?0.050375 ?0.835875 ?0.113750 ? 1 ?0.877250 ?0.115875 ?0.006875 ? 0 >> 4 ?rs3 ?0.000000 ?0.074750 ?0.925250 ? 2 ?0.897750 ?0.102000 ?0.000250 ? 0 >> 5 ?rs4 ?0.000125 ?0.052375 ?0.947500 ? 2 ?0.261500 ?0.724125 ?0.014375 ? 1 >> 6 ?rs5 ?0.003750 ?0.092125 ?0.904125 ? 2 ?0.023000 ?0.738125 ?0.238875 ? 1 >> >> And my task is: >> For each individual (X) on each row, to find the index corresponding to the >> max of G_hat_X_0, G_hat_X_1, G_hat_X_2 and then increment the cell of the >> confusion matrix with the row corresponding to that index and the column >> corresponding to G_X. >> >> For example, in the first row and the first individual, the index with the >> max value (0.488000) is 0 and the G_0 value is 1, so I would increment >> matrix index of the first row and second column. (Note that the ranges >> between rows and columns are one off. ?That is accounted for in the code.) >> >> In reality the data will be much bigger, containing 10000 rows and a >> variable number of columns (inds) between 10 and 500. >> >> The correct result is: >> >>> cmat >> ? ? ? ?tru_rr tru_rv tru_vv >> call_rr ? ? ?2 ? ? ?2 ? ? ?0 >> call_rv ? ? ?0 ? ? ?4 ? ? ?0 >> call_vv ? ? ?0 ? ? ?0 ? ? ?4 >> > > If we reform data into a 3d array, arr, it can be vectorized like this > where the two args of table correspond to Gmax and Gtru: > > arr <- array(t(data[-1]), c(4, 2, 6)) > table(apply(arr[-4,,], 2:3, which.max), arr[4,,] + 1)A couple of further improvements are that we can replace the array, arr, with a matrix, mat, and also we can add dimension names in the table() call: mat <- matrix(t(data[-1]), 4) table(Gmax = apply(mat[-4,], 2, which.max), Gtru = mat[4,] + 1) -- Statistics & Software Consulting GKX Group, GKX Associates Inc. tel: 1-877-GKX-GROUP email: ggrothendieck at gmail.com