ivo welch
2012-Mar-26 22:28 UTC
[R] assigning vector or matrix sparsely (for use with mclapply)
Dear R wizards--- I have a wrapper on mclapply() that makes it a little easier for me to do multiprocessing. (Posting this may make life easier for other googlers.) I pass a data frame, a vector that tells me what rows should be recomputed, and the function; and I get back a vector or matrix of answers. d <- data.frame( id=1:6, val=11:16 ) loc <- c(TRUE,TRUE,FALSE,TRUE,FALSE,TRUE) v1 <- mc.byselectrows( d, loc, function(x) x[,2]^2 ) v2 <- mc.byselectrows(d, loc, function(x) cbind(x[,2]^2,x[,2]^3)) mc.byselectrows <- function(data.in, recalclist, FUN, ...) { ? data.notdone <- data.in[recalclist,] ? cat.stderr("[mc.byselectrows: ", nrow(data.notdone), "rows to be recomputed out of", nrow(data.in), "]\n") ? FUN.ON.ROWS <- function(.index, ...) as.matrix(FUN(data.notdone[.index,], ...)) ? soln <- mclapply( as.list(1:nrow(data.notdone)) , FUN.ON.ROWS, ... ) ? rv <- do.call("rbind", soln) ?## omits naming. ? if (ncol(rv)==1) rv <- as.vector(rv) ? rv } this works fine, except that what I want to get NA's in the return positions that were not recalculated. then, I can write newdata$y <- ifelse ( is.na(olddata$y), mc.byselectrows( olddata, is.na(olddata$y), fun.calc.y ), olddata$y ) I can do this very inelegantly, of course. I can merge recalclist into data.in and then write a loop that substitutes for the do.call to rbind. yikes. or I could do the recalclist contingency inside the FUN.ON.ROWS, but this is costly in terms of execution time. are there obvious solutions? advice appreciated. regards, /iaw ---- Ivo Welch (ivo.welch at gmail.com)
ilai
2012-Mar-28 02:27 UTC
[R] assigning vector or matrix sparsely (for use with mclapply)
It is (at least for me) really unclear what the problem is, or how it's related to mclapply. You say " this works fine, except that what I want to get NA's in the return positions that were not recalculated. then, I can write> > newdata$y <- ifelse ( is.na(olddata$y), mc.byselectrows( olddata, > is.na(olddata$y), fun.calc.y ), olddata$y )" Why ??? Are you applying the function twice ? than why not simply v1.1 <- mc.byselectrows( d, loc<1, function(x) x[,2]^2 ) the second time ? If the problem is in keeping track of which rows got calculated, why not rename with the row.names omitted after mclapply (probably a good idea anyway): FUN.ON.ROWS <- function(.index, ...) as.matrix(FUN(data.notdone[.index,], ...)) soln <- mclapply( as.list(1:nrow(data.notdone)) , FUN.ON.ROWS, ... ) rv <- do.call("rbind", soln) ## omits naming. if (ncol(rv)==1){ rv <- as.vector(rv) ; names(rv) <- row.names(data.notdone) } else rownames(rv) <- row.names(data.notdone) rv } And finally, you don't even need row.names for c(v1,d[loc<1,2]) Or am I missing something here ? BTW your code uses cat.stderr (which is local ? ) instead of cat, and has no call to multicore. Cheers>On Mon, Mar 26, 2012 at 4:28 PM, ivo welch <ivo.welch at gmail.com> wrote:> Dear R wizards--- > > I have a wrapper on mclapply() that makes it a little easier for me to > do multiprocessing. ?(Posting this may make life easier for other > googlers.) ?I pass a data frame, a vector that tells me what rows > should be recomputed, and the function; and I get back a vector or > matrix of answers. > > ? d <- data.frame( id=1:6, val=11:16 ) > ? loc <- c(TRUE,TRUE,FALSE,TRUE,FALSE,TRUE) > ? v1 <- mc.byselectrows( d, loc, function(x) x[,2]^2 ) > ? v2 <- mc.byselectrows(d, loc, function(x) cbind(x[,2]^2,x[,2]^3)) > > mc.byselectrows <- function(data.in, recalclist, FUN, ...) { > > ? data.notdone <- data.in[recalclist,] > ? cat.stderr("[mc.byselectrows: ", nrow(data.notdone), "rows to be > recomputed out of", nrow(data.in), "]\n") > > ? FUN.ON.ROWS <- function(.index, ...) > as.matrix(FUN(data.notdone[.index,], ...)) > ? soln <- mclapply( as.list(1:nrow(data.notdone)) , FUN.ON.ROWS, ... ) > ? rv <- do.call("rbind", soln) ?## omits naming. > ? if (ncol(rv)==1) rv <- as.vector(rv) > ? rv > } > > this works fine, except that what I want to get NA's in the return > positions that were not recalculated. ?then, I can write > > ?newdata$y <- ifelse ( is.na(olddata$y), mc.byselectrows( olddata, > is.na(olddata$y), fun.calc.y ), olddata$y ) > > I can do this very inelegantly, of course. ?I can merge recalclist > into data.in and then write a loop that substitutes for the do.call to > rbind. ?yikes. ?or I could do the recalclist contingency inside the > FUN.ON.ROWS, but this is costly in terms of execution time. ?are there > obvious solutions? ?advice appreciated. > > regards, > > /iaw > ---- > Ivo Welch (ivo.welch at gmail.com) > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.
Apparently Analagous Threads
- R Processing dataframe by group - equivalent to SAS by group processing with a first. and retain statments
- CForest Error Logical Subscript Too Long
- R Processing dataframe by group - equivalent to SAS by group processing with a first. and retain statments
- Identify first row of each ID within a data frame, create a variable first =1 for the first row and first=0 of all other rows
- R Processing dataframe by group - equivalent to SAS by group processing with a first. and retain statments