The following program is whittled down from a much larger program that always works on Intel, and always works on AMD's threadripper with lapply but not mclappy. With mclapply on AMD, all processes go into "suspend" mode and the program then hangs. This bug is replicable on an AMD Ryzen Threadripper 2950X 16-Core Processor (128GB RAM), running latest ubuntu 18.04. The R version 3.5.3 (2019-03-11) -- "Great Truth" , invoked with --vanilla. I hope this helps...it took quite a while to get it to this stage. I sure hope that I am not reporting an old bug... options("mc.cores"=4) library(data.table) library(parallel) if (!file.exists("bugsample.csv")) { NR <- 64833330 notused <- data.frame(v1=1:NR, v2=1:NR, v3=1:NR, x1=log(1:NR), x2=log(1:NR)) fwrite(notused, file="bugsample.csv") stop("you can quit now and restart the program") } if (!exists("notused")) notused <- fread("bugsample.csv", nrows= Inf) ## needed! Inf cannot be replaced by actual NR sample <- data.frame( groupidentifier=c( rep(11111,2000), rep(22222, 4500 ) ) ) sample$yvar <- sin(1:nrow(sample)) sample$xvar <- 1:nrow(sample) testfun <- function(dl) { with(dl, message("Working: ", first(groupidentifier), " with ", nrow(dl))) lapply( 1:nrow(dl), FUN=function(onedayindex) { if ((onedayindex %% 500) != 0) return(NULL) with(dl[1:onedayindex,], c( tryCatch( coef(lm( yvar ~ xvar, data=dl[1:onedayindex,] ))[2], error = function(e) NA ) ) ) }) } message("starting --- replicable hang with mclapply, but not lapply") o <- mclapply(split( 1:nrow(sample), sample$groupidentifier ), FUN=function(.index) testfun( sample[.index, , drop=FALSE] )) message("never gets here with mclapply") print( do.call("c", o[[1]]) ) print( do.call("c", o[[2]]) ) -- Ivo Welch (ivo.welch at ucla.edu) [[alternative HTML version deleted]]
Dirk Eddelbuettel
2019-Apr-05 11:28 UTC
[Rd] Deep Replicable Bug With AMD Threadripper MultiCore
On 4 April 2019 at 17:28, ivo welch wrote: | The following program is whittled down from a much larger program that | always works on Intel, and always works on AMD's threadripper with | lapply but not mclappy. With mclapply on AMD, all processes go into | "suspend" mode and the program then hangs. This bug is replicable on an | AMD Ryzen Threadripper 2950X 16-Core Processor (128GB RAM), running | latest ubuntu 18.04. The R version 3.5.3 (2019-03-11) -- "Great Truth" , | invoked with --vanilla. I hope this helps...it took quite a while to get | it to this stage. I sure hope that I am not reporting an old bug... | | options("mc.cores"=4) | library(data.table) | library(parallel) Just how you set mc.cores to 4 for parallel::mclapply I would try throttling data.table which in its current version goes for all cores. So do, say, setDTthreads(4) and see if that helps. Try lower and lower values to see if you get by. While there may well be a different race condition in mclapply, it may help to not overschedule. (FWIW, the next version of data.table, in queue at CRAN, is less aggressive and has additional options for fine tuning.) Dirk | if (!file.exists("bugsample.csv")) { | NR <- 64833330 | notused <- data.frame(v1=1:NR, v2=1:NR, v3=1:NR, x1=log(1:NR), | x2=log(1:NR)) | fwrite(notused, file="bugsample.csv") | stop("you can quit now and restart the program") | } | | if (!exists("notused")) notused <- fread("bugsample.csv", nrows= Inf) ## | needed! Inf cannot be replaced by actual NR | | | sample <- data.frame( groupidentifier=c( rep(11111,2000), rep(22222, 4500 ) | ) ) | sample$yvar <- sin(1:nrow(sample)) | sample$xvar <- 1:nrow(sample) | | | testfun <- function(dl) { | with(dl, message("Working: ", first(groupidentifier), " with ", | nrow(dl))) | | lapply( 1:nrow(dl), FUN=function(onedayindex) { | if ((onedayindex %% 500) != 0) return(NULL) | with(dl[1:onedayindex,], | c( tryCatch( coef(lm( yvar ~ xvar, data=dl[1:onedayindex,] | ))[2], error = function(e) NA ) ) ) | }) | } | | | message("starting --- replicable hang with mclapply, but not lapply") | | o <- mclapply(split( 1:nrow(sample), sample$groupidentifier ), | FUN=function(.index) testfun( sample[.index, , drop=FALSE] )) | | message("never gets here with mclapply") | | print( do.call("c", o[[1]]) ) | print( do.call("c", o[[2]]) ) | | | | -- | Ivo Welch (ivo.welch at ucla.edu) | | [[alternative HTML version deleted]] | | ______________________________________________ | R-devel at r-project.org mailing list | https://stat.ethz.ch/mailman/listinfo/r-devel -- http://dirk.eddelbuettel.com | @eddelbuettel | edd at debian.org
Tomas Kalibera
2019-Apr-05 12:10 UTC
[Rd] Deep Replicable Bug With AMD Threadripper MultiCore
In addition you can also try to use a PSOCK cluster (see makeCluster, parLapply) to avoid the problem - it should help if the problem is somehow related to forking in mclapply(). The problem you are seeing may be in base R, in data.table, or in interaction between the two (mclapply() from base R uses forking directly, data.table uses OpenMP). If you think the bug is in base R, it would be much better if you could find a reproducible example that would only use packages shipped directly with R, otherwise it might be best to contact the maintainer of data.table. Please also make sure to use the latest version of R 3.5 (or R-devel). The implementation of forking in parallel packages, and hence also in mclapply, has been rewritten since R 3.4. Best Tomas On 4/5/19 1:28 PM, Dirk Eddelbuettel wrote:> On 4 April 2019 at 17:28, ivo welch wrote: > | The following program is whittled down from a much larger program that > | always works on Intel, and always works on AMD's threadripper with > | lapply but not mclappy. With mclapply on AMD, all processes go into > | "suspend" mode and the program then hangs. This bug is replicable on an > | AMD Ryzen Threadripper 2950X 16-Core Processor (128GB RAM), running > | latest ubuntu 18.04. The R version 3.5.3 (2019-03-11) -- "Great Truth" , > | invoked with --vanilla. I hope this helps...it took quite a while to get > | it to this stage. I sure hope that I am not reporting an old bug... > | > | options("mc.cores"=4) > | library(data.table) > | library(parallel) > > Just how you set mc.cores to 4 for parallel::mclapply I would try throttling > data.table which in its current version goes for all cores. So do, say, > > setDTthreads(4) > > and see if that helps. Try lower and lower values to see if you get by. > While there may well be a different race condition in mclapply, it may help > to not overschedule. > > (FWIW, the next version of data.table, in queue at CRAN, is less aggressive > and has additional options for fine tuning.) > > Dirk > > | if (!file.exists("bugsample.csv")) { > | NR <- 64833330 > | notused <- data.frame(v1=1:NR, v2=1:NR, v3=1:NR, x1=log(1:NR), > | x2=log(1:NR)) > | fwrite(notused, file="bugsample.csv") > | stop("you can quit now and restart the program") > | } > | > | if (!exists("notused")) notused <- fread("bugsample.csv", nrows= Inf) ## > | needed! Inf cannot be replaced by actual NR > | > | > | sample <- data.frame( groupidentifier=c( rep(11111,2000), rep(22222, 4500 ) > | ) ) > | sample$yvar <- sin(1:nrow(sample)) > | sample$xvar <- 1:nrow(sample) > | > | > | testfun <- function(dl) { > | with(dl, message("Working: ", first(groupidentifier), " with ", > | nrow(dl))) > | > | lapply( 1:nrow(dl), FUN=function(onedayindex) { > | if ((onedayindex %% 500) != 0) return(NULL) > | with(dl[1:onedayindex,], > | c( tryCatch( coef(lm( yvar ~ xvar, data=dl[1:onedayindex,] > | ))[2], error = function(e) NA ) ) ) > | }) > | } > | > | > | message("starting --- replicable hang with mclapply, but not lapply") > | > | o <- mclapply(split( 1:nrow(sample), sample$groupidentifier ), > | FUN=function(.index) testfun( sample[.index, , drop=FALSE] )) > | > | message("never gets here with mclapply") > | > | print( do.call("c", o[[1]]) ) > | print( do.call("c", o[[2]]) ) > | > | > | > | -- > | Ivo Welch (ivo.welch at ucla.edu) > | > | [[alternative HTML version deleted]] > | > | ______________________________________________ > | R-devel at r-project.org mailing list > | https://stat.ethz.ch/mailman/listinfo/r-devel >