Dear R gurus, I have a very embarrassingly parallelizable job that I am trying to speed up with snow on our local cluster. Basically, I am doing ~50,000 t.test for a series of micro-array experiments, one gene at a time. Thus, I can easily spread the load across multiple processors and nodes. So, I have a master list object that tells me what rows to pick up for each genes to do the t.test from series of microarray experiments containing ~500,000 rows and x columns per experiments. While trying to optimize my function using parLapply(), I quickly realized that I was not gaining any speed because every time a test was done on one of the item in the list, the 500,000 line by x column matrix had to be shipped along with the item in the list and the traffic time was actually longer than the computing time. However, if I export the 500,000 object first across the spawned processes as in this mock script cl <- makeCluster(nnodes,method) mArrayData <- getData(experiments) clusterExport(cl, 'mArrayData') Results <- parLapply(cl, theMapList, function(x) t.testFnc(x)) With a function that define the mArrayData argument as a default parameter as in t.testFnc <- function(probeList, array=mArrayData){ x <- array[probeList$A,] y <- array[probeList$B,] res <- doSomeTest(x,y) return(res) } Using this strategy, I was able to gain full advantage of my cluster and reduce the analysis time by the number of nodes I have in our cluster. The large data matrix was resident in each processes and didn't have to travel on the network every time a item from the list was pass to the function t.testFnc() However, I quickly realized that this works (the call to clusterExport() ) only when I run the script one line at a time. When the process is enclosed in a function, the object mArrayData is not exported, presumably because it's not a global object from the Master process. So, what is the alternative to push the content of an object to the slaves? The documentation in the snow package is a bit light and I couldn't find good example out there. I don't want to have the function getData() evaluated on each nodes because the argument to that functions are humongous and that would cause way too much traffic on the network. I want the result of the function getData(), the object mArrayData, propagated to the cluster only once and be available to downstream functions. Hope this is clear and that a solution will be possible. Many thanks Marco -- Marco Blanchette, Ph.D. Assistant Investigator Stowers Institute for Medical Research 1000 East 50th St. Kansas City, MO 64110 Tel: 816-926-4071 Cell: 816-726-8419 Fax: 816-926-2018
Hi Marco -- Do you know about Bioconductor, http://bioconductor.org ? The rowttests function in the genefilter package will do what you want efficiently and on a single node.> # install the package > source('http://bioconductor.org/biocLite.R') > biocLite('genefilter') > # do 500k t-tests > library(genefilter) > m <- matrix(runif(500000*20), ncol=20) > f <- factor(rep(c("A", "B"), each=10)) > system.time(rowttests(m, f))user system elapsed 0.964 0.128 1.095 A package like limma, with it's great vignette, is an excellent introduction to statistical analyses that make better use of this type of data. See the links to Bioconductor packages at http://bioconductor.org/packages/release/Software.html A little more below... Martin "Blanchette, Marco" <MAB at stowers-institute.org> writes:> Dear R gurus, > > I have a very embarrassingly parallelizable job that I am trying to > speed up with snow on our local cluster. Basically, I am doing > ~50,000 t.test for a series of micro-array experiments, one gene at > a time. Thus, I can easily spread the load across multiple > processors and nodes. > > So, I have a master list object that tells me what rows to pick up > for each genes to do the t.test from series of microarray > experiments containing ~500,000 rows and x columns per experiments. > > While trying to optimize my function using parLapply(), I quickly > realized that I was not gaining any speed because every time a test > was done on one of the item in the list, the 500,000 line by x > column matrix had to be shipped along with the item in the list and > the traffic time was actually longer than the computing time. > > However, if I export the 500,000 object first across the spawned > processes as in this mock script > > cl <- makeCluster(nnodes,method) > mArrayData <- getData(experiments) > clusterExport(cl, 'mArrayData') > > Results <- parLapply(cl, theMapList, function(x) t.testFnc(x))try writing this in a more 'functional' style, so all variables used by the function in parLapply are passed to the function, parLapply(cl, theMapList, function(probeList, bigArray) { x <- bigArray[probeList$A,] y <- bigArray[probeList$B,] doSomeTest(x, y) }, bigArray=mArrayData) snow will see to distributing bigArray in an appropriate way.> With a function that define the mArrayData argument as a default parameter as in > > t.testFnc <- function(probeList, array=mArrayData){ > x <- array[probeList$A,] > y <- array[probeList$B,] > res <- doSomeTest(x,y) > return(res) > } > > Using this strategy, I was able to gain full advantage of my cluster > and reduce the analysis time by the number of nodes I have in our > cluster. The large data matrix was resident in each processes and > didn't have to travel on the network every time a item from the list > was pass to the function t.testFnc() > > However, I quickly realized that this works (the call to > clusterExport() ) only when I run the script one line at a > time. When the process is enclosed in a function, the object > mArrayData is not exported, presumably because it's not a global > object from the Master process. > > So, what is the alternative to push the content of an object to the > slaves? The documentation in the snow package is a bit light and I > couldn't find good example out there. I don't want to have the > function getData() evaluated on each nodes because the argument to > that functions are humongous and that would cause way too much > traffic on the network. I want the result of the function getData(), > the object mArrayData, propagated to the cluster only once and be > available to downstream functions.> Hope this is clear and that a solution will be possible. > > Many thanks > > Marco > > -- > Marco Blanchette, Ph.D. > Assistant Investigator > Stowers Institute for Medical Research > 1000 East 50th St. > > Kansas City, MO 64110 > > Tel: 816-926-4071 > Cell: 816-726-8419 > Fax: 816-926-2018 > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.-- Martin Morgan Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M2 B169 Phone: (206) 667-2793
I think I found a solution. I do not like to use global variable by fear of unpredictable side-effects but, I think that in this case I don't have to much chance. Here is a mock function that pushes the content of a variable evaluated within a function to the nodes on the cluster, do some computation on the nodes using that variable and then return the result after cleaning up the newly created global variable. Let me know what you people think: aTest <- function(x,n.nodes=2){ library(snow) #initialize a cluster makeCluster(rep('locahost',n.nodes),type='SOCK') #create a global variable y <<- x #export the variable to the cluster clusterExport(cl,'y') #do some computation on the cluster c <- clusterEvalQ(cl,y+2) #remove the variable from the global environment rm(y, envir=.GlobalEnv) #stop the cluster stopCluster(cl) #exit and return the computation return(c) } On 11/29/08 6:59 PM, "Marco Blanchette" <MAB at Stowers-Institute.org> wrote: Dear R gurus, I have a very embarrassingly parallelizable job that I am trying to speed up with snow on our local cluster. Basically, I am doing ~50,000 t.test for a series of micro-array experiments, one gene at a time. Thus, I can easily spread the load across multiple processors and nodes. So, I have a master list object that tells me what rows to pick up for each genes to do the t.test from series of microarray experiments containing ~500,000 rows and x columns per experiments. While trying to optimize my function using parLapply(), I quickly realized that I was not gaining any speed because every time a test was done on one of the item in the list, the 500,000 line by x column matrix had to be shipped along with the item in the list and the traffic time was actually longer than the computing time. However, if I export the 500,000 object first across the spawned processes as in this mock script cl <- makeCluster(nnodes,method) mArrayData <- getData(experiments) clusterExport(cl, 'mArrayData') Results <- parLapply(cl, theMapList, function(x) t.testFnc(x)) With a function that define the mArrayData argument as a default parameter as in t.testFnc <- function(probeList, array=mArrayData){ x <- array[probeList$A,] y <- array[probeList$B,] res <- doSomeTest(x,y) return(res) } Using this strategy, I was able to gain full advantage of my cluster and reduce the analysis time by the number of nodes I have in our cluster. The large data matrix was resident in each processes and didn't have to travel on the network every time a item from the list was pass to the function t.testFnc() However, I quickly realized that this works (the call to clusterExport() ) only when I run the script one line at a time. When the process is enclosed in a function, the object mArrayData is not exported, presumably because it's not a global object from the Master process. So, what is the alternative to push the content of an object to the slaves? The documentation in the snow package is a bit light and I couldn't find good example out there. I don't want to have the function getData() evaluated on each nodes because the argument to that functions are humongous and that would cause way too much traffic on the network. I want the result of the function getData(), the object mArrayData, propagated to the cluster only once and be available to downstream functions. Hope this is clear and that a solution will be possible. Many thanks Marco -- Marco Blanchette, Ph.D. Assistant Investigator Stowers Institute for Medical Research 1000 East 50th St. Kansas City, MO 64110 Tel: 816-926-4071 Cell: 816-726-8419 Fax: 816-926-2018 ______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Marco Blanchette, Ph.D. Assistant Investigator Stowers Institute for Medical Research 1000 East 50th St. Kansas City, MO 64110 Tel: 816-926-4071 Cell: 816-726-8419 Fax: 816-926-2018