Cook, Malcolm
2013-Mar-04 21:13 UTC
[Rd] enabling reproducible research & R package management & install.package.version & BiocLite
Hi, In support of reproducible research at my Institute, I seek an approach to re-creating the R environments in which an analysis has been conducted. By which I mean, the exact version of R and the exact version of all packages used in a particular R session. I am seeking comments/criticism of this as a goal, and of the following outline of an approach: === When all the steps to an workflow have been finalized ==* re-run the workflow from beginning to end * save the results of sessionInfo() into an RDS file named after the current date and time. === Later, when desirous of exactly recreating this analysis ==* read the (old) sessionInfo() into an R session * exit with failure if the running version of R doesn't match * compare the old sessionInfo to the currently available installed libraries (i.e. using packageVersion) * where there are discrepancies, install the required version of the package (without dependencies) into new library (named after the old sessionInfo RDS file) Then the analyst should be able to put the new library into the front of .libPaths and run the analysis confident that the same version of the packages. I have in that past used install-package-version.R to revert to previous versions of R packages successfully (https://gist.github.com/1503736). And there is a similar tool in Hadley Wickhams devtools. But, I don't know if I need something special for (BioConductor) packages that have been installed using biocLite and seek advice here. I do understand that the R environment is not sufficient to guarantee reproducibility. Some of my colleagues have suggested saving a virtual machine with all your software/library/data installed. So, I am also in general interested in what other people are doing to this end. But I am most interested in: * is this a good idea * is there a worked out solution * does biocLite introduce special cases * where do the dragons lurk ... and the like Any tips? Thanks, ~ Malcolm Cook Stowers Institute / Computation Biology / Shilatifard Lab
Aaron Mackey
2013-Mar-04 21:28 UTC
[Rd] [BioC] enabling reproducible research & R package management & install.package.version & BiocLite
On Mon, Mar 4, 2013 at 4:13 PM, Cook, Malcolm <MEC@stowers.org> wrote:> * where do the dragons lurk >webs of interconnected dynamically loaded libraries, identical versions of R compiled with different BLAS/LAPACK options, etc. Go with the VM if you really, truly, want this level of exact reproducibility. An alternative (and arguably more useful) strategy would be to cache results of each computational step, and report when results differ upon re-execution with identical inputs; if you cache sessionInfo along with each result, you can identify which package(s) changed, and begin to hunt down why the change occurred (possibly for the better); couple this with the concept of keeping both code *and* results in version control, then you can move forward with a (re)analysis without being crippled by out-of-date software. -Aaron -- Aaron J. Mackey, PhD Assistant Professor Center for Public Health Genomics University of Virginia amackey@virginia.edu http://www.cphg.virginia.edu/mackey [[alternative HTML version deleted]]
Steve Lianoglou
2013-Mar-04 22:15 UTC
[Rd] [BioC] enabling reproducible research & R package management & install.package.version & BiocLite
On Mon, Mar 4, 2013 at 4:28 PM, Aaron Mackey <amackey at virginia.edu> wrote:> On Mon, Mar 4, 2013 at 4:13 PM, Cook, Malcolm <MEC at stowers.org> wrote: > >> * where do the dragons lurk >> > > webs of interconnected dynamically loaded libraries, identical versions of > R compiled with different BLAS/LAPACK options, etc. Go with the VM if you > really, truly, want this level of exact reproducibility.Sounds like the best bet -- maybe tools like vagrant might be useful here: http://www.vagrantup.com ... or maybe they're overkill? Haven't really checked it out myself too much, my impression is that these tools (vagrant, chef, puppet) are built to handle such cases. I'd imagine you'd probably need a location where you can grab the precise (versioned) packages for the things you are specifying, but ... -steve -- Steve Lianoglou Graduate Student: Computational Systems Biology | Memorial Sloan-Kettering Cancer Center | Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact
Yihui Xie
2013-Mar-04 23:04 UTC
[Rd] enabling reproducible research & R package management & install.package.version & BiocLite
Just my 2 cents: it may not be a good idea to restrict software versions to gain reproducibility. To me, this kind of reproducibility is "dead" reproducibility (what if the old software has a fatal bug? do we want to reproduce the same **wrong** results?). Software packages are continuously evolving, and our research should be adapted as well. How to achieve this? I think this paper by Robert Gentleman and Duncan Temple Lang has given a nice answer: http://biostats.bepress.com/bioconductor/paper2/ With R 3.0.0 coming, it will be easy to achieve what they have outlined because R 3.0 allows custom vignette builders. Basically, your research paper can be built with 'R CMD build' and checked with 'R CMD check' if you provide an appropriate builder. An R package has the great potential of becoming the ideal tool for reproducible research due to its wonderful infrastructure: functions, datasets, examples, unit tests, vignettes, dependency structure, and so on. With the help of version control, you can easily spot the changes after you upgrade the packages. With an R package, you can automate a lot of things, e.g. install.packages() will take care of dependencies and R CMD build can rebuild your paper. Just like Bioc has a devel version, you can continuously check your results in a devel version, so that you know what is going to break if you upgrade to new versions of other packages. Is developing a research paper too different with developing a software package? (in the context of computing) Probably not. Long live the reproducible research! Regards, Yihui -- Yihui Xie <xieyihui at gmail.com> Phone: 515-294-2465 Web: http://yihui.name Department of Statistics, Iowa State University 2215 Snedecor Hall, Ames, IA On Mon, Mar 4, 2013 at 3:13 PM, Cook, Malcolm <MEC at stowers.org> wrote:> Hi, > > In support of reproducible research at my Institute, I seek an approach to re-creating the R environments in which an analysis has been conducted. > > By which I mean, the exact version of R and the exact version of all packages used in a particular R session. > > I am seeking comments/criticism of this as a goal, and of the following outline of an approach: > > === When all the steps to an workflow have been finalized ==> * re-run the workflow from beginning to end > * save the results of sessionInfo() into an RDS file named after the current date and time. > > === Later, when desirous of exactly recreating this analysis ==> * read the (old) sessionInfo() into an R session > * exit with failure if the running version of R doesn't match > * compare the old sessionInfo to the currently available installed libraries (i.e. using packageVersion) > * where there are discrepancies, install the required version of the package (without dependencies) into new library (named after the old sessionInfo RDS file) > > Then the analyst should be able to put the new library into the front of .libPaths and run the analysis confident that the same version of the packages. > > I have in that past used install-package-version.R to revert to previous versions of R packages successfully (https://gist.github.com/1503736). And there is a similar tool in Hadley Wickhams devtools. > > But, I don't know if I need something special for (BioConductor) packages that have been installed using biocLite and seek advice here. > > I do understand that the R environment is not sufficient to guarantee reproducibility. Some of my colleagues have suggested saving a virtual machine with all your software/library/data installed. So, I am also in general interested in what other people are doing to this end. But I am most interested in: > > * is this a good idea > * is there a worked out solution > * does biocLite introduce special cases > * where do the dragons lurk > > ... and the like > > Any tips? > > Thanks, > > ~ Malcolm Cook > Stowers Institute / Computation Biology / Shilatifard Lab >
Mike Marchywka
2013-Mar-05 11:23 UTC
[Rd] [BioC] enabling reproducible research & R package management & install.package.version & BiocLite
I hate to ask what go this thread started but it sounds like someone was counting on? exact numeric reproducibility or was there a bug in a specific release? In actual? fact, the best way to determine reproducibility is run the code in a variety of packages. Alternatively, you can do everything in java and not assume? that calculations commute or associate as the code is modified but it seems pointless. Sensitivity determination would seem to lead to more reprodicible results than trying to keep a specific set of code quirks. I also seem to recall that FPU may have random lower order bits in some cases, same code/data give different results. Alsways assume FP is stochastic and plan on anlayzing the "noise." ----------------------------------------> From: amackey at virginia.edu > Date: Mon, 4 Mar 2013 16:28:48 -0500 > To: MEC at stowers.org > CC: r-devel at r-project.org; bioconductor at r-project.org; r-discussion at listserv.stowers.org > Subject: Re: [Rd] [BioC] enabling reproducible research & R package management & install.package.version & BiocLite > > On Mon, Mar 4, 2013 at 4:13 PM, Cook, Malcolm <MEC at stowers.org> wrote: > > > * where do the dragons lurk > > > > webs of interconnected dynamically loaded libraries, identical versions of > R compiled with different BLAS/LAPACK options, etc. Go with the VM if you > really, truly, want this level of exact reproducibility. > > An alternative (and arguably more useful) strategy would be to cache > results of each computational step, and report when results differ upon > re-execution with identical inputs; if you cache sessionInfo along with > each result, you can identify which package(s) changed, and begin to hunt > down why the change occurred (possibly for the better); couple this with > the concept of keeping both code *and* results in version control, then you > can move forward with a (re)analysis without being crippled by out-of-date > software. > > -Aaron > > -- > Aaron J. Mackey, PhD > Assistant Professor > Center for Public Health Genomics > University of Virginia > amackey at virginia.edu > http://www.cphg.virginia.edu/mackey > > [[alternative HTML version deleted]] > > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel