thr3ads.net - R devel - [Rd] enabling reproducible research & R package management & install.package.version & BiocLite [Mar 2013]

If this information is useful, please help other people find it:
Share via:

Cook, Malcolm

2013-Mar-04 21:13 UTC

[Rd] enabling reproducible research & R package management & install.package.version & BiocLite

Hi,

In support of reproducible research at my Institute, I seek an approach to
re-creating the R environments in which an analysis has been conducted.

By which I mean, the exact version of R and the exact version of all packages
used in a particular R session.

I am seeking comments/criticism of this as a goal, and of the following outline
of an approach:

=== When all the steps to an workflow have been finalized ==* re-run the
workflow from beginning to end
* save the results of sessionInfo() into an RDS file named after the current
date and time.

=== Later, when desirous of exactly recreating this analysis ==* read the (old)
sessionInfo() into an R session
* exit with failure if the running version of R doesn't match
* compare the old sessionInfo to the currently available installed libraries
(i.e. using packageVersion)
* where there are discrepancies, install the required version of the package
(without dependencies) into new library (named after the old sessionInfo RDS
file)

Then the analyst should be able to put the new library into the front of
.libPaths and run the analysis confident that the same version of the packages.

I have in that past used install-package-version.R  to revert to previous
versions of R packages successfully (https://gist.github.com/1503736).  And
there is a similar tool in Hadley Wickhams devtools.

But, I don't know if I need something special for (BioConductor) packages
that have been installed using biocLite and seek advice here.

I do understand that the R environment is not sufficient to guarantee
reproducibility.   Some of my colleagues have suggested saving a virtual machine
with all your software/library/data installed. So, I am also in general
interested in what other people are doing to this end.  But I am most interested
in:

* is this a good idea
* is there a worked out solution
* does biocLite introduce special cases
* where do the dragons lurk

... and the like

Any tips?

Thanks,

~ Malcolm Cook
Stowers Institute / Computation Biology / Shilatifard Lab

Aaron Mackey

2013-Mar-04 21:28 UTC

head link

[Rd] [BioC] enabling reproducible research & R package management & install.package.version & BiocLite

On Mon, Mar 4, 2013 at 4:13 PM, Cook, Malcolm <MEC@stowers.org> wrote:
> * where do the dragons lurk
>
webs of interconnected dynamically loaded libraries, identical versions of
R compiled with different BLAS/LAPACK options, etc.  Go with the VM if you
really, truly, want this level of exact reproducibility.

An alternative (and arguably more useful) strategy would be to cache
results of each computational step, and report when results differ upon
re-execution with identical inputs; if you cache sessionInfo along with
each result, you can identify which package(s) changed, and begin to hunt
down why the change occurred (possibly for the better); couple this with
the concept of keeping both code *and* results in version control, then you
can move forward with a (re)analysis without being crippled by out-of-date
software.

-Aaron

--
Aaron J. Mackey, PhD
Assistant Professor
Center for Public Health Genomics
University of Virginia
amackey@virginia.edu
http://www.cphg.virginia.edu/mackey

	[[alternative HTML version deleted]]

Steve Lianoglou

2013-Mar-04 22:15 UTC

head link

[Rd] [BioC] enabling reproducible research & R package management & install.package.version & BiocLite

On Mon, Mar 4, 2013 at 4:28 PM, Aaron Mackey <amackey at virginia.edu>
wrote:> On Mon, Mar 4, 2013 at 4:13 PM, Cook, Malcolm <MEC at stowers.org>
wrote:
>
>> * where do the dragons lurk
>>
>
> webs of interconnected dynamically loaded libraries, identical versions of
> R compiled with different BLAS/LAPACK options, etc.  Go with the VM if you
> really, truly, want this level of exact reproducibility.
Sounds like the best bet -- maybe tools like vagrant might be useful here:

http://www.vagrantup.com

... or maybe they're overkill?

Haven't really checked it out myself too much, my impression is that
these tools (vagrant, chef, puppet) are built to handle such cases.

I'd imagine you'd probably need a location where you can grab the
precise (versioned) packages for the things you are specifying, but
...

-steve

-- 
Steve Lianoglou
Graduate Student: Computational Systems Biology
 | Memorial Sloan-Kettering Cancer Center
 | Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact

Yihui Xie

2013-Mar-04 23:04 UTC

head link

[Rd] enabling reproducible research & R package management & install.package.version & BiocLite

Just my 2 cents: it may not be a good idea to restrict software
versions to gain reproducibility. To me, this kind of reproducibility
is "dead" reproducibility (what if the old software has a fatal bug?
do we want to reproduce the same **wrong** results?). Software
packages are continuously evolving, and our research should be adapted
as well. How to achieve this? I think this paper by Robert Gentleman
and Duncan Temple Lang has given a nice answer:
http://biostats.bepress.com/bioconductor/paper2/

With R 3.0.0 coming, it will be easy to achieve what they have
outlined because R 3.0 allows custom vignette builders. Basically,
your research paper can be built with 'R CMD build' and checked with
'R CMD check' if you provide an appropriate builder. An R package has
the great potential of becoming the ideal tool for reproducible
research due to its wonderful infrastructure: functions, datasets,
examples, unit tests, vignettes, dependency structure, and so on. With
the help of version control, you can easily spot the changes after you
upgrade the packages. With an R package, you can automate a lot of
things, e.g. install.packages() will take care of dependencies and R
CMD build can rebuild your paper.

Just like Bioc has a devel version, you can continuously check your
results in a devel version, so that you know what is going to break if
you upgrade to new versions of other packages. Is developing a
research paper too different with developing a software package? (in
the context of computing) Probably not.

Long live the reproducible research!

Regards,
Yihui
--
Yihui Xie <xieyihui at gmail.com>
Phone: 515-294-2465 Web: http://yihui.name
Department of Statistics, Iowa State University
2215 Snedecor Hall, Ames, IA

On Mon, Mar 4, 2013 at 3:13 PM, Cook, Malcolm <MEC at stowers.org>
wrote:> Hi,
>
> In support of reproducible research at my Institute, I seek an approach to
re-creating the R environments in which an analysis has been conducted.
>
> By which I mean, the exact version of R and the exact version of all
packages used in a particular R session.
>
> I am seeking comments/criticism of this as a goal, and of the following
outline of an approach:
>
> === When all the steps to an workflow have been finalized ==> * re-run
the workflow from beginning to end
> * save the results of sessionInfo() into an RDS file named after the
current date and time.
>
> === Later, when desirous of exactly recreating this analysis ==> * read
the (old) sessionInfo() into an R session
> * exit with failure if the running version of R doesn't match
> * compare the old sessionInfo to the currently available installed
libraries (i.e. using packageVersion)
> * where there are discrepancies, install the required version of the
package (without dependencies) into new library (named after the old sessionInfo
RDS file)
>
> Then the analyst should be able to put the new library into the front of
.libPaths and run the analysis confident that the same version of the packages.
>
> I have in that past used install-package-version.R  to revert to previous
versions of R packages successfully (https://gist.github.com/1503736).  And
there is a similar tool in Hadley Wickhams devtools.
>
> But, I don't know if I need something special for (BioConductor)
packages that have been installed using biocLite and seek advice here.
>
> I do understand that the R environment is not sufficient to guarantee
reproducibility.   Some of my colleagues have suggested saving a virtual machine
with all your software/library/data installed. So, I am also in general
interested in what other people are doing to this end.  But I am most interested
in:
>
> * is this a good idea
> * is there a worked out solution
> * does biocLite introduce special cases
> * where do the dragons lurk
>
> ... and the like
>
> Any tips?
>
> Thanks,
>
> ~ Malcolm Cook
> Stowers Institute / Computation Biology / Shilatifard Lab
>

Mike Marchywka

2013-Mar-05 11:23 UTC

head link

[Rd] [BioC] enabling reproducible research & R package management & install.package.version & BiocLite

I hate to ask what go this thread started but it sounds like someone was
counting on?
exact numeric reproducibility or was there a bug in a specific release? In
actual?
fact, the best way to determine reproducibility is run the code in a variety of
packages. Alternatively, you can do everything in java and not assume?
that calculations commute or associate as the code is modified but it seems
pointless. Sensitivity determination would seem to lead to more reprodicible
results
than trying to keep a specific set of code quirks.

I also seem to recall that FPU may have random lower order bits in some cases,
same code/data give different results. Alsways assume FP is stochastic and plan
on anlayzing the "noise."

----------------------------------------> From: amackey at virginia.edu
> Date: Mon, 4 Mar 2013 16:28:48 -0500
> To: MEC at stowers.org
> CC: r-devel at r-project.org; bioconductor at r-project.org; r-discussion
at listserv.stowers.org
> Subject: Re: [Rd] [BioC] enabling reproducible research & R package
management & install.package.version & BiocLite
>
> On Mon, Mar 4, 2013 at 4:13 PM, Cook, Malcolm <MEC at stowers.org>
wrote:
>
> > * where do the dragons lurk
> >
>
> webs of interconnected dynamically loaded libraries, identical versions of
> R compiled with different BLAS/LAPACK options, etc. Go with the VM if you
> really, truly, want this level of exact reproducibility.
>
> An alternative (and arguably more useful) strategy would be to cache
> results of each computational step, and report when results differ upon
> re-execution with identical inputs; if you cache sessionInfo along with
> each result, you can identify which package(s) changed, and begin to hunt
> down why the change occurred (possibly for the better); couple this with
> the concept of keeping both code *and* results in version control, then you
> can move forward with a (re)analysis without being crippled by out-of-date
> software.
>
> -Aaron
>
> --
> Aaron J. Mackey, PhD
> Assistant Professor
> Center for Public Health Genomics
> University of Virginia
> amackey at virginia.edu
> http://www.cphg.virginia.edu/mackey
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

Possibly Parallel Threads

Search for more reasonably related threads

R devel - Mar 2013 - enabling reproducible research & R package management & install.package.version & BiocLite

[Rd] enabling reproducible research & R package management & install.package.version & BiocLite

[Rd] [BioC] enabling reproducible research & R package management & install.package.version & BiocLite

[Rd] [BioC] enabling reproducible research & R package management & install.package.version & BiocLite

[Rd] enabling reproducible research & R package management & install.package.version & BiocLite

[Rd] [BioC] enabling reproducible research & R package management & install.package.version & BiocLite

Possibly Parallel Threads