On Jan 3, 2013, at 8:33 AM, Mario Bourgoin <mob at media.mit.edu>
wrote:> Dear Sir or Madam,
>
> The group of people with whom I work is now convinced of the usefulness of
> using R and its packages to meet our needs for statistical analysis. It
> has become important that R programs and scripts we create today can be run
> by someone else tomorrow, so need to use version-control. For this to work
> well, we need to version-control not just our code, but also R and the CRAN
> packages we use. (We only use CRAN for now.) Fortunately, R is under
> Subversion, and many CRAN packages are under Subversion in R-Forge.
> However, many CRAN packages do not appear to be available from R-Forge.
>
> 1- Are all CRAN packages available from some repository under version
> control? (My guess is ``no.'')
> 2- Is there an identifier on CRAN that flags a package as under version
> control in a repository? (My guess is ``no.'')
> 3- How does CRAN do version control for non-repository packages? (My guess
> is ``through the generosity of volunteer administrators'' though I
would
> prefer that some version control software be involved.)
> 4- Should we decide to create a local source repository to meet our needs?
> (My guess is ``that depends.'')
> 5- Where might I find examples of groups creating and maintaining local
> source repositories for R and its packages?
>
> Sincerely,
> --
> Mario Bourgoin
I suspect that you will get various responses, so let me offer my ten cents:
1. The old versions of CRAN packages are typically, but possibly not always,
available via an "Old Sources" link on each package's page on
CRAN. You could use that approach to obtain old source versions of packages.
However, it is conceivable that locally compiling and using the archived source
version of that same package (eg. where you may have used a precompiled binary
on OSX, Windows or even Linux in some cases) could yield behavioral changes over
time. Hardware, OS, compiler and other environmental changes (bugs, 32 versus 64
bit, differing compiler options, etc.) could introduce even subtle problems that
may perhaps preclude you absolutely replicating results from previous work.
Those are especially important to consider for CRAN packages that are not
"pure R" (eg. they include C, C++, FORTRAN, etc.).
2. The old versions of contributed CRAN packages that are physically on CRAN are
not under a true file level source version control system there. It is up to
each package maintainer/author to elect to use such a tool themselves outside of
CRAN. R-Forge and GitHub are perhaps the two most popular online platforms, but
others may be used and yet others may use local offline repos that you do not
have access to. Some may not use a true version control system at all. There is
no requirement for or any enforcement of a particular development process for
contributed CRAN packages.
3. While R itself is under SVN control, unless you are compiling R from source
and keeping track of SVN rev numbers, that is not likely to be helpful to you,
if you typically install precompiled binary versions of R. You will want to
archive the OS-specific R binaries that you use.
4. As noted above, it is conceivable that running code today versus running that
same code five years from now using the same versions of R and CRAN packages
that you used today can be problematic. It is not only R and the CRAN packages
that are changing, but your hardware, OS, compilers and possible other relevant
tools that are highly likely to change as well. All of these factors can
contribute to your ability or inability to exactly replicate results over time.
Only you can determine just how much of today's R/CRAN installation and
computing environment you need to be able to replicate in the future.
5. If you have datasets that you will be using and need to replicate the same
results five years from now on the same dataset that you used today, you will
need to maintain your datasets (not just your code) in a version control system
as well.
6. You might also want to look into "Reproducible Research".
Bottom line, you have defined or are in the process of defining your own local
requirements and perhaps SOPs. Thus, take control of your own risk mitigation
process. Implement your own version control system locally, that includes, if
you use them, precompiled binaries of R and any CRAN packages that you may use,
so that you can replicate the state of an R installation to your own
requirements, notwithstanding hardware and OS level changes that will occur.
You will of course want to document the version of R and any third party
packages that you use when performing an analysis, so that you can track such
information for future use.
If you compile and install source versions of R and CRAN packages, then I would
keep source level tarballs of each in said version control system so that you
can reasonably ensure access to them when you need it, even though they may also
be available via CRAN.
I would be sure that such a repo (or more likely, content/project specific
repos) are stored on a central server, which is backed up offline with a
sufficient frequency and level of redundancy to mitigate loss risk.
The two most popular VC tools these days are SVN and Git. There are significant
differences in the implementation models of both, so you will need to take time
to consider your own functional and operational requirements, which would may
lead you in one direction or the other. That being said, I made the switch from
SVN to Git last year, even though I don't need true distributed version
control myself. There are various reasons for that switch, which are beyond the
scope of this discussion, so I won't get into details here.
I hope that the above is helpful.
Regards,
Marc Schwartz