Alexander Ploner
2005-Oct-11 08:04 UTC
[R] Q: Suggestions for long-term data/program storage policy?
Dear list, we are a statistical/epidemiological departement that - after a few years of rapid growth - finally is getting around to formulate a general data storage and retention policy - mainly to ensure that we can reproduce results from published papers/theses easier in the future, but also with the hope that we get more synergy between related projects. We have formulated what we feel is a reasonable draft, requiring basically that the raw data, all programs to create derived data sets, and the analysis programs are stored and documented in a uniform manner, regardless of the analysis software used. The minimum data retention we are aiming for is 10 years, and the format for the raw data is quite sane (either flat ASCII or real Given the rapid devlopment cycle of R, this suggests that at the very least all non-base packages used in the analysis are stored together with each project. I have basically two questions: 1) Are old R versions (binaries/sources) going to be available on CRAN indefinitely? 2) Is .RData a reasonable file format for long term storage? I would also be very grateful for any other suggestions, comments or links for setting up and implementing such a storage policy (R- specific or otherwise). Thank you for your time, alexander Alexander.Ploner@meb.ki.se Medical Epidemiology & Biostatistics Karolinska Institutet, Stockholm Tel: ++46-8-524-82329 Fax: ++46-8-31 49 75 [[alternative HTML version deleted]]
sosman
2005-Oct-11 09:02 UTC
[R] Q: Suggestions for long-term data/program storage policy?
Alexander Ploner wrote:> Dear list, > > we are a statistical/epidemiological departement that - after a few > years of rapid growth - finally is getting around to formulate a > general data storage and retention policy - mainly to ensure that we > can reproduce results from published papers/theses easier in the > future, but also with the hope that we get more synergy between > related projects. > > We have formulated what we feel is a reasonable draft, requiring > basically that the raw data, all programs to create derived data > sets, and the analysis programs are stored and documented in a > uniform manner, regardless of the analysis software used. The minimum > data retention we are aiming for is 10 years, and the format for the > raw data is quite sane (either flat ASCII or real > > Given the rapid devlopment cycle of R, this suggests that at the very > least all non-base packages used in the analysis are stored together > with each project. I have basically two questions: > > 1) Are old R versions (binaries/sources) going to be available on > CRAN indefinitely? > > 2) Is .RData a reasonable file format for long term storage? > > I would also be very grateful for any other suggestions, comments or > links for setting up and implementing such a storage policy (R- > specific or otherwise).I am coming more from a software development angle but you might want to take a look at subversion for versioning your projects. For non-geeky types, TortoiseSVN has a point and click interface. It handles binary files efficiently and you can easily go back and get earlier versions of your projects. http://subversion.tigris.org/
Duncan Murdoch
2005-Oct-11 10:54 UTC
[R] Q: Suggestions for long-term data/program storage policy?
Alexander Ploner wrote:> Dear list, > > we are a statistical/epidemiological departement that - after a few > years of rapid growth - finally is getting around to formulate a > general data storage and retention policy - mainly to ensure that we > can reproduce results from published papers/theses easier in the > future, but also with the hope that we get more synergy between > related projects. > > We have formulated what we feel is a reasonable draft, requiring > basically that the raw data, all programs to create derived data > sets, and the analysis programs are stored and documented in a > uniform manner, regardless of the analysis software used. The minimum > data retention we are aiming for is 10 years, and the format for the > raw data is quite sane (either flat ASCII or real > > Given the rapid devlopment cycle of R, this suggests that at the very > least all non-base packages used in the analysis are stored together > with each project. I have basically two questions: > > 1) Are old R versions (binaries/sources) going to be available on > CRAN indefinitely?I think sources will be, binaries much less reliably. (I just discovered that one or two of the old Windows binaries are corrupted; I'm not sure I'll be able to find good copies.)> 2) Is .RData a reasonable file format for long term storage?I think the intention is that it will be supported in future versions of R, but storing data in a binary format is risky. What if you don't use R in 5 years? You would find it a lot easier to decode text format files in another package than .RData format. The other advantage of text format is that it works very well with version control systems like Subversion or CVS. You can see several versions of the file, see comments on why changes were made, etc. Duncan Murdoch> > I would also be very grateful for any other suggestions, comments or > links for setting up and implementing such a storage policy (R- > specific or otherwise). > > Thank you for your time, > > alexander > > > Alexander.Ploner at meb.ki.se > Medical Epidemiology & Biostatistics > Karolinska Institutet, Stockholm > Tel: ++46-8-524-82329 > Fax: ++46-8-31 49 75 > > > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Prof Brian Ripley
2005-Oct-11 11:04 UTC
[R] Q: Suggestions for long-term data/program storage policy?
On Tue, 11 Oct 2005, Alexander Ploner wrote:> we are a statistical/epidemiological departement that - after a few > years of rapid growth - finally is getting around to formulate a > general data storage and retention policy - mainly to ensure that we > can reproduce results from published papers/theses easier in the > future, but also with the hope that we get more synergy between > related projects. > > We have formulated what we feel is a reasonable draft, requiring > basically that the raw data, all programs to create derived data > sets, and the analysis programs are stored and documented in a > uniform manner, regardless of the analysis software used. The minimum > data retention we are aiming for is 10 years, and the format for the > raw data is quite sane (either flat ASCII or realYou are intending to retain copies of the OS used and hardware too? The results depend far more on those than you apparently realize.> Given the rapid devlopment cycle of R,I think you will find your OS changes as fast: all those security updates potentially affect your results.> this suggests that at the very least all non-base packages used in the > analysis are stored together with each project. I have basically two > questions: > > 1) Are old R versions (binaries/sources) going to be available on > CRAN indefinitely?Not binaries. The intention is that source files be available, but they could become corrupted (as it seems the Windows binary has for a past version).> 2) Is .RData a reasonable file format for long term storage?I would say not, as it is almost impossible to recover from any corruption in such a file. We like to have long-term data in a human-readable printout, with a print copy, and also store some checksums.> I would also be very grateful for any other suggestions, comments or > links for setting up and implementing such a storage policy (R- > specific or otherwise).You need to consider the medium on which you are going to store the archive. We currrently use CD-R (and not tapes as those are less compatible across drives -- we have two identical drives currently but do not expect either to last 10 years), and check them annually -- I guess we will re-write to another medium after much less than 10 years. -- Brian D. Ripley, ripley at stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UK Fax: +44 1865 272595
Sean Davis
2005-Oct-11 11:11 UTC
[R] Q: Suggestions for long-term data/program storage policy?
On 10/11/05 6:54 AM, "Duncan Murdoch" <murdoch at stats.uwo.ca> wrote:> Alexander Ploner wrote: >> Dear list, >> >> we are a statistical/epidemiological departement that - after a few >> years of rapid growth - finally is getting around to formulate a >> general data storage and retention policy - mainly to ensure that we >> can reproduce results from published papers/theses easier in the >> future, but also with the hope that we get more synergy between >> related projects. >> I would also be very grateful for any other suggestions, comments or >> links for setting up and implementing such a storage policy (R- >> specific or otherwise).I would also consider a relational database (such as mysql or postgres) for your data warehousing. These products (particularly postgres) are designed with data integrity first-and-foremost. Data formats can change over time, but the data can be easily extracted from the database to match the needs at hand. Data generated at different times can be easily mined and combined as needed. The data backup process is fairly straightforward. R already integrates with several relational database systems, so an integrated solution can be defined if one so desires. Look at RMySQL, Rdbi, and RdbiPgSQL for how to integrate R with MySQL and Postgres. Sean