Hello, we are collecting information on the subject of research data management in German on the webplatform: www.forschungsdaten.info One of the topics, which we are writing about, is how to *archive* data. Unfortunately, none of us in the project is an expert with respect to R and so I would like to ask the list, what they recommend? A related question is to do with the sharing of data. We have already asked some academics, who have basically replied that they don't really know other than to strongly recommend a plain text format. We would also like to know, if members of the list recommend converting formats from commercial software such as S-Plus, Terr, SPSS etc. to an R-compatible format for long term archivation? Are there any general rules and best practices, when it comes to archiving (and sharing) statistical data and statistical programs? Any comments would be much appreciated! Joe -- B 1003 Kommunikations-, Informations-, Medienzentrum (KIM) Universitaet Konstanz t: ++49-7531-883234 e: joe.gain at uni-konstanz.de
Joe: 1. This may be the wrong forum for this question, as this list is about R programming issues. However, I don't know what the right forum should be. You might consider stats.stackexchange.com. Some IT forum might be better (but which???) 2. A google search on "data formats for archiving" (or similar) seemed to produce many useful hits. I think you'd learn more from that than you would here. Cheers, Bert Bert Gunter "The trouble with having an open mind is that people keep coming along and sticking things into it." -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip ) On Wed, Mar 29, 2017 at 1:44 AM, Joe Gain <joe.gain at uni-konstanz.de> wrote:> Hello, > > we are collecting information on the subject of research data management in > German on the webplatform: > > www.forschungsdaten.info > > One of the topics, which we are writing about, is how to *archive* data. > Unfortunately, none of us in the project is an expert with respect to R and > so I would like to ask the list, what they recommend? A related question is > to do with the sharing of data. We have already asked some academics, who > have basically replied that they don't really know other than to strongly > recommend a plain text format. > > We would also like to know, if members of the list recommend converting > formats from commercial software such as S-Plus, Terr, SPSS etc. to an > R-compatible format for long term archivation? Are there any general rules > and best practices, when it comes to archiving (and sharing) statistical > data and statistical programs? > > Any comments would be much appreciated! > Joe > > -- > B 1003 > Kommunikations-, Informations-, Medienzentrum (KIM) > Universitaet Konstanz > > t: ++49-7531-883234 > e: joe.gain at uni-konstanz.de > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.
Dear Joe, I'd choose a plain text format. They can be read and parsed with a very wide range of software. That is IMHO a much more important factor for long term archivation that file size or the ease to read it with specific software. The choice between tab-delimited, comma separated values, XML, JSON, ... will depend upon the data (and the metadata). Best regards, ir. Thierry Onkelinx Instituut voor natuur- en bosonderzoek / Research Institute for Nature and Forest team Biometrie & Kwaliteitszorg / team Biometrics & Quality Assurance Kliniekstraat 25 1070 Anderlecht Belgium To call in the statistician after the experiment is done may be no more than asking him to perform a post-mortem examination: he may be able to say what the experiment died of. ~ Sir Ronald Aylmer Fisher The plural of anecdote is not data. ~ Roger Brinner The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data. ~ John Tukey 2017-03-29 10:44 GMT+02:00 Joe Gain <joe.gain at uni-konstanz.de>:> Hello, > > we are collecting information on the subject of research data management in > German on the webplatform: > > www.forschungsdaten.info > > One of the topics, which we are writing about, is how to *archive* data. > Unfortunately, none of us in the project is an expert with respect to R and > so I would like to ask the list, what they recommend? A related question is > to do with the sharing of data. We have already asked some academics, who > have basically replied that they don't really know other than to strongly > recommend a plain text format. > > We would also like to know, if members of the list recommend converting > formats from commercial software such as S-Plus, Terr, SPSS etc. to an > R-compatible format for long term archivation? Are there any general rules > and best practices, when it comes to archiving (and sharing) statistical > data and statistical programs? > > Any comments would be much appreciated! > Joe > > -- > B 1003 > Kommunikations-, Informations-, Medienzentrum (KIM) > Universitaet Konstanz > > t: ++49-7531-883234 > e: joe.gain at uni-konstanz.de > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.
The relevance to R (and therefore R-help) of this question is marginal at best. R might not be the language of choice when you go retrieve the data. Also, this question seems dangerously close to a troll, because the obvious answer is that the data should be in an open format but if you are not currently working with data in an open format then you increase the cost of archiving and risk losing information up front by extracting it from a proprietary format, and balancing those concerns is more political than technical. Note that there exist open binary formats, and the goals of your archiving task and nature of the data would have to be considered in deciding which of the many to use. My own experience has been that plain text survives time best, but YMMV. -- Sent from my phone. Please excuse my brevity. On March 29, 2017 1:44:21 AM PDT, Joe Gain <joe.gain at uni-konstanz.de> wrote:>Hello, > >we are collecting information on the subject of research data >management >in German on the webplatform: > >www.forschungsdaten.info > >One of the topics, which we are writing about, is how to *archive* >data. >Unfortunately, none of us in the project is an expert with respect to R > >and so I would like to ask the list, what they recommend? A related >question is to do with the sharing of data. We have already asked some >academics, who have basically replied that they don't really know other > >than to strongly recommend a plain text format. > >We would also like to know, if members of the list recommend converting > >formats from commercial software such as S-Plus, Terr, SPSS etc. to an >R-compatible format for long term archivation? Are there any general >rules and best practices, when it comes to archiving (and sharing) >statistical data and statistical programs? > >Any comments would be much appreciated! >Joe
On 29.03.2017 17:36, Jeff Newmiller wrote:> The relevance to R (and therefore R-help) of this question is marginal at best. R might not be the language of choice when you go retrieve the data. > > Also, this question seems dangerously close to a troll, because the obvious answer is that the data should be in an open format but if you are not currently working with data in an open format then you increase the cost of archiving and risk losing information up front by extracting it from a proprietary format, and balancing those concerns is more political than technical. > > Note that there exist open binary formats, and the goals of your archiving task and nature of the data would have to be considered in deciding which of the many to use. My own experience has been that plain text survives time best, but YMMV. >Well, I didn't mean to troll the list. We have a small section on R, and in response to a question that we got from a user, we thought it would be a good idea to check with some actual R-users. I think the responses are pretty much in line with what we expected. There's unsurprisingly no simple solution. A text format is advantageous due to the many options that a user has to work with text data. Your point is valid, with regards to the format of the source-data, which can be a clear constraint (other constraints are, for example, of a legal nature). I'm not trying to advocate for open formats per se, just trying to gather information so as to be able to make a recommendation. I think we need to restructure the information on our web platform to clearly differentiate between data and the source code, scripts etc. which are used to process the data ("algorithms"). There is a big problem with data that has been archived but nobody knows what it is/ was for. Archivation, sharing, reproducibility are important subjects and we are interested in the experience of statisticians in dealing with these problems. Thanks for the replies! Joe -- B 1003 Kommunikations-, Informations-, Medienzentrum (KIM) Universitaet Konstanz t: ++49-7531-883234 e: joe.gain at uni-konstanz.de
Hi Joe, I have read your question with great interest. I am a little bit astonished to read about your project. There is a big national institute in Germany called GESIS (https://de.wikipedia.org/wiki/GESIS_%E2%80%93_Leibniz-Institut_f%C3%BCr_Sozialwissenschaften) which does the same job you are trying to set-up since 1986 now. You could try to exchange ideas with them. Your subject is very complex with regard to reproducible research. You might want to have a look at (1) https://cran.r-project.org/web/views/ReproducibleResearch.html (2) Gandrud, Christopher: Reproducible Research with R and R Studio (https://www.amazon.com/Reproducible-Research-Studio-Second-Chapman/dp/1498715370) Kind regards Georg> Gesendet: Mittwoch, 29. M?rz 2017 um 10:44 Uhr > Von: "Joe Gain" <joe.gain at uni-konstanz.de> > An: R-help at r-project.org > Cc: bwfdm-info at lists.kit.edu > Betreff: [R] Archive format > > Hello, > > we are collecting information on the subject of research data management > in German on the webplatform: > > www.forschungsdaten.info > > One of the topics, which we are writing about, is how to *archive* data. > Unfortunately, none of us in the project is an expert with respect to R > and so I would like to ask the list, what they recommend? A related > question is to do with the sharing of data. We have already asked some > academics, who have basically replied that they don't really know other > than to strongly recommend a plain text format. > > We would also like to know, if members of the list recommend converting > formats from commercial software such as S-Plus, Terr, SPSS etc. to an > R-compatible format for long term archivation? Are there any general > rules and best practices, when it comes to archiving (and sharing) > statistical data and statistical programs? > > Any comments would be much appreciated! > Joe > > -- > B 1003 > Kommunikations-, Informations-, Medienzentrum (KIM) > Universitaet Konstanz > > t: ++49-7531-883234 > e: joe.gain at uni-konstanz.de > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >
Hi Georg, On 08.04.2017 09:04, G.Maubach at gmx.de wrote:> Hi Joe, > > I have read your question with great interest. I am a little bit astonished to read about your project. There is a big national institute in Germany called GESIS (https://de.wikipedia.org/wiki/GESIS_%E2%80%93_Leibniz-Institut_f%C3%BCr_Sozialwissenschaften) which does the same job you are trying to set-up since 1986 now. You could try to exchange ideas with them.we've already had some contact with GESIS. I agree that it would be a good idea to communicate and cooperate more with GESIS-- although there are many interesting organisations, which are all doing their own thing and it's not always easy to do so. We organised a confernce in Heidelberg, "The E-Science Tage", and I was at the GESIS presentation, which was very good.> Your subject is very complex with regard to reproducible research. You might want to have a look at> (1) https://cran.r-project.org/web/views/ReproducibleResearch.html > (2) Gandrud, Christopher: Reproducible Research with R and R Studio (https://www.amazon.com/Reproducible-Research-Studio-Second-Chapman/dp/1498715370)Thanks for the useful links. (There's a whole book about R and reproducible research!) The general goal of the web platform is to increase the awareness of researchers in Research Data Management. The topic _is_ very complicated and it's difficult to write a general approach, especially, when you consider the different research disciplines, etc. nevertheless, that is what we are trying to do. Where it's possible and when the information becomes to specific we will include links to further resources (such as those, you have recommended above). Also, the project is to some extent dependent on the feedback of users, especially when they are able to provide us with information, which improves the content of the web platform.> Kind regards > > Georg >Thanks for taking the time to reply to my question. All the best, Joe>> Gesendet: Mittwoch, 29. M?rz 2017 um 10:44 Uhr >> Von: "Joe Gain" <joe.gain at uni-konstanz.de> >> An: R-help at r-project.org >> Cc: bwfdm-info at lists.kit.edu >> Betreff: [R] Archive format >> >> Hello, >> >> we are collecting information on the subject of research data management >> in German on the webplatform: >> >> www.forschungsdaten.info >> >> One of the topics, which we are writing about, is how to *archive* data. >> Unfortunately, none of us in the project is an expert with respect to R >> and so I would like to ask the list, what they recommend? A related >> question is to do with the sharing of data. We have already asked some >> academics, who have basically replied that they don't really know other >> than to strongly recommend a plain text format. >> >> We would also like to know, if members of the list recommend converting >> formats from commercial software such as S-Plus, Terr, SPSS etc. to an >> R-compatible format for long term archivation? Are there any general >> rules and best practices, when it comes to archiving (and sharing) >> statistical data and statistical programs? >> >> Any comments would be much appreciated! >> Joe >> >> -- >> B 1003 >> Kommunikations-, Informations-, Medienzentrum (KIM) >> Universitaet Konstanz >> >> t: ++49-7531-883234 >> e: joe.gain at uni-konstanz.de >> >> ______________________________________________ >> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. >>-- B 1003 Kommunikations-, Informations-, Medienzentrum (KIM) Universitaet Konstanz t: ++49-7531-883234 e: joe.gain at uni-konstanz.de