Mark Wardle
2006-Oct-26 21:00 UTC
[R] Organisation of medium/large projects with multiple analyses
Dear all, I'm still new to R, but have a fair experience with general programming. All of my data is stored in postgresql, and I have a number of R files that generate tables, results, graphs etc. These are then available to be imported into powerpoint/latex etc. I'm using version control (subversion), and as with most small projects, now have an ever increasing number of R scripts, each with fairly specific features. With any enlarging project, there are always issues regarding interdependencies, shared commonality (eg accessing same data store), and old scripts stopping working with changes made elsewhere (eg to data schema). For example, I might have a specific inclusion and exclusion criteria for patients, and this SQL query may have to be included in a number of analyses; I'm tempted to factor this out into a project-specific data access library, but is that over the top? This is a very long-winded and roundabout way of asking people how they organise medium-sized projects? Do people create their own "libraries" for specific projects for shared functionality, or do people just liberally use "source()" for this kind of thing? What about namespaces? I've got unwieldy sounding functions like ataxia.repeats.plot.alleles() and often these functions are not particularly generic, and are only called three or four times, but they do save repetition. Do you go to the effort of creating a library that solves your particular problem, or only reserve that for more generic functionality? Do people keep all of their R scripts for a specific project separate, or in one big file? I can see advantages (knowing it all works) and disadvantages (time for it all to run after minor changes) in both approaches, but it is unclear to me which is "better". I do know that I've set-up a variety of analyses, moved on to other things, only to find later on that old scripts have stopped working because I've changed some interdependency. Does anyone go as far as to use test suites to check for sane output (apart from doing things manually)? Note I'm not asking about how to run R on all these scripts, as people have already suggested makefiles. I realise these are vague high-level questions, and there won't be any "right" or "wrong" answers, but I'm grateful to hear about different strategies in organising R analyses/files, and how people solve these problems? I've not seen this kind of thing covered in any of the textbooks. Apologies for being so verbose! Best wishes, Mark -- Dr. Mark Wardle Clinical research fellow and Specialist Registrar in Neurology, C2-B2 link, Cardiff University, Heath Park, CARDIFF, CF14 4XN. UK
Daniel Elliott
2006-Oct-29 05:39 UTC
[R] Organisation of medium/large projects with multiple analyses
Mark, It sounds like your data/experiment storage and organization needs are more complicated than mine, but I'll share my methodology... I'm still new to R, but have a fair experience with general programming.> All of my data is stored in postgresql, and I have a number of R files > that generate tables, results, graphs etc. These are then available to > be imported into powerpoint/latex etc. > > I'm using version control (subversion), and as with most small projects, > now have an ever increasing number of R scripts, each with fairly > specific features. >I only use version control for generic code. For me, generic code is not at the experiment level but at the "algorithm" level. It is only code that others would find useful - code that I hope to release to the R community. I use object-oriented programming to simplify the more specific, experiment-level scripts that I will describe later. These objects include plotting and data import/export among other things. Like you, many of my experiments are variations on the same theme. I have attempted general functions that can run many different experiments with changes only to parameters, but I have found this far too cumbersome. I am now resigned to storing all code and input and generated output data and graphs together in a single directory for each experiment with the exception of my general libraries. This typically consists of me copying the scripts that ran other experiments into a new directory where they are (hopefully only slightly) modified to fit the new experiment. I wish I had a cooler way to handle all of this, but this does make it very easy to rerun stuff. I even create new files, but not necessarily new directories, for scripts that differ only in the parameters they used when calling functions from my libraries. Do you go to the effort of creating a library that solves your> particular problem, or only reserve that for more generic functionality?I only use libraries and classes for code that is generic enough to be usable by rest of the R community. Do people keep all of their R scripts for a specific project separate,> or in one big file?Files for a particular project are kept in many different directories with little structure. Experiment logs (like informal lab reports) are used if I need to revisit or rerun an experiment. By the way, I back all of this stuff onto tape drive or DVD.> I can see advantages (knowing it all works) and > disadvantages (time for it all to run after minor changes) in both > approaches, but it is unclear to me which is "better". I do know that > I've set-up a variety of analyses, moved on to other things, only to > find later on that old scripts have stopped working because I've changed > some interdependency. Does anyone go as far as to use test suites to > check for sane output (apart from doing things manually)? Note I'm not > asking about how to run R on all these scripts, as people have already > suggested makefiles.I try really really really hard to never change my libraries. If I need to modify on the algorithms in a library I create a new method within the same library. Since you use version control (which is totally awesome, do you use it for your writing as well) hopefully you will be able to quickly figure out why an old script doesn't work (in theory should only be caused by function name changes). I realise these are vague high-level questions, and there won't be any> "right" or "wrong" answers, but I'm grateful to hear about different > strategies in organising R analyses/files, and how people solve these > problems? I've not seen this kind of thing covered in any of the > textbooks. Apologies for being so verbose!Not sure one could be TOO verbose here! I am constantly looking for bulletproof ways to manage these complex issues. Sadly, in the past, I may have done so to a fault. I feel that the use of version control for generic code and formal writing is very important. Hope this helps. Maybe we could someday come up with a metalanguage to describe our experiments. - dan elliott [[alternative HTML version deleted]]