Ben Ganzfried
2011-Jun-07 16:41 UTC
[R] Creating a file with reusable functions accessible throughout a computational biology cancer project
Hi, My project is set up the following way: root directory contains the following folders: folders: "Breast_Cancer" AND "Colorectal_Cancer" AND "Lung_Cancer" AND "Prostate_Cancer" I want to create a file, call it: "repeating_functions.R" and place it in the root directory such that I can call these functions from within the sub-folders in each type of cancer. My confusion is that I'm not sure of the syntax to make this happen. For example: Within the "Prostate_Cancer" folder, I have the following folders: "curated" AND "src" AND "uncurated" Within "uncurated" I have a ton of files, one of which could be: PMID5377_fullpdata.csv within "src" I have my R scripts, the one corresponding to the above "uncurated" file would be: PMID5377_curation.R Here's the problem I'm trying to address: Many of the uncurated files will require the same R code to curate them and I find myself spending a lot of time copying and pasting the same code over and over. I've spent at least 40 hours copying code I've already written and pasting it into a new dataset. There has simply got to be a better way to do this. A common example of the code I'll write in an "uncurated" file is the following (let's call the following snippet of code UNCURATED_EXAMPLE1): ##characteristics_ch1.2 -> G tmp <- uncurated$characteristics_ch1.2 tmp <- sub("grade: ","",tmp,fixed=TRUE) tmp[tmp=="I"] <- "low" tmp[tmp=="II"] <- "low" tmp[tmp=="III"] <- "high" curated$G <- tmp The thing that changes depending on the dataset is *typically* the column header (ie "uncurated$characteristics_ch1.2" might be "uncurated$description" or "uncurated_characteristics_ch1.7" depending on the dataset), although sometimes I want to substitute different words (ie "grade" can be referred to in many different ways). What's the easiest way to automate this? I'd like, at a minimum, to make UNCURATED_EXAMPLE1 look like the following: tmp <- uncurated$characteristics_ch1.2 insert_call_to_repeating_functions.R_and_access_("grade")_function curated$G <- tmp It would be even better if I could say, for Prostate_Cancer, write one R script that standardizes all the "uncurated" datasets; rather than writing 100 different R scripts. Although I don't know how feasible this is. I'm sorry if this sounds confusing. Basically, I have thousands of "uncurated" datasets with clinical information and I'm trying to standardize all the datasets via R scripts so that all the information is standardized for statistical analysis. Not all of the datasets contain the same information, but many of them do contain similar data (ie age, stage, grade, days_to_recurrence, and many others). Furthermore, in many cases the standardization code is very similar across datasets (ie I'll want to delete the words "Age: " before the actual number). But this is not always the case (ie sometimes a dataset will not put the different patient data (ie age, stage, grade) in separate columns, instead putting it all in one column, so I have to write a different function to split it by the ";" and make a new table that is separated by column). Anyway, I would be forever grateful for any advice to make this quicker and am happy to provide any clarifications. Thank you very much. Ben [[alternative HTML version deleted]]
Duncan Murdoch
2011-Jun-07 16:53 UTC
[R] Creating a file with reusable functions accessible throughout a computational biology cancer project
On 07/06/2011 12:41 PM, Ben Ganzfried wrote:> Hi, > > My project is set up the following way: > root directory contains the following folders: > folders: "Breast_Cancer" AND "Colorectal_Cancer" AND "Lung_Cancer" AND > "Prostate_Cancer" > > I want to create a file, call it: "repeating_functions.R" and place it in > the root directory such that I can call these functions from within the > sub-folders in each type of cancer. My confusion is that I'm not sure of > the syntax to make this happen. For example: > > Within the "Prostate_Cancer" folder, I have the following folders: > "curated" AND "src" AND "uncurated" > > Within "uncurated" I have a ton of files, one of which could be: > PMID5377_fullpdata.csv > > within "src" I have my R scripts, the one corresponding to the above > "uncurated" file would be: > PMID5377_curation.R > > Here's the problem I'm trying to address: > Many of the uncurated files will require the same R code to curate them and > I find myself spending a lot of time copying and pasting the same code over > and over. I've spent at least 40 hours copying code I've already written and > pasting it into a new dataset. There has simply got to be a better way to > do this.There is: you should put your common functions in a package. Packages are a good way to organize your own code, you don't need to publish them. (You will get a warning if you put "Not for distribution" into the License field in the DESCRIPTION file, but it's just a warning.) You can also put datasets in a package; this makes sense if they are relatively static. If you get new data every day you probably wouldn't.> A common example of the code I'll write in an "uncurated" file is the > following (let's call the following snippet of code UNCURATED_EXAMPLE1): > ##characteristics_ch1.2 -> G > tmp<- uncurated$characteristics_ch1.2 > tmp<- sub("grade: ","",tmp,fixed=TRUE) > tmp[tmp=="I"]<- "low" > tmp[tmp=="II"]<- "low" > tmp[tmp=="III"]<- "high" > curated$G<- tmp > > The thing that changes depending on the dataset is *typically* the column > header (ie "uncurated$characteristics_ch1.2" might be > "uncurated$description" or "uncurated_characteristics_ch1.7" depending on > the dataset), although sometimes I want to substitute different words (ie > "grade" can be referred to in many different ways). > > What's the easiest way to automate this? I'd like, at a minimum, to make > UNCURATED_EXAMPLE1 look like the following: > tmp<- uncurated$characteristics_ch1.2 > insert_call_to_repeating_functions.R_and_access_("grade")_function > curated$G<- tmp > > It would be even better if I could say, for Prostate_Cancer, write one R > script that standardizes all the "uncurated" datasets; rather than writing > 100 different R scripts. Although I don't know how feasible this is.Both of those sound very easy. For example, curate <- function(characteristic, word="grade: ") { tmp <- sub(word, "", characteristic, fixed=TRUE) tmp[tmp=="I"] <- "low" tmp[tmp=="II"] <- "low" tmp[tmp=="III"] <- "high" tmp } Then your script would just need one line curated$G <- curate(uncurated$characteristics_ch1.2) I don't know where you'll find the names of all the datasets, but if you can get them into a vector, it's pretty easy to write a loop that calls curate() for each one. Deciding how much goes in the package and how much is one-off code that stays with a particular dataset is a judgment call. I'd guess based on your description that curate() belongs in the package but the rest doesn't, but you know a lot more about the details than I do. Duncan Murdoch> I'm sorry if this sounds confusing. Basically, I have thousands of > "uncurated" datasets with clinical information and I'm trying to standardize > all the datasets via R scripts so that all the information is standardized > for statistical analysis. Not all of the datasets contain the same > information, but many of them do contain similar data (ie age, stage, grade, > days_to_recurrence, and many others). Furthermore, in many cases the > standardization code is very similar across datasets (ie I'll want to delete > the words "Age: " before the actual number). But this is not always the > case (ie sometimes a dataset will not put the different patient data (ie > age, stage, grade) in separate columns, instead putting it all in one > column, so I have to write a different function to split it by the ";" and > make a new table that is separated by column). Anyway, I would be forever > grateful for any advice to make this quicker and am happy to provide any > clarifications. > > Thank you very much. > > Ben > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.