thr3ads.net - R help - [R] Creating a file with reusable functions accessible throughout a computational biology cancer project [Jun 2011]

If this information is useful, please help other people find it:
Share via:

Ben Ganzfried

2011-Jun-07 16:41 UTC

[R] Creating a file with reusable functions accessible throughout a computational biology cancer project

Hi,

My project is set up the following way:
root directory contains the following folders:
  folders: "Breast_Cancer" AND "Colorectal_Cancer" AND
"Lung_Cancer" AND
"Prostate_Cancer"

I want to create a file, call it: "repeating_functions.R" and place it
in
the root directory such that I can call these functions from within the
sub-folders in each type of cancer.  My confusion is that I'm not sure of
the syntax to make this happen.  For example:

Within the "Prostate_Cancer" folder, I have the following folders:
"curated" AND "src" AND "uncurated"

Within "uncurated" I have a ton of files, one of which could be:
PMID5377_fullpdata.csv

within "src" I have my R scripts, the one corresponding to the above
"uncurated" file would be:
PMID5377_curation.R

Here's the problem I'm trying to address:
Many of the uncurated files will require the same R code to curate them and
I find myself spending a lot of time copying and pasting the same code over
and over. I've spent at least 40 hours copying code I've already written
and
pasting it into a new dataset.  There has simply got to be a better way to
do this.

A common example of the code I'll write in an "uncurated" file is
the
following (let's call the following snippet of code UNCURATED_EXAMPLE1):
##characteristics_ch1.2 -> G
tmp <- uncurated$characteristics_ch1.2
tmp <- sub("grade: ","",tmp,fixed=TRUE)
tmp[tmp=="I"] <- "low"
tmp[tmp=="II"] <- "low"
tmp[tmp=="III"] <- "high"
curated$G <- tmp

The thing that changes depending on the dataset is *typically* the column
header (ie "uncurated$characteristics_ch1.2" might be
"uncurated$description" or "uncurated_characteristics_ch1.7"
depending on
the dataset), although sometimes I want to substitute different words (ie
"grade" can be referred to in many different ways).

What's the easiest way to automate this?  I'd like, at a minimum, to
make
UNCURATED_EXAMPLE1 look like the following:
tmp <- uncurated$characteristics_ch1.2
insert_call_to_repeating_functions.R_and_access_("grade")_function
curated$G <- tmp

It would be even better if I could say, for Prostate_Cancer, write one R
script that standardizes all the "uncurated" datasets; rather than
writing
100 different R scripts.  Although I don't know how feasible this is.

I'm sorry if this sounds confusing.  Basically, I have thousands of
"uncurated" datasets with clinical information and I'm trying to
standardize
all the datasets via R scripts so that all the information is standardized
for statistical analysis.  Not all of the datasets contain the same
information, but many of them do contain similar data (ie age, stage, grade,
days_to_recurrence, and many others).  Furthermore, in many cases the
standardization code is very similar across datasets (ie I'll want to delete
the words "Age: " before the actual number).  But this is not always
the
case (ie sometimes a dataset will not put the different patient data (ie
age, stage, grade) in separate columns, instead putting it all in one
column, so I have to write a different function to split it by the ";"
and
make a new table that is separated by column).  Anyway, I would be forever
grateful for any advice to make this quicker and am happy to provide any
clarifications.

Thank you very much.

Ben

	[[alternative HTML version deleted]]

Duncan Murdoch

2011-Jun-07 16:53 UTC

head link

[R] Creating a file with reusable functions accessible throughout a computational biology cancer project

On 07/06/2011 12:41 PM, Ben Ganzfried wrote:> Hi,
>
> My project is set up the following way:
> root directory contains the following folders:
>    folders: "Breast_Cancer" AND "Colorectal_Cancer" AND
"Lung_Cancer" AND
> "Prostate_Cancer"
>
> I want to create a file, call it: "repeating_functions.R" and
place it in
> the root directory such that I can call these functions from within the
> sub-folders in each type of cancer.  My confusion is that I'm not sure
of
> the syntax to make this happen.  For example:
>
> Within the "Prostate_Cancer" folder, I have the following
folders:
> "curated" AND "src" AND "uncurated"
>
> Within "uncurated" I have a ton of files, one of which could be:
> PMID5377_fullpdata.csv
>
> within "src" I have my R scripts, the one corresponding to the
above
> "uncurated" file would be:
> PMID5377_curation.R
>
> Here's the problem I'm trying to address:
> Many of the uncurated files will require the same R code to curate them and
> I find myself spending a lot of time copying and pasting the same code over
> and over. I've spent at least 40 hours copying code I've already
written and
> pasting it into a new dataset.  There has simply got to be a better way to
> do this.
There is:  you should put your common functions in a package.  Packages 
are a good way to organize your own code, you don't need to publish 
them.  (You will get a warning if you put "Not for distribution" into 
the License field in the DESCRIPTION file, but it's just a warning.)  
You can also put datasets in a package; this makes sense if they are 
relatively static.  If you get new data every day you probably
wouldn't.> A common example of the code I'll write in an "uncurated"
file is the
> following (let's call the following snippet of code
UNCURATED_EXAMPLE1):
> ##characteristics_ch1.2 ->  G
> tmp<- uncurated$characteristics_ch1.2
> tmp<- sub("grade: ","",tmp,fixed=TRUE)
> tmp[tmp=="I"]<- "low"
> tmp[tmp=="II"]<- "low"
> tmp[tmp=="III"]<- "high"
> curated$G<- tmp
>
> The thing that changes depending on the dataset is *typically* the column
> header (ie "uncurated$characteristics_ch1.2" might be
> "uncurated$description" or
"uncurated_characteristics_ch1.7" depending on
> the dataset), although sometimes I want to substitute different words (ie
> "grade" can be referred to in many different ways).
>
> What's the easiest way to automate this?  I'd like, at a minimum,
to make
> UNCURATED_EXAMPLE1 look like the following:
> tmp<- uncurated$characteristics_ch1.2
>
insert_call_to_repeating_functions.R_and_access_("grade")_function
> curated$G<- tmp
>
> It would be even better if I could say, for Prostate_Cancer, write one R
> script that standardizes all the "uncurated" datasets; rather
than writing
> 100 different R scripts.  Although I don't know how feasible this is.
Both of those sound very easy.   For example,

curate <- function(characteristic, word="grade: ") {
   tmp <- sub(word, "", characteristic, fixed=TRUE)
   tmp[tmp=="I"] <- "low"
   tmp[tmp=="II"] <- "low"
   tmp[tmp=="III"] <- "high"
   tmp
}

Then your script would just need one line

curated$G <- curate(uncurated$characteristics_ch1.2)

I don't know where you'll find the names of all the datasets, but if you
can get them into a vector, it's pretty easy to write a loop that calls 
curate() for each one.

Deciding how much goes in the package and how much is one-off code that 
stays with a particular dataset is a judgment call.  I'd guess based on 
your description that curate() belongs in the package but the rest 
doesn't, but you know a lot more about the details than I do.

Duncan Murdoch> I'm sorry if this sounds confusing.  Basically, I have thousands of
> "uncurated" datasets with clinical information and I'm trying
to standardize
> all the datasets via R scripts so that all the information is standardized
> for statistical analysis.  Not all of the datasets contain the same
> information, but many of them do contain similar data (ie age, stage,
grade,
> days_to_recurrence, and many others).  Furthermore, in many cases the
> standardization code is very similar across datasets (ie I'll want to
delete
> the words "Age: " before the actual number).  But this is not
always the
> case (ie sometimes a dataset will not put the different patient data (ie
> age, stage, grade) in separate columns, instead putting it all in one
> column, so I have to write a different function to split it by the
";" and
> make a new table that is separated by column).  Anyway, I would be forever
> grateful for any advice to make this quicker and am happy to provide any
> clarifications.
>
> Thank you very much.
>
> Ben
>
> 	[[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

Reasonably Related Threads

Search for more possibly parallel threads

R help - Jun 2011 - Creating a file with reusable functions accessible throughout a computational biology cancer project

[R] Creating a file with reusable functions accessible throughout a computational biology cancer project

[R] Creating a file with reusable functions accessible throughout a computational biology cancer project

Reasonably Related Threads