My goal is simple: calcuate GC content of each sequence in a list of nucleotide sequences. I have figured out how to vectorize, but all my attempts at memoization failed. Can you show me how to properly memoize my function? There is a StackOverflow post on the subject of memoization, but it does not help me: http://stackoverflow.com/questions/7262485/options-for-caching-memoization-hashing-in-r I haven't been able to find any other discussions on this subject. Searching for "memoise" or "memoize" on r-bloggers.com returns zero results. Searching for those keywords at http://r-project.markmail.org/ does not return helpful discussions. Here's my data: seqs <- c("","G","C","CCC","T","","TTCCT","","C","CTC") Some sequences are missing, so they're blank `""`. I have a function for calculating GC content: > GC <- function(s) { if (!is.character(s)) return(NA) n <- nchar(s) if (n == 0) return(NA) m <- gregexpr('[GCSgcs]', s)[[1]] if (m[1] < 1) return(0) return(100.0 * length(m) / n) } It works: > GC('') [1] NA > GC('G') [1] 100 > GC('GAG') [1] 66.66667 > sapply(seqs, GC) G C CCC T TTCCT C NA 100.00000 100.00000 100.00000 0.00000 NA 40.00000 NA 100.00000 CTC 66.66667 I want to memoize it. Then, I want to vectorize it. Should be easy, right? Apparently, I must have the wrong mindset for using the `memoise` or `R.cache` R packages: > system.time(dummy <- sapply(rep(seqs,100), GC)) user system elapsed 0.044 0.000 0.054 > > library(memoise) > GCm1 <- memoise(GC) > system.time(dummy <- sapply(rep(seqs,100), GCm1)) user system elapsed 0.164 0.000 0.173 > > library(R.cache) > GCm2 <- addMemoization(GC) > system.time(dummy <- sapply(rep(seqs,100), GCm2)) user system elapsed 10.601 0.252 10.926 Notice that the memoized functions are several orders of magnitude slower. I tried the `hash` package, but things seem to be happening behind the scenes and I don't understand the output: > cache <- hash() > GCc <- function(s) { if (!is.character(s) || nchar(s) == 0) { return(NA) } if(exists(s, cache)) { return(cache[[s]]) } result <- GC(s) cache[[s]] <<- result return(result) } > sapply(seqs,GCc) [[1]] [1] NA $G [1] 100 $C NULL $CCC [1] 100 $T NULL [[6]] [1] NA $TTCCT [1] 40 [[8]] [1] NA $C NULL $CTC [1] 66.66667 At least I figured out how to vectorize: > GCv <- Vectorize(GC) > GCv(seqs) G C CCC T TTCCT C 0.00000 100.00000 100.00000 100.00000 0.00000 0.00000 40.00000 0.00000 100.00000 CTC 66.66667 [[alternative HTML version deleted]]
On 04/26/2012 03:21 PM, Kamil Slowikowski wrote:> My goal is simple: calcuate GC content of each sequence in a list of > nucleotide > sequences. I have figured out how to vectorize, but all my attempts at > memoization failed. > > Can you show me how to properly memoize my function? > > There is a StackOverflow post on the subject of memoization, but it does not > help me: > http://stackoverflow.com/questions/7262485/options-for-caching-memoization-hashing-in-r > > I haven't been able to find any other discussions on this subject. Searching > for "memoise" or "memoize" on r-bloggers.com returns zero results. Searching > for those keywords at http://r-project.markmail.org/ does not return helpful > discussions. > > Here's my data: > > seqs<- c("","G","C","CCC","T","","TTCCT","","C","CTC") > > Some sequences are missing, so they're blank `""`. > > I have a function for calculating GC content: > > > GC<- function(s) { > if (!is.character(s)) return(NA) > n<- nchar(s) > if (n == 0) return(NA) > m<- gregexpr('[GCSgcs]', s)[[1]] > if (m[1]< 1) return(0) > return(100.0 * length(m) / n) > } > > It works: > > > GC('') > [1] NA > > GC('G') > [1] 100 > > GC('GAG') > [1] 66.66667 > > sapply(seqs, GC) > G C CCC T TTCCT > C > NA 100.00000 100.00000 100.00000 0.00000 NA 40.00000 > NA 100.00000 > CTC > 66.66667 > > I want to memoize it. Then, I want to vectorize it. Should be easy, right?Hi Kamil -- Not really an answer to your question, but looking at http://bioconductor.org/packages/2.10/bioc/html/Biostrings.html will tell you to install Biostrings with source("http://bioconductor.org/biocLite.R") biocLite("Biostrings") and then library(Biostrings) dna = DNAStringSet(c("","G","C","CCC","T","","TTCCT","","C","CTC")) alf = alphabetFrequency(dna, as.prob=TRUE, baseOnly=TRUE) rowSums(alf[,c("G", "C")]) will give you GC content of each string. > rowSums(alf[,c("G", "C")]) [1] NaN 1.0000000 1.0000000 1.0000000 0.0000000 NaN 0.4000000 [8] NaN 1.0000000 0.6666667 this will be fast and scalable; Biostrings and other Bioconductor (http://bioconductor.org) packages have many useful functions for working with DNA. See the Bioconductor mailing list for more help if this is a promising direction. http://bioconductor.org/help/mailing-list/ Martin> > Apparently, I must have the wrong mindset for using the `memoise` or > `R.cache` > R packages: > > > system.time(dummy<- sapply(rep(seqs,100), GC)) > user system elapsed > 0.044 0.000 0.054 > > > > library(memoise) > > GCm1<- memoise(GC) > > system.time(dummy<- sapply(rep(seqs,100), GCm1)) > user system elapsed > 0.164 0.000 0.173 > > > > library(R.cache) > > GCm2<- addMemoization(GC) > > system.time(dummy<- sapply(rep(seqs,100), GCm2)) > user system elapsed > 10.601 0.252 10.926 > > Notice that the memoized functions are several orders of magnitude slower. > > I tried the `hash` package, but things seem to be happening behind the > scenes > and I don't understand the output: > > > cache<- hash() > > GCc<- function(s) { > if (!is.character(s) || nchar(s) == 0) { > return(NA) > } > if(exists(s, cache)) { > return(cache[[s]]) > } > result<- GC(s) > cache[[s]]<<- result > return(result) > } > > sapply(seqs,GCc) > [[1]] > [1] NA > > $G > [1] 100 > > $C > NULL > > $CCC > [1] 100 > > $T > NULL > > [[6]] > [1] NA > > $TTCCT > [1] 40 > > [[8]] > [1] NA > > $C > NULL > > $CTC > [1] 66.66667 > > At least I figured out how to vectorize: > > > GCv<- Vectorize(GC) > > GCv(seqs) > G C CCC T TTCCT > C > 0.00000 100.00000 100.00000 100.00000 0.00000 0.00000 40.00000 > 0.00000 100.00000 > CTC > 66.66667 > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.-- Computational Biology Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: M1-B861 Telephone: 206 667-2793
On Thu, Apr 26, 2012 at 3:21 PM, Kamil Slowikowski <kslowikowski at gmail.com> wrote:> My goal is simple: calcuate GC content of each sequence in a list of > nucleotide > sequences. I have figured out how to vectorize, but all my attempts at > memoization failed. > > Can you show me how to properly memoize my function? > > There is a StackOverflow post on the subject of memoization, but it does not > help me: > http://stackoverflow.com/questions/7262485/options-for-caching-memoization-hashing-in-r > > I haven't been able to find any other discussions on this subject. Searching > for "memoise" or "memoize" on r-bloggers.com returns zero results. Searching > for those keywords at http://r-project.markmail.org/ does not return helpful > discussions. > > Here's my data: > > ? ?seqs <- c("","G","C","CCC","T","","TTCCT","","C","CTC") > > Some sequences are missing, so they're blank `""`. > > I have a function for calculating GC content: > > ? ?> GC <- function(s) { > ? ? ? ?if (!is.character(s)) return(NA) > ? ? ? ?n <- nchar(s) > ? ? ? ?if (n == 0) return(NA) > ? ? ? ?m <- gregexpr('[GCSgcs]', s)[[1]] > ? ? ? ?if (m[1] < 1) return(0) > ? ? ? ?return(100.0 * length(m) / n) > ? ?} > > It works: > > ? ?> GC('') > ? ?[1] NA > ? ?> GC('G') > ? ?[1] 100 > ? ?> GC('GAG') > ? ?[1] 66.66667 > ? ?> sapply(seqs, GC) > ? ? ? ? ? ? ? ? ? ? ?G ? ? ? ? C ? ? ? CCC ? ? ? ? T ? ? ? ? ? ? ? TTCCT > ? ? ? ? ? ? ? ?C > ? ? ? ? ? NA 100.00000 100.00000 100.00000 ? 0.00000 ? ? ? ?NA ?40.00000 > ? ? NA 100.00000 > ? ? ? ? ?CTC > ? ? 66.66667 > > I want to memoize it. Then, I want to vectorize it. Should be easy, right? > > Apparently, I must have the wrong mindset for using the `memoise` or > `R.cache` > R packages: > > ? ?> system.time(dummy <- sapply(rep(seqs,100), GC)) > ? ? ? user ?system elapsed > ? ? ?0.044 ? 0.000 ? 0.054 > ? ?> > ? ?> library(memoise) > ? ?> GCm1 <- memoise(GC) > ? ?> system.time(dummy <- sapply(rep(seqs,100), GCm1)) > ? ? ? user ?system elapsed > ? ? ?0.164 ? 0.000 ? 0.173 > ? ?> > ? ?> library(R.cache) > ? ?> GCm2 <- addMemoization(GC) > ? ?> system.time(dummy <- sapply(rep(seqs,100), GCm2)) > ? ? ? user ?system elapsed > ? ? 10.601 ? 0.252 ?10.926 > > Notice that the memoized functions are several orders of magnitude slower.About R.cache: All memoization by R.cache is currently done toward the file system. In other words, it is designed for larger objects (so you cannot hold all of the cache in memory) and more computationally expensive tasks. /Henrik> > I tried the `hash` package, but things seem to be happening behind the > scenes > and I don't understand the output: > > ? ?> cache <- hash() > ? ?> GCc <- function(s) { > ? ? ? ?if (!is.character(s) || nchar(s) == 0) { > ? ? ? ? ? ?return(NA) > ? ? ? ?} > ? ? ? ?if(exists(s, cache)) { > ? ? ? ? ? ?return(cache[[s]]) > ? ? ? ?} > ? ? ? ?result <- GC(s) > ? ? ? ?cache[[s]] <<- result > ? ? ? ?return(result) > ? ?} > ? ?> sapply(seqs,GCc) > ? ?[[1]] > ? ?[1] NA > > ? ?$G > ? ?[1] 100 > > ? ?$C > ? ?NULL > > ? ?$CCC > ? ?[1] 100 > > ? ?$T > ? ?NULL > > ? ?[[6]] > ? ?[1] NA > > ? ?$TTCCT > ? ?[1] 40 > > ? ?[[8]] > ? ?[1] NA > > ? ?$C > ? ?NULL > > ? ?$CTC > ? ?[1] 66.66667 > > At least I figured out how to vectorize: > > ? ?> GCv <- Vectorize(GC) > ? ?> GCv(seqs) > ? ? ? ? ? ? ? ? ? ? ?G ? ? ? ? C ? ? ? CCC ? ? ? ? T ? ? ? ? ? ? ? TTCCT > ? ? ? ? ? ? ? ?C > ? ? ?0.00000 100.00000 100.00000 100.00000 ? 0.00000 ? 0.00000 ?40.00000 > 0.00000 100.00000 > ? ? ? ? ?CTC > ? ? 66.66667 > > ? ? ? ?[[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.