My goal is simple: calcuate GC content of each sequence in a list of
nucleotide
sequences. I have figured out how to vectorize, but all my attempts at
memoization failed.
Can you show me how to properly memoize my function?
There is a StackOverflow post on the subject of memoization, but it does not
help me:
http://stackoverflow.com/questions/7262485/options-for-caching-memoization-hashing-in-r
I haven't been able to find any other discussions on this subject. Searching
for "memoise" or "memoize" on r-bloggers.com returns zero
results. Searching
for those keywords at http://r-project.markmail.org/ does not return helpful
discussions.
Here's my data:
seqs <-
c("","G","C","CCC","T","","TTCCT","","C","CTC")
Some sequences are missing, so they're blank `""`.
I have a function for calculating GC content:
> GC <- function(s) {
if (!is.character(s)) return(NA)
n <- nchar(s)
if (n == 0) return(NA)
m <- gregexpr('[GCSgcs]', s)[[1]]
if (m[1] < 1) return(0)
return(100.0 * length(m) / n)
}
It works:
> GC('')
[1] NA
> GC('G')
[1] 100
> GC('GAG')
[1] 66.66667
> sapply(seqs, GC)
G C CCC T TTCCT
C
NA 100.00000 100.00000 100.00000 0.00000 NA 40.00000
NA 100.00000
CTC
66.66667
I want to memoize it. Then, I want to vectorize it. Should be easy, right?
Apparently, I must have the wrong mindset for using the `memoise` or
`R.cache`
R packages:
> system.time(dummy <- sapply(rep(seqs,100), GC))
user system elapsed
0.044 0.000 0.054
>
> library(memoise)
> GCm1 <- memoise(GC)
> system.time(dummy <- sapply(rep(seqs,100), GCm1))
user system elapsed
0.164 0.000 0.173
>
> library(R.cache)
> GCm2 <- addMemoization(GC)
> system.time(dummy <- sapply(rep(seqs,100), GCm2))
user system elapsed
10.601 0.252 10.926
Notice that the memoized functions are several orders of magnitude slower.
I tried the `hash` package, but things seem to be happening behind the
scenes
and I don't understand the output:
> cache <- hash()
> GCc <- function(s) {
if (!is.character(s) || nchar(s) == 0) {
return(NA)
}
if(exists(s, cache)) {
return(cache[[s]])
}
result <- GC(s)
cache[[s]] <<- result
return(result)
}
> sapply(seqs,GCc)
[[1]]
[1] NA
$G
[1] 100
$C
NULL
$CCC
[1] 100
$T
NULL
[[6]]
[1] NA
$TTCCT
[1] 40
[[8]]
[1] NA
$C
NULL
$CTC
[1] 66.66667
At least I figured out how to vectorize:
> GCv <- Vectorize(GC)
> GCv(seqs)
G C CCC T TTCCT
C
0.00000 100.00000 100.00000 100.00000 0.00000 0.00000 40.00000
0.00000 100.00000
CTC
66.66667
[[alternative HTML version deleted]]
On 04/26/2012 03:21 PM, Kamil Slowikowski wrote:> My goal is simple: calcuate GC content of each sequence in a list of > nucleotide > sequences. I have figured out how to vectorize, but all my attempts at > memoization failed. > > Can you show me how to properly memoize my function? > > There is a StackOverflow post on the subject of memoization, but it does not > help me: > http://stackoverflow.com/questions/7262485/options-for-caching-memoization-hashing-in-r > > I haven't been able to find any other discussions on this subject. Searching > for "memoise" or "memoize" on r-bloggers.com returns zero results. Searching > for those keywords at http://r-project.markmail.org/ does not return helpful > discussions. > > Here's my data: > > seqs<- c("","G","C","CCC","T","","TTCCT","","C","CTC") > > Some sequences are missing, so they're blank `""`. > > I have a function for calculating GC content: > > > GC<- function(s) { > if (!is.character(s)) return(NA) > n<- nchar(s) > if (n == 0) return(NA) > m<- gregexpr('[GCSgcs]', s)[[1]] > if (m[1]< 1) return(0) > return(100.0 * length(m) / n) > } > > It works: > > > GC('') > [1] NA > > GC('G') > [1] 100 > > GC('GAG') > [1] 66.66667 > > sapply(seqs, GC) > G C CCC T TTCCT > C > NA 100.00000 100.00000 100.00000 0.00000 NA 40.00000 > NA 100.00000 > CTC > 66.66667 > > I want to memoize it. Then, I want to vectorize it. Should be easy, right?Hi Kamil -- Not really an answer to your question, but looking at http://bioconductor.org/packages/2.10/bioc/html/Biostrings.html will tell you to install Biostrings with source("http://bioconductor.org/biocLite.R") biocLite("Biostrings") and then library(Biostrings) dna = DNAStringSet(c("","G","C","CCC","T","","TTCCT","","C","CTC")) alf = alphabetFrequency(dna, as.prob=TRUE, baseOnly=TRUE) rowSums(alf[,c("G", "C")]) will give you GC content of each string. > rowSums(alf[,c("G", "C")]) [1] NaN 1.0000000 1.0000000 1.0000000 0.0000000 NaN 0.4000000 [8] NaN 1.0000000 0.6666667 this will be fast and scalable; Biostrings and other Bioconductor (http://bioconductor.org) packages have many useful functions for working with DNA. See the Bioconductor mailing list for more help if this is a promising direction. http://bioconductor.org/help/mailing-list/ Martin> > Apparently, I must have the wrong mindset for using the `memoise` or > `R.cache` > R packages: > > > system.time(dummy<- sapply(rep(seqs,100), GC)) > user system elapsed > 0.044 0.000 0.054 > > > > library(memoise) > > GCm1<- memoise(GC) > > system.time(dummy<- sapply(rep(seqs,100), GCm1)) > user system elapsed > 0.164 0.000 0.173 > > > > library(R.cache) > > GCm2<- addMemoization(GC) > > system.time(dummy<- sapply(rep(seqs,100), GCm2)) > user system elapsed > 10.601 0.252 10.926 > > Notice that the memoized functions are several orders of magnitude slower. > > I tried the `hash` package, but things seem to be happening behind the > scenes > and I don't understand the output: > > > cache<- hash() > > GCc<- function(s) { > if (!is.character(s) || nchar(s) == 0) { > return(NA) > } > if(exists(s, cache)) { > return(cache[[s]]) > } > result<- GC(s) > cache[[s]]<<- result > return(result) > } > > sapply(seqs,GCc) > [[1]] > [1] NA > > $G > [1] 100 > > $C > NULL > > $CCC > [1] 100 > > $T > NULL > > [[6]] > [1] NA > > $TTCCT > [1] 40 > > [[8]] > [1] NA > > $C > NULL > > $CTC > [1] 66.66667 > > At least I figured out how to vectorize: > > > GCv<- Vectorize(GC) > > GCv(seqs) > G C CCC T TTCCT > C > 0.00000 100.00000 100.00000 100.00000 0.00000 0.00000 40.00000 > 0.00000 100.00000 > CTC > 66.66667 > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.-- Computational Biology Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: M1-B861 Telephone: 206 667-2793
On Thu, Apr 26, 2012 at 3:21 PM, Kamil Slowikowski <kslowikowski at gmail.com> wrote:> My goal is simple: calcuate GC content of each sequence in a list of > nucleotide > sequences. I have figured out how to vectorize, but all my attempts at > memoization failed. > > Can you show me how to properly memoize my function? > > There is a StackOverflow post on the subject of memoization, but it does not > help me: > http://stackoverflow.com/questions/7262485/options-for-caching-memoization-hashing-in-r > > I haven't been able to find any other discussions on this subject. Searching > for "memoise" or "memoize" on r-bloggers.com returns zero results. Searching > for those keywords at http://r-project.markmail.org/ does not return helpful > discussions. > > Here's my data: > > ? ?seqs <- c("","G","C","CCC","T","","TTCCT","","C","CTC") > > Some sequences are missing, so they're blank `""`. > > I have a function for calculating GC content: > > ? ?> GC <- function(s) { > ? ? ? ?if (!is.character(s)) return(NA) > ? ? ? ?n <- nchar(s) > ? ? ? ?if (n == 0) return(NA) > ? ? ? ?m <- gregexpr('[GCSgcs]', s)[[1]] > ? ? ? ?if (m[1] < 1) return(0) > ? ? ? ?return(100.0 * length(m) / n) > ? ?} > > It works: > > ? ?> GC('') > ? ?[1] NA > ? ?> GC('G') > ? ?[1] 100 > ? ?> GC('GAG') > ? ?[1] 66.66667 > ? ?> sapply(seqs, GC) > ? ? ? ? ? ? ? ? ? ? ?G ? ? ? ? C ? ? ? CCC ? ? ? ? T ? ? ? ? ? ? ? TTCCT > ? ? ? ? ? ? ? ?C > ? ? ? ? ? NA 100.00000 100.00000 100.00000 ? 0.00000 ? ? ? ?NA ?40.00000 > ? ? NA 100.00000 > ? ? ? ? ?CTC > ? ? 66.66667 > > I want to memoize it. Then, I want to vectorize it. Should be easy, right? > > Apparently, I must have the wrong mindset for using the `memoise` or > `R.cache` > R packages: > > ? ?> system.time(dummy <- sapply(rep(seqs,100), GC)) > ? ? ? user ?system elapsed > ? ? ?0.044 ? 0.000 ? 0.054 > ? ?> > ? ?> library(memoise) > ? ?> GCm1 <- memoise(GC) > ? ?> system.time(dummy <- sapply(rep(seqs,100), GCm1)) > ? ? ? user ?system elapsed > ? ? ?0.164 ? 0.000 ? 0.173 > ? ?> > ? ?> library(R.cache) > ? ?> GCm2 <- addMemoization(GC) > ? ?> system.time(dummy <- sapply(rep(seqs,100), GCm2)) > ? ? ? user ?system elapsed > ? ? 10.601 ? 0.252 ?10.926 > > Notice that the memoized functions are several orders of magnitude slower.About R.cache: All memoization by R.cache is currently done toward the file system. In other words, it is designed for larger objects (so you cannot hold all of the cache in memory) and more computationally expensive tasks. /Henrik> > I tried the `hash` package, but things seem to be happening behind the > scenes > and I don't understand the output: > > ? ?> cache <- hash() > ? ?> GCc <- function(s) { > ? ? ? ?if (!is.character(s) || nchar(s) == 0) { > ? ? ? ? ? ?return(NA) > ? ? ? ?} > ? ? ? ?if(exists(s, cache)) { > ? ? ? ? ? ?return(cache[[s]]) > ? ? ? ?} > ? ? ? ?result <- GC(s) > ? ? ? ?cache[[s]] <<- result > ? ? ? ?return(result) > ? ?} > ? ?> sapply(seqs,GCc) > ? ?[[1]] > ? ?[1] NA > > ? ?$G > ? ?[1] 100 > > ? ?$C > ? ?NULL > > ? ?$CCC > ? ?[1] 100 > > ? ?$T > ? ?NULL > > ? ?[[6]] > ? ?[1] NA > > ? ?$TTCCT > ? ?[1] 40 > > ? ?[[8]] > ? ?[1] NA > > ? ?$C > ? ?NULL > > ? ?$CTC > ? ?[1] 66.66667 > > At least I figured out how to vectorize: > > ? ?> GCv <- Vectorize(GC) > ? ?> GCv(seqs) > ? ? ? ? ? ? ? ? ? ? ?G ? ? ? ? C ? ? ? CCC ? ? ? ? T ? ? ? ? ? ? ? TTCCT > ? ? ? ? ? ? ? ?C > ? ? ?0.00000 100.00000 100.00000 100.00000 ? 0.00000 ? 0.00000 ?40.00000 > 0.00000 100.00000 > ? ? ? ? ?CTC > ? ? 66.66667 > > ? ? ? ?[[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.