Since you only have 4 characters, you can can create a table of all
the combinations of 4 of them and this will reduce to one byte instead
of 4. This is fine if you just want to store them.
> x <-
expand.grid(c("A","C","G","T"),
+ c("A", "C", "G", "T"),
+ c("A", "C", "G", "T"),
+ c("A", "C", "G",
"T"))> gene.table <- apply(x, 1, paste, collapse='')
> # convert the string (right now it is length mod 4. more logic if not
multiple of 4
> gene <- "ACGATACGGCGACCACCGAGATCTACACTCTTCCCC"
> # break into 4 character strings
> start <- seq(1, by=4, to=nchar(gene))
> strings <- mapply(substr, gene, start, start+3)
> # create new compressed string
> comp <- as.raw(match(strings, gene.table) - 1)
> # convert back
> paste(gene.table[as.integer(comp) + 1], collapse='')
[1] "ACGATACGGCGACCACCGAGATCTACACTCTTCCCC">
On Wed, Dec 24, 2008 at 10:26 AM, Gundala Viswanath <gundalav at
gmail.com> wrote:> Dear all,
>
> What's the R way to compress the string into smaller 2~3 char/digit
length.
> In particular I want to compress string of length >=30 characters,
> e.g. ACGATACGGCGACCACCGAGATCTACACTCTTCC
>
> The reason I want to do that is because, there are billions
> of such string I want to print out. And I need to save disk space.
>
> - Gundala Viswanath
> Jakarta - Indonesia
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
--
Jim Holtman
Cincinnati, OH
+1 513 646 9390
What is the problem that you are trying to solve?