thr3ads.net - R help - [R] Deleting subsequences from a string sequence [Jan 2014]

If this information is useful, please help other people find it:
Share via:

arun

2014-Jan-23 19:04 UTC

[R] Deleting subsequences from a string sequence

Hi,
Try:
?CDS1 <- read.table("CDS coordinates.txt",header=FALSE)
CDS2 <-
split(CDS1[,1],as.numeric(as.character(gl(nrow(CDS1),2,length=nrow(CDS1)))))
eya4 <-
readChar("eya4_lagan_HM_cp.txt",file.info("eya4_lagan_HM_cp.txt")$size)
?eyaSpl<- head(strsplit(eya4,"")[[1]],-1)
length(eyaSpl)
#[1] 311522

eyaSpl1 <- eyaSpl
##1
for(i in seq_along(CDS2)){
eyaSpl1[seq(CDS2[[i]][1],CDS2[[i]][2],by=1)] <- "#"
eyaSpl1}

##2
?eyaSpl2 <- rep("#",sum(length(eyaSpl),length(CDS1[,1])))
vec1 <- unlist(lapply(CDS2,function(x) c(x[1]-1,x[2]+1)),use.names=FALSE)
?eyaSpl2[-vec1] <- eyaSpl
eyaSpl2New <- paste(eyaSpl2,collapse="")

A.K.


I have a data file here, which is imported into R by: 

? ? eya4_lagan_HM_cp <- "E:/blahblah/eya4_lagan_HM_cp.txt" 
? ? 
? ? eya4_lagan_HM_cp <- readChar(eya4_lagan_HM_cp,
file.info(eya4_lagan_HM_cp)$size)


Label the first string with position "1" and the last string
 as position "311,522" (note the sequence contains in total 311,522 
characters). I have two queries which are closely related. 

**Query 1)** 

Now I have a data file with a list of positions here. The positions are read in
"pairs", that is, take the first pair 44184
and 44216 as an example. I wish to delete the subsequence from position 
44184 (inclusive) to position 44216 (inclusive) from the previous 
sequence `eya4_lagan_HM_cp` and in its place, insert the character #. In other
words, substitute the subsequence from 44184 to 44216 with #. I
would like to do this with the rest of the pairs, that is, for 151795 
and 151844, I want to delete from position 151795 (inclusive) to 151844 
(inclusive) in `eya4_lagan_HM_cp` and replace it with #, and so on. 


**Query 2)** 

Now I would like to do something slightly different with the 
data file with the list of positions. Take the first pair as an example 
again. I would like to insert a # right before?position 44184, in other words,
insert a # between positions 44183 and 44184 in
`eya4_lagan_HM_cp` and then I would like to insert a # right after?position
44216, i.e., insert a # between positions 44216 and 44217. I would like to
repeat this procedure for all position pairs. So for the next pair, I would like
a # right before?151795 and a # right after?151844.

Thank you.

Hervé Pagès

2014-Jan-23 21:03 UTC

head link

[R] Deleting subsequences from a string sequence

Hi,

This is easy to do with the Biostrings package (from Bioconductor).

Let's say you've managed to load the string from your
eya4_lagan_HM_cp.txt file:

   my_seq <- "123456789012345678901234567890"

What you call "pairs of positions" are "ranges". Let's
say you've
managed to the ranges from your data file:

   my_ranges <- rbind(c(3, 7), c(12, 13), c(18, 25))

Then:

   library(Biostrings)
   my_seq <- BString(my_seq)
   my_ranges <- IRanges(my_ranges[ ,1], my_ranges[ ,2])

Query 1:

   > replaceAt(my_seq, at=my_ranges, value="#")
     18-letter "BString" instance
   seq: 12#8901#4567#67890

Query 2:

   > replaceAt(my_seq, at=my_ranges, value=paste0("#",
extractAt(my_seq,
my_ranges), "#"))
     36-letter "BString" instance
   seq: 12#34567#8901#23#4567#89012345#67890

   ## Or, equivalently (but more efficiently):

   > replaceAt(my_seq, at=c(start(my_ranges), end(my_ranges) + 1), 
value="#")
     36-letter "BString" instance
   seq: 12#34567#8901#23#4567#89012345#67890

You can turn the BString objects back into ordinary strings with
as.character().

To install the Biostrings package:

   source("http://bioconductor.org/biocLite.R")
   biocLite("Biostrings")

Cheers,
H.


On 01/23/2014 11:04 AM, arun wrote:> Hi,
> Try:
>   CDS1 <- read.table("CDS coordinates.txt",header=FALSE)
> CDS2 <-
split(CDS1[,1],as.numeric(as.character(gl(nrow(CDS1),2,length=nrow(CDS1)))))
> eya4 <-
readChar("eya4_lagan_HM_cp.txt",file.info("eya4_lagan_HM_cp.txt")$size)
>   eyaSpl<- head(strsplit(eya4,"")[[1]],-1)
> length(eyaSpl)
> #[1] 311522
>
> eyaSpl1 <- eyaSpl
> ##1
> for(i in seq_along(CDS2)){
> eyaSpl1[seq(CDS2[[i]][1],CDS2[[i]][2],by=1)] <- "#"
> eyaSpl1}
>
> ##2
>   eyaSpl2 <- rep("#",sum(length(eyaSpl),length(CDS1[,1])))
> vec1 <- unlist(lapply(CDS2,function(x)
c(x[1]-1,x[2]+1)),use.names=FALSE)
>   eyaSpl2[-vec1] <- eyaSpl
> eyaSpl2New <- paste(eyaSpl2,collapse="")
>
> A.K.
>
>
> I have a data file here, which is imported into R by:
>
>      eya4_lagan_HM_cp <- "E:/blahblah/eya4_lagan_HM_cp.txt"
>
>      eya4_lagan_HM_cp <- readChar(eya4_lagan_HM_cp,
file.info(eya4_lagan_HM_cp)$size)
>
>
> Label the first string with position "1" and the last string
>   as position "311,522" (note the sequence contains in total
311,522
> characters). I have two queries which are closely related.
>
> **Query 1)**
>
> Now I have a data file with a list of positions here. The positions are
read in "pairs", that is, take the first pair 44184
> and 44216 as an example. I wish to delete the subsequence from position
> 44184 (inclusive) to position 44216 (inclusive) from the previous
> sequence `eya4_lagan_HM_cp` and in its place, insert the character #. In
other words, substitute the subsequence from 44184 to 44216 with #. I
> would like to do this with the rest of the pairs, that is, for 151795
> and 151844, I want to delete from position 151795 (inclusive) to 151844
> (inclusive) in `eya4_lagan_HM_cp` and replace it with #, and so on.
>
>
> **Query 2)**
>
> Now I would like to do something slightly different with the
> data file with the list of positions. Take the first pair as an example
> again. I would like to insert a # right before position 44184, in other
words, insert a # between positions 44183 and 44184 in
> `eya4_lagan_HM_cp` and then I would like to insert a # right after position
44216, i.e., insert a # between positions 44216 and 44217. I would like to
repeat this procedure for all position pairs. So for the next pair, I would like
a # right before 151795 and a # right after 151844.
>
> Thank you.
>
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
-- 
Herv? Pag?s

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319

arun

2014-Jan-24 19:16 UTC

head link

[R] Deleting subsequences from a string sequence

HI,
I am sorry.? I didn't test it properly.
Check if this works. (But, you already got Herv?'s solution).
For ##2
eyaSpl2 <- rep("#",sum(length(eyaSpl),length(CDS1[,1]))) ##as in
previous code

indx <- CDS1[,1]+rep(seq(0,length(CDS1[,1]),by=2),each=2)[-c(1,40)]
eyaSpl2[-indx] <- eyaSpl

###testing
indx2 <- which(eyaSpl2=="#")
lst1 <- lapply(split(CDS1[,1],((seq_along(CDS1[,1])-1)%/%2)+1),function(x)
paste(eyaSpl[(x[1]-1):(x[2]+1)],collapse=""))
lst2 <- lapply(split(indx2,((seq_along(indx2)-1)%/%2)+1),function(x)
paste(eyaSpl2[(x[1]-1):(x[2]+1)],collapse=""))

?lst1[[1]]
#[1] "fapkkaakafmfffakkannpaaapkacfaapfdk"
?lst2[[1]]
#[1] "f#apkkaakafmfffakkannpaaapkacfaapfd#k"
#### 
lst1[[2]]
#[1] "kkpaaakaaaafkpkfbfakaaofakapkpppfcgaanfpfakaappffakk"
?lst2[[2]]
#[1] "k#kpaaakaaaafkpkfbfakaaofakapkpppfcgaanfpfakaappffak#k"
?####
lst1[[19]]
#[1]
"kfafaafapkfffpphpkkakkapapfeaknfafpfckaffpfhpkkfpfpefahfaakfafpkkaappakakpapppkpaaf"
?lst2[[19]]
#[1]
"k#fafaafapkfffpphpkkakkapapfeaknfafpfckaffpfhpkkfpfpefahfaakfafpkkaappakakpapppkpaa#f"



A.K.


Hi A.K., thanks for your help. I have some follow up queries. 

For ##2, the code doesn't seem to get exactly what I was after. For example,
for the first position pair, the code generates:

a#fapkkaakafmfffakkannpaaapkacfaapf#dk 

whereas the # signs should be around this: 

af#apkkaakafmfffakkannpaaapkacfaapfd#k 

The positions of # are also slightly off for the latter position pairs. 





On Thursday, January 23, 2014 2:04 PM, arun <smartpink111 at yahoo.com>
wrote:
Hi,
Try:
?CDS1 <- read.table("CDS coordinates.txt",header=FALSE)
CDS2 <-
split(CDS1[,1],as.numeric(as.character(gl(nrow(CDS1),2,length=nrow(CDS1)))))
eya4 <-
readChar("eya4_lagan_HM_cp.txt",file.info("eya4_lagan_HM_cp.txt")$size)
?eyaSpl<- head(strsplit(eya4,"")[[1]],-1)
length(eyaSpl)
#[1] 311522

eyaSpl1 <- eyaSpl
##1
for(i in seq_along(CDS2)){
eyaSpl1[seq(CDS2[[i]][1],CDS2[[i]][2],by=1)] <- "#"
eyaSpl1}

##2
?eyaSpl2 <- rep("#",sum(length(eyaSpl),length(CDS1[,1])))
vec1 <- unlist(lapply(CDS2,function(x) c(x[1]-1,x[2]+1)),use.names=FALSE)
?eyaSpl2[-vec1] <- eyaSpl
eyaSpl2New <- paste(eyaSpl2,collapse="")

A.K.


I have a data file here, which is imported into R by: 

? ? eya4_lagan_HM_cp <- "E:/blahblah/eya4_lagan_HM_cp.txt" 
? ? 
? ? eya4_lagan_HM_cp <- readChar(eya4_lagan_HM_cp,
file.info(eya4_lagan_HM_cp)$size)


Label the first string with position "1" and the last string
as position "311,522" (note the sequence contains in total 311,522 
characters). I have two queries which are closely related. 

**Query 1)** 

Now I have a data file with a list of positions here. The positions are read in
"pairs", that is, take the first pair 44184
and 44216 as an example. I wish to delete the subsequence from position 
44184 (inclusive) to position 44216 (inclusive) from the previous 
sequence `eya4_lagan_HM_cp` and in its place, insert the character #. In other
words, substitute the subsequence from 44184 to 44216 with #. I
would like to do this with the rest of the pairs, that is, for 151795 
and 151844, I want to delete from position 151795 (inclusive) to 151844 
(inclusive) in `eya4_lagan_HM_cp` and replace it with #, and so on. 


**Query 2)** 

Now I would like to do something slightly different with the 
data file with the list of positions. Take the first pair as an example 
again. I would like to insert a # right before?position 44184, in other words,
insert a # between positions 44183 and 44184 in
`eya4_lagan_HM_cp` and then I would like to insert a # right after?position
44216, i.e., insert a # between positions 44216 and 44217. I would like to
repeat this procedure for all position pairs. So for the next pair, I would like
a # right before?151795 and a # right after?151844.

Thank you.

R help - Jan 2014 - Deleting subsequences from a string sequence

[R] Deleting subsequences from a string sequence

[R] Deleting subsequences from a string sequence

[R] Deleting subsequences from a string sequence