Hi, Try: ?CDS1 <- read.table("CDS coordinates.txt",header=FALSE) CDS2 <- split(CDS1[,1],as.numeric(as.character(gl(nrow(CDS1),2,length=nrow(CDS1))))) eya4 <- readChar("eya4_lagan_HM_cp.txt",file.info("eya4_lagan_HM_cp.txt")$size) ?eyaSpl<- head(strsplit(eya4,"")[[1]],-1) length(eyaSpl) #[1] 311522 eyaSpl1 <- eyaSpl ##1 for(i in seq_along(CDS2)){ eyaSpl1[seq(CDS2[[i]][1],CDS2[[i]][2],by=1)] <- "#" eyaSpl1} ##2 ?eyaSpl2 <- rep("#",sum(length(eyaSpl),length(CDS1[,1]))) vec1 <- unlist(lapply(CDS2,function(x) c(x[1]-1,x[2]+1)),use.names=FALSE) ?eyaSpl2[-vec1] <- eyaSpl eyaSpl2New <- paste(eyaSpl2,collapse="") A.K. I have a data file here, which is imported into R by: ? ? eya4_lagan_HM_cp <- "E:/blahblah/eya4_lagan_HM_cp.txt" ? ? ? ? eya4_lagan_HM_cp <- readChar(eya4_lagan_HM_cp, file.info(eya4_lagan_HM_cp)$size) Label the first string with position "1" and the last string as position "311,522" (note the sequence contains in total 311,522 characters). I have two queries which are closely related. **Query 1)** Now I have a data file with a list of positions here. The positions are read in "pairs", that is, take the first pair 44184 and 44216 as an example. I wish to delete the subsequence from position 44184 (inclusive) to position 44216 (inclusive) from the previous sequence `eya4_lagan_HM_cp` and in its place, insert the character #. In other words, substitute the subsequence from 44184 to 44216 with #. I would like to do this with the rest of the pairs, that is, for 151795 and 151844, I want to delete from position 151795 (inclusive) to 151844 (inclusive) in `eya4_lagan_HM_cp` and replace it with #, and so on. **Query 2)** Now I would like to do something slightly different with the data file with the list of positions. Take the first pair as an example again. I would like to insert a # right before?position 44184, in other words, insert a # between positions 44183 and 44184 in `eya4_lagan_HM_cp` and then I would like to insert a # right after?position 44216, i.e., insert a # between positions 44216 and 44217. I would like to repeat this procedure for all position pairs. So for the next pair, I would like a # right before?151795 and a # right after?151844. Thank you.
Hi, This is easy to do with the Biostrings package (from Bioconductor). Let's say you've managed to load the string from your eya4_lagan_HM_cp.txt file: my_seq <- "123456789012345678901234567890" What you call "pairs of positions" are "ranges". Let's say you've managed to the ranges from your data file: my_ranges <- rbind(c(3, 7), c(12, 13), c(18, 25)) Then: library(Biostrings) my_seq <- BString(my_seq) my_ranges <- IRanges(my_ranges[ ,1], my_ranges[ ,2]) Query 1: > replaceAt(my_seq, at=my_ranges, value="#") 18-letter "BString" instance seq: 12#8901#4567#67890 Query 2: > replaceAt(my_seq, at=my_ranges, value=paste0("#", extractAt(my_seq, my_ranges), "#")) 36-letter "BString" instance seq: 12#34567#8901#23#4567#89012345#67890 ## Or, equivalently (but more efficiently): > replaceAt(my_seq, at=c(start(my_ranges), end(my_ranges) + 1), value="#") 36-letter "BString" instance seq: 12#34567#8901#23#4567#89012345#67890 You can turn the BString objects back into ordinary strings with as.character(). To install the Biostrings package: source("http://bioconductor.org/biocLite.R") biocLite("Biostrings") Cheers, H. On 01/23/2014 11:04 AM, arun wrote:> Hi, > Try: > CDS1 <- read.table("CDS coordinates.txt",header=FALSE) > CDS2 <- split(CDS1[,1],as.numeric(as.character(gl(nrow(CDS1),2,length=nrow(CDS1))))) > eya4 <- readChar("eya4_lagan_HM_cp.txt",file.info("eya4_lagan_HM_cp.txt")$size) > eyaSpl<- head(strsplit(eya4,"")[[1]],-1) > length(eyaSpl) > #[1] 311522 > > eyaSpl1 <- eyaSpl > ##1 > for(i in seq_along(CDS2)){ > eyaSpl1[seq(CDS2[[i]][1],CDS2[[i]][2],by=1)] <- "#" > eyaSpl1} > > ##2 > eyaSpl2 <- rep("#",sum(length(eyaSpl),length(CDS1[,1]))) > vec1 <- unlist(lapply(CDS2,function(x) c(x[1]-1,x[2]+1)),use.names=FALSE) > eyaSpl2[-vec1] <- eyaSpl > eyaSpl2New <- paste(eyaSpl2,collapse="") > > A.K. > > > I have a data file here, which is imported into R by: > > eya4_lagan_HM_cp <- "E:/blahblah/eya4_lagan_HM_cp.txt" > > eya4_lagan_HM_cp <- readChar(eya4_lagan_HM_cp, file.info(eya4_lagan_HM_cp)$size) > > > Label the first string with position "1" and the last string > as position "311,522" (note the sequence contains in total 311,522 > characters). I have two queries which are closely related. > > **Query 1)** > > Now I have a data file with a list of positions here. The positions are read in "pairs", that is, take the first pair 44184 > and 44216 as an example. I wish to delete the subsequence from position > 44184 (inclusive) to position 44216 (inclusive) from the previous > sequence `eya4_lagan_HM_cp` and in its place, insert the character #. In other words, substitute the subsequence from 44184 to 44216 with #. I > would like to do this with the rest of the pairs, that is, for 151795 > and 151844, I want to delete from position 151795 (inclusive) to 151844 > (inclusive) in `eya4_lagan_HM_cp` and replace it with #, and so on. > > > **Query 2)** > > Now I would like to do something slightly different with the > data file with the list of positions. Take the first pair as an example > again. I would like to insert a # right before position 44184, in other words, insert a # between positions 44183 and 44184 in > `eya4_lagan_HM_cp` and then I would like to insert a # right after position 44216, i.e., insert a # between positions 44216 and 44217. I would like to repeat this procedure for all position pairs. So for the next pair, I would like a # right before 151795 and a # right after 151844. > > Thank you. > > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >-- Herv? Pag?s Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319
HI, I am sorry.? I didn't test it properly. Check if this works. (But, you already got Herv?'s solution). For ##2 eyaSpl2 <- rep("#",sum(length(eyaSpl),length(CDS1[,1]))) ##as in previous code indx <- CDS1[,1]+rep(seq(0,length(CDS1[,1]),by=2),each=2)[-c(1,40)] eyaSpl2[-indx] <- eyaSpl ###testing indx2 <- which(eyaSpl2=="#") lst1 <- lapply(split(CDS1[,1],((seq_along(CDS1[,1])-1)%/%2)+1),function(x) paste(eyaSpl[(x[1]-1):(x[2]+1)],collapse="")) lst2 <- lapply(split(indx2,((seq_along(indx2)-1)%/%2)+1),function(x) paste(eyaSpl2[(x[1]-1):(x[2]+1)],collapse="")) ?lst1[[1]] #[1] "fapkkaakafmfffakkannpaaapkacfaapfdk" ?lst2[[1]] #[1] "f#apkkaakafmfffakkannpaaapkacfaapfd#k" #### lst1[[2]] #[1] "kkpaaakaaaafkpkfbfakaaofakapkpppfcgaanfpfakaappffakk" ?lst2[[2]] #[1] "k#kpaaakaaaafkpkfbfakaaofakapkpppfcgaanfpfakaappffak#k" ?#### lst1[[19]] #[1] "kfafaafapkfffpphpkkakkapapfeaknfafpfckaffpfhpkkfpfpefahfaakfafpkkaappakakpapppkpaaf" ?lst2[[19]] #[1] "k#fafaafapkfffpphpkkakkapapfeaknfafpfckaffpfhpkkfpfpefahfaakfafpkkaappakakpapppkpaa#f" A.K. Hi A.K., thanks for your help. I have some follow up queries. For ##2, the code doesn't seem to get exactly what I was after. For example, for the first position pair, the code generates: a#fapkkaakafmfffakkannpaaapkacfaapf#dk whereas the # signs should be around this: af#apkkaakafmfffakkannpaaapkacfaapfd#k The positions of # are also slightly off for the latter position pairs. On Thursday, January 23, 2014 2:04 PM, arun <smartpink111 at yahoo.com> wrote: Hi, Try: ?CDS1 <- read.table("CDS coordinates.txt",header=FALSE) CDS2 <- split(CDS1[,1],as.numeric(as.character(gl(nrow(CDS1),2,length=nrow(CDS1))))) eya4 <- readChar("eya4_lagan_HM_cp.txt",file.info("eya4_lagan_HM_cp.txt")$size) ?eyaSpl<- head(strsplit(eya4,"")[[1]],-1) length(eyaSpl) #[1] 311522 eyaSpl1 <- eyaSpl ##1 for(i in seq_along(CDS2)){ eyaSpl1[seq(CDS2[[i]][1],CDS2[[i]][2],by=1)] <- "#" eyaSpl1} ##2 ?eyaSpl2 <- rep("#",sum(length(eyaSpl),length(CDS1[,1]))) vec1 <- unlist(lapply(CDS2,function(x) c(x[1]-1,x[2]+1)),use.names=FALSE) ?eyaSpl2[-vec1] <- eyaSpl eyaSpl2New <- paste(eyaSpl2,collapse="") A.K. I have a data file here, which is imported into R by: ? ? eya4_lagan_HM_cp <- "E:/blahblah/eya4_lagan_HM_cp.txt" ? ? ? ? eya4_lagan_HM_cp <- readChar(eya4_lagan_HM_cp, file.info(eya4_lagan_HM_cp)$size) Label the first string with position "1" and the last string as position "311,522" (note the sequence contains in total 311,522 characters). I have two queries which are closely related. **Query 1)** Now I have a data file with a list of positions here. The positions are read in "pairs", that is, take the first pair 44184 and 44216 as an example. I wish to delete the subsequence from position 44184 (inclusive) to position 44216 (inclusive) from the previous sequence `eya4_lagan_HM_cp` and in its place, insert the character #. In other words, substitute the subsequence from 44184 to 44216 with #. I would like to do this with the rest of the pairs, that is, for 151795 and 151844, I want to delete from position 151795 (inclusive) to 151844 (inclusive) in `eya4_lagan_HM_cp` and replace it with #, and so on. **Query 2)** Now I would like to do something slightly different with the data file with the list of positions. Take the first pair as an example again. I would like to insert a # right before?position 44184, in other words, insert a # between positions 44183 and 44184 in `eya4_lagan_HM_cp` and then I would like to insert a # right after?position 44216, i.e., insert a # between positions 44216 and 44217. I would like to repeat this procedure for all position pairs. So for the next pair, I would like a # right before?151795 and a # right after?151844. Thank you.