hi, I like to find a pattern within a giver sequence. There might be multiple occurences of the pattern. I like to know the number of occurences and the positions if possible. "grep" can tell me if a pattern exists but can't give me the information I need. Does anyone know any function that I can use or know how to do what I intend?. Thanks a lot. Sean
Sean Liang wrote:> hi, I like to find a pattern within a giver sequence. There might be > multiple occurences of the pattern. I like to know the number of > occurences and the positions if possible. "grep" can tell me if a > pattern exists but can't give me the information I need. Does anyone > know any function that I can use or know how to do what I intend?.??? gr <- grep("a", c("a", "b", "a")) grep tells you that the positions are 1 and 3. length(gr) tells you there are two occurences. Uwe Ligges> Thanks a lot. > > Sean > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Uwe: Thanks for getting back to me. My situation is a bit more complicated. I have a vector of sequences each of which might contain any number of a given pattern (e.g.>pat=c("ATCGTTTGCTAC", "GGCTAATGCATTGC"); > grep ("TGC", pat)[1] 1 2 grep only tells me the position of first occurrence in each element whereas the second element contains two "TGC"s.>>> Uwe Ligges <ligges at statistik.uni-dortmund.de> 10/29/2004 2:27:27 PM >>>Sean Liang wrote:> hi, I like to find a pattern within a giver sequence. There might be > multiple occurences of the pattern. I like to know the number of > occurences and the positions if possible. "grep" can tell me if a > pattern exists but can't give me the information I need. Does anyone > know any function that I can use or know how to do what I intend?.??? gr <- grep("a", c("a", "b", "a")) grep tells you that the positions are 1 and 3. length(gr) tells you there are two occurences. Uwe Ligges> Thanks a lot. > > Sean > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide!http://www.R-project.org/posting-guide.html
Bert: I thought about that but was a bit concerned with the number of strings (10^4) in my vector. Each of these strings can have any number of a given patterns. I will give it a test to see if this is computationally efficient. Thanks.>>> Berton Gunter <gunter.berton at gene.com> 10/29/2004 2:16:53 PM >>>Use regexp() in a loop that deletes each successive occurrence of the pattern via substring() and keeps count. -- Bert Gunter Genentech Non-Clinical Statistics South San Francisco, CA "The business of the statistician is to catalyze the scientific learning process." - George E. P. Box> -----Original Message----- > From: r-help-bounces at stat.math.ethz.ch > [mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of Sean Liang > Sent: Friday, October 29, 2004 11:06 AM > To: R-help at stat.math.ethz.ch > Subject: [R] pattern search > > hi, I like to find a pattern within a giver sequence. There might be > multiple occurences of the pattern. I like to know the number of > occurences and the positions if possible. "grep" can tell me if a > pattern exists but can't give me the information I need. Does anyone > know any function that I can use or know how to do what I intend?. > Thanks a lot. > > Sean > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide! > http://www.R-project.org/posting-guide.html >
On Fri, 29 Oct 2004, Sean Liang wrote:> Thanks for getting back to me. My situation is a bit more complicated. > I have a vector of sequences each of which might contain any number of a > given pattern (e.g. > > >pat=c("ATCGTTTGCTAC", "GGCTAATGCATTGC"); > > grep ("TGC", pat) > [1] 1 2 > > grep only tells me the position of first occurrence in each element > whereas the second element contains two "TGC"s.sapply(strsplit(paste(1,pat,2, sep=""), "TGC"), length)-1 and you can also figure out the positions. You can also do it recursively with regexpr (on the same page as grep, no less). -- Brian D. Ripley, ripley at stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UK Fax: +44 1865 272595
Sean Liang <SLiang <at> wyeth.com> writes:> I have a vector of sequences each of which might contain any number of a > given pattern (e.g. > > >pat=c("ATCGTTTGCTAC", "GGCTAATGCATTGC"); > > grep ("TGC", pat) > [1] 1 2 > > grep only tells me the position of first occurrence in each element > whereas the second element contains two "TGC"s.[...]> I like to know the number of > occurences and the positions if possible.The following crates v, a list, the same length as pat, of vectors representing pat elements split along boundaries of TGC. lapply then calculates the starting position of each element selecting out those that correspond to TGC. The sapply at the end calculates the number of matches for each element of pat. pat <- c("ATCGTTTGCTAC", "GGCTAATGCATTGC") # pat split along TGC boundaries v <- strsplit(gsub("(TGC)", ":\\1:", pat), split = ":+") # starting positions lapply(v, function(x) (cumsum(nchar(x)) - nchar("TGC") + 1)[grep("TGC",x)]) # number of matches sapply(.Last.value, length)