Hello all, For some work I am doing on RNA, I want to use R to do string parsing that (I think) is like a simplistic HTML parsing. For example, let's say we have the following two variables: Seq <- "GCCTCGATAGCTCAGTTGGGAGAGCGTACGACTGAAGATCGTAAGGtCACCAGTTCGATCCTGGTTCGGGGCA" Str <- ">>>>>>>..>>>>........<<<<.>>>>>.......<<<<<.....>>>>>.......<<<<<<<<<<<<." Say that I want to parse "Seq" According to "Str", by using the legend here Seq: GCCTCGATAGCTCAGTTGGGAGAGCGTACGACTGAAGATCGTAAGGtCACCAGTTCGATCCTGGTTCGGGGCA Str: >>>>>>>..>>>>........<<<<.>>>>>.......<<<<<.....>>>>>.......<<<<<<<<<<<<. | | | | | | | || | +-----+ +--------------+ +---------------+ +---------------++-----+ | Stem 1 Stem 2 Stem 3 | | | +----------------------------------------------------------------+ Stem 0 Assume that we always have 4 stems (0 to 3), but that the length of letters before and after each of them can very. The output should be something like the following list structure: list( "Stem 0 opening" = "GCCTCGA", "before Stem 1" = "TA", "Stem 1" = list(opening = "GCTC", inside = "AGTTGGGA", closing = "GAGC" ), "between Stem 1 and 2" = "G", "Stem 2" = list(opening = "TACGA", inside = "CTGAAGA", closing = "TCGTA" ), "between Stem 2 and 3" = "AGGtC", "Stem 3" = list(opening = "ACCAG", inside = "TTCGATC", closing = "CTGGT" ), "After Stem 3" = "", "Stem 0 closing" = "TCGGGGC" ) I don't have any experience with programming a parser, and would like advices as to what strategy to use when programming something like this (and any recommended R commands to use). What I was thinking of is to first get rid of the "Stem 0", then go through the inner string with a recursive function (let's call it "seperate.stem") that each time will split the string into: 1. before stem 2. opening stem 3. inside stem 4. closing stem 5. after stem Where the "after stem" will then be recursively entered into the same function ("seperate.stem") The thing is that I am not sure how to try and do this coding without using a loop. Any advices will be most welcomed. ----------------Contact Details:------------------------------------------------------- Contact me: Tal.Galili@gmail.com | 972-52-7275845 Read me: www.talgalili.com (Hebrew) | www.biostatistics.co.il (Hebrew) | www.r-statistics.com (English) ---------------------------------------------------------------------------------------------- [[alternative HTML version deleted]]
How are you supposed to interprete the string that is doing the parsing? Does each sequence have the same number of ">>>>" for the opening sequence as it does for "<<<<" on the ending sequence? That what it appears to be looking at the way stem 3 is parsed. You will have to provide a little more insight on how to interprete the symbols. Does the parsing always start with a partial stem 0 as your example shows? Is there a way of making sure you have the right sequences when you start? Is there a chance of error in the middle of the string that you have to restart from? How long are these strings that you want to parse? Is each one a self contained sequence like you show in your example, or do they go on for thousands of characters? Is there always at least one '.' between stems? A full set of rules as to how the parsing should be done would be useful. Do you have the BNF syntax for parsing? On Tue, Mar 16, 2010 at 6:10 AM, Tal Galili <tal.galili@gmail.com> wrote:> Hello all, > > For some work I am doing on RNA, I want to use R to do string parsing that > (I think) is like a simplistic HTML parsing. > > > For example, let's say we have the following two variables: > > Seq <- > "GCCTCGATAGCTCAGTTGGGAGAGCGTACGACTGAAGATCGTAAGGtCACCAGTTCGATCCTGGTTCGGGGCA" > Str <- > ">>>>>>>..>>>>........<<<<.>>>>>.......<<<<<.....>>>>>.......<<<<<<<<<<<<." > > Say that I want to parse "Seq" According to "Str", by using the legend here > > Seq: > GCCTCGATAGCTCAGTTGGGAGAGCGTACGACTGAAGATCGTAAGGtCACCAGTTCGATCCTGGTTCGGGGCA > Str: > >>>>>>>..>>>>........<<<<.>>>>>.......<<<<<.....>>>>>.......<<<<<<<<<<<<. > > | | | | | | | || > | > > +-----+ +--------------+ +---------------+ > +---------------++-----+ > > | Stem 1 Stem 2 Stem 3 | > > | | > > +----------------------------------------------------------------+ > > Stem 0 > > Assume that we always have 4 stems (0 to 3), but that the length of letters > before and after each of them can very. > > The output should be something like the following list structure: > > > list( > "Stem 0 opening" = "GCCTCGA", > "before Stem 1" = "TA", > "Stem 1" = list(opening = "GCTC", > inside = "AGTTGGGA", > closing = "GAGC" > ), > "between Stem 1 and 2" = "G", > "Stem 2" = list(opening = "TACGA", > inside = "CTGAAGA", > closing = "TCGTA" > ), > "between Stem 2 and 3" = "AGGtC", > "Stem 3" = list(opening = "ACCAG", > inside = "TTCGATC", > closing = "CTGGT" > ), > "After Stem 3" = "", > "Stem 0 closing" = "TCGGGGC" > ) > > > I don't have any experience with programming a parser, and would like > advices as to what strategy to use when programming something like this > (and > any recommended R commands to use). > > > What I was thinking of is to first get rid of the "Stem 0", then go through > the inner string with a recursive function (let's call it "seperate.stem") > that each time will split the string into: > 1. before stem > 2. opening stem > 3. inside stem > 4. closing stem > 5. after stem > > Where the "after stem" will then be recursively entered into the same > function ("seperate.stem") > > The thing is that I am not sure how to try and do this coding without using > a loop. > > Any advices will be most welcomed. > > > ----------------Contact > Details:------------------------------------------------------- > Contact me: Tal.Galili@gmail.com | 972-52-7275845 > Read me: www.talgalili.com (Hebrew) | www.biostatistics.co.il (Hebrew) | > www.r-statistics.com (English) > > ---------------------------------------------------------------------------------------------- > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html<http://www.r-project.org/posting-guide.html> > and provide commented, minimal, self-contained, reproducible code. >-- Jim Holtman Cincinnati, OH +1 513 646 9390 What is the problem that you are trying to solve? [[alternative HTML version deleted]]
Gabor Grothendieck
2010-Mar-16 14:24 UTC
[R] How to parse a string (by a "new" markup) with R ?
We show how to use the gsubfn package to parse this. The rules are not entirely clear so we will assume the following: - there is a fixed template for the output which is the same as your output but possibly with different character strings filled in. This implies, for example, that there are exactly Stem0, Stem1, Stem2 and Stem3 and no fewer or more stems. - the sequence always starts with the open of Stem0, at least one dot and the open of Stem1. There are no dots prior to the open of Stem0. This seems to be implicit in your sample output since there is no zero length string in your sample output corresponding to dots prior to Stem0. - Stem0 closes with the same number of < as there are > to open it You can modify this yourself to take into account the actual rules whatever they are. We first calculate, k, the number of leading >'s using strapply. Then we replace the leading k >'s with }'s and the trailing k <'s with {'s giving us Str3: "}}}}}}}..>>>>........<<<<.>>>>>.......<<<<<.....>>>>>.......<<<<<{{{{{{{." We again use strapply, this time to get the lengths of the runs. Note that zero length runs are possible so we cannot, for example, use rle for this. For example there is a zero length run of dots between the last < and the first {. read.fwf is used to actually parse out the strings using the lengths we just calculated. Finally we fill in the template using relist. # inputs Seq <- "GCCTCGATAGCTCAGTTGGGAGAGCGTACGACTGAAGATCGTAAGGtCACCAGTTCGATCCTGGTTCGGGGCA" Str <- ">>>>>>>..>>>>........<<<<.>>>>>.......<<<<<.....>>>>>.......<<<<<<<<<<<<." template <- list( "Stem 0 opening" = "", "before Stem 1" = "", "Stem 1" = list(opening = "", inside = "", closing = "" ), "between Stem 1 and 2" = "", "Stem 2" = list(opening = "", inside = "", closing = "" ), "between Stem 2 and 3" = "", "Stem 3" = list(opening = "", inside = "", closing = "" ), "After Stem 3" = "", "Stem 0 closing" = "" ) # processing # create string made by repeating string s k times followed by more reps <- function(s, k, more = "") { paste(paste(rep(s, k), collapse = ""), more, sep = "") } library(gsubfn) k <- nchar(strapply(Str, "^>+", c)[[1]]) Str2 <- sub("^>+", reps("}", k), Str) Str3 <- sub(reps("<", k, "([^<]*)$"), reps("{", k, "\\1"), Str2) pat <- "^(}*)([.]*)(>*)([.]*)(<*)([.]*)(>*)([.]*)(<*)([.]*)(>*)([.]*)(<*)([.]*)({*)([.]*)$" lens <- sapply(strapply(Str3, pat, c)[[1]], nchar) tokens <- unlist(read.fwf(textConnection(Seq), lens, as.is = TRUE)) closeAllConnections() tokens[is.na(tokens)] <- "" out <- relist(tokens, template) out Here is the str of the output for your sample input:> str(out)List of 9 $ Stem 0 opening : chr "GCCTCGA" $ before Stem 1 : chr "TA" $ Stem 1 :List of 3 ..$ opening: chr "GCTC" ..$ inside : chr "AGTTGGGA" ..$ closing: chr "GAGC" $ between Stem 1 and 2: chr "G" $ Stem 2 :List of 3 ..$ opening: chr "TACGA" ..$ inside : chr "CTGAAGA" ..$ closing: chr "TCGTA" $ between Stem 2 and 3: chr "AGGtC" $ Stem 3 :List of 3 ..$ opening: chr "ACCAG" ..$ inside : chr "TTCGATC" ..$ closing: chr "CTGGT" $ After Stem 3 : chr "" $ Stem 0 closing : chr "TCGGGGC" On Tue, Mar 16, 2010 at 6:10 AM, Tal Galili <tal.galili at gmail.com> wrote:> Hello all, > > For some work I am doing on RNA, I want to use R to do string parsing that > (I think) is like a simplistic HTML parsing. > > > For example, let's say we have the following two variables: > > ? ?Seq <- > "GCCTCGATAGCTCAGTTGGGAGAGCGTACGACTGAAGATCGTAAGGtCACCAGTTCGATCCTGGTTCGGGGCA" > ? ?Str <- > ">>>>>>>..>>>>........<<<<.>>>>>.......<<<<<.....>>>>>.......<<<<<<<<<<<<." > > Say that I want to parse "Seq" According to "Str", by using the legend here > > Seq: GCCTCGATAGCTCAGTTGGGAGAGCGTACGACTGAAGATCGTAAGGtCACCAGTTCGATCCTGGTTCGGGGCA > Str: >>>>>>>..>>>>........<<<<.>>>>>.......<<<<<.....>>>>>.......<<<<<<<<<<<<. > > ? ? | ? ? | ?| ? ? ? ? ? ? ?| | ? ? ? ? ? ? ? | ? ? | ? ? ? ? ? ? ? || ? ? | > > ? ? +-----+ ?+--------------+ +---------------+ ? ? +---------------++-----+ > > ? ? ? ?| ? ? ? ?Stem 1 ? ? ? ? ? ?Stem 2 ? ? ? ? ? ? ? ? Stem 3 ? ? ? ? | > > ? ? ? ?| ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?| > > ? ? ? ?+----------------------------------------------------------------+ > > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?Stem 0 > > Assume that we always have 4 stems (0 to 3), but that the length of letters > before and after each of them can very. > > The output should be something like the following list structure: > > > ? ?list( > ? ? "Stem 0 opening" = "GCCTCGA", > ? ? "before Stem 1" = "TA", > ? ? "Stem 1" = list(opening = "GCTC", > ? ? inside = "AGTTGGGA", > ? ? closing = "GAGC" > ? ? ), > ? ? "between Stem 1 and 2" = "G", > ? ? "Stem 2" = list(opening = "TACGA", > ? ? inside = "CTGAAGA", > ? ? closing = "TCGTA" > ? ? ), > ? ? "between Stem 2 and 3" = "AGGtC", > ? ? "Stem 3" = list(opening = "ACCAG", > ? ? inside = "TTCGATC", > ? ? closing = "CTGGT" > ? ? ), > ? ? "After Stem 3" = "", > ? ? "Stem 0 closing" = "TCGGGGC" > ? ?) > > > I don't have any experience with programming a parser, and would like > advices as to what strategy to use when programming something like this (and > any recommended R commands to use). > > > What I was thinking of is to first get rid of the "Stem 0", then go through > the inner string with a recursive function (let's call it "seperate.stem") > that each time will split the string into: > 1. before stem > 2. opening stem > 3. inside stem > 4. closing stem > 5. after stem > > Where the "after stem" will then be recursively entered into the same > function ("seperate.stem") > > The thing is that I am not sure how to try and do this coding without using > a loop. > > Any advices will be most welcomed. > > > ----------------Contact > Details:------------------------------------------------------- > Contact me: Tal.Galili at gmail.com | ?972-52-7275845 > Read me: www.talgalili.com (Hebrew) | www.biostatistics.co.il (Hebrew) | > www.r-statistics.com (English) > ---------------------------------------------------------------------------------------------- > > ? ? ? ?[[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >
Wow, Thank you very much Andrej! Tal ----------------Contact Details:------------------------------------------------------- Contact me: Tal.Galili@gmail.com | 972-52-7275845 Read me: www.talgalili.com (Hebrew) | www.biostatistics.co.il (Hebrew) | www.r-statistics.com (English) ---------------------------------------------------------------------------------------------- 2010/3/17 Andrej Blejec <Andrej.Blejec@nib.si>> A version using regular expressions, lot of regexpr() and substr() > functions is attached. > Finally everything is packed into splitSeq() function > > Andrej > > -- > Andrej Blejec > National Institute of Biology > Vecna pot 111 POB 141 > SI-1000 Ljubljana > SLOVENIA > e-mail: andrej.blejec@nib.si > URL: http://ablejec.nib.si > tel: + 386 (0)59 232 789 > fax: + 386 1 241 29 80 > -------------------------- > Local Organizer of ICOTS-8 > International Conference on Teaching Statistics > http://icots8.org > > > > > -----Original Message----- > > From: r-help-bounces@r-project.org [mailto:r-help-bounces@r- > > project.org] On Behalf Of Gabor Grothendieck > > Sent: Tuesday, March 16, 2010 3:24 PM > > To: Tal Galili > > Cc: r-help@r-project.org; seqinr-forum@r-forge.wu-wien.ac.at > > Subject: Re: [R] How to parse a string (by a "new" markup) with R ? > > > > We show how to use the gsubfn package to parse this. > > > > The rules are not entirely clear so we will assume the following: > > > > - there is a fixed template for the output which is the same as your > > output but possibly with different character strings filled in. This > > implies, for example, that there are exactly Stem0, Stem1, Stem2 and > > Stem3 and no fewer or more stems. > > > > - the sequence always starts with the open of Stem0, at least one dot > > and the open of Stem1. There are no dots prior to the open of Stem0. > > This seems to be implicit in your sample output since there is no zero > > length string in your sample output corresponding to dots prior to > > Stem0. > > > > - Stem0 closes with the same number of < as there are > to open it > > > > You can modify this yourself to take into account the actual rules > > whatever they are. > > > > We first calculate, k, the number of leading >'s using strapply. > > > > Then we replace the leading k >'s with }'s and the trailing k <'s with > > {'s giving us Str3: > > > > > > "}}}}}}}..>>>>........<<<<.>>>>>.......<<<<<.....>>>>>.......<<<<<{{{{{ > > {{." > > > > We again use strapply, this time to get the lengths of the runs. Note > > that > > zero length runs are possible so we cannot, for example, use rle for > > this. For > > example there is a zero length run of dots between the last < and the > > first {. > > read.fwf is used to actually parse out the strings using the lengths we > > just > > calculated. > > > > Finally we fill in the template using relist. > > > > # inputs > > > > Seq <- > > "GCCTCGATAGCTCAGTTGGGAGAGCGTACGACTGAAGATCGTAAGGtCACCAGTTCGATCCTGGTTCGGG > > GCA" > > Str <- > > ">>>>>>>..>>>>........<<<<.>>>>>.......<<<<<.....>>>>>.......<<<<<<<<<< > > <<." > > template <- > > list( > > "Stem 0 opening" = "", > > "before Stem 1" = "", > > "Stem 1" = list(opening = "", > > inside = "", > > closing = "" > > ), > > "between Stem 1 and 2" = "", > > "Stem 2" = list(opening = "", > > inside = "", > > closing = "" > > ), > > "between Stem 2 and 3" = "", > > "Stem 3" = list(opening = "", > > inside = "", > > closing = "" > > ), > > "After Stem 3" = "", > > "Stem 0 closing" = "" > > ) > > > > # processing > > > > # create string made by repeating string s k times followed by more > > reps <- function(s, k, more = "") { > > paste(paste(rep(s, k), collapse = ""), more, sep = "") > > } > > > > library(gsubfn) > > k <- nchar(strapply(Str, "^>+", c)[[1]]) > > Str2 <- sub("^>+", reps("}", k), Str) > > Str3 <- sub(reps("<", k, "([^<]*)$"), reps("{", k, "\\1"), Str2) > > > > pat <- > > "^(}*)([.]*)(>*)([.]*)(<*)([.]*)(>*)([.]*)(<*)([.]*)(>*)([.]*)(<*)([.]* > > )({*)([.]*)$" > > lens <- sapply(strapply(Str3, pat, c)[[1]], nchar) > > tokens <- unlist(read.fwf(textConnection(Seq), lens, as.is = TRUE)) > > closeAllConnections() > > tokens[is.na(tokens)] <- "" > > out <- relist(tokens, template) > > out > > > > > > Here is the str of the output for your sample input: > > > > > str(out) > > List of 9 > > $ Stem 0 opening : chr "GCCTCGA" > > $ before Stem 1 : chr "TA" > > $ Stem 1 :List of 3 > > ..$ opening: chr "GCTC" > > ..$ inside : chr "AGTTGGGA" > > ..$ closing: chr "GAGC" > > $ between Stem 1 and 2: chr "G" > > $ Stem 2 :List of 3 > > ..$ opening: chr "TACGA" > > ..$ inside : chr "CTGAAGA" > > ..$ closing: chr "TCGTA" > > $ between Stem 2 and 3: chr "AGGtC" > > $ Stem 3 :List of 3 > > ..$ opening: chr "ACCAG" > > ..$ inside : chr "TTCGATC" > > ..$ closing: chr "CTGGT" > > $ After Stem 3 : chr "" > > $ Stem 0 closing : chr "TCGGGGC" > > > > > > > > On Tue, Mar 16, 2010 at 6:10 AM, Tal Galili <tal.galili@gmail.com> > > wrote: > > > Hello all, > > > > > > For some work I am doing on RNA, I want to use R to do string parsing > > that > > > (I think) is like a simplistic HTML parsing. > > > > > > > > > For example, let's say we have the following two variables: > > > > > > Seq <- > > > > > "GCCTCGATAGCTCAGTTGGGAGAGCGTACGACTGAAGATCGTAAGGtCACCAGTTCGATCCTGGTTCGGG > > GCA" > > > Str <- > > > > > ">>>>>>>..>>>>........<<<<.>>>>>.......<<<<<.....>>>>>.......<<<<<<<<<< > > <<." > > > > > > Say that I want to parse "Seq" According to "Str", by using the > > legend here > > > > > > Seq: > > GCCTCGATAGCTCAGTTGGGAGAGCGTACGACTGAAGATCGTAAGGtCACCAGTTCGATCCTGGTTCGGGG > > CA > > > Str: > > >>>>>>>..>>>>........<<<<.>>>>>.......<<<<<.....>>>>>.......<<<<<<<<<<< > > <. > > > > > > | | | | | | | > > || | > > > > > > +-----+ +--------------+ +---------------+ +--------------- > > ++-----+ > > > > > > | Stem 1 Stem 2 Stem 3 > > | > > > > > > | > > | > > > > > > +------------------------------------------------------------- > > ---+ > > > > > > Stem 0 > > > > > > Assume that we always have 4 stems (0 to 3), but that the length of > > letters > > > before and after each of them can very. > > > > > > The output should be something like the following list structure: > > > > > > > > > list( > > > "Stem 0 opening" = "GCCTCGA", > > > "before Stem 1" = "TA", > > > "Stem 1" = list(opening = "GCTC", > > > inside = "AGTTGGGA", > > > closing = "GAGC" > > > ), > > > "between Stem 1 and 2" = "G", > > > "Stem 2" = list(opening = "TACGA", > > > inside = "CTGAAGA", > > > closing = "TCGTA" > > > ), > > > "between Stem 2 and 3" = "AGGtC", > > > "Stem 3" = list(opening = "ACCAG", > > > inside = "TTCGATC", > > > closing = "CTGGT" > > > ), > > > "After Stem 3" = "", > > > "Stem 0 closing" = "TCGGGGC" > > > ) > > > > > > > > > I don't have any experience with programming a parser, and would like > > > advices as to what strategy to use when programming something like > > this (and > > > any recommended R commands to use). > > > > > > > > > What I was thinking of is to first get rid of the "Stem 0", then go > > through > > > the inner string with a recursive function (let's call it > > "seperate.stem") > > > that each time will split the string into: > > > 1. before stem > > > 2. opening stem > > > 3. inside stem > > > 4. closing stem > > > 5. after stem > > > > > > Where the "after stem" will then be recursively entered into the same > > > function ("seperate.stem") > > > > > > The thing is that I am not sure how to try and do this coding without > > using > > > a loop. > > > > > > Any advices will be most welcomed. > > > > > > > > > ----------------Contact > > > Details:------------------------------------------------------- > > > Contact me: Tal.Galili@gmail.com | 972-52-7275845 > > > Read me: www.talgalili.com (Hebrew) | www.biostatistics.co.il > > (Hebrew) | > > > www.r-statistics.com (English) > > > --------------------------------------------------------------------- > > ------------------------- > > > > > > [[alternative HTML version deleted]] > > > > > > ______________________________________________ > > > R-help@r-project.org mailing list > > > https://stat.ethz.ch/mailman/listinfo/r-help > > > PLEASE do read the posting guide http://www.R-project.org/posting- > > guide.html > > > and provide commented, minimal, self-contained, reproducible code. > > > > > > > ______________________________________________ > > R-help@r-project.org mailing list > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide http://www.R-project.org/posting- > > guide.html > > and provide commented, minimal, self-contained, reproducible code. >[[alternative HTML version deleted]]