Hi everybody, I have some questions about the way that sub is working. I hope that someone has the answer: 1/ Why the second example does not return an empty string ? There is no match. subtext <- "-1980-" sub(".*(1980).*", "\\1", subtext) # return 1980 sub(".*(1981).*", "\\1", subtext) # return -1980- 2/ Based on sub documentation, it replaces the first occurence of a pattern: why it does not return 1980 ? subtext <- " 1980 1981 " sub(".*(198[01]).*", "\\1", subtext) # return 1981 3/ I want extract year from text; I use: subtext <- "bla 1980 bla" sub(".*[ \\.\\(-]([12][01289][0-9][0-9])[ \\.\\)-].*", "\\1", subtext) # return 1980 subtext <- "bla 2010 bla" sub(".*[ \\.\\(-]([12][01289][0-9][0-9])[ \\.\\)-].*", "\\1", subtext) # return 2010 but subtext <- "bla 1010 bla" sub(".*[ \\.\\(-]([12][01289][0-9][0-9])[ \\.\\)-].*", "\\1", subtext) # return 1010 I would like exclude the case 1010 and other like this. The solution would be: 18[0-9][0-9] or 19[0-9][0-9] or 200[0-9] or 201[0-9] Is there a solution to write such a pattern in grep ? Thanks a lot Marc
I answer myself to the third point: This pattern is better : pattern.year <- ".*\\b(18|19|20)([0-9][0-9])\\b.*" subtext <- "bla 1880 bla" sub(pattern.year, "\\1\\2", subtext) # return 1880 subtext <- "bla 1980 bla" sub(pattern.year, "\\1\\2", subtext) # return 1980 subtext <- "bla 2010 bla" sub(pattern.year, "\\1\\2", subtext) # return 2010 subtext <- "bla 1010 bla" sub(pattern.year, "\\1\\2", subtext) # return bla 1010 bla subtext <- "bla 3010 bla" sub(pattern.year, "\\1\\2", subtext) # return bla 3010 bla Marc Le 09/08/2018 ? 09:57, Marc Girondot via R-help a ?crit?:> Hi everybody, > > I have some questions about the way that sub is working. I hope that > someone has the answer: > > 1/ Why the second example does not return an empty string ? There is > no match. > > subtext <- "-1980-" > sub(".*(1980).*", "\\1", subtext) # return 1980 > sub(".*(1981).*", "\\1", subtext) # return -1980- > > 2/ Based on sub documentation, it replaces the first occurence of a > pattern: why it does not return 1980 ? > > subtext <- " 1980 1981 " > sub(".*(198[01]).*", "\\1", subtext) # return 1981 > > 3/ I want extract year from text; I use: > > subtext <- "bla 1980 bla" > sub(".*[ \\.\\(-]([12][01289][0-9][0-9])[ \\.\\)-].*", "\\1", subtext) > # return 1980 > subtext <- "bla 2010 bla" > sub(".*[ \\.\\(-]([12][01289][0-9][0-9])[ \\.\\)-].*", "\\1", subtext) > # return 2010 > > but > > subtext <- "bla 1010 bla" > sub(".*[ \\.\\(-]([12][01289][0-9][0-9])[ \\.\\)-].*", "\\1", subtext) > # return 1010 > > I would like exclude the case 1010 and other like this. > > The solution would be: > > 18[0-9][0-9] or 19[0-9][0-9] or 200[0-9] or 201[0-9] > > Is there a solution to write such a pattern in grep ? > > Thanks a lot > > Marc > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >-- __________________________________________________________ Marc Girondot, Pr Laboratoire Ecologie, Syst?matique et Evolution Equipe de Conservation des Populations et des Communaut?s CNRS, AgroParisTech et Universit? Paris-Sud 11 , UMR 8079 B?timent 362 91405 Orsay Cedex, France Tel: 33 1 (0)1.69.15.72.30 Fax: 33 1 (0)1.69.15.73.53 e-mail: marc.girondot at u-psud.fr Web: http://www.ese.u-psud.fr/epc/conservation/Marc.html Skype: girondot
I answer myself to the third point: This pattern is better to get a year: pattern.year <- ".*\\b(18|19|20)([0-9][0-9])\\b.*" subtext <- "bla 1880 bla" sub(pattern.year, "\\1\\2", subtext) # return 1880 subtext <- "bla 1980 bla" sub(pattern.year, "\\1\\2", subtext) # return 1980 subtext <- "bla 2010 bla" sub(pattern.year, "\\1\\2", subtext) # return 2010 subtext <- "bla 1010 bla" sub(pattern.year, "\\1\\2", subtext) # return bla 1010 bla subtext <- "bla 3010 bla" sub(pattern.year, "\\1\\2", subtext) # return bla 3010 bla Marc Le 09/08/2018 ? 09:57, Marc Girondot via R-help a ?crit?:> Hi everybody, > > I have some questions about the way that sub is working. I hope that > someone has the answer: > > 1/ Why the second example does not return an empty string ? There is > no match. > > subtext <- "-1980-" > sub(".*(1980).*", "\\1", subtext) # return 1980 > sub(".*(1981).*", "\\1", subtext) # return -1980- > > 2/ Based on sub documentation, it replaces the first occurence of a > pattern: why it does not return 1980 ? > > subtext <- " 1980 1981 " > sub(".*(198[01]).*", "\\1", subtext) # return 1981 > > 3/ I want extract year from text; I use: > > subtext <- "bla 1980 bla" > sub(".*[ \\.\\(-]([12][01289][0-9][0-9])[ \\.\\)-].*", "\\1", subtext) > # return 1980 > subtext <- "bla 2010 bla" > sub(".*[ \\.\\(-]([12][01289][0-9][0-9])[ \\.\\)-].*", "\\1", subtext) > # return 2010 > > but > > subtext <- "bla 1010 bla" > sub(".*[ \\.\\(-]([12][01289][0-9][0-9])[ \\.\\)-].*", "\\1", subtext) > # return 1010 > > I would like exclude the case 1010 and other like this. > > The solution would be: > > 18[0-9][0-9] or 19[0-9][0-9] or 200[0-9] or 201[0-9] > > Is there a solution to write such a pattern in grep ? > > Thanks a lot > > Marc > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >
Hi Marc. For question 1. I know in Perl that regular expressions when captured can be saved if not overwritten. \\1 is the capture variable in your R examples. So the 2nd regular expression does not match but \\1 still has 1980 captured from the previous expression, hence the result. Maybe if you restart R and try your 2nd expression first, \\1 will be empty or no match result. Just speculation :) John On 9 Aug 2018 08:58, "Marc Girondot via R-help" <r-help at r-project.org> wrote:> Hi everybody, > > I have some questions about the way that sub is working. I hope that > someone has the answer: > > 1/ Why the second example does not return an empty string ? There is no > match. > > subtext <- "-1980-" > sub(".*(1980).*", "\\1", subtext) # return 1980 > sub(".*(1981).*", "\\1", subtext) # return -1980- > > 2/ Based on sub documentation, it replaces the first occurence of a > pattern: why it does not return 1980 ? > > subtext <- " 1980 1981 " > sub(".*(198[01]).*", "\\1", subtext) # return 1981 > > 3/ I want extract year from text; I use: > > subtext <- "bla 1980 bla" > sub(".*[ \\.\\(-]([12][01289][0-9][0-9])[ \\.\\)-].*", "\\1", subtext) # > return 1980 > subtext <- "bla 2010 bla" > sub(".*[ \\.\\(-]([12][01289][0-9][0-9])[ \\.\\)-].*", "\\1", subtext) # > return 2010 > > but > > subtext <- "bla 1010 bla" > sub(".*[ \\.\\(-]([12][01289][0-9][0-9])[ \\.\\)-].*", "\\1", subtext) # > return 1010 > > I would like exclude the case 1010 and other like this. > > The solution would be: > > 18[0-9][0-9] or 19[0-9][0-9] or 200[0-9] or 201[0-9] > > Is there a solution to write such a pattern in grep ? > > Thanks a lot > > Marc > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posti > ng-guide.html > and provide commented, minimal, self-contained, reproducible code. >[[alternative HTML version deleted]]
So there is probably a command that resets the capture variables as I call them. No doubt someone will write what it is. On 9 Aug 2018 10:36, "john matthew" <poisson200 at googlemail.com> wrote:> Hi Marc. > For question 1. > I know in Perl that regular expressions when captured can be saved if not > overwritten. \\1 is the capture variable in your R examples. > > So the 2nd regular expression does not match but \\1 still has 1980 > captured from the previous expression, hence the result. > > Maybe if you restart R and try your 2nd expression first, \\1 will be > empty or no match result. > > Just speculation :) > > John > > > On 9 Aug 2018 08:58, "Marc Girondot via R-help" <r-help at r-project.org> > wrote: > >> Hi everybody, >> >> I have some questions about the way that sub is working. I hope that >> someone has the answer: >> >> 1/ Why the second example does not return an empty string ? There is no >> match. >> >> subtext <- "-1980-" >> sub(".*(1980).*", "\\1", subtext) # return 1980 >> sub(".*(1981).*", "\\1", subtext) # return -1980- >> >> 2/ Based on sub documentation, it replaces the first occurence of a >> pattern: why it does not return 1980 ? >> >> subtext <- " 1980 1981 " >> sub(".*(198[01]).*", "\\1", subtext) # return 1981 >> >> 3/ I want extract year from text; I use: >> >> subtext <- "bla 1980 bla" >> sub(".*[ \\.\\(-]([12][01289][0-9][0-9])[ \\.\\)-].*", "\\1", subtext) # >> return 1980 >> subtext <- "bla 2010 bla" >> sub(".*[ \\.\\(-]([12][01289][0-9][0-9])[ \\.\\)-].*", "\\1", subtext) # >> return 2010 >> >> but >> >> subtext <- "bla 1010 bla" >> sub(".*[ \\.\\(-]([12][01289][0-9][0-9])[ \\.\\)-].*", "\\1", subtext) # >> return 1010 >> >> I would like exclude the case 1010 and other like this. >> >> The solution would be: >> >> 18[0-9][0-9] or 19[0-9][0-9] or 200[0-9] or 201[0-9] >> >> Is there a solution to write such a pattern in grep ? >> >> Thanks a lot >> >> Marc >> >> ______________________________________________ >> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide http://www.R-project.org/posti >> ng-guide.html >> and provide commented, minimal, self-contained, reproducible code. >> >[[alternative HTML version deleted]]
Quoting Marc Girondot via R-help <r-help at r-project.org>:> Hi everybody, > > I have some questions about the way that sub is working. I hope that > someone has the answer: > > 1/ Why the second example does not return an empty string ? There is > no match. > > subtext <- "-1980-" > sub(".*(1980).*", "\\1", subtext) # return 1980 > sub(".*(1981).*", "\\1", subtext) # return -1980-This is as documented in ?sub: "Elements of character vectors x which are not substituted will be returned unchanged"> 2/ Based on sub documentation, it replaces the first occurence of a > pattern: why it does not return 1980 ? > > subtext <- " 1980 1981 " > sub(".*(198[01]).*", "\\1", subtext) # return 1981Because the pattern matches the whole string, not just the year: regexpr(".*(198[01]).*", subtext) ## [1] 1 ## attr(,"match.length") ## [1] 11 ## attr(,"useBytes") ## [1] TRUE From this match, the RE engine will give you the last backreference-match, which is "1981". If you want to _extract_ the first year, use a non-greedy RE instead: sub(".*?(198[01]).*", "\\1", subtext) ## [1] "1980" I say _extract_ because you may _replace_ the pattern, as expected: sub("198[01]", "YYYY", subtext) ## [1] " YYYY 1981 " That is because the pattern does not match the whole string. Perhaps this example makes it clearer: test <- "1 2 3 4 5" sub("([0-9])", "\\1\\1", test) ## [1] "11 2 3 4 5" sub(".*([0-9]).*", "\\1\\1", test) ## [1] "55" sub(".*?([0-9]).*", "\\1\\1", test) ## [1] "11"> 3/ I want extract year from text; I use: > > subtext <- "bla 1980 bla" > sub(".*[ \\.\\(-]([12][01289][0-9][0-9])[ \\.\\)-].*", "\\1", > subtext) # return 1980 > subtext <- "bla 2010 bla" > sub(".*[ \\.\\(-]([12][01289][0-9][0-9])[ \\.\\)-].*", "\\1", > subtext) # return 2010 > > but > > subtext <- "bla 1010 bla" > sub(".*[ \\.\\(-]([12][01289][0-9][0-9])[ \\.\\)-].*", "\\1", > subtext) # return 1010 > > I would like exclude the case 1010 and other like this. > > The solution would be: > > 18[0-9][0-9] or 19[0-9][0-9] or 200[0-9] or 201[0-9] > > Is there a solution to write such a pattern in grep ?You answered this yourself, I think.> Thanks a lot > > Marc >-- Enrico Schumann Lucerne, Switzerland http://enricoschumann.net