Hello I know that R's string functions are not as extensive as those of Unix but I need to do some text handling totally within an R environment because the target is a Windows system which will not have the corresponding shell utilities, sed, awk etc. Can anyone explain the following gsub phenomenon to me:> dates<-c("73","74","02","1973","1974","2002")I want to take just the last two digits where it is a 4-digit year and both digits when it is a 2-digit year. I should be able to use substr but measurement from the string end (with a negative counter or something) is not implemented:> substr(dates,3,4)[1] "" "" "" "73" "74" "02"> substr(dates,-2,4)[1] "73" "74" "02" "1973" "1974" "2002"> substr(dates,4,-2)[1] "" "" "" "" "" "" So I tried gsub:> gsub("[19|20]([0-9][0-9])","\\1",dates)[1] "73" "74" "02" "973" "974" "002" As I understand it (and comparing with sed), the \\1 should take the first bracketed string but clearly this doesn't work. If I try what should also work:> gsub("[19|20]([0-9])([0-9])","\\1\\2",dates)[1] "73" "74" "02" "973" "974" "002" On the other hand the following does work:> gsub("[19|20]([0-9])([0-9])","\\2",dates)[1] "73" "74" "02" "73" "74" "02" So it appears that the substitution takes one character extra to the left but the following indicates that the lower limit of the selected range is also at fault:> s<-c("1","12","123","1234","12345","123456") > gsub("[12]([4-6]*)","",s)[1] "" "" "3" "34" "345" "3456" Probably more elegant examples could be constructed that could home in on the issue. The version is R 2.0.1 on Linux so perhaps it is a little old now. Questions: 1) Am I misunderstanding the gsub use? 2) Was it a bug that has since been corrected? 3) Is it still a bug in the latest version? TIA JOhn John Logsdon "Try to make things as simple Quantex Research Ltd, Manchester UK as possible but not simpler" j.logsdon at quantex-research.com a.einstein at relativity.org +44(0)161 445 4951/G:+44(0)7717758675 www.quantex-research.com
you could use something like: dates <- c("73", "74", "02", "1973", "1974", "2002") ############### nd <- nchar(dates) substr(dates, ifelse(nd == 2, 1, 3), nd) I hope it helps. Best, Dimitris ---- Dimitris Rizopoulos Ph.D. Student Biostatistical Centre School of Public Health Catholic University of Leuven Address: Kapucijnenvoer 35, Leuven, Belgium Tel: +32/(0)16/336899 Fax: +32/(0)16/337015 Web: http://www.med.kuleuven.be/biostat/ http://www.student.kuleuven.be/~m0390867/dimitris.htm ----- Original Message ----- From: "John Logsdon" <j.logsdon at quantex-research.com> To: <r-help at stat.math.ethz.ch> Sent: Sunday, November 27, 2005 11:04 AM Subject: [R] gsub syntax> Hello > > I know that R's string functions are not as extensive as those of > Unix but > I need to do some text handling totally within an R environment > because > the target is a Windows system which will not have the corresponding > shell > utilities, sed, awk etc. > > Can anyone explain the following gsub phenomenon to me: > >> dates<-c("73","74","02","1973","1974","2002") > > I want to take just the last two digits where it is a 4-digit year > and > both digits when it is a 2-digit year. I should be able to use > substr but > measurement from the string end (with a negative counter or > something) is > not implemented: > >> substr(dates,3,4) > [1] "" "" "" "73" "74" "02" >> substr(dates,-2,4) > [1] "73" "74" "02" "1973" "1974" "2002" >> substr(dates,4,-2) > [1] "" "" "" "" "" "" > > So I tried gsub: > >> gsub("[19|20]([0-9][0-9])","\\1",dates) > [1] "73" "74" "02" "973" "974" "002" > > As I understand it (and comparing with sed), the \\1 should take the > first > bracketed string but clearly this doesn't work. If I try what > should also > work: > >> gsub("[19|20]([0-9])([0-9])","\\1\\2",dates) > [1] "73" "74" "02" "973" "974" "002" > > On the other hand the following does work: > >> gsub("[19|20]([0-9])([0-9])","\\2",dates) > [1] "73" "74" "02" "73" "74" "02" > > So it appears that the substitution takes one character extra to the > left > but the following indicates that the lower limit of the selected > range is > also at fault: > >> s<-c("1","12","123","1234","12345","123456") >> gsub("[12]([4-6]*)","",s) > [1] "" "" "3" "34" "345" "3456" > > Probably more elegant examples could be constructed that could home > in on > the issue. > > The version is R 2.0.1 on Linux so perhaps it is a little old now. > > Questions: > > 1) Am I misunderstanding the gsub use? > > 2) Was it a bug that has since been corrected? > > 3) Is it still a bug in the latest version? > > TIA > > JOhn > > John Logsdon "Try to make things as > simple > Quantex Research Ltd, Manchester UK as possible but not > simpler" > j.logsdon at quantex-research.com > a.einstein at relativity.org > +44(0)161 445 4951/G:+44(0)7717758675 www.quantex-research.com > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide! > http://www.R-project.org/posting-guide.html >Disclaimer: http://www.kuleuven.be/cwis/email_disclaimer.htm
John Logsdon wrote:> Hello > > I know that R's string functions are not as extensive as those of Unix but > I need to do some text handling totally within an R environment because > the target is a Windows system which will not have the corresponding shell > utilities, sed, awk etc. > > Can anyone explain the following gsub phenomenon to me: > > >>dates<-c("73","74","02","1973","1974","2002") > > > I want to take just the last two digits where it is a 4-digit year and > both digits when it is a 2-digit year. I should be able to use substr but > measurement from the string end (with a negative counter or something) is > not implemented: > > >>substr(dates,3,4) > > [1] "" "" "" "73" "74" "02" > >>substr(dates,-2,4) > > [1] "73" "74" "02" "1973" "1974" "2002" > >>substr(dates,4,-2) > > [1] "" "" "" "" "" "" > > So I tried gsub: > > >>gsub("[19|20]([0-9][0-9])","\\1",dates) > > [1] "73" "74" "02" "973" "974" "002" > > As I understand it (and comparing with sed), the \\1 should take the first > bracketed string but clearly this doesn't work. If I try what should also > work: > > >>gsub("[19|20]([0-9])([0-9])","\\1\\2",dates) > > [1] "73" "74" "02" "973" "974" "002" > > On the other hand the following does work: > > >>gsub("[19|20]([0-9])([0-9])","\\2",dates) > > [1] "73" "74" "02" "73" "74" "02" > > So it appears that the substitution takes one character extra to the left > but the following indicates that the lower limit of the selected range is > also at fault: > > >>s<-c("1","12","123","1234","12345","123456") >>gsub("[12]([4-6]*)","",s) > > [1] "" "" "3" "34" "345" "3456" > > Probably more elegant examples could be constructed that could home in on > the issue. > > The version is R 2.0.1 on Linux so perhaps it is a little old now. > > Questions: > > 1) Am I misunderstanding the gsub use? > > 2) Was it a bug that has since been corrected? > > 3) Is it still a bug in the latest version? > > TIA > > JOhn >Hi, John, I cannot comment on your questions since I'm no regexpr guru. However, it seems to me you can do the following instead: gsub(".*([0-9][0-9])", "\\1", dates) This works fine on Linux & Windows, R-2.2.0. HTH, --sundar
On 11/27/05, John Logsdon <j.logsdon at quantex-research.com> wrote:> Hello > > I know that R's string functions are not as extensive as those of Unix butI don't think this statement is true although I have seen it repeated.> I need to do some text handling totally within an R environment because > the target is a Windows system which will not have the corresponding shell > utilities, sed, awk etc.Free versions of these utilities are available for Windows although they don't come with Windows. e.g. Google for gawk.> > Can anyone explain the following gsub phenomenon to me: > > > dates<-c("73","74","02","1973","1974","2002") > > I want to take just the last two digits where it is a 4-digit year and > both digits when it is a 2-digit year. I should be able to use substr but > measurement from the string end (with a negative counter or something) is > not implemented: > > > substr(dates,3,4) > [1] "" "" "" "73" "74" "02" > > substr(dates,-2,4) > [1] "73" "74" "02" "1973" "1974" "2002" > > substr(dates,4,-2) > [1] "" "" "" "" "" "" > > So I tried gsub: > > > gsub("[19|20]([0-9][0-9])","\\1",dates) > [1] "73" "74" "02" "973" "974" "002" > > As I understand it (and comparing with sed), the \\1 should take the first > bracketed string but clearly this doesn't work. If I try what should also > work: > > > gsub("[19|20]([0-9])([0-9])","\\1\\2",dates) > [1] "73" "74" "02" "973" "974" "002" > > On the other hand the following does work: > > > gsub("[19|20]([0-9])([0-9])","\\2",dates) > [1] "73" "74" "02" "73" "74" "02" > > So it appears that the substitution takes one character extra to the left > but the following indicates that the lower limit of the selected range is > also at fault: > > > s<-c("1","12","123","1234","12345","123456") > > gsub("[12]([4-6]*)","",s) > [1] "" "" "3" "34" "345" "3456" > > Probably more elegant examples could be constructed that could home in on > the issue. > > The version is R 2.0.1 on Linux so perhaps it is a little old now. > > Questions: > > 1) Am I misunderstanding the gsub use? > > 2) Was it a bug that has since been corrected? > > 3) Is it still a bug in the latest version? >It works the same on my system which is 2.2.0 Windows patched (2005-10-24). At first I too thought it was a bug but I noticed it works the same in perl so now I am not sure. The following perl program under Windows using perl 5.8.6 on Windows gives 002 as the answer as the answer too: $_ = "2002"; s/[19|20]([0-9])([0-9])/\1\2/g; print; In any any case, it could be done like this: sub(".*(..)$", "\\1", dates) or substring(dates, nchar(dates)-1) or the following which appends -01-01 to the year, converts it to Date class, implicitly converts it back to character and then extracts the 3rd to 4th character of the result: substring(as.Date(sprintf("%s-01-01", dates)), 3, 4) or
R is blameless here: it works as documented and in the same way as POSIX tools. It agrees with 'sed' using the same syntax (modulo the shell-specific quoting rules) e.g. in csh % echo 1973 | sed 's/[19|20]\([0-9][0-9]\)/\1/g' 973 % echo 1973 | sed 's/\([19|20]\)\([0-9][0-9]\)/-\1-\2-/g' -1-97-3 % echo "73 74 02 1973 1974 2002" | sed 's/[19|20]\([0-9][0-9]\)/\1/g' 73 74 02 973 974 002 so what happened when you were 'comparing with sed'? "[19|20]" is a character class (containing five characters) matching one character, not a match for two characters as you seem to imagine. It does not mean the same as "19|20", which is what you seem to have intended (and you seem only to want to do the substitution once on each string, so why use gsub?):> sub("19|20([0-9][0-9])", "\\1", dates)[1] "73" "74" "02" "73" "74" "02" A more direct way which would work e.g. for 1837 would be sub(".*([0-9]{2}$)", "\\1", dates) or even better (locale-independent) sub(".*([[:digit:]]{2}$)", "\\1", dates) Current versions of R have a help page ?regexp explaining what regexps are. Even 2.0.1 did, although you were asked to update *before* posting (see the posting guide). It was unambiguous: A _character class_ is a list of characters enclosed by '[' and ']' matches any single character in that list ... ^^^^^^ ... Note that alternation does not work inside character classes, where \code{|} has its literal meaning. On Sun, 27 Nov 2005, John Logsdon wrote:> Hello > > I know that R's string functions are not as extensive as those of Unix but > I need to do some text handling totally within an R environment because > the target is a Windows system which will not have the corresponding shell > utilities, sed, awk etc. > Can anyone explain the following gsub phenomenon to me: > >> dates<-c("73","74","02","1973","1974","2002") > > I want to take just the last two digits where it is a 4-digit year and > both digits when it is a 2-digit year. I should be able to use substr but > measurement from the string end (with a negative counter or something) is > not implemented:Why 'should' it work in a different way to that documented?>> substr(dates,3,4) > [1] "" "" "" "73" "74" "02" >> substr(dates,-2,4) > [1] "73" "74" "02" "1973" "1974" "2002" >> substr(dates,4,-2) > [1] "" "" "" "" "" "" > > So I tried gsub: > >> gsub("[19|20]([0-9][0-9])","\\1",dates) > [1] "73" "74" "02" "973" "974" "002" > > As I understand it (and comparing with sed), the \\1 should take the first > bracketed string but clearly this doesn't work. > If I try what should also work: > >> gsub("[19|20]([0-9])([0-9])","\\1\\2",dates) > [1] "73" "74" "02" "973" "974" "002"> On the other hand the following does work: > >> gsub("[19|20]([0-9])([0-9])","\\2",dates) > [1] "73" "74" "02" "73" "74" "02" > > So it appears that the substitution takes one character extra to the left > but the following indicates that the lower limit of the selected range is > also at fault: >> s<-c("1","12","123","1234","12345","123456") >> gsub("[12]([4-6]*)","",s) > [1] "" "" "3" "34" "345" "3456" > > Probably more elegant examples could be constructed that could home in on > the issue. > The version is R 2.0.1 on Linux so perhaps it is a little old now. > > Questions: > > 1) Am I misunderstanding the gsub use?Yes.> 2) Was it a bug that has since been corrected?Unfortunately the bug reported two years ago in> library(fortunes); fortune("WTFM")still seems extant. See the posting guide for advice on how to correct it. -- Brian D. Ripley, ripley at stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UK Fax: +44 1865 272595