Hello I know that R's string functions are not as extensive as those of Unix but I need to do some text handling totally within an R environment because the target is a Windows system which will not have the corresponding shell utilities, sed, awk etc. Can anyone explain the following gsub phenomenon to me:> dates<-c("73","74","02","1973","1974","2002")I want to take just the last two digits where it is a 4-digit year and both digits when it is a 2-digit year. I should be able to use substr but measurement from the string end (with a negative counter or something) is not implemented:> substr(dates,3,4)[1] "" "" "" "73" "74" "02"> substr(dates,-2,4)[1] "73" "74" "02" "1973" "1974" "2002"> substr(dates,4,-2)[1] "" "" "" "" "" "" So I tried gsub:> gsub("[19|20]([0-9][0-9])","\\1",dates)[1] "73" "74" "02" "973" "974" "002" As I understand it (and comparing with sed), the \\1 should take the first bracketed string but clearly this doesn't work. If I try what should also work:> gsub("[19|20]([0-9])([0-9])","\\1\\2",dates)[1] "73" "74" "02" "973" "974" "002" On the other hand the following does work:> gsub("[19|20]([0-9])([0-9])","\\2",dates)[1] "73" "74" "02" "73" "74" "02" So it appears that the substitution takes one character extra to the left but the following indicates that the lower limit of the selected range is also at fault:> s<-c("1","12","123","1234","12345","123456") > gsub("[12]([4-6]*)","",s)[1] "" "" "3" "34" "345" "3456" Probably more elegant examples could be constructed that could home in on the issue. The version is R 2.0.1 on Linux so perhaps it is a little old now. Questions: 1) Am I misunderstanding the gsub use? 2) Was it a bug that has since been corrected? 3) Is it still a bug in the latest version? TIA JOhn John Logsdon "Try to make things as simple Quantex Research Ltd, Manchester UK as possible but not simpler" j.logsdon at quantex-research.com a.einstein at relativity.org +44(0)161 445 4951/G:+44(0)7717758675 www.quantex-research.com
you could use something like:
dates <- c("73", "74", "02", "1973",
"1974", "2002")
###############
nd <- nchar(dates)
substr(dates, ifelse(nd == 2, 1, 3), nd)
I hope it helps.
Best,
Dimitris
----
Dimitris Rizopoulos
Ph.D. Student
Biostatistical Centre
School of Public Health
Catholic University of Leuven
Address: Kapucijnenvoer 35, Leuven, Belgium
Tel: +32/(0)16/336899
Fax: +32/(0)16/337015
Web: http://www.med.kuleuven.be/biostat/
http://www.student.kuleuven.be/~m0390867/dimitris.htm
----- Original Message -----
From: "John Logsdon" <j.logsdon at quantex-research.com>
To: <r-help at stat.math.ethz.ch>
Sent: Sunday, November 27, 2005 11:04 AM
Subject: [R] gsub syntax
> Hello
>
> I know that R's string functions are not as extensive as those of
> Unix but
> I need to do some text handling totally within an R environment
> because
> the target is a Windows system which will not have the corresponding
> shell
> utilities, sed, awk etc.
>
> Can anyone explain the following gsub phenomenon to me:
>
>>
dates<-c("73","74","02","1973","1974","2002")
>
> I want to take just the last two digits where it is a 4-digit year
> and
> both digits when it is a 2-digit year. I should be able to use
> substr but
> measurement from the string end (with a negative counter or
> something) is
> not implemented:
>
>> substr(dates,3,4)
> [1] "" "" "" "73"
"74" "02"
>> substr(dates,-2,4)
> [1] "73" "74" "02" "1973"
"1974" "2002"
>> substr(dates,4,-2)
> [1] "" "" "" "" ""
""
>
> So I tried gsub:
>
>> gsub("[19|20]([0-9][0-9])","\\1",dates)
> [1] "73" "74" "02" "973"
"974" "002"
>
> As I understand it (and comparing with sed), the \\1 should take the
> first
> bracketed string but clearly this doesn't work. If I try what
> should also
> work:
>
>> gsub("[19|20]([0-9])([0-9])","\\1\\2",dates)
> [1] "73" "74" "02" "973"
"974" "002"
>
> On the other hand the following does work:
>
>> gsub("[19|20]([0-9])([0-9])","\\2",dates)
> [1] "73" "74" "02" "73"
"74" "02"
>
> So it appears that the substitution takes one character extra to the
> left
> but the following indicates that the lower limit of the selected
> range is
> also at fault:
>
>>
s<-c("1","12","123","1234","12345","123456")
>> gsub("[12]([4-6]*)","",s)
> [1] "" "" "3" "34"
"345" "3456"
>
> Probably more elegant examples could be constructed that could home
> in on
> the issue.
>
> The version is R 2.0.1 on Linux so perhaps it is a little old now.
>
> Questions:
>
> 1) Am I misunderstanding the gsub use?
>
> 2) Was it a bug that has since been corrected?
>
> 3) Is it still a bug in the latest version?
>
> TIA
>
> JOhn
>
> John Logsdon "Try to make things as
> simple
> Quantex Research Ltd, Manchester UK as possible but not
> simpler"
> j.logsdon at quantex-research.com
> a.einstein at relativity.org
> +44(0)161 445 4951/G:+44(0)7717758675 www.quantex-research.com
>
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide!
> http://www.R-project.org/posting-guide.html
>
Disclaimer: http://www.kuleuven.be/cwis/email_disclaimer.htm
John Logsdon wrote:> Hello > > I know that R's string functions are not as extensive as those of Unix but > I need to do some text handling totally within an R environment because > the target is a Windows system which will not have the corresponding shell > utilities, sed, awk etc. > > Can anyone explain the following gsub phenomenon to me: > > >>dates<-c("73","74","02","1973","1974","2002") > > > I want to take just the last two digits where it is a 4-digit year and > both digits when it is a 2-digit year. I should be able to use substr but > measurement from the string end (with a negative counter or something) is > not implemented: > > >>substr(dates,3,4) > > [1] "" "" "" "73" "74" "02" > >>substr(dates,-2,4) > > [1] "73" "74" "02" "1973" "1974" "2002" > >>substr(dates,4,-2) > > [1] "" "" "" "" "" "" > > So I tried gsub: > > >>gsub("[19|20]([0-9][0-9])","\\1",dates) > > [1] "73" "74" "02" "973" "974" "002" > > As I understand it (and comparing with sed), the \\1 should take the first > bracketed string but clearly this doesn't work. If I try what should also > work: > > >>gsub("[19|20]([0-9])([0-9])","\\1\\2",dates) > > [1] "73" "74" "02" "973" "974" "002" > > On the other hand the following does work: > > >>gsub("[19|20]([0-9])([0-9])","\\2",dates) > > [1] "73" "74" "02" "73" "74" "02" > > So it appears that the substitution takes one character extra to the left > but the following indicates that the lower limit of the selected range is > also at fault: > > >>s<-c("1","12","123","1234","12345","123456") >>gsub("[12]([4-6]*)","",s) > > [1] "" "" "3" "34" "345" "3456" > > Probably more elegant examples could be constructed that could home in on > the issue. > > The version is R 2.0.1 on Linux so perhaps it is a little old now. > > Questions: > > 1) Am I misunderstanding the gsub use? > > 2) Was it a bug that has since been corrected? > > 3) Is it still a bug in the latest version? > > TIA > > JOhn >Hi, John, I cannot comment on your questions since I'm no regexpr guru. However, it seems to me you can do the following instead: gsub(".*([0-9][0-9])", "\\1", dates) This works fine on Linux & Windows, R-2.2.0. HTH, --sundar
On 11/27/05, John Logsdon <j.logsdon at quantex-research.com> wrote:> Hello > > I know that R's string functions are not as extensive as those of Unix butI don't think this statement is true although I have seen it repeated.> I need to do some text handling totally within an R environment because > the target is a Windows system which will not have the corresponding shell > utilities, sed, awk etc.Free versions of these utilities are available for Windows although they don't come with Windows. e.g. Google for gawk.> > Can anyone explain the following gsub phenomenon to me: > > > dates<-c("73","74","02","1973","1974","2002") > > I want to take just the last two digits where it is a 4-digit year and > both digits when it is a 2-digit year. I should be able to use substr but > measurement from the string end (with a negative counter or something) is > not implemented: > > > substr(dates,3,4) > [1] "" "" "" "73" "74" "02" > > substr(dates,-2,4) > [1] "73" "74" "02" "1973" "1974" "2002" > > substr(dates,4,-2) > [1] "" "" "" "" "" "" > > So I tried gsub: > > > gsub("[19|20]([0-9][0-9])","\\1",dates) > [1] "73" "74" "02" "973" "974" "002" > > As I understand it (and comparing with sed), the \\1 should take the first > bracketed string but clearly this doesn't work. If I try what should also > work: > > > gsub("[19|20]([0-9])([0-9])","\\1\\2",dates) > [1] "73" "74" "02" "973" "974" "002" > > On the other hand the following does work: > > > gsub("[19|20]([0-9])([0-9])","\\2",dates) > [1] "73" "74" "02" "73" "74" "02" > > So it appears that the substitution takes one character extra to the left > but the following indicates that the lower limit of the selected range is > also at fault: > > > s<-c("1","12","123","1234","12345","123456") > > gsub("[12]([4-6]*)","",s) > [1] "" "" "3" "34" "345" "3456" > > Probably more elegant examples could be constructed that could home in on > the issue. > > The version is R 2.0.1 on Linux so perhaps it is a little old now. > > Questions: > > 1) Am I misunderstanding the gsub use? > > 2) Was it a bug that has since been corrected? > > 3) Is it still a bug in the latest version? >It works the same on my system which is 2.2.0 Windows patched (2005-10-24). At first I too thought it was a bug but I noticed it works the same in perl so now I am not sure. The following perl program under Windows using perl 5.8.6 on Windows gives 002 as the answer as the answer too: $_ = "2002"; s/[19|20]([0-9])([0-9])/\1\2/g; print; In any any case, it could be done like this: sub(".*(..)$", "\\1", dates) or substring(dates, nchar(dates)-1) or the following which appends -01-01 to the year, converts it to Date class, implicitly converts it back to character and then extracts the 3rd to 4th character of the result: substring(as.Date(sprintf("%s-01-01", dates)), 3, 4) or
R is blameless here: it works as documented and in the same way as
POSIX tools. It agrees with 'sed' using the same syntax (modulo the
shell-specific quoting rules) e.g. in csh
% echo 1973 | sed 's/[19|20]\([0-9][0-9]\)/\1/g'
973
% echo 1973 | sed 's/\([19|20]\)\([0-9][0-9]\)/-\1-\2-/g'
-1-97-3
% echo "73 74 02 1973 1974 2002" | sed
's/[19|20]\([0-9][0-9]\)/\1/g'
73 74 02 973 974 002
so what happened when you were 'comparing with sed'?
"[19|20]" is a character class (containing five characters) matching
one
character, not a match for two characters as you seem to imagine. It does
not mean the same as "19|20", which is what you seem to have intended
(and
you seem only to want to do the substitution once on each string, so why
use gsub?):
> sub("19|20([0-9][0-9])", "\\1", dates)
[1] "73" "74" "02" "73" "74"
"02"
A more direct way which would work e.g. for 1837 would be
sub(".*([0-9]{2}$)", "\\1", dates)
or even better (locale-independent)
sub(".*([[:digit:]]{2}$)", "\\1", dates)
Current versions of R have a help page ?regexp explaining what regexps
are. Even 2.0.1 did, although you were asked to update *before* posting
(see the posting guide). It was unambiguous:
A _character class_ is a list of characters enclosed by '[' and
']' matches any single character in that list ...
^^^^^^
... Note that alternation does not work inside character classes,
where \code{|} has its literal meaning.
On Sun, 27 Nov 2005, John Logsdon wrote:
> Hello
>
> I know that R's string functions are not as extensive as those of Unix
but
> I need to do some text handling totally within an R environment because
> the target is a Windows system which will not have the corresponding shell
> utilities, sed, awk etc.
> Can anyone explain the following gsub phenomenon to me:
>
>>
dates<-c("73","74","02","1973","1974","2002")
>
> I want to take just the last two digits where it is a 4-digit year and
> both digits when it is a 2-digit year. I should be able to use substr but
> measurement from the string end (with a negative counter or something) is
> not implemented:
Why 'should' it work in a different way to that documented?
>> substr(dates,3,4)
> [1] "" "" "" "73"
"74" "02"
>> substr(dates,-2,4)
> [1] "73" "74" "02" "1973"
"1974" "2002"
>> substr(dates,4,-2)
> [1] "" "" "" "" ""
""
>
> So I tried gsub:
>
>> gsub("[19|20]([0-9][0-9])","\\1",dates)
> [1] "73" "74" "02" "973"
"974" "002"
>
> As I understand it (and comparing with sed), the \\1 should take the first
> bracketed string but clearly this doesn't work.
> If I try what should also work:
>
>> gsub("[19|20]([0-9])([0-9])","\\1\\2",dates)
> [1] "73" "74" "02" "973"
"974" "002"
> On the other hand the following does work:
>
>> gsub("[19|20]([0-9])([0-9])","\\2",dates)
> [1] "73" "74" "02" "73"
"74" "02"
>
> So it appears that the substitution takes one character extra to the left
> but the following indicates that the lower limit of the selected range is
> also at fault:
>>
s<-c("1","12","123","1234","12345","123456")
>> gsub("[12]([4-6]*)","",s)
> [1] "" "" "3" "34"
"345" "3456"
>
> Probably more elegant examples could be constructed that could home in on
> the issue.
> The version is R 2.0.1 on Linux so perhaps it is a little old now.
>
> Questions:
>
> 1) Am I misunderstanding the gsub use?
Yes.
> 2) Was it a bug that has since been corrected?
Unfortunately the bug reported two years ago in
> library(fortunes); fortune("WTFM")
still seems extant. See the posting guide for advice on how to correct
it.
--
Brian D. Ripley, ripley at stats.ox.ac.uk
Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/
University of Oxford, Tel: +44 1865 272861 (self)
1 South Parks Road, +44 1865 272866 (PA)
Oxford OX1 3TG, UK Fax: +44 1865 272595