thr3ads.net - R help - [R] gsub syntax [Nov 2005]

If this information is useful, please help other people find it:
Share via:

John Logsdon

2005-Nov-27 10:04 UTC

[R] gsub syntax

Hello

I know that R's string functions are not as extensive as those of Unix but
I need to do some text handling totally within an R environment because
the target is a Windows system which will not have the corresponding shell
utilities, sed, awk etc.

Can anyone explain the following gsub phenomenon to me:
>
dates<-c("73","74","02","1973","1974","2002")
I want to take just the last two digits where it is a 4-digit year and
both digits when it is a 2-digit year.  I should be able to use substr but
measurement from the string end (with a negative counter or something) is
not implemented:
> substr(dates,3,4)[1] ""   ""   ""   "73" "74"
"02"> substr(dates,-2,4)[1] "73"   "74"   "02"   "1973"
"1974" "2002"> substr(dates,4,-2)[1] "" "" "" "" ""
""

So I tried gsub:
> gsub("[19|20]([0-9][0-9])","\\1",dates)[1] "73"  "74"  "02"  "973"
"974" "002"

As I understand it (and comparing with sed), the \\1 should take the first
bracketed string but clearly this doesn't work.  If I try what should also
work:
> gsub("[19|20]([0-9])([0-9])","\\1\\2",dates)[1] "73"  "74"  "02"  "973"
"974" "002"

On the other hand the following does work:
> gsub("[19|20]([0-9])([0-9])","\\2",dates) [1] "73" "74" "02" "73" "74"
"02"

So it appears that the substitution takes one character extra to the left
but the following indicates that the lower limit of the selected range is
also at fault:
>
s<-c("1","12","123","1234","12345","123456")
> gsub("[12]([4-6]*)","",s)[1] ""     ""     "3"    "34"  
"345"  "3456"

Probably more elegant examples could be constructed that could home in on
the issue.

The version is R 2.0.1 on Linux so perhaps it is a little old now.

Questions:

1) Am I misunderstanding the gsub use?

2) Was it a bug that has since been corrected?

3) Is it still a bug in the latest version?

TIA

JOhn

John Logsdon                               "Try to make things as simple
Quantex Research Ltd, Manchester UK         as possible but not simpler"
j.logsdon at quantex-research.com              a.einstein at relativity.org
+44(0)161 445 4951/G:+44(0)7717758675       www.quantex-research.com

Dimitris Rizopoulos

2005-Nov-27 10:20 UTC

head link

[R] gsub syntax

you could use something like:

dates <- c("73", "74", "02", "1973",
"1974", "2002")
###############
nd <- nchar(dates)
substr(dates, ifelse(nd == 2, 1, 3), nd)


I hope it helps.

Best,
Dimitris

----
Dimitris Rizopoulos
Ph.D. Student
Biostatistical Centre
School of Public Health
Catholic University of Leuven

Address: Kapucijnenvoer 35, Leuven, Belgium
Tel: +32/(0)16/336899
Fax: +32/(0)16/337015
Web: http://www.med.kuleuven.be/biostat/
     http://www.student.kuleuven.be/~m0390867/dimitris.htm


----- Original Message ----- 
From: "John Logsdon" <j.logsdon at quantex-research.com>
To: <r-help at stat.math.ethz.ch>
Sent: Sunday, November 27, 2005 11:04 AM
Subject: [R] gsub syntax

> Hello
>
> I know that R's string functions are not as extensive as those of 
> Unix but
> I need to do some text handling totally within an R environment 
> because
> the target is a Windows system which will not have the corresponding 
> shell
> utilities, sed, awk etc.
>
> Can anyone explain the following gsub phenomenon to me:
>
>>
dates<-c("73","74","02","1973","1974","2002")
>
> I want to take just the last two digits where it is a 4-digit year 
> and
> both digits when it is a 2-digit year.  I should be able to use 
> substr but
> measurement from the string end (with a negative counter or 
> something) is
> not implemented:
>
>> substr(dates,3,4)
> [1] ""   ""   ""   "73"
"74" "02"
>> substr(dates,-2,4)
> [1] "73"   "74"   "02"   "1973"
"1974" "2002"
>> substr(dates,4,-2)
> [1] "" "" "" "" ""
""
>
> So I tried gsub:
>
>> gsub("[19|20]([0-9][0-9])","\\1",dates)
> [1] "73"  "74"  "02"  "973"
"974" "002"
>
> As I understand it (and comparing with sed), the \\1 should take the 
> first
> bracketed string but clearly this doesn't work.  If I try what 
> should also
> work:
>
>> gsub("[19|20]([0-9])([0-9])","\\1\\2",dates)
> [1] "73"  "74"  "02"  "973"
"974" "002"
>
> On the other hand the following does work:
>
>> gsub("[19|20]([0-9])([0-9])","\\2",dates)
> [1] "73" "74" "02" "73"
"74" "02"
>
> So it appears that the substitution takes one character extra to the 
> left
> but the following indicates that the lower limit of the selected 
> range is
> also at fault:
>
>>
s<-c("1","12","123","1234","12345","123456")
>> gsub("[12]([4-6]*)","",s)
> [1] ""     ""     "3"    "34"  
"345"  "3456"
>
> Probably more elegant examples could be constructed that could home 
> in on
> the issue.
>
> The version is R 2.0.1 on Linux so perhaps it is a little old now.
>
> Questions:
>
> 1) Am I misunderstanding the gsub use?
>
> 2) Was it a bug that has since been corrected?
>
> 3) Is it still a bug in the latest version?
>
> TIA
>
> JOhn
>
> John Logsdon                               "Try to make things as 
> simple
> Quantex Research Ltd, Manchester UK         as possible but not 
> simpler"
> j.logsdon at quantex-research.com 
> a.einstein at relativity.org
> +44(0)161 445 4951/G:+44(0)7717758675       www.quantex-research.com
>
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide! 
> http://www.R-project.org/posting-guide.html
> 

Disclaimer: http://www.kuleuven.be/cwis/email_disclaimer.htm

Sundar Dorai-Raj

2005-Nov-27 10:37 UTC

head link

[R] gsub syntax

John Logsdon wrote:> Hello
> 
> I know that R's string functions are not as extensive as those of Unix
but
> I need to do some text handling totally within an R environment because
> the target is a Windows system which will not have the corresponding shell
> utilities, sed, awk etc.
> 
> Can anyone explain the following gsub phenomenon to me:
> 
> 
>>dates<-c("73","74","02","1973","1974","2002")
> 
> 
> I want to take just the last two digits where it is a 4-digit year and
> both digits when it is a 2-digit year.  I should be able to use substr but
> measurement from the string end (with a negative counter or something) is
> not implemented:
> 
> 
>>substr(dates,3,4)
> 
> [1] ""   ""   ""   "73"
"74" "02"
> 
>>substr(dates,-2,4)
> 
> [1] "73"   "74"   "02"   "1973"
"1974" "2002"
> 
>>substr(dates,4,-2)
> 
> [1] "" "" "" "" ""
""
> 
> So I tried gsub:
> 
> 
>>gsub("[19|20]([0-9][0-9])","\\1",dates)
> 
> [1] "73"  "74"  "02"  "973"
"974" "002"
> 
> As I understand it (and comparing with sed), the \\1 should take the first
> bracketed string but clearly this doesn't work.  If I try what should
also
> work:
> 
> 
>>gsub("[19|20]([0-9])([0-9])","\\1\\2",dates)
> 
> [1] "73"  "74"  "02"  "973"
"974" "002"
> 
> On the other hand the following does work:
> 
> 
>>gsub("[19|20]([0-9])([0-9])","\\2",dates) 
> 
> [1] "73" "74" "02" "73"
"74" "02"
> 
> So it appears that the substitution takes one character extra to the left
> but the following indicates that the lower limit of the selected range is
> also at fault:
> 
> 
>>s<-c("1","12","123","1234","12345","123456")
>>gsub("[12]([4-6]*)","",s)
> 
> [1] ""     ""     "3"    "34"  
"345"  "3456"
> 
> Probably more elegant examples could be constructed that could home in on
> the issue.
> 
> The version is R 2.0.1 on Linux so perhaps it is a little old now.
> 
> Questions:
> 
> 1) Am I misunderstanding the gsub use?
> 
> 2) Was it a bug that has since been corrected?
> 
> 3) Is it still a bug in the latest version?
> 
> TIA
> 
> JOhn
>
Hi, John,

I cannot comment on your questions since I'm no regexpr guru. However, 
it seems to me you can do the following instead:

gsub(".*([0-9][0-9])", "\\1", dates)

This works fine on Linux & Windows, R-2.2.0.

HTH,

--sundar

Gabor Grothendieck

2005-Nov-27 14:50 UTC

head link

[R] gsub syntax

On 11/27/05, John Logsdon <j.logsdon at quantex-research.com>
wrote:> Hello
>
> I know that R's string functions are not as extensive as those of Unix
but
I don't think this statement is true although I have seen it repeated.
> I need to do some text handling totally within an R environment because
> the target is a Windows system which will not have the corresponding shell
> utilities, sed, awk etc.
Free versions of these utilities are available for Windows although they
don't come with Windows.  e.g. Google for gawk.
>
> Can anyone explain the following gsub phenomenon to me:
>
> >
dates<-c("73","74","02","1973","1974","2002")
>
> I want to take just the last two digits where it is a 4-digit year and
> both digits when it is a 2-digit year.  I should be able to use substr but
> measurement from the string end (with a negative counter or something) is
> not implemented:
>
> > substr(dates,3,4)
> [1] ""   ""   ""   "73"
"74" "02"
> > substr(dates,-2,4)
> [1] "73"   "74"   "02"   "1973"
"1974" "2002"
> > substr(dates,4,-2)
> [1] "" "" "" "" ""
""
>
> So I tried gsub:
>
> > gsub("[19|20]([0-9][0-9])","\\1",dates)
> [1] "73"  "74"  "02"  "973"
"974" "002"
>
> As I understand it (and comparing with sed), the \\1 should take the first
> bracketed string but clearly this doesn't work.  If I try what should
also
> work:
>
> > gsub("[19|20]([0-9])([0-9])","\\1\\2",dates)
> [1] "73"  "74"  "02"  "973"
"974" "002"
>
> On the other hand the following does work:
>
> > gsub("[19|20]([0-9])([0-9])","\\2",dates)
> [1] "73" "74" "02" "73"
"74" "02"
>
> So it appears that the substitution takes one character extra to the left
> but the following indicates that the lower limit of the selected range is
> also at fault:
>
> >
s<-c("1","12","123","1234","12345","123456")
> > gsub("[12]([4-6]*)","",s)
> [1] ""     ""     "3"    "34"  
"345"  "3456"
>
> Probably more elegant examples could be constructed that could home in on
> the issue.
>
> The version is R 2.0.1 on Linux so perhaps it is a little old now.
>
> Questions:
>
> 1) Am I misunderstanding the gsub use?
>
> 2) Was it a bug that has since been corrected?
>
> 3) Is it still a bug in the latest version?
>
It works the same on my system which is 2.2.0 Windows patched
(2005-10-24). At first I too thought it was a bug but I noticed it
works the same in perl so now I am not sure. The following perl
program under Windows using perl 5.8.6 on Windows
gives 002 as the answer as the answer too:

   $_ = "2002";
   s/[19|20]([0-9])([0-9])/\1\2/g;
   print;

In any any case, it could be done like this:

   sub(".*(..)$", "\\1", dates)

or

   substring(dates, nchar(dates)-1)

or the following which appends -01-01 to the year, converts it to Date
class, implicitly converts it back to character and then extracts
the 3rd to 4th character of the result:

   substring(as.Date(sprintf("%s-01-01", dates)), 3, 4)

or

Prof Brian Ripley

2005-Nov-27 16:41 UTC

head link

[R] gsub syntax

R is blameless here: it works as documented and in the same way as 
POSIX tools.  It agrees with 'sed' using the same syntax (modulo the 
shell-specific quoting rules) e.g. in csh

    % echo 1973 | sed 's/[19|20]\([0-9][0-9]\)/\1/g'
    973
    % echo 1973 | sed 's/\([19|20]\)\([0-9][0-9]\)/-\1-\2-/g'
    -1-97-3
    % echo "73 74 02 1973 1974 2002" | sed
's/[19|20]\([0-9][0-9]\)/\1/g'
    73 74 02 973 974 002

so what happened when you were 'comparing with sed'?

"[19|20]" is a character class (containing five characters) matching
one
character, not a match for two characters as you seem to imagine.  It does 
not mean the same as "19|20", which is what you seem to have intended
(and
you seem only to want to do the substitution once on each string, so why 
use gsub?):
> sub("19|20([0-9][0-9])", "\\1", dates)[1] "73" "74" "02" "73" "74"
"02"

A more direct way which would work e.g. for 1837 would be

sub(".*([0-9]{2}$)", "\\1", dates)

or even better (locale-independent)

sub(".*([[:digit:]]{2}$)", "\\1", dates)

Current versions of R have a help page ?regexp explaining what regexps 
are.  Even 2.0.1 did, although you were asked to update *before* posting 
(see the posting guide).  It was unambiguous:

    A _character class_ is a list of characters enclosed by '[' and
    ']' matches any single character in that list ...
                    ^^^^^^
    ...  Note that alternation does not work inside character classes,
    where \code{|} has its literal meaning.


On Sun, 27 Nov 2005, John Logsdon wrote:
> Hello
>
> I know that R's string functions are not as extensive as those of Unix
but
> I need to do some text handling totally within an R environment because
> the target is a Windows system which will not have the corresponding shell
> utilities, sed, awk etc.
> Can anyone explain the following gsub phenomenon to me:
>
>>
dates<-c("73","74","02","1973","1974","2002")
>
> I want to take just the last two digits where it is a 4-digit year and
> both digits when it is a 2-digit year.  I should be able to use substr but
> measurement from the string end (with a negative counter or something) is
> not implemented:
Why 'should' it work in a different way to that documented?
>> substr(dates,3,4)
> [1] ""   ""   ""   "73"
"74" "02"
>> substr(dates,-2,4)
> [1] "73"   "74"   "02"   "1973"
"1974" "2002"
>> substr(dates,4,-2)
> [1] "" "" "" "" ""
""
>
> So I tried gsub:
>
>> gsub("[19|20]([0-9][0-9])","\\1",dates)
> [1] "73"  "74"  "02"  "973"
"974" "002"
>
> As I understand it (and comparing with sed), the \\1 should take the first
> bracketed string but clearly this doesn't work.
> If I try what should also work:
>
>> gsub("[19|20]([0-9])([0-9])","\\1\\2",dates)
> [1] "73"  "74"  "02"  "973"
"974" "002"
> On the other hand the following does work:
>
>> gsub("[19|20]([0-9])([0-9])","\\2",dates)
> [1] "73" "74" "02" "73"
"74" "02"
>
> So it appears that the substitution takes one character extra to the left
> but the following indicates that the lower limit of the selected range is
> also at fault:
>>
s<-c("1","12","123","1234","12345","123456")
>> gsub("[12]([4-6]*)","",s)
> [1] ""     ""     "3"    "34"  
"345"  "3456"
>
> Probably more elegant examples could be constructed that could home in on
> the issue.
> The version is R 2.0.1 on Linux so perhaps it is a little old now.
>
> Questions:
>
> 1) Am I misunderstanding the gsub use?
Yes.
> 2) Was it a bug that has since been corrected?
Unfortunately the bug reported two years ago in
> library(fortunes); fortune("WTFM")
still seems extant.  See the posting guide for advice on how to correct 
it.


-- 
Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595

Apparently Analagous Threads

Search for more possibly parallel threads

R help - Nov 2005 - gsub syntax

[R] gsub syntax

[R] gsub syntax

[R] gsub syntax

[R] gsub syntax

[R] gsub syntax

Apparently Analagous Threads