thr3ads.net - R devel - [Rd] Change in grep behavior from 1.9.0 to R-patched [Jun 2004]

If this information is useful, please help other people find it:
Share via:

Roger D. Peng

2004-Jun-11 16:44 UTC

[Rd] Change in grep behavior from 1.9.0 to R-patched

I've noticed a change in the way grep() behaves between the 1.9.0 
release and a recent R-patched.  On 1.9.0 I get the following output:

 > x <- dget(file =
url("http://www.biostat.jhsph.edu/~rpeng/names.R"))
 > length(grep("^l\\w+tmean", x, perl = TRUE, value = TRUE))
[1] 84

And on R-patched (2004-06-11) I get

 > x <- dget(file =
url("http://www.biostat.jhsph.edu/~rpeng/names.R"))
 > length(grep("^l\\w+tmean", x, perl = TRUE, value = TRUE))
[1] 13

I can't come up with a simpler example which is why I've posted my 
actual character vector on the web (please let me know if there are 
problems downloading it).

I didn't find anything in the NEWs file that would indicate a change 
and another problem is that I'm not sure which behavior is correct. 
My knowledge of regular expressions is limited.

-roger

Martin Maechler

2004-Jun-11 17:21 UTC

head link

[Rd] Change in grep behavior from 1.9.0 to R-patched

>>>>> "Roger" == Roger D Peng <rpeng@jhsph.edu>
>>>>>     on Fri, 11 Jun 2004 10:43:57 -0400 writes:
    Roger> I've noticed a change in the way grep() behaves between the
1.9.0
    Roger> release and a recent R-patched.  On 1.9.0 I get the following
output:

    >> x <- dget(file =
url("http://www.biostat.jhsph.edu/~rpeng/names.R"))
    >> length(grep("^l\\w+tmean", x, perl = TRUE, value = TRUE))
    Roger> [1] 84

    Roger> And on R-patched (2004-06-11) I get

    >> x <- dget(file =
url("http://www.biostat.jhsph.edu/~rpeng/names.R"))
    >> length(grep("^l\\w+tmean", x, perl = TRUE, value = TRUE))
    Roger> [1] 13

I can reproduce this exactly.

    <....>

    Roger> I didn't find anything in the NEWs file that would indicate a
change

yes: The src/extras/pcre/ (Perl Compatible Regular Expressions)
     library was upgraded, and since we assumed that wouldn't
     have any effect --- as we now see, a too optimistically ---
     it wasn't documented in NEWS

    Roger> and another problem is that I'm not sure which behavior is
correct.
    Roger> My knowledge of regular expressions is limited.

The first one is correct I think: '\w' means word constituents
(see below) and for 1.9.0, 
you get

 > grep("^l\\w+tmean", x, perl = TRUE, value = TRUE)
  [1] "l1pm10tmean"  "l1pm25tmean"  "l1cotmean"   
"l1no2tmean"   "l1so2tmean"
  [6] "l1o3tmean"    "l2pm10tmean"  "l2pm25tmean" 
"l2cotmean"    "l2no2tmean"
 [11] "l2so2tmean"   "l2o3tmean"    "l3pm10tmean" 
"l3pm25tmean"  "l3cotmean"
 [16] "l3no2tmean"   "l3so2tmean"   "l3o3tmean"   
"l4pm10tmean"  "l4pm25tmean"
 [21] "l4cotmean"    "l4no2tmean"   "l4so2tmean"  
"l4o3tmean"    "l5pm10tmean"
 [26] "l5pm25tmean"  "l5cotmean"    "l5no2tmean"  
"l5so2tmean"   "l5o3tmean"
 [31] "l6pm10tmean"  "l6pm25tmean"  "l6cotmean"   
"l6no2tmean"   "l6so2tmean"
 [36] "l6o3tmean"    "l7pm10tmean"  "l7pm25tmean" 
"l7cotmean"    "l7no2tmean"
 [41] "l7so2tmean"   "l7o3tmean"    "lm1pm10tmean"
"lm1pm25tmean" "lm1cotmean"
 [46] "lm1no2tmean"  "lm1so2tmean"  "lm1o3tmean"  
"lm2pm10tmean" "lm2pm25tmean"
 [51] "lm2cotmean"   "lm2no2tmean"  "lm2so2tmean" 
"lm2o3tmean"   "lm3pm10tmean"
 [56] "lm3pm25tmean" "lm3cotmean"   "lm3no2tmean" 
"lm3so2tmean"  "lm3o3tmean"
 [61] "lm4pm10tmean" "lm4pm25tmean" "lm4cotmean"  
"lm4no2tmean"  "lm4so2tmean"
 [66] "lm4o3tmean"   "lm5pm10tmean" "lm5pm25tmean"
"lm5cotmean"   "lm5no2tmean"
 [71] "lm5so2tmean"  "lm5o3tmean"   "lm6pm10tmean"
"lm6pm25tmean" "lm6cotmean"
 [76] "lm6no2tmean"  "lm6so2tmean"  "lm6o3tmean"  
"lm7pm10tmean" "lm7pm25tmean"
 [81] "lm7cotmean"   "lm7no2tmean"  "lm7so2tmean" 
"lm7o3tmean"
 > 

which is correct AFAICS and shouldn't be shorted to the only 13 elements
> grep("^l\\w+tmean", x, perl = TRUE, value = TRUE) [1] "l1pm10tmean" "l1pm25tmean" "l1cotmean"  
"l1no2tmean"  "l1so2tmean"
 [6] "l1o3tmean"   "l2pm10tmean" "l2pm25tmean"
"l2cotmean"   "l2no2tmean"
[11] "l2so2tmean"  "l2o3tmean"   "l3pm10tmean"

in R-patched.

------------

For me,  'man perlre' contains
>>         \w  Match a "word" character (alphanumeric plus
"_")
         <......>
>>     A "\w" matches a single alphanumeric character or
"_", not a whole
>>     word.  Use "\w+" to match a string of Perl-identifier
characters (which
>>     isn't the same as matching an English word).  If "use
locale" is in
>>     effect, the list of alphabetic characters generated by
"\w" is taken
>>     from the current locale.  See the perllocale manpage. .......
so it may well be connected to locale problems.  But I don't
think any locale should  have   
 "l2pm25tmean" matched by  '^l\w+tmean'   but not match
 "lm5pm25tmean"

[If making a difference between these two, it should rather be
 the other way round].

Martin Maechler

Prof Brian Ripley

2004-Jun-11 17:28 UTC

head link

[Rd] Change in grep behavior from 1.9.0 to R-patched

This is actually PCRE.  Something is wrong with your build of R-patched
(1.9.1 alpha, I assume): I get 84 everywhere.  You are asking for a first
character l, then one or more characters of `word' then tmean.  In your
example this is the same as (in a suitable locale, including C)

length(grep("^l[A-Za-z0-9]+tmean", x, perl = TRUE, value = TRUE))
length(grep("^l[[:alnum:]_]+tmean", x, perl = TRUE, value = TRUE))

which each give 84.

One issue: PCRE is locale-dependent.  Did you use the same locale for 
each?  What happens if you force LANG=C?

(I've just checked an R-devel Solaris system.  This gave 13 on a build 
from Weds, and 84 when remade today.  The result with 13 seems truncated, 
as they are the first 13.  Might be coincidental, of course.)

On Fri, 11 Jun 2004, Roger D. Peng wrote:
> I've noticed a change in the way grep() behaves between the 1.9.0 
> release and a recent R-patched.  On 1.9.0 I get the following output:
> 
>  > x <- dget(file =
url("http://www.biostat.jhsph.edu/~rpeng/names.R"))
>  > length(grep("^l\\w+tmean", x, perl = TRUE, value = TRUE))
> [1] 84
> 
> And on R-patched (2004-06-11) I get
> 
>  > x <- dget(file =
url("http://www.biostat.jhsph.edu/~rpeng/names.R"))
>  > length(grep("^l\\w+tmean", x, perl = TRUE, value = TRUE))
> [1] 13
> 
> I can't come up with a simpler example which is why I've posted my 
> actual character vector on the web (please let me know if there are 
> problems downloading it).
> 
> I didn't find anything in the NEWs file that would indicate a change 
No change is intended and the underlying C code is unchanged.
> and another problem is that I'm not sure which behavior is correct. 
> My knowledge of regular expressions is limited.
-- 
Brian D. Ripley,                  ripley@stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595

Martin Maechler

2004-Jun-11 17:28 UTC

head link

[Rd] Change in grep behavior from 1.9.0 to R-patched

I forgot to add

  Thank you very much for
  - starting to use R-patched and hence testing it
  - providing a nicely reproducible example

Everyone else: do follow Roger!

Thanks again!
Martin

Seemingly Similar Threads

Search for more seemingly similar threads

R devel - Jun 2004 - Change in grep behavior from 1.9.0 to R-patched

[Rd] Change in grep behavior from 1.9.0 to R-patched

[Rd] Change in grep behavior from 1.9.0 to R-patched

[Rd] Change in grep behavior from 1.9.0 to R-patched

[Rd] Change in grep behavior from 1.9.0 to R-patched

Seemingly Similar Threads