I've noticed a change in the way grep() behaves between the 1.9.0 release and a recent R-patched. On 1.9.0 I get the following output: > x <- dget(file = url("http://www.biostat.jhsph.edu/~rpeng/names.R")) > length(grep("^l\\w+tmean", x, perl = TRUE, value = TRUE)) [1] 84 And on R-patched (2004-06-11) I get > x <- dget(file = url("http://www.biostat.jhsph.edu/~rpeng/names.R")) > length(grep("^l\\w+tmean", x, perl = TRUE, value = TRUE)) [1] 13 I can't come up with a simpler example which is why I've posted my actual character vector on the web (please let me know if there are problems downloading it). I didn't find anything in the NEWs file that would indicate a change and another problem is that I'm not sure which behavior is correct. My knowledge of regular expressions is limited. -roger
Martin Maechler
2004-Jun-11 17:21 UTC
[Rd] Change in grep behavior from 1.9.0 to R-patched
>>>>> "Roger" == Roger D Peng <rpeng@jhsph.edu> >>>>> on Fri, 11 Jun 2004 10:43:57 -0400 writes:Roger> I've noticed a change in the way grep() behaves between the 1.9.0 Roger> release and a recent R-patched. On 1.9.0 I get the following output: >> x <- dget(file = url("http://www.biostat.jhsph.edu/~rpeng/names.R")) >> length(grep("^l\\w+tmean", x, perl = TRUE, value = TRUE)) Roger> [1] 84 Roger> And on R-patched (2004-06-11) I get >> x <- dget(file = url("http://www.biostat.jhsph.edu/~rpeng/names.R")) >> length(grep("^l\\w+tmean", x, perl = TRUE, value = TRUE)) Roger> [1] 13 I can reproduce this exactly. <....> Roger> I didn't find anything in the NEWs file that would indicate a change yes: The src/extras/pcre/ (Perl Compatible Regular Expressions) library was upgraded, and since we assumed that wouldn't have any effect --- as we now see, a too optimistically --- it wasn't documented in NEWS Roger> and another problem is that I'm not sure which behavior is correct. Roger> My knowledge of regular expressions is limited. The first one is correct I think: '\w' means word constituents (see below) and for 1.9.0, you get > grep("^l\\w+tmean", x, perl = TRUE, value = TRUE) [1] "l1pm10tmean" "l1pm25tmean" "l1cotmean" "l1no2tmean" "l1so2tmean" [6] "l1o3tmean" "l2pm10tmean" "l2pm25tmean" "l2cotmean" "l2no2tmean" [11] "l2so2tmean" "l2o3tmean" "l3pm10tmean" "l3pm25tmean" "l3cotmean" [16] "l3no2tmean" "l3so2tmean" "l3o3tmean" "l4pm10tmean" "l4pm25tmean" [21] "l4cotmean" "l4no2tmean" "l4so2tmean" "l4o3tmean" "l5pm10tmean" [26] "l5pm25tmean" "l5cotmean" "l5no2tmean" "l5so2tmean" "l5o3tmean" [31] "l6pm10tmean" "l6pm25tmean" "l6cotmean" "l6no2tmean" "l6so2tmean" [36] "l6o3tmean" "l7pm10tmean" "l7pm25tmean" "l7cotmean" "l7no2tmean" [41] "l7so2tmean" "l7o3tmean" "lm1pm10tmean" "lm1pm25tmean" "lm1cotmean" [46] "lm1no2tmean" "lm1so2tmean" "lm1o3tmean" "lm2pm10tmean" "lm2pm25tmean" [51] "lm2cotmean" "lm2no2tmean" "lm2so2tmean" "lm2o3tmean" "lm3pm10tmean" [56] "lm3pm25tmean" "lm3cotmean" "lm3no2tmean" "lm3so2tmean" "lm3o3tmean" [61] "lm4pm10tmean" "lm4pm25tmean" "lm4cotmean" "lm4no2tmean" "lm4so2tmean" [66] "lm4o3tmean" "lm5pm10tmean" "lm5pm25tmean" "lm5cotmean" "lm5no2tmean" [71] "lm5so2tmean" "lm5o3tmean" "lm6pm10tmean" "lm6pm25tmean" "lm6cotmean" [76] "lm6no2tmean" "lm6so2tmean" "lm6o3tmean" "lm7pm10tmean" "lm7pm25tmean" [81] "lm7cotmean" "lm7no2tmean" "lm7so2tmean" "lm7o3tmean" > which is correct AFAICS and shouldn't be shorted to the only 13 elements> grep("^l\\w+tmean", x, perl = TRUE, value = TRUE)[1] "l1pm10tmean" "l1pm25tmean" "l1cotmean" "l1no2tmean" "l1so2tmean" [6] "l1o3tmean" "l2pm10tmean" "l2pm25tmean" "l2cotmean" "l2no2tmean" [11] "l2so2tmean" "l2o3tmean" "l3pm10tmean" in R-patched. ------------ For me, 'man perlre' contains>> \w Match a "word" character (alphanumeric plus "_")<......>>> A "\w" matches a single alphanumeric character or "_", not a whole >> word. Use "\w+" to match a string of Perl-identifier characters (which >> isn't the same as matching an English word). If "use locale" is in >> effect, the list of alphabetic characters generated by "\w" is taken >> from the current locale. See the perllocale manpage. .......so it may well be connected to locale problems. But I don't think any locale should have "l2pm25tmean" matched by '^l\w+tmean' but not match "lm5pm25tmean" [If making a difference between these two, it should rather be the other way round]. Martin Maechler
Prof Brian Ripley
2004-Jun-11 17:28 UTC
[Rd] Change in grep behavior from 1.9.0 to R-patched
This is actually PCRE. Something is wrong with your build of R-patched (1.9.1 alpha, I assume): I get 84 everywhere. You are asking for a first character l, then one or more characters of `word' then tmean. In your example this is the same as (in a suitable locale, including C) length(grep("^l[A-Za-z0-9]+tmean", x, perl = TRUE, value = TRUE)) length(grep("^l[[:alnum:]_]+tmean", x, perl = TRUE, value = TRUE)) which each give 84. One issue: PCRE is locale-dependent. Did you use the same locale for each? What happens if you force LANG=C? (I've just checked an R-devel Solaris system. This gave 13 on a build from Weds, and 84 when remade today. The result with 13 seems truncated, as they are the first 13. Might be coincidental, of course.) On Fri, 11 Jun 2004, Roger D. Peng wrote:> I've noticed a change in the way grep() behaves between the 1.9.0 > release and a recent R-patched. On 1.9.0 I get the following output: > > > x <- dget(file = url("http://www.biostat.jhsph.edu/~rpeng/names.R")) > > length(grep("^l\\w+tmean", x, perl = TRUE, value = TRUE)) > [1] 84 > > And on R-patched (2004-06-11) I get > > > x <- dget(file = url("http://www.biostat.jhsph.edu/~rpeng/names.R")) > > length(grep("^l\\w+tmean", x, perl = TRUE, value = TRUE)) > [1] 13 > > I can't come up with a simpler example which is why I've posted my > actual character vector on the web (please let me know if there are > problems downloading it). > > I didn't find anything in the NEWs file that would indicate a changeNo change is intended and the underlying C code is unchanged.> and another problem is that I'm not sure which behavior is correct. > My knowledge of regular expressions is limited.-- Brian D. Ripley, ripley@stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UK Fax: +44 1865 272595