> version_ platform x86_64-unknown-linux-gnu arch x86_64 os linux-gnu system x86_64, linux-gnu status major 2 minor 2.1 year 2005 month 12 day 20 svn rev 36812 language R>> grep("[W-Z]", LETTERS, value = TRUE)[1] "W" "X" "Y" "Z" That's what I'd have expected.> grep("[W-Z]", letters, value = TRUE)[1] "x" "y" "z" Not what I'd have thought. However,> grep("[B-D]", letters, value = TRUE, perl = TRUE)character(0) So what is it that standard regular expressions use that's different from Perl-type ones? The help file for grep refers to POSIX 1003.2 which looked a bit daunting to delve into. From my limited reading, it seems there are different gegex "Engine Types" which seems to be getting somewhat tangential to what I was working on. I could probably avoid problems if I always set perl=TRUE, but it would be good to know what basic and extended regular expressions do that's different. If someone has a quick line or two describing it, I'd be interested to know. Thanks -- ~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~. ___ Patrick Connolly {~._.~} Great minds discuss ideas _( Y )_ Middle minds discuss events (:_~*~_:) Small minds discuss people (_)-(_) ..... Anon ~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.
I get the same thing on "Version 2.3.1 Patched (2006-06-04 r38279)" but on "R version 2.2.1, 2005-12-20" it gives character(0), as expected, so there is some change between versions of R. I am on Windows XP. On 6/9/06, Patrick Connolly <p_connolly at ihug.co.nz> wrote:> > version > _ > platform x86_64-unknown-linux-gnu > arch x86_64 > os linux-gnu > system x86_64, linux-gnu > status > major 2 > minor 2.1 > year 2005 > month 12 > day 20 > svn rev 36812 > language R > > > > > grep("[W-Z]", LETTERS, value = TRUE) > [1] "W" "X" "Y" "Z" > > That's what I'd have expected. > > > grep("[W-Z]", letters, value = TRUE) > [1] "x" "y" "z" > > Not what I'd have thought. However, > > > grep("[B-D]", letters, value = TRUE, perl = TRUE) > character(0) > > So what is it that standard regular expressions use that's different > from Perl-type ones? > > The help file for grep refers to POSIX 1003.2 which looked a bit > daunting to delve into. From my limited reading, it seems there are > different gegex "Engine Types" which seems to be getting somewhat > tangential to what I was working on. I could probably avoid problems > if I always set perl=TRUE, but it would be good to know what basic and > extended regular expressions do that's different. If someone has a > quick line or two describing it, I'd be interested to know. > > Thanks > > -- > ~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~. > ___ Patrick Connolly > {~._.~} Great minds discuss ideas > _( Y )_ Middle minds discuss events > (:_~*~_:) Small minds discuss people > (_)-(_) ..... Anon > > ~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~. > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html >
?regex does describe this: A range of characters may be specified by giving the first and last characters, separated by a hyphen. (Character ranges are interpreted in the collation order of the current locale.) You did not tell us your locale, but based on questions from you in the past I would guess en_NZ.utf8. In that locale the collation order is wWxXyYzZ, so your surprise is explained. (It seems the PCRE code is not using the same ordering in that locale.) You may find it useful to set LC_COLLATE to C as I do:> strsplit(Sys.getlocale(), ";")[[1]] [1] "LC_CTYPE=en_GB" "LC_NUMERIC=C" "LC_TIME=en_GB" [4] "LC_COLLATE=C" "LC_MONETARY=en_GB" "LC_MESSAGES=en_GB" [7] "LC_PAPER=en_GB" "LC_NAME=C" "LC_ADDRESS=C" [10] "LC_TELEPHONE=C" "LC_MEASUREMENT=en_GB" "LC_IDENTIFICATION=C" On Sat, 10 Jun 2006, Patrick Connolly wrote:>> version > _ > platform x86_64-unknown-linux-gnu > arch x86_64 > os linux-gnu > system x86_64, linux-gnu > status > major 2 > minor 2.1 > year 2005 > month 12 > day 20 > svn rev 36812 > language R >> > >> grep("[W-Z]", LETTERS, value = TRUE) > [1] "W" "X" "Y" "Z" > > That's what I'd have expected. > >> grep("[W-Z]", letters, value = TRUE) > [1] "x" "y" "z" > > Not what I'd have thought. However, > >> grep("[B-D]", letters, value = TRUE, perl = TRUE) > character(0) > > So what is it that standard regular expressions use that's different > from Perl-type ones? > > The help file for grep refers to POSIX 1003.2 which looked a bit > daunting to delve into. From my limited reading, it seems there are > different gegex "Engine Types" which seems to be getting somewhat > tangential to what I was working on. I could probably avoid problems > if I always set perl=TRUE, but it would be good to know what basic and > extended regular expressions do that's different. If someone has a > quick line or two describing it, I'd be interested to know.-- Brian D. Ripley, ripley at stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UK Fax: +44 1865 272595