thr3ads.net - R help - [R] Regex engine types [Jun 2006]

If this information is useful, please help other people find it:
Share via:

Patrick Connolly

2006-Jun-10 02:46 UTC

[R] Regex engine types

> version         _                       
platform x86_64-unknown-linux-gnu
arch     x86_64                  
os       linux-gnu               
system   x86_64, linux-gnu       
status                           
major    2                       
minor    2.1                     
year     2005                    
month    12                      
day      20                      
svn rev  36812                   
language R                       > 
> grep("[W-Z]", LETTERS, value = TRUE)[1] "W" "X" "Y" "Z"

That's what I'd have expected.
> grep("[W-Z]", letters, value = TRUE)[1] "x" "y" "z"

Not what I'd have thought.  However,
> grep("[B-D]", letters, value = TRUE, perl = TRUE)character(0)

So what is it that standard regular expressions use that's different
from Perl-type ones?

The help file for grep refers to POSIX 1003.2 which looked a bit
daunting to delve into.  From my limited reading, it seems there are
different gegex "Engine Types" which seems to be getting somewhat
tangential to what I was working on.  I could probably avoid problems
if I always set perl=TRUE, but it would be good to know what basic and
extended regular expressions do that's different.  If someone has a
quick line or two describing it, I'd be interested to know.

Thanks

-- 
~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.   
   ___    Patrick Connolly   
 {~._.~}          		 Great minds discuss ideas    
 _( Y )_  	  	        Middle minds discuss events 
(:_~*~_:) 	       		 Small minds discuss people  
 (_)-(_)  	                           ..... Anon
	  
~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.

Gabor Grothendieck

2006-Jun-10 05:40 UTC

head link

[R] Regex engine types

I get the same thing on "Version 2.3.1 Patched (2006-06-04 r38279)"
but on "R version 2.2.1, 2005-12-20" it gives character(0), as
expected, so there is some change between versions of R.  I am
on Windows XP.

On 6/9/06, Patrick Connolly <p_connolly at ihug.co.nz>
wrote:> > version
>         _
> platform x86_64-unknown-linux-gnu
> arch     x86_64
> os       linux-gnu
> system   x86_64, linux-gnu
> status
> major    2
> minor    2.1
> year     2005
> month    12
> day      20
> svn rev  36812
> language R
> >
>
> > grep("[W-Z]", LETTERS, value = TRUE)
> [1] "W" "X" "Y" "Z"
>
> That's what I'd have expected.
>
> > grep("[W-Z]", letters, value = TRUE)
> [1] "x" "y" "z"
>
> Not what I'd have thought.  However,
>
> > grep("[B-D]", letters, value = TRUE, perl = TRUE)
> character(0)
>
> So what is it that standard regular expressions use that's different
> from Perl-type ones?
>
> The help file for grep refers to POSIX 1003.2 which looked a bit
> daunting to delve into.  From my limited reading, it seems there are
> different gegex "Engine Types" which seems to be getting somewhat
> tangential to what I was working on.  I could probably avoid problems
> if I always set perl=TRUE, but it would be good to know what basic and
> extended regular expressions do that's different.  If someone has a
> quick line or two describing it, I'd be interested to know.
>
> Thanks
>
> --
> ~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.
>   ___    Patrick Connolly
>  {~._.~}                         Great minds discuss ideas
>  _( Y )_                        Middle minds discuss events
> (:_~*~_:)                        Small minds discuss people
>  (_)-(_)                                   ..... Anon
>
> ~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.
>
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide!
http://www.R-project.org/posting-guide.html
>

Prof Brian Ripley

2006-Jun-10 06:47 UTC

head link

[R] Regex engine types

?regex does describe this:

      A range of characters may be specified by giving the first and last
      characters, separated by a hyphen.  (Character ranges are
      interpreted in the collation order of the current locale.)

You did not tell us your locale, but based on questions from you in the 
past I would guess en_NZ.utf8.  In that locale the collation order is 
wWxXyYzZ, so your surprise is explained.  (It seems the PCRE code is not 
using the same ordering in that locale.)

You may find it useful to set LC_COLLATE to C as I do:
> strsplit(Sys.getlocale(), ";")[[1]]
  [1] "LC_CTYPE=en_GB"       "LC_NUMERIC=C"        
"LC_TIME=en_GB"
  [4] "LC_COLLATE=C"         "LC_MONETARY=en_GB"   
"LC_MESSAGES=en_GB"
  [7] "LC_PAPER=en_GB"       "LC_NAME=C"           
"LC_ADDRESS=C"
[10] "LC_TELEPHONE=C"       "LC_MEASUREMENT=en_GB"
"LC_IDENTIFICATION=C"


On Sat, 10 Jun 2006, Patrick Connolly wrote:
>> version
>         _
> platform x86_64-unknown-linux-gnu
> arch     x86_64
> os       linux-gnu
> system   x86_64, linux-gnu
> status
> major    2
> minor    2.1
> year     2005
> month    12
> day      20
> svn rev  36812
> language R
>>
>
>> grep("[W-Z]", LETTERS, value = TRUE)
> [1] "W" "X" "Y" "Z"
>
> That's what I'd have expected.
>
>> grep("[W-Z]", letters, value = TRUE)
> [1] "x" "y" "z"
>
> Not what I'd have thought.  However,
>
>> grep("[B-D]", letters, value = TRUE, perl = TRUE)
> character(0)
>
> So what is it that standard regular expressions use that's different
> from Perl-type ones?
>
> The help file for grep refers to POSIX 1003.2 which looked a bit
> daunting to delve into.  From my limited reading, it seems there are
> different gegex "Engine Types" which seems to be getting somewhat
> tangential to what I was working on.  I could probably avoid problems
> if I always set perl=TRUE, but it would be good to know what basic and
> extended regular expressions do that's different.  If someone has a
> quick line or two describing it, I'd be interested to know.
-- 
Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595

Reasonably Related Threads

Search for more seemingly similar threads

R help - Jun 2006 - Regex engine types

[R] Regex engine types

[R] Regex engine types

[R] Regex engine types

Reasonably Related Threads