axel.klenk at actelion.com
2010-Jan-21 16:36 UTC
[R] unexpected behaviour of R-2.10.1 regular expression in UTF-8 locale
Dear R-helpers, I have encountered the following unexpected behaviour of R-2.10.1, but not R-2.9.0, on both RHEL 4 and Ubuntu Karmic (precompiled via synaptic or built from source). I have a character vector from which I want to extract a certain pattern that is surrounded by junk as in:> nn <- sprintf("junk_%02d_junk", 1:2) > nn[1] "junk_01_junk" "junk_02_junk"> sub("^.*([[:digit:]]{2}).*$", "\\1", nn)[1] "nk" "nk" # oops? however:> sub("^.*([[:digit:]]{2}).*$", "\\1", nn, perl = TRUE)[1] "01" "02" # as expected, and also> Sys.setlocale("LC_ALL", "C")[1] "LC_CTYPE=C;LC_NUMERIC=C;LC_TIME=C;LC_COLLATE=C;LC_MONETARY=C;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_US.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF-8;LC_IDENTIFICATION=C"> sub("^.*([[:digit:]]{2}).*$", "\\1", nn)[1] "01" "02" Is there something wrong with my regex syntax or am I missing something else? Obviously I have at least two workarounds but I'd like to report this since it is breaking code that used to run in R-2.9.0. Thanks in advance for any help or insight, - axel $ R --vanilla R version 2.10.1 (2009-12-14) Copyright (C) 2009 The R Foundation for Statistical Computing ISBN 3-900051-07-0 R is free software and comes with ABSOLUTELY NO WARRANTY. You are welcome to redistribute it under certain conditions. Type 'license()' or 'licence()' for distribution details. Natural language support but running in an English locale R is a collaborative project with many contributors. Type 'contributors()' for more information and 'citation()' on how to cite R or R packages in publications. Type 'demo()' for some demos, 'help()' for on-line help, or 'help.start()' for an HTML browser interface to help. Type 'q()' to quit R.> sessionInfo()R version 2.10.1 (2009-12-14) x86_64-pc-linux-gnu locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=C LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base Axel Klenk Research Informatician Actelion Pharmaceuticals Ltd / Gewerbestrasse 16 / CH-4123 Allschwil / Switzerland The information of this email and in any file transmitted with it is strictly confidential and may be legally privileged. It is intended solely for the addressee. If you are not the intended recipient, any copying, distribution or any other use of this email is prohibited and may be unlawful. In such case, you should please notify the sender immediately and destroy this email. The content of this email is not legally binding unless confirmed by letter. Any views expressed in this message are those of the individual sender, except where the message states otherwise and the sender is authorised to state them to be the views of the sender's company. For further information about Actelion please see our website at http://www.actelion.com