thr3ads.net - R devel - [Rd] gsub, utf-8 replacements and the C-locale [Nov 2011]

If this information is useful, please help other people find it:
Share via:

Hadley Wickham

2011-Nov-23 23:48 UTC

[Rd] gsub, utf-8 replacements and the C-locale

Hi all,

I'd like to discuss a infelicity/possible bug with gsub.  Take the
following function:

f <- function(x) {
  gsub("\u{A0}", " ", gsub(" ",
"\u{A0}", x))
}

As you might expect, in utf-8 locales it is idempotent:

Sys.setlocale("LC_ALL", "UTF-8")
f("x y")
# [1] "x y"

But in the C locale it is not:

Sys.setlocale("LC_ALL", "C")
f("x y")
# [1] "x\302\240y"

This seems weird to me. (And caused a bug in a package because I
didn't realise some windows users have a non-utf8 locale)

I'm not sure what the correct resolution is.  Should the encoding of
the output of gsub be utf-8 if either the input or replacement is
utf-8?  In non-utf-8 locales should the encoding of "\u{A0}" be bytes?

Hadley

-- 
Assistant Professor / Dobelman Family Junior Chair
Department of Statistics / Rice University
http://had.co.nz/

Simon Urbanek

2011-Nov-24 00:06 UTC

head link

[Rd] gsub, utf-8 replacements and the C-locale

On Nov 23, 2011, at 6:48 PM, Hadley Wickham wrote:
> Hi all,
> 
> I'd like to discuss a infelicity/possible bug with gsub.  Take the
> following function:
> 
> f <- function(x) {
>  gsub("\u{A0}", " ", gsub(" ",
"\u{A0}", x))
> }
> 
> As you might expect, in utf-8 locales it is idempotent:
> 
> Sys.setlocale("LC_ALL", "UTF-8")
> f("x y")
> # [1] "x y"
> 
> But in the C locale it is not:
> 
> Sys.setlocale("LC_ALL", "C")
> f("x y")
> # [1] "x\302\240y"
> 
> This seems weird to me. (And caused a bug in a package because I
> didn't realise some windows users have a non-utf8 locale)
> 
> I'm not sure what the correct resolution is.  Should the encoding of
the output of gsub be utf-8 if either the input or replacement is utf-8?
It is if the input is UTF-8 but only then - that is what is causing the
asymmetry. Part of the problem is that you cannot declare 7-bit string as UTF-8
(even though it is valid) so you can't work around it by forcing the
encoding.

>  In non-utf-8 locales should the encoding of "\u{A0}" be bytes?
> 
No, because the whole point of the encoding is to define the content.
"\ua0" defines one unicode character whereas "\302\240"
defines two bytes with unknown meaning. The meaning of UTF-8 encoded strings is
still valid in non-UTF-8 locales and the reason why your can work with UTF-8
strings in R irrespective of the locale (very useful thing).

I would suggest to handle the special case of 7-bit input and UTF-8 replacement
such that it results in UTF-8 output (as opposed to bytes output with happens
now). The relevant code is somewhat convoluted (and more so in R-devel) so
I'm not volunteering to do it, though.

Just to make things more clear - the current result (in C locale):
> gsub(" ","\ua0", "foo bar")[1] "foo\302\240bar"

Possibly desired result:
> gsub(" ","\ua0", "foo bar")[1] "foo<U+00A0>bar"

Cheers,
Simon

Reasonably Related Threads

Search for more reasonably related threads

R devel - Nov 2011 - gsub, utf-8 replacements and the C-locale

[Rd] gsub, utf-8 replacements and the C-locale

[Rd] gsub, utf-8 replacements and the C-locale

Reasonably Related Threads