thr3ads.net - R devel - [Rd] Windows, format.POSIXct and character encodings [May 2013]

If this information is useful, please help other people find it:
Share via:

Hadley Wickham

2013-May-01 14:06 UTC

[Rd] Windows, format.POSIXct and character encodings

Hi all,

In what encoding does format.POSIXct return its output? It doesn't
seem to be utf-8:

Sys.setlocale("LC_ALL", "Japanese_Japan.932")

times <- c("1970-01-01 01:00:00 UTC", "1970-02-02 22:00:00
UTC")
ampm <- format(as.POSIXct(times), format = "%p")
x <- gsub(">", "*", paste(ampm, collapse =
"+>"))

y <- "??+*??"
identical(x, y)
# [1] TRUE

# But, confusingly, ...

charToRaw(x)
# [1] e5 8d 88 e5 89 8d 2b 2a e5 8d 88 e5 be 8c

charToRaw(y)
# [1] 8c df 91 4f 2b 2a 8c df 8c e3

# So there's at least a small bug with identical

# And this causes a problem when you attempt to do
# stuff with the string

gsub("+", "*", x, fixed = T)
# Error in gsub("+", "*", x, fixed = T) :
#  invalid multibyte string at '<8c>'
gsub("+", "*", y, fixed = T)
# [1] "??**??"


My session info is

R version 3.0.0 (2013-04-03)
Platform: x86_64-w64-mingw32/x64 (64-bit)

locale:
[1] LC_COLLATE=Japanese_Japan.932  LC_CTYPE=Japanese_Japan.932
[3] LC_MONETARY=Japanese_Japan.932 LC_NUMERIC=C
[5] LC_TIME=Japanese_Japan.932

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

loaded via a namespace (and not attached):
[1] tools_3.0.0

Any ideas? Thanks!

Hadley

--
Chief Scientist, RStudio
http://had.co.nz/

Simon Urbanek

2013-May-01 21:33 UTC

head link

[Rd] Windows, format.POSIXct and character encodings

On May 1, 2013, at 10:06 AM, Hadley Wickham wrote:
> Hi all,
> 
> In what encoding does format.POSIXct return its output? It doesn't
> seem to be utf-8:
> 
> Sys.setlocale("LC_ALL", "Japanese_Japan.932")
> 
> times <- c("1970-01-01 01:00:00 UTC", "1970-02-02
22:00:00 UTC")
> ampm <- format(as.POSIXct(times), format = "%p")
> x <- gsub(">", "*", paste(ampm, collapse =
"+>"))
> 
> y <- "??+*??"
> identical(x, y)
> # [1] TRUE
> 
> # But, confusingly, ...
> 
> charToRaw(x)
> # [1] e5 8d 88 e5 89 8d 2b 2a e5 8d 88 e5 be 8c
> 
> charToRaw(y)
> # [1] 8c df 91 4f 2b 2a 8c df 8c e3
> 
That's not confusing at all:
> Encoding(x)
[1] "UTF-8"> Encoding(y)[1] "unknown"

The first string is in UTF-8 the second is in the local locale (here 932).

> # So there's at least a small bug with identical
> 
Nope: ?identical
"Character strings are regarded as identical if they are in different
marked encodings but would agree when translated to UTF-8."

> # And this causes a problem when you attempt to do
> # stuff with the string
> 
> gsub("+", "*", x, fixed = T)
> # Error in gsub("+", "*", x, fixed = T) :
> #  invalid multibyte string at '<8c>'
> gsub("+", "*", y, fixed = T)
> # [1] "??**??"
> 
This is where the problem lies - and it has nothing to do with format:
> z=enc2utf8("??+*??")
> gsub("+", "*", z, fixed = T)Error in gsub("+", "*", z, fixed = T) : 
  invalid multibyte string at '<8c>'

The cause is that  fgrep_one() gives higher precedence to mbcslocale than
use_UTF8 so the grep is actually done in the MBCS locale and not UTF-8.
Consequently, you'll see this only in multi-byte locales other than UTF-8,
so on let's say OS X you can reproduce it with
> x="??+*??"
> gsub("+", "*", x, fixed = T)Error in gsub("+", "*", x, fixed = T) : 
  invalid multibyte string at '<8c>'

Inverting the precedence would fix this issue, but I'm not sure if it would
have unwanted side-effects on MBCS locales ...

Cheers,
Simon

> 
> My session info is
> 
> R version 3.0.0 (2013-04-03)
> Platform: x86_64-w64-mingw32/x64 (64-bit)
> 
> locale:
> [1] LC_COLLATE=Japanese_Japan.932  LC_CTYPE=Japanese_Japan.932
> [3] LC_MONETARY=Japanese_Japan.932 LC_NUMERIC=C
> [5] LC_TIME=Japanese_Japan.932
> 
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base
> 
> loaded via a namespace (and not attached):
> [1] tools_3.0.0
> 
> Any ideas? Thanks!
> 
> Hadley
> 
> --
> Chief Scientist, RStudio
> http://had.co.nz/
> 
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

Maybe Matching Threads

Search for more reasonably related threads

R devel - May 2013 - Windows, format.POSIXct and character encodings

[Rd] Windows, format.POSIXct and character encodings

[Rd] Windows, format.POSIXct and character encodings

Maybe Matching Threads