thr3ads.net - R help - [R] nchar(type = "chars") of "latin1" string in C locale [Feb 2017]

If this information is useful, please help other people find it:
Share via:

Mikko Korpela

2017-Feb-24 14:57 UTC

[R] nchar(type = "chars") of "latin1" string in C locale

When running R in an ASCII locale (export LC_ALL=C) on Linux, is this 
expected?

foo <- "\xe4"
Encoding(foo) <- "latin1"
foo
# [1] "<e4>"
nchar(foo)
# [1] 4
nchar(foo, type = "bytes")
# [1] 1
nchar(foo, type = "width")
# [1] 4

That is, the number of characters reported for the default 'type = 
"chars"' is the number of characters (4) used for printing the
unknown byte.

Obviously, one byte is one character in the single-byte ISO-8859-1 
"latin1" encoding. Therefore I think the result of 4 characters for 1 
byte is wrong, or unintuitive.

If this is as expected, maybe it should be mentioned in the '?nchar' 
manual as a special case. Yes, I did try to read the manual for an 
explanation. According to the manual, the result should be "The number 
of human-readable characters", but there is the note that:

      This does *not* by default give the number of characters that will
      be used to 'print()' the string.  Use 'encodeString' to
find that.

For UTF-8 strings, nchar() does work correctly (as documented) even in 
the C locale.

foo2 <- "\xc3\xa4"
Encoding(foo2) <- "UTF-8"
foo2
# [1] "<U+00E4>"
nchar(foo2)
# [1] 1
nchar(foo2, type = "bytes")
# [1] 2
nchar(foo2, type = "width")
# [1] 1

But, confusingly, encodeString() does not agree with print(), contrary 
to the document '?encodeString':

encodeString(foo2)
# [1] "\\u00e4"

I was using "R Under development (unstable) (2017-02-23 r72248)".

-- 
Mikko Korpela
Department of Geosciences and Geography
University of Helsinki

R help - Feb 2017 - nchar(type = "chars") of "latin1" string in C locale

[R] nchar(type = "chars") of "latin1" string in C locale