Mikko Korpela
2017-Feb-24 14:57 UTC
[R] nchar(type = "chars") of "latin1" string in C locale
When running R in an ASCII locale (export LC_ALL=C) on Linux, is this expected? foo <- "\xe4" Encoding(foo) <- "latin1" foo # [1] "<e4>" nchar(foo) # [1] 4 nchar(foo, type = "bytes") # [1] 1 nchar(foo, type = "width") # [1] 4 That is, the number of characters reported for the default 'type = "chars"' is the number of characters (4) used for printing the unknown byte. Obviously, one byte is one character in the single-byte ISO-8859-1 "latin1" encoding. Therefore I think the result of 4 characters for 1 byte is wrong, or unintuitive. If this is as expected, maybe it should be mentioned in the '?nchar' manual as a special case. Yes, I did try to read the manual for an explanation. According to the manual, the result should be "The number of human-readable characters", but there is the note that: This does *not* by default give the number of characters that will be used to 'print()' the string. Use 'encodeString' to find that. For UTF-8 strings, nchar() does work correctly (as documented) even in the C locale. foo2 <- "\xc3\xa4" Encoding(foo2) <- "UTF-8" foo2 # [1] "<U+00E4>" nchar(foo2) # [1] 1 nchar(foo2, type = "bytes") # [1] 2 nchar(foo2, type = "width") # [1] 1 But, confusingly, encodeString() does not agree with print(), contrary to the document '?encodeString': encodeString(foo2) # [1] "\\u00e4" I was using "R Under development (unstable) (2017-02-23 r72248)". -- Mikko Korpela Department of Geosciences and Geography University of Helsinki