thr3ads.net - R devel - [Rd] Windows iconv() "failure" in certain locales [Jun 2017]

If this information is useful, please help other people find it:
Share via:

Martin Maechler

2017-Jun-27 15:36 UTC

[Rd] Windows iconv() "failure" in certain locales

This is a continuation of the R-devel thread with subject
 "suggestion to fix packageDescription() for Windows users" :

As I said there, a patch should rather address the underlying
problem in packageDescription rather than a kludgy workaround
patch for  citation().
(For that same reason, Ben Marwick proposed to fix
 packageDescription() rather than the symptom seen in citation().)

It's not hard to see that the problem is that  iconv() in
Windows does not always succeed to translate from "UTF-8" to the
"current locale", in the case mentioned there.

I'm giving some easier reproducible examples:  no need to install
half of tidyverse just to get citation("readr") :
> x <- c("Ekstr\xf8m", "J\xf6reskog", "bi\xdfchen
Z\xfcrcher")
> Encoding(x1) <- "latin1"
> xU <- iconv(x1, "latin1", "UTF-8")
> Sys.setlocale("LC_CTYPE", "Chinese")[1] "Chinese (Simplified)_People's Republic of
China.936"> 
> iconv(x1, "latin1", "") # NA NA NA
[1] NA NA NA> iconv(xU, "UTF-8", "") # NA NA NA
[1] NA NA NA> iconv(xU, "UTF-8", "//TRANSLIT")[1] "Ekstr?m"         "J?reskog"        "bi?chen
Z?rcher"> iconv(xU, "UTF-8", "", sub = "byte")[1] "Ekstr<c3><b8>m"        
"J<c3><b6>reskog"        "bi<c3><9f>chen
Z??rcher"

> Sys.setlocale("LC_CTYPE", "Arabic")
[1] "Arabic_Saudi Arabia.1256"> iconv(x1, "latin1", "")  # NA NA NA
[1] NA NA NA> iconv(xU, "UTF-8", "")  # NA NA NA
[1] NA NA NA> iconv(xU, "UTF-8", "//TRANSLIT")[1] "Ekstr\370m"         "J\366reskog"        "bi?chen
Z?rcher"> iconv(xU, "UTF-8", "", sub="byte")[1] "Ekstr<c3><b8>m"        
"J<c3><b6>reskog"        "bi<c3><9f>chen
Z?rcher"> iconv(xU, "UTF-8", "", sub="?")[1] "Ekstr??m"         "J??reskog"        "bi??chen
Z?rcher"

Etc... .  As the above is typically garbled between e-mail
transfer agents, I append both the iconv-Windows.R R script and
the corresponding iconv-Windows.Rout  R transcript to this
e-mail (using MIME type text/plain (easy using emacs for mail..)),
and they contain a bit more than the above.

Note that the above shows that using 'sub = *' and using
"//TRANSLIT" in case of a previous NA  result helps quite a bit,
in the sense that it gives much more information to see
  "J?reskog"  instead   NA.

I'm considering updating  packageDescription() to try these in
case it first returns NA.   This would make the citation() hack
unnecessary.

Martin

-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: iconv-Windows.R
URL:
<https://stat.ethz.ch/pipermail/r-devel/attachments/20170627/1e5f6924/attachment.ksh>
-------------- next part --------------

R Under development (unstable) (2017-06-25 r72854) -- "Unsuffered
Consequences"
Copyright (C) 2017 The R Foundation for Statistical Computing
Platform: x86_64-w64-mingw32/x64 (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.
> #### iconv() behavior depending on Locales  LC_CTYPE  in Windows
> #### =======                       =============================> ###
> ### In a *shell* in Windows (emacs), after doing R.home() in R, use that to
do something like
> ###   c:/PROGRA~1/R/R-devel/bin/R CMD BATCH iconv-Windows.R
> ###   ^^^^^^^^^^^^^^^^^^^^^^^^^^= === ===== ===============  ==>
producing  iconv-Windows.Rout
> ###
> sessionInfo() ## does not matter so muchR Under development (unstable) (2017-06-25 r72854)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows Server 2008 R2 x64 (build 7601) Service Pack 1

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252 
[2] LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
[1] compiler_3.5.0> ## -- should be Windows to exhibit the problems
> 
> ## From  help(iconv) 's  example : Using "latin1" European
language letters:
> x1 <- c("Ekstr\xf8m", "J\xf6reskog",
"bi\xdfchen Z\xfcrcher")
> Encoding(x1) <- "latin1"
> xU <- iconv(x1, "latin1", "UTF-8")
> 
> 
> ## 2 locales that do not work well : ---------------------------------
> Sys.setlocale("LC_CTYPE", "Chinese")[1] "Chinese (Simplified)_People's Republic of
China.936"> 
> iconv(x1, "latin1", "") # NA NA NA
[1] NA NA NA> iconv(x1, "latin1", "//TRANSLIT") # perfect for Chinese[1] "Ekstr?m"         "J?reskog"        "bi?chen
Z?rcher"> iconv(x1, "latin1", "", sub = "byte")[1] "Ekstr<f8>m"         "J<f6>reskog"       
"bi<df>chen Z??rcher"> iconv(xU, "UTF-8", "") # NA NA NA
[1] NA NA NA> iconv(xU, "UTF-8", "//TRANSLIT")[1] "Ekstr?m"         "J?reskog"        "bi?chen
Z?rcher"> iconv(xU, "UTF-8", "", sub = "byte")[1] "Ekstr<c3><b8>m"        
"J<c3><b6>reskog"        "bi<c3><9f>chen
Z??rcher"> ##--
> Sys.setlocale("LC_CTYPE", "Arabic")
[1] "Arabic_Saudi Arabia.1256"> iconv(x1, "latin1", "")  # NA NA NA
[1] NA NA NA> iconv(x1, "latin1", "//TRANSLIT") # not bad, but not
perfect[1] "Ekstr\370m"         "J\366reskog"        "bi?chen
Z?rcher"> iconv(x1, "latin1", "", sub="byte")[1] "Ekstr<f8>m"         "J<f6>reskog"       
"bi<df>chen Z?rcher"> iconv(x1, "latin1", "", sub="?")[1] "Ekstr?m"         "J?reskog"        "bi?chen
Z?rcher"> iconv(xU, "UTF-8", "")  # NA NA NA
[1] NA NA NA> iconv(xU, "UTF-8", "//TRANSLIT")[1] "Ekstr\370m"         "J\366reskog"        "bi?chen
Z?rcher"> iconv(xU, "UTF-8", "", sub="byte")[1] "Ekstr<c3><b8>m"        
"J<c3><b6>reskog"        "bi<c3><9f>chen
Z?rcher"> iconv(xU, "UTF-8", "", sub="?")[1] "Ekstr??m"         "J??reskog"        "bi??chen
Z?rcher"> 
> ## 2 locales that work well for these examples (no wonder) -----------
> 
> Sys.setlocale("LC_CTYPE", "German_Switzerland")
[1] "German_Switzerland.1252"> iconv(x1, "latin1", "")[1] "Ekstr?m"         "J?reskog"        "bi?chen
Z?rcher"> iconv(x1, "latin1", "//TRANSLIT")[1] "Ekstr?m"         "J?reskog"        "bi?chen
Z?rcher"> iconv(x1, "latin1", "", sub="?")[1] "Ekstr?m"         "J?reskog"        "bi?chen
Z?rcher"> iconv(xU, "UTF-8", "")[1] "Ekstr?m"         "J?reskog"        "bi?chen
Z?rcher"> iconv(xU, "UTF-8", "//TRANSLIT")[1] "Ekstr?m"         "J?reskog"        "bi?chen
Z?rcher"> iconv(xU, "UTF-8", "", sub="?")[1] "Ekstr?m"         "J?reskog"        "bi?chen
Z?rcher"> ##--
> Sys.setlocale("LC_CTYPE", "English")
[1] "English_United States.1252"> iconv(x1, "latin1", "")[1] "Ekstr?m"         "J?reskog"        "bi?chen
Z?rcher"> iconv(x1, "latin1", "//TRANSLIT")[1] "Ekstr?m"         "J?reskog"        "bi?chen
Z?rcher"> iconv(x1, "latin1", "", sub="?")[1] "Ekstr?m"         "J?reskog"        "bi?chen
Z?rcher"> iconv(xU, "UTF-8", "")[1] "Ekstr?m"         "J?reskog"        "bi?chen
Z?rcher"> iconv(xU, "UTF-8", "//TRANSLIT")[1] "Ekstr?m"         "J?reskog"        "bi?chen
Z?rcher"> iconv(xU, "UTF-8", "", sub="?")[1] "Ekstr?m"         "J?reskog"        "bi?chen
Z?rcher"> 
> proc.time()   user  system elapsed 
   0.18    0.14    0.98

Duncan Murdoch

2017-Jun-28 10:32 UTC

head link

[Rd] Windows iconv() "failure" in certain locales

On 27/06/2017 11:36 AM, Martin Maechler wrote:> This is a continuation of the R-devel thread with subject
>  "suggestion to fix packageDescription() for Windows users" :
>
> As I said there, a patch should rather address the underlying
> problem in packageDescription rather than a kludgy workaround
> patch for  citation().
> (For that same reason, Ben Marwick proposed to fix
>  packageDescription() rather than the symptom seen in citation().)
>
> It's not hard to see that the problem is that  iconv() in
> Windows does not always succeed to translate from "UTF-8" to the
> "current locale", in the case mentioned there.
>
> I'm giving some easier reproducible examples:  no need to install
> half of tidyverse just to get citation("readr") :
>
>> x <- c("Ekstr\xf8m", "J\xf6reskog",
"bi\xdfchen Z\xfcrcher")
>> Encoding(x1) <- "latin1"
>> xU <- iconv(x1, "latin1", "UTF-8")
>
>> Sys.setlocale("LC_CTYPE", "Chinese")
> [1] "Chinese (Simplified)_People's Republic of China.936"
>>
>> iconv(x1, "latin1", "") # NA NA NA
> [1] NA NA NA
>> iconv(xU, "UTF-8", "") # NA NA NA
> [1] NA NA NA
>> iconv(xU, "UTF-8", "//TRANSLIT")
> [1] "Ekstr?m"         "J?reskog"        "bi?chen
Z?rcher"
>> iconv(xU, "UTF-8", "", sub = "byte")
> [1] "Ekstr<c3><b8>m"        
"J<c3><b6>reskog"        "bi<c3><9f>chen
Z??rcher"
>
>
>> Sys.setlocale("LC_CTYPE", "Arabic")
> [1] "Arabic_Saudi Arabia.1256"
>> iconv(x1, "latin1", "")  # NA NA NA
> [1] NA NA NA
>> iconv(xU, "UTF-8", "")  # NA NA NA
> [1] NA NA NA
>> iconv(xU, "UTF-8", "//TRANSLIT")
> [1] "Ekstr\370m"         "J\366reskog"       
"bi?chen Z?rcher"
>> iconv(xU, "UTF-8", "", sub="byte")
> [1] "Ekstr<c3><b8>m"        
"J<c3><b6>reskog"        "bi<c3><9f>chen
Z?rcher"
>> iconv(xU, "UTF-8", "", sub="?")
> [1] "Ekstr??m"         "J??reskog"       
"bi??chen Z?rcher"
>
> Etc... .  As the above is typically garbled between e-mail
> transfer agents, I append both the iconv-Windows.R R script and
> the corresponding iconv-Windows.Rout  R transcript to this
> e-mail (using MIME type text/plain (easy using emacs for mail..)),
> and they contain a bit more than the above.
>
> Note that the above shows that using 'sub = *' and using
> "//TRANSLIT" in case of a previous NA  result helps quite a bit,
> in the sense that it gives much more information to see
>   "J?reskog"  instead   NA.
>
> I'm considering updating  packageDescription() to try these in
> case it first returns NA.   This would make the citation() hack
> unnecessary.
I agree with the general sentiment (fix the underlying problem).  I 
haven't traced through this one, but the usual cause of problems like 
this is that we too frequently convert to the local encoding even when 
that loses information.

Kirill M?ller and I are gradually working through internal code and 
fixing these issues.  I don't know if this one will be fixed sooner or 
later, but I would hope it would be fixed by 3.5.0.

So in order that we don't hide it, I'd ask you not to apply the patch in
R-devel.

Duncan Murdoch

Uwe Ligges

2017-Jun-28 16:45 UTC

head link

[Rd] Windows iconv() "failure" in certain locales

On 27.06.2017 17:36, Martin Maechler wrote:> This is a continuation of the R-devel thread with subject
>   "suggestion to fix packageDescription() for Windows users" :
> 
> As I said there, a patch should rather address the underlying
> problem in packageDescription rather than a kludgy workaround
> patch for  citation().
> (For that same reason, Ben Marwick proposed to fix
>   packageDescription() rather than the symptom seen in citation().)
> 
> It's not hard to see that the problem is that  iconv() in
> Windows does not always succeed to translate from "UTF-8" to the
> "current locale", in the case mentioned there.
> 
> I'm giving some easier reproducible examples:  no need to install
> half of tidyverse just to get citation("readr") :
> 
>> x <- c("Ekstr\xf8m", "J\xf6reskog",
"bi\xdfchen Z\xfcrcher")
>> Encoding(x1) <- "latin1"
>> xU <- iconv(x1, "latin1", "UTF-8")
> 
>> Sys.setlocale("LC_CTYPE", "Chinese")
> [1] "Chinese (Simplified)_People's Republic of China.936"
>>
>> iconv(x1, "latin1", "") # NA NA NA
> [1] NA NA NA
>> iconv(xU, "UTF-8", "") # NA NA NA
> [1] NA NA NA
>> iconv(xU, "UTF-8", "//TRANSLIT")
> [1] "Ekstr?m"         "J?reskog"        "bi?chen
Z?rcher"
Interesting, I get chinese characters here.

Beside the comments from Duncan Murdoch:
iconv(x1, "latin1", "", sub="?")
etc. would be an alternative in case some characters really cannot be 
converted into the target encoding and should perhaps be considered for 
the time after Duncan commits the fix for the underlying porblem.

Best,
Uwe







>> iconv(xU, "UTF-8", "", sub = "byte")
> [1] "Ekstr<c3><b8>m"        
"J<c3><b6>reskog"        "bi<c3><9f>chen
Z??rcher"
> 
> 
>> Sys.setlocale("LC_CTYPE", "Arabic")
> [1] "Arabic_Saudi Arabia.1256"
>> iconv(x1, "latin1", "")  # NA NA NA
> [1] NA NA NA
>> iconv(xU, "UTF-8", "")  # NA NA NA
> [1] NA NA NA
>> iconv(xU, "UTF-8", "//TRANSLIT")
> [1] "Ekstr\370m"         "J\366reskog"       
"bi?chen Z?rcher"
>> iconv(xU, "UTF-8", "", sub="byte")
> [1] "Ekstr<c3><b8>m"        
"J<c3><b6>reskog"        "bi<c3><9f>chen
Z?rcher"
>> iconv(xU, "UTF-8", "", sub="?")
> [1] "Ekstr??m"         "J??reskog"       
"bi??chen Z?rcher"
> 
> Etc... .  As the above is typically garbled between e-mail
> transfer agents, I append both the iconv-Windows.R R script and
> the corresponding iconv-Windows.Rout  R transcript to this
> e-mail (using MIME type text/plain (easy using emacs for mail..)),
> and they contain a bit more than the above.
> 
> Note that the above shows that using 'sub = *' and using
> "//TRANSLIT" in case of a previous NA  result helps quite a bit,
> in the sense that it gives much more information to see
>    "J?reskog"  instead   NA.
> 
> I'm considering updating  packageDescription() to try these in
> case it first returns NA.   This would make the citation() hack
> unnecessary.
> 
> Martin
> 
> 
> iconv-Windows.R
> 
> 
> #### iconv() behavior depending on Locales  LC_CTYPE  in Windows
> #### =======                       =============================> ###
> ### In a *shell* in Windows (emacs), after doing R.home() in R, use that to
do something like
> ###   c:/PROGRA~1/R/R-devel/bin/R CMD BATCH iconv-Windows.R
> ###   ^^^^^^^^^^^^^^^^^^^^^^^^^^= === ===== ===============  ==>
producing  iconv-Windows.Rout
> ###
> sessionInfo() ## does not matter so much
> ## -- should be Windows to exhibit the problems
> 
> ## From  help(iconv) 's  example : Using "latin1" European
language letters:
> x1 <- c("Ekstr\xf8m", "J\xf6reskog",
"bi\xdfchen Z\xfcrcher")
> Encoding(x1) <- "latin1"
> xU <- iconv(x1, "latin1", "UTF-8")
> 
> 
> ## 2 locales that do not work well : ---------------------------------
> Sys.setlocale("LC_CTYPE", "Chinese")
> 
> iconv(x1, "latin1", "") # NA NA NA
> iconv(x1, "latin1", "//TRANSLIT") # perfect for Chinese
> iconv(x1, "latin1", "", sub = "byte")
> iconv(xU, "UTF-8", "") # NA NA NA
> iconv(xU, "UTF-8", "//TRANSLIT")
> iconv(xU, "UTF-8", "", sub = "byte")
> ##--
> Sys.setlocale("LC_CTYPE", "Arabic")
> iconv(x1, "latin1", "")  # NA NA NA
> iconv(x1, "latin1", "//TRANSLIT") # not bad, but not
perfect
> iconv(x1, "latin1", "", sub="byte")
> iconv(x1, "latin1", "", sub="?")
> iconv(xU, "UTF-8", "")  # NA NA NA
> iconv(xU, "UTF-8", "//TRANSLIT")
> iconv(xU, "UTF-8", "", sub="byte")
> iconv(xU, "UTF-8", "", sub="?")
> 
> ## 2 locales that work well for these examples (no wonder) -----------
> 
> Sys.setlocale("LC_CTYPE", "German_Switzerland")
> iconv(x1, "latin1", "")
> iconv(x1, "latin1", "//TRANSLIT")
> iconv(x1, "latin1", "", sub="?")
> iconv(xU, "UTF-8", "")
> iconv(xU, "UTF-8", "//TRANSLIT")
> iconv(xU, "UTF-8", "", sub="?")
> ##--
> Sys.setlocale("LC_CTYPE", "English")
> iconv(x1, "latin1", "")
> iconv(x1, "latin1", "//TRANSLIT")
> iconv(x1, "latin1", "", sub="?")
> iconv(xU, "UTF-8", "")
> iconv(xU, "UTF-8", "//TRANSLIT")
> iconv(xU, "UTF-8", "", sub="?")
> 
> 
> iconv-Windows.Rout
> 
> 
> 
> R Under development (unstable) (2017-06-25 r72854) -- "Unsuffered
Consequences"
> Copyright (C) 2017 The R Foundation for Statistical Computing
> Platform: x86_64-w64-mingw32/x64 (64-bit)
> 
> R is free software and comes with ABSOLUTELY NO WARRANTY.
> You are welcome to redistribute it under certain conditions.
> Type 'license()' or 'licence()' for distribution details.
> 
> R is a collaborative project with many contributors.
> Type 'contributors()' for more information and
> 'citation()' on how to cite R or R packages in publications.
> 
> Type 'demo()' for some demos, 'help()' for on-line help, or
> 'help.start()' for an HTML browser interface to help.
> Type 'q()' to quit R.
> 
>> #### iconv() behavior depending on Locales  LC_CTYPE  in Windows
>> #### =======                      
=============================>> ###
>> ### In a *shell* in Windows (emacs), after doing R.home() in R, use
that to do something like
>> ###   c:/PROGRA~1/R/R-devel/bin/R CMD BATCH iconv-Windows.R
>> ###   ^^^^^^^^^^^^^^^^^^^^^^^^^^= === ===== ===============  ==>
producing  iconv-Windows.Rout
>> ###
>> sessionInfo() ## does not matter so much
> R Under development (unstable) (2017-06-25 r72854)
> Platform: x86_64-w64-mingw32/x64 (64-bit)
> Running under: Windows Server 2008 R2 x64 (build 7601) Service Pack 1
> 
> Matrix products: default
> 
> locale:
> [1] LC_COLLATE=English_United States.1252
> [2] LC_CTYPE=English_United States.1252
> [3] LC_MONETARY=English_United States.1252
> [4] LC_NUMERIC=C
> [5] LC_TIME=English_United States.1252
> 
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base
> 
> loaded via a namespace (and not attached):
> [1] compiler_3.5.0
>> ## -- should be Windows to exhibit the problems
>>
>> ## From  help(iconv) 's  example : Using "latin1"
European language letters:
>> x1 <- c("Ekstr\xf8m", "J\xf6reskog",
"bi\xdfchen Z\xfcrcher")
>> Encoding(x1) <- "latin1"
>> xU <- iconv(x1, "latin1", "UTF-8")
>>
>>
>> ## 2 locales that do not work well : ---------------------------------
>> Sys.setlocale("LC_CTYPE", "Chinese")
> [1] "Chinese (Simplified)_People's Republic of China.936"
>>
>> iconv(x1, "latin1", "") # NA NA NA
> [1] NA NA NA
>> iconv(x1, "latin1", "//TRANSLIT") # perfect for
Chinese
> [1] "Ekstr?m"         "J?reskog"        "bi?chen
Z?rcher"
>> iconv(x1, "latin1", "", sub = "byte")
> [1] "Ekstr<f8>m"         "J<f6>reskog"     
"bi<df>chen Z??rcher"
>> iconv(xU, "UTF-8", "") # NA NA NA
> [1] NA NA NA
>> iconv(xU, "UTF-8", "//TRANSLIT")
> [1] "Ekstr?m"         "J?reskog"        "bi?chen
Z?rcher"
>> iconv(xU, "UTF-8", "", sub = "byte")
> [1] "Ekstr<c3><b8>m"        
"J<c3><b6>reskog"        "bi<c3><9f>chen
Z??rcher"
>> ##--
>> Sys.setlocale("LC_CTYPE", "Arabic")
> [1] "Arabic_Saudi Arabia.1256"
>> iconv(x1, "latin1", "")  # NA NA NA
> [1] NA NA NA
>> iconv(x1, "latin1", "//TRANSLIT") # not bad, but
not perfect
> [1] "Ekstr\370m"         "J\366reskog"       
"bi?chen Z?rcher"
>> iconv(x1, "latin1", "", sub="byte")
> [1] "Ekstr<f8>m"         "J<f6>reskog"     
"bi<df>chen Z?rcher"
>> iconv(x1, "latin1", "", sub="?")
> [1] "Ekstr?m"         "J?reskog"        "bi?chen
Z?rcher"
>> iconv(xU, "UTF-8", "")  # NA NA NA
> [1] NA NA NA
>> iconv(xU, "UTF-8", "//TRANSLIT")
> [1] "Ekstr\370m"         "J\366reskog"       
"bi?chen Z?rcher"
>> iconv(xU, "UTF-8", "", sub="byte")
> [1] "Ekstr<c3><b8>m"        
"J<c3><b6>reskog"        "bi<c3><9f>chen
Z?rcher"
>> iconv(xU, "UTF-8", "", sub="?")
> [1] "Ekstr??m"         "J??reskog"       
"bi??chen Z?rcher"
>>
>> ## 2 locales that work well for these examples (no wonder) -----------
>>
>> Sys.setlocale("LC_CTYPE", "German_Switzerland")
> [1] "German_Switzerland.1252"
>> iconv(x1, "latin1", "")
> [1] "Ekstr?m"         "J?reskog"        "bi?chen
Z?rcher"
>> iconv(x1, "latin1", "//TRANSLIT")
> [1] "Ekstr?m"         "J?reskog"        "bi?chen
Z?rcher"
>> iconv(x1, "latin1", "", sub="?")
> [1] "Ekstr?m"         "J?reskog"        "bi?chen
Z?rcher"
>> iconv(xU, "UTF-8", "")
> [1] "Ekstr?m"         "J?reskog"        "bi?chen
Z?rcher"
>> iconv(xU, "UTF-8", "//TRANSLIT")
> [1] "Ekstr?m"         "J?reskog"        "bi?chen
Z?rcher"
>> iconv(xU, "UTF-8", "", sub="?")
> [1] "Ekstr?m"         "J?reskog"        "bi?chen
Z?rcher"
>> ##--
>> Sys.setlocale("LC_CTYPE", "English")
> [1] "English_United States.1252"
>> iconv(x1, "latin1", "")
> [1] "Ekstr?m"         "J?reskog"        "bi?chen
Z?rcher"
>> iconv(x1, "latin1", "//TRANSLIT")
> [1] "Ekstr?m"         "J?reskog"        "bi?chen
Z?rcher"
>> iconv(x1, "latin1", "", sub="?")
> [1] "Ekstr?m"         "J?reskog"        "bi?chen
Z?rcher"
>> iconv(xU, "UTF-8", "")
> [1] "Ekstr?m"         "J?reskog"        "bi?chen
Z?rcher"
>> iconv(xU, "UTF-8", "//TRANSLIT")
> [1] "Ekstr?m"         "J?reskog"        "bi?chen
Z?rcher"
>> iconv(xU, "UTF-8", "", sub="?")
> [1] "Ekstr?m"         "J?reskog"        "bi?chen
Z?rcher"
>>
>> proc.time()
>     user  system elapsed
>     0.18    0.14    0.98
> 
> 
> 
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

Martin Maechler

2017-Jun-29 10:27 UTC

head link

[Rd] Windows iconv() "failure" in certain locales

>>>>> Uwe Ligges <ligges at statistik.tu-dortmund.de>
>>>>>     on Wed, 28 Jun 2017 18:45:59 +0200 writes:
    > On 27.06.2017 17:36, Martin Maechler wrote:
    >> This is a continuation of the R-devel thread with subject
    >> "suggestion to fix packageDescription() for Windows
users" :
    >> 
    >> As I said there, a patch should rather address the underlying
    >> problem in packageDescription rather than a kludgy workaround
    >> patch for  citation().
    >> (For that same reason, Ben Marwick proposed to fix
    >> packageDescription() rather than the symptom seen in citation().)
    >> 
    >> It's not hard to see that the problem is that  iconv() in
    >> Windows does not always succeed to translate from "UTF-8"
to the
    >> "current locale", in the case mentioned there.
    >> 
    >> I'm giving some easier reproducible examples:  no need to
install
    >> half of tidyverse just to get citation("readr") :
    >> 
    >>> x <- c("Ekstr\xf8m", "J\xf6reskog",
"bi\xdfchen Z\xfcrcher")
    >>> Encoding(x1) <- "latin1"
    >>> xU <- iconv(x1, "latin1", "UTF-8")
    >> 
    >>> Sys.setlocale("LC_CTYPE", "Chinese")
    >> [1] "Chinese (Simplified)_People's Republic of
China.936"
    >>> 
    >>> iconv(x1, "latin1", "") # NA NA NA
    >> [1] NA NA NA
    >>> iconv(xU, "UTF-8", "") # NA NA NA
    >> [1] NA NA NA
    >>> iconv(xU, "UTF-8", "//TRANSLIT")
    >> [1] "Ekstr?m"         "J?reskog"       
"bi?chen Z?rcher"

    > Interesting, I get chinese characters here.

For which one of the above cases; can you show them
 (it may survive E-mail servers; we had other
  Chinese R strings on R-help / R-devel recently, right?)

In any case, I think  that is even worse, isn't it?  
As also in a Chinese locale you'd want explicit-latin1 text to
see in something that looks like latin-1 (I know from a master's
 student that Windows+Chinese can well show latin-1-like
 letters also interspersed in the Chinese text),
no ? 


    > Beside the comments from Duncan Murdoch:

    > iconv(x1, "latin1", "", sub="?")
    > etc. would be an alternative in case some characters really cannot be 
    > converted into the target encoding and should perhaps be considered for
    > the time after Duncan commits the fix for the underlying porblem.

Yes. I'd had the same idea that's why I used it in the code I
sent along.

So,

1)  we definitely won't commit the workaround patch for citation().

2) I have a "workaround patch" for packageDescription() which is
   more useful in the sense that only if iconv() produces NA's, it
   tries alternatives, notably   "//TRANSLIT", 
"ASCII//TRANSLIT"
   (the latter Ben also mentioned, but my patch would only use it
    in the NA case) and also the same  'sub="?"' that you
mention
    above, Uwe.

   That patch is not Windows-specific and will automatically
   also help in other cases / platforms where the iconv()
   re-encoding leads to partial NAs.
   
  @Duncan M: would you _not_ want me to commit that either?

Martin

Apparently Analagous Threads

Search for more seemingly similar threads

R devel - Jun 2017 - Windows iconv() "failure" in certain locales

[Rd] Windows iconv() "failure" in certain locales

[Rd] Windows iconv() "failure" in certain locales

[Rd] Windows iconv() "failure" in certain locales

[Rd] Windows iconv() "failure" in certain locales

Apparently Analagous Threads