thr3ads.net - R help - [R] Encoding problems. [Nov 2009]

If this information is useful, please help other people find it:
Share via:

Gérald Jean

2009-Nov-24 16:56 UTC

[R] Encoding problems.

Hello,

I use:

R version 2.9.2 (2009-08-24)
Copyright (C) 2009 The R Foundation for Statistical Computing
ISBN 3-900051-07-0

on Ubuntu 9.10, I usually run R from ESS (5.4 on current Unbuntu) from
Emacs-22.2.1.  But I also tried the following from the console and it
gave the same results.

I have a data file containing lots of European characters, French,
German, Italian and so on.  I can read it ok in R but I can't display
the characters correctly.

I searched the archives and following professor Ripley's advice I read
my data the following way:
> con <- file("/home/gerald/Vins/ListeVin091123.csv", open =
"r",
encoding = "UTF-8")> isOpen(con)
[1] TRUE> ttt <- read.table(file = con, header = TRUE, sep = ";", quote
= "\"'",+                 dec = ",",   # row.names, col.names,
+                 na.strings = "", colClasses = NA, nrows = -1,
+                 skip = 0, check.names = TRUE,
+                 strip.white = FALSE, blank.lines.skip = TRUE,
+                 comment.char = "#",
+                 allowEscapes = FALSE, flush = FALSE,
+                 stringsAsFactors = FALSE)> close(con)
It seems that R does recognize the locales since it tries to report
errors in French here is a simple example:
> ttt.g <- "g?rald"Erreur : caract??res multioctets incorrects dans l'analyse de code
(parser) ?  la ligne 1

outputting the colnames of my data set I get:
> names(ttt) [1] "ID"           "Domaine"      "Nom"         
"Mill??????.sime"
"Pays"        
 [6] "R??????.gion"    "Appellation"  "Vignoble"  
"Couleur"
"Alcool"      
[11] "Classement"   "Cuve"         "mois"        
"Bio"
"C??????.page..1"
[16] "X."           "C??????.page..2" "X..1"      
"C??????.page..3"
"X..2"        
[21] "C??????.page..4" "X..3"        
"C??????.page..5" "X..4"
"Prix"        
[26] "Quantit??????."  "Internet"    

sessionInfo yields the following:
> sessionInfo()R version 2.9.2 (2009-08-24) 
i486-pc-linux-gnu 

locale:
LC_CTYPE=fr_CA.UTF-8;LC_NUMERIC=C;LC_TIME=fr_CA.UTF-8;LC_COLLATE=fr_CA.UTF-8;LC_MONETARY=C;
LC_MESSAGES=fr_CA.UTF-8;LC_PAPER=fr_CA.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;
LC_MEASUREMENT=fr_CA.UTF-8;LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods
base     

other attached packages:
[1] Revobase_0.2-1

I tried to play with Emacs' coding systems with no luck!  Any idea on
how to handle this?

My ultimate goal is to clean up and sort this data set and then export
it in a LaTeX compatible format.

By the way, if I open the file with OpenOffice Calc it asks me to
confirm that the encoding is Unicode UTF-8, I do, change the default
delimiter to ";" and press enter.  All the accented characters display
OK.

Thanks for any insights,

G?rald Jean

Peter Dalgaard

2009-Nov-24 17:29 UTC

head link

[R] Encoding problems.

G?rald Jean wrote:> Hello,
> 
> I use:
> 
> R version 2.9.2 (2009-08-24)
> Copyright (C) 2009 The R Foundation for Statistical Computing
> ISBN 3-900051-07-0
> 
> on Ubuntu 9.10, I usually run R from ESS (5.4 on current Unbuntu) from
> Emacs-22.2.1.  But I also tried the following from the console and it
> gave the same results.
> 
> I have a data file containing lots of European characters, French,
> German, Italian and so on.  I can read it ok in R but I can't display
> the characters correctly.
> 
> I searched the archives and following professor Ripley's advice I read
> my data the following way:
> 
>> con <- file("/home/gerald/Vins/ListeVin091123.csv", open =
"r",
> encoding = "UTF-8")
>> isOpen(con)
> [1] TRUE
>> ttt <- read.table(file = con, header = TRUE, sep = ";",
quote = "\"'",
> +                 dec = ",",   # row.names, col.names,
> +                 na.strings = "", colClasses = NA, nrows = -1,
> +                 skip = 0, check.names = TRUE,
> +                 strip.white = FALSE, blank.lines.skip = TRUE,
> +                 comment.char = "#",
> +                 allowEscapes = FALSE, flush = FALSE,
> +                 stringsAsFactors = FALSE)
>> close(con)
> 
> It seems that R does recognize the locales since it tries to report
> errors in French here is a simple example:
> 
>> ttt.g <- "g?rald"
> Erreur : caract??res multioctets incorrects dans l'analyse de code
> (parser) ?  la ligne 1
Looks like R is speaking UTF-8 and you're not. Or rather, your console
isn't. You may need to poke around to change that -- I think most
terminal emulators these days allow you to set the encoding from their
menu bar.

However, the printout below doesn't quite look like UTF-8, more like one
of the older ISO646 mechanisms, so you may still have some work to do.
Then again, if OO can read the original file, maybe I am worrying too
soon....

-p
> outputting the colnames of my data set I get:
> 
>> names(ttt)
>  [1] "ID"           "Domaine"      "Nom"     
"Mill??????.sime"
> "Pays"        
>  [6] "R??????.gion"    "Appellation" 
"Vignoble"     "Couleur"
> "Alcool"      
> [11] "Classement"   "Cuve"         "mois"    
"Bio"
> "C??????.page..1"
> [16] "X."           "C??????.page..2" "X..1" 
"C??????.page..3"
> "X..2"        
> [21] "C??????.page..4" "X..3"        
"C??????.page..5" "X..4"
> "Prix"        
> [26] "Quantit??????."  "Internet"    
> 
> sessionInfo yields the following:
> 
>> sessionInfo()
> R version 2.9.2 (2009-08-24) 
> i486-pc-linux-gnu 
> 
> locale:
>
LC_CTYPE=fr_CA.UTF-8;LC_NUMERIC=C;LC_TIME=fr_CA.UTF-8;LC_COLLATE=fr_CA.UTF-8;LC_MONETARY=C;
>
LC_MESSAGES=fr_CA.UTF-8;LC_PAPER=fr_CA.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;
> LC_MEASUREMENT=fr_CA.UTF-8;LC_IDENTIFICATION=C
> 
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods
> base     
> 
> other attached packages:
> [1] Revobase_0.2-1
> 
> I tried to play with Emacs' coding systems with no luck!  Any idea on
> how to handle this?
> 
> My ultimate goal is to clean up and sort this data set and then export
> it in a LaTeX compatible format.
> 
> By the way, if I open the file with OpenOffice Calc it asks me to
> confirm that the encoding is Unicode UTF-8, I do, change the default
> delimiter to ";" and press enter.  All the accented characters
display
> OK.
> 
> Thanks for any insights,
> 
> G?rald Jean
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

-- 
   O__  ---- Peter Dalgaard             ?ster Farimagsgade 5, Entr.B
  c/ /'_ --- Dept. of Biostatistics     PO Box 2099, 1014 Cph. K
 (*) \(*) -- University of Copenhagen   Denmark      Ph:  (+45) 35327918
~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk)              FAX: (+45) 35327907

Apparently Analagous Threads

Search for more reasonably related threads

R help - Nov 2009 - Encoding problems.

[R] Encoding problems.

[R] Encoding problems.

Apparently Analagous Threads