thr3ads.net - R help - [R] Request for advice on character set conversions (those damn Excel files, again ...) [Sep 2008]

If this information is useful, please help other people find it:
Share via:

Emmanuel Charpentier

2008-Sep-07 22:02 UTC

[R] Request for advice on character set conversions (those damn Excel files, again ...)

Dear list,
        
I have to read a not-so-small bunch of not-so-small Excel files, which 
seem to have traversed Window 3.1, Windows95 and Windows NT versions of 
the thing (with maybe a Mac or two thrown in for good measure...).
The problem is that 1) I need to read strings, and 2) those 
strings may have various encodings. In the same sheet of the same file, 
some cells may be latin1, some UTF-8 and some CP437 (!).

read.xls() alows me to read those things in sets of dataframes. my 
problem is to convert the encodings to UTF8 without cloberring those who 
are already (looking like) UTF8.

I came to the following solution :

foo<-function(d, from="latin1",to="UTF-8"){
  # Semi-smart conversion of a dataframe between charsets.
  # Needed to ease use of those [@!] Excel files
  # that have survived the Win3.1 --> Win95 --> NT transition,
  # usually in poor shape..
  conv1<-function(v,from,to) {
    condconv<-function(v,from,to) {
      cnv<-is.na(iconv(v,to,to))
      v[cnv]<-iconv(v[cnv],from,to)
      return(v)
    }
    if (is.factor(v)) {
      l<-condconv(levels(v),from,to)
      levels(v)<-l
      return(v)
    }
    else if (is.character(v)) return(condconv(v,from,to))
    else return(v)
  }
  for(i in names(d)) d[,i]<-conv1(d[,i],from,to)
  return(d)
}

Any advice for enhancement is welcome...

Sincerely yours,
        
					Emmanuel Charpentier

Peter Dalgaard

2008-Sep-07 23:45 UTC

head link

[R] Request for advice on character set conversions (those damn Excel files, again ...)

Emmanuel Charpentier wrote:> Dear list,
>         
> I have to read a not-so-small bunch of not-so-small Excel files, which 
> seem to have traversed Window 3.1, Windows95 and Windows NT versions of 
> the thing (with maybe a Mac or two thrown in for good measure...).
> The problem is that 1) I need to read strings, and 2) those 
> strings may have various encodings. In the same sheet of the same file, 
> some cells may be latin1, some UTF-8 and some CP437 (!).
>
> read.xls() alows me to read those things in sets of dataframes. my 
> problem is to convert the encodings to UTF8 without cloberring those who 
> are already (looking like) UTF8.
>
> I came to the following solution :
>
> foo<-function(d, from="latin1",to="UTF-8"){
>   # Semi-smart conversion of a dataframe between charsets.
>   # Needed to ease use of those [@!] Excel files
>   # that have survived the Win3.1 --> Win95 --> NT transition,
>   # usually in poor shape..
>   conv1<-function(v,from,to) {
>     condconv<-function(v,from,to) {
>       cnv<-is.na(iconv(v,to,to))
>       v[cnv]<-iconv(v[cnv],from,to)
>       return(v)
>     }
>     if (is.factor(v)) {
>       l<-condconv(levels(v),from,to)
>       levels(v)<-l
>       return(v)
>     }
>     else if (is.character(v)) return(condconv(v,from,to))
>     else return(v)
>   }
>   for(i in names(d)) d[,i]<-conv1(d[,i],from,to)
>   return(d)
> }
>
> Any advice for enhancement is welcome...
>   This looks reasonably sane, I think. The last loop could be d[] <- 
lapply(d, conv1, from, to), but I think that is cosmetic. You can't 
really do much better because there is no simple way of distinguishing 
between the various 8-bit character sets. You could presumably setup 
some heuristics. like the fact that the occurrence of 0x82 or 0x8a 
probably indicates cp437, but it gets tricky. (At least, in French, you 
don't have the Danish/Norwegian peculiarity that upper/lowercase o-slash 
were missing in cp437, and therefore often replaced yen and cent symbols 
in matrix printer ROMs. We still get the occational parcel addressed to 
"?ster Farimagsgade".)

-- 
   O__  ---- Peter Dalgaard             ?ster Farimagsgade 5, Entr.B
  c/ /'_ --- Dept. of Biostatistics     PO Box 2099, 1014 Cph. K
 (*) \(*) -- University of Copenhagen   Denmark      Ph:  (+45) 35327918
~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk)              FAX: (+45) 35327907

Apparently Analagous Threads

Search for more reasonably related threads

R help - Sep 2008 - Request for advice on character set conversions (those damn Excel files, again ...)

[R] Request for advice on character set conversions (those damn Excel files, again ...)

[R] Request for advice on character set conversions (those damn Excel files, again ...)

Apparently Analagous Threads