thr3ads.net - R help - [R] Umlaut read from csv-file [Nov 2008]

If this information is useful, please help other people find it:
Share via:

Heinz Tuechler

2008-Nov-06 20:39 UTC

[R] Umlaut read from csv-file

Dear All!

Reading character strings containing an "umlaut" 
from a csv-file I find a (to me) surprising 
behaviour in R 2.8.0, that I did not notice in R 2.7.2.
A comparison by "==" results in FALSE, while grep does find the
aggreement.
See the example below.
The crucial line is x=="div 1-2 Ver?nderungen", 
with the result [1] FALSE in R 2.8.0 but
[1] TRUE in R 2.7.2.

Thank you in advance for your help

Heinz T?chler

##### in R 2.8.0 patched

x0 <- "div 1-2 Ver?nderungen" # define a character string

write.csv(x0, 'chr.csv', row.names=FALSE) # write a csv-file with one
line
rm(x0)

x <- read.csv('chr.csv', skip=0, header=TRUE, as.is=TRUE)$x # read in
csv-file
x
x=="div 1-2 Ver?nderungen"
 > [1] FALSE
grep("div 1-2 Ver?nderungen", x)
 > [1] 1
grep("div 1-2 Ver?nderungen", x, value=TRUE)
 > [1] "div 1-2 Ver?nderungen"

unlink('chr.csv') # delete file

Version:
  platform = i386-pc-mingw32
  arch = i386
  os = mingw32
  system = i386, mingw32
  status = Patched
  major = 2
  minor = 8.0
  year = 2008
  month = 11
  day = 04
  svn rev = 46830
  language = R
  version.string = R version 2.8.0 Patched (2008-11-04 r46830)

Windows XP (build 2600) Service Pack 2

Locale:
LC_COLLATE=German_Austria.1252;LC_CTYPE=German_Austria.1252;LC_MONETARY=German_Austria.1252;LC_NUMERIC=C;LC_TIME=German_Austria.1252

Search Path:
  .GlobalEnv, package:stats, package:graphics, 
package:grDevices, package:utils, 
package:datasets, package:methods, Autoloads, package:base


##### in R 2.7.2 patched


x0 <- "div 1-2 Ver?nderungen" # define a character string

write.csv(x0, 'chr.csv', row.names=FALSE) # write a csv-file with one
line
rm(x0)

x <- read.csv('chr.csv', skip=0, header=TRUE, as.is=TRUE)$x # read in
csv-file
x
x=="div 1-2 Ver?nderungen"
 > [1] TRUE
grep("div 1-2 Ver?nderungen", x)
 > [1] 1
grep("div 1-2 Ver?nderungen", x, value=TRUE)
 > [1] "div 1-2 Ver?nderungen"

unlink('chr.csv') # delete file

Version:
  platform = i386-pc-mingw32
  arch = i386
  os = mingw32
  system = i386, mingw32
  status = Patched
  major = 2
  minor = 7.2
  year = 2008
  month = 09
  day = 02
  svn rev = 46486
  language = R
  version.string = R version 2.7.2 Patched (2008-09-02 r46486)

Windows XP (build 2600) Service Pack 2

Locale:
LC_COLLATE=German_Austria.1252;LC_CTYPE=German_Austria.1252;LC_MONETARY=German_Austria.1252;LC_NUMERIC=C;LC_TIME=German_Austria.1252

Search Path:
  .GlobalEnv, package:stats, package:graphics, 
package:grDevices, package:utils, 
package:datasets, package:methods, Autoloads, package:base

Prof Brian Ripley

2008-Nov-06 22:51 UTC

head link

[R] Umlaut read from csv-file

Look at Encoding() on your two strings.  The results are different, and 
this seems to be the root of the problem.  Adding encoding="latin1" to
the
read.csv call is a workaround.

It looks like there is a problem in the use of the CHARSXP cache: if I 
save the session then x0 == x becomes true when I reload it, even though 
the encodings remain different.

I've found the immediate cause and will change this in R-patched shortly.

On Thu, 6 Nov 2008, Heinz Tuechler wrote:
> Dear All!
>
> Reading character strings containing an "umlaut" from a csv-file
I find a (to
> me) surprising behaviour in R 2.8.0, that I did not notice in R 2.7.2.
> A comparison by "==" results in FALSE, while grep does find the
aggreement.
> See the example below.
> The crucial line is x=="div 1-2 Ver?nderungen", with the result
[1] FALSE in
> R 2.8.0 but
> [1] TRUE in R 2.7.2.
>
> Thank you in advance for your help
>
> Heinz T?chler
>
> ##### in R 2.8.0 patched
>
> x0 <- "div 1-2 Ver?nderungen" # define a character string
>
> write.csv(x0, 'chr.csv', row.names=FALSE) # write a csv-file with
one line
> rm(x0)
>
> x <- read.csv('chr.csv', skip=0, header=TRUE, as.is=TRUE)$x #
read in
> csv-file
> x
> x=="div 1-2 Ver?nderungen"
>> [1] FALSE
> grep("div 1-2 Ver?nderungen", x)
>> [1] 1
> grep("div 1-2 Ver?nderungen", x, value=TRUE)
>> [1] "div 1-2 Ver?nderungen"
>
> unlink('chr.csv') # delete file
>
> Version:
> platform = i386-pc-mingw32
> arch = i386
> os = mingw32
> system = i386, mingw32
> status = Patched
> major = 2
> minor = 8.0
> year = 2008
> month = 11
> day = 04
> svn rev = 46830
> language = R
> version.string = R version 2.8.0 Patched (2008-11-04 r46830)
>
> Windows XP (build 2600) Service Pack 2
>
> Locale:
>
LC_COLLATE=German_Austria.1252;LC_CTYPE=German_Austria.1252;LC_MONETARY=German_Austria.1252;LC_NUMERIC=C;LC_TIME=German_Austria.1252
>
> Search Path:
> .GlobalEnv, package:stats, package:graphics, package:grDevices, 
> package:utils, package:datasets, package:methods, Autoloads, package:base
>
>
> ##### in R 2.7.2 patched
>
>
> x0 <- "div 1-2 Ver?nderungen" # define a character string
>
> write.csv(x0, 'chr.csv', row.names=FALSE) # write a csv-file with
one line
> rm(x0)
>
> x <- read.csv('chr.csv', skip=0, header=TRUE, as.is=TRUE)$x #
read in
> csv-file
> x
> x=="div 1-2 Ver?nderungen"
>> [1] TRUE
> grep("div 1-2 Ver?nderungen", x)
>> [1] 1
> grep("div 1-2 Ver?nderungen", x, value=TRUE)
>> [1] "div 1-2 Ver?nderungen"
>
> unlink('chr.csv') # delete file
>
> Version:
> platform = i386-pc-mingw32
> arch = i386
> os = mingw32
> system = i386, mingw32
> status = Patched
> major = 2
> minor = 7.2
> year = 2008
> month = 09
> day = 02
> svn rev = 46486
> language = R
> version.string = R version 2.7.2 Patched (2008-09-02 r46486)
>
> Windows XP (build 2600) Service Pack 2
>
> Locale:
>
LC_COLLATE=German_Austria.1252;LC_CTYPE=German_Austria.1252;LC_MONETARY=German_Austria.1252;LC_NUMERIC=C;LC_TIME=German_Austria.1252
>
> Search Path:
> .GlobalEnv, package:stats, package:graphics, package:grDevices, 
> package:utils, package:datasets, package:methods, Autoloads, package:base
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
-- 
Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595

Heinz Tuechler

2008-Nov-06 23:50 UTC

head link

[R] Umlaut read from csv-file

Dear Prof.Ripley!

Thank you very much for your attention. In the 
given example Encoding(), or the encoding 
parameter of read.csv solve the problem. I hope 
your patch will solve also the problem, when I 
read a spss file by spss.get(), since this 
function has no encoding parameter and my real problem originated there.

many thanks

Heinz T?chler

At 23:51 06.11.2008, you wrote:>Look at Encoding() on your two strings.  The 
>results are different, and this seems to be the 
>root of the problem.  Adding encoding="latin1" 
>to the read.csv call is a workaround.
>
>It looks like there is a problem in the use of 
>the CHARSXP cache: if I save the session then x0 
>== x becomes true when I reload it, even though the encodings remain
different.
>
>I've found the immediate cause and will change this in R-patched
shortly.
>
>On Thu, 6 Nov 2008, Heinz Tuechler wrote:
>
>>Dear All!
>>
>>Reading character strings containing an 
>>"umlaut" from a csv-file I find a (to me) 
>>surprising behaviour in R 2.8.0, that I did not notice in R 2.7.2.
>>A comparison by "==" results in FALSE, while grep does find
the aggreement.
>>See the example below.
>>The crucial line is x=="div 1-2 Ver?nderungen", 
>>with the result [1] FALSE in R 2.8.0 but
>>[1] TRUE in R 2.7.2.
>>
>>Thank you in advance for your help
>>
>>Heinz T?chler
>>
>>##### in R 2.8.0 patched
>>
>>x0 <- "div 1-2 Ver?nderungen" # define a character string
>>
>>write.csv(x0, 'chr.csv', row.names=FALSE) # write a csv-file
with one line
>>rm(x0)
>>
>>x <- read.csv('chr.csv', skip=0, header=TRUE, 
>>as.is=TRUE)$x # read in csv-file
>>x
>>x=="div 1-2 Ver?nderungen"
>>>[1] FALSE
>>grep("div 1-2 Ver?nderungen", x)
>>>[1] 1
>>grep("div 1-2 Ver?nderungen", x, value=TRUE)
>>>[1] "div 1-2 Ver?nderungen"
>>
>>unlink('chr.csv') # delete file
>>
>>Version:
>>platform = i386-pc-mingw32
>>arch = i386
>>os = mingw32
>>system = i386, mingw32
>>status = Patched
>>major = 2
>>minor = 8.0
>>year = 2008
>>month = 11
>>day = 04
>>svn rev = 46830
>>language = R
>>version.string = R version 2.8.0 Patched (2008-11-04 r46830)
>>
>>Windows XP (build 2600) Service Pack 2
>>
>>Locale:
>>LC_COLLATE=German_Austria.1252;LC_CTYPE=German_Austria.1252;LC_MONETARY=German_Austria.1252;LC_NUMERIC=C;LC_TIME=German_Austria.1252
>>
>>Search Path:
>>.GlobalEnv, package:stats, package:graphics, 
>>package:grDevices, package:utils, 
>>package:datasets, package:methods, Autoloads, package:base
>>
>>
>>##### in R 2.7.2 patched
>>
>>
>>x0 <- "div 1-2 Ver?nderungen" # define a character string
>>
>>write.csv(x0, 'chr.csv', row.names=FALSE) # write a csv-file
with one line
>>rm(x0)
>>
>>x <- read.csv('chr.csv', skip=0, header=TRUE, 
>>as.is=TRUE)$x # read in csv-file
>>x
>>x=="div 1-2 Ver?nderungen"
>>>[1] TRUE
>>grep("div 1-2 Ver?nderungen", x)
>>>[1] 1
>>grep("div 1-2 Ver?nderungen", x, value=TRUE)
>>>[1] "div 1-2 Ver?nderungen"
>>
>>unlink('chr.csv') # delete file
>>
>>Version:
>>platform = i386-pc-mingw32
>>arch = i386
>>os = mingw32
>>system = i386, mingw32
>>status = Patched
>>major = 2
>>minor = 7.2
>>year = 2008
>>month = 09
>>day = 02
>>svn rev = 46486
>>language = R
>>version.string = R version 2.7.2 Patched (2008-09-02 r46486)
>>
>>Windows XP (build 2600) Service Pack 2
>>
>>Locale:
>>LC_COLLATE=German_Austria.1252;LC_CTYPE=German_Austria.1252;LC_MONETARY=German_Austria.1252;LC_NUMERIC=C;LC_TIME=German_Austria.1252
>>
>>Search Path:
>>.GlobalEnv, package:stats, package:graphics, 
>>package:grDevices, package:utils, 
>>package:datasets, package:methods, Autoloads, package:base
>>
>>______________________________________________
>>R-help at r-project.org mailing list
>>https://stat.ethz.ch/mailman/listinfo/r-help
>>PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
>>and provide commented, minimal, self-contained, reproducible code.
>
>--
>Brian D. Ripley,                  ripley at stats.ox.ac.uk
>Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
>University of Oxford,             Tel:  +44 1865 272861 (self)
>1 South Parks Road,                     +44 1865 272866 (PA)
>Oxford OX1 3TG, UK                Fax:  +44 1865 272595

Prof Brian Ripley

2008-Nov-09 05:25 UTC

head link

[R] Umlaut read from csv-file

On Sat, 8 Nov 2008, Heinz Tuechler wrote:
> At 08:01 08.11.2008, Prof Brian Ripley wrote:
>> We have no idea what you understood (you didn't tell us), but the
help says
>> 
>> encoding: character vector.  The encoding(s) to be assumed when
'file'
>>           is a character string: see 'file'.  A possible value
is
>>           '"unknown"': see the ???Details???.
>> 
>> ...
>>      This paragraph applies if 'file' is a filename (rather
than a
>>      connection).  If 'encoding = "unknown"', an
attempt is made to
>>      guess the encoding.  The result of 'localeToCharset()' is
used as
>>      a guide.  If 'encoding' has two or more elements, they are
tried
>>      in turn until the file/URL can be read without error in the trial
>>      encoding.
>> 
>> So source(encoding="latin1") says the file is encoded in
Latin-1 and should
>> be re-encoded if necessary (e.g. in  UTF-8 locale).
>> 
>> Setting the Encoding of parsed character strings is not mentioned.
>> 
>> You could have written out a data frame with write.csv() and re-read it
>> with read.csv(encoding = "latin1"): that was the workaround
you were given
>> earlier (not to use source).
>
> Thank you for this explanation. I felt that I did not understand the help 
> page of source() and I hoped, encoding='latin1' would have the same
effect as
> in read.csv(), but rethinking it, I see that it would conflict with the 
> primary functionality of source().
> Earlier I tried writing the data.frame with write.csv and re-reading it.
This
> works, but additional information like labels(), I have to tranfer in a 
> second step.
> The best way I could immagine, would be some function, which marks every 
> character string in the whole structure of a data.frame, including all 
> attributes, as latin1.
I think it is possible that

con <- file("foo")
source(con, encoding="latin1")
close(foo)

will also do what you want, although that's an udocumented side effect.

But all of this should be unnecessary in R-patched (although it is 
possible that there are other quirks with unmarked strings lurking in the 
shadows, there are no other obvious changes from 2.7.2).
>
>> On Sat, 8 Nov 2008, Heinz Tuechler wrote:
>> 
>>> At 16:52 07.11.2008, Prof Brian Ripley wrote:
>>>> On Fri, 7 Nov 2008, Peter Dalgaard wrote:
>>>> 
>>>>> Heinz Tuechler wrote:
>>>>>> Dear Prof.Ripley!
>>>>>> Thank you very much for your attention. In the given
example
>>>>>> Encoding(),
>>>>>> or the encoding parameter of read.csv solve the
problem. I hope your
>>>>>> patch will solve also the problem, when I read a spss
file by
>>>>>> spss.get(), since this function has no encoding
parameter and my real
>>>>>> problem originated there.
>>>>> read.spss() (package foreign) does have a reencode
argument, though; and
>>>>> this is called by spss.get(), so it looks like an easy hack
to add it
>>>>> there.
>>>> Yes, older software like spss.get needs to get updated for the 
>>>> internationalization age.  Modifying it to have a ... argument
passed to
>>>> read.spss would be a good idea (and future-proofing).
>>>> In cases like this it is likely that the SPSS file does contain
its
>>>> encoding (although sometimes it does not and occasionally it is
wrong),
>>>> so it is helpful to make use of the info if it is there. 
However, the
>>>> default is read.spss(reencode=NA) because of the problems of
assuming
>>>> that the info is correct when it is not are worse.
>>> 
>>> The cause, why I tried the example below was to solve the encoding
by
>>> dumping and then re-sourcing a data.frame with the encoding
parameter set
>>> to latin1. As you can see, source(x, encoding='latin1')
does not have the
>>> effect I expected. Unfortunately I do not have any idea, what I
understood
>>> wrong regarding the meaning of encoding='latin1'.
>>> 
>>> Heinz T??chler
>>> 
>>> 
>>> us <- c("a", "b", "c",
"??", "??", "??")
>>> Encoding(us)
>>> [1] "unknown" "unknown" "unknown"
"latin1"  "latin1"  "latin1"
>>> dump('us', 'us_dump.txt')
>>> rm(us)
>>> source('us_dump.txt', encoding='latin1')
>>> us
>>> [1] "a" "b" "c" "??"
"??" "??"
>>> Encoding(us)
>>> [1] "unknown" "unknown" "unknown"
"unknown" "unknown" "unknown"
>>> unlink('us_dump.txt')
>>> 
>>> 
>>> 
>>> 
>>>> --
>>>> Brian D. Ripley,                  ripley at stats.ox.ac.uk
>>>> Professor of Applied Statistics, 
http://www.stats.ox.ac.uk/~ripley/
>>>> University of Oxford,             Tel:  +44 1865 272861 (self)
>>>> 1 South Parks Road,                     +44 1865 272866 (PA)
>>>> Oxford OX1 3TG, UK                Fax:  +44 1865 272595
>>> 
>>> ______________________________________________
>>> R-help at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide 
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>> 
>> --
>> Brian D. Ripley,                  ripley at stats.ox.ac.uk
>> Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
>> University of Oxford,             Tel:  +44 1865 272861 (self)
>> 1 South Parks Road,                     +44 1865 272866 (PA)
>> Oxford OX1 3TG, UK                Fax:  +44 1865 272595
>
>
>
-- 
Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595

Seemingly Similar Threads

Search for more apparently analagous threads

R help - Nov 2008 - Umlaut read from csv-file

[R] Umlaut read from csv-file

[R] Umlaut read from csv-file

[R] Umlaut read from csv-file

[R] Umlaut read from csv-file

Seemingly Similar Threads