Dear All! Reading character strings containing an "umlaut" from a csv-file I find a (to me) surprising behaviour in R 2.8.0, that I did not notice in R 2.7.2. A comparison by "==" results in FALSE, while grep does find the aggreement. See the example below. The crucial line is x=="div 1-2 Ver?nderungen", with the result [1] FALSE in R 2.8.0 but [1] TRUE in R 2.7.2. Thank you in advance for your help Heinz T?chler ##### in R 2.8.0 patched x0 <- "div 1-2 Ver?nderungen" # define a character string write.csv(x0, 'chr.csv', row.names=FALSE) # write a csv-file with one line rm(x0) x <- read.csv('chr.csv', skip=0, header=TRUE, as.is=TRUE)$x # read in csv-file x x=="div 1-2 Ver?nderungen" > [1] FALSE grep("div 1-2 Ver?nderungen", x) > [1] 1 grep("div 1-2 Ver?nderungen", x, value=TRUE) > [1] "div 1-2 Ver?nderungen" unlink('chr.csv') # delete file Version: platform = i386-pc-mingw32 arch = i386 os = mingw32 system = i386, mingw32 status = Patched major = 2 minor = 8.0 year = 2008 month = 11 day = 04 svn rev = 46830 language = R version.string = R version 2.8.0 Patched (2008-11-04 r46830) Windows XP (build 2600) Service Pack 2 Locale: LC_COLLATE=German_Austria.1252;LC_CTYPE=German_Austria.1252;LC_MONETARY=German_Austria.1252;LC_NUMERIC=C;LC_TIME=German_Austria.1252 Search Path: .GlobalEnv, package:stats, package:graphics, package:grDevices, package:utils, package:datasets, package:methods, Autoloads, package:base ##### in R 2.7.2 patched x0 <- "div 1-2 Ver?nderungen" # define a character string write.csv(x0, 'chr.csv', row.names=FALSE) # write a csv-file with one line rm(x0) x <- read.csv('chr.csv', skip=0, header=TRUE, as.is=TRUE)$x # read in csv-file x x=="div 1-2 Ver?nderungen" > [1] TRUE grep("div 1-2 Ver?nderungen", x) > [1] 1 grep("div 1-2 Ver?nderungen", x, value=TRUE) > [1] "div 1-2 Ver?nderungen" unlink('chr.csv') # delete file Version: platform = i386-pc-mingw32 arch = i386 os = mingw32 system = i386, mingw32 status = Patched major = 2 minor = 7.2 year = 2008 month = 09 day = 02 svn rev = 46486 language = R version.string = R version 2.7.2 Patched (2008-09-02 r46486) Windows XP (build 2600) Service Pack 2 Locale: LC_COLLATE=German_Austria.1252;LC_CTYPE=German_Austria.1252;LC_MONETARY=German_Austria.1252;LC_NUMERIC=C;LC_TIME=German_Austria.1252 Search Path: .GlobalEnv, package:stats, package:graphics, package:grDevices, package:utils, package:datasets, package:methods, Autoloads, package:base
Look at Encoding() on your two strings. The results are different, and this seems to be the root of the problem. Adding encoding="latin1" to the read.csv call is a workaround. It looks like there is a problem in the use of the CHARSXP cache: if I save the session then x0 == x becomes true when I reload it, even though the encodings remain different. I've found the immediate cause and will change this in R-patched shortly. On Thu, 6 Nov 2008, Heinz Tuechler wrote:> Dear All! > > Reading character strings containing an "umlaut" from a csv-file I find a (to > me) surprising behaviour in R 2.8.0, that I did not notice in R 2.7.2. > A comparison by "==" results in FALSE, while grep does find the aggreement. > See the example below. > The crucial line is x=="div 1-2 Ver?nderungen", with the result [1] FALSE in > R 2.8.0 but > [1] TRUE in R 2.7.2. > > Thank you in advance for your help > > Heinz T?chler > > ##### in R 2.8.0 patched > > x0 <- "div 1-2 Ver?nderungen" # define a character string > > write.csv(x0, 'chr.csv', row.names=FALSE) # write a csv-file with one line > rm(x0) > > x <- read.csv('chr.csv', skip=0, header=TRUE, as.is=TRUE)$x # read in > csv-file > x > x=="div 1-2 Ver?nderungen" >> [1] FALSE > grep("div 1-2 Ver?nderungen", x) >> [1] 1 > grep("div 1-2 Ver?nderungen", x, value=TRUE) >> [1] "div 1-2 Ver?nderungen" > > unlink('chr.csv') # delete file > > Version: > platform = i386-pc-mingw32 > arch = i386 > os = mingw32 > system = i386, mingw32 > status = Patched > major = 2 > minor = 8.0 > year = 2008 > month = 11 > day = 04 > svn rev = 46830 > language = R > version.string = R version 2.8.0 Patched (2008-11-04 r46830) > > Windows XP (build 2600) Service Pack 2 > > Locale: > LC_COLLATE=German_Austria.1252;LC_CTYPE=German_Austria.1252;LC_MONETARY=German_Austria.1252;LC_NUMERIC=C;LC_TIME=German_Austria.1252 > > Search Path: > .GlobalEnv, package:stats, package:graphics, package:grDevices, > package:utils, package:datasets, package:methods, Autoloads, package:base > > > ##### in R 2.7.2 patched > > > x0 <- "div 1-2 Ver?nderungen" # define a character string > > write.csv(x0, 'chr.csv', row.names=FALSE) # write a csv-file with one line > rm(x0) > > x <- read.csv('chr.csv', skip=0, header=TRUE, as.is=TRUE)$x # read in > csv-file > x > x=="div 1-2 Ver?nderungen" >> [1] TRUE > grep("div 1-2 Ver?nderungen", x) >> [1] 1 > grep("div 1-2 Ver?nderungen", x, value=TRUE) >> [1] "div 1-2 Ver?nderungen" > > unlink('chr.csv') # delete file > > Version: > platform = i386-pc-mingw32 > arch = i386 > os = mingw32 > system = i386, mingw32 > status = Patched > major = 2 > minor = 7.2 > year = 2008 > month = 09 > day = 02 > svn rev = 46486 > language = R > version.string = R version 2.7.2 Patched (2008-09-02 r46486) > > Windows XP (build 2600) Service Pack 2 > > Locale: > LC_COLLATE=German_Austria.1252;LC_CTYPE=German_Austria.1252;LC_MONETARY=German_Austria.1252;LC_NUMERIC=C;LC_TIME=German_Austria.1252 > > Search Path: > .GlobalEnv, package:stats, package:graphics, package:grDevices, > package:utils, package:datasets, package:methods, Autoloads, package:base > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >-- Brian D. Ripley, ripley at stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UK Fax: +44 1865 272595
Dear Prof.Ripley! Thank you very much for your attention. In the given example Encoding(), or the encoding parameter of read.csv solve the problem. I hope your patch will solve also the problem, when I read a spss file by spss.get(), since this function has no encoding parameter and my real problem originated there. many thanks Heinz T?chler At 23:51 06.11.2008, you wrote:>Look at Encoding() on your two strings. The >results are different, and this seems to be the >root of the problem. Adding encoding="latin1" >to the read.csv call is a workaround. > >It looks like there is a problem in the use of >the CHARSXP cache: if I save the session then x0 >== x becomes true when I reload it, even though the encodings remain different. > >I've found the immediate cause and will change this in R-patched shortly. > >On Thu, 6 Nov 2008, Heinz Tuechler wrote: > >>Dear All! >> >>Reading character strings containing an >>"umlaut" from a csv-file I find a (to me) >>surprising behaviour in R 2.8.0, that I did not notice in R 2.7.2. >>A comparison by "==" results in FALSE, while grep does find the aggreement. >>See the example below. >>The crucial line is x=="div 1-2 Ver?nderungen", >>with the result [1] FALSE in R 2.8.0 but >>[1] TRUE in R 2.7.2. >> >>Thank you in advance for your help >> >>Heinz T?chler >> >>##### in R 2.8.0 patched >> >>x0 <- "div 1-2 Ver?nderungen" # define a character string >> >>write.csv(x0, 'chr.csv', row.names=FALSE) # write a csv-file with one line >>rm(x0) >> >>x <- read.csv('chr.csv', skip=0, header=TRUE, >>as.is=TRUE)$x # read in csv-file >>x >>x=="div 1-2 Ver?nderungen" >>>[1] FALSE >>grep("div 1-2 Ver?nderungen", x) >>>[1] 1 >>grep("div 1-2 Ver?nderungen", x, value=TRUE) >>>[1] "div 1-2 Ver?nderungen" >> >>unlink('chr.csv') # delete file >> >>Version: >>platform = i386-pc-mingw32 >>arch = i386 >>os = mingw32 >>system = i386, mingw32 >>status = Patched >>major = 2 >>minor = 8.0 >>year = 2008 >>month = 11 >>day = 04 >>svn rev = 46830 >>language = R >>version.string = R version 2.8.0 Patched (2008-11-04 r46830) >> >>Windows XP (build 2600) Service Pack 2 >> >>Locale: >>LC_COLLATE=German_Austria.1252;LC_CTYPE=German_Austria.1252;LC_MONETARY=German_Austria.1252;LC_NUMERIC=C;LC_TIME=German_Austria.1252 >> >>Search Path: >>.GlobalEnv, package:stats, package:graphics, >>package:grDevices, package:utils, >>package:datasets, package:methods, Autoloads, package:base >> >> >>##### in R 2.7.2 patched >> >> >>x0 <- "div 1-2 Ver?nderungen" # define a character string >> >>write.csv(x0, 'chr.csv', row.names=FALSE) # write a csv-file with one line >>rm(x0) >> >>x <- read.csv('chr.csv', skip=0, header=TRUE, >>as.is=TRUE)$x # read in csv-file >>x >>x=="div 1-2 Ver?nderungen" >>>[1] TRUE >>grep("div 1-2 Ver?nderungen", x) >>>[1] 1 >>grep("div 1-2 Ver?nderungen", x, value=TRUE) >>>[1] "div 1-2 Ver?nderungen" >> >>unlink('chr.csv') # delete file >> >>Version: >>platform = i386-pc-mingw32 >>arch = i386 >>os = mingw32 >>system = i386, mingw32 >>status = Patched >>major = 2 >>minor = 7.2 >>year = 2008 >>month = 09 >>day = 02 >>svn rev = 46486 >>language = R >>version.string = R version 2.7.2 Patched (2008-09-02 r46486) >> >>Windows XP (build 2600) Service Pack 2 >> >>Locale: >>LC_COLLATE=German_Austria.1252;LC_CTYPE=German_Austria.1252;LC_MONETARY=German_Austria.1252;LC_NUMERIC=C;LC_TIME=German_Austria.1252 >> >>Search Path: >>.GlobalEnv, package:stats, package:graphics, >>package:grDevices, package:utils, >>package:datasets, package:methods, Autoloads, package:base >> >>______________________________________________ >>R-help at r-project.org mailing list >>https://stat.ethz.ch/mailman/listinfo/r-help >>PLEASE do read the posting guide http://www.R-project.org/posting-guide.html >>and provide commented, minimal, self-contained, reproducible code. > >-- >Brian D. Ripley, ripley at stats.ox.ac.uk >Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ >University of Oxford, Tel: +44 1865 272861 (self) >1 South Parks Road, +44 1865 272866 (PA) >Oxford OX1 3TG, UK Fax: +44 1865 272595
On Sat, 8 Nov 2008, Heinz Tuechler wrote:> At 08:01 08.11.2008, Prof Brian Ripley wrote: >> We have no idea what you understood (you didn't tell us), but the help says >> >> encoding: character vector. The encoding(s) to be assumed when 'file' >> is a character string: see 'file'. A possible value is >> '"unknown"': see the ???Details???. >> >> ... >> This paragraph applies if 'file' is a filename (rather than a >> connection). If 'encoding = "unknown"', an attempt is made to >> guess the encoding. The result of 'localeToCharset()' is used as >> a guide. If 'encoding' has two or more elements, they are tried >> in turn until the file/URL can be read without error in the trial >> encoding. >> >> So source(encoding="latin1") says the file is encoded in Latin-1 and should >> be re-encoded if necessary (e.g. in UTF-8 locale). >> >> Setting the Encoding of parsed character strings is not mentioned. >> >> You could have written out a data frame with write.csv() and re-read it >> with read.csv(encoding = "latin1"): that was the workaround you were given >> earlier (not to use source). > > Thank you for this explanation. I felt that I did not understand the help > page of source() and I hoped, encoding='latin1' would have the same effect as > in read.csv(), but rethinking it, I see that it would conflict with the > primary functionality of source(). > Earlier I tried writing the data.frame with write.csv and re-reading it. This > works, but additional information like labels(), I have to tranfer in a > second step. > The best way I could immagine, would be some function, which marks every > character string in the whole structure of a data.frame, including all > attributes, as latin1.I think it is possible that con <- file("foo") source(con, encoding="latin1") close(foo) will also do what you want, although that's an udocumented side effect. But all of this should be unnecessary in R-patched (although it is possible that there are other quirks with unmarked strings lurking in the shadows, there are no other obvious changes from 2.7.2).> >> On Sat, 8 Nov 2008, Heinz Tuechler wrote: >> >>> At 16:52 07.11.2008, Prof Brian Ripley wrote: >>>> On Fri, 7 Nov 2008, Peter Dalgaard wrote: >>>> >>>>> Heinz Tuechler wrote: >>>>>> Dear Prof.Ripley! >>>>>> Thank you very much for your attention. In the given example >>>>>> Encoding(), >>>>>> or the encoding parameter of read.csv solve the problem. I hope your >>>>>> patch will solve also the problem, when I read a spss file by >>>>>> spss.get(), since this function has no encoding parameter and my real >>>>>> problem originated there. >>>>> read.spss() (package foreign) does have a reencode argument, though; and >>>>> this is called by spss.get(), so it looks like an easy hack to add it >>>>> there. >>>> Yes, older software like spss.get needs to get updated for the >>>> internationalization age. Modifying it to have a ... argument passed to >>>> read.spss would be a good idea (and future-proofing). >>>> In cases like this it is likely that the SPSS file does contain its >>>> encoding (although sometimes it does not and occasionally it is wrong), >>>> so it is helpful to make use of the info if it is there. However, the >>>> default is read.spss(reencode=NA) because of the problems of assuming >>>> that the info is correct when it is not are worse. >>> >>> The cause, why I tried the example below was to solve the encoding by >>> dumping and then re-sourcing a data.frame with the encoding parameter set >>> to latin1. As you can see, source(x, encoding='latin1') does not have the >>> effect I expected. Unfortunately I do not have any idea, what I understood >>> wrong regarding the meaning of encoding='latin1'. >>> >>> Heinz T??chler >>> >>> >>> us <- c("a", "b", "c", "??", "??", "??") >>> Encoding(us) >>> [1] "unknown" "unknown" "unknown" "latin1" "latin1" "latin1" >>> dump('us', 'us_dump.txt') >>> rm(us) >>> source('us_dump.txt', encoding='latin1') >>> us >>> [1] "a" "b" "c" "??" "??" "??" >>> Encoding(us) >>> [1] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" >>> unlink('us_dump.txt') >>> >>> >>> >>> >>>> -- >>>> Brian D. Ripley, ripley at stats.ox.ac.uk >>>> Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ >>>> University of Oxford, Tel: +44 1865 272861 (self) >>>> 1 South Parks Road, +44 1865 272866 (PA) >>>> Oxford OX1 3TG, UK Fax: +44 1865 272595 >>> >>> ______________________________________________ >>> R-help at r-project.org mailing list >>> https://stat.ethz.ch/mailman/listinfo/r-help >>> PLEASE do read the posting guide >>> http://www.R-project.org/posting-guide.html >>> and provide commented, minimal, self-contained, reproducible code. >> >> -- >> Brian D. Ripley, ripley at stats.ox.ac.uk >> Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ >> University of Oxford, Tel: +44 1865 272861 (self) >> 1 South Parks Road, +44 1865 272866 (PA) >> Oxford OX1 3TG, UK Fax: +44 1865 272595 > > >-- Brian D. Ripley, ripley at stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UK Fax: +44 1865 272595