David Byrne
2019-Feb-07 13:33 UTC
[Rd] Bug Report: read.table with UTF-8 encoded file imports infinity symbol as Integer 8
I can confirm that it doesn't happen on Ubuntu 18.04.1 so Peter is most likely correct; it looks like its Windows specific. On Thu, 7 Feb 2019 at 12:55, peter dalgaard <pdalgd at gmail.com> wrote:> > This doesn't seem to be happening on MacOS, neither in Terminal nor RStudio, (R 3.5.1, R-devel, R-patched). So probably Windows specific. > > -pd > > > On 7 Feb 2019, at 11:17 , David Byrne <david.byrne222 at gmail.com> wrote: > > > > Bug > > Using read.table(file, encoding="UTF-8") to import a UTF-8 encoded > > file containing the infinity symbol (' ? ') results in the infinity > > symbol imported as the number 8. Other Unicode characters seem > > unaffected, example, Zhe: ? > > > > Expected Behavior: > > The imported data.frame should represent the infinity symbol as the > > expected 'Inf' so that normal mathematical operations can be processed > > > > Stack Overflow Post: > > I created a question on Stack Overflow where one other member was able > > to reproduce the same issues I was having. This question can be found > > at: > > https://stackoverflow.com/questions/54522196/r-read-table-with-utf-8-encoded-file-reads-infinity-symbol-as-8-int > > > > Method to Reproduce - 1: > > A simple method to reproduce this issues is to use R-Studio: In the > > console, type the following: > >> read.table(text=" ?", encoding="UTF-8") > > > > The result should be a data.frame with a single value of '8' > > > > Repeating the same with ? Results in correct expected behavior > > > > Method to Reproduce - 2: > > Create a .csv file containing the infinity and Zhe characters (I have > > attached the file for convenience, hopefully it is no rejected by your > > email service). Launch an interactive session using > > > >> r --vanilla > > > > Enter the following statement taking care to replace the > > <path-to-file> with the appropriate one: > > > >> read.table("<path-to-file>/unicode_chars.csv", sep=",", encoding="UTF-8") > > > > > > This should result in a two element data.frame; the first being the > > incorrect value of 8 with an additional <U+FEFF> and the second the > > correct value of Zhe. > > > > Note the additional <U+FEFF> prefixed to the front of the '8'. This > > appears to be a hidden character for the purposes of letting editors > > know the encoding. The following link has some explanation however, it > > states this is caused by excel. The file I created was done so using > > notepad and not Excel. > > > > https://medium.freecodecamp.org/a-quick-tale-about-feff-the-invisible-character-cd25cd4630e7 > > > > System Details: > > OS: > >> Windows 10.0.17134 Build 17134 > > > > > > R Version: > >> platform x86_64-w64-mingw32 > >> arch x86_64 > >> os mingw32 > >> system x86_64, mingw32 > >> status > >> major 3 > >> minor 4.1 > >> year 2017 > >> month 06 > >> day 30 > >> svn rev 72865 > >> language R > >> version.string R version 3.4.1 (2017-06-30) > >> nickname Single Candle > > ______________________________________________ > > R-devel at r-project.org mailing list > > https://stat.ethz.ch/mailman/listinfo/r-devel > > -- > Peter Dalgaard, Professor, > Center for Statistics, Copenhagen Business School > Solbjerg Plads 3, 2000 Frederiksberg, Denmark > Phone: (+45)38153501 > Office: A 4.23 > Email: pd.mes at cbs.dk Priv: PDalgd at gmail.com > > > > > > > > >
Daniel Possenriede
2019-Feb-07 14:10 UTC
[Rd] Bug Report: read.table with UTF-8 encoded file imports infinity symbol as Integer 8
There seems to be something odd with "?" on Windows (and not only with read.table) In native encoding (cp-1252 in my case), "?" gets converted to "8" x <- "?" Encoding(x) #> [1] "unknown" print(x) #> [1] "8" charToRaw(x) #> [1] 38 "?" is indeed "8" identical(x, "8") #> [1] TRUE Everything seems fine if "?" is UTF-8 encoded. y <- "\u221E" Encoding(y) #> [1] "UTF-8" print(y) #> [1] "?" charToRaw(y) #> [1] e2 88 9e Unless the string is converted back to native encoding. format(y) #> [1] "8" This ought to be "<U+221E>", equivalently to format("?") #> [1] "<U+221D>" Session Info: si <- sessionInfo() si$running #> [1] "Windows 10 x64 (build 17134)" si$R.version$version.string #> [1] "R version 3.5.2 (2018-12-20)" si$locale #> [1] "LC_COLLATE=German_Germany.1252;LC_CTYPE=German_Germany.1252;LC_MONETARY=German_Germany.1252;LC_NUMERIC=C;LC_TIME=German_Germany.1252" Am Do., 7. Feb. 2019 um 14:33 Uhr schrieb David Byrne < david.byrne222 at gmail.com>:> I can confirm that it doesn't happen on Ubuntu 18.04.1 so Peter is > most likely correct; it looks like its Windows specific. > > On Thu, 7 Feb 2019 at 12:55, peter dalgaard <pdalgd at gmail.com> wrote: > > > > This doesn't seem to be happening on MacOS, neither in Terminal nor > RStudio, (R 3.5.1, R-devel, R-patched). So probably Windows specific. > > > > -pd > > > > > On 7 Feb 2019, at 11:17 , David Byrne <david.byrne222 at gmail.com> > wrote: > > > > > > Bug > > > Using read.table(file, encoding="UTF-8") to import a UTF-8 encoded > > > file containing the infinity symbol (' ? ') results in the infinity > > > symbol imported as the number 8. Other Unicode characters seem > > > unaffected, example, Zhe: ? > > > > > > Expected Behavior: > > > The imported data.frame should represent the infinity symbol as the > > > expected 'Inf' so that normal mathematical operations can be processed > > > > > > Stack Overflow Post: > > > I created a question on Stack Overflow where one other member was able > > > to reproduce the same issues I was having. This question can be found > > > at: > > > > https://stackoverflow.com/questions/54522196/r-read-table-with-utf-8-encoded-file-reads-infinity-symbol-as-8-int > > > > > > Method to Reproduce - 1: > > > A simple method to reproduce this issues is to use R-Studio: In the > > > console, type the following: > > >> read.table(text=" ?", encoding="UTF-8") > > > > > > The result should be a data.frame with a single value of '8' > > > > > > Repeating the same with ? Results in correct expected behavior > > > > > > Method to Reproduce - 2: > > > Create a .csv file containing the infinity and Zhe characters (I have > > > attached the file for convenience, hopefully it is no rejected by your > > > email service). Launch an interactive session using > > > > > >> r --vanilla > > > > > > Enter the following statement taking care to replace the > > > <path-to-file> with the appropriate one: > > > > > >> read.table("<path-to-file>/unicode_chars.csv", sep=",", > encoding="UTF-8") > > > > > > > > > This should result in a two element data.frame; the first being the > > > incorrect value of 8 with an additional <U+FEFF> and the second the > > > correct value of Zhe. > > > > > > Note the additional <U+FEFF> prefixed to the front of the '8'. This > > > appears to be a hidden character for the purposes of letting editors > > > know the encoding. The following link has some explanation however, it > > > states this is caused by excel. The file I created was done so using > > > notepad and not Excel. > > > > > > > https://medium.freecodecamp.org/a-quick-tale-about-feff-the-invisible-character-cd25cd4630e7 > > > > > > System Details: > > > OS: > > >> Windows 10.0.17134 Build 17134 > > > > > > > > > R Version: > > >> platform x86_64-w64-mingw32 > > >> arch x86_64 > > >> os mingw32 > > >> system x86_64, mingw32 > > >> status > > >> major 3 > > >> minor 4.1 > > >> year 2017 > > >> month 06 > > >> day 30 > > >> svn rev 72865 > > >> language R > > >> version.string R version 3.4.1 (2017-06-30) > > >> nickname Single Candle > > > ______________________________________________ > > > R-devel at r-project.org mailing list > > > https://stat.ethz.ch/mailman/listinfo/r-devel > > > > -- > > Peter Dalgaard, Professor, > > Center for Statistics, Copenhagen Business School > > Solbjerg Plads 3, 2000 Frederiksberg, Denmark > > Phone: (+45)38153501 > > Office: A 4.23 > > Email: pd.mes at cbs.dk Priv: PDalgd at gmail.com > > > > > > > > > > > > > > > > > > > > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel >[[alternative HTML version deleted]]
Paul McQuesten
2019-Feb-07 14:38 UTC
[Rd] Bug Report: read.table with UTF-8 encoded file imports infinity symbol as Integer 8
Windows Notepad prefixes UTF-8 files with a Byte Order Mark (\UFEFF). Per https://en.wikipedia.org/wiki/Byte_order_mark, this is permitted in UTF-8, but not required. I suppose that there are other Windows programs which do likewise (in addition to Excel and Notepad). "The Unicode Standard permits the BOM in UTF-8 <https://en.wikipedia.org/wiki/UTF-8>,[3] <https://en.wikipedia.org/wiki/Byte_order_mark#cite_note-3> but does not require or recommend its use.[4] <https://en.wikipedia.org/wiki/Byte_order_mark#cite_note-4> Byte order has no meaning in UTF-8,[5] <https://en.wikipedia.org/wiki/Byte_order_mark#cite_note-utf-8-bom-5> so its only use in UTF-8 is to signal at the start that the text stream is encoded in UTF-8, or that it was converted to UTF-8 from a stream that contained an optional BOM. The standard also does not recommend removing a BOM when it is there, so that round-tripping between encodings does not lose information, and so that code that relies on it continues to work.[6] <https://en.wikipedia.org/wiki/Byte_order_mark#cite_note-6>[7] <https://en.wikipedia.org/wiki/Byte_order_mark#cite_note-7> The IETF recommends that if a protocol either (a) always uses UTF-8, or (b) has some other way to indicate what encoding is being used, then it "SHOULD forbid use of U+FEFF as a signature."[8] <https://en.wikipedia.org/wiki/Byte_order_mark#cite_note-rfc3629-8>" On Thu, Feb 7, 2019 at 8:10 AM Daniel Possenriede <possenriede at gmail.com> wrote:> There seems to be something odd with "?" on Windows (and not only with > read.table) > In native encoding (cp-1252 in my case), "?" gets converted to "8" > > x <- "?" > Encoding(x) > #> [1] "unknown" > print(x) > #> [1] "8" > charToRaw(x) > #> [1] 38 > > "?" is indeed "8" > > identical(x, "8") > #> [1] TRUE > > Everything seems fine if "?" is UTF-8 encoded. > > y <- "\u221E" > Encoding(y) > #> [1] "UTF-8" > print(y) > #> [1] "?" > charToRaw(y) > #> [1] e2 88 9e > > Unless the string is converted back to native encoding. > > format(y) > #> [1] "8" > > This ought to be "<U+221E>", equivalently to > > format("?") > #> [1] "<U+221D>" > > Session Info: > > si <- sessionInfo() > si$running > #> [1] "Windows 10 x64 (build 17134)" > si$R.version$version.string > #> [1] "R version 3.5.2 (2018-12-20)" > si$locale > #> [1] > > "LC_COLLATE=German_Germany.1252;LC_CTYPE=German_Germany.1252;LC_MONETARY=German_Germany.1252;LC_NUMERIC=C;LC_TIME=German_Germany.1252" > > > > Am Do., 7. Feb. 2019 um 14:33 Uhr schrieb David Byrne < > david.byrne222 at gmail.com>: > > > I can confirm that it doesn't happen on Ubuntu 18.04.1 so Peter is > > most likely correct; it looks like its Windows specific. > > > > On Thu, 7 Feb 2019 at 12:55, peter dalgaard <pdalgd at gmail.com> wrote: > > > > > > This doesn't seem to be happening on MacOS, neither in Terminal nor > > RStudio, (R 3.5.1, R-devel, R-patched). So probably Windows specific. > > > > > > -pd > > > > > > > On 7 Feb 2019, at 11:17 , David Byrne <david.byrne222 at gmail.com> > > wrote: > > > > > > > > Bug > > > > Using read.table(file, encoding="UTF-8") to import a UTF-8 encoded > > > > file containing the infinity symbol (' ? ') results in the infinity > > > > symbol imported as the number 8. Other Unicode characters seem > > > > unaffected, example, Zhe: ? > > > > > > > > Expected Behavior: > > > > The imported data.frame should represent the infinity symbol as the > > > > expected 'Inf' so that normal mathematical operations can be > processed > > > > > > > > Stack Overflow Post: > > > > I created a question on Stack Overflow where one other member was > able > > > > to reproduce the same issues I was having. This question can be found > > > > at: > > > > > > > https://stackoverflow.com/questions/54522196/r-read-table-with-utf-8-encoded-file-reads-infinity-symbol-as-8-int > > > > > > > > Method to Reproduce - 1: > > > > A simple method to reproduce this issues is to use R-Studio: In the > > > > console, type the following: > > > >> read.table(text=" ?", encoding="UTF-8") > > > > > > > > The result should be a data.frame with a single value of '8' > > > > > > > > Repeating the same with ? Results in correct expected behavior > > > > > > > > Method to Reproduce - 2: > > > > Create a .csv file containing the infinity and Zhe characters (I have > > > > attached the file for convenience, hopefully it is no rejected by > your > > > > email service). Launch an interactive session using > > > > > > > >> r --vanilla > > > > > > > > Enter the following statement taking care to replace the > > > > <path-to-file> with the appropriate one: > > > > > > > >> read.table("<path-to-file>/unicode_chars.csv", sep=",", > > encoding="UTF-8") > > > > > > > > > > > > This should result in a two element data.frame; the first being the > > > > incorrect value of 8 with an additional <U+FEFF> and the second the > > > > correct value of Zhe. > > > > > > > > Note the additional <U+FEFF> prefixed to the front of the '8'. This > > > > appears to be a hidden character for the purposes of letting editors > > > > know the encoding. The following link has some explanation however, > it > > > > states this is caused by excel. The file I created was done so using > > > > notepad and not Excel. > > > > > > > > > > > https://medium.freecodecamp.org/a-quick-tale-about-feff-the-invisible-character-cd25cd4630e7 > > > > > > > > System Details: > > > > OS: > > > >> Windows 10.0.17134 Build 17134 > > > > > > > > > > > > R Version: > > > >> platform x86_64-w64-mingw32 > > > >> arch x86_64 > > > >> os mingw32 > > > >> system x86_64, mingw32 > > > >> status > > > >> major 3 > > > >> minor 4.1 > > > >> year 2017 > > > >> month 06 > > > >> day 30 > > > >> svn rev 72865 > > > >> language R > > > >> version.string R version 3.4.1 (2017-06-30) > > > >> nickname Single Candle > > > > ______________________________________________ > > > > R-devel at r-project.org mailing list > > > > https://stat.ethz.ch/mailman/listinfo/r-devel > > > > > > -- > > > Peter Dalgaard, Professor, > > > Center for Statistics, Copenhagen Business School > > > Solbjerg Plads 3, 2000 Frederiksberg, Denmark > > > Phone: (+45)38153501 > > > Office: A 4.23 > > > Email: pd.mes at cbs.dk Priv: PDalgd at gmail.com > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > ______________________________________________ > > R-devel at r-project.org mailing list > > https://stat.ethz.ch/mailman/listinfo/r-devel > > > > [[alternative HTML version deleted]] > > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel >[[alternative HTML version deleted]]
Tomas Kalibera
2019-Feb-08 12:07 UTC
[Rd] Bug Report: read.table with UTF-8 encoded file imports infinity symbol as Integer 8
I can reproduce this behavior on my Windows 10 system in RGui (cp1252): when I paste the Unicode infinity symbol into the console, it is treated as number 8. This is caused by Windows "best fit" default behavior in conversion of unicode characters to characters in the current native encoding: at some point in the past, 8 has been chosen as a good fit for infinity in Windows. In my scenario, the conversion is invoked by RGui before returning the input to the main R loop, even before the input gets to the parser. In principle, we could change this particular conversion in RGui to avoid the substitution. RGui uses "\uxxxx" escapes to pass characters that cannot be represented, this is why e.g. the Cyrillic Zhe \u0436 worked, so we could tell Windows not to do the substitution and pass "\u221e" for Infinity, and then the string after being processed by the parser will be represented in UTF-8 inside R and could be e.g. printed by the RGui console. That is something that could be considered, but it will not solve the main problem and it may actually cause trouble to users who are used to such substitutions (especially when the substitutions are more intuitive, but, that may be a matter of opinion). The main problem is that in normal use, sooner or later R will get to the point when it will need to do the conversion to native encoding, and in some context where "\uxxxx" escapes will not be possible. One cannot reliably work with strings in R that cannot be represented in the current native encoding (except when one knows precisely how to avoid the conversion in some specific task, but that may be brittle; so the best-fit substitution might in principle help here). This problem does not exist on Unix/macOS systems where the current native encoding is UTF-8 these days, so today it only exists on Windows where UTF-8 cannot be the current native encoding. As has been discussed before, even though we could rewrite in principle all calls to Windows API to use Unicode and have all strings in UTF-8 in R, we would still have problems when interfacing with packages that assume strings are in current native encoding (without checking), so this problem won't be easy to fix. Best, Tomas On 2/7/19 3:10 PM, Daniel Possenriede wrote:> There seems to be something odd with "?" on Windows (and not only with > read.table) > In native encoding (cp-1252 in my case), "?" gets converted to "8" > > x <- "?" > Encoding(x) > #> [1] "unknown" > print(x) > #> [1] "8" > charToRaw(x) > #> [1] 38 > > "?" is indeed "8" > > identical(x, "8") > #> [1] TRUE > > Everything seems fine if "?" is UTF-8 encoded. > > y <- "\u221E" > Encoding(y) > #> [1] "UTF-8" > print(y) > #> [1] "?" > charToRaw(y) > #> [1] e2 88 9e > > Unless the string is converted back to native encoding. > > format(y) > #> [1] "8" > > This ought to be "<U+221E>", equivalently to > > format("?") > #> [1] "<U+221D>" > > Session Info: > > si <- sessionInfo() > si$running > #> [1] "Windows 10 x64 (build 17134)" > si$R.version$version.string > #> [1] "R version 3.5.2 (2018-12-20)" > si$locale > #> [1] > "LC_COLLATE=German_Germany.1252;LC_CTYPE=German_Germany.1252;LC_MONETARY=German_Germany.1252;LC_NUMERIC=C;LC_TIME=German_Germany.1252" > > > > Am Do., 7. Feb. 2019 um 14:33 Uhr schrieb David Byrne < > david.byrne222 at gmail.com>: > >> I can confirm that it doesn't happen on Ubuntu 18.04.1 so Peter is >> most likely correct; it looks like its Windows specific. >> >> On Thu, 7 Feb 2019 at 12:55, peter dalgaard <pdalgd at gmail.com> wrote: >>> This doesn't seem to be happening on MacOS, neither in Terminal nor >> RStudio, (R 3.5.1, R-devel, R-patched). So probably Windows specific. >>> -pd >>> >>>> On 7 Feb 2019, at 11:17 , David Byrne <david.byrne222 at gmail.com> >> wrote: >>>> Bug >>>> Using read.table(file, encoding="UTF-8") to import a UTF-8 encoded >>>> file containing the infinity symbol (' ? ') results in the infinity >>>> symbol imported as the number 8. Other Unicode characters seem >>>> unaffected, example, Zhe: ? >>>> >>>> Expected Behavior: >>>> The imported data.frame should represent the infinity symbol as the >>>> expected 'Inf' so that normal mathematical operations can be processed >>>> >>>> Stack Overflow Post: >>>> I created a question on Stack Overflow where one other member was able >>>> to reproduce the same issues I was having. This question can be found >>>> at: >>>> >> https://stackoverflow.com/questions/54522196/r-read-table-with-utf-8-encoded-file-reads-infinity-symbol-as-8-int >>>> Method to Reproduce - 1: >>>> A simple method to reproduce this issues is to use R-Studio: In the >>>> console, type the following: >>>>> read.table(text=" ?", encoding="UTF-8") >>>> The result should be a data.frame with a single value of '8' >>>> >>>> Repeating the same with ? Results in correct expected behavior >>>> >>>> Method to Reproduce - 2: >>>> Create a .csv file containing the infinity and Zhe characters (I have >>>> attached the file for convenience, hopefully it is no rejected by your >>>> email service). Launch an interactive session using >>>> >>>>> r --vanilla >>>> Enter the following statement taking care to replace the >>>> <path-to-file> with the appropriate one: >>>> >>>>> read.table("<path-to-file>/unicode_chars.csv", sep=",", >> encoding="UTF-8") >>>> >>>> This should result in a two element data.frame; the first being the >>>> incorrect value of 8 with an additional <U+FEFF> and the second the >>>> correct value of Zhe. >>>> >>>> Note the additional <U+FEFF> prefixed to the front of the '8'. This >>>> appears to be a hidden character for the purposes of letting editors >>>> know the encoding. The following link has some explanation however, it >>>> states this is caused by excel. The file I created was done so using >>>> notepad and not Excel. >>>> >>>> >> https://medium.freecodecamp.org/a-quick-tale-about-feff-the-invisible-character-cd25cd4630e7 >>>> System Details: >>>> OS: >>>>> Windows 10.0.17134 Build 17134 >>>> >>>> R Version: >>>>> platform x86_64-w64-mingw32 >>>>> arch x86_64 >>>>> os mingw32 >>>>> system x86_64, mingw32 >>>>> status >>>>> major 3 >>>>> minor 4.1 >>>>> year 2017 >>>>> month 06 >>>>> day 30 >>>>> svn rev 72865 >>>>> language R >>>>> version.string R version 3.4.1 (2017-06-30) >>>>> nickname Single Candle >>>> ______________________________________________ >>>> R-devel at r-project.org mailing list >>>> https://stat.ethz.ch/mailman/listinfo/r-devel >>> -- >>> Peter Dalgaard, Professor, >>> Center for Statistics, Copenhagen Business School >>> Solbjerg Plads 3, 2000 Frederiksberg, Denmark >>> Phone: (+45)38153501 >>> Office: A 4.23 >>> Email: pd.mes at cbs.dk Priv: PDalgd at gmail.com >>> >>> >>> >>> >>> >>> >>> >>> >>> >> ______________________________________________ >> R-devel at r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/r-devel >> > [[alternative HTML version deleted]] > > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel
Reasonably Related Threads
- Bug Report: read.table with UTF-8 encoded file imports infinity symbol as Integer 8
- Bug Report: read.table with UTF-8 encoded file imports infinity symbol as Integer 8
- Bug Report: read.table with UTF-8 encoded file imports infinity symbol as Integer 8
- Bug Report: read.table with UTF-8 encoded file imports infinity symbol as Integer 8
- Bug Report: read.table with UTF-8 encoded file imports infinity symbol as Integer 8