David Byrne
2019-Feb-07 10:17 UTC
[Rd] Bug Report: read.table with UTF-8 encoded file imports infinity symbol as Integer 8
Bug Using read.table(file, encoding="UTF-8") to import a UTF-8 encoded file containing the infinity symbol (' ? ') results in the infinity symbol imported as the number 8. Other Unicode characters seem unaffected, example, Zhe: ? Expected Behavior: The imported data.frame should represent the infinity symbol as the expected 'Inf' so that normal mathematical operations can be processed Stack Overflow Post: I created a question on Stack Overflow where one other member was able to reproduce the same issues I was having. This question can be found at: https://stackoverflow.com/questions/54522196/r-read-table-with-utf-8-encoded-file-reads-infinity-symbol-as-8-int Method to Reproduce - 1: A simple method to reproduce this issues is to use R-Studio: In the console, type the following:> read.table(text=" ?", encoding="UTF-8")The result should be a data.frame with a single value of '8' Repeating the same with ? Results in correct expected behavior Method to Reproduce - 2: Create a .csv file containing the infinity and Zhe characters (I have attached the file for convenience, hopefully it is no rejected by your email service). Launch an interactive session using> r --vanillaEnter the following statement taking care to replace the <path-to-file> with the appropriate one:> read.table("<path-to-file>/unicode_chars.csv", sep=",", encoding="UTF-8")This should result in a two element data.frame; the first being the incorrect value of 8 with an additional <U+FEFF> and the second the correct value of Zhe. Note the additional <U+FEFF> prefixed to the front of the '8'. This appears to be a hidden character for the purposes of letting editors know the encoding. The following link has some explanation however, it states this is caused by excel. The file I created was done so using notepad and not Excel. https://medium.freecodecamp.org/a-quick-tale-about-feff-the-invisible-character-cd25cd4630e7 System Details: OS:> Windows 10.0.17134 Build 17134R Version:> platform x86_64-w64-mingw32 > arch x86_64 > os mingw32 > system x86_64, mingw32 > status > major 3 > minor 4.1 > year 2017 > month 06 > day 30 > svn rev 72865 > language R > version.string R version 3.4.1 (2017-06-30) > nickname Single Candle
peter dalgaard
2019-Feb-07 12:55 UTC
[Rd] Bug Report: read.table with UTF-8 encoded file imports infinity symbol as Integer 8
This doesn't seem to be happening on MacOS, neither in Terminal nor RStudio, (R 3.5.1, R-devel, R-patched). So probably Windows specific. -pd> On 7 Feb 2019, at 11:17 , David Byrne <david.byrne222 at gmail.com> wrote: > > Bug > Using read.table(file, encoding="UTF-8") to import a UTF-8 encoded > file containing the infinity symbol (' ? ') results in the infinity > symbol imported as the number 8. Other Unicode characters seem > unaffected, example, Zhe: ? > > Expected Behavior: > The imported data.frame should represent the infinity symbol as the > expected 'Inf' so that normal mathematical operations can be processed > > Stack Overflow Post: > I created a question on Stack Overflow where one other member was able > to reproduce the same issues I was having. This question can be found > at: > https://stackoverflow.com/questions/54522196/r-read-table-with-utf-8-encoded-file-reads-infinity-symbol-as-8-int > > Method to Reproduce - 1: > A simple method to reproduce this issues is to use R-Studio: In the > console, type the following: >> read.table(text=" ?", encoding="UTF-8") > > The result should be a data.frame with a single value of '8' > > Repeating the same with ? Results in correct expected behavior > > Method to Reproduce - 2: > Create a .csv file containing the infinity and Zhe characters (I have > attached the file for convenience, hopefully it is no rejected by your > email service). Launch an interactive session using > >> r --vanilla > > Enter the following statement taking care to replace the > <path-to-file> with the appropriate one: > >> read.table("<path-to-file>/unicode_chars.csv", sep=",", encoding="UTF-8") > > > This should result in a two element data.frame; the first being the > incorrect value of 8 with an additional <U+FEFF> and the second the > correct value of Zhe. > > Note the additional <U+FEFF> prefixed to the front of the '8'. This > appears to be a hidden character for the purposes of letting editors > know the encoding. The following link has some explanation however, it > states this is caused by excel. The file I created was done so using > notepad and not Excel. > > https://medium.freecodecamp.org/a-quick-tale-about-feff-the-invisible-character-cd25cd4630e7 > > System Details: > OS: >> Windows 10.0.17134 Build 17134 > > > R Version: >> platform x86_64-w64-mingw32 >> arch x86_64 >> os mingw32 >> system x86_64, mingw32 >> status >> major 3 >> minor 4.1 >> year 2017 >> month 06 >> day 30 >> svn rev 72865 >> language R >> version.string R version 3.4.1 (2017-06-30) >> nickname Single Candle > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel-- Peter Dalgaard, Professor, Center for Statistics, Copenhagen Business School Solbjerg Plads 3, 2000 Frederiksberg, Denmark Phone: (+45)38153501 Office: A 4.23 Email: pd.mes at cbs.dk Priv: PDalgd at gmail.com
David Byrne
2019-Feb-07 13:33 UTC
[Rd] Bug Report: read.table with UTF-8 encoded file imports infinity symbol as Integer 8
I can confirm that it doesn't happen on Ubuntu 18.04.1 so Peter is most likely correct; it looks like its Windows specific. On Thu, 7 Feb 2019 at 12:55, peter dalgaard <pdalgd at gmail.com> wrote:> > This doesn't seem to be happening on MacOS, neither in Terminal nor RStudio, (R 3.5.1, R-devel, R-patched). So probably Windows specific. > > -pd > > > On 7 Feb 2019, at 11:17 , David Byrne <david.byrne222 at gmail.com> wrote: > > > > Bug > > Using read.table(file, encoding="UTF-8") to import a UTF-8 encoded > > file containing the infinity symbol (' ? ') results in the infinity > > symbol imported as the number 8. Other Unicode characters seem > > unaffected, example, Zhe: ? > > > > Expected Behavior: > > The imported data.frame should represent the infinity symbol as the > > expected 'Inf' so that normal mathematical operations can be processed > > > > Stack Overflow Post: > > I created a question on Stack Overflow where one other member was able > > to reproduce the same issues I was having. This question can be found > > at: > > https://stackoverflow.com/questions/54522196/r-read-table-with-utf-8-encoded-file-reads-infinity-symbol-as-8-int > > > > Method to Reproduce - 1: > > A simple method to reproduce this issues is to use R-Studio: In the > > console, type the following: > >> read.table(text=" ?", encoding="UTF-8") > > > > The result should be a data.frame with a single value of '8' > > > > Repeating the same with ? Results in correct expected behavior > > > > Method to Reproduce - 2: > > Create a .csv file containing the infinity and Zhe characters (I have > > attached the file for convenience, hopefully it is no rejected by your > > email service). Launch an interactive session using > > > >> r --vanilla > > > > Enter the following statement taking care to replace the > > <path-to-file> with the appropriate one: > > > >> read.table("<path-to-file>/unicode_chars.csv", sep=",", encoding="UTF-8") > > > > > > This should result in a two element data.frame; the first being the > > incorrect value of 8 with an additional <U+FEFF> and the second the > > correct value of Zhe. > > > > Note the additional <U+FEFF> prefixed to the front of the '8'. This > > appears to be a hidden character for the purposes of letting editors > > know the encoding. The following link has some explanation however, it > > states this is caused by excel. The file I created was done so using > > notepad and not Excel. > > > > https://medium.freecodecamp.org/a-quick-tale-about-feff-the-invisible-character-cd25cd4630e7 > > > > System Details: > > OS: > >> Windows 10.0.17134 Build 17134 > > > > > > R Version: > >> platform x86_64-w64-mingw32 > >> arch x86_64 > >> os mingw32 > >> system x86_64, mingw32 > >> status > >> major 3 > >> minor 4.1 > >> year 2017 > >> month 06 > >> day 30 > >> svn rev 72865 > >> language R > >> version.string R version 3.4.1 (2017-06-30) > >> nickname Single Candle > > ______________________________________________ > > R-devel at r-project.org mailing list > > https://stat.ethz.ch/mailman/listinfo/r-devel > > -- > Peter Dalgaard, Professor, > Center for Statistics, Copenhagen Business School > Solbjerg Plads 3, 2000 Frederiksberg, Denmark > Phone: (+45)38153501 > Office: A 4.23 > Email: pd.mes at cbs.dk Priv: PDalgd at gmail.com > > > > > > > > >
Tomas Kalibera
2019-Feb-08 16:23 UTC
[Rd] Bug Report: read.table with UTF-8 encoded file imports infinity symbol as Integer 8
I can reproduce with read.table(encoding="UTF-8") in RGui on Windows 10, reading a file containing the two UTF-8 characters. The table is read correctly into R as documented (both characters are represented in UTF-8 and marked as such), but, the conversion of Infinity to 8 and of Zhe to <U+0436> happens later during printing using print.data.frame(). For instance, it currently does not happen during print(as.matrix()). As I wrote in more detail in another email in this thread, R sometimes needs to convert strings to the current native encoding, Windows converts Infinity to 8 by default as "best fit", but fails to convert Zhe, so R displays the <U+436>. It is easiest to only use input files in current native encoding, so one could convert before passing them to R and make sure the conversion does not have similar problems...? or use R on a non-Windows platform. Relying on which R functions/packages can work with non-native encodings may be brittle, but of course any R function that documents to work with non-native encodings (like read.table(encoding=)) should do so. If not, it will be fixed following a bug report. I am not sure if that is what you had in mind, but conversion of character (string) to double is a different matter. as.double() now as documented in ?as.double returns NA for "?" (on Linux). Best Tomas On 2/7/19 11:17 AM, David Byrne wrote:> Bug > Using read.table(file, encoding="UTF-8") to import a UTF-8 encoded > file containing the infinity symbol (' ? ') results in the infinity > symbol imported as the number 8. Other Unicode characters seem > unaffected, example, Zhe: ? > > Expected Behavior: > The imported data.frame should represent the infinity symbol as the > expected 'Inf' so that normal mathematical operations can be processed > > Stack Overflow Post: > I created a question on Stack Overflow where one other member was able > to reproduce the same issues I was having. This question can be found > at: > https://stackoverflow.com/questions/54522196/r-read-table-with-utf-8-encoded-file-reads-infinity-symbol-as-8-int > > Method to Reproduce - 1: > A simple method to reproduce this issues is to use R-Studio: In the > console, type the following: >> read.table(text=" ?", encoding="UTF-8") > The result should be a data.frame with a single value of '8' > > Repeating the same with ? Results in correct expected behavior > > Method to Reproduce - 2: > Create a .csv file containing the infinity and Zhe characters (I have > attached the file for convenience, hopefully it is no rejected by your > email service). Launch an interactive session using > >> r --vanilla > Enter the following statement taking care to replace the > <path-to-file> with the appropriate one: > >> read.table("<path-to-file>/unicode_chars.csv", sep=",", encoding="UTF-8") > > This should result in a two element data.frame; the first being the > incorrect value of 8 with an additional <U+FEFF> and the second the > correct value of Zhe. > > Note the additional <U+FEFF> prefixed to the front of the '8'. This > appears to be a hidden character for the purposes of letting editors > know the encoding. The following link has some explanation however, it > states this is caused by excel. The file I created was done so using > notepad and not Excel. > > https://medium.freecodecamp.org/a-quick-tale-about-feff-the-invisible-character-cd25cd4630e7 > > System Details: > OS: >> Windows 10.0.17134 Build 17134 > > R Version: >> platform x86_64-w64-mingw32 >> arch x86_64 >> os mingw32 >> system x86_64, mingw32 >> status >> major 3 >> minor 4.1 >> year 2017 >> month 06 >> day 30 >> svn rev 72865 >> language R >> version.string R version 3.4.1 (2017-06-30) >> nickname Single Candle > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel[[alternative HTML version deleted]]
Seemingly Similar Threads
- Bug Report: read.table with UTF-8 encoded file imports infinity symbol as Integer 8
- Bug Report: read.table with UTF-8 encoded file imports infinity symbol as Integer 8
- Bug Report: read.table with UTF-8 encoded file imports infinity symbol as Integer 8
- Bug Report: read.table with UTF-8 encoded file imports infinity symbol as Integer 8
- Bug Report: read.table with UTF-8 encoded file imports infinity symbol as Integer 8