Hi, My first email to r-help and as I am not sure about the issue, I wanted to ask for help first. The comments under this thread <https://github.com/ropensci/stats19/pull/83> outline a particular string from a dataset which seems to be read by R on Windows differently to Linux and MacOS and also to bash on Ubuntu Bionic. There seems to be some weird and unidentifiable (to me) characters in front of the `Accidents_Index` column name there causing the length to be 17 rather than 14 characters. I have inspected the string as best as I could and cannot see why we see the output from a Windows machine. Is it an issue in `read.table()`? Thanks --- Layik Hama Research Fellow Leeds Institute for Data Analytics Room 11.70, Worsley Building, University of Leeds [[alternative HTML version deleted]]
On Thu, 17 Jan 2019 14:55:18 +0000 Layik Hama <L.Hama at leeds.ac.uk> wrote:> There seems to be some weird and unidentifiable (to me) characters in > front of the `Accidents_Index` column name there causing the length > to be 17 rather than 14 characters.Repeating the reproduction steps described at the linked pull request, $ curl -o acc2017.zip http://data.dft.gov.uk.s3.amazonaws.com/road-accidents-safety-data/dftRoadSafetyData_Accidents_2017.zip $ unzip acc2017.zip $ head -n 1 Acc.csv | hd | head -n 2 00000000 ef bb bf 41 63 63 69 64 65 6e 74 5f 49 6e 64 65 |...Accident_Inde| 00000010 78 2c 4c 6f 63 61 74 69 6f 6e 5f 45 61 73 74 69 |x,Location_Easti| The document begins with a U+FEFF BYTE ORDER MARK, encoded in UTF-8. Not sure which encoding R chooses on Windows by default, but explicitly passing encoding="UTF-8" (or is it fileEncoding?) might help decode it as such. (Sorry, cannot test my advice on Windows right now.) -- Best regards, Ivan
Ivan, Thank you for digging into the string. I can confirm that the `hexdump` shows extra characters on bash, too. The question would then be: Why would `identical(str, "Accident_Index", ignore.case = TRUE)` behave differently on Linux/MacOS vs Windows? Thanks --- Layik Hama Research Fellow Leeds Institute for Data Analytics Room 11.70, Worsley Building, University of Leeds ________________________________ From: Ivan Krylov <krylov.r00t at gmail.com> Sent: 17 January 2019 20:40:32 To: Layik Hama Cc: r-help at r-project.org Subject: Re: [R] Potential R bug in identical On Thu, 17 Jan 2019 14:55:18 +0000 Layik Hama <L.Hama at leeds.ac.uk> wrote:> There seems to be some weird and unidentifiable (to me) characters in > front of the `Accidents_Index` column name there causing the length > to be 17 rather than 14 characters.Repeating the reproduction steps described at the linked pull request, $ curl -o acc2017.zip http://data.dft.gov.uk.s3.amazonaws.com/road-accidents-safety-data/dftRoadSafetyData_Accidents_2017.zip $ unzip acc2017.zip $ head -n 1 Acc.csv | hd | head -n 2 00000000 ef bb bf 41 63 63 69 64 65 6e 74 5f 49 6e 64 65 |...Accident_Inde| 00000010 78 2c 4c 6f 63 61 74 69 6f 6e 5f 45 61 73 74 69 |x,Location_Easti| The document begins with a U+FEFF BYTE ORDER MARK, encoded in UTF-8. Not sure which encoding R chooses on Windows by default, but explicitly passing encoding="UTF-8" (or is it fileEncoding?) might help decode it as such. (Sorry, cannot test my advice on Windows right now.) -- Best regards, Ivan [[alternative HTML version deleted]]