thr3ads.net - R help - [R] Potential R bug in identical [Jan 2019]

If this information is useful, please help other people find it:
Share via:

Layik Hama

2019-Jan-17 14:55 UTC

[R] Potential R bug in identical

Hi,


My first email to r-help and as I am not sure about the issue, I wanted to ask
for help first.

The comments under this thread
<https://github.com/ropensci/stats19/pull/83> outline a particular string
from a dataset which seems to be read by R on Windows differently to Linux and
MacOS and also to bash on Ubuntu Bionic. There seems to be some weird and
unidentifiable (to me) characters in front of the `Accidents_Index` column name
there causing the length to be 17 rather than 14 characters.


I have inspected the string as best as I could and cannot see why we see the
output from a Windows machine.


Is it an issue in `read.table()`?


Thanks


---

Layik Hama
Research Fellow

Leeds Institute for Data Analytics
Room 11.70, Worsley Building,
University of Leeds

	[[alternative HTML version deleted]]

Ivan Krylov

2019-Jan-17 20:40 UTC

head link

[R] Potential R bug in identical

On Thu, 17 Jan 2019 14:55:18 +0000
Layik Hama <L.Hama at leeds.ac.uk> wrote:
> There seems to be some weird and unidentifiable (to me) characters in
> front of the `Accidents_Index` column name there causing the length
> to be 17 rather than 14 characters.
Repeating the reproduction steps described at the linked pull request,

$ curl -o acc2017.zip
http://data.dft.gov.uk.s3.amazonaws.com/road-accidents-safety-data/dftRoadSafetyData_Accidents_2017.zip
$ unzip acc2017.zip
$ head -n 1 Acc.csv | hd | head -n 2
00000000  ef bb bf 41 63 63 69 64  65 6e 74 5f 49 6e 64 65  |...Accident_Inde|
00000010  78 2c 4c 6f 63 61 74 69  6f 6e 5f 45 61 73 74 69  |x,Location_Easti|

The document begins with a U+FEFF BYTE ORDER MARK, encoded in UTF-8.
Not sure which encoding R chooses on Windows by default, but
explicitly passing encoding="UTF-8" (or is it fileEncoding?) might
help decode it as such. (Sorry, cannot test my advice on Windows right
now.)

-- 
Best regards,
Ivan

Layik Hama

2019-Jan-17 21:05 UTC

head link

[R] Potential R bug in identical

Ivan,

Thank you for digging into the string. I can confirm that the `hexdump` shows
extra characters on bash, too.

The question would then be:

Why would `identical(str, "Accident_Index", ignore.case = TRUE)`
behave differently on Linux/MacOS vs Windows?

Thanks

---

Layik Hama
Research Fellow

Leeds Institute for Data Analytics
Room 11.70, Worsley Building,
University of Leeds
________________________________
From: Ivan Krylov <krylov.r00t at gmail.com>
Sent: 17 January 2019 20:40:32
To: Layik Hama
Cc: r-help at r-project.org
Subject: Re: [R] Potential R bug in identical

On Thu, 17 Jan 2019 14:55:18 +0000
Layik Hama <L.Hama at leeds.ac.uk> wrote:
> There seems to be some weird and unidentifiable (to me) characters in
> front of the `Accidents_Index` column name there causing the length
> to be 17 rather than 14 characters.
Repeating the reproduction steps described at the linked pull request,

$ curl -o acc2017.zip
http://data.dft.gov.uk.s3.amazonaws.com/road-accidents-safety-data/dftRoadSafetyData_Accidents_2017.zip
$ unzip acc2017.zip
$ head -n 1 Acc.csv | hd | head -n 2
00000000  ef bb bf 41 63 63 69 64  65 6e 74 5f 49 6e 64 65  |...Accident_Inde|
00000010  78 2c 4c 6f 63 61 74 69  6f 6e 5f 45 61 73 74 69  |x,Location_Easti|

The document begins with a U+FEFF BYTE ORDER MARK, encoded in UTF-8.
Not sure which encoding R chooses on Windows by default, but
explicitly passing encoding="UTF-8" (or is it fileEncoding?) might
help decode it as such. (Sorry, cannot test my advice on Windows right
now.)

--
Best regards,
Ivan

	[[alternative HTML version deleted]]

R help - Jan 2019 - Potential R bug in identical

[R] Potential R bug in identical

[R] Potential R bug in identical

[R] Potential R bug in identical