Ivan, Thank you for digging into the string. I can confirm that the `hexdump` shows extra characters on bash, too. The question would then be: Why would `identical(str, "Accident_Index", ignore.case = TRUE)` behave differently on Linux/MacOS vs Windows? Thanks --- Layik Hama Research Fellow Leeds Institute for Data Analytics Room 11.70, Worsley Building, University of Leeds ________________________________ From: Ivan Krylov <krylov.r00t at gmail.com> Sent: 17 January 2019 20:40:32 To: Layik Hama Cc: r-help at r-project.org Subject: Re: [R] Potential R bug in identical On Thu, 17 Jan 2019 14:55:18 +0000 Layik Hama <L.Hama at leeds.ac.uk> wrote:> There seems to be some weird and unidentifiable (to me) characters in > front of the `Accidents_Index` column name there causing the length > to be 17 rather than 14 characters.Repeating the reproduction steps described at the linked pull request, $ curl -o acc2017.zip http://data.dft.gov.uk.s3.amazonaws.com/road-accidents-safety-data/dftRoadSafetyData_Accidents_2017.zip $ unzip acc2017.zip $ head -n 1 Acc.csv | hd | head -n 2 00000000 ef bb bf 41 63 63 69 64 65 6e 74 5f 49 6e 64 65 |...Accident_Inde| 00000010 78 2c 4c 6f 63 61 74 69 6f 6e 5f 45 61 73 74 69 |x,Location_Easti| The document begins with a U+FEFF BYTE ORDER MARK, encoded in UTF-8. Not sure which encoding R chooses on Windows by default, but explicitly passing encoding="UTF-8" (or is it fileEncoding?) might help decode it as such. (Sorry, cannot test my advice on Windows right now.) -- Best regards, Ivan [[alternative HTML version deleted]]
On Thu, 17 Jan 2019 21:05:07 +0000 Layik Hama <L.Hama at leeds.ac.uk> wrote:> Why would `identical(str, "Accident_Index", ignore.case = TRUE)` > behave differently on Linux/MacOS vs Windows?Because str is different from "Accident_Index" on Windows: it was decoded from bytes to characters according to different rules when file was read. Default encoding for files being read is specified by 'encoding' options. On both Windows and Linux I get:> options('encoding')$encoding [1] "native.enc" For which ?file says (in section "Encoding"):>> ?""? and ?"native.enc"? both mean the ?native? encoding, that is the >> internal encoding of the current locale and hence no translation is >> done.Linux version of R has a UTF-8 locale (AFAIK, macOS does too) and decodes the files as UTF-8 by default:> sessionInfo()R version 3.3.3 (2017-03-06) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Debian GNU/Linux 9 (stretch) locale: [1] LC_CTYPE=ru_RU.utf8 LC_NUMERIC=C [3] LC_TIME=ru_RU.utf8 LC_COLLATE=ru_RU.utf8 [5] LC_MONETARY=ru_RU.utf8 LC_MESSAGES=ru_RU.utf8 [7] LC_PAPER=ru_RU.utf8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=ru_RU.utf8 LC_IDENTIFICATION=C While on Windows R uses a single-byte encoding dependent on the locale:> sessionInfo()R version 3.5.2 (2018-12-20) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 7 x64 (build 7601) Service Pack 1 Matrix products: default locale: [1] LC_COLLATE=Russian_Russia.1251 LC_CTYPE=Russian_Russia.1251 [3] LC_MONETARY=Russian_Russia.1251 LC_NUMERIC=C [5] LC_TIME=Russian_Russia.1251> readLines('test.txt')[1][1] "???Accident_Index"> nchar(readLines('test.txt')[1])[1] 17 R on Windows can be explicitly told to decode the file as UTF-8:> nchar(readLines(file('test.txt',encoding='UTF-8'))[1])[1] 15 The first character of the string is the invisible byte order mark. Thankfully, there is an easy fix for that, too. ?file additionally says:>> As from R 3.0.0 the encoding ?"UTF-8-BOM"? is accepted for >> reading and will remove a Byte Order Mark if present (which it >> often is for files and webpages generated by Microsoft applications).So this is how we get the 14-character column name we'd wanted:> nchar(readLines(file('test.txt',encoding='UTF-8-BOM'))[1])[1] 14 For our original task, this means:> names(read.csv('Acc.csv'))[1] # might produce incorrect results[1] "?.?Accident_Index"> names(read.csv('Acc.csv', fileEncoding='UTF-8-BOM'))[1] # correct[1] "Accident_Index" -- Best regards, Ivan