Jack Kelley
2017-Apr-29 23:53 UTC
[Rd] Any progress on write.csv fileEncoding for UTF-16 and UTF-32 ?
"R version 3.4.0 (2017-04-21)" on "x86_64-w64-mingw32" platform I am using CSVs and other text tables, and text in general (including regular expressions), on Windows 10. For me, that means dealing with Windows-1252 and UTF-8 encoding, with UTF-16 and UTF-32 as helpful curiosities. Something as simple as iconv ("\n", to = "UTF-16") causes an error, due to an embedded nul. Then there is write.csv (or write.table) with its fileEncoding parameter: not working correctly for UTF-16 and UTF-32. Of course, developers are aware of this, for example [Rd] iconv to UTF-16 encoding produces error due to embedded nulls (write.table with fileEncoding param) https://stat.ethz.ch/pipermail/r-devel/2016-February/072323.html iconv to UTF-16 encoding produces error due to embedded nulls (write.table with fileEncoding param) http://r.789695.n4.nabble.com/iconv-to-UTF-16-encoding-produces-error-due-to -embedded-nulls-write-table-with-fileEncoding-param-td4717481.html ---------------------------------------------------------------------------- ------------------------ Focussing on write.csv and UTF-16LE and UTF-16BE, it seems that a nul character is omitted in each <CarriageReturn><LineFeed> pair. TEST SCRIPT ---------------------------------------------------------------------------- ------------------------ remove (list = objects()) print (sessionInfo()) cat ("---------------------------------\n\n") LE <- data.frame ( want = c ("0d,00", "0a,00"), got = c ("0d ", "0a,00") ) BE <- data.frame ( want = c ("00,0d", "00,0a"), got = c ("00,0d", " 0a") ) write.csv (LE, "R_LE.csv", fileEncoding = "UTF-16LE", row.names = FALSE) write.csv (BE, "R_BE.csv", fileEncoding = "UTF-16BE", row.names = FALSE) print (readBin ("R_LE.csv", "raw", 1000)) print (LE) cat ("\n") print (readBin ("R_BE.csv", "raw", 1000)) print (BE) cat ("\n") try (iconv ("\n", to = "UTF-8")) try (iconv ("\n", to = "UTF-16LE")) try (iconv ("\n", to = "UTF-16BE")) try (iconv ("\n", to = "UTF-16")) try (iconv ("\n", to = "UTF-32LE")) try (iconv ("\n", to = "UTF-32BE")) try (iconv ("\n", to = "UTF-32")) ---------------------------------------------------------------------------- ------------------------ TEST SCRIPT OUTPUT> source ("bug_encoding.R")R version 3.4.0 (2017-04-21) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 10 x64 (build 14393) Matrix products: default locale: [1] LC_COLLATE=English_Australia.1252 LC_CTYPE=English_Australia.1252 [3] LC_MONETARY=English_Australia.1252 LC_NUMERIC=C [5] LC_TIME=English_Australia.1252 attached base packages: [1] stats graphics grDevices utils datasets methods base loaded via a namespace (and not attached): [1] compiler_3.4.0 --------------------------------- [1] 22 00 77 00 61 00 6e 00 74 00 22 00 2c 00 22 00 67 00 6f 00 74 00 22 00 0d [26] 0a 00 22 00 30 00 64 00 2c 00 30 00 30 00 22 00 2c 00 22 00 30 00 64 00 20 [51] 00 20 00 20 00 22 00 0d 0a 00 22 00 30 00 61 00 2c 00 30 00 30 00 22 00 2c [76] 00 22 00 30 00 61 00 2c 00 30 00 30 00 22 00 0d 0a 00 want got 1 0d,00 0d 2 0a,00 0a,00 [1] 00 22 00 77 00 61 00 6e 00 74 00 22 00 2c 00 22 00 67 00 6f 00 74 00 22 00 [26] 0d 0a 00 22 00 30 00 30 00 2c 00 30 00 64 00 22 00 2c 00 22 00 30 00 30 00 [51] 2c 00 30 00 64 00 22 00 0d 0a 00 22 00 30 00 30 00 2c 00 30 00 61 00 22 00 [76] 2c 00 22 00 20 00 20 00 20 00 30 00 61 00 22 00 0d 0a want got 1 00,0d 00,0d 2 00,0a 0a Error in iconv("\n", to = "UTF-16LE") : embedded nul in string: '\n\0' Error in iconv("\n", to = "UTF-16BE") : embedded nul in string: '\0\n' Error in iconv("\n", to = "UTF-16") : embedded nul in string: '??\0\n' Error in iconv("\n", to = "UTF-32LE") : embedded nul in string: '\n\0\0\0' Error in iconv("\n", to = "UTF-32BE") : embedded nul in string: '\0\0\0\n' Error in iconv("\n", to = "UTF-32") : embedded nul in string: '\0\0??\0\0\0\n'>---------------------------------------------------------------------------- ------------------------ Cheers -- Jack Kelley
Duncan Murdoch
2017-Apr-30 16:23 UTC
[Rd] Any progress on write.csv fileEncoding for UTF-16 and UTF-32 ?
No, I don't think anyone is working on this. There's a fairly simple workaround for the UTF-16 and UTF-32 iconv issues: don't attempt to produce character vectors, produce raw vectors instead. (The "toRaw" argument to iconv() asks for this.) Raw vectors can contain embedded nulls. Character vectors can't, because internally, R is using 8 bit C strings, and the nulls are string terminators. I don't know how difficult it would be to fix the write.table problems. Duncan Murdoch On 29/04/2017 7:53 PM, Jack Kelley wrote:> "R version 3.4.0 (2017-04-21)" on "x86_64-w64-mingw32" platform > > I am using CSVs and other text tables, and text in general (including > regular expressions), on Windows 10. > For me, that means dealing with Windows-1252 and UTF-8 encoding, with UTF-16 > and UTF-32 as helpful curiosities. > > Something as simple as iconv ("\n", to = "UTF-16") causes an error, due to > an embedded nul. > > Then there is write.csv (or write.table) with its fileEncoding parameter: > not working correctly for UTF-16 and UTF-32. > > Of course, developers are aware of this, for example ? > > [Rd] iconv to UTF-16 encoding produces error due to embedded nulls > (write.table with fileEncoding param) > https://stat.ethz.ch/pipermail/r-devel/2016-February/072323.html > > iconv to UTF-16 encoding produces error due to embedded nulls (write.table > with fileEncoding param) > http://r.789695.n4.nabble.com/iconv-to-UTF-16-encoding-produces-error-due-to > -embedded-nulls-write-table-with-fileEncoding-param-td4717481.html > > ---------------------------------------------------------------------------- > ------------------------ > > Focussing on write.csv and UTF-16LE and UTF-16BE, it seems that a nul > character is omitted in each <CarriageReturn><LineFeed> pair. > > TEST SCRIPT > ---------------------------------------------------------------------------- > ------------------------ > remove (list = objects()) > > print (sessionInfo()) > cat ("---------------------------------\n\n") > > LE <- data.frame ( > want = c ("0d,00", "0a,00"), > got = c ("0d ", "0a,00") > ) > > BE <- data.frame ( > want = c ("00,0d", "00,0a"), > got = c ("00,0d", " 0a") > ) > > write.csv (LE, "R_LE.csv", fileEncoding = "UTF-16LE", row.names = FALSE) > write.csv (BE, "R_BE.csv", fileEncoding = "UTF-16BE", row.names = FALSE) > > print (readBin ("R_LE.csv", "raw", 1000)) > print (LE) > cat ("\n") > > print (readBin ("R_BE.csv", "raw", 1000)) > print (BE) > cat ("\n") > > try (iconv ("\n", to = "UTF-8")) > > try (iconv ("\n", to = "UTF-16LE")) > try (iconv ("\n", to = "UTF-16BE")) > try (iconv ("\n", to = "UTF-16")) > > try (iconv ("\n", to = "UTF-32LE")) > try (iconv ("\n", to = "UTF-32BE")) > try (iconv ("\n", to = "UTF-32")) > ---------------------------------------------------------------------------- > ------------------------ > > TEST SCRIPT OUTPUT > >> source ("bug_encoding.R") > R version 3.4.0 (2017-04-21) > Platform: x86_64-w64-mingw32/x64 (64-bit) > Running under: Windows 10 x64 (build 14393) > > Matrix products: default > > locale: > [1] LC_COLLATE=English_Australia.1252 LC_CTYPE=English_Australia.1252 > [3] LC_MONETARY=English_Australia.1252 LC_NUMERIC=C > [5] LC_TIME=English_Australia.1252 > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > loaded via a namespace (and not attached): > [1] compiler_3.4.0 > --------------------------------- > > [1] 22 00 77 00 61 00 6e 00 74 00 22 00 2c 00 22 00 67 00 6f 00 74 00 22 00 > 0d > [26] 0a 00 22 00 30 00 64 00 2c 00 30 00 30 00 22 00 2c 00 22 00 30 00 64 00 > 20 > [51] 00 20 00 20 00 22 00 0d 0a 00 22 00 30 00 61 00 2c 00 30 00 30 00 22 00 > 2c > [76] 00 22 00 30 00 61 00 2c 00 30 00 30 00 22 00 0d 0a 00 > want got > 1 0d,00 0d > 2 0a,00 0a,00 > > [1] 00 22 00 77 00 61 00 6e 00 74 00 22 00 2c 00 22 00 67 00 6f 00 74 00 22 > 00 > [26] 0d 0a 00 22 00 30 00 30 00 2c 00 30 00 64 00 22 00 2c 00 22 00 30 00 30 > 00 > [51] 2c 00 30 00 64 00 22 00 0d 0a 00 22 00 30 00 30 00 2c 00 30 00 61 00 22 > 00 > [76] 2c 00 22 00 20 00 20 00 20 00 30 00 61 00 22 00 0d 0a > want got > 1 00,0d 00,0d > 2 00,0a 0a > > Error in iconv("\n", to = "UTF-16LE") : embedded nul in string: '\n\0' > Error in iconv("\n", to = "UTF-16BE") : embedded nul in string: '\0\n' > Error in iconv("\n", to = "UTF-16") : embedded nul in string: '??\0\n' > Error in iconv("\n", to = "UTF-32LE") : > embedded nul in string: '\n\0\0\0' > Error in iconv("\n", to = "UTF-32BE") : > embedded nul in string: '\0\0\0\n' > Error in iconv("\n", to = "UTF-32") : > embedded nul in string: '\0\0??\0\0\0\n' >> > ---------------------------------------------------------------------------- > ------------------------ > Cheers -- Jack Kelley > > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel >
Duncan Murdoch
2017-May-01 18:21 UTC
[Rd] Any progress on write.csv fileEncoding for UTF-16 and UTF-32 ?
On 30/04/2017 12:23 PM, Duncan Murdoch wrote:> No, I don't think anyone is working on this. > > There's a fairly simple workaround for the UTF-16 and UTF-32 iconv > issues: don't attempt to produce character vectors, produce raw vectors > instead. (The "toRaw" argument to iconv() asks for this.) Raw vectors > can contain embedded nulls. Character vectors can't, because > internally, R is using 8 bit C strings, and the nulls are string > terminators. > > I don't know how difficult it would be to fix the write.table problems.I've now taken a look, and it appears as if it's not too hard. I'll see if I can work out a patch that I trust. Duncan Murdoch> > Duncan Murdoch > > On 29/04/2017 7:53 PM, Jack Kelley wrote: >> "R version 3.4.0 (2017-04-21)" on "x86_64-w64-mingw32" platform >> >> I am using CSVs and other text tables, and text in general (including >> regular expressions), on Windows 10. >> For me, that means dealing with Windows-1252 and UTF-8 encoding, with UTF-16 >> and UTF-32 as helpful curiosities. >> >> Something as simple as iconv ("\n", to = "UTF-16") causes an error, due to >> an embedded nul. >> >> Then there is write.csv (or write.table) with its fileEncoding parameter: >> not working correctly for UTF-16 and UTF-32. >> >> Of course, developers are aware of this, for example ? >> >> [Rd] iconv to UTF-16 encoding produces error due to embedded nulls >> (write.table with fileEncoding param) >> https://stat.ethz.ch/pipermail/r-devel/2016-February/072323.html >> >> iconv to UTF-16 encoding produces error due to embedded nulls (write.table >> with fileEncoding param) >> http://r.789695.n4.nabble.com/iconv-to-UTF-16-encoding-produces-error-due-to >> -embedded-nulls-write-table-with-fileEncoding-param-td4717481.html >> >> ---------------------------------------------------------------------------- >> ------------------------ >> >> Focussing on write.csv and UTF-16LE and UTF-16BE, it seems that a nul >> character is omitted in each <CarriageReturn><LineFeed> pair. >> >> TEST SCRIPT >> ---------------------------------------------------------------------------- >> ------------------------ >> remove (list = objects()) >> >> print (sessionInfo()) >> cat ("---------------------------------\n\n") >> >> LE <- data.frame ( >> want = c ("0d,00", "0a,00"), >> got = c ("0d ", "0a,00") >> ) >> >> BE <- data.frame ( >> want = c ("00,0d", "00,0a"), >> got = c ("00,0d", " 0a") >> ) >> >> write.csv (LE, "R_LE.csv", fileEncoding = "UTF-16LE", row.names = FALSE) >> write.csv (BE, "R_BE.csv", fileEncoding = "UTF-16BE", row.names = FALSE) >> >> print (readBin ("R_LE.csv", "raw", 1000)) >> print (LE) >> cat ("\n") >> >> print (readBin ("R_BE.csv", "raw", 1000)) >> print (BE) >> cat ("\n") >> >> try (iconv ("\n", to = "UTF-8")) >> >> try (iconv ("\n", to = "UTF-16LE")) >> try (iconv ("\n", to = "UTF-16BE")) >> try (iconv ("\n", to = "UTF-16")) >> >> try (iconv ("\n", to = "UTF-32LE")) >> try (iconv ("\n", to = "UTF-32BE")) >> try (iconv ("\n", to = "UTF-32")) >> ---------------------------------------------------------------------------- >> ------------------------ >> >> TEST SCRIPT OUTPUT >> >>> source ("bug_encoding.R") >> R version 3.4.0 (2017-04-21) >> Platform: x86_64-w64-mingw32/x64 (64-bit) >> Running under: Windows 10 x64 (build 14393) >> >> Matrix products: default >> >> locale: >> [1] LC_COLLATE=English_Australia.1252 LC_CTYPE=English_Australia.1252 >> [3] LC_MONETARY=English_Australia.1252 LC_NUMERIC=C >> [5] LC_TIME=English_Australia.1252 >> >> attached base packages: >> [1] stats graphics grDevices utils datasets methods base >> >> loaded via a namespace (and not attached): >> [1] compiler_3.4.0 >> --------------------------------- >> >> [1] 22 00 77 00 61 00 6e 00 74 00 22 00 2c 00 22 00 67 00 6f 00 74 00 22 00 >> 0d >> [26] 0a 00 22 00 30 00 64 00 2c 00 30 00 30 00 22 00 2c 00 22 00 30 00 64 00 >> 20 >> [51] 00 20 00 20 00 22 00 0d 0a 00 22 00 30 00 61 00 2c 00 30 00 30 00 22 00 >> 2c >> [76] 00 22 00 30 00 61 00 2c 00 30 00 30 00 22 00 0d 0a 00 >> want got >> 1 0d,00 0d >> 2 0a,00 0a,00 >> >> [1] 00 22 00 77 00 61 00 6e 00 74 00 22 00 2c 00 22 00 67 00 6f 00 74 00 22 >> 00 >> [26] 0d 0a 00 22 00 30 00 30 00 2c 00 30 00 64 00 22 00 2c 00 22 00 30 00 30 >> 00 >> [51] 2c 00 30 00 64 00 22 00 0d 0a 00 22 00 30 00 30 00 2c 00 30 00 61 00 22 >> 00 >> [76] 2c 00 22 00 20 00 20 00 20 00 30 00 61 00 22 00 0d 0a >> want got >> 1 00,0d 00,0d >> 2 00,0a 0a >> >> Error in iconv("\n", to = "UTF-16LE") : embedded nul in string: '\n\0' >> Error in iconv("\n", to = "UTF-16BE") : embedded nul in string: '\0\n' >> Error in iconv("\n", to = "UTF-16") : embedded nul in string: '??\0\n' >> Error in iconv("\n", to = "UTF-32LE") : >> embedded nul in string: '\n\0\0\0' >> Error in iconv("\n", to = "UTF-32BE") : >> embedded nul in string: '\0\0\0\n' >> Error in iconv("\n", to = "UTF-32") : >> embedded nul in string: '\0\0??\0\0\0\n' >>> >> ---------------------------------------------------------------------------- >> ------------------------ >> Cheers -- Jack Kelley >> >> ______________________________________________ >> R-devel at r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/r-devel >> >
Possibly Parallel Threads
- Any progress on write.csv fileEncoding for UTF-16 and UTF-32 ?
- Any progress on write.csv fileEncoding for UTF-16 and UTF-32 ?
- Any progress on write.csv fileEncoding for UTF-16 and UTF-32 ?
- iconv to UTF-16 encoding produces error due to embedded nulls (write.table with fileEncoding param)
- Any progress on write.csv fileEncoding for UTF-16 and UTF-32 ?