nospam at altfeld-im.de
2016-Feb-16 17:25 UTC
[Rd] iconv to UTF-16 encoding produces error due to embedded nulls (write.table with fileEncoding param)
If I execute the code from the "?write.table" examples section x <- data.frame(a = I("a \" quote"), b = pi) # (ommited code) write.csv(x, file = "foo.csv", fileEncoding = "UTF-16LE") the resulting CSV file has a size of 6 bytes which is too short (truncated): """,3 The problem seems to be the iconv function: iconv("foo", to="UTF-16") produces Error in iconv("foo", to = "UTF-16"): embedded nul in string: '\xff\xfef\0o\0o\0' In 2010 a (partial) patch for this problem was submitted: http://tolstoy.newcastle.edu.au/R/e10/devel/10/06/0648.html Are there chances to fix this problem since it prevents writing Windows UTF-16LE text files? PS: This problem can be reproduced on Windows and Linux. ---------------> sessionInfo()R version 3.2.3 (2015-12-10) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 14.04.3 LTS locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base loaded via a namespace (and not attached): [1] tools_3.2.3>
nospam at altfeld-im.de
2016-Feb-22 17:45 UTC
[Rd] iconv to UTF-16 encoding produces error due to embedded nulls (write.table with fileEncoding param)
Dear R developers I think I have found a bug that can be reproduced with two lines of code and I am very thankful to get your first assessment or feed-back on my report. If this is the wrong mailing list or I did something wrong (e. g. semi "anonymous" email address to protect my privacy and defend unwanted spam) please let me know since I am new here. Thank you very much :-) J. Altfeld On Tue, 2016-02-16 at 18:25 +0100, nospam at altfeld-im.de wrote:> > > If I execute the code from the "?write.table" examples section > > x <- data.frame(a = I("a \" quote"), b = pi) > # (ommited code) > write.csv(x, file = "foo.csv", fileEncoding = "UTF-16LE") > > the resulting CSV file has a size of 6 bytes which is too short > (truncated): > > """,3 > > The problem seems to be the iconv function: > > iconv("foo", to="UTF-16") > > produces > > Error in iconv("foo", to = "UTF-16"): > embedded nul in string: '\xff\xfef\0o\0o\0' > > In 2010 a (partial) patch for this problem was submitted: > > http://tolstoy.newcastle.edu.au/R/e10/devel/10/06/0648.html > > Are there chances to fix this problem since it prevents writing Windows > UTF-16LE text files? > > > > PS: This problem can be reproduced on Windows and Linux. > > --------------- > > > sessionInfo() > R version 3.2.3 (2015-12-10) > Platform: x86_64-pc-linux-gnu (64-bit) > Running under: Ubuntu 14.04.3 LTS > > locale: > [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C > LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 > [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 > LC_PAPER=en_US.UTF-8 LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] stats graphics grDevices utils datasets methods > base > > loaded via a namespace (and not attached): > [1] tools_3.2.3 > > > > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel
Martin Maechler
2016-Feb-23 09:37 UTC
[Rd] iconv to UTF-16 encoding produces error due to embedded nulls (write.table with fileEncoding param)
>>>>> nospam at altfeld-im de <nospam at altfeld-im.de> >>>>> on Mon, 22 Feb 2016 18:45:59 +0100 writes:> Dear R developers > I think I have found a bug that can be reproduced with two lines of code > and I am very thankful to get your first assessment or feed-back on my > report. > If this is the wrong mailing list or I did something wrong > (e. g. semi "anonymous" email address to protect my privacy and defend > unwanted spam) please let me know since I am new here. > Thank you very much :-) > J. Altfeld Dear J., (yes, a bit less anonymity would be very welcomed here!), You are right, this is a bug, at least in the documentation, but probably "all real", indeed, but read on. > On Tue, 2016-02-16 at 18:25 +0100, nospam at altfeld-im.de wrote: >> >> >> If I execute the code from the "?write.table" examples section >> >> x <- data.frame(a = I("a \" quote"), b = pi) >> # (ommited code) >> write.csv(x, file = "foo.csv", fileEncoding = "UTF-16LE") >> >> the resulting CSV file has a size of 6 bytes which is too short >> (truncated): >> >> """,3 reproducibly, yes. If you look at what write.csv does and then simplify, you can get a similar wrong result by write.table(x, file = "foo.tab", fileEncoding = "UTF-16LE") which results in a file with one line """ 3 and if you debug write.table() you see that its building blocks here are file <- file(........, encoding = fileEncoding) a writeLines(*, file=file) for the column headers, and then "deeper down" C code which I did not investigate. But just looking a bit at such a file() object with writeLines() seems slightly revealing, as e.g., 'eol' does not seem to "work" for this encoding: > fn <- tempfile("ffoo"); ff <- file(fn, open="w", encoding = "UTF-16LE") > writeLines(LETTERS[3:1], ff); writeLines("|", ff); writeLines(">a", ff) > close(ff) > file.show(fn) CBA|> > file.size(fn) [1] 5 > >> The problem seems to be the iconv function: >> >> iconv("foo", to="UTF-16") >> >> produces >> >> Error in iconv("foo", to = "UTF-16"): >> embedded nul in string: '\xff\xfef\0o\0o\0' but this works > iconv("foo", to="UTF-16", toRaw=TRUE) [[1]] [1] ff fe 66 00 6f 00 6f 00 (indeed showing the embedded '\0's) >> In 2010 a (partial) patch for this problem was submitted: >> http://tolstoy.newcastle.edu.au/R/e10/devel/10/06/0648.html the patch only related to the iconv() problem not allowing 'raw' (instead of character) argument x. ... and it is > 5.5 years old, for an iconv() version that was less featureful than today. Rather, current iconv(x) allows x to be a list of raw entries. >> Are there chances to fix this problem since it prevents writing Windows >> UTF-16LE text files? >> >> PS: This problem can be reproduced on Windows and Linux. indeed.... also on "R devel of today". I agree it should be fixed... but as I said not by the patch you mentioned. Tested patches to fix this are welcome, indeed. Martin Maechler >> --------------- >> >> > sessionInfo() >> R version 3.2.3 (2015-12-10) >> Platform: x86_64-pc-linux-gnu (64-bit) >> Running under: Ubuntu 14.04.3 LTS >> >> locale: >> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C >> LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 >> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 >> LC_PAPER=en_US.UTF-8 LC_NAME=C >> [9] LC_ADDRESS=C LC_TELEPHONE=C >> LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C >> >> attached base packages: >> [1] stats graphics grDevices utils datasets methods >> base >> >> loaded via a namespace (and not attached): >> [1] tools_3.2.3 >> > >> >> ______________________________________________ >> R-devel at r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/r-devel > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel
Possibly Parallel Threads
- iconv to UTF-16 encoding produces error due to embedded nulls (write.table with fileEncoding param)
- iconv to UTF-16 encoding produces error due to embedded nulls (write.table with fileEncoding param)
- iconv to UTF-16 encoding produces error due to embedded nulls (write.table with fileEncoding param)
- iconv to UTF-16 encoding produces error due to embedded nulls (write.table with fileEncoding param)
- iconv to UTF-16 encoding produces error due to embedded nulls (write.table with fileEncoding param)