Martin Maechler
2016-Feb-23 09:37 UTC
[Rd] iconv to UTF-16 encoding produces error due to embedded nulls (write.table with fileEncoding param)
>>>>> nospam at altfeld-im de <nospam at altfeld-im.de> >>>>> on Mon, 22 Feb 2016 18:45:59 +0100 writes:> Dear R developers > I think I have found a bug that can be reproduced with two lines of code > and I am very thankful to get your first assessment or feed-back on my > report. > If this is the wrong mailing list or I did something wrong > (e. g. semi "anonymous" email address to protect my privacy and defend > unwanted spam) please let me know since I am new here. > Thank you very much :-) > J. Altfeld Dear J., (yes, a bit less anonymity would be very welcomed here!), You are right, this is a bug, at least in the documentation, but probably "all real", indeed, but read on. > On Tue, 2016-02-16 at 18:25 +0100, nospam at altfeld-im.de wrote: >> >> >> If I execute the code from the "?write.table" examples section >> >> x <- data.frame(a = I("a \" quote"), b = pi) >> # (ommited code) >> write.csv(x, file = "foo.csv", fileEncoding = "UTF-16LE") >> >> the resulting CSV file has a size of 6 bytes which is too short >> (truncated): >> >> """,3 reproducibly, yes. If you look at what write.csv does and then simplify, you can get a similar wrong result by write.table(x, file = "foo.tab", fileEncoding = "UTF-16LE") which results in a file with one line """ 3 and if you debug write.table() you see that its building blocks here are file <- file(........, encoding = fileEncoding) a writeLines(*, file=file) for the column headers, and then "deeper down" C code which I did not investigate. But just looking a bit at such a file() object with writeLines() seems slightly revealing, as e.g., 'eol' does not seem to "work" for this encoding: > fn <- tempfile("ffoo"); ff <- file(fn, open="w", encoding = "UTF-16LE") > writeLines(LETTERS[3:1], ff); writeLines("|", ff); writeLines(">a", ff) > close(ff) > file.show(fn) CBA|> > file.size(fn) [1] 5 > >> The problem seems to be the iconv function: >> >> iconv("foo", to="UTF-16") >> >> produces >> >> Error in iconv("foo", to = "UTF-16"): >> embedded nul in string: '\xff\xfef\0o\0o\0' but this works > iconv("foo", to="UTF-16", toRaw=TRUE) [[1]] [1] ff fe 66 00 6f 00 6f 00 (indeed showing the embedded '\0's) >> In 2010 a (partial) patch for this problem was submitted: >> http://tolstoy.newcastle.edu.au/R/e10/devel/10/06/0648.html the patch only related to the iconv() problem not allowing 'raw' (instead of character) argument x. ... and it is > 5.5 years old, for an iconv() version that was less featureful than today. Rather, current iconv(x) allows x to be a list of raw entries. >> Are there chances to fix this problem since it prevents writing Windows >> UTF-16LE text files? >> >> PS: This problem can be reproduced on Windows and Linux. indeed.... also on "R devel of today". I agree it should be fixed... but as I said not by the patch you mentioned. Tested patches to fix this are welcome, indeed. Martin Maechler >> --------------- >> >> > sessionInfo() >> R version 3.2.3 (2015-12-10) >> Platform: x86_64-pc-linux-gnu (64-bit) >> Running under: Ubuntu 14.04.3 LTS >> >> locale: >> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C >> LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 >> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 >> LC_PAPER=en_US.UTF-8 LC_NAME=C >> [9] LC_ADDRESS=C LC_TELEPHONE=C >> LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C >> >> attached base packages: >> [1] stats graphics grDevices utils datasets methods >> base >> >> loaded via a namespace (and not attached): >> [1] tools_3.2.3 >> > >> >> ______________________________________________ >> R-devel at r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/r-devel > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel
Mikko Korpela
2016-Feb-23 12:06 UTC
[Rd] iconv to UTF-16 encoding produces error due to embedded nulls (write.table with fileEncoding param)
On 23.02.2016 11:37, Martin Maechler wrote:>>>>>> nospam at altfeld-im de <nospam at altfeld-im.de> >>>>>> on Mon, 22 Feb 2016 18:45:59 +0100 writes: > > > Dear R developers > > I think I have found a bug that can be reproduced with two lines of code > > and I am very thankful to get your first assessment or feed-back on my > > report. > > > If this is the wrong mailing list or I did something wrong > > (e. g. semi "anonymous" email address to protect my privacy and defend > > unwanted spam) please let me know since I am new here. > > > Thank you very much :-) > > > J. Altfeld > > Dear J., > (yes, a bit less anonymity would be very welcomed here!), > > You are right, this is a bug, at least in the documentation, but > probably "all real", indeed, > > but read on. > > > On Tue, 2016-02-16 at 18:25 +0100, nospam at altfeld-im.de wrote: > >> > >> > >> If I execute the code from the "?write.table" examples section > >> > >> x <- data.frame(a = I("a \" quote"), b = pi) > >> # (ommited code) > >> write.csv(x, file = "foo.csv", fileEncoding = "UTF-16LE") > >> > >> the resulting CSV file has a size of 6 bytes which is too short > >> (truncated): > >> > >> """,3 > > reproducibly, yes. > If you look at what write.csv does > and then simplify, you can get a similar wrong result by > > write.table(x, file = "foo.tab", fileEncoding = "UTF-16LE") > > which results in a file with one line > > """ 3 > > and if you debug write.table() you see that its building blocks > here are > file <- file(........, encoding = fileEncoding) > > a writeLines(*, file=file) for the column headers, > > and then "deeper down" C code which I did not investigate.I took a look at connections.c. There is a call to strlen() that gets confused by null characters. I think the obvious fix is to avoid the call to strlen() as the size is already known: Index: src/main/connections.c ==================================================================--- src/main/connections.c (revision 70213) +++ src/main/connections.c (working copy) @@ -369,7 +369,7 @@ /* is this safe? */ warning(_("invalid char string in output conversion")); *ob = '\0'; - con->write(outbuf, 1, strlen(outbuf), con); + con->write(outbuf, 1, ob - outbuf, con); } while(again && inb > 0); /* it seems some iconv signal -1 on zero-length input */ } else> > But just looking a bit at such a file() object with writeLines() > seems slightly revealing, as e.g., 'eol' does not seem to > "work" for this encoding: > > > fn <- tempfile("ffoo"); ff <- file(fn, open="w", encoding = "UTF-16LE") > > writeLines(LETTERS[3:1], ff); writeLines("|", ff); writeLines(">a", ff) > > close(ff) > > file.show(fn) > CBA|> > > file.size(fn) > [1] 5 > >With the patch applied: > readLines(fn, encoding="UTF-16LE", skipNul=TRUE) [1] "C" "B" "A" "|" ">a" > file.size(fn) [1] 22 - Mikko Korpela> >> The problem seems to be the iconv function: > >> > >> iconv("foo", to="UTF-16") > >> > >> produces > >> > >> Error in iconv("foo", to = "UTF-16"): > >> embedded nul in string: '\xff\xfef\0o\0o\0' > > but this works > > > iconv("foo", to="UTF-16", toRaw=TRUE) > [[1]] > [1] ff fe 66 00 6f 00 6f 00 > > (indeed showing the embedded '\0's) > > >> In 2010 a (partial) patch for this problem was submitted: > >> http://tolstoy.newcastle.edu.au/R/e10/devel/10/06/0648.html > > the patch only related to the iconv() problem not allowing 'raw' > (instead of character) argument x. > > ... and it is > 5.5 years old, for an iconv() version that was less > featureful than today. > Rather, current iconv(x) allows x to be a list of raw entries. > > > >> Are there chances to fix this problem since it prevents writing Windows > >> UTF-16LE text files? > > >> > >> PS: This problem can be reproduced on Windows and Linux. > > indeed.... also on "R devel of today". > > I agree it should be fixed... but as I said not by the patch you > mentioned. > > Tested patches to fix this are welcome, indeed.
nospam at altfeld-im.de
2016-Feb-23 21:53 UTC
[Rd] iconv to UTF-16 encoding produces error due to embedded nulls (write.table with fileEncoding param)
Excellent analysis, thank you both for the quick reply! Is there anything I can do to get the bug fixed in the next version of R (e. g. filing a bug report at https://bugs.r-project.org/bugzilla3/)? On Tue, 2016-02-23 at 14:06 +0200, Mikko Korpela wrote:> On 23.02.2016 11:37, Martin Maechler wrote: > >>>>>> nospam at altfeld-im de <nospam at altfeld-im.de> > >>>>>> on Mon, 22 Feb 2016 18:45:59 +0100 writes: > > > > > Dear R developers > > > I think I have found a bug that can be reproduced with two lines of code > > > and I am very thankful to get your first assessment or feed-back on my > > > report. > > > > > If this is the wrong mailing list or I did something wrong > > > (e. g. semi "anonymous" email address to protect my privacy and defend > > > unwanted spam) please let me know since I am new here. > > > > > Thank you very much :-) > > > > > J. Altfeld > > > > Dear J., > > (yes, a bit less anonymity would be very welcomed here!), > > > > You are right, this is a bug, at least in the documentation, but > > probably "all real", indeed, > > > > but read on. > > > > > On Tue, 2016-02-16 at 18:25 +0100, nospam at altfeld-im.de wrote: > > >> > > >> > > >> If I execute the code from the "?write.table" examples section > > >> > > >> x <- data.frame(a = I("a \" quote"), b = pi) > > >> # (ommited code) > > >> write.csv(x, file = "foo.csv", fileEncoding = "UTF-16LE") > > >> > > >> the resulting CSV file has a size of 6 bytes which is too short > > >> (truncated): > > >> > > >> """,3 > > > > reproducibly, yes. > > If you look at what write.csv does > > and then simplify, you can get a similar wrong result by > > > > write.table(x, file = "foo.tab", fileEncoding = "UTF-16LE") > > > > which results in a file with one line > > > > """ 3 > > > > and if you debug write.table() you see that its building blocks > > here are > > file <- file(........, encoding = fileEncoding) > > > > a writeLines(*, file=file) for the column headers, > > > > and then "deeper down" C code which I did not investigate. > > I took a look at connections.c. There is a call to strlen() that gets > confused by null characters. I think the obvious fix is to avoid the > call to strlen() as the size is already known: > > Index: src/main/connections.c > ==================================================================> --- src/main/connections.c (revision 70213) > +++ src/main/connections.c (working copy) > @@ -369,7 +369,7 @@ > /* is this safe? */ > warning(_("invalid char string in output conversion")); > *ob = '\0'; > - con->write(outbuf, 1, strlen(outbuf), con); > + con->write(outbuf, 1, ob - outbuf, con); > } while(again && inb > 0); /* it seems some iconv signal -1 on > zero-length input */ > } else > > > > > > But just looking a bit at such a file() object with writeLines() > > seems slightly revealing, as e.g., 'eol' does not seem to > > "work" for this encoding: > > > > > fn <- tempfile("ffoo"); ff <- file(fn, open="w", encoding = "UTF-16LE") > > > writeLines(LETTERS[3:1], ff); writeLines("|", ff); writeLines(">a", ff) > > > close(ff) > > > file.show(fn) > > CBA|> > > > file.size(fn) > > [1] 5 > > > > > With the patch applied: > > > readLines(fn, encoding="UTF-16LE", skipNul=TRUE) > [1] "C" "B" "A" "|" ">a" > > file.size(fn) > [1] 22 > > - Mikko Korpela > > > >> The problem seems to be the iconv function: > > >> > > >> iconv("foo", to="UTF-16") > > >> > > >> produces > > >> > > >> Error in iconv("foo", to = "UTF-16"): > > >> embedded nul in string: '\xff\xfef\0o\0o\0' > > > > but this works > > > > > iconv("foo", to="UTF-16", toRaw=TRUE) > > [[1]] > > [1] ff fe 66 00 6f 00 6f 00 > > > > (indeed showing the embedded '\0's) > > > > >> In 2010 a (partial) patch for this problem was submitted: > > >> http://tolstoy.newcastle.edu.au/R/e10/devel/10/06/0648.html > > > > the patch only related to the iconv() problem not allowing 'raw' > > (instead of character) argument x. > > > > ... and it is > 5.5 years old, for an iconv() version that was less > > featureful than today. > > Rather, current iconv(x) allows x to be a list of raw entries. > > > > > > >> Are there chances to fix this problem since it prevents writing Windows > > >> UTF-16LE text files? > > > > >> > > >> PS: This problem can be reproduced on Windows and Linux. > > > > indeed.... also on "R devel of today". > > > > I agree it should be fixed... but as I said not by the patch you > > mentioned. > > > > Tested patches to fix this are welcome, indeed. >
Duncan Murdoch
2016-Feb-24 13:47 UTC
[Rd] iconv to UTF-16 encoding produces error due to embedded nulls (write.table with fileEncoding param)
On 23/02/2016 7:06 AM, Mikko Korpela wrote:> On 23.02.2016 11:37, Martin Maechler wrote: >>>>>>> nospam at altfeld-im de <nospam at altfeld-im.de> >>>>>>> on Mon, 22 Feb 2016 18:45:59 +0100 writes: >> >> > Dear R developers >> > I think I have found a bug that can be reproduced with two lines of code >> > and I am very thankful to get your first assessment or feed-back on my >> > report. >> >> > If this is the wrong mailing list or I did something wrong >> > (e. g. semi "anonymous" email address to protect my privacy and defend >> > unwanted spam) please let me know since I am new here. >> >> > Thank you very much :-) >> >> > J. Altfeld >> >> Dear J., >> (yes, a bit less anonymity would be very welcomed here!), >> >> You are right, this is a bug, at least in the documentation, but >> probably "all real", indeed, >> >> but read on. >> >> > On Tue, 2016-02-16 at 18:25 +0100, nospam at altfeld-im.de wrote: >> >> >> >> >> >> If I execute the code from the "?write.table" examples section >> >> >> >> x <- data.frame(a = I("a \" quote"), b = pi) >> >> # (ommited code) >> >> write.csv(x, file = "foo.csv", fileEncoding = "UTF-16LE") >> >> >> >> the resulting CSV file has a size of 6 bytes which is too short >> >> (truncated): >> >> >> >> """,3 >> >> reproducibly, yes. >> If you look at what write.csv does >> and then simplify, you can get a similar wrong result by >> >> write.table(x, file = "foo.tab", fileEncoding = "UTF-16LE") >> >> which results in a file with one line >> >> """ 3 >> >> and if you debug write.table() you see that its building blocks >> here are >> file <- file(........, encoding = fileEncoding) >> >> a writeLines(*, file=file) for the column headers, >> >> and then "deeper down" C code which I did not investigate. > > I took a look at connections.c. There is a call to strlen() that gets > confused by null characters. I think the obvious fix is to avoid the > call to strlen() as the size is already known: > > Index: src/main/connections.c > ==================================================================> --- src/main/connections.c (revision 70213) > +++ src/main/connections.c (working copy) > @@ -369,7 +369,7 @@ > /* is this safe? */ > warning(_("invalid char string in output conversion")); > *ob = '\0'; > - con->write(outbuf, 1, strlen(outbuf), con); > + con->write(outbuf, 1, ob - outbuf, con); > } while(again && inb > 0); /* it seems some iconv signal -1 on > zero-length input */ > } else > > >> >> But just looking a bit at such a file() object with writeLines() >> seems slightly revealing, as e.g., 'eol' does not seem to >> "work" for this encoding: >> >> > fn <- tempfile("ffoo"); ff <- file(fn, open="w", encoding = "UTF-16LE") >> > writeLines(LETTERS[3:1], ff); writeLines("|", ff); writeLines(">a", ff) >> > close(ff) >> > file.show(fn) >> CBA|> >> > file.size(fn) >> [1] 5 >> > > > With the patch applied: > > > readLines(fn, encoding="UTF-16LE", skipNul=TRUE) > [1] "C" "B" "A" "|" ">a" > > file.size(fn) > [1] 22That may be okay on Unix, but it's not enough on Windows. There the \n that writeLines adds at the end of each line isn't translated to UTF-16LE properly, so things get messed up. (I think the \n is translated, but the \r that Windows wants is not, so you get a mix of 8 bit and 16 bit characters.) Duncan Murdoch> - Mikko Korpela > >> >> The problem seems to be the iconv function: >> >> >> >> iconv("foo", to="UTF-16") >> >> >> >> produces >> >> >> >> Error in iconv("foo", to = "UTF-16"): >> >> embedded nul in string: '\xff\xfef\0o\0o\0' >> >> but this works >> >> > iconv("foo", to="UTF-16", toRaw=TRUE) >> [[1]] >> [1] ff fe 66 00 6f 00 6f 00 >> >> (indeed showing the embedded '\0's) >> >> >> In 2010 a (partial) patch for this problem was submitted: >> >> http://tolstoy.newcastle.edu.au/R/e10/devel/10/06/0648.html >> >> the patch only related to the iconv() problem not allowing 'raw' >> (instead of character) argument x. >> >> ... and it is > 5.5 years old, for an iconv() version that was less >> featureful than today. >> Rather, current iconv(x) allows x to be a list of raw entries. >> >> >> >> Are there chances to fix this problem since it prevents writing Windows >> >> UTF-16LE text files? >> >> >> >> >> PS: This problem can be reproduced on Windows and Linux. >> >> indeed.... also on "R devel of today". >> >> I agree it should be fixed... but as I said not by the patch you >> mentioned. >> >> Tested patches to fix this are welcome, indeed. > > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel >
Mikko Korpela
2016-Feb-25 09:31 UTC
[Rd] iconv to UTF-16 encoding produces error due to embedded nulls (write.table with fileEncoding param)
On 23.02.2016 14:06, Mikko Korpela wrote:> On 23.02.2016 11:37, Martin Maechler wrote: >>>>>>> nospam at altfeld-im de <nospam at altfeld-im.de> >>>>>>> on Mon, 22 Feb 2016 18:45:59 +0100 writes: >> >> > Dear R developers >> > I think I have found a bug that can be reproduced with two lines of code >> > and I am very thankful to get your first assessment or feed-back on my >> > report. >> >> > If this is the wrong mailing list or I did something wrong >> > (e. g. semi "anonymous" email address to protect my privacy and defend >> > unwanted spam) please let me know since I am new here. >> >> > Thank you very much :-) >> >> > J. Altfeld >> >> Dear J., >> (yes, a bit less anonymity would be very welcomed here!), >> >> You are right, this is a bug, at least in the documentation, but >> probably "all real", indeed, >> >> but read on. >> >> > On Tue, 2016-02-16 at 18:25 +0100, nospam at altfeld-im.de wrote: >> >> >> >> >> >> If I execute the code from the "?write.table" examples section >> >> >> >> x <- data.frame(a = I("a \" quote"), b = pi) >> >> # (ommited code) >> >> write.csv(x, file = "foo.csv", fileEncoding = "UTF-16LE") >> >> >> >> the resulting CSV file has a size of 6 bytes which is too short >> >> (truncated): >> >> >> >> """,3 >> >> reproducibly, yes. >> If you look at what write.csv does >> and then simplify, you can get a similar wrong result by >> >> write.table(x, file = "foo.tab", fileEncoding = "UTF-16LE") >> >> which results in a file with one line >> >> """ 3 >> >> and if you debug write.table() you see that its building blocks >> here are >> file <- file(........, encoding = fileEncoding) >> >> a writeLines(*, file=file) for the column headers, >> >> and then "deeper down" C code which I did not investigate. > > I took a look at connections.c. There is a call to strlen() that gets > confused by null characters. I think the obvious fix is to avoid the > call to strlen() as the size is already known: > > Index: src/main/connections.c > ==================================================================> --- src/main/connections.c (revision 70213) > +++ src/main/connections.c (working copy) > @@ -369,7 +369,7 @@ > /* is this safe? */ > warning(_("invalid char string in output conversion")); > *ob = '\0'; > - con->write(outbuf, 1, strlen(outbuf), con); > + con->write(outbuf, 1, ob - outbuf, con); > } while(again && inb > 0); /* it seems some iconv signal -1 on > zero-length input */ > } else > > >> >> But just looking a bit at such a file() object with writeLines() >> seems slightly revealing, as e.g., 'eol' does not seem to >> "work" for this encoding: >> >> > fn <- tempfile("ffoo"); ff <- file(fn, open="w", encoding = "UTF-16LE") >> > writeLines(LETTERS[3:1], ff); writeLines("|", ff); writeLines(">a", ff) >> > close(ff) >> > file.show(fn) >> CBA|> >> > file.size(fn) >> [1] 5 >> > > > With the patch applied: > > > readLines(fn, encoding="UTF-16LE", skipNul=TRUE) > [1] "C" "B" "A" "|" ">a" > > file.size(fn) > [1] 22I just realized that I was misusing the encoding argument of readLines(). The code above works by accident, but the following would be more appropriate: > ff <- file(fn, open="r", encoding="UTF-16LE") > readLines(ff) [1] "C" "B" "A" "|" ">a" > close(ff) Testing on Linux, with the patch applied. (As noted by Duncan Murdoch, the patch is incomplete on Windows.) - Mikko
Possibly Parallel Threads
- iconv to UTF-16 encoding produces error due to embedded nulls (write.table with fileEncoding param)
- iconv to UTF-16 encoding produces error due to embedded nulls (write.table with fileEncoding param)
- iconv to UTF-16 encoding produces error due to embedded nulls (write.table with fileEncoding param)
- iconv to UTF-16 encoding produces error due to embedded nulls (write.table with fileEncoding param)
- iconv to UTF-16 encoding produces error due to embedded nulls (write.table with fileEncoding param)