Duncan Murdoch
2016-Feb-24 13:47 UTC
[Rd] iconv to UTF-16 encoding produces error due to embedded nulls (write.table with fileEncoding param)
On 23/02/2016 7:06 AM, Mikko Korpela wrote:> On 23.02.2016 11:37, Martin Maechler wrote: >>>>>>> nospam at altfeld-im de <nospam at altfeld-im.de> >>>>>>> on Mon, 22 Feb 2016 18:45:59 +0100 writes: >> >> > Dear R developers >> > I think I have found a bug that can be reproduced with two lines of code >> > and I am very thankful to get your first assessment or feed-back on my >> > report. >> >> > If this is the wrong mailing list or I did something wrong >> > (e. g. semi "anonymous" email address to protect my privacy and defend >> > unwanted spam) please let me know since I am new here. >> >> > Thank you very much :-) >> >> > J. Altfeld >> >> Dear J., >> (yes, a bit less anonymity would be very welcomed here!), >> >> You are right, this is a bug, at least in the documentation, but >> probably "all real", indeed, >> >> but read on. >> >> > On Tue, 2016-02-16 at 18:25 +0100, nospam at altfeld-im.de wrote: >> >> >> >> >> >> If I execute the code from the "?write.table" examples section >> >> >> >> x <- data.frame(a = I("a \" quote"), b = pi) >> >> # (ommited code) >> >> write.csv(x, file = "foo.csv", fileEncoding = "UTF-16LE") >> >> >> >> the resulting CSV file has a size of 6 bytes which is too short >> >> (truncated): >> >> >> >> """,3 >> >> reproducibly, yes. >> If you look at what write.csv does >> and then simplify, you can get a similar wrong result by >> >> write.table(x, file = "foo.tab", fileEncoding = "UTF-16LE") >> >> which results in a file with one line >> >> """ 3 >> >> and if you debug write.table() you see that its building blocks >> here are >> file <- file(........, encoding = fileEncoding) >> >> a writeLines(*, file=file) for the column headers, >> >> and then "deeper down" C code which I did not investigate. > > I took a look at connections.c. There is a call to strlen() that gets > confused by null characters. I think the obvious fix is to avoid the > call to strlen() as the size is already known: > > Index: src/main/connections.c > ==================================================================> --- src/main/connections.c (revision 70213) > +++ src/main/connections.c (working copy) > @@ -369,7 +369,7 @@ > /* is this safe? */ > warning(_("invalid char string in output conversion")); > *ob = '\0'; > - con->write(outbuf, 1, strlen(outbuf), con); > + con->write(outbuf, 1, ob - outbuf, con); > } while(again && inb > 0); /* it seems some iconv signal -1 on > zero-length input */ > } else > > >> >> But just looking a bit at such a file() object with writeLines() >> seems slightly revealing, as e.g., 'eol' does not seem to >> "work" for this encoding: >> >> > fn <- tempfile("ffoo"); ff <- file(fn, open="w", encoding = "UTF-16LE") >> > writeLines(LETTERS[3:1], ff); writeLines("|", ff); writeLines(">a", ff) >> > close(ff) >> > file.show(fn) >> CBA|> >> > file.size(fn) >> [1] 5 >> > > > With the patch applied: > > > readLines(fn, encoding="UTF-16LE", skipNul=TRUE) > [1] "C" "B" "A" "|" ">a" > > file.size(fn) > [1] 22That may be okay on Unix, but it's not enough on Windows. There the \n that writeLines adds at the end of each line isn't translated to UTF-16LE properly, so things get messed up. (I think the \n is translated, but the \r that Windows wants is not, so you get a mix of 8 bit and 16 bit characters.) Duncan Murdoch> - Mikko Korpela > >> >> The problem seems to be the iconv function: >> >> >> >> iconv("foo", to="UTF-16") >> >> >> >> produces >> >> >> >> Error in iconv("foo", to = "UTF-16"): >> >> embedded nul in string: '\xff\xfef\0o\0o\0' >> >> but this works >> >> > iconv("foo", to="UTF-16", toRaw=TRUE) >> [[1]] >> [1] ff fe 66 00 6f 00 6f 00 >> >> (indeed showing the embedded '\0's) >> >> >> In 2010 a (partial) patch for this problem was submitted: >> >> http://tolstoy.newcastle.edu.au/R/e10/devel/10/06/0648.html >> >> the patch only related to the iconv() problem not allowing 'raw' >> (instead of character) argument x. >> >> ... and it is > 5.5 years old, for an iconv() version that was less >> featureful than today. >> Rather, current iconv(x) allows x to be a list of raw entries. >> >> >> >> Are there chances to fix this problem since it prevents writing Windows >> >> UTF-16LE text files? >> >> >> >> >> PS: This problem can be reproduced on Windows and Linux. >> >> indeed.... also on "R devel of today". >> >> I agree it should be fixed... but as I said not by the patch you >> mentioned. >> >> Tested patches to fix this are welcome, indeed. > > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel >
Mikko Korpela
2016-Feb-24 14:55 UTC
[Rd] iconv to UTF-16 encoding produces error due to embedded nulls (write.table with fileEncoding param)
On 24.02.2016 15:47, Duncan Murdoch wrote:> On 23/02/2016 7:06 AM, Mikko Korpela wrote: >> On 23.02.2016 11:37, Martin Maechler wrote: >>>>>>>> nospam at altfeld-im de <nospam at altfeld-im.de> >>>>>>>> on Mon, 22 Feb 2016 18:45:59 +0100 writes: >>> >>> > Dear R developers >>> > I think I have found a bug that can be reproduced with two >>> lines of code >>> > and I am very thankful to get your first assessment or >>> feed-back on my >>> > report. >>> >>> > If this is the wrong mailing list or I did something wrong >>> > (e. g. semi "anonymous" email address to protect my privacy >>> and defend >>> > unwanted spam) please let me know since I am new here. >>> >>> > Thank you very much :-) >>> >>> > J. Altfeld >>> >>> Dear J., >>> (yes, a bit less anonymity would be very welcomed here!), >>> >>> You are right, this is a bug, at least in the documentation, but >>> probably "all real", indeed, >>> >>> but read on. >>> >>> > On Tue, 2016-02-16 at 18:25 +0100, nospam at altfeld-im.de wrote: >>> >> >>> >> >>> >> If I execute the code from the "?write.table" examples section >>> >> >>> >> x <- data.frame(a = I("a \" quote"), b = pi) >>> >> # (ommited code) >>> >> write.csv(x, file = "foo.csv", fileEncoding = "UTF-16LE") >>> >> >>> >> the resulting CSV file has a size of 6 bytes which is too short >>> >> (truncated): >>> >> >>> >> """,3 >>> >>> reproducibly, yes. >>> If you look at what write.csv does >>> and then simplify, you can get a similar wrong result by >>> >>> write.table(x, file = "foo.tab", fileEncoding = "UTF-16LE") >>> >>> which results in a file with one line >>> >>> """ 3 >>> >>> and if you debug write.table() you see that its building blocks >>> here are >>> file <- file(........, encoding = fileEncoding) >>> >>> a writeLines(*, file=file) for the column headers, >>> >>> and then "deeper down" C code which I did not investigate. >> >> I took a look at connections.c. There is a call to strlen() that gets >> confused by null characters. I think the obvious fix is to avoid the >> call to strlen() as the size is already known: >> >> Index: src/main/connections.c >> ==================================================================>> --- src/main/connections.c (revision 70213) >> +++ src/main/connections.c (working copy) >> @@ -369,7 +369,7 @@ >> /* is this safe? */ >> warning(_("invalid char string in output conversion")); >> *ob = '\0'; >> - con->write(outbuf, 1, strlen(outbuf), con); >> + con->write(outbuf, 1, ob - outbuf, con); >> } while(again && inb > 0); /* it seems some iconv signal -1 on >> zero-length input */ >> } else >> >> >>> >>> But just looking a bit at such a file() object with writeLines() >>> seems slightly revealing, as e.g., 'eol' does not seem to >>> "work" for this encoding: >>> >>> > fn <- tempfile("ffoo"); ff <- file(fn, open="w", encoding >>> "UTF-16LE") >>> > writeLines(LETTERS[3:1], ff); writeLines("|", ff); >>> writeLines(">a", ff) >>> > close(ff) >>> > file.show(fn) >>> CBA|> >>> > file.size(fn) >>> [1] 5 >>> > >> >> With the patch applied: >> >> > readLines(fn, encoding="UTF-16LE", skipNul=TRUE) >> [1] "C" "B" "A" "|" ">a" >> > file.size(fn) >> [1] 22 > > That may be okay on Unix, but it's not enough on Windows. There the \n > that writeLines adds at the end of each line isn't translated to > UTF-16LE properly, so things get messed up. (I think the \n is > translated, but the \r that Windows wants is not, so you get a mix of 8 > bit and 16 bit characters.)That's unfortunate. I tested my tiny patch on Linux. I don't know what kind of additional changes would be needed to make this work on Windows. -- Mikko Korpela Aalto University School of Science Department of Computer Science
Duncan Murdoch
2016-Feb-24 16:16 UTC
[Rd] iconv to UTF-16 encoding produces error due to embedded nulls (write.table with fileEncoding param)
On 24/02/2016 9:55 AM, Mikko Korpela wrote:> On 24.02.2016 15:47, Duncan Murdoch wrote: >> On 23/02/2016 7:06 AM, Mikko Korpela wrote: >>> On 23.02.2016 11:37, Martin Maechler wrote: >>>>>>>>> nospam at altfeld-im de <nospam at altfeld-im.de> >>>>>>>>> on Mon, 22 Feb 2016 18:45:59 +0100 writes: >>>> >>>> > Dear R developers >>>> > I think I have found a bug that can be reproduced with two >>>> lines of code >>>> > and I am very thankful to get your first assessment or >>>> feed-back on my >>>> > report. >>>> >>>> > If this is the wrong mailing list or I did something wrong >>>> > (e. g. semi "anonymous" email address to protect my privacy >>>> and defend >>>> > unwanted spam) please let me know since I am new here. >>>> >>>> > Thank you very much :-) >>>> >>>> > J. Altfeld >>>> >>>> Dear J., >>>> (yes, a bit less anonymity would be very welcomed here!), >>>> >>>> You are right, this is a bug, at least in the documentation, but >>>> probably "all real", indeed, >>>> >>>> but read on. >>>> >>>> > On Tue, 2016-02-16 at 18:25 +0100, nospam at altfeld-im.de wrote: >>>> >> >>>> >> >>>> >> If I execute the code from the "?write.table" examples section >>>> >> >>>> >> x <- data.frame(a = I("a \" quote"), b = pi) >>>> >> # (ommited code) >>>> >> write.csv(x, file = "foo.csv", fileEncoding = "UTF-16LE") >>>> >> >>>> >> the resulting CSV file has a size of 6 bytes which is too short >>>> >> (truncated): >>>> >> >>>> >> """,3 >>>> >>>> reproducibly, yes. >>>> If you look at what write.csv does >>>> and then simplify, you can get a similar wrong result by >>>> >>>> write.table(x, file = "foo.tab", fileEncoding = "UTF-16LE") >>>> >>>> which results in a file with one line >>>> >>>> """ 3 >>>> >>>> and if you debug write.table() you see that its building blocks >>>> here are >>>> file <- file(........, encoding = fileEncoding) >>>> >>>> a writeLines(*, file=file) for the column headers, >>>> >>>> and then "deeper down" C code which I did not investigate. >>> >>> I took a look at connections.c. There is a call to strlen() that gets >>> confused by null characters. I think the obvious fix is to avoid the >>> call to strlen() as the size is already known: >>> >>> Index: src/main/connections.c >>> ==================================================================>>> --- src/main/connections.c (revision 70213) >>> +++ src/main/connections.c (working copy) >>> @@ -369,7 +369,7 @@ >>> /* is this safe? */ >>> warning(_("invalid char string in output conversion")); >>> *ob = '\0'; >>> - con->write(outbuf, 1, strlen(outbuf), con); >>> + con->write(outbuf, 1, ob - outbuf, con); >>> } while(again && inb > 0); /* it seems some iconv signal -1 on >>> zero-length input */ >>> } else >>> >>> >>>> >>>> But just looking a bit at such a file() object with writeLines() >>>> seems slightly revealing, as e.g., 'eol' does not seem to >>>> "work" for this encoding: >>>> >>>> > fn <- tempfile("ffoo"); ff <- file(fn, open="w", encoding >>>> "UTF-16LE") >>>> > writeLines(LETTERS[3:1], ff); writeLines("|", ff); >>>> writeLines(">a", ff) >>>> > close(ff) >>>> > file.show(fn) >>>> CBA|> >>>> > file.size(fn) >>>> [1] 5 >>>> > >>> >>> With the patch applied: >>> >>> > readLines(fn, encoding="UTF-16LE", skipNul=TRUE) >>> [1] "C" "B" "A" "|" ">a" >>> > file.size(fn) >>> [1] 22 >> >> That may be okay on Unix, but it's not enough on Windows. There the \n >> that writeLines adds at the end of each line isn't translated to >> UTF-16LE properly, so things get messed up. (I think the \n is >> translated, but the \r that Windows wants is not, so you get a mix of 8 >> bit and 16 bit characters.) > > That's unfortunate. I tested my tiny patch on Linux. I don't know what > kind of additional changes would be needed to make this work on Windows. >It looks like a big change is needed for a perfect solution: - Windows does the translation of \n to \r\n. In the R code, Windows is never told that the output is UTF-16LE, so it does an 8 bit translation. - Telling Windows that output is UTF-16LE looks hard: we'd need to convert the string to wide chars in R, then write it in wide chars. This seems like a lot of work for a rare case. - It might be easier to do a hack: if the user asks for "UTF-16LE", then treat it internally as a text file but tell Windows it's a binary file. This means no \n to \r\n translation will be done by Windows. If the desired output file needs Windows line endings, the user would have to specify sep="\r\n" in writeLines. Duncan Murdoch
Apparently Analagous Threads
- iconv to UTF-16 encoding produces error due to embedded nulls (write.table with fileEncoding param)
- iconv to UTF-16 encoding produces error due to embedded nulls (write.table with fileEncoding param)
- iconv to UTF-16 encoding produces error due to embedded nulls (write.table with fileEncoding param)
- iconv to UTF-16 encoding produces error due to embedded nulls (write.table with fileEncoding param)
- iconv to UTF-16 encoding produces error due to embedded nulls (write.table with fileEncoding param)