Duncan Murdoch
2016-Feb-24 16:16 UTC
[Rd] iconv to UTF-16 encoding produces error due to embedded nulls (write.table with fileEncoding param)
On 24/02/2016 9:55 AM, Mikko Korpela wrote:> On 24.02.2016 15:47, Duncan Murdoch wrote: >> On 23/02/2016 7:06 AM, Mikko Korpela wrote: >>> On 23.02.2016 11:37, Martin Maechler wrote: >>>>>>>>> nospam at altfeld-im de <nospam at altfeld-im.de> >>>>>>>>> on Mon, 22 Feb 2016 18:45:59 +0100 writes: >>>> >>>> > Dear R developers >>>> > I think I have found a bug that can be reproduced with two >>>> lines of code >>>> > and I am very thankful to get your first assessment or >>>> feed-back on my >>>> > report. >>>> >>>> > If this is the wrong mailing list or I did something wrong >>>> > (e. g. semi "anonymous" email address to protect my privacy >>>> and defend >>>> > unwanted spam) please let me know since I am new here. >>>> >>>> > Thank you very much :-) >>>> >>>> > J. Altfeld >>>> >>>> Dear J., >>>> (yes, a bit less anonymity would be very welcomed here!), >>>> >>>> You are right, this is a bug, at least in the documentation, but >>>> probably "all real", indeed, >>>> >>>> but read on. >>>> >>>> > On Tue, 2016-02-16 at 18:25 +0100, nospam at altfeld-im.de wrote: >>>> >> >>>> >> >>>> >> If I execute the code from the "?write.table" examples section >>>> >> >>>> >> x <- data.frame(a = I("a \" quote"), b = pi) >>>> >> # (ommited code) >>>> >> write.csv(x, file = "foo.csv", fileEncoding = "UTF-16LE") >>>> >> >>>> >> the resulting CSV file has a size of 6 bytes which is too short >>>> >> (truncated): >>>> >> >>>> >> """,3 >>>> >>>> reproducibly, yes. >>>> If you look at what write.csv does >>>> and then simplify, you can get a similar wrong result by >>>> >>>> write.table(x, file = "foo.tab", fileEncoding = "UTF-16LE") >>>> >>>> which results in a file with one line >>>> >>>> """ 3 >>>> >>>> and if you debug write.table() you see that its building blocks >>>> here are >>>> file <- file(........, encoding = fileEncoding) >>>> >>>> a writeLines(*, file=file) for the column headers, >>>> >>>> and then "deeper down" C code which I did not investigate. >>> >>> I took a look at connections.c. There is a call to strlen() that gets >>> confused by null characters. I think the obvious fix is to avoid the >>> call to strlen() as the size is already known: >>> >>> Index: src/main/connections.c >>> ==================================================================>>> --- src/main/connections.c (revision 70213) >>> +++ src/main/connections.c (working copy) >>> @@ -369,7 +369,7 @@ >>> /* is this safe? */ >>> warning(_("invalid char string in output conversion")); >>> *ob = '\0'; >>> - con->write(outbuf, 1, strlen(outbuf), con); >>> + con->write(outbuf, 1, ob - outbuf, con); >>> } while(again && inb > 0); /* it seems some iconv signal -1 on >>> zero-length input */ >>> } else >>> >>> >>>> >>>> But just looking a bit at such a file() object with writeLines() >>>> seems slightly revealing, as e.g., 'eol' does not seem to >>>> "work" for this encoding: >>>> >>>> > fn <- tempfile("ffoo"); ff <- file(fn, open="w", encoding >>>> "UTF-16LE") >>>> > writeLines(LETTERS[3:1], ff); writeLines("|", ff); >>>> writeLines(">a", ff) >>>> > close(ff) >>>> > file.show(fn) >>>> CBA|> >>>> > file.size(fn) >>>> [1] 5 >>>> > >>> >>> With the patch applied: >>> >>> > readLines(fn, encoding="UTF-16LE", skipNul=TRUE) >>> [1] "C" "B" "A" "|" ">a" >>> > file.size(fn) >>> [1] 22 >> >> That may be okay on Unix, but it's not enough on Windows. There the \n >> that writeLines adds at the end of each line isn't translated to >> UTF-16LE properly, so things get messed up. (I think the \n is >> translated, but the \r that Windows wants is not, so you get a mix of 8 >> bit and 16 bit characters.) > > That's unfortunate. I tested my tiny patch on Linux. I don't know what > kind of additional changes would be needed to make this work on Windows. >It looks like a big change is needed for a perfect solution: - Windows does the translation of \n to \r\n. In the R code, Windows is never told that the output is UTF-16LE, so it does an 8 bit translation. - Telling Windows that output is UTF-16LE looks hard: we'd need to convert the string to wide chars in R, then write it in wide chars. This seems like a lot of work for a rare case. - It might be easier to do a hack: if the user asks for "UTF-16LE", then treat it internally as a text file but tell Windows it's a binary file. This means no \n to \r\n translation will be done by Windows. If the desired output file needs Windows line endings, the user would have to specify sep="\r\n" in writeLines. Duncan Murdoch
Duncan Murdoch
2016-Feb-24 20:49 UTC
[Rd] iconv to UTF-16 encoding produces error due to embedded nulls (write.table with fileEncoding param)
On 24/02/2016 11:16 AM, Duncan Murdoch wrote:> On 24/02/2016 9:55 AM, Mikko Korpela wrote: >> On 24.02.2016 15:47, Duncan Murdoch wrote: >>> On 23/02/2016 7:06 AM, Mikko Korpela wrote: >>>> On 23.02.2016 11:37, Martin Maechler wrote: >>>>>>>>>> nospam at altfeld-im de <nospam at altfeld-im.de> >>>>>>>>>> on Mon, 22 Feb 2016 18:45:59 +0100 writes: >>>>> >>>>> > Dear R developers >>>>> > I think I have found a bug that can be reproduced with two >>>>> lines of code >>>>> > and I am very thankful to get your first assessment or >>>>> feed-back on my >>>>> > report. >>>>> >>>>> > If this is the wrong mailing list or I did something wrong >>>>> > (e. g. semi "anonymous" email address to protect my privacy >>>>> and defend >>>>> > unwanted spam) please let me know since I am new here. >>>>> >>>>> > Thank you very much :-) >>>>> >>>>> > J. Altfeld >>>>> >>>>> Dear J., >>>>> (yes, a bit less anonymity would be very welcomed here!), >>>>> >>>>> You are right, this is a bug, at least in the documentation, but >>>>> probably "all real", indeed, >>>>> >>>>> but read on. >>>>> >>>>> > On Tue, 2016-02-16 at 18:25 +0100, nospam at altfeld-im.de wrote: >>>>> >> >>>>> >> >>>>> >> If I execute the code from the "?write.table" examples section >>>>> >> >>>>> >> x <- data.frame(a = I("a \" quote"), b = pi) >>>>> >> # (ommited code) >>>>> >> write.csv(x, file = "foo.csv", fileEncoding = "UTF-16LE") >>>>> >> >>>>> >> the resulting CSV file has a size of 6 bytes which is too short >>>>> >> (truncated): >>>>> >> >>>>> >> """,3 >>>>> >>>>> reproducibly, yes. >>>>> If you look at what write.csv does >>>>> and then simplify, you can get a similar wrong result by >>>>> >>>>> write.table(x, file = "foo.tab", fileEncoding = "UTF-16LE") >>>>> >>>>> which results in a file with one line >>>>> >>>>> """ 3 >>>>> >>>>> and if you debug write.table() you see that its building blocks >>>>> here are >>>>> file <- file(........, encoding = fileEncoding) >>>>> >>>>> a writeLines(*, file=file) for the column headers, >>>>> >>>>> and then "deeper down" C code which I did not investigate. >>>> >>>> I took a look at connections.c. There is a call to strlen() that gets >>>> confused by null characters. I think the obvious fix is to avoid the >>>> call to strlen() as the size is already known: >>>> >>>> Index: src/main/connections.c >>>> ==================================================================>>>> --- src/main/connections.c (revision 70213) >>>> +++ src/main/connections.c (working copy) >>>> @@ -369,7 +369,7 @@ >>>> /* is this safe? */ >>>> warning(_("invalid char string in output conversion")); >>>> *ob = '\0'; >>>> - con->write(outbuf, 1, strlen(outbuf), con); >>>> + con->write(outbuf, 1, ob - outbuf, con); >>>> } while(again && inb > 0); /* it seems some iconv signal -1 on >>>> zero-length input */ >>>> } else >>>> >>>> >>>>> >>>>> But just looking a bit at such a file() object with writeLines() >>>>> seems slightly revealing, as e.g., 'eol' does not seem to >>>>> "work" for this encoding: >>>>> >>>>> > fn <- tempfile("ffoo"); ff <- file(fn, open="w", encoding >>>>> "UTF-16LE") >>>>> > writeLines(LETTERS[3:1], ff); writeLines("|", ff); >>>>> writeLines(">a", ff) >>>>> > close(ff) >>>>> > file.show(fn) >>>>> CBA|> >>>>> > file.size(fn) >>>>> [1] 5 >>>>> > >>>> >>>> With the patch applied: >>>> >>>> > readLines(fn, encoding="UTF-16LE", skipNul=TRUE) >>>> [1] "C" "B" "A" "|" ">a" >>>> > file.size(fn) >>>> [1] 22 >>> >>> That may be okay on Unix, but it's not enough on Windows. There the \n >>> that writeLines adds at the end of each line isn't translated to >>> UTF-16LE properly, so things get messed up. (I think the \n is >>> translated, but the \r that Windows wants is not, so you get a mix of 8 >>> bit and 16 bit characters.) >> >> That's unfortunate. I tested my tiny patch on Linux. I don't know what >> kind of additional changes would be needed to make this work on Windows. >> > > It looks like a big change is needed for a perfect solution: > > - Windows does the translation of \n to \r\n. In the R code, Windows > is never told that the output is UTF-16LE, so it does an 8 bit translation. > > - Telling Windows that output is UTF-16LE looks hard: we'd need to > convert the string to wide chars in R, then write it in wide chars. > This seems like a lot of work for a rare case. > > - It might be easier to do a hack: if the user asks for "UTF-16LE", > then treat it internally as a text file but tell Windows it's a binary > file. This means no \n to \r\n translation will be done by Windows. If > the desired output file needs Windows line endings, the user would have > to specify sep="\r\n" in writeLines.A third possibility is to handle the insertion of the \r completely within R. This will have the advantage of making it optional, so it would be a lot easier to write a Unix-style file on Windows. I think either the first or third possibilities will take too much time for me to attempt them before 3.3.0. I'm not sure about the second one yet. Duncan Murdoch
peter dalgaard
2016-Feb-25 09:49 UTC
[Rd] iconv to UTF-16 encoding produces error due to embedded nulls (write.table with fileEncoding param)
Aim for 3.3.1 then? It's not like we have hordes of people demanding to have this fixed right here and now, or do we? (A practical problem is that the version control dynamics dictate that at this stage, commits to r-devel _will_ end up in 3.3.0 on April 14, unless backed out and then inserted in the new r-devel branch to be created on March 17.) - Peter On 24 Feb 2016, at 21:49 , Duncan Murdoch <murdoch.duncan at gmail.com> wrote:> On 24/02/2016 11:16 AM, Duncan Murdoch wrote: >> On 24/02/2016 9:55 AM, Mikko Korpela wrote: >>>>[...]>>> >>> That's unfortunate. I tested my tiny patch on Linux. I don't know what >>> kind of additional changes would be needed to make this work on Windows. >>> >> >> It looks like a big change is needed for a perfect solution: >> >> - Windows does the translation of \n to \r\n. In the R code, Windows >> is never told that the output is UTF-16LE, so it does an 8 bit translation. >> >> - Telling Windows that output is UTF-16LE looks hard: we'd need to >> convert the string to wide chars in R, then write it in wide chars. >> This seems like a lot of work for a rare case. >> >> - It might be easier to do a hack: if the user asks for "UTF-16LE", >> then treat it internally as a text file but tell Windows it's a binary >> file. This means no \n to \r\n translation will be done by Windows. If >> the desired output file needs Windows line endings, the user would have >> to specify sep="\r\n" in writeLines. > > A third possibility is to handle the insertion of the \r completely within R. This will have the advantage of making it optional, so it would be a lot easier to write a Unix-style file on Windows. > > I think either the first or third possibilities will take too much time for me to attempt them before 3.3.0. I'm not sure about the second one yet. > > Duncan Murdoch > > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel-- Peter Dalgaard, Professor, Center for Statistics, Copenhagen Business School Solbjerg Plads 3, 2000 Frederiksberg, Denmark Phone: (+45)38153501 Office: A 4.23 Email: pd.mes at cbs.dk Priv: PDalgd at gmail.com
Seemingly Similar Threads
- iconv to UTF-16 encoding produces error due to embedded nulls (write.table with fileEncoding param)
- iconv to UTF-16 encoding produces error due to embedded nulls (write.table with fileEncoding param)
- iconv to UTF-16 encoding produces error due to embedded nulls (write.table with fileEncoding param)
- iconv to UTF-16 encoding produces error due to embedded nulls (write.table with fileEncoding param)
- iconv to UTF-16 encoding produces error due to embedded nulls (write.table with fileEncoding param)