Mikko Korpela
2016-Feb-25 09:31 UTC
[Rd] iconv to UTF-16 encoding produces error due to embedded nulls (write.table with fileEncoding param)
On 23.02.2016 14:06, Mikko Korpela wrote:> On 23.02.2016 11:37, Martin Maechler wrote: >>>>>>> nospam at altfeld-im de <nospam at altfeld-im.de> >>>>>>> on Mon, 22 Feb 2016 18:45:59 +0100 writes: >> >> > Dear R developers >> > I think I have found a bug that can be reproduced with two lines of code >> > and I am very thankful to get your first assessment or feed-back on my >> > report. >> >> > If this is the wrong mailing list or I did something wrong >> > (e. g. semi "anonymous" email address to protect my privacy and defend >> > unwanted spam) please let me know since I am new here. >> >> > Thank you very much :-) >> >> > J. Altfeld >> >> Dear J., >> (yes, a bit less anonymity would be very welcomed here!), >> >> You are right, this is a bug, at least in the documentation, but >> probably "all real", indeed, >> >> but read on. >> >> > On Tue, 2016-02-16 at 18:25 +0100, nospam at altfeld-im.de wrote: >> >> >> >> >> >> If I execute the code from the "?write.table" examples section >> >> >> >> x <- data.frame(a = I("a \" quote"), b = pi) >> >> # (ommited code) >> >> write.csv(x, file = "foo.csv", fileEncoding = "UTF-16LE") >> >> >> >> the resulting CSV file has a size of 6 bytes which is too short >> >> (truncated): >> >> >> >> """,3 >> >> reproducibly, yes. >> If you look at what write.csv does >> and then simplify, you can get a similar wrong result by >> >> write.table(x, file = "foo.tab", fileEncoding = "UTF-16LE") >> >> which results in a file with one line >> >> """ 3 >> >> and if you debug write.table() you see that its building blocks >> here are >> file <- file(........, encoding = fileEncoding) >> >> a writeLines(*, file=file) for the column headers, >> >> and then "deeper down" C code which I did not investigate. > > I took a look at connections.c. There is a call to strlen() that gets > confused by null characters. I think the obvious fix is to avoid the > call to strlen() as the size is already known: > > Index: src/main/connections.c > ==================================================================> --- src/main/connections.c (revision 70213) > +++ src/main/connections.c (working copy) > @@ -369,7 +369,7 @@ > /* is this safe? */ > warning(_("invalid char string in output conversion")); > *ob = '\0'; > - con->write(outbuf, 1, strlen(outbuf), con); > + con->write(outbuf, 1, ob - outbuf, con); > } while(again && inb > 0); /* it seems some iconv signal -1 on > zero-length input */ > } else > > >> >> But just looking a bit at such a file() object with writeLines() >> seems slightly revealing, as e.g., 'eol' does not seem to >> "work" for this encoding: >> >> > fn <- tempfile("ffoo"); ff <- file(fn, open="w", encoding = "UTF-16LE") >> > writeLines(LETTERS[3:1], ff); writeLines("|", ff); writeLines(">a", ff) >> > close(ff) >> > file.show(fn) >> CBA|> >> > file.size(fn) >> [1] 5 >> > > > With the patch applied: > > > readLines(fn, encoding="UTF-16LE", skipNul=TRUE) > [1] "C" "B" "A" "|" ">a" > > file.size(fn) > [1] 22I just realized that I was misusing the encoding argument of readLines(). The code above works by accident, but the following would be more appropriate: > ff <- file(fn, open="r", encoding="UTF-16LE") > readLines(ff) [1] "C" "B" "A" "|" ">a" > close(ff) Testing on Linux, with the patch applied. (As noted by Duncan Murdoch, the patch is incomplete on Windows.) - Mikko
Mikko Korpela
2016-Feb-25 10:54 UTC
[Rd] iconv to UTF-16 encoding produces error due to embedded nulls (write.table with fileEncoding param)
On 25.02.2016 11:31, Mikko Korpela wrote:> On 23.02.2016 14:06, Mikko Korpela wrote: >> On 23.02.2016 11:37, Martin Maechler wrote: >>>>>>>> nospam at altfeld-im de <nospam at altfeld-im.de> >>>>>>>> on Mon, 22 Feb 2016 18:45:59 +0100 writes: >>> >>> > Dear R developers >>> > I think I have found a bug that can be reproduced with two lines of code >>> > and I am very thankful to get your first assessment or feed-back on my >>> > report. >>> >>> > If this is the wrong mailing list or I did something wrong >>> > (e. g. semi "anonymous" email address to protect my privacy and defend >>> > unwanted spam) please let me know since I am new here. >>> >>> > Thank you very much :-) >>> >>> > J. Altfeld >>> >>> Dear J., >>> (yes, a bit less anonymity would be very welcomed here!), >>> >>> You are right, this is a bug, at least in the documentation, but >>> probably "all real", indeed, >>> >>> but read on. >>> >>> > On Tue, 2016-02-16 at 18:25 +0100, nospam at altfeld-im.de wrote: >>> >> >>> >> >>> >> If I execute the code from the "?write.table" examples section >>> >> >>> >> x <- data.frame(a = I("a \" quote"), b = pi) >>> >> # (ommited code) >>> >> write.csv(x, file = "foo.csv", fileEncoding = "UTF-16LE") >>> >> >>> >> the resulting CSV file has a size of 6 bytes which is too short >>> >> (truncated): >>> >> >>> >> """,3 >>> >>> reproducibly, yes. >>> If you look at what write.csv does >>> and then simplify, you can get a similar wrong result by >>> >>> write.table(x, file = "foo.tab", fileEncoding = "UTF-16LE") >>> >>> which results in a file with one line >>> >>> """ 3 >>> >>> and if you debug write.table() you see that its building blocks >>> here are >>> file <- file(........, encoding = fileEncoding) >>> >>> a writeLines(*, file=file) for the column headers, >>> >>> and then "deeper down" C code which I did not investigate. >> >> I took a look at connections.c. There is a call to strlen() that gets >> confused by null characters. I think the obvious fix is to avoid the >> call to strlen() as the size is already known: >> >> Index: src/main/connections.c >> ==================================================================>> --- src/main/connections.c (revision 70213) >> +++ src/main/connections.c (working copy) >> @@ -369,7 +369,7 @@ >> /* is this safe? */ >> warning(_("invalid char string in output conversion")); >> *ob = '\0'; >> - con->write(outbuf, 1, strlen(outbuf), con); >> + con->write(outbuf, 1, ob - outbuf, con); >> } while(again && inb > 0); /* it seems some iconv signal -1 on >> zero-length input */ >> } else >> >> >>> >>> But just looking a bit at such a file() object with writeLines() >>> seems slightly revealing, as e.g., 'eol' does not seem to >>> "work" for this encoding: >>> >>> > fn <- tempfile("ffoo"); ff <- file(fn, open="w", encoding = "UTF-16LE") >>> > writeLines(LETTERS[3:1], ff); writeLines("|", ff); writeLines(">a", ff) >>> > close(ff) >>> > file.show(fn) >>> CBA|> >>> > file.size(fn) >>> [1] 5 >>> > >> >> With the patch applied: >> >> > readLines(fn, encoding="UTF-16LE", skipNul=TRUE) >> [1] "C" "B" "A" "|" ">a" >> > file.size(fn) >> [1] 22 > I just realized that I was misusing the encoding argument of > readLines(). The code above works by accident, but the following would > be more appropriate: > > > ff <- file(fn, open="r", encoding="UTF-16LE") > > readLines(ff) > [1] "C" "B" "A" "|" ">a" > > close(ff) > > Testing on Linux, with the patch applied. (As noted by Duncan Murdoch, > the patch is incomplete on Windows.)Before inspecting the file with readLines() I tried file.show() but it did not work as expected. On Linux using a UTF-8 locale, the result of trying to show the truly UTF-16LE encoded file with > file.show(fn, encoding="UTF-16LE") was a pager showing "<43>" (quotes not included) followed by several empty lines. With the following patch, the command works correctly (in this case, on this platform, not tested comprehensively). The idea is to read the input file "raw" in order to avoid problems with null characters. The input then needs to be split into lines after iconv(), or it could be written to the output file with cat() if the style of line termination characters does not matter. The 'perl = TRUE' is for assumed performance advantage only. It can be removed, or one might want to test if there is a significant difference one way or the other. - Mikko Index: src/library/base/R/files.R ==================================================================--- src/library/base/R/files.R (revision 70217) +++ src/library/base/R/files.R (working copy) @@ -50,10 +50,13 @@ for(i in seq_along(files)) { f <- files[i] tf <- tempfile() - tmp <- readLines(f, warn = FALSE) + tmp <- list(readBin(f, "raw", file.size(f))) tmp2 <- try(iconv(tmp, encoding, "", "byte")) if(inherits(tmp2, "try-error")) file.copy(f, tf) - else writeLines(tmp2, tf) + else { + tmp2 <- strsplit(tmp2, "\r\n?|\n", perl = TRUE)[[1L]] + writeLines(tmp2, tf) + } files[i] <- tf if(delete.file) unlink(f) }
Duncan Murdoch
2016-Feb-29 18:30 UTC
[Rd] iconv to UTF-16 encoding produces error due to embedded nulls (write.table with fileEncoding param)
I have just committed your first patch (the strlen() replacement) to R-devel, and will soon put it in R-patched as well. I wont have time to look at this again before the 3.2.4 release, so your file.show() patch isn't going to make it unless someone else gets to it. There's still a faint chance that I'll do more in R-devel before 3.3.0, but I think it's best if there were bug reports about both of these problems so they don't get forgotten. Since the first one is mainly a Windows problem, I'll write that one up; I'd appreciate it if you could write up the file.show() issue, after checking against R-devel rev 70247 or higher. Duncan Murdoch On 25/02/2016 5:54 AM, Mikko Korpela wrote:> On 25.02.2016 11:31, Mikko Korpela wrote: >> On 23.02.2016 14:06, Mikko Korpela wrote: >>> On 23.02.2016 11:37, Martin Maechler wrote: >>>>>>>>> nospam at altfeld-im de <nospam at altfeld-im.de> >>>>>>>>> on Mon, 22 Feb 2016 18:45:59 +0100 writes: >>>> >>>> > Dear R developers >>>> > I think I have found a bug that can be reproduced with two lines of code >>>> > and I am very thankful to get your first assessment or feed-back on my >>>> > report. >>>> >>>> > If this is the wrong mailing list or I did something wrong >>>> > (e. g. semi "anonymous" email address to protect my privacy and defend >>>> > unwanted spam) please let me know since I am new here. >>>> >>>> > Thank you very much :-) >>>> >>>> > J. Altfeld >>>> >>>> Dear J., >>>> (yes, a bit less anonymity would be very welcomed here!), >>>> >>>> You are right, this is a bug, at least in the documentation, but >>>> probably "all real", indeed, >>>> >>>> but read on. >>>> >>>> > On Tue, 2016-02-16 at 18:25 +0100, nospam at altfeld-im.de wrote: >>>> >> >>>> >> >>>> >> If I execute the code from the "?write.table" examples section >>>> >> >>>> >> x <- data.frame(a = I("a \" quote"), b = pi) >>>> >> # (ommited code) >>>> >> write.csv(x, file = "foo.csv", fileEncoding = "UTF-16LE") >>>> >> >>>> >> the resulting CSV file has a size of 6 bytes which is too short >>>> >> (truncated): >>>> >> >>>> >> """,3 >>>> >>>> reproducibly, yes. >>>> If you look at what write.csv does >>>> and then simplify, you can get a similar wrong result by >>>> >>>> write.table(x, file = "foo.tab", fileEncoding = "UTF-16LE") >>>> >>>> which results in a file with one line >>>> >>>> """ 3 >>>> >>>> and if you debug write.table() you see that its building blocks >>>> here are >>>> file <- file(........, encoding = fileEncoding) >>>> >>>> a writeLines(*, file=file) for the column headers, >>>> >>>> and then "deeper down" C code which I did not investigate. >>> >>> I took a look at connections.c. There is a call to strlen() that gets >>> confused by null characters. I think the obvious fix is to avoid the >>> call to strlen() as the size is already known: >>> >>> Index: src/main/connections.c >>> ==================================================================>>> --- src/main/connections.c (revision 70213) >>> +++ src/main/connections.c (working copy) >>> @@ -369,7 +369,7 @@ >>> /* is this safe? */ >>> warning(_("invalid char string in output conversion")); >>> *ob = '\0'; >>> - con->write(outbuf, 1, strlen(outbuf), con); >>> + con->write(outbuf, 1, ob - outbuf, con); >>> } while(again && inb > 0); /* it seems some iconv signal -1 on >>> zero-length input */ >>> } else >>> >>> >>>> >>>> But just looking a bit at such a file() object with writeLines() >>>> seems slightly revealing, as e.g., 'eol' does not seem to >>>> "work" for this encoding: >>>> >>>> > fn <- tempfile("ffoo"); ff <- file(fn, open="w", encoding = "UTF-16LE") >>>> > writeLines(LETTERS[3:1], ff); writeLines("|", ff); writeLines(">a", ff) >>>> > close(ff) >>>> > file.show(fn) >>>> CBA|> >>>> > file.size(fn) >>>> [1] 5 >>>> > >>> >>> With the patch applied: >>> >>> > readLines(fn, encoding="UTF-16LE", skipNul=TRUE) >>> [1] "C" "B" "A" "|" ">a" >>> > file.size(fn) >>> [1] 22 >> I just realized that I was misusing the encoding argument of >> readLines(). The code above works by accident, but the following would >> be more appropriate: >> >> > ff <- file(fn, open="r", encoding="UTF-16LE") >> > readLines(ff) >> [1] "C" "B" "A" "|" ">a" >> > close(ff) >> >> Testing on Linux, with the patch applied. (As noted by Duncan Murdoch, >> the patch is incomplete on Windows.) > Before inspecting the file with readLines() I tried file.show() but it > did not work as expected. On Linux using a UTF-8 locale, the result of > trying to show the truly UTF-16LE encoded file with > > > file.show(fn, encoding="UTF-16LE") > > was a pager showing "<43>" (quotes not included) followed by several > empty lines. > > With the following patch, the command works correctly (in this case, on > this platform, not tested comprehensively). The idea is to read the > input file "raw" in order to avoid problems with null characters. The > input then needs to be split into lines after iconv(), or it could be > written to the output file with cat() if the style of line termination > characters does not matter. The 'perl = TRUE' is for assumed performance > advantage only. It can be removed, or one might want to test if there is > a significant difference one way or the other. > > - Mikko > > Index: src/library/base/R/files.R > ==================================================================> --- src/library/base/R/files.R (revision 70217) > +++ src/library/base/R/files.R (working copy) > @@ -50,10 +50,13 @@ > for(i in seq_along(files)) { > f <- files[i] > tf <- tempfile() > - tmp <- readLines(f, warn = FALSE) > + tmp <- list(readBin(f, "raw", file.size(f))) > tmp2 <- try(iconv(tmp, encoding, "", "byte")) > if(inherits(tmp2, "try-error")) file.copy(f, tf) > - else writeLines(tmp2, tf) > + else { > + tmp2 <- strsplit(tmp2, "\r\n?|\n", perl = TRUE)[[1L]] > + writeLines(tmp2, tf) > + } > files[i] <- tf > if(delete.file) unlink(f) > } >
Apparently Analagous Threads
- iconv to UTF-16 encoding produces error due to embedded nulls (write.table with fileEncoding param)
- iconv to UTF-16 encoding produces error due to embedded nulls (write.table with fileEncoding param)
- iconv to UTF-16 encoding produces error due to embedded nulls (write.table with fileEncoding param)
- iconv to UTF-16 encoding produces error due to embedded nulls (write.table with fileEncoding param)
- iconv to UTF-16 encoding produces error due to embedded nulls (write.table with fileEncoding param)