thr3ads.net - R devel - [Rd] iconv to UTF-16 encoding produces error due to embedded nulls (write.table with fileEncoding param) [Feb 2016]

If this information is useful, please help other people find it:
Share via:

Mikko Korpela

2016-Feb-25 09:31 UTC

[Rd] iconv to UTF-16 encoding produces error due to embedded nulls (write.table with fileEncoding param)

On 23.02.2016 14:06, Mikko Korpela wrote:> On 23.02.2016 11:37, Martin Maechler wrote:
>>>>>>> nospam at altfeld-im de <nospam at
altfeld-im.de>
>>>>>>>     on Mon, 22 Feb 2016 18:45:59 +0100 writes:
>>
>>     > Dear R developers
>>     > I think I have found a bug that can be reproduced with two
lines of code
>>     > and I am very thankful to get your first assessment or
feed-back on my
>>     > report.
>>
>>     > If this is the wrong mailing list or I did something wrong
>>     > (e. g. semi "anonymous" email address to protect my
privacy and defend
>>     > unwanted spam) please let me know since I am new here.
>>
>>     > Thank you very much :-)
>>
>>     > J. Altfeld
>>
>> Dear J.,
>> (yes, a bit less anonymity would be very welcomed here!),
>>
>> You are right, this is a bug, at least in the documentation, but
>> probably "all real", indeed,
>>
>> but read on.
>>
>>     > On Tue, 2016-02-16 at 18:25 +0100, nospam at altfeld-im.de
wrote:
>>     >> 
>>     >> 
>>     >> If I execute the code from the "?write.table"
examples section
>>     >> 
>>     >> x <- data.frame(a = I("a \" quote"), b =
pi)
>>     >> # (ommited code)
>>     >> write.csv(x, file = "foo.csv", fileEncoding =
"UTF-16LE")
>>     >> 
>>     >> the resulting CSV file has a size of 6 bytes which is too
short
>>     >> (truncated):
>>     >> 
>>     >> """,3
>>
>> reproducibly, yes.
>> If you look at what write.csv does
>> and then simplify, you can get a similar wrong result by
>>
>>   write.table(x, file = "foo.tab", fileEncoding =
"UTF-16LE")
>>
>> which results in a file with one line
>>
>> """ 3
>>
>> and if you debug  write.table() you see that its building blocks
>> here are
>> 	 file <- file(........, encoding = fileEncoding)
>>
>> a 	 writeLines(*, file=file)  for the column headers,
>>
>> and then "deeper down" C code which I did not investigate.
> 
> I took a look at connections.c. There is a call to strlen() that gets
> confused by null characters. I think the obvious fix is to avoid the
> call to strlen() as the size is already known:
> 
> Index: src/main/connections.c
> ==================================================================> ---
src/main/connections.c	(revision 70213)
> +++ src/main/connections.c	(working copy)
> @@ -369,7 +369,7 @@
>  		/* is this safe? */
>  		warning(_("invalid char string in output conversion"));
>  	    *ob = '\0';
> -	    con->write(outbuf, 1, strlen(outbuf), con);
> +	    con->write(outbuf, 1, ob - outbuf, con);
>  	} while(again && inb > 0);  /* it seems some iconv signal -1
on
>  				       zero-length input */
>      } else
> 
> 
>>
>> But just looking a bit at such a file() object with writeLines()
>> seems slightly revealing, as e.g., 'eol' does not seem to
>> "work" for this encoding:
>>
>>     > fn <- tempfile("ffoo"); ff <- file(fn,
open="w", encoding = "UTF-16LE")
>>     > writeLines(LETTERS[3:1], ff); writeLines("|", ff);
writeLines(">a", ff)
>>     > close(ff)
>>     > file.show(fn)
>>     CBA|>
>>     > file.size(fn)
>>     [1] 5
>>     > 
> 
> With the patch applied:
> 
>     > readLines(fn, encoding="UTF-16LE", skipNul=TRUE)
>     [1] "C"  "B"  "A"  "|" 
">a"
>     > file.size(fn)
>     [1] 22I just realized that I was misusing the encoding argument of
readLines(). The code above works by accident, but the following would
be more appropriate:

    > ff <- file(fn, open="r", encoding="UTF-16LE")
    > readLines(ff)
    [1] "C"  "B"  "A"  "|" 
">a"
    > close(ff)

Testing on Linux, with the patch applied. (As noted by Duncan Murdoch,
the patch is incomplete on Windows.)

- Mikko

Mikko Korpela

2016-Feb-25 10:54 UTC

head link

[Rd] iconv to UTF-16 encoding produces error due to embedded nulls (write.table with fileEncoding param)

On 25.02.2016 11:31, Mikko Korpela wrote:> On 23.02.2016 14:06, Mikko Korpela wrote:
>> On 23.02.2016 11:37, Martin Maechler wrote:
>>>>>>>> nospam at altfeld-im de <nospam at
altfeld-im.de>
>>>>>>>>     on Mon, 22 Feb 2016 18:45:59 +0100 writes:
>>>
>>>     > Dear R developers
>>>     > I think I have found a bug that can be reproduced with two
lines of code
>>>     > and I am very thankful to get your first assessment or
feed-back on my
>>>     > report.
>>>
>>>     > If this is the wrong mailing list or I did something wrong
>>>     > (e. g. semi "anonymous" email address to protect
my privacy and defend
>>>     > unwanted spam) please let me know since I am new here.
>>>
>>>     > Thank you very much :-)
>>>
>>>     > J. Altfeld
>>>
>>> Dear J.,
>>> (yes, a bit less anonymity would be very welcomed here!),
>>>
>>> You are right, this is a bug, at least in the documentation, but
>>> probably "all real", indeed,
>>>
>>> but read on.
>>>
>>>     > On Tue, 2016-02-16 at 18:25 +0100, nospam at altfeld-im.de
wrote:
>>>     >> 
>>>     >> 
>>>     >> If I execute the code from the
"?write.table" examples section
>>>     >> 
>>>     >> x <- data.frame(a = I("a \" quote"),
b = pi)
>>>     >> # (ommited code)
>>>     >> write.csv(x, file = "foo.csv", fileEncoding
= "UTF-16LE")
>>>     >> 
>>>     >> the resulting CSV file has a size of 6 bytes which is
too short
>>>     >> (truncated):
>>>     >> 
>>>     >> """,3
>>>
>>> reproducibly, yes.
>>> If you look at what write.csv does
>>> and then simplify, you can get a similar wrong result by
>>>
>>>   write.table(x, file = "foo.tab", fileEncoding =
"UTF-16LE")
>>>
>>> which results in a file with one line
>>>
>>> """ 3
>>>
>>> and if you debug  write.table() you see that its building blocks
>>> here are
>>> 	 file <- file(........, encoding = fileEncoding)
>>>
>>> a 	 writeLines(*, file=file)  for the column headers,
>>>
>>> and then "deeper down" C code which I did not
investigate.
>>
>> I took a look at connections.c. There is a call to strlen() that gets
>> confused by null characters. I think the obvious fix is to avoid the
>> call to strlen() as the size is already known:
>>
>> Index: src/main/connections.c
>>
==================================================================>> ---
src/main/connections.c	(revision 70213)
>> +++ src/main/connections.c	(working copy)
>> @@ -369,7 +369,7 @@
>>  		/* is this safe? */
>>  		warning(_("invalid char string in output conversion"));
>>  	    *ob = '\0';
>> -	    con->write(outbuf, 1, strlen(outbuf), con);
>> +	    con->write(outbuf, 1, ob - outbuf, con);
>>  	} while(again && inb > 0);  /* it seems some iconv signal
-1 on
>>  				       zero-length input */
>>      } else
>>
>>
>>>
>>> But just looking a bit at such a file() object with writeLines()
>>> seems slightly revealing, as e.g., 'eol' does not seem to
>>> "work" for this encoding:
>>>
>>>     > fn <- tempfile("ffoo"); ff <- file(fn,
open="w", encoding = "UTF-16LE")
>>>     > writeLines(LETTERS[3:1], ff); writeLines("|",
ff); writeLines(">a", ff)
>>>     > close(ff)
>>>     > file.show(fn)
>>>     CBA|>
>>>     > file.size(fn)
>>>     [1] 5
>>>     > 
>>
>> With the patch applied:
>>
>>     > readLines(fn, encoding="UTF-16LE", skipNul=TRUE)
>>     [1] "C"  "B"  "A"  "|" 
">a"
>>     > file.size(fn)
>>     [1] 22
> I just realized that I was misusing the encoding argument of
> readLines(). The code above works by accident, but the following would
> be more appropriate:
> 
>     > ff <- file(fn, open="r",
encoding="UTF-16LE")
>     > readLines(ff)
>     [1] "C"  "B"  "A"  "|" 
">a"
>     > close(ff)
> 
> Testing on Linux, with the patch applied. (As noted by Duncan Murdoch,
> the patch is incomplete on Windows.)Before inspecting the file with readLines() I tried file.show() but it
did not work as expected. On Linux using a UTF-8 locale, the result of
trying to show the truly UTF-16LE encoded file with

    > file.show(fn, encoding="UTF-16LE")

was a pager showing "<43>" (quotes not included) followed by
several
empty lines.

With the following patch, the command works correctly (in this case, on
this platform, not tested comprehensively). The idea is to read the
input file "raw" in order to avoid problems with null characters. The
input then needs to be split into lines after iconv(), or it could be
written to the output file with cat() if the style of line termination
characters does not matter. The 'perl = TRUE' is for assumed performance
advantage only. It can be removed, or one might want to test if there is
a significant difference one way or the other.

- Mikko

Index: src/library/base/R/files.R
==================================================================---
src/library/base/R/files.R	(revision 70217)
+++ src/library/base/R/files.R	(working copy)
@@ -50,10 +50,13 @@
         for(i in seq_along(files)) {
             f <- files[i]
             tf <- tempfile()
-            tmp <- readLines(f, warn = FALSE)
+            tmp <- list(readBin(f, "raw", file.size(f)))
             tmp2 <- try(iconv(tmp, encoding, "",
"byte"))
             if(inherits(tmp2, "try-error")) file.copy(f, tf)
-            else writeLines(tmp2, tf)
+            else {
+                tmp2 <- strsplit(tmp2, "\r\n?|\n", perl =
TRUE)[[1L]]
+                writeLines(tmp2, tf)
+            }
             files[i] <- tf
             if(delete.file) unlink(f)
         }

Duncan Murdoch

2016-Feb-29 18:30 UTC

head link

[Rd] iconv to UTF-16 encoding produces error due to embedded nulls (write.table with fileEncoding param)

I have just committed your first patch (the strlen() replacement) to 
R-devel, and will soon put it in R-patched as well.  I wont have time to 
look at this again before the 3.2.4 release, so your file.show() patch 
isn't going to make it unless someone else gets to it.

There's still a faint chance that I'll do more in R-devel before 3.3.0, 
but I think it's best if there were bug reports about both of these 
problems so they don't get forgotten.  Since the first one is mainly a 
Windows problem, I'll write that one up; I'd appreciate it if you could 
write up the file.show() issue, after checking against R-devel rev 70247 
or higher.

Duncan Murdoch

On 25/02/2016 5:54 AM, Mikko Korpela wrote:> On 25.02.2016 11:31, Mikko Korpela wrote:
>> On 23.02.2016 14:06, Mikko Korpela wrote:
>>> On 23.02.2016 11:37, Martin Maechler wrote:
>>>>>>>>> nospam at altfeld-im de <nospam at
altfeld-im.de>
>>>>>>>>>      on Mon, 22 Feb 2016 18:45:59 +0100
writes:
>>>>
>>>>      > Dear R developers
>>>>      > I think I have found a bug that can be reproduced
with two lines of code
>>>>      > and I am very thankful to get your first assessment
or feed-back on my
>>>>      > report.
>>>>
>>>>      > If this is the wrong mailing list or I did something
wrong
>>>>      > (e. g. semi "anonymous" email address to
protect my privacy and defend
>>>>      > unwanted spam) please let me know since I am new
here.
>>>>
>>>>      > Thank you very much :-)
>>>>
>>>>      > J. Altfeld
>>>>
>>>> Dear J.,
>>>> (yes, a bit less anonymity would be very welcomed here!),
>>>>
>>>> You are right, this is a bug, at least in the documentation,
but
>>>> probably "all real", indeed,
>>>>
>>>> but read on.
>>>>
>>>>      > On Tue, 2016-02-16 at 18:25 +0100, nospam at
altfeld-im.de wrote:
>>>>      >>
>>>>      >>
>>>>      >> If I execute the code from the
"?write.table" examples section
>>>>      >>
>>>>      >> x <- data.frame(a = I("a \"
quote"), b = pi)
>>>>      >> # (ommited code)
>>>>      >> write.csv(x, file = "foo.csv",
fileEncoding = "UTF-16LE")
>>>>      >>
>>>>      >> the resulting CSV file has a size of 6 bytes
which is too short
>>>>      >> (truncated):
>>>>      >>
>>>>      >> """,3
>>>>
>>>> reproducibly, yes.
>>>> If you look at what write.csv does
>>>> and then simplify, you can get a similar wrong result by
>>>>
>>>>    write.table(x, file = "foo.tab", fileEncoding =
"UTF-16LE")
>>>>
>>>> which results in a file with one line
>>>>
>>>> """ 3
>>>>
>>>> and if you debug  write.table() you see that its building
blocks
>>>> here are
>>>> 	 file <- file(........, encoding = fileEncoding)
>>>>
>>>> a 	 writeLines(*, file=file)  for the column headers,
>>>>
>>>> and then "deeper down" C code which I did not
investigate.
>>>
>>> I took a look at connections.c. There is a call to strlen() that
gets
>>> confused by null characters. I think the obvious fix is to avoid
the
>>> call to strlen() as the size is already known:
>>>
>>> Index: src/main/connections.c
>>>
==================================================================>>>
--- src/main/connections.c	(revision 70213)
>>> +++ src/main/connections.c	(working copy)
>>> @@ -369,7 +369,7 @@
>>>   		/* is this safe? */
>>>   		warning(_("invalid char string in output
conversion"));
>>>   	    *ob = '\0';
>>> -	    con->write(outbuf, 1, strlen(outbuf), con);
>>> +	    con->write(outbuf, 1, ob - outbuf, con);
>>>   	} while(again && inb > 0);  /* it seems some iconv
signal -1 on
>>>   				       zero-length input */
>>>       } else
>>>
>>>
>>>>
>>>> But just looking a bit at such a file() object with
writeLines()
>>>> seems slightly revealing, as e.g., 'eol' does not seem
to
>>>> "work" for this encoding:
>>>>
>>>>      > fn <- tempfile("ffoo"); ff <-
file(fn, open="w", encoding = "UTF-16LE")
>>>>      > writeLines(LETTERS[3:1], ff);
writeLines("|", ff); writeLines(">a", ff)
>>>>      > close(ff)
>>>>      > file.show(fn)
>>>>      CBA|>
>>>>      > file.size(fn)
>>>>      [1] 5
>>>>      >
>>>
>>> With the patch applied:
>>>
>>>      > readLines(fn, encoding="UTF-16LE",
skipNul=TRUE)
>>>      [1] "C"  "B"  "A"  "|"
">a"
>>>      > file.size(fn)
>>>      [1] 22
>> I just realized that I was misusing the encoding argument of
>> readLines(). The code above works by accident, but the following would
>> be more appropriate:
>>
>>      > ff <- file(fn, open="r",
encoding="UTF-16LE")
>>      > readLines(ff)
>>      [1] "C"  "B"  "A"  "|" 
">a"
>>      > close(ff)
>>
>> Testing on Linux, with the patch applied. (As noted by Duncan Murdoch,
>> the patch is incomplete on Windows.)
> Before inspecting the file with readLines() I tried file.show() but it
> did not work as expected. On Linux using a UTF-8 locale, the result of
> trying to show the truly UTF-16LE encoded file with
>
>      > file.show(fn, encoding="UTF-16LE")
>
> was a pager showing "<43>" (quotes not included) followed
by several
> empty lines.
>
> With the following patch, the command works correctly (in this case, on
> this platform, not tested comprehensively). The idea is to read the
> input file "raw" in order to avoid problems with null characters.
The
> input then needs to be split into lines after iconv(), or it could be
> written to the output file with cat() if the style of line termination
> characters does not matter. The 'perl = TRUE' is for assumed
performance
> advantage only. It can be removed, or one might want to test if there is
> a significant difference one way or the other.
>
> - Mikko
>
> Index: src/library/base/R/files.R
> ==================================================================> ---
src/library/base/R/files.R	(revision 70217)
> +++ src/library/base/R/files.R	(working copy)
> @@ -50,10 +50,13 @@
>           for(i in seq_along(files)) {
>               f <- files[i]
>               tf <- tempfile()
> -            tmp <- readLines(f, warn = FALSE)
> +            tmp <- list(readBin(f, "raw", file.size(f)))
>               tmp2 <- try(iconv(tmp, encoding, "",
"byte"))
>               if(inherits(tmp2, "try-error")) file.copy(f, tf)
> -            else writeLines(tmp2, tf)
> +            else {
> +                tmp2 <- strsplit(tmp2, "\r\n?|\n", perl =
TRUE)[[1L]]
> +                writeLines(tmp2, tf)
> +            }
>               files[i] <- tf
>               if(delete.file) unlink(f)
>           }
>

Possibly Parallel Threads

Search for more seemingly similar threads

R devel - Feb 2016 - iconv to UTF-16 encoding produces error due to embedded nulls (write.table with fileEncoding param)

[Rd] iconv to UTF-16 encoding produces error due to embedded nulls (write.table with fileEncoding param)

[Rd] iconv to UTF-16 encoding produces error due to embedded nulls (write.table with fileEncoding param)

[Rd] iconv to UTF-16 encoding produces error due to embedded nulls (write.table with fileEncoding param)

Possibly Parallel Threads