thr3ads.net - R devel - [Rd] iconv to UTF-16 encoding produces error due to embedded nulls (write.table with fileEncoding param) [Feb 2016]

If this information is useful, please help other people find it:
Share via:

Duncan Murdoch

2016-Feb-24 13:47 UTC

[Rd] iconv to UTF-16 encoding produces error due to embedded nulls (write.table with fileEncoding param)

On 23/02/2016 7:06 AM, Mikko Korpela wrote:> On 23.02.2016 11:37, Martin Maechler wrote:
>>>>>>> nospam at altfeld-im de <nospam at
altfeld-im.de>
>>>>>>>      on Mon, 22 Feb 2016 18:45:59 +0100 writes:
>>
>>      > Dear R developers
>>      > I think I have found a bug that can be reproduced with two
lines of code
>>      > and I am very thankful to get your first assessment or
feed-back on my
>>      > report.
>>
>>      > If this is the wrong mailing list or I did something wrong
>>      > (e. g. semi "anonymous" email address to protect my
privacy and defend
>>      > unwanted spam) please let me know since I am new here.
>>
>>      > Thank you very much :-)
>>
>>      > J. Altfeld
>>
>> Dear J.,
>> (yes, a bit less anonymity would be very welcomed here!),
>>
>> You are right, this is a bug, at least in the documentation, but
>> probably "all real", indeed,
>>
>> but read on.
>>
>>      > On Tue, 2016-02-16 at 18:25 +0100, nospam at altfeld-im.de
wrote:
>>      >>
>>      >>
>>      >> If I execute the code from the "?write.table"
examples section
>>      >>
>>      >> x <- data.frame(a = I("a \" quote"), b
= pi)
>>      >> # (ommited code)
>>      >> write.csv(x, file = "foo.csv", fileEncoding =
"UTF-16LE")
>>      >>
>>      >> the resulting CSV file has a size of 6 bytes which is too
short
>>      >> (truncated):
>>      >>
>>      >> """,3
>>
>> reproducibly, yes.
>> If you look at what write.csv does
>> and then simplify, you can get a similar wrong result by
>>
>>    write.table(x, file = "foo.tab", fileEncoding =
"UTF-16LE")
>>
>> which results in a file with one line
>>
>> """ 3
>>
>> and if you debug  write.table() you see that its building blocks
>> here are
>> 	 file <- file(........, encoding = fileEncoding)
>>
>> a 	 writeLines(*, file=file)  for the column headers,
>>
>> and then "deeper down" C code which I did not investigate.
>
> I took a look at connections.c. There is a call to strlen() that gets
> confused by null characters. I think the obvious fix is to avoid the
> call to strlen() as the size is already known:
>
> Index: src/main/connections.c
> ==================================================================> ---
src/main/connections.c	(revision 70213)
> +++ src/main/connections.c	(working copy)
> @@ -369,7 +369,7 @@
>   		/* is this safe? */
>   		warning(_("invalid char string in output conversion"));
>   	    *ob = '\0';
> -	    con->write(outbuf, 1, strlen(outbuf), con);
> +	    con->write(outbuf, 1, ob - outbuf, con);
>   	} while(again && inb > 0);  /* it seems some iconv signal -1
on
>   				       zero-length input */
>       } else
>
>
>>
>> But just looking a bit at such a file() object with writeLines()
>> seems slightly revealing, as e.g., 'eol' does not seem to
>> "work" for this encoding:
>>
>>      > fn <- tempfile("ffoo"); ff <- file(fn,
open="w", encoding = "UTF-16LE")
>>      > writeLines(LETTERS[3:1], ff); writeLines("|", ff);
writeLines(">a", ff)
>>      > close(ff)
>>      > file.show(fn)
>>      CBA|>
>>      > file.size(fn)
>>      [1] 5
>>      >
>
> With the patch applied:
>
>      > readLines(fn, encoding="UTF-16LE", skipNul=TRUE)
>      [1] "C"  "B"  "A"  "|" 
">a"
>      > file.size(fn)
>      [1] 22
That may be okay on Unix, but it's not enough on Windows.  There the \n 
that writeLines adds at the end of each line isn't translated to 
UTF-16LE properly, so things get messed up.  (I think the \n is 
translated, but the \r that Windows wants is not, so you get a mix of 8 
bit and 16 bit characters.)

Duncan Murdoch
> - Mikko Korpela
>
>>      >> The problem seems to be the iconv function:
>>      >>
>>      >> iconv("foo", to="UTF-16")
>>      >>
>>      >> produces
>>      >>
>>      >> Error in iconv("foo", to = "UTF-16"):
>>      >> embedded nul in string: '\xff\xfef\0o\0o\0'
>>
>> but this works
>>
>>      > iconv("foo", to="UTF-16", toRaw=TRUE)
>>      [[1]]
>>      [1] ff fe 66 00 6f 00 6f 00
>>
>> (indeed showing the embedded '\0's)
>>
>>      >> In 2010 a (partial) patch for this problem was submitted:
>>      >>
http://tolstoy.newcastle.edu.au/R/e10/devel/10/06/0648.html
>>
>> the patch only related to the iconv() problem not allowing
'raw'
>> (instead of character) argument x.
>>
>> ... and it is > 5.5 years old, for an iconv() version that was less
>> featureful than today.
>> Rather, current iconv(x) allows x to be a list of raw entries.
>>
>>
>>      >> Are there chances to fix this problem since it prevents
writing Windows
>>      >> UTF-16LE text files?
>>
>>      >>
>>      >> PS: This problem can be reproduced on Windows and Linux.
>>
>> indeed.... also on "R devel of today".
>>
>> I agree it should be fixed... but as I said not by the patch you
>> mentioned.
>>
>> Tested patches to fix this are welcome, indeed.
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

Mikko Korpela

2016-Feb-24 14:55 UTC

head link

[Rd] iconv to UTF-16 encoding produces error due to embedded nulls (write.table with fileEncoding param)

On 24.02.2016 15:47, Duncan Murdoch wrote:> On 23/02/2016 7:06 AM, Mikko Korpela wrote:
>> On 23.02.2016 11:37, Martin Maechler wrote:
>>>>>>>> nospam at altfeld-im de <nospam at
altfeld-im.de>
>>>>>>>>      on Mon, 22 Feb 2016 18:45:59 +0100 writes:
>>>
>>>      > Dear R developers
>>>      > I think I have found a bug that can be reproduced with
two
>>> lines of code
>>>      > and I am very thankful to get your first assessment or
>>> feed-back on my
>>>      > report.
>>>
>>>      > If this is the wrong mailing list or I did something
wrong
>>>      > (e. g. semi "anonymous" email address to
protect my privacy
>>> and defend
>>>      > unwanted spam) please let me know since I am new here.
>>>
>>>      > Thank you very much :-)
>>>
>>>      > J. Altfeld
>>>
>>> Dear J.,
>>> (yes, a bit less anonymity would be very welcomed here!),
>>>
>>> You are right, this is a bug, at least in the documentation, but
>>> probably "all real", indeed,
>>>
>>> but read on.
>>>
>>>      > On Tue, 2016-02-16 at 18:25 +0100, nospam at
altfeld-im.de wrote:
>>>      >>
>>>      >>
>>>      >> If I execute the code from the
"?write.table" examples section
>>>      >>
>>>      >> x <- data.frame(a = I("a \"
quote"), b = pi)
>>>      >> # (ommited code)
>>>      >> write.csv(x, file = "foo.csv", fileEncoding
= "UTF-16LE")
>>>      >>
>>>      >> the resulting CSV file has a size of 6 bytes which is
too short
>>>      >> (truncated):
>>>      >>
>>>      >> """,3
>>>
>>> reproducibly, yes.
>>> If you look at what write.csv does
>>> and then simplify, you can get a similar wrong result by
>>>
>>>    write.table(x, file = "foo.tab", fileEncoding =
"UTF-16LE")
>>>
>>> which results in a file with one line
>>>
>>> """ 3
>>>
>>> and if you debug  write.table() you see that its building blocks
>>> here are
>>>      file <- file(........, encoding = fileEncoding)
>>>
>>> a      writeLines(*, file=file)  for the column headers,
>>>
>>> and then "deeper down" C code which I did not
investigate.
>>
>> I took a look at connections.c. There is a call to strlen() that gets
>> confused by null characters. I think the obvious fix is to avoid the
>> call to strlen() as the size is already known:
>>
>> Index: src/main/connections.c
>>
==================================================================>> ---
src/main/connections.c    (revision 70213)
>> +++ src/main/connections.c    (working copy)
>> @@ -369,7 +369,7 @@
>>           /* is this safe? */
>>           warning(_("invalid char string in output
conversion"));
>>           *ob = '\0';
>> -        con->write(outbuf, 1, strlen(outbuf), con);
>> +        con->write(outbuf, 1, ob - outbuf, con);
>>       } while(again && inb > 0);  /* it seems some iconv
signal -1 on
>>                          zero-length input */
>>       } else
>>
>>
>>>
>>> But just looking a bit at such a file() object with writeLines()
>>> seems slightly revealing, as e.g., 'eol' does not seem to
>>> "work" for this encoding:
>>>
>>>      > fn <- tempfile("ffoo"); ff <- file(fn,
open="w", encoding >>> "UTF-16LE")
>>>      > writeLines(LETTERS[3:1], ff); writeLines("|",
ff);
>>> writeLines(">a", ff)
>>>      > close(ff)
>>>      > file.show(fn)
>>>      CBA|>
>>>      > file.size(fn)
>>>      [1] 5
>>>      >
>>
>> With the patch applied:
>>
>>      > readLines(fn, encoding="UTF-16LE", skipNul=TRUE)
>>      [1] "C"  "B"  "A"  "|" 
">a"
>>      > file.size(fn)
>>      [1] 22
> 
> That may be okay on Unix, but it's not enough on Windows.  There the \n
> that writeLines adds at the end of each line isn't translated to
> UTF-16LE properly, so things get messed up.  (I think the \n is
> translated, but the \r that Windows wants is not, so you get a mix of 8
> bit and 16 bit characters.)
That's unfortunate. I tested my tiny patch on Linux. I don't know what
kind of additional changes would be needed to make this work on Windows.

-- 
Mikko Korpela
Aalto University School of Science
Department of Computer Science

Duncan Murdoch

2016-Feb-24 16:16 UTC

head link

[Rd] iconv to UTF-16 encoding produces error due to embedded nulls (write.table with fileEncoding param)

On 24/02/2016 9:55 AM, Mikko Korpela wrote:> On 24.02.2016 15:47, Duncan Murdoch wrote:
>> On 23/02/2016 7:06 AM, Mikko Korpela wrote:
>>> On 23.02.2016 11:37, Martin Maechler wrote:
>>>>>>>>> nospam at altfeld-im de <nospam at
altfeld-im.de>
>>>>>>>>>       on Mon, 22 Feb 2016 18:45:59 +0100
writes:
>>>>
>>>>       > Dear R developers
>>>>       > I think I have found a bug that can be reproduced
with two
>>>> lines of code
>>>>       > and I am very thankful to get your first assessment
or
>>>> feed-back on my
>>>>       > report.
>>>>
>>>>       > If this is the wrong mailing list or I did something
wrong
>>>>       > (e. g. semi "anonymous" email address to
protect my privacy
>>>> and defend
>>>>       > unwanted spam) please let me know since I am new
here.
>>>>
>>>>       > Thank you very much :-)
>>>>
>>>>       > J. Altfeld
>>>>
>>>> Dear J.,
>>>> (yes, a bit less anonymity would be very welcomed here!),
>>>>
>>>> You are right, this is a bug, at least in the documentation,
but
>>>> probably "all real", indeed,
>>>>
>>>> but read on.
>>>>
>>>>       > On Tue, 2016-02-16 at 18:25 +0100, nospam at
altfeld-im.de wrote:
>>>>       >>
>>>>       >>
>>>>       >> If I execute the code from the
"?write.table" examples section
>>>>       >>
>>>>       >> x <- data.frame(a = I("a \"
quote"), b = pi)
>>>>       >> # (ommited code)
>>>>       >> write.csv(x, file = "foo.csv",
fileEncoding = "UTF-16LE")
>>>>       >>
>>>>       >> the resulting CSV file has a size of 6 bytes
which is too short
>>>>       >> (truncated):
>>>>       >>
>>>>       >> """,3
>>>>
>>>> reproducibly, yes.
>>>> If you look at what write.csv does
>>>> and then simplify, you can get a similar wrong result by
>>>>
>>>>     write.table(x, file = "foo.tab", fileEncoding =
"UTF-16LE")
>>>>
>>>> which results in a file with one line
>>>>
>>>> """ 3
>>>>
>>>> and if you debug  write.table() you see that its building
blocks
>>>> here are
>>>>       file <- file(........, encoding = fileEncoding)
>>>>
>>>> a      writeLines(*, file=file)  for the column headers,
>>>>
>>>> and then "deeper down" C code which I did not
investigate.
>>>
>>> I took a look at connections.c. There is a call to strlen() that
gets
>>> confused by null characters. I think the obvious fix is to avoid
the
>>> call to strlen() as the size is already known:
>>>
>>> Index: src/main/connections.c
>>>
==================================================================>>>
--- src/main/connections.c    (revision 70213)
>>> +++ src/main/connections.c    (working copy)
>>> @@ -369,7 +369,7 @@
>>>            /* is this safe? */
>>>            warning(_("invalid char string in output
conversion"));
>>>            *ob = '\0';
>>> -        con->write(outbuf, 1, strlen(outbuf), con);
>>> +        con->write(outbuf, 1, ob - outbuf, con);
>>>        } while(again && inb > 0);  /* it seems some
iconv signal -1 on
>>>                           zero-length input */
>>>        } else
>>>
>>>
>>>>
>>>> But just looking a bit at such a file() object with
writeLines()
>>>> seems slightly revealing, as e.g., 'eol' does not seem
to
>>>> "work" for this encoding:
>>>>
>>>>       > fn <- tempfile("ffoo"); ff <-
file(fn, open="w", encoding >>>> "UTF-16LE")
>>>>       > writeLines(LETTERS[3:1], ff);
writeLines("|", ff);
>>>> writeLines(">a", ff)
>>>>       > close(ff)
>>>>       > file.show(fn)
>>>>       CBA|>
>>>>       > file.size(fn)
>>>>       [1] 5
>>>>       >
>>>
>>> With the patch applied:
>>>
>>>       > readLines(fn, encoding="UTF-16LE",
skipNul=TRUE)
>>>       [1] "C"  "B"  "A" 
"|"  ">a"
>>>       > file.size(fn)
>>>       [1] 22
>>
>> That may be okay on Unix, but it's not enough on Windows.  There
the \n
>> that writeLines adds at the end of each line isn't translated to
>> UTF-16LE properly, so things get messed up.  (I think the \n is
>> translated, but the \r that Windows wants is not, so you get a mix of 8
>> bit and 16 bit characters.)
>
> That's unfortunate. I tested my tiny patch on Linux. I don't know
what
> kind of additional changes would be needed to make this work on Windows.
>
It looks like a big change is needed for a perfect solution:

  - Windows does the translation of \n to \r\n.  In the R code, Windows 
is never told that the output is UTF-16LE, so it does an 8 bit translation.

  - Telling Windows that output is UTF-16LE looks hard:  we'd need to
convert the string to wide chars in R, then write it in wide chars. 
This seems like a lot of work for a rare case.

  - It might be easier to do a hack:  if the user asks for "UTF-16LE",
then treat it internally as a text file but tell Windows it's a binary 
file.  This means no \n to \r\n translation will be done by Windows.  If 
the desired output file needs Windows line endings, the user would have 
to specify sep="\r\n" in writeLines.

Duncan Murdoch

Apparently Analagous Threads

Search for more possibly parallel threads

R devel - Feb 2016 - iconv to UTF-16 encoding produces error due to embedded nulls (write.table with fileEncoding param)

[Rd] iconv to UTF-16 encoding produces error due to embedded nulls (write.table with fileEncoding param)

[Rd] iconv to UTF-16 encoding produces error due to embedded nulls (write.table with fileEncoding param)

[Rd] iconv to UTF-16 encoding produces error due to embedded nulls (write.table with fileEncoding param)

Apparently Analagous Threads