thr3ads.net - R devel - [Rd] iconv to UTF-16 encoding produces error due to embedded nulls (write.table with fileEncoding param) [Feb 2016]

If this information is useful, please help other people find it:
Share via:

Duncan Murdoch

2016-Feb-24 16:16 UTC

[Rd] iconv to UTF-16 encoding produces error due to embedded nulls (write.table with fileEncoding param)

On 24/02/2016 9:55 AM, Mikko Korpela wrote:> On 24.02.2016 15:47, Duncan Murdoch wrote:
>> On 23/02/2016 7:06 AM, Mikko Korpela wrote:
>>> On 23.02.2016 11:37, Martin Maechler wrote:
>>>>>>>>> nospam at altfeld-im de <nospam at
altfeld-im.de>
>>>>>>>>>       on Mon, 22 Feb 2016 18:45:59 +0100
writes:
>>>>
>>>>       > Dear R developers
>>>>       > I think I have found a bug that can be reproduced
with two
>>>> lines of code
>>>>       > and I am very thankful to get your first assessment
or
>>>> feed-back on my
>>>>       > report.
>>>>
>>>>       > If this is the wrong mailing list or I did something
wrong
>>>>       > (e. g. semi "anonymous" email address to
protect my privacy
>>>> and defend
>>>>       > unwanted spam) please let me know since I am new
here.
>>>>
>>>>       > Thank you very much :-)
>>>>
>>>>       > J. Altfeld
>>>>
>>>> Dear J.,
>>>> (yes, a bit less anonymity would be very welcomed here!),
>>>>
>>>> You are right, this is a bug, at least in the documentation,
but
>>>> probably "all real", indeed,
>>>>
>>>> but read on.
>>>>
>>>>       > On Tue, 2016-02-16 at 18:25 +0100, nospam at
altfeld-im.de wrote:
>>>>       >>
>>>>       >>
>>>>       >> If I execute the code from the
"?write.table" examples section
>>>>       >>
>>>>       >> x <- data.frame(a = I("a \"
quote"), b = pi)
>>>>       >> # (ommited code)
>>>>       >> write.csv(x, file = "foo.csv",
fileEncoding = "UTF-16LE")
>>>>       >>
>>>>       >> the resulting CSV file has a size of 6 bytes
which is too short
>>>>       >> (truncated):
>>>>       >>
>>>>       >> """,3
>>>>
>>>> reproducibly, yes.
>>>> If you look at what write.csv does
>>>> and then simplify, you can get a similar wrong result by
>>>>
>>>>     write.table(x, file = "foo.tab", fileEncoding =
"UTF-16LE")
>>>>
>>>> which results in a file with one line
>>>>
>>>> """ 3
>>>>
>>>> and if you debug  write.table() you see that its building
blocks
>>>> here are
>>>>       file <- file(........, encoding = fileEncoding)
>>>>
>>>> a      writeLines(*, file=file)  for the column headers,
>>>>
>>>> and then "deeper down" C code which I did not
investigate.
>>>
>>> I took a look at connections.c. There is a call to strlen() that
gets
>>> confused by null characters. I think the obvious fix is to avoid
the
>>> call to strlen() as the size is already known:
>>>
>>> Index: src/main/connections.c
>>>
==================================================================>>>
--- src/main/connections.c    (revision 70213)
>>> +++ src/main/connections.c    (working copy)
>>> @@ -369,7 +369,7 @@
>>>            /* is this safe? */
>>>            warning(_("invalid char string in output
conversion"));
>>>            *ob = '\0';
>>> -        con->write(outbuf, 1, strlen(outbuf), con);
>>> +        con->write(outbuf, 1, ob - outbuf, con);
>>>        } while(again && inb > 0);  /* it seems some
iconv signal -1 on
>>>                           zero-length input */
>>>        } else
>>>
>>>
>>>>
>>>> But just looking a bit at such a file() object with
writeLines()
>>>> seems slightly revealing, as e.g., 'eol' does not seem
to
>>>> "work" for this encoding:
>>>>
>>>>       > fn <- tempfile("ffoo"); ff <-
file(fn, open="w", encoding >>>> "UTF-16LE")
>>>>       > writeLines(LETTERS[3:1], ff);
writeLines("|", ff);
>>>> writeLines(">a", ff)
>>>>       > close(ff)
>>>>       > file.show(fn)
>>>>       CBA|>
>>>>       > file.size(fn)
>>>>       [1] 5
>>>>       >
>>>
>>> With the patch applied:
>>>
>>>       > readLines(fn, encoding="UTF-16LE",
skipNul=TRUE)
>>>       [1] "C"  "B"  "A" 
"|"  ">a"
>>>       > file.size(fn)
>>>       [1] 22
>>
>> That may be okay on Unix, but it's not enough on Windows.  There
the \n
>> that writeLines adds at the end of each line isn't translated to
>> UTF-16LE properly, so things get messed up.  (I think the \n is
>> translated, but the \r that Windows wants is not, so you get a mix of 8
>> bit and 16 bit characters.)
>
> That's unfortunate. I tested my tiny patch on Linux. I don't know
what
> kind of additional changes would be needed to make this work on Windows.
>
It looks like a big change is needed for a perfect solution:

  - Windows does the translation of \n to \r\n.  In the R code, Windows 
is never told that the output is UTF-16LE, so it does an 8 bit translation.

  - Telling Windows that output is UTF-16LE looks hard:  we'd need to
convert the string to wide chars in R, then write it in wide chars. 
This seems like a lot of work for a rare case.

  - It might be easier to do a hack:  if the user asks for "UTF-16LE",
then treat it internally as a text file but tell Windows it's a binary 
file.  This means no \n to \r\n translation will be done by Windows.  If 
the desired output file needs Windows line endings, the user would have 
to specify sep="\r\n" in writeLines.

Duncan Murdoch

Duncan Murdoch

2016-Feb-24 20:49 UTC

head link

[Rd] iconv to UTF-16 encoding produces error due to embedded nulls (write.table with fileEncoding param)

On 24/02/2016 11:16 AM, Duncan Murdoch wrote:> On 24/02/2016 9:55 AM, Mikko Korpela wrote:
>> On 24.02.2016 15:47, Duncan Murdoch wrote:
>>> On 23/02/2016 7:06 AM, Mikko Korpela wrote:
>>>> On 23.02.2016 11:37, Martin Maechler wrote:
>>>>>>>>>> nospam at altfeld-im de <nospam at
altfeld-im.de>
>>>>>>>>>>        on Mon, 22 Feb 2016 18:45:59
+0100 writes:
>>>>>
>>>>>        > Dear R developers
>>>>>        > I think I have found a bug that can be
reproduced with two
>>>>> lines of code
>>>>>        > and I am very thankful to get your first
assessment or
>>>>> feed-back on my
>>>>>        > report.
>>>>>
>>>>>        > If this is the wrong mailing list or I did
something wrong
>>>>>        > (e. g. semi "anonymous" email address
to protect my privacy
>>>>> and defend
>>>>>        > unwanted spam) please let me know since I am
new here.
>>>>>
>>>>>        > Thank you very much :-)
>>>>>
>>>>>        > J. Altfeld
>>>>>
>>>>> Dear J.,
>>>>> (yes, a bit less anonymity would be very welcomed here!),
>>>>>
>>>>> You are right, this is a bug, at least in the
documentation, but
>>>>> probably "all real", indeed,
>>>>>
>>>>> but read on.
>>>>>
>>>>>        > On Tue, 2016-02-16 at 18:25 +0100, nospam at
altfeld-im.de wrote:
>>>>>        >>
>>>>>        >>
>>>>>        >> If I execute the code from the
"?write.table" examples section
>>>>>        >>
>>>>>        >> x <- data.frame(a = I("a \"
quote"), b = pi)
>>>>>        >> # (ommited code)
>>>>>        >> write.csv(x, file = "foo.csv",
fileEncoding = "UTF-16LE")
>>>>>        >>
>>>>>        >> the resulting CSV file has a size of 6
bytes which is too short
>>>>>        >> (truncated):
>>>>>        >>
>>>>>        >> """,3
>>>>>
>>>>> reproducibly, yes.
>>>>> If you look at what write.csv does
>>>>> and then simplify, you can get a similar wrong result by
>>>>>
>>>>>      write.table(x, file = "foo.tab",
fileEncoding = "UTF-16LE")
>>>>>
>>>>> which results in a file with one line
>>>>>
>>>>> """ 3
>>>>>
>>>>> and if you debug  write.table() you see that its building
blocks
>>>>> here are
>>>>>        file <- file(........, encoding = fileEncoding)
>>>>>
>>>>> a      writeLines(*, file=file)  for the column headers,
>>>>>
>>>>> and then "deeper down" C code which I did not
investigate.
>>>>
>>>> I took a look at connections.c. There is a call to strlen()
that gets
>>>> confused by null characters. I think the obvious fix is to
avoid the
>>>> call to strlen() as the size is already known:
>>>>
>>>> Index: src/main/connections.c
>>>>
==================================================================>>>>
--- src/main/connections.c    (revision 70213)
>>>> +++ src/main/connections.c    (working copy)
>>>> @@ -369,7 +369,7 @@
>>>>             /* is this safe? */
>>>>             warning(_("invalid char string in output
conversion"));
>>>>             *ob = '\0';
>>>> -        con->write(outbuf, 1, strlen(outbuf), con);
>>>> +        con->write(outbuf, 1, ob - outbuf, con);
>>>>         } while(again && inb > 0);  /* it seems some
iconv signal -1 on
>>>>                            zero-length input */
>>>>         } else
>>>>
>>>>
>>>>>
>>>>> But just looking a bit at such a file() object with
writeLines()
>>>>> seems slightly revealing, as e.g., 'eol' does not
seem to
>>>>> "work" for this encoding:
>>>>>
>>>>>        > fn <- tempfile("ffoo"); ff <-
file(fn, open="w", encoding >>>>> "UTF-16LE")
>>>>>        > writeLines(LETTERS[3:1], ff);
writeLines("|", ff);
>>>>> writeLines(">a", ff)
>>>>>        > close(ff)
>>>>>        > file.show(fn)
>>>>>        CBA|>
>>>>>        > file.size(fn)
>>>>>        [1] 5
>>>>>        >
>>>>
>>>> With the patch applied:
>>>>
>>>>        > readLines(fn, encoding="UTF-16LE",
skipNul=TRUE)
>>>>        [1] "C"  "B"  "A" 
"|"  ">a"
>>>>        > file.size(fn)
>>>>        [1] 22
>>>
>>> That may be okay on Unix, but it's not enough on Windows. 
There the \n
>>> that writeLines adds at the end of each line isn't translated
to
>>> UTF-16LE properly, so things get messed up.  (I think the \n is
>>> translated, but the \r that Windows wants is not, so you get a mix
of 8
>>> bit and 16 bit characters.)
>>
>> That's unfortunate. I tested my tiny patch on Linux. I don't
know what
>> kind of additional changes would be needed to make this work on
Windows.
>>
>
> It looks like a big change is needed for a perfect solution:
>
>    - Windows does the translation of \n to \r\n.  In the R code, Windows
> is never told that the output is UTF-16LE, so it does an 8 bit translation.
>
>    - Telling Windows that output is UTF-16LE looks hard:  we'd need to
> convert the string to wide chars in R, then write it in wide chars.
> This seems like a lot of work for a rare case.
>
>    - It might be easier to do a hack:  if the user asks for
"UTF-16LE",
> then treat it internally as a text file but tell Windows it's a binary
> file.  This means no \n to \r\n translation will be done by Windows.  If
> the desired output file needs Windows line endings, the user would have
> to specify sep="\r\n" in writeLines.
A third possibility is to handle the insertion of the \r completely 
within R.  This will have the advantage of making it optional, so it 
would be a lot easier to write a Unix-style file on Windows.

I think either the first or third possibilities will take too much time 
for me to attempt them before 3.3.0.  I'm not sure about the second one yet.

Duncan Murdoch

peter dalgaard

2016-Feb-25 09:49 UTC

head link

[Rd] iconv to UTF-16 encoding produces error due to embedded nulls (write.table with fileEncoding param)

Aim for 3.3.1 then? It's not like we have hordes of people demanding to have
this fixed right here and now, or do we?

(A practical problem is that the version control dynamics dictate that at this
stage, commits to r-devel _will_ end up in 3.3.0 on April 14, unless backed out
and then inserted in the new r-devel branch to be created on March 17.)

- Peter


On 24 Feb 2016, at 21:49 , Duncan Murdoch <murdoch.duncan at gmail.com>
wrote:
> On 24/02/2016 11:16 AM, Duncan Murdoch wrote:
>> On 24/02/2016 9:55 AM, Mikko Korpela wrote:
>>>> 
[...]>>> 
>>> That's unfortunate. I tested my tiny patch on Linux. I
don't know what
>>> kind of additional changes would be needed to make this work on
Windows.
>>> 
>> 
>> It looks like a big change is needed for a perfect solution:
>> 
>>   - Windows does the translation of \n to \r\n.  In the R code, Windows
>> is never told that the output is UTF-16LE, so it does an 8 bit
translation.
>> 
>>   - Telling Windows that output is UTF-16LE looks hard:  we'd need
to
>> convert the string to wide chars in R, then write it in wide chars.
>> This seems like a lot of work for a rare case.
>> 
>>   - It might be easier to do a hack:  if the user asks for
"UTF-16LE",
>> then treat it internally as a text file but tell Windows it's a
binary
>> file.  This means no \n to \r\n translation will be done by Windows. 
If
>> the desired output file needs Windows line endings, the user would have
>> to specify sep="\r\n" in writeLines.
> 
> A third possibility is to handle the insertion of the \r completely within
R.  This will have the advantage of making it optional, so it would be a lot
easier to write a Unix-style file on Windows.
> 
> I think either the first or third possibilities will take too much time for
me to attempt them before 3.3.0.  I'm not sure about the second one yet.
> 
> Duncan Murdoch
> 
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
-- 
Peter Dalgaard, Professor,
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Office: A 4.23
Email: pd.mes at cbs.dk  Priv: PDalgd at gmail.com

Possibly Parallel Threads

Search for more maybe matching threads

R devel - Feb 2016 - iconv to UTF-16 encoding produces error due to embedded nulls (write.table with fileEncoding param)

[Rd] iconv to UTF-16 encoding produces error due to embedded nulls (write.table with fileEncoding param)

[Rd] iconv to UTF-16 encoding produces error due to embedded nulls (write.table with fileEncoding param)

[Rd] iconv to UTF-16 encoding produces error due to embedded nulls (write.table with fileEncoding param)

Possibly Parallel Threads