thr3ads.net - R devel - [Rd] Encoding issues [Feb 2019]

If this information is useful, please help other people find it:
Share via:

Iñaki Ucar

2019-Feb-18 15:36 UTC

[Rd] Encoding issues

Hi,

We found a (to our eyes) strange behaviour that might be a bug. First
a little bit of context. The 'units' package allows us to set the unit
using both SE or NSE. E.g., these both work in the same way:

units::set_units(1:10, "?m")
#> Units: [?m]
#> [1]  1  2  3  4  5  6  7  8  9 10

units::set_units(1:10, ?m)
#> Units: [?m]
#> [1]  1  2  3  4  5  6  7  8  9 10

That's micrometers, and works fine if the session charset is UTF-8.
Now the funny part comes with Windows. The first version, with quotes,
works fine, but the second one fails. This is easy to demonstrate from
Linux:

LC_CTYPE=en_US.iso88591 Rscript -e 'units::set_units(1:10,
"?m")'
#> Units: [?m]
#> [1]  1  2  3  4  5  6  7  8  9 10

LC_CTYPE=en_US.iso88591 Rscript -e 'units::set_units(1:10, ?m)'
#> Error: unexpected input in "units::set_units(1:10, ?"
#> Execution halted

However, if you use the first version, with quotes, in an example, and
the package is checked on Windows, it fails too (see
https://ci.appveyor.com/project/edzer/units/builds/22440023#L747). The
package declares UTF-8 encoding, so none of these errors should, in
principle, happen. Am I wrong?

Thanks in advance, regards,
I?aki

Gábor Csárdi

2019-Feb-18 16:26 UTC

head link

[Rd] Encoding issues

>From "Writing R Extensions":
"Only ASCII characters (and the control characters tab, formfeed, LF
and CR) should be used in code files."

So I am afraid you cannot use ?m.

Gabor

On Mon, Feb 18, 2019 at 3:36 PM I?aki Ucar <iucar at fedoraproject.org>
wrote:>
> Hi,
>
> We found a (to our eyes) strange behaviour that might be a bug. First
> a little bit of context. The 'units' package allows us to set the
unit
> using both SE or NSE. E.g., these both work in the same way:
>
> units::set_units(1:10, "?m")
> #> Units: [?m]
> #> [1]  1  2  3  4  5  6  7  8  9 10
>
> units::set_units(1:10, ?m)
> #> Units: [?m]
> #> [1]  1  2  3  4  5  6  7  8  9 10
>
> That's micrometers, and works fine if the session charset is UTF-8.
> Now the funny part comes with Windows. The first version, with quotes,
> works fine, but the second one fails. This is easy to demonstrate from
> Linux:
>
> LC_CTYPE=en_US.iso88591 Rscript -e 'units::set_units(1:10,
"?m")'
> #> Units: [?m]
> #> [1]  1  2  3  4  5  6  7  8  9 10
>
> LC_CTYPE=en_US.iso88591 Rscript -e 'units::set_units(1:10, ?m)'
> #> Error: unexpected input in "units::set_units(1:10, ?"
> #> Execution halted
>
> However, if you use the first version, with quotes, in an example, and
> the package is checked on Windows, it fails too (see
> https://ci.appveyor.com/project/edzer/units/builds/22440023#L747). The
> package declares UTF-8 encoding, so none of these errors should, in
> principle, happen. Am I wrong?
>
> Thanks in advance, regards,
> I?aki
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

Iñaki Ucar

2019-Feb-18 16:42 UTC

head link

[Rd] Encoding issues

On Mon, 18 Feb 2019 at 17:27, G?bor Cs?rdi <csardi.gabor at gmail.com>
wrote:>
> From "Writing R Extensions":
>
> "Only ASCII characters (and the control characters tab, formfeed, LF
> and CR) should be used in code files."
>
> So I am afraid you cannot use ?m.
Thanks, G?bor, I missed that bit. Then, is an .Rd file considered a
"code file"? Our surprise comes from the fact that the quoted version
works fine in a test file, but not in an example. Anyway, if they
cause such a documented trouble, it seems that the safest option is to
avoid its use in the first place.

I?aki

Tomas Kalibera

2019-Feb-18 16:45 UTC

head link

[Rd] Encoding issues

On 2/18/19 4:36 PM, I?aki Ucar wrote:> Hi,
>
> We found a (to our eyes) strange behaviour that might be a bug. First
> a little bit of context. The 'units' package allows us to set the
unit
> using both SE or NSE. E.g., these both work in the same way:
>
> units::set_units(1:10, "?m")
> #> Units: [?m]
> #> [1]  1  2  3  4  5  6  7  8  9 10
>
> units::set_units(1:10, ?m)
> #> Units: [?m]
> #> [1]  1  2  3  4  5  6  7  8  9 10
>
> That's micrometers, and works fine if the session charset is UTF-8.
> Now the funny part comes with Windows. The first version, with quotes,
> works fine, but the second one fails. This is easy to demonstrate from
> Linux:
>
> LC_CTYPE=en_US.iso88591 Rscript -e 'units::set_units(1:10,
"?m")'
> #> Units: [?m]
> #> [1]  1  2  3  4  5  6  7  8  9 10
>
> LC_CTYPE=en_US.iso88591 Rscript -e 'units::set_units(1:10, ?m)'
> #> Error: unexpected input in "units::set_units(1:10, ?"
> #> Execution halted
>
> However, if you use the first version, with quotes, in an example, and
> the package is checked on Windows, it fails too (see
> https://ci.appveyor.com/project/edzer/units/builds/22440023#L747). The
> package declares UTF-8 encoding, so none of these errors should, in
> principle, happen. Am I wrong?
Hi I?aki,

if you want to report a bug against R, please try to provide a minimum 
reproducible example that only uses base packages (not units) and please 
also see WRE sections 1.3, 1.6.3, including:

"There is a portable way to have arbitrary text in character strings 
(only) in your R code, which is to supply them in Unicode as ?\uxxxx? 
escapes."

"If your package specifies an encoding in its DESCRIPTION file, you 
should run these tools in a locale which makes use of that encoding" 
(includes R CMD check)

Even though there are portable ways to have a string constant literal in 
source code in UTF-8, not representable in the current native encoding 
(e.g. using \u escapes), it does not mean that such a string can be 
freely used in R. Many operations require conversion to the current 
native encoding, which will cause an error or unexpected result. Such 
conversions can happen any time (except when they are documented not to 
happen).

Implementing an API that will work with such strings in a package would 
be hard to get right, but not impossible. NSE will not work 
(non-representable strings, which are not string constant literals, are 
not supported). One can save a lot of headaches by using only ASCII in 
function APIs.

Best
Tomas
>
> Thanks in advance, regards,
> I?aki
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

	[[alternative HTML version deleted]]

Possibly Parallel Threads

Search for more apparently analagous threads

R devel - Feb 2019 - Encoding issues

[Rd] Encoding issues

[Rd] Encoding issues

[Rd] Encoding issues

[Rd] Encoding issues

Possibly Parallel Threads