thr3ads.net - R devel - [Rd] R string comparisons may vary with platform (plain text) [Nov 2014]

If this information is useful, please help other people find it:
Share via:

Prof Brian Ripley

2014-Nov-23 11:44 UTC

[Rd] R string comparisons may vary with platform (plain text)

On 23/11/2014 09:39, peter dalgaard wrote:>
>> On 23 Nov 2014, at 01:05 , Henrik Bengtsson <hb at
biostat.ucsf.edu> wrote:
>>
>> On Sat, Nov 22, 2014 at 12:42 PM, Duncan Murdoch
>> <murdoch.duncan at gmail.com> wrote:
>>> On 22/11/2014, 2:59 PM, Stuart Ambler wrote:
>>>> A colleague?s R program behaved differently when I ran it, and
we thought
>>>> we traced it probably to different results from string
comparisons as
>>>> below, with different R versions.  However the platforms also
differed.  A
>>>> friend ran it on a few machines and found that the comparison
behavior
>>>> didn?t correlate with R version, but rather with platform.
>>>>
>>>> I wonder if you?ve seen this.  If it?s not some setting I?m
unaware of,
>>>> maybe someone should look into it.  Sorry I haven?t taken the
time to read
>>>> the source code myself.
>>>
>>> Looks like a collation order issue.  See ?Comparison.
>>
>> With the oddity that both platforms use what look like similar locales:
>>
>> LC_COLLATE=en_US.UTF-8
>> LC_COLLATE=en_US.utf8
>
> It's the sort of thing thay I've tried to wrap my mind around
multiple times and failed, but have a look at
>
>
http://stackoverflow.com/questions/19967555/postgres-collation-differences-osx-v-ubuntu
>
> which seems to be essentially the same issue, just for Postgres. If you
have the stamina, also look into the python question that it links to.
>
> As I understand it, there are two potential reasons: Either the two
platforms are not using the same collation table for en_US, or at least one of
them is not fully implementing the Unicode Collation Algorithm.
And I have seen both with R.  At the very least, check if ICU is being 
used (capabilities("ICU") in current R, maybe not in some of the 
obsolete versions seen in this thread).

As a further possibility, there are choices in the UCA (in R, see 
?icuSetCollate) and ICU can be compiled with different default choices. 
  It is not clear to me what (if any) difference ICU versions make, but 
in R-devel extSoftVersion() reports that.

> In general, collation is a minefield: Some languages have the same letters
in different order (e.g. Estonian with Z between S and T); accented characters
sort with the unaccented counterpart in some languages but as separate
characters in others; some locales sort ABab, others AaBb, yet others aAbB;
sometimes punctuation is ignored, sometimes not; sometimes multiple characters
count as one, etc.
>As ?Comparison has long said.


-- 
Brian D. Ripley,                  ripley at stats.ox.ac.uk
Emeritus Professor of Applied Statistics, University of Oxford
1 South Parks Road, Oxford OX1 3TG, UK

Martin Morgan

2014-Nov-23 16:15 UTC

head link

[Rd] R string comparisons may vary with platform (plain text)

For many scientific applications one is really dealing with ASCII characters and
LC_COLLATE="C", even if the user is running in non-C locales. What
robust
approaches (if any?) are available to write code that sorts in a 
locale-independent way? The Note in ?Sys.setlocale is not overly optimistic 
about setting the locale within a session.

Martin Morgan

On 11/23/2014 03:44 AM, Prof Brian Ripley wrote:> On 23/11/2014 09:39, peter dalgaard wrote:
>>
>>> On 23 Nov 2014, at 01:05 , Henrik Bengtsson <hb at
biostat.ucsf.edu> wrote:
>>>
>>> On Sat, Nov 22, 2014 at 12:42 PM, Duncan Murdoch
>>> <murdoch.duncan at gmail.com> wrote:
>>>> On 22/11/2014, 2:59 PM, Stuart Ambler wrote:
>>>>> A colleague?s R program behaved differently when I ran it,
and we thought
>>>>> we traced it probably to different results from string
comparisons as
>>>>> below, with different R versions.  However the platforms
also differed.  A
>>>>> friend ran it on a few machines and found that the
comparison behavior
>>>>> didn?t correlate with R version, but rather with platform.
>>>>>
>>>>> I wonder if you?ve seen this.  If it?s not some setting I?m
unaware of,
>>>>> maybe someone should look into it.  Sorry I haven?t taken
the time to read
>>>>> the source code myself.
>>>>
>>>> Looks like a collation order issue.  See ?Comparison.
>>>
>>> With the oddity that both platforms use what look like similar
locales:
>>>
>>> LC_COLLATE=en_US.UTF-8
>>> LC_COLLATE=en_US.utf8
>>
>> It's the sort of thing thay I've tried to wrap my mind around
multiple times
>> and failed, but have a look at
>>
>>
http://stackoverflow.com/questions/19967555/postgres-collation-differences-osx-v-ubuntu
>>
>>
>> which seems to be essentially the same issue, just for Postgres. If you
have
>> the stamina, also look into the python question that it links to.
>>
>> As I understand it, there are two potential reasons: Either the two
platforms
>> are not using the same collation table for en_US, or at least one of
them is
>> not fully implementing the Unicode Collation Algorithm.
>
> And I have seen both with R.  At the very least, check if ICU is being used
> (capabilities("ICU") in current R, maybe not in some of the
obsolete versions
> seen in this thread).
>
> As a further possibility, there are choices in the UCA (in R, see
> ?icuSetCollate) and ICU can be compiled with different default choices.  It
is
> not clear to me what (if any) difference ICU versions make, but in R-devel
> extSoftVersion() reports that.
>
>
>> In general, collation is a minefield: Some languages have the same
letters in
>> different order (e.g. Estonian with Z between S and T); accented
characters
>> sort with the unaccented counterpart in some languages but as separate
>> characters in others; some locales sort ABab, others AaBb, yet others
aAbB;
>> sometimes punctuation is ignored, sometimes not; sometimes multiple
characters
>> count as one, etc.
>>
> As ?Comparison has long said.
>
>

-- 
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793

Mark van der Loo

2014-Nov-24 14:36 UTC

head link

[Rd] R string comparisons may vary with platform (plain text)

The 'stringi' package claims robust cross-platform performance. It
exports
much functionality of the ICU library and will attempt to install it when
not present.
The function 'stri_sort' accepts a collation argument that can be
defined
with 'stri_opts_collator'.




On Sun, Nov 23, 2014 at 5:15 PM, Martin Morgan <mtmorgan at fredhutch.org>
wrote:
>
> For many scientific applications one is really dealing with ASCII
> characters and LC_COLLATE="C", even if the user is running in
non-C
> locales. What robust approaches (if any?) are available to write code that
> sorts in a locale-independent way? The Note in ?Sys.setlocale is not overly
> optimistic about setting the locale within a session.
>
> Martin Morgan
>
>
> On 11/23/2014 03:44 AM, Prof Brian Ripley wrote:
>
>> On 23/11/2014 09:39, peter dalgaard wrote:
>>
>>>
>>>  On 23 Nov 2014, at 01:05 , Henrik Bengtsson <hb at
biostat.ucsf.edu>
>>>> wrote:
>>>>
>>>> On Sat, Nov 22, 2014 at 12:42 PM, Duncan Murdoch
>>>> <murdoch.duncan at gmail.com> wrote:
>>>>
>>>>> On 22/11/2014, 2:59 PM, Stuart Ambler wrote:
>>>>>
>>>>>> A colleague?s R program behaved differently when I ran
it, and we
>>>>>> thought
>>>>>> we traced it probably to different results from string
comparisons as
>>>>>> below, with different R versions.  However the
platforms also
>>>>>> differed.  A
>>>>>> friend ran it on a few machines and found that the
comparison behavior
>>>>>> didn?t correlate with R version, but rather with
platform.
>>>>>>
>>>>>> I wonder if you?ve seen this.  If it?s not some setting
I?m unaware
>>>>>> of,
>>>>>> maybe someone should look into it.  Sorry I haven?t
taken the time to
>>>>>> read
>>>>>> the source code myself.
>>>>>>
>>>>>
>>>>> Looks like a collation order issue.  See ?Comparison.
>>>>>
>>>>
>>>> With the oddity that both platforms use what look like similar
locales:
>>>>
>>>> LC_COLLATE=en_US.UTF-8
>>>> LC_COLLATE=en_US.utf8
>>>>
>>>
>>> It's the sort of thing thay I've tried to wrap my mind
around multiple
>>> times
>>> and failed, but have a look at
>>>
>>> http://stackoverflow.com/questions/19967555/postgres-
>>> collation-differences-osx-v-ubuntu
>>>
>>>
>>> which seems to be essentially the same issue, just for Postgres. If
you
>>> have
>>> the stamina, also look into the python question that it links to.
>>>
>>> As I understand it, there are two potential reasons: Either the two
>>> platforms
>>> are not using the same collation table for en_US, or at least one
of
>>> them is
>>> not fully implementing the Unicode Collation Algorithm.
>>>
>>
>> And I have seen both with R.  At the very least, check if ICU is being
>> used
>> (capabilities("ICU") in current R, maybe not in some of the
obsolete
>> versions
>> seen in this thread).
>>
>> As a further possibility, there are choices in the UCA (in R, see
>> ?icuSetCollate) and ICU can be compiled with different default choices.
>> It is
>> not clear to me what (if any) difference ICU versions make, but in
R-devel
>> extSoftVersion() reports that.
>>
>>
>>  In general, collation is a minefield: Some languages have the same
>>> letters in
>>> different order (e.g. Estonian with Z between S and T); accented
>>> characters
>>> sort with the unaccented counterpart in some languages but as
separate
>>> characters in others; some locales sort ABab, others AaBb, yet
others
>>> aAbB;
>>> sometimes punctuation is ignored, sometimes not; sometimes multiple
>>> characters
>>> count as one, etc.
>>>
>>>  As ?Comparison has long said.
>>
>>
>>
>
> --
> Computational Biology / Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N.
> PO Box 19024 Seattle, WA 98109
>
> Location: Arnold Building M1 B861
> Phone: (206) 667-2793
>
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>
	[[alternative HTML version deleted]]

Possibly Parallel Threads

Search for more reasonably related threads

R devel - Nov 2014 - R string comparisons may vary with platform (plain text)

[Rd] R string comparisons may vary with platform (plain text)

[Rd] R string comparisons may vary with platform (plain text)

[Rd] R string comparisons may vary with platform (plain text)

Possibly Parallel Threads