While the advice that collation order is not necessarily determined by encoding
is helpful, the advice suggesting that charToRaw is to be always avoided rings
false to me, since 61 hexadecimal is the same as 97 decimal.
I am hoping someone will come along and offer useful input like where to find
the actual collation order implemented by the now-default
Sys.getlocale("LC_COLATE")="C.UTF-8", since I was under the
impression that this particular collation was in fact supposed to collate
according to the numerical magnitude of the UTF-8 code points but it does not
appear to do so.
On April 14, 2022 4:25:17 AM PDT, Richard O'Keefe <raoknz at
gmail.com> wrote:>To the original poster: don't even think about
>charToRaw. For one thing, the integer code that
>corresponds to "a" can be found thus:
>> library(gtools)
>> asc("a")
>97
>and the answer is (predictably) 97, not 61.
>
>> ?"<"
>...
> Comparison of strings in character vectors is lexicographic within
> the strings using the collating sequence of the locale in use: see
> 'locales'. The collating sequence of locales such as
'en_US' is
> normally different from 'C' (which should use ASCII) and can be
> surprising. Beware of making _any_ assumptions about the
> collation order
>...
>
>In a UNIX environment, the collating order R uses will
>normally match the collating order that the system
>sort(1) command uses. This is also the order that is
>used by the strcoll(3) library function. There is an
>ISO standard, not for how to compare strings, but for
>specifying the rules for how to compare strings. The
>rules can be amazingly elaborate requiring up to seven
>different passes and not all of them in the same direction.
>
>ORIGINALLY the order was lexicographical left to right
>by byte values (like the strcmp(3) library function) but
>in a world of about 6000 languages and an amazing number
>of scripts, that just doesn't match what people actually
>want to do.
>
>> icuGetCollate()
>will tell you what collation rules R is following.
>> ?icuGetCollate
>will not so much tell you more than you wanted to know
>about collation as hint at it.
>
>These days, with Unicode and internationalisation,
>text encoding and collation are just insanely complex.
>R goes to a lot of trouble to hide this from you.
>LET IT.
>
>
>
>On Thu, 14 Apr 2022 at 13:38, Ebert,Timothy Aaron <tebert at ufl.edu>
wrote:
>
>> https://en.wikipedia.org/wiki/ASCII
>> There is a table towards the end of the document. Some of the other
pieces
>> may be of interest and/or relevant.
>>
>> Tim
>>
>> -----Original Message-----
>> From: R-help <r-help-bounces at r-project.org> On Behalf Of
Kristjan Kure
>> Sent: Wednesday, April 13, 2022 10:06 AM
>> To: r-help at r-project.org
>> Subject: [R] Symbol/String comparison in R
>>
>> [External Email]
>>
>> Hi!
>>
>> Sorry, I am a beginner in R.
>>
>> I was not able to find answers to my questions (tried Google, Stack
>> Overflow, etc). Please correct me if anything is wrong here.
>>
>> When comparing symbols/strings in R - raw numeric values are compared
>> symbol by symbol starting from left? If raw numeric values are not used
is
>> there an ASCII / Unicode table where symbols have values/ranking/order
and
>> R compares those values?
>>
>> *2) Comparing symbols*
>> Letter "a" raw value is 61, letter "b" raw value is
62? Is this correct?
>>
>> # Raw value for "a" = 61
>> a_raw <- charToRaw("a")
>> a_raw
>>
>> # Raw value for "b" = 62
>> b_raw <- charToRaw("b")
>> b_raw
>>
>> # equals TRUE
>> "a" < "b"
>>
>> Ok, so 61 is less than 62 so it's TRUE. Is this correct?
>>
>> *3) Comparing strings #1*
>> "1040" <= "12000"
>>
>> raw_1040 <- charToRaw("1040")
>> raw_1040
>> #31 *30* (comparison happens with the second symbol) 34 30
>>
>> raw_12000 <- charToRaw("12000")
>> raw_12000
>> #31 *32* (comparison happens with the second symbol) 30 30 30
>>
>> The symbol in the second position is 30 and it's less than 32.
Equals to
>> true. Is this correct?
>>
>> *4) Comparing strings #2*
>> "1040" <= "10000"
>>
>> raw_1040 <- charToRaw("1040")
>> raw_1040
>> #31 30 *34* (comparison happens with third symbol) 30
>>
>> raw_10000 <- charToRaw("10000")
>> raw_10000
>> #31 30 *30* (comparison happens with third symbol) 30 30
>>
>> The symbol in the third position is 34 is greater than 30. Equals to
false.
>> Is this correct?
>>
>> *5) Problem - Why does this equal FALSE?* *"A" <
"a"*
>>
>> 41 < 61 # FALSE?
>>
>> # Raw value for "A" = 41
>> A_raw <- charToRaw("A")
>> A_raw
>>
>> # Raw value for "a" = 61
>> a_raw <- charToRaw("a")
>> a_raw
>>
>> Why is capitalized "A" not less than lowercase "a"?
Based on raw values it
>> should be. What am I missing here?
>>
>> Thanks
>> Kristjan
>>
>> [[alternative HTML version deleted]]
>>
>> ______________________________________________
>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>>
https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailman_listinfo_r-2Dhelp&d=DwICAg&c=sJ6xIWYx-zLMB3EPkvcnVg&r=9PEhQh2kVeAsRzsn7AkP-g&m=9E-P8HOWO0s4h1p__tW4o8QGtge3bJ9VUJEDH-e-U_8OKRu2p1zazebKjPltKrWM&s=rhYKCkMRBFMzOVf8rVaRiO1Puh-rTSWAS8P6hoSzdgc&e>>
PLEASE do read the posting guide
>>
https://urldefense.proofpoint.com/v2/url?u=http-3A__www.R-2Dproject.org_posting-2Dguide.html&d=DwICAg&c=sJ6xIWYx-zLMB3EPkvcnVg&r=9PEhQh2kVeAsRzsn7AkP-g&m=9E-P8HOWO0s4h1p__tW4o8QGtge3bJ9VUJEDH-e-U_8OKRu2p1zazebKjPltKrWM&s=fI_1ZAYJFp1nrJkOV4i4ueqf4o1MD1gKHzb6AyciJUc&e>>
and provide commented, minimal, self-contained, reproducible code.
>>
>> ______________________________________________
>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>
> [[alternative HTML version deleted]]
>
>______________________________________________
>R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>https://stat.ethz.ch/mailman/listinfo/r-help
>PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>and provide commented, minimal, self-contained, reproducible code.
--
Sent from my phone. Please excuse my brevity.