Hello, 1) The best I could find on lower case/upper case is [1]; The Wikipedia page you link to is about a code page and the collating sequence is the same as ASCII so no, that's not it. 2) In the cp1252 table "A" < "a", it follows the numeric order 0x31 < 0x41. But what R is using is the locale LC_COLLATE setting, not the "C" one. How to validate the end results? The best way is to check the current setting, with Sys.getlocale. [1] https://books.google.pt/books?id=GkajBQAAQBAJ&pg=PA259&lpg=PA259&dq=collating+sequence+portuguese&source=bl&ots=fVnUYHz0ev&sig=ACfU3U3xjpJfPNcWEfvwb_2nScYb89CeOw&hl=pt-PT&sa=X&ved=2ahUKEwiAoNTW-JP3AhVI1xoKHXT-C4oQ6AF6BAgUEAM#v=onepage&q=collating%20sequence%20portuguese&f=false Hope this helps, Rui Barradas ?s 16:33 de 14/04/2022, Kristjan Kure escreveu:> Hi Rui > > Thank you for the code snippet. > > 1) How do you find your "Portuguese_Portugal.1252" symbols table now? > Is it this https://en.wikipedia.org/wiki/Windows-1252 > <https://en.wikipedia.org/wiki/Windows-1252>? > > 2) What attributes and values do you check to validate the end result? > I see there is a section "Codepage layout" and I can find "A" and "a" > symbols. > > What values on that table tell you "A" is bigger than "a"? > "A" < "a" # returns FALSE > "A" > "a" # returns TRUE > > PS! My locale is Estonian_Estonia.1257 > > Regards, > Kristjan > > On Thu, Apr 14, 2022 at 5:05 PM Rui Barradas <ruipbarradas at sapo.pt > <mailto:ruipbarradas at sapo.pt>> wrote: > > Hello, > > This is a locale issue, you are counting on the ASCII table codes but > that's only valid for the "C" locale. > > old_loc <- Sys.getlocale("LC_COLLATE") > > "A" < "a" > #> [1] FALSE > "A" > "a" > #> [1] TRUE > > Sys.setlocale("LC_COLLATE", locale = "C") > #> [1] "C" > > "A" < "a" > #> [1] TRUE > "A" > "a" > #> [1] FALSE > > Sys.setlocale("LC_COLLATE", old_loc) > #> [1] "Portuguese_Portugal.1252" > > > Hope this helps, > > Rui Barradas > > ?s 15:06 de 13/04/2022, Kristjan Kure escreveu: > > Hi! > > > > Sorry, I am a beginner in R. > > > > I was not able to find answers to my questions (tried Google, Stack > > Overflow, etc). Please correct me if anything is wrong here. > > > > When comparing symbols/strings in R - raw numeric values are compared > > symbol by symbol starting from left? If raw numeric values are > not used is > > there an ASCII / Unicode table where symbols have > values/ranking/order and > > R compares those values? > > > > *2) Comparing symbols* > > Letter "a" raw value is 61, letter "b" raw value is 62? Is this > correct? > > > > # Raw value for "a" = 61 > > a_raw <- charToRaw("a") > > a_raw > > > > # Raw value for "b" = 62 > > b_raw <- charToRaw("b") > > b_raw > > > > # equals TRUE > > "a" < "b" > > > > Ok, so 61 is less than 62 so it's TRUE. Is this correct? > > > > *3) Comparing strings #1* > > "1040" <= "12000" > > > > raw_1040 <- charToRaw("1040") > > raw_1040 > > #31 *30* (comparison happens with the second symbol) 34 30 > > > > raw_12000 <- charToRaw("12000") > > raw_12000 > > #31 *32* (comparison happens with the second symbol) 30 30 30 > > > > The symbol in the second position is 30 and it's less than 32. > Equals to > > true. Is this correct? > > > > *4) Comparing strings #2* > > "1040" <= "10000" > > > > raw_1040 <- charToRaw("1040") > > raw_1040 > > #31 30 *34*? (comparison happens with third symbol) 30 > > > > raw_10000 <- charToRaw("10000") > > raw_10000 > > #31 30 *30*? (comparison happens with third symbol) 30 30 > > > > The symbol in the third position is 34 is greater than 30. Equals > to false. > > Is this correct? > > > > *5) Problem - Why does this equal FALSE?* > > *"A" < "a"* > > > > 41 < 61 # FALSE? > > > > # Raw value for "A" = 41 > > A_raw <- charToRaw("A") > > A_raw > > > > # Raw value for "a" = 61 > > a_raw <- charToRaw("a") > > a_raw > > > > Why is capitalized "A" not less than lowercase "a"? Based on raw > values it > > should be. What am I missing here? > > > > Thanks > > Kristjan > > > >? ? ? ?[[alternative HTML version deleted]] > > > > ______________________________________________ > > R-help at r-project.org <mailto:R-help at r-project.org> mailing list > -- To UNSUBSCRIBE and more, see > > https://stat.ethz.ch/mailman/listinfo/r-help > <https://stat.ethz.ch/mailman/listinfo/r-help> > > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > <http://www.R-project.org/posting-guide.html> > > and provide commented, minimal, self-contained, reproducible code. >
Thank you, Rui. Not sure I got everything right, but here it is: *current_loc <- Sys.getlocale("LC_COLLATE")* #> [1] "Estonian_Estonia.1257" "A" < "a" #41 < 61 #> [1] FALSE raw_A <- charToRaw("A") #41 raw_a <- charToRaw("a") #61 # Not OK - should be TRUE (41 is less than 61) "A" > "a" #41 > 61 #> [1] TRUE raw_A <- charToRaw("A") #41 raw_a <- charToRaw("a") #61 # Not OK - should be FALSE (41 is not bigger than 61) *Sys.setlocale("LC_COLLATE", locale = "C")* "A" < "a" #41 < 61 #> [1] TRUE raw_A <- charToRaw("A") #41 raw_a <- charToRaw("a") #61 # OK - (41 is less than 61) "A" > "a" #41 > 61 #> [1] FALSE raw_A <- charToRaw("A") #41 raw_a <- charToRaw("a") #61 # OK - (41 is not bigger than 61) *Sys.setlocale("LC_COLLATE", current_loc)* *Conclusion: Comparing strings using charToRaw() only works correctly with locale = C?* Regards, Kristjan On Thu, Apr 14, 2022 at 10:01 PM Rui Barradas <ruipbarradas at sapo.pt> wrote:> Hello, > > 1) The best I could find on lower case/upper case is [1]; > The Wikipedia page you link to is about a code page and the collating > sequence is the same as ASCII so no, that's not it. > > 2) In the cp1252 table "A" < "a", it follows the numeric order 0x31 < > 0x41. But what R is using is the locale LC_COLLATE setting, not the "C" > one. > > How to validate the end results? The best way is to check the current > setting, with Sys.getlocale. > > > > [1] > > https://books.google.pt/books?id=GkajBQAAQBAJ&pg=PA259&lpg=PA259&dq=collating+sequence+portuguese&source=bl&ots=fVnUYHz0ev&sig=ACfU3U3xjpJfPNcWEfvwb_2nScYb89CeOw&hl=pt-PT&sa=X&ved=2ahUKEwiAoNTW-JP3AhVI1xoKHXT-C4oQ6AF6BAgUEAM#v=onepage&q=collating%20sequence%20portuguese&f=false > > > Hope this helps, > > Rui Barradas > > ?s 16:33 de 14/04/2022, Kristjan Kure escreveu: > > Hi Rui > > > > Thank you for the code snippet. > > > > 1) How do you find your "Portuguese_Portugal.1252" symbols table now? > > Is it this https://en.wikipedia.org/wiki/Windows-1252 > > <https://en.wikipedia.org/wiki/Windows-1252>? > > > > 2) What attributes and values do you check to validate the end result? > > I see there is a section "Codepage layout" and I can find "A" and "a" > > symbols. > > > > What values on that table tell you "A" is bigger than "a"? > > "A" < "a" # returns FALSE > > "A" > "a" # returns TRUE > > > > PS! My locale is Estonian_Estonia.1257 > > > > Regards, > > Kristjan > > > > On Thu, Apr 14, 2022 at 5:05 PM Rui Barradas <ruipbarradas at sapo.pt > > <mailto:ruipbarradas at sapo.pt>> wrote: > > > > Hello, > > > > This is a locale issue, you are counting on the ASCII table codes but > > that's only valid for the "C" locale. > > > > old_loc <- Sys.getlocale("LC_COLLATE") > > > > "A" < "a" > > #> [1] FALSE > > "A" > "a" > > #> [1] TRUE > > > > Sys.setlocale("LC_COLLATE", locale = "C") > > #> [1] "C" > > > > "A" < "a" > > #> [1] TRUE > > "A" > "a" > > #> [1] FALSE > > > > Sys.setlocale("LC_COLLATE", old_loc) > > #> [1] "Portuguese_Portugal.1252" > > > > > > Hope this helps, > > > > Rui Barradas > > > > ?s 15:06 de 13/04/2022, Kristjan Kure escreveu: > > > Hi! > > > > > > Sorry, I am a beginner in R. > > > > > > I was not able to find answers to my questions (tried Google, > Stack > > > Overflow, etc). Please correct me if anything is wrong here. > > > > > > When comparing symbols/strings in R - raw numeric values are > compared > > > symbol by symbol starting from left? If raw numeric values are > > not used is > > > there an ASCII / Unicode table where symbols have > > values/ranking/order and > > > R compares those values? > > > > > > *2) Comparing symbols* > > > Letter "a" raw value is 61, letter "b" raw value is 62? Is this > > correct? > > > > > > # Raw value for "a" = 61 > > > a_raw <- charToRaw("a") > > > a_raw > > > > > > # Raw value for "b" = 62 > > > b_raw <- charToRaw("b") > > > b_raw > > > > > > # equals TRUE > > > "a" < "b" > > > > > > Ok, so 61 is less than 62 so it's TRUE. Is this correct? > > > > > > *3) Comparing strings #1* > > > "1040" <= "12000" > > > > > > raw_1040 <- charToRaw("1040") > > > raw_1040 > > > #31 *30* (comparison happens with the second symbol) 34 30 > > > > > > raw_12000 <- charToRaw("12000") > > > raw_12000 > > > #31 *32* (comparison happens with the second symbol) 30 30 30 > > > > > > The symbol in the second position is 30 and it's less than 32. > > Equals to > > > true. Is this correct? > > > > > > *4) Comparing strings #2* > > > "1040" <= "10000" > > > > > > raw_1040 <- charToRaw("1040") > > > raw_1040 > > > #31 30 *34* (comparison happens with third symbol) 30 > > > > > > raw_10000 <- charToRaw("10000") > > > raw_10000 > > > #31 30 *30* (comparison happens with third symbol) 30 30 > > > > > > The symbol in the third position is 34 is greater than 30. Equals > > to false. > > > Is this correct? > > > > > > *5) Problem - Why does this equal FALSE?* > > > *"A" < "a"* > > > > > > 41 < 61 # FALSE? > > > > > > # Raw value for "A" = 41 > > > A_raw <- charToRaw("A") > > > A_raw > > > > > > # Raw value for "a" = 61 > > > a_raw <- charToRaw("a") > > > a_raw > > > > > > Why is capitalized "A" not less than lowercase "a"? Based on raw > > values it > > > should be. What am I missing here? > > > > > > Thanks > > > Kristjan > > > > > > [[alternative HTML version deleted]] > > > > > > ______________________________________________ > > > R-help at r-project.org <mailto:R-help at r-project.org> mailing list > > -- To UNSUBSCRIBE and more, see > > > https://stat.ethz.ch/mailman/listinfo/r-help > > <https://stat.ethz.ch/mailman/listinfo/r-help> > > > PLEASE do read the posting guide > > http://www.R-project.org/posting-guide.html > > <http://www.R-project.org/posting-guide.html> > > > and provide commented, minimal, self-contained, reproducible code. > > >[[alternative HTML version deleted]]
Hello, Inline. ?s 22:09 de 14/04/2022, Kristjan Kure escreveu:> Thank you, Rui. Not sure I got everything right, but here it is: > > *current_loc <- Sys.getlocale("LC_COLLATE")* > #> [1] "Estonian_Estonia.1257" > > "A" < "a" > #41 < 61 > #> [1] FALSE > raw_A <- charToRaw("A") #41 > raw_a <- charToRaw("a") #61 > # Not OK - should be TRUE (41 is less than 61) > > "A" > "a" > #41 > 61 > #> [1] TRUE > raw_A <- charToRaw("A") #41 > raw_a <- charToRaw("a") #61 > # Not OK - should be FALSE (41 is not bigger than 61) > > *Sys.setlocale("LC_COLLATE", locale = "C")* > > "A" < "a" > #41 < 61 > #> [1] TRUE > raw_A <- charToRaw("A") #41 > raw_a <- charToRaw("a") #61 > > # OK - (41 is less than 61) > > "A" > "a" > #41 > 61 > #> [1] FALSE > raw_A <- charToRaw("A") #41 > raw_a <- charToRaw("a") #61 > > # OK - (41 is not bigger than 61) > > *Sys.setlocale("LC_COLLATE", current_loc)* > * > * > *Conclusion: Comparing strings using charToRaw() only works correctly > with locale = C?* > *You are still mistaking the locale with the ASCII code (raw). Windows codepages like 1252 or your 1257 are supersets of the ASCII code and the ASCII hex codes make a lot of sense. The upper case and lower letters are 2^5 == 32 == 0x20 apart so set the 5th bit to go from upper to lower case: "A": 0100 0001 == 0x41 "a": 0110 0001 == 0x61 "B": 0100 0010 "b": 0110 0010 etc. This only relates to human alphabets and languages because its an attempt to make an electronic code usable to transmit/record/retrieve text in human readable way. But each language's lexicographic order need not follow this encoding's order even if it's what is used to record it electronically. In the examples below you'll see that to change the locale does not change the numeric codes. Comparing strings using charToRaw() only works correctly if what you want is to compare codes, not letters (in the sense of human writing). old_loc <- Sys.getlocale("LC_COLLATE") # hexadecimal base integers raw_A <- charToRaw("A") # 0x41 raw_a <- charToRaw("a") # 0x61 raw_A < raw_a #> [1] TRUE raw_A > raw_a #> [1] FALSE as.integer(raw_A) #> [1] 65 as.integer(raw_a) #> [1] 97 Sys.setlocale("LC_COLLATE", locale = "C") #> [1] "C" (C_raw_A <- charToRaw("A")) # 0x41 #> [1] 41 (C_raw_a <- charToRaw("a")) # 0x61 #> [1] 61 C_raw_A < C_raw_a #> [1] TRUE C_raw_A > C_raw_a #> [1] FALSE identical(raw_A, C_raw_A) #> [1] TRUE identical(raw_a, C_raw_a) #> [1] TRUE Sys.setlocale("LC_COLLATE", old_loc) #> [1] "Portuguese_Portugal.1252" Hope this helps, Rui Barradas> * > Regards, > Kristjan* > * > * > * > * > * > > On Thu, Apr 14, 2022 at 10:01 PM Rui Barradas <ruipbarradas at sapo.pt > <mailto:ruipbarradas at sapo.pt>> wrote: > > Hello, > > 1) The best I could find on lower case/upper case is [1]; > The Wikipedia page you link to is about a code page and the collating > sequence is the same as ASCII so no, that's not it. > > 2) In the cp1252 table "A" < "a", it follows the numeric order 0x31 < > 0x41. But what R is using is the locale LC_COLLATE setting, not the "C" > one. > > How to validate the end results? The best way is to check the current > setting, with Sys.getlocale. > > > > [1] > https://books.google.pt/books?id=GkajBQAAQBAJ&pg=PA259&lpg=PA259&dq=collating+sequence+portuguese&source=bl&ots=fVnUYHz0ev&sig=ACfU3U3xjpJfPNcWEfvwb_2nScYb89CeOw&hl=pt-PT&sa=X&ved=2ahUKEwiAoNTW-JP3AhVI1xoKHXT-C4oQ6AF6BAgUEAM#v=onepage&q=collating%20sequence%20portuguese&f=false > <https://books.google.pt/books?id=GkajBQAAQBAJ&pg=PA259&lpg=PA259&dq=collating+sequence+portuguese&source=bl&ots=fVnUYHz0ev&sig=ACfU3U3xjpJfPNcWEfvwb_2nScYb89CeOw&hl=pt-PT&sa=X&ved=2ahUKEwiAoNTW-JP3AhVI1xoKHXT-C4oQ6AF6BAgUEAM#v=onepage&q=collating%20sequence%20portuguese&f=false> > > > Hope this helps, > > Rui Barradas > > ?s 16:33 de 14/04/2022, Kristjan Kure escreveu: > > Hi Rui > > > > Thank you for the code snippet. > > > > 1) How do you find your "Portuguese_Portugal.1252" symbols table now? > > Is it this https://en.wikipedia.org/wiki/Windows-1252 > <https://en.wikipedia.org/wiki/Windows-1252> > > <https://en.wikipedia.org/wiki/Windows-1252 > <https://en.wikipedia.org/wiki/Windows-1252>>? > > > > 2) What attributes and values do you check to validate the end > result? > > I see there is a section "Codepage layout" and I can find "A" and > "a" > > symbols. > > > > What values on that table tell you "A" is bigger than "a"? > > "A" < "a" # returns FALSE > > "A" > "a" # returns TRUE > > > > PS! My locale is Estonian_Estonia.1257 > > > > Regards, > > Kristjan > > > > On Thu, Apr 14, 2022 at 5:05 PM Rui Barradas > <ruipbarradas at sapo.pt <mailto:ruipbarradas at sapo.pt> > > <mailto:ruipbarradas at sapo.pt <mailto:ruipbarradas at sapo.pt>>> wrote: > > > >? ? ?Hello, > > > >? ? ?This is a locale issue, you are counting on the ASCII table > codes but > >? ? ?that's only valid for the "C" locale. > > > >? ? ?old_loc <- Sys.getlocale("LC_COLLATE") > > > >? ? ?"A" < "a" > >? ? ?#> [1] FALSE > >? ? ?"A" > "a" > >? ? ?#> [1] TRUE > > > >? ? ?Sys.setlocale("LC_COLLATE", locale = "C") > >? ? ?#> [1] "C" > > > >? ? ?"A" < "a" > >? ? ?#> [1] TRUE > >? ? ?"A" > "a" > >? ? ?#> [1] FALSE > > > >? ? ?Sys.setlocale("LC_COLLATE", old_loc) > >? ? ?#> [1] "Portuguese_Portugal.1252" > > > > > >? ? ?Hope this helps, > > > >? ? ?Rui Barradas > > > >? ? ??s 15:06 de 13/04/2022, Kristjan Kure escreveu: > >? ? ? > Hi! > >? ? ? > > >? ? ? > Sorry, I am a beginner in R. > >? ? ? > > >? ? ? > I was not able to find answers to my questions (tried > Google, Stack > >? ? ? > Overflow, etc). Please correct me if anything is wrong here. > >? ? ? > > >? ? ? > When comparing symbols/strings in R - raw numeric values > are compared > >? ? ? > symbol by symbol starting from left? If raw numeric values are > >? ? ?not used is > >? ? ? > there an ASCII / Unicode table where symbols have > >? ? ?values/ranking/order and > >? ? ? > R compares those values? > >? ? ? > > >? ? ? > *2) Comparing symbols* > >? ? ? > Letter "a" raw value is 61, letter "b" raw value is 62? Is > this > >? ? ?correct? > >? ? ? > > >? ? ? > # Raw value for "a" = 61 > >? ? ? > a_raw <- charToRaw("a") > >? ? ? > a_raw > >? ? ? > > >? ? ? > # Raw value for "b" = 62 > >? ? ? > b_raw <- charToRaw("b") > >? ? ? > b_raw > >? ? ? > > >? ? ? > # equals TRUE > >? ? ? > "a" < "b" > >? ? ? > > >? ? ? > Ok, so 61 is less than 62 so it's TRUE. Is this correct? > >? ? ? > > >? ? ? > *3) Comparing strings #1* > >? ? ? > "1040" <= "12000" > >? ? ? > > >? ? ? > raw_1040 <- charToRaw("1040") > >? ? ? > raw_1040 > >? ? ? > #31 *30* (comparison happens with the second symbol) 34 30 > >? ? ? > > >? ? ? > raw_12000 <- charToRaw("12000") > >? ? ? > raw_12000 > >? ? ? > #31 *32* (comparison happens with the second symbol) 30 30 30 > >? ? ? > > >? ? ? > The symbol in the second position is 30 and it's less than 32. > >? ? ?Equals to > >? ? ? > true. Is this correct? > >? ? ? > > >? ? ? > *4) Comparing strings #2* > >? ? ? > "1040" <= "10000" > >? ? ? > > >? ? ? > raw_1040 <- charToRaw("1040") > >? ? ? > raw_1040 > >? ? ? > #31 30 *34*? (comparison happens with third symbol) 30 > >? ? ? > > >? ? ? > raw_10000 <- charToRaw("10000") > >? ? ? > raw_10000 > >? ? ? > #31 30 *30*? (comparison happens with third symbol) 30 30 > >? ? ? > > >? ? ? > The symbol in the third position is 34 is greater than 30. > Equals > >? ? ?to false. > >? ? ? > Is this correct? > >? ? ? > > >? ? ? > *5) Problem - Why does this equal FALSE?* > >? ? ? > *"A" < "a"* > >? ? ? > > >? ? ? > 41 < 61 # FALSE? > >? ? ? > > >? ? ? > # Raw value for "A" = 41 > >? ? ? > A_raw <- charToRaw("A") > >? ? ? > A_raw > >? ? ? > > >? ? ? > # Raw value for "a" = 61 > >? ? ? > a_raw <- charToRaw("a") > >? ? ? > a_raw > >? ? ? > > >? ? ? > Why is capitalized "A" not less than lowercase "a"? Based > on raw > >? ? ?values it > >? ? ? > should be. What am I missing here? > >? ? ? > > >? ? ? > Thanks > >? ? ? > Kristjan > >? ? ? > > >? ? ? >? ? ? ?[[alternative HTML version deleted]] > >? ? ? > > >? ? ? > ______________________________________________ > >? ? ? > R-help at r-project.org <mailto:R-help at r-project.org> > <mailto:R-help at r-project.org <mailto:R-help at r-project.org>> mailing list > >? ? ?-- To UNSUBSCRIBE and more, see > >? ? ? > https://stat.ethz.ch/mailman/listinfo/r-help > <https://stat.ethz.ch/mailman/listinfo/r-help> > >? ? ?<https://stat.ethz.ch/mailman/listinfo/r-help > <https://stat.ethz.ch/mailman/listinfo/r-help>> > >? ? ? > PLEASE do read the posting guide > > http://www.R-project.org/posting-guide.html > <http://www.R-project.org/posting-guide.html> > >? ? ?<http://www.R-project.org/posting-guide.html > <http://www.R-project.org/posting-guide.html>> > >? ? ? > and provide commented, minimal, self-contained, > reproducible code. > > >