thr3ads.net - llvm dev - [LLVMdev] [cfe-dev] Unicode path handling on Windows [Oct 2011]

If this information is useful, please help other people find it:
Share via:

Joachim Durchholz

2011-Oct-03 20:59 UTC

[LLVMdev] [cfe-dev] Unicode path handling on Windows

Am 03.10.2011 22:12, schrieb Nikola Smiljanic:> How about this:
>
> for (int i = 0; i != NumWChars; ++i)
>          absPath[i] = std::tolower(absPath[i], std::locale());
>
> seems to be working just fine?
You have two assumptions here:

Assumption 1: For each lowercase character, there is an equivalent 
uppercase character, and vice versa.
This is not true in half a dozen languages according to
ftp://ftp.unicode.org/Public/UNIDATA/SpecialCasing.txt .

Assumption 2: The transformation from lower case to upper case can be 
done for each character individually, without considering context.
This is not true in a couple of languages according to SpecialCasing.txt.

Do not do that. If you get complaints, they will be about scripts that 
you can't type on your keyboard, and that you know nothing about so you 
don't even know what the right behaviour would have been.
Rely on the relevant Unicode library. Which one that would be, and which 
functions to call, depends on what you need that to-lowercase 
transformation for. (It also depends on whether the names you get are 
already normalized or not; I'd want to run a normalization pass on the 
names first just to be on the safe side.)

Regards,
Jo

Török Edwin

2011-Oct-03 21:18 UTC

head link

[LLVMdev] [cfe-dev] Unicode path handling on Windows

On 10/03/2011 11:59 PM, Joachim Durchholz wrote:> Am 03.10.2011 22:12, schrieb Nikola Smiljanic:
>> How about this:
>>
>> for (int i = 0; i != NumWChars; ++i)
>>          absPath[i] = std::tolower(absPath[i], std::locale());
>>
>> seems to be working just fine?
> 
> You have two assumptions here:
> 
> Assumption 1: For each lowercase character, there is an equivalent 
> uppercase character, and vice versa.
> This is not true in half a dozen languages according to
> ftp://ftp.unicode.org/Public/UNIDATA/SpecialCasing.txt .
> 
> Assumption 2: The transformation from lower case to upper case can be 
> done for each character individually, without considering context.
> This is not true in a couple of languages according to SpecialCasing.txt.
> 
> Do not do that. If you get complaints, they will be about scripts that 
> you can't type on your keyboard, and that you know nothing about so you
> don't even know what the right behaviour would have been.
> Rely on the relevant Unicode library. Which one that would be, and which 
> functions to call, depends on what you need that to-lowercase 
> transformation for. (It also depends on whether the names you get are 
> already normalized or not; I'd want to run a normalization pass on the 
> names first just to be on the safe side.)
Does Windows do proper Unicode to-lowercase, or does it just lowercase A-Z?
>From reading the below article I get that you can create filenames that
would be consideredidentical under Unicode to-lowercase rules, but yet they exist as different
files:
https://blogs.msdn.com/b/michkap/archive/2005/10/17/481600.aspx

Best regards,
--Edwin

Joachim Durchholz

2011-Oct-03 21:42 UTC

head link

[LLVMdev] [cfe-dev] Unicode path handling on Windows

Am 03.10.2011 23:18, schrieb Török Edwin:> I get that you can create filenames that would be considered
> identical under Unicode to-lowercase rules, but yet they exist as different
files:
Hehe, I can imagine that.
That's why I was proposing to simply ask the filesystem.

Though in hindsight, I may have been to hasty - the question is: what is 
that to-lowercase transformation needed for?
The right course of action definitely depends on that. Unicode is too 
complicated for the simple answers (and there are good reasons for that).

Regards,
Jo

Apparently Analagous Threads

Search for more seemingly similar threads

llvm dev - Oct 2011 - [LLVMdev] [cfe-dev] Unicode path handling on Windows

[LLVMdev] [cfe-dev] Unicode path handling on Windows

[LLVMdev] [cfe-dev] Unicode path handling on Windows

[LLVMdev] [cfe-dev] Unicode path handling on Windows

Apparently Analagous Threads