Locale-specific is not what we want, but I don't believe Windows exposes an alternative API that does what we want. (Does CharLower give a different answer than tolower?) However, looking over the FileManager code a little more I'm not even sure using the path is the best solution, it seems it would be better to use inode like the unix code does. Windows doesn't support inode (the s_ino field in stat), but it does have nFileIndexHigh/nFileIndexLow which are exposed via ::GetFileInformationByHandle and is basically the same thing. This would require refactoring FileSystemStatCache to use a new structure in place of stat that could be shared between Windows and Unix. This would be a lot of small changes but seems like it would be fairly straightforward. On Oct 3, 2011, at 9:21 AM, Nikola Smiljanic wrote:> towlower doesn't seem to work with my test string in Cyrillic. This function does locale-specific conversion, is this what we want? > > Here's the whole thing, with all the calls to ::stat replaced with llvm::sys::fs::Stat. > > On Fri, Sep 30, 2011 at 8:04 PM, Bryce Cogswell <bryceco at gmail.com> wrote: > You can use _iswupper and _towlower instead of CharLowerBuffW. They don't require windows.h and work with /Za. > > On Sep 30, 2011, at 12:24 AM, Nikola Smiljanic wrote: > >> I tried to do the conversion to lowercase inside GetFullPath by adding an additional bool parameter to this function. It's not perfect but seems much better than repeating the whole UTF8 to UTF16 and UTF16 to UTF8 conversion again. The problem I have is with access to CharLowerBuffW. I need Windows.h for this function but when I try to include it I get a bunch of errors because Language Extensions are disabled, /Za switch. Do I just enable them and include Windows.h inside and #ifdef section? >> >> On Thu, Sep 29, 2011 at 7:57 AM, Bryce Cogswell <bryceco at gmail.com> wrote: >> I agree they are broken on all platforms. However, FileManager.cpp already contains an #if WIN32 conditional around the code calling LowercaseString, so you can use MultiByteToWideChar and CharLowerBuffW directly there, and not call the LowercaseString function. I don't think there are any other places where LowercaseString is called with non-ascii data, so you can punt on fixing it for now. >> >> On Sep 28, 2011, at 10:11 PM, Nikola Smiljanic wrote: >> >>> I have a problem with Lowercase and Uppercase functions. These are broken on all platforms, not only Window, so I can't just #ifdef and use CharLowerBuffW. I need a portable way to convert from UTF8 to UTF16. There is set of functions inside clang/Basic/ConvertUTF, but LLVM can't depend on this. What do I do? >>> >>> On Tue, Sep 27, 2011 at 5:09 AM, Bryce Cogswell <bryceco at yahoo.com> wrote: >>> I think the assert you have for _stat64i32 is fine. It is a constant expression so should compile to nothing, and the chance of the definition changing is pretty much zero. >>> >>> LowercaseString appears to be used by ASM printers where the output is expected to be ASCII, and then some WIN32-conditioned code in FileManager.cpp. I hate to say it but you'll probably need to convert the UTF-8 paths to wide char, lower case it using CharLowerBuffW (since it needs to match the casing rules used by NTFS), and then convert back to UTF-8. >>> >>> If you need to be pedantic about recognizing whether two paths are the same on Windows you also need to call GetFullPathName in order to expand any 8.3 path components, but this is an expensive function so I wouldn't do it unless absolutely necessary. >>> >>> There is also a call to _fullpath in there that needs to be changed to _wfullpath. >>> >>> The rest of the patch looks good. >>> >>> On Sep 23, 2011, at 9:57 AM, Nikola Smiljanic wrote: >>> >>>> Here's a new patch that fixes all the issues mentioned before. Note that this isn't final, I didn't want to replace all calls to ::stat so that it's easier to review. >>>> >>>> I have only one more questions: >>>> >>>> 1. _wopen accepts _stat64i32 instead of stat structure. These two are exactly the same, the only difference is that stat uses time_t and _stat64i32 uses __time64_t (time_t is a typedef for __time64_t but this depends on one more macro). I check to see that size of these two structures is the same, but is there something more that I could do? I'd also like to use some kind of static assert instead of runtime assert. >>>> >>>> I've also noticed that function LowercaseString (the same applies to UppercaseString) doesn't work with UTF8 input. It's calling isupper and tolower and these functions can't handle UTF8. This should get a new issue probably? >>>> >>>> On Wed, Sep 21, 2011 at 1:41 AM, Aaron Ballman <aaron at aaronballman.com> wrote: >>>> On Tue, Sep 20, 2011 at 4:15 PM, Nikola Smiljanic <popizdeh at gmail.com> wrote: >>>> > OK since this approach makes sense I'll shoot with my questions :) >>>> > 1. Where should get_utf8_argv go and is the name of this function OK? Right >>>> > now the function is inside llvm::sys::fs namespace because I need access to >>>> > Windows.h, should I leave it there. >>>> >>>> I don't think it belongs there as it has nothing to do with the file >>>> system. However, I'm not familiar enough to know of a better place >>>> for it either. Hopefully someone else can chime in with an idea. >>>> >>>> > 2. get_utf8_argv allocates new char** representing utf8 argv, should I >>>> > deallocate this inside driver.cpp, use std::vector<std::string> instead or >>>> > ignore the allocation completely? >>>> >>>> Perhaps some SmallVectors would make sense here? An inner one to hold >>>> the arg data, and an outer one to hold the list of inner vectors. >>>> Then you can do the cleanup within the call. >>>> >>>> > 3. There is an #ifdef inside driver.cpp right now that I don't like. >>>> > get_utf8_argv works for windows only, but I could change the function to >>>> > accept argv and return it as vector of utf8 strings. It would convert char* >>>> > to std::string on unix and use GetCommandLine + utf8 conversion on windows. >>>> >>>> I have no concrete opinion one way or the other on this. >>>> >>>> > 4. Should I wrap functions like ::fstat and ::close for consistency even >>>> > though they don't work with paths? >>>> >>>> I don't believe so. >>>> >>>> > I'll fix everything that is formatting related. You're right, current patch >>>> > has Open only for windows but I'll add the other one as well. >>>> >>>> Thanks for the help! >>>> >>>> ~Aaron >>>> >>>> <clang.patch><llvm.patch> >>> >>> >> >> > > > <clang.patch><llvm.patch>-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20111003/f354a1fa/attachment.html>
On Mon, Oct 3, 2011 at 1:43 PM, Bryce Cogswell <bryceco at gmail.com> wrote:> However, looking over the FileManager code a little more I'm not even sure > using the path is the best solution, it seems it would be better to use > inode like the unix code does. Windows doesn't support inode (the s_ino > field in stat), but it does have nFileIndexHigh/nFileIndexLow which are > exposed via ::GetFileInformationByHandle and is basically the same thing.One question I would have about that is whether stat actually opens the file under the hood. GetFileInformationByHandle requires an open file handle, which could be expensive depending on the situation. I'd worry slightly about a performance hit (networks come to mind). If the perf is equivalent, then I would agree that nFileIndexHigh/Low will work well as a Win32 version of the inode. ~Aaron
Nikola Smiljanic
2011-Oct-03 19:19 UTC
[LLVMdev] [cfe-dev] Unicode path handling on Windows
CharLowerW does the right thing. But I still need Windows.h to use it :) On Mon, Oct 3, 2011 at 8:43 PM, Bryce Cogswell <bryceco at gmail.com> wrote:> Locale-specific is not what we want, but I don't believe Windows exposes an > alternative API that does what we want. (Does CharLower give a different > answer than tolower?) > > However, looking over the FileManager code a little more I'm not even sure > using the path is the best solution, it seems it would be better to use > inode like the unix code does. Windows doesn't support inode (the s_ino > field in stat), but it does have nFileIndexHigh/nFileIndexLow which are > exposed via ::GetFileInformationByHandle and is basically the same thing. > This would require refactoring FileSystemStatCache to use a new structure in > place of stat that could be shared between Windows and Unix. This would be a > lot of small changes but seems like it would be fairly straightforward. > > > On Oct 3, 2011, at 9:21 AM, Nikola Smiljanic wrote: > > towlower doesn't seem to work with my test string in Cyrillic. This > function does locale-specific conversion, is this what we want? > > Here's the whole thing, with all the calls to ::stat replaced with > llvm::sys::fs::Stat. > > On Fri, Sep 30, 2011 at 8:04 PM, Bryce Cogswell <bryceco at gmail.com> wrote: > >> You can use _iswupper and _towlower instead of CharLowerBuffW. They don't >> require windows.h and work with /Za. >> >> On Sep 30, 2011, at 12:24 AM, Nikola Smiljanic wrote: >> >> I tried to do the conversion to lowercase inside GetFullPath by adding an >> additional bool parameter to this function. It's not perfect but seems much >> better than repeating the whole UTF8 to UTF16 and UTF16 to UTF8 conversion >> again. The problem I have is with access to CharLowerBuffW. I need Windows.h >> for this function but when I try to include it I get a bunch of errors >> because Language Extensions are disabled, /Za switch. Do I just enable them >> and include Windows.h inside and #ifdef section? >> >> On Thu, Sep 29, 2011 at 7:57 AM, Bryce Cogswell <bryceco at gmail.com>wrote: >> >>> I agree they are broken on all platforms. However, FileManager.cpp >>> already contains an #if WIN32 conditional around the code calling >>> LowercaseString, so you can use MultiByteToWideChar and CharLowerBuffW >>> directly there, and not call the LowercaseString function. I don't think >>> there are any other places where LowercaseString is called with non-ascii >>> data, so you can punt on fixing it for now. >>> >>> On Sep 28, 2011, at 10:11 PM, Nikola Smiljanic wrote: >>> >>> I have a problem with Lowercase and Uppercase functions. These are broken >>> on all platforms, not only Window, so I can't just #ifdef and use CharLowerBuffW. >>> I need a portable way to convert from UTF8 to UTF16. There is set of >>> functions inside clang/Basic/ConvertUTF, but LLVM can't depend on this. What >>> do I do? >>> >>> On Tue, Sep 27, 2011 at 5:09 AM, Bryce Cogswell <bryceco at yahoo.com>wrote: >>> >>>> I think the assert you have for _stat64i32 is fine. It is a constant >>>> expression so should compile to nothing, and the chance of the definition >>>> changing is pretty much zero. >>>> >>>> LowercaseString appears to be used by ASM printers where the output is >>>> expected to be ASCII, and then some WIN32-conditioned code in >>>> FileManager.cpp. I hate to say it but you'll probably need to convert the >>>> UTF-8 paths to wide char, lower case it using CharLowerBuffW (since it needs >>>> to match the casing rules used by NTFS), and then convert back to UTF-8. >>>> >>>> If you need to be pedantic about recognizing whether two paths are the >>>> same on Windows you also need to call GetFullPathName in order to expand any >>>> 8.3 path components, but this is an expensive function so I wouldn't do it >>>> unless absolutely necessary. >>>> >>>> There is also a call to _fullpath in there that needs to be changed to >>>> _wfullpath. >>>> >>>> The rest of the patch looks good. >>>> >>>> On Sep 23, 2011, at 9:57 AM, Nikola Smiljanic wrote: >>>> >>>> Here's a new patch that fixes all the issues mentioned before. Note that >>>> this isn't final, I didn't want to replace all calls to ::stat so that it's >>>> easier to review. >>>> >>>> I have only one more questions: >>>> >>>> 1. _wopen accepts _stat64i32 instead of stat structure. These two are >>>> exactly the same, the only difference is that stat uses time_t >>>> and _stat64i32 uses __time64_t (time_t is a typedef for __time64_t but this >>>> depends on one more macro). I check to see that size of these two structures >>>> is the same, but is there something more that I could do? I'd also like to >>>> use some kind of static assert instead of runtime assert. >>>> >>>> I've also noticed that function LowercaseString (the same applies to >>>> UppercaseString) doesn't work with UTF8 input. It's calling isupper and >>>> tolower and these functions can't handle UTF8. This should get a new issue >>>> probably? >>>> >>>> On Wed, Sep 21, 2011 at 1:41 AM, Aaron Ballman <aaron at aaronballman.com>wrote: >>>> >>>>> On Tue, Sep 20, 2011 at 4:15 PM, Nikola Smiljanic <popizdeh at gmail.com> >>>>> wrote: >>>>> > OK since this approach makes sense I'll shoot with my questions :) >>>>> > 1. Where should get_utf8_argv go and is the name of this function OK? >>>>> Right >>>>> > now the function is inside llvm::sys::fs namespace because I need >>>>> access to >>>>> > Windows.h, should I leave it there. >>>>> >>>>> I don't think it belongs there as it has nothing to do with the file >>>>> system. However, I'm not familiar enough to know of a better place >>>>> for it either. Hopefully someone else can chime in with an idea. >>>>> >>>>> > 2. get_utf8_argv allocates new char** representing utf8 argv, should >>>>> I >>>>> > deallocate this inside driver.cpp, use std::vector<std::string> >>>>> instead or >>>>> > ignore the allocation completely? >>>>> >>>>> Perhaps some SmallVectors would make sense here? An inner one to hold >>>>> the arg data, and an outer one to hold the list of inner vectors. >>>>> Then you can do the cleanup within the call. >>>>> >>>>> > 3. There is an #ifdef inside driver.cpp right now that I don't like. >>>>> > get_utf8_argv works for windows only, but I could change the function >>>>> to >>>>> > accept argv and return it as vector of utf8 strings. It would convert >>>>> char* >>>>> > to std::string on unix and use GetCommandLine + utf8 conversion on >>>>> windows. >>>>> >>>>> I have no concrete opinion one way or the other on this. >>>>> >>>>> > 4. Should I wrap functions like ::fstat and ::close for consistency >>>>> even >>>>> > though they don't work with paths? >>>>> >>>>> I don't believe so. >>>>> >>>>> > I'll fix everything that is formatting related. You're right, current >>>>> patch >>>>> > has Open only for windows but I'll add the other one as well. >>>>> >>>>> Thanks for the help! >>>>> >>>>> ~Aaron >>>>> >>>> >>>> <clang.patch><llvm.patch> >>>> >>>> >>>> >>> >>> >> >> > <clang.patch><llvm.patch> > > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20111003/0636cfb8/attachment.html>
On Oct 3, 2011, at 11:53 AM, Aaron Ballman wrote:> On Mon, Oct 3, 2011 at 1:43 PM, Bryce Cogswell <bryceco at gmail.com> wrote: >> However, looking over the FileManager code a little more I'm not even sure >> using the path is the best solution, it seems it would be better to use >> inode like the unix code does. Windows doesn't support inode (the s_ino >> field in stat), but it does have nFileIndexHigh/nFileIndexLow which are >> exposed via ::GetFileInformationByHandle and is basically the same thing. > > One question I would have about that is whether stat actually opens > the file under the hood. GetFileInformationByHandle requires an open > file handle, which could be expensive depending on the situation. I'd > worry slightly about a performance hit (networks come to mind). If > the perf is equivalent, then I would agree that nFileIndexHigh/Low > will work well as a Win32 version of the inode. > > ~AaronOur current unix code calls open followed by fstat, so Win32 perf should be similar.
Right, but maybe if you switch to using tolower_l() and pass an appropriate locale you can get it to work the same way. I'm not sure what locale that would have to be, but it needs to match whatever NTFS uses for its $upcase file. On Oct 3, 2011, at 12:19 PM, Nikola Smiljanic wrote:> CharLowerW does the right thing. But I still need Windows.h to use it :) > > On Mon, Oct 3, 2011 at 8:43 PM, Bryce Cogswell <bryceco at gmail.com> wrote: > Locale-specific is not what we want, but I don't believe Windows exposes an alternative API that does what we want. (Does CharLower give a different answer than tolower?) > > However, looking over the FileManager code a little more I'm not even sure using the path is the best solution, it seems it would be better to use inode like the unix code does. Windows doesn't support inode (the s_ino field in stat), but it does have nFileIndexHigh/nFileIndexLow which are exposed via ::GetFileInformationByHandle and is basically the same thing. This would require refactoring FileSystemStatCache to use a new structure in place of stat that could be shared between Windows and Unix. This would be a lot of small changes but seems like it would be fairly straightforward. > > > On Oct 3, 2011, at 9:21 AM, Nikola Smiljanic wrote: > >> towlower doesn't seem to work with my test string in Cyrillic. This function does locale-specific conversion, is this what we want? >> >> Here's the whole thing, with all the calls to ::stat replaced with llvm::sys::fs::Stat. >> >> On Fri, Sep 30, 2011 at 8:04 PM, Bryce Cogswell <bryceco at gmail.com> wrote: >> You can use _iswupper and _towlower instead of CharLowerBuffW. They don't require windows.h and work with /Za. >> >> On Sep 30, 2011, at 12:24 AM, Nikola Smiljanic wrote: >> >>> I tried to do the conversion to lowercase inside GetFullPath by adding an additional bool parameter to this function. It's not perfect but seems much better than repeating the whole UTF8 to UTF16 and UTF16 to UTF8 conversion again. The problem I have is with access to CharLowerBuffW. I need Windows.h for this function but when I try to include it I get a bunch of errors because Language Extensions are disabled, /Za switch. Do I just enable them and include Windows.h inside and #ifdef section? >>> >>> On Thu, Sep 29, 2011 at 7:57 AM, Bryce Cogswell <bryceco at gmail.com> wrote: >>> I agree they are broken on all platforms. However, FileManager.cpp already contains an #if WIN32 conditional around the code calling LowercaseString, so you can use MultiByteToWideChar and CharLowerBuffW directly there, and not call the LowercaseString function. I don't think there are any other places where LowercaseString is called with non-ascii data, so you can punt on fixing it for now. >>> >>> On Sep 28, 2011, at 10:11 PM, Nikola Smiljanic wrote: >>> >>>> I have a problem with Lowercase and Uppercase functions. These are broken on all platforms, not only Window, so I can't just #ifdef and use CharLowerBuffW. I need a portable way to convert from UTF8 to UTF16. There is set of functions inside clang/Basic/ConvertUTF, but LLVM can't depend on this. What do I do? >>>> >>>> On Tue, Sep 27, 2011 at 5:09 AM, Bryce Cogswell <bryceco at yahoo.com> wrote: >>>> I think the assert you have for _stat64i32 is fine. It is a constant expression so should compile to nothing, and the chance of the definition changing is pretty much zero. >>>> >>>> LowercaseString appears to be used by ASM printers where the output is expected to be ASCII, and then some WIN32-conditioned code in FileManager.cpp. I hate to say it but you'll probably need to convert the UTF-8 paths to wide char, lower case it using CharLowerBuffW (since it needs to match the casing rules used by NTFS), and then convert back to UTF-8. >>>> >>>> If you need to be pedantic about recognizing whether two paths are the same on Windows you also need to call GetFullPathName in order to expand any 8.3 path components, but this is an expensive function so I wouldn't do it unless absolutely necessary. >>>> >>>> There is also a call to _fullpath in there that needs to be changed to _wfullpath. >>>> >>>> The rest of the patch looks good. >>>> >>>> On Sep 23, 2011, at 9:57 AM, Nikola Smiljanic wrote: >>>> >>>>> Here's a new patch that fixes all the issues mentioned before. Note that this isn't final, I didn't want to replace all calls to ::stat so that it's easier to review. >>>>> >>>>> I have only one more questions: >>>>> >>>>> 1. _wopen accepts _stat64i32 instead of stat structure. These two are exactly the same, the only difference is that stat uses time_t and _stat64i32 uses __time64_t (time_t is a typedef for __time64_t but this depends on one more macro). I check to see that size of these two structures is the same, but is there something more that I could do? I'd also like to use some kind of static assert instead of runtime assert. >>>>> >>>>> I've also noticed that function LowercaseString (the same applies to UppercaseString) doesn't work with UTF8 input. It's calling isupper and tolower and these functions can't handle UTF8. This should get a new issue probably? >>>>> >>>>> On Wed, Sep 21, 2011 at 1:41 AM, Aaron Ballman <aaron at aaronballman.com> wrote: >>>>> On Tue, Sep 20, 2011 at 4:15 PM, Nikola Smiljanic <popizdeh at gmail.com> wrote: >>>>> > OK since this approach makes sense I'll shoot with my questions :) >>>>> > 1. Where should get_utf8_argv go and is the name of this function OK? Right >>>>> > now the function is inside llvm::sys::fs namespace because I need access to >>>>> > Windows.h, should I leave it there. >>>>> >>>>> I don't think it belongs there as it has nothing to do with the file >>>>> system. However, I'm not familiar enough to know of a better place >>>>> for it either. Hopefully someone else can chime in with an idea. >>>>> >>>>> > 2. get_utf8_argv allocates new char** representing utf8 argv, should I >>>>> > deallocate this inside driver.cpp, use std::vector<std::string> instead or >>>>> > ignore the allocation completely? >>>>> >>>>> Perhaps some SmallVectors would make sense here? An inner one to hold >>>>> the arg data, and an outer one to hold the list of inner vectors. >>>>> Then you can do the cleanup within the call. >>>>> >>>>> > 3. There is an #ifdef inside driver.cpp right now that I don't like. >>>>> > get_utf8_argv works for windows only, but I could change the function to >>>>> > accept argv and return it as vector of utf8 strings. It would convert char* >>>>> > to std::string on unix and use GetCommandLine + utf8 conversion on windows. >>>>> >>>>> I have no concrete opinion one way or the other on this. >>>>> >>>>> > 4. Should I wrap functions like ::fstat and ::close for consistency even >>>>> > though they don't work with paths? >>>>> >>>>> I don't believe so. >>>>> >>>>> > I'll fix everything that is formatting related. You're right, current patch >>>>> > has Open only for windows but I'll add the other one as well. >>>>> >>>>> Thanks for the help! >>>>> >>>>> ~Aaron >>>>> >>>>> <clang.patch><llvm.patch> >>>> >>>> >>> >>> >> >> >> <clang.patch><llvm.patch> > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20111003/0dcc48b1/attachment.html>
That should be fine. I don't believe the concern about performing a char-by-char conversion is valid; for example the NTFS-3G driver uses a simplistic upcase table and seems to work fine. I suspect Windows does the same. On Oct 3, 2011, at 1:12 PM, Nikola Smiljanic wrote:> How about this: > > for (int i = 0; i != NumWChars; ++i) > absPath[i] = std::tolower(absPath[i], std::locale()); > > seems to be working just fine? > > On Mon, Oct 3, 2011 at 9:27 PM, Bryce Cogswell <bryceco at gmail.com> wrote: > Right, but maybe if you switch to using tolower_l() and pass an appropriate locale you can get it to work the same way. I'm not sure what locale that would have to be, but it needs to match whatever NTFS uses for its $upcase file.-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20111003/95923b63/attachment.html>
Maybe Matching Threads
- [LLVMdev] [cfe-dev] Unicode path handling on Windows
- [LLVMdev] [cfe-dev] Unicode path handling on Windows
- [LLVMdev] [cfe-dev] Unicode path handling on Windows
- [LLVMdev] [cfe-dev] Unicode path handling on Windows
- [LLVMdev] [cfe-dev] Unicode path handling on Windows