NAKAMURA Takumi
2011-Sep-01  12:12 UTC
[LLVMdev] [cfe-dev] Unicode path handling on Windows
Guys, welcome to the too weird i18n world! We, Japanese, has got suffered for multibyte charset for 20 years. I have added a comment in http://llvm.org/bugs/show_bug.cgi?id=10348 . Of course I know, I don't think it would be a practical resolution. FYI, it seems clang can retrieve mbcs path with s/CP_UTF8/CP_ACP/g. E>bin\clang.exe -S なかむら\たくみ.c なかむら\たくみ.c:4:2: error: #error #error ^ 1 error generated. Though, you should know, MBCS still has an issue; E>bin\clang.exe -S 表はダメ文字\表はダメ文字.c clang: error: no such file or directory: '表はダメ文字\表はダメ文字.c' clang: error: no input files Note, "表" is represented as "0x95 0x5C" in CP932. In principle, IMHHHO; - argv should be treated as "blackbox" byte stream. - Don't assume "wmain(argc, wchar_t **argv)". mingw does not have one. Then, argv must be presented as the default codepage. - A few codepage (eg. cp932 Japanese shift jis) might contain 0x5C(\) in 2nd (leading) octet. Win32 ANSI (****A) APIs assume local codepage. We should do in llvm; - Treat pathstring in argv as blackbox. Never parse (char*)pathstring without any knowledge. - UTF8 would be useless on win32. Win32 does not manipulate utf8 implicitly in anywhere. - Path API should hold pathstring as API-native form (bytestream on unix, UCS2 wchar_t on win32). - Path should be manipulated as API-native form as possible. In future, we might consider "-finput-charset" and "-fexec-charset" on clang. Please consider an source file; //////// #include "むすめは/まおちゃん.h" char const literal[] = "俺です、俺俺"; //////// The include path (#include) should be handled as host-dependent. The literal should be interperted with input-charset and be emitted with exec-charset. Too hard the life is. Would you like to live in Japan? :p ...Takumi 2011/9/1 Nikola Smiljanic <popizdeh at gmail.com>:> The function available in clang/lib/Basic/ConvertUTF.c deals with unsigned > shorts, and I need wchar_t? > > On Thu, Sep 1, 2011 at 9:36 AM, Jean-Daniel Dupas <devlists at shadowlab.org> > wrote: >> >> Le 31 août 2011 à 21:02, Aaron Ballman a écrit : >> >> > On Wed, Aug 31, 2011 at 1:17 PM, Eli Friedman <eli.friedman at gmail.com> >> > wrote: >> >> On Wed, Aug 31, 2011 at 10:58 AM, Nikola Smiljanic <popizdeh at gmail.com> >> >> wrote: >> >>> _wopen expects wchar_t* and the only visible function for conversion >> >>> to >> >>> utf16 is ConvertUTF8toUTF32 which converts to unsigned shorts. >> >> >> >> If you're in #ifdef WIN32 code, just use ConvertUTF8toUTF16 and >> >> reinterpret_cast from unsigned short* to wchar_t*. >> > >> > I think the problem is that PathV2.inc is part of LLVM, and the >> > ConvertUTF8ToUTF16 function is in an anonymous namespace. So the >> > question becomes: raise the function into an accessible namespace, >> > duplicate code, or find some other mechanism? >> >> This function is also available in clang/lib/Basic/ConvertUTF.c >> >> > >> > I don't think it makes sense to raise the function out of the >> > anonymous namespace unless it's also moved (it has nothing to do with >> > paths per se). Perhaps it's worth it to move it to StringRef? >> > >> > ~Aaron >> > >> > _______________________________________________ >> > cfe-dev mailing list >> > cfe-dev at cs.uiuc.edu >> > http://lists.cs.uiuc.edu/mailman/listinfo/cfe-dev >> >> -- Jean-Daniel >> >> >> >> >> >> _______________________________________________ >> cfe-dev mailing list >> cfe-dev at cs.uiuc.edu >> http://lists.cs.uiuc.edu/mailman/listinfo/cfe-dev > > > _______________________________________________ > cfe-dev mailing list > cfe-dev at cs.uiuc.edu > http://lists.cs.uiuc.edu/mailman/listinfo/cfe-dev > >
Ruben Van Boxem
2011-Sep-01  15:44 UTC
[LLVMdev] [cfe-dev] Unicode path handling on Windows
Op 1 sep. 2011 14:12 schreef "NAKAMURA Takumi" <geek4civic at gmail.com> het volgende:> > Guys, welcome to the too weird i18n world! > We, Japanese, has got suffered for multibyte charset for 20 years. > > I have added a comment in http://llvm.org/bugs/show_bug.cgi?id=10348 . > Of course I know, I don't think it would be a practical resolution. > > FYI, it seems clang can retrieve mbcs path with s/CP_UTF8/CP_ACP/g. > > E>bin\clang.exe -S なかむら\たくみ.c > なかむら\たくみ.c:4:2: error: #error > #error > ^ > 1 error generated. > > Though, you should know, MBCS still has an issue; > > E>bin\clang.exe -S 表はダメ文字\表はダメ文字.c > clang: error: no such file or directory: '表はダメ文字\表はダメ文字.c' > clang: error: no input files > > Note, "表" is represented as "0x95 0x5C" in CP932. > > In principle, IMHHHO; > > - argv should be treated as "blackbox" byte stream. > - Don't assume "wmain(argc, wchar_t **argv)". mingw does not have > one. Then, argv must be presented as the default codepage.Correction: I believe MinGW-w64 has a Unicode startup and thus support for wmain (but of course it would be better to shift this to strict API functions)> - A few codepage (eg. cp932 Japanese shift jis) might contain > 0x5C(\) in 2nd (leading) octet. > > Win32 ANSI (****A) APIs assume local codepage. > > We should do in llvm; > > - Treat pathstring in argv as blackbox. Never parse > (char*)pathstring without any knowledge. > - UTF8 would be useless on win32. Win32 does not manipulate utf8 > implicitly in anywhere. > - Path API should hold pathstring as API-native form (bytestream on > unix, UCS2 wchar_t on win32). > - Path should be manipulated as API-native form as possible.Isn't it more straightforward to use utf-8 internally and use the conversion functions provided by the win32 API when calling other win32 API functions, and always call the wide versions of the win32 functions. Full compatibility guaranteed, and one encoding internally. Ruben> > In future, we might consider "-finput-charset" and "-fexec-charset" onclang.> Please consider an source file; > > //////// > #include "むすめは/まおちゃん.h" > char const literal[] = "俺です、俺俺"; > //////// > > The include path (#include) should be handled as host-dependent. The > literal should be interperted with input-charset and be emitted with > exec-charset. > > Too hard the life is. Would you like to live in Japan? :p > > > ...Takumi > > > 2011/9/1 Nikola Smiljanic <popizdeh at gmail.com>: > > The function available in clang/lib/Basic/ConvertUTF.c deals withunsigned> > shorts, and I need wchar_t? > > > > On Thu, Sep 1, 2011 at 9:36 AM, Jean-Daniel Dupas <devlists at shadowlab.org>> > wrote: > >> > >> Le 31 août 2011 à 21:02, Aaron Ballman a écrit : > >> > >> > On Wed, Aug 31, 2011 at 1:17 PM, Eli Friedman <eli.friedman at gmail.com > > >> > wrote: > >> >> On Wed, Aug 31, 2011 at 10:58 AM, Nikola Smiljanic <popizdeh at gmail.com>> >> >> wrote: > >> >>> _wopen expects wchar_t* and the only visible function forconversion> >> >>> to > >> >>> utf16 is ConvertUTF8toUTF32 which converts to unsigned shorts. > >> >> > >> >> If you're in #ifdef WIN32 code, just use ConvertUTF8toUTF16 and > >> >> reinterpret_cast from unsigned short* to wchar_t*. > >> > > >> > I think the problem is that PathV2.inc is part of LLVM, and the > >> > ConvertUTF8ToUTF16 function is in an anonymous namespace. So the > >> > question becomes: raise the function into an accessible namespace, > >> > duplicate code, or find some other mechanism? > >> > >> This function is also available in clang/lib/Basic/ConvertUTF.c > >> > >> > > >> > I don't think it makes sense to raise the function out of the > >> > anonymous namespace unless it's also moved (it has nothing to do with > >> > paths per se). Perhaps it's worth it to move it to StringRef? > >> > > >> > ~Aaron > >> > > >> > _______________________________________________ > >> > cfe-dev mailing list > >> > cfe-dev at cs.uiuc.edu > >> > http://lists.cs.uiuc.edu/mailman/listinfo/cfe-dev > >> > >> -- Jean-Daniel > >> > >> > >> > >> > >> > >> _______________________________________________ > >> cfe-dev mailing list > >> cfe-dev at cs.uiuc.edu > >> http://lists.cs.uiuc.edu/mailman/listinfo/cfe-dev > > > > > > _______________________________________________ > > cfe-dev mailing list > > cfe-dev at cs.uiuc.edu > > http://lists.cs.uiuc.edu/mailman/listinfo/cfe-dev > > > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20110901/f421b61c/attachment.html>
Nikola Smiljanic
2011-Sep-01  20:17 UTC
[LLVMdev] [cfe-dev] Unicode path handling on Windows
AFAIK Clang internals do assume utf8, and llvm::sys::path converts strings to utf16 on windows and calls W API functions. If somebody would like to take a look at my changes and comment on them. Here's a brief explanation of what I did: - Convert argv to utf8 using current system locale for win32 (this is done as soon as possible inside ExpandArgv). This makes the driver happy since calls to llvm::sys::path::exists succeed. - Change calls to ::open (inside FileSystemStatCache and MemoryBuffer) to ::_wopen on win32 by converting the path to utf16. - In order to do the conversions I had to expose two functions, one of them was already there but wasn't visible, the other one was added by me Known issues: - I should probably use LLVM_ON_WIN32 instead of WIN32 but this macro isn't defined inside FileSystemStatCache and MemoryBuffer for some reason. Both of these files have an #ifdef section that deals with O_BINARY so maybe these two sections should be consolidated? - Functions convert_multibyte_to_utf8 and convert_utf8_to_utf16 have definitions only on windows so every other platform is currently broken. On Thu, Sep 1, 2011 at 5:44 PM, Ruben Van Boxem <vanboxem.ruben at gmail.com>wrote:> Isn't it more straightforward to use utf-8 internally and use the > conversion functions provided by the win32 API when calling other win32 API > functions, and always call the wide versions of the win32 functions. Full > compatibility guaranteed, and one encoding internally. > > Ruben >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20110901/b724988b/attachment.html> -------------- next part -------------- A non-text attachment was scrubbed... Name: unicode_path_clang.patch Type: application/octet-stream Size: 1811 bytes Desc: not available URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20110901/b724988b/attachment.obj> -------------- next part -------------- A non-text attachment was scrubbed... Name: unicode_path_llvm.patch Type: application/octet-stream Size: 2973 bytes Desc: not available URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20110901/b724988b/attachment-0001.obj>
NAKAMURA Takumi
2011-Sep-02  03:21 UTC
[LLVMdev] [cfe-dev] Unicode path handling on Windows
2011/9/2 Ruben Van Boxem <vanboxem.ruben at gmail.com>:>> In principle, IMHHHO; >> >> - argv should be treated as "blackbox" byte stream. >> - Don't assume "wmain(argc, wchar_t **argv)". mingw does not have >> one. Then, argv must be presented as the default codepage. > > Correction: I believe MinGW-w64 has a Unicode startup and thus support for > wmain (but of course it would be better to shift this to strict API > functions)Good to hear. Frankly speaking, though, I don't know little knowledge to wmain() scheme...>> We should do in llvm; >> >> - Treat pathstring in argv as blackbox. Never parse >> (char*)pathstring without any knowledge. >> - UTF8 would be useless on win32. Win32 does not manipulate utf8 >> implicitly in anywhere. >> - Path API should hold pathstring as API-native form (bytestream on >> unix, UCS2 wchar_t on win32). >> - Path should be manipulated as API-native form as possible. > > Isn't it more straightforward to use utf-8 internally and use the conversion > functions provided by the win32 API when calling other win32 API functions, > and always call the wide versions of the win32 functions. Full compatibility > guaranteed, and one encoding internally.I could propose one if conversion of ansi->utf8 would be supported by win32. Now, I rethought it might be an option to hold utf8 internally. ...Takumi