thr3ads.net - llvm dev - [LLVMdev] [cfe-dev] Unicode path handling on Windows [Sep 2011]

If this information is useful, please help other people find it:
Share via:

NAKAMURA Takumi

2011-Sep-01 12:12 UTC

[LLVMdev] [cfe-dev] Unicode path handling on Windows

Guys, welcome to the too weird i18n world!
We, Japanese, has got suffered for multibyte charset for 20 years.

I have added a comment in http://llvm.org/bugs/show_bug.cgi?id=10348 .
Of course I know, I don't think it would be a practical resolution.

FYI, it seems clang can retrieve mbcs path with s/CP_UTF8/CP_ACP/g.

E>bin\clang.exe -S なかむら\たくみ.c
なかむら\たくみ.c:4:2: error: #error
#error
 ^
1 error generated.

Though, you should know, MBCS still has an issue;

E>bin\clang.exe -S 表はダメ文字\表はダメ文字.c
clang: error: no such file or directory: '表はダメ文字\表はダメ文字.c'
clang: error: no input files

Note, "表" is represented as "0x95 0x5C" in CP932.

In principle, IMHHHO;

  - argv should be treated as "blackbox" byte stream.
  - Don't assume "wmain(argc, wchar_t **argv)". mingw does not
have
one. Then, argv must be presented as the default codepage.
  - A few codepage (eg. cp932 Japanese shift jis) might contain
0x5C(\) in 2nd (leading) octet.

Win32 ANSI (****A) APIs assume local codepage.

We should do in llvm;

  - Treat pathstring in argv as blackbox. Never parse
(char*)pathstring without any knowledge.
  - UTF8 would be useless on win32. Win32 does not manipulate utf8
implicitly in anywhere.
  - Path API should hold pathstring as API-native form (bytestream on
unix, UCS2 wchar_t on win32).
  - Path should be manipulated as API-native form as possible.

In future, we might consider "-finput-charset" and
"-fexec-charset" on clang.
Please consider an source file;

////////
#include "むすめは/まおちゃん.h"
char const literal[] = "俺です、俺俺";
////////

The include path (#include) should be handled as host-dependent. The
literal should be interperted with input-charset and be emitted with
exec-charset.

Too hard the life is. Would you like to live in Japan? :p


...Takumi


2011/9/1 Nikola Smiljanic <popizdeh at gmail.com>:> The function available in clang/lib/Basic/ConvertUTF.c deals with unsigned
> shorts, and I need wchar_t?
>
> On Thu, Sep 1, 2011 at 9:36 AM, Jean-Daniel Dupas <devlists at
shadowlab.org>
> wrote:
>>
>> Le 31 août 2011 à 21:02, Aaron Ballman a écrit :
>>
>> > On Wed, Aug 31, 2011 at 1:17 PM, Eli Friedman <eli.friedman at
gmail.com>
>> > wrote:
>> >> On Wed, Aug 31, 2011 at 10:58 AM, Nikola Smiljanic
<popizdeh at gmail.com>
>> >> wrote:
>> >>> _wopen expects wchar_t* and the only visible function for
conversion
>> >>> to
>> >>> utf16 is ConvertUTF8toUTF32 which converts to unsigned
shorts.
>> >>
>> >> If you're in #ifdef WIN32 code, just use
ConvertUTF8toUTF16 and
>> >> reinterpret_cast from unsigned short* to wchar_t*.
>> >
>> > I think the problem is that PathV2.inc is part of LLVM, and the
>> > ConvertUTF8ToUTF16 function is in an anonymous namespace.  So the
>> > question becomes: raise the function into an accessible namespace,
>> > duplicate code, or find some other mechanism?
>>
>> This function is also available in clang/lib/Basic/ConvertUTF.c
>>
>> >
>> > I don't think it makes sense to raise the function out of the
>> > anonymous namespace unless it's also moved (it has nothing to
do with
>> > paths per se).  Perhaps it's worth it to move it to StringRef?
>> >
>> > ~Aaron
>> >
>> > _______________________________________________
>> > cfe-dev mailing list
>> > cfe-dev at cs.uiuc.edu
>> > http://lists.cs.uiuc.edu/mailman/listinfo/cfe-dev
>>
>> -- Jean-Daniel
>>
>>
>>
>>
>>
>> _______________________________________________
>> cfe-dev mailing list
>> cfe-dev at cs.uiuc.edu
>> http://lists.cs.uiuc.edu/mailman/listinfo/cfe-dev
>
>
> _______________________________________________
> cfe-dev mailing list
> cfe-dev at cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/cfe-dev
>
>

Ruben Van Boxem

2011-Sep-01 15:44 UTC

head link

[LLVMdev] [cfe-dev] Unicode path handling on Windows

Op 1 sep. 2011 14:12 schreef "NAKAMURA Takumi" <geek4civic at
gmail.com> het
volgende:>
> Guys, welcome to the too weird i18n world!
> We, Japanese, has got suffered for multibyte charset for 20 years.
>
> I have added a comment in http://llvm.org/bugs/show_bug.cgi?id=10348 .
> Of course I know, I don't think it would be a practical resolution.
>
> FYI, it seems clang can retrieve mbcs path with s/CP_UTF8/CP_ACP/g.
>
> E>bin\clang.exe -S なかむら\たくみ.c
> なかむら\たくみ.c:4:2: error: #error
> #error
>  ^
> 1 error generated.
>
> Though, you should know, MBCS still has an issue;
>
> E>bin\clang.exe -S 表はダメ文字\表はダメ文字.c
> clang: error: no such file or directory: '表はダメ文字\表はダメ文字.c'
> clang: error: no input files
>
> Note, "表" is represented as "0x95 0x5C" in CP932.
>
> In principle, IMHHHO;
>
>  - argv should be treated as "blackbox" byte stream.
>  - Don't assume "wmain(argc, wchar_t **argv)". mingw does not
have
> one. Then, argv must be presented as the default codepage.
Correction: I believe MinGW-w64 has a Unicode startup and thus support for
wmain (but of course it would be better to shift this to strict API
functions)
>  - A few codepage (eg. cp932 Japanese shift jis) might contain
> 0x5C(\) in 2nd (leading) octet.
>
> Win32 ANSI (****A) APIs assume local codepage.
>
> We should do in llvm;
>
>  - Treat pathstring in argv as blackbox. Never parse
> (char*)pathstring without any knowledge.
>  - UTF8 would be useless on win32. Win32 does not manipulate utf8
> implicitly in anywhere.
>  - Path API should hold pathstring as API-native form (bytestream on
> unix, UCS2 wchar_t on win32).
>  - Path should be manipulated as API-native form as possible.
Isn't it more straightforward to use utf-8 internally and use the conversion
functions provided by the win32 API when calling other win32 API functions,
and always call the wide versions of the win32 functions. Full compatibility
guaranteed, and one encoding internally.

Ruben>
> In future, we might consider "-finput-charset" and
"-fexec-charset" on
clang.> Please consider an source file;
>
> ////////
> #include "むすめは/まおちゃん.h"
> char const literal[] = "俺です、俺俺";
> ////////
>
> The include path (#include) should be handled as host-dependent. The
> literal should be interperted with input-charset and be emitted with
> exec-charset.
>
> Too hard the life is. Would you like to live in Japan? :p
>
>
> ...Takumi
>
>
> 2011/9/1 Nikola Smiljanic <popizdeh at gmail.com>:
> > The function available in clang/lib/Basic/ConvertUTF.c deals with
unsigned> > shorts, and I need wchar_t?
> >
> > On Thu, Sep 1, 2011 at 9:36 AM, Jean-Daniel Dupas <
devlists at shadowlab.org>> > wrote:
> >>
> >> Le 31 août 2011 à 21:02, Aaron Ballman a écrit :
> >>
> >> > On Wed, Aug 31, 2011 at 1:17 PM, Eli Friedman
<eli.friedman at gmail.com
>
> >> > wrote:
> >> >> On Wed, Aug 31, 2011 at 10:58 AM, Nikola Smiljanic <
popizdeh at gmail.com>> >> >> wrote:
> >> >>> _wopen expects wchar_t* and the only visible function
for
conversion> >> >>> to
> >> >>> utf16 is ConvertUTF8toUTF32 which converts to
unsigned shorts.
> >> >>
> >> >> If you're in #ifdef WIN32 code, just use
ConvertUTF8toUTF16 and
> >> >> reinterpret_cast from unsigned short* to wchar_t*.
> >> >
> >> > I think the problem is that PathV2.inc is part of LLVM, and
the
> >> > ConvertUTF8ToUTF16 function is in an anonymous namespace.  So
the
> >> > question becomes: raise the function into an accessible
namespace,
> >> > duplicate code, or find some other mechanism?
> >>
> >> This function is also available in clang/lib/Basic/ConvertUTF.c
> >>
> >> >
> >> > I don't think it makes sense to raise the function out of
the
> >> > anonymous namespace unless it's also moved (it has
nothing to do with
> >> > paths per se).  Perhaps it's worth it to move it to
StringRef?
> >> >
> >> > ~Aaron
> >> >
> >> > _______________________________________________
> >> > cfe-dev mailing list
> >> > cfe-dev at cs.uiuc.edu
> >> > http://lists.cs.uiuc.edu/mailman/listinfo/cfe-dev
> >>
> >> -- Jean-Daniel
> >>
> >>
> >>
> >>
> >>
> >> _______________________________________________
> >> cfe-dev mailing list
> >> cfe-dev at cs.uiuc.edu
> >> http://lists.cs.uiuc.edu/mailman/listinfo/cfe-dev
> >
> >
> > _______________________________________________
> > cfe-dev mailing list
> > cfe-dev at cs.uiuc.edu
> > http://lists.cs.uiuc.edu/mailman/listinfo/cfe-dev
> >
> >-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20110901/f421b61c/attachment.html>

Nikola Smiljanic

2011-Sep-01 20:17 UTC

head link

[LLVMdev] [cfe-dev] Unicode path handling on Windows

AFAIK Clang internals do assume utf8, and llvm::sys::path converts strings
to utf16 on windows and calls W API functions.

If somebody would like to take a look at my changes and comment on them.
Here's a brief explanation of what I did:

- Convert argv to utf8 using current system locale for win32 (this is done
as soon as possible inside ExpandArgv). This makes the driver happy since
calls to llvm::sys::path::exists succeed.
- Change calls to ::open (inside FileSystemStatCache and MemoryBuffer) to
::_wopen on win32 by converting the path to utf16.
- In order to do the conversions I had to expose two functions, one of them
was already there but wasn't visible, the other one was added by me

Known issues:

- I should probably use LLVM_ON_WIN32 instead of WIN32 but this macro isn't
defined inside FileSystemStatCache and MemoryBuffer for some reason. Both of
these files have an #ifdef section that deals with O_BINARY so maybe these
two sections should be consolidated?
- Functions convert_multibyte_to_utf8 and convert_utf8_to_utf16 have
definitions only on windows so every other platform is currently broken.

On Thu, Sep 1, 2011 at 5:44 PM, Ruben Van Boxem <vanboxem.ruben at
gmail.com>wrote:
> Isn't it more straightforward to use utf-8 internally and use the
> conversion functions provided by the win32 API when calling other win32 API
> functions, and always call the wide versions of the win32 functions. Full
> compatibility guaranteed, and one encoding internally.
>
> Ruben
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20110901/b724988b/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: unicode_path_clang.patch
Type: application/octet-stream
Size: 1811 bytes
Desc: not available
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20110901/b724988b/attachment.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: unicode_path_llvm.patch
Type: application/octet-stream
Size: 2973 bytes
Desc: not available
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20110901/b724988b/attachment-0001.obj>

NAKAMURA Takumi

2011-Sep-02 03:21 UTC

head link

[LLVMdev] [cfe-dev] Unicode path handling on Windows

2011/9/2 Ruben Van Boxem <vanboxem.ruben at
gmail.com>:>> In principle, IMHHHO;
>>
>>  - argv should be treated as "blackbox" byte stream.
>>  - Don't assume "wmain(argc, wchar_t **argv)". mingw does
not have
>> one. Then, argv must be presented as the default codepage.
>
> Correction: I believe MinGW-w64 has a Unicode startup and thus support for
> wmain (but of course it would be better to shift this to strict API
> functions)
Good to hear. Frankly speaking, though, I don't know little knowledge
to wmain() scheme...
>> We should do in llvm;
>>
>>  - Treat pathstring in argv as blackbox. Never parse
>> (char*)pathstring without any knowledge.
>>  - UTF8 would be useless on win32. Win32 does not manipulate utf8
>> implicitly in anywhere.
>>  - Path API should hold pathstring as API-native form (bytestream on
>> unix, UCS2 wchar_t on win32).
>>  - Path should be manipulated as API-native form as possible.
>
> Isn't it more straightforward to use utf-8 internally and use the
conversion
> functions provided by the win32 API when calling other win32 API functions,
> and always call the wide versions of the win32 functions. Full
compatibility
> guaranteed, and one encoding internally.
I could propose one if conversion of ansi->utf8 would be supported by win32.
Now, I rethought it might be an option to hold utf8 internally.

...Takumi

Seemingly Similar Threads

Search for more maybe matching threads

llvm dev - Sep 2011 - [LLVMdev] [cfe-dev] Unicode path handling on Windows

[LLVMdev] [cfe-dev] Unicode path handling on Windows

[LLVMdev] [cfe-dev] Unicode path handling on Windows

[LLVMdev] [cfe-dev] Unicode path handling on Windows

[LLVMdev] [cfe-dev] Unicode path handling on Windows

Seemingly Similar Threads