Tom Honermann via llvm-dev
2020-Jun-16 14:53 UTC
[llvm-dev] RFC: Adding support for the z/OS platform to LLVM and clang
> -----Original Message----- > From: llvm-dev <llvm-dev-bounces at lists.llvm.org> On Behalf Of Kai Peter Nacke > via llvm-dev > Sent: Tuesday, June 16, 2020 8:51 AM > To: Corentin <corentin.jabot at gmail.com> > Cc: llvm-dev at lists.llvm.org > Subject: Re: [llvm-dev] RFC: Adding support for the z/OS platform to LLVM and > clang > > > > 2) Add patches to Clang to allow EBCDIC and ASCII (ISO-8859-1) > > > encoded > > > input source files. This would be done at the file open time to allow > the > > rest of Clang to operate as if the source was UTF-8 and so require no > > changes downstream. Feedback on this plan is welcome from the Clang > > community. > > Would it be correct to assume that this EBCDIC -> UTF-8 mapping would > > be as prescribed by UTF-EBCDIC / IBM CDRA, notably for the control > > characters that do not map exactly? > > Notably, if the execution encoding is EBCDIC, is '0x06' equivalent to > > '0086', etc? > > > > The question "Is Unicode sufficient to represent all characters > > present in the input source without using the Private Use Area?" is > > one > that > > is relevant to both Clang and the C/C++ standard. ( I do hope that it > > is the case!) > > The current goal is to make only minimal changes to the frontend to enable > reading of EBCDIC encoded files. For this, we use the auto-conversion service of > z/OS UNIX System Services ( > https://urldefense.com/v3/__https://www.ibm.com/support/knowledgecenter/ > SSLTBW_2.4.0/com.ibm.zos.v2r4.bpxb200/xpascii.htm__;!!A4F2R9G_pg!NKRnU > eS37wLNWpYN6Yvhm9SzZwujyMlnpbFJyHV5Z8-M6-aucp0zxwXGxSZ7EKlr$ > ), together with file tagging and setting the CCSID for the program and for > opened files.. The auto-conversion service supports round-trip conversion > between EBCDIC and Enhanced ASCII. With it, boot strapping with EBCDIC > source files is possible. > Of course, more complete UTF-8 support is a valid implementation alternative.Other good references: - The 'ctag' utility https://www.ibm.com/support/knowledgecenter/SSLTBW_2.3.0/com.ibm.zos.v2r3.bpxa500/chtag.htm - File tagging overview https://www.ibm.com/support/knowledgecenter/en/SSLTBW_2.3.0/com.ibm.zos.v2r3.cbcpx01/cbc1p273.htm Kai, would use of auto conversion require that users set the _BPXK_AUTOCVT, _BPXK_CCSIDS, and/or _BPXK_PCCSID environment variables? Or do you envision having the clang driver set them before invocation of the compiler? If the latter, that would imply that users (and tests) are responsible for setting them for direct 'clang -cc1' invocations. Here is another possible direction to consider that would provide a more portable facility. Clang has interfaces for overriding file contents with a memory buffer; see the overrideFileContents() overloads in SourceManager. It should be straight forward to, when loading a file, make a determination as to whether a conversion is needed (e.g., consider file tags, environment variables, command line options, etc...) and, if needed, transcode the file contents and register the resulting buffer as an override. This would be useful for implementation of -finput-charset and would benefit deployments in Microsoft environments that have source files in ISO-8859 encodings. Tom.
Kai Peter Nacke via llvm-dev
2020-Jun-16 15:17 UTC
[llvm-dev] RFC: Adding support for the z/OS platform to LLVM and clang
Tom Honermann <Thomas.Honermann at synopsys.com> wrote on 16.06.2020 16:53:33:> > > > 2) Add patches to Clang to allow EBCDIC and ASCII (ISO-8859-1) > > > > encoded > > > > > input source files. This would be done at the file open time toallow> > the > > > rest of Clang to operate as if the source was UTF-8 and so requireno> > > changes downstream. Feedback on this plan is welcome from the Clang > > > community. > > > Would it be correct to assume that this EBCDIC -> UTF-8 mappingwould> > > be as prescribed by UTF-EBCDIC / IBM CDRA, notably for the control > > > characters that do not map exactly? > > > Notably, if the execution encoding is EBCDIC, is '0x06' equivalentto> > > '0086', etc? > > > > > > The question "Is Unicode sufficient to represent all characters > > > present in the input source without using the Private Use Area?" is > > > one > > that > > > is relevant to both Clang and the C/C++ standard. ( I do hope thatit> > > is the case!) > > > > The current goal is to make only minimal changes to the frontend toenable> > reading of EBCDIC encoded files. For this, we use the auto- > conversion service of > > z/OS UNIX System Services ( > >https://urldefense.com/v3/__https://www.ibm.com/support/knowledgecenter/> > SSLTBW_2.4.0/com.ibm.zos.v2r4.bpxb200/xpascii.htm__;!!A4F2R9G_pg!NKRnU > > eS37wLNWpYN6Yvhm9SzZwujyMlnpbFJyHV5Z8-M6-aucp0zxwXGxSZ7EKlr$ > > ), together with file tagging and setting the CCSID for the programand for> > opened files.. The auto-conversion service supports round-tripconversion> > between EBCDIC and Enhanced ASCII. With it, boot strapping with EBCDIC > > source files is possible. > > Of course, more complete UTF-8 support is a valid implementation > alternative. > > Other good references: > - The 'ctag' utility > https://www.ibm.com/support/knowledgecenter/SSLTBW_2.3.0/ > com.ibm.zos.v2r3.bpxa500/chtag.htm > - File tagging overview > https://www.ibm.com/support/knowledgecenter/en/SSLTBW_2.3.0/ > com.ibm.zos.v2r3.cbcpx01/cbc1p273.htm > > Kai, would use of auto conversion require that users set the > _BPXK_AUTOCVT, _BPXK_CCSIDS, and/or _BPXK_PCCSID environment > variables? Or do you envision having the clang driver set them > before invocation of the compiler? If the latter, that would imply > that users (and tests) are responsible for setting them for direct > 'clang -cc1' invocations.Hi Tom, the current approach is to enable auto conversion only if _BPX_AUTOCVT is set to ON. If the variable is not set, then all input files are treated as EBCDIC. The rational behind is that we do not want to outsmart the user. So there is no problem with direct `clang -cc1` invocations. It's a good hint that we need to describe this setup somewhere.> Here is another possible direction to consider that would provide a > more portable facility. Clang has interfaces for overriding file > contents with a memory buffer; see the overrideFileContents() > overloads in SourceManager. It should be straight forward to, when > loading a file, make a determination as to whether a conversion is > needed (e.g., consider file tags, environment variables, command > line options, etc...) and, if needed, transcode the file contents > and register the resulting buffer as an override. This would be > useful for implementation of -finput-charset and would benefit > deployments in Microsoft environments that have source files in > ISO-8859 encodings.That's a good hint. I'll definitely have a look at it, as it sounds that it could solve some problems/complexity. A separate solution would then still be required for LLVM.> Tom.Best regards, Kai Nacke IT Architect IBM Deutschland GmbH Vorsitzender des Aufsichtsrats: Sebastian Krause Geschäftsführung: Gregor Pillen (Vorsitzender), Agnes Heftberger, Norbert Janzen, Markus Koerner, Christian Noll, Nicole Reimer Sitz der Gesellschaft: Ehningen / Registergericht: Amtsgericht Stuttgart, HRB 14562 / WEEE-Reg.-Nr. DE 99369940
Tom Honermann via llvm-dev
2020-Jun-16 17:09 UTC
[llvm-dev] RFC: Adding support for the z/OS platform to LLVM and clang
> -----Original Message----- > From: Kai Peter Nacke <kai.nacke at de.ibm.com> > Sent: Tuesday, June 16, 2020 11:17 AM > To: Tom Honermann <thonerma at synopsys.com> > Cc: Corentin <corentin.jabot at gmail.com>; llvm-dev at lists.llvm.org > Subject: RE: [llvm-dev] RFC: Adding support for the z/OS platform to LLVM and > clang > > Tom Honermann <Thomas.Honermann at synopsys.com> wrote on 16.06.2020 > 16:53:33: > > > > > > 2) Add patches to Clang to allow EBCDIC and ASCII (ISO-8859-1) > > > > > encoded > > > > > > > input source files. This would be done at the file open time to > allow > > > the > > > > rest of Clang to operate as if the source was UTF-8 and so require > no > > > > changes downstream. Feedback on this plan is welcome from the > > > > Clang community. > > > > Would it be correct to assume that this EBCDIC -> UTF-8 mapping > would > > > > be as prescribed by UTF-EBCDIC / IBM CDRA, notably for the control > > > > characters that do not map exactly? > > > > Notably, if the execution encoding is EBCDIC, is '0x06' equivalent > to > > > > '0086', etc? > > > > > > > > The question "Is Unicode sufficient to represent all characters > > > > present in the input source without using the Private Use Area?" > > > > is one > > > that > > > > is relevant to both Clang and the C/C++ standard. ( I do hope that > it > > > > is the case!) > > > > > > The current goal is to make only minimal changes to the frontend to > enable > > > reading of EBCDIC encoded files. For this, we use the auto- > > conversion service of > > > z/OS UNIX System Services ( > > > > https://urldefense.com/v3/__https://www.ibm.com/support/knowledgecenter/ > > > > SSLTBW_2.4.0/com.ibm.zos.v2r4.bpxb200/xpascii.htm__;!!A4F2R9G_pg!NKR > > > nU eS37wLNWpYN6Yvhm9SzZwujyMlnpbFJyHV5Z8-M6- > aucp0zxwXGxSZ7EKlr$ > > > ), together with file tagging and setting the CCSID for the program > and for > > > opened files.. The auto-conversion service supports round-trip > conversion > > > between EBCDIC and Enhanced ASCII. With it, boot strapping with > > > EBCDIC source files is possible. > > > Of course, more complete UTF-8 support is a valid implementation > > alternative. > > > > Other good references: > > - The 'ctag' utility > > > > https://urldefense.com/v3/__https://www.ibm.com/support/knowledgecente > > > r/SSLTBW_2.3.0/__;!!A4F2R9G_pg!KV1im4SvVFKKMIvutwguN6maqCZttB7_zG_i > 0QW > > ZFauUVe6IKXYm6CeMjYXbWNyQ6SO-TOs$ > > com.ibm.zos.v2r3.bpxa500/chtag.htm > > - File tagging overview > > > > https://urldefense.com/v3/__https://www.ibm.com/support/knowledgecente > > > r/en/SSLTBW_2.3.0/__;!!A4F2R9G_pg!KV1im4SvVFKKMIvutwguN6maqCZttB7_z > G_i > > 0QWZFauUVe6IKXYm6CeMjYXbWNyQ2CwjL08$ > > com.ibm.zos.v2r3.cbcpx01/cbc1p273.htm > > > > Kai, would use of auto conversion require that users set the > > _BPXK_AUTOCVT, _BPXK_CCSIDS, and/or _BPXK_PCCSID environment > > variables? Or do you envision having the clang driver set them before > > invocation of the compiler? If the latter, that would imply that > > users (and tests) are responsible for setting them for direct 'clang > > -cc1' invocations. > > Hi Tom, > the current approach is to enable auto conversion only if _BPX_AUTOCVT is set > to ON. If the variable is not set, then all input files are treated as EBCDIC. The > rational behind is that we do not want to outsmart the user. > So there is no problem with direct `clang -cc1` invocations. It's a good hint that > we need to describe this setup somewhere.That seems reasonable. How would you handle _BPX_AUTOCVT being set to ALL? ( For anyone following along, the difference between ON and ALL is described at https://www.ibm.com/support/knowledgecenter/SSLTBW_2.3.0/com.ibm.zos.v2r3.cbcpx01/setenv.htm#setenv:> When _BPXK_AUTOCVT is ON, automatic conversion can only take place between IBM-1047 and ISO8859-1 code sets. Other CCSID pairs are not supported for automatic text conversion. To request automatic conversion for any CCSID pairs that Unicode service supports, set _BPXK_AUTOCVT to ALL.) Tom.