Adrian Prantl via llvm-dev
2018-Jan-30 15:41 UTC
[llvm-dev] [lldb-dev] Adding DWARF5 accelerator table support to llvm
> On Jan 30, 2018, at 7:35 AM, Pavel Labath <labath at google.com> wrote: > > Hello all, > > I am looking for feedback regarding implementation of the case folding > algorithm for .debug_names hashes. > > Unlike the apple tables, the .debug_names hashes are computed from > case-folded names (to enable case-insensitive lookups for languages > where that makes sense). The dwarf5 document specifies that the case > folding should be done according the the "Caseless matching" Section > of the Unicode standard (whose implementation is basically a long list > of special cases). While certainly possible, implementing this would > be much more complicated (and would probably make the code a bit > slower) than a simple tolower(3) call. And the benefits of this are > not really clear to me.Assuming a UTF-8 encoding, will tolower(3) destroy any non-ASCII characters in the process? In Swift, for example, we allow a wide range of unicode characters in identifiers and I want to make sure that this doesn't cause any problems. -- adrian> > Do you know if we already make any promises or assumptions about the > encoding and/or locale of the symbol names (and here I mainly mean the > names in the debug info metadata, not llvm symbols). > > If we don't already have a policy about this, then I propose to > implement the case folding via tolower() (which is compatible with the > full case folding algorithm, as long as one sticks to basic latin > characters). > > What do you think?
Pavel Labath via llvm-dev
2018-Jan-30 15:49 UTC
[llvm-dev] [lldb-dev] Adding DWARF5 accelerator table support to llvm
On 30 January 2018 at 15:41, Adrian Prantl <aprantl at apple.com> wrote:> > >> On Jan 30, 2018, at 7:35 AM, Pavel Labath <labath at google.com> wrote: >> >> Hello all, >> >> I am looking for feedback regarding implementation of the case folding >> algorithm for .debug_names hashes. >> >> Unlike the apple tables, the .debug_names hashes are computed from >> case-folded names (to enable case-insensitive lookups for languages >> where that makes sense). The dwarf5 document specifies that the case >> folding should be done according the the "Caseless matching" Section >> of the Unicode standard (whose implementation is basically a long list >> of special cases). While certainly possible, implementing this would >> be much more complicated (and would probably make the code a bit >> slower) than a simple tolower(3) call. And the benefits of this are >> not really clear to me. > > Assuming a UTF-8 encoding, will tolower(3) destroy any non-ASCII characters in the process? In Swift, for example, we allow a wide range of unicode characters in identifiers and I want to make sure that this doesn't cause any problems. >I'm not sure what it will do out-of-the-box, but I could certainly implement it such that it does not touch the fancy characters. However, if we already have unicode characters in the input, then it may make sense to go all the way and implement the full folding algorithm. Because, once we start producing hashes like this, it will be hard to switch to being fully standard-compliant (as that would invalidate the existing hashes). But the question then is: can I assume the input names will be unicode (w/utf8 encoding)?
Adrian Prantl via llvm-dev
2018-Jan-30 16:20 UTC
[llvm-dev] [lldb-dev] Adding DWARF5 accelerator table support to llvm
> On Jan 30, 2018, at 7:49 AM, Pavel Labath <labath at google.com> wrote: > > On 30 January 2018 at 15:41, Adrian Prantl <aprantl at apple.com> wrote: >> >> >>> On Jan 30, 2018, at 7:35 AM, Pavel Labath <labath at google.com> wrote: >>> >>> Hello all, >>> >>> I am looking for feedback regarding implementation of the case folding >>> algorithm for .debug_names hashes. >>> >>> Unlike the apple tables, the .debug_names hashes are computed from >>> case-folded names (to enable case-insensitive lookups for languages >>> where that makes sense). The dwarf5 document specifies that the case >>> folding should be done according the the "Caseless matching" Section >>> of the Unicode standard (whose implementation is basically a long list >>> of special cases). While certainly possible, implementing this would >>> be much more complicated (and would probably make the code a bit >>> slower) than a simple tolower(3) call. And the benefits of this are >>> not really clear to me. >> >> Assuming a UTF-8 encoding, will tolower(3) destroy any non-ASCII characters in the process? In Swift, for example, we allow a wide range of unicode characters in identifiers and I want to make sure that this doesn't cause any problems. >> > > I'm not sure what it will do out-of-the-box, but I could certainly > implement it such that it does not touch the fancy characters. > > However, if we already have unicode characters in the input, then it > may make sense to go all the way and implement the full folding > algorithm. Because, once we start producing hashes like this, it will > be hard to switch to being fully standard-compliant (as that would > invalidate the existing hashes). > > But the question then is: can I assume the input names will be unicode > (w/utf8 encoding)?We can make that happen and encode it explicitly in each compile unit:> 3.1.1 Full and Partial Compilation Unit Entries > ... > A DW_AT_use_UTF8 attribute, which is a flag whose presence indicates that all strings (such as the names of declared entities in the source program, or filenames in the line number table) are represented using the UTF-8 representation.-- adrian
Possibly Parallel Threads
- [lldb-dev] Adding DWARF5 accelerator table support to llvm
- [lldb-dev] Adding DWARF5 accelerator table support to llvm
- [lldb-dev] Adding DWARF5 accelerator table support to llvm
- [lldb-dev] Adding DWARF5 accelerator table support to llvm
- Adding DWARF5 accelerator table support to llvm