thr3ads.net - llvm dev - [LLVMdev] Format of special case list for sanitizers [Apr 2015]

If this information is useful, please help other people find it:
Share via:

Ryan Govostes

2015-Apr-05 23:47 UTC

[LLVMdev] Format of special case list for sanitizers

The documentation for the sanitizer special case list format[0] says,
> The meanining of * in regular expression for entity names is different - it
is treated as in shell wildcarding.

In SpecialCaseList::parse, we see that this is just replacing * with .*:

    // Replace * with .*
    for (size_t pos = 0; (pos = Regexp.find("*", pos)) !=
std::string::npos;
         pos += strlen(".*")) {
      Regexp.replace(pos, strlen("*"), ".*");
    }

This seems to introduce more problems than it solves, since (i) this doesn’t
really behave like a shell globbing wildcard as advertised, and (ii) if the user
tries to use * as a regex quantifier, this will match incorrectly: A* matches
the empty string and any number of As, while A.* matches all strings that start
with at least one A.

If it’s forgivable to break compatibility here, we should do regular expressions
_or_ shell globbing, and not a hybrid format.

I’d prefer shell globbing for paths in src entities, but that isn’t as useful
for function names. Most filenames will contain periods, which also need to be
escaped properly as regular expressions. (This also limits the usefulness of
treating literals separately.)

(Just a note: the way that regular expressions are concatenated in ::parse
appears to have a bug if a pattern contains a pipe.)

Ryan


0: http://clang.llvm.org/docs/SanitizerSpecialCaseList.html


diff --git a/lib/Support/SpecialCaseList.cpp b/lib/Support/SpecialCaseList.cpp
index c312cc1..2972cb1 100644
--- a/lib/Support/SpecialCaseList.cpp
+++ b/lib/Support/SpecialCaseList.cpp
@@ -133,7 +133,7 @@ bool SpecialCaseList::parse(const MemoryBuffer *MB,
std::string &Error) {
     // Add this regexp into the proper group by its prefix.
     if (!Regexps[Prefix][Category].empty())
       Regexps[Prefix][Category] += "|";
-    Regexps[Prefix][Category] += "^" + Regexp + "$";
+    Regexps[Prefix][Category] += "^(" + Regexp + ")$)";
   }
   return true;
 }

Alexey Samsonov

2015-Apr-13 20:39 UTC

head link

[LLVMdev] Format of special case list for sanitizers

Hi Ryan,

On Sun, Apr 5, 2015 at 4:47 PM, Ryan Govostes <rzg at apple.com> wrote:
> The documentation for the sanitizer special case list format[0] says,
>
> > The meanining of * in regular expression for entity names is different
-
> it is treated as in shell wildcarding.
>
>
> In SpecialCaseList::parse, we see that this is just replacing * with .*:
>
>     // Replace * with .*
>     for (size_t pos = 0; (pos = Regexp.find("*", pos)) !>
std::string::npos;
>          pos += strlen(".*")) {
>       Regexp.replace(pos, strlen("*"), ".*");
>     }
>
> This seems to introduce more problems than it solves, since (i) this
> doesn’t really behave like a shell globbing wildcard as advertised, and
> (ii) if the user tries to use * as a regex quantifier, this will match
> incorrectly: A* matches the empty string and any number of As, while A.*
> matches all strings that start with at least one A.
>
> If it’s forgivable to break compatibility here, we should do regular
> expressions _or_ shell globbing, and not a hybrid format.
>
I agree that the current format description is misleading, e.g.
"foo*bar"
will also match "fooxxx/yyy/zzzbar", which might be unexpected for
user
expecting a shell globbing. For now we should at least change documentation
to reflect that. In retrospective, replacing "*" with ".*"
doesn't look
like a good idea at all, and for simplicity I'd prefer to just use regular
expressions. Adding special case for "file paths" isn't nice:
1) as you mention, we have to do careful escaping
2) at the moment special case list format is generic and is not tied to
"src" or "fun" entities, it's specific sanitizers that
introduce logic on
top of it. I'd prefer to treat all special case list entries in a similar
way.

However, I'm really afraid of breaking compatibility :( I know several
users of blacklist that already use "*" meaning ".*", and it
would be
challenging to migrate them - e.g. you have to keep two different blacklist
files for older and newer Clang...

The only option I see is to introduce versioning to special case list
format. That's unfortunate and extra complexity, but can be useful if we
decide to make further changes.

> I’d prefer shell globbing for paths in src entities, but that isn’t as
> useful for function names. Most filenames will contain periods, which also
> need to be escaped properly as regular expressions. (This also limits the
> usefulness of treating literals separately.)
>
> (Just a note: the way that regular expressions are concatenated in ::parse
> appears to have a bug if a pattern contains a pipe.)
>
Patches welcome :) You can also open a bug against me, but I can't
guarantee I will get to this (or the suggestion above) in the nearest
future.

>
> Ryan
>
>
> 0: http://clang.llvm.org/docs/SanitizerSpecialCaseList.html
>
>
> diff --git a/lib/Support/SpecialCaseList.cpp
> b/lib/Support/SpecialCaseList.cpp
> index c312cc1..2972cb1 100644
> --- a/lib/Support/SpecialCaseList.cpp
> +++ b/lib/Support/SpecialCaseList.cpp
> @@ -133,7 +133,7 @@ bool SpecialCaseList::parse(const MemoryBuffer *MB,
> std::string &Error) {
>      // Add this regexp into the proper group by its prefix.
>      if (!Regexps[Prefix][Category].empty())
>        Regexps[Prefix][Category] += "|";
> -    Regexps[Prefix][Category] += "^" + Regexp + "$";
> +    Regexps[Prefix][Category] += "^(" + Regexp +
")$)";
>    }
>    return true;
>  }
>
>
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>


-- 
Alexey Samsonov
vonosmas at gmail.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20150413/e2a3aaa9/attachment.html>

Justin Bogner

2015-Apr-13 23:38 UTC

head link

[LLVMdev] Format of special case list for sanitizers

Alexey Samsonov <vonosmas at gmail.com> writes:> Hi Ryan,
>
> On Sun, Apr 5, 2015 at 4:47 PM, Ryan Govostes <rzg at apple.com>
wrote:
>> The documentation for the sanitizer special case list format[0] says,
>>
>> > The meanining of * in regular expression for entity names is
different
>> - it is treated as in shell wildcarding.
>>
>> In SpecialCaseList::parse, we see that this is just replacing * with
.*:
>>
>>     // Replace * with .*
>>     for (size_t pos = 0; (pos = Regexp.find("*", pos)) !=
std::string::npos;
>>          pos += strlen(".*")) {
>>       Regexp.replace(pos, strlen("*"), ".*");
>>     }
>>
>> This seems to introduce more problems than it solves, since (i) this
>> doesn’t really behave like a shell globbing wildcard as advertised, and
>> (ii) if the user tries to use * as a regex quantifier, this will match
>> incorrectly: A* matches the empty string and any number of As, while
A.*
>> matches all strings that start with at least one A.
>>
>> If it’s forgivable to break compatibility here, we should do regular
>> expressions _or_ shell globbing, and not a hybrid format.
>
> I agree that the current format description is misleading, e.g.
"foo*bar"
> will also match "fooxxx/yyy/zzzbar", which might be unexpected
for user
> expecting a shell globbing. For now we should at least change documentation
> to reflect that. In retrospective, replacing "*" with
".*" doesn't look like
> a good idea at all, and for simplicity I'd prefer to just use regular
> expressions. Adding special case for "file paths" isn't nice:
> 1) as you mention, we have to do careful escaping
> 2) at the moment special case list format is generic and is not tied to
"src"
> or "fun" entities, it's specific sanitizers that introduce
logic on top of
> it. I'd prefer to treat all special case list entries in a similar way.
>
> However, I'm really afraid of breaking compatibility :( I know several
users
> of blacklist that already use "*" meaning ".*", and it
would be challenging
> to migrate them - e.g. you have to keep two different blacklist files for
> older and newer Clang...
TBH, this doesn't seem like that big of a deal to me. The "*"
behaviour
is strange and confusing, and I'd expect most people to use sanitizers
mostly from their newer compiler if they use multiple - they work
better. Compiling with older compilers is important for a lot of use
cases, but I can't see why people would run sanitizers from two
different versions of the compiler at the same time.
> The only option I see is to introduce versioning to special case list
format.
> That's unfortunate and extra complexity, but can be useful if we decide
to
> make further changes.
>
>     I’d prefer shell globbing for paths in src entities, but that isn’t as
>     useful for function names. Most filenames will contain periods, which
>     also need to be escaped properly as regular expressions. (This also
>     limits the usefulness of treating literals separately.)
>    
>     (Just a note: the way that regular expressions are concatenated in
>     ::parse appears to have a bug if a pattern contains a pipe.)
>
> Patches welcome :) You can also open a bug against me, but I can't
guarantee
> I will get to this (or the suggestion above) in the nearest future.
>  
>
>     Ryan
>
>     0: http://clang.llvm.org/docs/SanitizerSpecialCaseList.html
>
>     diff --git a/lib/Support/SpecialCaseList.cpp b/lib/Support/
>     SpecialCaseList.cpp
>     index c312cc1..2972cb1 100644
>     --- a/lib/Support/SpecialCaseList.cpp
>     +++ b/lib/Support/SpecialCaseList.cpp
>     @@ -133,7 +133,7 @@ bool SpecialCaseList::parse(const MemoryBuffer *MB,
>     std::string &Error) {
>          // Add this regexp into the proper group by its prefix.
>          if (!Regexps[Prefix][Category].empty())
>            Regexps[Prefix][Category] += "|";
>     -    Regexps[Prefix][Category] += "^" + Regexp +
"$";
>     +    Regexps[Prefix][Category] += "^(" + Regexp +
")$)";
>        }
>        return true;
>      }
>
>     _______________________________________________
>     LLVM Developers mailing list
>     LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
>     http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>
> --
> Alexey Samsonov
> vonosmas at gmail.com
>
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev

Sean Silva

2015-Apr-15 15:31 UTC

head link

[LLVMdev] Format of special case list for sanitizers

On Mon, Apr 6, 2015 at 12:47 AM, Ryan Govostes <rzg at apple.com> wrote:
> The documentation for the sanitizer special case list format[0] says,
>
> > The meanining of * in regular expression for entity names is different
-
> it is treated as in shell wildcarding.
>
>
> In SpecialCaseList::parse, we see that this is just replacing * with .*:
>
>     // Replace * with .*
>     for (size_t pos = 0; (pos = Regexp.find("*", pos)) !>
std::string::npos;
>          pos += strlen(".*")) {
>       Regexp.replace(pos, strlen("*"), ".*");
>     }
>
> This seems to introduce more problems than it solves, since (i) this
> doesn’t really behave like a shell globbing wildcard as advertised, and
> (ii) if the user tries to use * as a regex quantifier, this will match
> incorrectly: A* matches the empty string and any number of As, while A.*
> matches all strings that start with at least one A.
>
> If it’s forgivable to break compatibility here, we should do regular
> expressions _or_ shell globbing, and not a hybrid format.
>
This really doesn't seem compelling enough to merit breaking compatibility,
especially since there are already users in the wild.

-- Sean Silva

>
> I’d prefer shell globbing for paths in src entities, but that isn’t as
> useful for function names. Most filenames will contain periods, which also
> need to be escaped properly as regular expressions. (This also limits the
> usefulness of treating literals separately.)
>
> (Just a note: the way that regular expressions are concatenated in ::parse
> appears to have a bug if a pattern contains a pipe.)
>
> Ryan
>
>
> 0: http://clang.llvm.org/docs/SanitizerSpecialCaseList.html
>
>
> diff --git a/lib/Support/SpecialCaseList.cpp
> b/lib/Support/SpecialCaseList.cpp
> index c312cc1..2972cb1 100644
> --- a/lib/Support/SpecialCaseList.cpp
> +++ b/lib/Support/SpecialCaseList.cpp
> @@ -133,7 +133,7 @@ bool SpecialCaseList::parse(const MemoryBuffer *MB,
> std::string &Error) {
>      // Add this regexp into the proper group by its prefix.
>      if (!Regexps[Prefix][Category].empty())
>        Regexps[Prefix][Category] += "|";
> -    Regexps[Prefix][Category] += "^" + Regexp + "$";
> +    Regexps[Prefix][Category] += "^(" + Regexp +
")$)";
>    }
>    return true;
>  }
>
>
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20150415/54a1d5ff/attachment.html>

Nick Lewycky

2015-Apr-15 19:05 UTC

head link

[LLVMdev] Format of special case list for sanitizers

Alexey Samsonov wrote:> Hi Ryan,
>
> On Sun, Apr 5, 2015 at 4:47 PM, Ryan Govostes <rzg at apple.com
> <mailto:rzg at apple.com>> wrote:
>
>     The documentation for the sanitizer special case list format[0] says,
>
>      > The meanining of * in regular expression for entity names is
>     different - it is treated as in shell wildcarding.
>
>
>     In SpecialCaseList::parse, we see that this is just replacing * with
.*:
>
>          // Replace * with .*
>          for (size_t pos = 0; (pos = Regexp.find("*", pos)) !>
std::string::npos;
>               pos += strlen(".*")) {
>            Regexp.replace(pos, strlen("*"), ".*");
>          }
>
>     This seems to introduce more problems than it solves, since (i) this
>     doesn’t really behave like a shell globbing wildcard as advertised,
>     and (ii) if the user tries to use * as a regex quantifier, this will
>     match incorrectly: A* matches the empty string and any number of As,
>     while A.* matches all strings that start with at least one A.
>
>     If it’s forgivable to break compatibility here, we should do regular
>     expressions _or_ shell globbing, and not a hybrid format.
>
>
> I agree that the current format description is misleading, e.g.
> "foo*bar" will also match "fooxxx/yyy/zzzbar", which
might be unexpected
> for user expecting a shell globbing. For now we should at least change
> documentation to reflect that. In retrospective, replacing "*"
with ".*"
> doesn't look like a good idea at all, and for simplicity I'd prefer
to
> just use regular expressions. Adding special case for "file
paths" isn't
> nice:
> 1) as you mention, we have to do careful escaping
> 2) at the moment special case list format is generic and is not tied to
> "src" or "fun" entities, it's specific sanitizers
that introduce logic
> on top of it. I'd prefer to treat all special case list entries in a
> similar way.
>
> However, I'm really afraid of breaking compatibility :( I know several
> users of blacklist that already use "*" meaning ".*",
and it would be
> challenging to migrate them - e.g. you have to keep two different
> blacklist files for older and newer Clang...
FWIW, I think ASAN is going to have more users tomorrow than today, and 
the longer we wait the more painful it will be to change. The sooner we 
make the change, the easier it'll be and the less time we'll have to 
live with the wart. That said, I don't know how to arrange for the 
switch, maybe add a -fold-format flag which converts the * to .*
> The only option I see is to introduce versioning to special case list
> format. That's unfortunate and extra complexity, but can be useful if
we
> decide to make further changes.
>
>
>     I’d prefer shell globbing for paths in src entities, but that isn’t
>     as useful for function names. Most filenames will contain periods,
>     which also need to be escaped properly as regular expressions. (This
>     also limits the usefulness of treating literals separately.)
>
>     (Just a note: the way that regular expressions are concatenated in
>     ::parse appears to have a bug if a pattern contains a pipe.)
>
>
> Patches welcome :) You can also open a bug against me, but I can't
> guarantee I will get to this (or the suggestion above) in the nearest
> future.
>
>
>     Ryan
>
>
>     0: http://clang.llvm.org/docs/SanitizerSpecialCaseList.html
>
>
>     diff --git a/lib/Support/SpecialCaseList.cpp
>     b/lib/Support/SpecialCaseList.cpp
>     index c312cc1..2972cb1 100644
>     --- a/lib/Support/SpecialCaseList.cpp
>     +++ b/lib/Support/SpecialCaseList.cpp
>     @@ -133,7 +133,7 @@ bool SpecialCaseList::parse(const MemoryBuffer
>     *MB, std::string &Error) {
>           // Add this regexp into the proper group by its prefix.
>           if (!Regexps[Prefix][Category].empty())
>             Regexps[Prefix][Category] += "|";
>     -    Regexps[Prefix][Category] += "^" + Regexp +
"$";
>     +    Regexps[Prefix][Category] += "^(" + Regexp +
")$)";
>         }
>         return true;
>       }
>
>
>     _______________________________________________
>     LLVM Developers mailing list
>     LLVMdev at cs.uiuc.edu <mailto:LLVMdev at cs.uiuc.edu>
http://llvm.cs.uiuc.edu
>     http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>
>
>
>
> --
> Alexey Samsonov
> vonosmas at gmail.com <mailto:vonosmas at gmail.com>
>
>
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev

llvm dev - Apr 2015 - [LLVMdev] Format of special case list for sanitizers

[LLVMdev] Format of special case list for sanitizers

[LLVMdev] Format of special case list for sanitizers

[LLVMdev] Format of special case list for sanitizers

[LLVMdev] Format of special case list for sanitizers

[LLVMdev] Format of special case list for sanitizers