thr3ads.net - llvm dev - [LLVMdev] Regular Expression lib support [Aug 2009]

If this information is useful, please help other people find it:
Share via:

Daniel Dunbar

2009-Aug-25 18:18 UTC

[LLVMdev] Regular Expression lib support

Woot! Thanks a bunch Edwin!

Some comments on the patch:
--
I'm not sure if it makes sense to import the man pages, if we only
expose Regex.h.
> diff --git a/include/llvm/Support/Regex.h b/include/llvm/Support/Regex.h
> new file mode 100644
> index 0000000..314bff4
> --- /dev/null
> +++ b/include/llvm/Support/Regex.h
> @@ -0,0 +1,49 @@
> +//===-- Regex.h - Regular Expression matcher implementation
-------------===//
Include C++ marker please (*- C++ -*-===//)
> +//
> +//                     The LLVM Compiler Infrastructure
> +//
> +// This file is distributed under the University of Illinois Open Source
> +// License. See LICENSE.TXT for details.
> +//
>
+//===----------------------------------------------------------------------===//
> +//
> +// This file implements a POSIX regular expression matcher.
> +//
>
+//===----------------------------------------------------------------------===//
> +#include "llvm/Support/DataTypes.h"
> +#include "llvm/Support/regex.h"
I would like to get rid of this include. What if match_sub just took a
SmallVector<std::pair<size_t, size_t>> to use for the matchs (or
some
equivalent). I think that the regex n_subs field has the number of matches, so
match_sub can do the right thing.
> +namespace llvm {
> +  class Regex {
> +  public:
> +    enum {
> +      // Compile with recognition of all special characters turned off.
> +      // All characters are thus considered ordinary,
> +      // so the RE is a literal string.
> +      NOSPEC=1,
> +      // Compile for matching that ignores upper/lower case distinctions.
> +      ICASE=2,
> +      // Compile for matching that need only report success or failure,
> +      // not what was matched.
> +      NOSUB=4,
> +      // Compile for newline-sensitive matching. With this flag
'[^' bracket
> +      // expressions and '.' never match newline. A ^ anchor
matches the
> +      // null string after any newline in the string in addition to its
normal
> +      // function, and the $ anchor matches the null string before any
> +      // newline in the string in addition to its normal function.
> +      NEWLINE=8
> +    };
Can you make these doxyments? Also, I'd prefer more LLVM style and less Unix
style enum names, IgnoreCase for example.

Does NOSPEC actually ever make sense to use? Wouldn't clients just use
string
compare?
> +
> +    Regex(const char *regex, unsigned Flags=NOSUB);
> +    ~Regex();
I think this should have an "bool isValid(std::string &Error)" or
similar
method, which would return the regex error if one was found during construction.
> +    // notbol: The match-beginning-of-line operator always fails to match,
> +    //    unless regex was compiled with NEWLINE flag.
> +    // noteol:  The match-end-of-line operator always fails to match.
> +    //    unless regex was compiled with NEWLINE flag.
> +
> +    bool matches(const char *string, bool notbol=false, bool
noteol=false);
> +    bool match_sub(const char *string, llvm_regmatch_t pmatch[],
> +                   unsigned nmatch, bool notbol=false, bool noteol=false);
These should be doxymented, and I don't understand the flags. Does this mean
that if I *don't* compile with NEWLINE, but *do* want ^/$ support, then
notbol/noteol need to be false? Do we care about use cases that care about this?
Could we just always have them as false?

As mentioned before I think we can change the matches to be returned in a small
vector or similar and avoid the dependency on llvm_regmatch_t.

What about this:


/// match - Match the regex against the given \arg String.
///
/// \param Matches - If given, on a successful match this will be filled in with
/// references to the matched group expressions (inside \arg String).
/// \param AllowBeginOfLine - ... again I think we should kill this if
possible ...
/// \param AllowEndOfLine - ... again I think we should kill this if
possible ...
/// \return - True on successful match.
bool match(const char *String, SmallVectorImpl<StringRef> *Matches,
             bool AllowBeginOfLine = true, bool AllowEndOfLine = true);
> +    bool match_sub(const char *string, llvm_regmatch_t pmatch[],
> +                   unsigned nmatch, bool notbol=false, bool noteol=false);
> +  private:
> +    llvm_regex_t preg;
> +  };
> +}
> diff --git a/include/llvm/Support/regex.h b/include/llvm/Support/regex.h
This won't work on Windows as stands (Regex.h == regex.h), but I think we
should
just keep regex.h private which solves the problem (still nice to rename it to
avoid confusion, llvm_regex.h would be consistent).

Otherwise this looks great to me and compiled file on Darwin. I too am of the
opinion that we should avoid all configure magicness and just always use this, I
don't see substantial value in trying to use the operating system one. If we
did, it would be an obvious extension to Regex.cpp, and nothing else.
--

Thanks again for working on this!!!

 - Daniel

2009/8/25 Török Edwin <edwintorok at gmail.com>:> On 2009-08-24 20:14, Chris Lattner wrote:
>> On Aug 23, 2009, at 11:59 PM, Török Edwin wrote:
>>> If LLVM is going to have an integrated regex library I suggest
using it
>>> regardless if the platform has one.
>>> The LLVM integrated regex library will provide consistent behaviour
and
>>> execution time, the system one will not.
>>
>> Hi Edwin,
>>
>> Can you propose the openbsd implementation as a patch to lib/support?
>>
>> -Chris
>
> Attached is a proposed patch.
> The Regex class is a fairly simple wrapper around the
> llvm_regcomp/llvm_regexec functions,
> maybe the interface could be improved, right now I just focused on
> getting it to work.
> Also the OpenBSD implementation has some extensions (see docs/regex.7),
> if they are useful those can be added to the Regex class also.
>
> A summary of changes from OpenBSD version:
>  - rename functions and types to avoid clashes with system one:
> regcomp->llvm_regcomp, ..., regex_t -> llvm_regex_t, regoff_t ->
> llvm_regoff_t, BAD -> REGEX_BAD ...
>  - include strlcpy.c (llvm_strlcpy), a configure check could be added to
> use system's one, but the function is so simple I just unconditionally
> always use it
>  - memcpy -> memmove (parameters may overlap, BSD memcpy allows that but
> POSIX doesn't)
>  - COPYRIGHT -> COPYRIGHT.regex, engine.c -> engine.inc, regex.h
->
> llvm/Support/regex.h
>  - char* -> const char* to avoid compiler warnings
>  - #define DUPMAX
>
> There is also a unittests/Support/RegexTest.cpp that does some very
> simple tests.
>
> Let me know how it works on your platform.
>
> Best regards,
> --Edwin
>
>
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>
>

Török Edwin

2009-Aug-25 20:56 UTC

head link

[LLVMdev] Regular Expression lib support

On 2009-08-25 21:18, Daniel Dunbar wrote:> Woot! Thanks a bunch Edwin!
>
> Some comments on the patch:
> --
> I'm not sure if it makes sense to import the man pages, if we only
> expose Regex.h.
>   
I'd like to keep re_format.7, it describes the format of the regex as
accepted by this implementation.
I'll remove regex.3 since its not exposed.
>   
>> diff --git a/include/llvm/Support/Regex.h
b/include/llvm/Support/Regex.h
>> new file mode 100644
>> index 0000000..314bff4
>> --- /dev/null
>> +++ b/include/llvm/Support/Regex.h
>> @@ -0,0 +1,49 @@
>> +//===-- Regex.h - Regular Expression matcher implementation
-------------===//
>>     
>
> Include C++ marker please (*- C++ -*-===//)
>
>   
>> +//
>> +//                     The LLVM Compiler Infrastructure
>> +//
>> +// This file is distributed under the University of Illinois Open
Source
>> +// License. See LICENSE.TXT for details.
>> +//
>>
+//===----------------------------------------------------------------------===//
>> +//
>> +// This file implements a POSIX regular expression matcher.
>> +//
>>
+//===----------------------------------------------------------------------===//
>> +#include "llvm/Support/DataTypes.h"
>> +#include "llvm/Support/regex.h"
>>     
>
> I would like to get rid of this include. What if match_sub just took a
> SmallVector<std::pair<size_t, size_t>> to use for the matchs
(or some
> equivalent). I think that the regex n_subs field has the number of matches,
so
> match_sub can do the right thing.
>   
Ok.
>   
>> +namespace llvm {
>> +  class Regex {
>> +  public:
>> +    enum {
>> +      // Compile with recognition of all special characters turned
off.
>> +      // All characters are thus considered ordinary,
>> +      // so the RE is a literal string.
>> +      NOSPEC=1,
>> +      // Compile for matching that ignores upper/lower case
distinctions.
>> +      ICASE=2,
>> +      // Compile for matching that need only report success or
failure,
>> +      // not what was matched.
>> +      NOSUB=4,
>> +      // Compile for newline-sensitive matching. With this flag
'[^' bracket
>> +      // expressions and '.' never match newline. A ^ anchor
matches the
>> +      // null string after any newline in the string in addition to
its normal
>> +      // function, and the $ anchor matches the null string before any
>> +      // newline in the string in addition to its normal function.
>> +      NEWLINE=8
>> +    };
>>     
>
> Can you make these doxyments? Also, I'd prefer more LLVM style and less
Unix
> style enum names, IgnoreCase for example.
>
> Does NOSPEC actually ever make sense to use? Wouldn't clients just use
string
> compare?
>   
Indeed, it is worthless for LLVM.
There's also REG_PEND (non-POSIX, but this impl. supports it) that
allows matching patterns
with embedded NULs. Would that be needed/useful for LLVM?
>   
>> +
>> +    Regex(const char *regex, unsigned Flags=NOSUB);
>> +    ~Regex();
>>     
>
> I think this should have an "bool isValid(std::string
&Error)" or similar
> method, which would return the regex error if one was found during
construction.
>   
Ok.
>   
>> +    // notbol: The match-beginning-of-line operator always fails to
match,
>> +    //    unless regex was compiled with NEWLINE flag.
>> +    // noteol:  The match-end-of-line operator always fails to match.
>> +    //    unless regex was compiled with NEWLINE flag.
>> +
>> +    bool matches(const char *string, bool notbol=false, bool
noteol=false);
>> +    bool match_sub(const char *string, llvm_regmatch_t pmatch[],
>> +                   unsigned nmatch, bool notbol=false, bool
noteol=false);
>>     
>
> These should be doxymented, and I don't understand the flags. Does this
mean
> that if I *don't* compile with NEWLINE, but *do* want ^/$ support, then
> notbol/noteol need to be false? Do we care about use cases that care about
this?
> Could we just always have them as false?
>   
They are meant for cases where you call matches() on portions of a
larger string.
When you call it at offset 0, you want notbol=false, but when you call
it at offset > 0 you want notbol=true.
Ditto for noteol, when the portion of string passed to matches() doesn't
contain the real end of the string.

If we are always passing the entire string we intend to match to
matches() then we can remove those bools.

This is the manpage:
REG_NEWLINE
"Compile for newline-sensitive matching. By default, newline is a
completely ordinary character with no special
meaning in either REs or strings. With this flag,. [^ bracket
expressions and . never match newline,
a ^ achor matches the null string after any newline in the string in
addition to its normal function,
and the $ anchor matches the null string before any newline in the
string in addition to its normal function.

REG_NOTBOL
The first character of the stringis not the beginning of a line, so the
^ anchor should not match before it.
This does not affect the behavior of newlines under REG_NEWLINE .
REG_NOTEOL
The NUL terminating the string does not end a line, so the $ anchor
should not match before it.
This does not affect the behavior of newlines under REG_NEWLINE .
> As mentioned before I think we can change the matches to be returned in a
small
> vector or similar and avoid the dependency on llvm_regmatch_t.
>
>   
Yes, that sounds good.
> What about this:
>
>
> /// match - Match the regex against the given \arg String.
> ///
> /// \param Matches - If given, on a successful match this will be filled in
with
> /// references to the matched group expressions (inside \arg String).
> /// \param AllowBeginOfLine - ... again I think we should kill this if
> possible ...
> /// \param AllowEndOfLine - ... again I think we should kill this if
> possible ...
> /// \return - True on successful match.
> bool match(const char *String, SmallVectorImpl<StringRef> *Matches,
>              bool AllowBeginOfLine = true, bool AllowEndOfLine = true);
>
>   
>> +    bool match_sub(const char *string, llvm_regmatch_t pmatch[],
>> +                   unsigned nmatch, bool notbol=false, bool
noteol=false);
>>     
>
>
>   
>> +  private:
>> +    llvm_regex_t preg;
>> +  };
>> +}
>>     
>
>   
>> diff --git a/include/llvm/Support/regex.h
b/include/llvm/Support/regex.h
>>     
>
> This won't work on Windows as stands (Regex.h == regex.h), but I think
we should
> just keep regex.h private which solves the problem (still nice to rename it
to
> avoid confusion, llvm_regex.h would be consistent).
>   
Ok, I'll try to make it private.
> Otherwise this looks great to me and compiled file on Darwin. I too am of
the
> opinion that we should avoid all configure magicness and just always use
this, I
> don't see substantial value in trying to use the operating system one.
If we
> did, it would be an obvious extension to Regex.cpp, and nothing else.
>   
I don't see any value in using the system provided one either.


Best regards,
--Edwin

HyperQuantum

2009-Aug-26 07:25 UTC

head link

[LLVMdev] Regular Expression lib support

>> +//===-- Regex.h - Regular Expression matcher implementation
-------------===//
>
> Include C++ marker please (*- C++ -*-===//)
Just out of curiosity, what is this marker used for?

Sebastian Redl

2009-Aug-26 09:18 UTC

head link

[LLVMdev] Regular Expression lib support

On Wed, 26 Aug 2009 09:25:30 +0200, HyperQuantum <hyperquantum at
gmail.com>
wrote:>>> +//===-- Regex.h - Regular Expression matcher implementation
>>> -------------===//
>>
>> Include C++ marker please (*- C++ -*-===//)
> 
> Just out of curiosity, what is this marker used for?
I use it in my text editor so that it can distinguish .h files containing C
from those containing C++.

Sebastian

OvermindDL1

2009-Aug-26 09:30 UTC

head link

[LLVMdev] Regular Expression lib support

Bah, the mailing server still has the headers broken!  Sending to the
list this time, how inconvenient...

On Wed, Aug 26, 2009 at 3:18 AM, Sebastian
Redl<sebastian.redl at getdesigned.at> wrote:>
> On Wed, 26 Aug 2009 09:25:30 +0200, HyperQuantum <hyperquantum at
gmail.com>
> wrote:
>>>> +//===-- Regex.h - Regular Expression matcher implementation
>>>> -------------===//
>>>
>>> Include C++ marker please (*- C++ -*-===//)
>>
>> Just out of curiosity, what is this marker used for?
>
> I use it in my text editor so that it can distinguish .h files containing C
> from those containing C++.
I have always been curious about that.  For my own projects I have
always used .hpp for C++ only headers and .h for C/C++ or just C
headers.  I have noticed other libraries follow the same standard, I
have always been curious why some, such as LLVM, do not?

Bill Wendling

2009-Aug-27 05:59 UTC

head link

[LLVMdev] Regular Expression lib support

On Aug 26, 2009, at 12:25 AM, HyperQuantum wrote:
>>> +//===-- Regex.h - Regular Expression matcher implementation  
>>> -------------===//
>>
>> Include C++ marker please (*- C++ -*-===//)
>
> Just out of curiosity, what is this marker used for?
It's used by emacs to tell it what "code mode" to use when editing
the
file. If you didn't specify anything, it would treat the .h as a C  
file. The highlighting and automatic formatting emacs does then won't  
be correct for C++ constructs like classes and namespaces.

-bw

Daniel Dunbar

2009-Aug-27 06:06 UTC

head link

[LLVMdev] Regular Expression lib support

2009/8/25 Török Edwin <edwintorok at gmail.com>:> On 2009-08-25 21:18, Daniel Dunbar wrote:
>> Woot! Thanks a bunch Edwin!
>>
>> Some comments on the patch:
>> --
>> I'm not sure if it makes sense to import the man pages, if we only
>> expose Regex.h.
>>
>
> I'd like to keep re_format.7, it describes the format of the regex as
> accepted by this implementation.
> I'll remove regex.3 since its not exposed.
Ok.
>> Can you make these doxyments? Also, I'd prefer more LLVM style and
less Unix
>> style enum names, IgnoreCase for example.
>>
>> Does NOSPEC actually ever make sense to use? Wouldn't clients just
use string
>> compare?
>>
>
> Indeed, it is worthless for LLVM.
> There's also REG_PEND (non-POSIX, but this impl. supports it) that
> allows matching patterns
> with embedded NULs. Would that be needed/useful for LLVM?
Maybe? If its already there and we are fine using this version of the
lib always then we might as well include it.
>>> +    // notbol: The match-beginning-of-line operator always fails
to match,
>>> +    //    unless regex was compiled with NEWLINE flag.
>>> +    // noteol:  The match-end-of-line operator always fails to
match.
>>> +    //    unless regex was compiled with NEWLINE flag.
>>> +
>>> +    bool matches(const char *string, bool notbol=false, bool
noteol=false);
>>> +    bool match_sub(const char *string, llvm_regmatch_t pmatch[],
>>> +                   unsigned nmatch, bool notbol=false, bool
noteol=false);
>>>
>>
>> These should be doxymented, and I don't understand the flags. Does
this mean
>> that if I *don't* compile with NEWLINE, but *do* want ^/$ support,
then
>> notbol/noteol need to be false? Do we care about use cases that care
about this?
>> Could we just always have them as false?
>>
>
> They are meant for cases where you call matches() on portions of a
> larger string.
> When you call it at offset 0, you want notbol=false, but when you call
> it at offset > 0 you want notbol=true.
> Ditto for noteol, when the portion of string passed to matches()
doesn't
> contain the real end of the string.
>
> If we are always passing the entire string we intend to match to
> matches() then we can remove those bools.
Ah, that makes sense. For the time being, though, I'd say lets drop
them until some client actually cares.
>> This won't work on Windows as stands (Regex.h == regex.h), but I
think we should
>> just keep regex.h private which solves the problem (still nice to
rename it to
>> avoid confusion, llvm_regex.h would be consistent).
>>
>
> Ok, I'll try to make it private.
Ok!

I have one more request; since the implementation ends up being spread
into a number of files, could we prefix them with something so they
stand out inside the Support directory?

Thanks again!

 - Daniel

Possibly Parallel Threads

Search for more maybe matching threads

llvm dev - Aug 2009 - [LLVMdev] Regular Expression lib support

[LLVMdev] Regular Expression lib support

[LLVMdev] Regular Expression lib support

[LLVMdev] Regular Expression lib support

[LLVMdev] Regular Expression lib support

[LLVMdev] Regular Expression lib support

[LLVMdev] Regular Expression lib support

[LLVMdev] Regular Expression lib support

Possibly Parallel Threads