Woot! Thanks a bunch Edwin! Some comments on the patch: -- I'm not sure if it makes sense to import the man pages, if we only expose Regex.h.> diff --git a/include/llvm/Support/Regex.h b/include/llvm/Support/Regex.h > new file mode 100644 > index 0000000..314bff4 > --- /dev/null > +++ b/include/llvm/Support/Regex.h > @@ -0,0 +1,49 @@ > +//===-- Regex.h - Regular Expression matcher implementation -------------===//Include C++ marker please (*- C++ -*-===//)> +// > +// The LLVM Compiler Infrastructure > +// > +// This file is distributed under the University of Illinois Open Source > +// License. See LICENSE.TXT for details. > +// > +//===----------------------------------------------------------------------===// > +// > +// This file implements a POSIX regular expression matcher. > +// > +//===----------------------------------------------------------------------===// > +#include "llvm/Support/DataTypes.h" > +#include "llvm/Support/regex.h"I would like to get rid of this include. What if match_sub just took a SmallVector<std::pair<size_t, size_t>> to use for the matchs (or some equivalent). I think that the regex n_subs field has the number of matches, so match_sub can do the right thing.> +namespace llvm { > + class Regex { > + public: > + enum { > + // Compile with recognition of all special characters turned off. > + // All characters are thus considered ordinary, > + // so the RE is a literal string. > + NOSPEC=1, > + // Compile for matching that ignores upper/lower case distinctions. > + ICASE=2, > + // Compile for matching that need only report success or failure, > + // not what was matched. > + NOSUB=4, > + // Compile for newline-sensitive matching. With this flag '[^' bracket > + // expressions and '.' never match newline. A ^ anchor matches the > + // null string after any newline in the string in addition to its normal > + // function, and the $ anchor matches the null string before any > + // newline in the string in addition to its normal function. > + NEWLINE=8 > + };Can you make these doxyments? Also, I'd prefer more LLVM style and less Unix style enum names, IgnoreCase for example. Does NOSPEC actually ever make sense to use? Wouldn't clients just use string compare?> + > + Regex(const char *regex, unsigned Flags=NOSUB); > + ~Regex();I think this should have an "bool isValid(std::string &Error)" or similar method, which would return the regex error if one was found during construction.> + // notbol: The match-beginning-of-line operator always fails to match, > + // unless regex was compiled with NEWLINE flag. > + // noteol: The match-end-of-line operator always fails to match. > + // unless regex was compiled with NEWLINE flag. > + > + bool matches(const char *string, bool notbol=false, bool noteol=false); > + bool match_sub(const char *string, llvm_regmatch_t pmatch[], > + unsigned nmatch, bool notbol=false, bool noteol=false);These should be doxymented, and I don't understand the flags. Does this mean that if I *don't* compile with NEWLINE, but *do* want ^/$ support, then notbol/noteol need to be false? Do we care about use cases that care about this? Could we just always have them as false? As mentioned before I think we can change the matches to be returned in a small vector or similar and avoid the dependency on llvm_regmatch_t. What about this: /// match - Match the regex against the given \arg String. /// /// \param Matches - If given, on a successful match this will be filled in with /// references to the matched group expressions (inside \arg String). /// \param AllowBeginOfLine - ... again I think we should kill this if possible ... /// \param AllowEndOfLine - ... again I think we should kill this if possible ... /// \return - True on successful match. bool match(const char *String, SmallVectorImpl<StringRef> *Matches, bool AllowBeginOfLine = true, bool AllowEndOfLine = true);> + bool match_sub(const char *string, llvm_regmatch_t pmatch[], > + unsigned nmatch, bool notbol=false, bool noteol=false);> + private: > + llvm_regex_t preg; > + }; > +}> diff --git a/include/llvm/Support/regex.h b/include/llvm/Support/regex.hThis won't work on Windows as stands (Regex.h == regex.h), but I think we should just keep regex.h private which solves the problem (still nice to rename it to avoid confusion, llvm_regex.h would be consistent). Otherwise this looks great to me and compiled file on Darwin. I too am of the opinion that we should avoid all configure magicness and just always use this, I don't see substantial value in trying to use the operating system one. If we did, it would be an obvious extension to Regex.cpp, and nothing else. -- Thanks again for working on this!!! - Daniel 2009/8/25 Török Edwin <edwintorok at gmail.com>:> On 2009-08-24 20:14, Chris Lattner wrote: >> On Aug 23, 2009, at 11:59 PM, Török Edwin wrote: >>> If LLVM is going to have an integrated regex library I suggest using it >>> regardless if the platform has one. >>> The LLVM integrated regex library will provide consistent behaviour and >>> execution time, the system one will not. >> >> Hi Edwin, >> >> Can you propose the openbsd implementation as a patch to lib/support? >> >> -Chris > > Attached is a proposed patch. > The Regex class is a fairly simple wrapper around the > llvm_regcomp/llvm_regexec functions, > maybe the interface could be improved, right now I just focused on > getting it to work. > Also the OpenBSD implementation has some extensions (see docs/regex.7), > if they are useful those can be added to the Regex class also. > > A summary of changes from OpenBSD version: > - rename functions and types to avoid clashes with system one: > regcomp->llvm_regcomp, ..., regex_t -> llvm_regex_t, regoff_t -> > llvm_regoff_t, BAD -> REGEX_BAD ... > - include strlcpy.c (llvm_strlcpy), a configure check could be added to > use system's one, but the function is so simple I just unconditionally > always use it > - memcpy -> memmove (parameters may overlap, BSD memcpy allows that but > POSIX doesn't) > - COPYRIGHT -> COPYRIGHT.regex, engine.c -> engine.inc, regex.h -> > llvm/Support/regex.h > - char* -> const char* to avoid compiler warnings > - #define DUPMAX > > There is also a unittests/Support/RegexTest.cpp that does some very > simple tests. > > Let me know how it works on your platform. > > Best regards, > --Edwin > > > _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev > >
On 2009-08-25 21:18, Daniel Dunbar wrote:> Woot! Thanks a bunch Edwin! > > Some comments on the patch: > -- > I'm not sure if it makes sense to import the man pages, if we only > expose Regex.h. >I'd like to keep re_format.7, it describes the format of the regex as accepted by this implementation. I'll remove regex.3 since its not exposed.> >> diff --git a/include/llvm/Support/Regex.h b/include/llvm/Support/Regex.h >> new file mode 100644 >> index 0000000..314bff4 >> --- /dev/null >> +++ b/include/llvm/Support/Regex.h >> @@ -0,0 +1,49 @@ >> +//===-- Regex.h - Regular Expression matcher implementation -------------===// >> > > Include C++ marker please (*- C++ -*-===//) > > >> +// >> +// The LLVM Compiler Infrastructure >> +// >> +// This file is distributed under the University of Illinois Open Source >> +// License. See LICENSE.TXT for details. >> +// >> +//===----------------------------------------------------------------------===// >> +// >> +// This file implements a POSIX regular expression matcher. >> +// >> +//===----------------------------------------------------------------------===// >> +#include "llvm/Support/DataTypes.h" >> +#include "llvm/Support/regex.h" >> > > I would like to get rid of this include. What if match_sub just took a > SmallVector<std::pair<size_t, size_t>> to use for the matchs (or some > equivalent). I think that the regex n_subs field has the number of matches, so > match_sub can do the right thing. >Ok.> >> +namespace llvm { >> + class Regex { >> + public: >> + enum { >> + // Compile with recognition of all special characters turned off. >> + // All characters are thus considered ordinary, >> + // so the RE is a literal string. >> + NOSPEC=1, >> + // Compile for matching that ignores upper/lower case distinctions. >> + ICASE=2, >> + // Compile for matching that need only report success or failure, >> + // not what was matched. >> + NOSUB=4, >> + // Compile for newline-sensitive matching. With this flag '[^' bracket >> + // expressions and '.' never match newline. A ^ anchor matches the >> + // null string after any newline in the string in addition to its normal >> + // function, and the $ anchor matches the null string before any >> + // newline in the string in addition to its normal function. >> + NEWLINE=8 >> + }; >> > > Can you make these doxyments? Also, I'd prefer more LLVM style and less Unix > style enum names, IgnoreCase for example. > > Does NOSPEC actually ever make sense to use? Wouldn't clients just use string > compare? >Indeed, it is worthless for LLVM. There's also REG_PEND (non-POSIX, but this impl. supports it) that allows matching patterns with embedded NULs. Would that be needed/useful for LLVM?> >> + >> + Regex(const char *regex, unsigned Flags=NOSUB); >> + ~Regex(); >> > > I think this should have an "bool isValid(std::string &Error)" or similar > method, which would return the regex error if one was found during construction. >Ok.> >> + // notbol: The match-beginning-of-line operator always fails to match, >> + // unless regex was compiled with NEWLINE flag. >> + // noteol: The match-end-of-line operator always fails to match. >> + // unless regex was compiled with NEWLINE flag. >> + >> + bool matches(const char *string, bool notbol=false, bool noteol=false); >> + bool match_sub(const char *string, llvm_regmatch_t pmatch[], >> + unsigned nmatch, bool notbol=false, bool noteol=false); >> > > These should be doxymented, and I don't understand the flags. Does this mean > that if I *don't* compile with NEWLINE, but *do* want ^/$ support, then > notbol/noteol need to be false? Do we care about use cases that care about this? > Could we just always have them as false? >They are meant for cases where you call matches() on portions of a larger string. When you call it at offset 0, you want notbol=false, but when you call it at offset > 0 you want notbol=true. Ditto for noteol, when the portion of string passed to matches() doesn't contain the real end of the string. If we are always passing the entire string we intend to match to matches() then we can remove those bools. This is the manpage: REG_NEWLINE "Compile for newline-sensitive matching. By default, newline is a completely ordinary character with no special meaning in either REs or strings. With this flag,. [^ bracket expressions and . never match newline, a ^ achor matches the null string after any newline in the string in addition to its normal function, and the $ anchor matches the null string before any newline in the string in addition to its normal function. REG_NOTBOL The first character of the stringis not the beginning of a line, so the ^ anchor should not match before it. This does not affect the behavior of newlines under REG_NEWLINE . REG_NOTEOL The NUL terminating the string does not end a line, so the $ anchor should not match before it. This does not affect the behavior of newlines under REG_NEWLINE .> As mentioned before I think we can change the matches to be returned in a small > vector or similar and avoid the dependency on llvm_regmatch_t. > >Yes, that sounds good.> What about this: > > > /// match - Match the regex against the given \arg String. > /// > /// \param Matches - If given, on a successful match this will be filled in with > /// references to the matched group expressions (inside \arg String). > /// \param AllowBeginOfLine - ... again I think we should kill this if > possible ... > /// \param AllowEndOfLine - ... again I think we should kill this if > possible ... > /// \return - True on successful match. > bool match(const char *String, SmallVectorImpl<StringRef> *Matches, > bool AllowBeginOfLine = true, bool AllowEndOfLine = true); > > >> + bool match_sub(const char *string, llvm_regmatch_t pmatch[], >> + unsigned nmatch, bool notbol=false, bool noteol=false); >> > > > >> + private: >> + llvm_regex_t preg; >> + }; >> +} >> > > >> diff --git a/include/llvm/Support/regex.h b/include/llvm/Support/regex.h >> > > This won't work on Windows as stands (Regex.h == regex.h), but I think we should > just keep regex.h private which solves the problem (still nice to rename it to > avoid confusion, llvm_regex.h would be consistent). >Ok, I'll try to make it private.> Otherwise this looks great to me and compiled file on Darwin. I too am of the > opinion that we should avoid all configure magicness and just always use this, I > don't see substantial value in trying to use the operating system one. If we > did, it would be an obvious extension to Regex.cpp, and nothing else. >I don't see any value in using the system provided one either. Best regards, --Edwin
>> +//===-- Regex.h - Regular Expression matcher implementation -------------===// > > Include C++ marker please (*- C++ -*-===//)Just out of curiosity, what is this marker used for?
On Wed, 26 Aug 2009 09:25:30 +0200, HyperQuantum <hyperquantum at gmail.com> wrote:>>> +//===-- Regex.h - Regular Expression matcher implementation >>> -------------===// >> >> Include C++ marker please (*- C++ -*-===//) > > Just out of curiosity, what is this marker used for?I use it in my text editor so that it can distinguish .h files containing C from those containing C++. Sebastian
Bah, the mailing server still has the headers broken! Sending to the list this time, how inconvenient... On Wed, Aug 26, 2009 at 3:18 AM, Sebastian Redl<sebastian.redl at getdesigned.at> wrote:> > On Wed, 26 Aug 2009 09:25:30 +0200, HyperQuantum <hyperquantum at gmail.com> > wrote: >>>> +//===-- Regex.h - Regular Expression matcher implementation >>>> -------------===// >>> >>> Include C++ marker please (*- C++ -*-===//) >> >> Just out of curiosity, what is this marker used for? > > I use it in my text editor so that it can distinguish .h files containing C > from those containing C++.I have always been curious about that. For my own projects I have always used .hpp for C++ only headers and .h for C/C++ or just C headers. I have noticed other libraries follow the same standard, I have always been curious why some, such as LLVM, do not?
On Aug 26, 2009, at 12:25 AM, HyperQuantum wrote:>>> +//===-- Regex.h - Regular Expression matcher implementation >>> -------------===// >> >> Include C++ marker please (*- C++ -*-===//) > > Just out of curiosity, what is this marker used for?It's used by emacs to tell it what "code mode" to use when editing the file. If you didn't specify anything, it would treat the .h as a C file. The highlighting and automatic formatting emacs does then won't be correct for C++ constructs like classes and namespaces. -bw
2009/8/25 Török Edwin <edwintorok at gmail.com>:> On 2009-08-25 21:18, Daniel Dunbar wrote: >> Woot! Thanks a bunch Edwin! >> >> Some comments on the patch: >> -- >> I'm not sure if it makes sense to import the man pages, if we only >> expose Regex.h. >> > > I'd like to keep re_format.7, it describes the format of the regex as > accepted by this implementation. > I'll remove regex.3 since its not exposed.Ok.>> Can you make these doxyments? Also, I'd prefer more LLVM style and less Unix >> style enum names, IgnoreCase for example. >> >> Does NOSPEC actually ever make sense to use? Wouldn't clients just use string >> compare? >> > > Indeed, it is worthless for LLVM. > There's also REG_PEND (non-POSIX, but this impl. supports it) that > allows matching patterns > with embedded NULs. Would that be needed/useful for LLVM?Maybe? If its already there and we are fine using this version of the lib always then we might as well include it.>>> + // notbol: The match-beginning-of-line operator always fails to match, >>> + // unless regex was compiled with NEWLINE flag. >>> + // noteol: The match-end-of-line operator always fails to match. >>> + // unless regex was compiled with NEWLINE flag. >>> + >>> + bool matches(const char *string, bool notbol=false, bool noteol=false); >>> + bool match_sub(const char *string, llvm_regmatch_t pmatch[], >>> + unsigned nmatch, bool notbol=false, bool noteol=false); >>> >> >> These should be doxymented, and I don't understand the flags. Does this mean >> that if I *don't* compile with NEWLINE, but *do* want ^/$ support, then >> notbol/noteol need to be false? Do we care about use cases that care about this? >> Could we just always have them as false? >> > > They are meant for cases where you call matches() on portions of a > larger string. > When you call it at offset 0, you want notbol=false, but when you call > it at offset > 0 you want notbol=true. > Ditto for noteol, when the portion of string passed to matches() doesn't > contain the real end of the string. > > If we are always passing the entire string we intend to match to > matches() then we can remove those bools.Ah, that makes sense. For the time being, though, I'd say lets drop them until some client actually cares.>> This won't work on Windows as stands (Regex.h == regex.h), but I think we should >> just keep regex.h private which solves the problem (still nice to rename it to >> avoid confusion, llvm_regex.h would be consistent). >> > > Ok, I'll try to make it private.Ok! I have one more request; since the implementation ends up being spread into a number of files, could we prefix them with something so they stand out inside the Support directory? Thanks again! - Daniel