We would like to have access to some kind of regular expression library inside LLVM. For example, we need this to extend the FileCheck test case checking tool to support regular expressions. There are three obvious options: 1. Roll our own library. Multiple unnamed individuals may even already have implementations lying around! :) 2. Use POSIX regcomp facilities. This implies importing some implementation of this interface, e.g., Windows. On Linux, BSD, etc. we would try to use the platform version if available (and non-buggy). 3. Import a more heavy weight library such as PCRE, and use it universally. My personal preference is #2, and I may work on this in the near-term, but I wanted to check for conflicting opinions first. My main reasons for preferring #2 is to use a standard interface and language, and it should be relatively easy to implement and maintain (I'm assuming there is some nice BSD implementation out their we can snarf, I haven't actually validated this assumption). The main downside of #2 vs #1 or #3 is that if we wanted fancier regexp features we would be somewhat stuck, but I don't think this is a significant concern. Ok? - Daniel
On Sun, Aug 23, 2009 at 4:56 PM, Daniel Dunbar<daniel at zuster.org> wrote:> We would like to have access to some kind of regular expression > library inside LLVM. For example, we need this to extend the FileCheck > test case checking tool to support regular expressions. > > There are three obvious options: > 1. Roll our own library. Multiple unnamed individuals may even > already have implementations lying around! :) > 2. Use POSIX regcomp facilities. This implies importing some > implementation of this interface, e.g., Windows. On Linux, BSD, etc. > we would try to use the platform version if available (and non-buggy). > > 3. Import a more heavy weight library such as PCRE, and use it universally.Personally, I'm a big fan of the Boost libraries. They've got a regex library, and a full-blown parser library (which I am using in my front-end). It's definitely heavier than POSIX, but it's portable, well-tested, and loaded with features.> - Daniel > _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev >
GAH! Why on bloody earth does the LLVM mailing list server not set the response address to the LLVM mailing list server so I actually respond to the server instead of the person above me, that is so different then the other 20+ mailing lists I am subscribed to and I could have sworn the LLVM mailing list was not broken like that before... My message is below: On Sun, Aug 23, 2009 at 4:29 PM, Kenneth Uildriks<kennethuil at gmail.com> wrote:> On Sun, Aug 23, 2009 at 4:56 PM, Daniel Dunbar<daniel at zuster.org> wrote: >> We would like to have access to some kind of regular expression >> library inside LLVM. For example, we need this to extend the FileCheck >> test case checking tool to support regular expressions. >> >> There are three obvious options: >> 1. Roll our own library. Multiple unnamed individuals may even >> already have implementations lying around! :) >> 2. Use POSIX regcomp facilities. This implies importing some >> implementation of this interface, e.g., Windows. On Linux, BSD, etc. >> we would try to use the platform version if available (and non-buggy). >> >> 3. Import a more heavy weight library such as PCRE, and use it universally. > > > Personally, I'm a big fan of the Boost libraries. They've got a regex > library, and a full-blown parser library (which I am using in my > front-end). It's definitely heavier than POSIX, but it's portable, > well-tested, and loaded with features.Boost.Xpressive supports both dynamic and static regex's, what that means is that you can use a regex dynamically (as a string), or you can create it statically (by building up the AST in a *very* easy-to-use way). Honestly, I prefer Boost.Spirit though, which is a PEG parser, it is pure static and it compiles faster then just about anything else yet found in existence (we have been testing it on everything, even the string->int simple parser is faster then atoi, and it just blows yax/etc... away, no comparison). PEG's have a syntax very much like regex, however they are recursable, and fully greedy with unlimited lookahead. If you want to include Boost.Spirit2.1 (and I can easily rip it out so you would just need to include it and the parts of it that it requires, you would not need to include boost, and the license not only allows doing this it encourages it), then I can help you integrate it. I have been using Spirit2.1 for a long time and have helped in its creation so I know it quite well now and would be happy to help include it in any/all of the needs you would need it here in, and yes, I can just about guarantee that it will be faster then anything else you could possibly use, and it is very easy to learn, especially if you already know regex (although you will have to note that everything is greedy, this is a 'good thing' in PEG grammars, it contributes to their very high speed, and remember that PEG's can be recursive). Spirit2.1 also compiles to *very* tight code (being smaller and faster then identical optimized hand-coded parsers that a certain company wanted us to compare it with), it may increase compile time, but you get huge runtime increases, as well as Spirit2.1 and all it depends on are header-only files, there are no other libs to include, just headers.
On Sun, Aug 23, 2009 at 5:17 PM, OvermindDL1<overminddl1 at gmail.com> wrote:> easy-to-use way). Honestly, I prefer Boost.Spirit though, which is a > PEG parser, it is pure static and it compiles faster then just aboutEr, by compiles faster I meant to say "executes faster", its compiling time is a minute or two for the usual grammar, a few seconds for something simple like string->int. Also, Spirit also has a generater part, it can take a data structure and save it as a text stream, this is also blazingly fast.
On Sun, Aug 23, 2009 at 3:29 PM, Kenneth Uildriks<kennethuil at gmail.com> wrote:> On Sun, Aug 23, 2009 at 4:56 PM, Daniel Dunbar<daniel at zuster.org> wrote: >> We would like to have access to some kind of regular expression >> library inside LLVM. For example, we need this to extend the FileCheck >> test case checking tool to support regular expressions. >> >> There are three obvious options: >> 1. Roll our own library. Multiple unnamed individuals may even >> already have implementations lying around! :) >> 2. Use POSIX regcomp facilities. This implies importing some >> implementation of this interface, e.g., Windows. On Linux, BSD, etc. >> we would try to use the platform version if available (and non-buggy). >> >> 3. Import a more heavy weight library such as PCRE, and use it universally. > > > Personally, I'm a big fan of the Boost libraries. They've got a regex > library, and a full-blown parser library (which I am using in my > front-end). It's definitely heavier than POSIX, but it's portable, > well-tested, and loaded with features.This is too heavy, and we don't need the extra features, and regexec is well tested and much more standard. Unless there is an overwhelming agreement to add another option, I'd like to keep the discussion to the obvious choices. That is, I need to be convinced *not* to use #2, before I get derailed into discussing which form of #3 to take. - Daniel
On Sun, Aug 23, 2009 at 5:56 PM, Daniel Dunbar<daniel at zuster.org> wrote:> We would like to have access to some kind of regular expression > library inside LLVM. For example, we need this to extend the FileCheck > test case checking tool to support regular expressions. > > There are three obvious options: > 1. Roll our own library. Multiple unnamed individuals may even > already have implementations lying around! :) >> 2. Use POSIX regcomp facilities. This implies importing some > implementation of this interface, e.g., Windows. On Linux, BSD, etc. > we would try to use the platform version if available (and non-buggy).Don't do it. They are ridiculous slow, and posix made some really dumb choices in regexps.> > 3. Import a more heavy weight library such as PCRE, and use it universally.PCRE is actually quite slow too. Google released (as part of V8, I believe) an almost-always linear-time regular expression library (written by Russ Cox and Rob Pike). It's quite small code wise (unlike boost), quite fast (and supports perl style regexps). I'll see if i can get a pointer to it.> > My personal preference is #2, and I may work on this in the near-term, > but I wanted to check for conflicting opinions first. > > My main reasons for preferring #2 is to use a standard interface and > language, and it should be relatively easy to implement and maintain > (I'm assuming there is some nice BSD implementation out their we can > snarf, I haven't actually validated this assumption).Warning: All the builtin platform ones are actually fairly slow, and some are quite buggy, even between different versions of the same platform (IE different glibc versions, etc) Google went through this before originally settling on something that was at least consistently buggy on all platforms (PCRE) and then moving to writing one that was much better (what was released in V8, though i think that one may be different than the exact one used internally). PCRE could easily blow the stack in some cases, sadly. FWIW: Most people are much more familiar with perl tyle regexps, than they are familiar with posix style. That said, if you really have the urge to have a BSD'd implementation of regexec, at least choose something like http://laurikari.net/tre/ which is a linear time in size of string. I've never performance timed it myself, but Russ Cox says it is "efficient" ;) The Google impl was, at least if i remember right, made up of 3 engines: 1 that was a special case matcher that handled plain old anchored strings super fast, 1 that dealt with regular expressions that did not require backtracking, and 1 that dealt with regular expressions that did. If you are curious, see http://swtch.com/~rsc/regexp/ for how bad the other libraries can get on fairly simple regexps because they don't use linear time matchers.
On Aug 23, 2009, at 9:01 PM, Daniel Berlin wrote:>> 2. Use POSIX regcomp facilities. This implies importing some >> implementation of this interface, e.g., Windows. On Linux, BSD, etc. >> we would try to use the platform version if available (and non- >> buggy). > > Don't do it. > They are ridiculous slow, and posix made some really dumb choices in > regexps.We want to use this from FileCheck, which we build at -O0 today. Also, each regex will be matched once. Most testcases use fixed strings (in fact 100% of them do today!). This really is not very performance sensitive. Regex engines like this are inherently more powerful but slower than fixed-purpose matching logic. I don't see a reason not to use a (slow!) simple regexec version. I would also prefer not to have all the crazy features. Just supporting simple matching stuff is perfectly acceptable. We don't need unicode character classes, negative assertions, etc. We don't need the full power of perl regex's.>> My personal preference is #2, and I may work on this in the near- >> term, >> but I wanted to check for conflicting opinions first. >> >> My main reasons for preferring #2 is to use a standard interface and >> language, and it should be relatively easy to implement and maintain >> (I'm assuming there is some nice BSD implementation out their we can >> snarf, I haven't actually validated this assumption). > Warning: All the builtin platform ones are actually fairly slow, and > some are quite buggy, even between different versions of the same > platform (IE different glibc versions, etc)I am more concerned about bugginess, but I doubt that affects simple regexes.> That said, if you really have the urge to have a BSD'd implementation > of regexec, at least choose something like http://laurikari.net/tre/ > which is a linear time in size of string.Nice, we need something for windows.> The Google impl was, at least if i remember right, made up of 3 > engines: 1 that was a special case matcher that handled plain old > anchored strings super fast, 1 that dealt with regular expressions > that did not require backtracking, and 1 that dealt with regular > expressions that did.FWIW, I plan to have different syntax in filecheck for fixed string matching vs regex matching. This should eliminate most of the common reasons for needing "leaning toothpicks".> If you are curious, see http://swtch.com/~rsc/regexp/ for how bad the > other libraries can get on fairly simple regexps because they don't > use linear time matchers.Yep, I'm very familiar with that paper. Thanks Danny, -Chris