On Aug 23, 2009, at 9:01 PM, Daniel Berlin wrote:>> 2. Use POSIX regcomp facilities. This implies importing some >> implementation of this interface, e.g., Windows. On Linux, BSD, etc. >> we would try to use the platform version if available (and non- >> buggy). > > Don't do it. > They are ridiculous slow, and posix made some really dumb choices in > regexps.We want to use this from FileCheck, which we build at -O0 today. Also, each regex will be matched once. Most testcases use fixed strings (in fact 100% of them do today!). This really is not very performance sensitive. Regex engines like this are inherently more powerful but slower than fixed-purpose matching logic. I don't see a reason not to use a (slow!) simple regexec version. I would also prefer not to have all the crazy features. Just supporting simple matching stuff is perfectly acceptable. We don't need unicode character classes, negative assertions, etc. We don't need the full power of perl regex's.>> My personal preference is #2, and I may work on this in the near- >> term, >> but I wanted to check for conflicting opinions first. >> >> My main reasons for preferring #2 is to use a standard interface and >> language, and it should be relatively easy to implement and maintain >> (I'm assuming there is some nice BSD implementation out their we can >> snarf, I haven't actually validated this assumption). > Warning: All the builtin platform ones are actually fairly slow, and > some are quite buggy, even between different versions of the same > platform (IE different glibc versions, etc)I am more concerned about bugginess, but I doubt that affects simple regexes.> That said, if you really have the urge to have a BSD'd implementation > of regexec, at least choose something like http://laurikari.net/tre/ > which is a linear time in size of string.Nice, we need something for windows.> The Google impl was, at least if i remember right, made up of 3 > engines: 1 that was a special case matcher that handled plain old > anchored strings super fast, 1 that dealt with regular expressions > that did not require backtracking, and 1 that dealt with regular > expressions that did.FWIW, I plan to have different syntax in filecheck for fixed string matching vs regex matching. This should eliminate most of the common reasons for needing "leaning toothpicks".> If you are curious, see http://swtch.com/~rsc/regexp/ for how bad the > other libraries can get on fairly simple regexps because they don't > use linear time matchers.Yep, I'm very familiar with that paper. Thanks Danny, -Chris
On Sun, Aug 23, 2009 at 10:20 PM, Chris Lattner<clattner at apple.com> wrote:> On Aug 23, 2009, at 9:11 PM, OvermindDL1 wrote: >>> >>> Again, forget boost regex. :) >> >> What about std::regex? > > No, we have to build with c++'98 compilers. I think you're missing the > point here. We care about code size in llvm, and the best code size you can > get is to link to the one already in your address space because it's part of > libc.There are multiple ones that have been created that fullfill std::regex that work just fine on C++98, such as boost::regex and dinkumware's and I know there is at least one other. On Sun, Aug 23, 2009 at 10:28 PM, Chris Lattner<clattner at apple.com> wrote:> On Aug 23, 2009, at 9:01 PM, Daniel Berlin wrote: >>> 2. Use POSIX regcomp facilities. This implies importing some >>> implementation of this interface, e.g., Windows. On Linux, BSD, etc. >>> we would try to use the platform version if available (and non- >>> buggy). >> >> Don't do it. >> They are ridiculous slow, and posix made some really dumb choices in >> regexps. > > We want to use this from FileCheck, which we build at -O0 today. > Also, each regex will be matched once. Most testcases use fixed > strings (in fact 100% of them do today!). This really is not very > performance sensitive. > > Regex engines like this are inherently more powerful but slower than > fixed-purpose matching logic. I don't see a reason not to use a > (slow!) simple regexec version. > > I would also prefer not to have all the crazy features. Just > supporting simple matching stuff is perfectly acceptable. We don't > need unicode character classes, negative assertions, etc. We don't > need the full power of perl regex's.Again, why not Spirit2.1, works just fine on C++98, and it is fast, and it is split up into the smallest bits so you only include what you use, and the assembly it compiles into is *very* tiny, far far less then any regex library could possibly be. On Sun, Aug 23, 2009 at 10:28 PM, Chris Lattner<clattner at apple.com> wrote:> I am more concerned about bugginess, but I doubt that affects simple > regexes.Spirit2.1 has testcases for every possible aspect of it, thoroughly tested and well proven. On Sun, Aug 23, 2009 at 10:28 PM, Chris Lattner<clattner at apple.com> wrote:>> That said, if you really have the urge to have a BSD'd implementation >> of regexec, at least choose something like http://laurikari.net/tre/ >> which is a linear time in size of string. > > Nice, we need something for windows.This is my whole push, if just one thing is used for all platforms, then you do not have to worry about something working correctly on one platform, but being slightly different on another and so forth. I really do not care what we use, just as long as it is the same thing for all platforms, not weird platform specific things that might have different hidden bugs that may not always interoperate.
On Aug 23, 2009, at 9:50 PM, OvermindDL1 wrote:> On Sun, Aug 23, 2009 at 10:20 PM, Chris Lattner<clattner at apple.com> > wrote: >> On Aug 23, 2009, at 9:11 PM, OvermindDL1 wrote: >>>> >>>> Again, forget boost regex. :) >>> >>> What about std::regex? >> >> No, we have to build with c++'98 compilers. I think you're missing >> the >> point here. We care about code size in llvm, and the best code >> size you can >> get is to link to the one already in your address space because >> it's part of >> libc. > > There are multiple ones that have been created that fullfill > std::regex that work just fine on C++98, such as boost::regex and > dinkumware's and I know there is at least one other.I think you're missing the whole "we value small code size more than speed of matching" thing. -Chris
Hello LLVM Devs, I thought I'd weigh in on some of these non-backtracking linear time RegEx algorithms. If they're anything like the PackRat parsing algorithms they take at least 4x the amount of memory in terms of storage as the string length itself by not backtracking. That should be fine for small RegExes but it wouldn't do so well for more elaborate and long expressions. If what you need is something for a short regex, a packrat algorithm will do fine. A 256-character string may bloat up to a couple of KB and still be reasonably cheap. If you're looking for a full parser, Spirit2.x might be better. --Sam
OvermindDL1 wrote:> Again, why not Spirit2.1, works just fine on C++98, and it is fast, > and it is split up into the smallest bits so you only include what you > use, and the assembly it compiles into is *very* tiny, far far less > then any regex library could possibly be. >Spirit is not an option for one simple reason: FileCheck needs to parse regexes from its instruction comments at runtime and execute them. It has no idea, at compile time, what these regexes will be. Sebastian
On 2009-08-24 07:28, Chris Lattner wrote:> On Aug 23, 2009, at 9:01 PM, Daniel Berlin wrote: > >>> 2. Use POSIX regcomp facilities. This implies importing some >>> implementation of this interface, e.g., Windows. On Linux, BSD, etc. >>> we would try to use the platform version if available (and non- >>> buggy). >>> >> Don't do it. >> They are ridiculous slow, and posix made some really dumb choices in >> regexps. >> > > We want to use this from FileCheck, which we build at -O0 today. > Also, each regex will be matched once. Most testcases use fixed > strings (in fact 100% of them do today!). This really is not very > performance sensitive. > > Regex engines like this are inherently more powerful but slower than > fixed-purpose matching logic. I don't see a reason not to use a > (slow!) simple regexec version. >I agree with Daniel Berlin, system's regcomp/regexec shouldn't be used. Slow can mean taking 30 minutes, or hanging indefinetely on some platforms with buggy regcomp/regexec. Some examples: https://wwws.clamav.net/bugzilla/show_bug.cgi?id=497 https://wwws.clamav.net/bugzilla/show_bug.cgi?id=598 https://wwws.clamav.net/bugzilla/show_bug.cgi?id=635 https://wwws.clamav.net/bugzilla/show_bug.cgi?id=658 https://wwws.clamav.net/bugzilla/show_bug.cgi?id=679 http://bugs.opensolaris.org/bugdatabase/printableBug.do?bug_id=4346175 That is why in ClamAV we are using the OpenBSD implementation of regcomp/regexec, regardless if the system has a regcomp/regexec available. The code is fairly small (~100k), BSD licensed, easy to make it portable (memmove->memcpy, pull in strlcpy impl., etc.) and doesn't explode in execution time. I wasn't aware of Google's regexp library at that time (perhaps it didn't exist yet), but if it provides linear execution time that sounds good too. If LLVM is going to have an integrated regex library I suggest using it regardless if the platform has one. The LLVM integrated regex library will provide consistent behaviour and execution time, the system one will not. Best regards, --Edwin
On Aug 23, 2009, at 11:59 PM, Török Edwin wrote:> If LLVM is going to have an integrated regex library I suggest using > it > regardless if the platform has one. > The LLVM integrated regex library will provide consistent behaviour > and > execution time, the system one will not.Hi Edwin, Can you propose the openbsd implementation as a patch to lib/support? -Chris
On Sun, Aug 23, 2009 at 11:04 PM, Samuel Crow<samuraileumas at yahoo.com> wrote:> Hello LLVM Devs, > > I thought I'd weigh in on some of these non-backtracking linear time RegEx algorithms. If they're anything like the PackRat parsing algorithms they take at least 4x the amount of memory in terms of storage as the string length itself by not backtracking. That should be fine for small RegExes but it wouldn't do so well for more elaborate and long expressions. > > If what you need is something for a short regex, a packrat algorithm will do fine. A 256-character string may bloat up to a couple of KB and still be reasonably cheap. If you're looking for a full parser, Spirit2.x might be better.They stated later on (why not earlier I have no clue, but this is why I switched away from pushing spirit and back to xpressive or something) that the regex strings they look up are in the files it reads rather then embedded. If it was embedded in the code itself, then yes, by far, they should use Spirit2.1. As long as they pick a library that works everywhere and not just on certain platforms I will be quite happy. :)
> > I wasn't aware of Google's regexp library at that time (perhaps it > didn't exist yet), but if it provides linear execution time that sounds > good too.It was internal at the time. The external version in v8 sadly does code generation. I could look into open sourcing the non-code-generating simple version we use, if you want. But given that speed is not a concern, might as well just use openbsd's. Our bug database internal to google had plenty of bugs much like clamav's. You'd think it would be hard to screw up simple string matching regexps, but apparently you'd be wrong.