thr3ads.net - llvm dev - [LLVMdev] Regular Expression lib support [Aug 2009]

If this information is useful, please help other people find it:
Share via:

Daniel Dunbar

2009-Aug-23 21:56 UTC

[LLVMdev] Regular Expression lib support

We would like to have access to some kind of regular expression
library inside LLVM. For example, we need this to extend the FileCheck
test case checking tool to support regular expressions.

There are three obvious options:
  1. Roll our own library. Multiple unnamed individuals may even
already have implementations lying around! :)

  2. Use POSIX regcomp facilities. This implies importing some
implementation of this interface, e.g., Windows. On Linux, BSD, etc.
we would try to use the platform version if available (and non-buggy).

  3. Import a more heavy weight library such as PCRE, and use it universally.

My personal preference is #2, and I may work on this in the near-term,
but I wanted to check for conflicting opinions first.

My main reasons for preferring #2 is to use a standard interface and
language, and it should be relatively easy to implement and maintain
(I'm assuming there is some nice BSD implementation out their we can
snarf, I haven't actually validated this assumption).

The main downside of #2 vs #1 or #3 is that if we wanted fancier
regexp features we would be somewhat stuck, but I don't think this is
a significant concern.

Ok?

 - Daniel

Kenneth Uildriks

2009-Aug-23 22:29 UTC

head link

[LLVMdev] Regular Expression lib support

On Sun, Aug 23, 2009 at 4:56 PM, Daniel Dunbar<daniel at zuster.org>
wrote:> We would like to have access to some kind of regular expression
> library inside LLVM. For example, we need this to extend the FileCheck
> test case checking tool to support regular expressions.
>
> There are three obvious options:
>  1. Roll our own library. Multiple unnamed individuals may even
> already have implementations lying around! :)
>  2. Use POSIX regcomp facilities. This implies importing some
> implementation of this interface, e.g., Windows. On Linux, BSD, etc.
> we would try to use the platform version if available (and non-buggy).
>
>  3. Import a more heavy weight library such as PCRE, and use it
universally.

Personally, I'm a big fan of the Boost libraries.  They've got a regex
library, and a full-blown parser library (which I am using in my
front-end).  It's definitely heavier than POSIX, but it's portable,
well-tested, and loaded with features.

>  - Daniel
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>

OvermindDL1

2009-Aug-23 23:19 UTC

head link

[LLVMdev] Regular Expression lib support

GAH!  Why on bloody earth does the LLVM mailing list server not set
the response address to the LLVM mailing list server so I actually
respond to the server instead of the person above me, that is so
different then the other 20+ mailing lists I am subscribed to and I
could have sworn the LLVM mailing list was not broken like that
before...  My message is below:

On Sun, Aug 23, 2009 at 4:29 PM, Kenneth Uildriks<kennethuil at gmail.com>
wrote:> On Sun, Aug 23, 2009 at 4:56 PM, Daniel Dunbar<daniel at zuster.org>
wrote:
>> We would like to have access to some kind of regular expression
>> library inside LLVM. For example, we need this to extend the FileCheck
>> test case checking tool to support regular expressions.
>>
>> There are three obvious options:
>>  1. Roll our own library. Multiple unnamed individuals may even
>> already have implementations lying around! :)
>>  2. Use POSIX regcomp facilities. This implies importing some
>> implementation of this interface, e.g., Windows. On Linux, BSD, etc.
>> we would try to use the platform version if available (and non-buggy).
>>
>>  3. Import a more heavy weight library such as PCRE, and use it
universally.
>
>
> Personally, I'm a big fan of the Boost libraries.  They've got a
regex
> library, and a full-blown parser library (which I am using in my
> front-end).  It's definitely heavier than POSIX, but it's portable,
> well-tested, and loaded with features.
Boost.Xpressive supports both dynamic and static regex's, what that
means is that you can use a regex dynamically (as a string), or you
can create it statically (by building up the AST in a *very*
easy-to-use way).  Honestly, I prefer Boost.Spirit though, which is a
PEG parser, it is pure static and it compiles faster then just about
anything else yet found in existence (we have been testing it on
everything, even the string->int simple parser is faster then atoi,
and it just blows yax/etc... away, no comparison).  PEG's have a
syntax very much like regex, however they are recursable, and fully
greedy with unlimited lookahead.  If you want to include
Boost.Spirit2.1 (and I can easily rip it out so you would just need to
include it and the parts of it that it requires, you would not need to
include boost, and the license not only allows doing this it
encourages it), then I can help you integrate it.  I have been using
Spirit2.1 for a long time and have helped in its creation so I know it
quite well now and would be happy to help include it in any/all of the
needs you would need it here in, and yes, I can just about guarantee
that it will be faster then anything else you could possibly use, and
it is very easy to learn, especially if you already know regex
(although you will have to note that everything is greedy, this is a
'good thing' in PEG grammars, it contributes to their very high speed,
and remember that PEG's can be recursive).  Spirit2.1 also compiles to
*very* tight code (being smaller and faster then identical optimized
hand-coded parsers that a certain company wanted us to compare it
with), it may increase compile time, but you get huge runtime
increases, as well as Spirit2.1 and all it depends on are header-only
files, there are no other libs to include, just headers.

OvermindDL1

2009-Aug-23 23:22 UTC

head link

[LLVMdev] Regular Expression lib support

On Sun, Aug 23, 2009 at 5:17 PM, OvermindDL1<overminddl1 at gmail.com>
wrote:> easy-to-use way).  Honestly, I prefer Boost.Spirit though, which is a
> PEG parser, it is pure static and it compiles faster then just about
Er, by compiles faster I meant to say "executes faster", its compiling
time is a minute or two for the usual grammar, a few seconds for
something simple like string->int.

Also, Spirit also has a generater part, it can take a data structure
and save it as a text stream, this is also blazingly fast.

Daniel Dunbar

2009-Aug-24 00:32 UTC

head link

[LLVMdev] Regular Expression lib support

On Sun, Aug 23, 2009 at 3:29 PM, Kenneth Uildriks<kennethuil at gmail.com>
wrote:> On Sun, Aug 23, 2009 at 4:56 PM, Daniel Dunbar<daniel at zuster.org>
wrote:
>> We would like to have access to some kind of regular expression
>> library inside LLVM. For example, we need this to extend the FileCheck
>> test case checking tool to support regular expressions.
>>
>> There are three obvious options:
>>  1. Roll our own library. Multiple unnamed individuals may even
>> already have implementations lying around! :)
>>  2. Use POSIX regcomp facilities. This implies importing some
>> implementation of this interface, e.g., Windows. On Linux, BSD, etc.
>> we would try to use the platform version if available (and non-buggy).
>>
>>  3. Import a more heavy weight library such as PCRE, and use it
universally.
>
>
> Personally, I'm a big fan of the Boost libraries.  They've got a
regex
> library, and a full-blown parser library (which I am using in my
> front-end).  It's definitely heavier than POSIX, but it's portable,
> well-tested, and loaded with features.
This is too heavy, and we don't need the extra features, and regexec
is well tested and much more standard. Unless there is an overwhelming
agreement to add another option, I'd like to keep the discussion to
the obvious choices. That is, I need to be convinced *not* to use #2,
before I get derailed into discussing which form of #3 to take.

 - Daniel

Daniel Berlin

2009-Aug-24 04:01 UTC

head link

[LLVMdev] Regular Expression lib support

On Sun, Aug 23, 2009 at 5:56 PM, Daniel Dunbar<daniel at zuster.org>
wrote:> We would like to have access to some kind of regular expression
> library inside LLVM. For example, we need this to extend the FileCheck
> test case checking tool to support regular expressions.
>
> There are three obvious options:
>  1. Roll our own library. Multiple unnamed individuals may even
> already have implementations lying around! :)
>
>  2. Use POSIX regcomp facilities. This implies importing some
> implementation of this interface, e.g., Windows. On Linux, BSD, etc.
> we would try to use the platform version if available (and non-buggy).
Don't do it.
They are ridiculous slow, and posix made some really dumb choices in regexps.
>
>  3. Import a more heavy weight library such as PCRE, and use it
universally.PCRE is actually quite slow too.
Google released (as part of V8, I believe) an almost-always
linear-time regular expression library (written by Russ Cox and Rob
Pike).

It's quite small code wise (unlike boost), quite fast (and supports
perl style regexps).  I'll see if i can get a pointer to it.
>
> My personal preference is #2, and I may work on this in the near-term,
> but I wanted to check for conflicting opinions first.
>
> My main reasons for preferring #2 is to use a standard interface and
> language, and it should be relatively easy to implement and maintain
> (I'm assuming there is some nice BSD implementation out their we can
> snarf, I haven't actually validated this assumption).Warning: All the builtin platform ones are actually fairly slow, and
some are quite buggy, even between different versions of the same
platform (IE different glibc versions, etc) Google went through this
before originally settling on something that was at least consistently
buggy on all platforms (PCRE) and then moving to writing one that was
much better (what was released in V8, though i think that one may be
different than the exact one used internally).  PCRE could easily blow
the stack in some cases, sadly.

FWIW: Most people are much more familiar with perl  tyle regexps, than
they are familiar with posix style.

That said, if you really have the urge to have a BSD'd implementation
of regexec, at least choose something like http://laurikari.net/tre/
which is a linear time in size of string.

I've never performance timed it myself, but Russ Cox says it is
"efficient" ;)

The Google impl was, at least if i remember right, made up of 3
engines: 1 that was a special case matcher that handled plain old
anchored strings super fast, 1 that dealt with regular expressions
that did not require backtracking, and 1 that dealt with regular
expressions that did.

If you are curious, see http://swtch.com/~rsc/regexp/ for how bad the
other libraries can get on fairly simple regexps because they don't
use linear time matchers.

Chris Lattner

2009-Aug-24 04:28 UTC

head link

[LLVMdev] Regular Expression lib support

On Aug 23, 2009, at 9:01 PM, Daniel Berlin wrote:>>  2. Use POSIX regcomp facilities. This implies importing some
>> implementation of this interface, e.g., Windows. On Linux, BSD, etc.
>> we would try to use the platform version if available (and non- 
>> buggy).
>
> Don't do it.
> They are ridiculous slow, and posix made some really dumb choices in  
> regexps.
We want to use this from FileCheck, which we build at -O0 today.   
Also, each regex will be matched once.  Most testcases use fixed  
strings (in fact 100% of them do today!).  This really is not very  
performance sensitive.

Regex engines like this are inherently more powerful but slower than  
fixed-purpose matching logic.  I don't see a reason not to use a  
(slow!) simple regexec version.

I would also prefer not to have all the crazy features.  Just  
supporting simple matching stuff is perfectly acceptable.  We don't  
need unicode character classes, negative assertions, etc.  We don't  
need the full power of perl regex's.
>> My personal preference is #2, and I may work on this in the near- 
>> term,
>> but I wanted to check for conflicting opinions first.
>>
>> My main reasons for preferring #2 is to use a standard interface and
>> language, and it should be relatively easy to implement and maintain
>> (I'm assuming there is some nice BSD implementation out their we
can
>> snarf, I haven't actually validated this assumption).
> Warning: All the builtin platform ones are actually fairly slow, and
> some are quite buggy, even between different versions of the same
> platform (IE different glibc versions, etc)
I am more concerned about bugginess, but I doubt that affects simple  
regexes.
> That said, if you really have the urge to have a BSD'd implementation
> of regexec, at least choose something like http://laurikari.net/tre/
> which is a linear time in size of string.
Nice, we need something for windows.
> The Google impl was, at least if i remember right, made up of 3
> engines: 1 that was a special case matcher that handled plain old
> anchored strings super fast, 1 that dealt with regular expressions
> that did not require backtracking, and 1 that dealt with regular
> expressions that did.
FWIW, I plan to have different syntax in filecheck for fixed string  
matching vs regex matching.  This should eliminate most of the common  
reasons for needing "leaning toothpicks".
> If you are curious, see http://swtch.com/~rsc/regexp/ for how bad the
> other libraries can get on fairly simple regexps because they don't
> use linear time matchers.
Yep, I'm very familiar with that paper.  Thanks Danny,

-Chris

Reasonably Related Threads

Search for more apparently analagous threads

llvm dev - Aug 2009 - [LLVMdev] Regular Expression lib support

[LLVMdev] Regular Expression lib support

[LLVMdev] Regular Expression lib support

[LLVMdev] Regular Expression lib support

[LLVMdev] Regular Expression lib support

[LLVMdev] Regular Expression lib support

[LLVMdev] Regular Expression lib support

[LLVMdev] Regular Expression lib support

Reasonably Related Threads