thr3ads.net - llvm dev - [LLVMdev] Regular Expression lib support [Aug 2009]

If this information is useful, please help other people find it:
Share via:

Chris Lattner

2009-Aug-24 04:28 UTC

[LLVMdev] Regular Expression lib support

On Aug 23, 2009, at 9:01 PM, Daniel Berlin wrote:>>  2. Use POSIX regcomp facilities. This implies importing some
>> implementation of this interface, e.g., Windows. On Linux, BSD, etc.
>> we would try to use the platform version if available (and non- 
>> buggy).
>
> Don't do it.
> They are ridiculous slow, and posix made some really dumb choices in  
> regexps.
We want to use this from FileCheck, which we build at -O0 today.   
Also, each regex will be matched once.  Most testcases use fixed  
strings (in fact 100% of them do today!).  This really is not very  
performance sensitive.

Regex engines like this are inherently more powerful but slower than  
fixed-purpose matching logic.  I don't see a reason not to use a  
(slow!) simple regexec version.

I would also prefer not to have all the crazy features.  Just  
supporting simple matching stuff is perfectly acceptable.  We don't  
need unicode character classes, negative assertions, etc.  We don't  
need the full power of perl regex's.
>> My personal preference is #2, and I may work on this in the near- 
>> term,
>> but I wanted to check for conflicting opinions first.
>>
>> My main reasons for preferring #2 is to use a standard interface and
>> language, and it should be relatively easy to implement and maintain
>> (I'm assuming there is some nice BSD implementation out their we
can
>> snarf, I haven't actually validated this assumption).
> Warning: All the builtin platform ones are actually fairly slow, and
> some are quite buggy, even between different versions of the same
> platform (IE different glibc versions, etc)
I am more concerned about bugginess, but I doubt that affects simple  
regexes.
> That said, if you really have the urge to have a BSD'd implementation
> of regexec, at least choose something like http://laurikari.net/tre/
> which is a linear time in size of string.
Nice, we need something for windows.
> The Google impl was, at least if i remember right, made up of 3
> engines: 1 that was a special case matcher that handled plain old
> anchored strings super fast, 1 that dealt with regular expressions
> that did not require backtracking, and 1 that dealt with regular
> expressions that did.
FWIW, I plan to have different syntax in filecheck for fixed string  
matching vs regex matching.  This should eliminate most of the common  
reasons for needing "leaning toothpicks".
> If you are curious, see http://swtch.com/~rsc/regexp/ for how bad the
> other libraries can get on fairly simple regexps because they don't
> use linear time matchers.
Yep, I'm very familiar with that paper.  Thanks Danny,

-Chris

OvermindDL1

2009-Aug-24 04:50 UTC

head link

[LLVMdev] Regular Expression lib support

On Sun, Aug 23, 2009 at 10:20 PM, Chris Lattner<clattner at apple.com>
wrote:> On Aug 23, 2009, at 9:11 PM, OvermindDL1 wrote:
>>>
>>> Again, forget boost regex. :)
>>
>> What about std::regex?
>
> No, we have to build with c++'98 compilers.  I think you're missing
the
> point here.  We care about code size in llvm, and the best code size you
can
> get is to link to the one already in your address space because it's
part of
> libc.
There are multiple ones that have been created that fullfill
std::regex that work just fine on C++98, such as boost::regex and
dinkumware's and I know there is at least one other.


On Sun, Aug 23, 2009 at 10:28 PM, Chris Lattner<clattner at apple.com>
wrote:> On Aug 23, 2009, at 9:01 PM, Daniel Berlin wrote:
>>>  2. Use POSIX regcomp facilities. This implies importing some
>>> implementation of this interface, e.g., Windows. On Linux, BSD,
etc.
>>> we would try to use the platform version if available (and non-
>>> buggy).
>>
>> Don't do it.
>> They are ridiculous slow, and posix made some really dumb choices in
>> regexps.
>
> We want to use this from FileCheck, which we build at -O0 today.
> Also, each regex will be matched once.  Most testcases use fixed
> strings (in fact 100% of them do today!).  This really is not very
> performance sensitive.
>
> Regex engines like this are inherently more powerful but slower than
> fixed-purpose matching logic.  I don't see a reason not to use a
> (slow!) simple regexec version.
>
> I would also prefer not to have all the crazy features.  Just
> supporting simple matching stuff is perfectly acceptable.  We don't
> need unicode character classes, negative assertions, etc.  We don't
> need the full power of perl regex's.
Again, why not Spirit2.1, works just fine on C++98, and it is fast,
and it is split up into the smallest bits so you only include what you
use, and the assembly it compiles into is *very* tiny, far far less
then any regex library could possibly be.

On Sun, Aug 23, 2009 at 10:28 PM, Chris Lattner<clattner at apple.com>
wrote:> I am more concerned about bugginess, but I doubt that affects simple
> regexes.
Spirit2.1 has testcases for every possible aspect of it, thoroughly
tested and well proven.


On Sun, Aug 23, 2009 at 10:28 PM, Chris Lattner<clattner at apple.com>
wrote:>> That said, if you really have the urge to have a BSD'd
implementation
>> of regexec, at least choose something like http://laurikari.net/tre/
>> which is a linear time in size of string.
>
> Nice, we need something for windows.
This is my whole push, if just one thing is used for all platforms,
then you do not have to worry about something working correctly on one
platform, but being slightly different on another and so forth.  I
really do not care what we use, just as long as it is the same thing
for all platforms, not weird platform specific things that might have
different hidden bugs that may not always interoperate.

Chris Lattner

2009-Aug-24 04:58 UTC

head link

[LLVMdev] Regular Expression lib support

On Aug 23, 2009, at 9:50 PM, OvermindDL1 wrote:
> On Sun, Aug 23, 2009 at 10:20 PM, Chris Lattner<clattner at
apple.com>
> wrote:
>> On Aug 23, 2009, at 9:11 PM, OvermindDL1 wrote:
>>>>
>>>> Again, forget boost regex. :)
>>>
>>> What about std::regex?
>>
>> No, we have to build with c++'98 compilers.  I think you're
missing
>> the
>> point here.  We care about code size in llvm, and the best code  
>> size you can
>> get is to link to the one already in your address space because  
>> it's part of
>> libc.
>
> There are multiple ones that have been created that fullfill
> std::regex that work just fine on C++98, such as boost::regex and
> dinkumware's and I know there is at least one other.
I think you're missing the whole "we value small code size more than  
speed of matching" thing.

-Chris

Samuel Crow

2009-Aug-24 05:04 UTC

head link

[LLVMdev] Regular Expression lib support

Hello LLVM Devs,

I thought I'd weigh in on some of these non-backtracking linear time RegEx
algorithms.  If they're anything like the PackRat parsing algorithms they
take at least 4x the amount of memory in terms of storage as the string length
itself by not backtracking.  That should be fine for small RegExes but it
wouldn't do so well for more elaborate and long expressions.

If what you need is something for a short regex, a packrat algorithm will do
fine.  A 256-character string may bloat up to a couple of KB and still be
reasonably cheap.  If you're looking for a full parser, Spirit2.x might be
better.

--Sam

Sebastian Redl

2009-Aug-24 06:38 UTC

head link

[LLVMdev] Regular Expression lib support

OvermindDL1 wrote:> Again, why not Spirit2.1, works just fine on C++98, and it is fast,
> and it is split up into the smallest bits so you only include what you
> use, and the assembly it compiles into is *very* tiny, far far less
> then any regex library could possibly be.
>   
Spirit is not an option for one simple reason: FileCheck needs to parse
regexes from its instruction comments at runtime and execute them. It
has no idea, at compile time, what these regexes will be.

Sebastian

Török Edwin

2009-Aug-24 06:59 UTC

head link

[LLVMdev] Regular Expression lib support

On 2009-08-24 07:28, Chris Lattner wrote:> On Aug 23, 2009, at 9:01 PM, Daniel Berlin wrote:
>   
>>>  2. Use POSIX regcomp facilities. This implies importing some
>>> implementation of this interface, e.g., Windows. On Linux, BSD,
etc.
>>> we would try to use the platform version if available (and non- 
>>> buggy).
>>>       
>> Don't do it.
>> They are ridiculous slow, and posix made some really dumb choices in  
>> regexps.
>>     
>
> We want to use this from FileCheck, which we build at -O0 today.   
> Also, each regex will be matched once.  Most testcases use fixed  
> strings (in fact 100% of them do today!).  This really is not very  
> performance sensitive.
>
> Regex engines like this are inherently more powerful but slower than  
> fixed-purpose matching logic.  I don't see a reason not to use a  
> (slow!) simple regexec version.
>   
I agree with Daniel Berlin, system's regcomp/regexec shouldn't be used.

Slow can mean taking 30 minutes, or hanging indefinetely on some
platforms with buggy regcomp/regexec.

Some examples:
https://wwws.clamav.net/bugzilla/show_bug.cgi?id=497
https://wwws.clamav.net/bugzilla/show_bug.cgi?id=598
https://wwws.clamav.net/bugzilla/show_bug.cgi?id=635
https://wwws.clamav.net/bugzilla/show_bug.cgi?id=658
https://wwws.clamav.net/bugzilla/show_bug.cgi?id=679
http://bugs.opensolaris.org/bugdatabase/printableBug.do?bug_id=4346175

That is why in ClamAV we are using the OpenBSD implementation of
regcomp/regexec, regardless if the system has a regcomp/regexec available.
The code is fairly small (~100k), BSD licensed, easy to make it portable
(memmove->memcpy, pull in strlcpy impl., etc.) and doesn't explode in
execution time.

I wasn't aware of Google's regexp library at that time (perhaps it
didn't exist yet), but if it provides linear execution time that sounds
good too.

If LLVM is going to have an integrated regex library I suggest using it
regardless if the platform has one.
The LLVM integrated regex library will provide consistent behaviour and
execution time, the system one will not.

Best regards,
--Edwin

Chris Lattner

2009-Aug-24 17:14 UTC

head link

[LLVMdev] Regular Expression lib support

On Aug 23, 2009, at 11:59 PM, Török Edwin wrote:> If LLVM is going to have an integrated regex library I suggest using  
> it
> regardless if the platform has one.
> The LLVM integrated regex library will provide consistent behaviour  
> and
> execution time, the system one will not.
Hi Edwin,

Can you propose the openbsd implementation as a patch to lib/support?

-Chris

OvermindDL1

2009-Aug-24 18:41 UTC

head link

[LLVMdev] Regular Expression lib support

On Sun, Aug 23, 2009 at 11:04 PM, Samuel Crow<samuraileumas at yahoo.com>
wrote:> Hello LLVM Devs,
>
> I thought I'd weigh in on some of these non-backtracking linear time
RegEx algorithms.  If they're anything like the PackRat parsing algorithms
they take at least 4x the amount of memory in terms of storage as the string
length itself by not backtracking.  That should be fine for small RegExes but it
wouldn't do so well for more elaborate and long expressions.
>
> If what you need is something for a short regex, a packrat algorithm will
do fine.  A 256-character string may bloat up to a couple of KB and still be
reasonably cheap.  If you're looking for a full parser, Spirit2.x might be
better.
They stated later on (why not earlier I have no clue, but this is why
I switched away from pushing spirit and back to xpressive or
something) that the regex strings they look up are in the files it
reads rather then embedded.  If it was embedded in the code itself,
then yes, by far, they should use Spirit2.1.

As long as they pick a library that works everywhere and not just on
certain platforms I will be quite happy.  :)

Daniel Berlin

2009-Aug-25 04:39 UTC

head link

[LLVMdev] Regular Expression lib support

>
> I wasn't aware of Google's regexp library at that time (perhaps it
> didn't exist yet), but if it provides linear execution time that sounds
> good too.
It was internal at the time.
The external version in v8 sadly does code generation. I could look
into open sourcing the non-code-generating simple version we use, if
you want.
But given that speed is not a concern, might as well just use openbsd's.

Our bug database internal to google had plenty of bugs much like
clamav's. You'd think it would be hard to screw up simple string
matching regexps, but apparently you'd be wrong.

Apparently Analagous Threads

Search for more possibly parallel threads

llvm dev - Aug 2009 - [LLVMdev] Regular Expression lib support

[LLVMdev] Regular Expression lib support

[LLVMdev] Regular Expression lib support

[LLVMdev] Regular Expression lib support

[LLVMdev] Regular Expression lib support

[LLVMdev] Regular Expression lib support

[LLVMdev] Regular Expression lib support

[LLVMdev] Regular Expression lib support

[LLVMdev] Regular Expression lib support

[LLVMdev] Regular Expression lib support

Apparently Analagous Threads