thr3ads.net - sup talk - [sup-talk] Indexing messages without ruby [Oct 2009]

If this information is useful, please help other people find it:
Share via:

Carl Worth

2009-Oct-15 17:23 UTC

[sup-talk] Indexing messages without ruby

I''ve gone forward with an experiment I had mentioned I might try:
trying to create a faster sup-sync by using C rather than ruby.

My approach was to use Xapian to create a sup-compatible index, and to
use libgmime for the mail parsing. That meant that I only had to write
a tiny bit of code to glue gmime together with xapian. The code is an
unholy mess of C and C++, (so there are as many as three different
string types, sometimes all in one function!), but it seems to be
working at least.

I wrote a little xapian-dump program to make a textual dump of a
database, (all data, terms, and values for each document), and I
verified that my program, notmuch, could create identical[*] terms and
values when indexing about 150 recent messages from the sup-talk
mailing list.

I''ve also verified that notmuch can index my own collection of nearly
600,000 email messages without any problems.

The big difference in notmuch that makes the resulting index
incompatible with sup is that it doesn''t generate a serialized ruby
data structure for the document data like sup currently
expects. Instead it just shoves the mail message''s filename into the
data field. So if anyone wanted to use notmuch with sup, this would
need to be resolved on one side or the other.[**]

As for performance, things look pretty good, but perhaps not as good
as I had hoped. From a test of importing about 400,000 messages from
my mail archive, notmuch starts out between 300 - 400 messages/sec.
but after about 40,000 messages slows down to about 100
messages/sec. and stabilizes there.

I haven''t tested sup recently, but from my recollection indexing the
same archive on the same computer, sup started out at about 40
messages/sec. and slowed down to about 20 messages/sec. (for as long
as I watched it).

So this is preliminary, but it looks like notmuch gives a 5-10x
performance improvement over sup, (but likely closer to the 5x end of
that range unless you''ve got a very small index---at which point who
cares how fast/slow things are?).

A quick glance with a profiler shows xapian dominating the notmuch
profile at 62% and gmime hardly appearing at all (near 2%). As
contrast, ruby dominates the sup-sync profile with xapian down in the
8% range. (So there''s the 10x target there.) One other advantage is
that with xapian dominating the profile, notmuch stands to benefit
from future performance improvements to xapian itself.

If that ruby time is dominated by mail parsing, it''s possible that
much of the advantage of notmuch could be gained by simply switching
from rmail to a non-ruby-based parser like gmime. But that''s just a
guess as I haven''t tried to find where the ruby time is being spent.

If anyone is interested in playing along at home, my code is available
via git with:

	git clone git://git.cworth.org/git/notmuch

Have fun,

-Carl

[*] Some minor differences exist in the heuristics for reognizing
signatures, and sup sometimes splits numbers into multiple terms (such
as "1754" indexed as two terms "17" and "54")
which I couldn''t explain
nor replicate. Finally notmuch doesn''t attempt to index encrypted
messages.

[**] Beyond this incompatibility, notmuch is not even close to being a
functional replacement for sup-sync. It currently only knows how to
shove new documents into the database and doesn''t know how to do any
updating. Similarly it doesn''t have any code to examine mtimes to
identify new or changed messages to be updated. It also doesn''t look
at maildir filename flags to determine labels, nor does it provide any
means for the user to request custom labels to be attached to certain
messages.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 190 bytes
Desc: not available
URL:
<http://rubyforge.org/pipermail/sup-talk/attachments/20091015/25074e86/attachment.bin>

Carl Worth

2009-Oct-20 04:24 UTC

head link

[sup-talk] Indexing messages without ruby

Excerpts from Carl Worth''s message of Thu Oct 15 10:23:40 -0700
2009:> As for performance, things look pretty good, but perhaps not as good
> as I had hoped.
I know William already said he''s not all that concerned with the
performance of sup-sync since it''s not a common operation, but me, I
can''t stop working on the problem.

And I think that''s justified, really. For one thing, the giant
sup-sync is one of the first things a new user has to do. And I think
that having to wait for an operation that''s measured in hours before
being able to use the program at all can be very off-putting.

I think we could do better to give a good first impression.
> So this is preliminary, but it looks like notmuch gives a 5-10x
> performance improvement over sup, (but likely closer to the 5x end of
> that range unless you''ve got a very small index---at which point
who
> cares how fast/slow things are?).
Those numbers were off. I now believe that my original code gained
only a 3x improvement by switching from ruby/rmail to C/GMime for mail
parsing. But I''ve done a little more coding since. Here are the
current results:

  For a benchmark of ~ 45000 messages, rate in messages/sec.:

  Rate    Commit ID       Significant change
  -----   ---------       ------------------
  41                      sup (with xapian, from next)
  120     5fbdbeb33       Switch from ruby to C (with GMime)
  538     9bc4253fa       Index headers only, not body
  1050    371091139       Use custom header parser, not GMime

  (Before each run the Linux disk cache was cleared with:
          sync; echo 3 > /proc/sys/vm/drop_caches
  )

So beyond the original 3x improvement, I gained a further 4x
improvement by simply doing less work. I''m now starting off by only
indexing message-id and thread-id data. That''s obviously
"cheating" in
terms of comparing performance, but I think it really makes sense to
do this.

The idea is that by just computing the thread-ids and indexing those
for a collection of email, that initial sup-sync could be performed
very quickly. Then, later, (as a background thread while sup is
running), the full-text indexing could be performed.

Finally, I gained a final 2x improvement by not using GMime at all,
(which constructs a data structure for the entire message, even if I
only want a few header), and instead just rolling a simple parser for
email headers. (Did you know you can hide nested parenthesized
comments all over the place in email headers? I didn''t.)

I''m quite happy with the final result that''s 25x faster than
sup.  I
can build a cold-cache index from my half-million message archive in
less than 10 minutes, (rather than 4 hours). And performance is fairly
IO-bound at this point, (in the 10-minute run, less than 7 minutes of
CPU are used).

Anyway, there are some ideas to consider for sup.

If anyone wants to play with my code, it''s here:

	git clone git://notmuch.org/notmuch

I won''t bore the list with further developments in notmuch, if any,
unless it''s on-topic, (such as someone trying to make sup work on top
of an index built by notmuch). And I''d be delighted to see that kind
of thing happen.

Happy hacking,

-Carl
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 190 bytes
Desc: not available
URL:
<http://rubyforge.org/pipermail/sup-talk/attachments/20091019/3839b0cc/attachment.bin>

Carl Worth

2009-Oct-20 15:35 UTC

head link

[sup-talk] Indexing messages without ruby

Excerpts from Carl Worth''s message of Mon Oct 19 21:24:21 -0700
2009:>     git clone git://notmuch.org/notmuch
That should be:

	git clone git://notmuchmail.org/git/notmuch

Clearly typing without my braing engaged.

-Carl
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 190 bytes
Desc: not available
URL:
<http://rubyforge.org/pipermail/sup-talk/attachments/20091020/99876743/attachment.bin>

sup talk - Oct 2009 - Indexing messages without ruby

[sup-talk] Indexing messages without ruby

[sup-talk] Indexing messages without ruby

[sup-talk] Indexing messages without ruby