I''ve gone forward with an experiment I had mentioned I might try: trying to create a faster sup-sync by using C rather than ruby. My approach was to use Xapian to create a sup-compatible index, and to use libgmime for the mail parsing. That meant that I only had to write a tiny bit of code to glue gmime together with xapian. The code is an unholy mess of C and C++, (so there are as many as three different string types, sometimes all in one function!), but it seems to be working at least. I wrote a little xapian-dump program to make a textual dump of a database, (all data, terms, and values for each document), and I verified that my program, notmuch, could create identical[*] terms and values when indexing about 150 recent messages from the sup-talk mailing list. I''ve also verified that notmuch can index my own collection of nearly 600,000 email messages without any problems. The big difference in notmuch that makes the resulting index incompatible with sup is that it doesn''t generate a serialized ruby data structure for the document data like sup currently expects. Instead it just shoves the mail message''s filename into the data field. So if anyone wanted to use notmuch with sup, this would need to be resolved on one side or the other.[**] As for performance, things look pretty good, but perhaps not as good as I had hoped. From a test of importing about 400,000 messages from my mail archive, notmuch starts out between 300 - 400 messages/sec. but after about 40,000 messages slows down to about 100 messages/sec. and stabilizes there. I haven''t tested sup recently, but from my recollection indexing the same archive on the same computer, sup started out at about 40 messages/sec. and slowed down to about 20 messages/sec. (for as long as I watched it). So this is preliminary, but it looks like notmuch gives a 5-10x performance improvement over sup, (but likely closer to the 5x end of that range unless you''ve got a very small index---at which point who cares how fast/slow things are?). A quick glance with a profiler shows xapian dominating the notmuch profile at 62% and gmime hardly appearing at all (near 2%). As contrast, ruby dominates the sup-sync profile with xapian down in the 8% range. (So there''s the 10x target there.) One other advantage is that with xapian dominating the profile, notmuch stands to benefit from future performance improvements to xapian itself. If that ruby time is dominated by mail parsing, it''s possible that much of the advantage of notmuch could be gained by simply switching from rmail to a non-ruby-based parser like gmime. But that''s just a guess as I haven''t tried to find where the ruby time is being spent. If anyone is interested in playing along at home, my code is available via git with: git clone git://git.cworth.org/git/notmuch Have fun, -Carl [*] Some minor differences exist in the heuristics for reognizing signatures, and sup sometimes splits numbers into multiple terms (such as "1754" indexed as two terms "17" and "54") which I couldn''t explain nor replicate. Finally notmuch doesn''t attempt to index encrypted messages. [**] Beyond this incompatibility, notmuch is not even close to being a functional replacement for sup-sync. It currently only knows how to shove new documents into the database and doesn''t know how to do any updating. Similarly it doesn''t have any code to examine mtimes to identify new or changed messages to be updated. It also doesn''t look at maildir filename flags to determine labels, nor does it provide any means for the user to request custom labels to be attached to certain messages. -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 190 bytes Desc: not available URL: <http://rubyforge.org/pipermail/sup-talk/attachments/20091015/25074e86/attachment.bin>
Excerpts from Carl Worth''s message of Thu Oct 15 10:23:40 -0700 2009:> As for performance, things look pretty good, but perhaps not as good > as I had hoped.I know William already said he''s not all that concerned with the performance of sup-sync since it''s not a common operation, but me, I can''t stop working on the problem. And I think that''s justified, really. For one thing, the giant sup-sync is one of the first things a new user has to do. And I think that having to wait for an operation that''s measured in hours before being able to use the program at all can be very off-putting. I think we could do better to give a good first impression.> So this is preliminary, but it looks like notmuch gives a 5-10x > performance improvement over sup, (but likely closer to the 5x end of > that range unless you''ve got a very small index---at which point who > cares how fast/slow things are?).Those numbers were off. I now believe that my original code gained only a 3x improvement by switching from ruby/rmail to C/GMime for mail parsing. But I''ve done a little more coding since. Here are the current results: For a benchmark of ~ 45000 messages, rate in messages/sec.: Rate Commit ID Significant change ----- --------- ------------------ 41 sup (with xapian, from next) 120 5fbdbeb33 Switch from ruby to C (with GMime) 538 9bc4253fa Index headers only, not body 1050 371091139 Use custom header parser, not GMime (Before each run the Linux disk cache was cleared with: sync; echo 3 > /proc/sys/vm/drop_caches ) So beyond the original 3x improvement, I gained a further 4x improvement by simply doing less work. I''m now starting off by only indexing message-id and thread-id data. That''s obviously "cheating" in terms of comparing performance, but I think it really makes sense to do this. The idea is that by just computing the thread-ids and indexing those for a collection of email, that initial sup-sync could be performed very quickly. Then, later, (as a background thread while sup is running), the full-text indexing could be performed. Finally, I gained a final 2x improvement by not using GMime at all, (which constructs a data structure for the entire message, even if I only want a few header), and instead just rolling a simple parser for email headers. (Did you know you can hide nested parenthesized comments all over the place in email headers? I didn''t.) I''m quite happy with the final result that''s 25x faster than sup. I can build a cold-cache index from my half-million message archive in less than 10 minutes, (rather than 4 hours). And performance is fairly IO-bound at this point, (in the 10-minute run, less than 7 minutes of CPU are used). Anyway, there are some ideas to consider for sup. If anyone wants to play with my code, it''s here: git clone git://notmuch.org/notmuch I won''t bore the list with further developments in notmuch, if any, unless it''s on-topic, (such as someone trying to make sup work on top of an index built by notmuch). And I''d be delighted to see that kind of thing happen. Happy hacking, -Carl -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 190 bytes Desc: not available URL: <http://rubyforge.org/pipermail/sup-talk/attachments/20091019/3839b0cc/attachment.bin>
Excerpts from Carl Worth''s message of Mon Oct 19 21:24:21 -0700 2009:> git clone git://notmuch.org/notmuchThat should be: git clone git://notmuchmail.org/git/notmuch Clearly typing without my braing engaged. -Carl -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 190 bytes Desc: not available URL: <http://rubyforge.org/pipermail/sup-talk/attachments/20091020/99876743/attachment.bin>