Jarrod Roberson
2007-Jun-14 22:33 UTC
[Xapian-discuss] indexing strategy for "near real time" indexing
I am working on a proof of concept real time email indexer using xapian. This is for HUGE volumes, think ISP level. I have to come up with a strategy for indexing the messages as they come in as near real time as I can. I am considering indexing into many databases based on time and / or size, and then trying to xapian-compact them together at the end of the day, and start over. The single writer limitation is what I am trying to address. Anyone have any suggestions about what might be a good place to start?
Olly Betts
2007-Jun-19 19:01 UTC
[Xapian-discuss] indexing strategy for "near real time" indexing
On Thu, Jun 14, 2007 at 05:33:19PM -0400, Jarrod Roberson wrote:> I am working on a proof of concept real time email indexer using > xapian. This is for HUGE volumes, think ISP level. I have to come up > with a strategy for indexing the messages as they come in as near real > time as I can. > > I am considering indexing into many databases based on time and / or > size, and then trying to xapian-compact them together at the end of > the day, and start over. The single writer limitation is what I am > trying to address.My thoughts would be to dump a copy of each message to be indexed into a spool directory (or directory hierarchy), and have the indexer process run through the spool. Either one message per file, or perhaps better in batches. That way a sudden surge of email doesn't overwhelm the system - it just creates a temporary backlog of unindexed mail. And the indexer can be temporarily taken off-line without having to halt mail delivery or miss indexing messages. You need to be able to indexer faster than messages arrive on average, and ideally fast enough to keep up with all but the peaks of demand - if necessary, you can run multiple indexers with a spool each and add new messages to each in a round-robin way. You can combine databases with xapian-compact when it's quieter as you suggest. Cheers, Olly
Sam Liddicott
2007-Jun-19 21:31 UTC
[Xapian-discuss] indexing strategy for "near real time" indexing
Are you indexing a mail store with reference to the store to retrieve the original message, or indexing a mail spool as messages pass through. How will messages expire? What processes will have read, and write access to the store/spool. If a spool, I suggest you modify the SMTP daemon to create hard links (in a different dir) to the queued message either when it enters, or finally leaves, or is delivered successfully to 1 (or each) reipient (depending which strategy suits best) If a store then you probably want to track changes to the store. Maildir and mdir are simple, but mbox may require scanning whol mailboxes to look for added or removed message IDs. As Olly points out, it's best to use a queue. You can't really do real time unless you have enough cpu to cope with unforseen peaks or unless you throttle reception by tying it to the index process. Sam -----Original Message----- From: "Olly Betts" <olly@survex.com> To: "Jarrod Roberson" <jarrod@vertigrated.com> Cc: xapian-discuss@lists.xapian.org Sent: 19/06/07 19:01 Subject: Re: [Xapian-discuss] indexing strategy for "near real time" indexing On Thu, Jun 14, 2007 at 05:33:19PM -0400, Jarrod Roberson wrote:> I am working on a proof of concept real time email indexer using > xapian. This is for HUGE volumes, think ISP level. I have to come up > with a strategy for indexing the messages as they come in as near real > time as I can. > > I am considering indexing into many databases based on time and / or > size, and then trying to xapian-compact them together at the end of > the day, and start over. The single writer limitation is what I am > trying to address.My thoughts would be to dump a copy of each message to be indexed into a spool directory (or directory hierarchy), and have the indexer process run through the spool. Either one message per file, or perhaps better in batches. That way a sudden surge of email doesn't overwhelm the system - it just creates a temporary backlog of unindexed mail. And the indexer can be temporarily taken off-line without having to halt mail delivery or miss indexing messages. You need to be able to indexer faster than messages arrive on average, and ideally fast enough to keep up with all but the peaks of demand - if necessary, you can run multiple indexers with a spool each and add new messages to each in a round-robin way. You can combine databases with xapian-compact when it's quieter as you suggest. Cheers, Olly _______________________________________________ Xapian-discuss mailing list Xapian-discuss@lists.xapian.org http://lists.xapian.org/mailman/listinfo/xapian-discuss