thr3ads.net - dovecot - [Dovecot] Please advise on very fast search [Nov 2011]

If this information is useful, please help other people find it:
Share via:

Alexander Chekalin

2011-Nov-09 13:57 UTC

[Dovecot] Please advise on very fast search

Hello,

I try to create some kind of mail backup system. What I need is system 
that will store mail for the whole domain, and allow me to restore 
messages from/to specified email at that domain.

The scheme is pretty simple: on our main mail server the SMTP server 
itself has a rule to send a copy of every message to 
'backup at backupserver.host', and the backupserver.host domain is
placed
nearby on second server.

The SMTP on second server do simple 'catchall' redirect of all messages 
to the single box. There is also a Dovecot that takes care for remote 
IMAP access to that box. And, finally, I've create some scripts to sort 
all messages in INBOX to folders named after message's date.

So I have a lot of mailboxes inside the catchall box:
INBOX
2011.11.03
2011.11.04
2011.11.05
2011.11.06
...etc...

and each folder holds messages for that day. Simply, and works perfectly.

The problem is that when my archive become big (several years), it 
appears to be painful to find specified message(s). When someone 
suddenly needs to find his/her old message, it is mostly guesses like 'I 
think the message was between june and july of 2009, or maybe month or 
two before that', so I need to search all mailboxes (with 1000's 
messages in each). And it takes really long time.


I tried to play with Dovecot indexes, but it won't help too much. The 
bad part is that I need to search for all emails in each message 
headers, not only for "From" or "To", since some messages
are sent to
maillists soe "To" = list address, not person's personal email.

Then I tried to index messages on my own, storing info on emails into 
MySQL database ('email' -> 'mailbox', 'message
filename'), but soon I
find out that message files can be renamed by Dovecot.

Could you please advice me how to speed up message search?


Sorry for such a long question, hope you can help!

Yours,
   Alexander Chekalin

Robert Schetterer

2011-Nov-09 14:14 UTC

head link

[Dovecot] Please advise on very fast search

Am 09.11.2011 14:57, schrieb Alexander Chekalin:> Hello,
> 
> I try to create some kind of mail backup system. What I need is system
> that will store mail for the whole domain, and allow me to restore
> messages from/to specified email at that domain.
> 
> The scheme is pretty simple: on our main mail server the SMTP server
> itself has a rule to send a copy of every message to
> 'backup at backupserver.host', and the backupserver.host domain is
placed
> nearby on second server.
> 
> The SMTP on second server do simple 'catchall' redirect of all
messages
> to the single box. There is also a Dovecot that takes care for remote
> IMAP access to that box. And, finally, I've create some scripts to sort
> all messages in INBOX to folders named after message's date.
> 
> So I have a lot of mailboxes inside the catchall box:
> INBOX
> 2011.11.03
> 2011.11.04
> 2011.11.05
> 2011.11.06
> ...etc...
> 
> and each folder holds messages for that day. Simply, and works perfectly.
> 
> The problem is that when my archive become big (several years), it
> appears to be painful to find specified message(s). When someone
> suddenly needs to find his/her old message, it is mostly guesses like
'I
> think the message was between june and july of 2009, or maybe month or
> two before that', so I need to search all mailboxes (with 1000's
> messages in each). And it takes really long time.
> 
> 
> I tried to play with Dovecot indexes, but it won't help too much. The
> bad part is that I need to search for all emails in each message
> headers, not only for "From" or "To", since some
messages are sent to
> maillists soe "To" = list address, not person's personal
email.
> 
> Then I tried to index messages on my own, storing info on emails into
> MySQL database ('email' -> 'mailbox', 'message
filename'), but soon I
> find out that message files can be renamed by Dovecot.
> 
> Could you please advice me how to speed up message search?
> 
> 
> Sorry for such a long question, hope you can help!
> 
> Yours,
>   Alexander Chekalin
> 
guess youre searching over imap ?
perhaps compression will help for speed up, and many other speed related
stuff, or you need some other idea of indexing
at last if its maildir how fast is "grep" etc...and so on
some ideas here
http://wiki.dovecot.org/HowTo/ReadOnlyArchive etc

anyway , i think you really need another kind of archive solution
in Germany there is a law that you need to archive some kind of business
mails up to 10 years for finance and other review, so there are a lot of
"you can by" solutions now, these have solved the problems you
discovered ( indexing etc )
i was shown i.e
http://www.bytstormail.de which looked fine to me

or
perhaps you might have a look
http://www.archiveopteryx.org/
here too

-- 
Best Regards

MfG Robert Schetterer

Germany/Munich/Bavaria

Timo Sirainen

2011-Nov-09 15:17 UTC

head link

[Dovecot] Please advise on very fast search

On Wed, 2011-11-09 at 16:57 +0300, Alexander Chekalin wrote:
> The problem is that when my archive become big (several years), it 
> appears to be painful to find specified message(s). When someone 
> suddenly needs to find his/her old message, it is mostly guesses like
'I
> think the message was between june and july of 2009, or maybe month or 
> two before that', so I need to search all mailboxes (with 1000's 
> messages in each). And it takes really long time.
> 
> 
> I tried to play with Dovecot indexes, but it won't help too much. 
They'll help with the dates.
> The 
> bad part is that I need to search for all emails in each message 
> headers, not only for "From" or "To", since some
messages are sent to
> maillists soe "To" = list address, not person's personal
email.
Headers only, not message body? Anyway, some of the full text search
backends would support searching from both. I'd recommend using either
Solr or with Dovecot v2.1 you can also use Lucene:
http://wiki2.dovecot.org/Plugins/FTS

Alexander Chekalin

2011-Nov-09 16:16 UTC

head link

[Dovecot] Please advise on very fast search

Thanks, Robert,

will take a look at.

What I'm afraid for is how database storage should be planned (storage, 
CPU, RAM, scaling when will be over-filled). When dealing with files 
(I'm using maildir), it is much easy to understand and to fix just about 
everything. Adding database involves tune it up too, and I'll have more 
points of 'tune it a bit'

In fact work with Dovecot is pretty nice, but I think I can tune it to 
work faster.

I now run it on FreeBSD (on UFS2), maybe I should change OS + FS, but 
need to test (really hope ZFS disks on SAS drives will help; still find 
no benchmarks on such a setup). Will also try to use full text search, 
but afraid of index size (and I need no search on body, just on headers).

Anyway thank your for pointing me in right directions!

Yours,
   Alexander

Stan Hoeppner

2011-Nov-16 16:36 UTC

head link

[Dovecot] Please advise on very fast search

On 11/16/2011 12:15 AM, Alexander Chekalin wrote:> Hello, Stan,
> 
>> This is why I recommended mbox in the first place.  If your only writes
>> to these mailbox files are appends of new messages, mbox is the best
>> format by far.  It's faster at appending than any other format, and
it's
>> faster for searching than any other.
> 
> I now seriously consider to use mdbox due to its nice self-regulation.
> After all it I believe mdbox should do file compression on its own, no
> cron scripts required.
mbox and mdbox each has strengths and weaknesses.  mbox will compress
with a higher ratio than mdbox.  You already have a nightly script that
moves all mail from the day into a new file.  Piping that through gzip
or bzip2 is a no brainer.  It'll add one line to your existing script,
if that.  Dovecot will decompress the file transparently when you access
it via IMAP.  And again since it's a single file searching it is much
faster.  With mbox you will have a single file for each day of emails.
This seems ideal for archive purposes, one file per day.

mdbox does fully transparent de/compression which is nice.  The downside
is that Dovecot does dbox compression on a per email basis, not a per
file basis.  So your compression ratio will be much less than with mbox,
especially with bzip2 which works best on files over 900KB in size.
Most emails are less than 8KB.  Using mdbox will yield multiple files
per day of emails instead of just one.

Either format is much better than maildir for archiving.
>> It's an archive.  You're not going to use maildir so you
don't need
>> random IOPS performance.  Thus RAID5/6 are a much better fit for an
>> archive as you get better read performance, with more than adequate
>> write performance, and you use less disks.  And as this is an archive,
>> you don't need real time automatic/transparent compression.  Thus I
>> recommend something like:
>>
>> 1.  Debian 6 w/linux-image-2.6.39-bpo.2-amd64 or a custom rolled
>>      2.6.39 or later kernel
>> 2.  hardware RAID5 w/large (2TB) SATA disks, 512B native sectors
>>      e.g. MegaRAID SAS 9261-8i, 4 Seagate Constellation ES ST2000NM0011
>>      Specify a strip size of 256KB for the array
>>      Perma set /sys/block/sdX/read_ahead_kb to 512 so you're
reading
>>      ahead 1024 sectors at a time instead of the default of 256.  This
>>      will speed up your searches quite a bit.
>> 3.  XFS filesystem on the RAID device, created with mkfs.xfs defaults
>> 4.  mbox w/zlib plugin.  Compress daily files each night with a script
>> 5.  You don't need LVM with a good RAID card (or with mdraid). 
This
>>      controller can expand the RAID5 up to 8 drives (up to 32 drives
max
>>      using SAS expanders)
> 
> We are considering to get HP DL180G6 server for 8 or 14 drives bays
The P410 tops out at 8 drives, so get the 8 drive model.  Start with 4 x
2TB drives in RAID5.  Add 4 more drives when you need the capacity, and
when drive prices are back down to normal (see below).

http://h18004.www1.hp.com/products/quickspecs/13248_na/13248_na.html
> (base model price is somewhat equal, but additional drives adds up cost)
Especially right now in 2011.  Flooding in Thailand, where 25% of the
world's drives are produced, has doubled the cost of all hard drives
worldwide.  Now is a horrible time to buy spinning drives.  I've read it
may be 12 months before prices start coming back down...
> with HP Smart Array P410 RAID controller (some servers are equipped with
> this controller by default) with 256 Mb battery-backed cache, but I'll
> check your suggestions!
The P410 should be fine for a dedicated archive server.
> What memory size should I plan in the server? You're talking about
AMD64
> OS image, and 64-bit software are like to consume more memory that
> 32-bit, so looks like your talking about pretty huge RAM, and I don't
> believe it's necessary, or maybe I'm wrong?
The memory footprint of 64bit binaries is nothing to worry about.  The
additional amount consumed is more than offset by the performance gained
with direct access to RAM above 4GB compared to the performance of PAE.

Keep in mind that 90% of your memory will be eaten by Linux buffer
cache.  Your binaries will account for less than 5% of your
RAM consumption.  If I understand correctly how you will use this
archive server, then 8GB should be plenty.  8GB is standard on the 8
drive DL180 G6.

http://h18004.www1.hp.com/products/quickspecs/13248_na/13248_na.html
> Problem is I have no experience with XFS and not sure I can tune it in
> the best way, so I'll go with mkfs.xfs defaults, I think.
With only 4 drives and using a P410 w/cache and RAID5, doing manual XFS
tuning isn't necessary for good performance, especially for an archive
application which is data heavy, not metadata heavy.  Setting
sunit/swidth to match the RAID5 layout may increase performance slightly
due to stripe aligned writes, but not enough that I'd worry about it.
Just use the mkfs.xfs defaults.  If you get the BBWC for the P410,
enable the controller write cache, and mount XFS with 'nobarrier'.  This
will increase write performance quite a bit as fsyncs will complete
instantly.
> Hope we'll see Dovecot 2.1.x stable soon, as I'd like to use fts
plugins
> and 2.1 handle that much better, but I don't like the idea of use
> unstable in production.
Me neither.

Speaking of archive/search, did you take look at Enkive yet?
http://www.enkive.org/
> Thank you for taking your time on my case,
You're welcome Alexander.

-- 
Stan

P.S.  You may wish to implement dnswl.org  ;)

Apparently Analagous Threads

Search for more apparently analagous threads

dovecot - Nov 2011 - Please advise on very fast search

[Dovecot] Please advise on very fast search

[Dovecot] Please advise on very fast search

[Dovecot] Please advise on very fast search

[Dovecot] Please advise on very fast search

[Dovecot] Please advise on very fast search

Apparently Analagous Threads