thr3ads.net - dovecot - [Dovecot] quick question [Jan 2010]

If this information is useful, please help other people find it:
Share via:

David Halik

2010-Jan-22 16:24 UTC

[Dovecot] quick question

Timo (and anyone else who feels like chiming in),

I was just wondering if you'd be able to tell me if the amount of 
corruption I see on a daily basis is what you consider "average" for
our
current setup and traffic. Now that we are no longer experiencing any 
core dumps with the latest patches since our migration from courier two 
months ago, I'd like to know what is expected as operational norms. 
Prior to this we had never used Dovecot, so I have nothing to go on.

Our physical setup is 10 Centos 5.4 x86_64 IMAP/POP servers, all with 
the same NFS backend where the index, control, and Maildir's for the 
users reside. Accessing this are direct connections from clients, plus 
multiple squirrelmail webservers, and pine users, all at the same time 
with layer4 switch connection load balancing.

Each server has an average of about 400 connections, for a total of 
around concurrent 4000 during a normal business day. This is out of a 
possible user population of about 15,000.

All our dovecot servers syslog to one machine, and on average I see 
about 50-75 instances of file corruption per day. I'm not counting each 
line, since some instances of corruption generate a log message for each 
uid that's wrong. This is just me counting "user A was corrupted once
at
10:00, user B was corrupted at 10:25" for example.

Examples of the corruption are as follows:

###########
Corrupted transaction log file ..../dovecot/.INBOX/dovecot.index.log seq 
28: Invalid transaction log size (32692 vs 32800): 
...../dovecot/.INBOX/dovecot.index.log (sync_offset=32692)

Corrupted index cache file ...../dovecot/.Sent 
Messages/dovecot.index.cache: Corrupted physical size for uid=624: 0 != 
53490263

Corrupted transaction log file ..../dovecot/.INBOX/dovecot.index.log seq 
66: Unexpected garbage at EOF (sync_offset=21608)

Corrupted transaction log file 
...../dovecot/.Trash.RFA/dovecot.index.log seq 2: indexid changed 
1264098644 -> 1264098664 (sync_offset=0)

Corrupted index cache file ...../dovecot/.INBOX/dovecot.index.cache: 
invalid record size

Corrupted index cache file ...../dovecot/.INBOX/dovecot.index.cache: 
field index too large (33 >= 19)

Corrupted transaction log file ..../dovecot/.INBOX/dovecot.index.log seq 
40: record size too small (type=0x0, offset=5788, size=0) (sync_offset=5812)
##########

These are most of the unique messages I could find, although the 
majority are the same as the first two I posted. So, my question, is 
this normal for a setup such as ours? I've been arguing with my boss 
over this since the switch. My opinion is that with a setup such as ours 
where a user can be logged in using Thunderbird, Squirrelmail, and their 
Blackberry all concurrently at the same time, there will always be the 
occasional index/log corruption.

Unfortunately, he is of the opinion that there should rarely be any and 
there is a design flaw in how Dovecot is designed to work with multiple 
services with an NFS backend.

What has been your experience so far?

Thanks,
-Dave

-- 
===============================David Halik
System Administrator
OIT-CSS Rutgers University
dhalik at jla.rutgers.edu
================================

Timo Sirainen

2010-Jan-22 17:16 UTC

head link

[Dovecot] quick question

On Fri, 2010-01-22 at 11:24 -0500, David Halik wrote:
> Unfortunately, he is of the opinion that there should rarely be any
> and 
> there is a design flaw in how Dovecot is designed to work with
> multiple 
> services with an NFS backend. 
Well, he is pretty much correct. I thought I could add enough NFS cache
flushes to code to make it work well, but that's highly dependent on
what OS or even kernel version the NFS clients are running on. Looking
at the problems with people using NFS it's pretty clear that this
solution just isn't going to work properly.

But then again, Dovecot is the only (free) IMAP server that even
attempts to support this kind of behavior. Or sure, Courier does too,
but disabling index files on Dovecot should get the same stability.

I see only two proper solutions:

1) Change your architecture so that all mail accesses to a specific user
go through a single server. Install Dovecot proxy so all IMAP/POP3
connections go through it to the correct server.

Later once v2.0 is stable install LMTP and make all mail deliveries go
through it too (possibly also LMTP proxy if your MTA can't figure out
the correct destination server). In the mean time use deliver with a
configuration that doesn't update index files.

This guarantees that only a single server ever accesses the user's mails
simultaneously. This is the only guaranteed way to make it work in near
future. With this setup you should see zero corruption.

2) Long term solution will be for Dovecot to not use NFS server for
inter-process communication, but instead connect to other Dovecot
servers directly via network. Again in this setup there would be only a
single server reading/writing user's index files.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 198 bytes
Desc: This is a digitally signed message part
URL:
<http://dovecot.org/pipermail/dovecot/attachments/20100122/cd0be38f/attachment-0002.bin>

Brandon Davidson

2010-Jan-22 18:15 UTC

head link

[Dovecot] quick question

David,
> -----Original Message-----
> From: dovecot-bounces+brandond=uoregon.edu at dovecot.org
[mailto:dovecot-> Our physical setup is 10 Centos 5.4 x86_64 IMAP/POP servers, all with
> the same NFS backend where the index, control, and Maildir's for the
> users reside. Accessing this are direct connections from clients, plus
> multiple squirrelmail webservers, and pine users, all at the same time
> with layer4 switch connection load balancing.
> 
> Each server has an average of about 400 connections, for a total of
> around concurrent 4000 during a normal business day. This is out of a
> possible user population of about 15,000.
> 
> All our dovecot servers syslog to one machine, and on average I see
> about 50-75 instances of file corruption per day. I'm not counting
each> line, since some instances of corruption generate a log message for
each> uid that's wrong. This is just me counting "user A was corrupted
once
at> 10:00, user B was corrupted at 10:25" for example.
We have a much similar setup - 8 POP/IMAP servers running RHEL 5.4,
Dovecot 1.2.9 (+ patches), F5 BigIP load balancer cluster
(active/standby) in a L4 profile distributing connections round-robin,
maildirs on two Netapp Filers (clustered 3070s with 54k RPM SATA disks),
10k peak concurrent connections for 45k total accounts. We used to run
with the noac mount option, but performance was abysmal, and we were
approaching 80% CPU utilization on the filers at peak load. After
removing noac, our CPU is down around 30%, and our NFS ops/sec rate is
maybe 1/10th of what it used to be.

The downside to this is that we've started seeing significantly more
crashing and mailbox corruption. Timo's latest patch seems to have fixed
the crashing, but the corruption just seems to be the cost of
distributing users at random across our backend servers.

We've thought about enabling IP-based session affinity on the load
balancer, but this would concentrate the load of our webmail clients, as
well as not really solving the problem for users that leave clients open
on multiple systems. I've done a small bit of looking at nginx's imap
proxy support, but it's not really set up to do what we want, and would
require moving the IMAP virtual server off our load balancers and on to
something significantly less supportable. Having the dovecot processes
'talk amongst themselves' to synchronize things, or go into proxy mode
automatically, would be fantastic.

Anyway, that's where we're at with the issue. As a data point for your
discussion with your boss:
* With 'noac', we would see maybe 1 or two 'corrupt' errors a
day. Most
of these were related to users going over quota.
* After removing 'noac', we saw 5-10 'Corrupt' errors and 20-30
crashes
a day. The crashes were highly visible to the users, as their mailbox
would appear to be empty until the rebuild completed.
* Since applying the latest patch, we've seen no crashes, and 60-70
'Corrupt' errors a day. We have not had any new user complaints.

Hope that helps,

-Brad

David Halik

2010-Jan-22 20:19 UTC

head link

[Dovecot] quick question

On 01/22/2010 12:16 PM, Timo Sirainen wrote:> Looking at the problems with people using NFS it's pretty clear that
this
> solution just isn't going to work properly.
>    
Actually, considering the amount of people and servers we're throwing at 
it, I think that it's dealing with it pretty well. I'm sure there are 
always more tweaks and enhancements that can be done, but look at how 
much better 1.2 is over 1.0 releases. it's definitely not
"broken," just
maybe not quite production ready as it could be. Honestly, at this point 
my users are very happy with the speed increase and as long as their 
imap process isn't dying they don't seem to notice the behind the scenes
corruption because of the self healing code.
> But then again, Dovecot is the only (free) IMAP server that even
> attempts to support this kind of behavior. Or sure, Courier does too,
> but disabling index files on Dovecot should get the same stability.
>    
By the way, I didn't want to give the impression that we were unhappy 
with the product, rather I think what you've accomplished with dovecot 
is great even by non-free enterprise standards, not to mention the level 
of support you've given us has been excellent and I appreciate it 
greatly. It was a clear choice for us over courier once NFS support 
became a reality. Loads on the exact same hardware dropped from an 
average of 5 to 0.5, quite amazing, not to mention the speed benefit of 
the indexes. Our users with extremely large Maildir's were very satisfied.

> I see only two proper solutions:
>
> 1) Change your architecture so that all mail accesses to a specific user
> go through a single server. Install Dovecot proxy so all IMAP/POP3
> connections go through it to the correct server.
>    
We've discussed this internally and are still considering layer7 
username balancing as a possibility, but I haven't worked too much on 
the specifics yet. We've only been running for two months on dovecot, so 
we wanted to give it some burn in time and see how things progressed. 
Now that the core dumps are fixed, I think we might be able to live with 
the corruption for awhile. The only user visible issue that I was aware 
of was the the users' mailbox disappearing when the processes died, but 
since that's not happening any more I'll have to see if anyone notices 
the corruption.

Thanks for all the feedback. I'm going over some of the ideas you 
suggested and we'll be thinking about long term solutions.

-- 
===============================David Halik
System Administrator
OIT-CSS Rutgers University
dhalik at jla.rutgers.edu
================================

Reasonably Related Threads

Search for more reasonably related threads

dovecot - Jan 2010 - quick question

[Dovecot] quick question

[Dovecot] quick question

[Dovecot] quick question

[Dovecot] quick question

Reasonably Related Threads