list at airstreamcomm.net
2011-Jan-20 14:32 UTC
[Dovecot] Dotlock dovecot-uidlist errors / NFS / High Load
As of late our four node dovecot 1.2.13 cluster has been experiencing a massive number of these dotlock errors: Created dotlock file's timestamp is different than current time (1295480202 vs 1295479784): /mail/user/Maildir/dovecot-uidlist These dotlock errors correspond with very high load averages, and eventually we have to turn off all but one server to stop them from occurring. We first assumed this trend was related to the NFS storage, but we could not find a networking issue or NFS related problem to speak of. We run the mail storage on NFS which is hosted on a Centos 5.5 host, and mounted with the following options: udp,nodev,noexec,nosuid. Secondly we thought the issues were due to NTP as the time stamps vary so widely, so we rebuilt our NTP servers and found closer stratum 1 source clocks to synchronize to hoping it would alleviate the problem but the dotlock errors returned after about 12 hours. We have fcntl locking set in our configuration file, but it is our understanding from look at the source code that this file is locked with dotlock. Any help troubleshooting is appreciated. Thanks, Michael # 1.2.13: /etc/dovecot.conf # OS: Linux 2.6.18-194.8.1.el5 x86_64 CentOS release 5.5 (Final) protocols: imap pop3 listen(default): *:143 listen(imap): *:143 listen(pop3): *:110 shutdown_clients: no login_dir: /var/run/dovecot/login login_executable(default): /usr/libexec/dovecot/imap-login login_executable(imap): /usr/libexec/dovecot/imap-login login_executable(pop3): /usr/libexec/dovecot/pop3-login login_process_per_connection: no login_process_size: 128 login_processes_count: 4 login_max_processes_count: 256 login_max_connections: 386 first_valid_uid: 300 mail_location: maildir:~/Maildir mmap_disable: yes dotlock_use_excl: no mail_nfs_storage: yes mail_nfs_index: yes mail_executable(default): /usr/libexec/dovecot/imap mail_executable(imap): /usr/libexec/dovecot/imap mail_executable(pop3): /usr/libexec/dovecot/pop3 mail_plugin_dir(default): /usr/lib64/dovecot/imap mail_plugin_dir(imap): /usr/lib64/dovecot/imap mail_plugin_dir(pop3): /usr/lib64/dovecot/pop3 auth default: ? username_format: %Ln ? worker_max_count: 50 ? passdb: ??? driver: pam ? userdb: ??? driver: passwd
Stan Hoeppner
2011-Jan-20 16:57 UTC
[Dovecot] Dotlock dovecot-uidlist errors / NFS / High Load
list at airstreamcomm.net put forth on 1/20/2011 8:32 AM:> Secondly we thought the issues were due to NTP as the time stamps vary so > widely, so we rebuilt our NTP servers and found closer stratum 1 source > clocks to synchronize to hoping it would alleviate the problem but the > dotlock errors returned after about 12 hours. We have fcntl locking set in > our configuration file, but it is our understanding from look at the source > code that this file is locked with dotlock. > > Any help troubleshooting is appreciated.>From your description it sounds as if you're ntpd syncing each of the 4 serversagainst an external time source, first stratum 2/3 sources, then stratum 1 sources in an attempt to cure this problem. In a clustered server environment, _always_ run a local physical box/router ntpd server (preferably two) that queries a set of external sources, and services your internal machine queries. With RTTs all on your LAN, and using the same internal time sources for every query, this clock drift issue should be eliminated. Obviously, when you first set this up, stop ntpd and run ntpdate to get an initial time sync for each cluster host. If after setting this up, and we're dealing with bare metal cluster member servers, then I'd guess you've got a failed/defective clock chip on one host. If this is Linux, you can work around that by changing the local time source. There are something like 5 options. Google for "Linux time" or similar. Or, simply replace the hardware--RTC chip, mobo, etc. If any of these cluster members are virtual machines, regardless of hypervisor, I'd recommend disabling using ntpd, and cron'ing ntpdate to run once every 5 minutes, or once a a minute, whatever it takes to get the times to remain synced, against your local ntpd server mentioned above. I got to the point with VMWare ESX that I could make any Linux distro VM of 2.4 or 2.6 stay within one minute a month before needing a manual ntdate against our local time source. The time required to get to that point is a total waste. Cron'ing ntpdate as I mentioned is the quick, reliable way to solve this issue, if you're using VMs. -- Stan
Timo Sirainen
2011-Jan-20 21:18 UTC
[Dovecot] Dotlock dovecot-uidlist errors / NFS / High Load
On Thu, 2011-01-20 at 08:32 -0600, list at airstreamcomm.net wrote:> Created dotlock file's timestamp is different than current time > (1295480202 vs 1295479784): /mail/user/Maildir/dovecot-uidlistHmm. This may be a bug that happens when dotlocking has to wait for a long time for dotlock. See if http://hg.dovecot.org/dovecot-1.2/raw-rev/9a50a9dc905f fixes this. -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 198 bytes Desc: This is a digitally signed message part URL: <http://dovecot.org/pipermail/dovecot/attachments/20110120/18c89822/attachment-0002.bin>
list at airstreamcomm.net
2011-Jan-24 14:41 UTC
[Dovecot] Dotlock dovecot-uidlist errors / NFS / High Load
Timo, Thanks for the quick reply! We are building an rpm with the patch and will test this week and report back with our findings. We are grateful for your help and the chance to communicate directly with the author of dovecot! Michael On Thu, 20 Jan 2011 23:18:16 +0200, Timo Sirainen <tss at iki.fi> wrote:> On Thu, 2011-01-20 at 08:32 -0600, list at airstreamcomm.net wrote: > >> Created dotlock file's timestamp is different than current time >> (1295480202 vs 1295479784): /mail/user/Maildir/dovecot-uidlist > > Hmm. This may be a bug that happens when dotlocking has to wait for a > long time for dotlock. See if > http://hg.dovecot.org/dovecot-1.2/raw-rev/9a50a9dc905f fixes this.