Søren Schrøder
2008-Apr-04 10:37 UTC
[Dovecot] Dovecot and stale nfs-locks hanging processes
Greetings dovecot mailing list. I have implemented a relatively big dovecot setup (250k users) and overall I am very pleased with dovecot functionality and performance. Setup description: * dovecot 1.0.x * FreeBSD 6.3 * Postfix (using dovecot deliver as LDA). * OpenLdap backend * Storage is NFS (Clariion EMC NFSd for Maildir, and FreeBSD NFSd for Indexes). * Locking is fcntl using RPC.lockd. * Users are accessing mail using POP3 and IMAP (IMAP mainly via Squirrelmail, but also direct) * 3 frontends for POP/SMTP and 2 frontends for IMAP (webmail). Round Robin DNS My problem: I am having issues where POP3, IMAP and DELIVER processes gets stuck, apparently waiting for device. fstat shows: bash# fstat -p 93522 USER CMD PID FD MOUNT INUM MODE SZ|DV R/W 302870 pop3 93522 root / 2 drwxr-xr-x 512 r 302870 pop3 93522 wd /home/mnt5 51592 drwxr-xr-x 80 r 302870 pop3 93522 text /usr 121619 -r-xr-xr-x 436616 r 302870 pop3 93522 0* internet stream tcp 302870 pop3 93522 1* internet stream tcp 302870 pop3 93522 2* pipe c778aa48 <-> c778a990 0 rw 302870 pop3 93522 3 /dev 24 crw-rw-rw- random r 302870 pop3 93522 5* pipe ce440b28 <-> ce440be0 0 rw 302870 pop3 93522 6* pipe ce440be0 <-> ce440b28 0 rw 302870 pop3 93522 7 /home/mnt5 9010290 -rw------- 1493 rw 302870 pop3 93522 8 - - bad - 302870 pop3 93522 9 - - bad - 302870 pop3 93522 10 - - bad And the inode in question on /home/mnt5 is a dot-nfs file, indicating stale lock: bash# ls -li | grep 9010290 9010290 -rw------- 1 302870 42 1493 Apr 3 18:05 .nfs.0668c236.6d524.4 ktrace on the pid shows absolutely no activity. The pop3 process is un-killable, and I end up stacking up pop3 processes from the user, as well as deliver to the user. Not healthy.. I was under the impression that POP3 would exit when a lock is set, preventing more than one pop3 processes pr. user, but it doesn't seem to be the case. Stopping dovecot entirely, leaves these stale pop3/imap/deliver processes hanging, even with shutdown_clients = yes The windows-problem-solution (reboot) seems to be the only way to get rid of the locked processes. So: Has anyone else observed this behavior, and eventually found the magic cure ? I wonder if there was a way to implement a "max wall-clock time" per dovecot process type (i.e.. terminate process after for example 120 sec. delivery, 600 sec pop3 etc...), as a crude "garbage-collector". Any hints/suggestions is welcome. -- S?ren Schr?der
Charles Marcus
2008-Apr-04 11:35 UTC
[Dovecot] Dovecot and stale nfs-locks hanging processes
On 4/4/2008, S?ren Schr?der (sch at cybercity.dk) wrote:> * dovecot 1.0.xIs exact version a secret? ;) Also, output of dovecot -n is usually helpful... -- Best regards, Charles
Timo Sirainen
2008-Apr-04 14:18 UTC
[Dovecot] Dovecot and stale nfs-locks hanging processes
On Fri, 2008-04-04 at 12:37 +0200, S?ren Schr?der wrote:> The pop3 process is un-killable, and I end up stacking up pop3 processes > from the user, as well as deliver to the user. Not healthy.. I was under > the impression that POP3 would exit when a lock is set, preventing more > than one pop3 processes pr. user, but it doesn't seem to be the case.If you can't kill a process even with kill -9, the problem is with the kernel and Dovecot can't do much about it. How about trying without fcntl locks: lock_method = dotlock Also have you read http://wiki.dovecot.org/NFS? -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: <http://dovecot.org/pipermail/dovecot/attachments/20080404/8361bb18/attachment-0002.bin>
Søren Schrøder
2008-Apr-04 21:28 UTC
[Dovecot] Dovecot and stale nfs-locks hanging processes
Timo Sirainen wrote:> If you can't kill a process even with kill -9, the problem is with the> kernel and Dovecot can't do much about it.exactly - the process seems to be waiting for device. I suspect rpc.lockd to be the sinner. With the NFS beeing an EMC system, my means of debugging on the serverside is limited. Thats why I called for input from the list> How about trying without fcntl locks: > > lock_method = dotlockI tried dotlocking prior to fcntl, but I really could use the performance-gain fcntl gives in comparisson to dotlocking.> Also have you read http://wiki.dovecot.org/NFS?I have indeed, and I know that my setup is "Dovecot is run in multiple computers, users are redirected more or less randomly to different computers.", the one to be avoided, so I asked for trouble :) But I like the hoizontal scaling of such a setup. I revert to dotlocking then. -- S?ren Schr?der