Joe Allesi -X (joallesi - Coyote Creek Consulting at Cisco)
2007-Oct-23 17:16 UTC
[Dovecot] dovecot-auth: Too many open files
All, [version: dovecot-0.99.11-4.EL4.src.rpm] We recently experienced an issue that prevented all new IMAP logins from occurring. Although it appears that it was due to running out of available system file descriptors, I'm still not sure what the true root cause was as I can't replicate the same error in our test environment. The system file descriptor max was set at (per `cat /proc/sys/fs/file-nr`) 380081 , and we're averaging around 7,000 in use. It appears that the main issue was dovecot-auth, so does this appear to fall in line with the known PAM bug in this version? Auth config: auth_userdb = passwd auth_passdb = pam Checked maillogs (cleansed): Oct 16 11:48:05 host dovecot-auth: PAM: pipe() failed: Too many open files Oct 16 04:48:06 host imap-login: Disconnected [::ffff:<internal_IP>] Oct 16 11:48:11 host dovecot-auth: PAM: pipe() failed: Too many open files Oct 16 04:48:11 host imap-login: Aborted login [::ffff:<internal_IP>] Oct 16 04:48:11 host dovecot-auth: PAM unable to dlopen(/lib/security/pam_nologin.so) Oct 16 04:48:11 host dovecot-auth: PAM [dlerror: /lib/security/pam_nologin.so: cannot open shared object file: Too many open files] Oct 16 04:48:11 host dovecot-auth: PAM adding faulty module: /lib/security/pam_nologin.so Oct 16 04:48:11 host dovecot-auth: PAM unable to dlopen(/lib/security/pam_stack.so) Oct 16 04:48:11 host dovecot-auth: PAM [dlerror: /lib/security/pam_stack.so: cannot open shared object file: Too many open files] So, during the event I performed an strace against the dovecot-auth process and noticed the following error as well: accept(3, 0xbfe05a50, [2]) = -1 EMFILE (Too many open files) gettimeofday({1192538310, 41972}, NULL) = 0 poll([{fd=3, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL, revents=POLLIN}, {fd=0, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}, {fd=18, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}, {fd=10, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}, {fd=7, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}, {fd=23, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}, {fd=8, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}, {fd=11, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}, {fd=12, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}, {fd=13, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}, {fd=14, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}, {fd=19, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}, {fd=16, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}, {fd=17, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}, {fd=6, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}, {fd=15, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}, {fd=20, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}, {fd=21, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}, {fd=9, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}, {fd=22, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}, {fd=24, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}, {fd=25, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}, {fd=26, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}, {fd=27, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}, {fd=28, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}, {fd=34, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}, {fd=33, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}, {fd=29, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}, {fd=37, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}, {fd=30, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}, {fd=31, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}, {fd=41, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}, ...], 1020, 1585) = 1 gettimeofday({1192538310, 99354}, {480, 0}) = 0 Then I checked for the current network connections: netstat | grep imap | wc -l 42 Then used lsof to check for open files: while true ; do lsof | awk '{ print $3 }' | wc -l ; sleep 4 ; done 21728 21668 (restarted dovecot) 1838 1953 1864 2066 2018 2020 2036 2189 2661 2558 2490 (cont @2,000) Thanks! Joe Allesi
IIRC setting "login_process_size: 64" fixes this. p at rick * Joe Allesi -X (joallesi - Coyote Creek Consulting at Cisco) <joallesi at cisco.com>:> All, > > [version: dovecot-0.99.11-4.EL4.src.rpm] > > We recently experienced an issue that prevented all new IMAP logins from > occurring. Although it appears that it was due to running out of > available system file descriptors, I'm still not sure what the true root > cause was as I can't replicate the same error in our test environment. > The system file descriptor max was set at (per `cat > /proc/sys/fs/file-nr`) 380081 , and we're averaging around 7,000 in use. > It appears that the main issue was dovecot-auth, so does this appear to > fall in line with the known PAM bug in this version? > > Auth config: > auth_userdb = passwd > auth_passdb = pam > > Checked maillogs (cleansed): > > Oct 16 11:48:05 host dovecot-auth: PAM: pipe() failed: Too many open > files > Oct 16 04:48:06 host imap-login: Disconnected [::ffff:<internal_IP>] > Oct 16 11:48:11 host dovecot-auth: PAM: pipe() failed: Too many open > files > Oct 16 04:48:11 host imap-login: Aborted login [::ffff:<internal_IP>] > Oct 16 04:48:11 host dovecot-auth: PAM unable to > dlopen(/lib/security/pam_nologin.so) > Oct 16 04:48:11 host dovecot-auth: PAM [dlerror: > /lib/security/pam_nologin.so: cannot open shared object file: Too many > open files] > Oct 16 04:48:11 host dovecot-auth: PAM adding faulty module: > /lib/security/pam_nologin.so > Oct 16 04:48:11 host dovecot-auth: PAM unable to > dlopen(/lib/security/pam_stack.so) > Oct 16 04:48:11 host dovecot-auth: PAM [dlerror: > /lib/security/pam_stack.so: cannot open shared object file: Too many > open files] > > So, during the event I performed an strace against the dovecot-auth > process and noticed the following error as well: > > accept(3, 0xbfe05a50, [2]) = -1 EMFILE (Too many open > files) > gettimeofday({1192538310, 41972}, NULL) = 0 > poll([{fd=3, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL, > revents=POLLIN}, {fd=0, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}, > {fd=18, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}, {fd=10, > events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}, {fd=7, > events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}, {fd=23, > events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}, {fd=8, > events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}, {fd=11, > events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}, {fd=12, > events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}, {fd=13, > events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}, {fd=14, > events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}, {fd=19, > events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}, {fd=16, > events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}, {fd=17, > events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}, {fd=6, > events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}, {fd=15, > events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}, {fd=20, > events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}, {fd=21, > events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}, {fd=9, > events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}, {fd=22, > events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}, {fd=24, > events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}, {fd=25, > events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}, {fd=26, > events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}, {fd=27, > events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}, {fd=28, > events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}, {fd=34, > events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}, {fd=33, > events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}, {fd=29, > events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}, {fd=37, > events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}, {fd=30, > events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}, {fd=31, > events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}, {fd=41, > events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}, ...], 1020, 1585) = 1 > gettimeofday({1192538310, 99354}, {480, 0}) = 0 > > Then I checked for the current network connections: > > netstat | grep imap | wc -l > 42 > > Then used lsof to check for open files: > > while true ; do lsof | awk '{ print $3 }' | wc -l ; sleep 4 ; done > 21728 > 21668 > (restarted dovecot) > 1838 > 1953 > 1864 > 2066 > 2018 > 2020 > 2036 > 2189 > 2661 > 2558 > 2490 > (cont @2,000) > > Thanks! > > Joe Allesi-- state of mind Agentur f?r Kommunikation, Design und Softwareentwicklung Patrick Koetter Tel: 089 45227227 Echinger Strasse 3 Fax: 089 45227226 85386 Eching Web: http://www.state-of-mind.de Amtsgericht M?nchen Partnerschaftsregister PR 563
Joe Allesi -X (joallesi - Coyote Creek Consulting at Cisco)
2007-Oct-24 20:46 UTC
[Dovecot] dovecot-auth: Too many open files
I just noticed more than 700 of these in the messages log file from the day just prior to the outage, but no username is listed and there aren't any correlating entries in maillog: Oct 15 15:03:08 host dovecot(pam_unix)[6724]: authentication failure; logname= uid=0 euid=0 tty= ruser= rhostOct 15 15:03:41 host dovecot(pam_unix)[6952]: check pass; user unknown Could these have caused the dovecot-auth processes to spawn out of control, claiming too many file descriptors that were never reclaimed? I also noticed that the later versions of dovecot now include a configuration setting that prevents this from occuring: auth_worker_max_count = 30 Thanks! -----Original Message----- From: dovecot-bounces+joallesi=cisco.com at dovecot.org [mailto:dovecot-bounces+joallesi=cisco.com at dovecot.org] On Behalf Of Joe Allesi -X (joallesi - Coyote Creek Consulting at Cisco) Sent: Tuesday, October 23, 2007 1:17 PM To: List Mailing Dovecot Subject: [Dovecot] dovecot-auth: Too many open files All, [version: dovecot-0.99.11-4.EL4.src.rpm] We recently experienced an issue that prevented all new IMAP logins from occurring. Although it appears that it was due to running out of available system file descriptors, I'm still not sure what the true root cause was as I can't replicate the same error in our test environment. The system file descriptor max was set at (per `cat /proc/sys/fs/file-nr`) 380081 , and we're averaging around 7,000 in use. It appears that the main issue was dovecot-auth, so does this appear to fall in line with the known PAM bug in this version? Auth config: auth_userdb = passwd auth_passdb = pam Checked maillogs (cleansed): Oct 16 11:48:05 host dovecot-auth: PAM: pipe() failed: Too many open files Oct 16 04:48:06 host imap-login: Disconnected [::ffff:<internal_IP>] Oct 16 11:48:11 host dovecot-auth: PAM: pipe() failed: Too many open files Oct 16 04:48:11 host imap-login: Aborted login [::ffff:<internal_IP>] Oct 16 04:48:11 host dovecot-auth: PAM unable to dlopen(/lib/security/pam_nologin.so) Oct 16 04:48:11 host dovecot-auth: PAM [dlerror: /lib/security/pam_nologin.so: cannot open shared object file: Too many open files] Oct 16 04:48:11 host dovecot-auth: PAM adding faulty module: /lib/security/pam_nologin.so Oct 16 04:48:11 host dovecot-auth: PAM unable to dlopen(/lib/security/pam_stack.so) Oct 16 04:48:11 host dovecot-auth: PAM [dlerror: /lib/security/pam_stack.so: cannot open shared object file: Too many open files] So, during the event I performed an strace against the dovecot-auth process and noticed the following error as well: accept(3, 0xbfe05a50, [2]) = -1 EMFILE (Too many open files) gettimeofday({1192538310, 41972}, NULL) = 0 poll([{fd=3, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL, revents=POLLIN}, {fd=0, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}, {fd=18, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}, {fd=10, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}, {fd=7, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}, {fd=23, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}, {fd=8, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}, {fd=11, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}, {fd=12, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}, {fd=13, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}, {fd=14, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}, {fd=19, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}, {fd=16, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}, {fd=17, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}, {fd=6, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}, {fd=15, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}, {fd=20, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}, {fd=21, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}, {fd=9, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}, {fd=22, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}, {fd=24, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}, {fd=25, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}, {fd=26, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}, {fd=27, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}, {fd=28, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}, {fd=34, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}, {fd=33, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}, {fd=29, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}, {fd=37, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}, {fd=30, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}, {fd=31, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}, {fd=41, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}, ...], 1020, 1585) = 1 gettimeofday({1192538310, 99354}, {480, 0}) = 0 Then I checked for the current network connections: netstat | grep imap | wc -l 42 Then used lsof to check for open files: while true ; do lsof | awk '{ print $3 }' | wc -l ; sleep 4 ; done 21728 21668 (restarted dovecot) 1838 1953 1864 2066 2018 2020 2036 2189 2661 2558 2490 (cont @2,000) Thanks! Joe Allesi
Hello, On Tue, Oct 23, 2007 at 01:16:35PM -0400, Joe Allesi -X (joallesi - Coyote Creek Consulting at Cisco) wrote:> We recently experienced an issue that prevented all new IMAP logins from > occurring. Although it appears that it was due to running out of > available system file descriptors, I'm still not sure what the true root > [...] > > Auth config: > auth_userdb = passwd > auth_passdb = pam > [...]Do you use LDAP or just local passwd? -- Tomas Janousek, SW Engineer, Red Hat, Inc.
On Tue, 2007-10-23 at 13:16 -0400, Joe Allesi -X (joallesi - Coyote Creek Consulting at Cisco) wrote:> [version: dovecot-0.99.11-4.EL4.src.rpm] > > We recently experienced an issue that prevented all new IMAP logins from > occurring. Although it appears that it was due to running out of > available system file descriptors, I'm still not sure what the true root > cause was as I can't replicate the same error in our test environment. > The system file descriptor max was set at (per `cat > /proc/sys/fs/file-nr`) 380081 , and we're averaging around 7,000 in use. > It appears that the main issue was dovecot-auth, so does this appear to > fall in line with the known PAM bug in this version?Check what the open files are: ls -l /proc/`pidof dovecot-auth`/fd/ Although I don't recommend using v0.99 even if you can fix this problem. -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: <http://dovecot.org/pipermail/dovecot/attachments/20071027/cc967d38/attachment-0002.bin>