Hi, Since we upgraded to 2.0.9 (from 1.10 stock CentOS release), we are getting some errors with pop3. When the machines get busy, now and then it start with the following:> Mar 7 11:19:51 xxx dovecot: pop3-login: Error: net_connect_unix(pop3) failed: Resource temporarily unavailableAnd it generates hundreds of those before the machines dies, with the web server getting stuck as well on imap sessions, even though there is no imap error messages. I tried to lookup in the source where this is generated, but couldn't find it. Is it some resource limit on the machine? Please see bellow the dovecot -n output. Regards, Thierry # 2.0.9: /etc/dovecot/dovecot.conf # OS: Linux 2.6.18-194.32.1.el5 x86_64 CentOS release 5.5 (Final) nfs auth_cache_negative_ttl = 0 auth_cache_size = 2 M auth_cache_ttl = 10 mins auth_mechanisms = plain login dict { expire = mysql:/etc/dovecot/dovecot-dict-expire.conf.ext quota = mysql:/etc/dovecot/dovecot-dict-sql.conf.ext } disable_plaintext_auth = no first_valid_uid = 25 lda_mailbox_autocreate = yes lda_mailbox_autosubscribe = yes mail_access_groups = vmail mail_fsync = always mail_gid = 25 mail_location = maildir:/var/virtual/%d/%1n/%2n/%n:INDEX=/var/indexes/%d/%1n/%2n/%n mail_nfs_index = yes mail_nfs_storage = yes mail_plugins = quota expire mail_uid = 25 mbox_write_locks = fcntl mmap_disable = yes passdb { args = /etc/dovecot/dovecot-sql.conf.ext driver = sql } plugin { expire = Trash expire2 = Trash/* expire3 = Spam expire4 = Trash/* expire_dict = proxy::expire quota = dict:User quota::proxy::quota quota_rule = *:storage=2G quota_rule2 = Trash:storage=+100M sieve_global_path = /var/indexes/dovecot-default.sieve } postmaster_address = postmaster at XXXXXX.YY protocols = imap pop3 service auth { unix_listener auth-userdb { group = vmail mode = 0600 user = vmail } } service dict { unix_listener dict { group = vmail mode = 0600 user = vmail } } service imap-login { inet_listener imap { address = 127.0.0.1 port = 143 } process_min_avail = 2 service_count = 0 vsz_limit = 256 M } service imap { process_limit = 768 } service pop3-login { inet_listener pop3 { address = 0.0.0.0 port = 110 } process_limit = 200 process_min_avail = 2 service_count = 0 } service pop3 { process_limit = 256 } ssl = no ssl_cert = </etc/pki/dovecot/certs/dovecot.pem ssl_key = </etc/pki/dovecot/private/dovecot.pem userdb { driver = prefetch } userdb { args = /etc/dovecot/dovecot-sql.conf.ext driver = sql } userdb { args = uid=vmail gid=vmail driver = static } protocol lmtp { mail_plugins = quota expire sieve } protocol lda { mail_plugins = quota expire sieve } protocol imap { mail_plugins = quota expire imap_quota } protocol pop3 { mail_max_userip_connections = 2 pop3_fast_size_lookups = yes }
On 7.3.2011, at 11.51, Thierry de Montaudry wrote:> Since we upgraded to 2.0.9 (from 1.10 stock CentOS release), we are getting some errors with pop3. When the machines get busy, now and then it start with the following: >> Mar 7 11:19:51 xxx dovecot: pop3-login: Error: net_connect_unix(pop3) failed: Resource temporarily unavailable > And it generates hundreds of those before the machines dies, with the web server getting stuck as well on imap sessions, even though there is no imap error messages.Do you see any warning messages in logs containing "client connections are being dropped"?
On 07 Mar 2011, at 12:01, Timo Sirainen wrote:> On 7.3.2011, at 11.51, Thierry de Montaudry wrote: > >> Since we upgraded to 2.0.9 (from 1.10 stock CentOS release), we are getting some errors with pop3. When the machines get busy, now and then it start with the following: >>> Mar 7 11:19:51 xxx dovecot: pop3-login: Error: net_connect_unix(pop3) failed: Resource temporarily unavailable >> And it generates hundreds of those before the machines dies, with the web server getting stuck as well on imap sessions, even though there is no imap error messages. > > Do you see any warning messages in logs containing "client connections are being dropped"? >I did not see it on any machines. But for this specific one, I got the following after those errors, before restarting dovecot: Mar 7 11:20:09 web4 dovecot: pop3(x at y): Error: net_connect_unix(/var/run/dovecot/dict) failed: Connection refused Mar 7 11:20:11 web4 dovecot: pop3-login: Error: read(pop3) failed: Connection reset by peer Mar 7 11:20:11 web4 dovecot: pop3-login: Error: read(pop3) failed: Connection reset by peer Mar 7 11:20:11 web4 dovecot: pop3-login: Error: read(pop3) failed: Connection reset by peer Mar 7 11:20:11 web4 dovecot: pop3-login: Error: read(pop3) failed: Connection reset by peer
On Mon, 2011-03-07 at 13:40 +0200, Thierry de Montaudry wrote:> >>> Mar 7 11:19:51 xxx dovecot: pop3-login: Error: net_connect_unix(pop3) failed: Resource temporarily unavailable..> > Do you see any warning messages in logs containing "client connections are being dropped"? > > > I did not see it on any machines.Hmh. Could you upgrade to 2.0.11? It splits the two causes of "Resource temporarily unavailable" errors to two separate error messages. It would help figuring out the problem.> But for this specific one, I got the following after those errors, before restarting dovecot: > > Mar 7 11:20:09 web4 dovecot: pop3(x at y): Error: net_connect_unix(/var/run/dovecot/dict) failed: Connection refused > Mar 7 11:20:11 web4 dovecot: pop3-login: Error: read(pop3) failed: Connection reset by peer > Mar 7 11:20:11 web4 dovecot: pop3-login: Error: read(pop3) failed: Connection reset by peer > Mar 7 11:20:11 web4 dovecot: pop3-login: Error: read(pop3) failed: Connection reset by peer > Mar 7 11:20:11 web4 dovecot: pop3-login: Error: read(pop3) failed: Connection reset by peerThis looks like processes started dying.
On 08 Mar 2011, at 13:24, Chris Wilson wrote:> Hi Thierry, > > On Tue, 8 Mar 2011, Thierry de Montaudry wrote: >> On 07 Mar 2011, at 19:15, Timo Sirainen wrote: >>> On Mon, 2011-03-07 at 19:03 +0200, Thierry de Montaudry wrote: >>>>>>>>> Mar 7 11:19:51 xxx dovecot: pop3-login: Error: net_connect_unix(pop3) failed: Resource temporarily unavailable >>>>> .. >>>> As it is happening at least once a day, is there anything I can do to >>>> trace it? and whatever I'll do, will it slow down those machines? >>> >>> Set verbose_proctitle=yes (won't slow down) and get list of all >>> Dovecot processes when it happens. And check how much user and system >>> CPU it's using and what the load is. >> >> Got the same problem this morning, here is the CPU usage and ps aux for >> dovecot. plus the different error I could pick up in the log, most of >> them are repeated a couple of times. >> >> I suspect it a problem with system resources, but can find any message >> to tell me what. Mail are stored on 17 NFS servers (CentOS), plus 3 >> servers for indexes only. >> >> CPU load is very high, but mainly from httpd running our webmail >> interface, which uses the local imap server. > [...] >> top - 11:10:14 up 14 days, 12:04, 2 users, load average: 55.04, 29.13, 14.55 >> Tasks: 474 total, 60 running, 414 sleeping, 0 stopped, 0 zombie >> Cpu(s): 99.6%us, 0.3%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.1%si, 0.0%st >> Mem: 16439812k total, 16353268k used, 86544k free, 33268k buffers >> Swap: 4192956k total, 140k used, 4192816k free, 8228744k cached > > You're lucky this server is still alive and that you could even run top > and ps on it. > > There's nothing to debug in dovecot here. Your server is overloaded by > about 55 times. Buy 55 times as many servers or do something about your > webmail interface (maybe a separate webmail cluster). > > Cheers, Chris. >As you can see the numbers (55.04, 29.13, 14.55) the load was busy getting higher when I took this snapshot and this was not a normal situation. Usually this machine's load is only between 1 and 4, which is quite ok for a quad core. It only happens when dovecot start generating errors, and pop3, imap and http get stuck. It went up to 200, and I was still able to stop web and mail daemons, then restart them, and everything was back to normal.
On 08 Mar 2011, at 19:12, Charles Marcus wrote:> On 2011-03-08 12:00 PM, Thierry de Montaudry wrote: >> but moving from dovecot 1.10.13 to 2.0.9 > > First time I thought it was a typo and ignored it... > > There has never been a version 1.10.xxx > > Maybe you mean 1.0.13? >Sorry, my mistake, 1.1.13, version integrated in CentOS 5.