Simone Lazzaris
2018-Sep-07 08:25 UTC
Auth process sometimes stop responding after upgrade
In data venerd? 7 settembre 2018 10:06:00 CEST, Sami Ketola ha scritto:> > On 7 Sep 2018, at 11.00, Simone Lazzaris <s.lazzaris at interactive.eu> > > wrote: > > > > > > The only suspect thing is this: > > > > Sep 6 14:45:41 imap-front13 dovecot: director: doveadm: Host > > 192.168.1.142 > > vhost count changed from 100 to 0 > > Sep 6 14:45:41 imap-front13 dovecot: director: doveadm: Host > > 192.168.1.143 > > vhost count changed from 100 to 0 > > Sep 6 14:45:41 imap-front13 dovecot: director: doveadm: Host > > 192.168.1.219 > > vhost count changed from 100 to 0 > > > > Nothing on the other system logs (e.g. kernel, daemon, syslog, messages > > ....). > Any idea what is changing the vhost count on the backends? Do you have some > script running that possibly does change the vhost count for some > triggering event? > > SamiActually, I have a poolmon script running that should drop vhost count for unresponsive backends; the strage thing is, the backends are NOT unresponsive, they are working as ususal. *Simone Lazzaris* *Qcom S.p.A.* simone.lazzaris at qcom.it[1] | www.qcom.it[2] * LinkedIn[3]* | *Facebook[4]* [5] -------- [1] mailto:simone.lazzaris at qcom.it [2] https://www.qcom.it [3] https://www.linkedin.com/company/qcom-spa [4] http://www.facebook.com/qcomspa [5] https://www.qcom.it/includes/email-banner.gif -------------- next part -------------- An HTML attachment was scrubbed... URL: <https://dovecot.org/pipermail/dovecot/attachments/20180907/7a1fe30d/attachment.html>
> On 7 Sep 2018, at 11.25, Simone Lazzaris <simone.lazzaris at qcom.it> wrote: > Actually, I have a poolmon script running that should drop vhost count for unresponsive backends; the strage thing is, the backends are NOT unresponsive, they are working as ususal. >If it's this one https://github.com/brandond/poolmon/blob/master/poolmon <https://github.com/brandond/poolmon/blob/master/poolmon> you are running and old version of it as the latest version is more compatible with recent dovecot releases. current version in git correctly uses HOST-DOWN and HOST-FLUSH instead of modifying vhost count. But even then I still consider poolmon a bit too aggressive in marking hosts down. It does HOST-DOWN and HOST-FLUSH already after first scan failure. Maybe there should be some grace period and wait for few failed polls in certain period before doing it in case there is some temporary overload or networking failure. Sami -------------- next part -------------- An HTML attachment was scrubbed... URL: <https://dovecot.org/pipermail/dovecot/attachments/20180907/2a6be5e1/attachment.html>
Simone Lazzaris
2018-Sep-07 09:59 UTC
Auth process sometimes stop responding after upgrade
In data venerd? 7 settembre 2018 11:20:49 CEST, Sami Ketola ha scritto:> > On 7 Sep 2018, at 11.25, Simone Lazzaris <simone.lazzaris at qcom.it> wrote: > > Actually, I have a poolmon script running that should drop vhost count for > > unresponsive backends; the strage thing is, the backends are NOT > > unresponsive, they are working as ususal. > If it's this one https://github.com/brandond/poolmon/blob/master/poolmon > <https://github.com/brandond/poolmon/blob/master/poolmon> you are running > and old version of it as the latest version is more compatible with recent > dovecot releases. > > current version in git correctly uses HOST-DOWN and HOST-FLUSH instead of > modifying vhost count. >Interesting, I'll surely upgrade the script in the next days. Thanks for the hint. But the script is surely not the ultimate cause of the failures: the backend (and the script itself) are untouched - and working - since many moons ago. The only modified entity is dovecot on the frontends. And even in the event of some (3 out of 8, in this very case) backends marked as failed, the authentication on the frontends should work, shouldn't it? I've tried to strace the auth process during the last failure, and this is what I've got: Process 2539 attached - interrupt to quit gettimeofday({1536308480, 998803}, NULL) = 0 epoll_wait(15, After about 60 seconds, I've aborted the strace and restarted dovecot to avoid upsetting customers. Searching for file descriptor #15 in /proc/nnnn/fd I found "anon_inode: [eventpoll]" *Simone Lazzaris* *Qcom S.p.A.* simone.lazzaris at qcom.it[1] | www.qcom.it[2] * LinkedIn[3]* | *Facebook[4]* [5] -------- [1] mailto:simone.lazzaris at qcom.it [2] https://www.qcom.it [3] https://www.linkedin.com/company/qcom-spa [4] http://www.facebook.com/qcomspa [5] https://www.qcom.it/includes/email-banner.gif -------------- next part -------------- An HTML attachment was scrubbed... URL: <https://dovecot.org/pipermail/dovecot/attachments/20180907/5849aee1/attachment-0001.html>
Simone Lazzaris
2018-Sep-07 13:50 UTC
Auth process sometimes stop responding after upgrade
Some more information: the issue has just occurred, again on an instance without the "service_count = 0" configuration directive on pop3-login. I've observed that while the issue is occurring, the director process goes 100% CPU. I've straced the process. It is seemingly looping: ... ... epoll_ctl(13, EPOLL_CTL_ADD, 78, {EPOLLIN|EPOLLPRI|EPOLLERR|EPOLLHUP, {u32=149035320, u64=149035320}}) = 0 epoll_ctl(13, EPOLL_CTL_DEL, 78, {0, {u32=149035320, u64=149035320}}) = 0 epoll_ctl(13, EPOLL_CTL_ADD, 78, {EPOLLIN|EPOLLPRI|EPOLLERR|EPOLLHUP, {u32=149035320, u64=149035320}}) = 0 epoll_ctl(13, EPOLL_CTL_DEL, 78, {0, {u32=149035320, u64=149035320}}) = 0 epoll_ctl(13, EPOLL_CTL_ADD, 78, {EPOLLIN|EPOLLPRI|EPOLLERR|EPOLLHUP, {u32=149035320, u64=149035320}}) = 0 epoll_ctl(13, EPOLL_CTL_DEL, 78, {0, {u32=149035320, u64=149035320}}) = 0 epoll_ctl(13, EPOLL_CTL_ADD, 78, {EPOLLIN|EPOLLPRI|EPOLLERR|EPOLLHUP, {u32=149035320, u64=149035320}}) = 0 epoll_ctl(13, EPOLL_CTL_DEL, 78, {0, {u32=149035320, u64=149035320}}) = 0 epoll_ctl(13, EPOLL_CTL_ADD, 78, {EPOLLIN|EPOLLPRI|EPOLLERR|EPOLLHUP, {u32=149035320, u64=149035320}}) = 0 epoll_ctl(13, EPOLL_CTL_DEL, 78, {0, {u32=149035320, u64=149035320}}) = 0 epoll_ctl(13, EPOLL_CTL_ADD, 78, {EPOLLIN|EPOLLPRI|EPOLLERR|EPOLLHUP, {u32=149035320, u64=149035320}}) = 0 epoll_ctl(13, EPOLL_CTL_DEL, 78, {0, {u32=149035320, u64=149035320}}) = 0 epoll_ctl(13, EPOLL_CTL_ADD, 78, {EPOLLIN|EPOLLPRI|EPOLLERR|EPOLLHUP, {u32=149035320, u64=149035320}}) = 0 epoll_ctl(13, EPOLL_CTL_DEL, 78, {0, {u32=149035320, u64=149035320}}) = 0 epoll_ctl(13, EPOLL_CTL_ADD, 78, {EPOLLIN|EPOLLPRI|EPOLLERR|EPOLLHUP, {u32=149035320, u64=149035320}}) = 0 epoll_ctl(13, EPOLL_CTL_DEL, 78, {0, {u32=149035320, u64=149035320}}) = 0 epoll_ctl(13, EPOLL_CTL_ADD, 78, {EPOLLIN|EPOLLPRI|EPOLLERR|EPOLLHUP, {u32=149035320, u64=149035320}}) = 0 epoll_ctl(13, EPOLL_CTL_DEL, 78, {0, {u32=149035320, u64=149035320}}) = 0 epoll_ctl(13, EPOLL_CTL_ADD, 78, {EPOLLIN|EPOLLPRI|EPOLLERR|EPOLLHUP, {u32=149035320, u64=149035320}}) = 0 epoll_ctl(13, EPOLL_CTL_DEL, 78, {0, {u32=149035320, u64=149035320}}) = 0 epoll_ctl(13, EPOLL_CTL_ADD, 78, {EPOLLIN|EPOLLPRI|EPOLLERR|EPOLLHUP, {u32=149035320, u64=149035320}}) = 0 epoll_ctl(13, EPOLL_CTL_DEL, 78, {0, {u32=149035320, u64=149035320}}) = 0 epoll_ctl(13, EPOLL_CTL_ADD, 78, {EPOLLIN|EPOLLPRI|EPOLLERR|EPOLLHUP, {u32=149035320, u64=149035320}}) = 0 epoll_ctl(13, EPOLL_CTL_DEL, 78, {0, {u32=149035320, u64=149035320}}) = 0 epoll_ctl(13, EPOLL_CTL_ADD, 78, {EPOLLIN|EPOLLPRI|EPOLLERR|EPOLLHUP, {u32=149035320, u64=149035320}}) = 0 epoll_ctl(13, EPOLL_CTL_DEL, 78, {0, {u32=149035320, u64=149035320}}) = 0 epoll_ctl(13, EPOLL_CTL_ADD, 78, {EPOLLIN|EPOLLPRI|EPOLLERR|EPOLLHUP, {u32=149035320, u64=149035320}}) = 0 epoll_ctl(13, EPOLL_CTL_DEL, 78, {0, {u32=149035320, u64=149035320}}) = 0 epoll_ctl(13, EPOLL_CTL_ADD, 78, {EPOLLIN|EPOLLPRI|EPOLLERR|EPOLLHUP, {u32=149035320, u64=149035320}}) = 0 epoll_ctl(13, EPOLL_CTL_DEL, 78, {0, {u32=149035320, u64=149035320}}) = 0 epoll_ctl(13, EPOLL_CTL_ADD, 78, {EPOLLIN|EPOLLPRI|EPOLLERR|EPOLLHUP, {u32=149035320, u64=149035320}}) = 0 epoll_ctl(13, EPOLL_CTL_DEL, 78, {0, {u32=149035320, u64=149035320}}) = 0 epoll_ctl(13, EPOLL_CTL_ADD, 78, {EPOLLIN|EPOLLPRI|EPOLLERR|EPOLLHUP, {u32=149035320, u64=149035320}}) = 0 epoll_ctl(13, EPOLL_CTL_DEL, 78, {0, {u32=149035320, u64=149035320}}) = 0 epoll_ctl(13, EPOLL_CTL_ADD, 78, {EPOLLIN|EPOLLPRI|EPOLLERR|EPOLLHUP, {u32=149035320, u64=149035320}}) = 0 epoll_ctl(13, EPOLL_CTL_DEL, 78, {0, {u32=149035320, u64=149035320}}) = 0 epoll_ctl(13, EPOLL_CTL_ADD, 78, {EPOLLIN|EPOLLPRI|EPOLLERR|EPOLLHUP, {u32=149035320, u64=149035320}}) = 0 epoll_ctl(13, EPOLL_CTL_DEL, 78, {0, {u32=149035320, u64=149035320}}) = 0 epoll_ctl(13, EPOLL_CTL_ADD, 78, {EPOLLIN|EPOLLPRI|EPOLLERR|EPOLLHUP, {u32=149035320, u64=149035320}}) = 0 epoll_ctl(13, EPOLL_CTL_DEL, 78, {0, {u32=149035320, u64=149035320}}) = 0 epoll_ctl(13, EPOLL_CTL_ADD, 78, {EPOLLIN|EPOLLPRI|EPOLLERR|EPOLLHUP, {u32=149035320, u64=149035320}}) = 0 ... ... ... FD 13 is "anon_inode:[eventpoll]" Killing (with -9) the director process, without stopping or restarting the service, restores the correct behaviour. *Simone Lazzaris* *Qcom S.p.A.* simone.lazzaris at qcom.it[1] | www.qcom.it[2] * LinkedIn[3]* | *Facebook[4]* [5] -------- [1] mailto:simone.lazzaris at qcom.it [2] https://www.qcom.it [3] https://www.linkedin.com/company/qcom-spa [4] http://www.facebook.com/qcomspa [5] https://www.qcom.it/includes/email-banner.gif -------------- next part -------------- An HTML attachment was scrubbed... URL: <https://dovecot.org/pipermail/dovecot/attachments/20180907/1b50d5fb/attachment-0001.html>