On the whole we are pleased with our trials of dovecot to replace UW-IMAP. But (ah!) we have hit one particular problem, in which we think dovecot could probably benefit from a resilience improvement. We're running dovecot on Fedora Core 5 (FC5), with passwd map details supplied by NIS. We have found that "nscd" sometimes thinks that a username is invalid, even though it is valid. So when "deliver" attempts a delivery to the INBOX of that username, it receives "user unknown" from the name service, and then does a 5xx permanent failure of valid email.>From the user perspective "The System" has incorrectly rejected perfectlyvalid incoming email. It is rare, but it does occasionally happen on large, busy systems. Clearly it is fundamentally an "nscd" bug. But that bug is nevertheless out there, in the wild, on such systems, potentially affecting dovecot's delivery of valid user email. We have had a source code hack since October (in "deliver.c", simply replacing a "return ret" occurence with "return EX_TEMPFAIL") and it has worked nicely (ported forward from rc8 towards rc22). Mail re-queues and a later delivery attempt then succeeds. So it would be both helpful, and good for resilience against such real OS/nscd bugs (and similar), if there were a configuration option in dovecot to allow a site admin to tell deliver to use a temporary, 4xx, failure instead (if the circumstances were appropriate for the site). Could this be considered please, Timo? -- : David Lee I.T. Service : : Senior Systems Programmer Computer Centre : : UNIX Team Leader Durham University : : South Road : : http://www.dur.ac.uk/t.d.lee/ Durham DH1 3LE : : Phone: +44 191 334 2752 U.K. :
On Fri, 2007-02-09 at 10:38 +0000, David Lee wrote:> On the whole we are pleased with our trials of dovecot to replace UW-IMAP. > > But (ah!) we have hit one particular problem, in which we think dovecot > could probably benefit from a resilience improvement. > > We're running dovecot on Fedora Core 5 (FC5), with passwd map details > supplied by NIS. We have found that "nscd" sometimes thinks that a > username is invalid, even though it is valid. So when "deliver" attempts > a delivery to the INBOX of that username, it receives "user unknown" from > the name service, and then does a 5xx permanent failure of valid email. > >From the user perspective "The System" has incorrectly rejected perfectly > valid incoming email. It is rare, but it does occasionally happen on > large, busy systems. > > Clearly it is fundamentally an "nscd" bug. But that bug is nevertheless > out there, in the wild, on such systems, potentially affecting dovecot's > delivery of valid user email. > > We have had a source code hack since October (in "deliver.c", simply > replacing a "return ret" occurence with "return EX_TEMPFAIL") and it has > worked nicely (ported forward from rc8 towards rc22). Mail re-queues and > a later delivery attempt then succeeds. > > So it would be both helpful, and good for resilience against such real > OS/nscd bugs (and similar), if there were a configuration option in > dovecot to allow a site admin to tell deliver to use a temporary, 4xx, > failure instead (if the circumstances were appropriate for the site).Having been hit by numerous problems with nscd as well with many applications I'll just throw that in: - nscd is to be prevented whenever possible - (if) nscd is broken, complain with the vendor or better - fix bugs at the right place A few excerpts from a discussion about nscd on the postfix ML some time ago about exactly the same problem (postfix not finding reicipients due to nscd delivering bad information): "nscd is a crappy piece of software that is unstable and frequently corrupts information." "Most of the work is identifying the right problem. Much effort goes to waste solving the wrong one." Just my 2? ... -- Udo Rader bestsolution.at EDV Systemhaus GmbH http://www.bestsolution.at -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: <http://dovecot.org/pipermail/dovecot/attachments/20070209/b63835d7/attachment.bin>
David Lee wrote:> We're running dovecot on Fedora Core 5 (FC5), with passwd map details > supplied by NIS. We have found that "nscd" sometimes thinks that a > username is invalid, even though it is valid. So when "deliver" attempts > a delivery to the INBOX of that username, it receives "user unknown" from > the name service, and then does a 5xx permanent failure of valid email. >>From the user perspective "The System" has incorrectly rejected perfectly > valid incoming email. It is rare, but it does occasionally happen on > large, busy systems. >We don't use "deliver" (just use Exim) but we build a static passwd-file userdb from NIS overnight and use PAM for authentication (via pam_ldap to Active Directory, but it works with pam_unix too). We did this for a performance boost as Dovecot then caches the userdb, rather than having to wait for a NIS lookup each time, but I'd expect it to iron out problems with deliver/nscd as well. While the passwords could change any time, userdb information generally doesn't happen that often, and it only takes a few seconds to rebuild manually if a new user has to be added quickly. Chris -- --+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+- Christopher Wakelin, c.d.wakelin at reading.ac.uk IT Services Centre, The University of Reading, Tel: +44 (0)118 378 8439 Whiteknights, Reading, RG6 2AF, UK Fax: +44 (0)118 975 3094
Quoting David Lee:> We have had a source code hack since October (in "deliver.c", simply > replacing a "return ret" occurence with "return EX_TEMPFAIL") and it hasI cannot speak for Timo, but if this means mail for non-existing users will stay in the MTA's queue until it times out, it's definitely wrong. Apart from that, nscd is one of the first things I remove from freshly installed systems, because it's plain crap and caused only problems in the past. Even though I can hardly believe they do not get even such things right.
David Lee wrote:> On the whole we are pleased with our trials of dovecot to replace UW-IMAP. > > But (ah!) we have hit one particular problem, in which we think dovecot > could probably benefit from a resilience improvement. > > We're running dovecot on Fedora Core 5 (FC5), with passwd map details > supplied by NIS. We have found that "nscd" sometimes thinks that a > username is invalid, even though it is valid. So when "deliver" attempts > a delivery to the INBOX of that username, it receives "user unknown" from > the name service, and then does a 5xx permanent failure of valid email. >>From the user perspective "The System" has incorrectly rejected perfectly > valid incoming email. It is rare, but it does occasionally happen on > large, busy systems. > > Clearly it is fundamentally an "nscd" bug. But that bug is nevertheless > out there, in the wild, on such systems, potentially affecting dovecot's > delivery of valid user email. > > We have had a source code hack since October (in "deliver.c", simply > replacing a "return ret" occurence with "return EX_TEMPFAIL") and it has > worked nicely (ported forward from rc8 towards rc22). Mail re-queues and > a later delivery attempt then succeeds. > > So it would be both helpful, and good for resilience against such real > OS/nscd bugs (and similar), if there were a configuration option in > dovecot to allow a site admin to tell deliver to use a temporary, 4xx, > failure instead (if the circumstances were appropriate for the site). > > Could this be considered please, Timo? >I wrote the nscd that's used on Solaris back in 1995. If the Fedora release's nscd is just bungling the lookup, no work-around is possible and you need to disable at least the passwd cache in the nscd if that's possible. On the other hand, are you sure this isn't a intermittent NIS server issue? The problem about what a program should do if the name service isn't actually responding on the other hand, is tricky, whether that program is the nscd or postfix or dovecot. The right answer depends on the consequences of failure and what info you can get back from the name service. Obviously, if getpwnam_r() could be convinced to return EAGAIN if one of the name services was not responding, this would be a GOOD THING, since this would map directly to a TEMPFAIL. However, there are other system services that fail miserably when the user's account into isn't available, so for those hanging until the NIS server recovers is a better choice. [The hard thing about distributed systems is always failure semantics.] Absent tunable nscd failure semantics, I suggest that the following may be useful alternatives for intermittent NIS server problems: 1) construct a redundant NIS architecture with additional slave NIS servers that fail over... this is what we use internally at Sun w/ varying degrees of success. 2) ypcat the passwd map periodically and map it into a local passwd file. Some scripts smarts are required to avoid hideous problems if you get a truncated passwd map... this is quite robust if done correctly. I'm one of the odd folks who has their mail delivered to their desktop; I keep a copy of my passwd entry in the local machine so I don't lose mail if the NIS server craps out again. - Bart
David Lee schrieb:> On the whole we are pleased with our trials of dovecot to replace UW-IMAP. > > But (ah!) we have hit one particular problem, in which we think dovecot > could probably benefit from a resilience improvement.Careful there!> We're running dovecot on Fedora Core 5 (FC5), with passwd map details > supplied by NIS. We have found that "nscd" sometimes thinks that a > username is invalid, even though it is valid. So when "deliver" attempts > a delivery to the INBOX of that username, it receives "user unknown" from > the name service, and then does a 5xx permanent failure of valid email. > From the user perspective "The System" has incorrectly rejected perfectly > valid incoming email. It is rare, but it does occasionally happen on > large, busy systems.There are several problems to this approach here, generally plain blindness of many libc maintainers to this problem, regardless if the system has nsswitch or no. I filed NIS lookup bugs against GNU libc (not implementing TRYAGAIN=forever in nsswitch) and FreeBSD (timeout after a few minutes) literally years ago, without any tangible outcome. GNU libc maintainer rejects the bug report as a whole, it's fallen on deaf ears with FreeBSD. The other important concern for a portable software as dovecot is portability. On some systems, temporary failure of getpwnam() is indistinguishable from permanent failure, thus the only solution to this approach is Postfix's: implement a NIS lookup client to access the password database to circumvent the many libc bugs lurking there.> Clearly it is fundamentally an "nscd" bug. But that bug is nevertheless > out there, in the wild, on such systems, potentially affecting dovecot's > delivery of valid user email.You don't need nscd for unstable NIS, as laid out above :-(> We have had a source code hack since October (in "deliver.c", simply > replacing a "return ret" occurence with "return EX_TEMPFAIL") and it has > worked nicely (ported forward from rc8 towards rc22). Mail re-queues and > a later delivery attempt then succeeds.And lingers around in the queue for a week if an account has been terminated? Doesn't look like a 'solution' to me. Best regards Matthias Andree