Adrian Gonzalez
2018-Feb-13 01:48 UTC
dovecot: master: Panic: kevent() failed: Invalid argument
Hello I've been seeing these panics with dovecot 2.2.31 (and older versions) under FreeBSD about once every 2-3 months or so.? I know it's been discussed here in the list, but I didn't see a definitive fix.? If it's been fixed, please ignore this post (sorry!). I finally managed to observe the bug on a server that has the patch that prints out the kevent() timespec values and got: Nov 29 16:46:27 mail dovecot: master: Panic: kevent(events=74, ts=59.1000000000) failed: Invalid argument Looking through the code, the panic is triggered in io_loop_handler_run_internal().? The timeout values for the kevent() call are obtained by calling io_loop_get_wait_time(), which in turn calls timeout_get_wait_time() (both in ioloop.c). The timeout is computed by subtracting the value returned by gettimeofday() from timeout->next_run, and it looks like in very rare cases the result in tv_r->tv_usec is 1000000uS (1 second). So, it seems that if gettimeofday() returns exactly 0 microseconds, and timeout->next_run has exactly 1000000 microseconds, the kevent() call will fail (1 in 1e12 chance?). I'm not entirely sure where timeout->next_run is computed, but looking at timeout_update_next(), it adjusts the seconds only if the microseconds are greater than 1000000:? (ioloop.c, line 235) if (timeout->next_run.tv_usec > 1000000) { ??????????????? timeout->next_run.tv_sec++; ??????????????? timeout->next_run.tv_usec -= 1000000; } It seems like the comparison here should be greater or equal, but I'm not familiar enough with the code to say with absolute certainty. I added more debug error messages to the code in timeout_update_next(), and although I do see next_run->tv_usec at 1000000 several times a day, I haven't seen a panic since early December on any of our servers, so I decided to post this to see if this analysis makes sense.
On 13.02.2018 03:48, Adrian Gonzalez wrote:> > Hello > > I've been seeing these panics with dovecot 2.2.31 (and older versions) > under FreeBSD about once every 2-3 months or so.? I know it's been > discussed here in the list, but I didn't see a definitive fix.? If > it's been fixed, please ignore this post (sorry!). > > I finally managed to observe the bug on a server that has the patch > that prints out the kevent() timespec values and got: > > Nov 29 16:46:27 mail dovecot: master: Panic: kevent(events=74, > ts=59.1000000000) failed: Invalid argument > > Looking through the code, the panic is triggered in > io_loop_handler_run_internal().? The timeout values for the kevent() > call are obtained by calling io_loop_get_wait_time(), which in turn > calls timeout_get_wait_time() (both in ioloop.c). > > The timeout is computed by subtracting the value returned by > gettimeofday() from timeout->next_run, and it looks like in very rare > cases the result in tv_r->tv_usec is 1000000uS (1 second). > > So, it seems that if gettimeofday() returns exactly 0 microseconds, > and timeout->next_run has exactly 1000000 microseconds, the kevent() > call will fail (1 in 1e12 chance?). > > I'm not entirely sure where timeout->next_run is computed, but looking > at timeout_update_next(), it adjusts the seconds only if the > microseconds are greater than 1000000:? (ioloop.c, line 235) > > if (timeout->next_run.tv_usec > 1000000) { > ??????????????? timeout->next_run.tv_sec++; > ??????????????? timeout->next_run.tv_usec -= 1000000; > } > > It seems like the comparison here should be greater or equal, but I'm > not familiar enough with the code to say with absolute certainty. > > I added more debug error messages to the code in > timeout_update_next(), and although I do see next_run->tv_usec at > 1000000 several times a day, I haven't seen a panic since early > December on any of our servers, so I decided to post this to see if > this analysis makes sense. > > >Hi! I noticed that further down we actually use >= instead of >, so I'm guessing it was just missed here. Thanks. Aki