I occasionally get a fatal error unable to delete epoll event: Bad file descriptor I think the attached patch will fix it. The patch does two things: 1) changes the error we look for from ENOENT to EBADF. It is hard to reproduce this error, but if I change the epoll_ctl line to epoll_ctl(epfd, EPOLL_CTL_DEL, 666, ed->GetEpollEvent()); it does return -1 with errno = EBADF on my Linux 2.6.9 system. 2) it tests the variable errno, not e. This obviously was a typo, so even if some systems do return ENOENT for this condition, they still would have crashed. Now if only I knew why the socket was closing in the first place... Chris -------------- next part -------------- A non-text attachment was scrubbed... Name: empatch Type: text/x-patch Size: 725 bytes Desc: not available Url : http://rubyforge.org/pipermail/eventmachine-talk/attachments/20080104/391ce5eb/attachment.bin
On Jan 4, 2008 1:55 PM, Chris "$B%/(B" Heath <chris at heathens.co.nz> wrote:> I occasionally get a fatal error > > unable to delete epoll event: Bad file descriptor > > I think the attached patch will fix it. > > The patch does two things: > 1) changes the error we look for from ENOENT to EBADF. It is hard to > reproduce this error, but if I change the epoll_ctl line to > epoll_ctl(epfd, EPOLL_CTL_DEL, 666, ed->GetEpollEvent()); > it does return -1 with errno = EBADF on my Linux 2.6.9 system. > > 2) it tests the variable errno, not e. This obviously was a typo, so > even if some systems do return ENOENT for this condition, they still > would have crashed. > > Now if only I knew why the socket was closing in the first place... >Most definitely a bug. You get a cigar! I didn''t apply your patch as given, however, because by my reading of the manpage, ENOENT is the right error to be checking for here. Please sync to the head revision and see if this behaves any better for you. NOTE CAREFULLY: The head revision is now under trunk (where it should have been in the first place), not under version_0. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://rubyforge.org/pipermail/eventmachine-talk/attachments/20080104/09aae19c/attachment-0001.html
On Fri, 2008-01-04 at 20:58 -0500, Francis Cianfrocca wrote:> I didn''t apply your patch as given, however, because by my reading of > the manpage, ENOENT is the right error to be checking for here.Hmm... I guess it depends on which version of the manpages you have. Since version 2.37, the manpage for epoll_ctl(2) says: ERRORS EBADF epfd or fd is not a valid file descriptor. ... ENOENT op was EPOLL_CTL_MOD or EPOLL_CTL_DEL, and fd is not in epfd. It doesn''t say explicitly, but I assume that ENOENT is only used for VALID file descriptors that are not in epfd. Chris
On Jan 4, 2008 11:55 PM, Chris "$B%/(B" Heath <chris at heathens.co.nz> wrote:> On Fri, 2008-01-04 at 20:58 -0500, Francis Cianfrocca wrote: > > > I didn''t apply your patch as given, however, because by my reading of > > the manpage, ENOENT is the right error to be checking for here. > > Hmm... I guess it depends on which version of the manpages you have. > Since version 2.37, the manpage for epoll_ctl(2) says: > > ERRORS > EBADF epfd or fd is not a valid file descriptor. > ... > ENOENT op was EPOLL_CTL_MOD or EPOLL_CTL_DEL, and fd is not in epfd. > > It doesn''t say explicitly, but I assume that ENOENT is only used for > VALID file descriptors that are not in epfd. >I assume the same thing, and if you look closely at the code, you''ll see that the call to epoll_ctl to remove the descriptor will *always* happen before the descriptor is actually closed. The implementation goes to great lengths not to call close on descriptors that have been scheduled for closing until all the un-initializations are done. So if a descriptor is bad at the time of the epoll_ctl call, it would be a bug in EM, and I''d want that to get signalled. Does the code as it stands now work any better in your app? -------------- next part -------------- An HTML attachment was scrubbed... URL: http://rubyforge.org/pipermail/eventmachine-talk/attachments/20080105/ecfd1b5a/attachment.html
On Sat, 2008-01-05 at 00:11 -0500, Francis Cianfrocca wrote:> So if a descriptor is bad at the time of the epoll_ctl call, it would > be a bug in EM, and I''d want that to get signalled.Fair enough.> Does the code as it stands now work any better in your app?I don''t know yet. This happens only very occasionally. I think I have some idea of the code path that is triggering this, but no time to try it out right now. My theory is: 1. open a UDP descriptor 2. call send_datagram. 3. sendto fails (for some as yet unknown reason) and the error path calls Close (not ScheduleClose) 4. Close sets the fd to INVALID_SOCKET 5. in the next call to RunOnce, ShouldDelete returns true because fd =INVALID_SOCKET 6. epoll_ctl gets called with fd == INVALID_SOCKET. Chris
On Jan 5, 2008 2:15 AM, Chris "$B%/(B" Heath <chris at heathens.co.nz> wrote:> On Sat, 2008-01-05 at 00:11 -0500, Francis Cianfrocca wrote: > > > So if a descriptor is bad at the time of the epoll_ctl call, it would > > be a bug in EM, and I''d want that to get signalled. > > Fair enough. > > > Does the code as it stands now work any better in your app? > > I don''t know yet. This happens only very occasionally. > > I think I have some idea of the code path that is triggering this, but > no time to try it out right now. My theory is: > > 1. open a UDP descriptor > 2. call send_datagram. > 3. sendto fails (for some as yet unknown reason) and the error path > calls Close (not ScheduleClose) > 4. Close sets the fd to INVALID_SOCKET > 5. in the next call to RunOnce, ShouldDelete returns true because fd => INVALID_SOCKET > 6. epoll_ctl gets called with fd == INVALID_SOCKET. > > Chris > > _______________________________________________ > Eventmachine-talk mailing list > Eventmachine-talk at rubyforge.org > http://rubyforge.org/mailman/listinfo/eventmachine-talk >-------------- next part -------------- An HTML attachment was scrubbed... URL: http://rubyforge.org/pipermail/eventmachine-talk/attachments/20080105/d552e045/attachment.html
On Jan 5, 2008 2:15 AM, Chris "$B%/(B" Heath <chris at heathens.co.nz> wrote:> On Sat, 2008-01-05 at 00:11 -0500, Francis Cianfrocca wrote: > > > So if a descriptor is bad at the time of the epoll_ctl call, it would > > be a bug in EM, and I''d want that to get signalled. > > Fair enough. > > > Does the code as it stands now work any better in your app? > > I don''t know yet. This happens only very occasionally. > > I think I have some idea of the code path that is triggering this, but > no time to try it out right now. My theory is: > > 1. open a UDP descriptor > 2. call send_datagram. > 3. sendto fails (for some as yet unknown reason) and the error path > calls Close (not ScheduleClose) > 4. Close sets the fd to INVALID_SOCKET > 5. in the next call to RunOnce, ShouldDelete returns true because fd => INVALID_SOCKET > 6. epoll_ctl gets called with fd == INVALID_SOCKET. > > >I think you''re right. I changed the code again. Sync to the HEAD revision and see if you agree. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://rubyforge.org/pipermail/eventmachine-talk/attachments/20080105/a1fe7b6f/attachment.html
On Sat, 2008-01-05 at 06:42 -0500, Francis Cianfrocca wrote:> I think you''re right. I changed the code again. Sync to the HEAD > revision and see if you agree.Ah, I see what you''ve done. Yes, this looks better, assuming my theorized codepath was what actually caused this. I''ll let you know if my testing proves otherwise. I still have a minor quibble about the comment: // ENOENT is not an error because the socket may be already closed when we get here. How can a socket be closed yet still valid (since ENOENT only applies to valid sockets)? Honestly, I can''t think of any scenario that would legitimately trigger ENOENT. Every EPOLL_CTL_ADD is balanced with a EPOLL_CTL_DEL, so to my thinking, ENOENT would indicate a bug. Due to the e/errno typo, no previous version has been checking this error anyway, so I''d say it is well-tested without the ENOENT check. Chris
On Jan 5, 2008 2:10 PM, Chris "$B%/(B" Heath <chris at heathens.co.nz> wrote:> On Sat, 2008-01-05 at 06:42 -0500, Francis Cianfrocca wrote: > > > I still have a minor quibble about the comment: > > // ENOENT is not an error because the socket may be already closed when > we get here. > > How can a socket be closed yet still valid (since ENOENT only applies to > valid sockets)? >Read a little farther up. It could happen if the socket is closed (somehow) but the variable was not yet converted to -1. Could also maybe happen if you close a socket and then open another one, which would probably pick up the descriptor number of the closed one. I lose a lot of sleep over little edge conditions like that. :-) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://rubyforge.org/pipermail/eventmachine-talk/attachments/20080105/4b02e634/attachment-0001.html
On Sat, 2008-01-05 at 15:20 -0500, Francis Cianfrocca wrote:> Read a little farther up. It could happen if the socket is closed > (somehow) but the variable was not yet converted to -1.In this case, epoll_ctl will return EBADF because a closed socket is not a valid file descriptor.> Could also maybe happen if you close a socket and then open another > one, which would probably pick up the descriptor number of the closed > one.Yes, now this case should indeed return ENOENT, though I admit I haven''t tested it.> I lose a lot of sleep over little edge conditions like that. :-)OK, I leave up to you the decision of whether to handle externally closed sockets. I just wanted to point out that you are only handling one of the two possible error codes that this condition can produce. Chris
On Jan 5, 2008 9:02 PM, Chris "$B%/(B" Heath <chris at heathens.co.nz> wrote:> On Sat, 2008-01-05 at 15:20 -0500, Francis Cianfrocca wrote: > > > Read a little farther up. It could happen if the socket is closed > > (somehow) but the variable was not yet converted to -1. > > In this case, epoll_ctl will return EBADF because a closed socket is not > a valid file descriptor. >Are you sure about that? My reading of the doc suggests you''d get ENOENT. If you''re right then the code needs to be changed. It''s easy enough to just add the check for EBADF but it just makes me feel dirty to do something like that without really understanding it 100%. Hope you can understand that! :-) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://rubyforge.org/pipermail/eventmachine-talk/attachments/20080106/455a1ab4/attachment.html
On Sun, 2008-01-06 at 00:02 -0500, Francis Cianfrocca wrote:> On Jan 5, 2008 9:02 PM, Chris "?" Heath <chris at heathens.co.nz> wrote: > In this case, epoll_ctl will return EBADF because a closed > socket is not > a valid file descriptor. > > > Are you sure about that? My reading of the doc suggests you''d get > ENOENT. If you''re right then the code needs to be changed.I wasn''t sure, so I wrote a test program. Now I am sure. Test program is attached. Output is below. epdf = 3 Attempt to delete -1 yielded -1/9/Bad file descriptor Attempt to delete real fd (4) yielded -1/2/No such file or directory Attempt to delete closed fd (4) yielded -1/9/Bad file descriptor Attempt to add real fd (4) yielded 0/0/Success Attempt to delete closed fd (4) yielded -1/9/Bad file descriptor Attempt to add real fd (4) yielded 0/0/Success Attempt to delete closed then opened fd (4) yielded -1/2/No such file or directory As you can see, every time I try to delete a file descriptor that is guaranteed to always be invalid (-1) OR one that is currently invalid (closed), I get errno == 9 (EBADF). Chris -------------- next part -------------- A non-text attachment was scrubbed... Name: test_epoll_errors.c Type: text/x-csrc Size: 1462 bytes Desc: not available Url : http://rubyforge.org/pipermail/eventmachine-talk/attachments/20080106/b3010d09/attachment.bin
On Jan 6, 2008 5:06 AM, Chris "$B%/(B" Heath <chris at heathens.co.nz> wrote:> On Sun, 2008-01-06 at 00:02 -0500, Francis Cianfrocca wrote: > > On Jan 5, 2008 9:02 PM, Chris "$B%/(B" Heath <chris at heathens.co.nz> wrote: > > In this case, epoll_ctl will return EBADF because a closed > > socket is not > > a valid file descriptor. > > > > > > Are you sure about that? My reading of the doc suggests you''d get > > ENOENT. If you''re right then the code needs to be changed. > > I wasn''t sure, so I wrote a test program. Now I am sure. > > Test program is attached. Output is below. > > epdf = 3 > Attempt to delete -1 yielded -1/9/Bad file descriptor > Attempt to delete real fd (4) yielded -1/2/No such file or directory > Attempt to delete closed fd (4) yielded -1/9/Bad file descriptor > Attempt to add real fd (4) yielded 0/0/Success > Attempt to delete closed fd (4) yielded -1/9/Bad file descriptor > Attempt to add real fd (4) yielded 0/0/Success > Attempt to delete closed then opened fd (4) yielded -1/2/No such file or > directory > > As you can see, every time I try to delete a file descriptor that is > guaranteed to always be invalid (-1) OR one that is currently invalid > (closed), I get errno == 9 (EBADF). > >I''ll buy it. Thanks for doing all that work. HEAD revision in trunk has been changed to reflect this. Let me know if I did it right. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://rubyforge.org/pipermail/eventmachine-talk/attachments/20080106/0c13d870/attachment.html
On Sun, 2008-01-06 at 12:13 -0500, Francis Cianfrocca wrote:> I''ll buy it. Thanks for doing all that work. HEAD revision in trunk > has been changed to reflect this. Let me know if I did it right.Looks good. :-)