I''m still having problems with errors on UDP sockets, but now (using EM trunk) I am getting segmentation faults. I managed to get it down to a small test case: require ''rubygems'' require ''eventmachine'' EventMachine.epoll EventMachine.run do conn = EventMachine.open_datagram_socket("0.0.0.0", 0) conn.send_datagram("a"*300000, "192.168.0.1", 1234) EventMachine.popen("sleep 1000") end Obviously sending a 300kB UDP packet is not something I am trying to do in real life, but the real-life UDP errors only occur about once a week, so this is my way of making the error reproducible. I have tried it on 2 Linux systems (CentOS 4.6 with kernel 2.6.9-67.0.4.plus.c4 and ruby 1.9 2008-01-11, and CentOS 5.1 with kernel 2.6.18-53.el5 and ruby 1.8.5), and I get segfaults reliably on both. I am curious, do others see the same thing with the above test program? In a debugger, I found the crash was happening on line 439 of em.cpp (the line ed->Write() in function _RunEpollOnce). The descriptor ed is being used after being deleted. After puzzling over this for a long time, the only conclusion I can come to is that this is a bug in the kernel. It looks like epoll_wait is returning an event for an already-deleted file descriptor! I''d like to hear what others think here, before escalating to the kernel mailing list, though. Chris
I was bored and thought I''d try it. Yes, I get: ruby epolltest.rb terminate called after throwing an instance of ''std::runtime_error'' what(): unable to delete epoll event: Bad file descriptor Aborted (core dumped) On 2/20/08, Chris "$B%/(B" Heath <chris at heathens.co.nz> wrote:> > I''m still having problems with errors on UDP sockets, but now (using EM > trunk) I am getting segmentation faults. > > I managed to get it down to a small test case: > > require ''rubygems'' > require ''eventmachine'' > > EventMachine.epoll > EventMachine.run do > conn = EventMachine.open_datagram_socket("0.0.0.0", 0) > conn.send_datagram("a"*300000, "192.168.0.1", 1234) > EventMachine.popen("sleep 1000") > end > > Obviously sending a 300kB UDP packet is not something I am trying to do > in real life, but the real-life UDP errors only occur about once a week, > so this is my way of making the error reproducible. > > I have tried it on 2 Linux systems (CentOS 4.6 with kernel > 2.6.9-67.0.4.plus.c4 and ruby 1.9 2008-01-11, and CentOS 5.1 with kernel > 2.6.18-53.el5 and ruby 1.8.5), and I get segfaults reliably on both. I > am curious, do others see the same thing with the above test program? > > In a debugger, I found the crash was happening on line 439 of em.cpp > (the line ed->Write() in function _RunEpollOnce). The descriptor ed is > being used after being deleted. > > After puzzling over this for a long time, the only conclusion I can come > to is that this is a bug in the kernel. It looks like epoll_wait is > returning an event for an already-deleted file descriptor! I''d like to > hear what others think here, before escalating to the kernel mailing > list, though. > > Chris > > _______________________________________________ > Eventmachine-talk mailing list > Eventmachine-talk at rubyforge.org > http://rubyforge.org/mailman/listinfo/eventmachine-talk >-------------- next part -------------- An HTML attachment was scrubbed... URL: http://rubyforge.org/pipermail/eventmachine-talk/attachments/20080220/1a9c1b32/attachment.html
I''ve been told that''s a known bug that is fixed in trunk. I now run a gem built from trunk and it seems to work fine. 2008/2/20 William Crawford <wccrawford at gmail.com>:> > > I was bored and thought I''d try it. Yes, I get: > > ruby epolltest.rb > terminate called after throwing an instance of ''std::runtime_error'' > what(): unable to delete epoll event: Bad file descriptor > Aborted (core dumped) > > > On 2/20/08, Chris "?" Heath <chris at heathens.co.nz> wrote: > > I''m still having problems with errors on UDP sockets, but now (using EM > > trunk) I am getting segmentation faults. > > > > I managed to get it down to a small test case: > > > > require ''rubygems'' > > require ''eventmachine'' > > > > EventMachine.epoll > > EventMachine.run do > > conn = EventMachine.open_datagram_socket("0.0.0.0", 0) > > conn.send_datagram("a"*300000, "192.168.0.1", 1234) > > EventMachine.popen("sleep 1000") > > end > > > > Obviously sending a 300kB UDP packet is not something I am trying to do > > in real life, but the real-life UDP errors only occur about once a week, > > so this is my way of making the error reproducible. > > > > I have tried it on 2 Linux systems (CentOS 4.6 with kernel > > 2.6.9-67.0.4.plus.c4 and ruby 1.9 2008-01-11, and CentOS 5.1 with kernel > > 2.6.18-53.el5 and ruby 1.8.5), and I get segfaults reliably on both. I > > am curious, do others see the same thing with the above test program? > > > > In a debugger, I found the crash was happening on line 439 of em.cpp > > (the line ed->Write() in function _RunEpollOnce). The descriptor ed is > > being used after being deleted. > > > > After puzzling over this for a long time, the only conclusion I can come > > to is that this is a bug in the kernel. It looks like epoll_wait is > > returning an event for an already-deleted file descriptor! I''d like to > > hear what others think here, before escalating to the kernel mailing > > list, though. > > > > Chris > > > > _______________________________________________ > > Eventmachine-talk mailing list > > Eventmachine-talk at rubyforge.org > > http://rubyforge.org/mailman/listinfo/eventmachine-talk > > > > > _______________________________________________ > Eventmachine-talk mailing list > Eventmachine-talk at rubyforge.org > http://rubyforge.org/mailman/listinfo/eventmachine-talk >-- Cheers, Kevin Williams http://bantamtech.com/ http://almostserio.us/ http://kevwil.com/
Thanks for testing. To clarify: Yes, the "unable to delete epoll event" is what EM version 0.10.0 gives. This was fixed in EM trunk, but now I get a segfault. William, what Linux kernel are you using? Maybe this was already fixed in the latest kernel, but my CentOS/Redhat kernels are too old? Chris On Wed, 2008-02-20 at 08:54 -0700, Kevin Williams wrote:> I''ve been told that''s a known bug that is fixed in trunk. I now run a > gem built from trunk and it seems to work fine. > > 2008/2/20 William Crawford <wccrawford at gmail.com>: > > > > > > I was bored and thought I''d try it. Yes, I get: > > > > ruby epolltest.rb > > terminate called after throwing an instance of ''std::runtime_error'' > > what(): unable to delete epoll event: Bad file descriptor > > Aborted (core dumped) > > > > > > On 2/20/08, Chris "?" Heath <chris at heathens.co.nz> wrote: > > > I''m still having problems with errors on UDP sockets, but now (using EM > > > trunk) I am getting segmentation faults. > > > > > > I managed to get it down to a small test case: > > > > > > require ''rubygems'' > > > require ''eventmachine'' > > > > > > EventMachine.epoll > > > EventMachine.run do > > > conn = EventMachine.open_datagram_socket("0.0.0.0", 0) > > > conn.send_datagram("a"*300000, "192.168.0.1", 1234) > > > EventMachine.popen("sleep 1000") > > > end > > > > > > Obviously sending a 300kB UDP packet is not something I am trying to do > > > in real life, but the real-life UDP errors only occur about once a week, > > > so this is my way of making the error reproducible. > > > > > > I have tried it on 2 Linux systems (CentOS 4.6 with kernel > > > 2.6.9-67.0.4.plus.c4 and ruby 1.9 2008-01-11, and CentOS 5.1 with kernel > > > 2.6.18-53.el5 and ruby 1.8.5), and I get segfaults reliably on both. I > > > am curious, do others see the same thing with the above test program? > > > > > > In a debugger, I found the crash was happening on line 439 of em.cpp > > > (the line ed->Write() in function _RunEpollOnce). The descriptor ed is > > > being used after being deleted. > > > > > > After puzzling over this for a long time, the only conclusion I can come > > > to is that this is a bug in the kernel. It looks like epoll_wait is > > > returning an event for an already-deleted file descriptor! I''d like to > > > hear what others think here, before escalating to the kernel mailing > > > list, though. > > > > > > Chris > > > > > > _______________________________________________ > > > Eventmachine-talk mailing list > > > Eventmachine-talk at rubyforge.org > > > http://rubyforge.org/mailman/listinfo/eventmachine-talk > > > > > > > > > _______________________________________________ > > Eventmachine-talk mailing list > > Eventmachine-talk at rubyforge.org > > http://rubyforge.org/mailman/listinfo/eventmachine-talk > > > > > > -- > Cheers, > > Kevin Williams > http://bantamtech.com/ > http://almostserio.us/ > http://kevwil.com/ > _______________________________________________ > Eventmachine-talk mailing list > Eventmachine-talk at rubyforge.org > http://rubyforge.org/mailman/listinfo/eventmachine-talk
I''m running Kubuntu 7.10 and the current kernel is apparently 2.6.22-14-generic. I think there was an update in the last few days that I didn''t get yet, though. I don''t know if that will make a difference. (Still 2.6.22-14-generic, so it''s probably just a minor change.) On 2/20/08, Chris "$B%/(B" Heath <chris at heathens.co.nz> wrote:> > Thanks for testing. > > To clarify: Yes, the "unable to delete epoll event" is what EM version > 0.10.0 gives. This was fixed in EM trunk, but now I get a segfault. > > William, what Linux kernel are you using? Maybe this was already fixed > in the latest kernel, but my CentOS/Redhat kernels are too old? > > > Chris > > > > On Wed, 2008-02-20 at 08:54 -0700, Kevin Williams wrote: > > I''ve been told that''s a known bug that is fixed in trunk. I now run a > > gem built from trunk and it seems to work fine. > > > > 2008/2/20 William Crawford <wccrawford at gmail.com>: > > > > > > > > > I was bored and thought I''d try it. Yes, I get: > > > > > > ruby epolltest.rb > > > terminate called after throwing an instance of ''std::runtime_error'' > > > what(): unable to delete epoll event: Bad file descriptor > > > Aborted (core dumped) > > > > > > > > > On 2/20/08, Chris "$B%/(B" Heath <chris at heathens.co.nz> wrote: > > > > I''m still having problems with errors on UDP sockets, but now (using > EM > > > > trunk) I am getting segmentation faults. > > > > > > > > I managed to get it down to a small test case: > > > > > > > > require ''rubygems'' > > > > require ''eventmachine'' > > > > > > > > EventMachine.epoll > > > > EventMachine.run do > > > > conn = EventMachine.open_datagram_socket("0.0.0.0", 0) > > > > conn.send_datagram("a"*300000, "192.168.0.1", 1234) > > > > EventMachine.popen("sleep 1000") > > > > end > > > > > > > > Obviously sending a 300kB UDP packet is not something I am trying to > do > > > > in real life, but the real-life UDP errors only occur about once a > week, > > > > so this is my way of making the error reproducible. > > > > > > > > I have tried it on 2 Linux systems (CentOS 4.6 with kernel > > > > 2.6.9-67.0.4.plus.c4 and ruby 1.9 2008-01-11, and CentOS 5.1 with > kernel > > > > 2.6.18-53.el5 and ruby 1.8.5), and I get segfaults reliably on > both. I > > > > am curious, do others see the same thing with the above test > program? > > > > > > > > In a debugger, I found the crash was happening on line 439 of em.cpp > > > > (the line ed->Write() in function _RunEpollOnce). The descriptor ed > is > > > > being used after being deleted. > > > > > > > > After puzzling over this for a long time, the only conclusion I can > come > > > > to is that this is a bug in the kernel. It looks like epoll_wait is > > > > returning an event for an already-deleted file descriptor! I''d like > to > > > > hear what others think here, before escalating to the kernel mailing > > > > list, though. > > > > > > > > Chris > > > > > > > > _______________________________________________ > > > > Eventmachine-talk mailing list > > > > Eventmachine-talk at rubyforge.org > > > > http://rubyforge.org/mailman/listinfo/eventmachine-talk > > > > > > > > > > > > > _______________________________________________ > > > Eventmachine-talk mailing list > > > Eventmachine-talk at rubyforge.org > > > http://rubyforge.org/mailman/listinfo/eventmachine-talk > > > > > > > > > > > -- > > Cheers, > > > > Kevin Williams > > http://bantamtech.com/ > > http://almostserio.us/ > > http://kevwil.com/ > > _______________________________________________ > > Eventmachine-talk mailing list > > Eventmachine-talk at rubyforge.org > > http://rubyforge.org/mailman/listinfo/eventmachine-talk > > _______________________________________________ > Eventmachine-talk mailing list > Eventmachine-talk at rubyforge.org > http://rubyforge.org/mailman/listinfo/eventmachine-talk-------------- next part -------------- An HTML attachment was scrubbed... URL: http://rubyforge.org/pipermail/eventmachine-talk/attachments/20080220/6a8ab673/attachment.html
On Wed, 2008-02-20 at 11:07 -0500, Chris "?" Heath wrote:> Thanks for testing. > > To clarify: Yes, the "unable to delete epoll event" is what EM version > 0.10.0 gives. This was fixed in EM trunk, but now I get a segfault. > > William, what Linux kernel are you using? Maybe this was already fixed > in the latest kernel, but my CentOS/Redhat kernels are too old?Oops, I meant to direct that question to Kevin, since he''s got a kernel that apparently works. :-) Chris
The systems I have that work with trunk are a Fedora 8 xen slice running 2.6.16.29-xen x86_64 kernel, and OS X 10.5.2. I guess the os x kernel using kqueue makes that one kinda pointless, but it works. I kinda remember seeing this same error on the Mac, but I''m not sure. On Wed, Feb 20, 2008 at 9:33 AM, Chris "?" Heath <chris at heathens.co.nz> wrote:> On Wed, 2008-02-20 at 11:07 -0500, Chris "?" Heath wrote: > > Thanks for testing. > > > > To clarify: Yes, the "unable to delete epoll event" is what EM version > > 0.10.0 gives. This was fixed in EM trunk, but now I get a segfault. > > > > William, what Linux kernel are you using? Maybe this was already fixed > > in the latest kernel, but my CentOS/Redhat kernels are too old? > > Oops, I meant to direct that question to Kevin, since he''s got a kernel > that apparently works. :-) > > > > Chris > > > _______________________________________________ > Eventmachine-talk mailing list > Eventmachine-talk at rubyforge.org > http://rubyforge.org/mailman/listinfo/eventmachine-talk-- Cheers, Kevin Williams http://bantamtech.com/ http://almostserio.us/ http://kevwil.com/
Just installed from trunk on linux: $ uname -a Linux srv213 2.6.18-4-686-bigmem #1 SMP Thu May 10 00:23:00 UTC 2007 i686 GNU/Linux ~$ irb -reventmachine --prompt xmp EM.set_descriptor_table_size 10_000 ==>10000 EM.epoll ==>nil module A; def receive_data(a);send_data ">> #{a}";end; end; server Thread.new{ EM::run{ EM::start_server(''0.0.0.0'',10101,A) } } ==>#<Thread:0xb78beef0 run> 1000.times{ EM::connect(''127.0.0.1'',10101,A); } ==>1000 1000.times{ EM::connect(''127.0.0.1'',10101,A); } ==>1000 1000.times{ EM::connect(''127.0.0.1'',10101,A); } ==>1000 1000.times{ EM::connect(''127.0.0.1'',10101,A); } terminate called after throwing an instance of ''std::runtime_error'' what(): not initialized Aborted
I am realizing now, this is caused by the interaction of the fork() in EventMachine.popen with the close() in conn.send_datagram. What is happening is: 1. EventMachine.open_datagram_socket opens a UDP socket. 2. conn.send_datagram queues a packet, but does not send it yet. 3. EventMachine.popen calls fork() so now there are two copies of the UDP socket (one in each process). 4. The parent process calls epoll_wait. 5. epoll_wait returns immediately saying that the UDP socket is writable. 6. DatagramDescriptor::Write tries to send the packet, gets an error (EMSGSIZE), and closes the UDP socket. 7. _RunEpollOnce deletes the C++ object that wraps the socket descriptor. 8. The parent process calls epoll_wait again. 9. >>>>>>>>>>>>>> HERE''S THE BUG <<<<<<<<<<<<<<<<<<<<< epoll_wait returns immediately saying that the UDP socket is writable. Uh-oh! The UDP socket is closed in the parent process; it is only writable in the child process! >>>>>>>>>>>>>>>>>> END BUG <<<<<<<<<<<<<<<<<<<<<<<< 10. The parent process tries to call ed->Write, but it segfaults because ed is deleted. So my question is: would you consider this to be a kernel bug, or is it legitimate for the parent process to get epoll events for descriptors that are still open in a child process? Even if this is a kernel bug, I think we need to modify EventMachine to work around the bug. Some possible ways to do this: (a) *Always* call epoll_ctl(EPOLL_CTL_DEL) before closing a descriptor. or (b) *Never* call Close directly -- call ScheduleClose(false) instead. or (c) call fcntl(FD_CLOEXEC) on all sockets except for ones that we want to be passed through to the child. I think I prefer (a) or (b), because (c) only solves the problem for exec(), not fork(). Chris> On 2/20/08, Chris "?" Heath <chris at heathens.co.nz> wrote: > I''m still having problems with errors on UDP sockets, but now > (using EM > trunk) I am getting segmentation faults. > > I managed to get it down to a small test case: > > require ''rubygems'' > require ''eventmachine'' > > EventMachine.epoll > EventMachine.run do > conn = EventMachine.open_datagram_socket("0.0.0.0", 0) > conn.send_datagram("a"*300000, "192.168.0.1", 1234) > EventMachine.popen("sleep 1000") > end > > Obviously sending a 300kB UDP packet is not something I am > trying to do > in real life, but the real-life UDP errors only occur about > once a week, > so this is my way of making the error reproducible. > > I have tried it on 2 Linux systems (CentOS 4.6 with kernel > 2.6.9-67.0.4.plus.c4 and ruby 1.9 2008-01-11, and CentOS 5.1 > with kernel > 2.6.18-53.el5 and ruby 1.8.5), and I get segfaults reliably on > both. I > am curious, do others see the same thing with the above test > program? > > In a debugger, I found the crash was happening on line 439 of > em.cpp > (the line ed->Write() in function _RunEpollOnce). The > descriptor ed is > being used after being deleted. > > After puzzling over this for a long time, the only conclusion > I can come > to is that this is a bug in the kernel. It looks like > epoll_wait is > returning an event for an already-deleted file > descriptor! I''d like to > hear what others think here, before escalating to the kernel > mailing > list, though. > > Chris
I''d probably try to avoid forking, personally. On 20 Feb 2008, at 19:30, Chris ? Heath wrote:> I am realizing now, this is caused by the interaction of the fork() in > EventMachine.popen with the close() in conn.send_datagram. What is > happening is: > > 1. EventMachine.open_datagram_socket opens a UDP socket. > 2. conn.send_datagram queues a packet, but does not send it yet. > 3. EventMachine.popen calls fork() so now there are two copies of > the UDP socket (one in each process). > 4. The parent process calls epoll_wait. > 5. epoll_wait returns immediately saying that the UDP socket is > writable. > 6. DatagramDescriptor::Write tries to send the packet, gets an error > (EMSGSIZE), and closes the UDP socket. > 7. _RunEpollOnce deletes the C++ object that wraps the socket > descriptor. > 8. The parent process calls epoll_wait again. > 9. >>>>>>>>>>>>>> HERE''S THE BUG <<<<<<<<<<<<<<<<<<<<< > epoll_wait returns immediately saying that the UDP socket is > writable. Uh-oh! The UDP socket is closed in the parent process; > it is only writable in the child process! >>>>>>>>>>>>>>>>>>> END BUG <<<<<<<<<<<<<<<<<<<<<<<< > 10. The parent process tries to call ed->Write, but it segfaults > because ed is deleted. > > So my question is: would you consider this to be a kernel bug, or is > it > legitimate for the parent process to get epoll events for descriptors > that are still open in a child process? > > Even if this is a kernel bug, I think we need to modify EventMachine > to > work around the bug. Some possible ways to do this: > > (a) *Always* call epoll_ctl(EPOLL_CTL_DEL) before closing a > descriptor. > or > (b) *Never* call Close directly -- call ScheduleClose(false) instead. > or > (c) call fcntl(FD_CLOEXEC) on all sockets except for ones that we want > to be passed through to the child. > > I think I prefer (a) or (b), because (c) only solves the problem for > exec(), not fork(). > > Chris > > >> On 2/20/08, Chris "?" Heath <chris at heathens.co.nz> wrote: >> I''m still having problems with errors on UDP sockets, but now >> (using EM >> trunk) I am getting segmentation faults. >> >> I managed to get it down to a small test case: >> >> require ''rubygems'' >> require ''eventmachine'' >> >> EventMachine.epoll >> EventMachine.run do >> conn = EventMachine.open_datagram_socket("0.0.0.0", 0) >> conn.send_datagram("a"*300000, "192.168.0.1", 1234) >> EventMachine.popen("sleep 1000") >> end >> >> Obviously sending a 300kB UDP packet is not something I am >> trying to do >> in real life, but the real-life UDP errors only occur about >> once a week, >> so this is my way of making the error reproducible. >> >> I have tried it on 2 Linux systems (CentOS 4.6 with kernel >> 2.6.9-67.0.4.plus.c4 and ruby 1.9 2008-01-11, and CentOS 5.1 >> with kernel >> 2.6.18-53.el5 and ruby 1.8.5), and I get segfaults reliably on >> both. I >> am curious, do others see the same thing with the above test >> program? >> >> In a debugger, I found the crash was happening on line 439 of >> em.cpp >> (the line ed->Write() in function _RunEpollOnce). The >> descriptor ed is >> being used after being deleted. >> >> After puzzling over this for a long time, the only conclusion >> I can come >> to is that this is a bug in the kernel. It looks like >> epoll_wait is >> returning an event for an already-deleted file >> descriptor! I''d like to >> hear what others think here, before escalating to the kernel >> mailing >> list, though. >> >> Chris > > > > _______________________________________________ > Eventmachine-talk mailing list > Eventmachine-talk at rubyforge.org > http://rubyforge.org/mailman/listinfo/eventmachine-talk
On Wed, Feb 20, 2008 at 2:30 PM, Chris "$B%/(B" Heath <chris at heathens.co.nz> wrote:> I am realizing now, this is caused by the interaction of the fork() in > EventMachine.popen with the close() in conn.send_datagram. What is > happening is: > > 1. EventMachine.open_datagram_socket opens a UDP socket. > 2. conn.send_datagram queues a packet, but does not send it yet. > 3. EventMachine.popen calls fork() so now there are two copies of > the UDP socket (one in each process). > 4. The parent process calls epoll_wait. > 5. epoll_wait returns immediately saying that the UDP socket is > writable. > 6. DatagramDescriptor::Write tries to send the packet, gets an error > (EMSGSIZE), and closes the UDP socket. > 7. _RunEpollOnce deletes the C++ object that wraps the socket > descriptor. > 8. The parent process calls epoll_wait again. > 9. >>>>>>>>>>>>>> HERE''S THE BUG <<<<<<<<<<<<<<<<<<<<< > epoll_wait returns immediately saying that the UDP socket is > writable. Uh-oh! The UDP socket is closed in the parent process; > it is only writable in the child process! > >>>>>>>>>>>>>>>>>> END BUG <<<<<<<<<<<<<<<<<<<<<<<< > 10. The parent process tries to call ed->Write, but it segfaults > because ed is deleted. > > So my question is: would you consider this to be a kernel bug, or is it > legitimate for the parent process to get epoll events for descriptors > that are still open in a child process? > > Even if this is a kernel bug, I think we need to modify EventMachine to > work around the bug. Some possible ways to do this: > > (a) *Always* call epoll_ctl(EPOLL_CTL_DEL) before closing a descriptor. > or > (b) *Never* call Close directly -- call ScheduleClose(false) instead. > or > (c) call fcntl(FD_CLOEXEC) on all sockets except for ones that we want > to be passed through to the child. > > I think I prefer (a) or (b), because (c) only solves the problem for > exec(), not fork(). >It doesn''t seem to me that this bug is inherently related to fork (even though it takes a fork to make it show up). If EM is calling epoll_wait on a descriptor whose C++ wrapper has been deleted, that''s a clear bug. The problem with solution (a) is that, according to the documentation of epoll_wait, closing a descriptor automatically removes it from the epoll object. So I''m still undecided how best to fix this. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://rubyforge.org/pipermail/eventmachine-talk/attachments/20080221/102223c0/attachment.html
On Thu, 2008-02-21 at 13:16 -0500, Francis Cianfrocca wrote:> It doesn''t seem to me that this bug is inherently related to fork > (even though it takes a fork to make it show up).I just tried using dup(ed->GetSocket()) instead of fork() and the same thing happened. So yes, this bug is more about how epoll handles duplicated descriptors.> If EM is calling epoll_wait on a descriptor whose C++ wrapper has been > deleted, that''s a clear bug. The problem with solution (a) is that, > according to the documentation of epoll_wait, closing a descriptor > automatically removes it from the epoll object.Yes, I see that in the documentation, too. To be honest, I am starting to like the design of kqueue much better than epoll in that kqueue does not survive a fork. As soon you add the requirement that epoll descriptors can survive a fork, it no longer makes sense to remove a closed descriptor from the epoll object, because you might close the descriptor in the parent process but then call epoll_wait in the child process. So I have a feeling that this "bug" in epoll is actually considered a feature by its developers, and the documentation needs to be clarified.> So I''m still undecided how best to fix this.For now, I have gone with option (b) (changing Close() to ScheduleClose(false)) in my local copy. It seems to work, but I''ll let you decide if it''s the best solution. Chris
On Thu, 2008-02-21 at 13:16 -0500, Francis Cianfrocca wrote:> On Wed, Feb 20, 2008 at 2:30 PM, Chris "?" Heath > <chris at heathens.co.nz> wrote: > I am realizing now, this is caused by the interaction of the > fork() in > EventMachine.popen with the close() in conn.send_datagram. > What is > happening is: > > 1. EventMachine.open_datagram_socket opens a UDP socket. > 2. conn.send_datagram queues a packet, but does not send it > yet. > 3. EventMachine.popen calls fork() so now there are two copies > of > the UDP socket (one in each process). > 4. The parent process calls epoll_wait. > 5. epoll_wait returns immediately saying that the UDP socket > is > writable. > 6. DatagramDescriptor::Write tries to send the packet, gets an > error > (EMSGSIZE), and closes the UDP socket. > 7. _RunEpollOnce deletes the C++ object that wraps the socket > descriptor. > 8. The parent process calls epoll_wait again. > 9. >>>>>>>>>>>>>> HERE''S THE BUG <<<<<<<<<<<<<<<<<<<<< > epoll_wait returns immediately saying that the UDP socket is > writable. Uh-oh! The UDP socket is closed in the parent > process; > it is only writable in the child process! > >>>>>>>>>>>>>>>>>> END BUG <<<<<<<<<<<<<<<<<<<<<<<< > 10. The parent process tries to call ed->Write, but it > segfaults > because ed is deleted. > > So my question is: would you consider this to be a kernel bug, > or is it > legitimate for the parent process to get epoll events for > descriptors > that are still open in a child process? > > Even if this is a kernel bug, I think we need to modify > EventMachine to > work around the bug. Some possible ways to do this: > > (a) *Always* call epoll_ctl(EPOLL_CTL_DEL) before closing a > descriptor. > or > (b) *Never* call Close directly -- call ScheduleClose(false) > instead. > or > (c) call fcntl(FD_CLOEXEC) on all sockets except for ones that > we want > to be passed through to the child. > > I think I prefer (a) or (b), because (c) only solves the > problem for > exec(), not fork(). > > It doesn''t seem to me that this bug is inherently related to fork > (even though it takes a fork to make it show up). If EM is calling > epoll_wait on a descriptor whose C++ wrapper has been deleted, that''s > a clear bug. The problem with solution (a) is that, according to the > documentation of epoll_wait, closing a descriptor automatically > removes it from the epoll object. > > So I''m still undecided how best to fix this.Francis, A new version of the epoll(7) man page got released today. I think it clarifies a lot of things, and hopefully will help you decide the best way to fix this. http://www.kernel.org/pub/linux/docs/man-pages/man-pages-2.79.tar.bz2 Chris