thr3ads.net - Eventmachine talk - [Eventmachine-talk] Segfaults using epoll [Feb 2008]

If this information is useful, please help other people find it:
Share via:

Chris "ク" Heath

2008-Feb-20 07:13 UTC

[Eventmachine-talk] Segfaults using epoll

I''m still having problems with errors on UDP sockets, but now (using EM
trunk) I am getting segmentation faults.

I managed to get it down to a small test case:

require ''rubygems''
require ''eventmachine''

EventMachine.epoll
EventMachine.run do
  conn = EventMachine.open_datagram_socket("0.0.0.0", 0)
  conn.send_datagram("a"*300000, "192.168.0.1", 1234)
  EventMachine.popen("sleep 1000")
end

Obviously sending a 300kB UDP packet is not something I am trying to do
in real life, but the real-life UDP errors only occur about once a week,
so this is my way of making the error reproducible.

I have tried it on 2 Linux systems (CentOS 4.6 with kernel
2.6.9-67.0.4.plus.c4 and ruby 1.9 2008-01-11, and CentOS 5.1 with kernel
2.6.18-53.el5 and ruby 1.8.5), and I get segfaults reliably on both.  I
am curious, do others see the same thing with the above test program?

In a debugger, I found the crash was happening on line 439 of em.cpp
(the line ed->Write() in function _RunEpollOnce). The descriptor ed is
being used after being deleted.

After puzzling over this for a long time, the only conclusion I can come
to is that this is a bug in the kernel.  It looks like epoll_wait is
returning an event for an already-deleted file descriptor!  I''d like to
hear what others think here, before escalating to the kernel mailing
list, though.

Chris

William Crawford

2008-Feb-20 07:24 UTC

head link

[Eventmachine-talk] Segfaults using epoll

I was bored and thought I''d try it.  Yes, I get:

ruby epolltest.rb
terminate called after throwing an instance of
''std::runtime_error''
  what(): unable to delete epoll event: Bad file descriptor
Aborted (core dumped)

On 2/20/08, Chris "$B%/(B" Heath <chris at heathens.co.nz>
wrote:>
> I''m still having problems with errors on UDP sockets, but now
(using EM
> trunk) I am getting segmentation faults.
>
> I managed to get it down to a small test case:
>
> require ''rubygems''
> require ''eventmachine''
>
> EventMachine.epoll
> EventMachine.run do
>   conn = EventMachine.open_datagram_socket("0.0.0.0", 0)
>   conn.send_datagram("a"*300000, "192.168.0.1", 1234)
>   EventMachine.popen("sleep 1000")
> end
>
> Obviously sending a 300kB UDP packet is not something I am trying to do
> in real life, but the real-life UDP errors only occur about once a week,
> so this is my way of making the error reproducible.
>
> I have tried it on 2 Linux systems (CentOS 4.6 with kernel
> 2.6.9-67.0.4.plus.c4 and ruby 1.9 2008-01-11, and CentOS 5.1 with kernel
> 2.6.18-53.el5 and ruby 1.8.5), and I get segfaults reliably on both.  I
> am curious, do others see the same thing with the above test program?
>
> In a debugger, I found the crash was happening on line 439 of em.cpp
> (the line ed->Write() in function _RunEpollOnce). The descriptor ed is
> being used after being deleted.
>
> After puzzling over this for a long time, the only conclusion I can come
> to is that this is a bug in the kernel.  It looks like epoll_wait is
> returning an event for an already-deleted file descriptor!  I''d
like to
> hear what others think here, before escalating to the kernel mailing
> list, though.
>
> Chris
>
> _______________________________________________
> Eventmachine-talk mailing list
> Eventmachine-talk at rubyforge.org
> http://rubyforge.org/mailman/listinfo/eventmachine-talk
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://rubyforge.org/pipermail/eventmachine-talk/attachments/20080220/1a9c1b32/attachment.html

Kevin Williams

2008-Feb-20 07:54 UTC

head link

[Eventmachine-talk] Segfaults using epoll

I''ve been told that''s a known bug that is fixed in trunk. I
now run a
gem built from trunk and it seems to work fine.

2008/2/20 William Crawford <wccrawford at
gmail.com>:>
>
> I was bored and thought I''d try it.  Yes, I get:
>
> ruby epolltest.rb
> terminate called after throwing an instance of
''std::runtime_error''
>   what(): unable to delete epoll event: Bad file descriptor
>  Aborted (core dumped)
>
>
> On 2/20/08, Chris "?" Heath <chris at heathens.co.nz>
wrote:
> > I''m still having problems with errors on UDP sockets, but now
(using EM
> > trunk) I am getting segmentation faults.
> >
> > I managed to get it down to a small test case:
> >
> > require ''rubygems''
> > require ''eventmachine''
> >
> > EventMachine.epoll
> > EventMachine.run do
> >   conn = EventMachine.open_datagram_socket("0.0.0.0", 0)
> >   conn.send_datagram("a"*300000, "192.168.0.1",
1234)
> >   EventMachine.popen("sleep 1000")
> > end
> >
> > Obviously sending a 300kB UDP packet is not something I am trying to
do
> > in real life, but the real-life UDP errors only occur about once a
week,
> > so this is my way of making the error reproducible.
> >
> > I have tried it on 2 Linux systems (CentOS 4.6 with kernel
> > 2.6.9-67.0.4.plus.c4 and ruby 1.9 2008-01-11, and CentOS 5.1 with
kernel
> > 2.6.18-53.el5 and ruby 1.8.5), and I get segfaults reliably on both. 
I
> > am curious, do others see the same thing with the above test program?
> >
> > In a debugger, I found the crash was happening on line 439 of em.cpp
> > (the line ed->Write() in function _RunEpollOnce). The descriptor ed
is
> > being used after being deleted.
> >
> > After puzzling over this for a long time, the only conclusion I can
come
> > to is that this is a bug in the kernel.  It looks like epoll_wait is
> > returning an event for an already-deleted file descriptor! 
I''d like to
> > hear what others think here, before escalating to the kernel mailing
> > list, though.
> >
> > Chris
> >
> > _______________________________________________
> > Eventmachine-talk mailing list
> > Eventmachine-talk at rubyforge.org
> > http://rubyforge.org/mailman/listinfo/eventmachine-talk
> >
>
>
> _______________________________________________
>  Eventmachine-talk mailing list
>  Eventmachine-talk at rubyforge.org
>  http://rubyforge.org/mailman/listinfo/eventmachine-talk
>


--
Cheers,

Kevin Williams
http://bantamtech.com/
http://almostserio.us/
http://kevwil.com/

Chris "ク" Heath

2008-Feb-20 08:07 UTC

head link

[Eventmachine-talk] Segfaults using epoll

Thanks for testing.

To clarify: Yes, the "unable to delete epoll event" is what EM version
0.10.0 gives.  This was fixed in EM trunk, but now I get a segfault.

William, what Linux kernel are you using?  Maybe this was already fixed
in the latest kernel, but my CentOS/Redhat kernels are too old?

Chris


On Wed, 2008-02-20 at 08:54 -0700, Kevin Williams wrote:> I''ve been told that''s a known bug that is fixed in trunk.
I now run a
> gem built from trunk and it seems to work fine.
>
> 2008/2/20 William Crawford <wccrawford at gmail.com>:
> >
> >
> > I was bored and thought I''d try it.  Yes, I get:
> >
> > ruby epolltest.rb
> > terminate called after throwing an instance of
''std::runtime_error''
> >   what(): unable to delete epoll event: Bad file descriptor
> >  Aborted (core dumped)
> >
> >
> > On 2/20/08, Chris "?" Heath <chris at heathens.co.nz>
wrote:
> > > I''m still having problems with errors on UDP sockets,
but now (using EM
> > > trunk) I am getting segmentation faults.
> > >
> > > I managed to get it down to a small test case:
> > >
> > > require ''rubygems''
> > > require ''eventmachine''
> > >
> > > EventMachine.epoll
> > > EventMachine.run do
> > >   conn = EventMachine.open_datagram_socket("0.0.0.0",
0)
> > >   conn.send_datagram("a"*300000,
"192.168.0.1", 1234)
> > >   EventMachine.popen("sleep 1000")
> > > end
> > >
> > > Obviously sending a 300kB UDP packet is not something I am trying
to do
> > > in real life, but the real-life UDP errors only occur about once
a week,
> > > so this is my way of making the error reproducible.
> > >
> > > I have tried it on 2 Linux systems (CentOS 4.6 with kernel
> > > 2.6.9-67.0.4.plus.c4 and ruby 1.9 2008-01-11, and CentOS 5.1 with
kernel
> > > 2.6.18-53.el5 and ruby 1.8.5), and I get segfaults reliably on
both.  I
> > > am curious, do others see the same thing with the above test
program?
> > >
> > > In a debugger, I found the crash was happening on line 439 of
em.cpp
> > > (the line ed->Write() in function _RunEpollOnce). The
descriptor ed is
> > > being used after being deleted.
> > >
> > > After puzzling over this for a long time, the only conclusion I
can come
> > > to is that this is a bug in the kernel.  It looks like epoll_wait
is
> > > returning an event for an already-deleted file descriptor! 
I''d like to
> > > hear what others think here, before escalating to the kernel
mailing
> > > list, though.
> > >
> > > Chris
> > >
> > > _______________________________________________
> > > Eventmachine-talk mailing list
> > > Eventmachine-talk at rubyforge.org
> > > http://rubyforge.org/mailman/listinfo/eventmachine-talk
> > >
> >
> >
> > _______________________________________________
> >  Eventmachine-talk mailing list
> >  Eventmachine-talk at rubyforge.org
> >  http://rubyforge.org/mailman/listinfo/eventmachine-talk
> >
>
>
>
> --
> Cheers,
>
> Kevin Williams
> http://bantamtech.com/
> http://almostserio.us/
> http://kevwil.com/
> _______________________________________________
> Eventmachine-talk mailing list
> Eventmachine-talk at rubyforge.org
> http://rubyforge.org/mailman/listinfo/eventmachine-talk

William Crawford

2008-Feb-20 08:31 UTC

head link

[Eventmachine-talk] Segfaults using epoll

I''m running Kubuntu 7.10 and the current kernel is apparently
2.6.22-14-generic.  I think there was an update in the last few days that I
didn''t get yet, though.  I don''t know if that will make a
difference.
(Still 2.6.22-14-generic, so it''s probably just a minor change.)

On 2/20/08, Chris "$B%/(B" Heath <chris at heathens.co.nz>
wrote:>
> Thanks for testing.
>
> To clarify: Yes, the "unable to delete epoll event" is what EM
version
> 0.10.0 gives.  This was fixed in EM trunk, but now I get a segfault.
>
> William, what Linux kernel are you using?  Maybe this was already fixed
> in the latest kernel, but my CentOS/Redhat kernels are too old?
>
>
> Chris
>
>
>
> On Wed, 2008-02-20 at 08:54 -0700, Kevin Williams wrote:
> > I''ve been told that''s a known bug that is fixed in
trunk. I now run a
> > gem built from trunk and it seems to work fine.
> >
> > 2008/2/20 William Crawford <wccrawford at gmail.com>:
> > >
> > >
> > > I was bored and thought I''d try it.  Yes, I get:
> > >
> > > ruby epolltest.rb
> > > terminate called after throwing an instance of
''std::runtime_error''
> > >   what(): unable to delete epoll event: Bad file descriptor
> > >  Aborted (core dumped)
> > >
> > >
> > > On 2/20/08, Chris "$B%/(B" Heath <chris at
heathens.co.nz> wrote:
> > > > I''m still having problems with errors on UDP
sockets, but now (using
> EM
> > > > trunk) I am getting segmentation faults.
> > > >
> > > > I managed to get it down to a small test case:
> > > >
> > > > require ''rubygems''
> > > > require ''eventmachine''
> > > >
> > > > EventMachine.epoll
> > > > EventMachine.run do
> > > >   conn =
EventMachine.open_datagram_socket("0.0.0.0", 0)
> > > >   conn.send_datagram("a"*300000,
"192.168.0.1", 1234)
> > > >   EventMachine.popen("sleep 1000")
> > > > end
> > > >
> > > > Obviously sending a 300kB UDP packet is not something I am
trying to
> do
> > > > in real life, but the real-life UDP errors only occur about
once a
> week,
> > > > so this is my way of making the error reproducible.
> > > >
> > > > I have tried it on 2 Linux systems (CentOS 4.6 with kernel
> > > > 2.6.9-67.0.4.plus.c4 and ruby 1.9 2008-01-11, and CentOS 5.1
with
> kernel
> > > > 2.6.18-53.el5 and ruby 1.8.5), and I get segfaults reliably
on
> both.  I
> > > > am curious, do others see the same thing with the above test
> program?
> > > >
> > > > In a debugger, I found the crash was happening on line 439
of em.cpp
> > > > (the line ed->Write() in function _RunEpollOnce). The
descriptor ed
> is
> > > > being used after being deleted.
> > > >
> > > > After puzzling over this for a long time, the only
conclusion I can
> come
> > > > to is that this is a bug in the kernel.  It looks like
epoll_wait is
> > > > returning an event for an already-deleted file descriptor! 
I''d like
> to
> > > > hear what others think here, before escalating to the kernel
mailing
> > > > list, though.
> > > >
> > > > Chris
> > > >
> > > > _______________________________________________
> > > > Eventmachine-talk mailing list
> > > > Eventmachine-talk at rubyforge.org
> > > > http://rubyforge.org/mailman/listinfo/eventmachine-talk
> > > >
> > >
> > >
> > > _______________________________________________
> > >  Eventmachine-talk mailing list
> > >  Eventmachine-talk at rubyforge.org
> > >  http://rubyforge.org/mailman/listinfo/eventmachine-talk
> > >
> >
> >
> >
> > --
> > Cheers,
> >
> > Kevin Williams
> > http://bantamtech.com/
> > http://almostserio.us/
> > http://kevwil.com/
> > _______________________________________________
> > Eventmachine-talk mailing list
> > Eventmachine-talk at rubyforge.org
> > http://rubyforge.org/mailman/listinfo/eventmachine-talk
>
> _______________________________________________
> Eventmachine-talk mailing list
> Eventmachine-talk at rubyforge.org
> http://rubyforge.org/mailman/listinfo/eventmachine-talk-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://rubyforge.org/pipermail/eventmachine-talk/attachments/20080220/6a8ab673/attachment.html

Chris "ク" Heath

2008-Feb-20 08:33 UTC

head link

[Eventmachine-talk] Segfaults using epoll

On Wed, 2008-02-20 at 11:07 -0500, Chris "?" Heath
wrote:> Thanks for testing.
>
> To clarify: Yes, the "unable to delete epoll event" is what EM
version
> 0.10.0 gives.  This was fixed in EM trunk, but now I get a segfault.
>
> William, what Linux kernel are you using?  Maybe this was already fixed
> in the latest kernel, but my CentOS/Redhat kernels are too old?
Oops, I meant to direct that question to Kevin, since he''s got a kernel
that apparently works.  :-)

Chris

Kevin Williams

2008-Feb-20 09:03 UTC

head link

[Eventmachine-talk] Segfaults using epoll

The systems I have that work with trunk are a Fedora 8 xen slice
running 2.6.16.29-xen x86_64 kernel, and OS X 10.5.2. I guess the os x
kernel using kqueue makes that one kinda pointless, but it works. I
kinda remember seeing this same error on the Mac, but I''m not sure.

On Wed, Feb 20, 2008 at 9:33 AM, Chris "?" Heath <chris at
heathens.co.nz> wrote:> On Wed, 2008-02-20 at 11:07 -0500, Chris "?" Heath wrote:
>  > Thanks for testing.
>  >
>  > To clarify: Yes, the "unable to delete epoll event" is what
EM version
>  > 0.10.0 gives.  This was fixed in EM trunk, but now I get a segfault.
>  >
>  > William, what Linux kernel are you using?  Maybe this was already
fixed
>  > in the latest kernel, but my CentOS/Redhat kernels are too old?
>
>  Oops, I meant to direct that question to Kevin, since he''s got a
kernel
>  that apparently works.  :-)
>
>
>
>  Chris
>
>
>  _______________________________________________
>  Eventmachine-talk mailing list
>  Eventmachine-talk at rubyforge.org
>  http://rubyforge.org/mailman/listinfo/eventmachine-talk


--
Cheers,

Kevin Williams
http://bantamtech.com/
http://almostserio.us/
http://kevwil.com/

Oleg Andreev

2008-Feb-20 10:18 UTC

head link

[Eventmachine-talk] Segfaults using epoll

Just installed from trunk on linux:

$ uname -a

Linux srv213 2.6.18-4-686-bigmem #1 SMP Thu May 10 00:23:00 UTC 2007
i686 GNU/Linux

~$ irb -reventmachine --prompt xmp

EM.set_descriptor_table_size 10_000
     ==>10000
EM.epoll
     ==>nil
module A; def receive_data(a);send_data ">> #{a}";end; end;
server Thread.new{ EM::run{
EM::start_server(''0.0.0.0'',10101,A)  } }
     ==>#<Thread:0xb78beef0 run>
1000.times{ EM::connect(''127.0.0.1'',10101,A);  }
     ==>1000
1000.times{ EM::connect(''127.0.0.1'',10101,A);  }
     ==>1000
1000.times{ EM::connect(''127.0.0.1'',10101,A);  }
     ==>1000
1000.times{ EM::connect(''127.0.0.1'',10101,A);  }
terminate called after throwing an instance of
''std::runtime_error''
   what():  not initialized
Aborted

Chris "ク" Heath

2008-Feb-20 11:30 UTC

head link

[Eventmachine-talk] Segfaults using epoll

I am realizing now, this is caused by the interaction of the fork() in
EventMachine.popen with the close() in conn.send_datagram.  What is
happening is:

1. EventMachine.open_datagram_socket opens a UDP socket.
2. conn.send_datagram queues a packet, but does not send it yet.
3. EventMachine.popen calls fork() so now there are two copies of
   the UDP socket (one in each process).
4. The parent process calls epoll_wait.
5. epoll_wait returns immediately saying that the UDP socket is
   writable.
6. DatagramDescriptor::Write tries to send the packet, gets an error
   (EMSGSIZE), and closes the UDP socket.
7. _RunEpollOnce deletes the C++ object that wraps the socket
   descriptor.
8. The parent process calls epoll_wait again.
9. >>>>>>>>>>>>>> HERE''S THE
BUG
<<<<<<<<<<<<<<<<<<<<<
   epoll_wait returns immediately saying that the UDP socket is
   writable.  Uh-oh!  The UDP socket is closed in the parent process;
   it is only writable in the child process!
   >>>>>>>>>>>>>>>>>> END
BUG
<<<<<<<<<<<<<<<<<<<<<<<<
10. The parent process tries to call ed->Write, but it segfaults
   because ed is deleted.

So my question is: would you consider this to be a kernel bug, or is it
legitimate for the parent process to get epoll events for descriptors
that are still open in a child process?

Even if this is a kernel bug, I think we need to modify EventMachine to
work around the bug. Some possible ways to do this:

(a) *Always* call epoll_ctl(EPOLL_CTL_DEL) before closing a descriptor.
or
(b) *Never* call Close directly -- call ScheduleClose(false) instead.
or
(c) call fcntl(FD_CLOEXEC) on all sockets except for ones that we want
to be passed through to the child.

I think I prefer (a) or (b), because (c) only solves the problem for
exec(), not fork().

Chris

> On 2/20/08, Chris "?" Heath <chris at heathens.co.nz>
wrote:
>         I''m still having problems with errors on UDP sockets, but
now
>         (using EM
>         trunk) I am getting segmentation faults.
>
>         I managed to get it down to a small test case:
>
>         require ''rubygems''
>         require ''eventmachine''
>
>         EventMachine.epoll
>         EventMachine.run do
>           conn = EventMachine.open_datagram_socket("0.0.0.0", 0)
>           conn.send_datagram("a"*300000, "192.168.0.1",
1234)
>           EventMachine.popen("sleep 1000")
>         end
>
>         Obviously sending a 300kB UDP packet is not something I am
>         trying to do
>         in real life, but the real-life UDP errors only occur about
>         once a week,
>         so this is my way of making the error reproducible.
>
>         I have tried it on 2 Linux systems (CentOS 4.6 with kernel
>         2.6.9-67.0.4.plus.c4 and ruby 1.9 2008-01-11, and CentOS 5.1
>         with kernel
>         2.6.18-53.el5 and ruby 1.8.5), and I get segfaults reliably on
>         both.  I
>         am curious, do others see the same thing with the above test
>         program?
>
>         In a debugger, I found the crash was happening on line 439 of
>         em.cpp
>         (the line ed->Write() in function _RunEpollOnce). The
>         descriptor ed is
>         being used after being deleted.
>
>         After puzzling over this for a long time, the only conclusion
>         I can come
>         to is that this is a bug in the kernel.  It looks like
>         epoll_wait is
>         returning an event for an already-deleted file
>         descriptor!  I''d like to
>         hear what others think here, before escalating to the kernel
>         mailing
>         list, though.
>
>         Chris

James Tucker

2008-Feb-21 10:01 UTC

head link

[Eventmachine-talk] Segfaults using epoll

I''d probably try to avoid forking, personally.

On 20 Feb 2008, at 19:30, Chris ? Heath wrote:
> I am realizing now, this is caused by the interaction of the fork() in
> EventMachine.popen with the close() in conn.send_datagram.  What is
> happening is:
>
> 1. EventMachine.open_datagram_socket opens a UDP socket.
> 2. conn.send_datagram queues a packet, but does not send it yet.
> 3. EventMachine.popen calls fork() so now there are two copies of
>   the UDP socket (one in each process).
> 4. The parent process calls epoll_wait.
> 5. epoll_wait returns immediately saying that the UDP socket is
>   writable.
> 6. DatagramDescriptor::Write tries to send the packet, gets an error
>   (EMSGSIZE), and closes the UDP socket.
> 7. _RunEpollOnce deletes the C++ object that wraps the socket
>   descriptor.
> 8. The parent process calls epoll_wait again.
> 9. >>>>>>>>>>>>>> HERE''S
THE BUG
<<<<<<<<<<<<<<<<<<<<<
>   epoll_wait returns immediately saying that the UDP socket is
>   writable.  Uh-oh!  The UDP socket is closed in the parent process;
>   it is only writable in the child process!
>>>>>>>>>>>>>>>>>>> END
BUG
<<<<<<<<<<<<<<<<<<<<<<<<
> 10. The parent process tries to call ed->Write, but it segfaults
>   because ed is deleted.
>
> So my question is: would you consider this to be a kernel bug, or is
> it
> legitimate for the parent process to get epoll events for descriptors
> that are still open in a child process?
>
> Even if this is a kernel bug, I think we need to modify EventMachine
> to
> work around the bug. Some possible ways to do this:
>
> (a) *Always* call epoll_ctl(EPOLL_CTL_DEL) before closing a
> descriptor.
> or
> (b) *Never* call Close directly -- call ScheduleClose(false) instead.
> or
> (c) call fcntl(FD_CLOEXEC) on all sockets except for ones that we want
> to be passed through to the child.
>
> I think I prefer (a) or (b), because (c) only solves the problem for
> exec(), not fork().
>
> Chris
>
>
>> On 2/20/08, Chris "?" Heath <chris at heathens.co.nz>
wrote:
>>        I''m still having problems with errors on UDP sockets,
but now
>>        (using EM
>>        trunk) I am getting segmentation faults.
>>
>>        I managed to get it down to a small test case:
>>
>>        require ''rubygems''
>>        require ''eventmachine''
>>
>>        EventMachine.epoll
>>        EventMachine.run do
>>          conn = EventMachine.open_datagram_socket("0.0.0.0",
0)
>>          conn.send_datagram("a"*300000,
"192.168.0.1", 1234)
>>          EventMachine.popen("sleep 1000")
>>        end
>>
>>        Obviously sending a 300kB UDP packet is not something I am
>>        trying to do
>>        in real life, but the real-life UDP errors only occur about
>>        once a week,
>>        so this is my way of making the error reproducible.
>>
>>        I have tried it on 2 Linux systems (CentOS 4.6 with kernel
>>        2.6.9-67.0.4.plus.c4 and ruby 1.9 2008-01-11, and CentOS 5.1
>>        with kernel
>>        2.6.18-53.el5 and ruby 1.8.5), and I get segfaults reliably on
>>        both.  I
>>        am curious, do others see the same thing with the above test
>>        program?
>>
>>        In a debugger, I found the crash was happening on line 439 of
>>        em.cpp
>>        (the line ed->Write() in function _RunEpollOnce). The
>>        descriptor ed is
>>        being used after being deleted.
>>
>>        After puzzling over this for a long time, the only conclusion
>>        I can come
>>        to is that this is a bug in the kernel.  It looks like
>>        epoll_wait is
>>        returning an event for an already-deleted file
>>        descriptor!  I''d like to
>>        hear what others think here, before escalating to the kernel
>>        mailing
>>        list, though.
>>
>>        Chris
>
>
>
> _______________________________________________
> Eventmachine-talk mailing list
> Eventmachine-talk at rubyforge.org
> http://rubyforge.org/mailman/listinfo/eventmachine-talk

Francis Cianfrocca

2008-Feb-21 10:16 UTC

head link

[Eventmachine-talk] Segfaults using epoll

On Wed, Feb 20, 2008 at 2:30 PM, Chris "$B%/(B" Heath <chris at
heathens.co.nz>
wrote:
> I am realizing now, this is caused by the interaction of the fork() in
> EventMachine.popen with the close() in conn.send_datagram.  What is
> happening is:
>
> 1. EventMachine.open_datagram_socket opens a UDP socket.
> 2. conn.send_datagram queues a packet, but does not send it yet.
> 3. EventMachine.popen calls fork() so now there are two copies of
>   the UDP socket (one in each process).
> 4. The parent process calls epoll_wait.
> 5. epoll_wait returns immediately saying that the UDP socket is
>   writable.
> 6. DatagramDescriptor::Write tries to send the packet, gets an error
>   (EMSGSIZE), and closes the UDP socket.
> 7. _RunEpollOnce deletes the C++ object that wraps the socket
>   descriptor.
> 8. The parent process calls epoll_wait again.
> 9. >>>>>>>>>>>>>> HERE''S
THE BUG
<<<<<<<<<<<<<<<<<<<<<
>   epoll_wait returns immediately saying that the UDP socket is
>   writable.  Uh-oh!  The UDP socket is closed in the parent process;
>   it is only writable in the child process!
>   >>>>>>>>>>>>>>>>>>
END BUG
<<<<<<<<<<<<<<<<<<<<<<<<
> 10. The parent process tries to call ed->Write, but it segfaults
>   because ed is deleted.
>
> So my question is: would you consider this to be a kernel bug, or is it
> legitimate for the parent process to get epoll events for descriptors
> that are still open in a child process?
>
> Even if this is a kernel bug, I think we need to modify EventMachine to
> work around the bug. Some possible ways to do this:
>
> (a) *Always* call epoll_ctl(EPOLL_CTL_DEL) before closing a descriptor.
> or
> (b) *Never* call Close directly -- call ScheduleClose(false) instead.
> or
> (c) call fcntl(FD_CLOEXEC) on all sockets except for ones that we want
> to be passed through to the child.
>
> I think I prefer (a) or (b), because (c) only solves the problem for
> exec(), not fork().
>
It doesn''t seem to me that this bug is inherently related to fork (even
though it takes a fork to make it show up). If EM is calling epoll_wait on a
descriptor whose C++ wrapper has been deleted, that''s a clear bug. The
problem with solution (a) is that, according to the documentation of
epoll_wait, closing a descriptor automatically removes it from the epoll
object.

So I''m still undecided how best to fix this.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://rubyforge.org/pipermail/eventmachine-talk/attachments/20080221/102223c0/attachment.html

Chris "ク" Heath

2008-Feb-21 12:59 UTC

head link

[Eventmachine-talk] Segfaults using epoll

On Thu, 2008-02-21 at 13:16 -0500, Francis Cianfrocca wrote:
> It doesn''t seem to me that this bug is inherently related to fork
> (even though it takes a fork to make it show up).
I just tried using dup(ed->GetSocket()) instead of fork() and the same
thing happened.  So yes, this bug is more about how epoll handles
duplicated descriptors.
> If EM is calling epoll_wait on a descriptor whose C++ wrapper has been
> deleted, that''s a clear bug. The problem with solution (a) is
that,
> according to the documentation of epoll_wait, closing a descriptor
> automatically removes it from the epoll object.
Yes, I see that in the documentation, too.

To be honest, I am starting to like the design of kqueue much better
than epoll in that kqueue does not survive a fork.  As soon you add the
requirement that epoll descriptors can survive a fork, it no longer
makes sense to remove a closed descriptor from the epoll object, because
you might close the descriptor in the parent process but then call
epoll_wait in the child process.

So I have a feeling that this "bug" in epoll is actually considered a
feature by its developers, and the documentation needs to be clarified.
> So I''m still undecided how best to fix this.
For now, I have gone with option (b) (changing Close() to
ScheduleClose(false)) in my local copy. It seems to work, but I''ll let
you decide if it''s the best solution.

Chris

Chris "ク" Heath

2008-Mar-10 19:55 UTC

head link

[Eventmachine-talk] Segfaults using epoll

On Thu, 2008-02-21 at 13:16 -0500, Francis Cianfrocca
wrote:> On Wed, Feb 20, 2008 at 2:30 PM, Chris "?" Heath
> <chris at heathens.co.nz> wrote:
>         I am realizing now, this is caused by the interaction of the
>         fork() in
>         EventMachine.popen with the close() in conn.send_datagram.
>          What is
>         happening is:
>
>         1. EventMachine.open_datagram_socket opens a UDP socket.
>         2. conn.send_datagram queues a packet, but does not send it
>         yet.
>         3. EventMachine.popen calls fork() so now there are two copies
>         of
>           the UDP socket (one in each process).
>         4. The parent process calls epoll_wait.
>         5. epoll_wait returns immediately saying that the UDP socket
>         is
>           writable.
>         6. DatagramDescriptor::Write tries to send the packet, gets an
>         error
>           (EMSGSIZE), and closes the UDP socket.
>         7. _RunEpollOnce deletes the C++ object that wraps the socket
>           descriptor.
>         8. The parent process calls epoll_wait again.
>         9. >>>>>>>>>>>>>>
HERE''S THE BUG
<<<<<<<<<<<<<<<<<<<<<
>           epoll_wait returns immediately saying that the UDP socket is
>           writable.  Uh-oh!  The UDP socket is closed in the parent
>         process;
>           it is only writable in the child process!
>          
>>>>>>>>>>>>>>>>>> END BUG
<<<<<<<<<<<<<<<<<<<<<<<<
>         10. The parent process tries to call ed->Write, but it
>         segfaults
>           because ed is deleted.
>
>         So my question is: would you consider this to be a kernel bug,
>         or is it
>         legitimate for the parent process to get epoll events for
>         descriptors
>         that are still open in a child process?
>
>         Even if this is a kernel bug, I think we need to modify
>         EventMachine to
>         work around the bug. Some possible ways to do this:
>
>         (a) *Always* call epoll_ctl(EPOLL_CTL_DEL) before closing a
>         descriptor.
>         or
>         (b) *Never* call Close directly -- call ScheduleClose(false)
>         instead.
>         or
>         (c) call fcntl(FD_CLOEXEC) on all sockets except for ones that
>         we want
>         to be passed through to the child.
>
>         I think I prefer (a) or (b), because (c) only solves the
>         problem for
>         exec(), not fork().
>
> It doesn''t seem to me that this bug is inherently related to fork
> (even though it takes a fork to make it show up). If EM is calling
> epoll_wait on a descriptor whose C++ wrapper has been deleted,
that''s
> a clear bug. The problem with solution (a) is that, according to the
> documentation of epoll_wait, closing a descriptor automatically
> removes it from the epoll object.
>
> So I''m still undecided how best to fix this.
Francis,

A new version of the epoll(7) man page got released today.  I think it
clarifies a lot of things, and hopefully will help you decide the best
way to fix this.

http://www.kernel.org/pub/linux/docs/man-pages/man-pages-2.79.tar.bz2

Chris

Eventmachine talk - Feb 2008 - Segfaults using epoll

[Eventmachine-talk] Segfaults using epoll

[Eventmachine-talk] Segfaults using epoll

[Eventmachine-talk] Segfaults using epoll

[Eventmachine-talk] Segfaults using epoll

[Eventmachine-talk] Segfaults using epoll

[Eventmachine-talk] Segfaults using epoll

[Eventmachine-talk] Segfaults using epoll

[Eventmachine-talk] Segfaults using epoll

[Eventmachine-talk] Segfaults using epoll

[Eventmachine-talk] Segfaults using epoll

[Eventmachine-talk] Segfaults using epoll

[Eventmachine-talk] Segfaults using epoll

[Eventmachine-talk] Segfaults using epoll