I don''t know if this is a problem with EM, per se, but I was just wondering on some insights... It seems that if I have an EM server running on localhost, and then have a lot of clients running within another EM, sometimes when the clients connect to the server, the following is what occurs (ignore the INCORRECT''s, I think they''re actually not correct ) No. Time Source Destination Protocol Info 51 9.067843 127.0.0.1 127.0.0.1 TCP 51815 > 7779 [SYN] Seq=0 Win=65535 [TCP CHECKSUM INCORRECT] Len=0 MSS=16344 WS=1 TSV=895386257 TSER=0 52 9.067964 127.0.0.1 127.0.0.1 TCP 7779 > 51815 [SYN, ACK] Seq=0 Ack=1 Win=65535 [TCP CHECKSUM INCORRECT] Len=0 MSS=16344 WS=1 TSV=895386257 TSER=895386257 53 9.067990 127.0.0.1 127.0.0.1 TCP 51815 > 7779 [ACK] Seq=1 Ack=1 Win=81660 [TCP CHECKSUM INCORRECT] Len=0 TSV=895386257 TSER=895386257 85 14.090994 127.0.0.1 127.0.0.1 TCP 51815 > 7779 [FIN, ACK] Seq=1 Ack=1 Win=81660 [TCP CHECKSUM INCORRECT] Len=0 TSV=895386307 TSER=895386257 86 14.091087 127.0.0.1 127.0.0.1 TCP 7779 > 51815 [ACK] Seq=1 Ack=2 Win=81660 [TCP CHECKSUM INCORRECT] Len=0 TSV=895386307 TSER=895386307 87 14.091566 127.0.0.1 127.0.0.1 TCP 7779 > 51815 [FIN, ACK] Seq=1 Ack=2 Win=81660 [TCP CHECKSUM INCORRECT] Len=0 TSV=895386307 TSER=895386307 88 14.091616 127.0.0.1 127.0.0.1 TCP 51815 > 7779 [ACK] Seq=2 Ack=2 Win=81660 [TCP CHECKSUM INCORRECT] Len=0 TSV=895386307 TSER=895386307 So basically it arbitrarily sends a FIN packet about 5 seconds later, and never sends the data I instructed the client to send. The client and the server then both call unbind appropriately, and wonder what just happened (note I never tell it to close). This in a client with hundreds of connections. Is this a known problem with load testing on localhost? I''ve seen this before on win32 (this one happens to be on OS X 10.5). Am I just generally running out of buffer space so the kernel drops my packets? Thoughts? Thanks! -Roger
It appears that if a connection doesn''t ''connect'' within X seconds (specified as 4), then EM considers it a ''dead'' connection and immediately closes the new connection. It''s an application specific setting (think ping timing out after awhile--how long is that ''awhile''?) Setting it higher fixes it. So if your server is more than 4 seconds away...ugh. Thanks! -Roger
On Dec 29, 2007 8:49 PM, Roger Pack <rogerpack2005 at gmail.com> wrote:> It appears that if a connection doesn''t ''connect'' within X seconds > (specified as 4), then EM considers it a ''dead'' connection and > immediately closes the new connection. It''s an application specific > setting (think ping timing out after awhile--how long is that > ''awhile''?) > Setting it higher fixes it. > So if your server is more than 4 seconds away...ugh. > Thanks! > -Roger > _______________________________________________ > Eventmachine-talk mailing list > Eventmachine-talk at rubyforge.org > http://rubyforge.org/mailman/listinfo/eventmachine-talk >-------------- next part -------------- An HTML attachment was scrubbed... URL: http://rubyforge.org/pipermail/eventmachine-talk/attachments/20071230/d71dcb1d/attachment.html
On Dec 29, 2007 8:49 PM, Roger Pack <rogerpack2005 at gmail.com> wrote:> It appears that if a connection doesn''t ''connect'' within X seconds > (specified as 4), then EM considers it a ''dead'' connection and > immediately closes the new connection. It''s an application specific > setting (think ping timing out after awhile--how long is that > ''awhile''?) > Setting it higher fixes it. > So if your server is more than 4 seconds away...ugh. > Thanks! >It would be pretty easy to make the timeout interval configurable. But are you sure your use case is typical? Or were you just experimenting? If a TCP server takes multiple seconds to handshake, you''ve got a more basic problem in your application. (Unless you''re satellite-linking to Antarctica or something like that.) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://rubyforge.org/pipermail/eventmachine-talk/attachments/20071230/f3be8e93/attachment.html
The conditions where it occurs I think are when you''re connecting a lot to localhost so you have a lot of full queues and things like that--a lot of network bandwidth and so it takes those first packets awhile to get through the queue back and forth between the two hosts. So for example if you have a bittorrent client that is serving and downloading a lot of files, it might take 4s or more for an incoming ack packet to get through the queue among all the incoming data. That said, however, if I were to design it I might suggest waiting 120s by default. That would seem to be more ''conservative'' to allow for TCP''s faults, and reduce the chance of this error cropping up again during somebody else''s load testing. Then it would behave more like set_connection_timeout (which doesn''t have a predefined value, so the user understands what is happening when their connections close, later, because they had to manually set it, in the code). This also points out the usefulness again of a function like get_unbind_reason or get_unbind_status or something. My $.02 for the day. Thanks all! -Roger
On Dec 31, 2007 9:32 AM, Roger Pack <rogerpack2005 at gmail.com> wrote:> The conditions where it occurs I think are when you''re connecting a > lot to localhost so you have a lot of full queues and things like > that--a lot of network bandwidth and so it takes those first packets > awhile to get through the queue back and forth between the two hosts. > So for example if you have a bittorrent client that is serving and > downloading a lot of files, it might take 4s or more for an incoming > ack packet to get through the queue among all the incoming data. >Yeah, I can see that, actually, since user-written code could be causing the bottleneck. EM interleaves reads and accepts in order to keep from starving either, but it has no control over how long it takes user code to respond to read-events. I''d be in favor of adding a flag to set the default connect-pending timeout. I don''t like setting it to 120 seconds by default. (I assume you got that from the default FIN-WAIT time on some kernels?) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://rubyforge.org/pipermail/eventmachine-talk/attachments/20071231/2cddcdac/attachment.html
> I''d be in favor of adding a flag to set the default connect-pending timeout. > I don''t like setting it to 120 seconds by default. (I assume you got that > from the default FIN-WAIT time on some kernels?)That would work--so by default they don''t timeout and there''s a flag to turn it on? That would be cool. -Roger
?My latest problem seems to be that at times EM will stop reading from sockets, but those sockets will still have queued data in them I''m actually not sure if this is EM''s fault or my own, or if maybe EM is abandoning reading from sockets too early, or something? send-q rec-q tcp4 0 18460 127.0.0.1.7779 127.0.0.1.54550 FIN_WAIT_1 tcp4 81660 0 127.0.0.1.54550 127.0.0.1.7779 ESTABLISHED This is a connection between two processes, both running on localhost, both running EM single threaded, and both not firing any EM events. I have yet to check to see if ''unbind'' has already been called on one or both sockets, but anyway you can see that the bottom socket is trying to send lots of packets to the top socket. This kind of uses up available kernel buffer space. Will try to figure it out. Wish me luck. -Roger
No, there''s really no way to avoid timing out connect-attempts that fail. Otherwise over time you''d end up not freeing the descriptors. I was suggesting to keep the existing timeout and give a method to set its value. I''m in the middle of some huge and difficult performance-related changes. When that''s all checked in, I''ll add the new timeout method. On Jan 1, 2008 11:51 AM, Roger Pack <rogerpack2005 at gmail.com> wrote:> > I''d be in favor of adding a flag to set the default connect-pending > timeout. > > I don''t like setting it to 120 seconds by default. (I assume you got > that > > from the default FIN-WAIT time on some kernels?) > > That would work--so by default they don''t timeout and there''s a flag > to turn it on? That would be cool. > -Roger > _______________________________________________ > Eventmachine-talk mailing list > Eventmachine-talk at rubyforge.org > http://rubyforge.org/mailman/listinfo/eventmachine-talk >-------------- next part -------------- An HTML attachment was scrubbed... URL: http://rubyforge.org/pipermail/eventmachine-talk/attachments/20080101/9be7bd5d/attachment-0001.html
If the top socket is in FIN_WAIT_1, it has already been closed, either by your program, or by an exception, or something else. What platform are you running this on? And is this a repeatable problem that you could send a test-case for? On Jan 1, 2008 11:57 AM, Roger Pack <rogerpack2005 at gmail.com> wrote:> ?My latest problem seems to be that at times EM will stop reading from > sockets, but those sockets will still have queued data in them > I''m actually not sure if this is EM''s fault or my own, or if maybe EM > is abandoning reading from sockets too early, or something? > send-q rec-q > tcp4 0 18460 127.0.0.1.7779 127.0.0.1.54550 > FIN_WAIT_1 > tcp4 81660 0 127.0.0.1.54550 127.0.0.1.7779 > ESTABLISHED > > This is a connection between two processes, both running on localhost, > both running EM single threaded, and both not firing any EM events. I > have yet to check to see if ''unbind'' has already been called on one or > both sockets, but anyway you can see that the bottom socket is trying > to send lots of packets to the top socket. This kind of uses up > available kernel buffer space. > > Will try to figure it out. Wish me luck. > -Roger > _______________________________________________ > Eventmachine-talk mailing list > Eventmachine-talk at rubyforge.org > http://rubyforge.org/mailman/listinfo/eventmachine-talk >-------------- next part -------------- An HTML attachment was scrubbed... URL: http://rubyforge.org/pipermail/eventmachine-talk/attachments/20080101/050e0f63/attachment.html
Question on a bug: After a certain load EM ''naturally'' runs out of descriptors. After this point, however, it seems that select (normal select, at least) returns 22 (invalid argument)--always. Perhaps the descriptor list gets corrupted? Since select is returning this, EM never checks its ''still good'' sockets for read/write status, but just sleeps for a second (I think it assumes the error will clear itself up). A good thing might be to clear up the current bug, and also for it to check for error statuses and act accordingly. Should this be a real bug, it would also explain why, on windows, after a certain load, servers quit accepting, which I run into often, as windows only allows 256 file descriptors. Anyway something to think about. Now some thoughts on EM speedup: One way to speed up EM might be to have the user specify which functions they actually override, and only call those. Like if a user never needs ''connection_completed'' then you never call it. Of course, I can''t imagine many ''real'' protocols that don''t use receive_data, so my idea is pretty moot except for maybe unbind and connection_completed (post init?). Oh wait EM does this already :) Kind of minimal gain, though. Seems as if EM is about as fast as it can go. Dunno. :) Another thought would be to allow users to choose if rb_select is used or select itself (in single threaded mode, you can get away with using straight select, it seems). Wonder if that would be worthwhile or not. Some more oddities for speed gain would be an option to ''only read data in large chunks'' to save on message calling overhead. One might also make heartbeats optional (or keep a list of sockets that have requested them). Also I think it''s in there but a ''reusable'' select array might help, too. Having select timeout at ''the next timer firing'' might help, too Now some random thoughts, for fun. It would be nice to ''save'' errno away somewhere, so that we can tell why certain calls fail. As mentioned, it seems that socket ''double unbind'' currently (only noticed this after the latest SVN, though, so it''s probably not too hard to find). I have received this error before--not sure if the assertion is right or not...it could well be :) Assertion failed: nbytes > 0, file ed.cpp, line 595 # win32 Assertion failed: (nbytes > 0), function _WriteOutboundData, file ed.cpp, line 596. (on mac os x, after hitting ctrl-c to interrupt current transfers) # os x Anyway..will look into the bug. Viva EM. -Roger On Jan 1, 2008 10:48 AM, Francis Cianfrocca <garbagecat10 at gmail.com> wrote:> If the top socket is in FIN_WAIT_1, it has already been closed, either by > your program, or by an exception, or something else. What platform are you > running this on? And is this a repeatable problem that you could send a > test-case for?
On Jan 2, 2008 8:32 PM, Roger Pack <rogerpack2005 at gmail.com> wrote:> Question on a bug: > After a certain load EM ''naturally'' runs out of descriptors. > After this point, however, it seems that select (normal select, at > least) returns 22 (invalid argument)--always. > Perhaps the descriptor list gets corrupted? > Since select is returning this, EM never checks its ''still good'' > sockets for read/write status, but just sleeps for a second (I think > it assumes the error will clear itself up). A good thing might be to > clear up the current bug, and also for it to check for error statuses > and act accordingly. > > Should this be a real bug, it would also explain why, on windows, > after a certain load, servers quit accepting, which I run into often, > as windows only allows 256 file descriptors. >If select is returning EINVAL, that''s an obvious bug. Do you have a repeatable test case? Does this happen on Windows? OSX is another system where the default number of file descriptors is only 256. I wonder if the problem is the first parameter to the select call (maxsockets), although that parameter is (supposed to be) ignored on Windows.> > One way to speed up EM might be to have the user specify which > functions they actually override, and only call those. > Like if a user never needs ''connection_completed'' then you never call > it. Of course, I can''t imagine many ''real'' protocols that don''t use > receive_data, so my idea is pretty moot except for maybe unbind and > connection_completed (post init?). Oh wait EM does this already :) > Kind of minimal gain, though. Seems as if EM is about as fast as it > can go. Dunno. :) >I''ve generally found that eliminating rb_funcalls is a very helpful way to improve performance, *except* when there is a lot of network I/O going on, which tends to dominate the profile. Not sure there''s much bang for the buck here.> > Another thought would be to allow users to choose if rb_select is used > or select itself (in single threaded mode, you can get away with using > straight select, it seems). Wonder if that would be worthwhile or > not. >In Ruby 1.8.x, if you call rb_thread_select in a program with no additional threads beside the main one, the performance impact is almost unmeasurable. If you have even a single additional Ruby thread, even if it''s only sleeping, the performance impact is huge. For Ruby 1.9, EM uses the newer thread_nonblocking_region. I haven''t profiled performance under Ruby 1.9, which is still very buggy anyway.> > Some more oddities for speed gain would be an option to ''only read > data in large chunks'' to save on message calling overhead. > One might also make heartbeats optional (or keep a list of sockets > that have requested them). > Also I think it''s in there but a ''reusable'' select array might help, too. > Having select timeout at ''the next timer firing'' might help, too >There''s an endless amount of lore on how to make select faster with large sets, going back to the early Web days, when large sets first started happening. None of it is really as good as using things like epoll and kqueue.> > > Now some random thoughts, for fun. > > It would be nice to ''save'' errno away somewhere, so that we can tell > why certain calls fail. > > As mentioned, it seems that socket ''double unbind'' currently (only > noticed this after the latest SVN, though, so it''s probably not too > hard to find). >Since there''s no verb in your dependent clause, I don''t know what you''re saying here :-), but I do want to know. The recursive unbind problem is something I really want to solve.> > > I have received this error before--not sure if the assertion is right > or not...it could well be :) > > Assertion failed: nbytes > 0, file ed.cpp, line 595 # win32 > Assertion failed: (nbytes > 0), function _WriteOutboundData, file > ed.cpp, line 596. (on mac os x, after hitting ctrl-c to interrupt > current transfers) # os xThis sounds benign. Is it repeatable on the mac? -------------- next part -------------- An HTML attachment was scrubbed... URL: http://rubyforge.org/pipermail/eventmachine-talk/attachments/20080103/89e8018d/attachment.html
> In Ruby 1.8.x, if you call rb_thread_select in a program with no additional > threads beside the main one, the performance impact is almost unmeasurable. > If you have even a single additional Ruby thread, even if it''s only > sleeping, the performance impact is huge. For Ruby 1.9, EM uses the newer > thread_nonblocking_region. I haven''t profiled performance under Ruby 1.9, > which is still very buggy anyway.The other day I remember thinking ''what is taking up 100% cpu'' and it turns out to be a ruby process with 16 threads or something, each using network I/O. I was reminded of how poorly ruby performs in a multi-threaded environment :) Its native threads scar eme.>If select is returning EINVAL, that''s an obvious bug. Do you have arepeatable test case? Does this happen on Windows? >OSX is another system where the default number of file descriptors is only 256. I wonder if the problem is the first parameter >to the select call (maxsockets), although that parameter is (supposed to be) ignored on Windows. on windows, after a few seconds of high load I begin to get (a custom print out line) select failed 10038 An operation was attempted on something that is not a socket (same on os x, just EINVAL instead, and I can move the file descriptor limit up, so it''s not as common) However, when an existing connection closes, select (I think) seems to work, because I can accept (one more) connection, then it keeps erring. Mytheory for this is that when you run out of file descriptors, some bad things happen to existing sockets. Like maybe acceptors ''err'' when they are passed to select. I''m not sure, however, as it seems that some ''writable'' sockets err, and some ''readable'' sockets err. Maybe pending connections (typically readable) also err on select, if there aren''t file descriptors available. I''m not sure. I haven''t totally figured it out yet, but it appears that looping through and excluding those that err from the select descriptors before hand avoids the error and EM seems to handle according to specs. I first thought that if a socket erred on select it was totally toast--however it might be that those sockets ''become readable'' later, so just excluding them is a good option. I''m honestly not sure why this stuff occurs . I am not sure how sockets respond with epoll.> > It would be nice to ''save'' errno away somewhere, so that we can tell > > why certain calls fail.This is a request for a function to access ''errno'' so we can tell why connect_server fails (is it out of descriptors?) or to strerror(errno).> > As mentioned, it seems that socket ''double unbind'' currently (only > > noticed this after the latest SVN, though, so it''s probably not too > > hard to find). > > > > Since there''s no verb in your dependent clause, I don''t know what you''re > saying here :-)??>, but I do want to know. The recursive unbind problem is > something I really want to solve.I just commented out the ''throw'' clause for double unbinds (hack job) and it works for now for myself. Monkey patches here we come. I note that it is only a ''recent'' problem in the code, I believe.> > Assertion failed: nbytes > 0, file ed.cpp, line 595 # win32 > > Assertion failed: (nbytes > 0), function _WriteOutboundData, file > > ed.cpp, line 596. (on mac os x, after hitting ctrl-c to interrupt > > current transfers) # os x > This sounds benign. Is it repeatable on the mac?I''ve seen it once on mac os x, under heavy load. It is repeatable under heavy load, at least on win32. I just turned commented it out and things seem to be well. I think I can reproduce it consistently. Well that''s about it! Thanks all! -Roger
Question: on win32 do you need to propagate the fderr group (select''s 4th param?) in order to ascertain whether a socket does NOT connect well? Also, from http://itamarst.org/writings/win32sockets.html it mentions that FD_SETSIZE should be resized--anybody know if that is still the case? I guess it was in 2001, but hey :) Now, some more random bugs that have happened: Now it appears that sometimes for me, on win32, my program will reach a state such that select always passes ''immediately'' with a value of 1, and does read on a socket (not the loopbreak reader) but doesn''t close that socket and doesn''t do any callbacks to code. I would look into it but I think I''m gonna bail on win32 and go to os x again. Any ideas? ./eventmachine/svn/version_0/lib/eventmachine.rb:226: [BUG] Segmentation fault ruby 1.8.6 (2007-10-21) [i386-mingw32] Happens infrequently, but does happen. Anyway if select fails and you then check each socket, one by one, and only select on the valid sockets, it seems to work still. Wish me luck. -Roger
On Jan 3, 2008 8:57 PM, Roger Pack <rogerpack2005 at gmail.com> wrote:> Question: on win32 do you need to propagate the fderr group (select''s > 4th param?) in order to ascertain whether a socket does NOT connect > well? >We don''t do anything (for now at any rate) with fderr. On Windows, failures-to-connect are caught by the heartbeat mechanism. This is obviously not ideal but I don''t know how Windows signals connect errors. Being Windows, there''s no definitive documentation. And being Windows, the behavior is likely to be different for every version of the OS that''s out there. That''s why I punted on this one.> Also, from http://itamarst.org/writings/win32sockets.html it mentions > that FD_SETSIZE should be resized--anybody know if that is still the > case? I guess it was in 2001, but hey :) >FD_SETSIZE is always sized to 1024. That''s done in extconf.rb (it generates a compiler flag). Maybe you could try setting it down to 256 for Windows and Mac and see if that changes anything?> Now, some more random bugs that have happened: > Now it appears that sometimes for me, on win32, my program will reach > a state such that select always passes ''immediately'' with a value of > 1, and does read on a socket (not the loopbreak reader) but doesn''t > close that socket and doesn''t do any callbacks to code. I would look > into it but I think I''m gonna bail on win32 and go to os x again. > Any ideas? > > > ./eventmachine/svn/version_0/lib/eventmachine.rb:226: [BUG] Segmentation > fault > ruby 1.8.6 (2007-10-21) [i386-mingw32] > > Happens infrequently, but does happen. > > Anyway if select fails and you then check each socket, one by one, and > only select on the valid sockets, it seems to work still.What do you mean by "valid" sockets? Are you calling an ioctl or something to see if they''re valid?> Wish me luck. > -Roger > _______________________________________________ > Eventmachine-talk mailing list > Eventmachine-talk at rubyforge.org > http://rubyforge.org/mailman/listinfo/eventmachine-talk >-------------- next part -------------- An HTML attachment was scrubbed... URL: http://rubyforge.org/pipermail/eventmachine-talk/attachments/20080103/ef67202f/attachment-0001.html
> We don''t do anything (for now at any rate) with fderr. On Windows, > failures-to-connect are caught by the heartbeat mechanism. This is obviously > not ideal but I don''t know how Windows signals connect errors. Being > Windows, there''s no definitive documentation. And being Windows, the > behavior is likely to be different for every version of the OS that''s out > there. That''s why I punted on this one.Searching seemed to yield. http://msdn2.microsoft.com/en-us/library/ms740141.aspx Which describes one way. Hard to find, though. Unfortunately this is somewhat of a limitation in windows with its few file descriptors/process :)> FD_SETSIZE is always sized to 1024. That''s done in extconf.rb (it generates > a compiler flag). Maybe you could try setting it down to 256 for Windows and > Mac and see if that changes anything?Sweet. Thanks for doing that. It seems that at least on mac os, http://www.delorie.com/gnu/docs/glibc/libc_248.html seems to imply that, of all file descriptors created (say you run ulimit -n 2000 -- you could create descriptors up to 1024) if any of those descriptor numbers are > FD_SETSIZE, then you can''t pass them to select. I think this may be why select is returning EINVAL after awhile. There may also be a small conflict in how many sockets EM allows--if it has a total of > 1024 created at any one time, then the fd_set''s can''t fit them all for a single select, and so that might also be a reason it returns EINVAL. In other random news...there are sometimes random pauses, but I think that they may be caused by attempting to do name resolution when you don''t have any buffer space available (but who knows). At least they''re only pauses, and pretty rare. Probably my code, but hey, thought I''d throw it out there in case anybody ran into it. Any thoughts? Thanks! -Roger
On Jan 3, 2008 10:03 PM, Roger Pack <rogerpack2005 at gmail.com> wrote:> > > FD_SETSIZE is always sized to 1024. That''s done in extconf.rb (it > generates > > a compiler flag). Maybe you could try setting it down to 256 for Windows > and > > Mac and see if that changes anything? > > Sweet. Thanks for doing that. > It seems that at least on mac os, > http://www.delorie.com/gnu/docs/glibc/libc_248.html seems to imply > that, of all file descriptors created (say you run ulimit -n 2000 -- > you could create descriptors up to 1024) if any of those descriptor > numbers are > FD_SETSIZE, then you can''t pass them to select. I think > this may be why select is returning EINVAL after awhile. > There may also be a small conflict in how many sockets EM allows--if > it has a total of > 1024 created at any one time, then the fd_set''s > can''t fit them all for a single select, and so that might also be a > reason it returns EINVAL. >It''s definitely true that FD_SETSIZE controls how many descriptors you can pass to select. It''s also true that EM is a library linked into a ruby process, so Ruby''s FD_SETSIZE is what controls the outcome. And in Ruby, that''s never larger than 1024 descriptors. The only way to solve this is by using epoll on Linux 2.6 and kqueue on OSX or BSD. Kqueue support was added to EM after the last release, so try syncing to the head revision. Read the document titled EPOLL and mentally substitute "kqueue" for "epoll" wherever it appears.> In other random news...there are sometimes random pauses, but I think > that they may be caused by attempting to do name resolution when you > don''t have any buffer space available (but who knows). At least > they''re only pauses, and pretty rare. Probably my code, but hey, > thought I''d throw it out there in case anybody ran into it. >Yes. Definitely try using IP addresses instead of hostnames to see if the problem goes away. DNS resolution in the standard Ruby library actually spins threads and is horrendously slow. I wrote an evented DNS resolver/cache a few months back, which is far faster than the standard one. I haven''t added it to the distro yet but I should. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://rubyforge.org/pipermail/eventmachine-talk/attachments/20080104/d05fe503/attachment.html
> It''s definitely true that FD_SETSIZE controls how many descriptors you can > pass to select. It''s also true that EM is a library linked into a ruby > process, so Ruby''s FD_SETSIZE is what controls the outcome. And in Ruby, > that''s never larger than 1024 descriptors.Sweetness. Turns out if you arbitrarily close sockets (descriptor number) > FD_SETSIZE then EM works on os x! Yea! Select no longer returns EINVAL. I can submit the patch if you''d like. The reason this is necessary is that os x by default has 256. When you hit that limit, you naturally want to up the file descriptors available to something larger. The gotcha is you can up the limit to up to 10000--but if you do past 1024 then you can create ''valid'' descriptors that fail when passed to select. Fix is the patch (just check them on creation to see if they''ll fit within an fd_set, if you''re using select), or to use kqueue. Dunno if this helps the couple of probs on win32. Also dunno if such a thing would be good for epoll/kqueue as well.> Yes. Definitely try using IP addresses instead of hostnames to see if the > problem goes away. DNS resolution in the standard Ruby library actually > spins threads and is horrendously slow. I wrote an evented DNS > resolver/cache a few months back, which is far faster than the standard one. > I haven''t added it to the distro yet but I should.Grin. Yep you got me--I thought it was something else but it was DNS again. Silly me. Another optimization thought: after select we run through every socket, then check the loop breaker. An optimization would be to check if s==1 and the loop breaker is in the set, then you don''t have to run through the loop to check each socket. Also theoretically you only need to run through the fd_sets until you''ve found ''s'' worth of readable/writable sockets, so you can break the loop early. Just some thoughts. Thanks all. -- -Roger Pack For God hath not given us the spirit of fear; but of power, and of love, and of a sound mind" -- 2 Timothy 1:7
On Jan 4, 2008 12:55 PM, Roger Pack <rogerpack2005 at gmail.com> wrote:> Sweetness. Turns out if you arbitrarily close sockets (descriptor > number) > FD_SETSIZE then EM works on os x! Yea! Select no longer > returns EINVAL. I can submit the patch if you''d like. > The reason this is necessary is that os x by default has 256. When > you hit that limit, you naturally want to up the file descriptors > available to something larger. The gotcha is you can up the limit to > up to 10000--but if you do past 1024 then you can create ''valid'' > descriptors that fail when passed to select. Fix is the patch (just > check them on creation to see if they''ll fit within an fd_set, if > you''re using select), or to use kqueue. >I want the patch :-). Do try the kqueue implementation. I''d like to know if it works for anyone besides me.> > Another optimization thought: after select we run through every > socket, then check the loop breaker. An optimization would be to > check if s==1 and the loop breaker is in the set, then you don''t have > to run through the loop to check each socket. Also theoretically you > only need to run through the fd_sets until you''ve found ''s'' worth of > readable/writable sockets, so you can break the loop early. Just some > thoughts. >Think about it for a moment. If s==1, then the process by definition isn''t heavily loaded so it doesn''t need optimizing. This might make it go faster in a benchmark but there''s no benefit in the real world. :-) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://rubyforge.org/pipermail/eventmachine-talk/attachments/20080104/4eeff3b9/attachment.html
> I want the patch :-).I haven''t tested it in win32--I assume you''d like me to do that and get it ''perfect'' first, or do you just want it? I assume the polished one? I may try to create a test case that shows how this is broken that fails on the old implementation and doesn''t with the new. I think it''s basically ''use up all your file descriptors'' then close one and open one--it should connect. I''ll see if I can create one, too. Another question--are you more concerned with raw speed or with guaranteed functionality for patches? Like extra assertions--I assume leave them in?> Do try the kqueue implementation. I''d like to know if it works for anyone > besides me.I should do that. I''m interested with what happens when you hit the upper boundary--if you try and allocate too many ports or what not. Does it still need boundary checking? That type of thing. Theoretically the problem should just go away. Thanks for all the work and time. -Roger
On Jan 4, 2008 10:17 PM, Roger Pack <rogerpack2005 at gmail.com> wrote:> > I want the patch :-). > > I haven''t tested it in win32--I assume you''d like me to do that and > get it ''perfect'' first, or do you just want it? I assume the polished > one? > I may try to create a test case that shows how this is broken that > fails on the old implementation and doesn''t with the new. I think > it''s basically ''use up all your file descriptors'' then close one and > open one--it should connect. I''ll see if I can create one, too. >All the testing you can do is most welcome. And if you can make a test case that can go into the distro that''s really superb. I was just thinking we need more unit tests that are stress tests rather than just correctness tests.> Another question--are you more concerned with raw speed or with > guaranteed functionality for patches? Like extra assertions--I assume > leave them in? >Leave assertions and parameter-checks in. Raw speed is of no value if the code isn''t 100% reliable. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://rubyforge.org/pipermail/eventmachine-talk/attachments/20080104/78c477bd/attachment.html
> > Another optimization thought: after select we run through every > > socket, then check the loop breaker. An optimization would be to > > check if s==1 and the loop breaker is in the set, then you don''t have > > to run through the loop to check each socket. Also theoretically you > > only need to run through the fd_sets until you''ve found ''s'' worth of > > readable/writable sockets, so you can break the loop early. Just some > > thoughts. > > > > Think about it for a moment. If s==1, then the process by definition isn''t > heavily loaded so it doesn''t need optimizing. This might make it go faster > in a benchmark but there''s no benefit in the real world. :-)Another thought is that it allows the code that broke the loop to be executed more quickly. Also in the case of those that use next_tick constantly their loop would be executed more quickly. And it''s easy that''s why I suggested it :) But as you suggested the only real way to improve speed is to profile and kill the inefficiencies in the bottleneck--not in the typical use cases (though this one might possibly do something). Thanks. -Roger
looks like we might be able to work in _DARWIN_UNLIMITED_SELECT as a parameter to select in mac os x to help us, too, though kqueue works well.
As a note I also noted that inexplicably get_sockname will fail on windows, at times. I don''t know how this is really possible, but it does. Go figure :) -Roger> On windows (at least mine--mingw), it appears that setting FD_SETSIZE > to more than 64 makes it so that some sockets are ignored (see the > attached test). Not sure why. Wonder if using winsock2 would work > better. The current extconf.rb leaves it at 64, so we should be good > (should be fully functional as it is for win32). > > That being said, the patch still doesn''t fix ''all'' the problems on > win32. It fixes several, but not all. > > Sometimes select with the current code base fails in windows, because, > of all sockets, a few of them are ''bad''. Running through each socket > one at a time after a failed select, and checking if each ''is the > failing socket'' by selecting with a timeval of {0,0} catches ''the > problem socket'' 50% of the time. Sometimes, however, even that is not > enough. Each socket by itself passes, but then select fails again. > So I made the assumption that checking each socket with a timeval of > {0,1} would find those. It might not. > The problem of ''weirdness even given select {0,0} '' seems to happen > after I open some sockets, then some file descriptors, then selects > fail. Perhaps they''re getting ploughed. > > It also appears, separately, that sometimes select returns immediately > with the value 3, though no sockets are in select except the loop > breaker. That was odd. > So overall I''d say there are still some probs on win32 that cause > select to return immediately, and in error. I probably won''t take a > look at them unless I get really bored, as 50% of the problems seem to > be fixed. >
> That being said, the patch still doesn''t fix ''all'' the problems on > win32. It fixes several, but not all.I did have a question when I was making it, though, of what to do in the case that you ''accept'' an incoming socket but it''s too high numbered (linux) or you already have too many sockets (windows) in order to use it? I assumed to just close it, but wasn''t sure. Thoughts? -Roger
On Jan 11, 2008 11:32 AM, Roger Pack <rogerpack2005 at gmail.com> wrote:> > That being said, the patch still doesn''t fix ''all'' the problems on > > win32. It fixes several, but not all. > > I did have a question when I was making it, though, of what to do in > the case that you ''accept'' an incoming socket but it''s too high > numbered (linux) or you already have too many sockets (windows) in > order to use it? I assumed to just close it, but wasn''t sure. >It''s possible to accept a socket that''s too high-numbered to work in a select set, if the process is permitted to create a larger number of descriptors than FD_SETSIZE. You''re really not likely to see this on Windows, however. (On some versions of Windows, file descriptors are pointers rather than index numbers, anyway, so there''s no meaningful numeric comparison.) On Unix you can make it happen if you go out of your way. So you''re making a good point and it''s worth validating that a descriptor is selectable. I think that our future direction really should be away from select and toward the native high-performance polling mechanism on all platforms. We have epoll and kqueue now. We should add /dev/poll for Solaris and IOCP for Windows. Then the only platforms left on select will be be back versions of Linux and less-common Unixes. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://rubyforge.org/pipermail/eventmachine-talk/attachments/20080111/077925ae/attachment-0001.html
On Jan 11, 2008 12:58 PM, Francis Cianfrocca <garbagecat10 at gmail.com> wrote:> I think that our future direction really should be away from select and > toward the native high-performance polling mechanism on all platforms. We > have epoll and kqueue now. We should add /dev/poll for Solaris and IOCP for > Windows. Then the only platforms left on select will be be back versions of > Linux and less-common Unixes.Absolutely. And that brings up a question. EM.kqueue EM.epoll First, IMHO, if those are called, they should return a useful true/false indicating whether that method is enabled or not. Second, I think that calling those should be unnecessary; that EM should use the correct version for the platform, and should instead permit someone to call EM.select in order to force select''s use. Thoughts? Kirk Haines
On Jan 11, 2008 12:48 PM, Kirk Haines <wyhaines at gmail.com> wrote:> > On Jan 11, 2008 12:58 PM, Francis Cianfrocca <garbagecat10 at gmail.com> wrote: > > > I think that our future direction really should be away from select and > > toward the native high-performance polling mechanism on all platforms. We > > have epoll and kqueue now. We should add /dev/poll for Solaris and IOCP for > > Windows. Then the only platforms left on select will be be back versions of > > Linux and less-common Unixes. > > Absolutely. And that brings up a question. > > EM.kqueue > EM.epoll > > First, IMHO, if those are called, they should return a useful > true/false indicating whether that method is enabled or not. > > Second, I think that calling those should be unnecessary; that EM > should use the correct version for the platform, and should instead > permit someone to call EM.select in order to force select''s use. > > Thoughts? >+1 :-) --Michael
What can I do to help my patch be accepted? It does have a test, as well. -R On Fri, Jan 4, 2008 at 9:35 PM, Francis Cianfrocca <garbagecat10 at gmail.com> wrote:> On Jan 4, 2008 10:17 PM, Roger Pack <rogerpack2005 at gmail.com> wrote: > > > > I want the patch :-). > > > > I haven''t tested it in win32--I assume you''d like me to do that and > > get it ''perfect'' first, or do you just want it? I assume the polished > > one? > > I may try to create a test case that shows how this is broken that > > fails on the old implementation and doesn''t with the new. I think > > it''s basically ''use up all your file descriptors'' then close one and > > open one--it should connect. I''ll see if I can create one, too. > > > > All the testing you can do is most welcome. And if you can make a test case > that can go into the distro that''s really superb. I was just thinking we > need more unit tests that are stress tests rather than just correctness > tests. > > > > > > Another question--are you more concerned with raw speed or with > > guaranteed functionality for patches? Like extra assertions--I assume > > leave them in? > > > > > > Leave assertions and parameter-checks in. Raw speed is of no value if the > code isn''t 100% reliable. > > > _______________________________________________ > Eventmachine-talk mailing list > Eventmachine-talk at rubyforge.org > http://rubyforge.org/mailman/listinfo/eventmachine-talk >