Front Line
2008-Aug-27  11:40 UTC
[Mongrel] Mongrel hangs with 100% CPU / EBADF (Bad file descriptor)
We have a server with 10 running mongrel_cluster instances with apache
in front of them, and every now and then one or some of them hang.
No activity is seen in the database (we''re using activerecord
sessions).
Mysql with innodb tables. show innodb status shows no locks. show
processlist shows nothing.
The server is linux debian 4.0
Ruby is: ruby 1.8.6 (2008-03-03 patchlevel 114) [i486-linux]
Rails is: Rails 1.1.2 (yes, quite old)
We''re using the native mysql connector (gem install mysql)
"strace -p PID" gives the following in a loop for the hung mongrel
process:
gettimeofday({1219834026, 235289}, NULL) = 0
select(4, [3], [0], [], {0, 905241})    = -1 EBADF (Bad file descriptor)
gettimeofday({1219834026, 235477}, NULL) = 0
select(4, [3], [0], [], {0, 905053})    = -1 EBADF (Bad file descriptor)
gettimeofday({1219834026, 235654}, NULL) = 0
select(4, [3], [0], [], {0, 904875})    = -1 EBADF (Bad file descriptor)
gettimeofday({1219834026, 235829}, NULL) = 0
select(4, [3], [0], [], {0, 904700})    = -1 EBADF (Bad file descriptor)
gettimeofday({1219834026, 236017}, NULL) = 0
select(4, [3], [0], [], {0, 904513})    = -1 EBADF (Bad file descriptor)
gettimeofday({1219834026, 236192}, NULL) = 0
select(4, [3], [0], [], {0, 904338})    = -1 EBADF (Bad file descriptor)
gettimeofday({1219834026, 236367}, NULL) = 0
...
I used lsof and found that the process used 67 file descriptors (lsof -p
PID |wc -l)
Is there any other way I can  debug this, so that I could for example
determine which file descriptor is "bad"?
Any other info or suggestions? Anybody else seen this?
The site is fairly used, but not overly so, load averages usually around
0.3.
-- 
Posted via http://www.ruby-forum.com/.
Roger Pack
2008-Sep-27  15:20 UTC
[Mongrel] Mongrel hangs with 100% CPU / EBADF (Bad file descriptor)
> gettimeofday({1219834026, 235289}, NULL) = 0 > select(4, [3], [0], [], {0, 905241}) = -1 EBADF (Bad file descriptor) > gettimeofday({1219834026, 235477}, NULL) = 0 > select(4, [3], [0], [], {0, 905053}) = -1 EBADF (Bad file descriptor) > gettimeofday({1219834026, 235654}, NULL) = 0 > select(4, [3], [0], [], {0, 904875}) = -1 EBADF (Bad file descriptor) > gettimeofday({1219834026, 235829}, NULL) = 0 > select(4, [3], [0], [], {0, 904700}) = -1 EBADF (Bad file descriptor) > gettimeofday({1219834026, 236017}, NULL) = 0 > select(4, [3], [0], [], {0, 904513}) = -1 EBADF (Bad file descriptor) > gettimeofday({1219834026, 236192}, NULL) = 0 > select(4, [3], [0], [], {0, 904338}) = -1 EBADF (Bad file descriptor) > gettimeofday({1219834026, 236367}, NULL) = 0 > ... > > I used lsof and found that the process used 67 file descriptors (lsof -p > PID |wc -l) >You could try evented mongrel. I think the real problem is that internally ruby''s select mechanism isn''t designed to handle -1''s from select. I''d call that a ruby bug, should that be the case. In Python when this happens it raises an exception and relies on the caller to loop through each socket and discover the offending one. I can only hope that 1.9 does better at this situation. I beliee ruby''s select also doesn''t handle "more than 1024 socket descriptors" [it ignores those above 1024] so...I''d call it less than perfect.[1] -=R [1] http://rubyforge.org/tracker/index.php?func=detail&aid=20088&group_id=426&atid=1698 -- Posted via http://www.ruby-forum.com/.