I was having problems with the same server locking up to the point I can't even get in via SSH. I've already used HTB/TC to reserve bandwidth for my SSH port but the problem now isn't an attack on the bandwidth. So I'm trying to figure out if there's a way to ensure that SSH is given cpu and i/o priority. However, so far reading seems to imply that it's probably not going to help if the issue is i/o related and/or it would require escalating SSH to such levels (above paging/filesystem processes) that makes it a really bad idea. Since I'm not the only person who face problems trying to remotely access a locked up server, surely somebody must had come up with a solution that didn't involve somebody/something hitting the power button?
Am 29.06.2011 um 21:50 schrieb Emmanuel Noobadmin:> > Since I'm not the only person who face problems trying to remotely > access a locked up server, surely somebody must had come up with a > solution that didn't involve somebody/something hitting the power > button?Yes, it's called "out of band management". Have dial-in access to IPMI/iLO interfaces or just an APC remote controlled power-switch to power-off the server. Rainer
On Wed, Jun 29, 2011 at 4:50 PM, Emmanuel Noobadmin <centos.admin at gmail.com>wrote:> I was having problems with the same server locking up to the point I > can't even get in via SSH. I've already used HTB/TC to reserve > bandwidth for my SSH port but the problem now isn't an attack on the > bandwidth. So I'm trying to figure out if there's a way to ensure that > SSH is given cpu and i/o priority. > > However, so far reading seems to imply that it's probably not going to > help if the issue is i/o related and/or it would require escalating > SSH to such levels (above paging/filesystem processes) that makes it a > really bad idea. > > Since I'm not the only person who face problems trying to remotely > access a locked up server, surely somebody must had come up with a > solution that didn't involve somebody/something hitting the power > button? > >I would approach this issue from another perspective: who's locking up the server (as in eating all resources) and how to stop/constrain it. You can try to renice the sshd process and see what happens. I'm not entirely sure what 'locked up' means in this context. -- Giovanni Tirloni -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.centos.org/pipermail/centos/attachments/20110629/b26c36d8/attachment-0002.html>
On Thu, Jun 30, 2011 at 03:50:30AM +0800, Emmanuel Noobadmin wrote:> I was having problems with the same server locking up to the point I > can't even get in via SSH. I've already used HTB/TC to reserve > bandwidth for my SSH port but the problem now isn't an attack on the > bandwidth. So I'm trying to figure out if there's a way to ensure that > SSH is given cpu and i/o priority.As you've probably figured out, the short answer is no. There are sometimes workarounds, of course.> Since I'm not the only person who face problems trying to remotely > access a locked up server, surely somebody must had come up with a > solution that didn't involve somebody/something hitting the power > button?In addition to the suggestions already made, one possibility is to attach a serial console or IP KVM. Logging in may still be awful, but at least you won't have to go through sshd. I've been able to log in through a serial getty when sshd was not responding or taking too long (this works maybe 50-75% of the time; the rest of the time it's too late, and even getty is unresponsive). You have the added advantage of being able to log in directly as root if you have PermitRootLogin no in your sshd_config. If your I/O problem is due to running out of memory and thrashing swap, you can try to be more aggressive with the OOM killer settings. As someone else mentioned, it might help if you elaborated on "locked up". What are the common scenarios you see? --keith -- kkeller at wombat.san-francisco.ca.us -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 197 bytes Desc: not available URL: <http://lists.centos.org/pipermail/centos/attachments/20110629/00235fc4/attachment-0002.sig>
On 6/30/11, Giovanni Tirloni <gtirloni at sysdroid.com> wrote:> I would approach this issue from another perspective: who's locking up the > server (as in eating all resources) and how to stop/constrain it. You can > try to renice the sshd process and see what happens. I'm not entirely sure > what 'locked up' means in this context.Server's unresponsive to the external world. It isn't dead, on two occasions, when it happened at times like Sunday and 1am in the night, I could afford to wait it out and see that it eventually does recover from whatever it was. It's almost definitely related to disk i/o due to the VM guest fighting over the disks where their virtual disk-files are. However, the hard part is figuring out the exact factors, I know CPU isn't an issue having set up scripts to log top output when load goes above 5.
On Wed, 29 Jun 2011, Keith Keller wrote:> In addition to the suggestions already made, one possibility is to > attach a serial console or IP KVM. Logging in may still be awful, > but at least you won't have to go through sshd. I've been able to > log in through a serial getty when sshd was not responding or taking > too long (this works maybe 50-75% of the time; the rest of the time > it's too late, and even getty is unresponsive). You have the added > advantage of being able to log in directly as root if you have > PermitRootLogin no in your sshd_config.Even with OOB console access, there's still the problem of /bin/login timing out on highly loaded servers. The login.c source in the util-linux package hardwires the login timeout to 60 seconds. If your server can't process the login request in under a minute (not unusual if the load average is high and/or the machine is using swap), you can't login via *any* console. So if killing the machine doesn't appeal to you, you still need OOB console access plus * a patched version of /bin/login with a longer timeout, or * a process-watcher that aggressively kills known troublemakers, or * a remotely accessible console that never logs out. I actually relied for a while on the last choice. I had a remotely accessible root shell that never logged out. When things got sluggish, I was able to /bin/kill to my heart's content. It wasn't a pretty solution, but it kept me running until I was able to solve the problem properly. -- Paul Heinlein <> heinlein at madboa.com <> http://www.madboa.com/
On Wednesday, June 29, 2011 04:43:09 PM Rainer Duffner wrote:> Virtualization is an option, but the trouble is: if the server is I/O- > constrained anyway, virtualization won't help. > Everything will just be even slower.That depends. More expensive servers that would be suitable for virtualization host use also tend to have better I/O subsystems and faster disks. Relative to a 'cheap' system with much poorer base I/O bandwidth.
On Wednesday, June 29, 2011 05:20:26 PM Rainer Duffner wrote:> Am 29.06.2011 um 23:17 schrieb Lamar Owen: > > More expensive servers that would be suitable for > > virtualization host use also tend to have better I/O subsystems and > > faster disks. Relative to a 'cheap' system with much poorer base I/ > > O bandwidth.> The OP clearly stated that he's probably not running a datacenter full > of DL580g7 servers...Yeah, I saw that. I was just addressing the I/O slowdown thing, where if you double the money you might very well get more than double the performance, and get two VM's running faster than on the cheaper hardware. But it seems he's already doing some virt. Just not enough detail to sort that out. Although it would really be interesting to me to see scheduler settings that would indeed allow something of a 'privileged' ssh or an OOB console that would be responsive even under a punishing load with lots of swapping, which is what the OP originally asked about.
On 06/29/11 14:50, Emmanuel Noobadmin wrote:> I was having problems with the same server locking up to the point I > can't even get in via SSH.investigate instead of band-aiding... 1) syslog to a remote host. remote syslogging rarely stops when the system is disk/iowait bound. 2) log diving. is there anything in the logs around the time of the incidents? large emails(100MB+ body, not attachment) can freak out versions of spam assassin... server load would reach 300+ which timed out SSH connections. syslogs took time to wade through, but pinpointed the recurring issue.