I'll try to be as specific as possible without overkill. I have a Compaq 1850R with dual P3 450s and a gig of RAM running FreeBSD 4.8-RELEASE-p16. The 3 internal 36G SCSI 10k disks are set up RAID 5 on the Smart 2 SL RAID controller. SMP is enabled. FreeBSD uses the ida driver to interact with the RAID controller. This machine is the most I've tasked these 1850s to do so far and it has started Freezing shortly after I was forced to put it in production. There is a lot of disk I/O since this is a mail server (POP & SMTP), and the disk is being NFS accessed as well. Time between freezes ranges 15 hours to 72 hours. I've set up a lot of debugging to try and find what is going on with the machine and I had a little more light shed this morning. Let me define freeze: - no network response at all - display was still going to monitor - alt fN keys would switch displays, but... - type in username and hit enter and it just acknowledges the enter with line feeds. - mrtg was also registering a huge release of memory right before the crash. Average is 10-100 MB of Free memory and it would register all of the memory being freed up. Saturday when it froze last I setup 2 displays running commands since it appeared to keep running to the monitor after it would die to all else. One was running top -ores (since mrtg was pointing around memory) and the other was running systat -vm 5. When it froze today, there were about 15 processes in State: inode running at priority -14. They weren't all sendmail either. snmpd, radiator (just for accounting), and sendmail were running at -14. The top process and other processes were still running but all services had died again and I couldn't pull it out of the coma without hitting the power button...again. None of the logs record anything out of the ordinary. The machine goes from normal operation to freeze too fast to record the problem. Also, if it is a disk access problem, then that would explain why my logs don't have anything. If I had to put this as questions...: - What do I do to keep the freezes from happening? - What can I do to record more information to find out what specifically is causing the freeze? (or is this enough information and I just don't know the answer?) - Has anyone else put the Smart 2 SL or the 1850Rs through some heavy lifting on 4.8? I'm going to do some research on the Smart 2 SL and see if there are any updates that the SmartStart CD might have put an SEP field around and I'm going to try to find drop in replacement controller prices. The disks are brand new from newegg so I don't speculate them yet. Thanks for any help, suggestions, pointers, or assistance in advance, Gerald P.S. First post to the FreeBSD lists. Go easy on me.
> FreeBSD uses the ida driver to interact with the RAID controller.I've used this driver a lot on various bits of Compaq hardware and it has always worked excellently for me. Have had a pair of SMART 2-SL's,which I then upgraded to a single 3200 and subsequently a 4200. All workfine. I've been using various releases of FreeBSD-4 on them, latest being 4.9 obviously, and have never seen a problem with the ida driver.> - What do I do to keep the freezes from happening?I have, however, seen the "Freezing" thing on Compaq hardware, related to a different controller. It only happened with SMP enabled. I would suggest you try running them without SMP for a while and see if it still freezes. In my case it was the sym0 on board SCSI controller, and eventually some extra code was added to sym0 to catch lost interrupts and unfreeze the machine. -pcf.
On Tue, 27 Apr 2004, Gerald wrote:> Saturday when it froze last I setup 2 displays running commands since it > appeared to keep running to the monitor after it would die to all else. > One was running top -ores (since mrtg was pointing around memory) and the > other was running systat -vm 5. When it froze today, there were about 15 > processes in State: inode running at priority -14. They weren't all > sendmail either. snmpd, radiator (just for accounting), and sendmail were > running at -14. The top process and other processes were still running but > all services had died again and I couldn't pull it out of the coma > without hitting the power button...again.This machine died again Wed night at 9 PM. I had already installed kernel sans SMP but not booted to it yet so I'll see how long that goes. If we make it past tonight, I'll know if that resolved the issue for certain after this weekend. The biggest clue thus far in my problem on this machine has been when I left a top running on the console. When it freezes, the processes are still alive but sleeping in "STATE: inode". This was not one process in such state but 80% of the processes that were still on the screen. Most of those processes had a Priority of -14. I don't know UFS or the kernel source code well enough to know what problem a ton of processes waiting for "inode" would indicate, but I know what an inode is. I have plenty free, and a lack of response from the disk or RAID array would leave the kernel waiting for some type of inode information. Can someone shed some light on what this alone might indicate? Could it be RAID controller OR a hard drive flaking? Would something (the kernel) missing an interrupt from something else (a disk or RAID card) cause this? How can I debug this information and find out why it was waiting for inodes? Can someone even clarify what a process in STATE: inode means literally. (reading/writing/both/taking inventory,requesting inode information) Gerald
On Tue, 27 Apr 2004, Gerald wrote:> This machine is the most I've tasked these 1850s to do so far and it has > started Freezing shortly after I was forced to put it in production.For googlers and others that read this thread. This has been identified as being an SMP issue. I installed a kernel without the SMP options and it has ceased freezing. It's currently running 4.8-RELEASE. 10:32AM up 5 days, 13:18, 3 users, load averages: 0.34, 0.30, 0.32 I will try to chase this down further, but for now I have to get back to work. Thanks to Pete French for suggesting I shut off SMP and other insight he's given me in to the Compaqs. Thanks in advance to anyone who replies to my questions about what the inode state is that I might be able to help the SMP people track down a bug. Gerald