Strange indeed. On a 1750 with bge's: 475 mbufs in use 501/25600 mbuf clusters in use (current/max) 0/3/6656 sfbufs in use (current/peak/max) 1120 KBytes allocated to network 0 requests for sfbufs denied 0 requests for sfbufs delayed 0 requests for I/O initiated by sendfile 100 calls to protocol drain routines On a 2850 (hardware identical to an 1850): $ netstat -m 4294966848 mbufs in use 565/25600 mbuf clusters in use (current/max) 0/67/6656 sfbufs in use (current/peak/max) 1018 KBytes allocated to network 0 requests for sfbufs denied 0 requests for sfbufs delayed 16449 requests for I/O initiated by sendfile 589 calls to protocol drain routines Both experience the "auto reboot" feature. The mbufs on the 2850 look like a counter (signed/unsigned) bug, maybe even just in the printing. Other than that I'm having a hard time interpreting these results. Regards Rutger Bevaart On Nov 20, 2005, at 5:07 PM, Gino Ruopolo wrote:> > Hello Rutger, > > I red your post but I'm unable to reply on the list 'cause of some > firewall settings. > > I'm having the same problems with various Dell1850 and Fbsd 5.4 > > Last week I noticed the following: > > #netstat -m > 4294899289 mbufs in use !?!?!??!!? > 4294940375/25600 mbuf clusters in use (current/max) !?!?!?!??! > 0/9/6656 sfbufs in use (current/peak/max) > 4123460 KBytes allocated to network > 0 requests for sfbufs denied > 0 requests for sfbufs delayed > 34 requests for I/O initiated by sendfile > 2533 calls to protocol drain routines > > Here is the output of the same command on a different server with > fxp0 ethernet driver, also FBSD 5.4 and doing the same work: > > #netstat -m > 194 mbufs in use > 171/25600 mbuf clusters in use (current/max) > 0/4/6656 sfbufs in use (current/peak/max) > 390 KBytes allocated to network > 0 requests for sfbufs denied > 0 requests for sfbufs delayed > 0 requests for I/O initiated by sendfile > 0 calls to protocol drain routines > > So I've tried putting an old pci ethernet 10/100 using fxp driver > on a Dell1850 suffering the "self-reboot" problem. I'm getting 5 > days of uptime without a single reboot ... > > What about a problem with the em driver? > > Regards, > gino > > _________________________________________________________________ > Parla con i tuoi amici che hanno MSN Hotmail in tempo reale! E' > gratis. http://www.imagine-msn.com/messenger/default.aspx?locale=it-IT >
On Nov 20, 2005, at 1:24 PM, Rutger Bevaart wrote:> Both experience the "auto reboot" feature. The mbufs on the 2850 > look like a counter (signed/unsigned) bug, maybe even just in the > printing. Other than that I'm having a hard time interpreting these > results.FreeBSD 4.x, 5.x, and 6.x have been stable for me on all Dell hardware. 4.x (currently 4.11) has been running on 1550's, 1650's, 2650 and 1750's for > 3 years 5.4 on 2450 for ~6 months 6.0 on 1750, 1850, and 2650 since 6.0-RC2, currently running 6.0-REL. Never a flake-out not due to a hardware failure, and that only on two of the 1550s over 4 years' time. I did have the 5.4 box running 5.4- REL-p7 lockup once, but was unable to determine the cause.
On Sun, Nov 20, 2005 at 07:24:25PM +0100, Rutger Bevaart wrote:> Strange indeed. > > On a 1750 with bge's: > 475 mbufs in use > 501/25600 mbuf clusters in use (current/max) > 0/3/6656 sfbufs in use (current/peak/max) > 1120 KBytes allocated to network > 0 requests for sfbufs denied > 0 requests for sfbufs delayed > 0 requests for I/O initiated by sendfile > 100 calls to protocol drain routines > > On a 2850 (hardware identical to an 1850): > $ netstat -m > 4294966848 mbufs in use > 565/25600 mbuf clusters in use (current/max) > 0/67/6656 sfbufs in use (current/peak/max) > 1018 KBytes allocated to network > 0 requests for sfbufs denied > 0 requests for sfbufs delayed > 16449 requests for I/O initiated by sendfile > 589 calls to protocol drain routines > > Both experience the "auto reboot" feature. The mbufs on the 2850 look > like a counter (signed/unsigned) bug, maybe even just in the > printing. Other than that I'm having a hard time interpreting these > results.This is documented in the 5.4 errata, it's a leak in the stats counting on SMP machines. It was fixed after 5.4. Kris -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 187 bytes Desc: not available Url : http://lists.freebsd.org/pipermail/freebsd-stable/attachments/20051123/e777be3c/attachment.bin
Hi Kris, Rutger, and others that have commented on this thread. I'm happy to hear that I'm not the only one experiencing problems like this. I posted a similar question a month or so ago about a PowerEdge 2850 using SMP (dual Xeons) and never received any responses that helped solve the problem, or even any indication that others had the same problem. As you know, troubleshooting this is quite difficult, since it can take weeks to go down, and then the "auto-reboot" doesn't result in any clues as to why in the log file - it's just suddenly started again as if someone had pulled the plug on it. I've been pulling my hair out. My machine crashed twice in the last month or so, within two weeks of each other. Both times, it was just as a cron task was about to schedule the mysqlhotcopy script to back up some SQL databases that are being hosted on that machine, so I thought it may have something to do with that (I had it running as a root crontask so figured that maybe some bug in that caused things to go weird - it was running as root, after all). I changed it to run under a less privileged user and the machine hasn't died for about 2 1/2 weeks. But that's hardly a conclusive case of having solved the situation - it's probably planning on surviving just long enough to last until the point I need it the most to work. It sounds as though memory buffer allocations are going wacky or something, in which anything could take it down given the wrong combination of events. In any case, We're running the amd64 version of FreeBSD 5.4-RELEASE- p6 FreeBSD 5.4-RELEASE-p6 #3: Fri Aug 5 18:18:10 MDT 2005 A netstat -m (which I'd never tried before) yields: 18446744073709551402 mbufs in use 49/25600 mbuf clusters in use (current/max) 0/0/0 sfbufs in use (current/peak/max) 44 KBytes allocated to network 0 requests for sfbufs denied 0 requests for sfbufs delayed 0 requests for I/O initiated by sendfile 884 calls to protocol drain routines Obviously, the mbufs in use currently on that machine is way out to lunch. And interestingly, it looks as though my max mbuf clusters in use of 25600 is identical to the other netstat -m reports from people having this problem. Another machine (an older single CPU Dell) on which I'm running the 386 version of FreeBSD 5.4-RELEASE-p5 FreeBSD 5.4-RELEASE-p5 #1: Thu Jul 21 22:30:46 MDT 2005 has a more sane netstat -m: 130 mbufs in use 128/8896 mbuf clusters in use (current/max) 0/177/2480 sfbufs in use (current/peak/max) 288 KBytes allocated to network 0 requests for sfbufs denied 0 requests for sfbufs delayed 208493 requests for I/O initiated by sendfile 26697 calls to protocol drain routines But here's about where any troubleshooting on my own reaches its limit. I noticed that Kris mentioned it was a known problem in the stats counting for SMP machines and had been fixed, but haven't been able to find a reference to that, or any indication of how to do so. Is this fix supposed to have been an accounting bug in the report for netstat, or is it something which would have taken down the machine as has been happening? If switching to single CPU mode works, it's good to hear that I have an option if things continue to act up. But I'd really rather not have to "dumb down" the machine to one CPU when there is the potential of two. Most of the time it's not under a huge load, but periodically there are massive spikes, and that's where having two CPUs really help. If anyone can shed further light on a fix for this problem, it would be greatly appreciated! Dan -- Syzygy Research & Technology Box 83, Legal, AB T0G 1L0 Canada Phone: 780-961-2213
I just thought of one other bit of info that may be relevant to the auto-rebooting problem I've experienced with our PowerEdge 2850. Since the problem may be related to memory allocation, I thought I should mention that we have more memory in that machine that is typical for some users. We have 5 Gigs installed. From "top": Mem: 175M Active, 4121M Inact, 244M Wired, 244M Cache, 214M Buf, 23M Free Swap: 10G Total, 12K Used, 10G Free If this turns out to be an AMD64 vs. 386 issue and we were to revert to the 386 branch, would we still be able to access this memory, or would the 386 be limited to 4Gb (or maybe 2Gb) due to 32 bit addressing? We don't need anywhere near this much memory for user space programs, but the kernel does make good use of it to cache commonly accessed regions of the file system in memory. Dan -- Syzygy Research & Technology Box 83, Legal, AB T0G 1L0 Canada Phone: 780-961-2213
Thanks everyone for replies made over the past few days about the "unsolicited" rebooting problem. At first, I thought there was a memory allocation bug as judged by the output of "netstat -m", but apparently it's just a cosmetic statistics reporting bug and nothing related to the instability itself. Unfortunately, it means that I still haven't been able to find a solution to the problem (and apparently, I'm not the only one to experience it). Considering that we only have the one machine, which happens to be a production machine, that experiences the problem (infrequently at that), it's difficult to test and resolve. It's been suggested that FreeBSD 6.0 may fix the problem, but considering some of the inevitable bugs that creep into new releases, I'm reluctant to go there until things settle down in 6.0 (plus, I haven't seen any documentation that implies that a fix for the problem will result from using 6.0 in any case). If it weren't a production machine that needs to be reliable, stable, and available, I'd have a better chance at being able to test it under 6.0. Some speculation has been made about it being triggered by possibly buggy ethernet drivers, etc. In my case, though possible, I doubt it - since my machine has rebooted itself right when mysqlhotcopy was about to run on the machine (and it runs locally without causing any network activity that I'm aware of). The first thought I had was that it may be caused by faulty memory or something, but Dell's hardware diagnostics all tested everything to be perfectly okay. What I find strange is that it's not that the kernel locks up or anything - the machine just suddenly restarts (caches aren't flushed to disk or anything - it's just like someone literally pulls the power plug midstream, and then plugs it back in. The only indication that something weird goes on is that in the server logs everything seems to be crunching away happily and then suddenly I see the boot messages when it restarts all by itself.. In any case, if anyone else with a dual processor machine (I have a PowerEdge 2850 myself) has experienced the rebooting problem discussed a few days ago and resolved it, I'd very much like to hear from you. Dan -- Syzygy Research & Technology Box 83, Legal, AB T0G 1L0 Canada Phone: 780-961-2213
Rutger Bevaart wrote:> Same here on several 1750's, 1850's and 2850's. Tomorrow I'll > disable USB > in the BIOS on one of the 1750's and see if it makes a difference. > It's > the only one of the set that I could get downtime for because it > rebooted > yesterday ;-)I've disabled USB in the BIOS on my 2850 much earlier on when I was getting interrupt storms, since I didn't need USB anyway. It solved the problem of the interrupt storms, but it didn't seem to have any impact on the mysterious unsolicited rebooting problem. Claus Guttesen wrote:> It's not any comfort to you but I have two Dell PE 1750's running very > reliable using FreeBSD 5.4 stable as of Wed. the 28'th of Sep. 2005. > It has two Xeon at 3 GHz, 2 GB RAM, a LSILogic 1030 Ultra4 Adapter. > HTT is *off*. > > HTT does not yield any higher performance for most purposes. I can > send you my kernel if you want.It actually may be a comfort, since perhaps HTT is related to the culprit. Since the last crash, about a month ago, I disabled HTT, both in the kernel as well in the BIOS. So as far as I know, it's completely been disabled (and the boot messages and top only show 2 CPUs). And I haven't had the system go down for nearly a month now. Of course, I also did some other things at the same time, so it's unclear as to which specifically may have helped. I had noticed that in the past it had rebooted itself twice right while running mysqlhotcopy as root during a period where the server may have been rather heavily loaded. So in addition to turning off hyperthreading, I also changed the time when mysqlhotcopy was running to a period likely under a lighter load, and modified things so it isn't running as root any longer. Not that I think mysqlhotcopy was the culprit itself, but it does cause a fairly large burst of disk activity when it is running, and it does seem to be related to triggering the event, at least in my situation. In any case, since I've done those three things, I haven't had a crash yet. Of course, the lack of a result doesn't prove anything, but the more time that passes, the better I feel. That is until one day I wake up to find that it died again. In any case, if that happens, I'll know more things that the problem isn't related to.. Vivek Khera wrote:> I'd recommend running the Dell diags. They're pretty good at picking > out hardware trouble, which it sounds like the OP is having.In my case anyway, I have run the Dell diagnostics, and they showed everything to be just fine.. Kevin Oberman wrote:> As far as I can tell, hyperthreading is not much of a win for > anyone. See hte > article at: http://news.zdnet.co.uk/ > 0,39020330,39237341,00.htmhttp://news.zdnet > .co.uk/0,39020330,39237341,00.htm > > It reports that HTT slows performance even on threaded and, > theoretically HTT > ideal apps. (And this was with Windows.)So I've heard. I was hoping that hyperthreading might be able to help a dedicated MySQL server handle a bit higher load, but I never had the chance to benchmark it with and with hyperthreading before I had to put the machine into production. So it's disabled now - it can't hurt the stability of the system and can only potentially help it. Time will tell. Thanks for your replies, everyone! Dan -- Syzygy Research & Technology Box 83, Legal, AB T0G 1L0 Canada Phone: 780-961-2213