thr3ads.net - freebsd stable - FreeBSD unstable on Dell 1750 using SMP? [Nov 2005]

If this information is useful, please help other people find it:
Share via:

Rutger Bevaart

2005-Nov-20 10:24 UTC

FreeBSD unstable on Dell 1750 using SMP?

Strange indeed.

On a 1750 with bge's:
475 mbufs in use
501/25600 mbuf clusters in use (current/max)
0/3/6656 sfbufs in use (current/peak/max)
1120 KBytes allocated to network
0 requests for sfbufs denied
0 requests for sfbufs delayed
0 requests for I/O initiated by sendfile
100 calls to protocol drain routines

On a 2850 (hardware identical to an 1850):
$ netstat -m
4294966848 mbufs in use
565/25600 mbuf clusters in use (current/max)
0/67/6656 sfbufs in use (current/peak/max)
1018 KBytes allocated to network
0 requests for sfbufs denied
0 requests for sfbufs delayed
16449 requests for I/O initiated by sendfile
589 calls to protocol drain routines

Both experience the "auto reboot" feature. The mbufs on the 2850 look
like a counter (signed/unsigned) bug, maybe even just in the  
printing. Other than that I'm having a hard time interpreting these  
results.

Regards
Rutger Bevaart

On Nov 20, 2005, at 5:07 PM, Gino Ruopolo wrote:
>
> Hello Rutger,
>
> I red your post but I'm unable to reply on the list 'cause of some
> firewall settings.
>
> I'm having the same problems  with various Dell1850 and Fbsd 5.4
>
> Last week I noticed the following:
>
> #netstat -m
> 4294899289 mbufs in use    !?!?!??!!?
> 4294940375/25600 mbuf clusters in use (current/max)     !?!?!?!??!
> 0/9/6656 sfbufs in use (current/peak/max)
> 4123460 KBytes allocated to network
> 0 requests for sfbufs denied
> 0 requests for sfbufs delayed
> 34 requests for I/O initiated by sendfile
> 2533 calls to protocol drain routines
>
> Here is the output of the same command on a different server with  
> fxp0 ethernet driver, also FBSD 5.4 and doing the same work:
>
> #netstat -m
> 194 mbufs in use
> 171/25600 mbuf clusters in use (current/max)
> 0/4/6656 sfbufs in use (current/peak/max)
> 390 KBytes allocated to network
> 0 requests for sfbufs denied
> 0 requests for sfbufs delayed
> 0 requests for I/O initiated by sendfile
> 0 calls to protocol drain routines
>
> So I've tried putting an old pci ethernet 10/100 using fxp driver  
> on a Dell1850 suffering the "self-reboot" problem.  I'm
getting 5
> days of uptime without a single reboot ...
>
> What about a problem with the em driver?
>
> Regards,
> gino
>
> _________________________________________________________________
> Parla con i tuoi amici che hanno MSN Hotmail in tempo reale! E'  
> gratis. http://www.imagine-msn.com/messenger/default.aspx?locale=it-IT
>

Vivek Khera

2005-Nov-21 07:17 UTC

head link

FreeBSD unstable on Dell 1750 using SMP?

On Nov 20, 2005, at 1:24 PM, Rutger Bevaart wrote:
> Both experience the "auto reboot" feature. The mbufs on the 2850
> look like a counter (signed/unsigned) bug, maybe even just in the  
> printing. Other than that I'm having a hard time interpreting these  
> results.
FreeBSD 4.x, 5.x, and 6.x have been stable for me on all Dell hardware.

4.x (currently 4.11) has been running on 1550's, 1650's, 2650 and  
1750's for > 3 years
5.4 on 2450  for ~6 months
6.0 on 1750, 1850, and 2650 since 6.0-RC2, currently running 6.0-REL.

Never a flake-out not due to a hardware failure, and that only on two  
of the 1550s over 4 years' time.  I did have the 5.4 box running 5.4- 
REL-p7 lockup once, but was unable to determine the cause.

Kris Kennaway

2005-Nov-23 13:39 UTC

head link

FreeBSD unstable on Dell 1750 using SMP?

On Sun, Nov 20, 2005 at 07:24:25PM +0100, Rutger Bevaart
wrote:> Strange indeed.
> 
> On a 1750 with bge's:
> 475 mbufs in use
> 501/25600 mbuf clusters in use (current/max)
> 0/3/6656 sfbufs in use (current/peak/max)
> 1120 KBytes allocated to network
> 0 requests for sfbufs denied
> 0 requests for sfbufs delayed
> 0 requests for I/O initiated by sendfile
> 100 calls to protocol drain routines
> 
> On a 2850 (hardware identical to an 1850):
> $ netstat -m
> 4294966848 mbufs in use
> 565/25600 mbuf clusters in use (current/max)
> 0/67/6656 sfbufs in use (current/peak/max)
> 1018 KBytes allocated to network
> 0 requests for sfbufs denied
> 0 requests for sfbufs delayed
> 16449 requests for I/O initiated by sendfile
> 589 calls to protocol drain routines
> 
> Both experience the "auto reboot" feature. The mbufs on the 2850
look
> like a counter (signed/unsigned) bug, maybe even just in the  
> printing. Other than that I'm having a hard time interpreting these  
> results.
This is documented in the 5.4 errata, it's a leak in the stats
counting on SMP machines.  It was fixed after 5.4.

Kris
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 187 bytes
Desc: not available
Url :
http://lists.freebsd.org/pipermail/freebsd-stable/attachments/20051123/e777be3c/attachment.bin

Dan Charrois

2005-Nov-24 23:40 UTC

head link

FreeBSD unstable on Dell 1750 using SMP?

Hi Kris, Rutger, and others that have commented on this thread.

I'm happy to hear that I'm not the only one experiencing problems  
like this.  I posted a similar question a month or so ago about a  
PowerEdge 2850 using SMP (dual Xeons) and never received any  
responses that helped solve the problem, or even any indication that  
others had the same problem.  As you know, troubleshooting this is  
quite difficult, since it can take weeks to go down, and then the  
"auto-reboot" doesn't result in any clues as to why in the log
file -
it's just suddenly started again as if someone had pulled the plug on  
it.  I've been pulling my hair out.

My machine crashed twice in the last month or so, within two weeks of  
each other.  Both times, it was just as a cron task was about to  
schedule the mysqlhotcopy script to back up some SQL databases that  
are being hosted on that machine, so I thought it may have something  
to do with that (I had it running as a root crontask so figured that  
maybe some bug in that caused things to go weird - it was running as  
root, after all).  I changed it to run under a less privileged user  
and the machine hasn't died for about 2 1/2 weeks.  But that's hardly  
a conclusive case of having solved the situation - it's probably  
planning on surviving just long enough to last until the point I need  
it the most to work.   It sounds as though memory buffer allocations  
are going wacky or something, in which anything could take it down  
given the wrong combination of events.

In any case, We're running the amd64 version of FreeBSD 5.4-RELEASE- 
p6 FreeBSD 5.4-RELEASE-p6 #3: Fri Aug  5 18:18:10 MDT 2005

A netstat -m (which I'd never tried before) yields:

18446744073709551402 mbufs in use
49/25600 mbuf clusters in use (current/max)
0/0/0 sfbufs in use (current/peak/max)
44 KBytes allocated to network
0 requests for sfbufs denied
0 requests for sfbufs delayed
0 requests for I/O initiated by sendfile
884 calls to protocol drain routines

Obviously, the mbufs in use currently on that machine is way out to  
lunch.  And interestingly, it looks as though my max mbuf clusters in  
use of 25600 is identical to the other netstat -m reports from people  
having this problem.

Another machine (an older single CPU Dell) on which I'm running the  
386 version of FreeBSD 5.4-RELEASE-p5 FreeBSD 5.4-RELEASE-p5 #1: Thu  
Jul 21 22:30:46 MDT 2005 has a more sane netstat -m:

130 mbufs in use
128/8896 mbuf clusters in use (current/max)
0/177/2480 sfbufs in use (current/peak/max)
288 KBytes allocated to network
0 requests for sfbufs denied
0 requests for sfbufs delayed
208493 requests for I/O initiated by sendfile
26697 calls to protocol drain routines

But here's about where any troubleshooting on my own reaches its  
limit.  I noticed that Kris mentioned it was a known problem in the  
stats counting for SMP machines and had been fixed, but haven't been  
able to find a reference to that, or any indication of how to do so.   
Is this fix supposed to have been an accounting bug in the report for  
netstat, or is it something which would have taken down the machine  
as has been happening?

If switching to single CPU mode works, it's good to hear that I have  
an option if things continue to act up.  But I'd really rather not  
have to "dumb down" the machine to one CPU when there is the  
potential of two.  Most of the time it's not under a huge load, but  
periodically there are massive spikes, and that's where having two  
CPUs really help.

If anyone can shed further light on a fix for this problem, it would  
be greatly appreciated!

Dan
--
Syzygy Research & Technology
Box 83, Legal, AB  T0G 1L0 Canada
Phone: 780-961-2213

Dan Charrois

2005-Nov-24 23:52 UTC

head link

FreeBSD unstable on Dell 1750 using SMP?

I just thought of one other bit of info that may be relevant to the  
auto-rebooting problem I've experienced with our PowerEdge 2850.   
Since the problem may be related to memory allocation, I thought I  
should mention that we have more memory in that machine that is  
typical for some users.  We have 5 Gigs installed.  From "top":

Mem: 175M Active, 4121M Inact, 244M Wired, 244M Cache, 214M Buf, 23M  
Free
Swap: 10G Total, 12K Used, 10G Free

If this turns out to be an AMD64 vs. 386 issue and we were to revert  
to the 386 branch, would we still be able to access this memory, or  
would the 386 be limited to 4Gb (or maybe 2Gb) due to 32 bit  
addressing?  We don't need anywhere near this much memory for user  
space programs, but the kernel does make good use of it to cache  
commonly accessed regions of the file system in memory.

Dan
--
Syzygy Research & Technology
Box 83, Legal, AB  T0G 1L0 Canada
Phone: 780-961-2213

Dan Charrois

2005-Nov-29 10:59 UTC

head link

FreeBSD unstable on Dell 1750 using SMP?

Thanks everyone for replies made over the past few days about the  
"unsolicited" rebooting problem.  At first, I thought there was a  
memory allocation bug as judged by the output of "netstat -m", but  
apparently it's just a cosmetic statistics reporting bug and nothing  
related to the instability itself.

Unfortunately, it means that I still haven't been able to find a  
solution to the problem (and apparently, I'm not the only one to  
experience it).  Considering that we only have the one machine, which  
happens to be a production machine, that experiences the problem  
(infrequently at that), it's difficult to test and resolve.  It's  
been suggested that FreeBSD 6.0 may fix the problem, but considering  
some of the inevitable bugs that creep into new releases, I'm  
reluctant to go there until things settle down in 6.0 (plus, I  
haven't seen any documentation that implies that a fix for the  
problem will result from using 6.0 in any case).  If it weren't a  
production machine that needs to be reliable, stable, and available,  
I'd have a better chance at being able to test it under 6.0.

Some speculation has been made about it being triggered by possibly  
buggy ethernet drivers, etc.  In my case, though possible, I doubt it  
- since my machine has rebooted itself right when mysqlhotcopy was  
about to run on the machine (and it runs locally without causing any  
network activity that I'm aware of).  The first thought I had was  
that it may be caused by faulty memory or something, but Dell's  
hardware diagnostics all tested everything to be perfectly okay.

What I find strange is that it's not that the kernel locks up or  
anything - the machine just suddenly restarts (caches aren't flushed  
to disk or anything - it's just like someone literally pulls the  
power plug midstream, and then plugs it back in.  The only indication  
that something weird goes on is that in the server logs everything  
seems to be crunching away happily and then suddenly I see the boot  
messages when it restarts all by itself..

In any case, if anyone else with a dual processor machine (I have a  
PowerEdge 2850 myself) has experienced the rebooting problem  
discussed a few days ago and resolved it, I'd very much like to hear  
from you.

Dan
--
Syzygy Research & Technology
Box 83, Legal, AB  T0G 1L0 Canada
Phone: 780-961-2213

Dan Charrois

2005-Nov-29 22:24 UTC

head link

FreeBSD unstable on Dell 1750 using SMP?

Rutger Bevaart wrote:> Same here on several 1750's, 1850's and 2850's. Tomorrow
I'll
> disable USB
> in the BIOS on one of the 1750's and see if it makes a difference.  
> It's
> the only one of the set that I could get downtime for because it  
> rebooted
> yesterday ;-)
I've disabled USB in the BIOS on my 2850 much earlier on when I was  
getting interrupt storms, since I didn't need USB anyway.  It solved  
the problem of the interrupt storms, but it didn't seem to have any  
impact on the mysterious unsolicited rebooting problem.

Claus Guttesen wrote:> It's not any comfort to you but I have two Dell PE 1750's running
very
> reliable using FreeBSD 5.4 stable as of Wed. the 28'th of Sep. 2005.
> It has two Xeon at 3 GHz, 2 GB RAM, a LSILogic 1030 Ultra4 Adapter.
> HTT is *off*.
>
> HTT does not yield any higher performance for most purposes. I can
> send you my kernel if you want.
It actually may be a comfort, since perhaps HTT is related to the  
culprit.  Since the last crash, about a month ago, I disabled HTT,  
both in the kernel as well in the BIOS.  So as far as I know, it's  
completely been disabled (and the boot messages and top only show 2  
CPUs).  And I haven't had the system go down for nearly a month now.

Of course, I also did some other things at the same time, so it's  
unclear as to which specifically may have helped.  I had noticed that  
in the past it had rebooted itself twice right while running  
mysqlhotcopy as root during a period where the server may have been  
rather heavily loaded.  So in addition to turning off hyperthreading,  
I also changed the time when mysqlhotcopy was running to a period  
likely under a lighter load, and modified things so it isn't running  
as root any longer.

Not that I think mysqlhotcopy was the culprit itself, but it does  
cause a fairly large burst of disk activity when it is running, and  
it does seem to be related to triggering the event, at least in my  
situation.

In any case, since I've done those three things, I haven't had a  
crash yet.  Of course, the lack of a result doesn't prove anything,  
but the more time that passes, the better I feel.  That is until one  
day I wake up to find that it died again.  In any case, if that  
happens, I'll know more things that the problem isn't related to..

Vivek Khera wrote:> I'd recommend running the Dell diags.  They're pretty good at
picking
> out hardware trouble, which it sounds like the OP is having.
In my case anyway, I have run the Dell diagnostics, and they showed  
everything to be just fine..

Kevin Oberman wrote:> As far as I can tell, hyperthreading is not much of a win for  
> anyone. See hte
> article at: http://news.zdnet.co.uk/ 
> 0,39020330,39237341,00.htmhttp://news.zdnet
> .co.uk/0,39020330,39237341,00.htm
>
> It reports that HTT slows performance even on threaded and,  
> theoretically HTT
> ideal apps. (And this was with Windows.)
So I've heard.  I was hoping that hyperthreading might be able to  
help a dedicated MySQL server handle a bit higher load, but I never  
had the chance to benchmark it with and with hyperthreading before I  
had to put the machine into production.  So it's disabled now - it  
can't hurt the stability of the system and can only potentially help  
it.  Time will tell.

Thanks for your replies, everyone!

Dan
--
Syzygy Research & Technology
Box 83, Legal, AB  T0G 1L0 Canada
Phone: 780-961-2213

freebsd stable - Nov 2005 - FreeBSD unstable on Dell 1750 using SMP?

FreeBSD unstable on Dell 1750 using SMP?

FreeBSD unstable on Dell 1750 using SMP?

FreeBSD unstable on Dell 1750 using SMP?

FreeBSD unstable on Dell 1750 using SMP?

FreeBSD unstable on Dell 1750 using SMP?

FreeBSD unstable on Dell 1750 using SMP?

FreeBSD unstable on Dell 1750 using SMP?