Hi, all. I picked up a couple of Dell R810 monsters a couple of months ago. 96G of RAM, 24 core. With the aid of this list, got 8.1-RELEASE on there, and they are trucking along merrily as VirtualBox hosts. I'm seeing memory allocation errors when sending data over the network. It is random at best, however I can reproduce it pretty reliably. Sending 100M to a remote machine. Note the 2nd scp attempt worked. Most small files can make it through unmolested. obb# dd if=/dev/random of=100M-test bs=1M count=100 100+0 records in 100+0 records out 104857600 bytes transferred in 2.881689 secs (36387551 bytes/sec) obb# rsync -av 100M-test skin:/tmp/ sending incremental file list 100M-test Write failed: Cannot allocate memory rsync: writefd_unbuffered failed to write 4 bytes to socket [sender]: Broken pipe (32) rsync: connection unexpectedly closed (28 bytes received so far) [sender] rsync error: unexplained error (code 255) at io.c(601) [sender=3.0.7] obb# scp 100M-test skin:/tmp/ 100M-test 52% 52MB 52.1MB/s 00:00 ETAWrite failed: Cannot allocate memory lost connection obb# scp 100M-test skin:/tmp/ 100M-test 100% 100MB 50.0MB/s 00:02 obb# scp 100M-test skin:/tmp/ 100M-test 0% 0 0.0KB/s --:-- ETAWrite failed: Cannot allocate memory lost connection Fetching a file, however, works. obb# scp skin:/usr/local/tmp/100M-test . 100M-test 100% 100MB 20.0MB/s 00:05 obb# scp skin:/usr/local/tmp/100M-test . 100M-test 100% 100MB 20.0MB/s 00:05 obb# scp skin:/usr/local/tmp/100M-test . 100M-test 100% 100MB 20.0MB/s 00:05 obb# scp skin:/usr/local/tmp/100M-test . 100M-test 100% 100MB 20.0MB/s 00:05 ... I've ruled out bad hardware (mainly due to the behavior being *identical* on the sister machine, in a completely different data center.) It's a broadcom (bce) NIC. mbufs look fine to me. obb# netstat -m 511/6659/7170 mbufs in use (current/cache/total) 510/3678/4188/25600 mbuf clusters in use (current/cache/total/max) 510/3202 mbuf+clusters out of packet secondary zone in use (current/cache) 0/984/984/12800 4k (page size) jumbo clusters in use (current/cache/total/max) 0/0/0/6400 9k jumbo clusters in use (current/cache/total/max) 0/0/0/3200 16k jumbo clusters in use (current/cache/total/max) 1147K/12956K/14104K bytes allocated to network (current/cache/total) 0/0/0 requests for mbufs denied (mbufs/clusters/mbuf+clusters) 0/0/0 requests for jumbo clusters denied (4k/9k/16k) 0/0/0 sfbufs in use (current/peak/max) 0 requests for sfbufs denied 0 requests for sfbufs delayed 0 requests for I/O initiated by sendfile 0 calls to protocol drain routines Plenty of available mem (not surprising): obb# vmstat -hc 5 -w 5 procs memory page disks faults cpu r b w avm fre flt re pi po fr sr mf0 mf1 in sy cs us sy id 0 0 0 722M 92G 115 0 1 0 1067 0 0 0 429 32637 6520 0 1 99 0 0 0 722M 92G 1 0 0 0 0 0 0 0 9 31830 3279 0 0 100 0 0 0 722M 92G 0 0 0 0 3 0 0 0 8 33171 3223 0 0 100 0 0 0 761M 92G 2593 0 0 0 1712 0 5 4 121 35384 3907 0 0 99 1 0 0 761M 92G 0 0 0 0 0 0 0 0 10 30237 3156 0 0 100 Last bit of info, and here's where it gets really weird. Remember how I said this was a VirtualBox host? Guest machines running on it (mostly centos) don't exhibit the problem, which is also why it took me so long to notice it in the host. They can merrily copy data around at will, even though they are going out through the same host interface. I'm not sure what to check for or toggle at this point. There are all sorts of tunables I've been mucking around with to no avail, and so I've reverted them to defaults. Mostly concentrating on these: hw.intr_storm_threshold net.inet.tcp.rfc1323 kern.ipc.nmbclusters kern.ipc.nmbjumbop net.inet.tcp.sendspace net.inet.tcp.recvspace kern.ipc.somaxconn kern.ipc.maxsockbuf It was suggested to me to try limiting the RAM in loader.conf to under 32G and see what happens. When doing this, it does appear to be "okay". Not sure if that's coincidence, or directly related -- something with the large amount of RAM that is confusing a data structure somewhere? Or potentially a problem with the bce driver, specifically? I've kind of reached a limit here in what to dig for / try next. What else can I do to try and determine the root problem that would be helpful? Anyone ever have to deal with or seen something like this recently? (Or hell, not recently?) Ideas appreciated! -- Mahlon E. Smith http://www.martini.nu/contact.html -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 155 bytes Desc: not available Url : http://lists.freebsd.org/pipermail/freebsd-stable/attachments/20100907/f8827226/attachment.pgp
On Tue, Sep 7, 2010 at 4:08 PM, Mahlon E. Smith <mahlon@martini.nu> wrote:> I've kind of reached a limit here in what to dig for / try next. What > else can I do to try and determine the root problem that would be > helpful? Anyone ever have to deal with or seen something like this > recently? (Or hell, not recently?) > > Ideas appreciated! <http://www.martini.nu/contact.html> >Wild guess here, not sure all the dynamics of changing this, but you could try increasing: kern.maxdsiz It's a kern tunable, so change it in /boot/loader.conf -- Adam Vande More
On Tue, Sep 07, 2010 at 02:08:13PM -0700, Mahlon E. Smith wrote:> I picked up a couple of Dell R810 monsters a couple of months ago. 96G > of RAM, 24 core. With the aid of this list, got 8.1-RELEASE on there, > and they are trucking along merrily as VirtualBox hosts. > > I'm seeing memory allocation errors when sending data over the network. > It is random at best, however I can reproduce it pretty reliably. > > Sending 100M to a remote machine. Note the 2nd scp attempt worked. > Most small files can make it through unmolested. > > obb# dd if=/dev/random of=100M-test bs=1M count=100 > 100+0 records in > 100+0 records out > 104857600 bytes transferred in 2.881689 secs (36387551 bytes/sec) > obb# rsync -av 100M-test skin:/tmp/ > sending incremental file list > 100M-test > Write failed: Cannot allocate memory > rsync: writefd_unbuffered failed to write 4 bytes to socket [sender]: Broken pipe (32) > rsync: connection unexpectedly closed (28 bytes received so far) [sender] > rsync error: unexplained error (code 255) at io.c(601) [sender=3.0.7] > obb# scp 100M-test skin:/tmp/ > 100M-test 52% 52MB 52.1MB/s 00:00 ETAWrite failed: Cannot allocate memory > lost connection > obb# scp 100M-test skin:/tmp/ > 100M-test 100% 100MB 50.0MB/s 00:02 > obb# scp 100M-test skin:/tmp/ > 100M-test 0% 0 0.0KB/s --:-- ETAWrite failed: Cannot allocate memory > lost connection > > Fetching a file, however, works. > > obb# scp skin:/usr/local/tmp/100M-test . > 100M-test 100% 100MB 20.0MB/s 00:05 > obb# scp skin:/usr/local/tmp/100M-test . > 100M-test 100% 100MB 20.0MB/s 00:05 > obb# scp skin:/usr/local/tmp/100M-test . > 100M-test 100% 100MB 20.0MB/s 00:05 > obb# scp skin:/usr/local/tmp/100M-test . > 100M-test 100% 100MB 20.0MB/s 00:05 > ... > > > I've ruled out bad hardware (mainly due to the behavior being > *identical* on the sister machine, in a completely different data > center.) It's a broadcom (bce) NIC.This could be a bce(4) bug, meaning the "failed to allocate memory" message could be indicating DMA failure or something else from the card, and not necessarily related to mbufs. There are also changes/fixes to bce(4) that are in RELENG_8 (8.1-STABLE) that aren't in 8.1-RELEASE, but I don't know if those are responsible for your problem. Please provide output from the following: * uname -a (if desired, XXX out hostname) * vmstat -i * ifconfig -a (if desired, XXX out IPs and MACs) * netstat -inbd (if desired, XXX out MACs) * pciconf -lvc (only the bceX entry please) Also check dmesg to see if there's any error messages that correlate when the problem occurs. I'm also CC'ing Yong-Hyeon PYUN who might have some ideas. -- | Jeremy Chadwick jdc@parodius.com | | Parodius Networking http://www.parodius.com/ | | UNIX Systems Administrator Mountain View, CA, USA | | Making life hard for others since 1977. PGP: 4BD6C0CB |