thr3ads.net - freebsd stable - Network memory allocation failures [Sep 2010]

If this information is useful, please help other people find it:
Share via:

Mahlon E. Smith

2010-Sep-07 21:35 UTC

Network memory allocation failures

Hi, all.

I picked up a couple of Dell R810 monsters a couple of months ago.  96G
of RAM, 24 core.  With the aid of this list, got 8.1-RELEASE on there,
and they are trucking along merrily as VirtualBox hosts.

I'm seeing memory allocation errors when sending data over the network.
It is random at best, however I can reproduce it pretty reliably.

Sending 100M to a remote machine.  Note the 2nd scp attempt worked.
Most small files can make it through unmolested.

    obb# dd if=/dev/random of=100M-test bs=1M count=100
    100+0 records in
    100+0 records out
    104857600 bytes transferred in 2.881689 secs (36387551 bytes/sec)
    obb# rsync -av 100M-test skin:/tmp/
    sending incremental file list
    100M-test
    Write failed: Cannot allocate memory
    rsync: writefd_unbuffered failed to write 4 bytes to socket [sender]: Broken
pipe (32)
    rsync: connection unexpectedly closed (28 bytes received so far) [sender]
    rsync error: unexplained error (code 255) at io.c(601) [sender=3.0.7]
    obb# scp 100M-test skin:/tmp/
    100M-test        52%   52MB  52.1MB/s   00:00 ETAWrite failed: Cannot
allocate memory
    lost connection
    obb# scp 100M-test skin:/tmp/
    100M-test       100%  100MB  50.0MB/s   00:02    
    obb# scp 100M-test skin:/tmp/
    100M-test         0%    0     0.0KB/s   --:-- ETAWrite failed: Cannot
allocate memory
    lost connection

Fetching a file, however, works.

    obb# scp skin:/usr/local/tmp/100M-test .
    100M-test    100%  100MB  20.0MB/s   00:05    
    obb# scp skin:/usr/local/tmp/100M-test .
    100M-test    100%  100MB  20.0MB/s   00:05    
    obb# scp skin:/usr/local/tmp/100M-test .
    100M-test    100%  100MB  20.0MB/s   00:05    
    obb# scp skin:/usr/local/tmp/100M-test .
    100M-test    100%  100MB  20.0MB/s   00:05    
    ...


I've ruled out bad hardware (mainly due to the behavior being
*identical* on the sister machine, in a completely different data
center.) It's a broadcom (bce) NIC.

mbufs look fine to me.

    obb# netstat -m
    511/6659/7170 mbufs in use (current/cache/total)
    510/3678/4188/25600 mbuf clusters in use (current/cache/total/max)
    510/3202 mbuf+clusters out of packet secondary zone in use
    (current/cache)
    0/984/984/12800 4k (page size) jumbo clusters in use
    (current/cache/total/max)
    0/0/0/6400 9k jumbo clusters in use (current/cache/total/max)
    0/0/0/3200 16k jumbo clusters in use (current/cache/total/max)
    1147K/12956K/14104K bytes allocated to network (current/cache/total)
    0/0/0 requests for mbufs denied (mbufs/clusters/mbuf+clusters)
    0/0/0 requests for jumbo clusters denied (4k/9k/16k)
    0/0/0 sfbufs in use (current/peak/max)
    0 requests for sfbufs denied
    0 requests for sfbufs delayed
    0 requests for I/O initiated by sendfile
    0 calls to protocol drain routines

Plenty of available mem (not surprising):

    obb# vmstat -hc 5 -w 5 
     procs      memory      page                    disks     faults         cpu
     r b w     avm    fre   flt  re  pi  po    fr  sr mf0 mf1   in   sy   cs us
sy id
     0 0 0    722M    92G   115   0   1   0  1067   0   0   0  429 32637 6520  0
1 99
     0 0 0    722M    92G     1   0   0   0     0   0   0   0    9 31830 3279  0
0 100
     0 0 0    722M    92G     0   0   0   0     3   0   0   0    8 33171 3223  0
0 100
     0 0 0    761M    92G  2593   0   0   0  1712   0   5   4  121 35384 3907  0
0 99
     1 0 0    761M    92G     0   0   0   0     0   0   0   0   10 30237 3156  0
0 100


Last bit of info, and here's where it gets really weird.  Remember how I
said this was a VirtualBox host?  Guest machines running on it (mostly
centos) don't exhibit the problem, which is also why it took me so long
to notice it in the host.  They can merrily copy data around at will,
even though they are going out through the same host interface.

I'm not sure what to check for or toggle at this point.  There are all
sorts of tunables I've been mucking around with to no avail, and so I've
reverted them to defaults.  Mostly concentrating on these:

    hw.intr_storm_threshold
    net.inet.tcp.rfc1323
    kern.ipc.nmbclusters
    kern.ipc.nmbjumbop
    net.inet.tcp.sendspace
    net.inet.tcp.recvspace
    kern.ipc.somaxconn
    kern.ipc.maxsockbuf

It was suggested to me to try limiting the RAM in loader.conf to under
32G and see what happens.  When doing this, it does appear to be
"okay".
Not sure if that's coincidence, or directly related -- something with
the large amount of RAM that is confusing a data structure somewhere?
Or potentially a problem with the bce driver, specifically?

I've kind of reached a limit here in what to dig for / try next.  What
else can I do to try and determine the root problem that would be
helpful?  Anyone ever have to deal with or seen something like this
recently?  (Or hell, not recently?)

Ideas appreciated!

--
Mahlon E. Smith  
http://www.martini.nu/contact.html
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 155 bytes
Desc: not available
Url :
http://lists.freebsd.org/pipermail/freebsd-stable/attachments/20100907/f8827226/attachment.pgp

Adam Vande More

2010-Sep-07 22:02 UTC

head link

Network memory allocation failures

On Tue, Sep 7, 2010 at 4:08 PM, Mahlon E. Smith <mahlon@martini.nu> wrote:
> I've kind of reached a limit here in what to dig for / try next.  What
> else can I do to try and determine the root problem that would be
> helpful?  Anyone ever have to deal with or seen something like this
> recently?  (Or hell, not recently?)
>
> Ideas appreciated! <http://www.martini.nu/contact.html>
>
Wild guess here, not sure all the dynamics of changing this, but you could
try increasing:

kern.maxdsiz

It's a kern tunable, so change it in /boot/loader.conf

-- 
Adam Vande More

Jeremy Chadwick

2010-Sep-07 22:24 UTC

head link

Network memory allocation failures

On Tue, Sep 07, 2010 at 02:08:13PM -0700, Mahlon E. Smith
wrote:> I picked up a couple of Dell R810 monsters a couple of months ago.  96G
> of RAM, 24 core.  With the aid of this list, got 8.1-RELEASE on there,
> and they are trucking along merrily as VirtualBox hosts.
> 
> I'm seeing memory allocation errors when sending data over the network.
> It is random at best, however I can reproduce it pretty reliably.
> 
> Sending 100M to a remote machine.  Note the 2nd scp attempt worked.
> Most small files can make it through unmolested.
> 
>     obb# dd if=/dev/random of=100M-test bs=1M count=100
>     100+0 records in
>     100+0 records out
>     104857600 bytes transferred in 2.881689 secs (36387551 bytes/sec)
>     obb# rsync -av 100M-test skin:/tmp/
>     sending incremental file list
>     100M-test
>     Write failed: Cannot allocate memory
>     rsync: writefd_unbuffered failed to write 4 bytes to socket [sender]:
Broken pipe (32)
>     rsync: connection unexpectedly closed (28 bytes received so far)
[sender]
>     rsync error: unexplained error (code 255) at io.c(601) [sender=3.0.7]
>     obb# scp 100M-test skin:/tmp/
>     100M-test        52%   52MB  52.1MB/s   00:00 ETAWrite failed: Cannot
allocate memory
>     lost connection
>     obb# scp 100M-test skin:/tmp/
>     100M-test       100%  100MB  50.0MB/s   00:02    
>     obb# scp 100M-test skin:/tmp/
>     100M-test         0%    0     0.0KB/s   --:-- ETAWrite failed: Cannot
allocate memory
>     lost connection
> 
> Fetching a file, however, works.
> 
>     obb# scp skin:/usr/local/tmp/100M-test .
>     100M-test    100%  100MB  20.0MB/s   00:05    
>     obb# scp skin:/usr/local/tmp/100M-test .
>     100M-test    100%  100MB  20.0MB/s   00:05    
>     obb# scp skin:/usr/local/tmp/100M-test .
>     100M-test    100%  100MB  20.0MB/s   00:05    
>     obb# scp skin:/usr/local/tmp/100M-test .
>     100M-test    100%  100MB  20.0MB/s   00:05    
>     ...
> 
> 
> I've ruled out bad hardware (mainly due to the behavior being
> *identical* on the sister machine, in a completely different data
> center.) It's a broadcom (bce) NIC.
This could be a bce(4) bug, meaning the "failed to allocate memory"
message could be indicating DMA failure or something else from the card,
and not necessarily related to mbufs.

There are also changes/fixes to bce(4) that are in RELENG_8 (8.1-STABLE)
that aren't in 8.1-RELEASE, but I don't know if those are responsible
for your problem.

Please provide output from the following:

* uname -a        (if desired, XXX out hostname)
* vmstat -i
* ifconfig -a     (if desired, XXX out IPs and MACs)
* netstat -inbd   (if desired, XXX out MACs)
* pciconf -lvc    (only the bceX entry please)

Also check dmesg to see if there's any error messages that correlate
when the problem occurs.

I'm also CC'ing Yong-Hyeon PYUN who might have some ideas.

-- 
| Jeremy Chadwick                                   jdc@parodius.com |
| Parodius Networking                       http://www.parodius.com/ |
| UNIX Systems Administrator                  Mountain View, CA, USA |
| Making life hard for others since 1977.              PGP: 4BD6C0CB |

freebsd stable - Sep 2010 - Network memory allocation failures

Network memory allocation failures

Network memory allocation failures

Network memory allocation failures