Hi, we are having a problem with NFS using RDMA protocol over our FDR10
Infiniband network. I previously wrote to NFS mailing list about this,
so you may find our discussion there. I have taken some load off the
server which was using NFS for backups, and converted it to use SSH, but
we are still having critical problems with NFS clients losing connection
to the server, causing the clients to hang and needing a reboot. I
wanted to check in here before filing a bug with CentOS.
Our setup is a cluster with one head node (NFS server) and 9 compute
nodes (NFS clients). All the machines are running CentOS 6.9
2.6.32-696.30.1.el6.x86_64 and using the "Inbox"/CentOS RDMA
implementation/drivers (not Mellanox OFED). (We also have other NFS
clients but they are using 1GbE for NFS connection and, while they will
still hang with messages like "NFS server not responding, retrying" or
"timed out", they will eventually recover and don't need a
reboot.)
On the server (which is named pac) I will see messages like this:
Jul 30 18:19:38 pac kernel: svcrdma: failed to send reply chunks, rc=-5
Jul 30 18:19:38 pac kernel: svcrdma: failed to send write chunks, rc=-5
Jul 31 15:03:05 pac kernel: svcrdma: failed to send write chunks, rc=-5
Jul 31 15:09:06 pac kernel: svcrdma: failed to send write chunks, rc=-5
Jul 31 15:16:09 pac kernel: svcrdma: failed to send write chunks, rc=-5
Jul 31 15:23:31 pac kernel: svcrdma: Error -107 posting RDMA_READ
Jul 31 15:53:55 pac kernel: svcrdma: failed to send write chunks, rc=-5
Jul 31 16:09:19 pac kernel: svcrdma: failed to send reply chunks, rc=-5
Jul 31 16:09:19 pac kernel: svcrdma: failed to send reply chunks, rc=-5
Previously I had also seen messages like "Jul 11 21:09:56 pac kernel:
nfsd: peername failed (err 107)!" however have not seen that in this
latest hangup.
And on the clients (named n001-n009) I will see messages like this:
Jul 30 18:17:26 n001 kernel: RPC: rpcrdma_sendcq_process_wc: frmr
ffff8810674024c0 (stale): WR flushed
Jul 30 18:17:26 n001 kernel: RPC: rpcrdma_sendcq_process_wc: frmr
ffff88106638a640 (stale): WR flushed
Jul 30 18:19:26 n001 kernel: nfs: server 10.10.11.100 not responding,
still trying
Jul 30 18:19:36 n001 kernel: nfs: server 10.10.10.100 not responding,
timed out
Jul 30 18:19:38 n001 kernel: rpcrdma: connection to 10.10.11.100:20049
on mlx4_0, memreg 5 slots 32 ird 16
Jul 30 18:19:38 n001 kernel: nfs: server 10.10.11.100 OK
Jul 31 14:42:08 n001 kernel: RPC: rpcrdma_sendcq_process_wc: frmr
ffff8810671f02c0 (stale): WR flushed
Jul 31 14:42:08 n001 kernel: RPC: rpcrdma_sendcq_process_wc: frmr
ffff8810677bda40 (stale): WR flushed
Jul 31 14:42:08 n001 kernel: RPC: rpcrdma_sendcq_process_wc: frmr
ffff8810677bd940 (stale): WR flushed
Jul 31 14:42:08 n001 kernel: RPC: rpcrdma_sendcq_process_wc: frmr
ffff8810671f0240 (stale): WR flushed
Jul 31 14:43:35 n001 kernel: rpcrdma: connection to 10.10.11.100:20049
on mlx4_0, memreg 5 slots 32 ird 16
Jul 31 15:01:53 n001 kernel: RPC: rpcrdma_sendcq_process_wc: frmr
ffff881065133140 (stale): WR flushed
Jul 31 15:01:53 n001 kernel: RPC: rpcrdma_sendcq_process_wc: frmr
ffff8810666e3f00 (stale): WR flushed
Jul 31 15:01:53 n001 kernel: RPC: rpcrdma_sendcq_process_wc: frmr
ffff881063ea0dc0 (stale): WR flushed
Jul 31 15:01:53 n001 kernel: RPC: rpcrdma_sendcq_process_wc: frmr
ffff8810677bdb40 (stale): WR flushed
Jul 31 15:03:05 n001 kernel: rpcrdma: connection to 10.10.11.100:20049
on mlx4_0, memreg 5 slots 32 ird 16
Jul 31 15:07:07 n001 kernel: RPC: rpcrdma_sendcq_process_wc: frmr
ffff881060e59d40 (stale): WR flushed
Jul 31 15:07:07 n001 kernel: RPC: rpcrdma_sendcq_process_wc: frmr
ffff8810677efac0 (stale): WR flushed
Jul 31 15:07:07 n001 kernel: RPC: rpcrdma_sendcq_process_wc: frmr
ffff88106638a640 (stale): WR flushed
Jul 31 15:07:07 n001 kernel: RPC: rpcrdma_sendcq_process_wc: frmr
ffff8810671f03c0 (stale): WR flushed
Jul 31 15:09:06 n001 kernel: rpcrdma: connection to 10.10.11.100:20049
on mlx4_0, memreg 5 slots 32 ird 16
Jul 31 15:16:09 n001 kernel: rpcrdma: connection to 10.10.11.100:20049
closed (-103)
Jul 31 15:53:32 n001 kernel: nfs: server 10.10.10.100 not responding,
timed out
Jul 31 16:08:56 n001 kernel: nfs: server 10.10.10.100 not responding,
timed out
Jul 30 18:17:26 n002 kernel: RPC: rpcrdma_sendcq_process_wc: frmr
ffff881064461500 (stale): WR flushed
Jul 30 18:17:26 n002 kernel: RPC: rpcrdma_sendcq_process_wc: frmr
ffff8810604b2600 (stale): WR flushed
Jul 30 18:19:26 n002 kernel: nfs: server 10.10.11.100 not responding,
still trying
Jul 30 18:19:38 n002 kernel: rpcrdma: connection to 10.10.11.100:20049
on mlx4_0, memreg 5 slots 32 ird 16
Jul 30 18:19:38 n002 kernel: nfs: server 10.10.11.100 OK
Jul 31 14:43:35 n002 kernel: rpcrdma: connection to 10.10.11.100:20049
closed (-103)
Jul 31 16:08:56 n002 kernel: nfs: server 10.10.10.100 not responding,
timed out
Similar messages show up on the other clients n003-n009. After these
messages on the clients, their load will continually go up (viewable
through Ganglia) (I would guess since they are waiting for NFS mount to
re-appear). They aren't reachable any longer through SSH and neither
can root log in through console via IPMI web applet (just hangs after
entering password, may get to prompt eventually but system load is so
high), they need to be rebooted through IPMI interface.
Here is /etc/fstab on the server,
UUID=f15df051-ffb8-408c-8ad2-1987b6f082a2 / ext3 defaults 0 1
UUID=c854ee27-32cf-445d-8308-4e6f1a87d364 /boot ext3 defaults 0 2
UUID=b92a100f-2521-408b-9b15-93671c6ae056 swap swap defaults 0 0
UUID=a8a7b737-25ed-43a7-ae4b-391c71aa8c08 /data xfs defaults 0 2
UUID=d5692ec2-d5dc-4bb8-98d4-a4fb2ff54748 /projects xfs defaults 0 2
/dev/drbd0 /newwing xfs noauto 0 0
UUID=a305f309-d997-43ec-8e4f-78e26b07652f /working xfs defaults 0 2
tmpfs /dev/shm tmpfs defaults 0 0
devpts /dev/pts devpts gid=5,mode=620 0 0
sysfs /sys sysfs defaults 0 0
proc /proc proc defaults 0 0
I read that adding "inode64,nobarrier" for the xfs mount options may
help? That is something I can try once the server can be rebooted.
Here is current mounts on the server,
/dev/sda3 on / type ext3 (rw)
proc on /proc type proc (rw)
sysfs on /sys type sysfs (rw)
devpts on /dev/pts type devpts (rw,gid=5,mode=620)
tmpfs on /dev/shm type tmpfs (rw)
/dev/sda1 on /boot type ext3 (rw)
/dev/sdc1 on /data type xfs (rw)
/dev/sdb1 on /projects type xfs (rw)
/dev/sde1 on /working type xfs (rw,nobarrier)
none on /proc/sys/fs/binfmt_misc type binfmt_misc (rw)
sunrpc on /var/lib/nfs/rpc_pipefs type rpc_pipefs (rw)
nfsd on /proc/fs/nfsd type nfsd (rw)
/dev/drbd0 on /newwing type xfs (rw)
Here is /etc/exports on the server,
/data 10.10.10.0/24(rw,no_root_squash,async)
/data 10.10.11.0/24(rw,no_root_squash,async)
/data 150.x.x.192/27(rw,no_root_squash,async)
/data 150.x.x.64/26(rw,no_root_squash,async)
/home 10.10.10.0/24(rw,no_root_squash,async)
/home 10.10.11.0/24(rw,no_root_squash,async)
/opt 10.10.10.0/24(rw,no_root_squash,async)
/opt 10.10.11.0/24(rw,no_root_squash,async)
/projects 10.10.10.0/24(rw,no_root_squash,async)
/projects 10.10.11.0/24(rw,no_root_squash,async)
/projects 150.x.x.192/27(rw,no_root_squash,async)
/projects 150.x.x.64/26(rw,no_root_squash,async)
/tools 10.10.10.0/24(rw,no_root_squash,async)
/tools 10.10.11.0/24(rw,no_root_squash,async)
/usr/share/gridengine 10.10.10.10/24(rw,no_root_squash,async)
/usr/share/gridengine 10.10.11.10/24(rw,no_root_squash,async)
/usr/local 10.10.10.10/24(rw,no_root_squash,async)
/usr/local 10.10.11.10/24(rw,no_root_squash,async)
/working 10.10.10.0/24(rw,no_root_squash,async)
/working 10.10.11.0/24(rw,no_root_squash,async)
/working 150.x.x.192/27(rw,no_root_squash,async)
/working 150.x.x.64/26(rw,no_root_squash,async)
/newwing 10.10.10.0/24(rw,no_root_squash,async)
/newwing 10.10.11.0/24(rw,no_root_squash,async)
/newwing 150.x.x.192/27(rw,no_root_squash,async)
/newwing 150.x.x.64/26(rw,no_root_squash,async)
The 10.10.10.0/24 network is 1GbE and the 10.10.11.0/24 is the
Infiniband. The other networks are also 1GbE. Our cluster nodes will
normally mount all of these using the Infiniband with RDMA and the
computation jobs will normally be using /working which will see the most
reading/writing but /newwing, /projects, and /data are also used.
Here is an /etc/fstab from the nodes,
#NFS/RDMA
#10.10.11.100:/opt /opt nfs rdma,port=20049 0 0
#10.10.11.100:/data /data nfs rdma,port=20049 0 0
#10.10.11.100:/tools /tools nfs rdma,port=20049 0 0
#10.10.11.100:/home /home nfs rdma,port=20049 0 0
#10.10.11.100:/usr/local /usr/local nfs rdma,port=20049 0 0
#10.10.11.100:/usr/share/gridengine /usr/share/gridengine nfs
rdma,port=20049 0 0
#10.10.11.100:/projects /projects nfs rdma,port=20049 0 0
#10.10.11.100:/working /working nfs rdma,port=20049 0 0
#10.10.11.100:/newwing /newwing nfs rdma,port=20049 0 0
#NFS/IPoIB
10.10.11.100:/opt /opt nfs tcp 0 0
10.10.11.100:/data /data nfs tcp 0 0
10.10.11.100:/tools /tools nfs tcp 0 0
10.10.11.100:/home /home nfs tcp 0 0
10.10.11.100:/usr/local /usr/local nfs tcp 0 0
10.10.11.100:/usr/share/gridengine /usr/share/gridengine nfs tcp 0 0
10.10.11.100:/projects /projects nfs tcp 0 0
10.10.11.100:/working /working nfs tcp 0 0
10.10.11.100:/newwing /newwing nfs tcp 0 0
#NFS/TCP
#10.10.10.100:/opt /opt nfs defaults 0 0
#10.10.10.100:/data /data nfs defaults 0 0
#10.10.10.100:/tools /tools nfs defaults 0 0
#10.10.10.100:/home /home nfs defaults 0 0
#10.10.10.100:/usr/local /usr/local nfs defaults 0 0
#10.10.10.100:/usr/share/gridengine /usr/share/gridengine nfs defaults 0 0
#10.10.10.100:/projects /projects nfs defaults 0 0
#10.10.10.100:/working /working nfs defaults 0 0
#10.10.10.100:/newwing /newwing nfs defaults 0 0
Here I can switch between different interfaces/protocols for the NFS
mounts. Currently we are trying the IPoIB. We haven't started a
cluster job yet so not sure how it will perform. With the NFS/TCP over
1GbE the server/nodes would hang from time to time but still did not
crash at least, however was of course slow being limited by 1GbE.
We haven't had this problem until recently. I upgraded our cluster to
add the two additional nodes (n008 and n009) and we also added more
storage to the server (/newwing and /working). The new nodes are AMD
EPYC platform whereas the server and the nodes n001-n007 are Intel Xeon
platform, not sure if that would cause such a crash. The new nodes were
cloned from n001 and only kernel command line and network parameters
were changed.
The jobs are submitted to the cluster via Sun Grid Engine, and in total
there are about 61 jobs that may start at once and open connections to
the NFS server... it sounds like it is a system overload, although the
load on the server remains low, under 10%, even as it hangs the load may
increase to 80%. The server is a few years old but still has 2x 6-core
Intel Xeon E5-2620 v2 @ 2.10GHz with 128GB of RAM.
Would appreciate your assistance to troubleshoot this critical problem
and, if needed, gather the required information to submit a bug to the
tracker!
Thanks,
--
Chandler
Arizona Genomics Institute
www.genome.arizona.edu