LaGarde, Owen M ERDC-RDE-ITL-MS Contractor
2015-May-09 00:10 UTC
[Gluster-users] memory allocation failure messages as false positives?
Are there any typical reasons for glusterfsd falsely reporting memory allocation failure when attempting to create a new IB QP? I'm getting a high rate of similar cases but can't push any hardware or non-gluster software error into the open. Recovering the volume after a crash is not a problem; what self-heal doesn't automagically handle rebalancing takes care of just fine. Below is the glusterfs-glusterd log snippet for a typical crash. This happens with no particular pattern on any gluster server except the first in the series (which is also the one the clients specify in their mounts and thus go to for the vol info file). The crash may occur during a 'hello world' of 1p per node across the cluster but not do it during the final and most agressive rank of an OpenMPI All-to-All benchmark, or vice versa; there's no particular correlation with MPI traffic load, IB/RDMA traffic pattern, client population and/or activity, etc. In all failure cases all IPoIB, Ethernet, RDMA, and IBCV tests completed without issue and returned the appropriate bandwidth/latency/pathing. All servers are running auditd and gmond, which show no indication of memory issues or any other failure. All servers have run Pandora repeatedly without triggering any hardware failures. There are no complaints in the global OpenSM instances for either IB fabric at the management points, or on the PTP SMD GUID-locked instances running on the gluster servers and talking to the backing storage controllers. Any ideas? --------- [2015-05-08 23:19:26.660870] C [rdma.c:2951:gf_rdma_create_qp] 0-rdma.management: rdma.management: could not create QP (Cannot allocate memory) [2015-05-08 23:19:26.660966] W [rdma.c:818:gf_rdma_cm_handle_connect_request] 0-rdma.management: could not create QP (peer:10.149.0.63:1013 me:10.149.1.142:24008) pending frames: patchset: git://git.gluster.com/glusterfs.git signal received: 11 time of crash: 2015-05-08 23:19:26 configuration details: argp 1 backtrace 1 dlfcn 1 libpthread 1 llistxattr 1 setfsid 1 spinlock 1 epoll.h 1 xattr.h 1 st_atim.tv_nsec 1 package-string: glusterfs 3.6.2 /usr/lib64/libglusterfs.so.0(_gf_msg_backtrace_nomem+0xb6)[0x39e3a20136] /usr/lib64/libglusterfs.so.0(gf_print_trace+0x33f)[0x39e3a3abbf] /lib64/libc.so.6[0x39e1a326a0] /usr/lib64/glusterfs/3.6.2/xlator/mgmt/glusterd.so(glusterd_rpcsvc_notify+0x69)[0x7fefd149ec59] /usr/lib64/libgfrpc.so.0(rpcsvc_handle_disconnect+0x105)[0x39e32081d5] /usr/lib64/libgfrpc.so.0(rpcsvc_notify+0x1a0)[0x39e3209cd0] /usr/lib64/libgfrpc.so.0(rpc_transport_notify+0x28)[0x39e320b6d8] /usr/lib64/glusterfs/3.6.2/rpc-transport/rdma.so(+0x5941)[0x7fefd0251941] /usr/lib64/glusterfs/3.6.2/rpc-transport/rdma.so(+0xb6d9)[0x7fefd02576d9] /lib64/libpthread.so.0[0x39e26079d1] /lib64/libc.so.6(clone+0x6d)[0x39e1ae88fd] --------- ## ## Soft bits are: ## RHEL 6.6 kernel 2.6.32-528.el6.bz1159925.x86_64 (this is the 6.7 pre-release kernel with the latest ib_sm updates for the occasional mgroup bcast/join issues, see RH BZ) glibc-2.12-1.149.el6_6.7.x86_64 compat-opensm-libs-3.3.5-3.el6.x86_64 opensm-3.3.17-1.el6.x86_64 opensm-libs-3.3.17-1.el6.x86_64 opensm-multifabric-0.1-sgi710r3.rhel6.x86_64 (this is vendor stubs to do IB subnet_id and GUID specific opensm master/standby instances integrated with cluster management) glusterfs-3.6.2-1.el6.x86_64 glusterfs-debuginfo-3.6.2-1.el6.x86_64 glusterfs-devel-3.6.2-1.el6.x86_64 glusterfs-libs-3.6.2-1.el6.x86_64 glusterfs-extra-xlators-3.6.2-1.el6.x86_64 glusterfs-api-devel-3.6.2-1.el6.x86_64 glusterfs-fuse-3.6.2-1.el6.x86_64 glusterfs-server-3.6.2-1.el6.x86_64 glusterfs-cli-3.6.2-1.el6.x86_64 glusterfs-api-3.6.2-1.el6.x86_64 glusterfs-rdma-3.6.2-1.el6.x86_64 Volume in question: [root at phoenix-smc ~]# ssh service4 gluster vol info home Warning: No xauth data; using fake authentication data for X11 forwarding. Volume Name: home Type: Distribute Volume ID: f03fcaf0-3889-45ac-a06a-a4d60d5a673d Status: Started Number of Bricks: 28 Transport-type: rdma Bricks: Brick1: service4-ib1:/mnt/l1_s4_ost0000_0000/brick Brick2: service4-ib1:/mnt/l1_s4_ost0001_0001/brick Brick3: service4-ib1:/mnt/l1_s4_ost0002_0002/brick Brick4: service5-ib1:/mnt/l1_s5_ost0003_0003/brick Brick5: service5-ib1:/mnt/l1_s5_ost0004_0004/brick Brick6: service5-ib1:/mnt/l1_s5_ost0005_0005/brick Brick7: service5-ib1:/mnt/l1_s5_ost0006_0006/brick Brick8: service6-ib1:/mnt/l1_s6_ost0007_0007/brick Brick9: service6-ib1:/mnt/l1_s6_ost0008_0008/brick Brick10: service6-ib1:/mnt/l1_s6_ost0009_0009/brick Brick11: service7-ib1:/mnt/l1_s7_ost000a_0010/brick Brick12: service7-ib1:/mnt/l1_s7_ost000b_0011/brick Brick13: service7-ib1:/mnt/l1_s7_ost000c_0012/brick Brick14: service7-ib1:/mnt/l1_s7_ost000d_0013/brick Brick15: service10-ib1:/mnt/l1_s10_ost000e_0014/brick Brick16: service10-ib1:/mnt/l1_s10_ost000f_0015/brick Brick17: service10-ib1:/mnt/l1_s10_ost0010_0016/brick Brick18: service11-ib1:/mnt/l1_s11_ost0011_0017/brick Brick19: service11-ib1:/mnt/l1_s11_ost0012_0018/brick Brick20: service11-ib1:/mnt/l1_s11_ost0013_0019/brick Brick21: service11-ib1:/mnt/l1_s11_ost0014_0020/brick Brick22: service12-ib1:/mnt/l1_s12_ost0015_0021/brick Brick23: service12-ib1:/mnt/l1_s12_ost0016_0022/brick Brick24: service12-ib1:/mnt/l1_s12_ost0017_0023/brick Brick25: service13-ib1:/mnt/l1_s13_ost0018_0024/brick Brick26: service13-ib1:/mnt/l1_s13_ost0019_0025/brick Brick27: service13-ib1:/mnt/l1_s13_ost001a_0026/brick Brick28: service13-ib1:/mnt/l1_s13_ost001b_0027/brick Options Reconfigured: performance.stat-prefetch: off [root at phoenix-smc ~]# ssh service4 gluster vol status home Warning: No xauth data; using fake authentication data for X11 forwarding. Status of volume: home Gluster process Port Online Pid ------------------------------------------------------------------------------ Brick service4-ib1:/mnt/l1_s4_ost0000_0000/brick 49156 Y 8028 Brick service4-ib1:/mnt/l1_s4_ost0001_0001/brick 49157 Y 8040 Brick service4-ib1:/mnt/l1_s4_ost0002_0002/brick 49158 Y 8052 Brick service5-ib1:/mnt/l1_s5_ost0003_0003/brick 49163 Y 6526 Brick service5-ib1:/mnt/l1_s5_ost0004_0004/brick 49164 Y 6533 Brick service5-ib1:/mnt/l1_s5_ost0005_0005/brick 49165 Y 6540 Brick service5-ib1:/mnt/l1_s5_ost0006_0006/brick 49166 Y 6547 Brick service6-ib1:/mnt/l1_s6_ost0007_0007/brick 49155 Y 8027 Brick service6-ib1:/mnt/l1_s6_ost0008_0008/brick 49156 Y 8039 Brick service6-ib1:/mnt/l1_s6_ost0009_0009/brick 49157 Y 8051 Brick service7-ib1:/mnt/l1_s7_ost000a_0010/brick 49160 Y 9067 Brick service7-ib1:/mnt/l1_s7_ost000b_0011/brick 49161 Y 9074 Brick service7-ib1:/mnt/l1_s7_ost000c_0012/brick 49162 Y 9081 Brick service7-ib1:/mnt/l1_s7_ost000d_0013/brick 49163 Y 9088 Brick service10-ib1:/mnt/l1_s10_ost000e_0014/brick 49155 Y 8108 Brick service10-ib1:/mnt/l1_s10_ost000f_0015/brick 49156 Y 8120 Brick service10-ib1:/mnt/l1_s10_ost0010_0016/brick 49157 Y 8132 Brick service11-ib1:/mnt/l1_s11_ost0011_0017/brick 49160 Y 8070 Brick service11-ib1:/mnt/l1_s11_ost0012_0018/brick 49161 Y 8082 Brick service11-ib1:/mnt/l1_s11_ost0013_0019/brick 49162 Y 8094 Brick service11-ib1:/mnt/l1_s11_ost0014_0020/brick 49163 Y 8106 Brick service12-ib1:/mnt/l1_s12_ost0015_0021/brick 49155 Y 8072 Brick service12-ib1:/mnt/l1_s12_ost0016_0022/brick 49156 Y 8084 Brick service12-ib1:/mnt/l1_s12_ost0017_0023/brick 49157 Y 8096 Brick service13-ib1:/mnt/l1_s13_ost0018_0024/brick 49156 Y 8156 Brick service13-ib1:/mnt/l1_s13_ost0019_0025/brick 49157 Y 8168 Brick service13-ib1:/mnt/l1_s13_ost001a_0026/brick 49158 Y 8180 Brick service13-ib1:/mnt/l1_s13_ost001b_0027/brick 49159 Y 8192 NFS Server on localhost 2049 Y 8065 NFS Server on service6-ib1 2049 Y 8064 NFS Server on service13-ib1 2049 Y 8205 NFS Server on service11-ib1 2049 Y 11833 NFS Server on service12-ib1 2049 Y 8109 NFS Server on service10-ib1 2049 Y 8145 NFS Server on service5-ib1 2049 Y 6554 NFS Server on service7-ib1 2049 Y 15140 Task Status of Volume home ------------------------------------------------------------------------------ Task : Rebalance ID : 88f1e627-c7cc-40fc-b4a8-7672a6151712 Status : completed [root at phoenix-smc ~]# -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://www.gluster.org/pipermail/gluster-users/attachments/20150509/a3b2f5c3/attachment.html>