Mike Lykov
2018-Nov-07 11:01 UTC
[Gluster-users] glusterd SIGSEGV crash when create volume with transport=rdma
Hi All! I'm try to use ovirt virtualisation platform with GlusterFS storage and Intel Omni-Path "Infiniband" interfaces. All packages version 3.12 from ovirt-4.2 repository, but I tried also gluster 4.1 from Centos centos-release-gluster41 repository. Host are Centos 7.5. glusterd crashes with SIGSEGV. Are there some special configuration needed for rdma transport? Created trusted pool: [root at ovirtnode1 log]# gluster pool list UUID Hostname State 5a9a0a5f-12f4-48b1-bfbe-24c172adc65c ovirtstor5 Connected 41350da9-c944-41c5-afdc-46ff51ab93f6 ovirtstor6 Connected 0f50175e-7e47-4839-99c7-c7ced21f090c localhost Connected (this from 172.16.100.1, ovirtstor5 peer is a 172.16.100.5, ovirtstor6 is a 172.16.100.6) Creating Volume (Success): gluster volume create data_rdma replica 3 transport rdma ovirtstor1:/gluster_bricks/data_rdma/data_rdma ovirtstor5:/gluster_bricks/data_rdma/data_rdma ovirtstor6:/gluster_bricks/data_rdma/data_rdma volume create: data_rdma: success: please start the volume to access data glusterd.log (UTC Time, local time zone are UTC+4) [2018-11-07 09:52:43.106185] I [run.c:190:runner_log] (-->/usr/lib64/glusterfs/3.12.15/xlator/mgmt/glusterd.so(+0xdf50a) [0x7f3423e4350a] -->/usr/lib64/glusterfs/3.12.15/xlator/mgmt/glusterd.so(+0xdefcd) [0x7f3423e42fcd] -->/lib64/libglus [2018-11-07 09:52:57.825351] I [MSGID: 106488] [glusterd-handler.c:1548:__glusterd_handle_cli_get_volume] 0-management: Received get vol req [2018-11-07 09:53:19.119450] I [glusterd-utils.c:6056:glusterd_brick_start] 0-management: starting a fresh brick process for brick /gluster_bricks/data_rdma/data_rdma [2018-11-07 09:53:19.186374] I [MSGID: 106143] [glusterd-pmap.c:295:pmap_registry_bind] 0-pmap: adding brick /gluster_bricks/data_rdma/data_rdma.rdma on port 49155 Status (All Online): [root at ovirtnode1 /]# gluster volume status data_rdma Status of volume: data_rdma Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick ovirtstor1:/gluster_bricks/data_rdma/ data_rdma 0 49155 Y 156176 Brick ovirtstor5:/gluster_bricks/data_rdma/ data_rdma 0 49155 Y 47958 Brick ovirtstor6:/gluster_bricks/data_rdma/ data_rdma 0 49155 Y 18911 Self-heal Daemon on localhost N/A N/A Y 156206 Self-heal Daemon on ovirtstor5.miac N/A N/A Y 47994 Self-heal Daemon on ovirtstor6.miac N/A N/A Y 18947 After 3 minutes: [2018-11-07 09:56:08.957536] C [MSGID: 103021] [rdma.c:3263:gf_rdma_create_qp] 0-rdma.management: rdma.management: could not create QP [???????? ? ???????] [2018-11-07 09:56:08.957986] W [MSGID: 103021] [rdma.c:1049:gf_rdma_cm_handle_connect_request] 0-rdma.management: could not create QP (peer:172.16.100.5:49151 me:172.16.100.1:24008) pending frames: patchset: git://git.gluster.org/glusterfs.git signal received: 11 time of crash: 2018-11-07 09:56:08 configuration details: argp 1 backtrace 1 dlfcn 1 libpthread 1 llistxattr 1 setfsid 1 spinlock 1 epoll.h 1 xattr.h 1 st_atim.tv_nsec 1 package-string: glusterfs 3.12.15 /lib64/libglusterfs.so.0(_gf_msg_backtrace_nomem+0xa0)[0x7f342f2f54e0] /lib64/libglusterfs.so.0(gf_print_trace+0x334)[0x7f342f2ff414] /lib64/libc.so.6(+0x362f0)[0x7f342d9552f0] /usr/lib64/glusterfs/3.12.15/xlator/mgmt/glusterd.so(+0x160c4)[0x7f3423d7a0c4] /lib64/libgfrpc.so.0(rpcsvc_handle_disconnect+0x10f)[0x7f342f0b584f] /lib64/libgfrpc.so.0(rpcsvc_notify+0xc0)[0x7f342f0b7f20] /lib64/libgfrpc.so.0(rpc_transport_notify+0x23)[0x7f342f0b9ea3] /usr/lib64/glusterfs/3.12.15/rpc-transport/rdma.so(+0x4fef)[0x7f341fba8fef] /usr/lib64/glusterfs/3.12.15/rpc-transport/rdma.so(+0x7c20)[0x7f341fbabc20] /lib64/libpthread.so.0(+0x7e25)[0x7f342e154e25] /lib64/libc.so.6(clone+0x6d)[0x7f342da1dbad] Glusterd are listening only tcp/24007 on all nodes, but why? Therefore connect to 172.16.100.1:24008 are failed? On peer node (syslog 'messages') : Nov 7 13:53:24 ovirtnode5 glustershd[47994]: [2018-11-07 09:53:24.570701] C [MSGID: 103021] [rdma.c:3263:gf_rdma_create_qp] 0-data_rdma-client-0: data_rdma-client-0: could not c reate QP [???????? ? ???????] Nov 7 13:56:09 ovirtnode5 glusterd[15657]: [2018-11-07 09:56:09.988118] C [MSGID: 103021] [rdma.c:3263:gf_rdma_create_qp] 0-rdma.management: rdma.management: could not create QP [???????? ? ???????] Nov 7 13:56:09 ovirtnode5 glusterd[15657]: pending frames: Nov 7 13:56:09 ovirtnode5 glusterd[15657]: patchset: git://git.gluster.org/glusterfs.git Nov 7 13:56:09 ovirtnode5 glusterd[15657]: signal received: 11 Nov 7 13:56:09 ovirtnode5 glusterd[15657]: time of crash: Nov 7 13:56:09 ovirtnode5 glusterd[15657]: 2018-11-07 09:56:09 Nov 7 13:56:09 ovirtnode5 glusterd[15657]: configuration details: Nov 7 13:56:09 ovirtnode5 glusterd[15657]: argp 1 Nov 7 13:56:09 ovirtnode5 glusterd[15657]: backtrace 1 Nov 7 13:56:09 ovirtnode5 glusterd[15657]: dlfcn 1 Nov 7 13:56:09 ovirtnode5 glusterd[15657]: libpthread 1 Nov 7 13:56:09 ovirtnode5 glusterd[15657]: llistxattr 1 Nov 7 13:56:09 ovirtnode5 glusterd[15657]: setfsid 1 Nov 7 13:56:09 ovirtnode5 glusterd[15657]: spinlock 1 Nov 7 13:56:09 ovirtnode5 glusterd[15657]: epoll.h 1 Nov 7 13:56:09 ovirtnode5 glusterd[15657]: xattr.h 1 Nov 7 13:56:09 ovirtnode5 glusterd[15657]: st_atim.tv_nsec 1 Nov 7 13:56:09 ovirtnode5 glusterd[15657]: package-string: glusterfs 3.12.15 Nov 7 13:56:09 ovirtnode5 glusterd[15657]: --------- Nov 7 13:56:10 ovirtnode5 abrt-hook-ccpp: Process 15657 (glusterfsd) of user 0 killed by SIGSEGV - dumping core ABRT show this: [root at ovirtnode1 glusterfs]# abrt-cli list id 7b7b53a92fe3f26271fd9f9012d1d0d011d94773 reason: glusterfsd killed by SIGSEGV time: ?? 07 ??? 2018 13:56:09 cmdline: /usr/sbin/glusterd -p /var/run/glusterd.pid --log-level INFO package: glusterfs-fuse-3.12.15-1.el7 uid: 0 (root) count: 1 Directory: /var/tmp/abrt/ccpp-2018-11-07-13:56:09-3145 ??????????: https://retrace.fedoraproject.org/faf/reports/bthash/badd77dc4fa0d04f686a4b3366e262d1140fdb55 Code (I don't know what version/release it is, found in github) https://github.com/gluster/glusterfs/blob/master/rpc/rpc-transport/rdma/src/rdma.c ret = rdma_create_qp(peer->cm_id, device->pd, &init_attr); if (ret != 0) { gf_msg(peer->trans->name, GF_LOG_CRITICAL, errno, RDMA_MSG_CREAT_QP_FAILED, "%s: could not create QP", this->name); ret = -1; .srq = device->srq, RDMA on its own seems working: [root at ovirtnode5 log]# ib_write_bw -D 30 --cpu_util ovirtstor1 --------------------------------------------------------------------------------------- RDMA_Write BW Test Dual-port : OFF Device : hfi1_0 Number of qps : 1 Transport type : IB Connection type : RC Using SRQ : OFF TX depth : 128 CQ Moderation : 100 Mtu : 4096[B] Link type : IB Max inline data : 0[B] rdma_cm QPs : OFF Data ex. method : Ethernet --------------------------------------------------------------------------------------- local address: LID 0x04 QPN 0x00ae PSN 0x15933a RKey 0x60181900 VAddr 0x007fde76ee6000 remote address: LID 0x03 QPN 0x0056 PSN 0x7758e RKey 0x40101100 VAddr 0x007fde37488000 --------------------------------------------------------------------------------------- #bytes #iterations BW peak[MB/sec] BW average[MB/sec] MsgRate[Mpps] CPU_Util[%] Conflicting CPU frequency values detected: 3692.431000 != 3109.112000. CPU Frequency is not max. 65536 1646300 0.00 6430.36 0.102886 1.40 --------------------------------------------------------------------------------------- Info about hardware & driver [root at ovirtnode1 glusterfs]# hfi1_control -i Driver Version: 10.8-0 Driver SrcVersion: AFDD1BF17512A67B217EB47 Opa Version: 10.8.0.0.204 0: BoardId: Intel Corporation Omni-Path HFI Silicon 100 Series [integrated] 0: Version: ChipABI 3.0, ChipRev 7.17, SW Compat 3 0: ChipSerial: 0x011aeeea 0,1: Status: 5: LinkUp 4: ACTIVE 0,1: LID=0x3 GUID=0011:7509:011a:eeea [root at ovirtnode1 glusterfs]# opainfo hfi1_0:1 PortGID:0xfe80000000000000:00117509011aeeea PortState: Active LinkSpeed Act: 25Gb En: 25Gb LinkWidth Act: 4 En: 4 LinkWidthDnGrd ActTx: 4 Rx: 4 En: 3,4 LCRC Act: 14-bit En: 14-bit,16-bit,48-bit Mgmt: True LID: 0x00000003-0x00000003 SM LID: 0x00000003 SL: 0 Xmit Data: 6752 MB Pkts: 9628972 Recv Data: 217461 MB Pkts: 60540469 Link Quality: 5 (Excellent)
Mike Lykov
2018-Nov-07 11:30 UTC
[Gluster-users] glusterd SIGSEGV crash when create volume with transport=rdma
07.11.2018 15:01, Mike Lykov ?????:> RDMA on its own seems working: > [root at ovirtnode5 log]# ib_write_bw -D 30 --cpu_util ovirtstor1 > --------------------------------------------------------------------------------------- > RDMA_Write BW Test > Dual-port : OFF Device : hfi1_0 > Number of qps : 1 Transport type : IB > Connection type : RC Using SRQ : OFF > TX depth : 128 > CQ Moderation : 100 > Mtu : 4096[B] > Link type : IB > Max inline data : 0[B] > rdma_cm QPs : OFF > Data ex. method : Ethernet > --------------------------------------------------------------------------------------- > local address: LID 0x04 QPN 0x00ae PSN 0x15933a RKey 0x60181900 VAddr 0x007fde76ee6000 > remote address: LID 0x03 QPN 0x0056 PSN 0x7758e RKey 0x40101100 VAddr 0x007fde37488000 > --------------------------------------------------------------------------------------- > #bytes #iterations BW peak[MB/sec] BW average[MB/sec] MsgRate[Mpps] CPU_Util[%] > Conflicting CPU frequency values detected: 3692.431000 != 3109.112000. CPU Frequency is not max. > 65536 1646300 0.00 6430.36 0.102886 1.40 > ---------------------------------------------------------------------------------------This is right test using "RDMA_cm QPs" (server-side output) [root at ovirtnode1 glusterfs]# ib_write_bw -R -D 30 --cpu_util ************************************ * Waiting for client to connect... * ************************************ --------------------------------------------------------------------------------------- RDMA_Write BW Test Dual-port : OFF Device : hfi1_0 Number of qps : 1 Transport type : IB Connection type : RC Using SRQ : OFF CQ Moderation : 100 Mtu : 4096[B] Link type : IB Max inline data : 0[B] rdma_cm QPs : ON Data ex. method : rdma_cm --------------------------------------------------------------------------------------- Waiting for client rdma_cm QP to connect Please run the same command with the IB/RoCE interface IP --------------------------------------------------------------------------------------- local address: LID 0x03 QPN 0x005a PSN 0x69e89 remote address: LID 0x04 QPN 0x00b2 PSN 0xe70887 --------------------------------------------------------------------------------------- #bytes #iterations BW peak[MB/sec] BW average[MB/sec] MsgRate[Mpps] CPU_Util[%] 65536 2982900 0.00 11651.03 0.186417 0.00 ---------------------------------------------------------------------------------------