Anatoliy Dmytriyev
2018-May-04 11:43 UTC
[Gluster-users] Crashing applications, RDMA_ERROR in logs
Hello gluster users and professionals, We are running gluster 3.10.10 distributed volume (9 nodes) using RDMA transport. From time to time applications crash with I/O errors (can't access file) and in the client logs we can see messages like: [2018-05-04 10:00:43.467490] W [MSGID: 114031] [client-rpc-fops.c:2640:client3_3_readdirp_cbk] 0-gv0-client-2: remote operation failed [Transport endpoint is not connected] [2018-05-04 10:00:43.467585] W [MSGID: 103046] [rdma.c:3603:gf_rdma_decode_header] 0-rpc-transport/rdma: received a msg of type RDMA_ERROR [2018-05-04 10:00:43.467601] W [MSGID: 103046] [rdma.c:4055:gf_rdma_process_recv] 0-rpc-transport/rdma: peer (192.168.2.104:49152), couldn't encode or decode the msg properly or write chunks were not provided for replies that were bigger than RDMA_INLINE_THRESHOLD (2048) At the same time on gluster nodes in brick logs: [2018-05-04 10:00:43.468470] W [MSGID: 103027] [rdma.c:2498:__gf_rdma_send_reply_type_nomsg] 0-rpc-transport/rdma: encoding write chunks failed The gluster volume is mounted with options "backupvolfile-server=cn03-ib,transport=rdma,log-level=WARNING" The same applications run perfectly on not gluster FS. Could you please help to debug and fix this? # gluster volume status gv0 Status of volume: gv0 Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick cn01-ib:/gfs/gv0/brick1/brick 0 49152 Y 3984 Brick cn02-ib:/gfs/gv0/brick1/brick 0 49152 Y 3352 Brick cn03-ib:/gfs/gv0/brick1/brick 0 49152 Y 3333 Brick cn04-ib:/gfs/gv0/brick1/brick 0 49152 Y 3079 Brick cn05-ib:/gfs/gv0/brick1/brick 0 49152 Y 3093 Brick cn06-ib:/gfs/gv0/brick1/brick 0 49152 Y 3148 Brick cn07-ib:/gfs/gv0/brick1/brick 0 49152 Y 2995 Brick cn08-ib:/gfs/gv0/brick1/brick 0 49152 Y 3107 Brick cn09-ib:/gfs/gv0/brick1/brick 0 49152 Y 3014 Task Status of Volume gv0 ------------------------------------------------------------------------------ There are no active volume tasks # gluster volume info gv0 Volume Name: gv0 Type: Distribute Volume ID: 5ee4b6a4-b8d2-4795-919f-c992b95d6221 Status: Started Snapshot Count: 0 Number of Bricks: 9 Transport-type: rdma Bricks: Brick1: cn01-ib:/gfs/gv0/brick1/brick Brick2: cn02-ib:/gfs/gv0/brick1/brick Brick3: cn03-ib:/gfs/gv0/brick1/brick Brick4: cn04-ib:/gfs/gv0/brick1/brick Brick5: cn05-ib:/gfs/gv0/brick1/brick Brick6: cn06-ib:/gfs/gv0/brick1/brick Brick7: cn07-ib:/gfs/gv0/brick1/brick Brick8: cn08-ib:/gfs/gv0/brick1/brick Brick9: cn09-ib:/gfs/gv0/brick1/brick Options Reconfigured: performance.cache-size: 1GB server.event-threads: 8 client.event-threads: 8 cluster.nufa: on performance.readdir-ahead: on performance.parallel-readdir: on nfs.disable: on -- Best regards, Anatoliy
Possibly Parallel Threads
- Can't heal a volume: "Please check if all brick processes are running."
- Can't heal a volume: "Please check if all brick processes are running."
- Can't heal a volume: "Please check if all brick processes are running."
- Can't heal a volume: "Please check if all brick processes are running."
- Can't heal a volume: "Please check if all brick processes are running."