thr3ads.net - Gluster users - [Gluster-users] Crashing applications, RDMA

If this information is useful, please help other people find it:
Share via:

Anatoliy Dmytriyev

2018-May-04 11:43 UTC

[Gluster-users] Crashing applications, RDMA_ERROR in logs

Hello gluster users and professionals,

We are running gluster 3.10.10 distributed volume (9 nodes) using RDMA 
transport.

 From time to time applications crash with I/O errors (can't access file) 
and in the client logs we can see messages like:

[2018-05-04 10:00:43.467490] W [MSGID: 114031] 
[client-rpc-fops.c:2640:client3_3_readdirp_cbk] 0-gv0-client-2: remote 
operation failed [Transport endpoint is not connected]
[2018-05-04 10:00:43.467585] W [MSGID: 103046] 
[rdma.c:3603:gf_rdma_decode_header] 0-rpc-transport/rdma: received a msg 
of type RDMA_ERROR
[2018-05-04 10:00:43.467601] W [MSGID: 103046] 
[rdma.c:4055:gf_rdma_process_recv] 0-rpc-transport/rdma: peer 
(192.168.2.104:49152), couldn't encode or decode the msg properly or 
write chunks were not provided for replies that were bigger than 
RDMA_INLINE_THRESHOLD (2048)

At the same time on gluster nodes in brick logs:
[2018-05-04 10:00:43.468470] W [MSGID: 103027] 
[rdma.c:2498:__gf_rdma_send_reply_type_nomsg] 0-rpc-transport/rdma: 
encoding write chunks failed

The gluster volume is mounted with options 
"backupvolfile-server=cn03-ib,transport=rdma,log-level=WARNING"


The same applications run perfectly on not gluster FS. Could you please 
help to debug and fix this?




# gluster volume status gv0
Status of volume: gv0
Gluster process                             TCP Port  RDMA Port  Online  
Pid
------------------------------------------------------------------------------
Brick cn01-ib:/gfs/gv0/brick1/brick         0         49152      Y       
3984
Brick cn02-ib:/gfs/gv0/brick1/brick         0         49152      Y       
3352
Brick cn03-ib:/gfs/gv0/brick1/brick         0         49152      Y       
3333
Brick cn04-ib:/gfs/gv0/brick1/brick         0         49152      Y       
3079
Brick cn05-ib:/gfs/gv0/brick1/brick         0         49152      Y       
3093
Brick cn06-ib:/gfs/gv0/brick1/brick         0         49152      Y       
3148
Brick cn07-ib:/gfs/gv0/brick1/brick         0         49152      Y       
2995
Brick cn08-ib:/gfs/gv0/brick1/brick         0         49152      Y       
3107
Brick cn09-ib:/gfs/gv0/brick1/brick         0         49152      Y       
3014

Task Status of Volume gv0
------------------------------------------------------------------------------
There are no active volume tasks

# gluster volume info gv0

Volume Name: gv0
Type: Distribute
Volume ID: 5ee4b6a4-b8d2-4795-919f-c992b95d6221
Status: Started
Snapshot Count: 0
Number of Bricks: 9
Transport-type: rdma
Bricks:
Brick1: cn01-ib:/gfs/gv0/brick1/brick
Brick2: cn02-ib:/gfs/gv0/brick1/brick
Brick3: cn03-ib:/gfs/gv0/brick1/brick
Brick4: cn04-ib:/gfs/gv0/brick1/brick
Brick5: cn05-ib:/gfs/gv0/brick1/brick
Brick6: cn06-ib:/gfs/gv0/brick1/brick
Brick7: cn07-ib:/gfs/gv0/brick1/brick
Brick8: cn08-ib:/gfs/gv0/brick1/brick
Brick9: cn09-ib:/gfs/gv0/brick1/brick
Options Reconfigured:
performance.cache-size: 1GB
server.event-threads: 8
client.event-threads: 8
cluster.nufa: on
performance.readdir-ahead: on
performance.parallel-readdir: on
nfs.disable: on





-- 
Best regards,
Anatoliy

Reasonably Related Threads

Search for more possibly parallel threads

Gluster users - May 2018 - Crashing applications, RDMA_ERROR in logs

[Gluster-users] Crashing applications, RDMA_ERROR in logs

Reasonably Related Threads

Wisdom of the Ancients