harry mangalam
2014-Jan-04 01:51 UTC
[Gluster-users] fractured/split glusterfs - 2 up, 2 down for an hour
This is a distributed-only glusterfs on 4 servers with 2 bricks each on an IPoIB network. Thanks to a misconfigured autoupdate script, when 3.4.2 was released today, my gluster servers tried to update themselves. 2 succeeded, but then failed to restart, the other 2 failed to update and kept running. Not realizing the sequence of events, I restarted the 2 that failed to restart, which gave my fs 2 servers running 3.4.1 and 2 running 3.4.2. When I realized this after about 30m, I shut everything down and updated the 2 remaining to 3.4.2 and then restarted but now I'm getting lots of reports of file errors of the type 'endpoints not connected' and the like: [2014-01-04 01:31:18.593547] W [client-rpc-fops.c:2624:client3_3_lookup_cbk] 0-gl-client-2: remote operation failed: Transport endpoint i s not connected. Path: /bio/fishm/test_cuffdiff.sh (00000000-0000-0000-0000-000000000000) [2014-01-04 01:31:18.594928] W [client-rpc-fops.c:2624:client3_3_lookup_cbk] 0-gl-client-2: remote operation failed: Transport endpoint i s not connected. Path: /bio/fishm/test_cuffdiff.sh (00000000-0000-0000-0000-000000000000) [2014-01-04 01:31:18.595818] W [client-rpc-fops.c:2624:client3_3_lookup_cbk] 0-gl-client-2: remote operation failed: Transport endpoint i s not connected. Path: /bio/fishm/.#test_cuffdiff.sh (14c3b612-e952-4aec- ae18-7f3dbb422dcc) [2014-01-04 01:31:18.597381] W [client-rpc-fops.c:2624:client3_3_lookup_cbk] 0-gl-client-2: remote operation failed: Transport endpoint i s not connected. Path: /bio/fishm/test_cuffdiff.sh (00000000-0000-0000-0000-000000000000) [2014-01-04 01:31:18.598212] W [client-rpc-fops.c:814:client3_3_statfs_cbk] 0- gl-client-2: remote operation failed: Transport endpoint is not connected [2014-01-04 01:31:18.598236] W [dht-diskusage.c:45:dht_du_info_cbk] 0-gl-dht: failed to get disk info from gl-client-2 [2014-01-04 01:31:19.912210] W [socket.c:514:__socket_rwv] 0-gl-client-2: readv failed (No data available) [2014-01-04 01:31:22.912717] W [socket.c:514:__socket_rwv] 0-gl-client-2: readv failed (No data available) [2014-01-04 01:31:25.913208] W [socket.c:514:__socket_rwv] 0-gl-client-2: readv failed (No data available) The servers at the same time provided the following error 'E' messages: Fri Jan 03 17:46:42 [0.20 0.12 0.13] root at biostor1:~ 1008 $ grep ' E ' /var/log/glusterfs/bricks/raid1.log |grep '2014-01-03' [2014-01-03 06:11:36.251786] E [server-helpers.c:751:server_alloc_frame] (-->/usr/lib64/libgfrpc.so.0(rpcsvc_notify+0x103) [0x3161e090d3] (-- >/usr/lib64/libgfrpc.so.0(rpcsvc_handle_rpc_call+0x245) [0x3161e08f85] (-- >/usr/lib64/glusterfs/3.4.1/xlator/protocol/server.so(server3_3_lookup+0xa0)[0x7fa60e577170]))) 0-server: invalid argument: conn [2014-01-03 06:11:36.251813] E [rpcsvc.c:450:rpcsvc_check_and_reply_error] 0- rpcsvc: rpc actor failed to complete successfully [2014-01-03 17:48:44.236127] E [rpc-transport.c:253:rpc_transport_load] 0-rpc- transport: /usr/lib64/glusterfs/3.4.1/rpc-transport/rdma.so: cannot open shared object file: No such file or directory [2014-01-03 19:15:26.643378] E [rpc-transport.c:253:rpc_transport_load] 0-rpc- transport: /usr/lib64/glusterfs/3.4.2/rpc-transport/rdma.so: cannot open shared object file: No such file or directory The missing/misbehaving files /are/ accessible on the individual bricks but not thru gluster. This is a distributed-only setup, not replicated, so it seems like the gluster volume heal <volume> is appropriate. Do the gluster wizards agree? --- Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine [m/c 2225] / 92697 Google Voice Multiplexer: (949) 478-4487 415 South Circle View Dr, Irvine, CA, 92697 [shipping] MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps) --- -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://supercolony.gluster.org/pipermail/gluster-users/attachments/20140103/3425177e/attachment.html>
Vijay Bellur
2014-Jan-04 17:15 UTC
[Gluster-users] fractured/split glusterfs - 2 up, 2 down for an hour
On 01/04/2014 07:21 AM, harry mangalam wrote:> This is a distributed-only glusterfs on 4 servers with 2 bricks each on > an IPoIB network. > > Thanks to a misconfigured autoupdate script, when 3.4.2 was released > today, my gluster servers tried to update themselves. 2 succeeded, but > then failed to restart, the other 2 failed to update and kept running. > > Not realizing the sequence of events, I restarted the 2 that failed to > restart, which gave my fs 2 servers running 3.4.1 and 2 running 3.4.2. > > When I realized this after about 30m, I shut everything down and updated > the 2 remaining to 3.4.2 and then restarted but now I'm getting lots of > reports of file errors of the type 'endpoints not connected' and the like: > > [2014-01-04 01:31:18.593547] W > [client-rpc-fops.c:2624:client3_3_lookup_cbk] 0-gl-client-2: remote > operation failed: Transport endpoint i > > s not connected. Path: /bio/fishm/test_cuffdiff.sh > (00000000-0000-0000-0000-000000000000) > > [2014-01-04 01:31:18.594928] W > [client-rpc-fops.c:2624:client3_3_lookup_cbk] 0-gl-client-2: remote > operation failed: Transport endpoint i > > s not connected. Path: /bio/fishm/test_cuffdiff.sh > (00000000-0000-0000-0000-000000000000) > > [2014-01-04 01:31:18.595818] W > [client-rpc-fops.c:2624:client3_3_lookup_cbk] 0-gl-client-2: remote > operation failed: Transport endpoint i > > s not connected. Path: /bio/fishm/.#test_cuffdiff.sh > (14c3b612-e952-4aec-ae18-7f3dbb422dcc) > > [2014-01-04 01:31:18.597381] W > [client-rpc-fops.c:2624:client3_3_lookup_cbk] 0-gl-client-2: remote > operation failed: Transport endpoint i > > s not connected. Path: /bio/fishm/test_cuffdiff.sh > (00000000-0000-0000-0000-000000000000) > > [2014-01-04 01:31:18.598212] W > [client-rpc-fops.c:814:client3_3_statfs_cbk] 0-gl-client-2: remote > operation failed: Transport endpoint is > > not connected > > [2014-01-04 01:31:18.598236] W [dht-diskusage.c:45:dht_du_info_cbk] > 0-gl-dht: failed to get disk info from gl-client-2 > > [2014-01-04 01:31:19.912210] W [socket.c:514:__socket_rwv] > 0-gl-client-2: readv failed (No data available) > > [2014-01-04 01:31:22.912717] W [socket.c:514:__socket_rwv] > 0-gl-client-2: readv failed (No data available) > > [2014-01-04 01:31:25.913208] W [socket.c:514:__socket_rwv] > 0-gl-client-2: readv failed (No data available) > > The servers at the same time provided the following error 'E' messages: > > Fri Jan 03 17:46:42 [0.20 0.12 0.13] root at biostor1:~ > > 1008 $ grep ' E ' /var/log/glusterfs/bricks/raid1.log |grep '2014-01-03' > > [2014-01-03 06:11:36.251786] E [server-helpers.c:751:server_alloc_frame] > (-->/usr/lib64/libgfrpc.so.0(rpcsvc_notify+0x103) [0x3161e090d3] > (-->/usr/lib64/libgfrpc.so.0(rpcsvc_handle_rpc_call+0x245) > [0x3161e08f85] > (-->/usr/lib64/glusterfs/3.4.1/xlator/protocol/server.so(server3_3_lookup+0xa0) > [0x7fa60e577170]))) 0-server: invalid argument: conn > > [2014-01-03 06:11:36.251813] E > [rpcsvc.c:450:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor failed > to complete successfully > > [2014-01-03 17:48:44.236127] E [rpc-transport.c:253:rpc_transport_load] > 0-rpc-transport: /usr/lib64/glusterfs/3.4.1/rpc-transport/rdma.so: > cannot open shared object file: No such file or directory > > [2014-01-03 19:15:26.643378] E [rpc-transport.c:253:rpc_transport_load] > 0-rpc-transport: /usr/lib64/glusterfs/3.4.2/rpc-transport/rdma.so: > cannot open shared object file: No such file or directory >rdma.so seems to be missing here. Is glusterfs-rdma-3.4.2-1 rpm installed on the servers? -Vijay