Hans Höök
2014-Jan-21 14:20 UTC
[Gluster-users] Problem when rebooting geo-replication slave
Hi list, I have a problem when a geo-replicating slave has to be rebooted. After reboot the slave is out of sync and the gluster demon fails to even start. I have a workaround procedure that seems to work but it seems I must be doing something wrong or missing out on something. I am currently using gluster 3.4.0 with the following setup. Two replicating masters: fe and ni One geo-replicating slave with periodic snapshots in zfs: nitinol From master fe I have successfully setup geo-replication with: gluster volume geo-replication gvarchive nitinol:/zfspool/gluster/gvarchive start All is fine... not really... When slave nitinol is rebooted it becomes broken. service glusterfs-server start # fails - the demon does not start with following log entry: [2014-01-21 13:14:53.352007] I [glusterfsd.c:1910:main] 0-/usr/sbin/glusterd: Started running /usr/sbin/glusterd version 3.4.2 (/usr/sbin/glusterd -p /var/run/glusterd.pid) [2014-01-21 13:14:53.354316] I [glusterd.c:961:init] 0-management: Using /var/lib/glusterd as working directory [2014-01-21 13:14:53.356431] I [socket.c:3480:socket_init] 0-socket.management: SSL support is NOT enabled [2014-01-21 13:14:53.356490] I [socket.c:3495:socket_init] 0-socket.management: using system polling thread [2014-01-21 13:14:53.357999] W [rdma.c:4197:__gf_rdma_ctx_create] 0-rpc-transport/rdma: rdma_cm event channel creation failed (No such device) [2014-01-21 13:14:53.358055] E [rdma.c:4485:init] 0-rdma.management: Failed to initialize IB Device [2014-01-21 13:14:53.358136] E [rpc-transport.c:320:rpc_transport_load] 0-rpc-transport: 'rdma' initialization failed [2014-01-21 13:14:53.358185] W [rpcsvc.c:1389:rpcsvc_transport_create] 0-rpc-service: cannot create listener, initing the transport failed [2014-01-21 13:14:55.083839] I [glusterd-store.c:1339:glusterd_restore_op_version] 0-glusterd: retrieved op-version: 2 [2014-01-21 13:14:55.092907] E [glusterd-store.c:1858:glusterd_store_retrieve_volume] 0-: Unknown key: brick-0 [2014-01-21 13:14:55.093002] E [glusterd-store.c:1858:glusterd_store_retrieve_volume] 0-: Unknown key: brick-1 ..... [2014-01-21 13:14:55.741895] E [glusterd-store.c:1858:glusterd_store_retrieve_volume] 0-: Unknown key: brick-0 [2014-01-21 13:14:55.741989] E [glusterd-store.c:1858:glusterd_store_retrieve_volume] 0-: Unknown key: brick-1 [2014-01-21 13:14:55.792063] I [glusterd-handler.c:2818:glusterd_friend_add] 0-management: connect returned 0 [2014-01-21 13:14:55.792258] I [rpc-clnt.c:962:rpc_clnt_connection_init] 0-management: setting frame-timeout to 600 [2014-01-21 13:14:55.792416] I [socket.c:3480:socket_init] 0-management: SSL support is NOT enabled [2014-01-21 13:14:55.792443] I [socket.c:3495:socket_init] 0-management: using system polling thread [2014-01-21 13:14:55.796485] E [glusterd-store.c:2487:glusterd_resolve_all_bricks] 0-glusterd: resolve brick failed in restore [2014-01-21 13:14:55.796546] E [xlator.c:390:xlator_init] 0-management: Initialization of volume 'management' failed, review your volfile again [2014-01-21 13:14:55.796574] E [graph.c:292:glusterfs_graph_init] 0-management: initializing translator failed [2014-01-21 13:14:55.796596] E [graph.c:479:glusterfs_graph_activate] 0-graph: init failed [2014-01-21 13:14:55.797136] W [glusterfsd.c:1002:cleanup_and_exit] (-->/usr/sbin/glusterd(main+0x3cd) [0x7f737c1fb85d] (-->/usr/sbin/glusterd(glusterfs_volumes_init+0xc0) [0x7f737c1fe650] (-->/usr/sbin/glusterd(glusterfs_process_volfp+0x103) [0x7f737c1fe553]))) 0-: received signum (0), shutting down I have successfully corrected the situation by the following procedure: # on slave: rm -rf /var/lib/glusterd/vols # on master gluster volume geo-replication gvarchive nitinol:/zfspool/gluster/gvarchive stop gluster peer detach nitinol # on slave: service glusterfs-server start # on master: gluster peer probe nitinol gluster volume geo-replication gvarchive nitinol:/zfspool/gluster/gvarchive start This does not seem correct. Why does the volumes get out of sync? Regards Hans H??k