Hi there, I compiled and installed the latest version of Gluster on a couple of SLES 11 SP1 boxes, everything up to this point seemed ok. I start the daemon on both boxes, and both are listening on 24007. I issue a "gluster peer probe" command on one of the boxes and the daemon instantly dies, I restart it and it shows: # gluster peer status Number of Peers: 1 Hostname: mckalcpap02 Uuid: 00000000-0000-0000-0000-000000000000 State: Establishing Connection (Connected) I attempted to run the probe on the other box, the daemon crashes, now as I start the daemon on each box the daemon just crashes on the other box. The log output immediately prior to the crash is as follows: [2011-06-07 08:05:10.700710] I [glusterd-handler.c:623:glusterd_handle_cli_probe] 0-glusterd: Received CLI probe req mckalcpap02 24007 [2011-06-07 08:05:10.701058] I [glusterd-handler.c:391:glusterd_friend_find] 0-glusterd: Unable to find hostname: mckalcpap02 [2011-06-07 08:05:10.701086] I [glusterd-handler.c:3422:glusterd_probe_begin] 0-glusterd: Unable to find peerinfo for host: mckalcpap02 (24007) [2011-06-07 08:05:10.702832] I [glusterd-handler.c:3404:glusterd_friend_add] 0-glusterd: connect returned 0 [2011-06-07 08:05:10.703110] I [glusterd-handshake.c:317:glusterd_set_clnt_mgmt_program] 0-: Using Program glusterd clnt mgmt, Num (1238433), Version (1) If I use the IP address the same thing happens: [2011-06-07 08:07:12.873075] I [glusterd-handler.c:623:glusterd_handle_cli_probe] 0-glusterd: Received CLI probe req 10.9.54.2 24007 [2011-06-07 08:07:12.873410] I [glusterd-handler.c:391:glusterd_friend_find] 0-glusterd: Unable to find hostname: 10.9.54.2 [2011-06-07 08:07:12.873438] I [glusterd-handler.c:3422:glusterd_probe_begin] 0-glusterd: Unable to find peerinfo for host: 10.9.54.2 (24007) [2011-06-07 08:07:12.875046] I [glusterd-handler.c:3404:glusterd_friend_add] 0-glusterd: connect returned 0 [2011-06-07 08:07:12.875280] I [glusterd-handshake.c:317:glusterd_set_clnt_mgmt_program] 0-: Using Program glusterd clnt mgmt, Num (1238433), Version (1) There is no firewall issue: # telnet mckalcpap02 24007 Trying 10.9.54.2... Connected to mckalcpap02. Escape character is '^]'. Following restart (which crashes the other node) the log output is as follows: [2011-06-07 08:10:09.616486] I [glusterd.c:564:init] 0-management: Using /etc/glusterd as working directory [2011-06-07 08:10:09.617619] C [rdma.c:3933:rdma_init] 0-rpc-transport/rdma: Failed to get IB devices [2011-06-07 08:10:09.617676] E [rdma.c:4812:init] 0-rdma.management: Failed to initialize IB Device [2011-06-07 08:10:09.617700] E [rpc-transport.c:741:rpc_transport_load] 0-rpc-transport: 'rdma' initialization failed [2011-06-07 08:10:09.617724] W [rpcsvc.c:1288:rpcsvc_transport_create] 0-rpc-service: cannot create listener, initing the transport failed [2011-06-07 08:10:09.617830] I [glusterd.c:88:glusterd_uuid_init] 0-glusterd: retrieved UUID: 1e344f5d-6904-4d14-9be2-8f0f44b97dd7 [2011-06-07 08:10:11.258098] I [glusterd-handler.c:3404:glusterd_friend_add] 0-glusterd: connect returned 0 Given volfile: +------------------------------------------------------------------------------+ 1: volume management 2: type mgmt/glusterd 3: option working-directory /etc/glusterd 4: option transport-type socket,rdma 5: option transport.socket.keepalive-time 10 6: option transport.socket.keepalive-interval 2 7: end-volume 8: +------------------------------------------------------------------------------+ [2011-06-07 08:10:11.258431] I [glusterd-handshake.c:317:glusterd_set_clnt_mgmt_program] 0-: Using Program glusterd clnt mgmt, Num (1238433), Version (1) [2011-06-07 08:10:11.280533] W [socket.c:1494:__socket_proto_state_machine] 0-socket.management: reading from socket failed. Error (Transport endpoint is not connected), peer (10.9.54.2:1023) [2011-06-07 08:10:11.280595] W [socket.c:1494:__socket_proto_state_machine] 0-management: reading from socket failed. Error (Transport endpoint is not connected), peer (10.9.54.2:24007) [2011-06-07 08:10:17.256235] E [socket.c:1685:socket_connect_finish] 0-management: connection to 10.9.54.2:24007 failed (Connection refused) There are no logs on the node which crashes. I've tried various possibly solutions from searching the net but got getting anywhere, can anyone advise how to proceed? Thanks, Phil. -- Phil Bayfield Development Manager Alchemy Social, part of Techlightenment, an Experian company Office 202 | 89 Worship Street | London | EC2A 2BF t: +44 (0) 207 392 2618 m: +44 (0) 7825 561 091 e: phil at techlightenment.com <phil at techlightenment.com>skype: phil.tl www.techlightenment.com -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://supercolony.gluster.org/pipermail/gluster-users/attachments/20110907/fdecbbc5/attachment.html>