A Ghoshal
2015-Feb-04 21:36 UTC
[Gluster-users] In a replica 2 server, file-updates on one server missing on the other server #Personal#
Sorry for spamming you guys, but this is kind of important for me to debug, so if you saw anything like this before, do let me know. Here's an update: It seems the mount client is attempting connection with an invalid port number. 49175 is NOT the port number of glusterfsd on serv1 (192.168.24.8). I got me an strace: [pid 31026] open("/proc/sys/net/ipv4/ip_local_reserved_ports", O_RDONLY) = -1 ENOENT (No such file or directory) [pid 31026] write(4, "[2015-02-04 20:39:02.793154] W ["..., 215) = 215 [pid 31026] write(4, "[2015-02-04 20:39:02.793289] W ["..., 194) = 194 [pid 31026] bind(10, {sa_family=AF_INET, sin_port=htons(1023), sin_addr=inet_addr("192.168.24.80")}, 16) = 0 [pid 31026] fcntl(10, F_GETFL) = 0x2 (flags O_RDWR) [pid 31026] fcntl(10, F_SETFL, O_RDWR|O_NONBLOCK) = 0 [pid 31026] connect(10, {sa_family=AF_INET, sin_port=htons(49175), sin_addr=inet_addr("192.168.24.81")}, 16) = -1 EINPROGRESS (Operation now in progress) [pid 31026] fcntl(10, F_GETFL) = 0x802 (flags O_RDWR|O_NONBLOCK) [pid 31026] fcntl(10, F_SETFL, O_RDWR|O_NONBLOCK) = 0 [pid 31026] epoll_ctl(3, EPOLL_CTL_ADD, 10, {EPOLLIN|EPOLLPRI|EPOLLOUT, {u32=10, u64=8589934602}}) = 0 [pid 31026] nanosleep({1, 0}, <unfinished ...> [pid 31021] <... epoll_wait resumed> {{EPOLLIN|EPOLLOUT|EPOLLERR|EPOLLHUP, {u32=10, u64=8589934602}}}, 257, 4294967295) = 1 [pid 31021] getsockopt(10, SOL_SOCKET, SO_ERROR, [29422518842425455], [4]) = 0 [pid 31021] shutdown(10, 2 /* send and receive */) = -1 ENOTCONN (Transport endpoint is not connected) Which basically told me that connection is attempted via a non-blocking socket at port 49175. The errno from the failure is -ECONNREFUSED, which is what is expected. 807 in socket.c (gdb) bt #0 __socket_connect_finish (this=0x6887a0) at socket.c:807 #1 socket_connect_finish (this=0x6887a0) at socket.c:2147 #2 0x00007fc863de4c04 in socket_event_handler (fd=<value optimized out>, idx=<value optimized out>, data=0x6887a0, poll_in=1, poll_out=4, poll_err=<value optimized out>) at socket.c:2223 #3 0x00007fc867f7919f in event_dispatch_epoll_handler (event_pool=0x62db70) at event-epoll.c:384 #4 event_dispatch_epoll (event_pool=0x62db70) at event-epoll.c:445 #5 0x0000000000406b06 in main (argc=4, argv=0x7fff25302c38) at glusterfsd.c:1934 (gdb) print *optval Cannot access memory at address 0x6f (gdb) print optval $1 = 111 Note that this agrees with the following debug log: [2015-02-03 12:11:33.833647] D [socket.c:1962:__socket_proto_state_machine] 0-replicated_vol-1: reading from socket failed. Error (No data available), peer (192.168.24.81:49175) There is, of course no service running on port 49175. In fact, listen port for corresponding glusterd on serv1 is 49206. Where does the mount client pick this port number from? I know that if I kill and restart the mount client on serv0 from command line, then the problem will disappear. So, it's not something that is up with the processes on serv1... Thanks, Anirban From: A Ghoshal/MUM/TCS To: A Ghoshal <a.ghoshal at tcs.com> Cc: gluster-users at gluster.org, gluster-users-bounces at gluster.org, Pranith Kumar Karampuri <pkarampu at redhat.com> Date: 02/05/2015 02:03 AM Subject: Re: [Gluster-users] In a replica 2 server, file-updates on one server missing on the other server #Personal# Sent by: A Ghoshal Ok, more updates here: I turned on trace and it seems bind to secure port on the mount client with the remote brick is successful - afterwards the connect() fails to complete. I saw these logs: [2015-02-03 12:11:33.832615] T [rpc-clnt.c:422:rpc_clnt_reconnect] 0-replicated_vol-1: attempting reconnect [2015-02-03 12:11:33.832666] D [name.c:155:client_fill_address_family] 0-replicated_vol-1: address-family not specified, guessing it to be inet from (remote-host: serv1) [2015-02-03 12:11:33.832683] T [name.c:225:af_inet_client_get_remote_sockaddr] 0-replicated_vol-1: option remote-port missing in volume replicated_vol-1. Defaulting to 24007 [2015-02-03 12:11:33.833083] D [common-utils.c:237:gf_resolve_ip6] 0-resolver: returning ip-192.168.24.81 (port-24007) for hostname: serv1 and port: 24007 [2015-02-03 12:11:33.833113] T [socket.c:731:__socket_nodelay] 0-replicated_vol-1: NODELAY enabled for socket 10 [2015-02-03 12:11:33.833128] T [socket.c:790:__socket_keepalive] 0-replicated_vol-1: Keep-alive enabled for socket 10, interval 2, idle: 20 [2015-02-03 12:11:33.833188] W [common-utils.c:2247:gf_get_reserved_ports] 0-glusterfs: could not open the file /proc/sys/net/ipv4/ip_local_reserved_ports for getting reserved ports info (No such file or directory) [2015-02-03 12:11:33.833204] W [common-utils.c:2280:gf_process_reserved_ports] 0-glusterfs: Not able to get reserved ports, hence there is a possibility that glusterfs may consume reserved port [2015-02-03 12:11:33.833560] D [socket.c:605:__socket_shutdown] 0-replicated_vol-1: shutdown() returned -1. Transport endpoint is not connected [2015-02-03 12:11:33.833594] T [rpc-clnt.c:519:rpc_clnt_connection_cleanup] 0-replicated_vol-1: cleaning up state in transport object 0x6887a0 [2015-02-03 12:11:33.833622] D [socket.c:486:__socket_rwv] 0-replicated_vol-1: EOF on socket [2015-02-03 12:11:33.833635] W [socket.c:514:__socket_rwv] 0-replicated_vol-1: readv failed (No data available) [2015-02-03 12:11:33.833647] D [socket.c:1962:__socket_proto_state_machine] 0-replicated_vol-1: reading from socket failed. Error (No data available), peer (192.168.24.81:49175) [2015-02-03 12:11:33.833659] D [socket.c:2236:socket_event_handler] 0-transport: disconnecting now [2015-02-03 12:11:33.833684] T [rpc-clnt.c:519:rpc_clnt_connection_cleanup] 0-replicated_vol-1: cleaning up state in transport object 0x6887a0 After that, I attached the mount process to my gdb session and attached a breakpoint at __socket_shutdown(). It gave me the following traceback Breakpoint 1, __socket_shutdown (this=0x68a7f0) at socket.c:594 594 socket.c: No such file or directory. in socket.c (gdb) bt #0 __socket_shutdown (this=0x68a7f0) at socket.c:594 #1 0x00007f3391906553 in __socket_disconnect (this=0x68a7f0) at socket.c:623 #2 0x00007f339190856a in socket_connect_finish (this=0x68a7f0) at socket.c:2160 #3 0x00007f339190dc04 in socket_event_handler (fd=<value optimized out>, idx=<value optimized out>, data=0x68a7f0, poll_in=1, poll_out=4, poll_err=<value optimized out>) at socket.c:2223 #4 0x00007f3395aa219f in event_dispatch_epoll_handler (event_pool=0x62d8b0) at event-epoll.c:384 #5 event_dispatch_epoll (event_pool=0x62d8b0) at event-epoll.c:445 #6 0x0000000000406b06 in main (argc=4, argv=0x7fff71108c38) at glusterfsd.c:1934 (gdb) In case somebody could give me a clue or two on my next move (I'm looking at the code in the meantime), I'd be very grateful. Thanks, Anirban From: A Ghoshal <a.ghoshal at tcs.com> To: Pranith Kumar Karampuri <pkarampu at redhat.com> Cc: gluster-users at gluster.org Date: 02/03/2015 05:24 AM Subject: Re: [Gluster-users] In a replica 2 server, file-updates on one server missing on the other server #Personal# Sent by: gluster-users-bounces at gluster.org Hi Pranith, I finally understood what you meant the secure ports, because the issue occurred in one of our setups once more. It seems one of the clients on serv1 could not open a connection to the glusterfsd running on serv0. I'd actually started a mail trail about it (believing it might be something else, initially) here: http://www.gluster.org/pipermail/gluster-users/2015-February/020465.html I think I can write me a rudimentary kind of patch altering af_inet_bind_to_port_lt_ceiling() to get it to call bind with port 0, rather than specify a port explicitly when the client.bind-insecure is specified ... Then I'd need to create a way to set server.allow-insecure using the cli (or if you already sent around the patch to do that like you said in the earlier mail, do let me know). I'll keep you posted about it round here or @ [gluster-devel] if I can get it to work. Thanks a lot, Anirban From: A Ghoshal/MUM/TCS To: Pranith Kumar Karampuri <pkarampu at redhat.com> Cc: gluster-users at gluster.org, Niels de Vos <ndevos at redhat.com> Date: 01/23/2015 02:45 PM Subject: Re: [Gluster-users] In a replica 2 server, file-updates on one server missing on the other server #Personal# Sent by: A Ghoshal Oh, I didn't I only read a fragment of the IRC log and assumed --xlator-option would be enough. Apparently it's a lot more work.... I do have a query, though. These connections, from one of our setups, are these on secure ports? Or, maybe I didn't get it the first time..... root at serv0:/root> ps -ef | grep replicated_vol root 8851 25307 0 10:03 pts/2 00:00:00 grep replicated_vol root 29751 1 4 Jan21 ? 01:47:20 /usr/sbin/glusterfsd -s serv0 --volfile-id replicated_vol.serv0.mnt-bricks-replicated_vol-brick -p /var/lib/glusterd/vols/_replicated_vol/run/serv0-mnt-bricks-replicated_vol-brick.pid -S /var/run/dff9fa3c93e82f20103f2a3d91adc4a8.socket --brick-name /mnt/bricks/replicated_vol/brick -l /var/log/glusterfs/bricks/mnt-bricks-replicated_vol-brick.log --xlator-option *-posix.glusterd-uuid=1a1d1ebc-4b92-428f-b66b-9c5efa49574d --brick-port 49185 --xlator-option replicated_vol-server.listen-port=49185 root 30399 1 0 Jan21 ? 00:19:06 /usr/sbin/glusterfs --volfile-id=replicated_vol --volfile-server=serv0 /mnt/replicated_vol root at serv0:/root> netstat -p | grep 30399 tcp 0 0 serv0:969 serv0:49185 ESTABLISHED 30399/glusterfs tcp 0 0 serv0:999 serv1:49159 ESTABLISHED 30399/glusterfs tcp 0 0 serv0:1023 serv0:24007 ESTABLISHED 30399/glusterfs root at serv0:/root> Thanks again, Anirban From: Pranith Kumar Karampuri <pkarampu at redhat.com> To: A Ghoshal <a.ghoshal at tcs.com> Cc: gluster-users at gluster.org, Niels de Vos <ndevos at redhat.com> Date: 01/23/2015 01:58 PM Subject: Re: [Gluster-users] In a replica 2 server, file-updates on one server missing on the other server #Personal# On 01/23/2015 01:54 PM, A Ghoshal wrote: Thanks a lot, Pranith. We'll set this option on our test servers and keep the setup under observation. How did you get the bind-insecure option working? I guess I will send a patch to make it 'volume set option' Pranith Thanks, Anirban From: Pranith Kumar Karampuri <pkarampu at redhat.com> To: A Ghoshal <a.ghoshal at tcs.com> Cc: gluster-users at gluster.org, Niels de Vos <ndevos at redhat.com> Date: 01/23/2015 01:28 PM Subject: Re: [Gluster-users] In a replica 2 server, file-updates on one server missing on the other server #Personal# On 01/22/2015 02:07 PM, A Ghoshal wrote: Hi Pranith, Yes, the very same (chalcogen_eg_oxygen at yahoo.com). Justin Clift sent me a mail a while back telling me that it is better if we all use our business email addresses so I made me a new profile. Glusterfs complains about /proc/sys/net/ipv4/ip_local_reserved_ports because we use a really old Linux kernel (2.6.34) wherein this feature is not present. We plan to upgrade our Linux so often but each time we are dissuaded from it by some compatibility issue or the other. So, we get this log every time - on both good volumes and bad ones. What bothered me was this (on serv1): Basically to make the connections to servers i.e. bricks clients need to choose secure ports i.e. port less than 1024. Since this file is not present, it is not binding to any port as per the code I just checked. There is an option called client-bind-insecure which bypasses this check. I feel that is one (probably only way) to get around this. You have to "volume set server.allow-insecure on" option and bind-insecure option. CC ndevos who seemed to have helped someone set bind-insecure option correctly here (http://irclog.perlgeek.de/gluster/2014-04-09/text) Pranith [2015-01-20 09:37:49.151744] T [rpc-clnt.c:1182:rpc_clnt_record_build_header] 0-rpc-clnt: Request fraglen 456, payload: 360, rpc hdr: 96 [2015-01-20 09:37:49.151780] T [rpc-clnt.c:1499:rpc_clnt_submit] 0-rpc-clnt: submitted request (XID: 0x39620x Program: GlusterFS 3.3, ProgVers: 330, Proc: 27) to rpc-transport (replicated_vol-client-0) [2015-01-20 09:37:49.151810] T [rpc-clnt.c:1302:rpc_clnt_record] 0-replicated_vol-client-1: Auth Info: pid: 7599, uid: 0, gid: 0, owner: 0000000000000000 [2015-01-20 09:37:49.151824] T [rpc-clnt.c:1182:rpc_clnt_record_build_header] 0-rpc-clnt: Request fraglen 456, payload: 360, rpc hdr: 96 [2015-01-20 09:37:49.151889] T [rpc-clnt.c:1499:rpc_clnt_submit] 0-rpc-clnt: submitted request (XID: 0x39563x Program: GlusterFS 3.3, ProgVers: 330, Proc: 27) to rpc-transport (replicated_vol-client-1) [2015-01-20 09:37:49.152239] T [rpc-clnt.c:669:rpc_clnt_reply_init] 0-replicated_vol-client-1: received rpc message (RPC XID: 0x39563x Program: GlusterFS 3.3, ProgVers: 330, Proc: 27) from rpc-transport (replicated_vol-client-1) [2015-01-20 09:37:49.152484] T [rpc-clnt.c:669:rpc_clnt_reply_init] 0-replicated_vol-client-0: received rpc message (RPC XID: 0x39620x Program: GlusterFS 3.3, ProgVers: 330, Proc: 27) from rpc-transport (replicated_vol-client-0) When I write on the good server (serv1), we see that an RPC request is sent to both client-0 and client-1. While, when I write on the bad server (serv0), the RPC request is sent only to client-0, which is why it is no wonder that the writes are not synced over to serv1. Somehow I could not make the daemon on serv0 understand that there are two up-children and not just one. One additional detail - since we are using a kernel that's too old, we do not have the (Anand Avati's?) FUse readdirplus patches, either. I've noticed that the fixes in the readdirplus version of glusterfs aren't always guaranteed to be present on the non-readdirplus version of the patches. I'd filed a bug around one such anomaly back, but never got around to writing a patch for it (sorry!) Here it is: https://bugzilla.redhat.com/show_bug.cgi?id=1062287 I don't this has anything to do with readdirplus. Maybe something on similar lines here? Thanks, Anirban P.s. Please ignore the #Personal# in the subject line - we need to do that to push mails to the public domain past the email filter safely. From: Pranith Kumar Karampuri <pkarampu at redhat.com> To: A Ghoshal <a.ghoshal at tcs.com>, gluster-users at gluster.org Date: 01/22/2015 12:09 AM Subject: Re: [Gluster-users] In a replica 2 server, file-updates on one server missing on the other server hi, Responses inline. PS: You are chalkogen_oxygen? Pranith On 01/20/2015 05:34 PM, A Ghoshal wrote: Hello, I am using the following replicated volume: root at serv0:~> gluster v info replicated_vol Volume Name: replicated_vol Type: Replicate Volume ID: 26d111e3-7e4c-479e-9355-91635ab7f1c2 Status: Started Number of Bricks: 1 x 2 = 2 Transport-type: tcp Bricks: Brick1: serv0:/mnt/bricks/replicated_vol/brick Brick2: serv1:/mnt/bricks/replicated_vol/brick Options Reconfigured: diagnostics.client-log-level: INFO network.ping-timeout: 10 nfs.enable-ino32: on cluster.self-heal-daemon: on nfs.disable: off replicated_vol is mounted at /mnt/replicated_vol on both serv0 and serv1. If I do the following on serv0: root at serv0:~>echo "cranberries" > /mnt/replicated_vol/testfile root at serv0:~>echo "tangerines" >> /mnt/replicated_vol/testfile And then I check for the state of the replicas in the bricks, then I find that root at serv0:~>cat /mnt/bricks/replicated_vol/brick/testfile cranberries tangerines root at serv0:~> root at serv1:~>cat /mnt/bricks/replicated_vol/brick/testfile root at serv1:~> As may be seen, the replica on serv1 is blank, when I write into testfile from serv0 (even though the file is created on both bricks). Interestingly, if I write something to the file at serv1, then the two replicas become identical. root at serv1:~>echo "artichokes" >> /mnt/replicated_vol/testfile root at serv1:~>cat /mnt/bricks/replicated_vol/brick/testfile cranberries tangerines artichokes root at serv1:~> root at serv0:~>cat /mnt/bricks/replicated_vol/brick/testfile cranberries tangerines artichokes root at serv0:~> So, I dabbled into the logs a little bit, after upping the diagnostic level, and this is what I saw: When I write on serv0 (bad case): [2015-01-20 09:21:52.197704] T [fuse-bridge.c:546:fuse_lookup_resume] 0-glusterfs-fuse: 53027: LOOKUP /testfl(f0a76987-8a42-47a2-b027-a823254b736b) [2015-01-20 09:21:52.197959] D [afr-common.c:131:afr_lookup_xattr_req_prepare] 0-replicated_vol-replicate-0: /testfl: failed to get the gfid from dict [2015-01-20 09:21:52.198006] T [rpc-clnt.c:1302:rpc_clnt_record] 0-replicated_vol-client-0: Auth Info: pid: 28151, uid: 0, gid: 0, owner: 0000000000000000 [2015-01-20 09:21:52.198024] T [rpc-clnt.c:1182:rpc_clnt_record_build_header] 0-rpc-clnt: Request fraglen 456, payload: 360, rpc hdr: 96 [2015-01-20 09:21:52.198108] T [rpc-clnt.c:1499:rpc_clnt_submit] 0-rpc-clnt: submitted request (XID: 0x78163x Program: GlusterFS 3.3, ProgVers: 330, Proc: 27) to rpc-transport (replicated_vol-client-0) [2015-01-20 09:21:52.198565] T [rpc-clnt.c:669:rpc_clnt_reply_init] 0-replicated_vol-client-0: received rpc message (RPC XID: 0x78163x Program: GlusterFS 3.3, ProgVers: 330, Proc: 27) from rpc-transport (replicated_vol-client-0) [2015-01-20 09:21:52.198640] D [afr-self-heal-common.c:138:afr_sh_print_pending_matrix] 0-replicated_vol-replicate-0: pending_matrix: [ 0 3 ] [2015-01-20 09:21:52.198669] D [afr-self-heal-common.c:138:afr_sh_print_pending_matrix] 0-replicated_vol-replicate-0: pending_matrix: [ 0 0 ] [2015-01-20 09:21:52.198681] D [afr-self-heal-common.c:887:afr_mark_sources] 0-replicated_vol-replicate-0: Number of sources: 1 [2015-01-20 09:21:52.198694] D [afr-self-heal-data.c:825:afr_lookup_select_read_child_by_txn_type] 0-replicated_vol-replicate-0: returning read_child: 0 [2015-01-20 09:21:52.198705] D [afr-common.c:1380:afr_lookup_select_read_child] 0-replicated_vol-replicate-0: Source selected as 0 for /testfl [2015-01-20 09:21:52.198720] D [afr-common.c:1117:afr_lookup_build_response_params] 0-replicated_vol-replicate-0: Building lookup response from 0 [2015-01-20 09:21:52.198732] D [afr-common.c:1732:afr_lookup_perform_self_heal] 0-replicated_vol-replicate-0: Only 1 child up - do not attempt to detect self heal When I write on serv1 (good case): [2015-01-20 09:37:49.151506] T [fuse-bridge.c:546:fuse_lookup_resume] 0-glusterfs-fuse: 31212: LOOKUP /testfl(f0a76987-8a42-47a2-b027-a823254b736b) [2015-01-20 09:37:49.151683] D [afr-common.c:131:afr_lookup_xattr_req_prepare] 0-replicated_vol-replicate-0: /testfl: failed to get the gfid from dict [2015-01-20 09:37:49.151726] T [rpc-clnt.c:1302:rpc_clnt_record] 0-replicated_vol-client-0: Auth Info: pid: 7599, uid: 0, gid: 0, owner: 0000000000000000 [2015-01-20 09:37:49.151744] T [rpc-clnt.c:1182:rpc_clnt_record_build_header] 0-rpc-clnt: Request fraglen 456, payload: 360, rpc hdr: 96 [2015-01-20 09:37:49.151780] T [rpc-clnt.c:1499:rpc_clnt_submit] 0-rpc-clnt: submitted request (XID: 0x39620x Program: GlusterFS 3.3, ProgVers: 330, Proc: 27) to rpc-transport (replicated_vol-client-0) [2015-01-20 09:37:49.151810] T [rpc-clnt.c:1302:rpc_clnt_record] 0-replicated_vol-client-1: Auth Info: pid: 7599, uid: 0, gid: 0, owner: 0000000000000000 [2015-01-20 09:37:49.151824] T [rpc-clnt.c:1182:rpc_clnt_record_build_header] 0-rpc-clnt: Request fraglen 456, payload: 360, rpc hdr: 96 [2015-01-20 09:37:49.151889] T [rpc-clnt.c:1499:rpc_clnt_submit] 0-rpc-clnt: submitted request (XID: 0x39563x Program: GlusterFS 3.3, ProgVers: 330, Proc: 27) to rpc-transport (replicated_vol-client-1) [2015-01-20 09:37:49.152239] T [rpc-clnt.c:669:rpc_clnt_reply_init] 0-replicated_vol-client-1: received rpc message (RPC XID: 0x39563x Program: GlusterFS 3.3, ProgVers: 330, Proc: 27) from rpc-transport (replicated_vol-client-1) [2015-01-20 09:37:49.152484] T [rpc-clnt.c:669:rpc_clnt_reply_init] 0-replicated_vol-client-0: received rpc message (RPC XID: 0x39620x Program: GlusterFS 3.3, ProgVers: 330, Proc: 27) from rpc-transport (replicated_vol-client-0) [2015-01-20 09:37:49.152582] D [afr-self-heal-common.c:138:afr_sh_print_pending_matrix] 0-replicated_vol-replicate-0: pending_matrix: [ 0 3 ] [2015-01-20 09:37:49.152596] D [afr-self-heal-common.c:138:afr_sh_print_pending_matrix] 0-replicated_vol-replicate-0: pending_matrix: [ 0 0 ] [2015-01-20 09:37:49.152621] D [afr-self-heal-common.c:887:afr_mark_sources] 0-replicated_vol-replicate-0: Number of sources: 1 [2015-01-20 09:37:49.152633] D [afr-self-heal-data.c:825:afr_lookup_select_read_child_by_txn_type] 0-replicated_vol-replicate-0: returning read_child: 0 [2015-01-20 09:37:49.152644] D [afr-common.c:1380:afr_lookup_select_read_child] 0-replicated_vol-replicate-0: Source selected as 0 for /testfl [2015-01-20 09:37:49.152657] D [afr-common.c:1117:afr_lookup_build_response_params] 0-replicated_vol-replicate-0: Building lookup response from 0 We see that when you write on serv1, the RPC request is sent to both replicated_vol-client-0 and replicated_vol-client-1, while when we write on serv0, the request is sent only to replicated_vol-client-0, and the FUse client is unaware of the presence of client-1 in the latter case. I checked a bit more in the logs. When I turn on my trace, I found many instances of these logs on serv0 but NOT on serv1: [2015-01-20 09:21:15.520784] T [fuse-bridge.c:681:fuse_attr_cbk] 0-glusterfs-fuse: 53011: LOOKUP() / => 1 [2015-01-20 09:21:17.683088] T [rpc-clnt.c:422:rpc_clnt_reconnect] 0-replicated_vol-client-1: attempting reconnect [2015-01-20 09:21:17.683159] D [name.c:155:client_fill_address_family] 0-replicated_vol-client-1: address-family not specified, guessing it to be inet from (remote-host: serv1) [2015-01-20 09:21:17.683178] T [name.c:225:af_inet_client_get_remote_sockaddr] 0-replicated_vol-client-1: option remote-port missing in volume replicated_vol-client-1. Defaulting to 24007 [2015-01-20 09:21:17.683191] T [common-utils.c:188:gf_resolve_ip6] 0-resolver: flushing DNS cache [2015-01-20 09:21:17.683202] T [common-utils.c:195:gf_resolve_ip6] 0-resolver: DNS cache not present, freshly probing hostname: serv1 [2015-01-20 09:21:17.683814] D [common-utils.c:237:gf_resolve_ip6] 0-resolver: returning ip-192.168.24.81 (port-24007) for hostname: serv1 and port: 24007 [2015-01-20 09:21:17.684139] D [common-utils.c:257:gf_resolve_ip6] 0-resolver: next DNS query will return: ip-192.168.24.81 port-24007 [2015-01-20 09:21:17.684164] T [socket.c:731:__socket_nodelay] 0-replicated_vol-client-1: NODELAY enabled for socket 10 [2015-01-20 09:21:17.684177] T [socket.c:790:__socket_keepalive] 0-replicated_vol-client-1: Keep-alive enabled for socket 10, interval 2, idle: 20 [2015-01-20 09:21:17.684236] W [common-utils.c:2247:gf_get_reserved_ports] 0-glusterfs: could not open the file /proc/sys/net/ipv4/ip_local_reserved_ports for getting reserved ports info (No such file or directory) [2015-01-20 09:21:17.684253] W [common-utils.c:2280:gf_process_reserved_ports] 0-glusterfs: Not able to get reserved ports, hence there is a possibility that glusterfs may consume reserved port Logs above suggest that mount process couldn't assign a reserved port because it couldn't find the file /proc/sys/net/ipv4/ip_local_reserved_ports I guess reboot of the machine fixed it. Wonder why it was not found in the first place. Pranith. [2015-01-20 09:21:17.684660] D [socket.c:605:__socket_shutdown] 0-replicated_vol-client-1: shutdown() returned -1. Transport endpoint is not connected [2015-01-20 09:21:17.684699] T [rpc-clnt.c:519:rpc_clnt_connection_cleanup] 0-replicated_vol-client-1: cleaning up state in transport object 0x68a630 [2015-01-20 09:21:17.684731] D [socket.c:486:__socket_rwv] 0-replicated_vol-client-1: EOF on socket [2015-01-20 09:21:17.684750] W [socket.c:514:__socket_rwv] 0-replicated_vol-client-1: readv failed (No data available) [2015-01-20 09:21:17.684766] D [socket.c:1962:__socket_proto_state_machine] 0-replicated_vol-client-1: reading from socket failed. Error (No data available), peer (192.168.24.81:49198) I could not find a 'remote-port' option in /var/lib/glusterd on either peer. Could somebody tell me where this configuration is looked up from? Also, sometime later, I rebooted serv0 and that seemed to solve the problem. However, stop+start of replicated_vol and restart of /etc/init.d/glusterd did NOT solve the problem. Ignore that log. If no port is given in that volfile, it picks 24007 as the port, which is the default port where glusterd 'listens' Any help on this matter will be greatly appreciated as I need to provide robustness assurances for our setup. Thanks a lot, Anirban P.s. Additional details: glusterfs version: 3.4.2 Linux kernel version: 2.6.34 =====-----=====-----====Notice: The information contained in this e-mail message and/or attachments to it may contain confidential or privileged information. If you are not the intended recipient, any dissemination, use, review, distribution, printing or copying of the information contained in this e-mail message and/or attachments to it are strictly prohibited. If you have received this communication in error, please notify us by reply e-mail or telephone and immediately and permanently delete the message and any attachments. Thank you _______________________________________________ Gluster-users mailing list Gluster-users at gluster.org http://www.gluster.org/mailman/listinfo/gluster-users _______________________________________________ Gluster-users mailing list Gluster-users at gluster.org http://www.gluster.org/mailman/listinfo/gluster-users -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://www.gluster.org/pipermail/gluster-users/attachments/20150205/06ccc444/attachment.html>