Atin Mukherjee
2015-Mar-31 08:25 UTC
[Gluster-users] Initial mount problem - all subvolumes are down
On 03/31/2015 01:03 PM, Pranith Kumar Karampuri wrote:> > On 03/31/2015 12:53 PM, Atin Mukherjee wrote: >> >> On 03/31/2015 12:27 PM, Pranith Kumar Karampuri wrote: >>> Atin, >>> Could it be because bricks are started with PROC_START_NO_WAIT? >> That's the correct analysis Pranith. Mount was attempted before the >> bricks were started. If we can have a time lag in some seconds between >> mount and volume start the problem will go away. > Atin, > I think one way to solve this issue is to start the bricks with > NO_WAIT so that we can handle pmap-signin but wait for the pmap-signins > to complete before responding to cli/completing 'init'?Logically it should solve the problem. We need to think around it more from the existing design perspective. ~Atin> > Pranith >> >> >>> Pranith >>> On 03/31/2015 04:41 AM, Rumen Telbizov wrote: >>>> Hello everyone, >>>> >>>> I have a problem that I am trying to resolve and not sure which way to >>>> go so here I am asking for your advise. >>>> >>>> What it comes down to is that upon initial boot of all my GlusterFS >>>> machines the shared volume doesn't get mounted. Nevertheless the >>>> volume successfully created and started and further attempts to mount >>>> it manually succeed. I suspect what's happening is that gluster >>>> processes/bricks/etc haven't fully started at the time the /etc/fstab >>>> entry is read and the initial mount attempt is being made. Again, by >>>> the time I log in and run a mount -a -- the volume mounts without any >>>> issues. >>>> >>>> _Details from the logs:_ >>>> >>>> [2015-03-30 22:29:04.381918] I [MSGID: 100030] >>>> [glusterfsd.c:2018:main] 0-/usr/sbin/glusterfs: Started running >>>> /usr/sbin/glusterfs version 3.6.2 (args: /usr/sbin/glusterfs >>>> --log-file=/var/log/glusterfs/glusterfs.log --attribute-timeout=0 >>>> --entry-timeout=0 --volfile-server=localhost >>>> --volfile-server=10.12.130.21 --volfile-server=10.12.130.22 >>>> --volfile-server=10.12.130.23 --volfile-id=/myvolume /opt/shared) >>>> [2015-03-30 22:29:04.394913] E [socket.c:2267:socket_connect_finish] >>>> 0-glusterfs: connection to 127.0.0.1:24007 <http://127.0.0.1:24007> >>>> failed (Connection refused) >>>> [2015-03-30 22:29:04.394950] E >>>> [glusterfsd-mgmt.c:1811:mgmt_rpc_notify] 0-glusterfsd-mgmt: failed to >>>> connect with remote-host: localhost (Transport endpoint is not >>>> connected) >>>> [2015-03-30 22:29:04.394964] I >>>> [glusterfsd-mgmt.c:1838:mgmt_rpc_notify] 0-glusterfsd-mgmt: connecting >>>> to next volfile server 10.12.130.21 >>>> [2015-03-30 22:29:08.390687] E >>>> [glusterfsd-mgmt.c:1811:mgmt_rpc_notify] 0-glusterfsd-mgmt: failed to >>>> connect with remote-host: 10.12.130.21 (Transport endpoint is not >>>> connected) >>>> [2015-03-30 22:29:08.390720] I >>>> [glusterfsd-mgmt.c:1838:mgmt_rpc_notify] 0-glusterfsd-mgmt: connecting >>>> to next volfile server 10.12.130.22 >>>> [2015-03-30 22:29:11.392015] E >>>> [glusterfsd-mgmt.c:1811:mgmt_rpc_notify] 0-glusterfsd-mgmt: failed to >>>> connect with remote-host: 10.12.130.22 (Transport endpoint is not >>>> connected) >>>> [2015-03-30 22:29:11.392050] I >>>> [glusterfsd-mgmt.c:1838:mgmt_rpc_notify] 0-glusterfsd-mgmt: connecting >>>> to next volfile server 10.12.130.23 >>>> [2015-03-30 22:29:14.406429] I [dht-shared.c:337:dht_init_regex] >>>> 0-brain-dht: using regex rsync-hash-regex = ^\.(.+)\.[^.]+$ >>>> [2015-03-30 22:29:14.408964] I >>>> [rpc-clnt.c:969:rpc_clnt_connection_init] 0-host-client-2: setting >>>> frame-timeout to 60 >>>> [2015-03-30 22:29:14.409183] I >>>> [rpc-clnt.c:969:rpc_clnt_connection_init] 0-host-client-1: setting >>>> frame-timeout to 60 >>>> [2015-03-30 22:29:14.409388] I >>>> [rpc-clnt.c:969:rpc_clnt_connection_init] 0-host-client-0: setting >>>> frame-timeout to 60 >>>> [2015-03-30 22:29:14.409430] I [client.c:2280:notify] 0-host-client-0: >>>> parent translators are ready, attempting connect on transport >>>> [2015-03-30 22:29:14.409658] I [client.c:2280:notify] 0-host-client-1: >>>> parent translators are ready, attempting connect on transport >>>> [2015-03-30 22:29:14.409844] I [client.c:2280:notify] 0-host-client-2: >>>> parent translators are ready, attempting connect on transport >>>> Final graph: >>>> >>>> .... >>>> >>>> [2015-03-30 22:29:14.411045] I [client.c:2215:client_rpc_notify] >>>> 0-host-client-2: disconnected from host-client-2. Client process will >>>> keep trying to connect to glusterd until brick's port is available >>>> *[2015-03-30 22:29:14.411063] E [MSGID: 108006] >>>> [afr-common.c:3591:afr_notify] 0-myvolume-replicate-0: All subvolumes >>>> are down. Going offline until atleast one of them comes back up. >>>> *[2015-03-30 22:29:14.414871] I [fuse-bridge.c:5080:fuse_graph_setup] >>>> 0-fuse: switched to graph 0 >>>> [2015-03-30 22:29:14.415003] I [fuse-bridge.c:4009:fuse_init] >>>> 0-glusterfs-fuse: FUSE inited with protocol versions: glusterfs 7.22 >>>> kernel 7.17 >>>> [2015-03-30 22:29:14.415101] I [afr-common.c:3722:afr_local_init] >>>> 0-myvolume-replicate-0: no subvolumes up >>>> [2015-03-30 22:29:14.415215] I [afr-common.c:3722:afr_local_init] >>>> 0-myvolume-replicate-0: no subvolumes up >>>> [2015-03-30 22:29:14.415236] W [fuse-bridge.c:779:fuse_attr_cbk] >>>> 0-glusterfs-fuse: 2: LOOKUP() / => -1 (Transport endpoint is not >>>> connected) >>>> [2015-03-30 22:29:14.419007] I [fuse-bridge.c:4921:fuse_thread_proc] >>>> 0-fuse: unmounting /opt/shared >>>> *[2015-03-30 22:29:14.420176] W [glusterfsd.c:1194:cleanup_and_exit] >>>> (--> 0-: received signum (15), shutting down* >>>> [2015-03-30 22:29:14.420192] I [fuse-bridge.c:5599:fini] 0-fuse: >>>> Unmounting '/opt/shared'. >>>> >>>> >>>> _Relevant /etc/fstab entries are:_ >>>> >>>> /dev/xvdb /opt/local xfs defaults,noatime,nodiratime 0 0 >>>> >>>> localhost:/myvolume /opt/shared glusterfs >>>> defaults,_netdev,attribute-timeout=0,entry-timeout=0,log-file=/var/log/glusterfs/glusterfs.log,backup-volfile-servers=10.12.130.21:10.12.130.22:10.12.130.23 >>>> >>>> 0 0 >>>> >>>> >>>> _Volume configuration is:_ >>>> >>>> Volume Name: myvolume >>>> Type: Replicate >>>> Volume ID: xxxx >>>> Status: Started >>>> Number of Bricks: 1 x 3 = 3 >>>> Transport-type: tcp >>>> Bricks: >>>> Brick1: host1:/opt/local/brick >>>> Brick2: host2:/opt/local/brick >>>> Brick3: host3:/opt/local/brick >>>> Options Reconfigured: >>>> storage.health-check-interval: 5 >>>> network.ping-timeout: 5 >>>> nfs.disable: on >>>> auth.allow: 10.12.130.21,10.12.130.22,10.12.130.23 >>>> cluster.quorum-type: auto >>>> network.frame-timeout: 60 >>>> >>>> >>>> I run Debian 7 and the following GlusterFS version 3.6.2-2. >>>> >>>> While I could together some rc.local type of script which retries to >>>> mount the volume for a while until it succeeds or times out I was >>>> wondering if there's a better way to solve this problem? >>>> >>>> Thank you for your help. >>>> >>>> Regards, >>>> -- >>>> Rumen Telbizov >>>> Unix Systems Administrator <http://telbizov.com> >>>> >>>> >>>> _______________________________________________ >>>> Gluster-users mailing list >>>> Gluster-users at gluster.org >>>> http://www.gluster.org/mailman/listinfo/gluster-users >>> > > >-- ~Atin
Pranith Kumar Karampuri
2015-Mar-31 09:53 UTC
[Gluster-users] Initial mount problem - all subvolumes are down
On 03/31/2015 01:55 PM, Atin Mukherjee wrote:> > On 03/31/2015 01:03 PM, Pranith Kumar Karampuri wrote: >> On 03/31/2015 12:53 PM, Atin Mukherjee wrote: >>> On 03/31/2015 12:27 PM, Pranith Kumar Karampuri wrote: >>>> Atin, >>>> Could it be because bricks are started with PROC_START_NO_WAIT? >>> That's the correct analysis Pranith. Mount was attempted before the >>> bricks were started. If we can have a time lag in some seconds between >>> mount and volume start the problem will go away. >> Atin, >> I think one way to solve this issue is to start the bricks with >> NO_WAIT so that we can handle pmap-signin but wait for the pmap-signins >> to complete before responding to cli/completing 'init'? > Logically it should solve the problem. We need to think around it more > from the existing design perspective.Rumen, Feel free to log a bug. This should be fixed in later release. We can raise the bug and work it as well if you prefer it this way. Pranith> > ~Atin >> Pranith >>> >>>> Pranith >>>> On 03/31/2015 04:41 AM, Rumen Telbizov wrote: >>>>> Hello everyone, >>>>> >>>>> I have a problem that I am trying to resolve and not sure which way to >>>>> go so here I am asking for your advise. >>>>> >>>>> What it comes down to is that upon initial boot of all my GlusterFS >>>>> machines the shared volume doesn't get mounted. Nevertheless the >>>>> volume successfully created and started and further attempts to mount >>>>> it manually succeed. I suspect what's happening is that gluster >>>>> processes/bricks/etc haven't fully started at the time the /etc/fstab >>>>> entry is read and the initial mount attempt is being made. Again, by >>>>> the time I log in and run a mount -a -- the volume mounts without any >>>>> issues. >>>>> >>>>> _Details from the logs:_ >>>>> >>>>> [2015-03-30 22:29:04.381918] I [MSGID: 100030] >>>>> [glusterfsd.c:2018:main] 0-/usr/sbin/glusterfs: Started running >>>>> /usr/sbin/glusterfs version 3.6.2 (args: /usr/sbin/glusterfs >>>>> --log-file=/var/log/glusterfs/glusterfs.log --attribute-timeout=0 >>>>> --entry-timeout=0 --volfile-server=localhost >>>>> --volfile-server=10.12.130.21 --volfile-server=10.12.130.22 >>>>> --volfile-server=10.12.130.23 --volfile-id=/myvolume /opt/shared) >>>>> [2015-03-30 22:29:04.394913] E [socket.c:2267:socket_connect_finish] >>>>> 0-glusterfs: connection to 127.0.0.1:24007 <http://127.0.0.1:24007> >>>>> failed (Connection refused) >>>>> [2015-03-30 22:29:04.394950] E >>>>> [glusterfsd-mgmt.c:1811:mgmt_rpc_notify] 0-glusterfsd-mgmt: failed to >>>>> connect with remote-host: localhost (Transport endpoint is not >>>>> connected) >>>>> [2015-03-30 22:29:04.394964] I >>>>> [glusterfsd-mgmt.c:1838:mgmt_rpc_notify] 0-glusterfsd-mgmt: connecting >>>>> to next volfile server 10.12.130.21 >>>>> [2015-03-30 22:29:08.390687] E >>>>> [glusterfsd-mgmt.c:1811:mgmt_rpc_notify] 0-glusterfsd-mgmt: failed to >>>>> connect with remote-host: 10.12.130.21 (Transport endpoint is not >>>>> connected) >>>>> [2015-03-30 22:29:08.390720] I >>>>> [glusterfsd-mgmt.c:1838:mgmt_rpc_notify] 0-glusterfsd-mgmt: connecting >>>>> to next volfile server 10.12.130.22 >>>>> [2015-03-30 22:29:11.392015] E >>>>> [glusterfsd-mgmt.c:1811:mgmt_rpc_notify] 0-glusterfsd-mgmt: failed to >>>>> connect with remote-host: 10.12.130.22 (Transport endpoint is not >>>>> connected) >>>>> [2015-03-30 22:29:11.392050] I >>>>> [glusterfsd-mgmt.c:1838:mgmt_rpc_notify] 0-glusterfsd-mgmt: connecting >>>>> to next volfile server 10.12.130.23 >>>>> [2015-03-30 22:29:14.406429] I [dht-shared.c:337:dht_init_regex] >>>>> 0-brain-dht: using regex rsync-hash-regex = ^\.(.+)\.[^.]+$ >>>>> [2015-03-30 22:29:14.408964] I >>>>> [rpc-clnt.c:969:rpc_clnt_connection_init] 0-host-client-2: setting >>>>> frame-timeout to 60 >>>>> [2015-03-30 22:29:14.409183] I >>>>> [rpc-clnt.c:969:rpc_clnt_connection_init] 0-host-client-1: setting >>>>> frame-timeout to 60 >>>>> [2015-03-30 22:29:14.409388] I >>>>> [rpc-clnt.c:969:rpc_clnt_connection_init] 0-host-client-0: setting >>>>> frame-timeout to 60 >>>>> [2015-03-30 22:29:14.409430] I [client.c:2280:notify] 0-host-client-0: >>>>> parent translators are ready, attempting connect on transport >>>>> [2015-03-30 22:29:14.409658] I [client.c:2280:notify] 0-host-client-1: >>>>> parent translators are ready, attempting connect on transport >>>>> [2015-03-30 22:29:14.409844] I [client.c:2280:notify] 0-host-client-2: >>>>> parent translators are ready, attempting connect on transport >>>>> Final graph: >>>>> >>>>> .... >>>>> >>>>> [2015-03-30 22:29:14.411045] I [client.c:2215:client_rpc_notify] >>>>> 0-host-client-2: disconnected from host-client-2. Client process will >>>>> keep trying to connect to glusterd until brick's port is available >>>>> *[2015-03-30 22:29:14.411063] E [MSGID: 108006] >>>>> [afr-common.c:3591:afr_notify] 0-myvolume-replicate-0: All subvolumes >>>>> are down. Going offline until atleast one of them comes back up. >>>>> *[2015-03-30 22:29:14.414871] I [fuse-bridge.c:5080:fuse_graph_setup] >>>>> 0-fuse: switched to graph 0 >>>>> [2015-03-30 22:29:14.415003] I [fuse-bridge.c:4009:fuse_init] >>>>> 0-glusterfs-fuse: FUSE inited with protocol versions: glusterfs 7.22 >>>>> kernel 7.17 >>>>> [2015-03-30 22:29:14.415101] I [afr-common.c:3722:afr_local_init] >>>>> 0-myvolume-replicate-0: no subvolumes up >>>>> [2015-03-30 22:29:14.415215] I [afr-common.c:3722:afr_local_init] >>>>> 0-myvolume-replicate-0: no subvolumes up >>>>> [2015-03-30 22:29:14.415236] W [fuse-bridge.c:779:fuse_attr_cbk] >>>>> 0-glusterfs-fuse: 2: LOOKUP() / => -1 (Transport endpoint is not >>>>> connected) >>>>> [2015-03-30 22:29:14.419007] I [fuse-bridge.c:4921:fuse_thread_proc] >>>>> 0-fuse: unmounting /opt/shared >>>>> *[2015-03-30 22:29:14.420176] W [glusterfsd.c:1194:cleanup_and_exit] >>>>> (--> 0-: received signum (15), shutting down* >>>>> [2015-03-30 22:29:14.420192] I [fuse-bridge.c:5599:fini] 0-fuse: >>>>> Unmounting '/opt/shared'. >>>>> >>>>> >>>>> _Relevant /etc/fstab entries are:_ >>>>> >>>>> /dev/xvdb /opt/local xfs defaults,noatime,nodiratime 0 0 >>>>> >>>>> localhost:/myvolume /opt/shared glusterfs >>>>> defaults,_netdev,attribute-timeout=0,entry-timeout=0,log-file=/var/log/glusterfs/glusterfs.log,backup-volfile-servers=10.12.130.21:10.12.130.22:10.12.130.23 >>>>> >>>>> 0 0 >>>>> >>>>> >>>>> _Volume configuration is:_ >>>>> >>>>> Volume Name: myvolume >>>>> Type: Replicate >>>>> Volume ID: xxxx >>>>> Status: Started >>>>> Number of Bricks: 1 x 3 = 3 >>>>> Transport-type: tcp >>>>> Bricks: >>>>> Brick1: host1:/opt/local/brick >>>>> Brick2: host2:/opt/local/brick >>>>> Brick3: host3:/opt/local/brick >>>>> Options Reconfigured: >>>>> storage.health-check-interval: 5 >>>>> network.ping-timeout: 5 >>>>> nfs.disable: on >>>>> auth.allow: 10.12.130.21,10.12.130.22,10.12.130.23 >>>>> cluster.quorum-type: auto >>>>> network.frame-timeout: 60 >>>>> >>>>> >>>>> I run Debian 7 and the following GlusterFS version 3.6.2-2. >>>>> >>>>> While I could together some rc.local type of script which retries to >>>>> mount the volume for a while until it succeeds or times out I was >>>>> wondering if there's a better way to solve this problem? >>>>> >>>>> Thank you for your help. >>>>> >>>>> Regards, >>>>> -- >>>>> Rumen Telbizov >>>>> Unix Systems Administrator <http://telbizov.com> >>>>> >>>>> >>>>> _______________________________________________ >>>>> Gluster-users mailing list >>>>> Gluster-users at gluster.org >>>>> http://www.gluster.org/mailman/listinfo/gluster-users >> >>