thr3ads.net - Gluster users - [Gluster-users] Initial mount problem

If this information is useful, please help other people find it:
Share via:

Pranith Kumar Karampuri

2015-Mar-31 09:53 UTC

[Gluster-users] Initial mount problem - all subvolumes are down

On 03/31/2015 01:55 PM, Atin Mukherjee wrote:>
> On 03/31/2015 01:03 PM, Pranith Kumar Karampuri wrote:
>> On 03/31/2015 12:53 PM, Atin Mukherjee wrote:
>>> On 03/31/2015 12:27 PM, Pranith Kumar Karampuri wrote:
>>>> Atin,
>>>>          Could it be because bricks are started with
PROC_START_NO_WAIT?
>>> That's the correct analysis Pranith. Mount was attempted before
the
>>> bricks were started. If we can have a time lag in some seconds
between
>>> mount and volume start the problem will go away.
>> Atin,
>>         I think one way to solve this issue is to start the bricks with
>> NO_WAIT so that we can handle pmap-signin but wait for the pmap-signins
>> to complete before responding to cli/completing 'init'?
> Logically it should solve the problem. We need to think around it more
> from the existing design perspective.Rumen,
      Feel free to log a bug. This should be fixed in later release. We 
can raise the bug and work it as well if you prefer it this way.

Pranith>
> ~Atin
>> Pranith
>>>
>>>> Pranith
>>>> On 03/31/2015 04:41 AM, Rumen Telbizov wrote:
>>>>> Hello everyone,
>>>>>
>>>>> I have a problem that I am trying to resolve and not sure
which way to
>>>>> go so here I am asking for your advise.
>>>>>
>>>>> What it comes down to is that upon initial boot of all my
GlusterFS
>>>>> machines the shared volume doesn't get mounted.
Nevertheless the
>>>>> volume successfully created and started and further
attempts to mount
>>>>> it manually succeed. I suspect what's happening is that
gluster
>>>>> processes/bricks/etc haven't fully started at the time
the /etc/fstab
>>>>> entry is read and the initial mount attempt is being made.
Again, by
>>>>> the time I log in and run a mount -a -- the volume mounts
without any
>>>>> issues.
>>>>>
>>>>> _Details from the logs:_
>>>>>
>>>>> [2015-03-30 22:29:04.381918] I [MSGID: 100030]
>>>>> [glusterfsd.c:2018:main] 0-/usr/sbin/glusterfs: Started
running
>>>>> /usr/sbin/glusterfs version 3.6.2 (args:
/usr/sbin/glusterfs
>>>>> --log-file=/var/log/glusterfs/glusterfs.log
--attribute-timeout=0
>>>>> --entry-timeout=0 --volfile-server=localhost
>>>>> --volfile-server=10.12.130.21 --volfile-server=10.12.130.22
>>>>> --volfile-server=10.12.130.23 --volfile-id=/myvolume
/opt/shared)
>>>>> [2015-03-30 22:29:04.394913] E
[socket.c:2267:socket_connect_finish]
>>>>> 0-glusterfs: connection to 127.0.0.1:24007
<http://127.0.0.1:24007>
>>>>> failed (Connection refused)
>>>>> [2015-03-30 22:29:04.394950] E
>>>>> [glusterfsd-mgmt.c:1811:mgmt_rpc_notify] 0-glusterfsd-mgmt:
failed to
>>>>> connect with remote-host: localhost (Transport endpoint is
not
>>>>> connected)
>>>>> [2015-03-30 22:29:04.394964] I
>>>>> [glusterfsd-mgmt.c:1838:mgmt_rpc_notify] 0-glusterfsd-mgmt:
connecting
>>>>> to next volfile server 10.12.130.21
>>>>> [2015-03-30 22:29:08.390687] E
>>>>> [glusterfsd-mgmt.c:1811:mgmt_rpc_notify] 0-glusterfsd-mgmt:
failed to
>>>>> connect with remote-host: 10.12.130.21 (Transport endpoint
is not
>>>>> connected)
>>>>> [2015-03-30 22:29:08.390720] I
>>>>> [glusterfsd-mgmt.c:1838:mgmt_rpc_notify] 0-glusterfsd-mgmt:
connecting
>>>>> to next volfile server 10.12.130.22
>>>>> [2015-03-30 22:29:11.392015] E
>>>>> [glusterfsd-mgmt.c:1811:mgmt_rpc_notify] 0-glusterfsd-mgmt:
failed to
>>>>> connect with remote-host: 10.12.130.22 (Transport endpoint
is not
>>>>> connected)
>>>>> [2015-03-30 22:29:11.392050] I
>>>>> [glusterfsd-mgmt.c:1838:mgmt_rpc_notify] 0-glusterfsd-mgmt:
connecting
>>>>> to next volfile server 10.12.130.23
>>>>> [2015-03-30 22:29:14.406429] I
[dht-shared.c:337:dht_init_regex]
>>>>> 0-brain-dht: using regex rsync-hash-regex = ^\.(.+)\.[^.]+$
>>>>> [2015-03-30 22:29:14.408964] I
>>>>> [rpc-clnt.c:969:rpc_clnt_connection_init] 0-host-client-2:
setting
>>>>> frame-timeout to 60
>>>>> [2015-03-30 22:29:14.409183] I
>>>>> [rpc-clnt.c:969:rpc_clnt_connection_init] 0-host-client-1:
setting
>>>>> frame-timeout to 60
>>>>> [2015-03-30 22:29:14.409388] I
>>>>> [rpc-clnt.c:969:rpc_clnt_connection_init] 0-host-client-0:
setting
>>>>> frame-timeout to 60
>>>>> [2015-03-30 22:29:14.409430] I [client.c:2280:notify]
0-host-client-0:
>>>>> parent translators are ready, attempting connect on
transport
>>>>> [2015-03-30 22:29:14.409658] I [client.c:2280:notify]
0-host-client-1:
>>>>> parent translators are ready, attempting connect on
transport
>>>>> [2015-03-30 22:29:14.409844] I [client.c:2280:notify]
0-host-client-2:
>>>>> parent translators are ready, attempting connect on
transport
>>>>> Final graph:
>>>>>
>>>>> ....
>>>>>
>>>>> [2015-03-30 22:29:14.411045] I
[client.c:2215:client_rpc_notify]
>>>>> 0-host-client-2: disconnected from host-client-2. Client
process will
>>>>> keep trying to connect to glusterd until brick's port
is available
>>>>> *[2015-03-30 22:29:14.411063] E [MSGID: 108006]
>>>>> [afr-common.c:3591:afr_notify] 0-myvolume-replicate-0: All
subvolumes
>>>>> are down. Going offline until atleast one of them comes
back up.
>>>>> *[2015-03-30 22:29:14.414871] I
[fuse-bridge.c:5080:fuse_graph_setup]
>>>>> 0-fuse: switched to graph 0
>>>>> [2015-03-30 22:29:14.415003] I
[fuse-bridge.c:4009:fuse_init]
>>>>> 0-glusterfs-fuse: FUSE inited with protocol versions:
glusterfs 7.22
>>>>> kernel 7.17
>>>>> [2015-03-30 22:29:14.415101] I
[afr-common.c:3722:afr_local_init]
>>>>> 0-myvolume-replicate-0: no subvolumes up
>>>>> [2015-03-30 22:29:14.415215] I
[afr-common.c:3722:afr_local_init]
>>>>> 0-myvolume-replicate-0: no subvolumes up
>>>>> [2015-03-30 22:29:14.415236] W
[fuse-bridge.c:779:fuse_attr_cbk]
>>>>> 0-glusterfs-fuse: 2: LOOKUP() / => -1 (Transport
endpoint is not
>>>>> connected)
>>>>> [2015-03-30 22:29:14.419007] I
[fuse-bridge.c:4921:fuse_thread_proc]
>>>>> 0-fuse: unmounting /opt/shared
>>>>> *[2015-03-30 22:29:14.420176] W
[glusterfsd.c:1194:cleanup_and_exit]
>>>>> (--> 0-: received signum (15), shutting down*
>>>>> [2015-03-30 22:29:14.420192] I [fuse-bridge.c:5599:fini]
0-fuse:
>>>>> Unmounting '/opt/shared'.
>>>>>
>>>>>
>>>>> _Relevant /etc/fstab entries are:_
>>>>>
>>>>> /dev/xvdb /opt/local xfs defaults,noatime,nodiratime 0 0
>>>>>
>>>>> localhost:/myvolume /opt/shared glusterfs
>>>>>
defaults,_netdev,attribute-timeout=0,entry-timeout=0,log-file=/var/log/glusterfs/glusterfs.log,backup-volfile-servers=10.12.130.21:10.12.130.22:10.12.130.23
>>>>>
>>>>> 0 0
>>>>>
>>>>>
>>>>> _Volume configuration is:_
>>>>>
>>>>> Volume Name: myvolume
>>>>> Type: Replicate
>>>>> Volume ID: xxxx
>>>>> Status: Started
>>>>> Number of Bricks: 1 x 3 = 3
>>>>> Transport-type: tcp
>>>>> Bricks:
>>>>> Brick1: host1:/opt/local/brick
>>>>> Brick2: host2:/opt/local/brick
>>>>> Brick3: host3:/opt/local/brick
>>>>> Options Reconfigured:
>>>>> storage.health-check-interval: 5
>>>>> network.ping-timeout: 5
>>>>> nfs.disable: on
>>>>> auth.allow: 10.12.130.21,10.12.130.22,10.12.130.23
>>>>> cluster.quorum-type: auto
>>>>> network.frame-timeout: 60
>>>>>
>>>>>
>>>>> I run Debian 7 and the following GlusterFS version 3.6.2-2.
>>>>>
>>>>> While I could together some rc.local type of script which
retries to
>>>>> mount the volume for a while until it succeeds or times out
I was
>>>>> wondering if there's a better way to solve this
problem?
>>>>>
>>>>> Thank you for your help.
>>>>>
>>>>> Regards,
>>>>> -- 
>>>>> Rumen Telbizov
>>>>> Unix Systems Administrator <http://telbizov.com>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Gluster-users mailing list
>>>>> Gluster-users at gluster.org
>>>>> http://www.gluster.org/mailman/listinfo/gluster-users
>>
>>

Rumen Telbizov

2015-Mar-31 17:17 UTC

head link

[Gluster-users] Initial mount problem - all subvolumes are down

Pranith and Atin,

Thank you for looking into this and confirming it's a bug. Please log the
bug yourself since I am not familiar with the project's bug-tracking system.

Assessing its severity and the fact that this effectively stops the cluster
from functioning properly after boot, what do you think would be the
timeline for fixing this issue? What version do you expect to see this
fixed in?

In the meantime, is there another workaround that you might suggest besides
running a secondary mount later after the boot is over?

Thank you again for your help,
Rumen Telbizov



On Tue, Mar 31, 2015 at 2:53 AM, Pranith Kumar Karampuri <
pkarampu at redhat.com> wrote:
>
> On 03/31/2015 01:55 PM, Atin Mukherjee wrote:
>
>>
>> On 03/31/2015 01:03 PM, Pranith Kumar Karampuri wrote:
>>
>>> On 03/31/2015 12:53 PM, Atin Mukherjee wrote:
>>>
>>>> On 03/31/2015 12:27 PM, Pranith Kumar Karampuri wrote:
>>>>
>>>>> Atin,
>>>>>          Could it be because bricks are started with
>>>>> PROC_START_NO_WAIT?
>>>>>
>>>> That's the correct analysis Pranith. Mount was attempted
before the
>>>> bricks were started. If we can have a time lag in some seconds
between
>>>> mount and volume start the problem will go away.
>>>>
>>> Atin,
>>>         I think one way to solve this issue is to start the bricks
with
>>> NO_WAIT so that we can handle pmap-signin but wait for the
pmap-signins
>>> to complete before responding to cli/completing 'init'?
>>>
>> Logically it should solve the problem. We need to think around it more
>> from the existing design perspective.
>>
> Rumen,
>      Feel free to log a bug. This should be fixed in later release. We can
> raise the bug and work it as well if you prefer it this way.
>
> Pranith
>
>
>> ~Atin
>>
>>> Pranith
>>>
>>>>
>>>>  Pranith
>>>>> On 03/31/2015 04:41 AM, Rumen Telbizov wrote:
>>>>>
>>>>>> Hello everyone,
>>>>>>
>>>>>> I have a problem that I am trying to resolve and not
sure which way to
>>>>>> go so here I am asking for your advise.
>>>>>>
>>>>>> What it comes down to is that upon initial boot of all
my GlusterFS
>>>>>> machines the shared volume doesn't get mounted.
Nevertheless the
>>>>>> volume successfully created and started and further
attempts to mount
>>>>>> it manually succeed. I suspect what's happening is
that gluster
>>>>>> processes/bricks/etc haven't fully started at the
time the /etc/fstab
>>>>>> entry is read and the initial mount attempt is being
made. Again, by
>>>>>> the time I log in and run a mount -a -- the volume
mounts without any
>>>>>> issues.
>>>>>>
>>>>>> _Details from the logs:_
>>>>>>
>>>>>> [2015-03-30 22:29:04.381918] I [MSGID: 100030]
>>>>>> [glusterfsd.c:2018:main] 0-/usr/sbin/glusterfs: Started
running
>>>>>> /usr/sbin/glusterfs version 3.6.2 (args:
/usr/sbin/glusterfs
>>>>>> --log-file=/var/log/glusterfs/glusterfs.log
--attribute-timeout=0
>>>>>> --entry-timeout=0 --volfile-server=localhost
>>>>>> --volfile-server=10.12.130.21
--volfile-server=10.12.130.22
>>>>>> --volfile-server=10.12.130.23 --volfile-id=/myvolume
/opt/shared)
>>>>>> [2015-03-30 22:29:04.394913] E
[socket.c:2267:socket_connect_finish]
>>>>>> 0-glusterfs: connection to 127.0.0.1:24007
<http://127.0.0.1:24007>
>>>>>> failed (Connection refused)
>>>>>> [2015-03-30 22:29:04.394950] E
>>>>>> [glusterfsd-mgmt.c:1811:mgmt_rpc_notify]
0-glusterfsd-mgmt: failed to
>>>>>> connect with remote-host: localhost (Transport endpoint
is not
>>>>>> connected)
>>>>>> [2015-03-30 22:29:04.394964] I
>>>>>> [glusterfsd-mgmt.c:1838:mgmt_rpc_notify]
0-glusterfsd-mgmt:
>>>>>> connecting
>>>>>> to next volfile server 10.12.130.21
>>>>>> [2015-03-30 22:29:08.390687] E
>>>>>> [glusterfsd-mgmt.c:1811:mgmt_rpc_notify]
0-glusterfsd-mgmt: failed to
>>>>>> connect with remote-host: 10.12.130.21 (Transport
endpoint is not
>>>>>> connected)
>>>>>> [2015-03-30 22:29:08.390720] I
>>>>>> [glusterfsd-mgmt.c:1838:mgmt_rpc_notify]
0-glusterfsd-mgmt:
>>>>>> connecting
>>>>>> to next volfile server 10.12.130.22
>>>>>> [2015-03-30 22:29:11.392015] E
>>>>>> [glusterfsd-mgmt.c:1811:mgmt_rpc_notify]
0-glusterfsd-mgmt: failed to
>>>>>> connect with remote-host: 10.12.130.22 (Transport
endpoint is not
>>>>>> connected)
>>>>>> [2015-03-30 22:29:11.392050] I
>>>>>> [glusterfsd-mgmt.c:1838:mgmt_rpc_notify]
0-glusterfsd-mgmt:
>>>>>> connecting
>>>>>> to next volfile server 10.12.130.23
>>>>>> [2015-03-30 22:29:14.406429] I
[dht-shared.c:337:dht_init_regex]
>>>>>> 0-brain-dht: using regex rsync-hash-regex =
^\.(.+)\.[^.]+$
>>>>>> [2015-03-30 22:29:14.408964] I
>>>>>> [rpc-clnt.c:969:rpc_clnt_connection_init]
0-host-client-2: setting
>>>>>> frame-timeout to 60
>>>>>> [2015-03-30 22:29:14.409183] I
>>>>>> [rpc-clnt.c:969:rpc_clnt_connection_init]
0-host-client-1: setting
>>>>>> frame-timeout to 60
>>>>>> [2015-03-30 22:29:14.409388] I
>>>>>> [rpc-clnt.c:969:rpc_clnt_connection_init]
0-host-client-0: setting
>>>>>> frame-timeout to 60
>>>>>> [2015-03-30 22:29:14.409430] I [client.c:2280:notify]
0-host-client-0:
>>>>>> parent translators are ready, attempting connect on
transport
>>>>>> [2015-03-30 22:29:14.409658] I [client.c:2280:notify]
0-host-client-1:
>>>>>> parent translators are ready, attempting connect on
transport
>>>>>> [2015-03-30 22:29:14.409844] I [client.c:2280:notify]
0-host-client-2:
>>>>>> parent translators are ready, attempting connect on
transport
>>>>>> Final graph:
>>>>>>
>>>>>> ....
>>>>>>
>>>>>> [2015-03-30 22:29:14.411045] I
[client.c:2215:client_rpc_notify]
>>>>>> 0-host-client-2: disconnected from host-client-2.
Client process will
>>>>>> keep trying to connect to glusterd until brick's
port is available
>>>>>> *[2015-03-30 22:29:14.411063] E [MSGID: 108006]
>>>>>> [afr-common.c:3591:afr_notify] 0-myvolume-replicate-0:
All subvolumes
>>>>>> are down. Going offline until atleast one of them comes
back up.
>>>>>> *[2015-03-30 22:29:14.414871] I
[fuse-bridge.c:5080:fuse_graph_setup]
>>>>>> 0-fuse: switched to graph 0
>>>>>> [2015-03-30 22:29:14.415003] I
[fuse-bridge.c:4009:fuse_init]
>>>>>> 0-glusterfs-fuse: FUSE inited with protocol versions:
glusterfs 7.22
>>>>>> kernel 7.17
>>>>>> [2015-03-30 22:29:14.415101] I
[afr-common.c:3722:afr_local_init]
>>>>>> 0-myvolume-replicate-0: no subvolumes up
>>>>>> [2015-03-30 22:29:14.415215] I
[afr-common.c:3722:afr_local_init]
>>>>>> 0-myvolume-replicate-0: no subvolumes up
>>>>>> [2015-03-30 22:29:14.415236] W
[fuse-bridge.c:779:fuse_attr_cbk]
>>>>>> 0-glusterfs-fuse: 2: LOOKUP() / => -1 (Transport
endpoint is not
>>>>>> connected)
>>>>>> [2015-03-30 22:29:14.419007] I
[fuse-bridge.c:4921:fuse_thread_proc]
>>>>>> 0-fuse: unmounting /opt/shared
>>>>>> *[2015-03-30 22:29:14.420176] W
[glusterfsd.c:1194:cleanup_and_exit]
>>>>>> (--> 0-: received signum (15), shutting down*
>>>>>> [2015-03-30 22:29:14.420192] I
[fuse-bridge.c:5599:fini] 0-fuse:
>>>>>> Unmounting '/opt/shared'.
>>>>>>
>>>>>>
>>>>>> _Relevant /etc/fstab entries are:_
>>>>>>
>>>>>> /dev/xvdb /opt/local xfs defaults,noatime,nodiratime 0
0
>>>>>>
>>>>>> localhost:/myvolume /opt/shared glusterfs
>>>>>>
defaults,_netdev,attribute-timeout=0,entry-timeout=0,log-
>>>>>> file=/var/log/glusterfs/glusterfs.log,backup-volfile-
>>>>>> servers=10.12.130.21:10.12.130.22:10.12.130.23
>>>>>>
>>>>>> 0 0
>>>>>>
>>>>>>
>>>>>> _Volume configuration is:_
>>>>>>
>>>>>> Volume Name: myvolume
>>>>>> Type: Replicate
>>>>>> Volume ID: xxxx
>>>>>> Status: Started
>>>>>> Number of Bricks: 1 x 3 = 3
>>>>>> Transport-type: tcp
>>>>>> Bricks:
>>>>>> Brick1: host1:/opt/local/brick
>>>>>> Brick2: host2:/opt/local/brick
>>>>>> Brick3: host3:/opt/local/brick
>>>>>> Options Reconfigured:
>>>>>> storage.health-check-interval: 5
>>>>>> network.ping-timeout: 5
>>>>>> nfs.disable: on
>>>>>> auth.allow: 10.12.130.21,10.12.130.22,10.12.130.23
>>>>>> cluster.quorum-type: auto
>>>>>> network.frame-timeout: 60
>>>>>>
>>>>>>
>>>>>> I run Debian 7 and the following GlusterFS version
3.6.2-2.
>>>>>>
>>>>>> While I could together some rc.local type of script
which retries to
>>>>>> mount the volume for a while until it succeeds or times
out I was
>>>>>> wondering if there's a better way to solve this
problem?
>>>>>>
>>>>>> Thank you for your help.
>>>>>>
>>>>>> Regards,
>>>>>> --
>>>>>> Rumen Telbizov
>>>>>> Unix Systems Administrator <http://telbizov.com>
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> Gluster-users mailing list
>>>>>> Gluster-users at gluster.org
>>>>>> http://www.gluster.org/mailman/listinfo/gluster-users
>>>>>>
>>>>>
>>>
>>>
>

-- 
Rumen Telbizov
Unix Systems Administrator <http://telbizov.com>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://www.gluster.org/pipermail/gluster-users/attachments/20150331/534b41ac/attachment.html>

Gluster users - Mar 2015 - Initial mount problem - all subvolumes are down

[Gluster-users] Initial mount problem - all subvolumes are down

[Gluster-users] Initial mount problem - all subvolumes are down