thr3ads.net - Gluster users - [Gluster-users] Some bricks are offline after restart, how to bring them online gracefully? [Jul 2017]

If this information is useful, please help other people find it:
Share via:

Hari Gowtham

2017-Jun-30 10:36 UTC

[Gluster-users] Some bricks are offline after restart, how to bring them online gracefully?

Hi,

Jan, by multiple times I meant whether you were able to do the whole
setup multiple times and face the same issue.
So that we have a consistent reproducer to work on.

As grepping shows that the process doesn't exist the bug I mentioned
doesn't hold good.
Seems like another issue irrelevant to the bug i mentioned (have
mentioned it now).

When you say too often, this means there is a way to reproduce it.
Please do let us know the steps you performed to check. but this
shouldn't happen if you try again.

You won't have this issue often. and as Mani mentioned do not write a
script to start force it.
If this issue exists with a proper reproducer we will take a look at it.

Sorry, forgot to provide the link for the fix:
patch : https://review.gluster.org/#/c/17101/

If you find a reproducer do file a bug at
https://bugzilla.redhat.com/enter_bug.cgi?product=GlusterFS


On Fri, Jun 30, 2017 at 3:33 PM, Manikandan Selvaganesh
<manikandancs333 at gmail.com> wrote:> Hi Jan,
>
> It is not recommended that you automate the script for 'volume start
force'.
> Bricks do not go offline just like that. There will be some genuine issue
> which triggers this. Could you please attach the entire glusterd.logs and
> the brick logs around the time so that someone would be able to look?
>
> Just to make sure, please check if you have any network outage(using iperf
> or some standard tool).
>
> @Hari, i think you forgot to provide the bug link, please provide so that
> Jan
> or someone can check if it is related.
>
>
> --
> Thanks & Regards,
> Manikandan Selvaganesan.
> (@Manikandan Selvaganesh on Web)
>
> On Fri, Jun 30, 2017 at 3:19 PM, Jan <jan.h.zak at gmail.com> wrote:
>>
>> Hi Hari,
>>
>> thank you for your support!
>>
>> Did I try to check offline bricks multiple times?
>> Yes ? I gave it enough time (at least 20 minutes) to recover but it
stayed
>> offline.
>>
>> Version?
>> All nodes are 100% equal ? I tried fresh installation several times
during
>> my testing, Every time it is CentOS Minimal install with all updates
and
>> without any additional software:
>>
>> uname -r
>> 3.10.0-514.21.2.el7.x86_64
>>
>> yum list installed | egrep 'gluster|ganesha'
>> centos-release-gluster310.noarch     1.0-1.el7.centos         @extras
>> glusterfs.x86_64                     3.10.2-1.el7
>> @centos-gluster310
>> glusterfs-api.x86_64                 3.10.2-1.el7
>> @centos-gluster310
>> glusterfs-cli.x86_64                 3.10.2-1.el7
>> @centos-gluster310
>> glusterfs-client-xlators.x86_64      3.10.2-1.el7
>> @centos-gluster310
>> glusterfs-fuse.x86_64                3.10.2-1.el7
>> @centos-gluster310
>> glusterfs-ganesha.x86_64             3.10.2-1.el7
>> @centos-gluster310
>> glusterfs-libs.x86_64                3.10.2-1.el7
>> @centos-gluster310
>> glusterfs-server.x86_64              3.10.2-1.el7
>> @centos-gluster310
>> libntirpc.x86_64                     1.4.3-1.el7
>> @centos-gluster310
>> nfs-ganesha.x86_64                   2.4.5-1.el7
>> @centos-gluster310
>> nfs-ganesha-gluster.x86_64           2.4.5-1.el7
>> @centos-gluster310
>> userspace-rcu.x86_64                 0.7.16-3.el7
>> @centos-gluster310
>>
>> Grepping for the brick process?
>> I?ve just tried it again. Process doesn?t exist when brick is offline.
>>
>> Force start command?
>> sudo gluster volume start MyVolume force
>>
>> That works! Thank you.
>>
>> If I have this issue too often then I can create simple script that
greps
>> all bricks on the local server and force start when it?s offline. I can
>> schedule such script once after for example 5 minutes after boot.
>>
>> But I?m not sure if it?s good idea to automate it. I?d be worried that
I
>> can force it up even when the node doesn?t ?see? other nodes and cause
split
>> brain issue.
>>
>> Thank you!
>>
>> Kind regards,
>> Jan
>>
>>
>> On Fri, Jun 30, 2017 at 8:01 AM, Hari Gowtham <hgowtham at
redhat.com> wrote:
>>>
>>> Hi Jan,
>>>
>>> comments inline.
>>>
>>> On Fri, Jun 30, 2017 at 1:31 AM, Jan <jan.h.zak at gmail.com>
wrote:
>>> > Hi all,
>>> >
>>> > Gluster and Ganesha are amazing. Thank you for this great
work!
>>> >
>>> > I?m struggling with one issue and I think that you might be
able to
>>> > help me.
>>> >
>>> > I spent some time by playing with Gluster and Ganesha and
after I gain
>>> > some
>>> > experience I decided that I should go into production but I?m
still
>>> > struggling with one issue.
>>> >
>>> > I have 3x node CentOS 7.3 with the most current Gluster and
Ganesha
>>> > from
>>> > centos-gluster310 repository (3.10.2-1.el7) with replicated
bricks.
>>> >
>>> > Servers have a lot of resources and they run in a subnet on a
stable
>>> > network.
>>> >
>>> > I didn?t have any issues when I tested a single brick. But now
I?d like
>>> > to
>>> > setup 17 replicated bricks and I realized that when I restart
one of
>>> > nodes
>>> > then the result looks like this:
>>> >
>>> > sudo gluster volume status | grep ' N '
>>> >
>>> > Brick glunode0:/st/brick3/dir          N/A       N/A        N 
N/A
>>> > Brick glunode1:/st/brick2/dir          N/A       N/A        N 
N/A
>>> >
>>>
>>> did you try it multiple times?
>>>
>>> > Some bricks just don?t go online. Sometime it?s one brick,
sometime
>>> > tree and
>>> > it?s not same brick ? it?s random issue.
>>> >
>>> > I checked log on affected servers and this is an example:
>>> >
>>> > sudo tail /var/log/glusterfs/bricks/st-brick3-0.log
>>> >
>>> > [2017-06-29 17:59:48.651581] W [socket.c:593:__socket_rwv]
0-glusterfs:
>>> > readv on 10.2.44.23:24007 failed (No data available)
>>> > [2017-06-29 17:59:48.651622] E
[glusterfsd-mgmt.c:2114:mgmt_rpc_notify]
>>> > 0-glusterfsd-mgmt: failed to connect with remote-host:
glunode0 (No
>>> > data
>>> > available)
>>> > [2017-06-29 17:59:48.651638] I
[glusterfsd-mgmt.c:2133:mgmt_rpc_notify]
>>> > 0-glusterfsd-mgmt: Exhausted all volfile servers
>>> > [2017-06-29 17:59:49.944103] W
[glusterfsd.c:1332:cleanup_and_exit]
>>> > (-->/lib64/libpthread.so.0(+0x7dc5) [0x7f3158032dc5]
>>> > -->/usr/sbin/glusterfsd(glusterfs_sigwaiter+0xe5)
[0x7f31596cbfd5]
>>> > -->/usr/sbin/glusterfsd(cleanup_and_exit+0x6b)
[0x7f31596cbdfb] )
>>> > 0-:received signum (15), shutting down
>>> > [2017-06-29 17:59:50.397107] E [socket.c:3203:socket_connect]
>>> > 0-glusterfs:
>>> > connection attempt on 10.2.44.23:24007 failed, (Network is
unreachable)
>>> > [2017-06-29 17:59:50.397138] I
[socket.c:3507:socket_submit_request]
>>> > 0-glusterfs: not connected (priv->connected = 0)
>>> > [2017-06-29 17:59:50.397162] W
[rpc-clnt.c:1693:rpc_clnt_submit]
>>> > 0-glusterfs: failed to submit rpc-request (XID: 0x3 Program:
Gluster
>>> > Portmap, ProgVers: 1, Proc: 5) to rpc-transport (glusterfs)
>>> >
>>> > I think that important message is ?Network is unreachable?.
>>> >
>>> > Question
>>> > 1. Could you please tell me, is that normal when you have many
bricks?
>>> > Networks is definitely stable and other servers use it without
problem
>>> > and
>>> > all servers run on a same pair of switches. My assumption is
that in
>>> > the
>>> > same time many bricks try to connect and that doesn?t work.
>>>
>>> no. it shouldnt happen if there are multiple bricks.
>>> there was a bug related to this [1]
>>> to verify if that was the issue I need to know a few things.
>>> 1) are all the node of the same version.
>>> 2) did you check grepping for the brick process using the ps
command?
>>> need to verify is the brick is still up and is not connected to
glusterd
>>> alone.
>>>
>>>
>>> >
>>> > 2. Is there an option to configure a brick to enable some kind
of
>>> > autoreconnect or add some timeout?
>>> > gluster volume set brick123 option456 abc ??
>>> If the brick process is not seen in the ps aux | grep glusterfsd
>>> The way to start a brick is to use the volume start force command.
>>> If brick is not started there is no point configuring it. and to
start
>>> a brick we cant
>>> use the configure command.
>>>
>>> >
>>> > 3. What it the recommend way to fix offline brick on the
affected
>>> > server? I
>>> > don?t want to use ?gluster volume stop/start? since affected
bricks are
>>> > online on other server and there is no reason to completely
turn it
>>> > off.
>>> gluster volume start force will not bring down the bricks that are
>>> already up and
>>> running.
>>>
>>> >
>>> > Thank you,
>>> > Jan
>>> >
>>> > _______________________________________________
>>> > Gluster-users mailing list
>>> > Gluster-users at gluster.org
>>> > http://lists.gluster.org/mailman/listinfo/gluster-users
>>>
>>>
>>>
>>> --
>>> Regards,
>>> Hari Gowtham.
>>
>>
>>
>> _______________________________________________
>> Gluster-users mailing list
>> Gluster-users at gluster.org
>> http://lists.gluster.org/mailman/listinfo/gluster-users
>
>


-- 
Regards,
Hari Gowtham.

Jan

2017-Jul-02 10:26 UTC

head link

[Gluster-users] Some bricks are offline after restart, how to bring them online gracefully?

Thank you, I created bug with all logs:
https://bugzilla.redhat.com/show_bug.cgi?id=1467050

During testing I found second bug:
https://bugzilla.redhat.com/show_bug.cgi?id=1467057
There something wrong with Ganesha when Gluster bricks are named "w0"
or
"sw0".


On Fri, Jun 30, 2017 at 11:36 AM, Hari Gowtham <hgowtham at redhat.com>
wrote:
> Hi,
>
> Jan, by multiple times I meant whether you were able to do the whole
> setup multiple times and face the same issue.
> So that we have a consistent reproducer to work on.
>
> As grepping shows that the process doesn't exist the bug I mentioned
> doesn't hold good.
> Seems like another issue irrelevant to the bug i mentioned (have
> mentioned it now).
>
> When you say too often, this means there is a way to reproduce it.
> Please do let us know the steps you performed to check. but this
> shouldn't happen if you try again.
>
> You won't have this issue often. and as Mani mentioned do not write a
> script to start force it.
> If this issue exists with a proper reproducer we will take a look at it.
>
> Sorry, forgot to provide the link for the fix:
> patch : https://review.gluster.org/#/c/17101/
>
> If you find a reproducer do file a bug at
> https://bugzilla.redhat.com/enter_bug.cgi?product=GlusterFS
>
>
> On Fri, Jun 30, 2017 at 3:33 PM, Manikandan Selvaganesh
> <manikandancs333 at gmail.com> wrote:
> > Hi Jan,
> >
> > It is not recommended that you automate the script for 'volume
start
> force'.
> > Bricks do not go offline just like that. There will be some genuine
issue
> > which triggers this. Could you please attach the entire glusterd.logs
and
> > the brick logs around the time so that someone would be able to look?
> >
> > Just to make sure, please check if you have any network outage(using
> iperf
> > or some standard tool).
> >
> > @Hari, i think you forgot to provide the bug link, please provide so
that
> > Jan
> > or someone can check if it is related.
> >
> >
> > --
> > Thanks & Regards,
> > Manikandan Selvaganesan.
> > (@Manikandan Selvaganesh on Web)
> >
> > On Fri, Jun 30, 2017 at 3:19 PM, Jan <jan.h.zak at gmail.com>
wrote:
> >>
> >> Hi Hari,
> >>
> >> thank you for your support!
> >>
> >> Did I try to check offline bricks multiple times?
> >> Yes ? I gave it enough time (at least 20 minutes) to recover but
it
> stayed
> >> offline.
> >>
> >> Version?
> >> All nodes are 100% equal ? I tried fresh installation several
times
> during
> >> my testing, Every time it is CentOS Minimal install with all
updates and
> >> without any additional software:
> >>
> >> uname -r
> >> 3.10.0-514.21.2.el7.x86_64
> >>
> >> yum list installed | egrep 'gluster|ganesha'
> >> centos-release-gluster310.noarch     1.0-1.el7.centos        
@extras
> >> glusterfs.x86_64                     3.10.2-1.el7
> >> @centos-gluster310
> >> glusterfs-api.x86_64                 3.10.2-1.el7
> >> @centos-gluster310
> >> glusterfs-cli.x86_64                 3.10.2-1.el7
> >> @centos-gluster310
> >> glusterfs-client-xlators.x86_64      3.10.2-1.el7
> >> @centos-gluster310
> >> glusterfs-fuse.x86_64                3.10.2-1.el7
> >> @centos-gluster310
> >> glusterfs-ganesha.x86_64             3.10.2-1.el7
> >> @centos-gluster310
> >> glusterfs-libs.x86_64                3.10.2-1.el7
> >> @centos-gluster310
> >> glusterfs-server.x86_64              3.10.2-1.el7
> >> @centos-gluster310
> >> libntirpc.x86_64                     1.4.3-1.el7
> >> @centos-gluster310
> >> nfs-ganesha.x86_64                   2.4.5-1.el7
> >> @centos-gluster310
> >> nfs-ganesha-gluster.x86_64           2.4.5-1.el7
> >> @centos-gluster310
> >> userspace-rcu.x86_64                 0.7.16-3.el7
> >> @centos-gluster310
> >>
> >> Grepping for the brick process?
> >> I?ve just tried it again. Process doesn?t exist when brick is
offline.
> >>
> >> Force start command?
> >> sudo gluster volume start MyVolume force
> >>
> >> That works! Thank you.
> >>
> >> If I have this issue too often then I can create simple script
that
> greps
> >> all bricks on the local server and force start when it?s offline.
I can
> >> schedule such script once after for example 5 minutes after boot.
> >>
> >> But I?m not sure if it?s good idea to automate it. I?d be worried
that I
> >> can force it up even when the node doesn?t ?see? other nodes and
cause
> split
> >> brain issue.
> >>
> >> Thank you!
> >>
> >> Kind regards,
> >> Jan
> >>
> >>
> >> On Fri, Jun 30, 2017 at 8:01 AM, Hari Gowtham <hgowtham at
redhat.com>
> wrote:
> >>>
> >>> Hi Jan,
> >>>
> >>> comments inline.
> >>>
> >>> On Fri, Jun 30, 2017 at 1:31 AM, Jan <jan.h.zak at
gmail.com> wrote:
> >>> > Hi all,
> >>> >
> >>> > Gluster and Ganesha are amazing. Thank you for this great
work!
> >>> >
> >>> > I?m struggling with one issue and I think that you might
be able to
> >>> > help me.
> >>> >
> >>> > I spent some time by playing with Gluster and Ganesha and
after I
> gain
> >>> > some
> >>> > experience I decided that I should go into production but
I?m still
> >>> > struggling with one issue.
> >>> >
> >>> > I have 3x node CentOS 7.3 with the most current Gluster
and Ganesha
> >>> > from
> >>> > centos-gluster310 repository (3.10.2-1.el7) with
replicated bricks.
> >>> >
> >>> > Servers have a lot of resources and they run in a subnet
on a stable
> >>> > network.
> >>> >
> >>> > I didn?t have any issues when I tested a single brick.
But now I?d
> like
> >>> > to
> >>> > setup 17 replicated bricks and I realized that when I
restart one of
> >>> > nodes
> >>> > then the result looks like this:
> >>> >
> >>> > sudo gluster volume status | grep ' N '
> >>> >
> >>> > Brick glunode0:/st/brick3/dir          N/A       N/A     
N
>  N/A
> >>> > Brick glunode1:/st/brick2/dir          N/A       N/A     
N
>  N/A
> >>> >
> >>>
> >>> did you try it multiple times?
> >>>
> >>> > Some bricks just don?t go online. Sometime it?s one
brick, sometime
> >>> > tree and
> >>> > it?s not same brick ? it?s random issue.
> >>> >
> >>> > I checked log on affected servers and this is an example:
> >>> >
> >>> > sudo tail /var/log/glusterfs/bricks/st-brick3-0.log
> >>> >
> >>> > [2017-06-29 17:59:48.651581] W
[socket.c:593:__socket_rwv]
> 0-glusterfs:
> >>> > readv on 10.2.44.23:24007 failed (No data available)
> >>> > [2017-06-29 17:59:48.651622] E
[glusterfsd-mgmt.c:2114:mgmt_
> rpc_notify]
> >>> > 0-glusterfsd-mgmt: failed to connect with remote-host:
glunode0 (No
> >>> > data
> >>> > available)
> >>> > [2017-06-29 17:59:48.651638] I
[glusterfsd-mgmt.c:2133:mgmt_
> rpc_notify]
> >>> > 0-glusterfsd-mgmt: Exhausted all volfile servers
> >>> > [2017-06-29 17:59:49.944103] W
[glusterfsd.c:1332:cleanup_and_exit]
> >>> > (-->/lib64/libpthread.so.0(+0x7dc5) [0x7f3158032dc5]
> >>> > -->/usr/sbin/glusterfsd(glusterfs_sigwaiter+0xe5)
[0x7f31596cbfd5]
> >>> > -->/usr/sbin/glusterfsd(cleanup_and_exit+0x6b)
[0x7f31596cbdfb] )
> >>> > 0-:received signum (15), shutting down
> >>> > [2017-06-29 17:59:50.397107] E
[socket.c:3203:socket_connect]
> >>> > 0-glusterfs:
> >>> > connection attempt on 10.2.44.23:24007 failed, (Network
is
> unreachable)
> >>> > [2017-06-29 17:59:50.397138] I
[socket.c:3507:socket_submit_request]
> >>> > 0-glusterfs: not connected (priv->connected = 0)
> >>> > [2017-06-29 17:59:50.397162] W
[rpc-clnt.c:1693:rpc_clnt_submit]
> >>> > 0-glusterfs: failed to submit rpc-request (XID: 0x3
Program: Gluster
> >>> > Portmap, ProgVers: 1, Proc: 5) to rpc-transport
(glusterfs)
> >>> >
> >>> > I think that important message is ?Network is
unreachable?.
> >>> >
> >>> > Question
> >>> > 1. Could you please tell me, is that normal when you have
many
> bricks?
> >>> > Networks is definitely stable and other servers use it
without
> problem
> >>> > and
> >>> > all servers run on a same pair of switches. My assumption
is that in
> >>> > the
> >>> > same time many bricks try to connect and that doesn?t
work.
> >>>
> >>> no. it shouldnt happen if there are multiple bricks.
> >>> there was a bug related to this [1]
> >>> to verify if that was the issue I need to know a few things.
> >>> 1) are all the node of the same version.
> >>> 2) did you check grepping for the brick process using the ps
command?
> >>> need to verify is the brick is still up and is not connected
to
> glusterd
> >>> alone.
> >>>
> >>>
> >>> >
> >>> > 2. Is there an option to configure a brick to enable some
kind of
> >>> > autoreconnect or add some timeout?
> >>> > gluster volume set brick123 option456 abc ??
> >>> If the brick process is not seen in the ps aux | grep
glusterfsd
> >>> The way to start a brick is to use the volume start force
command.
> >>> If brick is not started there is no point configuring it. and
to start
> >>> a brick we cant
> >>> use the configure command.
> >>>
> >>> >
> >>> > 3. What it the recommend way to fix offline brick on the
affected
> >>> > server? I
> >>> > don?t want to use ?gluster volume stop/start? since
affected bricks
> are
> >>> > online on other server and there is no reason to
completely turn it
> >>> > off.
> >>> gluster volume start force will not bring down the bricks that
are
> >>> already up and
> >>> running.
> >>>
> >>> >
> >>> > Thank you,
> >>> > Jan
> >>> >
> >>> > _______________________________________________
> >>> > Gluster-users mailing list
> >>> > Gluster-users at gluster.org
> >>> > http://lists.gluster.org/mailman/listinfo/gluster-users
> >>>
> >>>
> >>>
> >>> --
> >>> Regards,
> >>> Hari Gowtham.
> >>
> >>
> >>
> >> _______________________________________________
> >> Gluster-users mailing list
> >> Gluster-users at gluster.org
> >> http://lists.gluster.org/mailman/listinfo/gluster-users
> >
> >
>
>
>
> --
> Regards,
> Hari Gowtham.
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.gluster.org/pipermail/gluster-users/attachments/20170702/01fef9be/attachment.html>

Maybe Matching Threads

Search for more apparently analagous threads

Gluster users - Jul 2017 - Some bricks are offline after restart, how to bring them online gracefully?

[Gluster-users] Some bricks are offline after restart, how to bring them online gracefully?

[Gluster-users] Some bricks are offline after restart, how to bring them online gracefully?

Maybe Matching Threads