thr3ads.net - Gluster users - [Gluster-users] Some bricks are offline after restart, how to bring them online gracefully? [Jun 2017]

If this information is useful, please help other people find it:
Share via:

Jan

2017-Jun-29 20:01 UTC

[Gluster-users] Some bricks are offline after restart, how to bring them online gracefully?

Hi all,

Gluster and Ganesha are amazing. Thank you for this great work!

I?m struggling with one issue and I think that you might be able to help me.

I spent some time by playing with Gluster and Ganesha and after I gain some
experience I decided that I should go into production but I?m still
struggling with one issue.

I have 3x node CentOS 7.3 with the most current Gluster and Ganesha from
centos-gluster310 repository (3.10.2-1.el7) with replicated bricks.

Servers have a lot of resources and they run in a subnet on a stable
network.

I didn?t have any issues when I tested a single brick. But now I?d like to
setup 17 replicated bricks and I realized that when I restart one of nodes
then the result looks like this:

sudo gluster volume status | grep ' N '

Brick glunode0:/st/brick3/dir          N/A       N/A        N       N/A
Brick glunode1:/st/brick2/dir          N/A       N/A        N       N/A

Some bricks just don?t go online. Sometime it?s one brick, sometime tree
and it?s not same brick ? it?s random issue.

I checked log on affected servers and this is an example:

sudo tail /var/log/glusterfs/bricks/st-brick3-0.log

[2017-06-29 17:59:48.651581] W [socket.c:593:__socket_rwv] 0-glusterfs:
readv on 10.2.44.23:24007 failed (No data available)
[2017-06-29 17:59:48.651622] E [glusterfsd-mgmt.c:2114:mgmt_rpc_notify]
0-glusterfsd-mgmt: failed to connect with remote-host: glunode0 (No data
available)
[2017-06-29 17:59:48.651638] I [glusterfsd-mgmt.c:2133:mgmt_rpc_notify]
0-glusterfsd-mgmt: Exhausted all volfile servers
[2017-06-29 17:59:49.944103] W [glusterfsd.c:1332:cleanup_and_exit]
(-->/lib64/libpthread.so.0(+0x7dc5) [0x7f3158032dc5]
-->/usr/sbin/glusterfsd(glusterfs_sigwaiter+0xe5) [0x7f31596cbfd5]
-->/usr/sbin/glusterfsd(cleanup_and_exit+0x6b) [0x7f31596cbdfb] )
0-:received signum (15), shutting down
[2017-06-29 17:59:50.397107] E [socket.c:3203:socket_connect] 0-glusterfs:
connection attempt on 10.2.44.23:24007 failed, (Network is unreachable)
[2017-06-29 17:59:50.397138] I [socket.c:3507:socket_submit_request]
0-glusterfs: not connected (priv->connected = 0)
[2017-06-29 17:59:50.397162] W [rpc-clnt.c:1693:rpc_clnt_submit]
0-glusterfs: failed to submit rpc-request (XID: 0x3 Program: Gluster
Portmap, ProgVers: 1, Proc: 5) to rpc-transport (glusterfs)

I think that important message is ?Network is unreachable?.

Question
1. Could you please tell me, is that normal when you have many bricks?
Networks is definitely stable and other servers use it without problem and
all servers run on a same pair of switches. My assumption is that in the
same time many bricks try to connect and that doesn?t work.

2. Is there an option to configure a brick to enable some kind of
autoreconnect or add some timeout?
gluster volume set brick123 option456 abc ??

3. What it the recommend way to fix offline brick on the affected server? I
don?t want to use ?gluster volume stop/start? since affected bricks are
online on other server and there is no reason to completely turn it off.

Thank you,
Jan
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.gluster.org/pipermail/gluster-users/attachments/20170629/02dfb57b/attachment.html>

Hari Gowtham

2017-Jun-30 07:01 UTC

head link

[Gluster-users] Some bricks are offline after restart, how to bring them online gracefully?

Hi Jan,

comments inline.

On Fri, Jun 30, 2017 at 1:31 AM, Jan <jan.h.zak at gmail.com>
wrote:> Hi all,
>
> Gluster and Ganesha are amazing. Thank you for this great work!
>
> I?m struggling with one issue and I think that you might be able to help
me.
>
> I spent some time by playing with Gluster and Ganesha and after I gain some
> experience I decided that I should go into production but I?m still
> struggling with one issue.
>
> I have 3x node CentOS 7.3 with the most current Gluster and Ganesha from
> centos-gluster310 repository (3.10.2-1.el7) with replicated bricks.
>
> Servers have a lot of resources and they run in a subnet on a stable
> network.
>
> I didn?t have any issues when I tested a single brick. But now I?d like to
> setup 17 replicated bricks and I realized that when I restart one of nodes
> then the result looks like this:
>
> sudo gluster volume status | grep ' N '
>
> Brick glunode0:/st/brick3/dir          N/A       N/A        N       N/A
> Brick glunode1:/st/brick2/dir          N/A       N/A        N       N/A
>
did you try it multiple times?
> Some bricks just don?t go online. Sometime it?s one brick, sometime tree
and
> it?s not same brick ? it?s random issue.
>
> I checked log on affected servers and this is an example:
>
> sudo tail /var/log/glusterfs/bricks/st-brick3-0.log
>
> [2017-06-29 17:59:48.651581] W [socket.c:593:__socket_rwv] 0-glusterfs:
> readv on 10.2.44.23:24007 failed (No data available)
> [2017-06-29 17:59:48.651622] E [glusterfsd-mgmt.c:2114:mgmt_rpc_notify]
> 0-glusterfsd-mgmt: failed to connect with remote-host: glunode0 (No data
> available)
> [2017-06-29 17:59:48.651638] I [glusterfsd-mgmt.c:2133:mgmt_rpc_notify]
> 0-glusterfsd-mgmt: Exhausted all volfile servers
> [2017-06-29 17:59:49.944103] W [glusterfsd.c:1332:cleanup_and_exit]
> (-->/lib64/libpthread.so.0(+0x7dc5) [0x7f3158032dc5]
> -->/usr/sbin/glusterfsd(glusterfs_sigwaiter+0xe5) [0x7f31596cbfd5]
> -->/usr/sbin/glusterfsd(cleanup_and_exit+0x6b) [0x7f31596cbdfb] )
> 0-:received signum (15), shutting down
> [2017-06-29 17:59:50.397107] E [socket.c:3203:socket_connect] 0-glusterfs:
> connection attempt on 10.2.44.23:24007 failed, (Network is unreachable)
> [2017-06-29 17:59:50.397138] I [socket.c:3507:socket_submit_request]
> 0-glusterfs: not connected (priv->connected = 0)
> [2017-06-29 17:59:50.397162] W [rpc-clnt.c:1693:rpc_clnt_submit]
> 0-glusterfs: failed to submit rpc-request (XID: 0x3 Program: Gluster
> Portmap, ProgVers: 1, Proc: 5) to rpc-transport (glusterfs)
>
> I think that important message is ?Network is unreachable?.
>
> Question
> 1. Could you please tell me, is that normal when you have many bricks?
> Networks is definitely stable and other servers use it without problem and
> all servers run on a same pair of switches. My assumption is that in the
> same time many bricks try to connect and that doesn?t work.
no. it shouldnt happen if there are multiple bricks.
there was a bug related to this [1]
to verify if that was the issue I need to know a few things.
1) are all the node of the same version.
2) did you check grepping for the brick process using the ps command?
need to verify is the brick is still up and is not connected to glusterd alone.

>
> 2. Is there an option to configure a brick to enable some kind of
> autoreconnect or add some timeout?
> gluster volume set brick123 option456 abc ??If the brick process is not seen in the ps aux | grep glusterfsd
The way to start a brick is to use the volume start force command.
If brick is not started there is no point configuring it. and to start
a brick we cant
use the configure command.
>
> 3. What it the recommend way to fix offline brick on the affected server? I
> don?t want to use ?gluster volume stop/start? since affected bricks are
> online on other server and there is no reason to completely turn it off.gluster volume start force will not bring down the bricks that are
already up and
running.
>
> Thank you,
> Jan
>
> _______________________________________________
> Gluster-users mailing list
> Gluster-users at gluster.org
> http://lists.gluster.org/mailman/listinfo/gluster-users


-- 
Regards,
Hari Gowtham.

Jan

2017-Jun-30 09:49 UTC

head link

[Gluster-users] Some bricks are offline after restart, how to bring them online gracefully?

Hi Hari,

thank you for your support!

Did I try to check offline bricks multiple times?
Yes ? I gave it enough time (at least 20 minutes) to recover but it stayed
offline.

Version?
All nodes are 100% equal ? I tried fresh installation several times during
my testing, Every time it is CentOS Minimal install with all updates and
without any additional software:

uname -r
3.10.0-514.21.2.el7.x86_64

yum list installed | egrep 'gluster|ganesha'
centos-release-gluster310.noarch     1.0-1.el7.centos         @extras

glusterfs.x86_64                     3.10.2-1.el7
@centos-gluster310
glusterfs-api.x86_64                 3.10.2-1.el7
@centos-gluster310
glusterfs-cli.x86_64                 3.10.2-1.el7
@centos-gluster310
glusterfs-client-xlators.x86_64      3.10.2-1.el7
@centos-gluster310
glusterfs-fuse.x86_64                3.10.2-1.el7
@centos-gluster310
glusterfs-ganesha.x86_64             3.10.2-1.el7
@centos-gluster310
glusterfs-libs.x86_64                3.10.2-1.el7
@centos-gluster310
glusterfs-server.x86_64              3.10.2-1.el7
@centos-gluster310
libntirpc.x86_64                     1.4.3-1.el7
 @centos-gluster310
nfs-ganesha.x86_64                   2.4.5-1.el7
 @centos-gluster310
nfs-ganesha-gluster.x86_64           2.4.5-1.el7
 @centos-gluster310
userspace-rcu.x86_64                 0.7.16-3.el7
@centos-gluster310

Grepping for the brick process?
I?ve just tried it again. Process doesn?t exist when brick is offline.

Force start command?
sudo gluster volume start MyVolume force

That works! Thank you.

If I have this issue too often then I can create simple script that greps
all bricks on the local server and force start when it?s offline. I can
schedule such script once after for example 5 minutes after boot.

But I?m not sure if it?s good idea to automate it. I?d be worried that I
can force it up even when the node doesn?t ?see? other nodes and cause
split brain issue.

Thank you!

Kind regards,
Jan


On Fri, Jun 30, 2017 at 8:01 AM, Hari Gowtham <hgowtham at redhat.com>
wrote:
> Hi Jan,
>
> comments inline.
>
> On Fri, Jun 30, 2017 at 1:31 AM, Jan <jan.h.zak at gmail.com> wrote:
> > Hi all,
> >
> > Gluster and Ganesha are amazing. Thank you for this great work!
> >
> > I?m struggling with one issue and I think that you might be able to
help
> me.
> >
> > I spent some time by playing with Gluster and Ganesha and after I gain
> some
> > experience I decided that I should go into production but I?m still
> > struggling with one issue.
> >
> > I have 3x node CentOS 7.3 with the most current Gluster and Ganesha
from
> > centos-gluster310 repository (3.10.2-1.el7) with replicated bricks.
> >
> > Servers have a lot of resources and they run in a subnet on a stable
> > network.
> >
> > I didn?t have any issues when I tested a single brick. But now I?d
like
> to
> > setup 17 replicated bricks and I realized that when I restart one of
> nodes
> > then the result looks like this:
> >
> > sudo gluster volume status | grep ' N '
> >
> > Brick glunode0:/st/brick3/dir          N/A       N/A        N      
N/A
> > Brick glunode1:/st/brick2/dir          N/A       N/A        N      
N/A
> >
>
> did you try it multiple times?
>
> > Some bricks just don?t go online. Sometime it?s one brick, sometime
tree
> and
> > it?s not same brick ? it?s random issue.
> >
> > I checked log on affected servers and this is an example:
> >
> > sudo tail /var/log/glusterfs/bricks/st-brick3-0.log
> >
> > [2017-06-29 17:59:48.651581] W [socket.c:593:__socket_rwv]
0-glusterfs:
> > readv on 10.2.44.23:24007 failed (No data available)
> > [2017-06-29 17:59:48.651622] E
[glusterfsd-mgmt.c:2114:mgmt_rpc_notify]
> > 0-glusterfsd-mgmt: failed to connect with remote-host: glunode0 (No
data
> > available)
> > [2017-06-29 17:59:48.651638] I
[glusterfsd-mgmt.c:2133:mgmt_rpc_notify]
> > 0-glusterfsd-mgmt: Exhausted all volfile servers
> > [2017-06-29 17:59:49.944103] W [glusterfsd.c:1332:cleanup_and_exit]
> > (-->/lib64/libpthread.so.0(+0x7dc5) [0x7f3158032dc5]
> > -->/usr/sbin/glusterfsd(glusterfs_sigwaiter+0xe5) [0x7f31596cbfd5]
> > -->/usr/sbin/glusterfsd(cleanup_and_exit+0x6b) [0x7f31596cbdfb] )
> > 0-:received signum (15), shutting down
> > [2017-06-29 17:59:50.397107] E [socket.c:3203:socket_connect]
> 0-glusterfs:
> > connection attempt on 10.2.44.23:24007 failed, (Network is
unreachable)
> > [2017-06-29 17:59:50.397138] I [socket.c:3507:socket_submit_request]
> > 0-glusterfs: not connected (priv->connected = 0)
> > [2017-06-29 17:59:50.397162] W [rpc-clnt.c:1693:rpc_clnt_submit]
> > 0-glusterfs: failed to submit rpc-request (XID: 0x3 Program: Gluster
> > Portmap, ProgVers: 1, Proc: 5) to rpc-transport (glusterfs)
> >
> > I think that important message is ?Network is unreachable?.
> >
> > Question
> > 1. Could you please tell me, is that normal when you have many bricks?
> > Networks is definitely stable and other servers use it without problem
> and
> > all servers run on a same pair of switches. My assumption is that in
the
> > same time many bricks try to connect and that doesn?t work.
>
> no. it shouldnt happen if there are multiple bricks.
> there was a bug related to this [1]
> to verify if that was the issue I need to know a few things.
> 1) are all the node of the same version.
> 2) did you check grepping for the brick process using the ps command?
> need to verify is the brick is still up and is not connected to glusterd
> alone.
>
>
> >
> > 2. Is there an option to configure a brick to enable some kind of
> > autoreconnect or add some timeout?
> > gluster volume set brick123 option456 abc ??
> If the brick process is not seen in the ps aux | grep glusterfsd
> The way to start a brick is to use the volume start force command.
> If brick is not started there is no point configuring it. and to start
> a brick we cant
> use the configure command.
>
> >
> > 3. What it the recommend way to fix offline brick on the affected
> server? I
> > don?t want to use ?gluster volume stop/start? since affected bricks
are
> > online on other server and there is no reason to completely turn it
off.
> gluster volume start force will not bring down the bricks that are
> already up and
> running.
>
> >
> > Thank you,
> > Jan
> >
> > _______________________________________________
> > Gluster-users mailing list
> > Gluster-users at gluster.org
> > http://lists.gluster.org/mailman/listinfo/gluster-users
>
>
>
> --
> Regards,
> Hari Gowtham.
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.gluster.org/pipermail/gluster-users/attachments/20170630/f41fe960/attachment.html>

Atin Mukherjee

2017-Jun-30 11:07 UTC

head link

[Gluster-users] Some bricks are offline after restart, how to bring them online gracefully?

On Fri, Jun 30, 2017 at 1:31 AM, Jan <jan.h.zak at gmail.com> wrote:
> Hi all,
>
> Gluster and Ganesha are amazing. Thank you for this great work!
>
> I?m struggling with one issue and I think that you might be able to help
> me.
>
> I spent some time by playing with Gluster and Ganesha and after I gain
> some experience I decided that I should go into production but I?m still
> struggling with one issue.
>
> I have 3x node CentOS 7.3 with the most current Gluster and Ganesha from
> centos-gluster310 repository (3.10.2-1.el7) with replicated bricks.
>
> Servers have a lot of resources and they run in a subnet on a stable
> network.
>
> I didn?t have any issues when I tested a single brick. But now I?d like to
> setup 17 replicated bricks and I realized that when I restart one of nodes
> then the result looks like this:
>
> sudo gluster volume status | grep ' N '
>
> Brick glunode0:/st/brick3/dir          N/A       N/A        N       N/A
> Brick glunode1:/st/brick2/dir          N/A       N/A        N       N/A
>
> Some bricks just don?t go online. Sometime it?s one brick, sometime tree
> and it?s not same brick ? it?s random issue.
>
> I checked log on affected servers and this is an example:
>
> sudo tail /var/log/glusterfs/bricks/st-brick3-0.log
>
> [2017-06-29 17:59:48.651581] W [socket.c:593:__socket_rwv] 0-glusterfs:
> readv on 10.2.44.23:24007 failed (No data available)
> [2017-06-29 17:59:48.651622] E [glusterfsd-mgmt.c:2114:mgmt_rpc_notify]
> 0-glusterfsd-mgmt: failed to connect with remote-host: glunode0 (No data
> available)
> [2017-06-29 17:59:48.651638] I [glusterfsd-mgmt.c:2133:mgmt_rpc_notify]
> 0-glusterfsd-mgmt: Exhausted all volfile servers
> [2017-06-29 17:59:49.944103] W [glusterfsd.c:1332:cleanup_and_exit]
> (-->/lib64/libpthread.so.0(+0x7dc5) [0x7f3158032dc5]
> -->/usr/sbin/glusterfsd(glusterfs_sigwaiter+0xe5) [0x7f31596cbfd5]
> -->/usr/sbin/glusterfsd(cleanup_and_exit+0x6b) [0x7f31596cbdfb] )
> 0-:received signum (15), shutting down
> [2017-06-29 17:59:50.397107] E [socket.c:3203:socket_connect] 0-glusterfs:
> connection attempt on 10.2.44.23:24007 failed, (Network is unreachable)
>
This happens when connect () syscall fails with  ENETUNREACH errno as per
the followint code

                if (ign_enoent)
{
                        ret = connect_loop
(priv->sock,
                                            SA
(&this->peerinfo.sockaddr),

this->peerinfo.sockaddr_len);
                } else
{
                        ret = connect
(priv->sock,
                                       SA
(&this->peerinfo.sockaddr),

this->peerinfo.sockaddr_len);

}


                if (ret == -1 && errno == ENOENT && ign_enoent)
{
                        gf_log (this->name,
GF_LOG_WARNING,
                               "Ignore failed connection attempt on %s,
(%s) ",
                                this->peerinfo.identifier, strerror
(errno));


                        /* connect failed with some other error than
EINPROGRESS
                        so, getsockopt (... SO_ERROR ...), will not catch
any
                        errors and return them to us, we need to remember
this
                        state, and take actions in
socket_event_handler
                        appropriately
*/
                        /* TBD: What about ENOENT, we will do getsockopt
there
                        as well, so how is that exempt from such a problem?
*/
                        priv->connect_failed 1;
                        this->connect_failed _gf_true;


                        goto
handler;

}


                if (ret == -1 && ((errno != EINPROGRESS) &&
(errno !ENOENT))) {
                        /* For unix path based sockets, the socket path
is
                         * cryptic (md5sum of path) and may not be useful
for
                         * the user in debugging so log it in
DEBUG

*/
                        gf_log (this->name, ((sa_family == AF_UNIX) ?
<===== this is the log which gets generated
                                GF_LOG_DEBUG :
GF_LOG_ERROR),
                                "connection attempt on %s failed,
(%s)",
                                this->peerinfo.identifier, strerror
(errno));

IMO, this can only happen if there is an intermittent n/w failure?

@Raghavendra G/ Mohit - do you have any other opinion?

[2017-06-29 17:59:50.397138] I [socket.c:3507:socket_submit_request]
0-glusterfs: not connected (priv->connected = 0)
> [2017-06-29 17:59:50.397162] W [rpc-clnt.c:1693:rpc_clnt_submit]
> 0-glusterfs: failed to submit rpc-request (XID: 0x3 Program: Gluster
> Portmap, ProgVers: 1, Proc: 5) to rpc-transport (glusterfs)
>
> I think that important message is ?Network is unreachable?.
>
> Question
> 1. Could you please tell me, is that normal when you have many bricks?
> Networks is definitely stable and other servers use it without problem and
> all servers run on a same pair of switches. My assumption is that in the
> same time many bricks try to connect and that doesn?t work.
>
> 2. Is there an option to configure a brick to enable some kind of
> autoreconnect or add some timeout?
> gluster volume set brick123 option456 abc ??
>
> 3. What it the recommend way to fix offline brick on the affected server?
> I don?t want to use ?gluster volume stop/start? since affected bricks are
> online on other server and there is no reason to completely turn it off.
>
> Thank you,
> Jan
>
> _______________________________________________
> Gluster-users mailing list
> Gluster-users at gluster.org
> http://lists.gluster.org/mailman/listinfo/gluster-users
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.gluster.org/pipermail/gluster-users/attachments/20170630/bf79d82e/attachment-0001.html>

Seemingly Similar Threads

Search for more possibly parallel threads

Gluster users - Jun 2017 - Some bricks are offline after restart, how to bring them online gracefully?

[Gluster-users] Some bricks are offline after restart, how to bring them online gracefully?

[Gluster-users] Some bricks are offline after restart, how to bring them online gracefully?

[Gluster-users] Some bricks are offline after restart, how to bring them online gracefully?

[Gluster-users] Some bricks are offline after restart, how to bring them online gracefully?

Seemingly Similar Threads