thr3ads.net - Gluster users - [Gluster-users] brick offline after restart glusterd [Jul 2015]

If this information is useful, please help other people find it:
Share via:

Tiemen Ruiten

2015-Jul-13 17:15 UTC

[Gluster-users] brick offline after restart glusterd

On 13 July 2015 at 19:06, Atin Mukherjee <amukherj at redhat.com> wrote:
>
>
> On 07/13/2015 10:29 PM, Tiemen Ruiten wrote:
> > OK, I found what's wrong. From the brick's log:
> >
> > [2015-07-12 02:32:01.542934] I
[glusterfsd-mgmt.c:1512:mgmt_getspec_cbk]
> > 0-glusterfs: No change in volfile, continuing
> > [2015-07-13 14:21:06.722675] W [glusterfsd.c:1219:cleanup_and_exit]
(-->
> > 0-: received signum (15), shutting down
> > [2015-07-13 14:21:35.168750] I [MSGID: 100030]
[glusterfsd.c:2294:main]
> > 0-/usr/sbin/glusterfsd: Started running /usr/sbin/glusterfsd version
> 3.7.1
> > (args: /usr/sbin/glusterfsd -s 10.100.3.10 --volfile-id
> > vmimage.10.100.3.10.export-gluster01-brick -p
> >
/var/lib/glusterd/vols/vmimage/run/10.100.3.10-export-gluster01-brick.pid
> > -S /var/run/gluster/2bfe3a2242d586d0850775f601f1c3ee.socket
--brick-name
> > /export/gluster01/brick -l
> > /var/log/glusterfs/bricks/export-gluster01-brick.log --xlator-option
> > *-posix.glusterd-uuid=26186ec6-a8c7-4834-bcaa-24e30289dba3
--brick-port
> > 49153 --xlator-option vmimage-server.listen-port=49153)
> > [2015-07-13 14:21:35.178558] E [socket.c:823:__socket_server_bind]
> > 0-socket.glusterfsd: binding to  failed: Address already in use
> > [2015-07-13 14:21:35.178624] E [socket.c:826:__socket_server_bind]
> > 0-socket.glusterfsd: Port is already in use
> > [2015-07-13 14:21:35.178649] W [rpcsvc.c:1602:rpcsvc_transport_create]
> > 0-rpc-service: listening on transport failed
> >
> >
> > ps aux | grep gluster
> > root      6417  0.0  0.2 753080 175016 ?       Ssl  May21  25:25
> > /usr/sbin/glusterfs --volfile-server=10.100.3.10 --volfile-id=/wwwdata
> > /mnt/gluster/web/wwwdata
> > root      6742  0.0  0.0 622012 17624 ?        Ssl  May21  22:31
> > /usr/sbin/glusterfs --volfile-server=10.100.3.10 --volfile-id=/conf
> > /mnt/gluster/conf
> > root     36575  0.2  0.0 589956 19228 ?        Ssl  16:21   0:19
> > /usr/sbin/glusterd --pid-file=/run/glusterd.pid
> > root     36720  0.0  0.0 565140 55836 ?        Ssl  16:21   0:02
> > /usr/sbin/glusterfs -s localhost --volfile-id gluster/nfs -p
> > /var/lib/glusterd/nfs/run/nfs.pid -l /var/log/glusterfs/nfs.log -S
> > /var/run/gluster/8b9ce8bebfa8c1d2fabb62654bdc550e.socket
> > root     36730  0.0  0.0 451016 22936 ?        Ssl  16:21   0:01
> > /usr/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p
> > /var/lib/glusterd/glustershd/run/glustershd.pid -l
> > /var/log/glusterfs/glustershd.log -S
> > /var/run/gluster/c0d7454986c96eef463d028dc8bce9fe.socket
--xlator-option
> > *replicate*.node-uuid=26186ec6-a8c7-4834-bcaa-24e30289dba3
> > root     37398  0.0  0.0 103248   916 pts/2    S+   18:49   0:00 grep
> > gluster
> > root     40058  0.0  0.0 755216 60212 ?        Ssl  May21  22:06
> > /usr/sbin/glusterfs --volfile-server=10.100.3.10
--volfile-id=/fl-webroot
> > /mnt/gluster/web/flash/webroot
> >
> > So several leftover processes. What will happen if I do a
> >
> > /etc/init.d/glusterd stop
> > /etc/init.d/glusterfsd stop
> >
> > kill all remaining gluster processes and restart gluster on this node?
> >
> > Will the volume stay online? What about split-brain? I suppose it
would
> be
> > best to disconnect all clients first...?
> Can you double check if any brick process is already running, if so kill
> it and try 'gluster volume start <volname> force'
> >
> >
> > On 13 July 2015 at 18:25, Tiemen Ruiten <t.ruiten at
rdmedia.com> wrote:
> >
> >> Hello,
> >>
> >> We have a two-node gluster cluster, running version 3.7.1, that
hosts an
> >> oVirt storage domain. This afternoon I tried creating a template
in
> oVirt,
> >> but within a minute VM's stopped responding and Gluster
started
> generating
> >> errors like the following:
> >>
> >> [2015-07-13 14:09:51.772629] W [rpcsvc.c:270:rpcsvc_program_actor]
> >> 0-rpc-service: RPC program not available (req 1298437 330) for
> >> 10.100.3.40:1021
> >> [2015-07-13 14:09:51.772675] E
> [rpcsvc.c:565:rpcsvc_check_and_reply_error]
> >> 0-rpcsvc: rpc actor failed to complete successfully
> >>
> >> I managed to get things in working order again by restarting
glusterd
> and
> >> glusterfsd, but now one brick is down:
> >>
> >> $sudo gluster volume status vmimage
> >> Status of volume: vmimage
> >> Gluster process                             TCP Port  RDMA Port 
Online
> >>  Pid
> >>
> >>
>
------------------------------------------------------------------------------
> >> Brick 10.100.3.10:/export/gluster01/brick   N/A       N/A        N
> >> 36736
> >> Brick 10.100.3.11:/export/gluster01/brick   49153     0          Y
> >> 11897
> >> NFS Server on localhost                     2049      0          Y
> >> 36720
> >> Self-heal Daemon on localhost               N/A       N/A        Y
> >> 36730
> >> NFS Server on 10.100.3.11                   2049      0          Y
> >> 11919
> >> Self-heal Daemon on 10.100.3.11             N/A       N/A        Y
> >> 11924
> >>
> >> Task Status of Volume vmimage
> >>
> >>
>
------------------------------------------------------------------------------
> >> There are no active volume tasks
> >>
> >> $ sudo gluster peer status
> >> Number of Peers: 1
> >>
> >> Hostname: 10.100.3.11
> >> Uuid: f9872fea-47f5-41f6-8094-c9fabd3c1339
> >> State: Peer in Cluster (Connected)
> >>
> >> Additionally in the etc-glusterfs-glusterd.vol.log I see these
messages
> >> repeating every 3 seconds:
> >>
> >> [2015-07-13 16:15:21.737044] W [socket.c:642:__socket_rwv]
0-management:
> >> readv on /var/run/gluster/2bfe3a2242d586d0850775f601f1c3ee.socket
failed
> >> (Invalid argument)
> >> The message "I [MSGID: 106005]
> >> [glusterd-handler.c:4667:__glusterd_brick_rpc_notify]
0-management:
> Brick
> >> 10.100.3.10:/export/gluster01/brick has disconnected from
glusterd."
> >> repeated 39 times between [2015-07-13 16:13:24.717611] and
[2015-07-13
> >> 16:15:21.737862]
> >> [2015-07-13 16:15:24.737694] W [socket.c:642:__socket_rwv]
0-management:
> >> readv on /var/run/gluster/2bfe3a2242d586d0850775f601f1c3ee.socket
failed
> >> (Invalid argument)
> >> [2015-07-13 16:15:24.738498] I [MSGID: 106005]
> >> [glusterd-handler.c:4667:__glusterd_brick_rpc_notify]
0-management:
> Brick
> >> 10.100.3.10:/export/gluster01/brick has disconnected from
glusterd.
> >> [2015-07-13 16:15:27.738194] W [socket.c:642:__socket_rwv]
0-management:
> >> readv on /var/run/gluster/2bfe3a2242d586d0850775f601f1c3ee.socket
failed
> >> (Invalid argument)
> >> [2015-07-13 16:15:30.738991] W [socket.c:642:__socket_rwv]
0-management:
> >> readv on /var/run/gluster/2bfe3a2242d586d0850775f601f1c3ee.socket
failed
> >> (Invalid argument)
> >> [2015-07-13 16:15:33.739735] W [socket.c:642:__socket_rwv]
0-management:
> >> readv on /var/run/gluster/2bfe3a2242d586d0850775f601f1c3ee.socket
failed
> >> (Invalid argument)
> >>
> >> Can I get this brick back up without bringing the volume/cluster
down?
> >>
> >> --
> >> Tiemen Ruiten
> >> Systems Engineer
> >> R&D Media
> >>
> >
> >
> >
> >
> >
> > _______________________________________________
> > Gluster-users mailing list
> > Gluster-users at gluster.org
> > http://www.gluster.org/mailman/listinfo/gluster-users
> >
>
> --
> ~Atin
>

Hi Atin,

I see brick processes for volumes wwwdata, conf and fl-webroot, judging
from the ps aux | grep gluster output. These volumes are not started. No
brick process for vmimage. So you're saying, kill those brick processes,
then gluster volume start vmimage force?

Thank you for  your response.

-- 
Tiemen Ruiten
Systems Engineer
R&D Media
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://www.gluster.org/pipermail/gluster-users/attachments/20150713/8ff48eb3/attachment.html>

Atin Mukherjee

2015-Jul-13 17:19 UTC

head link

[Gluster-users] brick offline after restart glusterd

On 07/13/2015 10:45 PM, Tiemen Ruiten wrote:> On 13 July 2015 at 19:06, Atin Mukherjee <amukherj at redhat.com>
wrote:
> 
>>
>>
>> On 07/13/2015 10:29 PM, Tiemen Ruiten wrote:
>>> OK, I found what's wrong. From the brick's log:
>>>
>>> [2015-07-12 02:32:01.542934] I
[glusterfsd-mgmt.c:1512:mgmt_getspec_cbk]
>>> 0-glusterfs: No change in volfile, continuing
>>> [2015-07-13 14:21:06.722675] W [glusterfsd.c:1219:cleanup_and_exit]
(-->
>>> 0-: received signum (15), shutting down
>>> [2015-07-13 14:21:35.168750] I [MSGID: 100030]
[glusterfsd.c:2294:main]
>>> 0-/usr/sbin/glusterfsd: Started running /usr/sbin/glusterfsd
version
>> 3.7.1
>>> (args: /usr/sbin/glusterfsd -s 10.100.3.10 --volfile-id
>>> vmimage.10.100.3.10.export-gluster01-brick -p
>>>
/var/lib/glusterd/vols/vmimage/run/10.100.3.10-export-gluster01-brick.pid
>>> -S /var/run/gluster/2bfe3a2242d586d0850775f601f1c3ee.socket
--brick-name
>>> /export/gluster01/brick -l
>>> /var/log/glusterfs/bricks/export-gluster01-brick.log
--xlator-option
>>> *-posix.glusterd-uuid=26186ec6-a8c7-4834-bcaa-24e30289dba3
--brick-port
>>> 49153 --xlator-option vmimage-server.listen-port=49153)
>>> [2015-07-13 14:21:35.178558] E [socket.c:823:__socket_server_bind]
>>> 0-socket.glusterfsd: binding to  failed: Address already in use
>>> [2015-07-13 14:21:35.178624] E [socket.c:826:__socket_server_bind]
>>> 0-socket.glusterfsd: Port is already in use
>>> [2015-07-13 14:21:35.178649] W
[rpcsvc.c:1602:rpcsvc_transport_create]
>>> 0-rpc-service: listening on transport failed
>>>
>>>
>>> ps aux | grep gluster
>>> root      6417  0.0  0.2 753080 175016 ?       Ssl  May21  25:25
>>> /usr/sbin/glusterfs --volfile-server=10.100.3.10
--volfile-id=/wwwdata
>>> /mnt/gluster/web/wwwdata
>>> root      6742  0.0  0.0 622012 17624 ?        Ssl  May21  22:31
>>> /usr/sbin/glusterfs --volfile-server=10.100.3.10 --volfile-id=/conf
>>> /mnt/gluster/conf
>>> root     36575  0.2  0.0 589956 19228 ?        Ssl  16:21   0:19
>>> /usr/sbin/glusterd --pid-file=/run/glusterd.pid
>>> root     36720  0.0  0.0 565140 55836 ?        Ssl  16:21   0:02
>>> /usr/sbin/glusterfs -s localhost --volfile-id gluster/nfs -p
>>> /var/lib/glusterd/nfs/run/nfs.pid -l /var/log/glusterfs/nfs.log -S
>>> /var/run/gluster/8b9ce8bebfa8c1d2fabb62654bdc550e.socket
>>> root     36730  0.0  0.0 451016 22936 ?        Ssl  16:21   0:01
>>> /usr/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p
>>> /var/lib/glusterd/glustershd/run/glustershd.pid -l
>>> /var/log/glusterfs/glustershd.log -S
>>> /var/run/gluster/c0d7454986c96eef463d028dc8bce9fe.socket
--xlator-option
>>> *replicate*.node-uuid=26186ec6-a8c7-4834-bcaa-24e30289dba3
>>> root     37398  0.0  0.0 103248   916 pts/2    S+   18:49   0:00
grep
>>> gluster
>>> root     40058  0.0  0.0 755216 60212 ?        Ssl  May21  22:06
>>> /usr/sbin/glusterfs --volfile-server=10.100.3.10
--volfile-id=/fl-webroot
>>> /mnt/gluster/web/flash/webroot
>>>
>>> So several leftover processes. What will happen if I do a
>>>
>>> /etc/init.d/glusterd stop
>>> /etc/init.d/glusterfsd stop
>>>
>>> kill all remaining gluster processes and restart gluster on this
node?
>>>
>>> Will the volume stay online? What about split-brain? I suppose it
would
>> be
>>> best to disconnect all clients first...?
>> Can you double check if any brick process is already running, if so
kill
>> it and try 'gluster volume start <volname> force'
>>>
>>>
>>> On 13 July 2015 at 18:25, Tiemen Ruiten <t.ruiten at
rdmedia.com> wrote:
>>>
>>>> Hello,
>>>>
>>>> We have a two-node gluster cluster, running version 3.7.1, that
hosts an
>>>> oVirt storage domain. This afternoon I tried creating a
template in
>> oVirt,
>>>> but within a minute VM's stopped responding and Gluster
started
>> generating
>>>> errors like the following:
>>>>
>>>> [2015-07-13 14:09:51.772629] W
[rpcsvc.c:270:rpcsvc_program_actor]
>>>> 0-rpc-service: RPC program not available (req 1298437 330) for
>>>> 10.100.3.40:1021
>>>> [2015-07-13 14:09:51.772675] E
>> [rpcsvc.c:565:rpcsvc_check_and_reply_error]
>>>> 0-rpcsvc: rpc actor failed to complete successfully
>>>>
>>>> I managed to get things in working order again by restarting
glusterd
>> and
>>>> glusterfsd, but now one brick is down:
>>>>
>>>> $sudo gluster volume status vmimage
>>>> Status of volume: vmimage
>>>> Gluster process                             TCP Port  RDMA Port
Online
>>>>  Pid
>>>>
>>>>
>>
------------------------------------------------------------------------------
>>>> Brick 10.100.3.10:/export/gluster01/brick   N/A       N/A      
N
>>>> 36736
>>>> Brick 10.100.3.11:/export/gluster01/brick   49153     0        
Y
>>>> 11897
>>>> NFS Server on localhost                     2049      0        
Y
>>>> 36720
>>>> Self-heal Daemon on localhost               N/A       N/A      
Y
>>>> 36730
>>>> NFS Server on 10.100.3.11                   2049      0        
Y
>>>> 11919
>>>> Self-heal Daemon on 10.100.3.11             N/A       N/A      
Y
>>>> 11924
>>>>
>>>> Task Status of Volume vmimage
>>>>
>>>>
>>
------------------------------------------------------------------------------
>>>> There are no active volume tasks
>>>>
>>>> $ sudo gluster peer status
>>>> Number of Peers: 1
>>>>
>>>> Hostname: 10.100.3.11
>>>> Uuid: f9872fea-47f5-41f6-8094-c9fabd3c1339
>>>> State: Peer in Cluster (Connected)
>>>>
>>>> Additionally in the etc-glusterfs-glusterd.vol.log I see these
messages
>>>> repeating every 3 seconds:
>>>>
>>>> [2015-07-13 16:15:21.737044] W [socket.c:642:__socket_rwv]
0-management:
>>>> readv on
/var/run/gluster/2bfe3a2242d586d0850775f601f1c3ee.socket failed
>>>> (Invalid argument)
>>>> The message "I [MSGID: 106005]
>>>> [glusterd-handler.c:4667:__glusterd_brick_rpc_notify]
0-management:
>> Brick
>>>> 10.100.3.10:/export/gluster01/brick has disconnected from
glusterd."
>>>> repeated 39 times between [2015-07-13 16:13:24.717611] and
[2015-07-13
>>>> 16:15:21.737862]
>>>> [2015-07-13 16:15:24.737694] W [socket.c:642:__socket_rwv]
0-management:
>>>> readv on
/var/run/gluster/2bfe3a2242d586d0850775f601f1c3ee.socket failed
>>>> (Invalid argument)
>>>> [2015-07-13 16:15:24.738498] I [MSGID: 106005]
>>>> [glusterd-handler.c:4667:__glusterd_brick_rpc_notify]
0-management:
>> Brick
>>>> 10.100.3.10:/export/gluster01/brick has disconnected from
glusterd.
>>>> [2015-07-13 16:15:27.738194] W [socket.c:642:__socket_rwv]
0-management:
>>>> readv on
/var/run/gluster/2bfe3a2242d586d0850775f601f1c3ee.socket failed
>>>> (Invalid argument)
>>>> [2015-07-13 16:15:30.738991] W [socket.c:642:__socket_rwv]
0-management:
>>>> readv on
/var/run/gluster/2bfe3a2242d586d0850775f601f1c3ee.socket failed
>>>> (Invalid argument)
>>>> [2015-07-13 16:15:33.739735] W [socket.c:642:__socket_rwv]
0-management:
>>>> readv on
/var/run/gluster/2bfe3a2242d586d0850775f601f1c3ee.socket failed
>>>> (Invalid argument)
>>>>
>>>> Can I get this brick back up without bringing the
volume/cluster down?
>>>>
>>>> --
>>>> Tiemen Ruiten
>>>> Systems Engineer
>>>> R&D Media
>>>>
>>>
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> Gluster-users mailing list
>>> Gluster-users at gluster.org
>>> http://www.gluster.org/mailman/listinfo/gluster-users
>>>
>>
>> --
>> ~Atin
>>
> 
> 
> Hi Atin,
> 
> I see brick processes for volumes wwwdata, conf and fl-webroot, judging
> from the ps aux | grep gluster output. These volumes are not started. No
> brick process for vmimage. So you're saying, kill those brick
processes,
> then gluster volume start vmimage force?No, I meant if any left over brick process were there for vmimage. If
its there kill them and start the volume with force or you could
probably try to stop the volume and then start it.

~Atin> 
> Thank you for  your response.
> 
-- 
~Atin

Gluster users - Jul 2015 - brick offline after restart glusterd

[Gluster-users] brick offline after restart glusterd

[Gluster-users] brick offline after restart glusterd