thr3ads.net - Gluster users - [Gluster-users] lost one replica after upgrading glusterfs from 3.7 to 3.10, please help [Apr 2017]

If this information is useful, please help other people find it:
Share via:

Mohammed Rafi K C

2017-Apr-28 09:41 UTC

[Gluster-users] lost one replica after upgrading glusterfs from 3.7 to 3.10, please help

Can you share the glusterd logs from the three nodes ?


Rafi KC


On 04/28/2017 02:34 PM, Seva Gluschenko wrote:> Dear Community,
>
>
> I call for your wisdom, as it appears that googling for keywords
doesn't help much.
>
> I have a glusterfs volume with replica count 2, and I tried to perform the
online upgrade procedure described in the docs
(http://gluster.readthedocs.io/en/latest/Upgrade-Guide/upgrade_to_3.10/). It all
went almost fine when I'd done with the first replica, the only problem was
the self-heal procedure that refused to complete until I commented out all IPv6
entries in the /etc/hosts.
>
> So far, being sure that it all should work on the 2nd replica pretty the
same as it was on the 1st one, I had proceeded with the upgrade on the replica
2. All of a sudden, it told me that it doesn't see the first replica at all.
The state before upgrade was:
>
> sst2# gluster volume status
> Status of volume: gv0
> Gluster process                             TCP Port  RDMA Port  Online 
Pid
>
------------------------------------------------------------------------------
> Brick sst0:/var/glusterfs                   49152     0          Y      
3482
> Brick sst2:/var/glusterfs                   49152     0          Y      
29863
> NFS Server on localhost                   2049      0          Y      
25175
> Self-heal Daemon on localhost        N/A       N/A        Y       25283
> NFS Server on sst0                          N/A       N/A        N      
N/A
> Self-heal Daemon on sst0                N/A       N/A        Y       4827 
> NFS Server on sst1                          N/A       N/A        N      
N/A
> Self-heal Daemon on sst1                N/A       N/A        Y       15009
>  
> Task Status of Volume gv0
>
------------------------------------------------------------------------------
> There are no active volume tasks
>
> sst2# gluster peer status
> Number of Peers: 2
>
> Hostname: sst0
> Uuid: 26b35bd7-ad7e-4a25-a3f9-70002771e1fc
> State: Peer in Cluster (Connected)
>
> Hostname: sst1
> Uuid: 5a2198de-f536-4328-a278-7f746f276e35
> State: Sent and Received peer request (Connected)
>
> sst2# gluster volume heal gv0 info
> Brick sst0:/var/glusterfs
> Number of entries: 0
>
> Brick sst2:/var/glusterfs
> Number of entries: 0
>
>
> After upgrade, it looked like this:
>
> sst2# gluster volume status
> Status of volume: gv0
> Gluster process                             TCP Port  RDMA Port  Online 
Pid
>
------------------------------------------------------------------------------
> Brick sst2:/var/glusterfs                   N/A       N/A        N      
N/A
> NFS Server on localhost                     N/A       N/A        N      
N/A
> NFS Server on localhost                     N/A       N/A        N      
N/A
>  
> Task Status of Volume gv0
>
------------------------------------------------------------------------------
> There are no active volume tasks
>
> sst2# gluster peer status
> Number of Peers: 2
>
> Hostname: sst1
> Uuid: 5a2198de-f536-4328-a278-7f746f276e35
> State: Sent and Received peer request (Connected)
>
> Hostname: sst0
> Uuid: 26b35bd7-ad7e-4a25-a3f9-70002771e1fc
> State: Peer Rejected (Connected)
>
>
> My biggest fault probably, at that point I googled and found this article
https://gluster.readthedocs.io/en/latest/Administrator%20Guide/Resolving%20Peer%20Rejected/
-- and followed its advice, removing at sst2 all the /var/lib/glusterd contents
except the glusterd.info file. As the result, the node, predictably, lost all
information about the volume.
>
> sst2# gluster volume status
> No volumes present
>
> sst2# gluster peer status
> Number of Peers: 2
>
> Hostname: sst0
> Uuid: 26b35bd7-ad7e-4a25-a3f9-70002771e1fc
> State: Accepted peer request (Connected)
>
> Hostname: sst1
> Uuid: 5a2198de-f536-4328-a278-7f746f276e35
> State: Accepted peer request (Connected)
>
> Okay, I thought, this is might be a high time to re-add the brick. Not that
easy, Jack:
>
> sst0# gluster volume add-brick gv0 replica 2 'sst2:/var/glusterfs'
> volume add-brick: failed: Operation failed
>
> The reason appeared to be natural: sst0 still knows that there was the
replica on sst2. What should I do then? At this point, I tried to recover the
volume information on sst2 by putting it offline and copying all the volume info
from the sst0. Of course it wasn't enough to just copy as is, I modified
/var/lib/glusterd/vols/gv0/sst*\:-var-glusterfs, setting listen-port=0 for the
remote brick (sst0) and listen-port=49152 for the local brick (sst2). It
didn't help much, unfortunately. The final state I've reached is as
follows:
>
> sst2# gluster peer status
> Number of Peers: 2
>
> Hostname: sst1
> Uuid: 5a2198de-f536-4328-a278-7f746f276e35
> State: Sent and Received peer request (Connected)
>
> Hostname: sst0
> Uuid: 26b35bd7-ad7e-4a25-a3f9-70002771e1fc
> State: Sent and Received peer request (Connected)
>
> sst2# gluster volume info
>  
> Volume Name: gv0
> Type: Replicate
> Volume ID: dd4996c0-04e6-4f9b-a04e-73279c4f112b
> Status: Started
> Snapshot Count: 0
> Number of Bricks: 1 x 2 = 2
> Transport-type: tcp
> Bricks:
> Brick1: sst0:/var/glusterfs
> Brick2: sst2:/var/glusterfs
> Options Reconfigured:
> cluster.self-heal-daemon: enable
> performance.readdir-ahead: on
> storage.owner-uid: 1000
> storage.owner-gid: 1000
>
> sst2# gluster volume status
> Status of volume: gv0
> Gluster process                             TCP Port  RDMA Port  Online 
Pid
>
------------------------------------------------------------------------------
> Brick sst2:/var/glusterfs                   N/A       N/A        N      
N/A
> NFS Server on localhost                     N/A       N/A        N      
N/A
> NFS Server on localhost                     N/A       N/A        N      
N/A
>  
> Task Status of Volume gv0
>
------------------------------------------------------------------------------
> There are no active volume tasks
>
>
> Meanwhile, on sst0:
>
> sst0# gluster volume info
>  
> Volume Name: gv0
> Type: Replicate
> Volume ID: dd4996c0-04e6-4f9b-a04e-73279c4f112b
> Status: Started
> Snapshot Count: 0
> Number of Bricks: 1 x 2 = 2
> Transport-type: tcp
> Bricks:
> Brick1: sst0:/var/glusterfs
> Brick2: sst2:/var/glusterfs
> Options Reconfigured:
> storage.owner-gid: 1000
> storage.owner-uid: 1000
> performance.readdir-ahead: on
> cluster.self-heal-daemon: enable
>
> sst0 ~ # gluster volume status
> Status of volume: gv0
> Gluster process                             TCP Port  RDMA Port  Online 
Pid
>
------------------------------------------------------------------------------
> Brick sst0:/var/glusterfs                   49152     0          Y      
31263
> NFS Server on localhost                     N/A       N/A        N      
N/A
> Self-heal Daemon on localhost               N/A       N/A        Y      
31254
>  
> Task Status of Volume gv0
>
------------------------------------------------------------------------------
> There are no active volume tasks
>
>
> Any ideas how to bring the sst2 back to normal are appreciated. As a last
resort solution, I can schedule the downtime, backup data, kill the volume and
start all over, but I would like to know if there is a shorter path. Thank you
very much in advance.
>
> -- 
> Best Regards,
>
> Seva Gluschenko
> _______________________________________________
> Gluster-users mailing list
> Gluster-users at gluster.org
> http://lists.gluster.org/mailman/listinfo/gluster-users

Seva Gluschenko

2017-Apr-28 09:52 UTC

head link

[Gluster-users] lost one replica after upgrading glusterfs from 3.7 to 3.10, please help

Of course. Please find attached. Hope they can shed some light on this.


Thanks,

Seva


28.04.2017, 12:41, "Mohammed Rafi K C" <rkavunga at
redhat.com>:> Can you share the glusterd logs from the three nodes ?
>
> Rafi KC
>
> On 04/28/2017 02:34 PM, Seva Gluschenko wrote:
>> ?Dear Community,
>>
>> ?I call for your wisdom, as it appears that googling for keywords
doesn't help much.
>>
>> ?I have a glusterfs volume with replica count 2, and I tried to perform
the online upgrade procedure described in the docs
(http://gluster.readthedocs.io/en/latest/Upgrade-Guide/upgrade_to_3.10/). It all
went almost fine when I'd done with the first replica, the only problem was
the self-heal procedure that refused to complete until I commented out all IPv6
entries in the /etc/hosts.
>>
>> ?So far, being sure that it all should work on the 2nd replica pretty
the same as it was on the 1st one, I had proceeded with the upgrade on the
replica 2. All of a sudden, it told me that it doesn't see the first replica
at all. The state before upgrade was:
>>
>> ?sst2# gluster volume status
>> ?Status of volume: gv0
>> ?Gluster process TCP Port RDMA Port Online Pid
>>
?------------------------------------------------------------------------------
>> ?Brick sst0:/var/glusterfs 49152 0 Y 3482
>> ?Brick sst2:/var/glusterfs 49152 0 Y 29863
>> ?NFS Server on localhost 2049 0 Y 25175
>> ?Self-heal Daemon on localhost N/A N/A Y 25283
>> ?NFS Server on sst0 N/A N/A N N/A
>> ?Self-heal Daemon on sst0 N/A N/A Y 4827
>> ?NFS Server on sst1 N/A N/A N N/A
>> ?Self-heal Daemon on sst1 N/A N/A Y 15009
>>
>> ?Task Status of Volume gv0
>>
?------------------------------------------------------------------------------
>> ?There are no active volume tasks
>>
>> ?sst2# gluster peer status
>> ?Number of Peers: 2
>>
>> ?Hostname: sst0
>> ?Uuid: 26b35bd7-ad7e-4a25-a3f9-70002771e1fc
>> ?State: Peer in Cluster (Connected)
>>
>> ?Hostname: sst1
>> ?Uuid: 5a2198de-f536-4328-a278-7f746f276e35
>> ?State: Sent and Received peer request (Connected)
>>
>> ?sst2# gluster volume heal gv0 info
>> ?Brick sst0:/var/glusterfs
>> ?Number of entries: 0
>>
>> ?Brick sst2:/var/glusterfs
>> ?Number of entries: 0
>>
>> ?After upgrade, it looked like this:
>>
>> ?sst2# gluster volume status
>> ?Status of volume: gv0
>> ?Gluster process TCP Port RDMA Port Online Pid
>>
?------------------------------------------------------------------------------
>> ?Brick sst2:/var/glusterfs N/A N/A N N/A
>> ?NFS Server on localhost N/A N/A N N/A
>> ?NFS Server on localhost N/A N/A N N/A
>>
>> ?Task Status of Volume gv0
>>
?------------------------------------------------------------------------------
>> ?There are no active volume tasks
>>
>> ?sst2# gluster peer status
>> ?Number of Peers: 2
>>
>> ?Hostname: sst1
>> ?Uuid: 5a2198de-f536-4328-a278-7f746f276e35
>> ?State: Sent and Received peer request (Connected)
>>
>> ?Hostname: sst0
>> ?Uuid: 26b35bd7-ad7e-4a25-a3f9-70002771e1fc
>> ?State: Peer Rejected (Connected)
>>
>> ?My biggest fault probably, at that point I googled and found this
article
https://gluster.readthedocs.io/en/latest/Administrator%20Guide/Resolving%20Peer%20Rejected/
-- and followed its advice, removing at sst2 all the /var/lib/glusterd contents
except the glusterd.info file. As the result, the node, predictably, lost all
information about the volume.
>>
>> ?sst2# gluster volume status
>> ?No volumes present
>>
>> ?sst2# gluster peer status
>> ?Number of Peers: 2
>>
>> ?Hostname: sst0
>> ?Uuid: 26b35bd7-ad7e-4a25-a3f9-70002771e1fc
>> ?State: Accepted peer request (Connected)
>>
>> ?Hostname: sst1
>> ?Uuid: 5a2198de-f536-4328-a278-7f746f276e35
>> ?State: Accepted peer request (Connected)
>>
>> ?Okay, I thought, this is might be a high time to re-add the brick. Not
that easy, Jack:
>>
>> ?sst0# gluster volume add-brick gv0 replica 2
'sst2:/var/glusterfs'
>> ?volume add-brick: failed: Operation failed
>>
>> ?The reason appeared to be natural: sst0 still knows that there was the
replica on sst2. What should I do then? At this point, I tried to recover the
volume information on sst2 by putting it offline and copying all the volume info
from the sst0. Of course it wasn't enough to just copy as is, I modified
/var/lib/glusterd/vols/gv0/sst*\:-var-glusterfs, setting listen-port=0 for the
remote brick (sst0) and listen-port=49152 for the local brick (sst2). It
didn't help much, unfortunately. The final state I've reached is as
follows:
>>
>> ?sst2# gluster peer status
>> ?Number of Peers: 2
>>
>> ?Hostname: sst1
>> ?Uuid: 5a2198de-f536-4328-a278-7f746f276e35
>> ?State: Sent and Received peer request (Connected)
>>
>> ?Hostname: sst0
>> ?Uuid: 26b35bd7-ad7e-4a25-a3f9-70002771e1fc
>> ?State: Sent and Received peer request (Connected)
>>
>> ?sst2# gluster volume info
>>
>> ?Volume Name: gv0
>> ?Type: Replicate
>> ?Volume ID: dd4996c0-04e6-4f9b-a04e-73279c4f112b
>> ?Status: Started
>> ?Snapshot Count: 0
>> ?Number of Bricks: 1 x 2 = 2
>> ?Transport-type: tcp
>> ?Bricks:
>> ?Brick1: sst0:/var/glusterfs
>> ?Brick2: sst2:/var/glusterfs
>> ?Options Reconfigured:
>> ?cluster.self-heal-daemon: enable
>> ?performance.readdir-ahead: on
>> ?storage.owner-uid: 1000
>> ?storage.owner-gid: 1000
>>
>> ?sst2# gluster volume status
>> ?Status of volume: gv0
>> ?Gluster process TCP Port RDMA Port Online Pid
>>
?------------------------------------------------------------------------------
>> ?Brick sst2:/var/glusterfs N/A N/A N N/A
>> ?NFS Server on localhost N/A N/A N N/A
>> ?NFS Server on localhost N/A N/A N N/A
>>
>> ?Task Status of Volume gv0
>>
?------------------------------------------------------------------------------
>> ?There are no active volume tasks
>>
>> ?Meanwhile, on sst0:
>>
>> ?sst0# gluster volume info
>>
>> ?Volume Name: gv0
>> ?Type: Replicate
>> ?Volume ID: dd4996c0-04e6-4f9b-a04e-73279c4f112b
>> ?Status: Started
>> ?Snapshot Count: 0
>> ?Number of Bricks: 1 x 2 = 2
>> ?Transport-type: tcp
>> ?Bricks:
>> ?Brick1: sst0:/var/glusterfs
>> ?Brick2: sst2:/var/glusterfs
>> ?Options Reconfigured:
>> ?storage.owner-gid: 1000
>> ?storage.owner-uid: 1000
>> ?performance.readdir-ahead: on
>> ?cluster.self-heal-daemon: enable
>>
>> ?sst0 ~ # gluster volume status
>> ?Status of volume: gv0
>> ?Gluster process TCP Port RDMA Port Online Pid
>>
?------------------------------------------------------------------------------
>> ?Brick sst0:/var/glusterfs 49152 0 Y 31263
>> ?NFS Server on localhost N/A N/A N N/A
>> ?Self-heal Daemon on localhost N/A N/A Y 31254
>>
>> ?Task Status of Volume gv0
>>
?------------------------------------------------------------------------------
>> ?There are no active volume tasks
>>
>> ?Any ideas how to bring the sst2 back to normal are appreciated. As a
last resort solution, I can schedule the downtime, backup data, kill the volume
and start all over, but I would like to know if there is a shorter path. Thank
you very much in advance.
>>
>> ?--
>> ?Best Regards,
>>
>> ?Seva Gluschenko-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: glusterd-sst2.log
URL:
<http://lists.gluster.org/pipermail/gluster-users/attachments/20170428/aa41f57e/attachment.ksh>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: glusterd-sst1.log
URL:
<http://lists.gluster.org/pipermail/gluster-users/attachments/20170428/aa41f57e/attachment-0001.ksh>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: glusterd-sst0.log
URL:
<http://lists.gluster.org/pipermail/gluster-users/attachments/20170428/aa41f57e/attachment-0002.ksh>

Gluster users - Apr 2017 - lost one replica after upgrading glusterfs from 3.7 to 3.10, please help

[Gluster-users] lost one replica after upgrading glusterfs from 3.7 to 3.10, please help

[Gluster-users] lost one replica after upgrading glusterfs from 3.7 to 3.10, please help