thr3ads.net - Gluster users - [Gluster-users] Repair after accident [Aug 2020]

If this information is useful, please help other people find it:
Share via:

Mathias Waack

2020-Aug-05 13:48 UTC

[Gluster-users] Repair after accident

Hi all,

we are running a gluster setup with two nodes:

Status of volume: gvol
Gluster process???????????????????????????? TCP Port? RDMA Port Online? Pid
------------------------------------------------------------------------------
Brick 192.168.1.x:/zbrick????????????????? 49152???? 0 Y?????? 13350
Brick 192.168.1.y:/zbrick????????????????? 49152???? 0 Y?????? 5965
Self-heal Daemon on localhost?????????????? N/A?????? N/A Y?????? 14188
Self-heal Daemon on 192.168.1.93??????????? N/A?????? N/A Y?????? 6003

Task Status of Volume gvol
------------------------------------------------------------------------------
There are no active volume tasks

The glusterfs hosts a bunch of containers with its data volumes. The 
underlying fs is zfs. Few days ago one of the containers created a lot 
of files in one of its data volumes, and at the end it completely filled 
up the space of the glusterfs volume. But this happened only on one 
host, on the other host there was still enough space. We finally were 
able to identify this container and found out, the sizes of the data on 
/zbrick were different on both hosts for this container. Now we made the 
big mistake to delete these files on both hosts in the /zbrick volume, 
not on the mounted glusterfs volume.

Later we found the reason for this behavior: the network driver on the 
second node partially crashed (which means we ware able to login on the 
node, so we assumed the network was running, but the card was already 
dropping packets at this time) at the same time, as the failed container 
started to fill up the gluster volume. After rebooting the second node? 
the gluster became available again.

Now the glusterfs volume is running again- but it is still (nearly) 
full: the files created by the container are not visible, but they still 
count into amount of free space. How can we fix this?

In addition there are some files which are no longer accessible since 
this accident:

tail access.log.old
tail: cannot open 'access.log.old' for reading: Input/output error

Looks like affected by this error are files which have been changed 
during the accident. Is there a way to fix this too?

Thanks
 ??? Mathias

Mathias Waack

2020-Aug-07 07:24 UTC

head link

[Gluster-users] Repair after accident

Hi all,

maybe I should add some more information:

The container which filled up the space was running on node x, which 
still shows a nearly filled fs:

192.168.1.x:/gvol? 2.6T? 2.5T? 149G? 95% /gluster

nearly the same situation on the underlying brick partition on node x:

zdata/brick???? 2.6T? 2.4T? 176G? 94% /zbrick

On node y the network card crashed, glusterfs shows the same values:

192.168.1.y:/gvol? 2.6T? 2.5T? 149G? 95% /gluster

but different values on the brick:

zdata/brick???? 2.9T? 1.6T? 1.4T? 54% /zbrick

I think this happened because glusterfs still has hardlinks to the 
deleted files on node x? So I can find these files with:

find /zbrick/.glusterfs -links 1 -ls | grep -v ' -> '

But now I am lost. How can I verify these files really belongs to the 
right container? Or can I just delete this files because there is no way 
to access it? Or offers glusterfs a way to solve this situation?

Mathias

On 05.08.20 15:48, Mathias Waack wrote:> Hi all,
>
> we are running a gluster setup with two nodes:
>
> Status of volume: gvol
> Gluster process???????????????????????????? TCP Port? RDMA Port 
> Online? Pid
>
------------------------------------------------------------------------------
>
> Brick 192.168.1.x:/zbrick????????????????? 49152???? 0 Y 13350
> Brick 192.168.1.y:/zbrick????????????????? 49152???? 0 Y 5965
> Self-heal Daemon on localhost?????????????? N/A?????? N/A Y 14188
> Self-heal Daemon on 192.168.1.93??????????? N/A?????? N/A Y 6003
>
> Task Status of Volume gvol
>
------------------------------------------------------------------------------
>
> There are no active volume tasks
>
> The glusterfs hosts a bunch of containers with its data volumes. The 
> underlying fs is zfs. Few days ago one of the containers created a lot 
> of files in one of its data volumes, and at the end it completely 
> filled up the space of the glusterfs volume. But this happened only on 
> one host, on the other host there was still enough space. We finally 
> were able to identify this container and found out, the sizes of the 
> data on /zbrick were different on both hosts for this container. Now 
> we made the big mistake to delete these files on both hosts in the 
> /zbrick volume, not on the mounted glusterfs volume.
>
> Later we found the reason for this behavior: the network driver on the 
> second node partially crashed (which means we ware able to login on 
> the node, so we assumed the network was running, but the card was 
> already dropping packets at this time) at the same time, as the failed 
> container started to fill up the gluster volume. After rebooting the 
> second node? the gluster became available again.
>
> Now the glusterfs volume is running again- but it is still (nearly) 
> full: the files created by the container are not visible, but they 
> still count into amount of free space. How can we fix this?
>
> In addition there are some files which are no longer accessible since 
> this accident:
>
> tail access.log.old
> tail: cannot open 'access.log.old' for reading: Input/output error
>
> Looks like affected by this error are files which have been changed 
> during the accident. Is there a way to fix this too?
>
> Thanks
> ??? Mathias
>
>
> ________
>
>
>
> Community Meeting Calendar:
>
> Schedule -
> Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
> Bridge: https://bluejeans.com/441850968
>
> Gluster-users mailing list
> Gluster-users at gluster.org
> https://lists.gluster.org/mailman/listinfo/gluster-users

Gluster users - Aug 2020 - Repair after accident

[Gluster-users] Repair after accident

[Gluster-users] Repair after accident