thr3ads.net - Gluster users - [Gluster-users] Repair after accident [Aug 2020]

If this information is useful, please help other people find it:
Share via:

Mathias Waack

2020-Aug-08 15:02 UTC

[Gluster-users] Repair after accident

So b53c8e46-068b-4286-94a6-7cf54f711983 is not a gfid? What else is it?

Mathias

On 08.08.20 09:00, Strahil Nikolov wrote:> In glusterfs the long string is called "gfid" and does not
represent the name.
>
> Best Regards,
> Strahil Nikolov
>
>
>
>
>
>
> ? ?????, 7 ?????? 2020 ?., 21:40:11 ???????+3, Mathias Waack
<mathias.waack at seim-partner.de> ??????:
>
>
>
>
>
> Hi Strahil,
>
> but I cannot find these files in the heal info:
>
> find /zbrick/.glusterfs -links 1 -ls | grep -v ' -> '
> ...
> 7443397? 132463 -rw-------?? 1 999????? docker?? 1073741824 Aug? 3 10:35
> /zbrick/.glusterfs/b5/3c/b53c8e46-068b-4286-94a6-7cf54f711983
>
> Now looking for this file in the heal infos:
>
> gluster volume heal gvol info | grep b53c8e46-068b-4286-94a6-7cf54f711983
>
> shows nothing.
>
> So I do not know, what I have to heal...
>
> Mathias
>
> On 07.08.20 14:32, Strahil Nikolov wrote:
>> Have you tried to gluster heal and check if the files are back into
their place?
>>
>> I always thought that those hard links are used? by the healing
mechanism? and if that is true - gluster should restore the files to their
original location and then wiping the correct files from FUSE will be easy.
>>
>> Best Regards,
>> Strahil Nikolov
>>
>> ?? 7 ?????? 2020 ?. 10:24:38 GMT+03:00, Mathias Waack <mathias.waack
at seim-partner.de> ??????:
>>> Hi all,
>>>
>>> maybe I should add some more information:
>>>
>>> The container which filled up the space was running on node x,
which
>>> still shows a nearly filled fs:
>>>
>>> 192.168.1.x:/gvol? 2.6T? 2.5T? 149G? 95% /gluster
>>>
>>> nearly the same situation on the underlying brick partition on node
x:
>>>
>>> zdata/brick???? 2.6T? 2.4T? 176G? 94% /zbrick
>>>
>>> On node y the network card crashed, glusterfs shows the same
values:
>>>
>>> 192.168.1.y:/gvol? 2.6T? 2.5T? 149G? 95% /gluster
>>>
>>> but different values on the brick:
>>>
>>> zdata/brick???? 2.9T? 1.6T? 1.4T? 54% /zbrick
>>>
>>> I think this happened because glusterfs still has hardlinks to the
>>> deleted files on node x? So I can find these files with:
>>>
>>> find /zbrick/.glusterfs -links 1 -ls | grep -v ' -> '
>>>
>>> But now I am lost. How can I verify these files really belongs to
the
>>> right container? Or can I just delete this files because there is
no
>>> way
>>> to access it? Or offers glusterfs a way to solve this situation?
>>>
>>> Mathias
>>>
>>> On 05.08.20 15:48, Mathias Waack wrote:
>>>> Hi all,
>>>>
>>>> we are running a gluster setup with two nodes:
>>>>
>>>> Status of volume: gvol
>>>> Gluster process???????????????????????????? TCP Port? RDMA Port
>>>> Online? Pid
>>>>
>>>
------------------------------------------------------------------------------
>>>
>>>> Brick 192.168.1.x:/zbrick????????????????? 49152???? 0 Y 13350
>>>> Brick 192.168.1.y:/zbrick????????????????? 49152???? 0 Y 5965
>>>> Self-heal Daemon on localhost?????????????? N/A?????? N/A Y
14188
>>>> Self-heal Daemon on 192.168.1.93??????????? N/A?????? N/A Y
6003
>>>>
>>>> Task Status of Volume gvol
>>>>
>>>
------------------------------------------------------------------------------
>>>
>>>> There are no active volume tasks
>>>>
>>>> The glusterfs hosts a bunch of containers with its data
volumes. The
>>>> underlying fs is zfs. Few days ago one of the containers
created a
>>> lot
>>>> of files in one of its data volumes, and at the end it
completely
>>>> filled up the space of the glusterfs volume. But this happened
only
>>> on
>>>> one host, on the other host there was still enough space. We
finally
>>>> were able to identify this container and found out, the sizes
of the
>>>> data on /zbrick were different on both hosts for this
container. Now
>>>> we made the big mistake to delete these files on both hosts in
the
>>>> /zbrick volume, not on the mounted glusterfs volume.
>>>>
>>>> Later we found the reason for this behavior: the network driver
on
>>> the
>>>> second node partially crashed (which means we ware able to
login on
>>>> the node, so we assumed the network was running, but the card
was
>>>> already dropping packets at this time) at the same time, as the
>>> failed
>>>> container started to fill up the gluster volume. After
rebooting the
>>>> second node? the gluster became available again.
>>>>
>>>> Now the glusterfs volume is running again- but it is still
(nearly)
>>>> full: the files created by the container are not visible, but
they
>>>> still count into amount of free space. How can we fix this?
>>>>
>>>> In addition there are some files which are no longer accessible
since
>>>> this accident:
>>>>
>>>> tail access.log.old
>>>> tail: cannot open 'access.log.old' for reading:
Input/output error
>>>>
>>>> Looks like affected by this error are files which have been
changed
>>>> during the accident. Is there a way to fix this too?
>>>>
>>>> Thanks
>>>>  ? ??? Mathias
>>>>
>>>>
>>>> ________
>>>>
>>>>
>>>>
>>>> Community Meeting Calendar:
>>>>
>>>> Schedule -
>>>> Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
>>>> Bridge: https://bluejeans.com/441850968
>>>>
>>>> Gluster-users mailing list
>>>> Gluster-users at gluster.org
>>>> https://lists.gluster.org/mailman/listinfo/gluster-users
>>> ________
>>>
>>>
>>>
>>> Community Meeting Calendar:
>>>
>>> Schedule -
>>> Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
>>> Bridge: https://bluejeans.com/441850968
>>>
>>> Gluster-users mailing list
>>> Gluster-users at gluster.org
>>> https://lists.gluster.org/mailman/listinfo/gluster-users
> ________
>
>
>
> Community Meeting Calendar:
>
> Schedule -
> Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
> Bridge: https://bluejeans.com/441850968
>
> Gluster-users mailing list
> Gluster-users at gluster.org
> https://lists.gluster.org/mailman/listinfo/gluster-users

Strahil Nikolov

2020-Aug-08 16:05 UTC

head link

[Gluster-users] Repair after accident

If you read my previous email, you will see that i noted that the string IS GFID
and not the name of the file :)


You can find the name  following  the procedure at:
https://docs.gluster.org/en/latest/Troubleshooting/gfid-to-path/


Of course,  that will be slow  for all entries in .glusterfs and you will need 
to create a script to match all gfids to brick path.


I guess the fastest way to find the deleted files (As far as  I understood they
were  deleted on the brick directly and entries  in .glusterfs were left) is  to
create a script that:

0.Create a ramfs for the files:
findmnt /mnt || mount -t ramfs -o size=128MB - /mnt

1. Get  all  inodes
ionice -c 2 -n 7 nice -n 15 find /full/path/to/brick -type f -exec ls -i {} \;
>/mnt/data

2. Get only the inodes:
nice  -n 15 awk '{print $1}'  /mnt/data >  /mnt/inode_only
3. Now the fun starts now-> find  inodes that are  not duplicate:

nice -n 15 uniq -u /mnt/inode_only >  /mnt/gfid-only

4. Once  you have the inodes, you can verify that they do exists only in
.gluster dir
for  i in $(cat /mnt/gfid-only); do ionice  -c 2 -n 7 nice  -n 15 find
/path/to/.glusterfs -inum $i ; echo;echo; done

5. If it's OK  ->  delete

for i in $(cat  /mnt/gfid-only); do ionice -c 2 -n 7 nice -n 15 find
/path/to/brick  -inum $i -delete ; done


Last ,  repeat  on all bricks
Good  luck!


P.S.:  Consider  creating a gluster snapshot before that - just in case... 
Better safe than sorry.

P.S: If you think that you got enough resources,  you can remove ionice/nice 
stuff . They are just to guarantee you won't eat too many resources.


Best Regards,
Strahil  Nikolov




?? 8 ?????? 2020 ?. 18:02:10 GMT+03:00, Mathias Waack <mathias.waack at
seim-partner.de> ??????:>So b53c8e46-068b-4286-94a6-7cf54f711983 is not a gfid? What else is it?
>
>Mathias
>
>On 08.08.20 09:00, Strahil Nikolov wrote:
>> In glusterfs the long string is called "gfid" and does not
represent
>the name.
>>
>> Best Regards,
>> Strahil Nikolov
>>
>>
>>
>>
>>
>>
>> ? ?????, 7 ?????? 2020 ?., 21:40:11 ???????+3, Mathias Waack
><mathias.waack at seim-partner.de> ??????:
>>
>>
>>
>>
>>
>> Hi Strahil,
>>
>> but I cannot find these files in the heal info:
>>
>> find /zbrick/.glusterfs -links 1 -ls | grep -v ' -> '
>> ...
>> 7443397? 132463 -rw-------?? 1 999????? docker?? 1073741824 Aug? 3
>10:35
>> /zbrick/.glusterfs/b5/3c/b53c8e46-068b-4286-94a6-7cf54f711983
>>
>> Now looking for this file in the heal infos:
>>
>> gluster volume heal gvol info | grep
>b53c8e46-068b-4286-94a6-7cf54f711983
>>
>> shows nothing.
>>
>> So I do not know, what I have to heal...
>>
>> Mathias
>>
>> On 07.08.20 14:32, Strahil Nikolov wrote:
>>> Have you tried to gluster heal and check if the files are back into
>their place?
>>>
>>> I always thought that those hard links are used? by the healing
>mechanism? and if that is true - gluster should restore the files to
>their original location and then wiping the correct files from FUSE
>will be easy.
>>>
>>> Best Regards,
>>> Strahil Nikolov
>>>
>>> ?? 7 ?????? 2020 ?. 10:24:38 GMT+03:00, Mathias Waack
><mathias.waack at seim-partner.de> ??????:
>>>> Hi all,
>>>>
>>>> maybe I should add some more information:
>>>>
>>>> The container which filled up the space was running on node x,
>which
>>>> still shows a nearly filled fs:
>>>>
>>>> 192.168.1.x:/gvol? 2.6T? 2.5T? 149G? 95% /gluster
>>>>
>>>> nearly the same situation on the underlying brick partition on
node
>x:
>>>>
>>>> zdata/brick???? 2.6T? 2.4T? 176G? 94% /zbrick
>>>>
>>>> On node y the network card crashed, glusterfs shows the same
>values:
>>>>
>>>> 192.168.1.y:/gvol? 2.6T? 2.5T? 149G? 95% /gluster
>>>>
>>>> but different values on the brick:
>>>>
>>>> zdata/brick???? 2.9T? 1.6T? 1.4T? 54% /zbrick
>>>>
>>>> I think this happened because glusterfs still has hardlinks to
the
>>>> deleted files on node x? So I can find these files with:
>>>>
>>>> find /zbrick/.glusterfs -links 1 -ls | grep -v ' ->
'
>>>>
>>>> But now I am lost. How can I verify these files really belongs
to
>the
>>>> right container? Or can I just delete this files because there
is
>no
>>>> way
>>>> to access it? Or offers glusterfs a way to solve this
situation?
>>>>
>>>> Mathias
>>>>
>>>> On 05.08.20 15:48, Mathias Waack wrote:
>>>>> Hi all,
>>>>>
>>>>> we are running a gluster setup with two nodes:
>>>>>
>>>>> Status of volume: gvol
>>>>> Gluster process???????????????????????????? TCP Port? RDMA
Port
>>>>> Online? Pid
>>>>>
>>>>
>------------------------------------------------------------------------------
>>>>
>>>>> Brick 192.168.1.x:/zbrick????????????????? 49152???? 0 Y
13350
>>>>> Brick 192.168.1.y:/zbrick????????????????? 49152???? 0 Y
5965
>>>>> Self-heal Daemon on localhost?????????????? N/A?????? N/A Y
14188
>>>>> Self-heal Daemon on 192.168.1.93??????????? N/A?????? N/A Y
6003
>>>>>
>>>>> Task Status of Volume gvol
>>>>>
>>>>
>------------------------------------------------------------------------------
>>>>
>>>>> There are no active volume tasks
>>>>>
>>>>> The glusterfs hosts a bunch of containers with its data
volumes.
>The
>>>>> underlying fs is zfs. Few days ago one of the containers
created a
>>>> lot
>>>>> of files in one of its data volumes, and at the end it
completely
>>>>> filled up the space of the glusterfs volume. But this
happened
>only
>>>> on
>>>>> one host, on the other host there was still enough space.
We
>finally
>>>>> were able to identify this container and found out, the
sizes of
>the
>>>>> data on /zbrick were different on both hosts for this
container.
>Now
>>>>> we made the big mistake to delete these files on both hosts
in the
>>>>> /zbrick volume, not on the mounted glusterfs volume.
>>>>>
>>>>> Later we found the reason for this behavior: the network
driver on
>>>> the
>>>>> second node partially crashed (which means we ware able to
login
>on
>>>>> the node, so we assumed the network was running, but the
card was
>>>>> already dropping packets at this time) at the same time, as
the
>>>> failed
>>>>> container started to fill up the gluster volume. After
rebooting
>the
>>>>> second node? the gluster became available again.
>>>>>
>>>>> Now the glusterfs volume is running again- but it is still
>(nearly)
>>>>> full: the files created by the container are not visible,
but they
>>>>> still count into amount of free space. How can we fix this?
>>>>>
>>>>> In addition there are some files which are no longer
accessible
>since
>>>>> this accident:
>>>>>
>>>>> tail access.log.old
>>>>> tail: cannot open 'access.log.old' for reading:
Input/output error
>>>>>
>>>>> Looks like affected by this error are files which have been
>changed
>>>>> during the accident. Is there a way to fix this too?
>>>>>
>>>>> Thanks
>>>>>  ? ??? Mathias
>>>>>
>>>>>
>>>>> ________
>>>>>
>>>>>
>>>>>
>>>>> Community Meeting Calendar:
>>>>>
>>>>> Schedule -
>>>>> Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
>>>>> Bridge: https://bluejeans.com/441850968
>>>>>
>>>>> Gluster-users mailing list
>>>>> Gluster-users at gluster.org
>>>>> https://lists.gluster.org/mailman/listinfo/gluster-users
>>>> ________
>>>>
>>>>
>>>>
>>>> Community Meeting Calendar:
>>>>
>>>> Schedule -
>>>> Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
>>>> Bridge: https://bluejeans.com/441850968
>>>>
>>>> Gluster-users mailing list
>>>> Gluster-users at gluster.org
>>>> https://lists.gluster.org/mailman/listinfo/gluster-users
>> ________
>>
>>
>>
>> Community Meeting Calendar:
>>
>> Schedule -
>> Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
>> Bridge: https://bluejeans.com/441850968
>>
>> Gluster-users mailing list
>> Gluster-users at gluster.org
>> https://lists.gluster.org/mailman/listinfo/gluster-users
>________
>
>
>
>Community Meeting Calendar:
>
>Schedule -
>Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
>Bridge: https://bluejeans.com/441850968
>
>Gluster-users mailing list
>Gluster-users at gluster.org
>https://lists.gluster.org/mailman/listinfo/gluster-users

Gluster users - Aug 2020 - Repair after accident

[Gluster-users] Repair after accident

[Gluster-users] Repair after accident