thr3ads.net - Gluster users - [Gluster-users] [Stale file handle] in shard volume [Jan 2019]

If this information is useful, please help other people find it:
Share via:

Olaf Buitelaar

2019-Jan-13 16:40 UTC

[Gluster-users] [Stale file handle] in shard volume

@Krutika if you need any further information, please let me know.

Thanks Olaf

Op vr 4 jan. 2019 om 07:51 schreef Nithya Balachandran <nbalacha at
redhat.com>:
> Adding Krutika.
>
> On Wed, 2 Jan 2019 at 20:56, Olaf Buitelaar <olaf.buitelaar at
gmail.com>
> wrote:
>
>> Hi Nithya,
>>
>> Thank you for your reply.
>>
>> the VM's using the gluster volumes keeps on getting paused/stopped
on
>> errors like these;
>> [2019-01-02 02:33:44.469132] E [MSGID: 133010]
>> [shard.c:1724:shard_common_lookup_shards_cbk] 0-ovirt-kube-shard:
Lookup on
>> shard 101487 failed. Base file gfid =
a38d64bc-a28b-4ee1-a0bb-f919e7a1022c
>> [Stale file handle]
>> [2019-01-02 02:33:44.563288] E [MSGID: 133010]
>> [shard.c:1724:shard_common_lookup_shards_cbk] 0-ovirt-kube-shard:
Lookup on
>> shard 101488 failed. Base file gfid =
a38d64bc-a28b-4ee1-a0bb-f919e7a1022c
>> [Stale file handle]
>>
>> Krutika, Can you take a look at this?
>
>
>>
>> What i'm trying to find out, if i can purge all gluster volumes
from all
>> possible stale file handles (and hopefully find a method to prevent
this in
>> the future), so the VM's can start running stable again.
>> For this i need to know when the
"shard_common_lookup_shards_cbk"
>> function considers a file as stale.
>> The statement; "Stale file handle errors show up when a file with
a
>> specified gfid is not found." doesn't seem to cover it all, as
i've shown
>> in earlier mails the shard file and glusterfs/xx/xx/uuid file do both
>> exist, and have the same inode.
>> If the criteria i'm using aren't correct, could you please tell
me which
>> criteria i should use to determine if a file is stale or not?
>> these criteria are just based observations i made, moving the stale
files
>> manually. After removing them i was able to start the VM again..until
some
>> time later it hangs on another stale shard file unfortunate.
>>
>> Thanks Olaf
>>
>> Op wo 2 jan. 2019 om 14:20 schreef Nithya Balachandran <
>> nbalacha at redhat.com>:
>>
>>>
>>>
>>> On Mon, 31 Dec 2018 at 01:27, Olaf Buitelaar <olaf.buitelaar at
gmail.com>
>>> wrote:
>>>
>>>> Dear All,
>>>>
>>>> till now a selected group of VM's still seem to produce new
stale
>>>> file's and getting paused due to this.
>>>> I've not updated gluster recently, however i did change the
op version
>>>> from 31200 to 31202 about a week before this issue arose.
>>>> Looking at the .shard directory, i've 100.000+ files
sharing the same
>>>> characteristics as a stale file. which are found till now,
>>>> they all have the sticky bit set, e.g. file permissions;
---------T.
>>>> are 0kb in size, and have the trusted.glusterfs.dht.linkto
attribute.
>>>>
>>>
>>> These are internal files used by gluster and do not necessarily
mean
>>> they are stale. They "point" to data files which may be
on different bricks
>>> (same name, gfid etc but no linkto xattr and no ----T permissions).
>>>
>>>
>>>> These files range from long a go (beginning of the year) till
now.
>>>> Which makes me suspect this was laying dormant for some time
now..and
>>>> somehow recently surfaced.
>>>> Checking other sub-volumes they contain also 0kb files in the
.shard
>>>> directory, but don't have the sticky bit and the linkto
attribute.
>>>>
>>>> Does anybody else experience this issue? Could this be a bug or
an
>>>> environmental issue?
>>>>
>>> These are most likely valid files- please do not delete them
without
>>> double-checking.
>>>
>>> Stale file handle errors show up when a file with a specified gfid
is
>>> not found. You will need to debug the files for which you see this
error by
>>> checking the bricks to see if they actually exist.
>>>
>>>>
>>>> Also i wonder if there is any tool or gluster command to clean
all
>>>> stale file handles?
>>>> Otherwise i'm planning to make a simple bash script, which
iterates
>>>> over the .shard dir, checks each file for the above mentioned
criteria, and
>>>> (re)moves the file and the corresponding .glusterfs file.
>>>> If there are other criteria needed to identify a stale file
handle, i
>>>> would like to hear that.
>>>> If this is a viable and safe operation to do of course.
>>>>
>>>> Thanks Olaf
>>>>
>>>>
>>>>
>>>> Op do 20 dec. 2018 om 13:43 schreef Olaf Buitelaar <
>>>> olaf.buitelaar at gmail.com>:
>>>>
>>>>> Dear All,
>>>>>
>>>>> I figured it out, it appeared to be the exact same issue as
described
>>>>> here;
>>>>>
https://lists.gluster.org/pipermail/gluster-users/2018-March/033785.html
>>>>> Another subvolume also had the shard file, only were all 0
bytes and
>>>>> had the dht.linkto
>>>>>
>>>>> for reference;
>>>>> [root at lease-04 ovirt-backbone-2]# getfattr -d -m . -e
hex
>>>>> .shard/f28cabcb-d169-41fc-a633-9bef4c4a8e40.51500
>>>>> # file: .shard/f28cabcb-d169-41fc-a633-9bef4c4a8e40.51500
>>>>>
>>>>>
security.selinux=0x73797374656d5f753a6f626a6563745f723a6574635f72756e74696d655f743a733000
>>>>> trusted.gfid=0x298147e49f9748b2baf1c8fff897244d
>>>>>
>>>>>
trusted.gfid2path.b48064c78d7a85c9=0x62653331383633382d653861302d346336642d393737642d3761393337616138343830362f66323863616263622d643136392d343166632d613633332d3962656634633461386534302e3531353030
>>>>>
>>>>>
trusted.glusterfs.dht.linkto=0x6f766972742d6261636b626f6e652d322d7265706c69636174652d3100
>>>>>
>>>>> [root at lease-04 ovirt-backbone-2]# getfattr -d -m . -e
hex
>>>>> .glusterfs/29/81/298147e4-9f97-48b2-baf1-c8fff897244d
>>>>> # file:
.glusterfs/29/81/298147e4-9f97-48b2-baf1-c8fff897244d
>>>>>
>>>>>
security.selinux=0x73797374656d5f753a6f626a6563745f723a6574635f72756e74696d655f743a733000
>>>>> trusted.gfid=0x298147e49f9748b2baf1c8fff897244d
>>>>>
>>>>>
trusted.gfid2path.b48064c78d7a85c9=0x62653331383633382d653861302d346336642d393737642d3761393337616138343830362f66323863616263622d643136392d343166632d613633332d3962656634633461386534302e3531353030
>>>>>
>>>>>
trusted.glusterfs.dht.linkto=0x6f766972742d6261636b626f6e652d322d7265706c69636174652d3100
>>>>>
>>>>> [root at lease-04 ovirt-backbone-2]# stat
>>>>> .glusterfs/29/81/298147e4-9f97-48b2-baf1-c8fff897244d
>>>>>   File:
?.glusterfs/29/81/298147e4-9f97-48b2-baf1-c8fff897244d?
>>>>>   Size: 0               Blocks: 0          IO Block: 4096  
regular
>>>>> empty file
>>>>> Device: fd01h/64769d    Inode: 1918631406  Links: 2
>>>>> Access: (1000/---------T)  Uid: (    0/    root)   Gid: (  
0/
>>>>> root)
>>>>> Context: system_u:object_r:etc_runtime_t:s0
>>>>> Access: 2018-12-17 21:43:36.405735296 +0000
>>>>> Modify: 2018-12-17 21:43:36.405735296 +0000
>>>>> Change: 2018-12-17 21:43:36.405735296 +0000
>>>>>  Birth: -
>>>>>
>>>>> removing the shard file and glusterfs file from each node
resolved the
>>>>> issue.
>>>>>
>>>>> I also found this thread;
>>>>>
https://lists.gluster.org/pipermail/gluster-users/2018-December/035460.html
>>>>> Maybe he suffers from the same issue.
>>>>>
>>>>> Best Olaf
>>>>>
>>>>>
>>>>> Op wo 19 dec. 2018 om 21:56 schreef Olaf Buitelaar <
>>>>> olaf.buitelaar at gmail.com>:
>>>>>
>>>>>> Dear All,
>>>>>>
>>>>>> It appears i've a stale file in one of the volumes,
on 2 files. These
>>>>>> files are qemu images (1 raw and 1 qcow2).
>>>>>> I'll just focus on 1 file since the situation on
the other seems the
>>>>>> same.
>>>>>>
>>>>>> The VM get's paused more or less directly after
being booted with
>>>>>> error;
>>>>>> [2018-12-18 14:05:05.275713] E [MSGID: 133010]
>>>>>> [shard.c:1724:shard_common_lookup_shards_cbk]
0-ovirt-backbone-2-shard:
>>>>>> Lookup on shard 51500 failed. Base file gfid
>>>>>> f28cabcb-d169-41fc-a633-9bef4c4a8e40 [Stale file
handle]
>>>>>>
>>>>>> investigating the shard;
>>>>>>
>>>>>> #on the arbiter node:
>>>>>>
>>>>>> [root at lease-05 ovirt-backbone-2]# getfattr -n
glusterfs.gfid.string
>>>>>>
/mnt/ovirt-backbone-2/b1c2c949-aef4-4aec-999b-b179efeef732/images/f6ac9660-a84e-469e-a17c-c6dbc538af4b/d6b09501-5b79-4c92-bf10-2ed3bda1b425
>>>>>> getfattr: Removing leading '/' from absolute
path names
>>>>>> # file:
>>>>>>
mnt/ovirt-backbone-2/b1c2c949-aef4-4aec-999b-b179efeef732/images/f6ac9660-a84e-469e-a17c-c6dbc538af4b/d6b09501-5b79-4c92-bf10-2ed3bda1b425
>>>>>>
glusterfs.gfid.string="f28cabcb-d169-41fc-a633-9bef4c4a8e40"
>>>>>>
>>>>>> [root at lease-05 ovirt-backbone-2]# getfattr -d -m .
-e hex
>>>>>> .shard/f28cabcb-d169-41fc-a633-9bef4c4a8e40.51500
>>>>>> # file:
.shard/f28cabcb-d169-41fc-a633-9bef4c4a8e40.51500
>>>>>>
>>>>>>
security.selinux=0x73797374656d5f753a6f626a6563745f723a6574635f72756e74696d655f743a733000
>>>>>> trusted.afr.dirty=0x000000000000000000000000
>>>>>> trusted.gfid=0x1f86b4328ec6424699aa48cc6d7b5da0
>>>>>>
>>>>>>
trusted.gfid2path.b48064c78d7a85c9=0x62653331383633382d653861302d346336642d393737642d3761393337616138343830362f66323863616263622d643136392d343166632d613633332d3962656634633461386534302e3531353030
>>>>>>
>>>>>> [root at lease-05 ovirt-backbone-2]# getfattr -d -m .
-e hex
>>>>>> .glusterfs/1f/86/1f86b432-8ec6-4246-99aa-48cc6d7b5da0
>>>>>> # file:
.glusterfs/1f/86/1f86b432-8ec6-4246-99aa-48cc6d7b5da0
>>>>>>
>>>>>>
security.selinux=0x73797374656d5f753a6f626a6563745f723a6574635f72756e74696d655f743a733000
>>>>>> trusted.afr.dirty=0x000000000000000000000000
>>>>>> trusted.gfid=0x1f86b4328ec6424699aa48cc6d7b5da0
>>>>>>
>>>>>>
trusted.gfid2path.b48064c78d7a85c9=0x62653331383633382d653861302d346336642d393737642d3761393337616138343830362f66323863616263622d643136392d343166632d613633332d3962656634633461386534302e3531353030
>>>>>>
>>>>>> [root at lease-05 ovirt-backbone-2]# stat
>>>>>> .glusterfs/1f/86/1f86b432-8ec6-4246-99aa-48cc6d7b5da0
>>>>>>   File:
?.glusterfs/1f/86/1f86b432-8ec6-4246-99aa-48cc6d7b5da0?
>>>>>>   Size: 0               Blocks: 0          IO Block:
4096   regular
>>>>>> empty file
>>>>>> Device: fd01h/64769d    Inode: 537277306   Links: 2
>>>>>> Access: (0660/-rw-rw----)  Uid: (    0/    root)   Gid:
(    0/
>>>>>> root)
>>>>>> Context: system_u:object_r:etc_runtime_t:s0
>>>>>> Access: 2018-12-17 21:43:36.361984810 +0000
>>>>>> Modify: 2018-12-17 21:43:36.361984810 +0000
>>>>>> Change: 2018-12-18 20:55:29.908647417 +0000
>>>>>>  Birth: -
>>>>>>
>>>>>> [root at lease-05 ovirt-backbone-2]# find . -inum
537277306
>>>>>> ./.glusterfs/1f/86/1f86b432-8ec6-4246-99aa-48cc6d7b5da0
>>>>>> ./.shard/f28cabcb-d169-41fc-a633-9bef4c4a8e40.51500
>>>>>>
>>>>>> #on the data nodes:
>>>>>>
>>>>>> [root at lease-08 ~]# getfattr -n glusterfs.gfid.string
>>>>>>
/mnt/ovirt-backbone-2/b1c2c949-aef4-4aec-999b-b179efeef732/images/f6ac9660-a84e-469e-a17c-c6dbc538af4b/d6b09501-5b79-4c92-bf10-2ed3bda1b425
>>>>>> getfattr: Removing leading '/' from absolute
path names
>>>>>> # file:
>>>>>>
mnt/ovirt-backbone-2/b1c2c949-aef4-4aec-999b-b179efeef732/images/f6ac9660-a84e-469e-a17c-c6dbc538af4b/d6b09501-5b79-4c92-bf10-2ed3bda1b425
>>>>>>
glusterfs.gfid.string="f28cabcb-d169-41fc-a633-9bef4c4a8e40"
>>>>>>
>>>>>> [root at lease-08 ovirt-backbone-2]# getfattr -d -m .
-e hex
>>>>>> .shard/f28cabcb-d169-41fc-a633-9bef4c4a8e40.51500
>>>>>> # file:
.shard/f28cabcb-d169-41fc-a633-9bef4c4a8e40.51500
>>>>>>
>>>>>>
security.selinux=0x73797374656d5f753a6f626a6563745f723a6574635f72756e74696d655f743a733000
>>>>>> trusted.afr.dirty=0x000000000000000000000000
>>>>>> trusted.gfid=0x1f86b4328ec6424699aa48cc6d7b5da0
>>>>>>
>>>>>>
trusted.gfid2path.b48064c78d7a85c9=0x62653331383633382d653861302d346336642d393737642d3761393337616138343830362f66323863616263622d643136392d343166632d613633332d3962656634633461386534302e3531353030
>>>>>>
>>>>>> [root at lease-08 ovirt-backbone-2]# getfattr -d -m .
-e hex
>>>>>> .glusterfs/1f/86/1f86b432-8ec6-4246-99aa-48cc6d7b5da0
>>>>>> # file:
.glusterfs/1f/86/1f86b432-8ec6-4246-99aa-48cc6d7b5da0
>>>>>>
>>>>>>
security.selinux=0x73797374656d5f753a6f626a6563745f723a6574635f72756e74696d655f743a733000
>>>>>> trusted.afr.dirty=0x000000000000000000000000
>>>>>> trusted.gfid=0x1f86b4328ec6424699aa48cc6d7b5da0
>>>>>>
>>>>>>
trusted.gfid2path.b48064c78d7a85c9=0x62653331383633382d653861302d346336642d393737642d3761393337616138343830362f66323863616263622d643136392d343166632d613633332d3962656634633461386534302e3531353030
>>>>>>
>>>>>> [root at lease-08 ovirt-backbone-2]# stat
>>>>>> .glusterfs/1f/86/1f86b432-8ec6-4246-99aa-48cc6d7b5da0
>>>>>>   File:
?.glusterfs/1f/86/1f86b432-8ec6-4246-99aa-48cc6d7b5da0?
>>>>>>   Size: 2166784         Blocks: 4128       IO Block:
4096   regular
>>>>>> file
>>>>>> Device: fd03h/64771d    Inode: 12893624759  Links: 3
>>>>>> Access: (0660/-rw-rw----)  Uid: (    0/    root)   Gid:
(    0/
>>>>>> root)
>>>>>> Context: system_u:object_r:etc_runtime_t:s0
>>>>>> Access: 2018-12-18 18:52:38.070776585 +0000
>>>>>> Modify: 2018-12-17 21:43:36.388054443 +0000
>>>>>> Change: 2018-12-18 21:01:47.810506528 +0000
>>>>>>  Birth: -
>>>>>>
>>>>>> [root at lease-08 ovirt-backbone-2]# find . -inum
12893624759
>>>>>> ./.glusterfs/1f/86/1f86b432-8ec6-4246-99aa-48cc6d7b5da0
>>>>>> ./.shard/f28cabcb-d169-41fc-a633-9bef4c4a8e40.51500
>>>>>>
>>>>>> =======================>>>>>>
>>>>>> [root at lease-11 ovirt-backbone-2]# getfattr -n
glusterfs.gfid.string
>>>>>>
/mnt/ovirt-backbone-2/b1c2c949-aef4-4aec-999b-b179efeef732/images/f6ac9660-a84e-469e-a17c-c6dbc538af4b/d6b09501-5b79-4c92-bf10-2ed3bda1b425
>>>>>> getfattr: Removing leading '/' from absolute
path names
>>>>>> # file:
>>>>>>
mnt/ovirt-backbone-2/b1c2c949-aef4-4aec-999b-b179efeef732/images/f6ac9660-a84e-469e-a17c-c6dbc538af4b/d6b09501-5b79-4c92-bf10-2ed3bda1b425
>>>>>>
glusterfs.gfid.string="f28cabcb-d169-41fc-a633-9bef4c4a8e40"
>>>>>>
>>>>>> [root at lease-11 ovirt-backbone-2]#  getfattr -d -m .
-e hex
>>>>>> .shard/f28cabcb-d169-41fc-a633-9bef4c4a8e40.51500
>>>>>> # file:
.shard/f28cabcb-d169-41fc-a633-9bef4c4a8e40.51500
>>>>>>
>>>>>>
security.selinux=0x73797374656d5f753a6f626a6563745f723a6574635f72756e74696d655f743a733000
>>>>>> trusted.afr.dirty=0x000000000000000000000000
>>>>>> trusted.gfid=0x1f86b4328ec6424699aa48cc6d7b5da0
>>>>>>
>>>>>>
trusted.gfid2path.b48064c78d7a85c9=0x62653331383633382d653861302d346336642d393737642d3761393337616138343830362f66323863616263622d643136392d343166632d613633332d3962656634633461386534302e3531353030
>>>>>>
>>>>>> [root at lease-11 ovirt-backbone-2]# getfattr -d -m .
-e hex
>>>>>> .glusterfs/1f/86/1f86b432-8ec6-4246-99aa-48cc6d7b5da0
>>>>>> # file:
.glusterfs/1f/86/1f86b432-8ec6-4246-99aa-48cc6d7b5da0
>>>>>>
>>>>>>
security.selinux=0x73797374656d5f753a6f626a6563745f723a6574635f72756e74696d655f743a733000
>>>>>> trusted.afr.dirty=0x000000000000000000000000
>>>>>> trusted.gfid=0x1f86b4328ec6424699aa48cc6d7b5da0
>>>>>>
>>>>>>
trusted.gfid2path.b48064c78d7a85c9=0x62653331383633382d653861302d346336642d393737642d3761393337616138343830362f66323863616263622d643136392d343166632d613633332d3962656634633461386534302e3531353030
>>>>>>
>>>>>> [root at lease-11 ovirt-backbone-2]# stat
>>>>>> .glusterfs/1f/86/1f86b432-8ec6-4246-99aa-48cc6d7b5da0
>>>>>>   File:
?.glusterfs/1f/86/1f86b432-8ec6-4246-99aa-48cc6d7b5da0?
>>>>>>   Size: 2166784         Blocks: 4128       IO Block:
4096   regular
>>>>>> file
>>>>>> Device: fd03h/64771d    Inode: 12956094809  Links: 3
>>>>>> Access: (0660/-rw-rw----)  Uid: (    0/    root)   Gid:
(    0/
>>>>>> root)
>>>>>> Context: system_u:object_r:etc_runtime_t:s0
>>>>>> Access: 2018-12-18 20:11:53.595208449 +0000
>>>>>> Modify: 2018-12-17 21:43:36.391580259 +0000
>>>>>> Change: 2018-12-18 19:19:25.888055392 +0000
>>>>>>  Birth: -
>>>>>>
>>>>>> [root at lease-11 ovirt-backbone-2]# find . -inum
12956094809
>>>>>> ./.glusterfs/1f/86/1f86b432-8ec6-4246-99aa-48cc6d7b5da0
>>>>>> ./.shard/f28cabcb-d169-41fc-a633-9bef4c4a8e40.51500
>>>>>>
>>>>>> ===============>>>>>>
>>>>>> I don't really see any inconsistencies, except the
dates on the stat.
>>>>>> However this is only after i tried moving the file out
of the volumes to
>>>>>> force a heal, which does happen on the data nodes, but
not on the arbiter
>>>>>> node. Before that they were also the same.
>>>>>> I've also compared the file
>>>>>> ./.shard/f28cabcb-d169-41fc-a633-9bef4c4a8e40.51500 on
the 2 nodes and they
>>>>>> are exactly the same.
>>>>>>
>>>>>> Things i've further tried;
>>>>>> - gluster v heal ovirt-backbone-2 full => gluster v
heal
>>>>>> ovirt-backbone-2 info reports 0 entries on all nodes
>>>>>>
>>>>>> - stop each glusterd and glusterfsd, pause around 40sec
and start
>>>>>> them again on each node, 1 at a time, waiting for the
heal to recover
>>>>>> before moving to the next node
>>>>>>
>>>>>> - force a heal by stopping glusterd on a node and
perform these steps;
>>>>>> mkdir /mnt/ovirt-backbone-2/trigger
>>>>>> rmdir /mnt/ovirt-backbone-2/trigger
>>>>>> setfattr -n trusted.non-existent-key -v abc
/mnt/ovirt-backbone-2/
>>>>>> setfattr -x trusted.non-existent-key
/mnt/ovirt-backbone-2/
>>>>>> start glusterd
>>>>>>
>>>>>> - gluster volume rebalance ovirt-backbone-2 start =>
success
>>>>>>
>>>>>> Whats further interesting is that according the mount
log, the volume
>>>>>> is in split-brain;
>>>>>> [2018-12-18 10:06:04.606870] E [MSGID: 108008]
>>>>>> [afr-read-txn.c:90:afr_read_txn_refresh_done]
>>>>>> 0-ovirt-backbone-2-replicate-2: Failing FSTAT on gfid
>>>>>> 2a57d87d-fe49-4034-919b-fdb79531bf68: split-brain
observed. [Input/output
>>>>>> error]
>>>>>> [2018-12-18 10:06:04.606908] E [MSGID: 133014]
>>>>>> [shard.c:1248:shard_common_stat_cbk]
0-ovirt-backbone-2-shard: stat failed:
>>>>>> 2a57d87d-fe49-4034-919b-fdb79531bf68 [Input/output
error]
>>>>>> [2018-12-18 10:06:04.606927] W
[fuse-bridge.c:871:fuse_attr_cbk]
>>>>>> 0-glusterfs-fuse: 428090: FSTAT()
>>>>>> /b1c2c949-aef4-4aec-999b-b179efeef732/dom_md/ids =>
-1 (Input/output error)
>>>>>> [2018-12-18 10:06:05.107729] E [MSGID: 108008]
>>>>>> [afr-read-txn.c:90:afr_read_txn_refresh_done]
>>>>>> 0-ovirt-backbone-2-replicate-2: Failing FSTAT on gfid
>>>>>> 2a57d87d-fe49-4034-919b-fdb79531bf68: split-brain
observed. [Input/output
>>>>>> error]
>>>>>> [2018-12-18 10:06:05.107770] E [MSGID: 133014]
>>>>>> [shard.c:1248:shard_common_stat_cbk]
0-ovirt-backbone-2-shard: stat failed:
>>>>>> 2a57d87d-fe49-4034-919b-fdb79531bf68 [Input/output
error]
>>>>>> [2018-12-18 10:06:05.107791] W
[fuse-bridge.c:871:fuse_attr_cbk]
>>>>>> 0-glusterfs-fuse: 428091: FSTAT()
>>>>>> /b1c2c949-aef4-4aec-999b-b179efeef732/dom_md/ids =>
-1 (Input/output error)
>>>>>> [2018-12-18 10:06:05.537244] I [MSGID: 108006]
>>>>>> [afr-common.c:5494:afr_local_init]
0-ovirt-backbone-2-replicate-1: no
>>>>>> subvolumes up
>>>>>> [2018-12-18 10:06:05.538523] E [MSGID: 108008]
>>>>>> [afr-read-txn.c:90:afr_read_txn_refresh_done]
>>>>>> 0-ovirt-backbone-2-replicate-2: Failing STAT on gfid
>>>>>> 00000000-0000-0000-0000-000000000001: split-brain
observed. [Input/output
>>>>>> error]
>>>>>> [2018-12-18 10:06:05.538685] I [MSGID: 108006]
>>>>>> [afr-common.c:5494:afr_local_init]
0-ovirt-backbone-2-replicate-1: no
>>>>>> subvolumes up
>>>>>> [2018-12-18 10:06:05.538794] I [MSGID: 108006]
>>>>>> [afr-common.c:5494:afr_local_init]
0-ovirt-backbone-2-replicate-1: no
>>>>>> subvolumes up
>>>>>> [2018-12-18 10:06:05.539342] I [MSGID: 109063]
>>>>>> [dht-layout.c:716:dht_layout_normalize]
0-ovirt-backbone-2-dht: Found
>>>>>> anomalies in /b1c2c949-aef4-4aec-999b-b179efeef732
(gfid >>>>>> 8c8598ce-1a52-418e-a7b4-435fee34bae8). Holes=2
overlaps=0
>>>>>> [2018-12-18 10:06:05.539372] W [MSGID: 109005]
>>>>>> [dht-selfheal.c:2158:dht_selfheal_directory]
0-ovirt-backbone-2-dht:
>>>>>> Directory selfheal failed: 2 subvolumes down.Not
fixing. path >>>>>> /b1c2c949-aef4-4aec-999b-b179efeef732,
gfid >>>>>> 8c8598ce-1a52-418e-a7b4-435fee34bae8
>>>>>> [2018-12-18 10:06:05.539694] I [MSGID: 108006]
>>>>>> [afr-common.c:5494:afr_local_init]
0-ovirt-backbone-2-replicate-1: no
>>>>>> subvolumes up
>>>>>> [2018-12-18 10:06:05.540652] I [MSGID: 108006]
>>>>>> [afr-common.c:5494:afr_local_init]
0-ovirt-backbone-2-replicate-1: no
>>>>>> subvolumes up
>>>>>> [2018-12-18 10:06:05.608612] E [MSGID: 108008]
>>>>>> [afr-read-txn.c:90:afr_read_txn_refresh_done]
>>>>>> 0-ovirt-backbone-2-replicate-2: Failing FSTAT on gfid
>>>>>> 2a57d87d-fe49-4034-919b-fdb79531bf68: split-brain
observed. [Input/output
>>>>>> error]
>>>>>> [2018-12-18 10:06:05.608657] E [MSGID: 133014]
>>>>>> [shard.c:1248:shard_common_stat_cbk]
0-ovirt-backbone-2-shard: stat failed:
>>>>>> 2a57d87d-fe49-4034-919b-fdb79531bf68 [Input/output
error]
>>>>>> [2018-12-18 10:06:05.608672] W
[fuse-bridge.c:871:fuse_attr_cbk]
>>>>>> 0-glusterfs-fuse: 428096: FSTAT()
>>>>>> /b1c2c949-aef4-4aec-999b-b179efeef732/dom_md/ids =>
-1 (Input/output error)
>>>>>> [2018-12-18 10:06:06.109339] E [MSGID: 108008]
>>>>>> [afr-read-txn.c:90:afr_read_txn_refresh_done]
>>>>>> 0-ovirt-backbone-2-replicate-2: Failing FSTAT on gfid
>>>>>> 2a57d87d-fe49-4034-919b-fdb79531bf68: split-brain
observed. [Input/output
>>>>>> error]
>>>>>> [2018-12-18 10:06:06.109378] E [MSGID: 133014]
>>>>>> [shard.c:1248:shard_common_stat_cbk]
0-ovirt-backbone-2-shard: stat failed:
>>>>>> 2a57d87d-fe49-4034-919b-fdb79531bf68 [Input/output
error]
>>>>>> [2018-12-18 10:06:06.109399] W
[fuse-bridge.c:871:fuse_attr_cbk]
>>>>>> 0-glusterfs-fuse: 428097: FSTAT()
>>>>>> /b1c2c949-aef4-4aec-999b-b179efeef732/dom_md/ids =>
-1 (Input/output error)
>>>>>>
>>>>>> #note i'm able to see ;
>>>>>> /b1c2c949-aef4-4aec-999b-b179efeef732/dom_md/ids
>>>>>> [root at lease-11 ovirt-backbone-2]# stat
>>>>>>
/mnt/ovirt-backbone-2/b1c2c949-aef4-4aec-999b-b179efeef732/dom_md/ids
>>>>>>   File:
>>>>>>
?/mnt/ovirt-backbone-2/b1c2c949-aef4-4aec-999b-b179efeef732/dom_md/ids?
>>>>>>   Size: 1048576         Blocks: 2048       IO Block:
131072 regular
>>>>>> file
>>>>>> Device: 41h/65d Inode: 10492258721813610344  Links: 1
>>>>>> Access: (0660/-rw-rw----)  Uid: (   36/    vdsm)   Gid:
(   36/
>>>>>> kvm)
>>>>>> Context: system_u:object_r:fusefs_t:s0
>>>>>> Access: 2018-12-19 20:07:39.917573869 +0000
>>>>>> Modify: 2018-12-19 20:07:39.928573917 +0000
>>>>>> Change: 2018-12-19 20:07:39.929573921 +0000
>>>>>>  Birth: -
>>>>>>
>>>>>> however checking: gluster v heal ovirt-backbone-2 info
split-brain
>>>>>> reports no entries.
>>>>>>
>>>>>> I've also tried mounting the qemu image, and this
works fine, i'm
>>>>>> able to see all contents;
>>>>>>  losetup /dev/loop0
>>>>>>
/mnt/ovirt-backbone-2/b1c2c949-aef4-4aec-999b-b179efeef732/images/f6ac9660-a84e-469e-a17c-c6dbc538af4b/d6b09501-5b79-4c92-bf10-2ed3bda1b425
>>>>>>  kpartx -a /dev/loop0
>>>>>>  vgscan
>>>>>>  vgchange -ay slave-data
>>>>>>  mkdir /mnt/slv01
>>>>>>  mount /dev/mapper/slave--data-lvol0 /mnt/slv01/
>>>>>>
>>>>>> Possible causes for this issue;
>>>>>> 1. the machine "lease-11" suffered from a
faulty RAM module (ECC),
>>>>>> which halted the machine and causes an invalid state.
(this machine also
>>>>>> hosts other volumes, with similar configurations, which
report no issue)
>>>>>> 2. after the RAM module was replaced, the VM using the
backing qemu
>>>>>> image, was restored from a backup (the backup was file
based within the VM
>>>>>> on a different directory). This is because some files
were corrupted. The
>>>>>> backup/recovery obviously causes extra IO, possible
introducing race
>>>>>> conditions? The machine did run for about 12h without
issues, and in total
>>>>>> for about 36h.
>>>>>> 3. since only the client (maybe only gfapi?) reports
errors,
>>>>>> something is broken there?
>>>>>>
>>>>>> The volume info;
>>>>>> root at lease-06 ~# gluster v info ovirt-backbone-2
>>>>>>
>>>>>> Volume Name: ovirt-backbone-2
>>>>>> Type: Distributed-Replicate
>>>>>> Volume ID: 85702d35-62c8-4c8c-930d-46f455a8af28
>>>>>> Status: Started
>>>>>> Snapshot Count: 0
>>>>>> Number of Bricks: 3 x (2 + 1) = 9
>>>>>> Transport-type: tcp
>>>>>> Bricks:
>>>>>> Brick1:
10.32.9.7:/data/gfs/bricks/brick1/ovirt-backbone-2
>>>>>> Brick2:
10.32.9.3:/data/gfs/bricks/brick1/ovirt-backbone-2
>>>>>> Brick3:
10.32.9.4:/data/gfs/bricks/bricka/ovirt-backbone-2 (arbiter)
>>>>>> Brick4:
10.32.9.8:/data0/gfs/bricks/brick1/ovirt-backbone-2
>>>>>> Brick5:
10.32.9.21:/data0/gfs/bricks/brick1/ovirt-backbone-2
>>>>>> Brick6:
10.32.9.5:/data/gfs/bricks/bricka/ovirt-backbone-2 (arbiter)
>>>>>> Brick7:
10.32.9.9:/data0/gfs/bricks/brick1/ovirt-backbone-2
>>>>>> Brick8:
10.32.9.20:/data0/gfs/bricks/brick1/ovirt-backbone-2
>>>>>> Brick9:
10.32.9.6:/data/gfs/bricks/bricka/ovirt-backbone-2 (arbiter)
>>>>>> Options Reconfigured:
>>>>>> nfs.disable: on
>>>>>> transport.address-family: inet
>>>>>> performance.quick-read: off
>>>>>> performance.read-ahead: off
>>>>>> performance.io-cache: off
>>>>>> performance.low-prio-threads: 32
>>>>>> network.remote-dio: enable
>>>>>> cluster.eager-lock: enable
>>>>>> cluster.quorum-type: auto
>>>>>> cluster.server-quorum-type: server
>>>>>> cluster.data-self-heal-algorithm: full
>>>>>> cluster.locking-scheme: granular
>>>>>> cluster.shd-max-threads: 8
>>>>>> cluster.shd-wait-qlength: 10000
>>>>>> features.shard: on
>>>>>> user.cifs: off
>>>>>> storage.owner-uid: 36
>>>>>> storage.owner-gid: 36
>>>>>> features.shard-block-size: 64MB
>>>>>> performance.write-behind-window-size: 512MB
>>>>>> performance.cache-size: 384MB
>>>>>> cluster.brick-multiplex: on
>>>>>>
>>>>>> The volume status;
>>>>>> root at lease-06 ~# gluster v status ovirt-backbone-2
>>>>>> Status of volume: ovirt-backbone-2
>>>>>> Gluster process                             TCP Port 
RDMA Port
>>>>>> Online  Pid
>>>>>>
>>>>>>
------------------------------------------------------------------------------
>>>>>> Brick 10.32.9.7:/data/gfs/bricks/brick1/ovi
>>>>>> rt-backbone-2                               49152     0
>>>>>> Y       7727
>>>>>> Brick 10.32.9.3:/data/gfs/bricks/brick1/ovi
>>>>>> rt-backbone-2                               49152     0
>>>>>> Y       12620
>>>>>> Brick 10.32.9.4:/data/gfs/bricks/bricka/ovi
>>>>>> rt-backbone-2                               49152     0
>>>>>> Y       8794
>>>>>> Brick 10.32.9.8:/data0/gfs/bricks/brick1/ov
>>>>>> irt-backbone-2                              49161     0
>>>>>> Y       22333
>>>>>> Brick 10.32.9.21:/data0/gfs/bricks/brick1/o
>>>>>> virt-backbone-2                             49152     0
>>>>>> Y       15030
>>>>>> Brick 10.32.9.5:/data/gfs/bricks/bricka/ovi
>>>>>> rt-backbone-2                               49166     0
>>>>>> Y       24592
>>>>>> Brick 10.32.9.9:/data0/gfs/bricks/brick1/ov
>>>>>> irt-backbone-2                              49153     0
>>>>>> Y       20148
>>>>>> Brick 10.32.9.20:/data0/gfs/bricks/brick1/o
>>>>>> virt-backbone-2                             49154     0
>>>>>> Y       15413
>>>>>> Brick 10.32.9.6:/data/gfs/bricks/bricka/ovi
>>>>>> rt-backbone-2                               49152     0
>>>>>> Y       43120
>>>>>> Self-heal Daemon on localhost               N/A      
N/A
>>>>>> Y       44587
>>>>>> Self-heal Daemon on 10.201.0.2              N/A      
N/A
>>>>>> Y       8401
>>>>>> Self-heal Daemon on 10.201.0.5              N/A      
N/A
>>>>>> Y       11038
>>>>>> Self-heal Daemon on 10.201.0.8              N/A      
N/A
>>>>>> Y       9513
>>>>>> Self-heal Daemon on 10.32.9.4               N/A      
N/A
>>>>>> Y       23736
>>>>>> Self-heal Daemon on 10.32.9.20              N/A      
N/A
>>>>>> Y       2738
>>>>>> Self-heal Daemon on 10.32.9.3               N/A      
N/A
>>>>>> Y       25598
>>>>>> Self-heal Daemon on 10.32.9.5               N/A      
N/A
>>>>>> Y       511
>>>>>> Self-heal Daemon on 10.32.9.9               N/A      
N/A
>>>>>> Y       23357
>>>>>> Self-heal Daemon on 10.32.9.8               N/A      
N/A
>>>>>> Y       15225
>>>>>> Self-heal Daemon on 10.32.9.7               N/A      
N/A
>>>>>> Y       25781
>>>>>> Self-heal Daemon on 10.32.9.21              N/A      
N/A
>>>>>> Y       5034
>>>>>>
>>>>>> Task Status of Volume ovirt-backbone-2
>>>>>>
>>>>>>
------------------------------------------------------------------------------
>>>>>> Task                 : Rebalance
>>>>>> ID                   :
6dfbac43-0125-4568-9ac3-a2c453faaa3d
>>>>>> Status               : completed
>>>>>>
>>>>>> gluster version is @3.12.15 and
cluster.op-version=31202
>>>>>>
>>>>>> =======================>>>>>>
>>>>>> It would be nice to know if it's possible to mark
the files as not
>>>>>> stale or if i should investigate other things?
>>>>>> Or should we consider this volume lost?
>>>>>> Also checking the code at;
>>>>>>
https://github.com/gluster/glusterfs/blob/master/xlators/features/shard/src/shard.c
>>>>>> it seems the functions shifted quite some (line 1724
vs. 2243), so maybe
>>>>>> it's fixed in a future version?
>>>>>> Any thoughts are welcome.
>>>>>>
>>>>>> Thanks Olaf
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>> Gluster-users mailing list
>>>> Gluster-users at gluster.org
>>>> https://lists.gluster.org/mailman/listinfo/gluster-users
>>>
>>>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.gluster.org/pipermail/gluster-users/attachments/20190113/6e845f07/attachment.html>

Krutika Dhananjay

2019-Jan-14 07:15 UTC

head link

[Gluster-users] [Stale file handle] in shard volume

Hi,

So the main issue is that certain vms seem to be pausing? Did I understand
that right?
Could you share the gluster-mount logs around the time the pause was seen?
And the brick logs too please?

As for ESTALE errors, the real cause of pauses can be determined from
errors/warnings logged by fuse. Mere occurrence of ESTALE errors against
shard function in logs doesn't necessarily indicate that is the reason for
the pause. Also, in this instance, the ESTALE errors it seems are
propagated by the lower translators (DHT? protocol/client? Or even bricks?)
and shard is merely logging the same.

-Krutika


On Sun, Jan 13, 2019 at 10:11 PM Olaf Buitelaar <olaf.buitelaar at
gmail.com>
wrote:
> @Krutika if you need any further information, please let me know.
>
> Thanks Olaf
>
> Op vr 4 jan. 2019 om 07:51 schreef Nithya Balachandran <
> nbalacha at redhat.com>:
>
>> Adding Krutika.
>>
>> On Wed, 2 Jan 2019 at 20:56, Olaf Buitelaar <olaf.buitelaar at
gmail.com>
>> wrote:
>>
>>> Hi Nithya,
>>>
>>> Thank you for your reply.
>>>
>>> the VM's using the gluster volumes keeps on getting
paused/stopped on
>>> errors like these;
>>> [2019-01-02 02:33:44.469132] E [MSGID: 133010]
>>> [shard.c:1724:shard_common_lookup_shards_cbk] 0-ovirt-kube-shard:
Lookup on
>>> shard 101487 failed. Base file gfid =
a38d64bc-a28b-4ee1-a0bb-f919e7a1022c
>>> [Stale file handle]
>>> [2019-01-02 02:33:44.563288] E [MSGID: 133010]
>>> [shard.c:1724:shard_common_lookup_shards_cbk] 0-ovirt-kube-shard:
Lookup on
>>> shard 101488 failed. Base file gfid =
a38d64bc-a28b-4ee1-a0bb-f919e7a1022c
>>> [Stale file handle]
>>>
>>> Krutika, Can you take a look at this?
>>
>>
>>>
>>> What i'm trying to find out, if i can purge all gluster volumes
from all
>>> possible stale file handles (and hopefully find a method to prevent
this in
>>> the future), so the VM's can start running stable again.
>>> For this i need to know when the
"shard_common_lookup_shards_cbk"
>>> function considers a file as stale.
>>> The statement; "Stale file handle errors show up when a file
with a
>>> specified gfid is not found." doesn't seem to cover it
all, as i've shown
>>> in earlier mails the shard file and glusterfs/xx/xx/uuid file do
both
>>> exist, and have the same inode.
>>> If the criteria i'm using aren't correct, could you please
tell me which
>>> criteria i should use to determine if a file is stale or not?
>>> these criteria are just based observations i made, moving the stale
>>> files manually. After removing them i was able to start the VM
again..until
>>> some time later it hangs on another stale shard file unfortunate.
>>>
>>> Thanks Olaf
>>>
>>> Op wo 2 jan. 2019 om 14:20 schreef Nithya Balachandran <
>>> nbalacha at redhat.com>:
>>>
>>>>
>>>>
>>>> On Mon, 31 Dec 2018 at 01:27, Olaf Buitelaar <olaf.buitelaar
at gmail.com>
>>>> wrote:
>>>>
>>>>> Dear All,
>>>>>
>>>>> till now a selected group of VM's still seem to produce
new stale
>>>>> file's and getting paused due to this.
>>>>> I've not updated gluster recently, however i did change
the op version
>>>>> from 31200 to 31202 about a week before this issue arose.
>>>>> Looking at the .shard directory, i've 100.000+ files
sharing the same
>>>>> characteristics as a stale file. which are found till now,
>>>>> they all have the sticky bit set, e.g. file permissions;
---------T.
>>>>> are 0kb in size, and have the trusted.glusterfs.dht.linkto
attribute.
>>>>>
>>>>
>>>> These are internal files used by gluster and do not necessarily
mean
>>>> they are stale. They "point" to data files which may
be on different bricks
>>>> (same name, gfid etc but no linkto xattr and no ----T
permissions).
>>>>
>>>>
>>>>> These files range from long a go (beginning of the year)
till now.
>>>>> Which makes me suspect this was laying dormant for some
time now..and
>>>>> somehow recently surfaced.
>>>>> Checking other sub-volumes they contain also 0kb files in
the .shard
>>>>> directory, but don't have the sticky bit and the linkto
attribute.
>>>>>
>>>>> Does anybody else experience this issue? Could this be a
bug or an
>>>>> environmental issue?
>>>>>
>>>> These are most likely valid files- please do not delete them
without
>>>> double-checking.
>>>>
>>>> Stale file handle errors show up when a file with a specified
gfid is
>>>> not found. You will need to debug the files for which you see
this error by
>>>> checking the bricks to see if they actually exist.
>>>>
>>>>>
>>>>> Also i wonder if there is any tool or gluster command to
clean all
>>>>> stale file handles?
>>>>> Otherwise i'm planning to make a simple bash script,
which iterates
>>>>> over the .shard dir, checks each file for the above
mentioned criteria, and
>>>>> (re)moves the file and the corresponding .glusterfs file.
>>>>> If there are other criteria needed to identify a stale file
handle, i
>>>>> would like to hear that.
>>>>> If this is a viable and safe operation to do of course.
>>>>>
>>>>> Thanks Olaf
>>>>>
>>>>>
>>>>>
>>>>> Op do 20 dec. 2018 om 13:43 schreef Olaf Buitelaar <
>>>>> olaf.buitelaar at gmail.com>:
>>>>>
>>>>>> Dear All,
>>>>>>
>>>>>> I figured it out, it appeared to be the exact same
issue as described
>>>>>> here;
>>>>>>
https://lists.gluster.org/pipermail/gluster-users/2018-March/033785.html
>>>>>> Another subvolume also had the shard file, only were
all 0 bytes and
>>>>>> had the dht.linkto
>>>>>>
>>>>>> for reference;
>>>>>> [root at lease-04 ovirt-backbone-2]# getfattr -d -m .
-e hex
>>>>>> .shard/f28cabcb-d169-41fc-a633-9bef4c4a8e40.51500
>>>>>> # file:
.shard/f28cabcb-d169-41fc-a633-9bef4c4a8e40.51500
>>>>>>
>>>>>>
security.selinux=0x73797374656d5f753a6f626a6563745f723a6574635f72756e74696d655f743a733000
>>>>>> trusted.gfid=0x298147e49f9748b2baf1c8fff897244d
>>>>>>
>>>>>>
trusted.gfid2path.b48064c78d7a85c9=0x62653331383633382d653861302d346336642d393737642d3761393337616138343830362f66323863616263622d643136392d343166632d613633332d3962656634633461386534302e3531353030
>>>>>>
>>>>>>
trusted.glusterfs.dht.linkto=0x6f766972742d6261636b626f6e652d322d7265706c69636174652d3100
>>>>>>
>>>>>> [root at lease-04 ovirt-backbone-2]# getfattr -d -m .
-e hex
>>>>>> .glusterfs/29/81/298147e4-9f97-48b2-baf1-c8fff897244d
>>>>>> # file:
.glusterfs/29/81/298147e4-9f97-48b2-baf1-c8fff897244d
>>>>>>
>>>>>>
security.selinux=0x73797374656d5f753a6f626a6563745f723a6574635f72756e74696d655f743a733000
>>>>>> trusted.gfid=0x298147e49f9748b2baf1c8fff897244d
>>>>>>
>>>>>>
trusted.gfid2path.b48064c78d7a85c9=0x62653331383633382d653861302d346336642d393737642d3761393337616138343830362f66323863616263622d643136392d343166632d613633332d3962656634633461386534302e3531353030
>>>>>>
>>>>>>
trusted.glusterfs.dht.linkto=0x6f766972742d6261636b626f6e652d322d7265706c69636174652d3100
>>>>>>
>>>>>> [root at lease-04 ovirt-backbone-2]# stat
>>>>>> .glusterfs/29/81/298147e4-9f97-48b2-baf1-c8fff897244d
>>>>>>   File:
?.glusterfs/29/81/298147e4-9f97-48b2-baf1-c8fff897244d?
>>>>>>   Size: 0               Blocks: 0          IO Block:
4096   regular
>>>>>> empty file
>>>>>> Device: fd01h/64769d    Inode: 1918631406  Links: 2
>>>>>> Access: (1000/---------T)  Uid: (    0/    root)   Gid:
(    0/
>>>>>> root)
>>>>>> Context: system_u:object_r:etc_runtime_t:s0
>>>>>> Access: 2018-12-17 21:43:36.405735296 +0000
>>>>>> Modify: 2018-12-17 21:43:36.405735296 +0000
>>>>>> Change: 2018-12-17 21:43:36.405735296 +0000
>>>>>>  Birth: -
>>>>>>
>>>>>> removing the shard file and glusterfs file from each
node resolved
>>>>>> the issue.
>>>>>>
>>>>>> I also found this thread;
>>>>>>
https://lists.gluster.org/pipermail/gluster-users/2018-December/035460.html
>>>>>> Maybe he suffers from the same issue.
>>>>>>
>>>>>> Best Olaf
>>>>>>
>>>>>>
>>>>>> Op wo 19 dec. 2018 om 21:56 schreef Olaf Buitelaar <
>>>>>> olaf.buitelaar at gmail.com>:
>>>>>>
>>>>>>> Dear All,
>>>>>>>
>>>>>>> It appears i've a stale file in one of the
volumes, on 2 files.
>>>>>>> These files are qemu images (1 raw and 1 qcow2).
>>>>>>> I'll just focus on 1 file since the situation
on the other seems the
>>>>>>> same.
>>>>>>>
>>>>>>> The VM get's paused more or less directly after
being booted with
>>>>>>> error;
>>>>>>> [2018-12-18 14:05:05.275713] E [MSGID: 133010]
>>>>>>> [shard.c:1724:shard_common_lookup_shards_cbk]
0-ovirt-backbone-2-shard:
>>>>>>> Lookup on shard 51500 failed. Base file gfid
>>>>>>> f28cabcb-d169-41fc-a633-9bef4c4a8e40 [Stale file
handle]
>>>>>>>
>>>>>>> investigating the shard;
>>>>>>>
>>>>>>> #on the arbiter node:
>>>>>>>
>>>>>>> [root at lease-05 ovirt-backbone-2]# getfattr -n
glusterfs.gfid.string
>>>>>>>
/mnt/ovirt-backbone-2/b1c2c949-aef4-4aec-999b-b179efeef732/images/f6ac9660-a84e-469e-a17c-c6dbc538af4b/d6b09501-5b79-4c92-bf10-2ed3bda1b425
>>>>>>> getfattr: Removing leading '/' from
absolute path names
>>>>>>> # file:
>>>>>>>
mnt/ovirt-backbone-2/b1c2c949-aef4-4aec-999b-b179efeef732/images/f6ac9660-a84e-469e-a17c-c6dbc538af4b/d6b09501-5b79-4c92-bf10-2ed3bda1b425
>>>>>>>
glusterfs.gfid.string="f28cabcb-d169-41fc-a633-9bef4c4a8e40"
>>>>>>>
>>>>>>> [root at lease-05 ovirt-backbone-2]# getfattr -d -m
. -e hex
>>>>>>> .shard/f28cabcb-d169-41fc-a633-9bef4c4a8e40.51500
>>>>>>> # file:
.shard/f28cabcb-d169-41fc-a633-9bef4c4a8e40.51500
>>>>>>>
>>>>>>>
security.selinux=0x73797374656d5f753a6f626a6563745f723a6574635f72756e74696d655f743a733000
>>>>>>> trusted.afr.dirty=0x000000000000000000000000
>>>>>>> trusted.gfid=0x1f86b4328ec6424699aa48cc6d7b5da0
>>>>>>>
>>>>>>>
trusted.gfid2path.b48064c78d7a85c9=0x62653331383633382d653861302d346336642d393737642d3761393337616138343830362f66323863616263622d643136392d343166632d613633332d3962656634633461386534302e3531353030
>>>>>>>
>>>>>>> [root at lease-05 ovirt-backbone-2]# getfattr -d -m
. -e hex
>>>>>>>
.glusterfs/1f/86/1f86b432-8ec6-4246-99aa-48cc6d7b5da0
>>>>>>> # file:
.glusterfs/1f/86/1f86b432-8ec6-4246-99aa-48cc6d7b5da0
>>>>>>>
>>>>>>>
security.selinux=0x73797374656d5f753a6f626a6563745f723a6574635f72756e74696d655f743a733000
>>>>>>> trusted.afr.dirty=0x000000000000000000000000
>>>>>>> trusted.gfid=0x1f86b4328ec6424699aa48cc6d7b5da0
>>>>>>>
>>>>>>>
trusted.gfid2path.b48064c78d7a85c9=0x62653331383633382d653861302d346336642d393737642d3761393337616138343830362f66323863616263622d643136392d343166632d613633332d3962656634633461386534302e3531353030
>>>>>>>
>>>>>>> [root at lease-05 ovirt-backbone-2]# stat
>>>>>>>
.glusterfs/1f/86/1f86b432-8ec6-4246-99aa-48cc6d7b5da0
>>>>>>>   File:
?.glusterfs/1f/86/1f86b432-8ec6-4246-99aa-48cc6d7b5da0?
>>>>>>>   Size: 0               Blocks: 0          IO
Block: 4096   regular
>>>>>>> empty file
>>>>>>> Device: fd01h/64769d    Inode: 537277306   Links: 2
>>>>>>> Access: (0660/-rw-rw----)  Uid: (    0/    root)  
Gid: (    0/
>>>>>>> root)
>>>>>>> Context: system_u:object_r:etc_runtime_t:s0
>>>>>>> Access: 2018-12-17 21:43:36.361984810 +0000
>>>>>>> Modify: 2018-12-17 21:43:36.361984810 +0000
>>>>>>> Change: 2018-12-18 20:55:29.908647417 +0000
>>>>>>>  Birth: -
>>>>>>>
>>>>>>> [root at lease-05 ovirt-backbone-2]# find . -inum
537277306
>>>>>>>
./.glusterfs/1f/86/1f86b432-8ec6-4246-99aa-48cc6d7b5da0
>>>>>>> ./.shard/f28cabcb-d169-41fc-a633-9bef4c4a8e40.51500
>>>>>>>
>>>>>>> #on the data nodes:
>>>>>>>
>>>>>>> [root at lease-08 ~]# getfattr -n
glusterfs.gfid.string
>>>>>>>
/mnt/ovirt-backbone-2/b1c2c949-aef4-4aec-999b-b179efeef732/images/f6ac9660-a84e-469e-a17c-c6dbc538af4b/d6b09501-5b79-4c92-bf10-2ed3bda1b425
>>>>>>> getfattr: Removing leading '/' from
absolute path names
>>>>>>> # file:
>>>>>>>
mnt/ovirt-backbone-2/b1c2c949-aef4-4aec-999b-b179efeef732/images/f6ac9660-a84e-469e-a17c-c6dbc538af4b/d6b09501-5b79-4c92-bf10-2ed3bda1b425
>>>>>>>
glusterfs.gfid.string="f28cabcb-d169-41fc-a633-9bef4c4a8e40"
>>>>>>>
>>>>>>> [root at lease-08 ovirt-backbone-2]# getfattr -d -m
. -e hex
>>>>>>> .shard/f28cabcb-d169-41fc-a633-9bef4c4a8e40.51500
>>>>>>> # file:
.shard/f28cabcb-d169-41fc-a633-9bef4c4a8e40.51500
>>>>>>>
>>>>>>>
security.selinux=0x73797374656d5f753a6f626a6563745f723a6574635f72756e74696d655f743a733000
>>>>>>> trusted.afr.dirty=0x000000000000000000000000
>>>>>>> trusted.gfid=0x1f86b4328ec6424699aa48cc6d7b5da0
>>>>>>>
>>>>>>>
trusted.gfid2path.b48064c78d7a85c9=0x62653331383633382d653861302d346336642d393737642d3761393337616138343830362f66323863616263622d643136392d343166632d613633332d3962656634633461386534302e3531353030
>>>>>>>
>>>>>>> [root at lease-08 ovirt-backbone-2]# getfattr -d -m
. -e hex
>>>>>>>
.glusterfs/1f/86/1f86b432-8ec6-4246-99aa-48cc6d7b5da0
>>>>>>> # file:
.glusterfs/1f/86/1f86b432-8ec6-4246-99aa-48cc6d7b5da0
>>>>>>>
>>>>>>>
security.selinux=0x73797374656d5f753a6f626a6563745f723a6574635f72756e74696d655f743a733000
>>>>>>> trusted.afr.dirty=0x000000000000000000000000
>>>>>>> trusted.gfid=0x1f86b4328ec6424699aa48cc6d7b5da0
>>>>>>>
>>>>>>>
trusted.gfid2path.b48064c78d7a85c9=0x62653331383633382d653861302d346336642d393737642d3761393337616138343830362f66323863616263622d643136392d343166632d613633332d3962656634633461386534302e3531353030
>>>>>>>
>>>>>>> [root at lease-08 ovirt-backbone-2]# stat
>>>>>>>
.glusterfs/1f/86/1f86b432-8ec6-4246-99aa-48cc6d7b5da0
>>>>>>>   File:
?.glusterfs/1f/86/1f86b432-8ec6-4246-99aa-48cc6d7b5da0?
>>>>>>>   Size: 2166784         Blocks: 4128       IO
Block: 4096   regular
>>>>>>> file
>>>>>>> Device: fd03h/64771d    Inode: 12893624759  Links:
3
>>>>>>> Access: (0660/-rw-rw----)  Uid: (    0/    root)  
Gid: (    0/
>>>>>>> root)
>>>>>>> Context: system_u:object_r:etc_runtime_t:s0
>>>>>>> Access: 2018-12-18 18:52:38.070776585 +0000
>>>>>>> Modify: 2018-12-17 21:43:36.388054443 +0000
>>>>>>> Change: 2018-12-18 21:01:47.810506528 +0000
>>>>>>>  Birth: -
>>>>>>>
>>>>>>> [root at lease-08 ovirt-backbone-2]# find . -inum
12893624759
>>>>>>>
./.glusterfs/1f/86/1f86b432-8ec6-4246-99aa-48cc6d7b5da0
>>>>>>> ./.shard/f28cabcb-d169-41fc-a633-9bef4c4a8e40.51500
>>>>>>>
>>>>>>> =======================>>>>>>>
>>>>>>> [root at lease-11 ovirt-backbone-2]# getfattr -n
glusterfs.gfid.string
>>>>>>>
/mnt/ovirt-backbone-2/b1c2c949-aef4-4aec-999b-b179efeef732/images/f6ac9660-a84e-469e-a17c-c6dbc538af4b/d6b09501-5b79-4c92-bf10-2ed3bda1b425
>>>>>>> getfattr: Removing leading '/' from
absolute path names
>>>>>>> # file:
>>>>>>>
mnt/ovirt-backbone-2/b1c2c949-aef4-4aec-999b-b179efeef732/images/f6ac9660-a84e-469e-a17c-c6dbc538af4b/d6b09501-5b79-4c92-bf10-2ed3bda1b425
>>>>>>>
glusterfs.gfid.string="f28cabcb-d169-41fc-a633-9bef4c4a8e40"
>>>>>>>
>>>>>>> [root at lease-11 ovirt-backbone-2]#  getfattr -d
-m . -e hex
>>>>>>> .shard/f28cabcb-d169-41fc-a633-9bef4c4a8e40.51500
>>>>>>> # file:
.shard/f28cabcb-d169-41fc-a633-9bef4c4a8e40.51500
>>>>>>>
>>>>>>>
security.selinux=0x73797374656d5f753a6f626a6563745f723a6574635f72756e74696d655f743a733000
>>>>>>> trusted.afr.dirty=0x000000000000000000000000
>>>>>>> trusted.gfid=0x1f86b4328ec6424699aa48cc6d7b5da0
>>>>>>>
>>>>>>>
trusted.gfid2path.b48064c78d7a85c9=0x62653331383633382d653861302d346336642d393737642d3761393337616138343830362f66323863616263622d643136392d343166632d613633332d3962656634633461386534302e3531353030
>>>>>>>
>>>>>>> [root at lease-11 ovirt-backbone-2]# getfattr -d -m
. -e hex
>>>>>>>
.glusterfs/1f/86/1f86b432-8ec6-4246-99aa-48cc6d7b5da0
>>>>>>> # file:
.glusterfs/1f/86/1f86b432-8ec6-4246-99aa-48cc6d7b5da0
>>>>>>>
>>>>>>>
security.selinux=0x73797374656d5f753a6f626a6563745f723a6574635f72756e74696d655f743a733000
>>>>>>> trusted.afr.dirty=0x000000000000000000000000
>>>>>>> trusted.gfid=0x1f86b4328ec6424699aa48cc6d7b5da0
>>>>>>>
>>>>>>>
trusted.gfid2path.b48064c78d7a85c9=0x62653331383633382d653861302d346336642d393737642d3761393337616138343830362f66323863616263622d643136392d343166632d613633332d3962656634633461386534302e3531353030
>>>>>>>
>>>>>>> [root at lease-11 ovirt-backbone-2]# stat
>>>>>>>
.glusterfs/1f/86/1f86b432-8ec6-4246-99aa-48cc6d7b5da0
>>>>>>>   File:
?.glusterfs/1f/86/1f86b432-8ec6-4246-99aa-48cc6d7b5da0?
>>>>>>>   Size: 2166784         Blocks: 4128       IO
Block: 4096   regular
>>>>>>> file
>>>>>>> Device: fd03h/64771d    Inode: 12956094809  Links:
3
>>>>>>> Access: (0660/-rw-rw----)  Uid: (    0/    root)  
Gid: (    0/
>>>>>>> root)
>>>>>>> Context: system_u:object_r:etc_runtime_t:s0
>>>>>>> Access: 2018-12-18 20:11:53.595208449 +0000
>>>>>>> Modify: 2018-12-17 21:43:36.391580259 +0000
>>>>>>> Change: 2018-12-18 19:19:25.888055392 +0000
>>>>>>>  Birth: -
>>>>>>>
>>>>>>> [root at lease-11 ovirt-backbone-2]# find . -inum
12956094809
>>>>>>>
./.glusterfs/1f/86/1f86b432-8ec6-4246-99aa-48cc6d7b5da0
>>>>>>> ./.shard/f28cabcb-d169-41fc-a633-9bef4c4a8e40.51500
>>>>>>>
>>>>>>> ===============>>>>>>>
>>>>>>> I don't really see any inconsistencies, except
the dates on the
>>>>>>> stat. However this is only after i tried moving the
file out of the volumes
>>>>>>> to force a heal, which does happen on the data
nodes, but not on the
>>>>>>> arbiter node. Before that they were also the same.
>>>>>>> I've also compared the file
>>>>>>> ./.shard/f28cabcb-d169-41fc-a633-9bef4c4a8e40.51500
on the 2 nodes and they
>>>>>>> are exactly the same.
>>>>>>>
>>>>>>> Things i've further tried;
>>>>>>> - gluster v heal ovirt-backbone-2 full =>
gluster v heal
>>>>>>> ovirt-backbone-2 info reports 0 entries on all
nodes
>>>>>>>
>>>>>>> - stop each glusterd and glusterfsd, pause around
40sec and start
>>>>>>> them again on each node, 1 at a time, waiting for
the heal to recover
>>>>>>> before moving to the next node
>>>>>>>
>>>>>>> - force a heal by stopping glusterd on a node and
perform these
>>>>>>> steps;
>>>>>>> mkdir /mnt/ovirt-backbone-2/trigger
>>>>>>> rmdir /mnt/ovirt-backbone-2/trigger
>>>>>>> setfattr -n trusted.non-existent-key -v abc
/mnt/ovirt-backbone-2/
>>>>>>> setfattr -x trusted.non-existent-key
/mnt/ovirt-backbone-2/
>>>>>>> start glusterd
>>>>>>>
>>>>>>> - gluster volume rebalance ovirt-backbone-2 start
=> success
>>>>>>>
>>>>>>> Whats further interesting is that according the
mount log, the
>>>>>>> volume is in split-brain;
>>>>>>> [2018-12-18 10:06:04.606870] E [MSGID: 108008]
>>>>>>> [afr-read-txn.c:90:afr_read_txn_refresh_done]
>>>>>>> 0-ovirt-backbone-2-replicate-2: Failing FSTAT on
gfid
>>>>>>> 2a57d87d-fe49-4034-919b-fdb79531bf68: split-brain
observed. [Input/output
>>>>>>> error]
>>>>>>> [2018-12-18 10:06:04.606908] E [MSGID: 133014]
>>>>>>> [shard.c:1248:shard_common_stat_cbk]
0-ovirt-backbone-2-shard: stat failed:
>>>>>>> 2a57d87d-fe49-4034-919b-fdb79531bf68 [Input/output
error]
>>>>>>> [2018-12-18 10:06:04.606927] W
[fuse-bridge.c:871:fuse_attr_cbk]
>>>>>>> 0-glusterfs-fuse: 428090: FSTAT()
>>>>>>> /b1c2c949-aef4-4aec-999b-b179efeef732/dom_md/ids
=> -1 (Input/output error)
>>>>>>> [2018-12-18 10:06:05.107729] E [MSGID: 108008]
>>>>>>> [afr-read-txn.c:90:afr_read_txn_refresh_done]
>>>>>>> 0-ovirt-backbone-2-replicate-2: Failing FSTAT on
gfid
>>>>>>> 2a57d87d-fe49-4034-919b-fdb79531bf68: split-brain
observed. [Input/output
>>>>>>> error]
>>>>>>> [2018-12-18 10:06:05.107770] E [MSGID: 133014]
>>>>>>> [shard.c:1248:shard_common_stat_cbk]
0-ovirt-backbone-2-shard: stat failed:
>>>>>>> 2a57d87d-fe49-4034-919b-fdb79531bf68 [Input/output
error]
>>>>>>> [2018-12-18 10:06:05.107791] W
[fuse-bridge.c:871:fuse_attr_cbk]
>>>>>>> 0-glusterfs-fuse: 428091: FSTAT()
>>>>>>> /b1c2c949-aef4-4aec-999b-b179efeef732/dom_md/ids
=> -1 (Input/output error)
>>>>>>> [2018-12-18 10:06:05.537244] I [MSGID: 108006]
>>>>>>> [afr-common.c:5494:afr_local_init]
0-ovirt-backbone-2-replicate-1: no
>>>>>>> subvolumes up
>>>>>>> [2018-12-18 10:06:05.538523] E [MSGID: 108008]
>>>>>>> [afr-read-txn.c:90:afr_read_txn_refresh_done]
>>>>>>> 0-ovirt-backbone-2-replicate-2: Failing STAT on
gfid
>>>>>>> 00000000-0000-0000-0000-000000000001: split-brain
observed. [Input/output
>>>>>>> error]
>>>>>>> [2018-12-18 10:06:05.538685] I [MSGID: 108006]
>>>>>>> [afr-common.c:5494:afr_local_init]
0-ovirt-backbone-2-replicate-1: no
>>>>>>> subvolumes up
>>>>>>> [2018-12-18 10:06:05.538794] I [MSGID: 108006]
>>>>>>> [afr-common.c:5494:afr_local_init]
0-ovirt-backbone-2-replicate-1: no
>>>>>>> subvolumes up
>>>>>>> [2018-12-18 10:06:05.539342] I [MSGID: 109063]
>>>>>>> [dht-layout.c:716:dht_layout_normalize]
0-ovirt-backbone-2-dht: Found
>>>>>>> anomalies in /b1c2c949-aef4-4aec-999b-b179efeef732
(gfid >>>>>>> 8c8598ce-1a52-418e-a7b4-435fee34bae8).
Holes=2 overlaps=0
>>>>>>> [2018-12-18 10:06:05.539372] W [MSGID: 109005]
>>>>>>> [dht-selfheal.c:2158:dht_selfheal_directory]
0-ovirt-backbone-2-dht:
>>>>>>> Directory selfheal failed: 2 subvolumes down.Not
fixing. path >>>>>>> /b1c2c949-aef4-4aec-999b-b179efeef732,
gfid >>>>>>> 8c8598ce-1a52-418e-a7b4-435fee34bae8
>>>>>>> [2018-12-18 10:06:05.539694] I [MSGID: 108006]
>>>>>>> [afr-common.c:5494:afr_local_init]
0-ovirt-backbone-2-replicate-1: no
>>>>>>> subvolumes up
>>>>>>> [2018-12-18 10:06:05.540652] I [MSGID: 108006]
>>>>>>> [afr-common.c:5494:afr_local_init]
0-ovirt-backbone-2-replicate-1: no
>>>>>>> subvolumes up
>>>>>>> [2018-12-18 10:06:05.608612] E [MSGID: 108008]
>>>>>>> [afr-read-txn.c:90:afr_read_txn_refresh_done]
>>>>>>> 0-ovirt-backbone-2-replicate-2: Failing FSTAT on
gfid
>>>>>>> 2a57d87d-fe49-4034-919b-fdb79531bf68: split-brain
observed. [Input/output
>>>>>>> error]
>>>>>>> [2018-12-18 10:06:05.608657] E [MSGID: 133014]
>>>>>>> [shard.c:1248:shard_common_stat_cbk]
0-ovirt-backbone-2-shard: stat failed:
>>>>>>> 2a57d87d-fe49-4034-919b-fdb79531bf68 [Input/output
error]
>>>>>>> [2018-12-18 10:06:05.608672] W
[fuse-bridge.c:871:fuse_attr_cbk]
>>>>>>> 0-glusterfs-fuse: 428096: FSTAT()
>>>>>>> /b1c2c949-aef4-4aec-999b-b179efeef732/dom_md/ids
=> -1 (Input/output error)
>>>>>>> [2018-12-18 10:06:06.109339] E [MSGID: 108008]
>>>>>>> [afr-read-txn.c:90:afr_read_txn_refresh_done]
>>>>>>> 0-ovirt-backbone-2-replicate-2: Failing FSTAT on
gfid
>>>>>>> 2a57d87d-fe49-4034-919b-fdb79531bf68: split-brain
observed. [Input/output
>>>>>>> error]
>>>>>>> [2018-12-18 10:06:06.109378] E [MSGID: 133014]
>>>>>>> [shard.c:1248:shard_common_stat_cbk]
0-ovirt-backbone-2-shard: stat failed:
>>>>>>> 2a57d87d-fe49-4034-919b-fdb79531bf68 [Input/output
error]
>>>>>>> [2018-12-18 10:06:06.109399] W
[fuse-bridge.c:871:fuse_attr_cbk]
>>>>>>> 0-glusterfs-fuse: 428097: FSTAT()
>>>>>>> /b1c2c949-aef4-4aec-999b-b179efeef732/dom_md/ids
=> -1 (Input/output error)
>>>>>>>
>>>>>>> #note i'm able to see ;
>>>>>>> /b1c2c949-aef4-4aec-999b-b179efeef732/dom_md/ids
>>>>>>> [root at lease-11 ovirt-backbone-2]# stat
>>>>>>>
/mnt/ovirt-backbone-2/b1c2c949-aef4-4aec-999b-b179efeef732/dom_md/ids
>>>>>>>   File:
>>>>>>>
?/mnt/ovirt-backbone-2/b1c2c949-aef4-4aec-999b-b179efeef732/dom_md/ids?
>>>>>>>   Size: 1048576         Blocks: 2048       IO
Block: 131072 regular
>>>>>>> file
>>>>>>> Device: 41h/65d Inode: 10492258721813610344  Links:
1
>>>>>>> Access: (0660/-rw-rw----)  Uid: (   36/    vdsm)  
Gid: (   36/
>>>>>>> kvm)
>>>>>>> Context: system_u:object_r:fusefs_t:s0
>>>>>>> Access: 2018-12-19 20:07:39.917573869 +0000
>>>>>>> Modify: 2018-12-19 20:07:39.928573917 +0000
>>>>>>> Change: 2018-12-19 20:07:39.929573921 +0000
>>>>>>>  Birth: -
>>>>>>>
>>>>>>> however checking: gluster v heal ovirt-backbone-2
info split-brain
>>>>>>> reports no entries.
>>>>>>>
>>>>>>> I've also tried mounting the qemu image, and
this works fine, i'm
>>>>>>> able to see all contents;
>>>>>>>  losetup /dev/loop0
>>>>>>>
/mnt/ovirt-backbone-2/b1c2c949-aef4-4aec-999b-b179efeef732/images/f6ac9660-a84e-469e-a17c-c6dbc538af4b/d6b09501-5b79-4c92-bf10-2ed3bda1b425
>>>>>>>  kpartx -a /dev/loop0
>>>>>>>  vgscan
>>>>>>>  vgchange -ay slave-data
>>>>>>>  mkdir /mnt/slv01
>>>>>>>  mount /dev/mapper/slave--data-lvol0 /mnt/slv01/
>>>>>>>
>>>>>>> Possible causes for this issue;
>>>>>>> 1. the machine "lease-11" suffered from a
faulty RAM module (ECC),
>>>>>>> which halted the machine and causes an invalid
state. (this machine also
>>>>>>> hosts other volumes, with similar configurations,
which report no issue)
>>>>>>> 2. after the RAM module was replaced, the VM using
the backing qemu
>>>>>>> image, was restored from a backup (the backup was
file based within the VM
>>>>>>> on a different directory). This is because some
files were corrupted. The
>>>>>>> backup/recovery obviously causes extra IO, possible
introducing race
>>>>>>> conditions? The machine did run for about 12h
without issues, and in total
>>>>>>> for about 36h.
>>>>>>> 3. since only the client (maybe only gfapi?)
reports errors,
>>>>>>> something is broken there?
>>>>>>>
>>>>>>> The volume info;
>>>>>>> root at lease-06 ~# gluster v info ovirt-backbone-2
>>>>>>>
>>>>>>> Volume Name: ovirt-backbone-2
>>>>>>> Type: Distributed-Replicate
>>>>>>> Volume ID: 85702d35-62c8-4c8c-930d-46f455a8af28
>>>>>>> Status: Started
>>>>>>> Snapshot Count: 0
>>>>>>> Number of Bricks: 3 x (2 + 1) = 9
>>>>>>> Transport-type: tcp
>>>>>>> Bricks:
>>>>>>> Brick1:
10.32.9.7:/data/gfs/bricks/brick1/ovirt-backbone-2
>>>>>>> Brick2:
10.32.9.3:/data/gfs/bricks/brick1/ovirt-backbone-2
>>>>>>> Brick3:
10.32.9.4:/data/gfs/bricks/bricka/ovirt-backbone-2 (arbiter)
>>>>>>> Brick4:
10.32.9.8:/data0/gfs/bricks/brick1/ovirt-backbone-2
>>>>>>> Brick5:
10.32.9.21:/data0/gfs/bricks/brick1/ovirt-backbone-2
>>>>>>> Brick6:
10.32.9.5:/data/gfs/bricks/bricka/ovirt-backbone-2 (arbiter)
>>>>>>> Brick7:
10.32.9.9:/data0/gfs/bricks/brick1/ovirt-backbone-2
>>>>>>> Brick8:
10.32.9.20:/data0/gfs/bricks/brick1/ovirt-backbone-2
>>>>>>> Brick9:
10.32.9.6:/data/gfs/bricks/bricka/ovirt-backbone-2 (arbiter)
>>>>>>> Options Reconfigured:
>>>>>>> nfs.disable: on
>>>>>>> transport.address-family: inet
>>>>>>> performance.quick-read: off
>>>>>>> performance.read-ahead: off
>>>>>>> performance.io-cache: off
>>>>>>> performance.low-prio-threads: 32
>>>>>>> network.remote-dio: enable
>>>>>>> cluster.eager-lock: enable
>>>>>>> cluster.quorum-type: auto
>>>>>>> cluster.server-quorum-type: server
>>>>>>> cluster.data-self-heal-algorithm: full
>>>>>>> cluster.locking-scheme: granular
>>>>>>> cluster.shd-max-threads: 8
>>>>>>> cluster.shd-wait-qlength: 10000
>>>>>>> features.shard: on
>>>>>>> user.cifs: off
>>>>>>> storage.owner-uid: 36
>>>>>>> storage.owner-gid: 36
>>>>>>> features.shard-block-size: 64MB
>>>>>>> performance.write-behind-window-size: 512MB
>>>>>>> performance.cache-size: 384MB
>>>>>>> cluster.brick-multiplex: on
>>>>>>>
>>>>>>> The volume status;
>>>>>>> root at lease-06 ~# gluster v status
ovirt-backbone-2
>>>>>>> Status of volume: ovirt-backbone-2
>>>>>>> Gluster process                             TCP
Port  RDMA Port
>>>>>>> Online  Pid
>>>>>>>
>>>>>>>
------------------------------------------------------------------------------
>>>>>>> Brick 10.32.9.7:/data/gfs/bricks/brick1/ovi
>>>>>>> rt-backbone-2                               49152  
0
>>>>>>> Y       7727
>>>>>>> Brick 10.32.9.3:/data/gfs/bricks/brick1/ovi
>>>>>>> rt-backbone-2                               49152  
0
>>>>>>> Y       12620
>>>>>>> Brick 10.32.9.4:/data/gfs/bricks/bricka/ovi
>>>>>>> rt-backbone-2                               49152  
0
>>>>>>> Y       8794
>>>>>>> Brick 10.32.9.8:/data0/gfs/bricks/brick1/ov
>>>>>>> irt-backbone-2                              49161  
0
>>>>>>> Y       22333
>>>>>>> Brick 10.32.9.21:/data0/gfs/bricks/brick1/o
>>>>>>> virt-backbone-2                             49152  
0
>>>>>>> Y       15030
>>>>>>> Brick 10.32.9.5:/data/gfs/bricks/bricka/ovi
>>>>>>> rt-backbone-2                               49166  
0
>>>>>>> Y       24592
>>>>>>> Brick 10.32.9.9:/data0/gfs/bricks/brick1/ov
>>>>>>> irt-backbone-2                              49153  
0
>>>>>>> Y       20148
>>>>>>> Brick 10.32.9.20:/data0/gfs/bricks/brick1/o
>>>>>>> virt-backbone-2                             49154  
0
>>>>>>> Y       15413
>>>>>>> Brick 10.32.9.6:/data/gfs/bricks/bricka/ovi
>>>>>>> rt-backbone-2                               49152  
0
>>>>>>> Y       43120
>>>>>>> Self-heal Daemon on localhost               N/A    
N/A
>>>>>>> Y       44587
>>>>>>> Self-heal Daemon on 10.201.0.2              N/A    
N/A
>>>>>>> Y       8401
>>>>>>> Self-heal Daemon on 10.201.0.5              N/A    
N/A
>>>>>>> Y       11038
>>>>>>> Self-heal Daemon on 10.201.0.8              N/A    
N/A
>>>>>>> Y       9513
>>>>>>> Self-heal Daemon on 10.32.9.4               N/A    
N/A
>>>>>>> Y       23736
>>>>>>> Self-heal Daemon on 10.32.9.20              N/A    
N/A
>>>>>>> Y       2738
>>>>>>> Self-heal Daemon on 10.32.9.3               N/A    
N/A
>>>>>>> Y       25598
>>>>>>> Self-heal Daemon on 10.32.9.5               N/A    
N/A
>>>>>>> Y       511
>>>>>>> Self-heal Daemon on 10.32.9.9               N/A    
N/A
>>>>>>> Y       23357
>>>>>>> Self-heal Daemon on 10.32.9.8               N/A    
N/A
>>>>>>> Y       15225
>>>>>>> Self-heal Daemon on 10.32.9.7               N/A    
N/A
>>>>>>> Y       25781
>>>>>>> Self-heal Daemon on 10.32.9.21              N/A    
N/A
>>>>>>> Y       5034
>>>>>>>
>>>>>>> Task Status of Volume ovirt-backbone-2
>>>>>>>
>>>>>>>
------------------------------------------------------------------------------
>>>>>>> Task                 : Rebalance
>>>>>>> ID                   :
6dfbac43-0125-4568-9ac3-a2c453faaa3d
>>>>>>> Status               : completed
>>>>>>>
>>>>>>> gluster version is @3.12.15 and
cluster.op-version=31202
>>>>>>>
>>>>>>> =======================>>>>>>>
>>>>>>> It would be nice to know if it's possible to
mark the files as not
>>>>>>> stale or if i should investigate other things?
>>>>>>> Or should we consider this volume lost?
>>>>>>> Also checking the code at;
>>>>>>>
https://github.com/gluster/glusterfs/blob/master/xlators/features/shard/src/shard.c
>>>>>>> it seems the functions shifted quite some (line
1724 vs. 2243), so maybe
>>>>>>> it's fixed in a future version?
>>>>>>> Any thoughts are welcome.
>>>>>>>
>>>>>>> Thanks Olaf
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>> Gluster-users mailing list
>>>>> Gluster-users at gluster.org
>>>>> https://lists.gluster.org/mailman/listinfo/gluster-users
>>>>
>>>>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.gluster.org/pipermail/gluster-users/attachments/20190114/24cd3da6/attachment.html>

Gluster users - Jan 2019 - [Stale file handle] in shard volume

[Gluster-users] [Stale file handle] in shard volume

[Gluster-users] [Stale file handle] in shard volume