thr3ads.net - Gluster users - [Gluster-users] Sharding problem - multiple shard copies with mismatching gfids [Mar 2018]

If this information is useful, please help other people find it:
Share via:

Ian Halliday

2018-Mar-25 19:39 UTC

[Gluster-users] Sharding problem - multiple shard copies with mismatching gfids

Hello all,

We are having a rather interesting problem with one of our VM storage 
systems. The GlusterFS client is throwing errors relating to GFID 
mismatches. We traced this down to multiple shards being present on the 
gluster nodes, with different gfids.

Hypervisor gluster mount log:

[2018-03-25 18:54:19.261733] E [MSGID: 133010] 
[shard.c:1724:shard_common_lookup_shards_cbk] 0-ovirt-zone1-shard: 
Lookup on shard 7 failed. Base file gfid = 
87137cac-49eb-492a-8f33-8e33470d8cb7 [Stale file handle]
The message "W [MSGID: 109009] 
[dht-common.c:2162:dht_lookup_linkfile_cbk] 0-ovirt-zone1-dht: 
/.shard/87137cac-49eb-492a-8f33-8e33470d8cb7.7: gfid different on data 
file on ovirt-zone1-replicate-3, gfid local = 
00000000-0000-0000-0000-000000000000, gfid node = 
57c6fcdf-52bb-4f7a-aea4-02f0dc81ff56 " repeated 2 times between 
[2018-03-25 18:54:19.253748] and [2018-03-25 18:54:19.263576]
[2018-03-25 18:54:19.264349] W [MSGID: 109009] 
[dht-common.c:1901:dht_lookup_everywhere_cbk] 0-ovirt-zone1-dht: 
/.shard/87137cac-49eb-492a-8f33-8e33470d8cb7.7: gfid differs on 
subvolume ovirt-zone1-replicate-3, gfid local = 
fdf0813b-718a-4616-a51b-6999ebba9ec3, gfid node = 
57c6fcdf-52bb-4f7a-aea4-02f0dc81ff56


On the storage nodes, we found this:

[root at n1 gluster]# find -name 87137cac-49eb-492a-8f33-8e33470d8cb7.7
./brick2/brick/.shard/87137cac-49eb-492a-8f33-8e33470d8cb7.7
./brick4/brick/.shard/87137cac-49eb-492a-8f33-8e33470d8cb7.7

[root at n1 gluster]# ls -lh 
./brick2/brick/.shard/87137cac-49eb-492a-8f33-8e33470d8cb7.7
---------T. 2 root root 0 Mar 25 13:55 
./brick2/brick/.shard/87137cac-49eb-492a-8f33-8e33470d8cb7.7
[root at n1 gluster]# ls -lh 
./brick4/brick/.shard/87137cac-49eb-492a-8f33-8e33470d8cb7.7
-rw-rw----. 2 root root 3.8G Mar 25 13:55 
./brick4/brick/.shard/87137cac-49eb-492a-8f33-8e33470d8cb7.7

[root at n1 gluster]# getfattr -d -m . -e hex 
./brick2/brick/.shard/87137cac-49eb-492a-8f33-8e33470d8cb7.7
# file: brick2/brick/.shard/87137cac-49eb-492a-8f33-8e33470d8cb7.7
security.selinux=0x73797374656d5f753a6f626a6563745f723a756e6c6162656c65645f743a733000
trusted.gfid=0xfdf0813b718a4616a51b6999ebba9ec3
trusted.glusterfs.dht.linkto=0x6f766972742d3335302d7a6f6e65312d7265706c69636174652d3300

[root at n1 gluster]# getfattr -d -m . -e hex 
./brick4/brick/.shard/87137cac-49eb-492a-8f33-8e33470d8cb7.7
# file: brick4/brick/.shard/87137cac-49eb-492a-8f33-8e33470d8cb7.7
security.selinux=0x73797374656d5f753a6f626a6563745f723a756e6c6162656c65645f743a733000
trusted.afr.dirty=0x000000000000000000000000
trusted.bit-rot.version=0x020000000000000059914190000ce672
trusted.gfid=0x57c6fcdf52bb4f7aaea402f0dc81ff56


I'm wondering how they got created in the first place, and if anyone has 
any insight on how to fix it?

Storage nodes:
[root at n1 gluster]# gluster --version
glusterfs 4.0.0

[root at n1 gluster]# gluster volume info

Volume Name: ovirt-350-zone1
Type: Distributed-Replicate
Volume ID: 106738ed-9951-4270-822e-63c9bcd0a20e
Status: Started
Snapshot Count: 0
Number of Bricks: 7 x (2 + 1) = 21
Transport-type: tcp
Bricks:
Brick1: 10.0.6.100:/gluster/brick1/brick
Brick2: 10.0.6.101:/gluster/brick1/brick
Brick3: 10.0.6.102:/gluster/arbrick1/brick (arbiter)
Brick4: 10.0.6.100:/gluster/brick2/brick
Brick5: 10.0.6.101:/gluster/brick2/brick
Brick6: 10.0.6.102:/gluster/arbrick2/brick (arbiter)
Brick7: 10.0.6.100:/gluster/brick3/brick
Brick8: 10.0.6.101:/gluster/brick3/brick
Brick9: 10.0.6.102:/gluster/arbrick3/brick (arbiter)
Brick10: 10.0.6.100:/gluster/brick4/brick
Brick11: 10.0.6.101:/gluster/brick4/brick
Brick12: 10.0.6.102:/gluster/arbrick4/brick (arbiter)
Brick13: 10.0.6.100:/gluster/brick5/brick
Brick14: 10.0.6.101:/gluster/brick5/brick
Brick15: 10.0.6.102:/gluster/arbrick5/brick (arbiter)
Brick16: 10.0.6.100:/gluster/brick6/brick
Brick17: 10.0.6.101:/gluster/brick6/brick
Brick18: 10.0.6.102:/gluster/arbrick6/brick (arbiter)
Brick19: 10.0.6.100:/gluster/brick7/brick
Brick20: 10.0.6.101:/gluster/brick7/brick
Brick21: 10.0.6.102:/gluster/arbrick7/brick (arbiter)
Options Reconfigured:
cluster.min-free-disk: 50GB
performance.strict-write-ordering: off
performance.strict-o-direct: off
nfs.disable: off
performance.readdir-ahead: on
transport.address-family: inet
performance.cache-size: 1GB
features.shard: on
features.shard-block-size: 5GB
server.event-threads: 8
server.outstanding-rpc-limit: 128
storage.owner-uid: 36
storage.owner-gid: 36
performance.quick-read: off
performance.read-ahead: off
performance.io-cache: off
performance.stat-prefetch: on
cluster.eager-lock: enable
network.remote-dio: enable
cluster.quorum-type: auto
cluster.server-quorum-type: server
cluster.data-self-heal-algorithm: full
performance.flush-behind: off
performance.write-behind-window-size: 8MB
client.event-threads: 8
server.allow-insecure: on


Client version:
[root at kvm573 ~]# gluster --version
glusterfs 3.12.5


Thanks!

- Ian
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.gluster.org/pipermail/gluster-users/attachments/20180325/5c87109b/attachment.html>

Krutika Dhananjay

2018-Mar-26 07:10 UTC

head link

[Gluster-users] Sharding problem - multiple shard copies with mismatching gfids

The gfid mismatch here is between the shard and its "link-to" file,
the
creation of which happens at a layer below that of shard translator on the
stack.

Adding DHT devs to take a look.

-Krutika

On Mon, Mar 26, 2018 at 1:09 AM, Ian Halliday <ihalliday at ndevix.com>
wrote:
> Hello all,
>
> We are having a rather interesting problem with one of our VM storage
> systems. The GlusterFS client is throwing errors relating to GFID
> mismatches. We traced this down to multiple shards being present on the
> gluster nodes, with different gfids.
>
> Hypervisor gluster mount log:
>
> [2018-03-25 18:54:19.261733] E [MSGID: 133010]
[shard.c:1724:shard_common_lookup_shards_cbk]
> 0-ovirt-zone1-shard: Lookup on shard 7 failed. Base file gfid >
87137cac-49eb-492a-8f33-8e33470d8cb7 [Stale file handle]
> The message "W [MSGID: 109009]
[dht-common.c:2162:dht_lookup_linkfile_cbk]
> 0-ovirt-zone1-dht: /.shard/87137cac-49eb-492a-8f33-8e33470d8cb7.7: gfid
> different on data file on ovirt-zone1-replicate-3, gfid local >
00000000-0000-0000-0000-000000000000, gfid node =
57c6fcdf-52bb-4f7a-aea4-02f0dc81ff56
> " repeated 2 times between [2018-03-25 18:54:19.253748] and
[2018-03-25
> 18:54:19.263576]
> [2018-03-25 18:54:19.264349] W [MSGID: 109009]
> [dht-common.c:1901:dht_lookup_everywhere_cbk] 0-ovirt-zone1-dht:
> /.shard/87137cac-49eb-492a-8f33-8e33470d8cb7.7: gfid differs on subvolume
> ovirt-zone1-replicate-3, gfid local = fdf0813b-718a-4616-a51b-6999ebba9ec3,
> gfid node = 57c6fcdf-52bb-4f7a-aea4-02f0dc81ff56
>
>
> On the storage nodes, we found this:
>
> [root at n1 gluster]# find -name 87137cac-49eb-492a-8f33-8e33470d8cb7.7
> ./brick2/brick/.shard/87137cac-49eb-492a-8f33-8e33470d8cb7.7
> ./brick4/brick/.shard/87137cac-49eb-492a-8f33-8e33470d8cb7.7
>
> [root at n1 gluster]# ls -lh ./brick2/brick/.shard/87137cac-49eb-492a-8f33-
> 8e33470d8cb7.7
> ---------T. 2 root root 0 Mar 25 13:55 ./brick2/brick/.shard/
> 87137cac-49eb-492a-8f33-8e33470d8cb7.7
> [root at n1 gluster]# ls -lh ./brick4/brick/.shard/87137cac-49eb-492a-8f33-
> 8e33470d8cb7.7
> -rw-rw----. 2 root root 3.8G Mar 25 13:55 ./brick4/brick/.shard/
> 87137cac-49eb-492a-8f33-8e33470d8cb7.7
>
> [root at n1 gluster]# getfattr -d -m . -e hex ./brick2/brick/.shard/
> 87137cac-49eb-492a-8f33-8e33470d8cb7.7
> # file: brick2/brick/.shard/87137cac-49eb-492a-8f33-8e33470d8cb7.7
> security.selinux=0x73797374656d5f753a6f626a6563
> 745f723a756e6c6162656c65645f743a733000
> trusted.gfid=0xfdf0813b718a4616a51b6999ebba9ec3
> trusted.glusterfs.dht.linkto=0x6f766972742d3335302d7a6f6e65
> 312d7265706c69636174652d3300
>
> [root at n1 gluster]# getfattr -d -m . -e hex ./brick4/brick/.shard/
> 87137cac-49eb-492a-8f33-8e33470d8cb7.7
> # file: brick4/brick/.shard/87137cac-49eb-492a-8f33-8e33470d8cb7.7
> security.selinux=0x73797374656d5f753a6f626a6563
> 745f723a756e6c6162656c65645f743a733000
> trusted.afr.dirty=0x000000000000000000000000
> trusted.bit-rot.version=0x020000000000000059914190000ce672
> trusted.gfid=0x57c6fcdf52bb4f7aaea402f0dc81ff56
>
>
> I'm wondering how they got created in the first place, and if anyone
has
> any insight on how to fix it?
>
> Storage nodes:
> [root at n1 gluster]# gluster --version
> glusterfs 4.0.0
>
> [root at n1 gluster]# gluster volume info
>
> Volume Name: ovirt-350-zone1
> Type: Distributed-Replicate
> Volume ID: 106738ed-9951-4270-822e-63c9bcd0a20e
> Status: Started
> Snapshot Count: 0
> Number of Bricks: 7 x (2 + 1) = 21
> Transport-type: tcp
> Bricks:
> Brick1: 10.0.6.100:/gluster/brick1/brick
> Brick2: 10.0.6.101:/gluster/brick1/brick
> Brick3: 10.0.6.102:/gluster/arbrick1/brick (arbiter)
> Brick4: 10.0.6.100:/gluster/brick2/brick
> Brick5: 10.0.6.101:/gluster/brick2/brick
> Brick6: 10.0.6.102:/gluster/arbrick2/brick (arbiter)
> Brick7: 10.0.6.100:/gluster/brick3/brick
> Brick8: 10.0.6.101:/gluster/brick3/brick
> Brick9: 10.0.6.102:/gluster/arbrick3/brick (arbiter)
> Brick10: 10.0.6.100:/gluster/brick4/brick
> Brick11: 10.0.6.101:/gluster/brick4/brick
> Brick12: 10.0.6.102:/gluster/arbrick4/brick (arbiter)
> Brick13: 10.0.6.100:/gluster/brick5/brick
> Brick14: 10.0.6.101:/gluster/brick5/brick
> Brick15: 10.0.6.102:/gluster/arbrick5/brick (arbiter)
> Brick16: 10.0.6.100:/gluster/brick6/brick
> Brick17: 10.0.6.101:/gluster/brick6/brick
> Brick18: 10.0.6.102:/gluster/arbrick6/brick (arbiter)
> Brick19: 10.0.6.100:/gluster/brick7/brick
> Brick20: 10.0.6.101:/gluster/brick7/brick
> Brick21: 10.0.6.102:/gluster/arbrick7/brick (arbiter)
> Options Reconfigured:
> cluster.min-free-disk: 50GB
> performance.strict-write-ordering: off
> performance.strict-o-direct: off
> nfs.disable: off
> performance.readdir-ahead: on
> transport.address-family: inet
> performance.cache-size: 1GB
> features.shard: on
> features.shard-block-size: 5GB
> server.event-threads: 8
> server.outstanding-rpc-limit: 128
> storage.owner-uid: 36
> storage.owner-gid: 36
> performance.quick-read: off
> performance.read-ahead: off
> performance.io-cache: off
> performance.stat-prefetch: on
> cluster.eager-lock: enable
> network.remote-dio: enable
> cluster.quorum-type: auto
> cluster.server-quorum-type: server
> cluster.data-self-heal-algorithm: full
> performance.flush-behind: off
> performance.write-behind-window-size: 8MB
> client.event-threads: 8
> server.allow-insecure: on
>
>
> Client version:
> [root at kvm573 ~]# gluster --version
> glusterfs 3.12.5
>
>
> Thanks!
>
> - Ian
>
>
> _______________________________________________
> Gluster-users mailing list
> Gluster-users at gluster.org
> http://lists.gluster.org/mailman/listinfo/gluster-users
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.gluster.org/pipermail/gluster-users/attachments/20180326/675b0a9c/attachment.html>

Raghavendra Gowdappa

2018-Mar-26 07:25 UTC

head link

[Gluster-users] Sharding problem - multiple shard copies with mismatching gfids

On Mon, Mar 26, 2018 at 12:40 PM, Krutika Dhananjay <kdhananj at
redhat.com>
wrote:
> The gfid mismatch here is between the shard and its "link-to"
file, the
> creation of which happens at a layer below that of shard translator on the
> stack.
>
> Adding DHT devs to take a look.
>
Thanks Krutika. I assume shard doesn't do any dentry operations like
rename, link, unlink on the path of file (not the gfid handle based path)
internally while managing shards. Can you confirm? If it does these
operations, what fops does it do?

@Ian,

I can suggest following way to fix the problem:
* Since one of files listed is a DHT linkto file, I am assuming there is
only one shard of the file. If not, please list out gfids of other shards
and don't proceed with healing procedure.
* If gfids of all shards happen to be same and only linkto has a different
gfid, please proceed to step 3. Otherwise abort the healing procedure.
* If cluster.lookup-optimize is set to true abort the healing procedure
* Delete the linkto file - the file with permissions -------T and xattr
trusted.dht.linkto and do a lookup on the file from mount point after
turning off readdriplus [1].

As to reasons on how we ended up in this situation, Can you explain me what
is the I/O pattern on this file - like are there lots of entry operations
like rename, link, unlink etc on the file? There have been known races in
rename/lookup-heal-creating-linkto where linkto and data file have
different gfids. [2] fixes some of these cases

[1] http://lists.gluster.org/pipermail/gluster-users/2017-March/030148.html
[2] https://review.gluster.org/#/c/19547/

regards,
Raghavendra
>
>
>> -Krutika
>
> On Mon, Mar 26, 2018 at 1:09 AM, Ian Halliday <ihalliday at
ndevix.com>
> wrote:
>
>> Hello all,
>>
>> We are having a rather interesting problem with one of our VM storage
>> systems. The GlusterFS client is throwing errors relating to GFID
>> mismatches. We traced this down to multiple shards being present on the
>> gluster nodes, with different gfids.
>>
>> Hypervisor gluster mount log:
>>
>> [2018-03-25 18:54:19.261733] E [MSGID: 133010]
>> [shard.c:1724:shard_common_lookup_shards_cbk] 0-ovirt-zone1-shard:
>> Lookup on shard 7 failed. Base file gfid =
87137cac-49eb-492a-8f33-8e33470d8cb7
>> [Stale file handle]
>> The message "W [MSGID: 109009]
[dht-common.c:2162:dht_lookup_linkfile_cbk]
>> 0-ovirt-zone1-dht: /.shard/87137cac-49eb-492a-8f33-8e33470d8cb7.7: gfid
>> different on data file on ovirt-zone1-replicate-3, gfid local >>
00000000-0000-0000-0000-000000000000, gfid node >>
57c6fcdf-52bb-4f7a-aea4-02f0dc81ff56 " repeated 2 times between
>> [2018-03-25 18:54:19.253748] and [2018-03-25 18:54:19.263576]
>> [2018-03-25 18:54:19.264349] W [MSGID: 109009]
>> [dht-common.c:1901:dht_lookup_everywhere_cbk] 0-ovirt-zone1-dht:
>> /.shard/87137cac-49eb-492a-8f33-8e33470d8cb7.7: gfid differs on
>> subvolume ovirt-zone1-replicate-3, gfid local >>
fdf0813b-718a-4616-a51b-6999ebba9ec3, gfid node >>
57c6fcdf-52bb-4f7a-aea4-02f0dc81ff56
>>
>>
>> On the storage nodes, we found this:
>>
>> [root at n1 gluster]# find -name 87137cac-49eb-492a-8f33-8e33470d8cb7.7
>> ./brick2/brick/.shard/87137cac-49eb-492a-8f33-8e33470d8cb7.7
>> ./brick4/brick/.shard/87137cac-49eb-492a-8f33-8e33470d8cb7.7
>>
>> [root at n1 gluster]# ls -lh ./brick2/brick/.shard/87137cac
>> -49eb-492a-8f33-8e33470d8cb7.7
>> ---------T. 2 root root 0 Mar 25 13:55 ./brick2/brick/.shard/87137cac
>> -49eb-492a-8f33-8e33470d8cb7.7
>> [root at n1 gluster]# ls -lh ./brick4/brick/.shard/87137cac
>> -49eb-492a-8f33-8e33470d8cb7.7
>> -rw-rw----. 2 root root 3.8G Mar 25 13:55
./brick4/brick/.shard/87137cac
>> -49eb-492a-8f33-8e33470d8cb7.7
>>
>> [root at n1 gluster]# getfattr -d -m . -e hex
./brick2/brick/.shard/87137cac
>> -49eb-492a-8f33-8e33470d8cb7.7
>> # file: brick2/brick/.shard/87137cac-49eb-492a-8f33-8e33470d8cb7.7
>> security.selinux=0x73797374656d5f753a6f626a6563745f723a756e6
>> c6162656c65645f743a733000
>> trusted.gfid=0xfdf0813b718a4616a51b6999ebba9ec3
>> trusted.glusterfs.dht.linkto=0x6f766972742d3335302d7a6f6e653
>> 12d7265706c69636174652d3300
>>
>> [root at n1 gluster]# getfattr -d -m . -e hex
./brick4/brick/.shard/87137cac
>> -49eb-492a-8f33-8e33470d8cb7.7
>> # file: brick4/brick/.shard/87137cac-49eb-492a-8f33-8e33470d8cb7.7
>> security.selinux=0x73797374656d5f753a6f626a6563745f723a756e6
>> c6162656c65645f743a733000
>> trusted.afr.dirty=0x000000000000000000000000
>> trusted.bit-rot.version=0x020000000000000059914190000ce672
>> trusted.gfid=0x57c6fcdf52bb4f7aaea402f0dc81ff56
>>
>>
>> I'm wondering how they got created in the first place, and if
anyone has
>> any insight on how to fix it?
>>
>> Storage nodes:
>> [root at n1 gluster]# gluster --version
>> glusterfs 4.0.0
>>
>> [root at n1 gluster]# gluster volume info
>>
>> Volume Name: ovirt-350-zone1
>> Type: Distributed-Replicate
>> Volume ID: 106738ed-9951-4270-822e-63c9bcd0a20e
>> Status: Started
>> Snapshot Count: 0
>> Number of Bricks: 7 x (2 + 1) = 21
>> Transport-type: tcp
>> Bricks:
>> Brick1: 10.0.6.100:/gluster/brick1/brick
>> Brick2: 10.0.6.101:/gluster/brick1/brick
>> Brick3: 10.0.6.102:/gluster/arbrick1/brick (arbiter)
>> Brick4: 10.0.6.100:/gluster/brick2/brick
>> Brick5: 10.0.6.101:/gluster/brick2/brick
>> Brick6: 10.0.6.102:/gluster/arbrick2/brick (arbiter)
>> Brick7: 10.0.6.100:/gluster/brick3/brick
>> Brick8: 10.0.6.101:/gluster/brick3/brick
>> Brick9: 10.0.6.102:/gluster/arbrick3/brick (arbiter)
>> Brick10: 10.0.6.100:/gluster/brick4/brick
>> Brick11: 10.0.6.101:/gluster/brick4/brick
>> Brick12: 10.0.6.102:/gluster/arbrick4/brick (arbiter)
>> Brick13: 10.0.6.100:/gluster/brick5/brick
>> Brick14: 10.0.6.101:/gluster/brick5/brick
>> Brick15: 10.0.6.102:/gluster/arbrick5/brick (arbiter)
>> Brick16: 10.0.6.100:/gluster/brick6/brick
>> Brick17: 10.0.6.101:/gluster/brick6/brick
>> Brick18: 10.0.6.102:/gluster/arbrick6/brick (arbiter)
>> Brick19: 10.0.6.100:/gluster/brick7/brick
>> Brick20: 10.0.6.101:/gluster/brick7/brick
>> Brick21: 10.0.6.102:/gluster/arbrick7/brick (arbiter)
>> Options Reconfigured:
>> cluster.min-free-disk: 50GB
>> performance.strict-write-ordering: off
>> performance.strict-o-direct: off
>> nfs.disable: off
>> performance.readdir-ahead: on
>> transport.address-family: inet
>> performance.cache-size: 1GB
>> features.shard: on
>> features.shard-block-size: 5GB
>> server.event-threads: 8
>> server.outstanding-rpc-limit: 128
>> storage.owner-uid: 36
>> storage.owner-gid: 36
>> performance.quick-read: off
>> performance.read-ahead: off
>> performance.io-cache: off
>> performance.stat-prefetch: on
>> cluster.eager-lock: enable
>> network.remote-dio: enable
>> cluster.quorum-type: auto
>> cluster.server-quorum-type: server
>> cluster.data-self-heal-algorithm: full
>> performance.flush-behind: off
>> performance.write-behind-window-size: 8MB
>> client.event-threads: 8
>> server.allow-insecure: on
>>
>>
>> Client version:
>> [root at kvm573 ~]# gluster --version
>> glusterfs 3.12.5
>>
>>
>> Thanks!
>>
>> - Ian
>>
>>
>> _______________________________________________
>> Gluster-users mailing list
>> Gluster-users at gluster.org
>> http://lists.gluster.org/mailman/listinfo/gluster-users
>>
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.gluster.org/pipermail/gluster-users/attachments/20180326/b4693892/attachment.html>

Reasonably Related Threads

Search for more possibly parallel threads

Gluster users - Mar 2018 - Sharding problem - multiple shard copies with mismatching gfids

[Gluster-users] Sharding problem - multiple shard copies with mismatching gfids

[Gluster-users] Sharding problem - multiple shard copies with mismatching gfids

[Gluster-users] Sharding problem - multiple shard copies with mismatching gfids

Reasonably Related Threads