thr3ads.net - Gluster users - [Gluster-users] Stale File Handle Errors During Heavy Writes [Nov 2019]

If this information is useful, please help other people find it:
Share via:

Timothy Orme

2019-Nov-27 01:38 UTC

[Gluster-users] Stale File Handle Errors During Heavy Writes

Hi All,

I'm running a 3x2 cluster, v6.5.  Not sure if its relevant, but also have
sharding enabled.

I've found that when under heavy write load, clients start erroring out with
"stale file handle" errors, on files not related to the writes.

For instance, when a user is running a simple wc against a file, it will bail
during that operation with "stale file"

When I check the client logs, I see errors like:

[2019-11-26 22:41:33.565776] E [MSGID: 109040]
[dht-helper.c:1336:dht_migration_complete_check_task] 3-scratch-dht:
24d53a0e-c28d-41e0-9dbc-a75e823a3c7d: failed to lookup the file on scratch-dht 
[Stale file handle]
[2019-11-26 22:41:33.565853] W [fuse-bridge.c:2827:fuse_readv_cbk]
0-glusterfs-fuse: 33112038: READ => -1
gfid=147040e2-a6b8-4f54-8490-f0f3df29ee50 fd=0x7f95d8d0b3f8 (Stale file handle)

I've seen some bugs or other threads referencing similar issues, but
couldn't really discern a solution from them.

Is this caused by some consistency issue with metadata while under load or
something else?  I dont see the issue when heavy reads are occurrring.

Any help is greatly appreciated!

Thanks!
Tim
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.gluster.org/pipermail/gluster-users/attachments/20191127/db4e85d8/attachment.html>

Olaf Buitelaar

2019-Nov-27 17:50 UTC

head link

[Gluster-users] Stale File Handle Errors During Heavy Writes

Hi Tim,

i've been suffering from this also for a long time, not sure if it's
exact
the same situation since your setup is different. But it seems similar.
i've filed this bug report;
https://bugzilla.redhat.com/show_bug.cgi?id=1732961 for which you might be
able to enrich.
To solve the stale files i've made this bash script;
https://gist.github.com/olafbuitelaar/ff6fe9d4ab39696d9ad6ca689cc89986 (it's
slightly outdated) which you could use as inspiration, it basically removes
the stale files as suggested here;
https://lists.gluster.org/pipermail/gluster-users/2018-March/033785.html .
Please be aware the script won't work if you have  2 (or more) bricks of
the same volume on the same server (since it always takes the first path
found).
I invoke the script via ansible like this (since the script needs to run on
all bricks);
- hosts: host1,host2,host3
  tasks:
    - shell: 'bash /root/clean-stale-gluster-fh.sh --host="{{ intif.ip
|
first }}" --volume=ovirt-data
--backup="/backup/stale/gfs/ovirt-data"
--shard="{{ item }}" --force'
      with_items:
        - 1b0ba5c2-dd2b-45d0-9c4b-a39b2123cc13.14451

fortunately for me the issue seems to be disappeared, since it's now about
1 month i received one, while before it was about every other day.
The biggest thing the seemed to resolve it was more disk space. while
before there was also plenty the gluster volume was at about 85% full, and
the individual disk had about 20-30% free of 8TB disk array, but had
servers in the mix with smaller disk array's but with similar available
space (in percents). i'm now at much lower percentage.
So my latest running theory is that it has something todo with how gluster
allocates the shared's, since it's based on it's hash it might want
to
place it in a certain sub-volume, but than comes to the conclusion it has
not enough space there, writes a marker to redirect it to another
sub-volume (thinking this is the stale file). However rebalances don't fix
this issue.  Also this still doesn't seem explain that most stale files
always end up in the first sub-volume.
Unfortunate i've no proof this is actually the root cause, besides that the
symptom "disappeared" once gluster had more space to work with.

Best Olaf

Op wo 27 nov. 2019 om 02:38 schreef Timothy Orme <torme at ancestry.com>:
> Hi All,
>
> I'm running a 3x2 cluster, v6.5.  Not sure if its relevant, but also
have
> sharding enabled.
>
> I've found that when under heavy write load, clients start erroring out
> with "stale file handle" errors, on files not related to the
writes.
>
> For instance, when a user is running a simple wc against a file, it will
> bail during that operation with "stale file"
>
> When I check the client logs, I see errors like:
>
> [2019-11-26 22:41:33.565776] E [MSGID: 109040]
> [dht-helper.c:1336:dht_migration_complete_check_task] 3-scratch-dht:
> 24d53a0e-c28d-41e0-9dbc-a75e823a3c7d: failed to lookup the file on
> scratch-dht  [Stale file handle]
> [2019-11-26 22:41:33.565853] W [fuse-bridge.c:2827:fuse_readv_cbk]
> 0-glusterfs-fuse: 33112038: READ => -1
> gfid=147040e2-a6b8-4f54-8490-f0f3df29ee50 fd=0x7f95d8d0b3f8 (Stale file
> handle)
>
> I've seen some bugs or other threads referencing similar issues, but
> couldn't really discern a solution from them.
>
> Is this caused by some consistency issue with metadata while under load or
> something else?  I dont see the issue when heavy reads are occurrring.
>
> Any help is greatly appreciated!
>
> Thanks!
> Tim
> ________
>
> Community Meeting Calendar:
>
> APAC Schedule -
> Every 2nd and 4th Tuesday at 11:30 AM IST
> Bridge: https://bluejeans.com/441850968
>
> NA/EMEA Schedule -
> Every 1st and 3rd Tuesday at 01:00 PM EDT
> Bridge: https://bluejeans.com/441850968
>
> Gluster-users mailing list
> Gluster-users at gluster.org
> https://lists.gluster.org/mailman/listinfo/gluster-users
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.gluster.org/pipermail/gluster-users/attachments/20191127/342bcc79/attachment.html>

Gluster users - Nov 2019 - Stale File Handle Errors During Heavy Writes

[Gluster-users] Stale File Handle Errors During Heavy Writes

[Gluster-users] Stale File Handle Errors During Heavy Writes