thr3ads.net - Gluster users - [Gluster-users] failed to close Bad file descriptor on file creation after using setfattr to test latency? [Feb 2023]

If this information is useful, please help other people find it:
Share via:

Brad Clemmons Jr

2023-Feb-14 17:57 UTC

[Gluster-users] failed to close Bad file descriptor on file creation after using setfattr to test latency?

Hi all,
Running into a problem with my gluster here in a LAB env.  Production is a
slightly different build(distributed replicated with multiple arbiter
bricks) and I don't see the same errors...yet.   I only seem to have this
problem on 2 client vm's that I ran "setfattr -n trusted.io-stats-dump
-v
output_file_id mount_point" on while trying to test for latency issues.
Curious if anyone has run into a similar situation with the bad file
descriptor mess below and knows what I'm doing wrong or missing, besides
everything...

I had built this gluster in one location, only to receive notice months
later the clients were now going to be connecting from another location
(different vcenter but with 10G backend).  It's basically storing hundreds
of thousands of what are usually small flat files.

Ran into what seemed to be latency issues(4+sec response) even doing
something as simple as an LS of a directory mounted with only a few folders
in it mounted on the remote clients.  Granted some of those folders have
thousands of small files... That led me down the setfattr path testing
latency.

Reads/writes on that directory yesterday evening after the setfattr was set
was fine(albeit slightly slow) and I got a latency report in
/var/run/gluster/logfilename.

I figured we can work out the latency issues, but the problem
encountered this morning is that I now cannot create a file via touch, dd,
vi etc... on that same mount point without the following errors.  This
happens now with a client in both the same location as the gluster and a
remote client.

"# touch /mnt/stor/test3> touch: failed to close '/mnt/stor/test3': Bad file descriptor"

Yet the file actually gets created...

# ls -al /mnt/stor/test3.test> -rw-r--r--. 1 root root 0 Feb 14 12:39 /mnt/stor/test3

If I mount the volume directly on one of the gluster servers themselves and
create a file that way(via dd, touch, etc...), I receive no error...

I've tried unmounting on the clients and remounting.  No effect.  I even
tried rebooting a client.  No effect.  As far as I can tell from the
gluster side, the gluster volume status, peer status, etc... all think the
gluster is fine.  Thoughts/ideas?

I also see the following in the /var/log/gluster/mountpoint.log file on the
clients for each attempt to create a file:
Similar errors for the Client in both same location as the gluster and a
remote client:

"[2023-02-14 16:23:27.705871] I [MSGID: 122062]
[ec.c:334:ec_up]> 0-stor-disperse-1: Going UP
> [2023-02-14 16:23:27.707347] I [fuse-bridge.c:5271:fuse_init]
> 0-glusterfs-fuse: FUSE inited with protocol versions: glusterfs 7.24 kernel
> 7.33
> [2023-02-14 16:23:27.707374] I [fuse-bridge.c:5899:fuse_graph_sync]
> 0-fuse: switched to graph 0
> [2023-02-14 16:23:45.648560] E [MSGID: 122066]
> [ec-common.c:1273:ec_prepare_update_cbk] 0-stor-disperse-0: Unable to get
> config xattr. FOP : 'FXATTROP' failed on gfid
> 864dbaf8-1012-4b79-afe4-89de0ace2628 [No data available]
> [2023-02-14 16:23:45.648620] W [fuse-bridge.c:1697:fuse_setattr_cbk]
> 0-glusterfs-fuse: 98: SETATTR() /test8 => -1 (No data available)
> [2023-02-14 16:23:45.648653] E [MSGID: 122034]
> [ec-common.c:706:ec_child_select] 0-stor-disperse-0: Insufficient available
> children for this request (have 0, need 2). FOP : 'FXATTROP' failed
on gfid
> 864dbaf8-1012-4b79-afe4-89de0ace2628
> [2023-02-14 16:23:45.648663] E [MSGID: 122037]
> [ec-common.c:2320:ec_update_size_version_done] 0-stor-disperse-0: Failed to
> update version and size. FOP : 'FXATTROP' failed on gfid
> 864dbaf8-1012-4b79-afe4-89de0ace2628 [Input/output error]
> [2023-02-14 16:23:45.648770] E [MSGID: 122077] [ec-generic.c:204:ec_flush]
> 0-stor-disperse-0: Failing FLUSH on 864dbaf8-1012-4b79-afe4-89de0ace2628
> [Bad file descriptor]
> [2023-02-14 16:23:45.649022] W [fuse-bridge.c:1945:fuse_err_cbk]
> 0-glusterfs-fuse: 99: FLUSH() ERR => -1 (Bad file descriptor)
> [2023-02-14 16:23:45.648996] E [MSGID: 122077] [ec-generic.c:204:ec_flush]
> 0-stor-disperse-0: Failing FLUSH on 864dbaf8-1012-4b79-afe4-89de0ace2628
> [Bad file descriptor]"


Gluster volume info sans hostname/ip specifics:

# gluster volume info stor> Volume Name: stor
> Type: Distributed-Disperse
> Volume ID: 6e94e3ce-cc15-494d-a2ad-3729e7589cdd
> Status: Started
> Snapshot Count: 0
> Number of Bricks: 2 x (2 + 1) = 6
> Transport-type: tcp
> Bricks:
> Brick1: stor01:/data/brick01a/brick
> Brick2: stor02:/data/brick02a/brick
> Brick3: stor03:/data/brick03a/brick
> Brick4: stor01:/data/brick01b/brick
> Brick5: stor02:/data/brick02b/brick
> Brick6: stor03:/data/brick03b/brick
> Options Reconfigured:
> diagnostics.count-fop-hits: on
> diagnostics.latency-measurement: on
> features.ctime: off
> transport.address-family: inet
> storage.fips-mode-rchecksum: on
> nfs.disable: on

Thanks,
Brad
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.gluster.org/pipermail/gluster-users/attachments/20230214/29f88c0c/attachment.html>

Xavi Hernandez

2023-Feb-15 09:17 UTC

head link

[Gluster-users] failed to close Bad file descriptor on file creation after using setfattr to test latency?

Hi Brad,

Find my comments inline below.

On Tue, Feb 14, 2023 at 6:57 PM Brad Clemmons Jr <brad.clemmons at
gmail.com>
wrote:
> Hi all,
> Running into a problem with my gluster here in a LAB env.  Production is a
> slightly different build(distributed replicated with multiple arbiter
> bricks) and I don't see the same errors...yet.   I only seem to have
this
> problem on 2 client vm's that I ran "setfattr -n
trusted.io-stats-dump -v
> output_file_id mount_point" on while trying to test for latency
issues.
> Curious if anyone has run into a similar situation with the bad file
> descriptor mess below and knows what I'm doing wrong or missing,
besides
> everything...
>
So you only see the problem on the clients where you ran the setfattr
command ? or the command was run in more clients but only some of them had
the issue ?

> I had built this gluster in one location, only to receive notice months
> later the clients were now going to be connecting from another location
> (different vcenter but with 10G backend).  It's basically storing
hundreds
> of thousands of what are usually small flat files.
>
For small files, the critical factor is latency between the sites, not
bandwidth. Gluster needs to send several requests for each accessed file,
so if the latency is significant, the delay is multiplied. For directory
based operations the effects are bigger because it may need many more
internal operations.

> Ran into what seemed to be latency issues(4+sec response) even doing
> something as simple as an LS of a directory mounted with only a few folders
> in it mounted on the remote clients.  Granted some of those folders have
> thousands of small files... That led me down the setfattr path testing
> latency.
>
The directory access operations are the worst performing ones in Gluster,
and they are very latency-sensitive.

>
> Reads/writes on that directory yesterday evening after the setfattr was
> set was fine(albeit slightly slow) and I got a latency report in
> /var/run/gluster/logfilename.
>
> I figured we can work out the latency issues, but the problem
> encountered this morning is that I now cannot create a file via touch, dd,
> vi etc... on that same mount point without the following errors.  This
> happens now with a client in both the same location as the gluster and a
> remote client.
>
> "# touch /mnt/stor/test3
>> touch: failed to close '/mnt/stor/test3': Bad file
descriptor"
>
>
> Yet the file actually gets created...
>
> # ls -al /mnt/stor/test3.test
>> -rw-r--r--. 1 root root 0 Feb 14 12:39 /mnt/stor/test3
>
>
> If I mount the volume directly on one of the gluster servers themselves
> and create a file that way(via dd, touch, etc...), I receive no error...
>
> I've tried unmounting on the clients and remounting.  No effect.  I
even
> tried rebooting a client.  No effect.  As far as I can tell from the
> gluster side, the gluster volume status, peer status, etc... all think the
> gluster is fine.  Thoughts/ideas?
>
Does this happen for any file that you create in any directory, or only for
some files or files created in some directory ?

It's really weird that even after a reboot the same client cannot create
files while another client can create them correctly.

>
> I also see the following in the /var/log/gluster/mountpoint.log file on
> the clients for each attempt to create a file:
> Similar errors for the Client in both same location as the gluster and a
> remote client:
>
> "[2023-02-14 16:23:27.705871] I [MSGID: 122062] [ec.c:334:ec_up]
>> 0-stor-disperse-1: Going UP
>> [2023-02-14 16:23:27.707347] I [fuse-bridge.c:5271:fuse_init]
>> 0-glusterfs-fuse: FUSE inited with protocol versions: glusterfs 7.24
kernel
>> 7.33
>> [2023-02-14 16:23:27.707374] I [fuse-bridge.c:5899:fuse_graph_sync]
>> 0-fuse: switched to graph 0
>> [2023-02-14 16:23:45.648560] E [MSGID: 122066]
>> [ec-common.c:1273:ec_prepare_update_cbk] 0-stor-disperse-0: Unable to
get
>> config xattr. FOP : 'FXATTROP' failed on gfid
>> 864dbaf8-1012-4b79-afe4-89de0ace2628 [No data available]
>> [2023-02-14 16:23:45.648620] W [fuse-bridge.c:1697:fuse_setattr_cbk]
>> 0-glusterfs-fuse: 98: SETATTR() /test8 => -1 (No data available)
>> [2023-02-14 16:23:45.648653] E [MSGID: 122034]
>> [ec-common.c:706:ec_child_select] 0-stor-disperse-0: Insufficient
available
>> children for this request (have 0, need 2). FOP : 'FXATTROP'
failed on gfid
>> 864dbaf8-1012-4b79-afe4-89de0ace2628
>> [2023-02-14 16:23:45.648663] E [MSGID: 122037]
>> [ec-common.c:2320:ec_update_size_version_done] 0-stor-disperse-0:
Failed to
>> update version and size. FOP : 'FXATTROP' failed on gfid
>> 864dbaf8-1012-4b79-afe4-89de0ace2628 [Input/output error]
>> [2023-02-14 16:23:45.648770] E [MSGID: 122077]
>> [ec-generic.c:204:ec_flush] 0-stor-disperse-0: Failing FLUSH on
>> 864dbaf8-1012-4b79-afe4-89de0ace2628 [Bad file descriptor]
>> [2023-02-14 16:23:45.649022] W [fuse-bridge.c:1945:fuse_err_cbk]
>> 0-glusterfs-fuse: 99: FLUSH() ERR => -1 (Bad file descriptor)
>> [2023-02-14 16:23:45.648996] E [MSGID: 122077]
>> [ec-generic.c:204:ec_flush] 0-stor-disperse-0: Failing FLUSH on
>> 864dbaf8-1012-4b79-afe4-89de0ace2628 [Bad file descriptor]"
>
>I saw this problem once, but we were not able to identify the cause (or
even reproduce it). There's a simple workaround that consists in recreating
the missing config xattr, but if you are ok, I would like to try to
identify the root cause to be able to completely avoid the problem.
Otherwise it could happen again and you will need to recreate the xattr
manually again.

Could you set the client log level to TRACE and try again ?

    # gluster volume set <volname> client-log-level TRACE

Once done, you can reset the log level to the default value:

    # gluster volume reset <volname> client-log-level

And then provide the log of the mountpoint.

Even if you don't get any visible error from a "good" client, can
you check
its logs to see if there's any error ? some errors could be
"hidden" if the
update succeeds on enough bricks.

BTW, could you create a Git Hub issue for this at
https://github.com/gluster/glusterfs/issues ? It will be easier to work on
this through Git Hub.

Best regards,

Xavi

>
> Gluster volume info sans hostname/ip specifics:
>
> # gluster volume info stor
>> Volume Name: stor
>> Type: Distributed-Disperse
>> Volume ID: 6e94e3ce-cc15-494d-a2ad-3729e7589cdd
>> Status: Started
>> Snapshot Count: 0
>> Number of Bricks: 2 x (2 + 1) = 6
>> Transport-type: tcp
>> Bricks:
>> Brick1: stor01:/data/brick01a/brick
>> Brick2: stor02:/data/brick02a/brick
>> Brick3: stor03:/data/brick03a/brick
>> Brick4: stor01:/data/brick01b/brick
>> Brick5: stor02:/data/brick02b/brick
>> Brick6: stor03:/data/brick03b/brick
>> Options Reconfigured:
>> diagnostics.count-fop-hits: on
>> diagnostics.latency-measurement: on
>> features.ctime: off
>> transport.address-family: inet
>> storage.fips-mode-rchecksum: on
>> nfs.disable: on
>
>
> Thanks,
> Brad
> ________
>
>
>
> Community Meeting Calendar:
>
> Schedule -
> Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
> Bridge: https://meet.google.com/cpu-eiue-hvk
> Gluster-users mailing list
> Gluster-users at gluster.org
> https://lists.gluster.org/mailman/listinfo/gluster-users
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.gluster.org/pipermail/gluster-users/attachments/20230215/b8213f12/attachment.html>

Seemingly Similar Threads

Search for more reasonably related threads

Gluster users - Feb 2023 - failed to close Bad file descriptor on file creation after using setfattr to test latency?

[Gluster-users] failed to close Bad file descriptor on file creation after using setfattr to test latency?

[Gluster-users] failed to close Bad file descriptor on file creation after using setfattr to test latency?

Seemingly Similar Threads