thr3ads.net - Gluster users - [Gluster-users] No healing, errno 22 [Mar 2021]

If this information is useful, please help other people find it:
Share via:

Zenon Panoussis

2021-Mar-16 18:15 UTC

[Gluster-users] No healing, errno 22

> Yes if the dataset is small, you can try rm -rf of the dir 
> from the mount (assuming no other application is accessing 
> them on the volume) launch heal once so that the heal info 
> becomes zero and then copy it over again .
I did approximately so; the rm -rf took its sweet time and the
number of entries to be healed kept diminishing as the deletion
progressed. At the end I was left with

Mon Mar 15 22:57:09 CET 2021
Gathering count of entries to be healed on volume gv0 has been successful

Brick node01:/gfs/gv0
Number of entries: 3

Brick mikrivouli:/gfs/gv0
Number of entries: 2

Brick nanosaurus:/gfs/gv0
Number of entries: 3
--------------

and that's where I've been ever since, for the past 20 hours.
SHD has kept trying to heal them all along and the log brings
us back to square one:

[2021-03-16 14:51:35.059593 +0000] I [MSGID: 108026]
[afr-self-heal-entry.c:1053:afr_selfheal_entry_do] 0-gv0-replicate-0: performing
entry selfheal on 94aefa13-9828-49e5-9bac-6f70453c100f
[2021-03-16 15:39:43.680380 +0000] E [MSGID: 114031]
[client-rpc-fops_v2.c:214:client4_0_mkdir_cbk] 0-gv0-client-0: remote operation
failed. [{path=(null)}, {errno=22}, {error=Invalid argument}]
[2021-03-16 15:39:43.769604 +0000] E [MSGID: 114031]
[client-rpc-fops_v2.c:214:client4_0_mkdir_cbk] 0-gv0-client-2: remote operation
failed. [{path=(null)}, {errno=22}, {error=Invalid argument}]
[2021-03-16 15:39:43.908425 +0000] E [MSGID: 114031]
[client-rpc-fops_v2.c:214:client4_0_mkdir_cbk] 0-gv0-client-1: remote operation
failed. [{path=(null)}, {errno=22}, {error=Invalid argument}]
[...]

In other words, deleting and recreating the unhealable files
and directories was a workaround, but the underlying problem
persists and I can't even begin to look for it when I have no
clue what errno 22 means in plain English.

In any case, glusterd.log is full of messages like

[2021-03-16 15:37:03.398619 +0000] I [MSGID: 106533]
[glusterd-volume-ops.c:717:__glusterd_handle_cli_heal_volume] 0-management:
Received heal vol req for volume gv0
[2021-03-16 15:37:03.791452 +0000] E [MSGID: 106061]
[glusterd-server-quorum.c:260:glusterd_is_volume_in_server_quorum] 0-management:
Dict get failed [{Key=cluster.server-quorum-type}]

Every single "received heal vol req" message is immediately followed
by a "dict get failed", always for server-quorum-type, for hours on
end. And I begin to smell a bug. The CLI can query the value OK:

# gluster volume get gv0 cluster.server-quorum-type
Option                                  Value
------                                  -----
cluster.server-quorum-type              off


Checking all quorum-related settings, I get

# gluster volume get gv0 all |grep quorum
cluster.quorum-type                     auto
cluster.quorum-count                    (null) (DEFAULT)
cluster.server-quorum-type              off
cluster.server-quorum-ratio             51
cluster.quorum-reads                    no (DEFAULT)
disperse.quorum-count                   0 (DEFAULT)

I never touched any of them and none of them appear in volume info
under "Options Reconfigured", so don't know why three of them are
not marked as defaults.

Next, I tried setting server-quorum-type=server. The server-quorum-type
problem went away and I got a new kind of dict get failure:

The message "E [MSGID: 106061]
[glusterd-volgen.c:2564:brick_graph_add_pump] 0-management: Dict get failed
[{Key=enable-pump}]" repeated 2 times between [2021-03-16 17:12:18.677594
+0000] and [2021-03-16 17:12:18.779859 +0000]

I tried rolling back server-quorum-type=server and got this error:

# gluster volume set gv0 cluster.server-quorum-type off
volume set: failed: option server-quorum-type off: 'off' is not valid
(possible options are none, server.)

Aha, but previously and by default it was clearly "off", not
"none".
That's bug somewhere and that is what was causing the dict get failures
on server-quorum-type. The missing dict enable-pump that's required
by server-quorum-type=server looks also like a bug because there is
no such setting:

# gluster volume get gv0 all |grep pump
#

There are more similarly strange complaints in the glusterd log:

[2021-03-16 17:25:43.134207 +0000] E [MSGID: 106434]
[glusterd-utils.c:13379:glusterd_get_value_for_vme_entry] 0-management:
xlator_volopt_dynload error (-1)
[2021-03-16 17:25:43.141816 +0000] W [MSGID: 106332]
[glusterd-utils.c:13390:glusterd_get_value_for_vme_entry] 0-management: Failed
to get option for localtime-logging key
[2021-03-16 17:25:43.143185 +0000] W [MSGID: 106332]
[glusterd-utils.c:13390:glusterd_get_value_for_vme_entry] 0-management: Failed
to get option for s3plugin-seckey key
[2021-03-16 17:25:43.143340 +0000] W [MSGID: 106332]
[glusterd-utils.c:13390:glusterd_get_value_for_vme_entry] 0-management: Failed
to get option for s3plugin-keyid key
[2021-03-16 17:25:43.143484 +0000] W [MSGID: 106332]
[glusterd-utils.c:13390:glusterd_get_value_for_vme_entry] 0-management: Failed
to get option for s3plugin-bucketid key
[2021-03-16 17:25:43.143621 +0000] W [MSGID: 106332]
[glusterd-utils.c:13390:glusterd_get_value_for_vme_entry] 0-management: Failed
to get option for s3plugin-hostname key

If none of this stuff is used in the first place, it should not
be triggering errors and warnings. If the S3 plugin is not enabled,
the S3 keys should not even be checked. Both the checking of the
keys and the error logging are bugs.

Cool, I'm discovering more and more stuff that needs fixing, but
I'm making zero progress with my healing problem. I'm still stuck
with errno=22.

Strahil Nikolov

2021-Mar-16 19:27 UTC

head link

[Gluster-users] No healing, errno 22

According
to?https://staged-gluster-docs.readthedocs.io/en/release3.7.0beta1/Features/server-quorum/
:

Server quorum is controlled by two parameters:
   
   - cluster.server-quorum-type

This value may be "server" to indicate that server quorum is enabled,
or "none" to mean it's disabled.




So , try with 'none' .

Best Regards,

Strahil Nikolov
 
 
  On Tue, Mar 16, 2021 at 20:16, Zenon Panoussis<oracle at
provocation.net> wrote:   > Yes if the dataset is small, you can try rm -rf of the dir 
> from the mount (assuming no other application is accessing 
> them on the volume) launch heal once so that the heal info 
> becomes zero and then copy it over again .
I did approximately so; the rm -rf took its sweet time and the
number of entries to be healed kept diminishing as the deletion
progressed. At the end I was left with

Mon Mar 15 22:57:09 CET 2021
Gathering count of entries to be healed on volume gv0 has been successful

Brick node01:/gfs/gv0
Number of entries: 3

Brick mikrivouli:/gfs/gv0
Number of entries: 2

Brick nanosaurus:/gfs/gv0
Number of entries: 3
--------------

and that's where I've been ever since, for the past 20 hours.
SHD has kept trying to heal them all along and the log brings
us back to square one:

[2021-03-16 14:51:35.059593 +0000] I [MSGID: 108026]
[afr-self-heal-entry.c:1053:afr_selfheal_entry_do] 0-gv0-replicate-0: performing
entry selfheal on 94aefa13-9828-49e5-9bac-6f70453c100f
[2021-03-16 15:39:43.680380 +0000] E [MSGID: 114031]
[client-rpc-fops_v2.c:214:client4_0_mkdir_cbk] 0-gv0-client-0: remote operation
failed. [{path=(null)}, {errno=22}, {error=Invalid argument}]
[2021-03-16 15:39:43.769604 +0000] E [MSGID: 114031]
[client-rpc-fops_v2.c:214:client4_0_mkdir_cbk] 0-gv0-client-2: remote operation
failed. [{path=(null)}, {errno=22}, {error=Invalid argument}]
[2021-03-16 15:39:43.908425 +0000] E [MSGID: 114031]
[client-rpc-fops_v2.c:214:client4_0_mkdir_cbk] 0-gv0-client-1: remote operation
failed. [{path=(null)}, {errno=22}, {error=Invalid argument}]
[...]

In other words, deleting and recreating the unhealable files
and directories was a workaround, but the underlying problem
persists and I can't even begin to look for it when I have no
clue what errno 22 means in plain English.

In any case, glusterd.log is full of messages like

[2021-03-16 15:37:03.398619 +0000] I [MSGID: 106533]
[glusterd-volume-ops.c:717:__glusterd_handle_cli_heal_volume] 0-management:
Received heal vol req for volume gv0
[2021-03-16 15:37:03.791452 +0000] E [MSGID: 106061]
[glusterd-server-quorum.c:260:glusterd_is_volume_in_server_quorum] 0-management:
Dict get failed [{Key=cluster.server-quorum-type}]

Every single "received heal vol req" message is immediately followed
by a "dict get failed", always for server-quorum-type, for hours on
end. And I begin to smell a bug. The CLI can query the value OK:

# gluster volume get gv0 cluster.server-quorum-type
Option? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? Value
------? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? -----
cluster.server-quorum-type? ? ? ? ? ? ? off


Checking all quorum-related settings, I get

# gluster volume get gv0 all |grep quorum
cluster.quorum-type? ? ? ? ? ? ? ? ? ? auto
cluster.quorum-count? ? ? ? ? ? ? ? ? ? (null) (DEFAULT)
cluster.server-quorum-type? ? ? ? ? ? ? off
cluster.server-quorum-ratio? ? ? ? ? ? 51
cluster.quorum-reads? ? ? ? ? ? ? ? ? ? no (DEFAULT)
disperse.quorum-count? ? ? ? ? ? ? ? ? 0 (DEFAULT)

I never touched any of them and none of them appear in volume info
under "Options Reconfigured", so don't know why three of them are
not marked as defaults.

Next, I tried setting server-quorum-type=server. The server-quorum-type
problem went away and I got a new kind of dict get failure:

The message "E [MSGID: 106061]
[glusterd-volgen.c:2564:brick_graph_add_pump] 0-management: Dict get failed
[{Key=enable-pump}]" repeated 2 times between [2021-03-16 17:12:18.677594
+0000] and [2021-03-16 17:12:18.779859 +0000]

I tried rolling back server-quorum-type=server and got this error:

# gluster volume set gv0 cluster.server-quorum-type off
volume set: failed: option server-quorum-type off: 'off' is not valid
(possible options are none, server.)

Aha, but previously and by default it was clearly "off", not
"none".
That's bug somewhere and that is what was causing the dict get failures
on server-quorum-type. The missing dict enable-pump that's required
by server-quorum-type=server looks also like a bug because there is
no such setting:

# gluster volume get gv0 all |grep pump
#

There are more similarly strange complaints in the glusterd log:

[2021-03-16 17:25:43.134207 +0000] E [MSGID: 106434]
[glusterd-utils.c:13379:glusterd_get_value_for_vme_entry] 0-management:
xlator_volopt_dynload error (-1)
[2021-03-16 17:25:43.141816 +0000] W [MSGID: 106332]
[glusterd-utils.c:13390:glusterd_get_value_for_vme_entry] 0-management: Failed
to get option for localtime-logging key
[2021-03-16 17:25:43.143185 +0000] W [MSGID: 106332]
[glusterd-utils.c:13390:glusterd_get_value_for_vme_entry] 0-management: Failed
to get option for s3plugin-seckey key
[2021-03-16 17:25:43.143340 +0000] W [MSGID: 106332]
[glusterd-utils.c:13390:glusterd_get_value_for_vme_entry] 0-management: Failed
to get option for s3plugin-keyid key
[2021-03-16 17:25:43.143484 +0000] W [MSGID: 106332]
[glusterd-utils.c:13390:glusterd_get_value_for_vme_entry] 0-management: Failed
to get option for s3plugin-bucketid key
[2021-03-16 17:25:43.143621 +0000] W [MSGID: 106332]
[glusterd-utils.c:13390:glusterd_get_value_for_vme_entry] 0-management: Failed
to get option for s3plugin-hostname key

If none of this stuff is used in the first place, it should not
be triggering errors and warnings. If the S3 plugin is not enabled,
the S3 keys should not even be checked. Both the checking of the
keys and the error logging are bugs.

Cool, I'm discovering more and more stuff that needs fixing, but
I'm making zero progress with my healing problem. I'm still stuck
with errno=22.

________



Community Meeting Calendar:

Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge: https://meet.google.com/cpu-eiue-hvk
Gluster-users mailing list
Gluster-users at gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users
  
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.gluster.org/pipermail/gluster-users/attachments/20210316/4118c165/attachment.html>

Ravishankar N

2021-Mar-17 13:27 UTC

head link

[Gluster-users] No healing, errno 22

On 16/03/21 11:45 pm, Zenon Panoussis wrote:>> Yes if the dataset is small, you can try rm -rf of the dir
>> from the mount (assuming no other application is accessing
>> them on the volume) launch heal once so that the heal info
>> becomes zero and then copy it over again .
> I did approximately so; the rm -rf took its sweet time and the
> number of entries to be healed kept diminishing as the deletion
> progressed. At the end I was left with
>
> Mon Mar 15 22:57:09 CET 2021
> Gathering count of entries to be healed on volume gv0 has been successful
>
> Brick node01:/gfs/gv0
> Number of entries: 3
>
> Brick mikrivouli:/gfs/gv0
> Number of entries: 2
>
> Brick nanosaurus:/gfs/gv0
> Number of entries: 3
> --------------
>
> and that's where I've been ever since, for the past 20 hours.
> SHD has kept trying to heal them all along and the log brings
> us back to square one:
>
> [2021-03-16 14:51:35.059593 +0000] I [MSGID: 108026]
[afr-self-heal-entry.c:1053:afr_selfheal_entry_do] 0-gv0-replicate-0: performing
entry selfheal on 94aefa13-9828-49e5-9bac-6f70453c100fDoes this gfid correspond to the same directory path as last
time?> [2021-03-16 15:39:43.680380 +0000] E [MSGID: 114031]
[client-rpc-fops_v2.c:214:client4_0_mkdir_cbk] 0-gv0-client-0: remote operation
failed. [{path=(null)}, {errno=22}, {error=Invalid argument}]
> [2021-03-16 15:39:43.769604 +0000] E [MSGID: 114031]
[client-rpc-fops_v2.c:214:client4_0_mkdir_cbk] 0-gv0-client-2: remote operation
failed. [{path=(null)}, {errno=22}, {error=Invalid argument}]
> [2021-03-16 15:39:43.908425 +0000] E [MSGID: 114031]
[client-rpc-fops_v2.c:214:client4_0_mkdir_cbk] 0-gv0-client-1: remote operation
failed. [{path=(null)}, {errno=22}, {error=Invalid argument}]
> [...]Wonder if you can attach gdb the glustershd process at the function 
client4_0_mkdir() and try to print args->loc->path to see on which file 
the mkdir is attempted on.> In other words, deleting and recreating the unhealable files
> and directories was a workaround, but the underlying problem
> persists and I can't even begin to look for it when I have no
> clue what errno 22 means in plain English.
>
> In any case, glusterd.log is full of messages likeserver-quorum messages in the glusterd log are unrelated, you can raise 
a separate github issue for that. (And you can leave it at
'off').>
> [2021-03-16 15:37:03.398619 +0000] I [MSGID: 106533]
[glusterd-volume-ops.c:717:__glusterd_handle_cli_heal_volume] 0-management:
Received heal vol req for volume gv0
> [2021-03-16 15:37:03.791452 +0000] E [MSGID: 106061]
[glusterd-server-quorum.c:260:glusterd_is_volume_in_server_quorum] 0-management:
Dict get failed [{Key=cluster.server-quorum-type}]
>

Gluster users - Mar 2021 - No healing, errno 22

[Gluster-users] No healing, errno 22

[Gluster-users] No healing, errno 22

[Gluster-users] No healing, errno 22