> Yes if the dataset is small, you can try rm -rf of the dir > from the mount (assuming no other application is accessing > them on the volume) launch heal once so that the heal info > becomes zero and then copy it over again .I did approximately so; the rm -rf took its sweet time and the number of entries to be healed kept diminishing as the deletion progressed. At the end I was left with Mon Mar 15 22:57:09 CET 2021 Gathering count of entries to be healed on volume gv0 has been successful Brick node01:/gfs/gv0 Number of entries: 3 Brick mikrivouli:/gfs/gv0 Number of entries: 2 Brick nanosaurus:/gfs/gv0 Number of entries: 3 -------------- and that's where I've been ever since, for the past 20 hours. SHD has kept trying to heal them all along and the log brings us back to square one: [2021-03-16 14:51:35.059593 +0000] I [MSGID: 108026] [afr-self-heal-entry.c:1053:afr_selfheal_entry_do] 0-gv0-replicate-0: performing entry selfheal on 94aefa13-9828-49e5-9bac-6f70453c100f [2021-03-16 15:39:43.680380 +0000] E [MSGID: 114031] [client-rpc-fops_v2.c:214:client4_0_mkdir_cbk] 0-gv0-client-0: remote operation failed. [{path=(null)}, {errno=22}, {error=Invalid argument}] [2021-03-16 15:39:43.769604 +0000] E [MSGID: 114031] [client-rpc-fops_v2.c:214:client4_0_mkdir_cbk] 0-gv0-client-2: remote operation failed. [{path=(null)}, {errno=22}, {error=Invalid argument}] [2021-03-16 15:39:43.908425 +0000] E [MSGID: 114031] [client-rpc-fops_v2.c:214:client4_0_mkdir_cbk] 0-gv0-client-1: remote operation failed. [{path=(null)}, {errno=22}, {error=Invalid argument}] [...] In other words, deleting and recreating the unhealable files and directories was a workaround, but the underlying problem persists and I can't even begin to look for it when I have no clue what errno 22 means in plain English. In any case, glusterd.log is full of messages like [2021-03-16 15:37:03.398619 +0000] I [MSGID: 106533] [glusterd-volume-ops.c:717:__glusterd_handle_cli_heal_volume] 0-management: Received heal vol req for volume gv0 [2021-03-16 15:37:03.791452 +0000] E [MSGID: 106061] [glusterd-server-quorum.c:260:glusterd_is_volume_in_server_quorum] 0-management: Dict get failed [{Key=cluster.server-quorum-type}] Every single "received heal vol req" message is immediately followed by a "dict get failed", always for server-quorum-type, for hours on end. And I begin to smell a bug. The CLI can query the value OK: # gluster volume get gv0 cluster.server-quorum-type Option Value ------ ----- cluster.server-quorum-type off Checking all quorum-related settings, I get # gluster volume get gv0 all |grep quorum cluster.quorum-type auto cluster.quorum-count (null) (DEFAULT) cluster.server-quorum-type off cluster.server-quorum-ratio 51 cluster.quorum-reads no (DEFAULT) disperse.quorum-count 0 (DEFAULT) I never touched any of them and none of them appear in volume info under "Options Reconfigured", so don't know why three of them are not marked as defaults. Next, I tried setting server-quorum-type=server. The server-quorum-type problem went away and I got a new kind of dict get failure: The message "E [MSGID: 106061] [glusterd-volgen.c:2564:brick_graph_add_pump] 0-management: Dict get failed [{Key=enable-pump}]" repeated 2 times between [2021-03-16 17:12:18.677594 +0000] and [2021-03-16 17:12:18.779859 +0000] I tried rolling back server-quorum-type=server and got this error: # gluster volume set gv0 cluster.server-quorum-type off volume set: failed: option server-quorum-type off: 'off' is not valid (possible options are none, server.) Aha, but previously and by default it was clearly "off", not "none". That's bug somewhere and that is what was causing the dict get failures on server-quorum-type. The missing dict enable-pump that's required by server-quorum-type=server looks also like a bug because there is no such setting: # gluster volume get gv0 all |grep pump # There are more similarly strange complaints in the glusterd log: [2021-03-16 17:25:43.134207 +0000] E [MSGID: 106434] [glusterd-utils.c:13379:glusterd_get_value_for_vme_entry] 0-management: xlator_volopt_dynload error (-1) [2021-03-16 17:25:43.141816 +0000] W [MSGID: 106332] [glusterd-utils.c:13390:glusterd_get_value_for_vme_entry] 0-management: Failed to get option for localtime-logging key [2021-03-16 17:25:43.143185 +0000] W [MSGID: 106332] [glusterd-utils.c:13390:glusterd_get_value_for_vme_entry] 0-management: Failed to get option for s3plugin-seckey key [2021-03-16 17:25:43.143340 +0000] W [MSGID: 106332] [glusterd-utils.c:13390:glusterd_get_value_for_vme_entry] 0-management: Failed to get option for s3plugin-keyid key [2021-03-16 17:25:43.143484 +0000] W [MSGID: 106332] [glusterd-utils.c:13390:glusterd_get_value_for_vme_entry] 0-management: Failed to get option for s3plugin-bucketid key [2021-03-16 17:25:43.143621 +0000] W [MSGID: 106332] [glusterd-utils.c:13390:glusterd_get_value_for_vme_entry] 0-management: Failed to get option for s3plugin-hostname key If none of this stuff is used in the first place, it should not be triggering errors and warnings. If the S3 plugin is not enabled, the S3 keys should not even be checked. Both the checking of the keys and the error logging are bugs. Cool, I'm discovering more and more stuff that needs fixing, but I'm making zero progress with my healing problem. I'm still stuck with errno=22.
According to?https://staged-gluster-docs.readthedocs.io/en/release3.7.0beta1/Features/server-quorum/ : Server quorum is controlled by two parameters: - cluster.server-quorum-type This value may be "server" to indicate that server quorum is enabled, or "none" to mean it's disabled. So , try with 'none' . Best Regards, Strahil Nikolov On Tue, Mar 16, 2021 at 20:16, Zenon Panoussis<oracle at provocation.net> wrote:> Yes if the dataset is small, you can try rm -rf of the dir > from the mount (assuming no other application is accessing > them on the volume) launch heal once so that the heal info > becomes zero and then copy it over again .I did approximately so; the rm -rf took its sweet time and the number of entries to be healed kept diminishing as the deletion progressed. At the end I was left with Mon Mar 15 22:57:09 CET 2021 Gathering count of entries to be healed on volume gv0 has been successful Brick node01:/gfs/gv0 Number of entries: 3 Brick mikrivouli:/gfs/gv0 Number of entries: 2 Brick nanosaurus:/gfs/gv0 Number of entries: 3 -------------- and that's where I've been ever since, for the past 20 hours. SHD has kept trying to heal them all along and the log brings us back to square one: [2021-03-16 14:51:35.059593 +0000] I [MSGID: 108026] [afr-self-heal-entry.c:1053:afr_selfheal_entry_do] 0-gv0-replicate-0: performing entry selfheal on 94aefa13-9828-49e5-9bac-6f70453c100f [2021-03-16 15:39:43.680380 +0000] E [MSGID: 114031] [client-rpc-fops_v2.c:214:client4_0_mkdir_cbk] 0-gv0-client-0: remote operation failed. [{path=(null)}, {errno=22}, {error=Invalid argument}] [2021-03-16 15:39:43.769604 +0000] E [MSGID: 114031] [client-rpc-fops_v2.c:214:client4_0_mkdir_cbk] 0-gv0-client-2: remote operation failed. [{path=(null)}, {errno=22}, {error=Invalid argument}] [2021-03-16 15:39:43.908425 +0000] E [MSGID: 114031] [client-rpc-fops_v2.c:214:client4_0_mkdir_cbk] 0-gv0-client-1: remote operation failed. [{path=(null)}, {errno=22}, {error=Invalid argument}] [...] In other words, deleting and recreating the unhealable files and directories was a workaround, but the underlying problem persists and I can't even begin to look for it when I have no clue what errno 22 means in plain English. In any case, glusterd.log is full of messages like [2021-03-16 15:37:03.398619 +0000] I [MSGID: 106533] [glusterd-volume-ops.c:717:__glusterd_handle_cli_heal_volume] 0-management: Received heal vol req for volume gv0 [2021-03-16 15:37:03.791452 +0000] E [MSGID: 106061] [glusterd-server-quorum.c:260:glusterd_is_volume_in_server_quorum] 0-management: Dict get failed [{Key=cluster.server-quorum-type}] Every single "received heal vol req" message is immediately followed by a "dict get failed", always for server-quorum-type, for hours on end. And I begin to smell a bug. The CLI can query the value OK: # gluster volume get gv0 cluster.server-quorum-type Option? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? Value ------? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ----- cluster.server-quorum-type? ? ? ? ? ? ? off Checking all quorum-related settings, I get # gluster volume get gv0 all |grep quorum cluster.quorum-type? ? ? ? ? ? ? ? ? ? auto cluster.quorum-count? ? ? ? ? ? ? ? ? ? (null) (DEFAULT) cluster.server-quorum-type? ? ? ? ? ? ? off cluster.server-quorum-ratio? ? ? ? ? ? 51 cluster.quorum-reads? ? ? ? ? ? ? ? ? ? no (DEFAULT) disperse.quorum-count? ? ? ? ? ? ? ? ? 0 (DEFAULT) I never touched any of them and none of them appear in volume info under "Options Reconfigured", so don't know why three of them are not marked as defaults. Next, I tried setting server-quorum-type=server. The server-quorum-type problem went away and I got a new kind of dict get failure: The message "E [MSGID: 106061] [glusterd-volgen.c:2564:brick_graph_add_pump] 0-management: Dict get failed [{Key=enable-pump}]" repeated 2 times between [2021-03-16 17:12:18.677594 +0000] and [2021-03-16 17:12:18.779859 +0000] I tried rolling back server-quorum-type=server and got this error: # gluster volume set gv0 cluster.server-quorum-type off volume set: failed: option server-quorum-type off: 'off' is not valid (possible options are none, server.) Aha, but previously and by default it was clearly "off", not "none". That's bug somewhere and that is what was causing the dict get failures on server-quorum-type. The missing dict enable-pump that's required by server-quorum-type=server looks also like a bug because there is no such setting: # gluster volume get gv0 all |grep pump # There are more similarly strange complaints in the glusterd log: [2021-03-16 17:25:43.134207 +0000] E [MSGID: 106434] [glusterd-utils.c:13379:glusterd_get_value_for_vme_entry] 0-management: xlator_volopt_dynload error (-1) [2021-03-16 17:25:43.141816 +0000] W [MSGID: 106332] [glusterd-utils.c:13390:glusterd_get_value_for_vme_entry] 0-management: Failed to get option for localtime-logging key [2021-03-16 17:25:43.143185 +0000] W [MSGID: 106332] [glusterd-utils.c:13390:glusterd_get_value_for_vme_entry] 0-management: Failed to get option for s3plugin-seckey key [2021-03-16 17:25:43.143340 +0000] W [MSGID: 106332] [glusterd-utils.c:13390:glusterd_get_value_for_vme_entry] 0-management: Failed to get option for s3plugin-keyid key [2021-03-16 17:25:43.143484 +0000] W [MSGID: 106332] [glusterd-utils.c:13390:glusterd_get_value_for_vme_entry] 0-management: Failed to get option for s3plugin-bucketid key [2021-03-16 17:25:43.143621 +0000] W [MSGID: 106332] [glusterd-utils.c:13390:glusterd_get_value_for_vme_entry] 0-management: Failed to get option for s3plugin-hostname key If none of this stuff is used in the first place, it should not be triggering errors and warnings. If the S3 plugin is not enabled, the S3 keys should not even be checked. Both the checking of the keys and the error logging are bugs. Cool, I'm discovering more and more stuff that needs fixing, but I'm making zero progress with my healing problem. I'm still stuck with errno=22. ________ Community Meeting Calendar: Schedule - Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC Bridge: https://meet.google.com/cpu-eiue-hvk Gluster-users mailing list Gluster-users at gluster.org https://lists.gluster.org/mailman/listinfo/gluster-users -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20210316/4118c165/attachment.html>
On 16/03/21 11:45 pm, Zenon Panoussis wrote:>> Yes if the dataset is small, you can try rm -rf of the dir >> from the mount (assuming no other application is accessing >> them on the volume) launch heal once so that the heal info >> becomes zero and then copy it over again . > I did approximately so; the rm -rf took its sweet time and the > number of entries to be healed kept diminishing as the deletion > progressed. At the end I was left with > > Mon Mar 15 22:57:09 CET 2021 > Gathering count of entries to be healed on volume gv0 has been successful > > Brick node01:/gfs/gv0 > Number of entries: 3 > > Brick mikrivouli:/gfs/gv0 > Number of entries: 2 > > Brick nanosaurus:/gfs/gv0 > Number of entries: 3 > -------------- > > and that's where I've been ever since, for the past 20 hours. > SHD has kept trying to heal them all along and the log brings > us back to square one: > > [2021-03-16 14:51:35.059593 +0000] I [MSGID: 108026] [afr-self-heal-entry.c:1053:afr_selfheal_entry_do] 0-gv0-replicate-0: performing entry selfheal on 94aefa13-9828-49e5-9bac-6f70453c100fDoes this gfid correspond to the same directory path as last time?> [2021-03-16 15:39:43.680380 +0000] E [MSGID: 114031] [client-rpc-fops_v2.c:214:client4_0_mkdir_cbk] 0-gv0-client-0: remote operation failed. [{path=(null)}, {errno=22}, {error=Invalid argument}] > [2021-03-16 15:39:43.769604 +0000] E [MSGID: 114031] [client-rpc-fops_v2.c:214:client4_0_mkdir_cbk] 0-gv0-client-2: remote operation failed. [{path=(null)}, {errno=22}, {error=Invalid argument}] > [2021-03-16 15:39:43.908425 +0000] E [MSGID: 114031] [client-rpc-fops_v2.c:214:client4_0_mkdir_cbk] 0-gv0-client-1: remote operation failed. [{path=(null)}, {errno=22}, {error=Invalid argument}] > [...]Wonder if you can attach gdb the glustershd process at the function client4_0_mkdir() and try to print args->loc->path to see on which file the mkdir is attempted on.> In other words, deleting and recreating the unhealable files > and directories was a workaround, but the underlying problem > persists and I can't even begin to look for it when I have no > clue what errno 22 means in plain English. > > In any case, glusterd.log is full of messages likeserver-quorum messages in the glusterd log are unrelated, you can raise a separate github issue for that. (And you can leave it at 'off').> > [2021-03-16 15:37:03.398619 +0000] I [MSGID: 106533] [glusterd-volume-ops.c:717:__glusterd_handle_cli_heal_volume] 0-management: Received heal vol req for volume gv0 > [2021-03-16 15:37:03.791452 +0000] E [MSGID: 106061] [glusterd-server-quorum.c:260:glusterd_is_volume_in_server_quorum] 0-management: Dict get failed [{Key=cluster.server-quorum-type}] >