Artem Russakovskii
2021-Jul-23 00:05 UTC
[Gluster-users] Broken status, peer probe, "DNS resolution failed on host" and "Error disabling sockopt IPV6_V6ONLY: "Protocol not available" after updating from gluster 7.9 to 9.1
Hi all, I just filed this ticket https://github.com/gluster/glusterfs/issues/2648, and wanted to bring it to your attention. Any feedback would be appreciated. Description of problem: We have a 4-node replicate cluster running gluster 7.9. I'm currently setting up a new cluster on a new set of machines and went straight for gluster 9.1. However, I was unable to probe any servers due to this error: [2021-07-17 00:31:05.228609 +0000] I [MSGID: 106487] [glusterd-handler.c:1160:__glusterd_handle_cli_probe] 0-glusterd: Received CLI probe req nexus2 24007 [2021-07-17 00:31:05.229727 +0000] E [MSGID: 101075] [common-utils.c:3657:gf_is_local_addr] 0-management: error in getaddrinfo [{ret=Name or service not known}] [2021-07-17 00:31:05.230785 +0000] E [MSGID: 106408] [glusterd-peer-utils.c:217:glusterd_peerinfo_find_by_hostname] 0-management: error in getaddrinfo: Name or service not known [Unknown error -2] [2021-07-17 00:31:05.353971 +0000] I [MSGID: 106128] [glusterd-handler.c:3719:glusterd_probe_begin] 0-glusterd: Unable to find peerinfo for host: nexus2 (24007) [2021-07-17 00:31:05.375871 +0000] W [MSGID: 106061] [glusterd-handler.c:3488:glusterd_transport_inet_options_build] 0-glusterd: Failed to get tcp-user-timeout [2021-07-17 00:31:05.375903 +0000] I [rpc-clnt.c:1010:rpc_clnt_connection_init] 0-management: setting frame-timeout to 600 [2021-07-17 00:31:05.377021 +0000] E [MSGID: 101075] [common-utils.c:520:gf_resolve_ip6] 0-resolver: error in getaddrinfo [{family=10}, {ret=Name or service not known}] [2021-07-17 00:31:05.377043 +0000] E [name.c:265:af_inet_client_get_remote_sockaddr] 0-management: DNS resolution failed on host nexus2 [2021-07-17 00:31:05.377147 +0000] I [MSGID: 106498] [glusterd-handler.c:3648:glusterd_friend_add] 0-management: connect returned 0 [2021-07-17 00:31:05.377201 +0000] I [MSGID: 106004] [glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management: Peer <nexus2> (<00000000-0000-0000-0000-000000000000>), in state <Establishing Connection>, has disconnected from glusterd. [2021-07-17 00:31:05.377453 +0000] E [MSGID: 101032] [store.c:464:gf_store_handle_retrieve] 0-: Path corresponding to /var/lib/glusterd/glusterd.info. [No such file or directory] I then wiped the /var/lib/glusterd dir to start clean and downgraded to 7.9, then attempted to peer probe again. This time, it worked fine, proving 7.9 is working, same as it is on prod. At this point, I made a volume, started it, and played around with testing to my satisfaction. Then I decided to see what would happen if I tried to upgrade this working volume from 7.9 to 9.1. The end result is: - gluster volume status is only showing the local gluster node and not any of the remote nodes - data does seem to replicate, so the connection between the servers is actually established - logs are now filled with constantly repeating messages like so: [2021-07-22 23:29:31.039004 +0000] E [name.c:265:af_inet_client_get_remote_sockaddr] 0-management: DNS resolution failed on host nexus2 [2021-07-22 23:29:31.039212 +0000] E [name.c:265:af_inet_client_get_remote_sockaddr] 0-management: DNS resolution failed on host citadel [2021-07-22 23:29:31.039304 +0000] E [name.c:265:af_inet_client_get_remote_sockaddr] 0-management: DNS resolution failed on host hive The message "E [MSGID: 101075] [common-utils.c:520:gf_resolve_ip6] 0-resolver: error in getaddrinfo [{family=10}, {ret=Name or service not known}]" repeated 119 times between [2021-07-22 23:27:34.025983 +0000] and [2021-07-22 23:29:31.039302 +0000] [2021-07-22 23:29:34.039369 +0000] E [MSGID: 101075] [common-utils.c:520:gf_resolve_ip6] 0-resolver: error in getaddrinfo [{family=10}, {ret=Name or service not known}] [2021-07-22 23:29:34.039441 +0000] E [name.c:265:af_inet_client_get_remote_sockaddr] 0-management: DNS resolution failed on host nexus2 [2021-07-22 23:29:34.039558 +0000] E [name.c:265:af_inet_client_get_remote_sockaddr] 0-management: DNS resolution failed on host citadel [2021-07-22 23:29:34.039659 +0000] E [name.c:265:af_inet_client_get_remote_sockaddr] 0-management: DNS resolution failed on host hive [2021-07-22 23:29:37.039741 +0000] E [name.c:265:af_inet_client_get_remote_sockaddr] 0-management: DNS resolution failed on host nexus2 [2021-07-22 23:29:37.039921 +0000] E [name.c:265:af_inet_client_get_remote_sockaddr] 0-management: DNS resolution failed on host citadel [2021-07-22 23:29:37.040015 +0000] E [name.c:265:af_inet_client_get_remote_sockaddr] 0-management: DNS resolution failed on host hive When I issue a command in cli: ==> cli.log <=[2021-07-22 23:38:11.802596 +0000] I [cli.c:840:main] 0-cli: Started running gluster with version 9.1 **[2021-07-22 23:38:11.804007 +0000] W [socket.c:3434:socket_connect] 0-glusterfs: Error disabling sockopt IPV6_V6ONLY: "Operation not supported"** [2021-07-22 23:38:11.906865 +0000] I [MSGID: 101190] [event-epoll.c:670:event_dispatch_epoll_worker] 0-epoll: Started thread with index [{index=0}] **Mandatory info:** **- The output of the `gluster volume info` command**: gluster volume info Volume Name: ap Type: Replicate Volume ID: XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX Status: Started Snapshot Count: 0 Number of Bricks: 1 x 4 = 4 Transport-type: tcp Bricks: Brick1: nexus2:/mnt/nexus2_block1/ap Brick2: forge:/mnt/forge_block1/ap Brick3: hive:/mnt/hive_block1/ap Brick4: citadel:/mnt/citadel_block1/ap Options Reconfigured: performance.client-io-threads: on nfs.disable: on storage.fips-mode-rchecksum: on transport.address-family: inet cluster.self-heal-daemon: enable client.event-threads: 4 cluster.data-self-heal-algorithm: full cluster.lookup-optimize: on cluster.quorum-count: 1 cluster.quorum-type: fixed cluster.readdir-optimize: on cluster.heal-timeout: 1800 disperse.eager-lock: on features.cache-invalidation: on features.cache-invalidation-timeout: 600 network.inode-lru-limit: 500000 network.ping-timeout: 7 network.remote-dio: enable performance.cache-invalidation: on performance.cache-size: 1GB performance.io-thread-count: 4 performance.md-cache-timeout: 600 performance.rda-cache-limit: 256MB performance.read-ahead: off performance.readdir-ahead: on performance.stat-prefetch: on performance.write-behind-window-size: 32MB server.event-threads: 4 cluster.background-self-heal-count: 1 performance.cache-refresh-timeout: 10 features.ctime: off cluster.granular-entry-heal: enable - The output of the gluster volume status command: gluster volume status Status of volume: ap Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick forge:/mnt/forge_block1/ap 49152 0 Y 2622 Self-heal Daemon on localhost N/A N/A N N/A Task Status of Volume ap ------------------------------------------------------------------------------ There are no active volume tasks - The output of the gluster volume heal command: gluster volume heal ap enable Enable heal on volume ap has been successful gluster volume heal ap Launching heal operation to perform index self heal on volume ap has been unsuccessful: Self-heal daemon is not running. Check self-heal daemon log file. - The operating system / glusterfs version: OpenSUSE 15.2, glusterfs 9.1. Sincerely, Artem -- Founder, Android Police <http://www.androidpolice.com>, APK Mirror <http://www.apkmirror.com/>, Illogical Robot LLC beerpla.net | @ArtemR <http://twitter.com/ArtemR> -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20210722/99c3f5b5/attachment.html>
Strahil Nikolov
2021-Jul-23 04:09 UTC
[Gluster-users] Broken status, peer probe, "DNS resolution failed on host" and "Error disabling sockopt IPV6_V6ONLY: "Protocol not available" after updating from gluster 7.9 to 9.1
Did you try with latest 9.X ? Based on the release notes that should be 9.3 . Best Regards,Strahil Nikolov On Fri, Jul 23, 2021 at 3:06, Artem Russakovskii<archon810 at gmail.com> wrote: Hi all, I just filed this ticket?https://github.com/gluster/glusterfs/issues/2648, and wanted to bring it to?your attention. Any feedback would be?appreciated. Description of problem: We have a 4-node replicate cluster running gluster 7.9. I'm currently setting up a new cluster on a new set of machines and went straight for gluster 9.1. However, I was unable to probe any servers due to this error: [2021-07-17 00:31:05.228609 +0000] I [MSGID: 106487] [glusterd-handler.c:1160:__glusterd_handle_cli_probe] 0-glusterd: Received CLI probe req nexus2 24007 [2021-07-17 00:31:05.229727 +0000] E [MSGID: 101075] [common-utils.c:3657:gf_is_local_addr] 0-management: error in getaddrinfo [{ret=Name or service not known}] [2021-07-17 00:31:05.230785 +0000] E [MSGID: 106408] [glusterd-peer-utils.c:217:glusterd_peerinfo_find_by_hostname] 0-management: error in getaddrinfo: Name or service not known [Unknown error -2] [2021-07-17 00:31:05.353971 +0000] I [MSGID: 106128] [glusterd-handler.c:3719:glusterd_probe_begin] 0-glusterd: Unable to find peerinfo for host: nexus2 (24007) [2021-07-17 00:31:05.375871 +0000] W [MSGID: 106061] [glusterd-handler.c:3488:glusterd_transport_inet_options_build] 0-glusterd: Failed to get tcp-user-timeout [2021-07-17 00:31:05.375903 +0000] I [rpc-clnt.c:1010:rpc_clnt_connection_init] 0-management: setting frame-timeout to 600 [2021-07-17 00:31:05.377021 +0000] E [MSGID: 101075] [common-utils.c:520:gf_resolve_ip6] 0-resolver: error in getaddrinfo [{family=10}, {ret=Name or service not known}] [2021-07-17 00:31:05.377043 +0000] E [name.c:265:af_inet_client_get_remote_sockaddr] 0-management: DNS resolution failed on host nexus2 [2021-07-17 00:31:05.377147 +0000] I [MSGID: 106498] [glusterd-handler.c:3648:glusterd_friend_add] 0-management: connect returned 0 [2021-07-17 00:31:05.377201 +0000] I [MSGID: 106004] [glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management: Peer <nexus2> (<00000000-0000-0000-0000-000000000000>), in state <Establishing Connection>, has disconnected from glusterd. [2021-07-17 00:31:05.377453 +0000] E [MSGID: 101032] [store.c:464:gf_store_handle_retrieve] 0-: Path corresponding to /var/lib/glusterd/glusterd.info. [No such file or directory] I then wiped the?/var/lib/glusterd?dir to start clean and downgraded to 7.9, then attempted to peer probe again. This time, it worked fine, proving 7.9 is working, same as it is on prod. At this point, I made a volume, started it, and played around with testing to my satisfaction. Then I decided to see what would happen if I tried to upgrade this working volume from 7.9 to 9.1. The end result is: - gluster volume status?is only showing the local gluster node and not any of the remote nodes - data does seem to replicate, so the connection between the servers is actually established - logs are now filled with constantly repeating messages like so: [2021-07-22 23:29:31.039004 +0000] E [name.c:265:af_inet_client_get_remote_sockaddr] 0-management: DNS resolution failed on host nexus2 [2021-07-22 23:29:31.039212 +0000] E [name.c:265:af_inet_client_get_remote_sockaddr] 0-management: DNS resolution failed on host citadel [2021-07-22 23:29:31.039304 +0000] E [name.c:265:af_inet_client_get_remote_sockaddr] 0-management: DNS resolution failed on host hive The message "E [MSGID: 101075] [common-utils.c:520:gf_resolve_ip6] 0-resolver: error in getaddrinfo [{family=10}, {ret=Name or service not known}]" repeated 119 times between [2021-07-22 23:27:34.025983 +0000] and [2021-07-22 23:29:31.039302 +0000] [2021-07-22 23:29:34.039369 +0000] E [MSGID: 101075] [common-utils.c:520:gf_resolve_ip6] 0-resolver: error in getaddrinfo [{family=10}, {ret=Name or service not known}] [2021-07-22 23:29:34.039441 +0000] E [name.c:265:af_inet_client_get_remote_sockaddr] 0-management: DNS resolution failed on host nexus2 [2021-07-22 23:29:34.039558 +0000] E [name.c:265:af_inet_client_get_remote_sockaddr] 0-management: DNS resolution failed on host citadel [2021-07-22 23:29:34.039659 +0000] E [name.c:265:af_inet_client_get_remote_sockaddr] 0-management: DNS resolution failed on host hive [2021-07-22 23:29:37.039741 +0000] E [name.c:265:af_inet_client_get_remote_sockaddr] 0-management: DNS resolution failed on host nexus2 [2021-07-22 23:29:37.039921 +0000] E [name.c:265:af_inet_client_get_remote_sockaddr] 0-management: DNS resolution failed on host citadel [2021-07-22 23:29:37.040015 +0000] E [name.c:265:af_inet_client_get_remote_sockaddr] 0-management: DNS resolution failed on host hive When I issue a command in cli: ==> cli.log <=[2021-07-22 23:38:11.802596 +0000] I [cli.c:840:main] 0-cli: Started running gluster with version 9.1 **[2021-07-22 23:38:11.804007 +0000] W [socket.c:3434:socket_connect] 0-glusterfs: Error disabling sockopt IPV6_V6ONLY: "Operation not supported"** [2021-07-22 23:38:11.906865 +0000] I [MSGID: 101190] [event-epoll.c:670:event_dispatch_epoll_worker] 0-epoll: Started thread with index [{index=0}] **Mandatory info:** **- The output of the `gluster volume info` command**:gluster volume info Volume Name: ap Type: Replicate Volume ID: XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX Status: Started Snapshot Count: 0 Number of Bricks: 1 x 4 = 4 Transport-type: tcp Bricks: Brick1: nexus2:/mnt/nexus2_block1/ap Brick2: forge:/mnt/forge_block1/ap Brick3: hive:/mnt/hive_block1/ap Brick4: citadel:/mnt/citadel_block1/ap Options Reconfigured: performance.client-io-threads: on nfs.disable: on storage.fips-mode-rchecksum: on transport.address-family: inet cluster.self-heal-daemon: enable client.event-threads: 4 cluster.data-self-heal-algorithm: full cluster.lookup-optimize: on cluster.quorum-count: 1 cluster.quorum-type: fixed cluster.readdir-optimize: on cluster.heal-timeout: 1800 disperse.eager-lock: on features.cache-invalidation: on features.cache-invalidation-timeout: 600 network.inode-lru-limit: 500000 network.ping-timeout: 7 network.remote-dio: enable performance.cache-invalidation: on performance.cache-size: 1GB performance.io-thread-count: 4 performance.md-cache-timeout: 600 performance.rda-cache-limit: 256MB performance.read-ahead: off performance.readdir-ahead: on performance.stat-prefetch: on performance.write-behind-window-size: 32MB server.event-threads: 4 cluster.background-self-heal-count: 1 performance.cache-refresh-timeout: 10 features.ctime: off cluster.granular-entry-heal: enable - The output of the?gluster volume status?command: gluster volume status Status of volume: ap Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick forge:/mnt/forge_block1/ap 49152 0 Y 2622 Self-heal Daemon on localhost N/A N/A N N/A Task Status of Volume ap ------------------------------------------------------------------------------ There are no active volume tasks - The output of the?gluster volume heal?command: gluster volume heal ap enable Enable heal on volume ap has been successful gluster volume heal ap Launching heal operation to perform index self heal on volume ap has been unsuccessful: Self-heal daemon is not running. Check self-heal daemon log file. - The operating system / glusterfs version: OpenSUSE 15.2, glusterfs 9.1. Sincerely, Artem -- Founder, Android Police,?APK Mirror, Illogical Robot LLCbeerpla.net | @ArtemR ________ Community Meeting Calendar: Schedule - Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC Bridge: https://meet.google.com/cpu-eiue-hvk Gluster-users mailing list Gluster-users at gluster.org https://lists.gluster.org/mailman/listinfo/gluster-users -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20210723/42bde5bd/attachment.html>