thr3ads.net - Gluster users - [Gluster-users] gluster forcing IPV6 on our IPV4 servers, glusterd fails (was gluster update question regarding new DNS resolution requirement) [Sep 2021]

If this information is useful, please help other people find it:
Share via:

Erik Jacobson

2021-Sep-21 14:59 UTC

[Gluster-users] gluster forcing IPV6 on our IPV4 servers, glusterd fails (was gluster update question regarding new DNS resolution requirement)

> Don't forget to run the geo-replication fix script , if you missed to
do it
> before the upgrade.
We don't use geo-replication YET but thank you for this thoughtful
reminder.

Just a note on things like this -- we really try to do everything in a
package update because that's how we'd have to deploy to customers in an
automated way. So having to run a script as part of the upgrade would be
very hard in a package based work flow for a packged solution.

I'm not complaining I love gluster but this is just food for thought.

I can't even hardly say it with a straight face because we suffer from
similar issues on the cluster management side - updating one CM to the
next is harder than it should be so I'm certainly not judging. Updating
is always painful.

I LOVE that slowly updating our gluster servers is "Just working".

This will allow a supercomputer to slowly update their infrastructure
while taking no compute nodes (using nfs-hosted squashfs images or root)
down. It's really remarkable since it's a big jump too 7.9 to 9.3 I am
impressed by this part. It's a huge relief that I didn't have to do an
intermediate jump to gluster8 in the middle as that would have been
nearly impossible for us to get right.

Thank you all!!

PS: Frontier will have 21 leader nodes running gluster servers.
Distributed/replicate in groups of 3 hosting nfs-exported squashfs image
objects for compute node root filesystems. Many thousands of nodes.
> 
> Best Regards,
> Strahil Nikolov
> 
> 
>     On Tue, Sep 21, 2021 at 0:46, Erik Jacobson
>     <erik.jacobson at hpe.com> wrote:
>     I pretended I'm a low-level C programmer with network and
filesystem
>     experience for a few hours.
> 
>     I'm not sure what the right solution is but what was happening was
the
>     code was trying to treat our IPV4 hosts as AF_INET6 and the family was
>     incompatible with our IPV4 IP addresses. Yes, we need to move to IPV6
>     but we're hoping to do that on our own time (~50 years like
everybody
>     else :)
> 
>     I found a chunk of the code that seemed to be force-setting us to
>     AF_INET6.
> 
>     While I'm sure it is not 100% the correct patch, the patch attached
and
>     pasted below is working for me so I'll integrate it with our
internal
>     build to continue testing.
> 
>     Please let me know if there is a configuration item I missed or a
>     different way to do this. I added -devel to this email.
> 
>     In the previous thread, you would have seen that we're testing a
>     hopeful change that will upgrade our deployed customers from gluster
>     7.9 to gluster 9.3.
> 
>     Thank you!! Advice on next steps would be appreciated !!
> 
> 
>     diff -Narup glusterfs-9.3-ORIG/rpc/rpc-transport/socket/src/name.c
>     glusterfs-9.3-NEW/rpc/rpc-transport/socket/src/name.c
>     --- glusterfs-9.3-ORIG/rpc/rpc-transport/socket/src/name.c   
2021-06-29
>     00:27:44.381408294 -0500
>     +++ glusterfs-9.3-NEW/rpc/rpc-transport/socket/src/name.c    2021-09-20
>     16:34:28.969425361 -0500
>     @@ -252,9 +252,16 @@ af_inet_client_get_remote_sockaddr(rpc_t
>         /* Need to update transport-address family if address-family is not
>     provided
>             to command-line arguments
>         */
>     +    /* HPE This is forcing our IPV4 servers in to to an IPV6 address
>     +    * family that is not compatible with IPV4. For now we will just
set it
>     +    * to AF_INET.
>     +    */
>     +    /*
>         if (inet_pton(AF_INET6, remote_host, &serveraddr)) {
>             sockaddr->sa_family = AF_INET6;
>         }
>     +    */
>     +    sockaddr->sa_family = AF_INET;
> 
>         /* TODO: gf_resolve is a blocking call. kick in some
>             non blocking dns techniques */
> 
>    
>     On Mon, Sep 20, 2021 at 11:35:35AM -0500, Erik Jacobson wrote:
>     > I missed the other important log snip:
>     >
>     > The message "E [MSGID: 101075]
[common-utils.c:520:gf_resolve_ip6]
>     0-resolver: error in getaddrinfo [{family=10}, {ret=Address family for
>     hostname not supported}]" repeated 620 times between [2021-09-20
>     15:49:23.720633 +0000] and [2021-09-20 15:50:41.731542 +0000]
>     >
>     > So I will dig in to the code some here.
>     >
>     >
>     > On Mon, Sep 20, 2021 at 10:59:30AM -0500, Erik Jacobson wrote:
>     > > Hello all! I hope you are well.
>     > >
>     > > We are starting a new software release cycle and I am trying
to find a
>     > > way to upgrade customers from our build of gluster 7.9 to our
build of
>     > > gluster 9.3
>     > >
>     > > When we deploy gluster, we foribly remove all references to
any host
>     > > names and use only IP addresses. This is because, if for any
reason a
>     > > DNS server is unreachable, even if the peer files have IPs
and DNS, it
>     > > causes glusterd to be unable to reach peers properly. We
can't really
>     > > rely on /etc/hosts either because customers take artistic
licene with
>     > > their /etc/hosts files and don't realize that problems
that can cause.
>     > >
>     > > So our deployed peer files look something like this:
>     > >
>     > > uuid=46a4b506-029d-4750-acfb-894501a88977
>     > > state=3
>     > > hostname1=172.23.0.16
>     > >
>     > > That is, with full intention, we avoid host names.
>     > >
>     > > When we upgrade to gluster 9.3, we fall over with these
errors and
>     > > gluster is now partitioned and the updated gluster servers
can't reach
>     > > anybody:
>     > >
>     > > [2021-09-20 15:50:41.731543 +0000] E
>     [name.c:265:af_inet_client_get_remote_sockaddr] 0-management: DNS
>     resolution failed on host 172.23.0.16
>     > >
>     > >
>     > > As you can see, we have defined on purpose everything using
IPs but in
>     > > 9.3 it appears this method fails. Are there any suggestions
short of
>     > > putting real host names in peer files?
>     > >
>     > >
>     > >
>     > > FYI
>     > >
>     > > This supercomputer will be using gluster for part of its
system
>     > > management. It is how we deploy the Image Objects (squashfs
images)
>     > > hosted on NFS today and served by gluster leader nodes and
also store
>     > > system logs, console logs, and other data.
>     > >
>     > > https://www.olcf.ornl.gov/frontier/
>     > >
>     > >
>     > > Erik
>     > > ________
>     > >
>     > >
>     > >
>     > > Community Meeting Calendar:
>     > >
>     > > Schedule -
>     > > Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
>     > > Bridge: https://meet.google.com/cpu-eiue-hvk
>     > > Gluster-users mailing list
>     > > Gluster-users at gluster.org
>     > > https://lists.gluster.org/mailman/listinfo/gluster-users
>     > ________
>     >
>     >
>     >
>     > Community Meeting Calendar:
>     >
>     > Schedule -
>     > Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
>     > Bridge: https://meet.google.com/cpu-eiue-hvk
>     > Gluster-users mailing list
>     > Gluster-users at gluster.org
>     > https://lists.gluster.org/mailman/listinfo/gluster-users
>

Strahil Nikolov

2021-Sep-21 16:18 UTC

head link

[Gluster-users] gluster forcing IPV6 on our IPV4 servers, glusterd fails (was gluster update question regarding new DNS resolution requirement)

As far as I know a fix was introduced recently, so even missing to run the
script won't be so critical - you can run it afterwards.I would use Ansible
to roll out such updates on a set of nodes - this will prevent human errors and
will give the opportunity to run such tiny details like geo-rep modifying
script.
P.S.: Out of curiosity, are you using distributed-replicated or
distributed-dispersed volumes ?
Best Regards,Strahil Nikolov
 
 
  On Tue, Sep 21, 2021 at 17:59, Erik Jacobson<erik.jacobson at hpe.com>
wrote:   > Don't forget to run the geo-replication fix script , if you
missed to do it> before the upgrade.
We don't use geo-replication YET but thank you for this thoughtful
reminder.

Just a note on things like this -- we really try to do everything in a
package update because that's how we'd have to deploy to customers in an
automated way. So having to run a script as part of the upgrade would be
very hard in a package based work flow for a packged solution.

I'm not complaining I love gluster but this is just food for thought.

I can't even hardly say it with a straight face because we suffer from
similar issues on the cluster management side - updating one CM to the
next is harder than it should be so I'm certainly not judging. Updating
is always painful.

I LOVE that slowly updating our gluster servers is "Just working".

This will allow a supercomputer to slowly update their infrastructure
while taking no compute nodes (using nfs-hosted squashfs images or root)
down. It's really remarkable since it's a big jump too 7.9 to 9.3 I am
impressed by this part. It's a huge relief that I didn't have to do an
intermediate jump to gluster8 in the middle as that would have been
nearly impossible for us to get right.

Thank you all!!

PS: Frontier will have 21 leader nodes running gluster servers.
Distributed/replicate in groups of 3 hosting nfs-exported squashfs image
objects for compute node root filesystems. Many thousands of nodes.
> 
> Best Regards,
> Strahil Nikolov
> 
> 
>? ? On Tue, Sep 21, 2021 at 0:46, Erik Jacobson
>? ? <erik.jacobson at hpe.com> wrote:
>? ? I pretended I'm a low-level C programmer with network and filesystem
>? ? experience for a few hours.
> 
>? ? I'm not sure what the right solution is but what was happening was
the
>? ? code was trying to treat our IPV4 hosts as AF_INET6 and the family was
>? ? incompatible with our IPV4 IP addresses. Yes, we need to move to IPV6
>? ? but we're hoping to do that on our own time (~50 years like
everybody
>? ? else :)
> 
>? ? I found a chunk of the code that seemed to be force-setting us to
>? ? AF_INET6.
> 
>? ? While I'm sure it is not 100% the correct patch, the patch attached
and
>? ? pasted below is working for me so I'll integrate it with our
internal
>? ? build to continue testing.
> 
>? ? Please let me know if there is a configuration item I missed or a
>? ? different way to do this. I added -devel to this email.
> 
>? ? In the previous thread, you would have seen that we're testing a
>? ? hopeful change that will upgrade our deployed customers from gluster
>? ? 7.9 to gluster 9.3.
> 
>? ? Thank you!! Advice on next steps would be appreciated !!
> 
> 
>? ? diff -Narup glusterfs-9.3-ORIG/rpc/rpc-transport/socket/src/name.c
>? ? glusterfs-9.3-NEW/rpc/rpc-transport/socket/src/name.c
>? ? --- glusterfs-9.3-ORIG/rpc/rpc-transport/socket/src/name.c? ? 2021-06-29
>? ? 00:27:44.381408294 -0500
>? ? +++ glusterfs-9.3-NEW/rpc/rpc-transport/socket/src/name.c? ? 2021-09-20
>? ? 16:34:28.969425361 -0500
>? ? @@ -252,9 +252,16 @@ af_inet_client_get_remote_sockaddr(rpc_t
>? ? ? ? /* Need to update transport-address family if address-family is not
>? ? provided
>? ? ? ? ? ? to command-line arguments
>? ? ? ? */
>? ? +? ? /* HPE This is forcing our IPV4 servers in to to an IPV6 address
>? ? +? ? * family that is not compatible with IPV4. For now we will just set
it
>? ? +? ? * to AF_INET.
>? ? +? ? */
>? ? +? ? /*
>? ? ? ? if (inet_pton(AF_INET6, remote_host, &serveraddr)) {
>? ? ? ? ? ? sockaddr->sa_family = AF_INET6;
>? ? ? ? }
>? ? +? ? */
>? ? +? ? sockaddr->sa_family = AF_INET;
> 
>? ? ? ? /* TODO: gf_resolve is a blocking call. kick in some
>? ? ? ? ? ? non blocking dns techniques */
> 
>? ? 
>? ? On Mon, Sep 20, 2021 at 11:35:35AM -0500, Erik Jacobson wrote:
>? ? > I missed the other important log snip:
>? ? >
>? ? > The message "E [MSGID: 101075]
[common-utils.c:520:gf_resolve_ip6]
>? ? 0-resolver: error in getaddrinfo [{family=10}, {ret=Address family for
>? ? hostname not supported}]" repeated 620 times between [2021-09-20
>? ? 15:49:23.720633 +0000] and [2021-09-20 15:50:41.731542 +0000]
>? ? >
>? ? > So I will dig in to the code some here.
>? ? >
>? ? >
>? ? > On Mon, Sep 20, 2021 at 10:59:30AM -0500, Erik Jacobson wrote:
>? ? > > Hello all! I hope you are well.
>? ? > >
>? ? > > We are starting a new software release cycle and I am trying
to find a
>? ? > > way to upgrade customers from our build of gluster 7.9 to our
build of
>? ? > > gluster 9.3
>? ? > >
>? ? > > When we deploy gluster, we foribly remove all references to
any host
>? ? > > names and use only IP addresses. This is because, if for any
reason a
>? ? > > DNS server is unreachable, even if the peer files have IPs and
DNS, it
>? ? > > causes glusterd to be unable to reach peers properly. We
can't really
>? ? > > rely on /etc/hosts either because customers take artistic
licene with
>? ? > > their /etc/hosts files and don't realize that problems
that can cause.
>? ? > >
>? ? > > So our deployed peer files look something like this:
>? ? > >
>? ? > > uuid=46a4b506-029d-4750-acfb-894501a88977
>? ? > > state=3
>? ? > > hostname1=172.23.0.16
>? ? > >
>? ? > > That is, with full intention, we avoid host names.
>? ? > >
>? ? > > When we upgrade to gluster 9.3, we fall over with these errors
and
>? ? > > gluster is now partitioned and the updated gluster servers
can't reach
>? ? > > anybody:
>? ? > >
>? ? > > [2021-09-20 15:50:41.731543 +0000] E
>? ? [name.c:265:af_inet_client_get_remote_sockaddr] 0-management: DNS
>? ? resolution failed on host 172.23.0.16
>? ? > >
>? ? > >
>? ? > > As you can see, we have defined on purpose everything using
IPs but in
>? ? > > 9.3 it appears this method fails. Are there any suggestions
short of
>? ? > > putting real host names in peer files?
>? ? > >
>? ? > >
>? ? > >
>? ? > > FYI
>? ? > >
>? ? > > This supercomputer will be using gluster for part of its
system
>? ? > > management. It is how we deploy the Image Objects (squashfs
images)
>? ? > > hosted on NFS today and served by gluster leader nodes and
also store
>? ? > > system logs, console logs, and other data.
>? ? > >
>? ? > > https://www.olcf.ornl.gov/frontier/
>? ? > >
>? ? > >
>? ? > > Erik
>? ? > > ________
>? ? > >
>? ? > >
>? ? > >
>? ? > > Community Meeting Calendar:
>? ? > >
>? ? > > Schedule -
>? ? > > Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
>? ? > > Bridge: https://meet.google.com/cpu-eiue-hvk
>? ? > > Gluster-users mailing list
>? ? > > Gluster-users at gluster.org
>? ? > > https://lists.gluster.org/mailman/listinfo/gluster-users
>? ? > ________
>? ? >
>? ? >
>? ? >
>? ? > Community Meeting Calendar:
>? ? >
>? ? > Schedule -
>? ? > Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
>? ? > Bridge: https://meet.google.com/cpu-eiue-hvk
>? ? > Gluster-users mailing list
>? ? > Gluster-users at gluster.org
>? ? > https://lists.gluster.org/mailman/listinfo/gluster-users
>   
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.gluster.org/pipermail/gluster-users/attachments/20210921/88574bc0/attachment.html>

Gluster users - Sep 2021 - gluster forcing IPV6 on our IPV4 servers, glusterd fails (was gluster update question regarding new DNS resolution requirement)

[Gluster-users] gluster forcing IPV6 on our IPV4 servers, glusterd fails (was gluster update question regarding new DNS resolution requirement)

[Gluster-users] gluster forcing IPV6 on our IPV4 servers, glusterd fails (was gluster update question regarding new DNS resolution requirement)