Erik Jacobson
2021-Sep-21 14:59 UTC
[Gluster-users] gluster forcing IPV6 on our IPV4 servers, glusterd fails (was gluster update question regarding new DNS resolution requirement)
> Don't forget to run the geo-replication fix script , if you missed to do it > before the upgrade.We don't use geo-replication YET but thank you for this thoughtful reminder. Just a note on things like this -- we really try to do everything in a package update because that's how we'd have to deploy to customers in an automated way. So having to run a script as part of the upgrade would be very hard in a package based work flow for a packged solution. I'm not complaining I love gluster but this is just food for thought. I can't even hardly say it with a straight face because we suffer from similar issues on the cluster management side - updating one CM to the next is harder than it should be so I'm certainly not judging. Updating is always painful. I LOVE that slowly updating our gluster servers is "Just working". This will allow a supercomputer to slowly update their infrastructure while taking no compute nodes (using nfs-hosted squashfs images or root) down. It's really remarkable since it's a big jump too 7.9 to 9.3 I am impressed by this part. It's a huge relief that I didn't have to do an intermediate jump to gluster8 in the middle as that would have been nearly impossible for us to get right. Thank you all!! PS: Frontier will have 21 leader nodes running gluster servers. Distributed/replicate in groups of 3 hosting nfs-exported squashfs image objects for compute node root filesystems. Many thousands of nodes.> > Best Regards, > Strahil Nikolov > > > On Tue, Sep 21, 2021 at 0:46, Erik Jacobson > <erik.jacobson at hpe.com> wrote: > I pretended I'm a low-level C programmer with network and filesystem > experience for a few hours. > > I'm not sure what the right solution is but what was happening was the > code was trying to treat our IPV4 hosts as AF_INET6 and the family was > incompatible with our IPV4 IP addresses. Yes, we need to move to IPV6 > but we're hoping to do that on our own time (~50 years like everybody > else :) > > I found a chunk of the code that seemed to be force-setting us to > AF_INET6. > > While I'm sure it is not 100% the correct patch, the patch attached and > pasted below is working for me so I'll integrate it with our internal > build to continue testing. > > Please let me know if there is a configuration item I missed or a > different way to do this. I added -devel to this email. > > In the previous thread, you would have seen that we're testing a > hopeful change that will upgrade our deployed customers from gluster > 7.9 to gluster 9.3. > > Thank you!! Advice on next steps would be appreciated !! > > > diff -Narup glusterfs-9.3-ORIG/rpc/rpc-transport/socket/src/name.c > glusterfs-9.3-NEW/rpc/rpc-transport/socket/src/name.c > --- glusterfs-9.3-ORIG/rpc/rpc-transport/socket/src/name.c 2021-06-29 > 00:27:44.381408294 -0500 > +++ glusterfs-9.3-NEW/rpc/rpc-transport/socket/src/name.c 2021-09-20 > 16:34:28.969425361 -0500 > @@ -252,9 +252,16 @@ af_inet_client_get_remote_sockaddr(rpc_t > /* Need to update transport-address family if address-family is not > provided > to command-line arguments > */ > + /* HPE This is forcing our IPV4 servers in to to an IPV6 address > + * family that is not compatible with IPV4. For now we will just set it > + * to AF_INET. > + */ > + /* > if (inet_pton(AF_INET6, remote_host, &serveraddr)) { > sockaddr->sa_family = AF_INET6; > } > + */ > + sockaddr->sa_family = AF_INET; > > /* TODO: gf_resolve is a blocking call. kick in some > non blocking dns techniques */ > > > On Mon, Sep 20, 2021 at 11:35:35AM -0500, Erik Jacobson wrote: > > I missed the other important log snip: > > > > The message "E [MSGID: 101075] [common-utils.c:520:gf_resolve_ip6] > 0-resolver: error in getaddrinfo [{family=10}, {ret=Address family for > hostname not supported}]" repeated 620 times between [2021-09-20 > 15:49:23.720633 +0000] and [2021-09-20 15:50:41.731542 +0000] > > > > So I will dig in to the code some here. > > > > > > On Mon, Sep 20, 2021 at 10:59:30AM -0500, Erik Jacobson wrote: > > > Hello all! I hope you are well. > > > > > > We are starting a new software release cycle and I am trying to find a > > > way to upgrade customers from our build of gluster 7.9 to our build of > > > gluster 9.3 > > > > > > When we deploy gluster, we foribly remove all references to any host > > > names and use only IP addresses. This is because, if for any reason a > > > DNS server is unreachable, even if the peer files have IPs and DNS, it > > > causes glusterd to be unable to reach peers properly. We can't really > > > rely on /etc/hosts either because customers take artistic licene with > > > their /etc/hosts files and don't realize that problems that can cause. > > > > > > So our deployed peer files look something like this: > > > > > > uuid=46a4b506-029d-4750-acfb-894501a88977 > > > state=3 > > > hostname1=172.23.0.16 > > > > > > That is, with full intention, we avoid host names. > > > > > > When we upgrade to gluster 9.3, we fall over with these errors and > > > gluster is now partitioned and the updated gluster servers can't reach > > > anybody: > > > > > > [2021-09-20 15:50:41.731543 +0000] E > [name.c:265:af_inet_client_get_remote_sockaddr] 0-management: DNS > resolution failed on host 172.23.0.16 > > > > > > > > > As you can see, we have defined on purpose everything using IPs but in > > > 9.3 it appears this method fails. Are there any suggestions short of > > > putting real host names in peer files? > > > > > > > > > > > > FYI > > > > > > This supercomputer will be using gluster for part of its system > > > management. It is how we deploy the Image Objects (squashfs images) > > > hosted on NFS today and served by gluster leader nodes and also store > > > system logs, console logs, and other data. > > > > > > https://www.olcf.ornl.gov/frontier/ > > > > > > > > > Erik > > > ________ > > > > > > > > > > > > Community Meeting Calendar: > > > > > > Schedule - > > > Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC > > > Bridge: https://meet.google.com/cpu-eiue-hvk > > > Gluster-users mailing list > > > Gluster-users at gluster.org > > > https://lists.gluster.org/mailman/listinfo/gluster-users > > ________ > > > > > > > > Community Meeting Calendar: > > > > Schedule - > > Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC > > Bridge: https://meet.google.com/cpu-eiue-hvk > > Gluster-users mailing list > > Gluster-users at gluster.org > > https://lists.gluster.org/mailman/listinfo/gluster-users >
Strahil Nikolov
2021-Sep-21 16:18 UTC
[Gluster-users] gluster forcing IPV6 on our IPV4 servers, glusterd fails (was gluster update question regarding new DNS resolution requirement)
As far as I know a fix was introduced recently, so even missing to run the script won't be so critical - you can run it afterwards.I would use Ansible to roll out such updates on a set of nodes - this will prevent human errors and will give the opportunity to run such tiny details like geo-rep modifying script. P.S.: Out of curiosity, are you using distributed-replicated or distributed-dispersed volumes ? Best Regards,Strahil Nikolov On Tue, Sep 21, 2021 at 17:59, Erik Jacobson<erik.jacobson at hpe.com> wrote: > Don't forget to run the geo-replication fix script , if you missed to do it> before the upgrade.We don't use geo-replication YET but thank you for this thoughtful reminder. Just a note on things like this -- we really try to do everything in a package update because that's how we'd have to deploy to customers in an automated way. So having to run a script as part of the upgrade would be very hard in a package based work flow for a packged solution. I'm not complaining I love gluster but this is just food for thought. I can't even hardly say it with a straight face because we suffer from similar issues on the cluster management side - updating one CM to the next is harder than it should be so I'm certainly not judging. Updating is always painful. I LOVE that slowly updating our gluster servers is "Just working". This will allow a supercomputer to slowly update their infrastructure while taking no compute nodes (using nfs-hosted squashfs images or root) down. It's really remarkable since it's a big jump too 7.9 to 9.3 I am impressed by this part. It's a huge relief that I didn't have to do an intermediate jump to gluster8 in the middle as that would have been nearly impossible for us to get right. Thank you all!! PS: Frontier will have 21 leader nodes running gluster servers. Distributed/replicate in groups of 3 hosting nfs-exported squashfs image objects for compute node root filesystems. Many thousands of nodes.> > Best Regards, > Strahil Nikolov > > >? ? On Tue, Sep 21, 2021 at 0:46, Erik Jacobson >? ? <erik.jacobson at hpe.com> wrote: >? ? I pretended I'm a low-level C programmer with network and filesystem >? ? experience for a few hours. > >? ? I'm not sure what the right solution is but what was happening was the >? ? code was trying to treat our IPV4 hosts as AF_INET6 and the family was >? ? incompatible with our IPV4 IP addresses. Yes, we need to move to IPV6 >? ? but we're hoping to do that on our own time (~50 years like everybody >? ? else :) > >? ? I found a chunk of the code that seemed to be force-setting us to >? ? AF_INET6. > >? ? While I'm sure it is not 100% the correct patch, the patch attached and >? ? pasted below is working for me so I'll integrate it with our internal >? ? build to continue testing. > >? ? Please let me know if there is a configuration item I missed or a >? ? different way to do this. I added -devel to this email. > >? ? In the previous thread, you would have seen that we're testing a >? ? hopeful change that will upgrade our deployed customers from gluster >? ? 7.9 to gluster 9.3. > >? ? Thank you!! Advice on next steps would be appreciated !! > > >? ? diff -Narup glusterfs-9.3-ORIG/rpc/rpc-transport/socket/src/name.c >? ? glusterfs-9.3-NEW/rpc/rpc-transport/socket/src/name.c >? ? --- glusterfs-9.3-ORIG/rpc/rpc-transport/socket/src/name.c? ? 2021-06-29 >? ? 00:27:44.381408294 -0500 >? ? +++ glusterfs-9.3-NEW/rpc/rpc-transport/socket/src/name.c? ? 2021-09-20 >? ? 16:34:28.969425361 -0500 >? ? @@ -252,9 +252,16 @@ af_inet_client_get_remote_sockaddr(rpc_t >? ? ? ? /* Need to update transport-address family if address-family is not >? ? provided >? ? ? ? ? ? to command-line arguments >? ? ? ? */ >? ? +? ? /* HPE This is forcing our IPV4 servers in to to an IPV6 address >? ? +? ? * family that is not compatible with IPV4. For now we will just set it >? ? +? ? * to AF_INET. >? ? +? ? */ >? ? +? ? /* >? ? ? ? if (inet_pton(AF_INET6, remote_host, &serveraddr)) { >? ? ? ? ? ? sockaddr->sa_family = AF_INET6; >? ? ? ? } >? ? +? ? */ >? ? +? ? sockaddr->sa_family = AF_INET; > >? ? ? ? /* TODO: gf_resolve is a blocking call. kick in some >? ? ? ? ? ? non blocking dns techniques */ > >? ? >? ? On Mon, Sep 20, 2021 at 11:35:35AM -0500, Erik Jacobson wrote: >? ? > I missed the other important log snip: >? ? > >? ? > The message "E [MSGID: 101075] [common-utils.c:520:gf_resolve_ip6] >? ? 0-resolver: error in getaddrinfo [{family=10}, {ret=Address family for >? ? hostname not supported}]" repeated 620 times between [2021-09-20 >? ? 15:49:23.720633 +0000] and [2021-09-20 15:50:41.731542 +0000] >? ? > >? ? > So I will dig in to the code some here. >? ? > >? ? > >? ? > On Mon, Sep 20, 2021 at 10:59:30AM -0500, Erik Jacobson wrote: >? ? > > Hello all! I hope you are well. >? ? > > >? ? > > We are starting a new software release cycle and I am trying to find a >? ? > > way to upgrade customers from our build of gluster 7.9 to our build of >? ? > > gluster 9.3 >? ? > > >? ? > > When we deploy gluster, we foribly remove all references to any host >? ? > > names and use only IP addresses. This is because, if for any reason a >? ? > > DNS server is unreachable, even if the peer files have IPs and DNS, it >? ? > > causes glusterd to be unable to reach peers properly. We can't really >? ? > > rely on /etc/hosts either because customers take artistic licene with >? ? > > their /etc/hosts files and don't realize that problems that can cause. >? ? > > >? ? > > So our deployed peer files look something like this: >? ? > > >? ? > > uuid=46a4b506-029d-4750-acfb-894501a88977 >? ? > > state=3 >? ? > > hostname1=172.23.0.16 >? ? > > >? ? > > That is, with full intention, we avoid host names. >? ? > > >? ? > > When we upgrade to gluster 9.3, we fall over with these errors and >? ? > > gluster is now partitioned and the updated gluster servers can't reach >? ? > > anybody: >? ? > > >? ? > > [2021-09-20 15:50:41.731543 +0000] E >? ? [name.c:265:af_inet_client_get_remote_sockaddr] 0-management: DNS >? ? resolution failed on host 172.23.0.16 >? ? > > >? ? > > >? ? > > As you can see, we have defined on purpose everything using IPs but in >? ? > > 9.3 it appears this method fails. Are there any suggestions short of >? ? > > putting real host names in peer files? >? ? > > >? ? > > >? ? > > >? ? > > FYI >? ? > > >? ? > > This supercomputer will be using gluster for part of its system >? ? > > management. It is how we deploy the Image Objects (squashfs images) >? ? > > hosted on NFS today and served by gluster leader nodes and also store >? ? > > system logs, console logs, and other data. >? ? > > >? ? > > https://www.olcf.ornl.gov/frontier/ >? ? > > >? ? > > >? ? > > Erik >? ? > > ________ >? ? > > >? ? > > >? ? > > >? ? > > Community Meeting Calendar: >? ? > > >? ? > > Schedule - >? ? > > Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC >? ? > > Bridge: https://meet.google.com/cpu-eiue-hvk >? ? > > Gluster-users mailing list >? ? > > Gluster-users at gluster.org >? ? > > https://lists.gluster.org/mailman/listinfo/gluster-users >? ? > ________ >? ? > >? ? > >? ? > >? ? > Community Meeting Calendar: >? ? > >? ? > Schedule - >? ? > Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC >? ? > Bridge: https://meet.google.com/cpu-eiue-hvk >? ? > Gluster-users mailing list >? ? > Gluster-users at gluster.org >? ? > https://lists.gluster.org/mailman/listinfo/gluster-users >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20210921/88574bc0/attachment.html>