thr3ads.net - Gluster users - [Gluster-users] Gluster + Infiniband + 3.x kernel -> hard crash? [Apr 2016]

If this information is useful, please help other people find it:
Share via:

Glomski, Patrick

2016-Apr-06 20:32 UTC

[Gluster-users] Gluster + Infiniband + 3.x kernel -> hard crash?

We run gluster 3.7 in a distributed replicated setup. Infiniband (tcp)
links the gluster peers together and clients use the ethernet interface.

This setup is stable running CentOS 6.x and using the most recent
infiniband drivers provided by Mellanox. Uptime was 170 days when we took
it down to wipe the systems and update to CentOS 7.

When the exact same setup is loaded onto a CentOS 7 machine (minor setup
differences, but basically the same; setup is handled by ansible), the
peers will (seemingly randomly) experience a hard crash and need to be
power-cycled. There is no output on the screen and nothing in the logs.
After rebooting, the peer reconnects, heals whatever files it missed, and
everything is happy again. Maximum uptime for any given peer is 20 days.
Thanks to the replication, clients maintain connectivity, but from a system
administration perspective it's driving me crazy!

We run other storage servers with the same infiniband and CentOS7 setup
except that they use NFS instead of gluster. NFS shares are served through
infiniband to some machines and ethernet to others.

Is it possible that gluster's (and only gluster's) use of the infiniband
kernel module to send tcp packets to its peers on a 3 kernel is causing the
system to have a hard crash? Pretty specific problem and it doesn't make
much sense to me, but that's sure where the evidence seems to point.

Anyone running CentOS 7 gluster arrays with infiniband out there to confirm
that it works fine for them? Gluster devs care to chime in with a better
theory? I'd love for this random crashing to stop.

Thanks,
Patrick
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://www.gluster.org/pipermail/gluster-users/attachments/20160406/5608ee27/attachment.html>

Niels de Vos

2016-Apr-08 11:09 UTC

head link

[Gluster-users] [Gluster-devel] Gluster + Infiniband + 3.x kernel -> hard crash?

On Wed, Apr 06, 2016 at 04:32:40PM -0400, Glomski, Patrick
wrote:> We run gluster 3.7 in a distributed replicated setup. Infiniband (tcp)
> links the gluster peers together and clients use the ethernet interface.
> 
> This setup is stable running CentOS 6.x and using the most recent
> infiniband drivers provided by Mellanox. Uptime was 170 days when we took
> it down to wipe the systems and update to CentOS 7.
> 
> When the exact same setup is loaded onto a CentOS 7 machine (minor setup
> differences, but basically the same; setup is handled by ansible), the
> peers will (seemingly randomly) experience a hard crash and need to be
> power-cycled. There is no output on the screen and nothing in the logs.
> After rebooting, the peer reconnects, heals whatever files it missed, and
> everything is happy again. Maximum uptime for any given peer is 20 days.
> Thanks to the replication, clients maintain connectivity, but from a system
> administration perspective it's driving me crazy!
> 
> We run other storage servers with the same infiniband and CentOS7 setup
> except that they use NFS instead of gluster. NFS shares are served through
> infiniband to some machines and ethernet to others.
> 
> Is it possible that gluster's (and only gluster's) use of the
infiniband
> kernel module to send tcp packets to its peers on a 3 kernel is causing the
> system to have a hard crash? Pretty specific problem and it doesn't
make
> much sense to me, but that's sure where the evidence seems to point.
> 
> Anyone running CentOS 7 gluster arrays with infiniband out there to confirm
> that it works fine for them? Gluster devs care to chime in with a better
> theory? I'd love for this random crashing to stop.
Giving suggestions for this is extremely difficult without knowing the
technicality of the crashes. Gluster should not cause kernel panics, so
this is most likely an issue in the kernel or the drivers that are used.
I can only advise you to install and configure kdump. This makes it
possible to capture a vmcore after a kernel panic. The system will
'kexec' into a special kdump kernel and it copies /proc/vmcore to stable
storage. There also would be a dmesg or similar log in the same
directory as the vmcore image itself. With the details from the kernel
panic, we can probably make some more useful suggestions (or get a
certain bug fixed).

Thanks,
Niels
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 819 bytes
Desc: not available
URL:
<http://www.gluster.org/pipermail/gluster-users/attachments/20160408/e00ea6ab/attachment.sig>

Raghavendra G

2016-Apr-29 12:07 UTC

head link

[Gluster-users] [Gluster-devel] Gluster + Infiniband + 3.x kernel -> hard crash?

On Thu, Apr 7, 2016 at 2:02 AM, Glomski, Patrick <
patrick.glomski at corvidtec.com> wrote:
> We run gluster 3.7 in a distributed replicated setup. Infiniband (tcp)
> links the gluster peers together and clients use the ethernet interface.
>
> This setup is stable running CentOS 6.x and using the most recent
> infiniband drivers provided by Mellanox. Uptime was 170 days when we took
> it down to wipe the systems and update to CentOS 7.
>
> When the exact same setup is loaded onto a CentOS 7 machine (minor setup
> differences, but basically the same; setup is handled by ansible), the
> peers will (seemingly randomly) experience a hard crash and need to be
> power-cycled. There is no output on the screen and nothing in the logs.
> After rebooting, the peer reconnects, heals whatever files it missed, and
> everything is happy again. Maximum uptime for any given peer is 20 days.
> Thanks to the replication, clients maintain connectivity, but from a system
> administration perspective it's driving me crazy!
>
> We run other storage servers with the same infiniband and CentOS7 setup
> except that they use NFS instead of gluster. NFS shares are served through
> infiniband to some machines and ethernet to others.
>
> Is it possible that gluster's (and only gluster's) use of the
infiniband
> kernel module to send tcp packets to its peers on a 3 kernel is causing the
> system to have a hard crash?
>
Please note that Gluster is only a "userspace" consumer of infiniband.
So,
at least in "theory" it shouldn't result in kernel panic. However
infiniband also allows userspace programs to do somethings which can be
done only by kernel (like pinning pages to a specific address). I am not
very familiar with internals of infiniband and hence cannot authoritatively
comment on whether kernel panic is possible/impossible. Some one with an
understanding of infiniband internals would be in a better position to
comment on this.


Pretty specific problem and it doesn't make much sense to me, but
that's> sure where the evidence seems to point.
>
> Anyone running CentOS 7 gluster arrays with infiniband out there to
> confirm that it works fine for them? Gluster devs care to chime in with a
> better theory? I'd love for this random crashing to stop.
>
> Thanks,
> Patrick
>
> _______________________________________________
> Gluster-devel mailing list
> Gluster-devel at gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-devel
>


-- 
Raghavendra G
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://www.gluster.org/pipermail/gluster-users/attachments/20160429/c76b94d6/attachment.html>

Gluster users - Apr 2016 - Gluster + Infiniband + 3.x kernel -> hard crash?

[Gluster-users] Gluster + Infiniband + 3.x kernel -> hard crash?

[Gluster-users] [Gluster-devel] Gluster + Infiniband + 3.x kernel -> hard crash?

[Gluster-users] [Gluster-devel] Gluster + Infiniband + 3.x kernel -> hard crash?