thr3ads.net - Gluster users - [Gluster-users] glusterfs health-check failed, (brick) going down [Jul 2021]

If this information is useful, please help other people find it:
Share via:

Olaf Buitelaar

2021-Jul-08 13:29 UTC

[Gluster-users] glusterfs health-check failed, (brick) going down

Hi Jiri,

your probleem looks pretty similar to mine, see;
https://lists.gluster.org/pipermail/gluster-users/2021-February/039134.html
Any chance you also see the xfs errors in de brick logs?
For me the situation improved once i disabled brick multiplexing, but i
don't see that in your volume configuration.

Cheers Olaf

Op do 8 jul. 2021 om 12:28 schreef Ji?? Sl??ka <jiri.slezka at slu.cz>:
> Hello gluster community,
>
> I am new to this list but using glusterfs for log time as our SDS
> solution for storing 80+TiB of data. I'm also using glusterfs for small
> 3 node HCI cluster with oVirt 4.4.6 and CentOS 8 (not stream yet).
> Glusterfs version here is 8.5-2.el8.x86_64.
>
> For time to time (I belive) random brick on random host goes down
> because health-check. It looks like
>
> [root at ovirt-hci02 ~]# grep "posix_health_check"
> /var/log/glusterfs/bricks/*
> /var/log/glusterfs/bricks/gluster_bricks-vms2-vms2.log:[2021-07-07
> 07:13:37.408184] M [MSGID: 113075]
> [posix-helpers.c:2214:posix_health_check_thread_proc] 0-vms-posix:
> health-check failed, going down
> /var/log/glusterfs/bricks/gluster_bricks-vms2-vms2.log:[2021-07-07
> 07:13:37.408407] M [MSGID: 113075]
> [posix-helpers.c:2232:posix_health_check_thread_proc] 0-vms-posix: still
> alive! -> SIGTERM
> /var/log/glusterfs/bricks/gluster_bricks-vms2-vms2.log:[2021-07-07
> 16:11:14.518971] M [MSGID: 113075]
> [posix-helpers.c:2214:posix_health_check_thread_proc] 0-vms-posix:
> health-check failed, going down
> /var/log/glusterfs/bricks/gluster_bricks-vms2-vms2.log:[2021-07-07
> 16:11:14.519200] M [MSGID: 113075]
> [posix-helpers.c:2232:posix_health_check_thread_proc] 0-vms-posix: still
> alive! -> SIGTERM
>
> on other host
>
> [root at ovirt-hci01 ~]# grep "posix_health_check"
> /var/log/glusterfs/bricks/*
> /var/log/glusterfs/bricks/gluster_bricks-engine-engine.log:[2021-07-05
> 13:15:51.983327] M [MSGID: 113075]
> [posix-helpers.c:2214:posix_health_check_thread_proc] 0-engine-posix:
> health-check failed, going down
> /var/log/glusterfs/bricks/gluster_bricks-engine-engine.log:[2021-07-05
> 13:15:51.983728] M [MSGID: 113075]
> [posix-helpers.c:2232:posix_health_check_thread_proc] 0-engine-posix:
> still alive! -> SIGTERM
> /var/log/glusterfs/bricks/gluster_bricks-vms2-vms2.log:[2021-07-05
> 01:53:35.769129] M [MSGID: 113075]
> [posix-helpers.c:2214:posix_health_check_thread_proc] 0-vms-posix:
> health-check failed, going down
> /var/log/glusterfs/bricks/gluster_bricks-vms2-vms2.log:[2021-07-05
> 01:53:35.769819] M [MSGID: 113075]
> [posix-helpers.c:2232:posix_health_check_thread_proc] 0-vms-posix: still
> alive! -> SIGTERM
>
> I cannot link these errors to any storage/fs issue (in dmesg or
> /var/log/messages), brick devices looks healthy (smartd).
>
> I can force start brick with
>
> gluster volume start vms|engine force
>
> and after some healing all works fine for few days
>
> Did anybody observe this behavior?
>
> vms volume has this structure (two bricks per host, each is separate
> JBOD ssd disk), engine volume has one brick on each host...
>
> gluster volume info vms
>
> Volume Name: vms
> Type: Distributed-Replicate
> Volume ID: 52032ec6-99d4-4210-8fb8-ffbd7a1e0bf7
> Status: Started
> Snapshot Count: 0
> Number of Bricks: 2 x 3 = 6
> Transport-type: tcp
> Bricks:
> Brick1: 10.0.4.11:/gluster_bricks/vms/vms
> Brick2: 10.0.4.13:/gluster_bricks/vms/vms
> Brick3: 10.0.4.12:/gluster_bricks/vms/vms
> Brick4: 10.0.4.11:/gluster_bricks/vms2/vms2
> Brick5: 10.0.4.13:/gluster_bricks/vms2/vms2
> Brick6: 10.0.4.12:/gluster_bricks/vms2/vms2
> Options Reconfigured:
> cluster.granular-entry-heal: enable
> performance.stat-prefetch: off
> cluster.eager-lock: enable
> performance.io-cache: off
> performance.read-ahead: off
> performance.quick-read: off
> user.cifs: off
> network.ping-timeout: 30
> network.remote-dio: off
> performance.strict-o-direct: on
> performance.low-prio-threads: 32
> features.shard: on
> storage.owner-gid: 36
> storage.owner-uid: 36
> transport.address-family: inet
> storage.fips-mode-rchecksum: on
> nfs.disable: on
> performance.client-io-threads: off
>
>
> Cheers,
>
> Jiri
>
> ________
>
>
>
> Community Meeting Calendar:
>
> Schedule -
> Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
> Bridge: https://meet.google.com/cpu-eiue-hvk
> Gluster-users mailing list
> Gluster-users at gluster.org
> https://lists.gluster.org/mailman/listinfo/gluster-users
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.gluster.org/pipermail/gluster-users/attachments/20210708/445136ec/attachment.html>

Jiří Sléžka

2021-Jul-08 14:52 UTC

head link

[Gluster-users] glusterfs health-check failed, (brick) going down

Hi Olaf,

thanks for reply.

On 7/8/21 3:29 PM, Olaf Buitelaar wrote:> Hi Jiri,
> 
> your probleem looks pretty similar to mine, see; 
> https://lists.gluster.org/pipermail/gluster-users/2021-February/039134.html
>
<https://lists.gluster.org/pipermail/gluster-users/2021-February/039134.html>
> Any chance you also see the xfs errors in de brick logs?
yes, I can see this log lines related to "health-check failed" items

[root at ovirt-hci02 ~]# grep "aio_read" /var/log/glusterfs/bricks/*
/var/log/glusterfs/bricks/gluster_bricks-vms2-vms2.log:[2021-07-07 
07:13:37.408010] W [MSGID: 113075] 
[posix-helpers.c:2135:posix_fs_health_check] 0-vms-posix: 
aio_read_cmp_buf() on /gluster_bricks/vms2/vms2/.glusterfs/health_check 
returned ret is -1 error is Structure needs cleaning
/var/log/glusterfs/bricks/gluster_bricks-vms2-vms2.log:[2021-07-07 
16:11:14.518844] W [MSGID: 113075] 
[posix-helpers.c:2135:posix_fs_health_check] 0-vms-posix: 
aio_read_cmp_buf() on /gluster_bricks/vms2/vms2/.glusterfs/health_check 
returned ret is -1 error is Structure needs cleaning

[root at ovirt-hci01 ~]# grep "aio_read" /var/log/glusterfs/bricks/*
/var/log/glusterfs/bricks/gluster_bricks-engine-engine.log:[2021-07-05 
13:15:51.982938] W [MSGID: 113075] 
[posix-helpers.c:2135:posix_fs_health_check] 0-engine-posix: 
aio_read_cmp_buf() on 
/gluster_bricks/engine/engine/.glusterfs/health_check returned ret is -1 
error is Structure needs cleaning
/var/log/glusterfs/bricks/gluster_bricks-vms2-vms2.log:[2021-07-05 
01:53:35.768534] W [MSGID: 113075] 
[posix-helpers.c:2135:posix_fs_health_check] 0-vms-posix: 
aio_read_cmp_buf() on /gluster_bricks/vms2/vms2/.glusterfs/health_check 
returned ret is -1 error is Structure needs cleaning

it looks very similar to your issue but in my case I don't use LVM cache 
and brick disks are JBOD (but connected through Broadcom / LSI MegaRAID 
SAS-3 3008 [Fury] (rev 02)).
> For me the situation improved once i disabled brick multiplexing, but i 
> don't see that in your volume configuration.
probably important is your note...
> When i kill the brick process and start with "gluser v start x
force" the
> issue seems much more unlikely to occur, but when started from a fresh
> reboot, or when killing the process and let it being started by glusterd
> (e.g. service glusterd start) the error seems to arise after a couple of
> minutes.
...because in the ovirt list Jayme replied this

https://lists.ovirt.org/archives/list/users at
ovirt.org/message/BZRONK53OGWSOPUSGQ76GIXUM7J6HHMJ/

and it looks to me like something you also observes.

Cheers, Jiri
> 
> Cheers Olaf
> 
> Op do 8 jul. 2021 om 12:28 schreef Ji?? Sl??ka <jiri.slezka at slu.cz 
> <mailto:jiri.slezka at slu.cz>>:
> 
>     Hello gluster community,
> 
>     I am new to this list but using glusterfs for log time as our SDS
>     solution for storing 80+TiB of data. I'm also using glusterfs for
small
>     3 node HCI cluster with oVirt 4.4.6 and CentOS 8 (not stream yet).
>     Glusterfs version here is 8.5-2.el8.x86_64.
> 
>     For time to time (I belive) random brick on random host goes down
>     because health-check. It looks like
> 
>     [root at ovirt-hci02 ~]# grep "posix_health_check"
>     /var/log/glusterfs/bricks/*
>     /var/log/glusterfs/bricks/gluster_bricks-vms2-vms2.log:[2021-07-07
>     07:13:37.408184] M [MSGID: 113075]
>     [posix-helpers.c:2214:posix_health_check_thread_proc] 0-vms-posix:
>     health-check failed, going down
>     /var/log/glusterfs/bricks/gluster_bricks-vms2-vms2.log:[2021-07-07
>     07:13:37.408407] M [MSGID: 113075]
>     [posix-helpers.c:2232:posix_health_check_thread_proc] 0-vms-posix:
>     still
>     alive! -> SIGTERM
>     /var/log/glusterfs/bricks/gluster_bricks-vms2-vms2.log:[2021-07-07
>     16:11:14.518971] M [MSGID: 113075]
>     [posix-helpers.c:2214:posix_health_check_thread_proc] 0-vms-posix:
>     health-check failed, going down
>     /var/log/glusterfs/bricks/gluster_bricks-vms2-vms2.log:[2021-07-07
>     16:11:14.519200] M [MSGID: 113075]
>     [posix-helpers.c:2232:posix_health_check_thread_proc] 0-vms-posix:
>     still
>     alive! -> SIGTERM
> 
>     on other host
> 
>     [root at ovirt-hci01 ~]# grep "posix_health_check"
>     /var/log/glusterfs/bricks/*
>     /var/log/glusterfs/bricks/gluster_bricks-engine-engine.log:[2021-07-05
>     13:15:51.983327] M [MSGID: 113075]
>     [posix-helpers.c:2214:posix_health_check_thread_proc] 0-engine-posix:
>     health-check failed, going down
>     /var/log/glusterfs/bricks/gluster_bricks-engine-engine.log:[2021-07-05
>     13:15:51.983728] M [MSGID: 113075]
>     [posix-helpers.c:2232:posix_health_check_thread_proc] 0-engine-posix:
>     still alive! -> SIGTERM
>     /var/log/glusterfs/bricks/gluster_bricks-vms2-vms2.log:[2021-07-05
>     01:53:35.769129] M [MSGID: 113075]
>     [posix-helpers.c:2214:posix_health_check_thread_proc] 0-vms-posix:
>     health-check failed, going down
>     /var/log/glusterfs/bricks/gluster_bricks-vms2-vms2.log:[2021-07-05
>     01:53:35.769819] M [MSGID: 113075]
>     [posix-helpers.c:2232:posix_health_check_thread_proc] 0-vms-posix:
>     still
>     alive! -> SIGTERM
> 
>     I cannot link these errors to any storage/fs issue (in dmesg or
>     /var/log/messages), brick devices looks healthy (smartd).
> 
>     I can force start brick with
> 
>     gluster volume start vms|engine force
> 
>     and after some healing all works fine for few days
> 
>     Did anybody observe this behavior?
> 
>     vms volume has this structure (two bricks per host, each is separate
>     JBOD ssd disk), engine volume has one brick on each host...
> 
>     gluster volume info vms
> 
>     Volume Name: vms
>     Type: Distributed-Replicate
>     Volume ID: 52032ec6-99d4-4210-8fb8-ffbd7a1e0bf7
>     Status: Started
>     Snapshot Count: 0
>     Number of Bricks: 2 x 3 = 6
>     Transport-type: tcp
>     Bricks:
>     Brick1: 10.0.4.11:/gluster_bricks/vms/vms
>     Brick2: 10.0.4.13:/gluster_bricks/vms/vms
>     Brick3: 10.0.4.12:/gluster_bricks/vms/vms
>     Brick4: 10.0.4.11:/gluster_bricks/vms2/vms2
>     Brick5: 10.0.4.13:/gluster_bricks/vms2/vms2
>     Brick6: 10.0.4.12:/gluster_bricks/vms2/vms2
>     Options Reconfigured:
>     cluster.granular-entry-heal: enable
>     performance.stat-prefetch: off
>     cluster.eager-lock: enable
>     performance.io-cache: off
>     performance.read-ahead: off
>     performance.quick-read: off
>     user.cifs: off
>     network.ping-timeout: 30
>     network.remote-dio: off
>     performance.strict-o-direct: on
>     performance.low-prio-threads: 32
>     features.shard: on
>     storage.owner-gid: 36
>     storage.owner-uid: 36
>     transport.address-family: inet
>     storage.fips-mode-rchecksum: on
>     nfs.disable: on
>     performance.client-io-threads: off
> 
> 
>     Cheers,
> 
>     Jiri
> 
>     ________
> 
> 
> 
>     Community Meeting Calendar:
> 
>     Schedule -
>     Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
>     Bridge: https://meet.google.com/cpu-eiue-hvk
>     <https://meet.google.com/cpu-eiue-hvk>
>     Gluster-users mailing list
>     Gluster-users at gluster.org <mailto:Gluster-users at
gluster.org>
>     https://lists.gluster.org/mailman/listinfo/gluster-users
>     <https://lists.gluster.org/mailman/listinfo/gluster-users>
> 

-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 4269 bytes
Desc: S/MIME Cryptographic Signature
URL:
<http://lists.gluster.org/pipermail/gluster-users/attachments/20210708/0f6d2a80/attachment.p7s>

Gluster users - Jul 2021 - glusterfs health-check failed, (brick) going down

[Gluster-users] glusterfs health-check failed, (brick) going down

[Gluster-users] glusterfs health-check failed, (brick) going down