Jiří Sléžka
2021-Jul-08 10:22 UTC
[Gluster-users] glusterfs health-check failed, (brick) going down
Hello gluster community, I am new to this list but using glusterfs for log time as our SDS solution for storing 80+TiB of data. I'm also using glusterfs for small 3 node HCI cluster with oVirt 4.4.6 and CentOS 8 (not stream yet). Glusterfs version here is 8.5-2.el8.x86_64. For time to time (I belive) random brick on random host goes down because health-check. It looks like [root at ovirt-hci02 ~]# grep "posix_health_check" /var/log/glusterfs/bricks/* /var/log/glusterfs/bricks/gluster_bricks-vms2-vms2.log:[2021-07-07 07:13:37.408184] M [MSGID: 113075] [posix-helpers.c:2214:posix_health_check_thread_proc] 0-vms-posix: health-check failed, going down /var/log/glusterfs/bricks/gluster_bricks-vms2-vms2.log:[2021-07-07 07:13:37.408407] M [MSGID: 113075] [posix-helpers.c:2232:posix_health_check_thread_proc] 0-vms-posix: still alive! -> SIGTERM /var/log/glusterfs/bricks/gluster_bricks-vms2-vms2.log:[2021-07-07 16:11:14.518971] M [MSGID: 113075] [posix-helpers.c:2214:posix_health_check_thread_proc] 0-vms-posix: health-check failed, going down /var/log/glusterfs/bricks/gluster_bricks-vms2-vms2.log:[2021-07-07 16:11:14.519200] M [MSGID: 113075] [posix-helpers.c:2232:posix_health_check_thread_proc] 0-vms-posix: still alive! -> SIGTERM on other host [root at ovirt-hci01 ~]# grep "posix_health_check" /var/log/glusterfs/bricks/* /var/log/glusterfs/bricks/gluster_bricks-engine-engine.log:[2021-07-05 13:15:51.983327] M [MSGID: 113075] [posix-helpers.c:2214:posix_health_check_thread_proc] 0-engine-posix: health-check failed, going down /var/log/glusterfs/bricks/gluster_bricks-engine-engine.log:[2021-07-05 13:15:51.983728] M [MSGID: 113075] [posix-helpers.c:2232:posix_health_check_thread_proc] 0-engine-posix: still alive! -> SIGTERM /var/log/glusterfs/bricks/gluster_bricks-vms2-vms2.log:[2021-07-05 01:53:35.769129] M [MSGID: 113075] [posix-helpers.c:2214:posix_health_check_thread_proc] 0-vms-posix: health-check failed, going down /var/log/glusterfs/bricks/gluster_bricks-vms2-vms2.log:[2021-07-05 01:53:35.769819] M [MSGID: 113075] [posix-helpers.c:2232:posix_health_check_thread_proc] 0-vms-posix: still alive! -> SIGTERM I cannot link these errors to any storage/fs issue (in dmesg or /var/log/messages), brick devices looks healthy (smartd). I can force start brick with gluster volume start vms|engine force and after some healing all works fine for few days Did anybody observe this behavior? vms volume has this structure (two bricks per host, each is separate JBOD ssd disk), engine volume has one brick on each host... gluster volume info vms Volume Name: vms Type: Distributed-Replicate Volume ID: 52032ec6-99d4-4210-8fb8-ffbd7a1e0bf7 Status: Started Snapshot Count: 0 Number of Bricks: 2 x 3 = 6 Transport-type: tcp Bricks: Brick1: 10.0.4.11:/gluster_bricks/vms/vms Brick2: 10.0.4.13:/gluster_bricks/vms/vms Brick3: 10.0.4.12:/gluster_bricks/vms/vms Brick4: 10.0.4.11:/gluster_bricks/vms2/vms2 Brick5: 10.0.4.13:/gluster_bricks/vms2/vms2 Brick6: 10.0.4.12:/gluster_bricks/vms2/vms2 Options Reconfigured: cluster.granular-entry-heal: enable performance.stat-prefetch: off cluster.eager-lock: enable performance.io-cache: off performance.read-ahead: off performance.quick-read: off user.cifs: off network.ping-timeout: 30 network.remote-dio: off performance.strict-o-direct: on performance.low-prio-threads: 32 features.shard: on storage.owner-gid: 36 storage.owner-uid: 36 transport.address-family: inet storage.fips-mode-rchecksum: on nfs.disable: on performance.client-io-threads: off Cheers, Jiri -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 4269 bytes Desc: S/MIME Cryptographic Signature URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20210708/d602b1ac/attachment.p7s>
Olaf Buitelaar
2021-Jul-08 13:29 UTC
[Gluster-users] glusterfs health-check failed, (brick) going down
Hi Jiri, your probleem looks pretty similar to mine, see; https://lists.gluster.org/pipermail/gluster-users/2021-February/039134.html Any chance you also see the xfs errors in de brick logs? For me the situation improved once i disabled brick multiplexing, but i don't see that in your volume configuration. Cheers Olaf Op do 8 jul. 2021 om 12:28 schreef Ji?? Sl??ka <jiri.slezka at slu.cz>:> Hello gluster community, > > I am new to this list but using glusterfs for log time as our SDS > solution for storing 80+TiB of data. I'm also using glusterfs for small > 3 node HCI cluster with oVirt 4.4.6 and CentOS 8 (not stream yet). > Glusterfs version here is 8.5-2.el8.x86_64. > > For time to time (I belive) random brick on random host goes down > because health-check. It looks like > > [root at ovirt-hci02 ~]# grep "posix_health_check" > /var/log/glusterfs/bricks/* > /var/log/glusterfs/bricks/gluster_bricks-vms2-vms2.log:[2021-07-07 > 07:13:37.408184] M [MSGID: 113075] > [posix-helpers.c:2214:posix_health_check_thread_proc] 0-vms-posix: > health-check failed, going down > /var/log/glusterfs/bricks/gluster_bricks-vms2-vms2.log:[2021-07-07 > 07:13:37.408407] M [MSGID: 113075] > [posix-helpers.c:2232:posix_health_check_thread_proc] 0-vms-posix: still > alive! -> SIGTERM > /var/log/glusterfs/bricks/gluster_bricks-vms2-vms2.log:[2021-07-07 > 16:11:14.518971] M [MSGID: 113075] > [posix-helpers.c:2214:posix_health_check_thread_proc] 0-vms-posix: > health-check failed, going down > /var/log/glusterfs/bricks/gluster_bricks-vms2-vms2.log:[2021-07-07 > 16:11:14.519200] M [MSGID: 113075] > [posix-helpers.c:2232:posix_health_check_thread_proc] 0-vms-posix: still > alive! -> SIGTERM > > on other host > > [root at ovirt-hci01 ~]# grep "posix_health_check" > /var/log/glusterfs/bricks/* > /var/log/glusterfs/bricks/gluster_bricks-engine-engine.log:[2021-07-05 > 13:15:51.983327] M [MSGID: 113075] > [posix-helpers.c:2214:posix_health_check_thread_proc] 0-engine-posix: > health-check failed, going down > /var/log/glusterfs/bricks/gluster_bricks-engine-engine.log:[2021-07-05 > 13:15:51.983728] M [MSGID: 113075] > [posix-helpers.c:2232:posix_health_check_thread_proc] 0-engine-posix: > still alive! -> SIGTERM > /var/log/glusterfs/bricks/gluster_bricks-vms2-vms2.log:[2021-07-05 > 01:53:35.769129] M [MSGID: 113075] > [posix-helpers.c:2214:posix_health_check_thread_proc] 0-vms-posix: > health-check failed, going down > /var/log/glusterfs/bricks/gluster_bricks-vms2-vms2.log:[2021-07-05 > 01:53:35.769819] M [MSGID: 113075] > [posix-helpers.c:2232:posix_health_check_thread_proc] 0-vms-posix: still > alive! -> SIGTERM > > I cannot link these errors to any storage/fs issue (in dmesg or > /var/log/messages), brick devices looks healthy (smartd). > > I can force start brick with > > gluster volume start vms|engine force > > and after some healing all works fine for few days > > Did anybody observe this behavior? > > vms volume has this structure (two bricks per host, each is separate > JBOD ssd disk), engine volume has one brick on each host... > > gluster volume info vms > > Volume Name: vms > Type: Distributed-Replicate > Volume ID: 52032ec6-99d4-4210-8fb8-ffbd7a1e0bf7 > Status: Started > Snapshot Count: 0 > Number of Bricks: 2 x 3 = 6 > Transport-type: tcp > Bricks: > Brick1: 10.0.4.11:/gluster_bricks/vms/vms > Brick2: 10.0.4.13:/gluster_bricks/vms/vms > Brick3: 10.0.4.12:/gluster_bricks/vms/vms > Brick4: 10.0.4.11:/gluster_bricks/vms2/vms2 > Brick5: 10.0.4.13:/gluster_bricks/vms2/vms2 > Brick6: 10.0.4.12:/gluster_bricks/vms2/vms2 > Options Reconfigured: > cluster.granular-entry-heal: enable > performance.stat-prefetch: off > cluster.eager-lock: enable > performance.io-cache: off > performance.read-ahead: off > performance.quick-read: off > user.cifs: off > network.ping-timeout: 30 > network.remote-dio: off > performance.strict-o-direct: on > performance.low-prio-threads: 32 > features.shard: on > storage.owner-gid: 36 > storage.owner-uid: 36 > transport.address-family: inet > storage.fips-mode-rchecksum: on > nfs.disable: on > performance.client-io-threads: off > > > Cheers, > > Jiri > > ________ > > > > Community Meeting Calendar: > > Schedule - > Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC > Bridge: https://meet.google.com/cpu-eiue-hvk > Gluster-users mailing list > Gluster-users at gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-users >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20210708/445136ec/attachment.html>