Olaf Buitelaar
2021-Jul-08 13:29 UTC
[Gluster-users] glusterfs health-check failed, (brick) going down
Hi Jiri, your probleem looks pretty similar to mine, see; https://lists.gluster.org/pipermail/gluster-users/2021-February/039134.html Any chance you also see the xfs errors in de brick logs? For me the situation improved once i disabled brick multiplexing, but i don't see that in your volume configuration. Cheers Olaf Op do 8 jul. 2021 om 12:28 schreef Ji?? Sl??ka <jiri.slezka at slu.cz>:> Hello gluster community, > > I am new to this list but using glusterfs for log time as our SDS > solution for storing 80+TiB of data. I'm also using glusterfs for small > 3 node HCI cluster with oVirt 4.4.6 and CentOS 8 (not stream yet). > Glusterfs version here is 8.5-2.el8.x86_64. > > For time to time (I belive) random brick on random host goes down > because health-check. It looks like > > [root at ovirt-hci02 ~]# grep "posix_health_check" > /var/log/glusterfs/bricks/* > /var/log/glusterfs/bricks/gluster_bricks-vms2-vms2.log:[2021-07-07 > 07:13:37.408184] M [MSGID: 113075] > [posix-helpers.c:2214:posix_health_check_thread_proc] 0-vms-posix: > health-check failed, going down > /var/log/glusterfs/bricks/gluster_bricks-vms2-vms2.log:[2021-07-07 > 07:13:37.408407] M [MSGID: 113075] > [posix-helpers.c:2232:posix_health_check_thread_proc] 0-vms-posix: still > alive! -> SIGTERM > /var/log/glusterfs/bricks/gluster_bricks-vms2-vms2.log:[2021-07-07 > 16:11:14.518971] M [MSGID: 113075] > [posix-helpers.c:2214:posix_health_check_thread_proc] 0-vms-posix: > health-check failed, going down > /var/log/glusterfs/bricks/gluster_bricks-vms2-vms2.log:[2021-07-07 > 16:11:14.519200] M [MSGID: 113075] > [posix-helpers.c:2232:posix_health_check_thread_proc] 0-vms-posix: still > alive! -> SIGTERM > > on other host > > [root at ovirt-hci01 ~]# grep "posix_health_check" > /var/log/glusterfs/bricks/* > /var/log/glusterfs/bricks/gluster_bricks-engine-engine.log:[2021-07-05 > 13:15:51.983327] M [MSGID: 113075] > [posix-helpers.c:2214:posix_health_check_thread_proc] 0-engine-posix: > health-check failed, going down > /var/log/glusterfs/bricks/gluster_bricks-engine-engine.log:[2021-07-05 > 13:15:51.983728] M [MSGID: 113075] > [posix-helpers.c:2232:posix_health_check_thread_proc] 0-engine-posix: > still alive! -> SIGTERM > /var/log/glusterfs/bricks/gluster_bricks-vms2-vms2.log:[2021-07-05 > 01:53:35.769129] M [MSGID: 113075] > [posix-helpers.c:2214:posix_health_check_thread_proc] 0-vms-posix: > health-check failed, going down > /var/log/glusterfs/bricks/gluster_bricks-vms2-vms2.log:[2021-07-05 > 01:53:35.769819] M [MSGID: 113075] > [posix-helpers.c:2232:posix_health_check_thread_proc] 0-vms-posix: still > alive! -> SIGTERM > > I cannot link these errors to any storage/fs issue (in dmesg or > /var/log/messages), brick devices looks healthy (smartd). > > I can force start brick with > > gluster volume start vms|engine force > > and after some healing all works fine for few days > > Did anybody observe this behavior? > > vms volume has this structure (two bricks per host, each is separate > JBOD ssd disk), engine volume has one brick on each host... > > gluster volume info vms > > Volume Name: vms > Type: Distributed-Replicate > Volume ID: 52032ec6-99d4-4210-8fb8-ffbd7a1e0bf7 > Status: Started > Snapshot Count: 0 > Number of Bricks: 2 x 3 = 6 > Transport-type: tcp > Bricks: > Brick1: 10.0.4.11:/gluster_bricks/vms/vms > Brick2: 10.0.4.13:/gluster_bricks/vms/vms > Brick3: 10.0.4.12:/gluster_bricks/vms/vms > Brick4: 10.0.4.11:/gluster_bricks/vms2/vms2 > Brick5: 10.0.4.13:/gluster_bricks/vms2/vms2 > Brick6: 10.0.4.12:/gluster_bricks/vms2/vms2 > Options Reconfigured: > cluster.granular-entry-heal: enable > performance.stat-prefetch: off > cluster.eager-lock: enable > performance.io-cache: off > performance.read-ahead: off > performance.quick-read: off > user.cifs: off > network.ping-timeout: 30 > network.remote-dio: off > performance.strict-o-direct: on > performance.low-prio-threads: 32 > features.shard: on > storage.owner-gid: 36 > storage.owner-uid: 36 > transport.address-family: inet > storage.fips-mode-rchecksum: on > nfs.disable: on > performance.client-io-threads: off > > > Cheers, > > Jiri > > ________ > > > > Community Meeting Calendar: > > Schedule - > Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC > Bridge: https://meet.google.com/cpu-eiue-hvk > Gluster-users mailing list > Gluster-users at gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-users >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20210708/445136ec/attachment.html>
Jiří Sléžka
2021-Jul-08 14:52 UTC
[Gluster-users] glusterfs health-check failed, (brick) going down
Hi Olaf, thanks for reply. On 7/8/21 3:29 PM, Olaf Buitelaar wrote:> Hi Jiri, > > your probleem looks pretty similar to mine, see; > https://lists.gluster.org/pipermail/gluster-users/2021-February/039134.html > <https://lists.gluster.org/pipermail/gluster-users/2021-February/039134.html> > Any chance you also see the xfs errors in de brick logs?yes, I can see this log lines related to "health-check failed" items [root at ovirt-hci02 ~]# grep "aio_read" /var/log/glusterfs/bricks/* /var/log/glusterfs/bricks/gluster_bricks-vms2-vms2.log:[2021-07-07 07:13:37.408010] W [MSGID: 113075] [posix-helpers.c:2135:posix_fs_health_check] 0-vms-posix: aio_read_cmp_buf() on /gluster_bricks/vms2/vms2/.glusterfs/health_check returned ret is -1 error is Structure needs cleaning /var/log/glusterfs/bricks/gluster_bricks-vms2-vms2.log:[2021-07-07 16:11:14.518844] W [MSGID: 113075] [posix-helpers.c:2135:posix_fs_health_check] 0-vms-posix: aio_read_cmp_buf() on /gluster_bricks/vms2/vms2/.glusterfs/health_check returned ret is -1 error is Structure needs cleaning [root at ovirt-hci01 ~]# grep "aio_read" /var/log/glusterfs/bricks/* /var/log/glusterfs/bricks/gluster_bricks-engine-engine.log:[2021-07-05 13:15:51.982938] W [MSGID: 113075] [posix-helpers.c:2135:posix_fs_health_check] 0-engine-posix: aio_read_cmp_buf() on /gluster_bricks/engine/engine/.glusterfs/health_check returned ret is -1 error is Structure needs cleaning /var/log/glusterfs/bricks/gluster_bricks-vms2-vms2.log:[2021-07-05 01:53:35.768534] W [MSGID: 113075] [posix-helpers.c:2135:posix_fs_health_check] 0-vms-posix: aio_read_cmp_buf() on /gluster_bricks/vms2/vms2/.glusterfs/health_check returned ret is -1 error is Structure needs cleaning it looks very similar to your issue but in my case I don't use LVM cache and brick disks are JBOD (but connected through Broadcom / LSI MegaRAID SAS-3 3008 [Fury] (rev 02)).> For me the situation improved once i disabled brick multiplexing, but i > don't see that in your volume configuration.probably important is your note...> When i kill the brick process and start with "gluser v start x force" the > issue seems much more unlikely to occur, but when started from a fresh > reboot, or when killing the process and let it being started by glusterd > (e.g. service glusterd start) the error seems to arise after a couple of > minutes....because in the ovirt list Jayme replied this https://lists.ovirt.org/archives/list/users at ovirt.org/message/BZRONK53OGWSOPUSGQ76GIXUM7J6HHMJ/ and it looks to me like something you also observes. Cheers, Jiri> > Cheers Olaf > > Op do 8 jul. 2021 om 12:28 schreef Ji?? Sl??ka <jiri.slezka at slu.cz > <mailto:jiri.slezka at slu.cz>>: > > Hello gluster community, > > I am new to this list but using glusterfs for log time as our SDS > solution for storing 80+TiB of data. I'm also using glusterfs for small > 3 node HCI cluster with oVirt 4.4.6 and CentOS 8 (not stream yet). > Glusterfs version here is 8.5-2.el8.x86_64. > > For time to time (I belive) random brick on random host goes down > because health-check. It looks like > > [root at ovirt-hci02 ~]# grep "posix_health_check" > /var/log/glusterfs/bricks/* > /var/log/glusterfs/bricks/gluster_bricks-vms2-vms2.log:[2021-07-07 > 07:13:37.408184] M [MSGID: 113075] > [posix-helpers.c:2214:posix_health_check_thread_proc] 0-vms-posix: > health-check failed, going down > /var/log/glusterfs/bricks/gluster_bricks-vms2-vms2.log:[2021-07-07 > 07:13:37.408407] M [MSGID: 113075] > [posix-helpers.c:2232:posix_health_check_thread_proc] 0-vms-posix: > still > alive! -> SIGTERM > /var/log/glusterfs/bricks/gluster_bricks-vms2-vms2.log:[2021-07-07 > 16:11:14.518971] M [MSGID: 113075] > [posix-helpers.c:2214:posix_health_check_thread_proc] 0-vms-posix: > health-check failed, going down > /var/log/glusterfs/bricks/gluster_bricks-vms2-vms2.log:[2021-07-07 > 16:11:14.519200] M [MSGID: 113075] > [posix-helpers.c:2232:posix_health_check_thread_proc] 0-vms-posix: > still > alive! -> SIGTERM > > on other host > > [root at ovirt-hci01 ~]# grep "posix_health_check" > /var/log/glusterfs/bricks/* > /var/log/glusterfs/bricks/gluster_bricks-engine-engine.log:[2021-07-05 > 13:15:51.983327] M [MSGID: 113075] > [posix-helpers.c:2214:posix_health_check_thread_proc] 0-engine-posix: > health-check failed, going down > /var/log/glusterfs/bricks/gluster_bricks-engine-engine.log:[2021-07-05 > 13:15:51.983728] M [MSGID: 113075] > [posix-helpers.c:2232:posix_health_check_thread_proc] 0-engine-posix: > still alive! -> SIGTERM > /var/log/glusterfs/bricks/gluster_bricks-vms2-vms2.log:[2021-07-05 > 01:53:35.769129] M [MSGID: 113075] > [posix-helpers.c:2214:posix_health_check_thread_proc] 0-vms-posix: > health-check failed, going down > /var/log/glusterfs/bricks/gluster_bricks-vms2-vms2.log:[2021-07-05 > 01:53:35.769819] M [MSGID: 113075] > [posix-helpers.c:2232:posix_health_check_thread_proc] 0-vms-posix: > still > alive! -> SIGTERM > > I cannot link these errors to any storage/fs issue (in dmesg or > /var/log/messages), brick devices looks healthy (smartd). > > I can force start brick with > > gluster volume start vms|engine force > > and after some healing all works fine for few days > > Did anybody observe this behavior? > > vms volume has this structure (two bricks per host, each is separate > JBOD ssd disk), engine volume has one brick on each host... > > gluster volume info vms > > Volume Name: vms > Type: Distributed-Replicate > Volume ID: 52032ec6-99d4-4210-8fb8-ffbd7a1e0bf7 > Status: Started > Snapshot Count: 0 > Number of Bricks: 2 x 3 = 6 > Transport-type: tcp > Bricks: > Brick1: 10.0.4.11:/gluster_bricks/vms/vms > Brick2: 10.0.4.13:/gluster_bricks/vms/vms > Brick3: 10.0.4.12:/gluster_bricks/vms/vms > Brick4: 10.0.4.11:/gluster_bricks/vms2/vms2 > Brick5: 10.0.4.13:/gluster_bricks/vms2/vms2 > Brick6: 10.0.4.12:/gluster_bricks/vms2/vms2 > Options Reconfigured: > cluster.granular-entry-heal: enable > performance.stat-prefetch: off > cluster.eager-lock: enable > performance.io-cache: off > performance.read-ahead: off > performance.quick-read: off > user.cifs: off > network.ping-timeout: 30 > network.remote-dio: off > performance.strict-o-direct: on > performance.low-prio-threads: 32 > features.shard: on > storage.owner-gid: 36 > storage.owner-uid: 36 > transport.address-family: inet > storage.fips-mode-rchecksum: on > nfs.disable: on > performance.client-io-threads: off > > > Cheers, > > Jiri > > ________ > > > > Community Meeting Calendar: > > Schedule - > Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC > Bridge: https://meet.google.com/cpu-eiue-hvk > <https://meet.google.com/cpu-eiue-hvk> > Gluster-users mailing list > Gluster-users at gluster.org <mailto:Gluster-users at gluster.org> > https://lists.gluster.org/mailman/listinfo/gluster-users > <https://lists.gluster.org/mailman/listinfo/gluster-users> >-------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 4269 bytes Desc: S/MIME Cryptographic Signature URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20210708/0f6d2a80/attachment.p7s>