Erik Jacobson
2020-Jan-26 00:51 UTC
[Gluster-users] gluster NFS hang observed mounting or umounting at scale
> The gluster NFS log has this entry: > [2020-01-25 19:07:33.085806] E [MSGID: 109040] [dht-helper.c:1388:dht_migration_complete_check_task] 0-cm_shared-dht: 19bd72f0-6863-4f1d-80dc-a426db8670b8: failed to lookup the file on cm_shared-dht [Stale file handle] > [2020-01-25 19:07:33.085848] W [MSGID: 112199] [nfs3-helpers.c:3578:nfs3_log_commit_res] 0-nfs-nfsv3: /image/images_rw_nfs/r41c4t1n1/rhel8.0/xfs-test.img => (XID: cb501b58, COMMIT: NFS: 70(Invalid file handle), POSIX: 116(Stale file handle)), wverf: 1579988225 >I've done more digging. I have access to an actual system that is failing (instead of my test case) above. It appears to be the same issue so that's good. (My access goes away in a couple hours). The nodes don't hang at the mount, but rather, at a check in the code for the existence of the image file. I'm not sure if the "holes" message I share below is a problem or not, the file indeed does start sparse. Restarting 'glusterd' on the problem server allows the node to boot. However, it does seem like the problem image file disappears from the face of the earth as far as I can tell (it doesn't exist in the gluster mount to the same path). Searching for all messages in nfs.log related to r17c3t6n3 (the problem node with the problem nfs.img file), I see: [root at leader1 glusterfs]# grep r17c3t6n3 nfs.log [2020-01-24 12:29:42.412019] W [MSGID: 112199] [nfs3-helpers.c:3494:nfs3_log_write_res] 0-nfs-nfsv3: /image/images_rw_nfs/r17c3t6n3/rhel8.0/xfs.img => (XID: ca68a5fc, WRITE: NFS: 70(Invalid file handle), POSIX: 116(Stale file handle)), count: 0, STABLE,wverf: 1579664973 [2020-01-25 04:57:10.199988] W [MSGID: 112199] [nfs3-helpers.c:3494:nfs3_log_write_res] 0-nfs-nfsv3: /image/images_rw_nfs/r17c3t6n3/rhel8.0/xfs.img => (XID: 1ec43ce0, WRITE: NFS: 70(Invalid file handle), POSIX: 116(Stale file handle)), count: 0, STABLE,wverf: 1579664973 [Invalid argument] [2020-01-25 04:57:10.200431] W [MSGID: 112199] [nfs3-helpers.c:3494:nfs3_log_write_res] 0-nfs-nfsv3: /image/images_rw_nfs/r17c3t6n3/rhel8.0/xfs.img => (XID: 20c43ce0, WRITE: NFS: 70(Invalid file handle), POSIX: 116(Stale file handle)), count: 0, STABLE,wverf: 1579664973 [2020-01-25 04:57:10.200695] W [MSGID: 112199] [nfs3-helpers.c:3494:nfs3_log_write_res] 0-nfs-nfsv3: /image/images_rw_nfs/r17c3t6n3/rhel8.0/xfs.img => (XID: 21c43ce0, WRITE: NFS: 70(Invalid file handle), POSIX: 116(Stale file handle)), count: 0, STABLE,wverf: 1579664973 [2020-01-25 04:57:10.200827] W [MSGID: 112199] [nfs3-helpers.c:3494:nfs3_log_write_res] 0-nfs-nfsv3: /image/images_rw_nfs/r17c3t6n3/rhel8.0/xfs.img => (XID: 1fc43ce0, WRITE: NFS: 70(Invalid file handle), POSIX: 116(Stale file handle)), count: 0, STABLE,wverf: 1579664973 [2020-01-25 04:57:10.201808] W [MSGID: 112199] [nfs3-helpers.c:3494:nfs3_log_write_res] 0-nfs-nfsv3: /image/images_rw_nfs/r17c3t6n3/rhel8.0/xfs.img => (XID: 22c43ce0, WRITE: NFS: 70(Invalid file handle), POSIX: 116(Stale file handle)), count: 0, STABLE,wverf: 1579664973 [Invalid argument] [2020-01-25 23:32:09.629807] I [MSGID: 109063] [dht-layout.c:693:dht_layout_normalize] 0-cm_shared-dht: Found anomalies in /image/images_rw_nfs/r17c3t6n3/rhel8.0 (gfid = 00000000-0000-0000-0000-000000000000). Holes=1 overlaps=0 [2020-01-26 02:42:33.712684] W [MSGID: 112199] [nfs3-helpers.c:3494:nfs3_log_write_res] 0-nfs-nfsv3: /image/images_rw_nfs/r17c3t6n3/rhel8.0/xfs.img => (XID: a0ca8fc3, WRITE: NFS: 70(Invalid file handle), POSIX: 116(Stale file handle)), count: 0, STABLE,wverf: 1579664973 r17c3t4n1 is another case: [2020-01-25 23:19:46.729427] I [MSGID: 109063] [dht-layout.c:693:dht_layout_normalize] 0-cm_shared-dht: Found anomalies in /image/images_rw_nfs/r17c3t4n1/rhel8.0 (gfid = 00000000-0000-0000-0000-000000000000). Holes=1 overlaps=0 [2020-01-26 02:42:43.907163] W [MSGID: 112199] [nfs3-helpers.c:3494:nfs3_log_write_res] 0-nfs-nfsv3: /image/images_rw_nfs/r17c3t4n1/rhel8.0/xfs.img => (XID: a77e7e7d, WRITE: NFS: 70(Invalid file handle), POSIX: 116(Stale file handle)), count: 0, STABLE,wverf: 1579664973 [2020-01-26 02:42:43.908045] W [MSGID: 112199] [nfs3-helpers.c:3494:nfs3_log_write_res] 0-nfs-nfsv3: /image/images_rw_nfs/r17c3t4n1/rhel8.0/xfs.img => (XID: a87e7e7d, WRITE: NFS: 70(Invalid file handle), POSIX: 116(Stale file handle)), count: 0, STABLE,wverf: 1579664973 [2020-01-26 02:42:43.908194] W [MSGID: 112199] [nfs3-helpers.c:3494:nfs3_log_write_res] 0-nfs-nfsv3: /image/images_rw_nfs/r17c3t4n1/rhel8.0/xfs.img => (XID: a67e7e7d, WRITE: NFS: 70(Invalid file handle), POSIX: 116(Stale file handle)), count: 0, STABLE,wverf: 1579664973
Erik Jacobson
2020-Jan-26 01:11 UTC
[Gluster-users] gluster NFS hang observed mounting or umounting at scale
One last reply to myself. I checked the bricks and the file wasn't there on any of the three. However, I see nothing recent nor interesting in the brick log. I forcibly created the file in the brick path, confirmed it showed up in the gluster mount, and force-rebooted a problem node. I didn't restart glusterd. Still had the problem. I suppose that makes sense as it would at the very least be a different inode. I restarted glusterd on the server used by the failing client and it took off (as expected). It detected the forcibly-created xfs image in the existence check and used it. So the core issue seems to be that the xfs.img file for the node either is removed or never makes it in to the actual back end storage bricks for gluster. Not that I'm an expert, but I'd guess this happens behind GNFS's back some how, resulting in the hangs and erorr messages. This files would be written by NFS normally. I do have these values set: performance.nfs.write-behind-trickling-writes on performance.write-behind-trickling-writes off performance.nfs.write-behind-window-size 1MB performance.write-behind on performance.nfs.write-behind on> I've done more digging. I have access to an actual system that is > failing (instead of my test case) above. It appears to be the same > issue so that's good. (My access goes away in a couple hours). > > The nodes don't hang at the mount, but rather, at a check in the code > for the existence of the image file. I'm not sure if the "holes" message > I share below is a problem or not, the file indeed does start sparse. > > Restarting 'glusterd' on the problem server allows the node to boot. > However, it does seem like the problem image file disappears from the > face of the earth as far as I can tell (it doesn't exist in the gluster > mount to the same path). > > Searching for all messages in nfs.log related to r17c3t6n3 (the problem > node with the problem nfs.img file), I see: > > [root at leader1 glusterfs]# grep r17c3t6n3 nfs.log > [2020-01-24 12:29:42.412019] W [MSGID: 112199] [nfs3-helpers.c:3494:nfs3_log_write_res] 0-nfs-nfsv3: /image/images_rw_nfs/r17c3t6n3/rhel8.0/xfs.img => (XID: ca68a5fc, WRITE: NFS: 70(Invalid file handle), POSIX: 116(Stale file handle)), count: 0, STABLE,wverf: 1579664973 > [2020-01-25 04:57:10.199988] W [MSGID: 112199] [nfs3-helpers.c:3494:nfs3_log_write_res] 0-nfs-nfsv3: /image/images_rw_nfs/r17c3t6n3/rhel8.0/xfs.img => (XID: 1ec43ce0, WRITE: NFS: 70(Invalid file handle), POSIX: 116(Stale file handle)), count: 0, STABLE,wverf: 1579664973 [Invalid argument] > [2020-01-25 04:57:10.200431] W [MSGID: 112199] [nfs3-helpers.c:3494:nfs3_log_write_res] 0-nfs-nfsv3: /image/images_rw_nfs/r17c3t6n3/rhel8.0/xfs.img => (XID: 20c43ce0, WRITE: NFS: 70(Invalid file handle), POSIX: 116(Stale file handle)), count: 0, STABLE,wverf: 1579664973 > [2020-01-25 04:57:10.200695] W [MSGID: 112199] [nfs3-helpers.c:3494:nfs3_log_write_res] 0-nfs-nfsv3: /image/images_rw_nfs/r17c3t6n3/rhel8.0/xfs.img => (XID: 21c43ce0, WRITE: NFS: 70(Invalid file handle), POSIX: 116(Stale file handle)), count: 0, STABLE,wverf: 1579664973 > [2020-01-25 04:57:10.200827] W [MSGID: 112199] [nfs3-helpers.c:3494:nfs3_log_write_res] 0-nfs-nfsv3: /image/images_rw_nfs/r17c3t6n3/rhel8.0/xfs.img => (XID: 1fc43ce0, WRITE: NFS: 70(Invalid file handle), POSIX: 116(Stale file handle)), count: 0, STABLE,wverf: 1579664973 > [2020-01-25 04:57:10.201808] W [MSGID: 112199] [nfs3-helpers.c:3494:nfs3_log_write_res] 0-nfs-nfsv3: /image/images_rw_nfs/r17c3t6n3/rhel8.0/xfs.img => (XID: 22c43ce0, WRITE: NFS: 70(Invalid file handle), POSIX: 116(Stale file handle)), count: 0, STABLE,wverf: 1579664973 [Invalid argument] > [2020-01-25 23:32:09.629807] I [MSGID: 109063] [dht-layout.c:693:dht_layout_normalize] 0-cm_shared-dht: Found anomalies in /image/images_rw_nfs/r17c3t6n3/rhel8.0 (gfid = 00000000-0000-0000-0000-000000000000). Holes=1 overlaps=0 > [2020-01-26 02:42:33.712684] W [MSGID: 112199] [nfs3-helpers.c:3494:nfs3_log_write_res] 0-nfs-nfsv3: /image/images_rw_nfs/r17c3t6n3/rhel8.0/xfs.img => (XID: a0ca8fc3, WRITE: NFS: 70(Invalid file handle), POSIX: 116(Stale file handle)), count: 0, STABLE,wverf: 1579664973 > > > r17c3t4n1 is another case: > > > > [2020-01-25 23:19:46.729427] I [MSGID: 109063] [dht-layout.c:693:dht_layout_normalize] 0-cm_shared-dht: Found anomalies in /image/images_rw_nfs/r17c3t4n1/rhel8.0 (gfid = 00000000-0000-0000-0000-000000000000). Holes=1 overlaps=0 > [2020-01-26 02:42:43.907163] W [MSGID: 112199] [nfs3-helpers.c:3494:nfs3_log_write_res] 0-nfs-nfsv3: /image/images_rw_nfs/r17c3t4n1/rhel8.0/xfs.img => (XID: a77e7e7d, WRITE: NFS: 70(Invalid file handle), POSIX: 116(Stale file handle)), count: 0, STABLE,wverf: 1579664973 > [2020-01-26 02:42:43.908045] W [MSGID: 112199] [nfs3-helpers.c:3494:nfs3_log_write_res] 0-nfs-nfsv3: /image/images_rw_nfs/r17c3t4n1/rhel8.0/xfs.img => (XID: a87e7e7d, WRITE: NFS: 70(Invalid file handle), POSIX: 116(Stale file handle)), count: 0, STABLE,wverf: 1579664973 > [2020-01-26 02:42:43.908194] W [MSGID: 112199] [nfs3-helpers.c:3494:nfs3_log_write_res] 0-nfs-nfsv3: /image/images_rw_nfs/r17c3t4n1/rhel8.0/xfs.img => (XID: a67e7e7d, WRITE: NFS: 70(Invalid file handle), POSIX: 116(Stale file handle)), count: 0, STABLE,wverf: 1579664973 > > > > ________ > > Community Meeting Calendar: > > APAC Schedule - > Every 2nd and 4th Tuesday at 11:30 AM IST > Bridge: https://bluejeans.com/441850968 > > NA/EMEA Schedule - > Every 1st and 3rd Tuesday at 01:00 PM EDT > Bridge: https://bluejeans.com/441850968 > > Gluster-users mailing list > Gluster-users at gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-usersErik Jacobson Software Engineer erik.jacobson at hpe.com +1 612 851 0550 Office Eagan, MN hpe.com