Changwei Ge
2020-May-20 02:52 UTC
[Ocfs2-devel] NFS clients crash OCFS2 nodes (general protection fault: 0000 [#1] SMP PTI)
Hi Pierre, You indeed provide detailed information on this problem and the research looks very nice. On 5/19/20 6:48 PM, Pierre Dinh-van wrote:> > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA512 > > Hi, > > I'm experiencing quite reproducable crashs on my OCFS2 cluster the last > weeks, and I think I found out part of the Problem. But before I make a > bug report (I'm not sure where), I though I could post my research here > in case someone understands it better than I. > > When it happens, the kernel is giving me a 'general protection fault: > 0000 [#1] SMP PTI' or ' BUG: unable to handle page fault for address:' > message, local access to the OCFS2 volumes are all hanging forever, and > the load is getting so high that only a hard reset "solves" the problem. > It happened the same way on both nodes (always on the active node > serving NFS). It happens when both nodes are up, or when only one is > active.This makes me a little puzzled, you mean you crash the kernel manually or ocfs2 crash the kernel itself? From the backtrace you posted, I assume that you crash the kernel manually to migrate NFS export IP to the other standby node, right?> > First, a few words about my setup : > > - - storage is Fiberchannel LUNs served by a EMC VNX5200 (4x 5TB). > > - - LUNs are encrypted with LUKS > > - - LUKS volumes are formated with OCFS2 > > - - heartbeat=local > > - - The OCFS2 cluster is communicating on a dedictated IP over InfinyBand > direct link (actually a fail-over bonding of the 2 ib interfaces). > > - - nodes are 2 debian buster hardware servers (24 cores, 128GB RAM). > > - - I use corosync for cluster management, but I don't think it's related > > - - The cluster serves 4 NFS exports (3 are NFSv3, and one is NFSv4 with > kerberos) and samba shares > > - - The samba shares are served in parallel by the 2 nodes managed by ctdb > > - - The NFS exports are served by one of the node having the dedicated IP > for this export.In fact, I am not quite experienced in this kind of setup where ocfs2 serves as NFS/Samba backend filesystem, but the use case seems no problem to me. But considering that you have hundreds of clients connected to a single NFS/v4 server, do you have a performance limitation. This has nothing with the problem, just my personal curiosity.> > - - When corosync is moving the IP of a NFS export to the other node, it > restarts nfsd to issue a grace time of 10s on the new active node.Sure, this looks normal step.> > - - The bug appeared with default debian buster kernel 4.19.0-9-amd64 and > with the backported kernel 5.5.0-0.bpo.2-amd64 as well > > > My cluster went live a few weeks ago, and it first worked with both > nodes active. During some maintenance work where the active NFS node had > to switch, I had a a bug, that I was able to reproduce against my will > at least 20 times in the last nights : > > 1) around 100 NFS clients are connected with the exports and happy > working with it > > 2) The IPs of the NFS shares are switching from nodeA to nodeB > > 3) nodeB's kernel says "general protection fault: 0000 [#1] SMP PTI" > > 4) nodeB's load is getting highI suppose the reason is that ocfs2 can't successfully acquire spin_lock of a system inode(ocfs2 internal mechanism to store its metadata, not visible to end users ). As you know, locking path won't stop trying until it get the lock, so it consumes a lot of CPU cycles, you probably observed a high system workload.> > 5) access to the OCFS2 is hanging (clean reboot never end, system cannot > unmount OCFS volumes) > > 6) I power cycle nodeB > > 7) I start nodeA again > > 8) nodeA give the same symptoms, as soon as some NFS clients do some > special requests > > > Most of the time, when the IP jumps to the other node, I get some of > this messages : > > > [? 501.707566] (nfsd,2133,14):ocfs2_test_inode_bit:2841 ERROR: unable to > get alloc inode in slot 65535This is very strange, I think ocfs2 passes the invalid slot number to subsequent consumer function, which makes it use a dangling pointer to a fake/illegal system inode. In other word, this why ocfs2 can't acquire the spin_lock of system inode. Can you try to fix the logic in `ocfs2_get_suballoc_slot_bit` judging if slot number is valid? And perform your case once more? Feel free to ask my help if you need. :-)> [? 501.716660] (nfsd,2133,14):ocfs2_test_inode_bit:2867 ERROR: status = -22Why this happens and if it is another issue needs further investigation.> [? 501.726585] (nfsd,2133,6):ocfs2_test_inode_bit:2841 ERROR: unable to > get alloc inode in slot 65535 > [? 501.735579] (nfsd,2133,6):ocfs2_test_inode_bit:2867 ERROR: status = -22 > > But it also happens when the node is not crashing, so I think it's > another Problem. > > Last night, I saved a few of the kernel output while the servers where > crashing one after the other like in a ping-pong game : > > > - -----8< on nodeB? 8<----- > > [? 502.070431] nfsd: nfsv4 idmapping failing: has idmapd not been started? > [? 502.475747] general protection fault: 0000 [#1] SMP PTI > [? 502.481027] CPU: 6 PID: 2104 Comm: nfsd Not tainted > 5.5.0-0.bpo.2-amd64 #1 Debian 5.5.17-1~bpo10+1 > [? 502.490016] Hardware name: IBM System x3650 M3 > - -[7945UHV]-/94Y7614???? , BIOS -[D6E150CUS-1.11]- 02/08/2011 > [? 502.499821] RIP: 0010:_raw_spin_lock+0xc/0x20 > [? 502.504206] Code: 66 90 66 66 90 31 c0 ba ff 00 00 00 f0 0f b1 17 75 > 05 48 89 d8 5b c3 e8 12 5f 91 ff eb f4 66 66 66 66 90 31 c0 ba 01 00 00 > 00 <f0> 0f b1 17 75 01 c3 89 c6 e8 76 46 91 ff 66 90 c3 0f 1f 00 66 66 > [? 502.523018] RSP: 0018:ffffa5af4838bac0 EFLAGS: 00010246 > [? 502.528277] RAX: 0000000000000000 RBX: 0fdbbeded053c1d6 RCX: > 0000000000000000 > [? 502.535433] RDX: 0000000000000001 RSI: 0000000000000009 RDI: > 0fdbbeded053c25e > [? 502.542588] RBP: 0fdbbeded053c25e R08: ffff9672ffa73068 R09: > 000000000002c340 > [? 502.549743] R10: 00000184d03a2115 R11: 0000000000000000 R12: > ffff9672eff43fd0 > [? 502.556899] R13: ffff9672fff66bc0 R14: 000000000000ffff R15: > ffff9672fff660c8 > [? 502.564055] FS:? 0000000000000000(0000) GS:ffff968203a80000(0000) > knlGS:0000000000000000 > [? 502.572166] CS:? 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [? 502.577928] CR2: 00007f0a4bb5df50 CR3: 0000001e4169a004 CR4: > 00000000000206e0 > [? 502.585082] Call Trace: > [? 502.587546]? igrab+0x19/0x50 > [? 502.590484]? ocfs2_get_system_file_inode+0x65/0x2c0 [ocfs2] > [? 502.596094]? ? ocfs2_read_blocks_sync+0x159/0x330 [ocfs2] > [? 502.601534]? ocfs2_test_inode_bit+0xe8/0x900 [ocfs2] > [? 502.606536]? ? ocfs2_free_dir_lookup_result+0x24/0x50 [ocfs2] > [? 502.612321]? ocfs2_get_parent+0xa3/0x300 [ocfs2] > [? 502.616960]? reconnect_path+0xa1/0x2c0 > [? 502.620742]? ? nfsd_proc_setattr+0x1b0/0x1b0 [nfsd] > [? 502.625636]? exportfs_decode_fh+0x111/0x2d0 > [? 502.629844]? ? exp_find_key+0xcd/0x160 [nfsd] > [? 502.634218]? ? __kmalloc+0x180/0x270 > [? 502.637810]? ? security_prepare_creds+0x6f/0xa0 > [? 502.642356]? ? tomoyo_write_self+0x1b0/0x1b0 > [? 502.646642]? ? security_prepare_creds+0x49/0xa0 > [? 502.651995]? fh_verify+0x3e5/0x5f0 [nfsd] > [? 502.656737]? nfsd3_proc_getattr+0x6b/0x100 [nfsd] > [? 502.662152]? nfsd_dispatch+0x9e/0x210 [nfsd] > [? 502.667136]? svc_process_common+0x386/0x6f0 [sunrpc] > [? 502.672791]? ? svc_sock_secure_port+0x12/0x30 [sunrpc] > [? 502.678628]? ? svc_recv+0x2ff/0x9c0 [sunrpc] > [? 502.683597]? ? nfsd_svc+0x2c0/0x2c0 [nfsd] > [? 502.688395]? ? nfsd_destroy+0x50/0x50 [nfsd] > [? 502.693377]? svc_process+0xd1/0x110 [sunrpc] > [? 502.698361]? nfsd+0xe3/0x140 [nfsd] > [? 502.702561]? kthread+0x112/0x130 > [? 502.706504]? ? kthread_park+0x80/0x80 > [? 502.710886]? ret_from_fork+0x35/0x40 > [? 502.715181] Modules linked in: rpcsec_gss_krb5 nfsd nfs_acl lockd > grace ocfs2 quota_tree sctp libcrc32c nft_counter ipt_REJECT > nf_reject_ipv4 xt_tcpudp nft_compat nf_tables nfnetlink binfmt_misc > ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager > cpufreq_userspace ocfs2_stackglue cpufreq_powersave cpufreq_conservative > configfs bonding dm_round_robin dm_multipath scsi_dh_rdac scsi_dh_emc > scsi_dh_alua intel_powerclamp coretemp kvm_intel ipmi_ssif kvm cdc_ether > irqbypass usbnet mii tpm_tis intel_cstate tpm_tis_core intel_uncore > joydev tpm ipmi_si ioatdma sg iTCO_wdt ib_mthca iTCO_vendor_support > watchdog dca pcspkr ipmi_devintf rng_core ib_uverbs ipmi_msghandler > i5500_temp i7core_edac evdev ib_umad ib_ipoib ib_cm ib_core auth_rpcgss > sunrpc ip_tables x_tables autofs4 algif_skcipher af_alg ext4 crc16 > mbcache jbd2 crc32c_generic dm_crypt dm_mod sd_mod hid_generic usbhid > hid sr_mod cdrom crct10dif_pclmul crc32_pclmul crc32c_intel > ghash_clmulni_intel mgag200 drm_vram_helper drm_ttm_helper > [? 502.715218]? i2c_algo_bit ttm qla2xxx drm_kms_helper drm ata_generic > ata_piix aesni_intel libata nvme_fc ehci_pci uhci_hcd crypto_simd > ehci_hcd nvme_fabrics cryptd glue_helper lpc_ich nvme_core megaraid_sas > mfd_core usbcore scsi_transport_fc e1000e scsi_mod i2c_i801 usb_common > ptp pps_core bnx2 button > [? 502.837917] ---[ end trace 6e1a8e507ac30d8d ]--- > [? 502.843379] RIP: 0010:_raw_spin_lock+0xc/0x20 > [? 502.848584] Code: 66 90 66 66 90 31 c0 ba ff 00 00 00 f0 0f b1 17 75 > 05 48 89 d8 5b c3 e8 12 5f 91 ff eb f4 66 66 66 66 90 31 c0 ba 01 00 00 > 00 <f0> 0f b1 17 75 01 c3 89 c6 e8 76 46 91 ff 66 90 c3 0f 1f 00 66 66 > [? 502.869295] RSP: 0018:ffffa5af4838bac0 EFLAGS: 00010246 > [? 502.875404] RAX: 0000000000000000 RBX: 0fdbbeded053c1d6 RCX: > 0000000000000000 > [? 502.883423] RDX: 0000000000000001 RSI: 0000000000000009 RDI: > 0fdbbeded053c25e > [? 502.891436] RBP: 0fdbbeded053c25e R08: ffff9672ffa73068 R09: > 000000000002c340 > [? 502.899446] R10: 00000184d03a2115 R11: 0000000000000000 R12: > ffff9672eff43fd0 > [? 502.907455] R13: ffff9672fff66bc0 R14: 000000000000ffff R15: > ffff9672fff660c8 > [? 502.915473] FS:? 0000000000000000(0000) GS:ffff968203a80000(0000) > knlGS:0000000000000000 > [? 502.924496] CS:? 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [? 502.931147] CR2: 00007f0a4bb5df50 CR3: 0000001e4169a004 CR4: > 00000000000206e0 > > - -----8< cut here 8<----- > > > shortly after when trying to start nodeA while nodeB was down : > > - -----8< nodeA 8<----- > > [? 728.927750] general protection fault: 0000 [#1] SMP PTI > [? 728.934647] CPU: 8 PID: 12156 Comm: nfsd Not tainted > 5.5.0-0.bpo.2-amd64 #1 Debian 5.5.17-1~bpo10+1 > [? 728.944800] Hardware name: IBM System x3650 M3 -[7945H2G]-/69Y4438, > BIOS -[D6E158AUS-1.16]- 11/26/2012 > [? 728.955176] RIP: 0010:_raw_spin_lock+0xc/0x20 > [? 728.960595] Code: 66 90 66 66 90 31 c0 ba ff 00 00 00 f0 0f b1 17 75 > 05 48 89 d8 5b c3 e8 12 5f 91 ff eb f4 66 66 66 66 90 31 c0 ba 01 00 00 > 00 <f0> 0f b1 17 75 01 c3 89 c6 e8 76 46 91 ff 66 90 c3 0f 1f 00 66 66 > [? 728.981559] RSP: 0018:ffffa24d09187ac0 EFLAGS: 00010246 > [? 728.987885] RAX: 0000000000000000 RBX: 625f656369766564 RCX: > 0000000000000000 > [? 728.996135] RDX: 0000000000000001 RSI: 0000000000000009 RDI: > 625f6563697665ec > [? 729.004384] RBP: 625f6563697665ec R08: ffff91f1272e3680 R09: > 000000000002c340 > [? 729.012628] R10: 000002114c00d22b R11: 0000000000000000 R12: > ffff91e13858bbd0 > [? 729.020877] R13: ffff91f13c03bbc0 R14: 000000000000ffff R15: > ffff91f13c03b0c8 > [? 729.029108] FS:? 0000000000000000(0000) GS:ffff91f13fa80000(0000) > knlGS:0000000000000000 > [? 729.038308] CS:? 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [? 729.045161] CR2: 000055fe95d1d9a8 CR3: 0000001c8420a003 CR4: > 00000000000206e0 > [? 729.053415] Call Trace: > [? 729.056978]? igrab+0x19/0x50 > [? 729.061023]? ocfs2_get_system_file_inode+0x65/0x2c0 [ocfs2] > [? 729.067736]? ? ocfs2_read_blocks_sync+0x159/0x330 [ocfs2] > [? 729.074284]? ocfs2_test_inode_bit+0xe8/0x900 [ocfs2] > [? 729.080386]? ? ocfs2_free_dir_lookup_result+0x24/0x50 [ocfs2] > [? 729.087268]? ocfs2_get_parent+0xa3/0x300 [ocfs2] > [? 729.092995]? reconnect_path+0xa1/0x2c0 > [? 729.097863]? ? nfsd_proc_setattr+0x1b0/0x1b0 [nfsd] > [? 729.103840]? exportfs_decode_fh+0x111/0x2d0 > [? 729.109133]? ? exp_find_key+0xcd/0x160 [nfsd] > [? 729.114574]? ? dbs_update_util_handler+0x16/0x90 > [? 729.120273]? ? __kmalloc+0x180/0x270 > [? 729.124920]? ? security_prepare_creds+0x6f/0xa0 > [? 729.130521]? ? tomoyo_write_self+0x1b0/0x1b0 > [? 729.135858]? ? security_prepare_creds+0x49/0xa0 > [? 729.141464]? fh_verify+0x3e5/0x5f0 [nfsd] > [? 729.146536]? nfsd3_proc_getattr+0x6b/0x100 [nfsd] > [? 729.152297]? nfsd_dispatch+0x9e/0x210 [nfsd] > [? 729.157647]? svc_process_common+0x386/0x6f0 [sunrpc] > [? 729.163675]? ? svc_sock_secure_port+0x12/0x30 [sunrpc] > [? 729.169873]? ? svc_recv+0x2ff/0x9c0 [sunrpc] > [? 729.175190]? ? nfsd_svc+0x2c0/0x2c0 [nfsd] > [? 729.180323]? ? nfsd_destroy+0x50/0x50 [nfsd] > [? 729.185631]? svc_process+0xd1/0x110 [sunrpc] > [? 729.190925]? nfsd+0xe3/0x140 [nfsd] > [? 729.195421]? kthread+0x112/0x130 > [? 729.199636]? ? kthread_park+0x80/0x80 > [? 729.204279]? ret_from_fork+0x35/0x40 > [? 729.208818] Modules linked in: ocfs2 quota_tree dm_crypt crypto_simd > cryptd glue_helper algif_skcipher af_alg nfsd nfs_acl lockd grace sctp > libcrc32c nft_counter xt_tcpudp nft_compat nf_tables nfnetlink > binfmt_misc ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager > ocfs2_stackglue configfs bonding dm_round_robin dm_multipath > scsi_dh_rdac scsi_dh_emc scsi_dh_alua intel_powerclamp ipmi_ssif > coretemp kvm_intel kvm irqbypass cdc_ether intel_cstate usbnet ib_mthca > mii joydev intel_uncore ib_uverbs iTCO_wdt sg ioatdma > iTCO_vendor_support i7core_edac watchdog i5500_temp pcspkr ipmi_si > tpm_tis tpm_tis_core tpm ipmi_devintf ipmi_msghandler rng_core evdev > acpi_cpufreq ib_umad auth_rpcgss ib_ipoib ib_cm sunrpc ib_core ip_tables > x_tables autofs4 ext4 crc16 mbcache jbd2 crc32c_generic dm_mod > hid_generic sd_mod usbhid hid sr_mod cdrom uas usb_storage mgag200 > drm_vram_helper drm_ttm_helper qla2xxx ata_generic i2c_algo_bit ttm > nvme_fc ixgbe nvme_fabrics ata_piix xfrm_algo ehci_pci uhci_hcd > [? 729.208868]? drm_kms_helper nvme_core ehci_hcd dca libata > megaraid_sas scsi_transport_fc libphy ptp drm usbcore scsi_mod > crc32c_intel lpc_ich i2c_i801 pps_core mfd_core usb_common bnx2 mdio button > [? 729.323900] ---[ end trace 212cefe75207e17f ]--- > [? 729.329644] RIP: 0010:_raw_spin_lock+0xc/0x20 > [? 729.335121] Code: 66 90 66 66 90 31 c0 ba ff 00 00 00 f0 0f b1 17 75 > 05 48 89 d8 5b c3 e8 12 5f 91 ff eb f4 66 66 66 66 90 31 c0 ba 01 00 00 > 00 <f0> 0f b1 17 75 01 c3 89 c6 e8 76 46 91 ff 66 90 c3 0f 1f 00 66 66 > [? 729.356192] RSP: 0018:ffffa24d09187ac0 EFLAGS: 00010246 > [? 729.362584] RAX: 0000000000000000 RBX: 625f656369766564 RCX: > 0000000000000000 > [? 729.370889] RDX: 0000000000000001 RSI: 0000000000000009 RDI: > 625f6563697665ec > [? 729.379184] RBP: 625f6563697665ec R08: ffff91f1272e3680 R09: > 000000000002c340 > [? 729.387475] R10: 000002114c00d22b R11: 0000000000000000 R12: > ffff91e13858bbd0 > [? 729.395767] R13: ffff91f13c03bbc0 R14: 000000000000ffff R15: > ffff91f13c03b0c8 > [? 729.404066] FS:? 0000000000000000(0000) GS:ffff91f13fa80000(0000) > knlGS:0000000000000000 > [? 729.413336] CS:? 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [? 729.420262] CR2: 000055fe95d1d9a8 CR3: 0000001c8420a003 CR4: > 00000000000206e0 > > - -----8< cut here 8<----- > > Since the backtrace is speaking about nfsd, I tried to isolate the > Problem by droping all NFS packet entering to the active node before > starting the services : > > iptables -I INPUT -p tcp --dport 2049 -j DROP > > Everything fine, access over samba were doing fine. > > I started to allow NFS for part of my network with iptables, until the > crash occurs. > > On this one, the crash occured after allowing only a NFSv3 client. > > - -----8< nodeB 8<----- > > [ 2494.893697] BUG: unable to handle page fault for address: > 00000000000fab7d > [ 2494.900626] #PF: supervisor write access in kernel mode > [ 2494.905873] #PF: error_code(0x0002) - not-present page > [ 2494.911042] PGD 0 P4D 0 > [ 2494.913596] Oops: 0002 [#1] SMP PTI > [ 2494.917106] CPU: 11 PID: 21342 Comm: nfsd Not tainted > 5.5.0-0.bpo.2-amd64 #1 Debian 5.5.17-1~bpo10+1 > [ 2494.926285] Hardware name: IBM System x3650 M3 > - -[7945UHV]-/94Y7614???? , BIOS -[D6E150CUS-1.11]- 02/08/2011 > [ 2494.936077] RIP: 0010:_raw_spin_lock+0xc/0x20 > [ 2494.940461] Code: 66 90 66 66 90 31 c0 ba ff 00 00 00 f0 0f b1 17 75 > 05 48 89 d8 5b c3 e8 12 5f 91 ff eb f4 66 66 66 66 90 31 c0 ba 01 00 00 > 00 <f0> 0f b1 17 75 01 c3 89 c6 e8 76 46 91 ff 66 90 c3 0f 1f 00 66 66 > [ 2494.959269] RSP: 0018:ffffb949cca7fac0 EFLAGS: 00010246 > [ 2494.964511] RAX: 0000000000000000 RBX: 00000000000faaf5 RCX: > 0000000000000000 > [ 2494.971664] RDX: 0000000000000001 RSI: 0000000000000009 RDI: > 00000000000fab7d > [ 2494.978815] RBP: 00000000000fab7d R08: ffff99c8f12737b8 R09: > 000000000002c340 > [ 2494.985968] R10: 000005dc16e333d8 R11: 0000000000000000 R12: > ffff99c903066bd0 > [ 2494.993119] R13: ffff99c9014c7bc0 R14: 000000000000ffff R15: > ffff99c9014c70c8 > [ 2495.000271] FS:? 0000000000000000(0000) GS:ffff99ba03bc0000(0000) > knlGS:0000000000000000 > [ 2495.008382] CS:? 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [ 2495.014141] CR2: 00000000000fab7d CR3: 0000000f38df2002 CR4: > 00000000000206e0 > [ 2495.021291] Call Trace: > [ 2495.023753]? igrab+0x19/0x50 > [ 2495.026693]? ocfs2_get_system_file_inode+0x65/0x2c0 [ocfs2] > [ 2495.032301]? ? ocfs2_read_blocks_sync+0x159/0x330 [ocfs2] > [ 2495.037741]? ocfs2_test_inode_bit+0xe8/0x900 [ocfs2] > [ 2495.042742]? ? ocfs2_free_dir_lookup_result+0x24/0x50 [ocfs2] > [ 2495.048525]? ocfs2_get_parent+0xa3/0x300 [ocfs2] > [ 2495.053163]? reconnect_path+0xa1/0x2c0 > [ 2495.056943]? ? nfsd_proc_setattr+0x1b0/0x1b0 [nfsd] > [ 2495.061838]? exportfs_decode_fh+0x111/0x2d0 > [ 2495.066042]? ? exp_find_key+0xcd/0x160 [nfsd] > [ 2495.070417]? ? __kmalloc+0x180/0x270 > [ 2495.074007]? ? security_prepare_creds+0x6f/0xa0 > [ 2495.078551]? ? tomoyo_write_self+0x1b0/0x1b0 > [ 2495.082834]? ? security_prepare_creds+0x49/0xa0 > [ 2495.087387]? fh_verify+0x3e5/0x5f0 [nfsd] > [ 2495.092125]? nfsd3_proc_getattr+0x6b/0x100 [nfsd] > [ 2495.097517]? nfsd_dispatch+0x9e/0x210 [nfsd] > [ 2495.102489]? svc_process_common+0x386/0x6f0 [sunrpc] > [ 2495.108134]? ? svc_sock_secure_port+0x12/0x30 [sunrpc] > [ 2495.113946]? ? svc_recv+0x2ff/0x9c0 [sunrpc] > [ 2495.118883]? ? nfsd_svc+0x2c0/0x2c0 [nfsd] > [ 2495.123651]? ? nfsd_destroy+0x50/0x50 [nfsd] > [ 2495.128604]? svc_process+0xd1/0x110 [sunrpc] > [ 2495.133563]? nfsd+0xe3/0x140 [nfsd] > [ 2495.137733]? kthread+0x112/0x130 > [ 2495.141643]? ? kthread_park+0x80/0x80 > [ 2495.145993]? ret_from_fork+0x35/0x40 > [ 2495.150262] Modules linked in: ocfs2 quota_tree nfsd nfs_acl lockd > grace sctp libcrc32c nft_counter xt_tcpudp nft_compat nf_tables > nfnetlink binfmt_misc ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm > ocfs2_nodemanager ocfs2_stackglue cpufreq_userspace cpufreq_powersave > cpufreq_conservative configfs bonding dm_round_robin dm_multipath > scsi_dh_rdac scsi_dh_emc scsi_dh_alua intel_powerclamp coretemp > kvm_intel kvm ipmi_ssif irqbypass cdc_ether usbnet intel_cstate mii > ib_mthca intel_uncore evdev iTCO_wdt joydev ipmi_si tpm_tis sg > tpm_tis_core tpm ipmi_devintf iTCO_vendor_support ib_uverbs pcspkr > watchdog ioatdma rng_core i7core_edac dca i5500_temp ipmi_msghandler > ib_umad ib_ipoib auth_rpcgss ib_cm sunrpc ib_core ip_tables x_tables > autofs4 algif_skcipher af_alg ext4 crc16 mbcache jbd2 crc32c_generic > dm_crypt dm_mod sr_mod cdrom sd_mod hid_generic usbhid hid > crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel mgag200 > drm_vram_helper drm_ttm_helper i2c_algo_bit ttm qla2xxx drm_kms_helper > [ 2495.150298]? ata_generic drm ata_piix aesni_intel libata nvme_fc > uhci_hcd crypto_simd nvme_fabrics ehci_pci ehci_hcd megaraid_sas cryptd > nvme_core glue_helper usbcore scsi_transport_fc lpc_ich mfd_core e1000e > scsi_mod i2c_i801 usb_common bnx2 ptp pps_core button > [ 2495.269073] CR2: 00000000000fab7d > [ 2495.273205] ---[ end trace 44f000505f987296 ]--- > [ 2495.278647] RIP: 0010:_raw_spin_lock+0xc/0x20 > [ 2495.283825] Code: 66 90 66 66 90 31 c0 ba ff 00 00 00 f0 0f b1 17 75 > 05 48 89 d8 5b c3 e8 12 5f 91 ff eb f4 66 66 66 66 90 31 c0 ba 01 00 00 > 00 <f0> 0f b1 17 75 01 c3 89 c6 e8 76 46 91 ff 66 90 c3 0f 1f 00 66 66 > [ 2495.304287] RSP: 0018:ffffb949cca7fac0 EFLAGS: 00010246 > [ 2495.310368] RAX: 0000000000000000 RBX: 00000000000faaf5 RCX: > 0000000000000000 > [ 2495.318358] RDX: 0000000000000001 RSI: 0000000000000009 RDI: > 00000000000fab7d > [ 2495.326339] RBP: 00000000000fab7d R08: ffff99c8f12737b8 R09: > 000000000002c340 > [ 2495.334314] R10: 000005dc16e333d8 R11: 0000000000000000 R12: > ffff99c903066bd0 > [ 2495.342297] R13: ffff99c9014c7bc0 R14: 000000000000ffff R15: > ffff99c9014c70c8 > [ 2495.350279] FS:? 0000000000000000(0000) GS:ffff99ba03bc0000(0000) > knlGS:0000000000000000 > [ 2495.359224] CS:? 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [ 2495.365825] CR2: 00000000000fab7d CR3: 0000000f38df2002 CR4: > 00000000000206e0 > > - -----8< cut here 8<----- > > > I also made some record of NFS traffic with tcpdump when the crashs are > occuring, but I didn't had time to analyse it. I can provide them if it > helps (but I might have to clean them up first to avoid some data leaks :) > > After this happening a few times, I had one case where one of the few > NFSv4 client was involved, and I took as aworkaround to migrate the > NFSv4 share to XFS since I though it could be NFSv4 related. Now that I > looked at the kernel dumps, I think it's also NFSv3 related. > > Is it a kernel bug or am I missing something ?Probably a ocfs2 internal error handling bug. :-(> > For now, I will switch back to XFS with a failover setup of my cluster, > cause I cannot afford this kind of crashes of a central file server. > >Interesting, do you mean you can use XFS as a NFSv4 backend filesystem which is formatted on a shared SCSI LUN with more than one host attached without filesystem corruption? Thanks, Changwei> > Cheers > > > Pierre > > > PS : I'll try to update with what I found in the PCAP files in some days > > > > -----BEGIN PGP SIGNATURE----- > > iQIzBAEBCgAdFiEEjXhlMuC3bFbPA3S/qC81yulnAywFAl7DuYUACgkQqC81yuln > AyyWHg/+NUMgiysN+QesGedzDjfKiUSTKqPvbg3pUklq3bXfidLdIcM6HGZt31Gy > /k33m83+675c3vQeez8uvSVAh7uNzQW2m7BGrzEv/ZkpwIZxqIwiV+gi/zc7lnGP > vqkjAqVjAapk254W3aTonZKpExKCbp6h1g1eVWKk43Mu7/UaLkQcjDPq9AyheSt+ > PAPpiQeQ9ox1+OVvEnpawJmcV9QUluHPwsar/yYa87ET2LIlBswtjOHr+Vw27kJU > qwak16rTDWqVQo5PYIi5N1jdCAEAPOV6EalNNwIQ9/u+lCdbVHYRV7Ze5nE58DDz > zP7Ai9Vaj/oBr/znjbaoE2VR0iWRVqbztBxts0hNv1IxAqJivZnaaJx6bXp51NeV > kXwTz0tHqCoy8RVlUXEGP8FBuaWgSLZyfN5IOk1KoocsLUuZh/BVxh+snDj4T52k > u/SRBFV4lW4RVYWHBItik8oqfMwYaJuwp2hdmU1hHsm1UnHQTQjum1L9ZfLDI7XU > NxHMGedneJePXdWH0eTqvHAPJ1U59gEL/cMPnOJF57q+eYoCs0svnRxCwQzviBok > Zak2uzIIXr3jc9bY3qiQlZivik5b+l6LK7ccoS7eEIqxOkKivOjwijuXc0EHOUR2 > c0QjFUDwJ1gR8rkBE/rOiKLXhMGnv2lcIZWiu7JlxMq+ms7f/1k> =6VH2 > -----END PGP SIGNATURE----- > > > _______________________________________________ > Ocfs2-devel mailing list > Ocfs2-devel at oss.oracle.com > https://oss.oracle.com/mailman/listinfo/ocfs2-devel >
Pierre Dinh-van
2020-May-21 17:02 UTC
[Ocfs2-devel] NFS clients crash OCFS2 nodes (general protection fault: 0000 [#1] SMP PTI)
On 5/20/20 4:52 AM, Changwei Ge wrote:> Hi Pierre,Hi Changwei,> You indeed provide detailed information on this problem and theresearch looks very nice. Thank you> On 5/19/20 6:48 PM, Pierre Dinh-van wrote: >> > Hi, > > I'm experiencing quite reproducable crashs on my OCFS2 cluster the last > weeks, and I think I found out part of the Problem. But before I make a > bug report (I'm not sure where), I though I could post my research here > in case someone understands it better than I. > > When it happens, the kernel is giving me a 'general protection fault: > 0000 [#1] SMP PTI' or ' BUG: unable to handle page fault for address:' > message, local access to the OCFS2 volumes are all hanging forever, and > the load is getting so high that only a hard reset "solves" the problem. > It happened the same way on both nodes (always on the active node > serving NFS). It happens when both nodes are up, or when only one is > active. > > > This makes me a little puzzled, you mean you crash the kernel > manually or ocfs2 crash the kernel itself? From the backtrace you > posted, I assume that you crash the kernel manually to migrate NFS > export IP to the other standby node, right? >I'm not crashing the kernel manually (I don't know how I could do). Ocfs2 is crashing the kernel as soon as some NFS requests are coming in, after the node started to serve NFS. Maybe I'm using the term "crash the kernel" a wrong way. Actually, only the OCFS2 part of the system is not responding after this crash. I can still log in with SSH and try to do things.> First, a few words about my setup :> In fact, I am not quite experienced in this kind of setup where ocfs2serves as NFS/Samba backend filesystem, but the use case seems no problem to me. But considering that you have hundreds of clients connected to a single NFS/v4 server, do you have a performance limitation. This has nothing with the problem, just my personal curiosity. Most of the NFS clients are Linux desktop clients. Some MacOSX too. The accesses are not really I/O intensive so I don't have preformance issues of having only 1 active node. Actually, the performance problem we had first was because of o2cb speaking over a 1Gb ethernet link, which was increasing the lstat() time a lot compared to XFS we had before. The cluster setup was mostly designed to have an easy failover system, and OCFS2 gave the impression of reducing the risk of corrupt file system (and allowed samba to be load balanced too)> 4) nodeB's load is getting high > > > I suppose the reason is that ocfs2 can't successfully acquire > spin_lock of a system inode(ocfs2 internal mechanism to store its > metadata, not visible to end users ). As you know, locking path won't > stop trying until it get the lock, so it consumes a lot of CPU cycles, > you probably observed a high system workload.So a possible path would be something like this ? (I don't know much about internals auf NFS locking and POSIX locking, so it's a guess): NFS client has a lock -> server is rebooted and forgets about the locks -> NFS clients requests the lock (maybe ignores the grace time or races with the nfsd restart). nfsd ask the FS about an unknown lock and it crashes the OCFS2 module somehow.> Most of the time, when the IP jumps to the other node, I get some of > this messages : > > > [? 501.707566] (nfsd,2133,14):ocfs2_test_inode_bit:2841 ERROR: unable to > get alloc inode in slot 65535 > > > This is very strange, I think ocfs2 passes the invalid slot number > to subsequent consumer function, which makes it use a dangling pointer > to a fake/illegal system inode. In other word, this why ocfs2 can't > acquire the spin_lock of system inode. > > > > Can you try to fix the logic in `ocfs2_get_suballoc_slot_bit` > judging if slot number is valid? And perform your case once more? > > > Feel free to ask my help if you need. :-) >I never looked at the code of OCFS2 and I'm not familiar with filesystem internals. I actually wrote to ocfs2-devel because ocfs2-users is not letting my mails go through :). For testing the case, I will try to do something on the standby node on the original ocfs2 volumes, as soon as I'm sure that it's safe for the productive server. But for checking the logic in ocfs2_get_suballoc_slot_bit, I looked at the code and don't really understand it> > [? 501.716660] (nfsd,2133,14):ocfs2_test_inode_bit:2867 ERROR: status > = -22 > > > Why this happens and if it is another issue needs further investigation. > > [? 501.726585] (nfsd,2133,6):ocfs2_test_inode_bit:2841 ERROR: unable to > get alloc inode in slot 65535 > [? 501.735579] (nfsd,2133,6):ocfs2_test_inode_bit:2867 ERROR: status = -22 > > But it also happens when the node is not crashing, so I think it's > another Problem. > > Last night, I saved a few of the kernel output while the servers where > crashing one after the other like in a ping-pong game : > > > -----8< on nodeB? 8<----- > > [? 502.070431] nfsd: nfsv4 idmapping failing: has idmapd not been started? > [? 502.475747] general protection fault: 0000 [#1] SMP PTI > [? 502.481027] CPU: 6 PID: 2104 Comm: nfsd Not tainted > 5.5.0-0.bpo.2-amd64 #1 Debian 5.5.17-1~bpo10+1 > []... > I also made some record of NFS traffic with tcpdump when the crashs are > occuring, but I didn't had time to analyse it. I can provide them if it > helps (but I might have to clean them up first to avoid some data leaks :) > > After this happening a few times, I had one case where one of the few > NFSv4 client was involved, and I took as aworkaround to migrate the > NFSv4 share to XFS since I though it could be NFSv4 related. Now that I > looked at the kernel dumps, I think it's also NFSv3 related. > > Is it a kernel bug or am I missing something ? > > > Probably a ocfs2 internal error handling bug. :-(I'll check if I can reproduce it in a safe testing environment. Is there any things I should also make while reproducing to have a better understanding of the failure path ? I never did kernel debugging before.> > For now, I will switch back to XFS with a failover setup of my cluster, > cause I cannot afford this kind of crashes of a central file server. > > > > > Interesting, do you mean you can use XFS as a NFSv4 backend > filesystem which is formatted on a shared SCSI LUN with more than one > host attached without filesystem corruption? > >I switched back to a single active node setup. the second node is in standby and don't touch the XFS until the first node is not using them anymore. Should be safe for now. Cheers Pierre -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 833 bytes Desc: OpenPGP digital signature Url : http://oss.oracle.com/pipermail/ocfs2-devel/attachments/20200521/5cc7d0b6/attachment.bin