Frederik Ferner
2009-Oct-12 16:06 UTC
[Lustre-discuss] soft lockups on NFS server/Lustre client
Hi List, on our NFS server exporting our Lustre file system to a number of NFS clients, we''ve recently started to see "kernel: BUG: soft lockup" messages. As the locked processes include nfsd, our users are obviously not happy. Around the time when the soft lockup occurs we also see a log of "kernel: BUG: warning at fs/inotify.c:181/set_dentry_child_flags()" messages, but I don''t know if this is related. We are using Lustre 1.6.6 on all machines, (MDS, OSS, clients). The NFS server/Lustre client with the lockups is running RHEL5.4 with an unpatched RedHat kernel (kernel-2.6.18-92.1.10.el5) with the Lustre modules from Sun. See below for sample logs from the Lustre client/NFS server. I can provide more logs if required. I''m not sure if this a Lustre issue but would appreciate if someone could help. We''ve not seen it on any other NFS server so far and there seems to be at least some lustre related stuff in the stack trace. Is this a known issue and how can we avoid this? I have not found anything using google and the search on bugzilla.lustre.org. At least the BUG warning seems to be a known issue on this kernel. I hope the logs below are readable enough, I tried to find entries where the stack traces don''t overlap but this seems to be the best I can find. Oct 9 15:21:27 cs04r-sc-serv-07 kernel: BUG: warning at fs/inotify.c:181/set_dentry_child_flags() (Tainted: G ) Oct 9 15:21:27 cs04r-sc-serv-07 kernel: Oct 9 15:21:27 cs04r-sc-serv-07 kernel: Call Trace: Oct 9 15:21:27 cs04r-sc-serv-07 kernel: [<ffffffff800ed7d1>] set_dentry_child_flags+0xef/0x14d Oct 9 15:21:27 cs04r-sc-serv-07 kernel: [<ffffffff800ed867>] remove_watch_no_event+0x38/0x47 Oct 9 15:21:27 cs04r-sc-serv-07 kernel: [<ffffffff800ed88e>] inotify_remove_watch_locked+0x18/0x3b Oct 9 15:21:27 cs04r-sc-serv-07 kernel: [<ffffffff800ed97c>] inotify_rm_wd+0x7e/0xa1 Oct 9 15:21:27 cs04r-sc-serv-07 kernel: [<ffffffff800ede6e>] sys_inotify_rm_watch+0x46/0x63 Oct 9 15:21:27 cs04r-sc-serv-07 kernel: [<ffffffff8005d28d>] tracesys+0xd5/0xe0 Oct 9 15:21:27 cs04r-sc-serv-07 kernel: Oct 9 15:21:27 cs04r-sc-serv-07 kernel: BUG: warning at fs/inotify.c:181/set_dentry_child_flags() (Tainted: G ) Oct 9 15:21:27 cs04r-sc-serv-07 kernel: Oct 9 15:21:27 cs04r-sc-serv-07 kernel: Call Trace: Oct 9 15:21:27 cs04r-sc-serv-07 kernel: [<ffffffff800ed7d1>] set_dentry_child_flags+0xef/0x14d Oct 9 15:21:27 cs04r-sc-serv-07 kernel: [<ffffffff800ed867>] remove_watch_no_event+0x38/0x47 Oct 9 15:21:27 cs04r-sc-serv-07 kernel: [<ffffffff800ed88e>] inotify_remove_watch_locked+0x18/0x3b Oct 9 15:21:27 cs04r-sc-serv-07 kernel: [<ffffffff800ed97c>] inotify_rm_wd+0x7e/0xa1 Oct 9 15:21:27 cs04r-sc-serv-07 kernel: [<ffffffff800ede6e>] sys_inotify_rm_watch+0x46/0x63 Oct 9 15:21:27 cs04r-sc-serv-07 kernel: BUG: soft lockup - CPU#5 stuck for 10s! [nfsd:22221] Oct 9 15:21:28 cs04r-sc-serv-07 kernel: CPU 5: Oct 9 15:21:28 cs04r-sc-serv-07 kernel: Modules linked in: vfat fat usb_storage dell_rbu mptctl ipmi_devintf ipmi_si ipmi_msghandler nfs fscache nfsd exportfs lockd nfs_acl auth_rpcgss autofs4 hidp mgc(U) lustre(U) lov(U) mdc(U) lquota(U) osc(U) ksocklnd(U) ptlrpc(U) ob dclass(U) lnet(U) lvfs(U) libcfs(U) rfcomm l2cap bluetooth sunrpc ipv6 xfrm_nalgo crypto_api mlx4_en(U) dm_multipath video sbs backlight i2c_ec i2c_core button battery asus_acpi acpi_memhotplug ac parport_pc lp parport joydev sr_mod cdrom mlx4_core(U) bnx2 serio_raw pcsp kr sg dm_snapshot dm_zero dm_mirror dm_mod ata_piix libata shpchp mptsas mptscsih mptbase scsi_transport_sas sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd Oct 9 15:21:28 cs04r-sc-serv-07 kernel: Pid: 22221, comm: nfsd Tainted: G 2.6.18-92.1.10.el5 #1 Oct 9 15:21:28 cs04r-sc-serv-07 kernel: RIP: 0010:[<ffffffff80064ba7>] [<ffffffff80064ba7>] .text.lock.spinlock+0x5/0x30 Oct 9 15:21:28 cs04r-sc-serv-07 kernel: RSP: 0018:ffff810044241ac8 EFLAGS: 00000286 Oct 9 15:21:28 cs04r-sc-serv-07 kernel: RAX: ffff81006cb6a1a8 RBX: ffff81006cb6a178 RCX: ffff810044241b50 Oct 9 15:21:28 cs04r-sc-serv-07 kernel: RDX: 0000000000000000 RSI: ffff810044241c90 RDI: ffffffff803c7480 Oct 9 15:21:28 cs04r-sc-serv-07 kernel: RBP: ffff81005d609e90 R08: 0000000000000001 R09: ffff810044241b50 Oct 9 15:21:28 cs04r-sc-serv-07 kernel: R10: ffffffff887cf72a R11: 00000000000189ef R12: 000000a800000000 Oct 9 15:21:28 cs04r-sc-serv-07 kernel: R13: ffff810044241c90 R14: 0000000000000000 R15: ffffffff8001d54c Oct 9 15:21:28 cs04r-sc-serv-07 kernel: FS: 00002b637558e6e0(0000) GS:ffff810037c0c540(0000) knlGS:0000000000000000 Oct 9 15:21:28 cs04r-sc-serv-07 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b Oct 9 15:21:28 cs04r-sc-serv-07 kernel: CR2: 00002b473a3a4000 CR3: 000000006934d000 CR4: 00000000000006e0 Oct 9 15:21:28 cs04r-sc-serv-07 kernel: Oct 9 15:21:28 cs04r-sc-serv-07 kernel: Call Trace: Oct 9 15:21:28 cs04r-sc-serv-07 kernel: [<ffffffff8004fc47>] d_find_alias+0x1c/0x38 Oct 9 15:21:28 cs04r-sc-serv-07 kernel: [<ffffffff800e271e>] d_alloc_anon+0xc/0xf8 Oct 9 15:21:28 cs04r-sc-serv-07 kernel: [<ffffffff88703338>] :lustre:ll_iget_for_nfs+0x608/0x7e0 Oct 9 15:21:28 cs04r-sc-serv-07 kernel: [<ffffffff887c2366>] :exportfs:find_exported_dentry+0x43/0x47b Oct 9 15:21:28 cs04r-sc-serv-07 kernel: [<ffffffff887cf72a>] :nfsd:nfsd_acceptable+0x0/0xd8 Oct 9 15:21:28 cs04r-sc-serv-07 kernel: [<ffffffff887d35ff>] :nfsd:exp_get_by_name+0x5b/0x71 Oct 9 15:21:28 cs04r-sc-serv-07 kernel: [<ffffffff887d3bee>] :nfsd:exp_find_key+0x89/0x9c Oct 9 15:21:28 cs04r-sc-serv-07 kernel: [<ffffffff8002e28d>] __wake_up+0x38/0x4f Oct 9 15:21:28 cs04r-sc-serv-07 kernel: [<ffffffff8009884c>] set_current_groups+0x159/0x164 Oct 9 15:21:28 cs04r-sc-serv-07 kernel: [<ffffffff887c27e9>] :exportfs:export_decode_fh+0x4b/0x52 Oct 9 15:21:28 cs04r-sc-serv-07 kernel: [<ffffffff887cfaa4>] :nfsd:fh_verify+0x2a2/0x4c6 Oct 9 15:21:28 cs04r-sc-serv-07 kernel: [<ffffffff8008a788>] __activate_task+0x27/0x39 Oct 9 15:21:28 cs04r-sc-serv-07 kernel: [<ffffffff887d00ca>] :nfsd:nfsd_access+0x29/0xfc Oct 9 15:21:28 cs04r-sc-serv-07 kernel: [<ffffffff887d76d0>] :nfsd:nfsd3_proc_access+0xa4/0xb0 Oct 9 15:21:28 cs04r-sc-serv-07 kernel: [<ffffffff887cd1db>] :nfsd:nfsd_dispatch+0xd8/0x1d6 Oct 9 15:21:28 cs04r-sc-serv-07 kernel: [<ffffffff8839a4fb>] :sunrpc:svc_process+0x454/0x71b Oct 9 15:21:28 cs04r-sc-serv-07 kernel: [<ffffffff800645ec>] __down_read+0x12/0x92 Oct 9 15:21:28 cs04r-sc-serv-07 kernel: [<ffffffff887cd5a1>] :nfsd:nfsd+0x0/0x2cb Oct 9 15:21:28 cs04r-sc-serv-07 kernel: [<ffffffff887cd746>] :nfsd:nfsd+0x1a5/0x2cb Oct 9 15:21:28 cs04r-sc-serv-07 kernel: [<ffffffff8005dfb1>] child_rip+0xa/0x11 Oct 9 15:21:28 cs04r-sc-serv-07 kernel: [<ffffffff887cd5a1>] :nfsd:nfsd+0x0/0x2cb Oct 9 15:21:28 cs04r-sc-serv-07 kernel: [<ffffffff887cd5a1>] :nfsd:nfsd+0x0/0x2cb Oct 9 15:21:28 cs04r-sc-serv-07 kernel: [<ffffffff8005dfa7>] child_rip+0x0/0x11 Oct 9 15:21:28 cs04r-sc-serv-07 kernel: Oct 9 15:21:28 cs04r-sc-serv-07 kernel: [<ffffffff8005d28d>] tracesys+0xd5/0xe0 Oct 9 15:21:28 cs04r-sc-serv-07 kernel: Kind regards, Frederik -- Frederik Ferner Computer Systems Administrator phone: +44 1235 77 8624 Diamond Light Source Ltd. mob: +44 7917 08 5110 (Apologies in advance for the lines below. Some bits are a legal requirement and I have no control over them.)
Robin Humble
2009-Oct-19 04:31 UTC
[Lustre-discuss] soft lockups on NFS server/Lustre client
On Mon, Oct 12, 2009 at 05:06:28PM +0100, Frederik Ferner wrote:>Hi List, > >on our NFS server exporting our Lustre file system to a number of NFS >clients, we''ve recently started to see "kernel: BUG: soft lockup" >messages. As the locked processes include nfsd, our users are obviously >not happy. > >Around the time when the soft lockup occurs we also see a log of >"kernel: BUG: warning at fs/inotify.c:181/set_dentry_child_flags()" >messages, but I don''t know if this is related.probably not related. we were seeing this too (no NFS involved at all) https://bugzilla.lustre.org/show_bug.cgi?id=20904 and the upshot is that I''m pretty sure it''s harmless and a RHEL bug. I filed https://bugzilla.redhat.com/show_bug.cgi?id=526853 but it''s probably being ignored. if you have a rhel support contract maybe you can kick it along a bit... dunno about your soft lockups. as I understand it soft lockups themselves aren''t harmful as long as they progress eventually. Lustre 1.6.6 isn''t exactly recent. have you tried 1.6.7.2 on your NFS exporter? presumably soft lockups could also be saying your re-exporter or OSS''s are overloaded or that you have a slow disk or 3 in a RAID... without NFS involved are all your OSTs up to speed? do you still get problems after echo 60 > /proc/sys/kernel/softlockup_thresh cheers, robin> >We are using Lustre 1.6.6 on all machines, (MDS, OSS, clients). The NFS >server/Lustre client with the lockups is running RHEL5.4 with an >unpatched RedHat kernel (kernel-2.6.18-92.1.10.el5) with the Lustre >modules from Sun. > >See below for sample logs from the Lustre client/NFS server. I can >provide more logs if required. > >I''m not sure if this a Lustre issue but would appreciate if someone >could help. We''ve not seen it on any other NFS server so far and there >seems to be at least some lustre related stuff in the stack trace. > >Is this a known issue and how can we avoid this? I have not found >anything using google and the search on bugzilla.lustre.org. At least >the BUG warning seems to be a known issue on this kernel. > >I hope the logs below are readable enough, I tried to find entries where >the stack traces don''t overlap but this seems to be the best I can find. > >Oct 9 15:21:27 cs04r-sc-serv-07 kernel: BUG: warning at >fs/inotify.c:181/set_dentry_child_flags() (Tainted: G ) >Oct 9 15:21:27 cs04r-sc-serv-07 kernel: >Oct 9 15:21:27 cs04r-sc-serv-07 kernel: Call Trace: >Oct 9 15:21:27 cs04r-sc-serv-07 kernel: [<ffffffff800ed7d1>] >set_dentry_child_flags+0xef/0x14d >Oct 9 15:21:27 cs04r-sc-serv-07 kernel: [<ffffffff800ed867>] >remove_watch_no_event+0x38/0x47 >Oct 9 15:21:27 cs04r-sc-serv-07 kernel: [<ffffffff800ed88e>] >inotify_remove_watch_locked+0x18/0x3b >Oct 9 15:21:27 cs04r-sc-serv-07 kernel: [<ffffffff800ed97c>] >inotify_rm_wd+0x7e/0xa1 >Oct 9 15:21:27 cs04r-sc-serv-07 kernel: [<ffffffff800ede6e>] >sys_inotify_rm_watch+0x46/0x63 >Oct 9 15:21:27 cs04r-sc-serv-07 kernel: [<ffffffff8005d28d>] >tracesys+0xd5/0xe0 >Oct 9 15:21:27 cs04r-sc-serv-07 kernel: >Oct 9 15:21:27 cs04r-sc-serv-07 kernel: BUG: warning at >fs/inotify.c:181/set_dentry_child_flags() (Tainted: G ) >Oct 9 15:21:27 cs04r-sc-serv-07 kernel: >Oct 9 15:21:27 cs04r-sc-serv-07 kernel: Call Trace: >Oct 9 15:21:27 cs04r-sc-serv-07 kernel: [<ffffffff800ed7d1>] >set_dentry_child_flags+0xef/0x14d >Oct 9 15:21:27 cs04r-sc-serv-07 kernel: [<ffffffff800ed867>] >remove_watch_no_event+0x38/0x47 >Oct 9 15:21:27 cs04r-sc-serv-07 kernel: [<ffffffff800ed88e>] >inotify_remove_watch_locked+0x18/0x3b >Oct 9 15:21:27 cs04r-sc-serv-07 kernel: [<ffffffff800ed97c>] >inotify_rm_wd+0x7e/0xa1 >Oct 9 15:21:27 cs04r-sc-serv-07 kernel: [<ffffffff800ede6e>] >sys_inotify_rm_watch+0x46/0x63 >Oct 9 15:21:27 cs04r-sc-serv-07 kernel: BUG: soft lockup - CPU#5 stuck >for 10s! [nfsd:22221] >Oct 9 15:21:28 cs04r-sc-serv-07 kernel: CPU 5: >Oct 9 15:21:28 cs04r-sc-serv-07 kernel: Modules linked in: vfat fat >usb_storage dell_rbu mptctl ipmi_devintf ipmi_si ipmi_msghandler nfs >fscache nfsd exportfs lockd nfs_acl auth_rpcgss autofs4 hidp mgc(U) >lustre(U) lov(U) mdc(U) lquota(U) osc(U) ksocklnd(U) ptlrpc(U) ob >dclass(U) lnet(U) lvfs(U) libcfs(U) rfcomm l2cap bluetooth sunrpc ipv6 >xfrm_nalgo crypto_api mlx4_en(U) dm_multipath video sbs backlight i2c_ec >i2c_core button battery asus_acpi acpi_memhotplug ac parport_pc lp >parport joydev sr_mod cdrom mlx4_core(U) bnx2 serio_raw pcsp >kr sg dm_snapshot dm_zero dm_mirror dm_mod ata_piix libata shpchp mptsas >mptscsih mptbase scsi_transport_sas sd_mod scsi_mod ext3 jbd uhci_hcd >ohci_hcd ehci_hcd >Oct 9 15:21:28 cs04r-sc-serv-07 kernel: Pid: 22221, comm: nfsd Tainted: >G 2.6.18-92.1.10.el5 #1 >Oct 9 15:21:28 cs04r-sc-serv-07 kernel: RIP: 0010:[<ffffffff80064ba7>] > [<ffffffff80064ba7>] .text.lock.spinlock+0x5/0x30 >Oct 9 15:21:28 cs04r-sc-serv-07 kernel: RSP: 0018:ffff810044241ac8 >EFLAGS: 00000286 >Oct 9 15:21:28 cs04r-sc-serv-07 kernel: RAX: ffff81006cb6a1a8 RBX: >ffff81006cb6a178 RCX: ffff810044241b50 >Oct 9 15:21:28 cs04r-sc-serv-07 kernel: RDX: 0000000000000000 RSI: >ffff810044241c90 RDI: ffffffff803c7480 >Oct 9 15:21:28 cs04r-sc-serv-07 kernel: RBP: ffff81005d609e90 R08: >0000000000000001 R09: ffff810044241b50 >Oct 9 15:21:28 cs04r-sc-serv-07 kernel: R10: ffffffff887cf72a R11: >00000000000189ef R12: 000000a800000000 >Oct 9 15:21:28 cs04r-sc-serv-07 kernel: R13: ffff810044241c90 R14: >0000000000000000 R15: ffffffff8001d54c >Oct 9 15:21:28 cs04r-sc-serv-07 kernel: FS: 00002b637558e6e0(0000) >GS:ffff810037c0c540(0000) knlGS:0000000000000000 >Oct 9 15:21:28 cs04r-sc-serv-07 kernel: CS: 0010 DS: 0000 ES: 0000 >CR0: 000000008005003b >Oct 9 15:21:28 cs04r-sc-serv-07 kernel: CR2: 00002b473a3a4000 CR3: >000000006934d000 CR4: 00000000000006e0 >Oct 9 15:21:28 cs04r-sc-serv-07 kernel: >Oct 9 15:21:28 cs04r-sc-serv-07 kernel: Call Trace: >Oct 9 15:21:28 cs04r-sc-serv-07 kernel: [<ffffffff8004fc47>] >d_find_alias+0x1c/0x38 >Oct 9 15:21:28 cs04r-sc-serv-07 kernel: [<ffffffff800e271e>] >d_alloc_anon+0xc/0xf8 >Oct 9 15:21:28 cs04r-sc-serv-07 kernel: [<ffffffff88703338>] >:lustre:ll_iget_for_nfs+0x608/0x7e0 >Oct 9 15:21:28 cs04r-sc-serv-07 kernel: [<ffffffff887c2366>] >:exportfs:find_exported_dentry+0x43/0x47b >Oct 9 15:21:28 cs04r-sc-serv-07 kernel: [<ffffffff887cf72a>] >:nfsd:nfsd_acceptable+0x0/0xd8 >Oct 9 15:21:28 cs04r-sc-serv-07 kernel: [<ffffffff887d35ff>] >:nfsd:exp_get_by_name+0x5b/0x71 >Oct 9 15:21:28 cs04r-sc-serv-07 kernel: [<ffffffff887d3bee>] >:nfsd:exp_find_key+0x89/0x9c >Oct 9 15:21:28 cs04r-sc-serv-07 kernel: [<ffffffff8002e28d>] >__wake_up+0x38/0x4f >Oct 9 15:21:28 cs04r-sc-serv-07 kernel: [<ffffffff8009884c>] >set_current_groups+0x159/0x164 >Oct 9 15:21:28 cs04r-sc-serv-07 kernel: [<ffffffff887c27e9>] >:exportfs:export_decode_fh+0x4b/0x52 >Oct 9 15:21:28 cs04r-sc-serv-07 kernel: [<ffffffff887cfaa4>] >:nfsd:fh_verify+0x2a2/0x4c6 >Oct 9 15:21:28 cs04r-sc-serv-07 kernel: [<ffffffff8008a788>] >__activate_task+0x27/0x39 >Oct 9 15:21:28 cs04r-sc-serv-07 kernel: [<ffffffff887d00ca>] >:nfsd:nfsd_access+0x29/0xfc >Oct 9 15:21:28 cs04r-sc-serv-07 kernel: [<ffffffff887d76d0>] >:nfsd:nfsd3_proc_access+0xa4/0xb0 >Oct 9 15:21:28 cs04r-sc-serv-07 kernel: [<ffffffff887cd1db>] >:nfsd:nfsd_dispatch+0xd8/0x1d6 >Oct 9 15:21:28 cs04r-sc-serv-07 kernel: [<ffffffff8839a4fb>] >:sunrpc:svc_process+0x454/0x71b >Oct 9 15:21:28 cs04r-sc-serv-07 kernel: [<ffffffff800645ec>] >__down_read+0x12/0x92 >Oct 9 15:21:28 cs04r-sc-serv-07 kernel: [<ffffffff887cd5a1>] >:nfsd:nfsd+0x0/0x2cb >Oct 9 15:21:28 cs04r-sc-serv-07 kernel: [<ffffffff887cd746>] >:nfsd:nfsd+0x1a5/0x2cb >Oct 9 15:21:28 cs04r-sc-serv-07 kernel: [<ffffffff8005dfb1>] >child_rip+0xa/0x11 >Oct 9 15:21:28 cs04r-sc-serv-07 kernel: [<ffffffff887cd5a1>] >:nfsd:nfsd+0x0/0x2cb >Oct 9 15:21:28 cs04r-sc-serv-07 kernel: [<ffffffff887cd5a1>] >:nfsd:nfsd+0x0/0x2cb >Oct 9 15:21:28 cs04r-sc-serv-07 kernel: [<ffffffff8005dfa7>] >child_rip+0x0/0x11 >Oct 9 15:21:28 cs04r-sc-serv-07 kernel: >Oct 9 15:21:28 cs04r-sc-serv-07 kernel: [<ffffffff8005d28d>] >tracesys+0xd5/0xe0 >Oct 9 15:21:28 cs04r-sc-serv-07 kernel: > > >Kind regards, >Frederik >-- >Frederik Ferner >Computer Systems Administrator phone: +44 1235 77 8624 >Diamond Light Source Ltd. mob: +44 7917 08 5110 >(Apologies in advance for the lines below. Some bits are a legal >requirement and I have no control over them.) >_______________________________________________ >Lustre-discuss mailing list >Lustre-discuss at lists.lustre.org >http://lists.lustre.org/mailman/listinfo/lustre-discuss
Frederik Ferner
2009-Oct-20 12:00 UTC
[Lustre-discuss] soft lockups on NFS server/Lustre client
Robin Humble wrote:> On Mon, Oct 12, 2009 at 05:06:28PM +0100, Frederik Ferner wrote: >> on our NFS server exporting our Lustre file system to a number of NFS >> clients, we''ve recently started to see "kernel: BUG: soft lockup" >> messages. As the locked processes include nfsd, our users are obviously >> not happy. >> >> Around the time when the soft lockup occurs we also see a log of >> "kernel: BUG: warning at fs/inotify.c:181/set_dentry_child_flags()" >> messages, but I don''t know if this is related. > > probably not related. we were seeing this too (no NFS involved at all)I may have been looking at slightly the wrong thing here. It was first reported by our users as a NFS problem but it now seems to be triggered by samba access to some directories on Lustre. We''ve separated the samba server from the NFS server and now we only see this on the samba server and not on the NFS server.> https://bugzilla.redhat.com/show_bug.cgi?id=526853 > but it''s probably being ignored. if you have a rhel support contract > maybe you can kick it along a bit...I see this has been closed as duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=499019 which is unfortunately not accessible to me. On the other hand Red Hat support have just pointed me at this bug as well and confirmed that it is not yet fixed in RHEL5.4.> dunno about your soft lockups. as I understand it soft lockups > themselves aren''t harmful as long as they progress eventually.Well, they are not harmful as such, my problem is that they seem to block the machine for some time and users complained about applications timing out when this affected the file system.> Lustre 1.6.6 isn''t exactly recent. have you tried 1.6.7.2 on your NFS > exporter?I know, until recently we did not have any real problems with 1.6.6 and the machines are in production. I''m currently trying to reproduce it in our test setup and may try 1.6.7.2 with an additional test machine on the production system as samba exporter during the next maintenance window. On the other hand it''s now really looking like a RHEL bug, so I''m not too sure how much it would help> presumably soft lockups could also be saying your re-exporter or OSS''s > are overloaded or that you have a slow disk or 3 in a RAID... without > NFS involved are all your OSTs up to speed?I think that the OSTs are not the problem here, as I''m not experiencing any problems on any of my other Lustre clients and now not anymore on the NFS server which is seeing more load than the samba server.> do you still get problems after > echo 60 > /proc/sys/kernel/softlockup_threshAfter applying this on the samba server, I only see the Bug warnings and not the soft lockups in syslog, still my windows clients seem to freeze occasionally for about a minute when browsing the exported file system, so no change on the client side. Cheers, Frederik -- Frederik Ferner Computer Systems Administrator phone: +44 1235 77 8624 Diamond Light Source Ltd. mob: +44 7917 08 5110 (Apologies in advance for the lines below. Some bits are a legal requirement and I have no control over them.)
Frederik Ferner
2009-Nov-20 16:17 UTC
[Lustre-discuss] soft lockups on NFS server/Lustre client
Hi All, just a quick follow-up on this. Frederik Ferner wrote:> Robin Humble wrote:[snip]> I see this has been closed as duplicate of > https://bugzilla.redhat.com/show_bug.cgi?id=499019 > which is unfortunately not accessible to me. > > On the other hand Red Hat support have just pointed me at this bug as > well and confirmed that it is not yet fixed in RHEL5.4.As you can see in this bug, Red Hat have provided a test kernel which we''ve been using on a number of machines without being able to reproduce the problem.>> Lustre 1.6.6 isn''t exactly recent. have you tried 1.6.7.2 on your NFS >> exporter?Now we''ve tried 1.6.7.2 on the NFS/Samba exporter and we''ve still seen the soft lockups until we upgraded to the test kernel mentioned above. NB we''ve had to upgrade the Samba exporters to 1.6.7.2 anyway after we''ve turned on flock and found a LBUG there that is fixed in 1.6.7.2 So to summarize, the soft lockups are a bug in the RHEL kernel and are hopefully going to be fixed in a official update. Kind regards, Frederik -- Frederik Ferner Computer Systems Administrator phone: +44 1235 77 8624 Diamond Light Source Ltd. mob: +44 7917 08 5110 (Apologies in advance for the lines below. Some bits are a legal requirement and I have no control over them.)