Hi all, in a recent shutdown of our Lustre cluster (net reconfig, Version upgrade to 1.6.7_patched), I decided to try to switch on quotas - this had failed when the cluster went operational last year. Again, I suffered from the same error as last year - failure, and "device/resource busy". This time, I was sure there was no activity at all on the system. But on the MDS, I observed a steep increase of the machine load, up to values of 70. The machine reacted very slowly. It is, however, an 8 Core Xeon - 32 GB RAM - Raptor disk - server, and in normal operation, this machine did never show any sign of overloading, no matter what our users do. Nevertheless, the Lustre log complained about connection losses to some OSTs (at least one was set incative), Heartbeat, which controls the IP of the MGS, complained about timeouts, and so did DRBD, which mirrors the MGS and MDT disks to a slave server. Probably the machine simply lost its own eth0/1/2/3/4 network interfaces which are used by these services. After 30 min, the "lfs quotacheck -ug /lustre" command aborted with the said errors. This happened again today, when we gave it another try. This time, we umounted Lustre, of course removed all Lustre modules, mounted it again and repeated the quotacheck. Similar behavior on the MDS, but this time the command ran through, the services recovered, Lustre survived and was mountable and - the quotas seem to work. So, after this lengthy intro, my question: Is this extreme loading or overloading of the MDS during quotacheck a "normal" feature? Is there a connection to the fact that the filesystem is already 75% full, with 128 TB? We have 68 OSTs, half of them 2.3TB, half of them 2.7 TB . All servers run Debian Etch 64, Kernel 2.6.22. Regards, Thomas
Thomas, Were there other clients mounted when you ran "lfs quotacheck"? Could there have been a difference in client activity between your two attempts? -Nathan ----- Original Message ----- From: Thomas Roth <t.roth at gsi.de> Date: Friday, April 24, 2009 7:28 am Subject: [Lustre-discuss] quotacheck blows up MDT> Hi all, > > in a recent shutdown of our Lustre cluster (net reconfig, Version > upgrade to 1.6.7_patched), I decided to try to switch on quotas - this > had failed when the cluster went operational last year. > > Again, I suffered from the same error as last year - failure, and > "device/resource busy". This time, I was sure there was no activity at > all on the system. But on the MDS, I observed a steep increase of the > machine load, up to values of 70. The machine reacted very slowly. It > is, however, an 8 Core Xeon - 32 GB RAM - Raptor disk - server, and in > normal operation, this machine did never show any sign of overloading, > no matter what our users do. > Nevertheless, the Lustre log complained about connection losses to > someOSTs (at least one was set incative), Heartbeat, which controls > the IP > of the MGS, complained about timeouts, and so did DRBD, which mirrors > the MGS and MDT disks to a slave server. Probably the machine simply > lost its own eth0/1/2/3/4 network interfaces which are used by these > services. > > After 30 min, the "lfs quotacheck -ug /lustre" command aborted with > thesaid errors. This happened again today, when we gave it another > try.This time, we umounted Lustre, of course removed all Lustre > modules,mounted it again and repeated the quotacheck. Similar > behavior on the > MDS, but this time the command ran through, the services recovered, > Lustre survived and was mountable and - the quotas seem to work. > > So, after this lengthy intro, my question: Is this extreme loading or > overloading of the MDS during quotacheck a "normal" feature? > > Is there a connection to the fact that the filesystem is already 75% > full, with 128 TB? > > We have 68 OSTs, half of them 2.3TB, half of them 2.7 TB . > All servers run Debian Etch 64, Kernel 2.6.22. > > Regards, > Thomas > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss >
No, in all cases - 2 days ago and today''s 2 attempts - I made sure there were no client mounts except for my one "administration" host. Checked all exports on the MDS, logged in to potential clients, umounted and removed Lustre modules, informed the users - and our batch farm was down anyhow, so the number of potential clients was rather small. Thomas Nathan.Dauchy at noaa.gov wrote:> Thomas, > > Were there other clients mounted when you ran "lfs quotacheck"? Could > there have been a difference in client activity between your two attempts? > > -Nathan > > ----- Original Message ----- > From: Thomas Roth <t.roth at gsi.de> > Date: Friday, April 24, 2009 7:28 am > Subject: [Lustre-discuss] quotacheck blows up MDT > >> Hi all, >> >> in a recent shutdown of our Lustre cluster (net reconfig, Version >> upgrade to 1.6.7_patched), I decided to try to switch on quotas - this >> had failed when the cluster went operational last year. >> >> Again, I suffered from the same error as last year - failure, and >> "device/resource busy". This time, I was sure there was no activity at >> all on the system. But on the MDS, I observed a steep increase of the >> machine load, up to values of 70. The machine reacted very slowly. It >> is, however, an 8 Core Xeon - 32 GB RAM - Raptor disk - server, and in >> normal operation, this machine did never show any sign of overloading, >> no matter what our users do. >> Nevertheless, the Lustre log complained about connection losses to >> someOSTs (at least one was set incative), Heartbeat, which controls >> the IP >> of the MGS, complained about timeouts, and so did DRBD, which mirrors >> the MGS and MDT disks to a slave server. Probably the machine simply >> lost its own eth0/1/2/3/4 network interfaces which are used by these >> services. >> >> After 30 min, the "lfs quotacheck -ug /lustre" command aborted with >> thesaid errors. This happened again today, when we gave it another >> try.This time, we umounted Lustre, of course removed all Lustre >> modules,mounted it again and repeated the quotacheck. Similar >> behavior on the >> MDS, but this time the command ran through, the services recovered, >> Lustre survived and was mountable and - the quotas seem to work. >> >> So, after this lengthy intro, my question: Is this extreme loading or >> overloading of the MDS during quotacheck a "normal" feature? >> >> Is there a connection to the fact that the filesystem is already 75% >> full, with 128 TB? >> >> We have 68 OSTs, half of them 2.3TB, half of them 2.7 TB . >> All servers run Debian Etch 64, Kernel 2.6.22. >> >> Regards, >> Thomas >> _______________________________________________ >> Lustre-discuss mailing list >> Lustre-discuss at lists.lustre.org >> http://lists.lustre.org/mailman/listinfo/lustre-discuss >> > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss-- -------------------------------------------------------------------- Thomas Roth Department: Informationstechnologie Location: SB3 1.262 Phone: +49-6159-71 1453 Fax: +49-6159-71 2986 GSI Helmholtzzentrum f?r Schwerionenforschung GmbH Planckstra?e 1 D-64291 Darmstadt www.gsi.de Gesellschaft mit beschr?nkter Haftung Sitz der Gesellschaft: Darmstadt Handelsregister: Amtsgericht Darmstadt, HRB 1528 Gesch?ftsf?hrer: Professor Dr. Horst St?cker Vorsitzende des Aufsichtsrates: Dr. Beatrix Vierkorn-Rudolph, Stellvertreter: Ministerialdirigent Dr. Rolf Bernhardt
I was not able to get quota working on Lustre 1.6.7.1 either. I got the following stack trace on all the OSS servers when I was trying to do a quota check with the latest Lustre Patched kernel. Apr 21 15:49:11 lustre2 kernel: LustreError: 15591:0:(client.c:547:ptlrpc_prep_req_pool()) ASSERTION(imp != LP_POISON) failed Apr 21 15:49:11 lustre2 kernel: LustreError: 15591:0:(client.c:547:ptlrpc_prep_req_pool()) LBUG Apr 21 15:49:11 lustre2 kernel: Lustre: 15591:0:(linux-debug.c:222:libcfs_debug_dumpstack()) showing stack for process 15591 Apr 21 15:49:11 lustre2 kernel: quotacheck R running task 0 15591 1 15592 15590 (L-TLB) Apr 21 15:49:11 lustre2 kernel: 0000000000000000 0000000000000001 0000000000000086 0000000000000001 Apr 21 15:49:11 lustre2 kernel: ffff8100881d19d0 ffffffff8002e15a 0000000000000046 ffff8100881d1a60 Apr 21 15:49:11 lustre2 kernel: ffffffff887c43b0 ffffffff8049a5c0 0000000000000060 ffffffff887df9b0 Apr 21 15:49:11 lustre2 kernel: Call Trace: Apr 21 15:49:11 lustre2 kernel: [<ffffffff8002e15a>] __wake_up+0x38/0x4f Apr 21 15:49:11 lustre2 kernel: [<ffffffff8009ddc3>] autoremove_wake_function+0x9/0x2e Apr 21 15:49:11 lustre2 kernel: [<ffffffff800891f6>] __wake_up_common+0x3e/0x68 Apr 21 15:49:11 lustre2 kernel: [<ffffffff8002e15a>] __wake_up+0x38/0x4f Apr 21 15:49:11 lustre2 last message repeated 2 times Apr 21 15:49:11 lustre2 kernel: [<ffffffff8008fdc9>] printk+0x52/0xbd Apr 21 15:49:11 lustre2 kernel: [<ffffffff800891f6>] __wake_up_common+0x3e/0x68 Apr 21 15:49:11 lustre2 kernel: [<ffffffff800891f6>] __wake_up_common+0x3e/0x68 Apr 21 15:49:11 lustre2 kernel: [<ffffffff8008fd2b>] vprintk+0x290/0x2dc Apr 21 15:49:11 lustre2 kernel: [<ffffffff800a54eb>] kallsyms_lookup+0xc2/0x17b Apr 21 15:49:11 lustre2 last message repeated 3 times Apr 21 15:49:11 lustre2 kernel: [<ffffffff8006b77d>] printk_address+0x9f/0xab Apr 21 15:49:11 lustre2 kernel: [<ffffffff8008fd00>] vprintk+0x265/0x2dc Apr 21 15:49:11 lustre2 kernel: [<ffffffff8008fdc9>] printk+0x52/0xbd Apr 21 15:49:11 lustre2 kernel: [<ffffffff800a3072>] module_text_address+0x33/0x3c Apr 21 15:49:11 lustre2 kernel: [<ffffffff8009c34c>] kernel_text_address+0x1a/0x26 Apr 21 15:49:11 lustre2 kernel: [<ffffffff8006b463>] dump_trace+0x211/0x23a Apr 21 15:49:11 lustre2 kernel: [<ffffffff8006b4c0>] show_trace+0x34/0x47 Apr 21 15:49:11 lustre2 kernel: [<ffffffff8006b5c5>] _show_stack+0xdb/0xea Apr 21 15:49:11 lustre2 kernel: [<ffffffff887b9ada>] :libcfs:lbug_with_loc+0x7a/0xd0 Apr 21 15:49:11 lustre2 kernel: [<ffffffff887c1c40>] :libcfs:tracefile_init+0x0/0x110 Apr 21 15:49:11 lustre2 kernel: [<ffffffff88904739>] :ptlrpc:ptlrpc_prep_req_pool+0xc9/0x6b0 Apr 21 15:49:11 lustre2 kernel: [<ffffffff8002ca21>] mntput_no_expire+0x19/0x89 Apr 21 15:49:11 lustre2 kernel: [<ffffffff88904d31>] :ptlrpc:ptlrpc_prep_req+0x11/0x20 Apr 21 15:49:11 lustre2 kernel: [<ffffffff88a0d43e>] :lquota:target_quotacheck_thread+0x15e/0x3e0 Apr 21 15:49:11 lustre2 kernel: [<ffffffff80031caa>] end_buffer_read_sync+0x0/0x22 Apr 21 15:49:11 lustre2 kernel: [<ffffffff800b4382>] audit_syscall_exit+0x31b/0x336 Apr 21 15:49:11 lustre2 kernel: [<ffffffff8005dfb1>] child_rip+0xa/0x11 Apr 21 15:49:11 lustre2 kernel: [<ffffffff88a0d2e0>] :lquota:target_quotacheck_thread+0x0/0x3e0 Apr 21 15:49:11 lustre2 kernel: [<ffffffff8005dfa7>] child_rip+0x0/0x11 Apr 21 15:49:11 lustre2 kernel: Nirmal
On Fri, Apr 24, 2009 at 10:48:11AM -0500, Nirmal Seenu wrote:> I was not able to get quota working on Lustre 1.6.7.1 either. I got the > following stack trace on all the OSS servers when I was trying to do a > quota check with the latest Lustre Patched kernel. > > Apr 21 15:49:11 lustre2 kernel: LustreError: > 15591:0:(client.c:547:ptlrpc_prep_req_pool()) ASSERTION(imp != > LP_POISON) failedHave you finally applied the patch from bug 18126? As i said earlier, it should fix this problem. Johann
I had quota working on 1.6.6 with that patch. I assumed that this patch would be landed in 1.6.7 and 1.6.7.1. I didn''t realize that the patch got landed only in 1.6.8 and 1.8.0. I will try applying this patch again to see if this fixes the problem. Thanks Nirmal Johann Lombardi wrote:> On Fri, Apr 24, 2009 at 10:48:11AM -0500, Nirmal Seenu wrote: >> I was not able to get quota working on Lustre 1.6.7.1 either. I got the >> following stack trace on all the OSS servers when I was trying to do a >> quota check with the latest Lustre Patched kernel. >> >> Apr 21 15:49:11 lustre2 kernel: LustreError: >> 15591:0:(client.c:547:ptlrpc_prep_req_pool()) ASSERTION(imp != >> LP_POISON) failed > > Have you finally applied the patch from bug 18126? > As i said earlier, it should fix this problem. > > Johann
On Fri, Apr 24, 2009 at 03:28:39PM +0200, Thomas Roth wrote:> So, after this lengthy intro, my question: Is this extreme loading or > overloading of the MDS during quotacheck a "normal" feature?Yes, quotacheck is a quite expensive operation. It scans the whole filesystem (sub quotachecks are run on both the MDS and the OSTs) to recompute the disk usage (for both inodes & blocks) on a per-uid/gid basis. Fortunately, we don''t have to run quotacheck often.> Is there a connection to the fact that the filesystem is already 75% > full, with 128 TB?yes, the bigger the OSTs/MDS, the longer :( Cheers, Johann