Hi, I have a machine with 20tb of local storage (internal sata drives). It has three 6.7tb arrays, each of which is configured as an OST and part of a larger LOV. I''ll call this "machine B". Machine B has 16 gigs of memory. Not only does it act as an OSS but it also has this single large lov mounted and I/O intensive jobs run on this system reading/writing heavliy from this single LOV. The LOV is also mounted on another system that we''ll call "Machine A". Machine A does not act as an OSS and is not serving out any disk via lustre. Last night I noticed that any of my jobs running on "Machine B" that were reading/writing from this LOV were running at about 60% of cpu capacity, while the remaining 40% was being used by the "system". I got those numbers from iostat. Note that the EXACT same jobs run on "Machine A" were running at 100% cpu. I couldn''t figure out what system calls were hogging up 40 percent of the total cpu. I stumbled across an article describing how in large memory systems if the page size isn''t set right the system can end up spending more time swapping out pages then actually doing user work, which would account for this 40% system usage. (I don''t understand this). After reading another article about virtual memory I found on RedHat''s site I issued the following command- " echo "10" > cat /proc/sys/vm/max_queue_depth" 4 hours later the system crashed, and I''m not sure why. I''m not asking you guys to debug my crash except as it relates to lustre. I have netdump running and was able to get a bit of information about the crash. The only reason I''m posting here about this is because in the information I recived about the dump there were quite a few calls to various lustre functions right before the crash/panic. I''ve attached the rest of it as a text file. I have read about there being some type of memory (leak?/bug?) in lustre which can show up when you run an machine as a lustre OSS and a lustre filesystem client. Could it be this bug that I''m seeing? Many thanks in advance, -Aaron ps please correct me if my acronyms are off :) -------------- next part -------------- Kernel BUG at panic:75 invalid operand: 0000 [1] SMP CPU 6 Modules linked in: ipt_REJECT(U) ipt_state(U) ip_conntrack(U) ipt_multiport(U) iptable_filter(U) ip_tables(U) llite(U) mdc(U) lov(U) osc(U) obdfilter(U) fsfilt_ldiskfs(U) ldiskfs(U) ost(U) ptlrpc(U) obdclass(U) lvfs(U) ksocklnd(U) lnet(U) libcfs(U) sg(U) st(U) loop(U) md5(U) ipv6(U) parport_pc(U) lp(U) parport(U) netconsole(U) netdump(U) autofs4(U) i2c_dev(U) i2c_core(U) nfs(U) lockd(U) nfs_acl(U) sunrpc(U) ds(U) yenta_socket(U) pcmcia_core(U) dm_mirror(U) dm_mod(U) button(U) battery(U) ac(U) ohci_hcd(U) ehci_hcd(U) e1000(U) floppy(U) ext3(U) jbd(U) sata_nv(U) libata(U) 3w_9xxx(U) aic79xx(U) sd_mod(U) scsi_mod(U) Pid: 120, comm: kswapd3 Tainted: G M 2.6.9-42.0.2.EL_lustre.1.4.7.1smp RIP: 0010:[<ffffffff80136d8e>] <ffffffff80136d8e>{panic+211} RSP: 0018:00000101fc72bd18 EFLAGS: 00010086 RAX: 000000000000002d RBX: ffffffff8031df9b RCX: 0000000000000046 RDX: 0000000000012e6d RSI: 0000000000000046 RDI: ffffffff80373dc0 RBP: 0000000000000900 R08: 00000000ffffffff R09: ffffffff8031df9b R10: 0000000000000061 R11: 000000000000002a R12: 00000000ffffffff R13: ffffffff8036ac80 R14: 00137b5ff61df5cb R15: ffffffff8031df9b FS: 0000002a95573380(0000) GS:ffffffff80477e80(0000) knlGS:0000000000000000 CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b CR2: 0000002aa6dee000 CR3: 00000003030d8000 CR4: 00000000000006e0 Process kswapd3 (pid: 120, threadinfo 0000010100146000, task 00000100080c3800) Stack: 0000003000000008 00000101fc72bdf8 00000101fc72bd38 000000031770fa68 0000000000000046 0000000000000046 0000000000012e40 0000000000000046 00000000ffffffff 00000101fc72be78 Call Trace:<ffffffff80117930>{print_mce+159} <ffffffff801179f1>{mce_available+0} <ffffffff80117ce6>{do_machine_check+731} <ffffffff8011135b>{machine_check+127} <ffffffffa0521447>{:llite:ll_releasepage+0} <ffffffffa05159fc>{:llite:llap_from_page+546} <EOE> <ffffffffa046d91a>{:lov:lov_teardown_async_page+711} <ffffffffa0516411>{:llite:ll_removepage+455} <ffffffffa0521455>{:llite:ll_releasepage+14} <ffffffff80163353>{shrink_zone+3363} <ffffffff80131555>{recalc_task_prio+337} <ffffffff80163b5f>{balance_pgdat+506} <ffffffff80163da9>{kswapd+252} <ffffffff80134b66>{autoremove_wake_function+0} <ffffffff80134b66>{autoremove_wake_function+0} <ffffffff80110e23>{child_rip+8} <ffffffff80163cad>{kswapd+0} <ffffffff80110e1b>{child_rip+0} Code: 0f 0b 40 d4 31 80 ff ff ff ff 4b 00 31 ff e8 47 c1 fe ff e8 RIP <ffffffff80136d8e>{panic+211} RSP <00000101fc72bd18> CPU 6: Machine Check Exception: 4 Bank 4: f603200100000813 TSC 137b5ff61e0204 ADDR 31770fa68 Kernel panic - not syncing: Machine check ----------- [cut here ] --------- [please bite here ] ---------
On Tue, 21 Nov 2006, Aaron Knister wrote:> I have a machine with 20tb of local storage (internal sata drives). It > has three 6.7tb arrays, each of which is configured as an OST and part > of a larger LOV. I''ll call this "machine B". Machine B has 16 gigs of > memory. Not only does it act as an OSS but it also has this single large > lov mounted and I/O intensive jobs run on this system reading/writing > heavliy from this single LOV.If machine B acts both as client and server for the same Lustre filesystem, you may run into recovery problems in case of a crash. CFS recommends against using such configurations. I''m not sure how dangerous this is, however.> The LOV is also mounted on another system that we''ll call "Machine A". > Machine A does not act as an OSS and is not serving out any disk via > lustre. Last night I noticed that any of my jobs running on "Machine B" > that were reading/writing from this LOV were running at about 60% of cpu > capacity, while the remaining 40% was being used by the "system". I got > those numbers from iostat. Note that the EXACT same jobs run on "Machine > A" were running at 100% cpu.Can you describe the load generated by your job on Lustre? Metadata intensive programs can generate a lot of activity on Lustre servers. Also check that Lustre debugging is set to zero (I suspect many innocent users get caught by this setting).> I couldn''t figure out what system calls were hogging up 40 percent of > the total cpu. I stumbled across an article describing how in large > memory systems if the page size isn''t set right the system can end up > spending more time swapping out pages then actually doing user work, > which would account for this 40% system usage. (I don''t understand > this).I guess this was not really about swapping, rather page fault handling. :) You could probably have a rough idea of what''s going on with a kernel profiling program such as oprofile. [snip]> I''ve attached the rest of it as a text file.The panic message states it''s an MCE (Machine Check Exception), usually this is a hardware problem (memory, CPU, etc.). -- Jean-Marc Saffroy - jean-marc.saffroy@ext.bull.net
Aaron Knister wrote: ...snipped...> Code: 0f 0b 40 d4 31 80 ff ff ff ff 4b 00 31 ff e8 47 c1 fe ff e8 > RIP <ffffffff80136d8e>{panic+211} RSP <00000101fc72bd18> > > CPU 6: Machine Check Exception: 4 Bank 4: f603200100000813 > TSC 137b5ff61e0204 ADDR 31770fa68 > >I do believe that MCE''s are typically some bit of hardware gone bad. http://en.wikipedia.org/wiki/Machine_Check_Exception Nic
Thank you all for your help. The machine had panicked several times since, but the issue went away when I took the client away from the server. A round of "I told you so" was in order for my higher ups ;-) because I recommended against this setup in the first place...but I did what I was told. -Aaron Jean-Marc Saffroy wrote:> On Tue, 21 Nov 2006, Aaron Knister wrote: > >> I have a machine with 20tb of local storage (internal sata drives). >> It has three 6.7tb arrays, each of which is configured as an OST and >> part of a larger LOV. I''ll call this "machine B". Machine B has 16 >> gigs of memory. Not only does it act as an OSS but it also has this >> single large lov mounted and I/O intensive jobs run on this system >> reading/writing heavliy from this single LOV. > > If machine B acts both as client and server for the same Lustre > filesystem, you may run into recovery problems in case of a crash. CFS > recommends against using such configurations. I''m not sure how > dangerous this is, however. > >> The LOV is also mounted on another system that we''ll call "Machine >> A". Machine A does not act as an OSS and is not serving out any disk >> via lustre. Last night I noticed that any of my jobs running on >> "Machine B" that were reading/writing from this LOV were running at >> about 60% of cpu capacity, while the remaining 40% was being used by >> the "system". I got those numbers from iostat. Note that the EXACT >> same jobs run on "Machine A" were running at 100% cpu. > > Can you describe the load generated by your job on Lustre? Metadata > intensive programs can generate a lot of activity on Lustre servers. > > Also check that Lustre debugging is set to zero (I suspect many > innocent users get caught by this setting). > >> I couldn''t figure out what system calls were hogging up 40 percent of >> the total cpu. I stumbled across an article describing how in large >> memory systems if the page size isn''t set right the system can end up >> spending more time swapping out pages then actually doing user work, >> which would account for this 40% system usage. (I don''t understand >> this). > > I guess this was not really about swapping, rather page fault > handling. :) You could probably have a rough idea of what''s going on > with a kernel profiling program such as oprofile. > > [snip] >> I''ve attached the rest of it as a text file. > > The panic message states it''s an MCE (Machine Check Exception), > usually this is a hardware problem (memory, CPU, etc.). > >