Hi all, I''ve experienced reproducible OSS crashes with 1.6.5 but also 1.6.4.3/1.6.4.2. The cluster is running Debian Etch64, kernel 2.6.22. The OSS are file servers with two OSTs. I''m now testing it by just using one OSS in the system (but encountered the problem first with 9 OSS), mounting Lustre on 4 clients and writing to it using the stress utility: "stress -d 2 --hdd-noclean --hdd-bytes 5M " Once the OSTs are filled up to > 60%, the machine will just stop working. There are no traces in any of the logs that would relate directly to the moment of failure. I have repeated the procedure now with 9 of these machines, all of them SuperMicro X7DB8 16 slot file servers, 2 Intel Xeon E5320 Quadcores, 8GB RAM and one older SuperMicro X7DB8 with 2 Dual Core Xeons and 4 GB RAM, on a Lustre 1.6.4.2 test system. All of these machines have two 3ware 9650 RAID controllers, 500 TB WD Disks in RAID 5. Subsequently I reformatted the OST with ext3 and ran the stress test locally on the machine: no failure, the partition filled to 100% without problem. All of this seems to indicate that it is not a (sole) hardware problem. Prior to the recent crash the following is found in /var/log/kern.log: Jul 22 21:23:52 kernel: Lustre: 25706:0:(filter_io_26.c:700:filter_commitrw_write()) gsilust-OST0000: slow i_mutex 30s Jul 22 21:24:10 kernel: Lustre: 25692:0:(lustre_fsfilt.h:246:fsfilt_brw_start_log()) gsilust-OST0000: slow journal start 37s Jul 22 21:24:10 kernel: Lustre: 25692:0:(filter_io_26.c:713:filter_commitrw_write()) gsilust-OST0000: slow brw_start 37s Jul 22 21:24:10 kernel: Lustre: 25697:0:(filter_io_26.c:700:filter_commitrw_write()) gsilust-OST0000: slow i_mutex 37s Jul 22 21:46:55 kernel: Lustre: 25680:0:(filter_io_26.c:700:filter_commitrw_write()) gsilust-OST0001: slow i_mutex 31s Jul 22 21:46:55 kernel: Lustre: 25680:0:(filter_io_26.c:700:filter_commitrw_write()) Skipped 2 previous similar messages Jul 22 21:47:06 kernel: Lustre: 25733:0:(filter_io_26.c:700:filter_commitrw_write()) gsilust-OST0001: slow i_mutex 30s Jul 22 21:47:10 kernel: Lustre: 25744:0:(filter_io_26.c:700:filter_commitrw_write()) gsilust-OST0001: slow i_mutex 31s Jul 22 21:47:15 kernel: Lustre: 25729:0:(lustre_fsfilt.h:246:fsfilt_brw_start_log()) gsilust-OST0001: slow journal start 30s Jul 22 21:47:15 kernel: Lustre: 25729:0:(filter_io_26.c:713:filter_commitrw_write()) gsilust-OST0001: slow brw_start 30s Jul 22 21:47:54 kernel: Lustre: 25662:0:(filter_io_26.c:700:filter_commitrw_write()) gsilust-OST0001: slow i_mutex 36s Jul 22 21:48:30 kernel: Lustre: 25721:0:(lustre_fsfilt.h:246:fsfilt_brw_start_log()) gsilust-OST0001: slow journal start 33s Jul 22 21:48:30 kernel: Lustre: 25721:0:(filter_io_26.c:713:filter_commitrw_write()) gsilust-OST0001: slow brw_start 33s Jul 22 21:48:30 kernel: Lustre: 25736:0:(filter_io_26.c:700:filter_commitrw_write()) gsilust-OST0001: slow i_mutex 33s However, these messages might as well have appeared while the machine was still working. Some of my OSS I managed to crash with a trace in kern.log, a known bug in ext3/ext4 code I think: Jul 14 21:41:19 kernel: uh! busy PA Jul 14 21:41:19 kernel: Jul 14 21:41:19 kernel: Call Trace: Jul 14 21:41:19 kernel: [<ffffffff8857c5ce>] :ldiskfs:ldiskfs_mb_discard_group_preallocations+0x2ae/0x400 Jul 14 21:41:19 kernel: [<ffffffff8857c75a>] :ldiskfs:ldiskfs_mb_discard_preallocations+0x3a/0x70 Jul 14 21:41:19 kernel: [<ffffffff8857cc0d>] :ldiskfs:ldiskfs_mb_new_blocks+0x24d/0x270 Jul 14 21:41:19 kernel: [<ffffffff802a1025>] __find_get_block_slow+0x2f/0xf1 Jul 14 21:41:19 kernel: [<ffffffff885c90ef>] :fsfilt_ldiskfs:ldiskfs_ext_new_extent_cb+0x4ef/0x640 Jul 14 21:41:19 kernel: [<ffffffff802a1428>] __getblk+0x1d/0x20d Jul 14 21:41:19 kernel: [<ffffffff88577801>] :ldiskfs:ldiskfs_ext_walk_space+0x131/0x250 Jul 14 21:41:19 kernel: [<ffffffff885c8c00>] :fsfilt_ldiskfs:ldiskfs_ext_new_extent_cb+0x0/0x640 Jul 14 21:41:19 kernel: [<ffffffff885c48c4>] :fsfilt_ldiskfs:fsfilt_map_nblocks+0xa4/0x150 Jul 14 21:41:19 kernel: [<ffffffff8845095c>] :ksocklnd:ksocknal_alloc_tx+0x2c/0x2a0 Jul 14 21:41:19 kernel: [<ffffffff885f24db>] :obdfilter:filter_direct_io+0x12b/0xd60 Jul 14 21:41:19 kernel: [<ffffffff885f3e9d>] :obdfilter:filter_commitrw_write+0x7bd/0x2640 Jul 14 21:41:19 kernel: [<ffffffff883bdbbe>] :ptlrpc:ldlm_resource_foreach+0x6e/0x3a0 Jul 14 21:41:19 kernel: [<ffffffff8023a0a4>] lock_timer_base+0x26/0x4b Jul 14 21:41:19 kernel: [<ffffffff885aec4e>] :ost:ost_brw_write+0x15be/0x2990 Jul 14 21:41:19 kernel: [<ffffffff8022c506>] default_wake_function+0x0/0xe Jul 14 21:41:19 kernel: [<ffffffff885b32f5>] :ost:ost_handle+0x2745/0x5ed0 Jul 14 21:41:19 kernel: [<ffffffff8027db5c>] cache_alloc_refill+0x94/0x1e8 Jul 14 21:41:19 kernel: [<ffffffff8022a52c>] find_busiest_group+0x255/0x6cf Jul 14 21:41:19 kernel: [<ffffffff80246a60>] do_gettimeofday+0x2e/0x9e Jul 14 21:41:19 kernel: [<ffffffff8023a0a4>] lock_timer_base+0x26/0x4b Jul 14 21:41:19 kernel: [<ffffffff8834d088>] :obdclass:class_handle2object+0x88/0x180 Jul 14 21:41:19 kernel: [<ffffffff883e41d0>] :ptlrpc:lustre_swab_ptlrpc_body+0x0/0x90 Jul 14 21:41:19 kernel: [<ffffffff883e1c9e>] :ptlrpc:lustre_swab_buf+0xbe/0xf0 Jul 14 21:41:19 kernel: [<ffffffff8023a1ef>] __mod_timer+0xb6/0xc4 Jul 14 21:41:19 kernel: [<ffffffff883ec38f>] :ptlrpc:ptlrpc_main+0x130f/0x1ce0 Jul 14 21:41:19 kernel: [<ffffffff8022c506>] default_wake_function+0x0/0xe Jul 14 21:41:19 kernel: [<ffffffff8020aba8>] child_rip+0xa/0x12 Jul 14 21:41:19 kernel: [<ffffffff883eb080>] :ptlrpc:ptlrpc_main+0x0/0x1ce0 Jul 14 21:41:19 kernel: [<ffffffff8020ab9e>] child_rip+0x0/0x12 Neither the "slow"-something messages nor the " uh! busy PA" trace show up in all cases of crashed OSS, so I have no idea whether this is related at all. In any case any hints would be appreciated, Thomas
On Wed, 2008-07-23 at 13:56 +0200, Thomas Roth wrote:> Hi all,Hi,> I''ve experienced reproducible OSS crashes with 1.6.5 but also > 1.6.4.3/1.6.4.2. The cluster is running Debian Etch64, kernel 2.6.22. > The OSS are file servers with two OSTs. > I''m now testing it by just using one OSS in the system (but encountered > the problem first with 9 OSS), mounting Lustre on 4 clients and writing > to it using the stress utility: "stress -d 2 --hdd-noclean --hdd-bytes 5M " > > Once the OSTs are filled up to > 60%, the machine will just stop working.Hrm. "stop working" and "crash" are two different things. Can we get clarification or more detail on exactly what does happen to the OSS at this point? Is the OSS still up and running? Can you log into it? Can you do an "ls -l /" and it returns successfully?> Jul 22 21:23:52 kernel: Lustre: > 25706:0:(filter_io_26.c:700:filter_commitrw_write()) gsilust-OST0000: > slow i_mutex 30s > Jul 22 21:24:10 kernel: Lustre: > 25692:0:(lustre_fsfilt.h:246:fsfilt_brw_start_log()) gsilust-OST0000: > slow journal start 37s > Jul 22 21:24:10 kernel: Lustre: > 25692:0:(filter_io_26.c:713:filter_commitrw_write()) gsilust-OST0000: > slow brw_start 37s > Jul 22 21:24:10 kernel: Lustre: > 25697:0:(filter_io_26.c:700:filter_commitrw_write()) gsilust-OST0000: > slow i_mutex 37s > Jul 22 21:46:55 kernel: Lustre: > 25680:0:(filter_io_26.c:700:filter_commitrw_write()) gsilust-OST0001: > slow i_mutex 31s > Jul 22 21:46:55 kernel: Lustre: > 25680:0:(filter_io_26.c:700:filter_commitrw_write()) Skipped 2 previous > similar messages > Jul 22 21:47:06 kernel: Lustre: > 25733:0:(filter_io_26.c:700:filter_commitrw_write()) gsilust-OST0001: > slow i_mutex 30s > Jul 22 21:47:10 kernel: Lustre: > 25744:0:(filter_io_26.c:700:filter_commitrw_write()) gsilust-OST0001: > slow i_mutex 31s > Jul 22 21:47:15 kernel: Lustre: > 25729:0:(lustre_fsfilt.h:246:fsfilt_brw_start_log()) gsilust-OST0001: > slow journal start 30s > Jul 22 21:47:15 kernel: Lustre: > 25729:0:(filter_io_26.c:713:filter_commitrw_write()) gsilust-OST0001: > slow brw_start 30s > Jul 22 21:47:54 kernel: Lustre: > 25662:0:(filter_io_26.c:700:filter_commitrw_write()) gsilust-OST0001: > slow i_mutex 36s > Jul 22 21:48:30 kernel: Lustre: > 25721:0:(lustre_fsfilt.h:246:fsfilt_brw_start_log()) gsilust-OST0001: > slow journal start 33s > Jul 22 21:48:30 kernel: Lustre: > 25721:0:(filter_io_26.c:713:filter_commitrw_write()) gsilust-OST0001: > slow brw_start 33s > Jul 22 21:48:30 kernel: Lustre: > 25736:0:(filter_io_26.c:700:filter_commitrw_write()) gsilust-OST0001: > slow i_mutex 33sThese are indicating that your OSTs are too slow. Maybe you have oversubscribed the number of OST threads your hardware can handle, or maybe the OST hardware has slowed down/degraded at the point this happens.> Some of my OSS I managed to crash with a trace in kern.log, a known bug > in ext3/ext4 code I think: > Jul 14 21:41:19 kernel: uh! busy PA > Jul 14 21:41:19 kernel: > Jul 14 21:41:19 kernel: Call Trace: > Jul 14 21:41:19 kernel: [<ffffffff8857c5ce>] > :ldiskfs:ldiskfs_mb_discard_group_preallocations+0x2ae/0x400 > Jul 14 21:41:19 kernel: [<ffffffff8857c75a>] > :ldiskfs:ldiskfs_mb_discard_preallocations+0x3a/0x70 > Jul 14 21:41:19 kernel: [<ffffffff8857cc0d>] > :ldiskfs:ldiskfs_mb_new_blocks+0x24d/0x270 > Jul 14 21:41:19 kernel: [<ffffffff802a1025>] __find_get_block_slow+0x2f/0xf1 > Jul 14 21:41:19 kernel: [<ffffffff885c90ef>] > :fsfilt_ldiskfs:ldiskfs_ext_new_extent_cb+0x4ef/0x640 > Jul 14 21:41:19 kernel: [<ffffffff802a1428>] __getblk+0x1d/0x20d > Jul 14 21:41:19 kernel: [<ffffffff88577801>] > :ldiskfs:ldiskfs_ext_walk_space+0x131/0x250 > Jul 14 21:41:19 kernel: [<ffffffff885c8c00>] > :fsfilt_ldiskfs:ldiskfs_ext_new_extent_cb+0x0/0x640 > Jul 14 21:41:19 kernel: [<ffffffff885c48c4>] > :fsfilt_ldiskfs:fsfilt_map_nblocks+0xa4/0x150 > Jul 14 21:41:19 kernel: [<ffffffff8845095c>] > :ksocklnd:ksocknal_alloc_tx+0x2c/0x2a0 > Jul 14 21:41:19 kernel: [<ffffffff885f24db>] > :obdfilter:filter_direct_io+0x12b/0xd60 > Jul 14 21:41:19 kernel: [<ffffffff885f3e9d>] > :obdfilter:filter_commitrw_write+0x7bd/0x2640 > Jul 14 21:41:19 kernel: [<ffffffff883bdbbe>] > :ptlrpc:ldlm_resource_foreach+0x6e/0x3a0 > Jul 14 21:41:19 kernel: [<ffffffff8023a0a4>] lock_timer_base+0x26/0x4b > Jul 14 21:41:19 kernel: [<ffffffff885aec4e>] > :ost:ost_brw_write+0x15be/0x2990 > Jul 14 21:41:19 kernel: [<ffffffff8022c506>] default_wake_function+0x0/0xe > Jul 14 21:41:19 kernel: [<ffffffff885b32f5>] :ost:ost_handle+0x2745/0x5ed0 > Jul 14 21:41:19 kernel: [<ffffffff8027db5c>] cache_alloc_refill+0x94/0x1e8 > Jul 14 21:41:19 kernel: [<ffffffff8022a52c>] find_busiest_group+0x255/0x6cf > Jul 14 21:41:19 kernel: [<ffffffff80246a60>] do_gettimeofday+0x2e/0x9e > Jul 14 21:41:19 kernel: [<ffffffff8023a0a4>] lock_timer_base+0x26/0x4b > Jul 14 21:41:19 kernel: [<ffffffff8834d088>] > :obdclass:class_handle2object+0x88/0x180 > Jul 14 21:41:19 kernel: [<ffffffff883e41d0>] > :ptlrpc:lustre_swab_ptlrpc_body+0x0/0x90 > Jul 14 21:41:19 kernel: [<ffffffff883e1c9e>] > :ptlrpc:lustre_swab_buf+0xbe/0xf0 > Jul 14 21:41:19 kernel: [<ffffffff8023a1ef>] __mod_timer+0xb6/0xc4 > Jul 14 21:41:19 kernel: [<ffffffff883ec38f>] > :ptlrpc:ptlrpc_main+0x130f/0x1ce0 > Jul 14 21:41:19 kernel: [<ffffffff8022c506>] default_wake_function+0x0/0xe > Jul 14 21:41:19 kernel: [<ffffffff8020aba8>] child_rip+0xa/0x12 > Jul 14 21:41:19 kernel: [<ffffffff883eb080>] :ptlrpc:ptlrpc_main+0x0/0x1ce0 > Jul 14 21:41:19 kernel: [<ffffffff8020ab9e>] child_rip+0x0/0x12This looks like bug 14322, Fixed in 1.6.5. b. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20080723/8e13ee89/attachment-0001.bin
Hi, Brian J. Murrell wrote:> On Wed, 2008-07-23 at 13:56 +0200, Thomas Roth wrote: >> Hi all, > > Hi, > >> I''ve experienced reproducible OSS crashes with 1.6.5 but also >> 1.6.4.3/1.6.4.2. The cluster is running Debian Etch64, kernel 2.6.22. >> The OSS are file servers with two OSTs. >> I''m now testing it by just using one OSS in the system (but encountered >> the problem first with 9 OSS), mounting Lustre on 4 clients and writing >> to it using the stress utility: "stress -d 2 --hdd-noclean --hdd-bytes 5M " >> >> Once the OSTs are filled up to > 60%, the machine will just stop working. > > Hrm. "stop working" and "crash" are two different things. Can we get > clarification or more detail on exactly what does happen to the OSS at > this point? Is the OSS still up and running? Can you log into it? Can > you do an "ls -l /" and it returns successfully?Well, in these cases the machine is simply dead: the jobs writing via Lustre have stopped with write failed: Input/output error, I can''t get into the machine neither via ssh nor via console, the only thing I can do is a hard reset. That''s why I suspected the hardware first.>> Jul 22 21:23:52 kernel: Lustre: >> 25706:0:(filter_io_26.c:700:filter_commitrw_write()) gsilust-OST0000: >> slow i_mutex 30s >> Jul 22 21:24:10 kernel: Lustre: >> 25692:0:(lustre_fsfilt.h:246:fsfilt_brw_start_log()) gsilust-OST0000: >> slow journal start 37s >> Jul 22 21:24:10 kernel: Lustre: >> 25692:0:(filter_io_26.c:713:filter_commitrw_write()) gsilust-OST0000: >> slow brw_start 37s >> Jul 22 21:24:10 kernel: Lustre: >> 25697:0:(filter_io_26.c:700:filter_commitrw_write()) gsilust-OST0000: >> slow i_mutex 37s >> Jul 22 21:46:55 kernel: Lustre: >> 25680:0:(filter_io_26.c:700:filter_commitrw_write()) gsilust-OST0001: >> slow i_mutex 31s >> Jul 22 21:46:55 kernel: Lustre: >> 25680:0:(filter_io_26.c:700:filter_commitrw_write()) Skipped 2 previous >> similar messages >> Jul 22 21:47:06 kernel: Lustre: >> 25733:0:(filter_io_26.c:700:filter_commitrw_write()) gsilust-OST0001: >> slow i_mutex 30s >> Jul 22 21:47:10 kernel: Lustre: >> 25744:0:(filter_io_26.c:700:filter_commitrw_write()) gsilust-OST0001: >> slow i_mutex 31s >> Jul 22 21:47:15 kernel: Lustre: >> 25729:0:(lustre_fsfilt.h:246:fsfilt_brw_start_log()) gsilust-OST0001: >> slow journal start 30s >> Jul 22 21:47:15 kernel: Lustre: >> 25729:0:(filter_io_26.c:713:filter_commitrw_write()) gsilust-OST0001: >> slow brw_start 30s >> Jul 22 21:47:54 kernel: Lustre: >> 25662:0:(filter_io_26.c:700:filter_commitrw_write()) gsilust-OST0001: >> slow i_mutex 36s >> Jul 22 21:48:30 kernel: Lustre: >> 25721:0:(lustre_fsfilt.h:246:fsfilt_brw_start_log()) gsilust-OST0001: >> slow journal start 33s >> Jul 22 21:48:30 kernel: Lustre: >> 25721:0:(filter_io_26.c:713:filter_commitrw_write()) gsilust-OST0001: >> slow brw_start 33s >> Jul 22 21:48:30 kernel: Lustre: >> 25736:0:(filter_io_26.c:700:filter_commitrw_write()) gsilust-OST0001: >> slow i_mutex 33s > > These are indicating that your OSTs are too slow. Maybe you have > oversubscribed the number of OST threads your hardware can handle, or > maybe the OST hardware has slowed down/degraded at the point this > happens.Interesting. These were 32 single processes "spinning on write()/unlink()", as the man page of stress says. I had a look on the network traffic only, and that was not as high as in other test: the servers are connected via 1Gbit Ethernet links, but I saw no more than 20-30 MB/s. Internally, the RAID-Controller and disks can handle much more. The load on the servers was less than 8. And as I mentioned, I did the same test on the machines locally, although I do not remember how many parallel stress jobs I employed. Where else could I look for overloaded hardware capacities? Any way to find out about the number of OST threads our hardware can handle? So far I have not tried any other pattern/utility for these tests: our users are well known to be more demanding than any test program. That''s why we want to employ Lustre in the first place: to let the users fire from hundreds of clients, concurrently, at a large data file space (instead of killing NFS servers).>> Some of my OSS I managed to crash with a trace in kern.log, a known bug >> in ext3/ext4 code I think: >> Jul 14 21:41:19 kernel: uh! busy PA > > This looks like bug 14322, Fixed in 1.6.5. > > b.Yeah, I think I have not seen this log on the 1.6.5 system. Just wasn''t sure whether the servers might not have died before they were able to write to the logs. Thanks for your reply, Thomas
On Wed, 2008-07-23 at 17:20 +0200, Thomas Roth wrote:> Hi,Hello,> Well, in these cases the machine is simply dead: the jobs writing via > Lustre have stopped with write failed: Input/output error,Is there any messages on the console of such a machine when it''s hung? Can you get a stack trace (i.e. sysrq-t) of the processes on the hung machine?> I can''t get > into the machine neither via ssh nor via console, the only thing I can > do is a hard reset. That''s why I suspected the hardware first.Indeed. The (serial) console is the best source of information in this sort of case. Hopefully you are logging it and can retrieve the messages prior to the hang.> Where else could I look for overloaded hardware capacities?Not sure. That''s quite hardware specific.> Any way to > find out about the number of OST threads our hardware can handle?Well, you could run some iokit benchmarks and find out where your plateau in performance is WRT to increasing threads to a single OST. b. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20080723/51f575aa/attachment.bin
>> Where else could I look for overloaded hardware capacities? >> > > Not sure. That''s quite hardware specific. >you could run collectl and then after you reset the system log back in and look at what was happening right before you did the reset. this will let you look at cpu, interrupts, memory, network and a variety of other things including lustre level stats such as I/O rates and even rpc stats. you''ll also be able to see what processes were running in a similar format to ps or you can just play back the data with the --top switch. if you feel 10 second samples aren''t frequent enough you always set you interval down to 1 second or even lower... -mark
Well, guess what - I did that. These OSS are all running collectl already ;-) At first I had in addition to the collectl daemon an xterm with "collectl -sL -od" which at least gave me the point in time when the machine stopped. Didn''t make me any wiser. At least there was no Lustre write activity any more at the time of the crash. Last try I added ''c'' and ''n'' and found that in the last minute the CPU load had risen to 78. No Lustre or network activity, though. That''s a bit much but it would be a pity if sufficient to crash a server: ### RECORD 27098 >>> lxfs89 <<< (1216869685.007) (Thu Jul 24 05:21:25 2008) ### # CPU SUMMARY (INTR, CTXSW & PROC /sec) # USER NICE SYS WAIT IRQ SOFT STEAL IDLE INTR CTXSW PROC RUNQ RUN AVG1 AVG5 AVG15 07/24 05:21:25 0 0 37 10 0 0 0 51 462 281 0 538 16 79.11 78.64 76.93 # LUSTRE FILESYSTEM SINGLE OST STATISTICS # Ost KBRead Reads KBWrite Writes 07/24 05:21:25 OST0015 0 0 0 0 07/24 05:21:25 OST0016 0 0 0 0 # NETWORK STATISTICS (/sec) # Num Name KBIn PktIn SizeIn MultI CmpI ErrIn KBOut PktOut SizeO CmpO ErrOut 07/24 05:21:25 0 lo: 0 0 0 0 0 0 0 0 0 0 0 07/24 05:21:25 1 eth0: 28 385 76 0 0 0 6 29 230 0 0 07/24 05:21:25 2 eth1: 0 0 0 0 0 0 0 0 0 0 0 07/24 05:21:25 3 eth2: 0 0 0 0 0 0 0 0 0 0 0 Then I actually started reading man pages and was able to extract some info from the collectl - logfile. -scnL told me that writing to the disk (to the log) stopped 5 hours before the last activity I''d seen on the xterm. The CPU load was 37 at that moment, and there was still packages coming in and being written to the two OSTs: ### RECORD 126 >>> lxfs89 <<< (1216851732.479) (Thu Jul 24 00:22:12 2008) ### # CPU SUMMARY (INTR, CTXSW & PROC /sec) # USER NICE SYS WAIT IRQ SOFT STEAL IDLE INTR CTXSW PROC RUNQ RUN AVG1 AVG5 AVG15 0 0 4 19 0 0 0 75 729 3302 0 471 9 36.87 35.96 34.36 # LUSTRE FILESYSTEM SINGLE OST STATISTICS #Ost KBRead Reads KBWrite Writes OST0015 0 0 2751 2 OST0016 0 0 2955 2 # NETWORK STATISTICS (/sec) #Num Name KBIn PktIn SizeIn MultI CmpI ErrIn KBOut PktOut SizeO CmpO ErrOut 0 lo: 0 0 0 0 0 0 0 0 0 0 0 1 eth0: 5139 3559 1478 0 0 0 116 1475 81 0 0 2 eth1: 0 0 0 0 0 0 0 0 0 0 0 3 eth2: 0 0 0 0 0 0 0 0 0 0 0 However, I have not yet gotten further in learning the abilities of collectl or the interpretation of its info. In another xterm window I had htop running, though. This stopped with three 100% processes on top, ll_ost_io_42, ll_ost_io_59, ll_ost_io_01, each of which had been running for 4h 58m. Fits the 5 hour gap mentioned earlier. Still I don''t have a clue as to what actually causes this behavior and how to avoid. On the next crash I''ll try to get a stack trace, and logging the console to more than the xterm buffer surely is something we ought to do as well. Thanks for your advice, Thomas Mark Seger wrote:>>> Where else could I look for overloaded hardware capacities? >>> >> Not sure. That''s quite hardware specific. >> > you could run collectl and then after you reset the system log back in > and look at what was happening right before you did the reset. this > will let you look at cpu, interrupts, memory, network and a variety of > other things including lustre level stats such as I/O rates and even rpc > stats. you''ll also be able to see what processes were running in a > similar format to ps or you can just play back the data with the --top > switch. if you feel 10 second samples aren''t frequent enough you always > set you interval down to 1 second or even lower... > -mark > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss-- -------------------------------------------------------------------- Thomas Roth Department: Informationstechnologie Location: SB3 1.262 Phone: +49-6159-71 1453 Fax: +49-6159-71 2986 Gesellschaft f?r Schwerionenforschung mbH Planckstra?e 1 D-64291 Darmstadt www.gsi.de Gesellschaft mit beschr?nkter Haftung Sitz der Gesellschaft: Darmstadt Handelsregister: Amtsgericht Darmstadt, HRB 1528 Gesch?ftsf?hrer: Professor Dr. Horst St?cker Vorsitzende des Aufsichtsrates: Dr. Beatrix Vierkorn-Rudolph, Stellvertreter: Ministerialdirigent Dr. Rolf Bernhardt
Hi Thomas, On Thursday 24 July 2008 09:24:11 am Thomas Roth wrote:> On the next crash I''ll try to get a stack trace, and logging the > console to more than the xterm buffer surely is something we ought to > do as well.If you don''t know it or use it already, maybe you could give a try to netdump: http://www.redhat.com/support/wpapers/redhat/netdump/ It basically allows you to get crash dumps and stack traces from a remote machine. Much useful for gathering Lustre debug information. Cheers, -- Kilian