thr3ads.net - Lustre discuss - [Lustre-discuss] OSS crashes [Jul 2008]

If this information is useful, please help other people find it:
Share via:

Thomas Roth

2008-Jul-23 11:56 UTC

[Lustre-discuss] OSS crashes

Hi all,

I''ve experienced reproducible OSS crashes with 1.6.5 but also 
1.6.4.3/1.6.4.2. The cluster is running Debian Etch64, kernel 2.6.22. 
The OSS are file servers with two OSTs.
I''m now testing it by just using one OSS in the system (but encountered
the problem first with 9 OSS), mounting Lustre on 4 clients and writing 
to it using the stress utility:  "stress -d 2 --hdd-noclean --hdd-bytes 5M
"

Once the OSTs are filled up to > 60%, the machine will just stop working.

There are no traces in any of the logs that would relate directly to the 
moment of failure.

I have repeated the procedure now with 9 of these machines, all of them 
SuperMicro X7DB8 16 slot file servers, 2 Intel Xeon E5320 Quadcores, 8GB 
RAM and one older SuperMicro X7DB8 with 2 Dual Core Xeons and 4 GB RAM, 
on a Lustre 1.6.4.2 test system. All of these machines have two 3ware 
9650 RAID controllers, 500 TB WD Disks in RAID 5.

Subsequently I reformatted the OST with ext3 and ran the stress test 
locally on the machine: no failure, the partition filled to 100% without 
problem.
All of this seems to indicate that it is not a (sole) hardware problem.

Prior to the recent crash the following is found in /var/log/kern.log:

Jul 22 21:23:52  kernel: Lustre: 
25706:0:(filter_io_26.c:700:filter_commitrw_write()) gsilust-OST0000: 
slow i_mutex  30s
Jul 22 21:24:10  kernel: Lustre: 
25692:0:(lustre_fsfilt.h:246:fsfilt_brw_start_log()) gsilust-OST0000: 
slow journal  start 37s
Jul 22 21:24:10  kernel: Lustre: 
25692:0:(filter_io_26.c:713:filter_commitrw_write()) gsilust-OST0000: 
slow brw_start 37s
Jul 22 21:24:10  kernel: Lustre: 
25697:0:(filter_io_26.c:700:filter_commitrw_write()) gsilust-OST0000: 
slow i_mutex  37s
Jul 22 21:46:55  kernel: Lustre: 
25680:0:(filter_io_26.c:700:filter_commitrw_write()) gsilust-OST0001: 
slow i_mutex  31s
Jul 22 21:46:55 kernel: Lustre: 
25680:0:(filter_io_26.c:700:filter_commitrw_write()) Skipped 2 previous 
similar messages
Jul 22 21:47:06 kernel: Lustre: 
25733:0:(filter_io_26.c:700:filter_commitrw_write()) gsilust-OST0001: 
slow i_mutex 30s
Jul 22 21:47:10 kernel: Lustre: 
25744:0:(filter_io_26.c:700:filter_commitrw_write()) gsilust-OST0001: 
slow i_mutex 31s
Jul 22 21:47:15 kernel: Lustre: 
25729:0:(lustre_fsfilt.h:246:fsfilt_brw_start_log()) gsilust-OST0001: 
slow journal start 30s
Jul 22 21:47:15 kernel: Lustre: 
25729:0:(filter_io_26.c:713:filter_commitrw_write()) gsilust-OST0001: 
slow brw_start 30s
Jul 22 21:47:54 kernel: Lustre: 
25662:0:(filter_io_26.c:700:filter_commitrw_write()) gsilust-OST0001: 
slow i_mutex 36s
Jul 22 21:48:30 kernel: Lustre: 
25721:0:(lustre_fsfilt.h:246:fsfilt_brw_start_log()) gsilust-OST0001: 
slow journal start 33s
Jul 22 21:48:30 kernel: Lustre: 
25721:0:(filter_io_26.c:713:filter_commitrw_write()) gsilust-OST0001: 
slow brw_start 33s
Jul 22 21:48:30 kernel: Lustre: 
25736:0:(filter_io_26.c:700:filter_commitrw_write()) gsilust-OST0001: 
slow i_mutex 33s


However, these messages might as well have appeared while the machine 
was still working.

Some of my OSS I managed to crash with a trace in kern.log, a known bug 
in ext3/ext4 code I think:
Jul 14 21:41:19 kernel: uh! busy PA
Jul 14 21:41:19 kernel:
Jul 14 21:41:19 kernel: Call Trace:
Jul 14 21:41:19 kernel: [<ffffffff8857c5ce>] 
:ldiskfs:ldiskfs_mb_discard_group_preallocations+0x2ae/0x400
Jul 14 21:41:19 kernel: [<ffffffff8857c75a>] 
:ldiskfs:ldiskfs_mb_discard_preallocations+0x3a/0x70
Jul 14 21:41:19 kernel: [<ffffffff8857cc0d>] 
:ldiskfs:ldiskfs_mb_new_blocks+0x24d/0x270
Jul 14 21:41:19 kernel: [<ffffffff802a1025>]
__find_get_block_slow+0x2f/0xf1
Jul 14 21:41:19 kernel: [<ffffffff885c90ef>] 
:fsfilt_ldiskfs:ldiskfs_ext_new_extent_cb+0x4ef/0x640
Jul 14 21:41:19 kernel: [<ffffffff802a1428>] __getblk+0x1d/0x20d
Jul 14 21:41:19 kernel: [<ffffffff88577801>] 
:ldiskfs:ldiskfs_ext_walk_space+0x131/0x250
Jul 14 21:41:19 kernel: [<ffffffff885c8c00>] 
:fsfilt_ldiskfs:ldiskfs_ext_new_extent_cb+0x0/0x640
Jul 14 21:41:19 kernel: [<ffffffff885c48c4>] 
:fsfilt_ldiskfs:fsfilt_map_nblocks+0xa4/0x150
Jul 14 21:41:19 kernel: [<ffffffff8845095c>] 
:ksocklnd:ksocknal_alloc_tx+0x2c/0x2a0
Jul 14 21:41:19 kernel: [<ffffffff885f24db>] 
:obdfilter:filter_direct_io+0x12b/0xd60
Jul 14 21:41:19 kernel: [<ffffffff885f3e9d>] 
:obdfilter:filter_commitrw_write+0x7bd/0x2640
Jul 14 21:41:19 kernel: [<ffffffff883bdbbe>] 
:ptlrpc:ldlm_resource_foreach+0x6e/0x3a0
Jul 14 21:41:19 kernel: [<ffffffff8023a0a4>] lock_timer_base+0x26/0x4b
Jul 14 21:41:19 kernel: [<ffffffff885aec4e>] 
:ost:ost_brw_write+0x15be/0x2990
Jul 14 21:41:19 kernel: [<ffffffff8022c506>] default_wake_function+0x0/0xe
Jul 14 21:41:19 kernel: [<ffffffff885b32f5>] :ost:ost_handle+0x2745/0x5ed0
Jul 14 21:41:19 kernel: [<ffffffff8027db5c>] cache_alloc_refill+0x94/0x1e8
Jul 14 21:41:19 kernel: [<ffffffff8022a52c>]
find_busiest_group+0x255/0x6cf
Jul 14 21:41:19 kernel: [<ffffffff80246a60>] do_gettimeofday+0x2e/0x9e
Jul 14 21:41:19 kernel: [<ffffffff8023a0a4>] lock_timer_base+0x26/0x4b
Jul 14 21:41:19 kernel: [<ffffffff8834d088>] 
:obdclass:class_handle2object+0x88/0x180
Jul 14 21:41:19 kernel: [<ffffffff883e41d0>] 
:ptlrpc:lustre_swab_ptlrpc_body+0x0/0x90
Jul 14 21:41:19 kernel: [<ffffffff883e1c9e>] 
:ptlrpc:lustre_swab_buf+0xbe/0xf0
Jul 14 21:41:19 kernel: [<ffffffff8023a1ef>] __mod_timer+0xb6/0xc4
Jul 14 21:41:19 kernel: [<ffffffff883ec38f>] 
:ptlrpc:ptlrpc_main+0x130f/0x1ce0
Jul 14 21:41:19 kernel: [<ffffffff8022c506>] default_wake_function+0x0/0xe
Jul 14 21:41:19 kernel: [<ffffffff8020aba8>] child_rip+0xa/0x12
Jul 14 21:41:19 kernel: [<ffffffff883eb080>]
:ptlrpc:ptlrpc_main+0x0/0x1ce0
Jul 14 21:41:19 kernel: [<ffffffff8020ab9e>] child_rip+0x0/0x12


Neither the "slow"-something messages nor the " uh! busy PA"
trace show
up in all cases of crashed OSS, so I have no idea whether this is 
related at all.

In any case any hints would be appreciated,
Thomas

Brian J. Murrell

2008-Jul-23 14:51 UTC

head link

[Lustre-discuss] OSS crashes

On Wed, 2008-07-23 at 13:56 +0200, Thomas Roth wrote:> Hi all,
Hi,
> I''ve experienced reproducible OSS crashes with 1.6.5 but also 
> 1.6.4.3/1.6.4.2. The cluster is running Debian Etch64, kernel 2.6.22. 
> The OSS are file servers with two OSTs.
> I''m now testing it by just using one OSS in the system (but
encountered
> the problem first with 9 OSS), mounting Lustre on 4 clients and writing 
> to it using the stress utility:  "stress -d 2 --hdd-noclean
--hdd-bytes 5M "
> 
> Once the OSTs are filled up to > 60%, the machine will just stop
working.
Hrm.  "stop working" and "crash" are two different things. 
Can we get
clarification or more detail on exactly what does happen to the OSS at
this point?  Is the OSS still up and running?  Can you log into it?  Can
you do an "ls -l /" and it returns successfully?
> Jul 22 21:23:52  kernel: Lustre: 
> 25706:0:(filter_io_26.c:700:filter_commitrw_write()) gsilust-OST0000: 
> slow i_mutex  30s
> Jul 22 21:24:10  kernel: Lustre: 
> 25692:0:(lustre_fsfilt.h:246:fsfilt_brw_start_log()) gsilust-OST0000: 
> slow journal  start 37s
> Jul 22 21:24:10  kernel: Lustre: 
> 25692:0:(filter_io_26.c:713:filter_commitrw_write()) gsilust-OST0000: 
> slow brw_start 37s
> Jul 22 21:24:10  kernel: Lustre: 
> 25697:0:(filter_io_26.c:700:filter_commitrw_write()) gsilust-OST0000: 
> slow i_mutex  37s
> Jul 22 21:46:55  kernel: Lustre: 
> 25680:0:(filter_io_26.c:700:filter_commitrw_write()) gsilust-OST0001: 
> slow i_mutex  31s
> Jul 22 21:46:55 kernel: Lustre: 
> 25680:0:(filter_io_26.c:700:filter_commitrw_write()) Skipped 2 previous 
> similar messages
> Jul 22 21:47:06 kernel: Lustre: 
> 25733:0:(filter_io_26.c:700:filter_commitrw_write()) gsilust-OST0001: 
> slow i_mutex 30s
> Jul 22 21:47:10 kernel: Lustre: 
> 25744:0:(filter_io_26.c:700:filter_commitrw_write()) gsilust-OST0001: 
> slow i_mutex 31s
> Jul 22 21:47:15 kernel: Lustre: 
> 25729:0:(lustre_fsfilt.h:246:fsfilt_brw_start_log()) gsilust-OST0001: 
> slow journal start 30s
> Jul 22 21:47:15 kernel: Lustre: 
> 25729:0:(filter_io_26.c:713:filter_commitrw_write()) gsilust-OST0001: 
> slow brw_start 30s
> Jul 22 21:47:54 kernel: Lustre: 
> 25662:0:(filter_io_26.c:700:filter_commitrw_write()) gsilust-OST0001: 
> slow i_mutex 36s
> Jul 22 21:48:30 kernel: Lustre: 
> 25721:0:(lustre_fsfilt.h:246:fsfilt_brw_start_log()) gsilust-OST0001: 
> slow journal start 33s
> Jul 22 21:48:30 kernel: Lustre: 
> 25721:0:(filter_io_26.c:713:filter_commitrw_write()) gsilust-OST0001: 
> slow brw_start 33s
> Jul 22 21:48:30 kernel: Lustre: 
> 25736:0:(filter_io_26.c:700:filter_commitrw_write()) gsilust-OST0001: 
> slow i_mutex 33s
These are indicating that your OSTs are too slow.  Maybe you have
oversubscribed the number of OST threads your hardware can handle, or
maybe the OST hardware has slowed down/degraded at the point this
happens.
> Some of my OSS I managed to crash with a trace in kern.log, a known bug 
> in ext3/ext4 code I think:
> Jul 14 21:41:19 kernel: uh! busy PA
> Jul 14 21:41:19 kernel:
> Jul 14 21:41:19 kernel: Call Trace:
> Jul 14 21:41:19 kernel: [<ffffffff8857c5ce>] 
> :ldiskfs:ldiskfs_mb_discard_group_preallocations+0x2ae/0x400
> Jul 14 21:41:19 kernel: [<ffffffff8857c75a>] 
> :ldiskfs:ldiskfs_mb_discard_preallocations+0x3a/0x70
> Jul 14 21:41:19 kernel: [<ffffffff8857cc0d>] 
> :ldiskfs:ldiskfs_mb_new_blocks+0x24d/0x270
> Jul 14 21:41:19 kernel: [<ffffffff802a1025>]
__find_get_block_slow+0x2f/0xf1
> Jul 14 21:41:19 kernel: [<ffffffff885c90ef>] 
> :fsfilt_ldiskfs:ldiskfs_ext_new_extent_cb+0x4ef/0x640
> Jul 14 21:41:19 kernel: [<ffffffff802a1428>] __getblk+0x1d/0x20d
> Jul 14 21:41:19 kernel: [<ffffffff88577801>] 
> :ldiskfs:ldiskfs_ext_walk_space+0x131/0x250
> Jul 14 21:41:19 kernel: [<ffffffff885c8c00>] 
> :fsfilt_ldiskfs:ldiskfs_ext_new_extent_cb+0x0/0x640
> Jul 14 21:41:19 kernel: [<ffffffff885c48c4>] 
> :fsfilt_ldiskfs:fsfilt_map_nblocks+0xa4/0x150
> Jul 14 21:41:19 kernel: [<ffffffff8845095c>] 
> :ksocklnd:ksocknal_alloc_tx+0x2c/0x2a0
> Jul 14 21:41:19 kernel: [<ffffffff885f24db>] 
> :obdfilter:filter_direct_io+0x12b/0xd60
> Jul 14 21:41:19 kernel: [<ffffffff885f3e9d>] 
> :obdfilter:filter_commitrw_write+0x7bd/0x2640
> Jul 14 21:41:19 kernel: [<ffffffff883bdbbe>] 
> :ptlrpc:ldlm_resource_foreach+0x6e/0x3a0
> Jul 14 21:41:19 kernel: [<ffffffff8023a0a4>]
lock_timer_base+0x26/0x4b
> Jul 14 21:41:19 kernel: [<ffffffff885aec4e>] 
> :ost:ost_brw_write+0x15be/0x2990
> Jul 14 21:41:19 kernel: [<ffffffff8022c506>]
default_wake_function+0x0/0xe
> Jul 14 21:41:19 kernel: [<ffffffff885b32f5>]
:ost:ost_handle+0x2745/0x5ed0
> Jul 14 21:41:19 kernel: [<ffffffff8027db5c>]
cache_alloc_refill+0x94/0x1e8
> Jul 14 21:41:19 kernel: [<ffffffff8022a52c>]
find_busiest_group+0x255/0x6cf
> Jul 14 21:41:19 kernel: [<ffffffff80246a60>]
do_gettimeofday+0x2e/0x9e
> Jul 14 21:41:19 kernel: [<ffffffff8023a0a4>]
lock_timer_base+0x26/0x4b
> Jul 14 21:41:19 kernel: [<ffffffff8834d088>] 
> :obdclass:class_handle2object+0x88/0x180
> Jul 14 21:41:19 kernel: [<ffffffff883e41d0>] 
> :ptlrpc:lustre_swab_ptlrpc_body+0x0/0x90
> Jul 14 21:41:19 kernel: [<ffffffff883e1c9e>] 
> :ptlrpc:lustre_swab_buf+0xbe/0xf0
> Jul 14 21:41:19 kernel: [<ffffffff8023a1ef>] __mod_timer+0xb6/0xc4
> Jul 14 21:41:19 kernel: [<ffffffff883ec38f>] 
> :ptlrpc:ptlrpc_main+0x130f/0x1ce0
> Jul 14 21:41:19 kernel: [<ffffffff8022c506>]
default_wake_function+0x0/0xe
> Jul 14 21:41:19 kernel: [<ffffffff8020aba8>] child_rip+0xa/0x12
> Jul 14 21:41:19 kernel: [<ffffffff883eb080>]
:ptlrpc:ptlrpc_main+0x0/0x1ce0
> Jul 14 21:41:19 kernel: [<ffffffff8020ab9e>] child_rip+0x0/0x12
This looks like bug 14322, Fixed in 1.6.5.

b.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
Url :
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20080723/8e13ee89/attachment-0001.bin

Thomas Roth

2008-Jul-23 15:20 UTC

head link

[Lustre-discuss] OSS crashes

Hi,

Brian J. Murrell wrote:> On Wed, 2008-07-23 at 13:56 +0200, Thomas Roth wrote:
>> Hi all,
> 
> Hi,
> 
>> I''ve experienced reproducible OSS crashes with 1.6.5 but also 
>> 1.6.4.3/1.6.4.2. The cluster is running Debian Etch64, kernel 2.6.22. 
>> The OSS are file servers with two OSTs.
>> I''m now testing it by just using one OSS in the system (but
encountered
>> the problem first with 9 OSS), mounting Lustre on 4 clients and writing
>> to it using the stress utility:  "stress -d 2 --hdd-noclean
--hdd-bytes 5M "
>>
>> Once the OSTs are filled up to > 60%, the machine will just stop
working.
> 
> Hrm.  "stop working" and "crash" are two different
things.  Can we get
> clarification or more detail on exactly what does happen to the OSS at
> this point?  Is the OSS still up and running?  Can you log into it?  Can
> you do an "ls -l /" and it returns successfully?
Well, in these cases the machine is simply dead: the jobs writing via 
Lustre have stopped with write failed: Input/output error, I can''t get 
into the machine neither via ssh nor via console, the only thing I can 
do is a hard reset. That''s why I suspected the hardware first.

>> Jul 22 21:23:52  kernel: Lustre: 
>> 25706:0:(filter_io_26.c:700:filter_commitrw_write()) gsilust-OST0000: 
>> slow i_mutex  30s
>> Jul 22 21:24:10  kernel: Lustre: 
>> 25692:0:(lustre_fsfilt.h:246:fsfilt_brw_start_log()) gsilust-OST0000: 
>> slow journal  start 37s
>> Jul 22 21:24:10  kernel: Lustre: 
>> 25692:0:(filter_io_26.c:713:filter_commitrw_write()) gsilust-OST0000: 
>> slow brw_start 37s
>> Jul 22 21:24:10  kernel: Lustre: 
>> 25697:0:(filter_io_26.c:700:filter_commitrw_write()) gsilust-OST0000: 
>> slow i_mutex  37s
>> Jul 22 21:46:55  kernel: Lustre: 
>> 25680:0:(filter_io_26.c:700:filter_commitrw_write()) gsilust-OST0001: 
>> slow i_mutex  31s
>> Jul 22 21:46:55 kernel: Lustre: 
>> 25680:0:(filter_io_26.c:700:filter_commitrw_write()) Skipped 2 previous
>> similar messages
>> Jul 22 21:47:06 kernel: Lustre: 
>> 25733:0:(filter_io_26.c:700:filter_commitrw_write()) gsilust-OST0001: 
>> slow i_mutex 30s
>> Jul 22 21:47:10 kernel: Lustre: 
>> 25744:0:(filter_io_26.c:700:filter_commitrw_write()) gsilust-OST0001: 
>> slow i_mutex 31s
>> Jul 22 21:47:15 kernel: Lustre: 
>> 25729:0:(lustre_fsfilt.h:246:fsfilt_brw_start_log()) gsilust-OST0001: 
>> slow journal start 30s
>> Jul 22 21:47:15 kernel: Lustre: 
>> 25729:0:(filter_io_26.c:713:filter_commitrw_write()) gsilust-OST0001: 
>> slow brw_start 30s
>> Jul 22 21:47:54 kernel: Lustre: 
>> 25662:0:(filter_io_26.c:700:filter_commitrw_write()) gsilust-OST0001: 
>> slow i_mutex 36s
>> Jul 22 21:48:30 kernel: Lustre: 
>> 25721:0:(lustre_fsfilt.h:246:fsfilt_brw_start_log()) gsilust-OST0001: 
>> slow journal start 33s
>> Jul 22 21:48:30 kernel: Lustre: 
>> 25721:0:(filter_io_26.c:713:filter_commitrw_write()) gsilust-OST0001: 
>> slow brw_start 33s
>> Jul 22 21:48:30 kernel: Lustre: 
>> 25736:0:(filter_io_26.c:700:filter_commitrw_write()) gsilust-OST0001: 
>> slow i_mutex 33s
> 
> These are indicating that your OSTs are too slow.  Maybe you have
> oversubscribed the number of OST threads your hardware can handle, or
> maybe the OST hardware has slowed down/degraded at the point this
> happens.
Interesting. These were 32 single processes "spinning on 
write()/unlink()", as the man page of stress says.
I had a look on the  network traffic only, and that was not as high as 
in other test: the servers are connected via 1Gbit Ethernet links, but I 
saw no more than 20-30 MB/s. Internally, the RAID-Controller and disks 
can handle much more. The load on the servers was less than 8.
And as I mentioned, I did the same test on the machines locally, 
although I do not remember how many parallel stress jobs I employed.
Where else could I look for overloaded hardware capacities? Any way to 
find out about  the number of OST threads our hardware can handle?
So far I have not tried any other pattern/utility for these tests: our 
users are well known to be more demanding than any test program.
That''s why we want to employ Lustre in the first place: to let the
users
fire from hundreds of clients, concurrently, at a large data file space 
(instead of killing NFS servers).
>> Some of my OSS I managed to crash with a trace in kern.log, a known bug
>> in ext3/ext4 code I think:
>> Jul 14 21:41:19 kernel: uh! busy PA
> 
> This looks like bug 14322, Fixed in 1.6.5.
> 
> b.
Yeah, I think I have not seen this log on the 1.6.5 system. Just wasn''t
sure whether the servers might not have died before they were able to 
write to the logs.

Thanks for your reply,
Thomas

Brian J. Murrell

2008-Jul-23 15:42 UTC

head link

[Lustre-discuss] OSS crashes

On Wed, 2008-07-23 at 17:20 +0200, Thomas Roth wrote:> Hi,
Hello,
> Well, in these cases the machine is simply dead: the jobs writing via 
> Lustre have stopped with write failed: Input/output error,
Is there any messages on the console of such a machine when it''s hung?
Can you get a stack trace (i.e. sysrq-t) of the processes on the hung
machine?
> I can''t get 
> into the machine neither via ssh nor via console, the only thing I can 
> do is a hard reset. That''s why I suspected the hardware first.
Indeed.  The (serial) console is the best source of information in this
sort of case.  Hopefully you are logging it and can retrieve the
messages prior to the hang.
> Where else could I look for overloaded hardware capacities?
Not sure.  That''s quite hardware specific.
> Any way to 
> find out about  the number of OST threads our hardware can handle?
Well, you could run some iokit benchmarks and find out where your
plateau in performance is WRT to increasing threads to a single OST.

b.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
Url :
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20080723/51f575aa/attachment.bin

Mark Seger

2008-Jul-23 15:49 UTC

head link

[Lustre-discuss] OSS crashes

>> Where else could I look for overloaded hardware capacities?
>>     
>
> Not sure.  That''s quite hardware specific.
>   you could run collectl and then after you reset the system log back in 
and look at what was happening right before you did the reset.  this 
will let you look at cpu, interrupts, memory, network and a variety of 
other things including lustre level stats such as I/O rates and even rpc 
stats.  you''ll also be able to see what processes were running in a 
similar format to ps or you can just play back the data with the --top 
switch.  if you feel 10 second samples aren''t frequent enough you
always
set you interval down to 1 second or even lower...
-mark

Thomas Roth

2008-Jul-24 16:24 UTC

head link

[Lustre-discuss] OSS crashes

Well, guess what - I did that. These OSS are all running collectl 
already ;-)
At first I had in addition to the collectl daemon an xterm with 
"collectl -sL -od" which at least gave me the point in time when the 
machine stopped. Didn''t make me any wiser. At least there was no Lustre
write activity any more at the time of the crash.
Last try I added ''c'' and ''n'' and found that
in the last minute the CPU
load had risen to 78. No Lustre or network activity, though. That''s a 
bit much but it would be a pity if sufficient to crash a server:

### RECORD 27098 >>> lxfs89 <<< (1216869685.007) (Thu Jul 24
05:21:25
2008) ###

#                CPU SUMMARY (INTR, CTXSW & PROC /sec)
#                USER  NICE   SYS  WAIT   IRQ  SOFT STEAL  IDLE  INTR 
CTXSW  PROC  RUNQ   RUN   AVG1  AVG5 AVG15
07/24 05:21:25      0     0    37    10     0     0     0    51   462 
  281     0   538    16  79.11 78.64 76.93

# LUSTRE FILESYSTEM SINGLE OST STATISTICS
#              Ost                KBRead   Reads    KBWrite  Writes
07/24 05:21:25 OST0015        0       0          0       0
07/24 05:21:25 OST0016        0       0          0       0

# NETWORK STATISTICS (/sec)
#               Num    Name   KBIn  PktIn SizeIn  MultI   CmpI  ErrIn 
KBOut PktOut  SizeO   CmpO ErrOut
07/24 05:21:25    0     lo:      0      0      0      0      0      0 
    0      0      0      0      0
07/24 05:21:25    1   eth0:     28    385     76      0      0      0 
    6     29    230      0      0
07/24 05:21:25    2   eth1:      0      0      0      0      0      0 
    0      0      0      0      0
07/24 05:21:25    3   eth2:      0      0      0      0      0      0 
    0      0      0      0      0

Then I actually started reading man pages and was able to extract some 
info from the collectl - logfile. -scnL told me that writing to the disk 
(to the log) stopped 5 hours before the last activity I''d seen on the 
xterm. The CPU load was 37 at that moment, and there was still packages 
coming in and being written to the two OSTs:

### RECORD  126 >>> lxfs89 <<< (1216851732.479) (Thu Jul 24
00:22:12
2008) ###

# CPU SUMMARY (INTR, CTXSW & PROC /sec)
# USER  NICE   SYS  WAIT   IRQ  SOFT STEAL  IDLE  INTR  CTXSW  PROC 
RUNQ   RUN   AVG1  AVG5 AVG15
      0     0     4    19     0     0     0    75   729   3302     0 
471     9  36.87 35.96 34.36

# LUSTRE FILESYSTEM SINGLE OST STATISTICS
#Ost              KBRead   Reads    KBWrite  Writes
OST0015        0       0       2751       2
OST0016        0       0       2955       2

# NETWORK STATISTICS (/sec)
#Num    Name   KBIn  PktIn SizeIn  MultI   CmpI  ErrIn  KBOut PktOut 
SizeO   CmpO ErrOut
    0     lo:      0      0      0      0      0      0      0      0 
    0      0      0
    1   eth0:   5139   3559   1478      0      0      0    116   1475 
   81      0      0
    2   eth1:      0      0      0      0      0      0      0      0 
    0      0      0
    3   eth2:      0      0      0      0      0      0      0      0 
    0      0      0

However, I have not yet gotten further in learning the abilities of 
collectl or the interpretation of its info.

In another xterm window I had htop running, though. This stopped with 
three 100% processes on top, ll_ost_io_42, ll_ost_io_59, ll_ost_io_01, 
each of which had been running for 4h 58m. Fits the 5 hour gap mentioned 
earlier.
Still I don''t have a clue as to what actually causes this behavior and 
how to avoid.

On the next crash I''ll try to get a stack trace, and logging the
console
to more than the xterm buffer surely is something we ought to do as well.

Thanks for your advice,
Thomas

Mark Seger wrote:>>> Where else could I look for overloaded hardware capacities?
>>>     
>> Not sure.  That''s quite hardware specific.
>>   
> you could run collectl and then after you reset the system log back in 
> and look at what was happening right before you did the reset.  this 
> will let you look at cpu, interrupts, memory, network and a variety of 
> other things including lustre level stats such as I/O rates and even rpc 
> stats.  you''ll also be able to see what processes were running in
a
> similar format to ps or you can just play back the data with the --top 
> switch.  if you feel 10 second samples aren''t frequent enough you
always
> set you interval down to 1 second or even lower...
> -mark
> 
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
-- 
--------------------------------------------------------------------
Thomas Roth
Department: Informationstechnologie
Location: SB3 1.262
Phone: +49-6159-71 1453  Fax: +49-6159-71 2986

Gesellschaft f?r Schwerionenforschung mbH
Planckstra?e 1
D-64291 Darmstadt
www.gsi.de

Gesellschaft mit beschr?nkter Haftung
Sitz der Gesellschaft: Darmstadt
Handelsregister: Amtsgericht Darmstadt, HRB 1528

Gesch?ftsf?hrer: Professor Dr. Horst St?cker

Vorsitzende des Aufsichtsrates: Dr. Beatrix Vierkorn-Rudolph,
Stellvertreter: Ministerialdirigent Dr. Rolf Bernhardt

Kilian CAVALOTTI

2008-Jul-24 16:58 UTC

head link

[Lustre-discuss] OSS crashes

Hi Thomas, 

On Thursday 24 July 2008 09:24:11 am Thomas Roth wrote:> On the next crash I''ll try to get a stack trace, and logging the
> console to more than the xterm buffer surely is something we ought to
> do as well.
If you don''t know it or use it already, maybe you could give a try to 
netdump: http://www.redhat.com/support/wpapers/redhat/netdump/

It basically allows you to get crash dumps and stack traces from a 
remote machine. Much useful for gathering Lustre debug information.

Cheers,
-- 
Kilian

Lustre discuss - Jul 2008 - OSS crashes

[Lustre-discuss] OSS crashes

[Lustre-discuss] OSS crashes

[Lustre-discuss] OSS crashes

[Lustre-discuss] OSS crashes

[Lustre-discuss] OSS crashes

[Lustre-discuss] OSS crashes

[Lustre-discuss] OSS crashes