thr3ads.net - Lustre discuss - [Lustre-discuss] OSS hangs (almost) after a few hours [May 2006]

If this information is useful, please help other people find it:
Share via:

Peter Kjellström

2006-May-19 07:36 UTC

[Lustre-discuss] OSS hangs (almost) after a few hours

Thanks for a very detailed answer, see inline comments below

On Friday 03 March 2006 18:39, Andreas Dilger wrote:> On Mar 03, 2006  12:13 +0100, Peter Kjellstr?m wrote:
> > * I have OSTs larger than 2 TiB (the whole point with this test), I
will
> > soon rerun this with identical config and limited OSTs.
This I have now done, the setup was identical but OSTs 1.98 TiB. Results were 
the same, one OSS went to pathetic performance after ~600 gigs written.
> > However, I sent 
> > this directly since I don''t think it''s a 2 TiB
problem, because: 1) it
> > stopped after just about 200 gigs (per OST) 2) >2 TiB ext3 was
tested
> > before lustre.
>
> Even with < 200GB written, it is possible that the files are being
written
> beyond the 2TB limit.  If you run "lfs getstripe" on the file
that was
> causing the slowness and find the objid for the slow OST, then on that
> OST run ''OBJ={objid} debugfs -R "stat O/0/d$((OBJ %
32))/$OBJ"
> /dev/{ostdev}'' this will dump the block allocation for that
object.
neat trick, I''ll save this one. I have however allready overwritten
that fs
with a new one (see above).
>
> > 1) A lustre process spins like crazy when the OSS stops:
> >   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
> >  2996 root      15   0     0    0    0 R 99.9  0.0  13:44.89 ll_ost_21
>
> It appears from the logs that the "slow" process is stuck
allocating
> blocks. The below trace is typical of the stuck processes (all of the other
> ones are blocked waiting on an allocation semaphore that this one is
> holding).
>
> ll_ost_00     R  running task       0  3605      1          3606  3604
> (L-TLB) 000001000000d780 000001000000db10 0000000000000000 0000000000000246
> 0000000000000046 0000000000000246 0000010075f368c0 00000100610d0240
> 00000100044d9090 00000000ffffffff
> Call Trace:<ffffffff80156c9c>{find_get_page+65}
>        <ffffffff80156c9c>{find_get_page+65}
>        <ffffffff8017721b>{__find_get_block_slow+62}
>        <ffffffff801779be>{__find_get_block+162}
>        <ffffffffa0048852>{:jbd:journal_get_undo_access+258}
>        <ffffffff80179d71>{__getblk+20}
>        <ffffffff80179dae>{__bread+6}
>        <ffffffffa0710d10>{:ldiskfs:read_block_bitmap+50}
>        <ffffffffa0711d8f>{:ldiskfs:ldiskfs_new_block_old+631}
>        <ffffffffa072ad63>{:ldiskfs:ldiskfs_new_block+27}
>        <ffffffffa07145fe>{:ldiskfs:ldiskfs_alloc_block+7}
>        <ffffffffa0716078>{:ldiskfs:ldiskfs_get_block_handle+881}
>        <ffffffffa0717f5f>{:ldiskfs:ldiskfs_map_inode_page+340}
>       
>
<ffffffffa07436cd>{:fsfilt_ldiskfs:fsfilt_ldiskfs_map_bm_inode_pages+101}
>
<ffffffffa0743975>{:fsfilt_ldiskfs:fsfilt_ldiskfs_map_inode_pages+122}
> <ffffffffa077de17>{:obdfilter:filter_direct_io+1765}
>        <ffffffffa078018e>{:obdfilter:filter_commitrw_write+6367}
>        <ffffffffa06eb978>{:ost:obd_commitrw+1110}
>        <ffffffffa06f37c8>{:ost:ost_brw_write+14405}
>        <ffffffff80132599>{default_wake_function+0}
>        <ffffffffa06ebace>{:ost:ost_bulk_timeout+0}
>        <ffffffffa06fb75f>{:ost:ost_handle+25399}
>        <ffffffff80131379>{finish_task_switch+55}
>        <ffffffff80302918>{thread_return+42}
>        <ffffffffa0666963>{:ptlrpc:ptlrpc_main+7123}
>
> It would probably be useful to get an oprofile dump
I''ll try to get you one during the day.

Thanks once again for your detailed answer, my next plan after ooprofile is to 
try 1.4.6.

Regards,
 Peter
> while the system is 
> so slow to see what is actually consuming so much CPU.  The routines in
> question are really part of ext3 itself so it is a bit puzzling why this
> would show up under Lustre and not ext3.  The thread is allocating 256
> blocks at one time, but this should not take 120s.  It appears that
> "find_get_page" is always at the top of the stack for the process
that
> is actually running, so this is suspicious.
>
> > 2) dmesg is filled with lines like:
> >  ll_ost_29     D 0000000102378fc4     0  3634      1          3635 
3633
> > (L-TLB)
> >  and:
> >  LustreError: dumping log to /tmp/lustre-log-d9.1141280676.3634
>
> This is lustre diagnostics dumping the stacks of processes that are
> not responsive (i.e. taking > 100s to process a request).  This allows
> debugging, such as we are doing here, instead of "system is
hung".
>
> Cheers, Andreas
> --
> Andreas Dilger
> Principal Software Engineer
> Cluster File Systems, Inc.
-- 
------------------------------------------------------------
  Peter Kjellstr?m               |
  National Supercomputer Centre  |
  Sweden                         | http://www.nsc.liu.se
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 191 bytes
Desc: not available
Url :
http://mail.clusterfs.com/pipermail/lustre-discuss/attachments/20060306/93e73a8d/attachment.bin

Andreas Dilger

2006-May-19 07:36 UTC

head link

[Lustre-discuss] OSS hangs (almost) after a few hours

On Mar 03, 2006  12:13 +0100, Peter Kjellstr?m wrote:> * I have OSTs larger than 2 TiB (the whole point with this test), I will
soon
> rerun this with identical config and limited OSTs. However, I sent this 
> directly since I don''t think it''s a 2 TiB problem,
because: 1) it stopped
> after just about 200 gigs (per OST) 2) >2 TiB ext3 was tested before
lustre.
Even with < 200GB written, it is possible that the files are being written
beyond the 2TB limit.  If you run "lfs getstripe" on the file that was
causing the slowness and find the objid for the slow OST, then on that
OST run ''OBJ={objid} debugfs -R "stat O/0/d$((OBJ %
32))/$OBJ" /dev/{ostdev}''
this will dump the block allocation for that object.
> 1) A lustre process spins like crazy when the OSS stops:
>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
>  2996 root      15   0     0    0    0 R 99.9  0.0  13:44.89 ll_ost_21
It appears from the logs that the "slow" process is stuck allocating
blocks.
The below trace is typical of the stuck processes (all of the other ones
are blocked waiting on an allocation semaphore that this one is holding).

ll_ost_00     R  running task       0  3605      1          3606  3604 (L-TLB)
000001000000d780 000001000000db10 0000000000000000 0000000000000246 
       0000000000000046 0000000000000246 0000010075f368c0 00000100610d0240 
       00000100044d9090 00000000ffffffff 
Call Trace:<ffffffff80156c9c>{find_get_page+65}
       <ffffffff80156c9c>{find_get_page+65} 
       <ffffffff8017721b>{__find_get_block_slow+62}
       <ffffffff801779be>{__find_get_block+162} 
       <ffffffffa0048852>{:jbd:journal_get_undo_access+258} 
       <ffffffff80179d71>{__getblk+20}
       <ffffffff80179dae>{__bread+6} 
       <ffffffffa0710d10>{:ldiskfs:read_block_bitmap+50}
       <ffffffffa0711d8f>{:ldiskfs:ldiskfs_new_block_old+631} 
       <ffffffffa072ad63>{:ldiskfs:ldiskfs_new_block+27}
       <ffffffffa07145fe>{:ldiskfs:ldiskfs_alloc_block+7} 
       <ffffffffa0716078>{:ldiskfs:ldiskfs_get_block_handle+881} 
       <ffffffffa0717f5f>{:ldiskfs:ldiskfs_map_inode_page+340} 
      
<ffffffffa07436cd>{:fsfilt_ldiskfs:fsfilt_ldiskfs_map_bm_inode_pages+101}
      
<ffffffffa0743975>{:fsfilt_ldiskfs:fsfilt_ldiskfs_map_inode_pages+122}
       <ffffffffa077de17>{:obdfilter:filter_direct_io+1765} 
       <ffffffffa078018e>{:obdfilter:filter_commitrw_write+6367} 
       <ffffffffa06eb978>{:ost:obd_commitrw+1110}
       <ffffffffa06f37c8>{:ost:ost_brw_write+14405} 
       <ffffffff80132599>{default_wake_function+0}
       <ffffffffa06ebace>{:ost:ost_bulk_timeout+0} 
       <ffffffffa06fb75f>{:ost:ost_handle+25399}
       <ffffffff80131379>{finish_task_switch+55} 
       <ffffffff80302918>{thread_return+42}
       <ffffffffa0666963>{:ptlrpc:ptlrpc_main+7123} 

It would probably be useful to get an oprofile dump while the system is
so slow to see what is actually consuming so much CPU.  The routines in
question are really part of ext3 itself so it is a bit puzzling why this
would show up under Lustre and not ext3.  The thread is allocating 256
blocks at one time, but this should not take 120s.  It appears that
"find_get_page" is always at the top of the stack for the process that
is actually running, so this is suspicious.
> 2) dmesg is filled with lines like:
>  ll_ost_29     D 0000000102378fc4     0  3634      1          3635  3633 
> (L-TLB)
>  and:
>  LustreError: dumping log to /tmp/lustre-log-d9.1141280676.3634
This is lustre diagnostics dumping the stacks of processes that are
not responsive (i.e. taking > 100s to process a request).  This allows
debugging, such as we are doing here, instead of "system is hung".

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

Peter Kjellström

2006-May-19 07:36 UTC

head link

[Lustre-discuss] OSS hangs (almost) after a few hours

Hello

Short description of problem:
After successful creation of a new lustre filesystem on my testrig I started 
writing 100 gig files from the single configured client. After seven files 
one OSS dumped ugly stuff on dmesg and almost died (write speed now <<1 
MiB/s). After restart the write process continued ok until file 35 then the 
other OSS did a similar thing.


Things I did that could be part of the problem:

* I have OSTs larger than 2 TiB (the whole point with this test), I will soon 
rerun this with identical config and limited OSTs. However, I sent this 
directly since I don''t think it''s a 2 TiB problem, because: 1)
it stopped
after just about 200 gigs (per OST) 2) >2 TiB ext3 was tested before lustre.

* I have rebuilt both the lustre kernel and userspace+modules, I had to since 
otherwise I can''t build my 3ware driver (see thread "lustre kernel
package is
misbuilt for RHEL4/centos-4")


Config:
1 client, 1 mds, 2 oss, all four machines with the same software
os: centos-4.2 (rhel4u2 rebuild) x86_64
hw: dual xeon 3.2 gigabit eth only, 2-4 GiB ram per node
oss1 has: (ost1) /dev/vg0/lv0 5.09 TiB striped over two 3ware 9500 cards
oss2 has: (ost2) /dev/vg0/lv0 3.18 TiB single 3ware 9550-SX card
          (ost3) /dev/vg0/lv1 3.18 TiB single 3ware 9550-SX card
kernel: direct rebuild of shipped 1.4.5.2 lustre kernel
lustre modules+userspace: rebuilt against my kernel 1.4.5.2
lfs used to make files striped over all OSTs (see details)

Details (in no perticular order):

1) A lustre process spins like crazy when the OSS stops:
  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 2996 root      15   0     0    0    0 R 99.9  0.0  13:44.89 ll_ost_21

2) dmesg is filled with lines like:
 ll_ost_29     D 0000000102378fc4     0  3634      1          3635  3633 
(L-TLB)
 and:
 LustreError: dumping log to /tmp/lustre-log-d9.1141280676.3634

3) full dmesg, lustre-log files (and for oss2 vmstat, top, config.xml) 
available as tar.gz files at:
 oss1/incident1:
www.nsc.liu.se/~cap/lustre-error-2006-03-02-2.6.9-22.0.2.EL_lustre.1.4.5.2_cfix.tar.gz
 oss2/incident2:
www.nsc.liu.se/~cap/lustre-error-2006-03-03-2.6.9-22.0.2.EL_lustre.1.4.5.2_cfix.tar.gz

4) lfs getstrip for the directory with the testfiles:
OBDS:
0: ost1_UUID ACTIVE
1: ost2_UUID ACTIVE
2: ost3_UUID ACTIVE
sob2/
default stripe_count: -1 stripe_size: 0 stripe_offset: -1
sob2/testfile.0
        obdidx           objid          objid            group
             2               3            0x3                0
             0               3            0x3                0
             1               3            0x3                0

sob2/testfile.1
        obdidx           objid          objid            group
             0               4            0x4                0
             1               4            0x4                0
             2               4            0x4                0
... and so on for about 35 files


Maybe this makes sense to somebody,
 Peter (who still hopes to be able to run lustre on his new servers and not 
nfs...)

-- 
------------------------------------------------------------
  Peter Kjellstr?m               |
  National Supercomputer Centre  |
  Sweden                         | http://www.nsc.liu.se
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 191 bytes
Desc: not available
Url :
http://mail.clusterfs.com/pipermail/lustre-discuss/attachments/20060303/b7426d37/attachment.bin

Peter Kjellström

2006-May-19 07:36 UTC

head link

[Lustre-discuss] OSS hangs (almost) after a few hours

On Friday 03 March 2006 18:39, Andreas Dilger wrote:> On Mar 03, 2006  12:13 +0100, Peter Kjellstr?m wrote:
> ...
> It would probably be useful to get an oprofile dump while the system is
> so slow to see what is actually consuming so much CPU. 
here''s an oprofile dump with only kernel symbols (I''ll see if
I can dig up
lustre symbols too and jbd):

CPU: CPU with timer interrupt, speed 0 MHz (estimated)
Profiling through timer interrupt
samples  %        app name                 symbol name
59788    51.5640  vmlinux-2.6.9-22.0.2.EL_lustre.1.4.5.2_cfix mwait_idle
15213    13.1204  jbd                      (no symbols)
14579    12.5736  vmlinux-2.6.9-22.0.2.EL_lustre.1.4.5.2_cfix find_get_page
10001     8.6253  ldiskfs                  (no symbols)
5160      4.4502  vmlinux-2.6.9-22.0.2.EL_lustre.1.4.5.2_cfix __find_get_block
2934      2.5304  vmlinux-2.6.9-22.0.2.EL_lustre.1.4.5.2_cfix 
__find_get_block_slow
2798      2.4131  vmlinux-2.6.9-22.0.2.EL_lustre.1.4.5.2_cfix __brelse
2288      1.9733  vmlinux-2.6.9-22.0.2.EL_lustre.1.4.5.2_cfix wake_up_buffer
987       0.8512  vmlinux-2.6.9-22.0.2.EL_lustre.1.4.5.2_cfix unlock_buffer
867       0.7477  vmlinux-2.6.9-22.0.2.EL_lustre.1.4.5.2_cfix put_page
398       0.3433  vmlinux-2.6.9-22.0.2.EL_lustre.1.4.5.2_cfix __might_sleep
181       0.1561  vmlinux-2.6.9-22.0.2.EL_lustre.1.4.5.2_cfix __getblk
131       0.1130  vmlinux-2.6.9-22.0.2.EL_lustre.1.4.5.2_cfix __bread
94        0.0811  vmlinux-2.6.9-22.0.2.EL_lustre.1.4.5.2_cfix bh_waitq_head
76        0.0655  vmlinux-2.6.9-22.0.2.EL_lustre.1.4.5.2_cfix kfree
74        0.0638  vmlinux-2.6.9-22.0.2.EL_lustre.1.4.5.2_cfix 
mark_page_accessed
49        0.0423  vmlinux-2.6.9-22.0.2.EL_lustre.1.4.5.2_cfix buffered_rmqueue
27        0.0233  vmlinux-2.6.9-22.0.2.EL_lustre.1.4.5.2_cfix __kmalloc
22        0.0190  e1000                    (no symbols)
22        0.0190  vmlinux-2.6.9-22.0.2.EL_lustre.1.4.5.2_cfix kmem_cache_alloc
18        0.0155  libcrypto.so.0.9.7a      (no symbols)
18        0.0155  vmlinux-2.6.9-22.0.2.EL_lustre.1.4.5.2_cfix kmem_cache_free
17        0.0147  vmlinux-2.6.9-22.0.2.EL_lustre.1.4.5.2_cfix 
copy_user_generic
15        0.0129  oprofiled                (no symbols)
13        0.0112  vmlinux-2.6.9-22.0.2.EL_lustre.1.4.5.2_cfix 
cache_alloc_refill
...

output from opreport -l after a 60 sec sampling window.

/Peter
> ...
-- 
------------------------------------------------------------
  Peter Kjellstr?m               |
  National Supercomputer Centre  |
  Sweden                         | http://www.nsc.liu.se
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 191 bytes
Desc: not available
Url :
http://mail.clusterfs.com/pipermail/lustre-discuss/attachments/20060306/3539ba40/attachment.bin

Peter Kjellström

2006-May-19 07:36 UTC

head link

[Lustre-discuss] OSS hangs (almost) after a few hours

On Monday 13 March 2006 18:49, Andreas Dilger wrote:> On Mar 13, 2006  08:35 +0100, Peter Kjellstr?m wrote:
> > I''m gearing up for a bugreport to redhat and/or lkml and/or
ext2-devel
> > but I''ve yet to decide what to include and such... The two
worst things
> > right now is that 1) it''s not 100% reproducible, the best I
have is 9 out
> > of 10 on a specific setup 2) I can''t reproduce at all on a
2.6.15.6
> > kernel.org. (I''ve been running 2.6.9-22smp, 2.6.9-22.0.2smp
with and
> > without lustre patches, all on x86_x64)
>
> This sounds just like a bug previously filed against the RHEL kernel,
> https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=156437
It does indeed, too bad I didn''t find that bgz when I did my initial
trawl :-/
I''ll verify it asap and report back.

/Peter
>
> That was fixed in the just-released 2.6.9-34 kernel I now see, with patches
> backported from the vanilla kernel.  As a workaround you can also mount the
> filesystem with the "noreservation" option.
>
> > Some things I now know:
> > * I trigger the bug by writing lots of data to one directory
(typically
> > as many 100G files as I can fit)
> > * the bug can bite after as little as 38G written
> > * when the bug hits performance is reduced with ~ a factor of 1000
> > * when in bug mode the oprofile allways looks as previously reported
> > * the bug seems to ignore file boundaries but it''s hard to
tell
> > * when in bug mode there''s something wrong with block
allocations, they
> > seem to take a loooong time
> > * when in bug mode if you write a 2nd file in that dir then the two
files
> > takes every 2nd block on disk (block rsv very broken)
> > * when in bug mode if you write a 2nd file in another dir it runs
> > normally * if you are patient the bug disappears, that is, there is a
> > limited part of the fs that is "cursed"
> > * if you remove the files and try again the bug hits at the same block
> > * if you remove the files, reboot and try again the bug hits at the
same
> > block * if you recreate the fs the bug hits at another place
> > * depending on where the bug hits you get different performance
(I''ve
> > seen everything from 50 KiB/s to 500 KiB/s)
> > * there is never any file corruption, never any problems reported to
> > dmesg
>
> Cheers, Andreas
-- 
------------------------------------------------------------
  Peter Kjellstr?m               |
  National Supercomputer Centre  |
  Sweden                         | http://www.nsc.liu.se
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 191 bytes
Desc: not available
Url :
http://mail.clusterfs.com/pipermail/lustre-discuss/attachments/20060313/510fba46/attachment.bin

Peter Kjellström

2006-May-19 07:36 UTC

head link

[Lustre-discuss] OSS hangs (almost) after a few hours

On Monday 13 March 2006 19:22, Peter Kjellstr?m wrote:> On Monday 13 March 2006 18:49, Andreas Dilger wrote:
> > On Mar 13, 2006  08:35 +0100, Peter Kjellstr?m wrote:
> > > I''m gearing up for a bugreport to redhat and/or lkml
and/or ext2-devel
> > > but I''ve yet to decide what to include and such... The
two worst things
> > > right now is that 1) it''s not 100% reproducible, the
best I have is 9
> > > out of 10 on a specific setup 2) I can''t reproduce at
all on a 2.6.15.6
> > > kernel.org. (I''ve been running 2.6.9-22smp,
2.6.9-22.0.2smp with and
> > > without lustre patches, all on x86_x64)
> >
> > This sounds just like a bug previously filed against the RHEL kernel,
> > https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=156437
>
> It does indeed, too bad I didn''t find that bgz when I did my
initial trawl
> :-/ I''ll verify it asap and report back.
It has been verified that this bug is what makes lustre unusable on these 
machines. The followup on this is of course:

* when will there be a 1.4.x release for/with the 2.6.9-34 (update3) kernel?
 or
* how does one tell lustre to use -o noreservation?

/Peter
>
> /Peter
>
> > That was fixed in the just-released 2.6.9-34 kernel I now see, with
> > patches backported from the vanilla kernel.  As a workaround you can
also
> > mount the filesystem with the "noreservation" option.
> >
-- 
------------------------------------------------------------
  Peter Kjellstr?m               |
  National Supercomputer Centre  |
  Sweden                         | http://www.nsc.liu.se
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 191 bytes
Desc: not available
Url :
http://mail.clusterfs.com/pipermail/lustre-discuss/attachments/20060320/17100a67/attachment.bin

Andreas Dilger

2006-May-19 07:36 UTC

head link

[Lustre-discuss] OSS hangs (almost) after a few hours

On Mar 20, 2006  14:42 +0100, Peter Kjellstr?m wrote:> * when will there be a 1.4.x release for/with the 2.6.9-34 (update3)
kernel?
This is already in progress, and will be released shortly.
> * how does one tell lustre to use -o noreservation?
lmc --add ost ... --mountfsoptions "noreservation" ...

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

Andreas Dilger

2006-May-19 07:36 UTC

head link

[Lustre-discuss] OSS hangs (almost) after a few hours

On Mar 13, 2006  08:35 +0100, Peter Kjellstr?m wrote:> I''m gearing up for a bugreport to redhat and/or lkml and/or
ext2-devel but
> I''ve yet to decide what to include and such... The two worst
things right now
> is that 1) it''s not 100% reproducible, the best I have is 9 out of
10 on a
> specific setup 2) I can''t reproduce at all on a 2.6.15.6
kernel.org. (I''ve
> been running 2.6.9-22smp, 2.6.9-22.0.2smp with and without lustre patches, 
> all on x86_x64)
This sounds just like a bug previously filed against the RHEL kernel,
https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=156437

That was fixed in the just-released 2.6.9-34 kernel I now see, with patches
backported from the vanilla kernel.  As a workaround you can also mount the
filesystem with the "noreservation" option.
> Some things I now know:
> * I trigger the bug by writing lots of data to one directory (typically as 
> many 100G files as I can fit)
> * the bug can bite after as little as 38G written
> * when the bug hits performance is reduced with ~ a factor of 1000
> * when in bug mode the oprofile allways looks as previously reported
> * the bug seems to ignore file boundaries but it''s hard to tell
> * when in bug mode there''s something wrong with block allocations,
they seem
> to take a loooong time
> * when in bug mode if you write a 2nd file in that dir then the two files 
> takes every 2nd block on disk (block rsv very broken)
> * when in bug mode if you write a 2nd file in another dir it runs normally
> * if you are patient the bug disappears, that is, there is a limited part
of
> the fs that is "cursed"
> * if you remove the files and try again the bug hits at the same block
> * if you remove the files, reboot and try again the bug hits at the same
block
> * if you recreate the fs the bug hits at another place
> * depending on where the bug hits you get different performance
(I''ve seen
> everything from 50 KiB/s to 500 KiB/s)
> * there is never any file corruption, never any problems reported to dmesg
Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

Peter Kjellström

2006-May-19 07:36 UTC

head link

[Lustre-discuss] OSS hangs (almost) after a few hours

On Monday 06 March 2006 14:05, Peter Kjellstr?m wrote:> On Friday 03 March 2006 18:39, Andreas Dilger wrote:
> > On Mar 03, 2006  12:13 +0100, Peter Kjellstr?m wrote:
> > ...
> > It would probably be useful to get an oprofile dump while the system
is
> > so slow to see what is actually consuming so much CPU.
>
> here''s an oprofile dump with only kernel symbols (I''ll
see if I can dig up
> lustre symbols too and jbd):
I did manage to get debugsymbols right for modules too.

comment1: 50% idle is expected since it''s an smp system
comment2: I cut the output when it went below 0.01%
comment3: 60 sec sample window, report by "opreport -l -p /lib/mod..."
comment4: vmlinux-lustre is edited form for:
vmlinux-2.6.9-22.0.2.EL_lustre.1.4.5.2_cfix

CPU: CPU with timer interrupt, speed 0 MHz (estimated)
Profiling through timer interrupt
samples  %        image name       app name          symbol name
59759    49.8395  vmlinux-lustre   vmlinux-lustre    mwait_idle
15347    12.7995  vmlinux-lustre   vmlinux-lustre    find_get_page
5638      4.7021  vmlinux-lustre   vmlinux-lustre    __find_get_block
5379      4.4861  ldiskfs.ko       ldiskfs          
ldiskfs_try_to_allocate_with_rsv
4423      3.6888  jbd.ko           jbd               do_get_write_access
4170      3.4778  ldiskfs.ko       ldiskfs           ldiskfs_get_group_desc
3972      3.3127  jbd.ko           jbd               journal_add_journal_head
3292      2.7456  vmlinux-lustre   vmlinux-lustre    __find_get_block_slow
3037      2.5329  jbd.ko           jbd               journal_get_undo_access
2894      2.4136  vmlinux-lustre   vmlinux-lustre    __brelse
2497      2.0825  jbd.ko           jbd               journal_put_journal_head
2439      2.0341  vmlinux-lustre   vmlinux-lustre    wake_up_buffer
2380      1.9849  jbd.ko           jbd               journal_cancel_revoke
1141      0.9516  vmlinux-lustre   vmlinux-lustre    unlock_buffer
890       0.7423  vmlinux-lustre   vmlinux-lustre    put_page
551       0.4595  ldiskfs.ko       ldiskfs           ldiskfs_new_block_old
423       0.3528  vmlinux-lustre   vmlinux-lustre    __might_sleep
305       0.2544  ldiskfs.ko       ldiskfs           goal_in_my_reservation
220       0.1835  ldiskfs.ko       ldiskfs           read_block_bitmap
184       0.1535  vmlinux-lustre   vmlinux-lustre    __getblk
150       0.1251  vmlinux-lustre   vmlinux-lustre    __bread
101       0.0842  vmlinux-lustre   vmlinux-lustre    bh_waitq_head
80        0.0667  vmlinux-lustre   vmlinux-lustre    kfree
75        0.0626  vmlinux-lustre   vmlinux-lustre    mark_page_accessed
50        0.0417  ldiskfs.ko       ldiskfs           ldiskfs_group_sparse
33        0.0275  vmlinux-lustre   vmlinux-lustre    __kmalloc
33        0.0275  vmlinux-lustre   vmlinux-lustre    buffered_rmqueue
26        0.0217  jbd.ko           jbd               journal_commit_transaction
24        0.0200  vmlinux-lustre   vmlinux-lustre    kmem_cache_free
23        0.0192  jbd.ko           jbd               __journal_file_buffer
22        0.0183  vmlinux-lustre   vmlinux-lustre    copy_user_generic
21        0.0175  jbd.ko           jbd               journal_release_buffer
21        0.0175  vmlinux-lustre   vmlinux-lustre    cache_alloc_refill
18        0.0150  vmlinux-lustre   vmlinux-lustre    kmem_cache_alloc
17        0.0142  vmlinux-lustre   vmlinux-lustre    kmem_getpages
14        0.0117  jbd.ko           jbd               journal_refile_buffer
13        0.0108  jbd.ko           jbd              
__journal_remove_journal_head
13        0.0108  jbd.ko           jbd               __journal_unfile_buffer

/Peter

-- 
------------------------------------------------------------
  Peter Kjellstr?m               |
  National Supercomputer Centre  |
  Sweden                         | http://www.nsc.liu.se
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 191 bytes
Desc: not available
Url :
http://mail.clusterfs.com/pipermail/lustre-discuss/attachments/20060306/05cd1953/attachment.bin

Peter Kjellström

2006-May-19 07:36 UTC

head link

[Lustre-discuss] OSS hangs (almost) after a few hours

Thought it was time for a status update here.

I''ve been hard at work trying to figure out when I can and when I
can''t
reproduce this. So far I''ve atleast been able to reproduce it with
devices
<2.0 TiB and without lustre (on plain ext3).

I''m gearing up for a bugreport to redhat and/or lkml and/or ext2-devel
but
I''ve yet to decide what to include and such... The two worst things
right now
is that 1) it''s not 100% reproducible, the best I have is 9 out of 10
on a
specific setup 2) I can''t reproduce at all on a 2.6.15.6 kernel.org.
(I''ve
been running 2.6.9-22smp, 2.6.9-22.0.2smp with and without lustre patches, 
all on x86_x64)

If you have any recommendations on where to send it and what to include
I''d be
grateful.

Some things I now know:
* I trigger the bug by writing lots of data to one directory (typically as 
many 100G files as I can fit)
* the bug can bite after as little as 38G written
* when the bug hits performance is reduced with ~ a factor of 1000
* when in bug mode the oprofile allways looks as previously reported
* the bug seems to ignore file boundaries but it''s hard to tell
* when in bug mode there''s something wrong with block allocations, they
seem
to take a loooong time
* when in bug mode if you write a 2nd file in that dir then the two files 
takes every 2nd block on disk (block rsv very broken)
* when in bug mode if you write a 2nd file in another dir it runs normally
* if you are patient the bug disappears, that is, there is a limited part of 
the fs that is "cursed"
* if you remove the files and try again the bug hits at the same block
* if you remove the files, reboot and try again the bug hits at the same block
* if you recreate the fs the bug hits at another place
* depending on where the bug hits you get different performance (I''ve
seen
everything from 50 KiB/s to 500 KiB/s)
* there is never any file corruption, never any problems reported to dmesg

I also have lots of data from lots of experiments (vmstat, debugfs output, 
oprofile, timelines, etc.)

Best Regards,
 Peter

-- 
------------------------------------------------------------
  Peter Kjellstr?m               |
  National Supercomputer Centre  |
  Sweden                         | http://www.nsc.liu.se
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 191 bytes
Desc: not available
Url :
http://mail.clusterfs.com/pipermail/lustre-discuss/attachments/20060313/9e6debd3/attachment.bin

Lustre discuss - May 2006 - OSS hangs (almost) after a few hours

[Lustre-discuss] OSS hangs (almost) after a few hours

[Lustre-discuss] OSS hangs (almost) after a few hours

[Lustre-discuss] OSS hangs (almost) after a few hours

[Lustre-discuss] OSS hangs (almost) after a few hours

[Lustre-discuss] OSS hangs (almost) after a few hours

[Lustre-discuss] OSS hangs (almost) after a few hours

[Lustre-discuss] OSS hangs (almost) after a few hours

[Lustre-discuss] OSS hangs (almost) after a few hours

[Lustre-discuss] OSS hangs (almost) after a few hours

[Lustre-discuss] OSS hangs (almost) after a few hours