thr3ads.net - Lustre devel - [Lustre-devel] [Bug 11540] SLES10 + 1.4.9pre: obdfilter shows large write performance degredation [Jan 2007]

If this information is useful, please help other people find it:
Share via:

nic@cray.com

2007-Jan-17 13:51 UTC

[Lustre-devel] [Bug 11540] SLES10 + 1.4.9pre: obdfilter shows large write performance degredation

Please don''t reply to lustre-devel. Instead, comment in Bugzilla by
using the following link:
https://bugzilla.lustre.org/show_bug.cgi?id=11540



I still owe you folks the obdfilter studies, but we have some tangential data
that might prove interesting.

We have been chasing a slow NFS write problem in parallel with this issue --
namely that parallel compiles on an NFS exported ext3 filesystem were taking
near 10 minutes instead of 5 seconds. 

We were able to capture server side stack traces during one of these slow
periods, and it appears that all of the nfsd threads are stuck in
log_wait_commit() where it is looping across the transaction ids (tid) and
comparing that against the journal->j_commit_sequence. My view is they are
waiting for their transaction id to be comitted to disk.

We then have kjournald as the only remaining suspect, and it appears to be stuck
in sync_dirty_buffer() from journal_write_commit_record():

kjournald     D ffff810072dabbb8     0  2849      1          2866  2784 (L-TLB)
ffff810072dabbb8 ffff810072dabb78 ffffffff8028e48a ffff81007c6f4340
       0000000000000561 ffffffff803ca180 ffff81007c6f4558 0000000000000008
       0000000000000086 0000000000000086
Call Trace: <ffffffff8028e48a>{scsi_request_fn+858}
       <ffffffff80356d25>{io_schedule+53}
<ffffffff8017e1fb>{sync_buffer+59}
       <ffffffff80356f05>{__wait_on_bit+69}
<ffffffff8017e1c0>{sync_buffer+0}
       <ffffffff8017e1c0>{sync_buffer+0}
<ffffffff80357006>{out_of_line_wait_on_bit+198}
       <ffffffff80147b20>{wake_bit_function+0}
<ffffffff80147b20>{wake_bit_function+0}
       <ffffffff8017e277>{__wait_on_buffer+39}
<ffffffff80181be2>{sync_dirty_buffer+162}
       <ffffffff80205e2f>{journal_write_commit_record+159}
       <ffffffff8020b566>{journal_free_journal_head+22}
<ffffffff8020b90e>{__journal_remove_journal_head+430}
       <ffffffff80206bb2>{journal_commit_transaction+3314}
       <ffffffff8013acac>{lock_timer_base+44}
<ffffffff8013af3f>{try_to_del_timer_sync+95}
       <ffffffff8020948e>{kjournald+254}
<ffffffff80147ae0>{autoremove_wake_function+0}
       <ffffffff80147ae0>{autoremove_wake_function+0}
<ffffffff80209380>{commit_timeout+0}
       <ffffffff8010ce52>{child_rip+8}
<ffffffff80209390>{kjournald+0}

Reading through journal_write_commit_record() shows the error message we often
see on SLES9 of:

                         "JBD: barrier-based sync failed on %s - "
                        "disabling barriers\n",

We noticed this message was NOT being printed under SLES10. As a test, we
mounted the ext3 filesystem as "-o barrier=0" to see if this is an
issue, and
low-and-behold, the compiles are back down to 5 seconds. 

What is not clear is (1) why are barrier based syncs working now on SLES10 and
(2) why oh why are they so dang slow?

For the interested, here is a sd_iostats snapshot of the disk during barrier on
& off test:

barrier on (SLOW):

eelc0-0c0s0n3:~ # cat /tmp/barrier_on_sdiostats 
index:        0   snapshot_time:         386.275018
    size    reads        total   writes        total
     512        0            0        0            0
    1024        0            0        0            0
    2048        0            0        0            0
    4096        0            0    14911       119288
    8192        0            0     1699        30800
   16384        0            0      801        33288
   32768        0            0      821        71080
   65536        0            0      129        20336
  131072        0            0        4         1144
   total        0            0    18365       275936

barrier on (FAST):

eelc0-0c0s0n3:~ # cat /proc/scsi/sd_iostats/sda
index:        0   snapshot_time:         57.314699
    size    reads        total   writes        total
     512        0            0        0            0
    1024        0            0        0            0
    2048        0            0        0            0
    4096       11           88     5066        40528
    8192        1           24      399         7384
   16384        0            0     1575        67696
   32768        0            0      286        22608
   65536        0            0       86        14808
  131072        0            0       69        24728
  262144        0            0       74        50688
  524288        0            0       63        64512
   total       12          112     7618       292952

While this hasn''t been tested on Lustre yet (lconf --mountfsoptions
barrier=0),
we could sure use the help diagnosing this. I will be running the Lustre test
later today to see if this is the source of our woes there as well.

Lustre devel - Jan 2007 - [Bug 11540] SLES10 + 1.4.9pre: obdfilter shows large write performance degredation

[Lustre-devel] [Bug 11540] SLES10 + 1.4.9pre: obdfilter shows large write performance degredation