nic@cray.com
2007-Jan-17 13:51 UTC
[Lustre-devel] [Bug 11540] SLES10 + 1.4.9pre: obdfilter shows large write performance degredation
Please don''t reply to lustre-devel. Instead, comment in Bugzilla by
using the following link:
https://bugzilla.lustre.org/show_bug.cgi?id=11540
I still owe you folks the obdfilter studies, but we have some tangential data
that might prove interesting.
We have been chasing a slow NFS write problem in parallel with this issue --
namely that parallel compiles on an NFS exported ext3 filesystem were taking
near 10 minutes instead of 5 seconds.
We were able to capture server side stack traces during one of these slow
periods, and it appears that all of the nfsd threads are stuck in
log_wait_commit() where it is looping across the transaction ids (tid) and
comparing that against the journal->j_commit_sequence. My view is they are
waiting for their transaction id to be comitted to disk.
We then have kjournald as the only remaining suspect, and it appears to be stuck
in sync_dirty_buffer() from journal_write_commit_record():
kjournald D ffff810072dabbb8 0 2849 1 2866 2784 (L-TLB)
ffff810072dabbb8 ffff810072dabb78 ffffffff8028e48a ffff81007c6f4340
0000000000000561 ffffffff803ca180 ffff81007c6f4558 0000000000000008
0000000000000086 0000000000000086
Call Trace: <ffffffff8028e48a>{scsi_request_fn+858}
<ffffffff80356d25>{io_schedule+53}
<ffffffff8017e1fb>{sync_buffer+59}
<ffffffff80356f05>{__wait_on_bit+69}
<ffffffff8017e1c0>{sync_buffer+0}
<ffffffff8017e1c0>{sync_buffer+0}
<ffffffff80357006>{out_of_line_wait_on_bit+198}
<ffffffff80147b20>{wake_bit_function+0}
<ffffffff80147b20>{wake_bit_function+0}
<ffffffff8017e277>{__wait_on_buffer+39}
<ffffffff80181be2>{sync_dirty_buffer+162}
<ffffffff80205e2f>{journal_write_commit_record+159}
<ffffffff8020b566>{journal_free_journal_head+22}
<ffffffff8020b90e>{__journal_remove_journal_head+430}
<ffffffff80206bb2>{journal_commit_transaction+3314}
<ffffffff8013acac>{lock_timer_base+44}
<ffffffff8013af3f>{try_to_del_timer_sync+95}
<ffffffff8020948e>{kjournald+254}
<ffffffff80147ae0>{autoremove_wake_function+0}
<ffffffff80147ae0>{autoremove_wake_function+0}
<ffffffff80209380>{commit_timeout+0}
<ffffffff8010ce52>{child_rip+8}
<ffffffff80209390>{kjournald+0}
Reading through journal_write_commit_record() shows the error message we often
see on SLES9 of:
"JBD: barrier-based sync failed on %s - "
"disabling barriers\n",
We noticed this message was NOT being printed under SLES10. As a test, we
mounted the ext3 filesystem as "-o barrier=0" to see if this is an
issue, and
low-and-behold, the compiles are back down to 5 seconds.
What is not clear is (1) why are barrier based syncs working now on SLES10 and
(2) why oh why are they so dang slow?
For the interested, here is a sd_iostats snapshot of the disk during barrier on
& off test:
barrier on (SLOW):
eelc0-0c0s0n3:~ # cat /tmp/barrier_on_sdiostats
index: 0 snapshot_time: 386.275018
size reads total writes total
512 0 0 0 0
1024 0 0 0 0
2048 0 0 0 0
4096 0 0 14911 119288
8192 0 0 1699 30800
16384 0 0 801 33288
32768 0 0 821 71080
65536 0 0 129 20336
131072 0 0 4 1144
total 0 0 18365 275936
barrier on (FAST):
eelc0-0c0s0n3:~ # cat /proc/scsi/sd_iostats/sda
index: 0 snapshot_time: 57.314699
size reads total writes total
512 0 0 0 0
1024 0 0 0 0
2048 0 0 0 0
4096 11 88 5066 40528
8192 1 24 399 7384
16384 0 0 1575 67696
32768 0 0 286 22608
65536 0 0 86 14808
131072 0 0 69 24728
262144 0 0 74 50688
524288 0 0 63 64512
total 12 112 7618 292952
While this hasn''t been tested on Lustre yet (lconf --mountfsoptions
barrier=0),
we could sure use the help diagnosing this. I will be running the Lustre test
later today to see if this is the source of our woes there as well.