On Thursday June 6, neilb@cse.unsw.edu.au wrote:>
> I'm not in a position at the moment to spend time testing whether the
> inefficacy of bdflush is ext3 specific or applies to ext2 as well. I
> may try that next week (we have a long weekend coming up). If it
> applies to ext2, it could be a showstopped for 2.4.19....
Actually, I did find some time.
I added the attached patch to 2.4.19-pre8 + ext3 0.9.18.
It checks the "age" of buffers being written by bdflush and shows me
the high water mark.
By "age" I mean the time between the b_flushtime and now, if now is
after the b_flushtime. This is the
"Age-since-should-have-been-flushed"
rather than "Age-since-dirtied".
The high water mark is maintained separately for each scsi drive and
decays 1 second in 5 so that I can get a good idea of what is
happening.
What should I see?
Given the default bdflush settings of flush 30 second old buffers
every 5 seconds, I should see buffers with "ages" up to 5 seconds
being flushed.
Also, given that the lru_list[BUF_DIRTY] is kept in oldest-first
order, the first buffer on the list should set the high water mark,
and so remaining buffers should not cause a message to be printer, at
least until the next run of write_some_buffers when the high water
mark will have decayed.
And this is exactly what I see when using ext2:
Jun 6 12:22:35 elfman kernel: buffer on 4 has age 325
Jun 6 12:22:36 elfman kernel: buffer on 0 has age 32
Jun 6 12:22:41 elfman kernel: buffer on 4 has age 498
Jun 6 12:22:41 elfman kernel: buffer on 0 has age 57
Jun 6 12:22:46 elfman kernel: buffer on 4 has age 487
Jun 6 12:22:51 elfman kernel: buffer on 4 has age 446
Jun 6 12:22:56 elfman kernel: buffer on 4 has age 454
Jun 6 12:23:01 elfman kernel: buffer on 4 has age 486
Jun 6 12:23:06 elfman kernel: buffer on 4 has age 466
Jun 6 12:23:06 elfman kernel: buffer on 0 has age 57
Jun 6 12:23:11 elfman kernel: buffer on 4 has age 401
Jun 6 12:23:16 elfman kernel: buffer on 4 has age 440
Jun 6 12:23:16 elfman kernel: buffer on 0 has age 10
Jun 6 12:23:21 elfman kernel: buffer on 4 has age 411
Jun 6 12:23:46 elfman kernel: buffer on 4 has age 118
Jun 6 12:23:46 elfman kernel: buffer on 0 has age 4
Jun 6 12:23:51 elfman kernel: buffer on 4 has age 499
"0" == sda which has my root filesystem (ext2 at this point) and
"4" == sde which has a test ext2 filesystem.
But with ext3 I see different numbers.
Jun 6 12:04:44 elfman kernel: buffer on 5 has age 9511
Jun 6 12:04:44 elfman kernel: buffer on 5 has age 9512
Jun 6 12:04:44 elfman kernel: buffer on 5 has age 9513
Jun 6 12:04:44 elfman kernel: buffer on 5 has age 9514
Jun 6 12:04:44 elfman kernel: buffer on 5 has age 9515
Jun 6 12:04:44 elfman kernel: buffer on 5 has age 9516
Jun 6 12:04:44 elfman kernel: buffer on 5 has age 9517
Jun 6 12:04:44 elfman kernel: buffer on 5 has age 9518
Jun 6 12:04:44 elfman kernel: buffer on 5 has age 9520
Jun 6 12:04:44 elfman kernel: buffer on 5 has age 9521
Jun 6 12:04:44 elfman kernel: buffer on 5 has age 9529
Jun 6 12:04:44 elfman kernel: buffer on 5 has age 9530
Jun 6 12:04:44 elfman kernel: buffer on 5 has age 9531
Jun 6 12:04:44 elfman kernel: buffer on 5 has age 9532
Jun 6 12:04:44 elfman kernel: buffer on 5 has age 9533
5 == sdf with a test ext3 filesystem.
Two things to note:
1/ ages up to 95 seconds (numbers in jiffies). This is the worst I
got in a little bit of testing. 20-30 seconds was quite common.
2/ consecutive buffers have increasing age... young buffers are
in-front of old buffers in the lru.
(I think having a large journal, or data=writeback will display the
largest times)
I tried the same test 2.4.16 and got results like:
Jun 6 13:09:30 elfman kernel: buffer on 5 has age 529
Jun 6 13:09:31 elfman kernel: buffer on 5 has age 558
Jun 6 13:09:31 elfman kernel: buffer on 5 has age 560
Jun 6 13:09:31 elfman kernel: buffer on 5 has age 561
Jun 6 13:09:31 elfman kernel: buffer on 5 has age 563
Jun 6 13:09:36 elfman kernel: buffer on 5 has age 475
Jun 6 13:09:37 elfman kernel: buffer on 5 has age 471
Jun 6 13:09:37 elfman kernel: buffer on 5 has age 472
Jun 6 13:09:37 elfman kernel: buffer on 5 has age 475
Jun 6 13:09:37 elfman kernel: buffer on 5 has age 476
Jun 6 13:09:37 elfman kernel: buffer on 5 has age 477
Jun 6 13:09:37 elfman kernel: buffer on 5 has age 478
Jun 6 13:09:37 elfman kernel: buffer on 5 has age 488
Jun 6 13:09:37 elfman kernel: buffer on 5 has age 489
I never got the same loooong time (9 seconds was the max) but the
buffers are still in the wrong order.
My guess is that ext3 (actually jbd) is making buffers young without
pushing them to the end of the BUF_DIRTY queue (possibly by calling
set_buffer_flushtime - jbd is the only caller).
This causes bdflush to stop processing when it sees these buffers, and
to leave old buffers that are later on in the queue unflushed.
I don't know why the effect became much worse in the latest code, but
it definately does. In real-life on my servers there are occasionally
holding dirty buffers that must be over 10 minutes old which are playing
Pin the tail on the journal
NeilBrown
---------------------------
--- fs/buffer.c 2002/05/26 23:13:05 1.2
+++ fs/buffer.c 2002/06/06 01:41:33
@@ -219,6 +219,28 @@
__refile_buffer(bh);
get_bh(bh);
array[count++] = bh;
+ if (MAJOR(bh->b_dev) == 8 && (MINOR(bh->b_dev)>>4) <
10) {
+ static unsigned long age[10];
+ static unsigned long then[10];
+ int i = MINOR(bh->b_dev)>>4;
+ unsigned long j = jiffies;
+ if (time_before(then[i]+20, j)) {
+ if (age[i] > (j-then[i])/5)
+ age[i] -= (j-then[i])/5;
+ else age[i] = 0;
+ then[i] = j;
+ }
+ if (time_before(bh->b_flushtime, j)) {
+ unsigned long a;
+ a = j - bh->b_flushtime;
+ if (a > age[i]) {
+ printk("buffer on %d has age %ld\n",
+ i, a);
+ age[i] = a;
+ then[i] = j;
+ }
+ }
+ }
if (count < NRSYNC)
continue;