Hi, as I mentioned in earlier mail to ext3-users I have been getting some corruption on an ext3 filesystem that has been serving NFS. I am now confident that I fully understand the problem and have a patch. It only affects data=journal mode and I wonder if it might also be the cause of the corruption noted by a number of people on linux-kernel. First I will explain the problem. Then display the patch. When a file that contains dirty pages is truncated, those pages must be dealt with. They can either be flushed to disc, or they can be invalidated. Either way they must be non-dirty before their location on disc is allocated to another file. If not, they might be flushed out to disc much later and overwrite something that has since been allocate the same location on disc. The pages are dealt with by truncate_inode_pages which ultimately calls journal_flushpage on each page which inturn calls journal_unmap_buffer. For data=writeback the data block will not be attached to the journal at all and the buffer is "Zapped" (marked as clean). For data=ordered the data block can only be dirty if it is part of the current transaction, or the committing transaction. (This is a consequence of the "order"ing.) If the current transaction, then it can be simply disposed on and removed from the current transaction. Again it is Zapped. If the committing transaction, then it will have to go to disc, and it will be flushed out before that transaction fully commits, and hence before any data in a subsequent transaction - that may be allocated to the same place of disc - has a chance of being written out. So there is no problem here. But for data=journal the situation is a bit different. Normally with data=journal, data blocks are written to the journal and then left in the pagecache in a dirty state to be eventually flushed to disc. Either by bdflush or by a journal checkpoint. journal_unmap_buffer will find that the buffer is attached to the journal, either via b_transaction or b_cp_transaction (i.e. an active part of a transaction, or a buffer that must be checkpointed before that transaction be be forgotten) If the latter case, the buffer gets reattached to the current transaction (if there is one) so it will be disposed of at when it completes. otherwise it is immedately Zapped. If the transaction it is attached to is the running transaction, then it is safe to just remove it from the transaction and never write it at all. But (and here we get to the meat of it all) if the transaction it is attached to is the committing transaction then it is not safe to discard it. We must leave it alone until that transaction has completed. We must arrange that when the transaction completes committing, the buffer then gets discards (and importantly, not left lying around dirty). But this *DOES*NOT*HAPPEN*. The code in question is at about line 1831 of fs/jbd/transaction.c and has the comment: /* If it is committing, we simply cannot touch it. We * can remove it's next_transaction pointer from the * running transaction if that is set, but nothing * else. */ This code really must arrange that a data=journal datablock never gets marked as dirty again (note that it isn't actualy dirty at this point, it is JBDDirty instead, to stop if from being written to disc prematurey). One option might be to simply clear the JBDDirty flag at this point. However this causes a number of assertions later on to fail, so it isn't a good idea. The approach that I have taken in the patch is to make use of the BH_Freed flag which is currently unused but is defined in jbd.h as BH_Freed, /* 1 if buffer has been freed (truncated) */ It seems perfect for the job. At this point in journal_unmap_buffer, we set BH_Freed on the buffer. Then in phase 7 of the commit, when we are looking at the t_forget buffers and considering if they are jdirty, and hence should be made really dirty again, we avoid doing so if BH_Freed is set. I am yet to actually try this patch out on a production system, partly because I wanted to be sure my understanding was good enough to be able to convincingly explain it in Email first, and partly because it is late in the week, and I would rather start using it on a monday morning when I can watch what is happening (for now, I am running data=ordered to avoid further problems). This patch is against 2.4.19-pre8 plus ext3 0.9.18. I have not been able to reproduce the problem except be letting 300 students loose on a live server, which isn't the most forgiving testing environment, so at the moment this patch is largely theoretical. If anyone has a test situation where that can reproduce any sort of ext3 corruption with data=journal, I would be keen to here how this patch affects things. The corruption that I see that I blame on this problem is directories begin overwritten with what looks like file contents and in one case (which required 90 minutes downtime for a fsck) some indirect blocks being over-written by file contents. NeilBrown --- ./fs/jbd/commit.c 2002/05/28 04:15:18 1.1 +++ ./fs/jbd/commit.c 2002/05/28 22:44:48 @@ -663,12 +663,13 @@ * there's no point in keeping a checkpoint record for * it. */ bh = jh2bh(jh); - if (buffer_jdirty(bh)) { + if (buffer_jdirty(bh) && !__buffer_state(bh, Freed)) { JBUFFER_TRACE(jh, "add to new checkpointing trans"); __journal_insert_checkpoint(jh, commit_transaction); JBUFFER_TRACE(jh, "refile for checkpoint writeback"); __journal_refile_buffer(jh); } else { + clear_bit(BH_Freed, &bh->b_state); J_ASSERT_BH(bh, !buffer_dirty(bh)); J_ASSERT_JH(jh, jh->b_next_transaction == NULL); __journal_unfile_buffer(jh); --- ./fs/jbd/transaction.c 2002/05/26 23:13:05 1.2 +++ ./fs/jbd/transaction.c 2002/05/28 09:24:45 @@ -1834,6 +1834,7 @@ * running transaction if that is set, but nothing * else. */ JBUFFER_TRACE(jh, "on committing transaction"); + set_bit(BH_Freed, &bh->b_state); if (jh->b_next_transaction) { J_ASSERT(jh->b_next_transaction = journal->j_running_transaction);
Stephen C. Tweedie
2002-May-31 15:03 UTC
Re: PATCH for filesys corruption in ext3 with data=journal
Hi, On Fri, May 31, 2002 at 12:33:23PM +1000, Neil Brown wrote:> as I mentioned in earlier mail to ext3-users I have been getting some > corruption on an ext3 filesystem that has been serving NFS. I am now > confident that I fully understand the problem and have a patch.The analysis looks right, but there's a slight problem with the patch --- it doesn't deal with a file on a 1k or 2k blocksize filesystem being partially truncated and then extended. I'll have a think about that, but this patch looks like it's on the right lines. Cheers, Stephen
Neil Brown
2002-Jun-04 02:48 UTC
Re: PATCH for filesys corruption in ext3 with data=journal
On Friday May 31, neilb@cse.unsw.edu.au wrote:> > Hi, > > as I mentioned in earlier mail to ext3-users I have been getting some > corruption on an ext3 filesystem that has been serving NFS. I am now > confident that I fully understand the problem and have a patch. > > It only affects data=journal mode and I wonder if it might also be the > cause of the corruption noted by a number of people on linux-kernel. > > First I will explain the problem. Then display the patch.I should follow-up to say that the patch isn't quite complete. Just next to + clear_bit(BH_Freed, &bh->b_state); we need + clear_bit(BH_JBDDirty, &bh->b_state); This seems to be working for me. NeilBrown> > > --- ./fs/jbd/commit.c 2002/05/28 04:15:18 1.1 > +++ ./fs/jbd/commit.c 2002/05/28 22:44:48 > @@ -663,12 +663,13 @@ > * there's no point in keeping a checkpoint record for > * it. */ > bh = jh2bh(jh); > - if (buffer_jdirty(bh)) { > + if (buffer_jdirty(bh) && !__buffer_state(bh, Freed)) { > JBUFFER_TRACE(jh, "add to new checkpointing trans"); > __journal_insert_checkpoint(jh, commit_transaction); > JBUFFER_TRACE(jh, "refile for checkpoint writeback"); > __journal_refile_buffer(jh); > } else { > + clear_bit(BH_Freed, &bh->b_state); > J_ASSERT_BH(bh, !buffer_dirty(bh)); > J_ASSERT_JH(jh, jh->b_next_transaction == NULL); > __journal_unfile_buffer(jh); > --- ./fs/jbd/transaction.c 2002/05/26 23:13:05 1.2 > +++ ./fs/jbd/transaction.c 2002/05/28 09:24:45 > @@ -1834,6 +1834,7 @@ > * running transaction if that is set, but nothing > * else. */ > JBUFFER_TRACE(jh, "on committing transaction"); > + set_bit(BH_Freed, &bh->b_state); > if (jh->b_next_transaction) { > J_ASSERT(jh->b_next_transaction => journal->j_running_transaction);