Hello.
After reading some mails from mailing list archives and other sources,
ext3 seems to having troubles in deleting files w/o destroying the data
neccessary to undelete them. Ok, I know what journaling means, and in any
way, the ext3 fs driver needs to keep track of which blocks have actually
been marked as being free. But now my question: Who say, that this
information must be stored in the original inode data? The answer is: Is
isn't neccessary at all. Ok, how then?
My prescription:
1.1. We need a fixed inode, I'll refer to that as the "deletion
inode"
here.
1.2. We need three blocks, whichs block numbers could be stored in the
super block. I'll refer to them as "deletion blocks" here.
Now, lets assume, a user enters rm foo.img where foo.img is a very big
file. The steps, the ext3fs driver currently does is (IIRC, at least):
2.1. Link the inode to a special list of "inodes to delete". -- no
change
here.
2.2. Check, if any inode is being delete in the moment. If not start it
now (go on with 3.1.). -- no change here.
Actual deletion: (That action initiated in step 2.2.)
Here, the procedure will completely changed:
3.1. Copy the inode (the 128 bytes, not the conten of the file ;-) to the
space for the "deletion inode", marking the original inode as
being
deleted + set dtime. This is one transaction.
-> If the "deletion inode" has a dtime != 0 on mount-time after a
crash,
recovery starts here.
3.2. Mark a bunch of blocks at the end of the file as free in the block
bitmaps. Check which indirect block, double indirect block and
trible indirect block needs to be updated, and copy them each in one
of the "deletion blocks", but only if they are not already in one
of
the "deletion blocks". Update the copies in the "deletion
blocks"
leaving the original ind, dind and tind blocks alone. All this is
done in one transaction.
3.3. Repeat 3.2. as long as there are still blocks beyond the 12 blocks
boundary.
3.4. Free the last blocks of the inode, and set dtime to zero (to prevent
later recoveries from staring over again). Again, one transaction.
3.5. Check if another inode is to be deleted. If yes, go on at 3.1.
A question which I would expect from you is, why it is enough to have
three blocks and why it's not neccessary to zero out all ind, dind and
tind blocks. For the answer, I want to summarize, how big files are
stored.
4.1. The first 12 block numbers are stored in the inode itself. This is a
good way to speed up access to small files.
4.2. The next 256 (512; 1024) block numbers are stored in a separate
block, which's blocknumber will be stored as 13. block number in the
inode.
4.3. The next 256^2 (512^2; 1024^2) block numbers are stored in 256 (512;
1024) separate blocks, which's block numbers are collected together
in a double indirect block, which's block number is stored as 14.
block number in the inode.
4.4. For the next 256^3 (512^3; 1024^3) blocks the same happens, but only
with another nexting level.
Ok, but how do we truncate? We take e.g. the last 256 blocks, and collect
their block numbers. Independantly of which block is the first and which
is the last to free in this step, there is one boundary. All blocks
before will not freed in this cycle, all blocks beyound it will be freed
in this cycle. When removing the 256 blocks at the end of the file, at
least one ind block needs to be updated. There are two cases in this
stage:
5.1. If all 256 blocks are managed by the same ind block, the whole ind
block would being zeroed out. In this case, we don't need to touch
the ind block as we simply remove it's entry in the dind block.
5.2. So let us assume that we delete some last blocks of one ind block
and some first blocks of another ind block. (This is the typical
case when freeing 256 blocks. The last n are managed by the last ind
block and 256-n blocks are managed by the previous ind block). In
this szenario, the latter ind block would be completely zeroed out
and thus is unused now, so we don't take further attention to it.
Only the first ind block needs to be updated. This is, what we need
the first "deletion block" for. But wait. The other ind block
(this
what would get zeroed out now) must be unlinked in it's maintaining
dind block. This is, where this case and case 1 meet again.
Now, the goal is to mark an ind block in a dind block as unused. (But
this is the same procedure like marking 256 blocks as free, but only by
removing one block instead of 256 blocks from the list).
Note: The worst case (3 "deletion blocks" needed) will even be the
worst
case when freeing the last allocated blocks of a file containing holes.
(Thus even, if that means deleting 256 blocks spread over hundreds of GB
in the inode's address space.)
I hope this was comprehensible, and if you still have questions, feel
free to ask me.
Regards, Bodo
--
And you are sure I should enter "format c:"?
Yes, "format c:" will help.