thr3ads.net - Btrfs devel - Rename+crash behaviour of btrfs

If this information is useful, please help other people find it:
Share via:

Jakob Unterwurzacher

2010-May-17 18:04 UTC

Rename+crash behaviour of btrfs - nearly ext3!

Hi!

Following Ubuntu''s dpkg+ext4 problems I wanted to see if btrfs would
solve them all. And it nearly does! Now I wonder if the remaining 0.2
seconds window of exposing 0-size files could be closed too.

I tested using two simple scripts (attached for reference) on kernel
2.6.34-rc7:
- rentest creates files $i.tmp and renames to $i.cur,
- owtest does the same but overwrites existing $i.cur files,
letting them run for 30-50 seconds then resetting the virtual machine.

The results for ext3 are as expected: 0-size files are never exposed as
$i.cur, overwrites are atomic.

ext4 overwrites are /almost/ atomic (I get one 0-size file in owtest),
lots of 0-size files are exposed in rentest (30 seconds window).

btrfs *nearly* does as well as ext3. Overwrites are atomic.

The rentest exposes only a 0.2 seconds windows of 0-size $i.cur files,
so that a "ls --full-time" after the crash looks like this (notice the
time between 01281.cur and 01292.tmp, only 0.2 seconds):
[...]
-rw-r--r-- 1 root root 20 2010-05-17 17:06:25.812016407 +0200 01280.cur
-rw-r--r-- 1 root root 20 2010-05-17 17:06:25.835999490 +0200 01281.cur
-rw-r--r-- 1 root root  0 2010-05-17 17:06:25.868035485 +0200 01282.cur
[...]
-rw-r--r-- 1 root root  0 2010-05-17 17:06:26.080003626 +0200 01291.cur
-rw-rw-rw- 1 root root  0 2010-05-17 17:06:26.108010083 +0200 01292.tmp


Finally, xfs kills lots of existing files in owtest and exposes lots of
0-size files in rentest (both 40 seconds window).

If anybody is interested, the bunch of trimmed "ls --full-time" output
for all filesystems is attached.


Thanks,
Jakob

Ric Wheeler

2010-May-17 19:12 UTC

head link

Re: Rename+crash behaviour of btrfs - nearly ext3!

On 05/17/2010 02:04 PM, Jakob Unterwurzacher wrote:> Hi!
>
> Following Ubuntu''s dpkg+ext4 problems I wanted to see if btrfs
would
> solve them all. And it nearly does! Now I wonder if the remaining 0.2
> seconds window of exposing 0-size files could be closed too.
>    
Nearly does not seem that reassuring. What would happen if the server 
was under an intense load, swapping away crazily and running multiple 
writers to that same file system?

ric
> I tested using two simple scripts (attached for reference) on kernel
> 2.6.34-rc7:
> - rentest creates files $i.tmp and renames to $i.cur,
> - owtest does the same but overwrites existing $i.cur files,
> letting them run for 30-50 seconds then resetting the virtual machine.
>
> The results for ext3 are as expected: 0-size files are never exposed as
> $i.cur, overwrites are atomic.
>
> ext4 overwrites are /almost/ atomic (I get one 0-size file in owtest),
> lots of 0-size files are exposed in rentest (30 seconds window).
>
> btrfs *nearly* does as well as ext3. Overwrites are atomic.
>
> The rentest exposes only a 0.2 seconds windows of 0-size $i.cur files,
> so that a "ls --full-time" after the crash looks like this
(notice the
> time between 01281.cur and 01292.tmp, only 0.2 seconds):
> [...]
> -rw-r--r-- 1 root root 20 2010-05-17 17:06:25.812016407 +0200 01280.cur
> -rw-r--r-- 1 root root 20 2010-05-17 17:06:25.835999490 +0200 01281.cur
> -rw-r--r-- 1 root root  0 2010-05-17 17:06:25.868035485 +0200 01282.cur
> [...]
> -rw-r--r-- 1 root root  0 2010-05-17 17:06:26.080003626 +0200 01291.cur
> -rw-rw-rw- 1 root root  0 2010-05-17 17:06:26.108010083 +0200 01292.tmp
>
>
> Finally, xfs kills lots of existing files in owtest and exposes lots of
> 0-size files in rentest (both 40 seconds window).
>
> If anybody is interested, the bunch of trimmed "ls --full-time"
output
> for all filesystems is attached.
>
>
> Thanks,
> Jakob
>    
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Josef Bacik

2010-May-17 19:25 UTC

head link

Re: Rename+crash behaviour of btrfs - nearly ext3!

On Mon, May 17, 2010 at 08:04:21PM +0200, Jakob Unterwurzacher
wrote:> Hi!
> 
> Following Ubuntu''s dpkg+ext4 problems I wanted to see if btrfs
would
> solve them all. And it nearly does! Now I wonder if the remaining 0.2
> seconds window of exposing 0-size files could be closed too.
> 
> I tested using two simple scripts (attached for reference) on kernel
> 2.6.34-rc7:
> - rentest creates files $i.tmp and renames to $i.cur,
> - owtest does the same but overwrites existing $i.cur files,
> letting them run for 30-50 seconds then resetting the virtual machine.
> 
> The results for ext3 are as expected: 0-size files are never exposed as
> $i.cur, overwrites are atomic.
> 
> ext4 overwrites are /almost/ atomic (I get one 0-size file in owtest),
> lots of 0-size files are exposed in rentest (30 seconds window).
> 
> btrfs *nearly* does as well as ext3. Overwrites are atomic.
> 
> The rentest exposes only a 0.2 seconds windows of 0-size $i.cur files,
> so that a "ls --full-time" after the crash looks like this
(notice the
> time between 01281.cur and 01292.tmp, only 0.2 seconds):
> [...]
> -rw-r--r-- 1 root root 20 2010-05-17 17:06:25.812016407 +0200 01280.cur
> -rw-r--r-- 1 root root 20 2010-05-17 17:06:25.835999490 +0200 01281.cur
> -rw-r--r-- 1 root root  0 2010-05-17 17:06:25.868035485 +0200 01282.cur
> [...]
> -rw-r--r-- 1 root root  0 2010-05-17 17:06:26.080003626 +0200 01291.cur
> -rw-rw-rw- 1 root root  0 2010-05-17 17:06:26.108010083 +0200 01292.tmp
> 
This isn''t actually true.  There is no window, the inode isn''t
written to disk
until all of the data is flushed to disk.  So the in memory inode will be
update, and therefore show an i_size of 0 since the io hasn''t finished,
but if
you were to crash at this point, when you came back up you''d have the
old data
in place because the new inode data wasn''t written to disk.  I have a
feeling
ext4 is the same way, but I''d have to check for sure.  Thanks,

Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Chris Mason

2010-May-17 19:36 UTC

head link

Re: Rename+crash behaviour of btrfs - nearly ext3!

On Mon, May 17, 2010 at 08:04:21PM +0200, Jakob Unterwurzacher
wrote:> Hi!
> 
> Following Ubuntu''s dpkg+ext4 problems I wanted to see if btrfs
would
> solve them all. And it nearly does! Now I wonder if the remaining 0.2
> seconds window of exposing 0-size files could be closed too.
That should be a zero second window, we try to force things to disk
during renames.

Could you please try this patch:

diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
index c9f1020..9370a71 100644
--- a/fs/btrfs/ordered-data.c
+++ b/fs/btrfs/ordered-data.c
@@ -806,7 +806,7 @@ int btrfs_add_ordered_operation(struct btrfs_trans_handle
*trans,
 	 * if this file hasn''t been changed since the last transaction
 	 * commit, we can safely return without doing anything
 	 */
-	if (last_mod < root->fs_info->last_trans_committed)
+	if (0 && last_mod < root->fs_info->last_trans_committed)
 		return 0;
 
 	/*
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Chris Mason

2010-May-17 20:09 UTC

head link

Re: Rename+crash behaviour of btrfs - nearly ext3!

On Mon, May 17, 2010 at 03:25:54PM -0400, Josef Bacik
wrote:> On Mon, May 17, 2010 at 08:04:21PM +0200, Jakob Unterwurzacher wrote:
> > Hi!
> > 
> > Following Ubuntu''s dpkg+ext4 problems I wanted to see if
btrfs would
> > solve them all. And it nearly does! Now I wonder if the remaining 0.2
> > seconds window of exposing 0-size files could be closed too.
> > 
> > I tested using two simple scripts (attached for reference) on kernel
> > 2.6.34-rc7:
> > - rentest creates files $i.tmp and renames to $i.cur,
> > - owtest does the same but overwrites existing $i.cur files,
> > letting them run for 30-50 seconds then resetting the virtual machine.
> > 
> > The results for ext3 are as expected: 0-size files are never exposed
as
> > $i.cur, overwrites are atomic.
> > 
> > ext4 overwrites are /almost/ atomic (I get one 0-size file in owtest),
> > lots of 0-size files are exposed in rentest (30 seconds window).
> > 
> > btrfs *nearly* does as well as ext3. Overwrites are atomic.
> > 
> > The rentest exposes only a 0.2 seconds windows of 0-size $i.cur files,
> > so that a "ls --full-time" after the crash looks like this
(notice the
> > time between 01281.cur and 01292.tmp, only 0.2 seconds):
> > [...]
> > -rw-r--r-- 1 root root 20 2010-05-17 17:06:25.812016407 +0200
01280.cur
> > -rw-r--r-- 1 root root 20 2010-05-17 17:06:25.835999490 +0200
01281.cur
> > -rw-r--r-- 1 root root  0 2010-05-17 17:06:25.868035485 +0200
01282.cur
> > [...]
> > -rw-r--r-- 1 root root  0 2010-05-17 17:06:26.080003626 +0200
01291.cur
> > -rw-rw-rw- 1 root root  0 2010-05-17 17:06:26.108010083 +0200
01292.tmp
> > 
> 
> This isn''t actually true.  There is no window, the inode
isn''t written to disk
> until all of the data is flushed to disk.  So the in memory inode will be
> update, and therefore show an i_size of 0 since the io hasn''t
finished, but if
> you were to crash at this point, when you came back up you''d have
the old data
> in place because the new inode data wasn''t written to disk.  I
have a feeling
> ext4 is the same way, but I''d have to check for sure.  Thanks,
Jacob, could you please confirm if your test includes a crash?

-chris

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Jakob Unterwurzacher

2010-May-17 20:30 UTC

head link

Re: Rename+crash behaviour of btrfs - nearly ext3!

On 17/05/10 22:09, Chris Mason wrote:>>> -rw-r--r-- 1 root root 20 2010-05-17 17:06:25.812016407 +0200
01280.cur
>>> -rw-r--r-- 1 root root 20 2010-05-17 17:06:25.835999490 +0200
01281.cur
>>> -rw-r--r-- 1 root root  0 2010-05-17 17:06:25.868035485 +0200
01282.cur
>>> [...]
>>> -rw-r--r-- 1 root root  0 2010-05-17 17:06:26.080003626 +0200
01291.cur
>>> -rw-rw-rw- 1 root root  0 2010-05-17 17:06:26.108010083 +0200
01292.tmp
>>>
>>
>> This isn''t actually true.  There is no window, the inode
isn''t written to disk
>> until all of the data is flushed to disk.  So the in memory inode will
be
>> update, and therefore show an i_size of 0 since the io hasn''t
finished, but if
>> you were to crash at this point, when you came back up you''d
have the old data
>> in place because the new inode data wasn''t written to disk.  I
have a feeling
>> ext4 is the same way, but I''d have to check for sure.  Thanks,
> 
> Jacob, could you please confirm if your test includes a crash?
> 
> -chris
Yes, i crash the VM by pressing reset in VirtualBox.
Note that the "ls" above is from the rename test that does NOT
overwrite
existing files.

Jakob
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Jakob Unterwurzacher

2010-May-18 00:14 UTC

head link

Re: Rename+crash behaviour of btrfs - nearly ext3!

On 17/05/10 21:36, Chris Mason wrote:> 
> That should be a zero second window, we try to force things to disk
> during renames.
> 
> Could you please try this patch:
> 
> diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
> index c9f1020..9370a71 100644
> --- a/fs/btrfs/ordered-data.c
> +++ b/fs/btrfs/ordered-data.c
> @@ -806,7 +806,7 @@ int btrfs_add_ordered_operation(struct
btrfs_trans_handle *trans,
>  	 * if this file hasn''t been changed since the last transaction
>  	 * commit, we can safely return without doing anything
>  	 */
> -	if (last_mod < root->fs_info->last_trans_committed)
> +	if (0 && last_mod < root->fs_info->last_trans_committed)

Ok, I upgraded to 2.6.34 final and switched to defconfig.
I only did the rename test ( i.e. no overwrite ), the window is now
1.1s, both with vanilla and with the patch.

Jakob

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Chris Mason

2010-May-18 00:30 UTC

head link

Re: Rename+crash behaviour of btrfs - nearly ext3!

On Tue, May 18, 2010 at 02:14:05AM +0200, Jakob Unterwurzacher
wrote:> On 17/05/10 21:36, Chris Mason wrote:
> > 
> > That should be a zero second window, we try to force things to disk
> > during renames.
> > 
> > Could you please try this patch:
> > 
> > diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
> > index c9f1020..9370a71 100644
> > --- a/fs/btrfs/ordered-data.c
> > +++ b/fs/btrfs/ordered-data.c
> > @@ -806,7 +806,7 @@ int btrfs_add_ordered_operation(struct
btrfs_trans_handle *trans,
> >  	 * if this file hasn''t been changed since the last
transaction
> >  	 * commit, we can safely return without doing anything
> >  	 */
> > -	if (last_mod < root->fs_info->last_trans_committed)
> > +	if (0 && last_mod <
root->fs_info->last_trans_committed)
> 
> 
> Ok, I upgraded to 2.6.34 final and switched to defconfig.
> I only did the rename test ( i.e. no overwrite ), the window is now
> 1.1s, both with vanilla and with the patch.
Thanks, so much for the easy fix.  I''ll take a look.

-chris

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Chris Mason

2010-May-18 00:59 UTC

head link

Re: Rename+crash behaviour of btrfs - nearly ext3!

On Mon, May 17, 2010 at 08:30:32PM -0400, Chris Mason
wrote:> On Tue, May 18, 2010 at 02:14:05AM +0200, Jakob Unterwurzacher wrote:
> > On 17/05/10 21:36, Chris Mason wrote:
> > > 
> > > That should be a zero second window, we try to force things to
disk
> > > during renames.
> > > 
> > > Could you please try this patch:
> > > 
> > > diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
> > > index c9f1020..9370a71 100644
> > > --- a/fs/btrfs/ordered-data.c
> > > +++ b/fs/btrfs/ordered-data.c
> > > @@ -806,7 +806,7 @@ int btrfs_add_ordered_operation(struct
btrfs_trans_handle *trans,
> > >  	 * if this file hasn''t been changed since the last
transaction
> > >  	 * commit, we can safely return without doing anything
> > >  	 */
> > > -	if (last_mod < root->fs_info->last_trans_committed)
> > > +	if (0 && last_mod <
root->fs_info->last_trans_committed)
> > 
> > 
> > Ok, I upgraded to 2.6.34 final and switched to defconfig.
> > I only did the rename test ( i.e. no overwrite ), the window is now
> > 1.1s, both with vanilla and with the patch.
> 
> Thanks, so much for the easy fix.  I''ll take a look.
Ohhhhh, I read your initial email wrong, I''m sorry.  The test
we''re
failing, the rentest, doesn''t overwrite one file with another.  It is
just creating a file and then renaming it.

Btrfs is explicitly choosing not to sync the file in this case because
the rename isn''t replacing good old data with new unwritten data.  The
rename is taking new unwritten data and giving it a different name.

Are there applications that rely on this? 

-chris

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Jakob Unterwurzacher

2010-May-18 12:03 UTC

head link

Re: Rename+crash behaviour of btrfs - nearly ext3!

On 18/05/10 02:59, Chris Mason wrote:>>> Ok, I upgraded to 2.6.34 final and switched to defconfig.
>>> I only did the rename test ( i.e. no overwrite ), the window is now
>>> 1.1s, both with vanilla and with the patch.
>>
>> Thanks, so much for the easy fix.  I''ll take a look.
> 
> Ohhhhh, I read your initial email wrong, I''m sorry.  The test
we''re
> failing, the rentest, doesn''t overwrite one file with another.  It
is
> just creating a file and then renaming it.
Yes, the overwrite test goes perfectly fine.
> Btrfs is explicitly choosing not to sync the file in this case because
> the rename isn''t replacing good old data with new unwritten data. 
The
> rename is taking new unwritten data and giving it a different name.
> 
> Are there applications that rely on this? 
> 
> -chris
Well, dpkg (the Debian/Ubuntu package manager) did. Then ext4 became the
default fs in Ubuntu and massive breakage was reported [1]. Now dpkg is
fsync()ing everything and is about 2x slower than it was with ext3 [2].

Btrfs is so close to getting it "right" that i wondered whether the
new
file name hitting the disk could be delayed that one second for the data
to make it to disk first.

Anyway, btrfs is still a factor 30 better than ext4 of xfs!

Thanks,
Jakob






[1] https://bugs.launchpad.net/ubuntu/+source/dpkg/+bug/512096 (notice
the massive duplicate list on the right!)

[2] https://bugs.launchpad.net/ubuntu/+source/dpkg/+bug/537241
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Chris Mason

2010-May-18 13:13 UTC

head link

Re: Rename+crash behaviour of btrfs - nearly ext3!

On Tue, May 18, 2010 at 02:03:49PM +0200, Jakob Unterwurzacher
wrote:> On 18/05/10 02:59, Chris Mason wrote:
> >>> Ok, I upgraded to 2.6.34 final and switched to defconfig.
> >>> I only did the rename test ( i.e. no overwrite ), the window
is now
> >>> 1.1s, both with vanilla and with the patch.
> >>
> >> Thanks, so much for the easy fix.  I''ll take a look.
> > 
> > Ohhhhh, I read your initial email wrong, I''m sorry.  The test
we''re
> > failing, the rentest, doesn''t overwrite one file with
another.  It is
> > just creating a file and then renaming it.
> 
> Yes, the overwrite test goes perfectly fine.
> 
> > Btrfs is explicitly choosing not to sync the file in this case because
> > the rename isn''t replacing good old data with new unwritten
data.  The
> > rename is taking new unwritten data and giving it a different name.
> > 
> > Are there applications that rely on this? 
> > 
> > -chris
> 
> Well, dpkg (the Debian/Ubuntu package manager) did. Then ext4 became the
> default fs in Ubuntu and massive breakage was reported [1]. Now dpkg is
> fsync()ing everything and is about 2x slower than it was with ext3 [2].
> 
> Btrfs is so close to getting it "right" that i wondered whether
the new
> file name hitting the disk could be delayed that one second for the data
> to make it to disk first.
> 
The thing is that different apps have a different version of
''right''.  Rename
is atomically replacing one file with another, and I completely agree
that when we have an established file on disk, we shouldn''t replace it
with something that is potentially garbage.

But for the zeros case we have a file that isn''t on disk and
we''re just
giving it a new name.  I can see a different class of applications
getting upset about renames slowing the system down dramatically because
they suddenly imply a lot of IO.

I''m more than open to discussion on this one, but I don''t see
how:

rm -f foo2
dd if=/dev/zero of=foo bs=1M count=1000
mv foo foo2

Should be expected to write 1GB of data.

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Oystein Viggen

2010-May-18 13:28 UTC

head link

Re: Rename+crash behaviour of btrfs - nearly ext3!

* [Chris Mason] 
> I''m more than open to discussion on this one, but I don''t
see how:
>
> rm -f foo2
> dd if=/dev/zero of=foo bs=1M count=1000
> mv foo foo2
>
> Should be expected to write 1GB of data.
IIRC, the answer you''re looking for is "it did with ext3 in the
default
data=ordered mode".  Combine that with the ext3 data=ordered fsync()
escalation where (again IIRC) fsync() tended to force a full sync() of
the file system, and it''s not that difficult to see why someone would
program with the expectation above.

Anyway, there''s still a question of if a new file system should emulate
the quirks of the old file system (read: be bug compatible), or if you
can just expect to be popular enough that userspace adapts to the new
order and lets you do The Right Thing instead.

Øystein
-- 
Outgoing mail is certified Virus Free.
..of course, the virus would tell you the same thing..

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Aidan Van Dyk

2010-May-18 13:39 UTC

head link

Re: Rename+crash behaviour of btrfs - nearly ext3!

* Chris Mason <chris.mason@oracle.com> [100518 09:13]:
 > I''m more than open to discussion on this one, but I don''t
see how:
> Should be expected to write 1GB of data.
++

Please don''t mess up BTRFS because older, less better things are messed
up in certain ways.  If we''re just going to continually perpetuate the
ideas that broken-by-desing apps are "right", we might as well just
give
up on a better FS, and stick to "what broken apps are expecting" (i.e.
ext3).


-- 
Aidan Van Dyk                                             Create like a god,
aidan@highrise.ca                                       command like a king,
http://www.highrise.ca/                                   work like a slave.

Jakob Unterwurzacher

2010-May-18 14:06 UTC

head link

Re: Rename+crash behaviour of btrfs - nearly ext3!

On 18/05/10 15:13, Chris Mason wrote:> 
> The thing is that different apps have a different version of
''right''.  Rename
> is atomically replacing one file with another, and I completely agree
> that when we have an established file on disk, we shouldn''t
replace it
> with something that is potentially garbage.
> 
> But for the zeros case we have a file that isn''t on disk and
we''re just
> giving it a new name.  I can see a different class of applications
> getting upset about renames slowing the system down dramatically because
> they suddenly imply a lot of IO.
> 
> I''m more than open to discussion on this one, but I don''t
see how:
> 
> rm -f foo2
> dd if=/dev/zero of=foo bs=1M count=1000
> mv foo foo2
> 
> Should be expected to write 1GB of data.
> 
> -chris
The idea would be to delay the rename hitting the disk until the data
has been written anyway.
The mv would return immediately, and someday, after the data has been
written to disk, the rename would be written to disk.

Jakob
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Chris Mason

2010-May-18 14:36 UTC

head link

Re: Rename+crash behaviour of btrfs - nearly ext3!

On Tue, May 18, 2010 at 04:06:45PM +0200, Jakob Unterwurzacher
wrote:> On 18/05/10 15:13, Chris Mason wrote:
> > 
> > The thing is that different apps have a different version of
''right''.  Rename
> > is atomically replacing one file with another, and I completely agree
> > that when we have an established file on disk, we shouldn''t
replace it
> > with something that is potentially garbage.
> > 
> > But for the zeros case we have a file that isn''t on disk and
we''re just
> > giving it a new name.  I can see a different class of applications
> > getting upset about renames slowing the system down dramatically
because
> > they suddenly imply a lot of IO.
> > 
> > I''m more than open to discussion on this one, but I
don''t see how:
> > 
> > rm -f foo2
> > dd if=/dev/zero of=foo bs=1M count=1000
> > mv foo foo2
> > 
> > Should be expected to write 1GB of data.
> > 
> > -chris
> 
> The idea would be to delay the rename hitting the disk until the data
> has been written anyway.
> The mv would return immediately, and someday, after the data has been
> written to disk, the rename would be written to disk.
This is possible, but we have to choose between consuming unbounded
resources while we queue up all the mvs or sometimes forcing the things
to disk.  At the end of the day, disks are so slow that eventually you
do end up waiting on them.

-chris

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Thomas Bellman

2010-May-18 14:47 UTC

head link

Re: Rename+crash behaviour of btrfs - nearly ext3!

On 05/18/10 15:28, Oystein Viggen wrote:
> * [Chris Mason]
>
>> I''m more than open to discussion on this one, but I
don''t see how:
>>
>> rm -f foo2
>> dd if=/dev/zero of=foo bs=1M count=1000
>> mv foo foo2
>>
>> Should be expected to write 1GB of data.
>
> IIRC, the answer you''re looking for is "it did with ext3 in
the default
> data=ordered mode".  Combine that with the ext3 data=ordered fsync()
> escalation where (again IIRC) fsync() tended to force a full sync() of
> the file system, and it''s not that difficult to see why someone
would
> program with the expectation above.
>
> Anyway, there''s still a question of if a new file system should
emulate
> the quirks of the old file system (read: be bug compatible), or if you
> can just expect to be popular enough that userspace adapts to the new
> order and lets you do The Right Thing instead.
So what *is* the right thing?  What kind of API should userspace have?
If the obvious thing for an application programmer to do is wrong, and
the right thing requires going through more hoops, that will ensure
that the majority of applications will be buggy.  We should strive
to make it easy to get things right.

It''s easy for the kernel, and the filesystem, to just ask the userspace
programmers to jump through the hoops, and declare those programs that
don''t to be broken.

On the other hand, if you go *too* far in absolving applications of
responsibility for making things safe, you would end up making all
filesystem operations synchronous, and that obviously hurts performance
in big ways.  So we need some kind of compromise, and where that
compromise should end up being, I don''t really have the answer to.
It''s just that I feel that often only the kernel programmers view is
represented here.

The pattern of writing to a file and then changing its name *without*
overwriting an existing file, is quite common when you write files to
a spool directory, and have another program that picks up files from
that directory and processes them.  You

     fd = open("foo4711.tmp", O_CREAT|O_EXCL|O_RDWR);
     write(fd, "data", strlen("data"));
     close(fd);
     link("foo4711.tmp", "foo4711");
     unlink("foo4711.tmp");

(And note that careful programs don''t use rename() here, because that
would risk clobbering a file some other process has written, and instead
use link()+unlink().  And I really wish a "safe_rename()" syscall that
didn''t clobber existing files existed.)

The programs I personally have written that did this, also had an fsync()
there, because I received data from another system and didn''t want to
ACK
until I knew it was safely on disk at my end.  But I am a fairly careful
programmer.

Note that in my previous life I was a userspace programmer, and in my
current life I''m a sysadmin.  I''m speaking as an interrested
user of
Btrfs, not as a kernel programmer.

	/Thomas Bellman
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Jakob Unterwurzacher

2010-May-18 15:57 UTC

head link

Re: Rename+crash behaviour of btrfs - nearly ext3!

On 18/05/10 16:36, Chris Mason wrote:>>
>> The idea would be to delay the rename hitting the disk until the data
>> has been written anyway.
>> The mv would return immediately, and someday, after the data has been
>> written to disk, the rename would be written to disk.
> 
> This is possible, but we have to choose between consuming unbounded
> resources while we queue up all the mvs or sometimes forcing the things
> to disk.  At the end of the day, disks are so slow that eventually you
> do end up waiting on them.
> 
> -chris
> 
I''m not sure how much memory a queued rename takes up, but the time
that
would be spent flushing it to disk would then be spent flushing file
data, draining the write buffer and freeing memory, no?

That would be writing to disk

 [Data..................][Rename]  or
 [Rename][Data..................]

Whether you drain the file data queue or the rename queue first, in the
end you''d have to write it all....

I thought the problem of delaying the renames was complexity, well, at
least T''Tso said it was [1] - I''m not sure if this applies to
btrfs as well.


Thanks,
Jakob



[1] https://bugzilla.kernel.org/show_bug.cgi?id=15910#c9
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Chris Mason

2010-May-18 16:10 UTC

head link

Re: Rename+crash behaviour of btrfs - nearly ext3!

On Tue, May 18, 2010 at 05:57:49PM +0200, Jakob Unterwurzacher
wrote:> On 18/05/10 16:36, Chris Mason wrote:
> >>
> >> The idea would be to delay the rename hitting the disk until the
data
> >> has been written anyway.
> >> The mv would return immediately, and someday, after the data has
been
> >> written to disk, the rename would be written to disk.
> > 
> > This is possible, but we have to choose between consuming unbounded
> > resources while we queue up all the mvs or sometimes forcing the
things
> > to disk.  At the end of the day, disks are so slow that eventually you
> > do end up waiting on them.
> > 
> > -chris
> > 
> 
> I''m not sure how much memory a queued rename takes up, but the
time that
> would be spent flushing it to disk would then be spent flushing file
> data, draining the write buffer and freeing memory, no?
> 
> That would be writing to disk
> 
>  [Data..................][Rename]  or
>  [Rename][Data..................]
Actually it is:

[Data..................][allow the transaction commit to complete]  or
[allow the transaction commit to complete][Data..................]

The problem is that people think of the rename as a tiny thing, but it
is really bundled in with all of the other metadata operations that were
done in the current transaction.   The space that was allocated to hold
the new file name, the space that was freed to remove the old file name,
the directory entries, the directory inode etc etc.

This means that holding back that one rename requires holding back every
operation done to the filesystem.

In btrfs, we''re still able to do fsyncs quickly in this case
because we have a dedicated log for that.  But there are a few different
types of operations (like disk management) that require us to wait for
the transaction to complete even when we use the dedicated log.
> 
> Whether you drain the file data queue or the rename queue first, in the
> end you''d have to write it all....
It''s about latency.  The latency required to write the entire file is
unbounded (the size of the file is unbounded).  The latency required to
commit the transaction without the file data is bounded because we are
able to control the amount of metadata in each transaction.

See the firefox vs ext3 wars for an example of all of this, it''s the
latency the firefox people were (rightly) complaining about.
> 
> I thought the problem of delaying the renames was complexity, well, at
> least T''Tso said it was [1] - I''m not sure if this
applies to btrfs as well.
I''m afraid there are lots and lots of different issues at play.  The
most important way to look at it is that forcing data to disk is very
slow, which is why we try to avoid it whenever we can.

Applications can request that the data go to disk via lots of different
ways.  Rename was never ever meant to be one of them, but it really does
make sense to provide atomic replacement of old good data with new good
data, so we''ve implemented that extra syncing.

Implementing syncing when userland doesn''t expect extra syncing usually
just make userland very unhappy.  It''s not that we can''t do it
it''s that
doing it has implications for every application that uses rename.

-chris

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Goffredo Baroncelli

2010-May-18 18:01 UTC

head link

Re: Rename+crash behaviour of btrfs - nearly ext3!

On Tuesday, May 18, 2010, Chris Mason wrote:> On Tue, May 18, 2010 at 05:57:49PM +0200, Jakob Unterwurzacher wrote:
> > On 18/05/10 16:36, Chris Mason wrote:
[...]> > 
> > I thought the problem of delaying the renames was complexity, well, at
> > least T''Tso said it was [1] - I''m not sure if this
applies to btrfs as
well.> 
> I''m afraid there are lots and lots of different issues at play. 
The
> most important way to look at it is that forcing data to disk is very
> slow, which is why we try to avoid it whenever we can.
> 
> Applications can request that the data go to disk via lots of different
> ways.  Rename was never ever meant to be one of them, but it really does
> make sense to provide atomic replacement of old good data with new good
> data, so we''ve implemented that extra syncing.
> 
> Implementing syncing when userland doesn''t expect extra syncing
usually
> just make userland very unhappy.  It''s not that we can''t
do it it''s that
> doing it has implications for every application that uses rename.
> 
> -chris

Funny, the first thing that comes to my mind reading this thread, is that this 
kind of complaint is raised about a file-system which is able to support a 
full rollback via the snapshot. 

I think that a "right" solution should be to integrate the package
manager
with the btrfs snapshot capability (as nexenta does [1]). But it is clear that 
this is a long term solution (IIRC Fedora is working on this).

In the mean time, which should be the "right" solution to solve the
dpkg
problem ( and in a more general form the package manager problem) with btrfs ?
> 
> --
> To unsubscribe from this list: send the line "unsubscribe
linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
[1] http://www.nexenta.org/os/TransactionalZFSUpgrades

-- 
gpg key@ keyserver.linux.it: Goffredo Baroncelli (ghigo)
<kreijackATinwind.it>
Key fingerprint = 4769 7E51 5293 D36C 814E  C054 BF04 F161 3DC5 0512
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Jakob Unterwurzacher

2010-May-18 18:24 UTC

head link

Re: Rename+crash behaviour of btrfs - nearly ext3!

On 18/05/10 18:10, Chris Mason wrote:>>
>> I''m not sure how much memory a queued rename takes up, but the
time that
>> would be spent flushing it to disk would then be spent flushing file
>> data, draining the write buffer and freeing memory, no?
>>
>> That would be writing to disk
>>
>>  [Data..................][Rename]  or
>>  [Rename][Data..................]
> 
> Actually it is:
> 
> [Data..................][allow the transaction commit to complete]  or
> [allow the transaction commit to complete][Data..................]
> 
> The problem is that people think of the rename as a tiny thing, but it
> is really bundled in with all of the other metadata operations that were
> done in the current transaction.   The space that was allocated to hold
> the new file name, the space that was freed to remove the old file name,
> the directory entries, the directory inode etc etc.
> 
> This means that holding back that one rename requires holding back every
> operation done to the filesystem.
> 
> In btrfs, we''re still able to do fsyncs quickly in this case
> because we have a dedicated log for that.  But there are a few different
> types of operations (like disk management) that require us to wait for
> the transaction to complete even when we use the dedicated log.
> 
>>
>> Whether you drain the file data queue or the rename queue first, in the
>> end you''d have to write it all....
> 
> It''s about latency.  The latency required to write the entire file
is
> unbounded (the size of the file is unbounded).  The latency required to
> commit the transaction without the file data is bounded because we are
> able to control the amount of metadata in each transaction.
> 
> See the firefox vs ext3 wars for an example of all of this, it''s
the
> latency the firefox people were (rightly) complaining about.
> 
>>
>> I thought the problem of delaying the renames was complexity, well, at
>> least T''Tso said it was [1] - I''m not sure if this
applies to btrfs as well.
> 
> I''m afraid there are lots and lots of different issues at play. 
The
> most important way to look at it is that forcing data to disk is very
> slow, which is why we try to avoid it whenever we can.
> 
> Applications can request that the data go to disk via lots of different
> ways.  Rename was never ever meant to be one of them, but it really does
> make sense to provide atomic replacement of old good data with new good
> data, so we''ve implemented that extra syncing.
> 
> Implementing syncing when userland doesn''t expect extra syncing
usually
> just make userland very unhappy.  It''s not that we can''t
do it it''s that
> doing it has implications for every application that uses rename.
> 
> -chris
Thanks for all the insight.

I will update the wiki FAQ to make clear what "data=ordered" in btrfs
means, what not, and why (or something like that).


Jakob
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Ric Wheeler

2010-May-18 23:00 UTC

head link

Re: Rename+crash behaviour of btrfs - nearly ext3!

On 05/18/2010 09:13 AM, Chris Mason wrote:> On Tue, May 18, 2010 at 02:03:49PM +0200, Jakob Unterwurzacher wrote:
>    
>> On 18/05/10 02:59, Chris Mason wrote:
>>      
>>>>> Ok, I upgraded to 2.6.34 final and switched to defconfig.
>>>>> I only did the rename test ( i.e. no overwrite ), the
window is now
>>>>> 1.1s, both with vanilla and with the patch.
>>>>>            
>>>> Thanks, so much for the easy fix.  I''ll take a look.
>>>>          
>>> Ohhhhh, I read your initial email wrong, I''m sorry.  The
test we''re
>>> failing, the rentest, doesn''t overwrite one file with
another.  It is
>>> just creating a file and then renaming it.
>>>        
>> Yes, the overwrite test goes perfectly fine.
>>
>>      
>>> Btrfs is explicitly choosing not to sync the file in this case
because
>>> the rename isn''t replacing good old data with new
unwritten data.  The
>>> rename is taking new unwritten data and giving it a different name.
>>>
>>> Are there applications that rely on this?
>>>
>>> -chris
>>>        
>> Well, dpkg (the Debian/Ubuntu package manager) did. Then ext4 became
the
>> default fs in Ubuntu and massive breakage was reported [1]. Now dpkg is
>> fsync()ing everything and is about 2x slower than it was with ext3 [2].
>>
>> Btrfs is so close to getting it "right" that i wondered
whether the new
>> file name hitting the disk could be delayed that one second for the
data
>> to make it to disk first.
>>
>>      
> The thing is that different apps have a different version of
''right''.  Rename
> is atomically replacing one file with another, and I completely agree
> that when we have an established file on disk, we shouldn''t
replace it
> with something that is potentially garbage.
>
> But for the zeros case we have a file that isn''t on disk and
we''re just
> giving it a new name.  I can see a different class of applications
> getting upset about renames slowing the system down dramatically because
> they suddenly imply a lot of IO.
>
> I''m more than open to discussion on this one, but I don''t
see how:
>
> rm -f foo2
> dd if=/dev/zero of=foo bs=1M count=1000
> mv foo foo2
>
> Should be expected to write 1GB of data.
>
> -chris
>    
Just to weigh in here, I think that you have the right behaviour 
already. If an application wants to force this to sync the data to disk, 
it should use fsync() after the rename.

Having application depend on semantics that only ext3 provided is not an 
excuse for making a rename take multiple seconds....

Thanks!

Ric

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Bruce Guenter

2010-May-19 01:05 UTC

head link

Re: Rename+crash behaviour of btrfs - nearly ext3!

On Tue, May 18, 2010 at 07:00:57PM -0400, Ric Wheeler
wrote:> Just to weigh in here, I think that you have the right behaviour 
> already. If an application wants to force this to sync the data to disk, 
> it should use fsync() after the rename.
Actually, it pretty much has to fsync before the rename (to ensure the
contents are on disk) and possibly fsync the directory after to ensure
the rename hits the disk.  If you fsync after the rename, there is still
no guarantee that a crash won''t cause partial data on disk with the new
filename, unless you assume the filesystem orders the writes so the
rename happens after the data hits the disk.  AFAIK most filesystems
make no such guarantee.

-- 
Bruce Guenter <bruce@untroubled.org>                http://untroubled.org/

Andy Lutomirski

2010-May-19 01:34 UTC

head link

Re: Rename+crash behaviour of btrfs - nearly ext3!

Chris Mason wrote:> On Tue, May 18, 2010 at 02:03:49PM +0200, Jakob Unterwurzacher wrote:
>> On 18/05/10 02:59, Chris Mason wrote:
>>>>> Ok, I upgraded to 2.6.34 final and switched to defconfig.
>>>>> I only did the rename test ( i.e. no overwrite ), the
window is now
>>>>> 1.1s, both with vanilla and with the patch.
>>>> Thanks, so much for the easy fix.  I''ll take a look.
>>> Ohhhhh, I read your initial email wrong, I''m sorry.  The
test we''re
>>> failing, the rentest, doesn''t overwrite one file with
another.  It is
>>> just creating a file and then renaming it.
>> Yes, the overwrite test goes perfectly fine.
>>
>>> Btrfs is explicitly choosing not to sync the file in this case
because
>>> the rename isn''t replacing good old data with new
unwritten data.  The
>>> rename is taking new unwritten data and giving it a different name.
>>>
>>> Are there applications that rely on this? 
>>>
>>> -chris
>> Well, dpkg (the Debian/Ubuntu package manager) did. Then ext4 became
the
>> default fs in Ubuntu and massive breakage was reported [1]. Now dpkg is
>> fsync()ing everything and is about 2x slower than it was with ext3 [2].
>>
>> Btrfs is so close to getting it "right" that i wondered
whether the new
>> file name hitting the disk could be delayed that one second for the
data
>> to make it to disk first.
>>
> 
> The thing is that different apps have a different version of
''right''.  Rename
> is atomically replacing one file with another, and I completely agree
> that when we have an established file on disk, we shouldn''t
replace it
> with something that is potentially garbage.
> 
> But for the zeros case we have a file that isn''t on disk and
we''re just
> giving it a new name.  I can see a different class of applications
> getting upset about renames slowing the system down dramatically because
> they suddenly imply a lot of IO.
> 
> I''m more than open to discussion on this one, but I don''t
see how:
> 
> rm -f foo2
> dd if=/dev/zero of=foo bs=1M count=1000
> mv foo foo2
> 
> Should be expected to write 1GB of data.
[disclaimer: I don''t know much about btrfs internals]

foo2 being gone after a crash is, of course, fine.  But, depending on 
the programmer, there are a few answers:

1. I want foo2 to either not exist or to contain the data I just wrote. 
  So please wait for it to hit disk.

2. I want foo2 to either not exist or to contain the data I just wrote. 
  So, btrfs, please learn how to make sure that the metadata doesn''t
get
written until the data gets written.  Presumably this means that the 
rename needs to go into a log somewhere (in memory) but not become a 
part of the current transaction to avoid all kinds of latency.

3. I want speed.  Do whatever''s fastest.

Of course, there''s a harder case:

dd if=/dev/zero of=foo bs=1M count=1000
mv foo foo2
dd if=<something else> of=foo2 bs=1k count=1

Now what?


A lot of application programmers probably want the metadata to happen 
after the data, but they don''t want to use fsync because they
don''t want
to wait for anything to hit disk.  It would be nice to ask the FS for 
help, but that might be distinctly nontrivial.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Btrfs devel - May 2010 - Rename+crash behaviour of btrfs - nearly ext3!

Rename+crash behaviour of btrfs - nearly ext3!

Re: Rename+crash behaviour of btrfs - nearly ext3!

Re: Rename+crash behaviour of btrfs - nearly ext3!

Re: Rename+crash behaviour of btrfs - nearly ext3!

Re: Rename+crash behaviour of btrfs - nearly ext3!

Re: Rename+crash behaviour of btrfs - nearly ext3!

Re: Rename+crash behaviour of btrfs - nearly ext3!

Re: Rename+crash behaviour of btrfs - nearly ext3!

Re: Rename+crash behaviour of btrfs - nearly ext3!

Re: Rename+crash behaviour of btrfs - nearly ext3!

Re: Rename+crash behaviour of btrfs - nearly ext3!

Re: Rename+crash behaviour of btrfs - nearly ext3!

Re: Rename+crash behaviour of btrfs - nearly ext3!

Re: Rename+crash behaviour of btrfs - nearly ext3!

Re: Rename+crash behaviour of btrfs - nearly ext3!

Re: Rename+crash behaviour of btrfs - nearly ext3!

Re: Rename+crash behaviour of btrfs - nearly ext3!

Re: Rename+crash behaviour of btrfs - nearly ext3!

Re: Rename+crash behaviour of btrfs - nearly ext3!

Re: Rename+crash behaviour of btrfs - nearly ext3!

Re: Rename+crash behaviour of btrfs - nearly ext3!

Re: Rename+crash behaviour of btrfs - nearly ext3!

Re: Rename+crash behaviour of btrfs - nearly ext3!