Hi! Following Ubuntu''s dpkg+ext4 problems I wanted to see if btrfs would solve them all. And it nearly does! Now I wonder if the remaining 0.2 seconds window of exposing 0-size files could be closed too. I tested using two simple scripts (attached for reference) on kernel 2.6.34-rc7: - rentest creates files $i.tmp and renames to $i.cur, - owtest does the same but overwrites existing $i.cur files, letting them run for 30-50 seconds then resetting the virtual machine. The results for ext3 are as expected: 0-size files are never exposed as $i.cur, overwrites are atomic. ext4 overwrites are /almost/ atomic (I get one 0-size file in owtest), lots of 0-size files are exposed in rentest (30 seconds window). btrfs *nearly* does as well as ext3. Overwrites are atomic. The rentest exposes only a 0.2 seconds windows of 0-size $i.cur files, so that a "ls --full-time" after the crash looks like this (notice the time between 01281.cur and 01292.tmp, only 0.2 seconds): [...] -rw-r--r-- 1 root root 20 2010-05-17 17:06:25.812016407 +0200 01280.cur -rw-r--r-- 1 root root 20 2010-05-17 17:06:25.835999490 +0200 01281.cur -rw-r--r-- 1 root root 0 2010-05-17 17:06:25.868035485 +0200 01282.cur [...] -rw-r--r-- 1 root root 0 2010-05-17 17:06:26.080003626 +0200 01291.cur -rw-rw-rw- 1 root root 0 2010-05-17 17:06:26.108010083 +0200 01292.tmp Finally, xfs kills lots of existing files in owtest and exposes lots of 0-size files in rentest (both 40 seconds window). If anybody is interested, the bunch of trimmed "ls --full-time" output for all filesystems is attached. Thanks, Jakob
On 05/17/2010 02:04 PM, Jakob Unterwurzacher wrote:> Hi! > > Following Ubuntu''s dpkg+ext4 problems I wanted to see if btrfs would > solve them all. And it nearly does! Now I wonder if the remaining 0.2 > seconds window of exposing 0-size files could be closed too. >Nearly does not seem that reassuring. What would happen if the server was under an intense load, swapping away crazily and running multiple writers to that same file system? ric> I tested using two simple scripts (attached for reference) on kernel > 2.6.34-rc7: > - rentest creates files $i.tmp and renames to $i.cur, > - owtest does the same but overwrites existing $i.cur files, > letting them run for 30-50 seconds then resetting the virtual machine. > > The results for ext3 are as expected: 0-size files are never exposed as > $i.cur, overwrites are atomic. > > ext4 overwrites are /almost/ atomic (I get one 0-size file in owtest), > lots of 0-size files are exposed in rentest (30 seconds window). > > btrfs *nearly* does as well as ext3. Overwrites are atomic. > > The rentest exposes only a 0.2 seconds windows of 0-size $i.cur files, > so that a "ls --full-time" after the crash looks like this (notice the > time between 01281.cur and 01292.tmp, only 0.2 seconds): > [...] > -rw-r--r-- 1 root root 20 2010-05-17 17:06:25.812016407 +0200 01280.cur > -rw-r--r-- 1 root root 20 2010-05-17 17:06:25.835999490 +0200 01281.cur > -rw-r--r-- 1 root root 0 2010-05-17 17:06:25.868035485 +0200 01282.cur > [...] > -rw-r--r-- 1 root root 0 2010-05-17 17:06:26.080003626 +0200 01291.cur > -rw-rw-rw- 1 root root 0 2010-05-17 17:06:26.108010083 +0200 01292.tmp > > > Finally, xfs kills lots of existing files in owtest and exposes lots of > 0-size files in rentest (both 40 seconds window). > > If anybody is interested, the bunch of trimmed "ls --full-time" output > for all filesystems is attached. > > > Thanks, > Jakob >-- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, May 17, 2010 at 08:04:21PM +0200, Jakob Unterwurzacher wrote:> Hi! > > Following Ubuntu''s dpkg+ext4 problems I wanted to see if btrfs would > solve them all. And it nearly does! Now I wonder if the remaining 0.2 > seconds window of exposing 0-size files could be closed too. > > I tested using two simple scripts (attached for reference) on kernel > 2.6.34-rc7: > - rentest creates files $i.tmp and renames to $i.cur, > - owtest does the same but overwrites existing $i.cur files, > letting them run for 30-50 seconds then resetting the virtual machine. > > The results for ext3 are as expected: 0-size files are never exposed as > $i.cur, overwrites are atomic. > > ext4 overwrites are /almost/ atomic (I get one 0-size file in owtest), > lots of 0-size files are exposed in rentest (30 seconds window). > > btrfs *nearly* does as well as ext3. Overwrites are atomic. > > The rentest exposes only a 0.2 seconds windows of 0-size $i.cur files, > so that a "ls --full-time" after the crash looks like this (notice the > time between 01281.cur and 01292.tmp, only 0.2 seconds): > [...] > -rw-r--r-- 1 root root 20 2010-05-17 17:06:25.812016407 +0200 01280.cur > -rw-r--r-- 1 root root 20 2010-05-17 17:06:25.835999490 +0200 01281.cur > -rw-r--r-- 1 root root 0 2010-05-17 17:06:25.868035485 +0200 01282.cur > [...] > -rw-r--r-- 1 root root 0 2010-05-17 17:06:26.080003626 +0200 01291.cur > -rw-rw-rw- 1 root root 0 2010-05-17 17:06:26.108010083 +0200 01292.tmp >This isn''t actually true. There is no window, the inode isn''t written to disk until all of the data is flushed to disk. So the in memory inode will be update, and therefore show an i_size of 0 since the io hasn''t finished, but if you were to crash at this point, when you came back up you''d have the old data in place because the new inode data wasn''t written to disk. I have a feeling ext4 is the same way, but I''d have to check for sure. Thanks, Josef -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, May 17, 2010 at 08:04:21PM +0200, Jakob Unterwurzacher wrote:> Hi! > > Following Ubuntu''s dpkg+ext4 problems I wanted to see if btrfs would > solve them all. And it nearly does! Now I wonder if the remaining 0.2 > seconds window of exposing 0-size files could be closed too.That should be a zero second window, we try to force things to disk during renames. Could you please try this patch: diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c index c9f1020..9370a71 100644 --- a/fs/btrfs/ordered-data.c +++ b/fs/btrfs/ordered-data.c @@ -806,7 +806,7 @@ int btrfs_add_ordered_operation(struct btrfs_trans_handle *trans, * if this file hasn''t been changed since the last transaction * commit, we can safely return without doing anything */ - if (last_mod < root->fs_info->last_trans_committed) + if (0 && last_mod < root->fs_info->last_trans_committed) return 0; /* -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, May 17, 2010 at 03:25:54PM -0400, Josef Bacik wrote:> On Mon, May 17, 2010 at 08:04:21PM +0200, Jakob Unterwurzacher wrote: > > Hi! > > > > Following Ubuntu''s dpkg+ext4 problems I wanted to see if btrfs would > > solve them all. And it nearly does! Now I wonder if the remaining 0.2 > > seconds window of exposing 0-size files could be closed too. > > > > I tested using two simple scripts (attached for reference) on kernel > > 2.6.34-rc7: > > - rentest creates files $i.tmp and renames to $i.cur, > > - owtest does the same but overwrites existing $i.cur files, > > letting them run for 30-50 seconds then resetting the virtual machine. > > > > The results for ext3 are as expected: 0-size files are never exposed as > > $i.cur, overwrites are atomic. > > > > ext4 overwrites are /almost/ atomic (I get one 0-size file in owtest), > > lots of 0-size files are exposed in rentest (30 seconds window). > > > > btrfs *nearly* does as well as ext3. Overwrites are atomic. > > > > The rentest exposes only a 0.2 seconds windows of 0-size $i.cur files, > > so that a "ls --full-time" after the crash looks like this (notice the > > time between 01281.cur and 01292.tmp, only 0.2 seconds): > > [...] > > -rw-r--r-- 1 root root 20 2010-05-17 17:06:25.812016407 +0200 01280.cur > > -rw-r--r-- 1 root root 20 2010-05-17 17:06:25.835999490 +0200 01281.cur > > -rw-r--r-- 1 root root 0 2010-05-17 17:06:25.868035485 +0200 01282.cur > > [...] > > -rw-r--r-- 1 root root 0 2010-05-17 17:06:26.080003626 +0200 01291.cur > > -rw-rw-rw- 1 root root 0 2010-05-17 17:06:26.108010083 +0200 01292.tmp > > > > This isn''t actually true. There is no window, the inode isn''t written to disk > until all of the data is flushed to disk. So the in memory inode will be > update, and therefore show an i_size of 0 since the io hasn''t finished, but if > you were to crash at this point, when you came back up you''d have the old data > in place because the new inode data wasn''t written to disk. I have a feeling > ext4 is the same way, but I''d have to check for sure. Thanks,Jacob, could you please confirm if your test includes a crash? -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Jakob Unterwurzacher
2010-May-17 20:30 UTC
Re: Rename+crash behaviour of btrfs - nearly ext3!
On 17/05/10 22:09, Chris Mason wrote:>>> -rw-r--r-- 1 root root 20 2010-05-17 17:06:25.812016407 +0200 01280.cur >>> -rw-r--r-- 1 root root 20 2010-05-17 17:06:25.835999490 +0200 01281.cur >>> -rw-r--r-- 1 root root 0 2010-05-17 17:06:25.868035485 +0200 01282.cur >>> [...] >>> -rw-r--r-- 1 root root 0 2010-05-17 17:06:26.080003626 +0200 01291.cur >>> -rw-rw-rw- 1 root root 0 2010-05-17 17:06:26.108010083 +0200 01292.tmp >>> >> >> This isn''t actually true. There is no window, the inode isn''t written to disk >> until all of the data is flushed to disk. So the in memory inode will be >> update, and therefore show an i_size of 0 since the io hasn''t finished, but if >> you were to crash at this point, when you came back up you''d have the old data >> in place because the new inode data wasn''t written to disk. I have a feeling >> ext4 is the same way, but I''d have to check for sure. Thanks, > > Jacob, could you please confirm if your test includes a crash? > > -chrisYes, i crash the VM by pressing reset in VirtualBox. Note that the "ls" above is from the rename test that does NOT overwrite existing files. Jakob -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Jakob Unterwurzacher
2010-May-18 00:14 UTC
Re: Rename+crash behaviour of btrfs - nearly ext3!
On 17/05/10 21:36, Chris Mason wrote:> > That should be a zero second window, we try to force things to disk > during renames. > > Could you please try this patch: > > diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c > index c9f1020..9370a71 100644 > --- a/fs/btrfs/ordered-data.c > +++ b/fs/btrfs/ordered-data.c > @@ -806,7 +806,7 @@ int btrfs_add_ordered_operation(struct btrfs_trans_handle *trans, > * if this file hasn''t been changed since the last transaction > * commit, we can safely return without doing anything > */ > - if (last_mod < root->fs_info->last_trans_committed) > + if (0 && last_mod < root->fs_info->last_trans_committed)Ok, I upgraded to 2.6.34 final and switched to defconfig. I only did the rename test ( i.e. no overwrite ), the window is now 1.1s, both with vanilla and with the patch. Jakob -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, May 18, 2010 at 02:14:05AM +0200, Jakob Unterwurzacher wrote:> On 17/05/10 21:36, Chris Mason wrote: > > > > That should be a zero second window, we try to force things to disk > > during renames. > > > > Could you please try this patch: > > > > diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c > > index c9f1020..9370a71 100644 > > --- a/fs/btrfs/ordered-data.c > > +++ b/fs/btrfs/ordered-data.c > > @@ -806,7 +806,7 @@ int btrfs_add_ordered_operation(struct btrfs_trans_handle *trans, > > * if this file hasn''t been changed since the last transaction > > * commit, we can safely return without doing anything > > */ > > - if (last_mod < root->fs_info->last_trans_committed) > > + if (0 && last_mod < root->fs_info->last_trans_committed) > > > Ok, I upgraded to 2.6.34 final and switched to defconfig. > I only did the rename test ( i.e. no overwrite ), the window is now > 1.1s, both with vanilla and with the patch.Thanks, so much for the easy fix. I''ll take a look. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, May 17, 2010 at 08:30:32PM -0400, Chris Mason wrote:> On Tue, May 18, 2010 at 02:14:05AM +0200, Jakob Unterwurzacher wrote: > > On 17/05/10 21:36, Chris Mason wrote: > > > > > > That should be a zero second window, we try to force things to disk > > > during renames. > > > > > > Could you please try this patch: > > > > > > diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c > > > index c9f1020..9370a71 100644 > > > --- a/fs/btrfs/ordered-data.c > > > +++ b/fs/btrfs/ordered-data.c > > > @@ -806,7 +806,7 @@ int btrfs_add_ordered_operation(struct btrfs_trans_handle *trans, > > > * if this file hasn''t been changed since the last transaction > > > * commit, we can safely return without doing anything > > > */ > > > - if (last_mod < root->fs_info->last_trans_committed) > > > + if (0 && last_mod < root->fs_info->last_trans_committed) > > > > > > Ok, I upgraded to 2.6.34 final and switched to defconfig. > > I only did the rename test ( i.e. no overwrite ), the window is now > > 1.1s, both with vanilla and with the patch. > > Thanks, so much for the easy fix. I''ll take a look.Ohhhhh, I read your initial email wrong, I''m sorry. The test we''re failing, the rentest, doesn''t overwrite one file with another. It is just creating a file and then renaming it. Btrfs is explicitly choosing not to sync the file in this case because the rename isn''t replacing good old data with new unwritten data. The rename is taking new unwritten data and giving it a different name. Are there applications that rely on this? -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Jakob Unterwurzacher
2010-May-18 12:03 UTC
Re: Rename+crash behaviour of btrfs - nearly ext3!
On 18/05/10 02:59, Chris Mason wrote:>>> Ok, I upgraded to 2.6.34 final and switched to defconfig. >>> I only did the rename test ( i.e. no overwrite ), the window is now >>> 1.1s, both with vanilla and with the patch. >> >> Thanks, so much for the easy fix. I''ll take a look. > > Ohhhhh, I read your initial email wrong, I''m sorry. The test we''re > failing, the rentest, doesn''t overwrite one file with another. It is > just creating a file and then renaming it.Yes, the overwrite test goes perfectly fine.> Btrfs is explicitly choosing not to sync the file in this case because > the rename isn''t replacing good old data with new unwritten data. The > rename is taking new unwritten data and giving it a different name. > > Are there applications that rely on this? > > -chrisWell, dpkg (the Debian/Ubuntu package manager) did. Then ext4 became the default fs in Ubuntu and massive breakage was reported [1]. Now dpkg is fsync()ing everything and is about 2x slower than it was with ext3 [2]. Btrfs is so close to getting it "right" that i wondered whether the new file name hitting the disk could be delayed that one second for the data to make it to disk first. Anyway, btrfs is still a factor 30 better than ext4 of xfs! Thanks, Jakob [1] https://bugs.launchpad.net/ubuntu/+source/dpkg/+bug/512096 (notice the massive duplicate list on the right!) [2] https://bugs.launchpad.net/ubuntu/+source/dpkg/+bug/537241 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, May 18, 2010 at 02:03:49PM +0200, Jakob Unterwurzacher wrote:> On 18/05/10 02:59, Chris Mason wrote: > >>> Ok, I upgraded to 2.6.34 final and switched to defconfig. > >>> I only did the rename test ( i.e. no overwrite ), the window is now > >>> 1.1s, both with vanilla and with the patch. > >> > >> Thanks, so much for the easy fix. I''ll take a look. > > > > Ohhhhh, I read your initial email wrong, I''m sorry. The test we''re > > failing, the rentest, doesn''t overwrite one file with another. It is > > just creating a file and then renaming it. > > Yes, the overwrite test goes perfectly fine. > > > Btrfs is explicitly choosing not to sync the file in this case because > > the rename isn''t replacing good old data with new unwritten data. The > > rename is taking new unwritten data and giving it a different name. > > > > Are there applications that rely on this? > > > > -chris > > Well, dpkg (the Debian/Ubuntu package manager) did. Then ext4 became the > default fs in Ubuntu and massive breakage was reported [1]. Now dpkg is > fsync()ing everything and is about 2x slower than it was with ext3 [2]. > > Btrfs is so close to getting it "right" that i wondered whether the new > file name hitting the disk could be delayed that one second for the data > to make it to disk first. >The thing is that different apps have a different version of ''right''. Rename is atomically replacing one file with another, and I completely agree that when we have an established file on disk, we shouldn''t replace it with something that is potentially garbage. But for the zeros case we have a file that isn''t on disk and we''re just giving it a new name. I can see a different class of applications getting upset about renames slowing the system down dramatically because they suddenly imply a lot of IO. I''m more than open to discussion on this one, but I don''t see how: rm -f foo2 dd if=/dev/zero of=foo bs=1M count=1000 mv foo foo2 Should be expected to write 1GB of data. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
* [Chris Mason]> I''m more than open to discussion on this one, but I don''t see how: > > rm -f foo2 > dd if=/dev/zero of=foo bs=1M count=1000 > mv foo foo2 > > Should be expected to write 1GB of data.IIRC, the answer you''re looking for is "it did with ext3 in the default data=ordered mode". Combine that with the ext3 data=ordered fsync() escalation where (again IIRC) fsync() tended to force a full sync() of the file system, and it''s not that difficult to see why someone would program with the expectation above. Anyway, there''s still a question of if a new file system should emulate the quirks of the old file system (read: be bug compatible), or if you can just expect to be popular enough that userspace adapts to the new order and lets you do The Right Thing instead. Øystein -- Outgoing mail is certified Virus Free. ..of course, the virus would tell you the same thing.. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
* Chris Mason <chris.mason@oracle.com> [100518 09:13]:> I''m more than open to discussion on this one, but I don''t see how:> Should be expected to write 1GB of data.++ Please don''t mess up BTRFS because older, less better things are messed up in certain ways. If we''re just going to continually perpetuate the ideas that broken-by-desing apps are "right", we might as well just give up on a better FS, and stick to "what broken apps are expecting" (i.e. ext3). -- Aidan Van Dyk Create like a god, aidan@highrise.ca command like a king, http://www.highrise.ca/ work like a slave.
Jakob Unterwurzacher
2010-May-18 14:06 UTC
Re: Rename+crash behaviour of btrfs - nearly ext3!
On 18/05/10 15:13, Chris Mason wrote:> > The thing is that different apps have a different version of ''right''. Rename > is atomically replacing one file with another, and I completely agree > that when we have an established file on disk, we shouldn''t replace it > with something that is potentially garbage. > > But for the zeros case we have a file that isn''t on disk and we''re just > giving it a new name. I can see a different class of applications > getting upset about renames slowing the system down dramatically because > they suddenly imply a lot of IO. > > I''m more than open to discussion on this one, but I don''t see how: > > rm -f foo2 > dd if=/dev/zero of=foo bs=1M count=1000 > mv foo foo2 > > Should be expected to write 1GB of data. > > -chrisThe idea would be to delay the rename hitting the disk until the data has been written anyway. The mv would return immediately, and someday, after the data has been written to disk, the rename would be written to disk. Jakob -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, May 18, 2010 at 04:06:45PM +0200, Jakob Unterwurzacher wrote:> On 18/05/10 15:13, Chris Mason wrote: > > > > The thing is that different apps have a different version of ''right''. Rename > > is atomically replacing one file with another, and I completely agree > > that when we have an established file on disk, we shouldn''t replace it > > with something that is potentially garbage. > > > > But for the zeros case we have a file that isn''t on disk and we''re just > > giving it a new name. I can see a different class of applications > > getting upset about renames slowing the system down dramatically because > > they suddenly imply a lot of IO. > > > > I''m more than open to discussion on this one, but I don''t see how: > > > > rm -f foo2 > > dd if=/dev/zero of=foo bs=1M count=1000 > > mv foo foo2 > > > > Should be expected to write 1GB of data. > > > > -chris > > The idea would be to delay the rename hitting the disk until the data > has been written anyway. > The mv would return immediately, and someday, after the data has been > written to disk, the rename would be written to disk.This is possible, but we have to choose between consuming unbounded resources while we queue up all the mvs or sometimes forcing the things to disk. At the end of the day, disks are so slow that eventually you do end up waiting on them. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 05/18/10 15:28, Oystein Viggen wrote:> * [Chris Mason] > >> I''m more than open to discussion on this one, but I don''t see how: >> >> rm -f foo2 >> dd if=/dev/zero of=foo bs=1M count=1000 >> mv foo foo2 >> >> Should be expected to write 1GB of data. > > IIRC, the answer you''re looking for is "it did with ext3 in the default > data=ordered mode". Combine that with the ext3 data=ordered fsync() > escalation where (again IIRC) fsync() tended to force a full sync() of > the file system, and it''s not that difficult to see why someone would > program with the expectation above. > > Anyway, there''s still a question of if a new file system should emulate > the quirks of the old file system (read: be bug compatible), or if you > can just expect to be popular enough that userspace adapts to the new > order and lets you do The Right Thing instead.So what *is* the right thing? What kind of API should userspace have? If the obvious thing for an application programmer to do is wrong, and the right thing requires going through more hoops, that will ensure that the majority of applications will be buggy. We should strive to make it easy to get things right. It''s easy for the kernel, and the filesystem, to just ask the userspace programmers to jump through the hoops, and declare those programs that don''t to be broken. On the other hand, if you go *too* far in absolving applications of responsibility for making things safe, you would end up making all filesystem operations synchronous, and that obviously hurts performance in big ways. So we need some kind of compromise, and where that compromise should end up being, I don''t really have the answer to. It''s just that I feel that often only the kernel programmers view is represented here. The pattern of writing to a file and then changing its name *without* overwriting an existing file, is quite common when you write files to a spool directory, and have another program that picks up files from that directory and processes them. You fd = open("foo4711.tmp", O_CREAT|O_EXCL|O_RDWR); write(fd, "data", strlen("data")); close(fd); link("foo4711.tmp", "foo4711"); unlink("foo4711.tmp"); (And note that careful programs don''t use rename() here, because that would risk clobbering a file some other process has written, and instead use link()+unlink(). And I really wish a "safe_rename()" syscall that didn''t clobber existing files existed.) The programs I personally have written that did this, also had an fsync() there, because I received data from another system and didn''t want to ACK until I knew it was safely on disk at my end. But I am a fairly careful programmer. Note that in my previous life I was a userspace programmer, and in my current life I''m a sysadmin. I''m speaking as an interrested user of Btrfs, not as a kernel programmer. /Thomas Bellman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Jakob Unterwurzacher
2010-May-18 15:57 UTC
Re: Rename+crash behaviour of btrfs - nearly ext3!
On 18/05/10 16:36, Chris Mason wrote:>> >> The idea would be to delay the rename hitting the disk until the data >> has been written anyway. >> The mv would return immediately, and someday, after the data has been >> written to disk, the rename would be written to disk. > > This is possible, but we have to choose between consuming unbounded > resources while we queue up all the mvs or sometimes forcing the things > to disk. At the end of the day, disks are so slow that eventually you > do end up waiting on them. > > -chris >I''m not sure how much memory a queued rename takes up, but the time that would be spent flushing it to disk would then be spent flushing file data, draining the write buffer and freeing memory, no? That would be writing to disk [Data..................][Rename] or [Rename][Data..................] Whether you drain the file data queue or the rename queue first, in the end you''d have to write it all.... I thought the problem of delaying the renames was complexity, well, at least T''Tso said it was [1] - I''m not sure if this applies to btrfs as well. Thanks, Jakob [1] https://bugzilla.kernel.org/show_bug.cgi?id=15910#c9 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, May 18, 2010 at 05:57:49PM +0200, Jakob Unterwurzacher wrote:> On 18/05/10 16:36, Chris Mason wrote: > >> > >> The idea would be to delay the rename hitting the disk until the data > >> has been written anyway. > >> The mv would return immediately, and someday, after the data has been > >> written to disk, the rename would be written to disk. > > > > This is possible, but we have to choose between consuming unbounded > > resources while we queue up all the mvs or sometimes forcing the things > > to disk. At the end of the day, disks are so slow that eventually you > > do end up waiting on them. > > > > -chris > > > > I''m not sure how much memory a queued rename takes up, but the time that > would be spent flushing it to disk would then be spent flushing file > data, draining the write buffer and freeing memory, no? > > That would be writing to disk > > [Data..................][Rename] or > [Rename][Data..................]Actually it is: [Data..................][allow the transaction commit to complete] or [allow the transaction commit to complete][Data..................] The problem is that people think of the rename as a tiny thing, but it is really bundled in with all of the other metadata operations that were done in the current transaction. The space that was allocated to hold the new file name, the space that was freed to remove the old file name, the directory entries, the directory inode etc etc. This means that holding back that one rename requires holding back every operation done to the filesystem. In btrfs, we''re still able to do fsyncs quickly in this case because we have a dedicated log for that. But there are a few different types of operations (like disk management) that require us to wait for the transaction to complete even when we use the dedicated log.> > Whether you drain the file data queue or the rename queue first, in the > end you''d have to write it all....It''s about latency. The latency required to write the entire file is unbounded (the size of the file is unbounded). The latency required to commit the transaction without the file data is bounded because we are able to control the amount of metadata in each transaction. See the firefox vs ext3 wars for an example of all of this, it''s the latency the firefox people were (rightly) complaining about.> > I thought the problem of delaying the renames was complexity, well, at > least T''Tso said it was [1] - I''m not sure if this applies to btrfs as well.I''m afraid there are lots and lots of different issues at play. The most important way to look at it is that forcing data to disk is very slow, which is why we try to avoid it whenever we can. Applications can request that the data go to disk via lots of different ways. Rename was never ever meant to be one of them, but it really does make sense to provide atomic replacement of old good data with new good data, so we''ve implemented that extra syncing. Implementing syncing when userland doesn''t expect extra syncing usually just make userland very unhappy. It''s not that we can''t do it it''s that doing it has implications for every application that uses rename. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Goffredo Baroncelli
2010-May-18 18:01 UTC
Re: Rename+crash behaviour of btrfs - nearly ext3!
On Tuesday, May 18, 2010, Chris Mason wrote:> On Tue, May 18, 2010 at 05:57:49PM +0200, Jakob Unterwurzacher wrote: > > On 18/05/10 16:36, Chris Mason wrote:[...]> > > > I thought the problem of delaying the renames was complexity, well, at > > least T''Tso said it was [1] - I''m not sure if this applies to btrfs aswell.> > I''m afraid there are lots and lots of different issues at play. The > most important way to look at it is that forcing data to disk is very > slow, which is why we try to avoid it whenever we can. > > Applications can request that the data go to disk via lots of different > ways. Rename was never ever meant to be one of them, but it really does > make sense to provide atomic replacement of old good data with new good > data, so we''ve implemented that extra syncing. > > Implementing syncing when userland doesn''t expect extra syncing usually > just make userland very unhappy. It''s not that we can''t do it it''s that > doing it has implications for every application that uses rename. > > -chrisFunny, the first thing that comes to my mind reading this thread, is that this kind of complaint is raised about a file-system which is able to support a full rollback via the snapshot. I think that a "right" solution should be to integrate the package manager with the btrfs snapshot capability (as nexenta does [1]). But it is clear that this is a long term solution (IIRC Fedora is working on this). In the mean time, which should be the "right" solution to solve the dpkg problem ( and in a more general form the package manager problem) with btrfs ?> > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html >[1] http://www.nexenta.org/os/TransactionalZFSUpgrades -- gpg key@ keyserver.linux.it: Goffredo Baroncelli (ghigo) <kreijackATinwind.it> Key fingerprint = 4769 7E51 5293 D36C 814E C054 BF04 F161 3DC5 0512 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Jakob Unterwurzacher
2010-May-18 18:24 UTC
Re: Rename+crash behaviour of btrfs - nearly ext3!
On 18/05/10 18:10, Chris Mason wrote:>> >> I''m not sure how much memory a queued rename takes up, but the time that >> would be spent flushing it to disk would then be spent flushing file >> data, draining the write buffer and freeing memory, no? >> >> That would be writing to disk >> >> [Data..................][Rename] or >> [Rename][Data..................] > > Actually it is: > > [Data..................][allow the transaction commit to complete] or > [allow the transaction commit to complete][Data..................] > > The problem is that people think of the rename as a tiny thing, but it > is really bundled in with all of the other metadata operations that were > done in the current transaction. The space that was allocated to hold > the new file name, the space that was freed to remove the old file name, > the directory entries, the directory inode etc etc. > > This means that holding back that one rename requires holding back every > operation done to the filesystem. > > In btrfs, we''re still able to do fsyncs quickly in this case > because we have a dedicated log for that. But there are a few different > types of operations (like disk management) that require us to wait for > the transaction to complete even when we use the dedicated log. > >> >> Whether you drain the file data queue or the rename queue first, in the >> end you''d have to write it all.... > > It''s about latency. The latency required to write the entire file is > unbounded (the size of the file is unbounded). The latency required to > commit the transaction without the file data is bounded because we are > able to control the amount of metadata in each transaction. > > See the firefox vs ext3 wars for an example of all of this, it''s the > latency the firefox people were (rightly) complaining about. > >> >> I thought the problem of delaying the renames was complexity, well, at >> least T''Tso said it was [1] - I''m not sure if this applies to btrfs as well. > > I''m afraid there are lots and lots of different issues at play. The > most important way to look at it is that forcing data to disk is very > slow, which is why we try to avoid it whenever we can. > > Applications can request that the data go to disk via lots of different > ways. Rename was never ever meant to be one of them, but it really does > make sense to provide atomic replacement of old good data with new good > data, so we''ve implemented that extra syncing. > > Implementing syncing when userland doesn''t expect extra syncing usually > just make userland very unhappy. It''s not that we can''t do it it''s that > doing it has implications for every application that uses rename. > > -chrisThanks for all the insight. I will update the wiki FAQ to make clear what "data=ordered" in btrfs means, what not, and why (or something like that). Jakob -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 05/18/2010 09:13 AM, Chris Mason wrote:> On Tue, May 18, 2010 at 02:03:49PM +0200, Jakob Unterwurzacher wrote: > >> On 18/05/10 02:59, Chris Mason wrote: >> >>>>> Ok, I upgraded to 2.6.34 final and switched to defconfig. >>>>> I only did the rename test ( i.e. no overwrite ), the window is now >>>>> 1.1s, both with vanilla and with the patch. >>>>> >>>> Thanks, so much for the easy fix. I''ll take a look. >>>> >>> Ohhhhh, I read your initial email wrong, I''m sorry. The test we''re >>> failing, the rentest, doesn''t overwrite one file with another. It is >>> just creating a file and then renaming it. >>> >> Yes, the overwrite test goes perfectly fine. >> >> >>> Btrfs is explicitly choosing not to sync the file in this case because >>> the rename isn''t replacing good old data with new unwritten data. The >>> rename is taking new unwritten data and giving it a different name. >>> >>> Are there applications that rely on this? >>> >>> -chris >>> >> Well, dpkg (the Debian/Ubuntu package manager) did. Then ext4 became the >> default fs in Ubuntu and massive breakage was reported [1]. Now dpkg is >> fsync()ing everything and is about 2x slower than it was with ext3 [2]. >> >> Btrfs is so close to getting it "right" that i wondered whether the new >> file name hitting the disk could be delayed that one second for the data >> to make it to disk first. >> >> > The thing is that different apps have a different version of ''right''. Rename > is atomically replacing one file with another, and I completely agree > that when we have an established file on disk, we shouldn''t replace it > with something that is potentially garbage. > > But for the zeros case we have a file that isn''t on disk and we''re just > giving it a new name. I can see a different class of applications > getting upset about renames slowing the system down dramatically because > they suddenly imply a lot of IO. > > I''m more than open to discussion on this one, but I don''t see how: > > rm -f foo2 > dd if=/dev/zero of=foo bs=1M count=1000 > mv foo foo2 > > Should be expected to write 1GB of data. > > -chris >Just to weigh in here, I think that you have the right behaviour already. If an application wants to force this to sync the data to disk, it should use fsync() after the rename. Having application depend on semantics that only ext3 provided is not an excuse for making a rename take multiple seconds.... Thanks! Ric -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, May 18, 2010 at 07:00:57PM -0400, Ric Wheeler wrote:> Just to weigh in here, I think that you have the right behaviour > already. If an application wants to force this to sync the data to disk, > it should use fsync() after the rename.Actually, it pretty much has to fsync before the rename (to ensure the contents are on disk) and possibly fsync the directory after to ensure the rename hits the disk. If you fsync after the rename, there is still no guarantee that a crash won''t cause partial data on disk with the new filename, unless you assume the filesystem orders the writes so the rename happens after the data hits the disk. AFAIK most filesystems make no such guarantee. -- Bruce Guenter <bruce@untroubled.org> http://untroubled.org/
Chris Mason wrote:> On Tue, May 18, 2010 at 02:03:49PM +0200, Jakob Unterwurzacher wrote: >> On 18/05/10 02:59, Chris Mason wrote: >>>>> Ok, I upgraded to 2.6.34 final and switched to defconfig. >>>>> I only did the rename test ( i.e. no overwrite ), the window is now >>>>> 1.1s, both with vanilla and with the patch. >>>> Thanks, so much for the easy fix. I''ll take a look. >>> Ohhhhh, I read your initial email wrong, I''m sorry. The test we''re >>> failing, the rentest, doesn''t overwrite one file with another. It is >>> just creating a file and then renaming it. >> Yes, the overwrite test goes perfectly fine. >> >>> Btrfs is explicitly choosing not to sync the file in this case because >>> the rename isn''t replacing good old data with new unwritten data. The >>> rename is taking new unwritten data and giving it a different name. >>> >>> Are there applications that rely on this? >>> >>> -chris >> Well, dpkg (the Debian/Ubuntu package manager) did. Then ext4 became the >> default fs in Ubuntu and massive breakage was reported [1]. Now dpkg is >> fsync()ing everything and is about 2x slower than it was with ext3 [2]. >> >> Btrfs is so close to getting it "right" that i wondered whether the new >> file name hitting the disk could be delayed that one second for the data >> to make it to disk first. >> > > The thing is that different apps have a different version of ''right''. Rename > is atomically replacing one file with another, and I completely agree > that when we have an established file on disk, we shouldn''t replace it > with something that is potentially garbage. > > But for the zeros case we have a file that isn''t on disk and we''re just > giving it a new name. I can see a different class of applications > getting upset about renames slowing the system down dramatically because > they suddenly imply a lot of IO. > > I''m more than open to discussion on this one, but I don''t see how: > > rm -f foo2 > dd if=/dev/zero of=foo bs=1M count=1000 > mv foo foo2 > > Should be expected to write 1GB of data.[disclaimer: I don''t know much about btrfs internals] foo2 being gone after a crash is, of course, fine. But, depending on the programmer, there are a few answers: 1. I want foo2 to either not exist or to contain the data I just wrote. So please wait for it to hit disk. 2. I want foo2 to either not exist or to contain the data I just wrote. So, btrfs, please learn how to make sure that the metadata doesn''t get written until the data gets written. Presumably this means that the rename needs to go into a log somewhere (in memory) but not become a part of the current transaction to avoid all kinds of latency. 3. I want speed. Do whatever''s fastest. Of course, there''s a harder case: dd if=/dev/zero of=foo bs=1M count=1000 mv foo foo2 dd if=<something else> of=foo2 bs=1k count=1 Now what? A lot of application programmers probably want the metadata to happen after the data, but they don''t want to use fsync because they don''t want to wait for anything to hit disk. It would be nice to ask the FS for help, but that might be distinctly nontrivial. --Andy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html