I often get zero-length files in btrfs snapshots (when the original files were not zero-length). The shell script below reproduces this problem on two Ubuntu machines, with Ubuntu kernels 2.6.31-17.54 and 2.6.32-12.17. Is there some mistaken assumption I''m making here in terms of how btrfsctl works? Nickolai. --- root@sahara:/# cat /tmp/btrbug.sh #!/bin/sh -x if id | grep -qv uid=0; then echo "Must run setup as root" exit 1 fi if losetup -a | grep -q /dev/shm/fs.img; then echo "Loopback FS mounted, unmounting.." umount /mnt/x || exit 2 fi rmmod btrfs rmmod zlib_deflate rmmod libcrc32c modprobe btrfs dd if=/dev/zero of=/dev/shm/fs.img bs=1024k count=256 || exit 2 mkfs -t btrfs /dev/shm/fs.img || exit 2 mkdir -p /mnt/x || exit 2 mount -o loop -t btrfs /dev/shm/fs.img /mnt/x || exit 2 mkdir /mnt/x/d || exit 2 echo x1 > /mnt/x/d/foo.txt || exit 2 btrfsctl -s /mnt/x/snap /mnt/x/d wc -l /mnt/x/d/foo.txt wc -l /mnt/x/snap/d/foo.txt root@sahara:/# /tmp/btrbug.sh + id + grep -qv uid=0 + losetup -a + grep -q /dev/shm/fs.img + echo Loopback FS mounted, unmounting.. Loopback FS mounted, unmounting.. + umount /mnt/x + rmmod btrfs + rmmod zlib_deflate + rmmod libcrc32c + modprobe btrfs + dd if=/dev/zero of=/dev/shm/fs.img bs=1024k count=256 256+0 records in 256+0 records out 268435456 bytes (268 MB) copied, 0.231684 s, 1.2 GB/s + mkfs -t btrfs /dev/shm/fs.img WARNING! - Btrfs Btrfs v0.19 IS EXPERIMENTAL WARNING! - see http://btrfs.wiki.kernel.org before using fs created label (null) on /dev/shm/fs.img nodesize 4096 leafsize 4096 sectorsize 4096 size 256.00MB Btrfs Btrfs v0.19 + mkdir -p /mnt/x + mount -o loop -t btrfs /dev/shm/fs.img /mnt/x + mkdir /mnt/x/d + echo x1 + btrfsctl -s /mnt/x/snap /mnt/x/d operation complete Btrfs Btrfs v0.19 + wc -l /mnt/x/d/foo.txt 1 /mnt/x/d/foo.txt + wc -l /mnt/x/snap/d/foo.txt 0 /mnt/x/snap/d/foo.txt root@sahara:/# -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hi, > I often get zero-length files in btrfs snapshots (when the > original files were not zero-length). The shell script below > reproduces this problem on two Ubuntu machines, with Ubuntu > kernels 2.6.31-17.54 and 2.6.32-12.17. Is there some mistaken > assumption I''m making here in terms of how btrfsctl works? > [..] > > echo x1 > /mnt/x/d/foo.txt || exit 2 > btrfsctl -s /mnt/x/snap /mnt/x/d You''re just missing a sync/fsync() between these two lines. We argued on IRC a while ago about whether this is a sensible default; cmason wants the no-sync version of snapshot creation to be available, but was amenable to the idea of changing the default to be sync before snapshot, since it was pointed out that no-one other than him had understood we were supposed to be running sync first. - Chris. -- Chris Ball <cjb@laptop.org> One Laptop Per Child -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Feb 11, 2010 at 7:11 PM, Chris Ball <cjb@laptop.org> wrote:> > echo x1 > /mnt/x/d/foo.txt || exit 2 > > btrfsctl -s /mnt/x/snap /mnt/x/d > > You''re just missing a sync/fsync() between these two lines. > > We argued on IRC a while ago about whether this is a sensible default; > cmason wants the no-sync version of snapshot creation to be available, > but was amenable to the idea of changing the default to be sync before > snapshot, since it was pointed out that no-one other than him had > understood we were supposed to be running sync first. >You''re saying that it only snapshots the on-disk data structures and not the in-memory versions? That can only lead to pain. What do you do if something else during this race condition? What would a sync do to solve this? Have the semantics of sync been changed in btrfs from "sync everything that hasn''t been written yet" to "sync this subvolume"? From what I understand what should be happening is much like what LVM should do: step 1: defer all other writes to subvolume (userspace processes get stuck in D state until step 4) step 2: sync all changes not already committed to subvolume step 3: create snapshot step 4: resume writes from userspace Now if all 4 steps can be done with in-memory data structures without forcing data (not necessarily meta-data) to disk, so much the better. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Feb 11, 2010 at 08:50:48PM -0800, Mike Fedyk wrote:> On Thu, Feb 11, 2010 at 7:11 PM, Chris Ball <cjb@laptop.org> wrote: > > > echo x1 > /mnt/x/d/foo.txt || exit 2 > > > btrfsctl -s /mnt/x/snap /mnt/x/d > > > > You''re just missing a sync/fsync() between these two lines. > > > > We argued on IRC a while ago about whether this is a sensible default; > > cmason wants the no-sync version of snapshot creation to be available, > > but was amenable to the idea of changing the default to be sync before > > snapshot, since it was pointed out that no-one other than him had > > understood we were supposed to be running sync first. > > > You''re saying that it only snapshots the on-disk data structures and > not the in-memory versions? That can only lead to pain. What do you > do if something else during this race condition? What would a sync do > to solve this? Have the semantics of sync been changed in btrfs from > "sync everything that hasn''t been written yet" to "sync this > subvolume"? >Welcome to delalloc. You either get fast writes or you get all of your data on the disk every 5 seconds. If you don''t like delalloc, use ext3. The data you''ve written to memory doesn''t go down to disk unless explicitly told to, such as 1) fsync - this is obvious 2) vm - the vm has decided that this dirty page has been sitting around long enough and should be written back to the disk, could happen now, could happen 10 years from now. 3) sync - this is not as obvious. sync doesn''t mean anything than "start writing back dirty data to the fs", and returns before it''s done. For btrfs what that means is we run through _every_ inode that has delalloc pages associated with them and start writeback on them. This will get most of your data into the current transaction, which is when the snapshot happens. If you don''t want empty files, do something like this btrfsctl -c /dir/to/volume btrfsctl -s /dir/to/volume/snapshotname /dir/to/volume this is what we do with yum and its rollback plugin, and it works out quite well. Thanks, Josef -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Feb 12, 2010 at 7:19 AM, Josef Bacik <josef@redhat.com> wrote:> On Thu, Feb 11, 2010 at 08:50:48PM -0800, Mike Fedyk wrote: >> On Thu, Feb 11, 2010 at 7:11 PM, Chris Ball <cjb@laptop.org> wrote: >> > > echo x1 > /mnt/x/d/foo.txt || exit 2 >> > > btrfsctl -s /mnt/x/snap /mnt/x/d >> > >> > You''re just missing a sync/fsync() between these two lines. >> > >> > We argued on IRC a while ago about whether this is a sensible default; >> > cmason wants the no-sync version of snapshot creation to be available, >> > but was amenable to the idea of changing the default to be sync before >> > snapshot, since it was pointed out that no-one other than him had >> > understood we were supposed to be running sync first. >> > >> You''re saying that it only snapshots the on-disk data structures and >> not the in-memory versions? That can only lead to pain. What do you >> do if something else during this race condition? What would a sync do >> to solve this? Have the semantics of sync been changed in btrfs from >> "sync everything that hasn''t been written yet" to "sync this >> subvolume"? >> > > Welcome to delalloc. You either get fast writes or you get all of your data on > the disk every 5 seconds. If you don''t like delalloc, use ext3. The data > you''ve written to memory doesn''t go down to disk unless explicitly told to, such > as > > 1) fsync - this is obvious > 2) vm - the vm has decided that this dirty page has been sitting around long > enough and should be written back to the disk, could happen now, could happen 10 > years from now. > 3) sync - this is not as obvious. sync doesn''t mean anything than "start > writing back dirty data to the fs", and returns before it''s done. For btrfs > what that means is we run through _every_ inode that has delalloc pages > associated with them and start writeback on them. This will get most of your > data into the current transaction, which is when the snapshot happens. > > If you don''t want empty files, do something like this > > btrfsctl -c /dir/to/volume > btrfsctl -s /dir/to/volume/snapshotname /dir/to/volume > > this is what we do with yum and its rollback plugin, and it works out quite > well. Thanks, >Then you broke your ordering guarantee. If the data isn''t there, the meta-data shouldn''t be there either. So the snapshots made before the data hits a transaction shouldn''t have the file at all. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Feb 12, 2010 at 08:18:01AM -0800, Mike Fedyk wrote:> On Fri, Feb 12, 2010 at 7:19 AM, Josef Bacik <josef@redhat.com> wrote: > > On Thu, Feb 11, 2010 at 08:50:48PM -0800, Mike Fedyk wrote: > >> On Thu, Feb 11, 2010 at 7:11 PM, Chris Ball <cjb@laptop.org> wrote: > >> > > echo x1 > /mnt/x/d/foo.txt || exit 2 > >> > > btrfsctl -s /mnt/x/snap /mnt/x/d > >> > > >> > You''re just missing a sync/fsync() between these two lines. > >> > > >> > We argued on IRC a while ago about whether this is a sensible default; > >> > cmason wants the no-sync version of snapshot creation to be available, > >> > but was amenable to the idea of changing the default to be sync before > >> > snapshot, since it was pointed out that no-one other than him had > >> > understood we were supposed to be running sync first. > >> > > >> You''re saying that it only snapshots the on-disk data structures and > >> not the in-memory versions? That can only lead to pain. What do you > >> do if something else during this race condition? What would a sync do > >> to solve this? Have the semantics of sync been changed in btrfs from > >> "sync everything that hasn''t been written yet" to "sync this > >> subvolume"? > >> > > > > Welcome to delalloc. You either get fast writes or you get all of your data on > > the disk every 5 seconds. If you don''t like delalloc, use ext3. The data > > you''ve written to memory doesn''t go down to disk unless explicitly told to, such > > as > > > > 1) fsync - this is obvious > > 2) vm - the vm has decided that this dirty page has been sitting around long > > enough and should be written back to the disk, could happen now, could happen 10 > > years from now. > > 3) sync - this is not as obvious. sync doesn''t mean anything than "start > > writing back dirty data to the fs", and returns before it''s done. For btrfs > > what that means is we run through _every_ inode that has delalloc pages > > associated with them and start writeback on them. This will get most of your > > data into the current transaction, which is when the snapshot happens. > > > > If you don''t want empty files, do something like this > > > > btrfsctl -c /dir/to/volume > > btrfsctl -s /dir/to/volume/snapshotname /dir/to/volume > > > > this is what we do with yum and its rollback plugin, and it works out quite > > well. Thanks, > > > > Then you broke your ordering guarantee. If the data isn''t there, the > meta-data shouldn''t be there either. So the snapshots made before the > data hits a transaction shouldn''t have the file at all.Nope, what is happening is fd = creat("file") <- this is metadata that needs to be written write(fd, buf) <- because of delalloc there is no metadata that is created for this operation, therefore it doesn''t need to be written out. close(fd) so the file has metadata created for it, which needs to be written out. Because of delalloc there are no extents created or anything for the data, therefore there is nothing to write. Thanks, Josef -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Feb 12, 2010 at 8:22 AM, Josef Bacik <josef@redhat.com> wrote:> On Fri, Feb 12, 2010 at 08:18:01AM -0800, Mike Fedyk wrote: >> On Fri, Feb 12, 2010 at 7:19 AM, Josef Bacik <josef@redhat.com> wrote: >> > On Thu, Feb 11, 2010 at 08:50:48PM -0800, Mike Fedyk wrote: >> >> On Thu, Feb 11, 2010 at 7:11 PM, Chris Ball <cjb@laptop.org> wrote: >> >> > > echo x1 > /mnt/x/d/foo.txt || exit 2 >> >> > > btrfsctl -s /mnt/x/snap /mnt/x/d >> >> > >> >> > You''re just missing a sync/fsync() between these two lines. >> >> > >> >> > We argued on IRC a while ago about whether this is a sensible default; >> >> > cmason wants the no-sync version of snapshot creation to be available, >> >> > but was amenable to the idea of changing the default to be sync before >> >> > snapshot, since it was pointed out that no-one other than him had >> >> > understood we were supposed to be running sync first. >> >> > >> >> You''re saying that it only snapshots the on-disk data structures and >> >> not the in-memory versions? That can only lead to pain. What do you >> >> do if something else during this race condition? What would a sync do >> >> to solve this? Have the semantics of sync been changed in btrfs from >> >> "sync everything that hasn''t been written yet" to "sync this >> >> subvolume"? >> >> >> > >> > Welcome to delalloc. You either get fast writes or you get all of your data on >> > the disk every 5 seconds. If you don''t like delalloc, use ext3. The data >> > you''ve written to memory doesn''t go down to disk unless explicitly told to, such >> > as >> > >> > 1) fsync - this is obvious >> > 2) vm - the vm has decided that this dirty page has been sitting around long >> > enough and should be written back to the disk, could happen now, could happen 10 >> > years from now. >> > 3) sync - this is not as obvious. sync doesn''t mean anything than "start >> > writing back dirty data to the fs", and returns before it''s done. For btrfs >> > what that means is we run through _every_ inode that has delalloc pages >> > associated with them and start writeback on them. This will get most of your >> > data into the current transaction, which is when the snapshot happens. >> > >> > If you don''t want empty files, do something like this >> > >> > btrfsctl -c /dir/to/volume >> > btrfsctl -s /dir/to/volume/snapshotname /dir/to/volume >> > >> > this is what we do with yum and its rollback plugin, and it works out quite >> > well. Thanks, >> > >> >> Then you broke your ordering guarantee. If the data isn''t there, the >> meta-data shouldn''t be there either. So the snapshots made before the >> data hits a transaction shouldn''t have the file at all. > > Nope, what is happening is > > fd = creat("file") <- this is metadata that needs to be written > write(fd, buf) <- because of delalloc there is no metadata that is created > for this operation, therefore it doesn''t need to be written out. > close(fd) > > so the file has metadata created for it, which needs to be written out. Because > of delalloc there are no extents created or anything for the data, therefore > there is nothing to write. Thanks, >So file creation is effectively synchronous? So I could create a benchmark that creates millions of files and it would be limited to the IO OP performance of the disks? Why does file creation need to hit the disk before the contents (with limits to size of data that can fit in one transaction)? -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Feb 12, 2010 at 08:27:00AM -0800, Mike Fedyk wrote:> On Fri, Feb 12, 2010 at 8:22 AM, Josef Bacik <josef@redhat.com> wrote: > > On Fri, Feb 12, 2010 at 08:18:01AM -0800, Mike Fedyk wrote: > >> On Fri, Feb 12, 2010 at 7:19 AM, Josef Bacik <josef@redhat.com> wrote: > >> > On Thu, Feb 11, 2010 at 08:50:48PM -0800, Mike Fedyk wrote: > >> >> On Thu, Feb 11, 2010 at 7:11 PM, Chris Ball <cjb@laptop.org> wrote: > >> >> > > echo x1 > /mnt/x/d/foo.txt || exit 2 > >> >> > > btrfsctl -s /mnt/x/snap /mnt/x/d > >> >> > > >> >> > You''re just missing a sync/fsync() between these two lines. > >> >> > > >> >> > We argued on IRC a while ago about whether this is a sensible default; > >> >> > cmason wants the no-sync version of snapshot creation to be available, > >> >> > but was amenable to the idea of changing the default to be sync before > >> >> > snapshot, since it was pointed out that no-one other than him had > >> >> > understood we were supposed to be running sync first. > >> >> > > >> >> You''re saying that it only snapshots the on-disk data structures and > >> >> not the in-memory versions? That can only lead to pain. What do you > >> >> do if something else during this race condition? What would a sync do > >> >> to solve this? Have the semantics of sync been changed in btrfs from > >> >> "sync everything that hasn''t been written yet" to "sync this > >> >> subvolume"? > >> >> > >> > > >> > Welcome to delalloc. You either get fast writes or you get all of your data on > >> > the disk every 5 seconds. If you don''t like delalloc, use ext3. The data > >> > you''ve written to memory doesn''t go down to disk unless explicitly told to, such > >> > as > >> > > >> > 1) fsync - this is obvious > >> > 2) vm - the vm has decided that this dirty page has been sitting around long > >> > enough and should be written back to the disk, could happen now, could happen 10 > >> > years from now. > >> > 3) sync - this is not as obvious. sync doesn''t mean anything than "start > >> > writing back dirty data to the fs", and returns before it''s done. For btrfs > >> > what that means is we run through _every_ inode that has delalloc pages > >> > associated with them and start writeback on them. This will get most of your > >> > data into the current transaction, which is when the snapshot happens. > >> > > >> > If you don''t want empty files, do something like this > >> > > >> > btrfsctl -c /dir/to/volume > >> > btrfsctl -s /dir/to/volume/snapshotname /dir/to/volume > >> > > >> > this is what we do with yum and its rollback plugin, and it works out quite > >> > well. Thanks, > >> > > >> > >> Then you broke your ordering guarantee. If the data isn''t there, the > >> meta-data shouldn''t be there either. So the snapshots made before the > >> data hits a transaction shouldn''t have the file at all. > > > > Nope, what is happening is > > > > fd = creat("file") <- this is metadata that needs to be written > > write(fd, buf) <- because of delalloc there is no metadata that is created > > for this operation, therefore it doesn''t need to be written out. > > close(fd) > > > > so the file has metadata created for it, which needs to be written out. Because > > of delalloc there are no extents created or anything for the data, therefore > > there is nothing to write. Thanks, > > > > So file creation is effectively synchronous? So I could create a > benchmark that creates millions of files and it would be limited to > the IO OP performance of the disks? > > Why does file creation need to hit the disk before the contents (with > limits to size of data that can fit in one transaction)?File creation isn''t synchronous, it just modifies metadata, which needs to be committed when the transaction commits. So if you creat millions of files you are going to be held up every 30 seconds as the transaction commits and writes all the files you were able to create within that 30 seconds, same as _any_ filesystem that does ordered mode. Creating a file is a metadata operation, and _any_ metadata operation has to be committed to disk when the transaction commits in order to maintain a coherent fs. Thanks, Josef -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Feb 12, 2010 at 8:32 AM, Josef Bacik <josef@redhat.com> wrote:> On Fri, Feb 12, 2010 at 08:27:00AM -0800, Mike Fedyk wrote: >> On Fri, Feb 12, 2010 at 8:22 AM, Josef Bacik <josef@redhat.com> wrote: >> > On Fri, Feb 12, 2010 at 08:18:01AM -0800, Mike Fedyk wrote: >> >> On Fri, Feb 12, 2010 at 7:19 AM, Josef Bacik <josef@redhat.com> wrote: >> >> > On Thu, Feb 11, 2010 at 08:50:48PM -0800, Mike Fedyk wrote: >> >> >> On Thu, Feb 11, 2010 at 7:11 PM, Chris Ball <cjb@laptop.org> wrote: >> >> >> > > echo x1 > /mnt/x/d/foo.txt || exit 2 >> >> >> > > btrfsctl -s /mnt/x/snap /mnt/x/d >> >> >> > >> >> >> > You''re just missing a sync/fsync() between these two lines. >> >> >> > >> >> >> > We argued on IRC a while ago about whether this is a sensible default; >> >> >> > cmason wants the no-sync version of snapshot creation to be available, >> >> >> > but was amenable to the idea of changing the default to be sync before >> >> >> > snapshot, since it was pointed out that no-one other than him had >> >> >> > understood we were supposed to be running sync first. >> >> >> > >> >> >> You''re saying that it only snapshots the on-disk data structures and >> >> >> not the in-memory versions? That can only lead to pain. What do you >> >> >> do if something else during this race condition? What would a sync do >> >> >> to solve this? Have the semantics of sync been changed in btrfs from >> >> >> "sync everything that hasn''t been written yet" to "sync this >> >> >> subvolume"? >> >> >> >> >> > >> >> > Welcome to delalloc. You either get fast writes or you get all of your data on >> >> > the disk every 5 seconds. If you don''t like delalloc, use ext3. The data >> >> > you''ve written to memory doesn''t go down to disk unless explicitly told to, such >> >> > as >> >> > >> >> > 1) fsync - this is obvious >> >> > 2) vm - the vm has decided that this dirty page has been sitting around long >> >> > enough and should be written back to the disk, could happen now, could happen 10 >> >> > years from now. >> >> > 3) sync - this is not as obvious. sync doesn''t mean anything than "start >> >> > writing back dirty data to the fs", and returns before it''s done. For btrfs >> >> > what that means is we run through _every_ inode that has delalloc pages >> >> > associated with them and start writeback on them. This will get most of your >> >> > data into the current transaction, which is when the snapshot happens. >> >> > >> >> > If you don''t want empty files, do something like this >> >> > >> >> > btrfsctl -c /dir/to/volume >> >> > btrfsctl -s /dir/to/volume/snapshotname /dir/to/volume >> >> > >> >> > this is what we do with yum and its rollback plugin, and it works out quite >> >> > well. Thanks, >> >> > >> >> >> >> Then you broke your ordering guarantee. If the data isn''t there, the >> >> meta-data shouldn''t be there either. So the snapshots made before the >> >> data hits a transaction shouldn''t have the file at all. >> > >> > Nope, what is happening is >> > >> > fd = creat("file") <- this is metadata that needs to be written >> > write(fd, buf) <- because of delalloc there is no metadata that is created >> > for this operation, therefore it doesn''t need to be written out. >> > close(fd) >> > >> > so the file has metadata created for it, which needs to be written out. Because >> > of delalloc there are no extents created or anything for the data, therefore >> > there is nothing to write. Thanks, >> > >> >> So file creation is effectively synchronous? So I could create a >> benchmark that creates millions of files and it would be limited to >> the IO OP performance of the disks? >> >> Why does file creation need to hit the disk before the contents (with >> limits to size of data that can fit in one transaction)? > > File creation isn''t synchronous, it just modifies metadata, which needs to be > committed when the transaction commits. So if you creat millions of files you > are going to be held up every 30 seconds as the transaction commits and writes > all the files you were able to create within that 30 seconds, same as _any_ > filesystem that does ordered mode. > > Creating a file is a metadata operation, and _any_ metadata operation has to be > committed to disk when the transaction commits in order to maintain a coherent > fs. Thanks, >Thanks, I understand better now. What I still don''t understand though is that the create could have taken up to 30 seconds to commit and the same for the few bytes of data, but a few ms later a snapshot was made and the metadata change was there and the data change was not. Could it have happened that the snapshot would not have the newly created file and this was just a timing issue that should not be relied upon? I''m just wondering why that file was there at all. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 02/12/10 09:19, Josef Bacik wrote:> On Thu, Feb 11, 2010 at 08:50:48PM -0800, Mike Fedyk wrote: >> On Thu, Feb 11, 2010 at 7:11 PM, Chris Ball<cjb@laptop.org> wrote: >>> > echo x1> /mnt/x/d/foo.txt || exit 2 >>> > btrfsctl -s /mnt/x/snap /mnt/x/d >>> >>> You''re just missing a sync/fsync() between these two lines. >>> >>> We argued on IRC a while ago about whether this is a sensible default; >>> cmason wants the no-sync version of snapshot creation to be available, >>> but was amenable to the idea of changing the default to be sync before >>> snapshot, since it was pointed out that no-one other than him had >>> understood we were supposed to be running sync first. >>> >> You''re saying that it only snapshots the on-disk data structures and >> not the in-memory versions? That can only lead to pain. What do you >> do if something else during this race condition? What would a sync do >> to solve this? Have the semantics of sync been changed in btrfs from >> "sync everything that hasn''t been written yet" to "sync this >> subvolume"? >> > > Welcome to delalloc. You either get fast writes or you get all of your data on > the disk every 5 seconds. If you don''t like delalloc, use ext3. The data > you''ve written to memory doesn''t go down to disk unless explicitly told to, such > as > > 1) fsync - this is obvious > 2) vm - the vm has decided that this dirty page has been sitting around long > enough and should be written back to the disk, could happen now, could happen 10 > years from now. > 3) sync - this is not as obvious. sync doesn''t mean anything than "start > writing back dirty data to the fs", and returns before it''s done. For btrfs > what that means is we run through _every_ inode that has delalloc pages > associated with them and start writeback on them. This will get most of your > data into the current transaction, which is when the snapshot happens. > > If you don''t want empty files, do something like this > > btrfsctl -c /dir/to/volume > btrfsctl -s /dir/to/volume/snapshotname /dir/to/volume > > this is what we do with yum and its rollback plugin, and it works out quite > well. Thanks, > > Josef > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html >Is there a race in there? It seems like if a process starts modifying a file between the sync and the snapshot, data could still be lost. Is there something else going on here that I''m missing that would prevent this race? --Ravi -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Feb 12, 2010 at 12:22:12PM -0600, Ravi Pinjala wrote:> On 02/12/10 09:19, Josef Bacik wrote: >> On Thu, Feb 11, 2010 at 08:50:48PM -0800, Mike Fedyk wrote: >>> On Thu, Feb 11, 2010 at 7:11 PM, Chris Ball<cjb@laptop.org> wrote: >>>> > echo x1> /mnt/x/d/foo.txt || exit 2 >>>> > btrfsctl -s /mnt/x/snap /mnt/x/d >>>> >>>> You''re just missing a sync/fsync() between these two lines. >>>> >>>> We argued on IRC a while ago about whether this is a sensible default; >>>> cmason wants the no-sync version of snapshot creation to be available, >>>> but was amenable to the idea of changing the default to be sync before >>>> snapshot, since it was pointed out that no-one other than him had >>>> understood we were supposed to be running sync first. >>>> >>> You''re saying that it only snapshots the on-disk data structures and >>> not the in-memory versions? That can only lead to pain. What do you >>> do if something else during this race condition? What would a sync do >>> to solve this? Have the semantics of sync been changed in btrfs from >>> "sync everything that hasn''t been written yet" to "sync this >>> subvolume"? >>> >> >> Welcome to delalloc. You either get fast writes or you get all of your data on >> the disk every 5 seconds. If you don''t like delalloc, use ext3. The data >> you''ve written to memory doesn''t go down to disk unless explicitly told to, such >> as >> >> 1) fsync - this is obvious >> 2) vm - the vm has decided that this dirty page has been sitting around long >> enough and should be written back to the disk, could happen now, could happen 10 >> years from now. >> 3) sync - this is not as obvious. sync doesn''t mean anything than "start >> writing back dirty data to the fs", and returns before it''s done. For btrfs >> what that means is we run through _every_ inode that has delalloc pages >> associated with them and start writeback on them. This will get most of your >> data into the current transaction, which is when the snapshot happens. >> >> If you don''t want empty files, do something like this >> >> btrfsctl -c /dir/to/volume >> btrfsctl -s /dir/to/volume/snapshotname /dir/to/volume >> >> this is what we do with yum and its rollback plugin, and it works out quite >> well. Thanks, >> >> Josef >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html >> > > Is there a race in there? It seems like if a process starts modifying a > file between the sync and the snapshot, data could still be lost. Is > there something else going on here that I''m missing that would prevent > this race? >Data won''t be lost, it just won''t be there in the snapshot, and will be there in the source. Thanks, Josef -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hi, > Is there a race in there? It seems like if a process starts > modifying a file between the sync and the snapshot, data could > still be lost. Is there something else going on here that I''m > missing that would prevent this race? AIUI, you''re correct that a writer process concurrent to a snapshot leads to a race with data that doesn''t make it in to the snapshot, but I think you''re wrong in thinking that there''s much we could do about it -- either you miss writes between sync and snapshot, as we do now, or we do sync-and-snapshot atomically, and the concurrent writes are missed because we decided to block further writes from that process before we took the snapshot. The only real answer is to quiesce the writer process before you begin. Does that make sense? - Chris. -- Chris Ball <cjb@laptop.org> One Laptop Per Child -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Feb 12, 2010 at 10:19:40AM -0500, Josef Bacik wrote:> 3) sync - this is not as obvious. sync doesn''t mean anything than "start > writing back dirty data to the fs", and returns before it''s done. For btrfs > what that means is we run through _every_ inode that has delalloc pages > associated with them and start writeback on them. This will get most of your > data into the current transaction, which is when the snapshot happens.sync does return synchronously on Linux, even if that is not guaranted by Posix. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Mike Fedyk wrote (ao):> On Fri, Feb 12, 2010 at 8:32 AM, Josef Bacik <josef@redhat.com> wrote: > > Creating a file is a metadata operation, and _any_ metadata operation has to be > > committed to disk when the transaction commits in order to maintain a coherent > > fs. ??Thanks, > > What I still don''t understand though is that the create could have > taken up to 30 seconds to commit and the same for the few bytes of > data, but a few ms later a snapshot was made and the metadata change > was there and the data change was not. Could it have happened that > the snapshot would not have the newly created file and this was just a > timing issue that should not be relied upon? > > I''m just wondering why that file was there at all.I would say that is because the moment the file got created, the resulting metadata was commited immediately. The data not yet. With kind regards, Sander -- Humilis IT Services and Solutions http://www.humilis.net -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sat, Feb 13, 2010 at 3:25 AM, Sander <sander@humilis.net> wrote:> Mike Fedyk wrote (ao): >> On Fri, Feb 12, 2010 at 8:32 AM, Josef Bacik <josef@redhat.com> wrote: >> > Creating a file is a metadata operation, and _any_ metadata operation has to be >> > committed to disk when the transaction commits in order to maintain a coherent >> > fs. ??Thanks, >> >> What I still don''t understand though is that the create could have >> taken up to 30 seconds to commit and the same for the few bytes of >> data, but a few ms later a snapshot was made and the metadata change >> was there and the data change was not. Could it have happened that >> the snapshot would not have the newly created file and this was just a >> timing issue that should not be relied upon? >> >> I''m just wondering why that file was there at all. > > I would say that is because the moment the file got created, the > resulting metadata was commited immediately. The data not yet. >Josef explained it to me on IRC. Meta-data changes like file creation get added to the current transaction and snapshots start a new transaction so that is why the empty file is in the snapshot. The file is empty because with delayed allocation, the data has not hit the filesystem yet and thus has no representation in filesystem operations like snapshots. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sat, 13 Feb 2010, Mike Fedyk wrote:> On Sat, Feb 13, 2010 at 3:25 AM, Sander <sander@humilis.net> wrote: > > Mike Fedyk wrote (ao): > >> On Fri, Feb 12, 2010 at 8:32 AM, Josef Bacik <josef@redhat.com> wrote: > >> > Creating a file is a metadata operation, and _any_ metadata operation has to be > >> > committed to disk when the transaction commits in order to maintain a coherent > >> > fs. ??Thanks, > >> > >> What I still don''t understand though is that the create could have > >> taken up to 30 seconds to commit and the same for the few bytes of > >> data, but a few ms later a snapshot was made and the metadata change > >> was there and the data change was not. Could it have happened that > >> the snapshot would not have the newly created file and this was just a > >> timing issue that should not be relied upon? > >> > >> I''m just wondering why that file was there at all. > > > > I would say that is because the moment the file got created, the > > resulting metadata was commited immediately. The data not yet. > > Josef explained it to me on IRC. Meta-data changes like file creation > get added to the current transaction and snapshots start a new > transaction so that is why the empty file is in the snapshot. > > The file is empty because with delayed allocation, the data has not > hit the filesystem yet and thus has no representation in filesystem > operations like snapshots.You can make btrfs include the file data in the snapshot along with the metadata with the ''flushoncommit'' mount option. The problem is that this will make _all_ btrfs commits more expensive, as they''ll block new operations during the commit while old data is being flushed out. We could trivially make this happen only when there is a new snapshot, to get the behavior you expect (see patch below). If the goal is to make a perfectly consistent snapshot of the file system, this is better than sync ; btrfsctl -s snap whatever because there wouldn''t be a window where metadata changes make it into the snapshot but file data does not. Is there really a use case for the sort of ''lazy'' snapshots with out-of-sync data and metadata (like 0-byte files)? If so, we should add another ioctl for a full-blown snapshot so that users who _do_ want a fully consistent snapshot can get it. If not, something like the below should be sufficient to make all snapshots fully consistent... sage --- From: Sage Weil <sage@newdream.net> Date: Fri, 19 Feb 2010 14:13:50 -0800 Subject: [PATCH] Btrfs: flush data on snapshot creation Flush any delalloc extents when we create a snapshot, so that recently written file data is always included in the snapshot. Signed-off-by: Sage Weil <sage@newdream.net> --- fs/btrfs/transaction.c | 5 +---- 1 files changed, 1 insertions(+), 4 deletions(-) diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c index e83d4e1..f5b7029 100644 --- a/fs/btrfs/transaction.c +++ b/fs/btrfs/transaction.c @@ -1084,13 +1084,10 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans, mutex_unlock(&root->fs_info->trans_mutex); - if (flush_on_commit) { + if (flush_on_commit || snap_pending) { btrfs_start_delalloc_inodes(root, 1); ret = btrfs_wait_ordered_extents(root, 0, 1); BUG_ON(ret); - } else if (snap_pending) { - ret = btrfs_wait_ordered_extents(root, 0, 1); - BUG_ON(ret); } /* -- 1.6.6.1
On Friday 19 February 2010, Sage Weil wrote: [...]> We could trivially make this happen only when there is a new snapshot, to > get the behavior you expect (see patch below). If the goal is to make a > perfectly consistent snapshot of the file system, this is better than > > sync ; btrfsctl -s snap whatever > > because there wouldn''t be a window where metadata changes make it into the > snapshot but file data does not.I don''t have the knowledge to say if your patch is good or not from a performance point of view, but to me, the behaviour of your patch seems a reasonable defaults. I may accept that a crash can break a supposed sequence of a writing on the disk, so data which should be on the disk never reach the disk. But I can reduce the risk of this behaviour with an UPS. Instead the fact that a snapshot may not taken the last data to me seems an un-acceptable behaviour. Worse, this behaviour may lead to write code like do_sync(); do_snapshot(); which is difficult to optimise at the kernel level; instead if we put a sync before a snapshot in the core of the btrfs, even tough in the present there is performance problem, may have (even in a far future) a possible optimisation.. Regards Goffredo> Is there really a use case for the sort of ''lazy'' snapshots with > out-of-sync data and metadata (like 0-byte files)? If so, we should add > another ioctl for a full-blown snapshot so that users who _do_ want a > fully consistent snapshot can get it. > > If not, something like the below should be sufficient to make all > snapshots fully consistent... > > sage > > --- > > From: Sage Weil <sage@newdream.net> > Date: Fri, 19 Feb 2010 14:13:50 -0800 > Subject: [PATCH] Btrfs: flush data on snapshot creation > > Flush any delalloc extents when we create a snapshot, so that recently > written file data is always included in the snapshot. > > Signed-off-by: Sage Weil <sage@newdream.net> > --- > fs/btrfs/transaction.c | 5 +---- > 1 files changed, 1 insertions(+), 4 deletions(-) > > diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c > index e83d4e1..f5b7029 100644 > --- a/fs/btrfs/transaction.c > +++ b/fs/btrfs/transaction.c > @@ -1084,13 +1084,10 @@ int btrfs_commit_transaction(structbtrfs_trans_handle *trans,> > mutex_unlock(&root->fs_info->trans_mutex); > > - if (flush_on_commit) { > + if (flush_on_commit || snap_pending) { > btrfs_start_delalloc_inodes(root, 1); > ret = btrfs_wait_ordered_extents(root, 0, 1); > BUG_ON(ret); > - } else if (snap_pending) { > - ret = btrfs_wait_ordered_extents(root, 0, 1); > - BUG_ON(ret); > } > > /* > -- > 1.6.6.1 >-- gpg key@ keyserver.linux.it: Goffredo Baroncelli (ghigo) <kreijackAtinwind.it> Key fingerprint = 4769 7E51 5293 D36C 814E C054 BF04 F161 3DC5 0512 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html