Hi Chris-
I noticed some data and metadata getting out of sync on disk, despite 
wrapping my writes with btrfs transactions.  After digging into it a bit, 
it appears to be a larger problem with inode size/data getting written 
during a regular commit.
I have a test program append a few bytes at a time to a few different 
files, in a loop.  I let it run until I see a btrfs transaction commit 
(via a printk at the bottom of btrfs_commit_transaction).  Then ''reboot
-f
-n''.  After remounting, all files exist but are 0 bytes, and debug-tree
shows a bunch of empty files.  I would expect to see either the sizes when 
the commit happend (a few hundred KB in my case), or no files at all; 
there was actually no point in time when any of the files were 0 bytes.
Similarly, if I do the same but wait for a few commits to happen, after 
remount the file sizes reflect the size from around the next-to-last 
commit, not the last commit.
This is probably more information than you need, but my original test was 
a bit more complicated, with weirder results.  Append to each file, then 
write it''s size to an xattr on another file.  Wrap both operations in a
transaction.  Start it up, run ''sync'', then reboot -f -n. 
When I remount
the size and xattr are out of sync by exactly one iteration: the xattr 
reflects the size that resulted from _two_ writes back, not the 
immediately preceeding write.  If anything I would expect to see a larger 
actual size than xattr value (for example if the start/end transaction 
ioctls weren''t working)...
sage
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <stdlib.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
int main(int argc, char **argv)
{
	while (1) {
		int r, fd, pos, i = rand() % 10;
		char a[20];
		
		sprintf(a, "%d.log", i);
		fd = open(a, O_CREAT|O_APPEND|O_WRONLY, 0600);
		r = write(fd, "foobarfoo\n", 10);
		pos = lseek(fd, 0, SEEK_CUR);
		printf("write %s = %d, size = %d\n", a, r, pos);
		close(fd);
	}
}
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
2008/12/19 Sage Weil <sage@newdream.net>:> Hi Chris- > > I noticed some data and metadata getting out of sync on disk, despite > wrapping my writes with btrfs transactions. After digging into it a bit, > it appears to be a larger problem with inode size/data getting written > during a regular commit. > > I have a test program append a few bytes at a time to a few different > files, in a loop. I let it run until I see a btrfs transaction commit > (via a printk at the bottom of btrfs_commit_transaction). Then ''reboot -f > -n''. After remounting, all files exist but are 0 bytes, and debug-tree > shows a bunch of empty files. I would expect to see either the sizes when > the commit happend (a few hundred KB in my case), or no files at all; > there was actually no point in time when any of the files were 0 bytes. > > Similarly, if I do the same but wait for a few commits to happen, after > remount the file sizes reflect the size from around the next-to-last > commit, not the last commit. > > This is probably more information than you need, but my original test was > a bit more complicated, with weirder results. Append to each file, then > write it''s size to an xattr on another file. Wrap both operations in a > transaction. Start it up, run ''sync'', then reboot -f -n. When I remount > the size and xattr are out of sync by exactly one iteration: the xattr > reflects the size that resulted from _two_ writes back, not the > immediately preceeding write. If anything I would expect to see a larger > actual size than xattr value (for example if the start/end transaction > ioctls weren''t working)... > > sage > > > > #include <stdio.h> > #include <stdlib.h> > #include <unistd.h> > #include <stdlib.h> > #include <sys/types.h> > #include <sys/stat.h> > #include <fcntl.h> > > int main(int argc, char **argv) > { > while (1) { > int r, fd, pos, i = rand() % 10; > char a[20]; > > sprintf(a, "%d.log", i); > fd = open(a, O_CREAT|O_APPEND|O_WRONLY, 0600); > r = write(fd, "foobarfoo\n", 10); > pos = lseek(fd, 0, SEEK_CUR); > printf("write %s = %d, size = %d\n", a, r, pos); > close(fd); > } > } >This is the desired behaviour of data=ordered. Btrfs transaction commit don''t flush data, and metadata wont get updated until data IO complete. http://article.gmane.org/gmane.comp.file-systems.btrfs/869/match=new+data+ordered+code Regards Yan Zheng -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, 19 Dec 2008, Yan Zheng wrote:> > I noticed some data and metadata getting out of sync on disk, despite > > wrapping my writes with btrfs transactions. After digging into it a bit, > > it appears to be a larger problem with inode size/data getting written > > during a regular commit. > > [...] > > This is the desired behaviour of data=ordered. Btrfs transaction commit > don''t flush data, and metadata wont get updated until data IO complete. > > http://article.gmane.org/gmane.comp.file-systems.btrfs/869/match=new+data+ordered+codeAh, right, so it is. I think what I''m looking for then is a mount mode to get the old behavior, such that each commit flushes previously written data. Probably a call to btrfs_wait_ordered_extents() in btrfs_commit_transaction(), or something along those lines... sage -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, 2008-12-18 at 21:21 -0800, Sage Weil wrote:> On Fri, 19 Dec 2008, Yan Zheng wrote: > > > I noticed some data and metadata getting out of sync on disk, despite > > > wrapping my writes with btrfs transactions. After digging into it a bit, > > > it appears to be a larger problem with inode size/data getting written > > > during a regular commit. > > > [...] > > > > This is the desired behaviour of data=ordered. Btrfs transaction commit > > don''t flush data, and metadata wont get updated until data IO complete. > > > > http://article.gmane.org/gmane.comp.file-systems.btrfs/869/match=new+data+ordered+code > > Ah, right, so it is. > > I think what I''m looking for then is a mount mode to get the old behavior, > such that each commit flushes previously written data. Probably a call to > btrfs_wait_ordered_extents() in btrfs_commit_transaction(), or something > along those lines...Could you describe the end goal a bit? I''m happy to make modes where it''ll do what you need. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, 19 Dec 2008, Chris Mason wrote:> On Thu, 2008-12-18 at 21:21 -0800, Sage Weil wrote: > > On Fri, 19 Dec 2008, Yan Zheng wrote: > > > > I noticed some data and metadata getting out of sync on disk, despite > > > > wrapping my writes with btrfs transactions. After digging into it a bit, > > > > it appears to be a larger problem with inode size/data getting written > > > > during a regular commit. > > > > [...] > > > > > > This is the desired behaviour of data=ordered. Btrfs transaction commit > > > don''t flush data, and metadata wont get updated until data IO complete. > > > > > > http://article.gmane.org/gmane.comp.file-systems.btrfs/869/match=new+data+ordered+code > > > > Ah, right, so it is. > > > > I think what I''m looking for then is a mount mode to get the old behavior, > > such that each commit flushes previously written data. Probably a call to > > btrfs_wait_ordered_extents() in btrfs_commit_transaction(), or something > > along those lines... > > Could you describe the end goal a bit? I''m happy to make modes where > it''ll do what you need.The end goal is for data to flush and commit with the transaction that was running when the write() occured. So, after a sequence like write A setxattr B <crash> you should always see A if you see B. And after a sequence like ioctl(fd, BTRFS_IOC_TRANS_START) write A setxattr B close(fd) <crash> you should see either both A and B or neither A nor B. fsync() isn''t really appropriate since it forces a commit (or a tree log entry?), and it would still be better to roll lots of operations up together. Either a mount mode that includes dirty data in each transaction commit (and probably disables the tree log?), or a per-file fsync-like operation that commits an individual file''s dirty data to the running transaction would do the trick. sage -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, 2008-12-19 at 10:48 -0800, Sage Weil wrote:> On Fri, 19 Dec 2008, Chris Mason wrote: > > On Thu, 2008-12-18 at 21:21 -0800, Sage Weil wrote: > > > On Fri, 19 Dec 2008, Yan Zheng wrote: > > > > > I noticed some data and metadata getting out of sync on disk, despite > > > > > wrapping my writes with btrfs transactions. After digging into it a bit, > > > > > it appears to be a larger problem with inode size/data getting written > > > > > during a regular commit. > > > > > [...] > > > > > > > > This is the desired behaviour of data=ordered. Btrfs transaction commit > > > > don''t flush data, and metadata wont get updated until data IO complete. > > > > > > > > http://article.gmane.org/gmane.comp.file-systems.btrfs/869/match=new+data+ordered+code > > > > > > Ah, right, so it is. > > > > > > I think what I''m looking for then is a mount mode to get the old behavior, > > > such that each commit flushes previously written data. Probably a call to > > > btrfs_wait_ordered_extents() in btrfs_commit_transaction(), or something > > > along those lines... > > > > Could you describe the end goal a bit? I''m happy to make modes where > > it''ll do what you need. > > The end goal is for data to flush and commit with the transaction that was > running when the write() occured. > > So, after a sequence like > write A > setxattr B > <crash> > you should always see A if you see B. > > And after a sequence like > ioctl(fd, BTRFS_IOC_TRANS_START) > write A > setxattr B > close(fd) > <crash> > you should see either both A and B or neither A nor B. > > fsync() isn''t really appropriate since it forces a commit (or a tree log > entry?), and it would still be better to roll lots of operations up > together. Either a mount mode that includes dirty data in each > transaction commit (and probably disables the tree log?), or a per-file > fsync-like operation that commits an individual file''s dirty data to the > running transaction would do the trick.A third option is a different type of xattr operation that doesn''t go to disk until the metadata updates done at IO end time.>From a performance point of view, it''ll be much faster than slowing downcommit with data writes. Can that work for you? -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, 19 Dec 2008, Chris Mason wrote:> On Fri, 2008-12-19 at 10:48 -0800, Sage Weil wrote: > > On Fri, 19 Dec 2008, Chris Mason wrote: > > > On Thu, 2008-12-18 at 21:21 -0800, Sage Weil wrote: > > > > On Fri, 19 Dec 2008, Yan Zheng wrote: > > > > > > I noticed some data and metadata getting out of sync on disk, despite > > > > > > wrapping my writes with btrfs transactions. After digging into it a bit, > > > > > > it appears to be a larger problem with inode size/data getting written > > > > > > during a regular commit. > > > > > > [...] > > > > > > > > > > This is the desired behaviour of data=ordered. Btrfs transaction commit > > > > > don''t flush data, and metadata wont get updated until data IO complete. > > > > > > > > > > http://article.gmane.org/gmane.comp.file-systems.btrfs/869/match=new+data+ordered+code > > > > > > > > Ah, right, so it is. > > > > > > > > I think what I''m looking for then is a mount mode to get the old behavior, > > > > such that each commit flushes previously written data. Probably a call to > > > > btrfs_wait_ordered_extents() in btrfs_commit_transaction(), or something > > > > along those lines... > > > > > > Could you describe the end goal a bit? I''m happy to make modes where > > > it''ll do what you need. > > > > The end goal is for data to flush and commit with the transaction that was > > running when the write() occured. > > > > So, after a sequence like > > write A > > setxattr B > > <crash> > > you should always see A if you see B. > > > > And after a sequence like > > ioctl(fd, BTRFS_IOC_TRANS_START) > > write A > > setxattr B > > close(fd) > > <crash> > > you should see either both A and B or neither A nor B. > > > > fsync() isn''t really appropriate since it forces a commit (or a tree log > > entry?), and it would still be better to roll lots of operations up > > together. Either a mount mode that includes dirty data in each > > transaction commit (and probably disables the tree log?), or a per-file > > fsync-like operation that commits an individual file''s dirty data to the > > running transaction would do the trick. > > A third option is a different type of xattr operation that doesn''t go to > disk until the metadata updates done at IO end time. > > >From a performance point of view, it''ll be much faster than slowing down > commit with data writes. > > Can that work for you?I suspect not, since multiple files are involved. It''s usually something like write A setxattr A write B setxattr C and all need to be committed atomically. The model really is a bundle of arbitrary operations that commit atomically. Slower commit times aren''t as much of a concern because this is on the storage backend, behind client caches and so forth. I think it''s a reasonable price to pay for the stronger consistency. Hopefully it''s not throwing too big a wrench into the data=ordered machinery? It sort of looks like this is already what you get when taking a snapshot (I see the call to wait_ordered_extnets in commit_transaction when snaps_pending). sage -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, 2008-12-19 at 12:08 -0800, Sage Weil wrote: [ trigger data=ordered flush at commit time ]> > > A third option is a different type of xattr operation that doesn''t go to > > disk until the metadata updates done at IO end time. > > > > >From a performance point of view, it''ll be much faster than slowing down > > commit with data writes. > > > > Can that work for you? > > I suspect not, since multiple files are involved. It''s usually something > like > > write A > setxattr A > write B > setxattr C > > and all need to be committed atomically. The model really is a bundle of > arbitrary operations that commit atomically. > > Slower commit times aren''t as much of a concern because this is on the > storage backend, behind client caches and so forth. I think it''s > a reasonable price to pay for the stronger consistency. > > Hopefully it''s not throwing too big a wrench into the data=ordered > machinery? It sort of looks like this is already what you get when taking > a snapshot (I see the call to wait_ordered_extnets in commit_transaction > when snaps_pending).If we make it optional, its fine by me to added a data=ordered flush at commit time. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html