Hi, Does btrfs support atomic file data replaces? Basically, the atomic variant of this: // old stage open(O_TRUNC) write() // 0+ times close() // new state -- Olaf -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 6 January 2011 20:01, Olaf van der Spek <olafvdspek@gmail.com> wrote:> Hi, > > Does btrfs support atomic file data replaces?Hi Olaf, Yes btrfs does support atomic replace, since kernel 2.6.30 circa June 2009. [1] Special handling was added to ext3, ext4, btrfs (and probably other Linux FSs) for your replace-via-truncate and the alternative replace-via-rename application patterns. Try reading "Delayed allocation and the zero-length file problem" article and comments by Ted Ts''o for further discussion. [2] Mike -- [1] http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=5a3f23d515a2ebf0c750db80579ca57b28cbce6d [2] http://thunk.org/tytso/blog/2009/03/12/delayed-allocation-and-the-zero-length-file-problem/ -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Jan 7, 2011 at 2:55 PM, Mike Fleetwood <mike.fleetwood@googlemail.com> wrote:> On 6 January 2011 20:01, Olaf van der Spek <olafvdspek@gmail.com> wrote: >> Hi, >> >> Does btrfs support atomic file data replaces? > > Hi Olaf, > > Yes btrfs does support atomic replace, since kernel 2.6.30 circa June 2009. [1] > > Special handling was added to ext3, ext4, btrfs (and probably other > Linux FSs) for your replace-via-truncate and the alternative > replace-via-rename application patterns. Try reading "Delayed > allocation and the zero-length file problem" article and comments by > Ted Ts''o for further discussion. [2]According to Ted, via-truncate and via-rename are unsafe. Only fsync, rename is safe. Disadvantage of rename is resetting file owner (if non-root), having issues with meta-data and other stuff. My proposal was for an open flag, O_ATOMIC, to be introduced to tell the FS the whole file update should be done atomically. Ted says this is too hard in ext4, so I was wondering if this would be possible in btrfs. Olaf -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Jan 7, 2011 at 3:01 PM, Olaf van der Spek <olafvdspek@gmail.com> wrote:> According to Ted, via-truncate and via-rename are unsafe. Only fsync, > rename is safe. > Disadvantage of rename is resetting file owner (if non-root), having > issues with meta-data and other stuff. > > My proposal was for an open flag, O_ATOMIC, to be introduced to tell > the FS the whole file update should be done atomically. > Ted says this is too hard in ext4, so I was wondering if this would be > possible in btrfs.http://thunk.org/tytso/blog/2009/03/12/delayed-allocation-and-the-zero-length-file-problem/#comment-2082 http://thunk.org/tytso/blog/2009/03/12/delayed-allocation-and-the-zero-length-file-problem/#comment-2089 http://thunk.org/tytso/blog/2009/03/12/delayed-allocation-and-the-zero-length-file-problem/#comment-2090 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Excerpts from Olaf van der Spek''s message of 2011-01-06 15:01:15 -0500:> Hi, > > Does btrfs support atomic file data replaces? Basically, the atomic > variant of this: > // old stage > open(O_TRUNC) > write() // 0+ times > close() > // new stateYes and no. We have a best effort mechanism where we try to guess that since you''ve done this truncate and the write that you want the writes to show up quickly. But its a guess. The problem is the write() // 0+ times. The kernel has no idea what new result you want the file to contain because the application isn''t telling us. What btrfs can do (but we haven''t yet implemented) is make sure that the results of a single write file are on disk atomically, even if they are replacing existing bytes in the file. Because we cow and because we don''t update metadata pointers until the IO is complete, we can wait until all the IO for a given write call is on disk before we update any of the metadata. This isn''t hard, it''s on my TODO list. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Jan 7, 2011 at 3:58 PM, Chris Mason <chris.mason@oracle.com> wrote:> Excerpts from Olaf van der Spek''s message of 2011-01-06 15:01:15 -0500: >> Hi, >> >> Does btrfs support atomic file data replaces? Basically, the atomic >> variant of this: >> // old stage >> open(O_TRUNC) >> write() // 0+ times >> close() >> // new state > > Yes and no. We have a best effort mechanism where we try to guess that > since you''ve done this truncate and the write that you want the writes > to show up quickly. But its a guess. > > The problem is the write() // 0+ times. The kernel has no idea what > new result you want the file to contain because the application isn''t > telling us.Isn''t it safe for the kernel to wait until the first write or close before writing anything to disk?> What btrfs can do (but we haven''t yet implemented) is make sure that the > results of a single write file are on disk atomically, even if they are > replacing existing bytes in the file. > > Because we cow and because we don''t update metadata pointers until the > IO is complete, we can wait until all the IO for a given write call is > on disk before we update any of the metadata. > > This isn''t hard, it''s on my TODO list.What about a new flag: O_ATOMIC that''d take the guesswork out of the kernel? Olaf -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Excerpts from Olaf van der Spek''s message of 2011-01-07 10:01:59 -0500:> On Fri, Jan 7, 2011 at 3:58 PM, Chris Mason <chris.mason@oracle.com> wrote: > > Excerpts from Olaf van der Spek''s message of 2011-01-06 15:01:15 -0500: > >> Hi, > >> > >> Does btrfs support atomic file data replaces? Basically, the atomic > >> variant of this: > >> // old stage > >> open(O_TRUNC) > >> write() // 0+ times > >> close() > >> // new state > > > > Yes and no. Â We have a best effort mechanism where we try to guess that > > since you''ve done this truncate and the write that you want the writes > > to show up quickly. Â But its a guess. > > > > The problem is the write() // 0+ times. Â The kernel has no idea what > > new result you want the file to contain because the application isn''t > > telling us. > > Isn''t it safe for the kernel to wait until the first write or close > before writing anything to disk?I''m afraid not. Picture an application that opens a thousand files and writes 1MB to each of them, and then didn''t close any. If we waited until close, you''d have 1GB of memory pinned or staged somehow.> > > What btrfs can do (but we haven''t yet implemented) is make sure that the > > results of a single write file are on disk atomically, even if they are > > replacing existing bytes in the file. > > > > Because we cow and because we don''t update metadata pointers until the > > IO is complete, we can wait until all the IO for a given write call is > > on disk before we update any of the metadata. > > > > This isn''t hard, it''s on my TODO list. > > What about a new flag: O_ATOMIC that''d take the guesswork out of the kernel?We can''t guess beyond a single write call. Otherwise we get into the problem above where an application can force the kernel to wait forever. I''m not against O_ATOMIC to enable the new btrfs functionality, but it will still be limited to one write. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Jan 7, 2011 at 4:05 PM, Chris Mason <chris.mason@oracle.com> wrote:>> > The problem is the write() // 0+ times. The kernel has no idea what >> > new result you want the file to contain because the application isn''t >> > telling us. >> >> Isn''t it safe for the kernel to wait until the first write or close >> before writing anything to disk? > > I''m afraid not. Picture an application that opens a thousand files and > writes 1MB to each of them, and then didn''t close any. If we waited > until close, you''d have 1GB of memory pinned or staged somehow.That''s not what I asked. ;) I asked to wait until the first write (or close). That way, you don''t get unintentional empty files. One step further, you don''t have to keep the data in memory, you''re free to write them to disk. You just wouldn''t update the meta-data (yet).>> > This isn''t hard, it''s on my TODO list. >> >> What about a new flag: O_ATOMIC that''d take the guesswork out of the kernel? > > We can''t guess beyond a single write call. Otherwise we get into > the problem above where an application can force the kernel to wait > forever. I''m not against O_ATOMIC to enable the new btrfs > functionality, but it will still be limited to one write. > > -chris >-- Olaf -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Excerpts from Olaf van der Spek''s message of 2011-01-07 10:08:24 -0500:> On Fri, Jan 7, 2011 at 4:05 PM, Chris Mason <chris.mason@oracle.com> wrote: > >> > The problem is the write() // 0+ times. Â The kernel has no idea what > >> > new result you want the file to contain because the application isn''t > >> > telling us. > >> > >> Isn''t it safe for the kernel to wait until the first write or close > >> before writing anything to disk? > > > > I''m afraid not. Â Picture an application that opens a thousand files and > > writes 1MB to each of them, and then didn''t close any. Â If we waited > > until close, you''d have 1GB of memory pinned or staged somehow. > > That''s not what I asked. ;) > I asked to wait until the first write (or close). That way, you don''t > get unintentional empty files. > One step further, you don''t have to keep the data in memory, you''re > free to write them to disk. You just wouldn''t update the meta-data > (yet).Sorry ;) Picture an application that truncates 1024 files without closing any of them. Basically any operation that includes the kernel waiting for applications because they promise to do something soon is a denial of service attack, or a really easy way to run out of memory on the box. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Jan 7, 2011 at 4:13 PM, Chris Mason <chris.mason@oracle.com> wrote:>> That''s not what I asked. ;) >> I asked to wait until the first write (or close). That way, you don''t >> get unintentional empty files. >> One step further, you don''t have to keep the data in memory, you''re >> free to write them to disk. You just wouldn''t update the meta-data >> (yet). > > Sorry ;) Picture an application that truncates 1024 files without closing any > of them. Basically any operation that includes the kernel waiting for > applications because they promise to do something soon is a denial of > service attack, or a really easy way to run out of memory on the box.I''m not sure why you would run out of memory in that case. O_ATOMIC would be the solution for the rename workaround: write temp file, rename With advantages like a way simpler API, no issues with resetting meta-data, no issues with temp file and maybe better performance. Olaf -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Excerpts from Olaf van der Spek''s message of 2011-01-07 10:17:31 -0500:> On Fri, Jan 7, 2011 at 4:13 PM, Chris Mason <chris.mason@oracle.com> wrote: > >> That''s not what I asked. ;) > >> I asked to wait until the first write (or close). That way, you don''t > >> get unintentional empty files. > >> One step further, you don''t have to keep the data in memory, you''re > >> free to write them to disk. You just wouldn''t update the meta-data > >> (yet). > > > > Sorry ;) Picture an application that truncates 1024 files without closing any > > of them. Â Basically any operation that includes the kernel waiting for > > applications because they promise to do something soon is a denial of > > service attack, or a really easy way to run out of memory on the box. > > I''m not sure why you would run out of memory in that case.Well, lets make sure I''ve got a good handle on the proposed interface: 1) fd = open(some_file, O_ATOMIC) 2) truncate(fd, 0) 3) write(fd, new data) The semantics are that we promise not to let the truncate hit the disk until the application does the write. We have a few choices on how we do this: 1) Leave the disk untouched, but keep something in memory that says this inode is really truncated 2) Record on disk that we''ve done our atomic truncate but it is still pending. We''d need some way to remove or invalidate this record after a crash. 3) Go ahead and do the operation but don''t allow the transaction to commit until the write is done. option #1: keep something in memory. Well, any time we have a requirement to pin something in memory until userland decides to do a write, we risk oom. option #2: disk format change. Actually somewhat complex because if we haven''t crashed, we need to be able to read the inode in again without invalidating the record but if we do crash, we have to invalidate the record. Not impossible, but not trivial. option #3: Pin the whole transaction. Depending on the FS this may be impossible. Certain operations require us to commit the transaction to reclaim space, and we cannot allow userland to put that on hold without deadlocking. What most people don''t realize about the crash safe filesystems is they don''t have fine grained transactions. There is one single transaction for all the operations done. This is mostly because it is less complex and much faster, but it also makes any ''pin the whole transaction'' type system unusable. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Jan 7, 2011 at 5:12 PM, Chris Mason <chris.mason@oracle.com> wrote:>> I''m not sure why you would run out of memory in that case. > > Well, lets make sure I''ve got a good handle on the proposed interface: > > 1) fd = open(some_file, O_ATOMIC)No, O_TRUNC should be used in open. Maybe it works with a separate truncate too.> 2) truncate(fd, 0) > 3) write(fd, new data) > > The semantics are that we promise not to let the truncate hit the disk > until the application does the write. > > We have a few choices on how we do this: > > 1) Leave the disk untouched, but keep something in memory that says this > inode is really truncated > > 2) Record on disk that we''ve done our atomic truncate but it is still > pending. We''d need some way to remove or invalidate this record after a > crash. > > 3) Go ahead and do the operation but don''t allow the transaction to > commit until the write is done. > > option #1: keep something in memory. Well, any time we have a > requirement to pin something in memory until userland decides to do a > write, we risk oom.Since the file is open, you have to keep something in memory anyway, right? Adding a bit (or bool) does not make a difference IMO. Isn''t this comparable to opening a temp file?> option #2: disk format change. Actually somewhat complex because if we > haven''t crashed, we need to be able to read the inode in again without > invalidating the record but if we do crash, we have to invalidate the > record. Not impossible, but not trivial. > > option #3: Pin the whole transaction. Depending on the FS this may be > impossible. Certain operations require us to commit the transaction to > reclaim space, and we cannot allow userland to put that on hold without > deadlocking.#1 is the only one that makes sense.> What most people don''t realize about the crash safe filesystems is they > don''t have fine grained transactions. There is one single transaction > for all the operations done. This is mostly because it is less complex > and much faster, but it also makes any ''pin the whole transaction'' type > system unusable.AFAIK the cost is mostly more complex code / runtime. The cost is not disk performance. -- Olaf -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Friday, January 07, 2011 17:12:11 Chris Mason wrote:> Excerpts from Olaf van der Spek''s message of 2011-01-07 10:17:31 -0500: > > On Fri, Jan 7, 2011 at 4:13 PM, Chris Mason <chris.mason@oracle.com>wrote:> > >> That''s not what I asked. ;) > > >> I asked to wait until the first write (or close). That way, you don''t > > >> get unintentional empty files. > > >> One step further, you don''t have to keep the data in memory, you''re > > >> free to write them to disk. You just wouldn''t update the meta-data > > >> (yet). > > > > > > Sorry ;) Picture an application that truncates 1024 files without > > > closing any of them. Basically any operation that includes the kernel > > > waiting for applications because they promise to do something soon is > > > a denial of service attack, or a really easy way to run out of memory > > > on the box. > > > > I''m not sure why you would run out of memory in that case. > > Well, lets make sure I''ve got a good handle on the proposed interface: > > 1) fd = open(some_file, O_ATOMIC) > 2) truncate(fd, 0) > 3) write(fd, new data) > > The semantics are that we promise not to let the truncate hit the disk > until the application does the write. > > We have a few choices on how we do this: > > 1) Leave the disk untouched, but keep something in memory that says this > inode is really truncated > > 2) Record on disk that we''ve done our atomic truncate but it is still > pending. We''d need some way to remove or invalidate this record after a > crash. > > 3) Go ahead and do the operation but don''t allow the transaction to > commit until the write is done. > > option #1: keep something in memory. Well, any time we have a > requirement to pin something in memory until userland decides to do a > write, we risk oom.Userland has already a file descriptor allocated (which can fail anyway because of OOM), I see no problem in increasing the size of kernel memory usage by 4 bytes (if not less) just to note that the application wants to see the file as truncated (1 bit) and the next write has to be atomic (2nd bit?). -- Hubert Kario QBS - Quality Business Software 02-656 Warszawa, ul. Ksawerów 30/85 tel. +48 (22) 646-61-51, 646-74-24 www.qbs.com.pl -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Are you suggesting to do: 1)fopen with O_TRUNC, O_ATOMIC: returns fd to a temporary file 2)application writes to that fd, with one or more system calls, in a short time or in long time, at his will. 3)at fclose (or even at fsync ) atomically swap "data pointer" of "real file" with "temp file", then delete temp.In a transparent mode to userland. (something similar to e4defrag). Is this sum up correct? Massimo Maggi Il 07/01/2011 16:17, Olaf van der Spek ha scritto:> On Fri, Jan 7, 2011 at 4:13 PM, Chris Mason <chris.mason@oracle.com> wrote: >>> That''s not what I asked. ;) >>> I asked to wait until the first write (or close). That way, you don''t >>> get unintentional empty files. >>> One step further, you don''t have to keep the data in memory, you''re >>> free to write them to disk. You just wouldn''t update the meta-data >>> (yet). >> Sorry ;) Picture an application that truncates 1024 files without closing any >> of them. Basically any operation that includes the kernel waiting for >> applications because they promise to do something soon is a denial of >> service attack, or a really easy way to run out of memory on the box. > I''m not sure why you would run out of memory in that case. > > O_ATOMIC would be the solution for the rename workaround: write temp > file, rename > With advantages like a way simpler API, no issues with resetting > meta-data, no issues with temp file and maybe better performance. > > Olaf > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html-- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Jan 7, 2011 at 5:32 PM, Massimo Maggi <massimo@mmmm.it> wrote:> Are you suggesting to do: > 1)fopen with O_TRUNC, O_ATOMIC: returns fd to a temporary file > 2)application writes to that fd, with one or more system calls, in a > short time or in long time, at his will. > 3)at fclose (or even at fsync ) atomically swap "data pointer" of "real > file" with "temp file", then delete temp.In a transparent mode to > userland. (something similar to e4defrag). > Is this sum up correct?Almost. Swap should probably not be done at fsync time. Other open references (for example running executables) should be swapped too. The new-file case has to be handled too. Olaf -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Olaf van der Spek wrote:> On Fri, Jan 7, 2011 at 5:32 PM, Massimo Maggi <massimo@mmmm.it> wrote: >> Are you suggesting to do: >> 1)fopen with O_TRUNC, O_ATOMIC: returns fd to a temporary file >> 2)application writes to that fd, with one or more system calls, in a >> short time or in long time, at his will. >> 3)at fclose (or even at fsync ) atomically swap "data pointer" of "real >> file" with "temp file", then delete temp.In a transparent mode to >> userland. (something similar to e4defrag). >> Is this sum up correct? > > Almost. Swap should probably not be done at fsync time. > Other open references (for example running executables) should be swapped too.What is the visibility of the changes for other processes supposed to be in the meantime? I.e., if things happen in this order: 1. Process A does fda = open("foo.txt", O_TRUNC|O_ATOMIC) 2. Process B does fdb = open("foo.txt", O_RDONLY) 3. B does read(fdb, buf, 4096) 4. A does write(fda, "NEW DATA\n", 9) 5. Process C comes in and does fdc = open("foo.txt", O_RDONLY) 6. C does read(fdc, buf, 4096) 7. A calls close(fda) Does B see an empty file, or does it see the old contents of the file? Does C see "NEW DATA\n", or does it see the old contents of the file, or perhaps an empty file? /Bellman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Excerpts from Hubert Kario''s message of 2011-01-07 11:26:02 -0500:> On Friday, January 07, 2011 17:12:11 Chris Mason wrote: > > Excerpts from Olaf van der Spek''s message of 2011-01-07 10:17:31 -0500: > > > On Fri, Jan 7, 2011 at 4:13 PM, Chris Mason <chris.mason@oracle.com> > wrote: > > > >> That''s not what I asked. ;) > > > >> I asked to wait until the first write (or close). That way, you don''t > > > >> get unintentional empty files. > > > >> One step further, you don''t have to keep the data in memory, you''re > > > >> free to write them to disk. You just wouldn''t update the meta-data > > > >> (yet). > > > > > > > > Sorry ;) Picture an application that truncates 1024 files without > > > > closing any of them. Basically any operation that includes the kernel > > > > waiting for applications because they promise to do something soon is > > > > a denial of service attack, or a really easy way to run out of memory > > > > on the box. > > > > > > I''m not sure why you would run out of memory in that case. > > > > Well, lets make sure I''ve got a good handle on the proposed interface: > > > > 1) fd = open(some_file, O_ATOMIC) > > 2) truncate(fd, 0) > > 3) write(fd, new data) > > > > The semantics are that we promise not to let the truncate hit the disk > > until the application does the write. > > > > We have a few choices on how we do this: > > > > 1) Leave the disk untouched, but keep something in memory that says this > > inode is really truncated > > > > 2) Record on disk that we''ve done our atomic truncate but it is still > > pending. We''d need some way to remove or invalidate this record after a > > crash. > > > > 3) Go ahead and do the operation but don''t allow the transaction to > > commit until the write is done. > > > > option #1: keep something in memory. Well, any time we have a > > requirement to pin something in memory until userland decides to do a > > write, we risk oom. > > Userland has already a file descriptor allocated (which can fail anyway > because of OOM), I see no problem in increasing the size of kernel memory > usage by 4 bytes (if not less) just to note that the application wants to see > the file as truncated (1 bit) and the next write has to be atomic (2nd bit?). >The exact amount of tracking is going to vary. The reason why is that actually doing the truncate is an O(size of the file) operation and so you can''t just flip a switch when the write or the close comes in. You have to run through all the metadata of the file and do something temporary with each part that is only completed when the file IO is actually done. Honestly, there many different ways to solve this in the application. Requiring high speed atomic replacement of individual file contents is a recipe for frustration. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 01/07/2011 09:58 AM, Chris Mason wrote:> Yes and no. We have a best effort mechanism where we try to guess that > since you''ve done this truncate and the write that you want the writes > to show up quickly. But its a guess.It is a pretty good guess, and one that the NT kernel has been making for 15 years or so. I''ve been following this issue for some time and I still don''t understand why Ted is so hostile to this and can''t make it work right on ext4. When you get a rename() you just need to check if there are outstanding journal transactions and/or dirty cache pages, and hang the rename() transaction on the end of those. That way if the system crashes after the new file has fully hit the disk, the old file is gone and you only have the new one, but if it crashes before, you still have the old one in place. Both the writes and the rename can be delayed in the cache to an arbitrary point in the future; what matters is that their order is preserved. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Jan 7, 2011 at 8:29 PM, Thomas Bellman <bellman@nsc.liu.se> wrote:> What is the visibility of the changes for other processes supposed > to be in the meantime? I.e., if things happen in this order:Should be atomic too, at close time.> 1. Process A does fda = open("foo.txt", O_TRUNC|O_ATOMIC) > 2. Process B does fdb = open("foo.txt", O_RDONLY) > 3. B does read(fdb, buf, 4096) > 4. A does write(fda, "NEW DATA\n", 9) > 5. Process C comes in and does fdc = open("foo.txt", O_RDONLY) > 6. C does read(fdc, buf, 4096) > 7. A calls close(fda) > > Does B see an empty file, or does it see the old contents of > the file?Old file, otherwise A wouldn''t be atomic.> Does C see "NEW DATA\n", or does it see the old > contents of the file, or perhaps an empty file?Old file again, as the ''transaction'' isn''t finished until close. -- Olaf -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Jan 7, 2011 at 8:29 PM, Chris Mason <chris.mason@oracle.com> wrote:> The exact amount of tracking is going to vary. The reason why is that > actually doing the truncate is an O(size of the file) operation and so > you can''t just flip a switch when the write or the close comes in. You > have to run through all the metadata of the file and do something > temporary with each part that is only completed when the file IO is > actually done.That''s true. Maybe the proper way, via O_ATOMIC, is better.> Honestly, there many different ways to solve this in the application. > Requiring high speed atomic replacement of individual file contents is a > recipe for frustration.Did you see message of Massimo? That''d be the ideal way from an app point of view. Not solving this properly in the FS moves the problem to userspace where it''s even harder to solve and is not as performant. Replacing file data is a common operation that IMO the FS should support in a safe way. -- Olaf -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Olaf van der Spek wrote:> On Fri, Jan 7, 2011 at 8:29 PM, Thomas Bellman <bellman@nsc.liu.se> wrote: >> What is the visibility of the changes for other processes supposed >> to be in the meantime? I.e., if things happen in this order: > > Should be atomic too, at close time. > >> 1. Process A does fda = open("foo.txt", O_TRUNC|O_ATOMIC) >> 2. Process B does fdb = open("foo.txt", O_RDONLY) >> 3. B does read(fdb, buf, 4096) >> 4. A does write(fda, "NEW DATA\n", 9) >> 5. Process C comes in and does fdc = open("foo.txt", O_RDONLY) >> 6. C does read(fdc, buf, 4096) >> 7. A calls close(fda) >> >> Does B see an empty file, or does it see the old contents of >> the file? > > Old file, otherwise A wouldn''t be atomic. > >> Does C see "NEW DATA\n", or does it see the old >> contents of the file, or perhaps an empty file? > > Old file again, as the ''transaction'' isn''t finished until close.So, basically database transactions with an isolation level of "committed read", for file operations. That''s something I have wanted for a long time, especially if I also get a rollback() operation, but have never heard of any Unix that implemented it. A separate commit() operation would be better than conflating it with close(). And as I said, we want a rollback() as well. And a process that terminates without committing the transaction that it is performing, should have the transaction automatically rolled back. I only have a very shallow knowledge about the internals of the Linux kernel in regards to filesystems, but I suspect that this could be implemented almost entirely within the VFS, and not need to touch the actual filesystems, as long as you are satisfied with a limited amount of transaction space (what fits in RAM + swap). I''m looking forward to your implementation. :-) Even though I suspect that it would be a rather large undertaking to implement... /Bellman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sat, Jan 8, 2011 at 10:43 PM, Thomas Bellman <bellman@nsc.liu.se> wrote:> So, basically database transactions with an isolation level of > "committed read", for file operations. That''s something I have > wanted for a long time, especially if I also get a rollback() > operation, but have never heard of any Unix that implemented it.True, that''s why this feature request is here. Note that it''s (ATM) only about single file data replace.> A separate commit() operation would be better than conflating it > with close(). And as I said, we want a rollback() as well. And > a process that terminates without committing the transaction that > it is performing, should have the transaction automatically rolled > back.What could you do between commit and close?> I only have a very shallow knowledge about the internals of the > Linux kernel in regards to filesystems, but I suspect that this > could be implemented almost entirely within the VFS, and not need > to touch the actual filesystems, as long as you are satisfied > with a limited amount of transaction space (what fits in RAM + > swap). > > I''m looking forward to your implementation. :-) Even though I > suspect that it would be a rather large undertaking to implement...I have no plans to work on an implementation. -- Olaf -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Olaf van der Spek wrote:> On Sat, Jan 8, 2011 at 10:43 PM, Thomas Bellman <bellman@nsc.liu.se> wrote: >> So, basically database transactions with an isolation level of >> "committed read", for file operations. That''s something I have >> wanted for a long time, especially if I also get a rollback() >> operation, but have never heard of any Unix that implemented it. > > True, that''s why this feature request is here. > Note that it''s (ATM) only about single file data replace.That particular problem was solved with the introduction of the rename(2) system call in 4.2BSD a bit more than a quarter of a century ago. There is no need to introduce another, less flexible, API for doing the same thing.>> A separate commit() operation would be better than conflating it >> with close(). And as I said, we want a rollback() as well. And >> a process that terminates without committing the transaction that >> it is performing, should have the transaction automatically rolled >> back. > > What could you do between commit and close?More write() operations, of course. Just like you can continue with more transactions after a COMMIT WORK call without having to close and re-open the database in SQL. /Bellman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sun, Jan 9, 2011 at 7:56 PM, Thomas Bellman <bellman@nsc.liu.se> wrote:>> True, that''s why this feature request is here. >> Note that it''s (ATM) only about single file data replace. > > That particular problem was solved with the introduction of the > rename(2) system call in 4.2BSD a bit more than a quarter of a > century ago. There is no need to introduce another, less flexible, > API for doing the same thing.You might want to read about the problems with that workaround.>> What could you do between commit and close? > > More write() operations, of course. Just like you can continue > with more transactions after a COMMIT WORK call without having > to close and re-open the database in SQL.The transaction is defined as beginning with open and ending with close. -- Olaf -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 01/09/2011 01:56 PM, Thomas Bellman wrote:> That particular problem was solved with the introduction of the > rename(2) system call in 4.2BSD a bit more than a quarter of a > century ago. There is no need to introduce another, less flexible, > API for doing the same thing.I''m curious if there are any BSD specifications that state that rename() has this behavior. Ted Tso has been claiming that POSIX does not require this behavior in the face of a crash and that as a result, an application that relies on such behavior is broken, and needs to fsync() before rename(). This of course, makes replacing numerous files much slower, glacially so on btrfs. There has been a great deal of discussion ok the dpkg mailing lists about it since plenty of people are upset that dpkg runs much slower these days than it used to, because it now calls fsync() before rename() in order to avoid breakage on ext4. You can read more, including the rationale of why POSIX does not require this behavior at http://lwn.net/Articles/323607/. I still say that preserving the order of the writes and rename is the only sane thing to do, whether POSIX requires it or not. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sat, Jan 8, 2011 at 3:40 PM, Olaf van der Spek <olafvdspek@gmail.com> wrote:> On Fri, Jan 7, 2011 at 8:29 PM, Chris Mason <chris.mason@oracle.com> wrote: >> The exact amount of tracking is going to vary. The reason why is that >> actually doing the truncate is an O(size of the file) operation and so >> you can''t just flip a switch when the write or the close comes in. You >> have to run through all the metadata of the file and do something >> temporary with each part that is only completed when the file IO is >> actually done. > > That''s true. Maybe the proper way, via O_ATOMIC, is better. > >> Honestly, there many different ways to solve this in the application. >> Requiring high speed atomic replacement of individual file contents is a >> recipe for frustration. > > Did you see message of Massimo? That''d be the ideal way from an app > point of view. > Not solving this properly in the FS moves the problem to userspace > where it''s even harder to solve and is not as performant. > > Replacing file data is a common operation that IMO the FS should > support in a safe way.Chris? -- Olaf -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Excerpts from Olaf van der Spek''s message of 2011-01-26 13:30:08 -0500:> On Sat, Jan 8, 2011 at 3:40 PM, Olaf van der Spek <olafvdspek@gmail.com> wrote: > > On Fri, Jan 7, 2011 at 8:29 PM, Chris Mason <chris.mason@oracle.com> wrote: > >> The exact amount of tracking is going to vary. Â The reason why is that > >> actually doing the truncate is an O(size of the file) operation and so > >> you can''t just flip a switch when the write or the close comes in. Â You > >> have to run through all the metadata of the file and do something > >> temporary with each part that is only completed when the file IO is > >> actually done. > > > > That''s true. Maybe the proper way, via O_ATOMIC, is better. > > > >> Honestly, there many different ways to solve this in the application. > >> Requiring high speed atomic replacement of individual file contents is a > >> recipe for frustration. > > > > Did you see message of Massimo? That''d be the ideal way from an app > > point of view. > > Not solving this properly in the FS moves the problem to userspace > > where it''s even harder to solve and is not as performant. > > > > Replacing file data is a common operation that IMO the FS should > > support in a safe way. > > Chris? >My answer hasn''t really changed ;) Replacing file data is a common operation, but it is still surprisingly complex. Again, the truncate is O(size of the file) and it is actually impossible to do this atomically in most filesystems. You don''t notice this because xfs/ext34/btrfs (and many others) have code that makes sure a truncate is restarted if you crash. So, it appears to be atomic even though we''re really just restarting the operation. In order to have a truncate + replacement of data operation, we''d have to do a disk format change that includes both the truncate and the new data. It would look a lot like echo data > file.new ; truncate file ; mv file.new file, but recorded in the FS metadata. I don''t have this in the btrfs roadmap. It would be nice but most people use databases for things that require atomic operations. I think what ext4 and btrfs do today fall into the category of best effort and least surprise, and I think it is as good as we can get without huge performance penalties for normal use. Now, if you want to talk about atomic replacement of file data without changing the file size, that''s much easier. At least it''s easier for those of us with cows in our pockets. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Jan 26, 2011 at 8:30 PM, Chris Mason <chris.mason@oracle.com> wrote:> My answer hasn''t really changed ;) Replacing file data is a common > operation, but it is still surprisingly complex. Again, the truncate is > O(size of the file) and it is actually impossible to do this atomically > in most filesystems.Unfortunately life isn''t trivial. ;) Given that it''s common, it doesn''t make sense to have code duplication in lots of apps to implement the temp file rename pattern. If it''s too complex to implement in the FS (ATM), would it be possible to implement it in a higher layer?> You don''t notice this because xfs/ext34/btrfs (and many others) have > code that makes sure a truncate is restarted if you crash. So, it > appears to be atomic even though we''re really just restarting the > operation. In order to have a truncate + replacement of data operation, > we''d have to do a disk format change that includes both the truncate and > the new data.I''m not sure why the disk format would have to change. Conceptually, just like the temp file case, you''d write the new data to newly allocated blocks. After (and I guess that''s the complex part) they''re safely on disk, you update the meta data, in an atomic way.> It would look a lot like echo data > file.new ; truncate file ; mv > file.new file, but recorded in the FS metadata. > > I don''t have this in the btrfs roadmap. It would be nice but most > people use databases for things that require atomic operations. IExecutables and files shouldn''t be in a DB. Olaf -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html