thr3ads.net - Btrfs devel - Atomic file data replace API [Jan 2011]

If this information is useful, please help other people find it:
Share via:

Olaf van der Spek

2011-Jan-06 20:01 UTC

Atomic file data replace API

Hi,

Does btrfs support atomic file data replaces? Basically, the atomic
variant of this:
// old stage
open(O_TRUNC)
write() // 0+ times
close()
// new state
-- 
Olaf
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Mike Fleetwood

2011-Jan-07 13:55 UTC

head link

Re: Atomic file data replace API

On 6 January 2011 20:01, Olaf van der Spek <olafvdspek@gmail.com>
wrote:> Hi,
>
> Does btrfs support atomic file data replaces?
Hi Olaf,

Yes btrfs does support atomic replace, since kernel 2.6.30 circa June 2009. [1]

Special handling was added to ext3, ext4, btrfs (and probably other
Linux FSs) for your replace-via-truncate and the alternative
replace-via-rename application patterns.  Try reading "Delayed
allocation and the zero-length file problem" article and comments by
Ted Ts''o for further discussion. [2]

Mike
-- 
[1]
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=5a3f23d515a2ebf0c750db80579ca57b28cbce6d
[2]
http://thunk.org/tytso/blog/2009/03/12/delayed-allocation-and-the-zero-length-file-problem/
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Olaf van der Spek

2011-Jan-07 14:01 UTC

head link

Re: Atomic file data replace API

On Fri, Jan 7, 2011 at 2:55 PM, Mike Fleetwood
<mike.fleetwood@googlemail.com> wrote:> On 6 January 2011 20:01, Olaf van der Spek <olafvdspek@gmail.com>
wrote:
>> Hi,
>>
>> Does btrfs support atomic file data replaces?
>
> Hi Olaf,
>
> Yes btrfs does support atomic replace, since kernel 2.6.30 circa June 2009.
[1]
>
> Special handling was added to ext3, ext4, btrfs (and probably other
> Linux FSs) for your replace-via-truncate and the alternative
> replace-via-rename application patterns.  Try reading "Delayed
> allocation and the zero-length file problem" article and comments by
> Ted Ts''o for further discussion. [2]
According to Ted, via-truncate and via-rename are unsafe. Only fsync,
rename is safe.
Disadvantage of rename is resetting file owner (if non-root), having
issues with meta-data and other stuff.

My proposal was for an open flag, O_ATOMIC, to be introduced to tell
the FS the whole file update should be done atomically.
Ted says this is too hard in ext4, so I was wondering if this would be
possible in btrfs.

Olaf
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Olaf van der Spek

2011-Jan-07 14:10 UTC

head link

Re: Atomic file data replace API

On Fri, Jan 7, 2011 at 3:01 PM, Olaf van der Spek <olafvdspek@gmail.com>
wrote:> According to Ted, via-truncate and via-rename are unsafe. Only fsync,
> rename is safe.
> Disadvantage of rename is resetting file owner (if non-root), having
> issues with meta-data and other stuff.
>
> My proposal was for an open flag, O_ATOMIC, to be introduced to tell
> the FS the whole file update should be done atomically.
> Ted says this is too hard in ext4, so I was wondering if this would be
> possible in btrfs.
http://thunk.org/tytso/blog/2009/03/12/delayed-allocation-and-the-zero-length-file-problem/#comment-2082
http://thunk.org/tytso/blog/2009/03/12/delayed-allocation-and-the-zero-length-file-problem/#comment-2089
http://thunk.org/tytso/blog/2009/03/12/delayed-allocation-and-the-zero-length-file-problem/#comment-2090
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Chris Mason

2011-Jan-07 14:58 UTC

head link

Re: Atomic file data replace API

Excerpts from Olaf van der Spek''s message of 2011-01-06 15:01:15
-0500:> Hi,
> 
> Does btrfs support atomic file data replaces? Basically, the atomic
> variant of this:
> // old stage
> open(O_TRUNC)
> write() // 0+ times
> close()
> // new state
Yes and no.  We have a best effort mechanism where we try to guess that
since you''ve done this truncate and the write that you want the writes
to show up quickly.  But its a guess.

The problem is the write() // 0+ times.  The kernel has no idea what
new result you want the file to contain because the application isn''t
telling us.

What btrfs can do (but we haven''t yet implemented) is make sure that
the
results of a single write file are on disk atomically, even if they are
replacing existing bytes in the file.

Because we cow and because we don''t update metadata pointers until the
IO is complete, we can wait until all the IO for a given write call is
on disk before we update any of the metadata.

This isn''t hard, it''s on my TODO list.

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Olaf van der Spek

2011-Jan-07 15:01 UTC

head link

Re: Atomic file data replace API

On Fri, Jan 7, 2011 at 3:58 PM, Chris Mason <chris.mason@oracle.com>
wrote:> Excerpts from Olaf van der Spek''s message of 2011-01-06 15:01:15
-0500:
>> Hi,
>>
>> Does btrfs support atomic file data replaces? Basically, the atomic
>> variant of this:
>> // old stage
>> open(O_TRUNC)
>> write() // 0+ times
>> close()
>> // new state
>
> Yes and no.  We have a best effort mechanism where we try to guess that
> since you''ve done this truncate and the write that you want the
writes
> to show up quickly.  But its a guess.
>
> The problem is the write() // 0+ times.  The kernel has no idea what
> new result you want the file to contain because the application
isn''t
> telling us.
Isn''t it safe for the kernel to wait until the first write or close
before writing anything to disk?
> What btrfs can do (but we haven''t yet implemented) is make sure
that the
> results of a single write file are on disk atomically, even if they are
> replacing existing bytes in the file.
>
> Because we cow and because we don''t update metadata pointers until
the
> IO is complete, we can wait until all the IO for a given write call is
> on disk before we update any of the metadata.
>
> This isn''t hard, it''s on my TODO list.
What about a new flag: O_ATOMIC that''d take the guesswork out of the
kernel?

Olaf
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Chris Mason

2011-Jan-07 15:05 UTC

head link

Re: Atomic file data replace API

Excerpts from Olaf van der Spek''s message of 2011-01-07 10:01:59
-0500:> On Fri, Jan 7, 2011 at 3:58 PM, Chris Mason <chris.mason@oracle.com>
wrote:
> > Excerpts from Olaf van der Spek''s message of 2011-01-06
15:01:15 -0500:
> >> Hi,
> >>
> >> Does btrfs support atomic file data replaces? Basically, the
atomic
> >> variant of this:
> >> // old stage
> >> open(O_TRUNC)
> >> write() // 0+ times
> >> close()
> >> // new state
> >
> > Yes and no. Â We have a best effort mechanism where we try to guess
that
> > since you''ve done this truncate and the write that you want
the writes
> > to show up quickly. Â But its a guess.
> >
> > The problem is the write() // 0+ times. Â The kernel has no idea what
> > new result you want the file to contain because the application
isn''t
> > telling us.
> 
> Isn''t it safe for the kernel to wait until the first write or
close
> before writing anything to disk?
I''m afraid not.  Picture an application that opens a thousand files and
writes 1MB to each of them, and then didn''t close any.  If we waited
until close, you''d have 1GB of memory pinned or staged somehow.
> 
> > What btrfs can do (but we haven''t yet implemented) is make
sure that the
> > results of a single write file are on disk atomically, even if they
are
> > replacing existing bytes in the file.
> >
> > Because we cow and because we don''t update metadata pointers
until the
> > IO is complete, we can wait until all the IO for a given write call is
> > on disk before we update any of the metadata.
> >
> > This isn''t hard, it''s on my TODO list.
> 
> What about a new flag: O_ATOMIC that''d take the guesswork out of
the kernel?
We can''t guess beyond a single write call.  Otherwise we get into
the problem above where an application can force the kernel to wait
forever.  I''m not against O_ATOMIC to enable the new btrfs
functionality, but it will still be limited to one write.

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Olaf van der Spek

2011-Jan-07 15:08 UTC

head link

Re: Atomic file data replace API

On Fri, Jan 7, 2011 at 4:05 PM, Chris Mason <chris.mason@oracle.com>
wrote:>> > The problem is the write() // 0+ times.  The kernel has no idea
what
>> > new result you want the file to contain because the application
isn''t
>> > telling us.
>>
>> Isn''t it safe for the kernel to wait until the first write or
close
>> before writing anything to disk?
>
> I''m afraid not.  Picture an application that opens a thousand
files and
> writes 1MB to each of them, and then didn''t close any.  If we
waited
> until close, you''d have 1GB of memory pinned or staged somehow.
That''s not what I asked. ;)
I asked to wait until the first write (or close). That way, you don''t
get unintentional empty files.
One step further, you don''t have to keep the data in memory,
you''re
free to write them to disk. You just wouldn''t update the meta-data
(yet).
>> > This isn''t hard, it''s on my TODO list.
>>
>> What about a new flag: O_ATOMIC that''d take the guesswork out
of the kernel?
>
> We can''t guess beyond a single write call.  Otherwise we get into
> the problem above where an application can force the kernel to wait
> forever.  I''m not against O_ATOMIC to enable the new btrfs
> functionality, but it will still be limited to one write.
>
> -chris
>


-- 
Olaf
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Chris Mason

2011-Jan-07 15:13 UTC

head link

Re: Atomic file data replace API

Excerpts from Olaf van der Spek''s message of 2011-01-07 10:08:24
-0500:> On Fri, Jan 7, 2011 at 4:05 PM, Chris Mason <chris.mason@oracle.com>
wrote:
> >> > The problem is the write() // 0+ times. Â The kernel has no
idea what
> >> > new result you want the file to contain because the
application isn''t
> >> > telling us.
> >>
> >> Isn''t it safe for the kernel to wait until the first
write or close
> >> before writing anything to disk?
> >
> > I''m afraid not. Â Picture an application that opens a
thousand files and
> > writes 1MB to each of them, and then didn''t close any. Â If
we waited
> > until close, you''d have 1GB of memory pinned or staged
somehow.
> 
> That''s not what I asked. ;)
> I asked to wait until the first write (or close). That way, you
don''t
> get unintentional empty files.
> One step further, you don''t have to keep the data in memory,
you''re
> free to write them to disk. You just wouldn''t update the meta-data
> (yet).
Sorry ;) Picture an application that truncates 1024 files without closing any
of them.  Basically any operation that includes the kernel waiting for
applications because they promise to do something soon is a denial of
service attack, or a really easy way to run out of memory on the box.

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Olaf van der Spek

2011-Jan-07 15:17 UTC

head link

Re: Atomic file data replace API

On Fri, Jan 7, 2011 at 4:13 PM, Chris Mason <chris.mason@oracle.com>
wrote:>> That''s not what I asked. ;)
>> I asked to wait until the first write (or close). That way, you
don''t
>> get unintentional empty files.
>> One step further, you don''t have to keep the data in memory,
you''re
>> free to write them to disk. You just wouldn''t update the
meta-data
>> (yet).
>
> Sorry ;) Picture an application that truncates 1024 files without closing
any
> of them.  Basically any operation that includes the kernel waiting for
> applications because they promise to do something soon is a denial of
> service attack, or a really easy way to run out of memory on the box.
I''m not sure why you would run out of memory in that case.

O_ATOMIC would be the solution for the rename workaround: write temp
file, rename
With advantages like a way simpler API, no issues with resetting
meta-data, no issues with temp file and maybe better performance.

Olaf
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Chris Mason

2011-Jan-07 16:12 UTC

head link

Re: Atomic file data replace API

Excerpts from Olaf van der Spek''s message of 2011-01-07 10:17:31
-0500:> On Fri, Jan 7, 2011 at 4:13 PM, Chris Mason <chris.mason@oracle.com>
wrote:
> >> That''s not what I asked. ;)
> >> I asked to wait until the first write (or close). That way, you
don''t
> >> get unintentional empty files.
> >> One step further, you don''t have to keep the data in
memory, you''re
> >> free to write them to disk. You just wouldn''t update the
meta-data
> >> (yet).
> >
> > Sorry ;) Picture an application that truncates 1024 files without
closing any
> > of them. Â Basically any operation that includes the kernel waiting
for
> > applications because they promise to do something soon is a denial of
> > service attack, or a really easy way to run out of memory on the box.
> 
> I''m not sure why you would run out of memory in that case.
Well, lets make sure I''ve got a good handle on the proposed interface:

1) fd = open(some_file, O_ATOMIC)
2) truncate(fd, 0)
3) write(fd, new data)

The semantics are that we promise not to let the truncate hit the disk
until the application does the write.

We have a few choices on how we do this:

1) Leave the disk untouched, but keep something in memory that says this
inode is really truncated

2) Record on disk that we''ve done our atomic truncate but it is still
pending.  We''d need some way to remove or invalidate this record after
a
crash.

3) Go ahead and do the operation but don''t allow the transaction to
commit until the write is done.

option #1: keep something in memory.  Well, any time we have a
requirement to pin something in memory until userland decides to do a
write, we risk oom.

option #2: disk format change.  Actually somewhat complex because if we
haven''t crashed, we need to be able to read the inode in again without
invalidating the record but if we do crash, we have to invalidate the
record.  Not impossible, but not trivial.

option #3: Pin the whole transaction.  Depending on the FS this may be
impossible.  Certain operations require us to commit the transaction to
reclaim space, and we cannot allow userland to put that on hold without
deadlocking.

What most people don''t realize about the crash safe filesystems is they
don''t have fine grained transactions.  There is one single transaction
for all the operations done.  This is mostly because it is less complex
and much faster, but it also makes any ''pin the whole
transaction'' type
system unusable.

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Olaf van der Spek

2011-Jan-07 16:19 UTC

head link

Re: Atomic file data replace API

On Fri, Jan 7, 2011 at 5:12 PM, Chris Mason <chris.mason@oracle.com>
wrote:>> I''m not sure why you would run out of memory in that case.
>
> Well, lets make sure I''ve got a good handle on the proposed
interface:
>
> 1) fd = open(some_file, O_ATOMIC)
No, O_TRUNC should be used in open. Maybe it works with a separate truncate too.
> 2) truncate(fd, 0)
> 3) write(fd, new data)
>
> The semantics are that we promise not to let the truncate hit the disk
> until the application does the write.
>
> We have a few choices on how we do this:
>
> 1) Leave the disk untouched, but keep something in memory that says this
> inode is really truncated
>
> 2) Record on disk that we''ve done our atomic truncate but it is
still
> pending.  We''d need some way to remove or invalidate this record
after a
> crash.
>
> 3) Go ahead and do the operation but don''t allow the transaction
to
> commit until the write is done.
>
> option #1: keep something in memory.  Well, any time we have a
> requirement to pin something in memory until userland decides to do a
> write, we risk oom.
Since the file is open, you have to keep something in memory anyway,
right? Adding a bit (or bool) does not make a difference IMO.
Isn''t this comparable to opening a temp file?
> option #2: disk format change.  Actually somewhat complex because if we
> haven''t crashed, we need to be able to read the inode in again
without
> invalidating the record but if we do crash, we have to invalidate the
> record.  Not impossible, but not trivial.
>
> option #3: Pin the whole transaction.  Depending on the FS this may be
> impossible.  Certain operations require us to commit the transaction to
> reclaim space, and we cannot allow userland to put that on hold without
> deadlocking.
#1 is the only one that makes sense.
> What most people don''t realize about the crash safe filesystems is
they
> don''t have fine grained transactions.  There is one single
transaction
> for all the operations done.  This is mostly because it is less complex
> and much faster, but it also makes any ''pin the whole
transaction'' type
> system unusable.
AFAIK the cost is mostly more complex code / runtime. The cost is not
disk performance.

-- 
Olaf
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Hubert Kario

2011-Jan-07 16:26 UTC

head link

Re: Atomic file data replace API

On Friday, January 07, 2011 17:12:11 Chris Mason wrote:> Excerpts from Olaf van der Spek''s message of 2011-01-07 10:17:31
-0500:
> > On Fri, Jan 7, 2011 at 4:13 PM, Chris Mason
<chris.mason@oracle.com>
wrote:> > >> That''s not what I asked. ;)
> > >> I asked to wait until the first write (or close). That way,
you don''t
> > >> get unintentional empty files.
> > >> One step further, you don''t have to keep the data in
memory, you''re
> > >> free to write them to disk. You just wouldn''t update
the meta-data
> > >> (yet).
> > > 
> > > Sorry ;) Picture an application that truncates 1024 files without
> > > closing any of them.  Basically any operation that includes the
kernel
> > > waiting for applications because they promise to do something
soon is
> > > a denial of service attack, or a really easy way to run out of
memory
> > > on the box.
> > 
> > I''m not sure why you would run out of memory in that case.
> 
> Well, lets make sure I''ve got a good handle on the proposed
interface:
> 
> 1) fd = open(some_file, O_ATOMIC)
> 2) truncate(fd, 0)
> 3) write(fd, new data)
> 
> The semantics are that we promise not to let the truncate hit the disk
> until the application does the write.
> 
> We have a few choices on how we do this:
> 
> 1) Leave the disk untouched, but keep something in memory that says this
> inode is really truncated
> 
> 2) Record on disk that we''ve done our atomic truncate but it is
still
> pending.  We''d need some way to remove or invalidate this record
after a
> crash.
> 
> 3) Go ahead and do the operation but don''t allow the transaction
to
> commit until the write is done.
> 
> option #1: keep something in memory.  Well, any time we have a
> requirement to pin something in memory until userland decides to do a
> write, we risk oom.
Userland has already a file descriptor allocated (which can fail anyway 
because of OOM), I see no problem in increasing the size of kernel memory 
usage by 4 bytes (if not less) just to note that the application wants to see 
the file as truncated (1 bit) and the next write has to be atomic (2nd bit?).

-- 
Hubert Kario
QBS - Quality Business Software
02-656 Warszawa, ul. Ksawerów 30/85
tel. +48 (22) 646-61-51, 646-74-24
www.qbs.com.pl
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Massimo Maggi

2011-Jan-07 16:32 UTC

head link

Re: Atomic file data replace API

Are you suggesting to do:
1)fopen with O_TRUNC, O_ATOMIC: returns fd to a temporary file
2)application writes to that fd, with one or more system calls, in a
short time or in long time, at his will.
3)at fclose (or even at fsync ) atomically swap "data pointer" of
"real
file" with "temp file", then delete temp.In a transparent mode to
userland.  (something similar to e4defrag).
Is this sum up correct?

Massimo Maggi

Il 07/01/2011 16:17, Olaf van der Spek ha scritto:> On Fri, Jan 7, 2011 at 4:13 PM, Chris Mason <chris.mason@oracle.com>
wrote:
>>> That''s not what I asked. ;)
>>> I asked to wait until the first write (or close). That way, you
don''t
>>> get unintentional empty files.
>>> One step further, you don''t have to keep the data in
memory, you''re
>>> free to write them to disk. You just wouldn''t update the
meta-data
>>> (yet).
>> Sorry ;) Picture an application that truncates 1024 files without
closing any
>> of them.  Basically any operation that includes the kernel waiting for
>> applications because they promise to do something soon is a denial of
>> service attack, or a really easy way to run out of memory on the box.
> I''m not sure why you would run out of memory in that case.
>
> O_ATOMIC would be the solution for the rename workaround: write temp
> file, rename
> With advantages like a way simpler API, no issues with resetting
> meta-data, no issues with temp file and maybe better performance.
>
> Olaf
> --
> To unsubscribe from this list: send the line "unsubscribe
linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Olaf van der Spek

2011-Jan-07 16:34 UTC

head link

Re: Atomic file data replace API

On Fri, Jan 7, 2011 at 5:32 PM, Massimo Maggi <massimo@mmmm.it>
wrote:> Are you suggesting to do:
> 1)fopen with O_TRUNC, O_ATOMIC: returns fd to a temporary file
> 2)application writes to that fd, with one or more system calls, in a
> short time or in long time, at his will.
> 3)at fclose (or even at fsync ) atomically swap "data pointer" of
"real
> file" with "temp file", then delete temp.In a transparent
mode to
> userland.  (something similar to e4defrag).
> Is this sum up correct?
Almost. Swap should probably not be done at fsync time.
Other open references (for example running executables) should be swapped too.

The new-file case has to be handled too.

Olaf
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Thomas Bellman

2011-Jan-07 19:29 UTC

head link

Re: Atomic file data replace API

Olaf van der Spek wrote:
> On Fri, Jan 7, 2011 at 5:32 PM, Massimo Maggi <massimo@mmmm.it>
wrote:
>> Are you suggesting to do:
>> 1)fopen with O_TRUNC, O_ATOMIC: returns fd to a temporary file
>> 2)application writes to that fd, with one or more system calls, in a
>> short time or in long time, at his will.
>> 3)at fclose (or even at fsync ) atomically swap "data
pointer" of "real
>> file" with "temp file", then delete temp.In a
transparent mode to
>> userland.  (something similar to e4defrag).
>> Is this sum up correct?
> 
> Almost. Swap should probably not be done at fsync time.
> Other open references (for example running executables) should be swapped
too.
What is the visibility of the changes for other processes supposed
to be in the meantime?  I.e., if things happen in this order:

1. Process A does fda = open("foo.txt", O_TRUNC|O_ATOMIC)
2. Process B does fdb = open("foo.txt", O_RDONLY)
3. B does read(fdb, buf, 4096)
4. A does write(fda, "NEW DATA\n", 9)
5. Process C comes in and does fdc = open("foo.txt", O_RDONLY)
6. C does read(fdc, buf, 4096)
7. A calls close(fda)

Does B see an empty file, or does it see the old contents of
the file?  Does C see "NEW DATA\n", or does it see the old
contents of the file, or perhaps an empty file?

	/Bellman
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Chris Mason

2011-Jan-07 19:29 UTC

head link

Re: Atomic file data replace API

Excerpts from Hubert Kario''s message of 2011-01-07 11:26:02
-0500:> On Friday, January 07, 2011 17:12:11 Chris Mason wrote:
> > Excerpts from Olaf van der Spek''s message of 2011-01-07
10:17:31 -0500:
> > > On Fri, Jan 7, 2011 at 4:13 PM, Chris Mason
<chris.mason@oracle.com>
> wrote:
> > > >> That''s not what I asked. ;)
> > > >> I asked to wait until the first write (or close). That
way, you don''t
> > > >> get unintentional empty files.
> > > >> One step further, you don''t have to keep the
data in memory, you''re
> > > >> free to write them to disk. You just wouldn''t
update the meta-data
> > > >> (yet).
> > > > 
> > > > Sorry ;) Picture an application that truncates 1024 files
without
> > > > closing any of them.  Basically any operation that includes
the kernel
> > > > waiting for applications because they promise to do
something soon is
> > > > a denial of service attack, or a really easy way to run out
of memory
> > > > on the box.
> > > 
> > > I''m not sure why you would run out of memory in that
case.
> > 
> > Well, lets make sure I''ve got a good handle on the proposed
interface:
> > 
> > 1) fd = open(some_file, O_ATOMIC)
> > 2) truncate(fd, 0)
> > 3) write(fd, new data)
> > 
> > The semantics are that we promise not to let the truncate hit the disk
> > until the application does the write.
> > 
> > We have a few choices on how we do this:
> > 
> > 1) Leave the disk untouched, but keep something in memory that says
this
> > inode is really truncated
> > 
> > 2) Record on disk that we''ve done our atomic truncate but it
is still
> > pending.  We''d need some way to remove or invalidate this
record after a
> > crash.
> > 
> > 3) Go ahead and do the operation but don''t allow the
transaction to
> > commit until the write is done.
> > 
> > option #1: keep something in memory.  Well, any time we have a
> > requirement to pin something in memory until userland decides to do a
> > write, we risk oom.
> 
> Userland has already a file descriptor allocated (which can fail anyway 
> because of OOM), I see no problem in increasing the size of kernel memory 
> usage by 4 bytes (if not less) just to note that the application wants to
see
> the file as truncated (1 bit) and the next write has to be atomic (2nd
bit?).
> 
The exact amount of tracking is going to vary.  The reason why is that
actually doing the truncate is an O(size of the file) operation and so
you can''t just flip a switch when the write or the close comes in.  You
have to run through all the metadata of the file and do something
temporary with each part that is only completed when the file IO is
actually done.

Honestly, there many different ways to solve this in the application.
Requiring high speed atomic replacement of individual file contents is a
recipe for frustration.

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Phillip Susi

2011-Jan-08 01:11 UTC

head link

Re: Atomic file data replace API

On 01/07/2011 09:58 AM, Chris Mason wrote:> Yes and no.  We have a best effort mechanism where we try to guess that
> since you''ve done this truncate and the write that you want the
writes
> to show up quickly.  But its a guess.
It is a pretty good guess, and one that the NT kernel has been making 
for 15 years or so.  I''ve been following this issue for some time and I
still don''t understand why Ted is so hostile to this and can''t
make it
work right on ext4.  When you get a rename() you just need to check if 
there are outstanding journal transactions and/or dirty cache pages, and 
hang the rename() transaction on the end of those.  That way if the 
system crashes after the new file has fully hit the disk, the old file 
is gone and you only have the new one, but if it crashes before, you 
still have the old one in place.

Both the writes and the rename can be delayed in the cache to an 
arbitrary point in the future; what matters is that their order is 
preserved.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Olaf van der Spek

2011-Jan-08 14:36 UTC

head link

Re: Atomic file data replace API

On Fri, Jan 7, 2011 at 8:29 PM, Thomas Bellman <bellman@nsc.liu.se>
wrote:> What is the visibility of the changes for other processes supposed
> to be in the meantime?  I.e., if things happen in this order:
Should be atomic too, at close time.
> 1. Process A does fda = open("foo.txt", O_TRUNC|O_ATOMIC)
> 2. Process B does fdb = open("foo.txt", O_RDONLY)
> 3. B does read(fdb, buf, 4096)
> 4. A does write(fda, "NEW DATA\n", 9)
> 5. Process C comes in and does fdc = open("foo.txt", O_RDONLY)
> 6. C does read(fdc, buf, 4096)
> 7. A calls close(fda)
>
> Does B see an empty file, or does it see the old contents of
> the file?
Old file, otherwise A wouldn''t be atomic.
> Does C see "NEW DATA\n", or does it see the old
> contents of the file, or perhaps an empty file?
Old file again, as the ''transaction'' isn''t finished
until close.

-- 
Olaf
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Olaf van der Spek

2011-Jan-08 14:40 UTC

head link

Re: Atomic file data replace API

On Fri, Jan 7, 2011 at 8:29 PM, Chris Mason <chris.mason@oracle.com>
wrote:> The exact amount of tracking is going to vary.  The reason why is that
> actually doing the truncate is an O(size of the file) operation and so
> you can''t just flip a switch when the write or the close comes in.
 You
> have to run through all the metadata of the file and do something
> temporary with each part that is only completed when the file IO is
> actually done.
That''s true. Maybe the proper way, via O_ATOMIC, is better.
> Honestly, there many different ways to solve this in the application.
> Requiring high speed atomic replacement of individual file contents is a
> recipe for frustration.
Did you see message of Massimo? That''d be the ideal way from an app
point of view.
Not solving this properly in the FS moves the problem to userspace
where it''s even harder to solve and is not as performant.

Replacing file data is a common operation that IMO the FS should
support in a safe way.
-- 
Olaf
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Thomas Bellman

2011-Jan-08 21:43 UTC

head link

Re: Atomic file data replace API

Olaf van der Spek wrote:
> On Fri, Jan 7, 2011 at 8:29 PM, Thomas Bellman <bellman@nsc.liu.se>
wrote:
>> What is the visibility of the changes for other processes supposed
>> to be in the meantime?  I.e., if things happen in this order:
> 
> Should be atomic too, at close time.
> 
>> 1. Process A does fda = open("foo.txt", O_TRUNC|O_ATOMIC)
>> 2. Process B does fdb = open("foo.txt", O_RDONLY)
>> 3. B does read(fdb, buf, 4096)
>> 4. A does write(fda, "NEW DATA\n", 9)
>> 5. Process C comes in and does fdc = open("foo.txt",
O_RDONLY)
>> 6. C does read(fdc, buf, 4096)
>> 7. A calls close(fda)
>>
>> Does B see an empty file, or does it see the old contents of
>> the file?
> 
> Old file, otherwise A wouldn''t be atomic.
> 
>> Does C see "NEW DATA\n", or does it see the old
>> contents of the file, or perhaps an empty file?
> 
> Old file again, as the ''transaction'' isn''t
finished until close.
So, basically database transactions with an isolation level of
"committed read", for file operations.  That''s something I
have
wanted for a long time, especially if I also get a rollback()
operation, but have never heard of any Unix that implemented it.

A separate commit() operation would be better than conflating it
with close().  And as I said, we want a rollback() as well.  And
a process that terminates without committing the transaction that
it is performing, should have the transaction automatically rolled
back.

I only have a very shallow knowledge about the internals of the
Linux kernel in regards to filesystems, but I suspect that this
could be implemented almost entirely within the VFS, and not need
to touch the actual filesystems, as long as you are satisfied
with a limited amount of transaction space (what fits in RAM +
swap).

I''m looking forward to your implementation. :-)  Even though I
suspect that it would be a rather large undertaking to implement...

	/Bellman
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Olaf van der Spek

2011-Jan-09 15:16 UTC

head link

Re: Atomic file data replace API

On Sat, Jan 8, 2011 at 10:43 PM, Thomas Bellman <bellman@nsc.liu.se>
wrote:> So, basically database transactions with an isolation level of
> "committed read", for file operations.  That''s something
I have
> wanted for a long time, especially if I also get a rollback()
> operation, but have never heard of any Unix that implemented it.
True, that''s why this feature request is here.
Note that it''s (ATM) only about  single file data replace.
> A separate commit() operation would be better than conflating it
> with close().  And as I said, we want a rollback() as well.  And
> a process that terminates without committing the transaction that
> it is performing, should have the transaction automatically rolled
> back.
What could you do between commit and close?
> I only have a very shallow knowledge about the internals of the
> Linux kernel in regards to filesystems, but I suspect that this
> could be implemented almost entirely within the VFS, and not need
> to touch the actual filesystems, as long as you are satisfied
> with a limited amount of transaction space (what fits in RAM +
> swap).
>
> I''m looking forward to your implementation. :-)  Even though I
> suspect that it would be a rather large undertaking to implement...
I have no plans to work on an implementation.

-- 
Olaf
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Thomas Bellman

2011-Jan-09 18:56 UTC

head link

Re: Atomic file data replace API

Olaf van der Spek wrote:
> On Sat, Jan 8, 2011 at 10:43 PM, Thomas Bellman <bellman@nsc.liu.se>
wrote:
>> So, basically database transactions with an isolation level of
>> "committed read", for file operations.  That''s
something I have
>> wanted for a long time, especially if I also get a rollback()
>> operation, but have never heard of any Unix that implemented it.
> 
> True, that''s why this feature request is here.
> Note that it''s (ATM) only about  single file data replace.
That particular problem was solved with the introduction of the
rename(2) system call in 4.2BSD a bit more than a quarter of a
century ago.  There is no need to introduce another, less flexible,
API for doing the same thing.
>> A separate commit() operation would be better than conflating it
>> with close().  And as I said, we want a rollback() as well.  And
>> a process that terminates without committing the transaction that
>> it is performing, should have the transaction automatically rolled
>> back.
> 
> What could you do between commit and close?
More write() operations, of course.  Just like you can continue
with more transactions after a COMMIT WORK call without having
to close and re-open the database in SQL.


	/Bellman
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Olaf van der Spek

2011-Jan-09 19:06 UTC

head link

Re: Atomic file data replace API

On Sun, Jan 9, 2011 at 7:56 PM, Thomas Bellman <bellman@nsc.liu.se>
wrote:>> True, that''s why this feature request is here.
>> Note that it''s (ATM) only about  single file data replace.
>
> That particular problem was solved with the introduction of the
> rename(2) system call in 4.2BSD a bit more than a quarter of a
> century ago.  There is no need to introduce another, less flexible,
> API for doing the same thing.
You might want to read about the problems with that workaround.
>> What could you do between commit and close?
>
> More write() operations, of course.  Just like you can continue
> with more transactions after a COMMIT WORK call without having
> to close and re-open the database in SQL.
The transaction is defined as beginning with open and ending with close.
-- 
Olaf
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Phillip Susi

2011-Jan-09 20:13 UTC

head link

Re: Atomic file data replace API

On 01/09/2011 01:56 PM, Thomas Bellman wrote:> That particular problem was solved with the introduction of the
> rename(2) system call in 4.2BSD a bit more than a quarter of a
> century ago. There is no need to introduce another, less flexible,
> API for doing the same thing.
I''m curious if there are any BSD specifications that state that
rename()
has this behavior.  Ted Tso has been claiming that POSIX does not 
require this behavior in the face of a crash and that as a result, an 
application that relies on such behavior is broken, and needs to fsync() 
before rename().  This of course, makes replacing numerous files much 
slower, glacially so on btrfs.  There has been a great deal of 
discussion ok the dpkg mailing lists about it since plenty of people are 
upset that dpkg runs much slower these days than it used to, because it 
now calls fsync() before rename() in order to avoid breakage on ext4.

You can read more, including the rationale of why POSIX does not require 
this behavior at http://lwn.net/Articles/323607/.

I still say that preserving the order of the writes and rename is the 
only sane thing to do, whether POSIX requires it or not.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Olaf van der Spek

2011-Jan-26 18:30 UTC

head link

Re: Atomic file data replace API

On Sat, Jan 8, 2011 at 3:40 PM, Olaf van der Spek <olafvdspek@gmail.com>
wrote:> On Fri, Jan 7, 2011 at 8:29 PM, Chris Mason <chris.mason@oracle.com>
wrote:
>> The exact amount of tracking is going to vary.  The reason why is that
>> actually doing the truncate is an O(size of the file) operation and so
>> you can''t just flip a switch when the write or the close comes
in.  You
>> have to run through all the metadata of the file and do something
>> temporary with each part that is only completed when the file IO is
>> actually done.
>
> That''s true. Maybe the proper way, via O_ATOMIC, is better.
>
>> Honestly, there many different ways to solve this in the application.
>> Requiring high speed atomic replacement of individual file contents is
a
>> recipe for frustration.
>
> Did you see message of Massimo? That''d be the ideal way from an
app
> point of view.
> Not solving this properly in the FS moves the problem to userspace
> where it''s even harder to solve and is not as performant.
>
> Replacing file data is a common operation that IMO the FS should
> support in a safe way.
Chris?


-- 
Olaf
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Chris Mason

2011-Jan-26 19:30 UTC

head link

Re: Atomic file data replace API

Excerpts from Olaf van der Spek''s message of 2011-01-26 13:30:08
-0500:> On Sat, Jan 8, 2011 at 3:40 PM, Olaf van der Spek
<olafvdspek@gmail.com> wrote:
> > On Fri, Jan 7, 2011 at 8:29 PM, Chris Mason
<chris.mason@oracle.com> wrote:
> >> The exact amount of tracking is going to vary. Â The reason why is
that
> >> actually doing the truncate is an O(size of the file) operation
and so
> >> you can''t just flip a switch when the write or the close
comes in. Â You
> >> have to run through all the metadata of the file and do something
> >> temporary with each part that is only completed when the file IO
is
> >> actually done.
> >
> > That''s true. Maybe the proper way, via O_ATOMIC, is better.
> >
> >> Honestly, there many different ways to solve this in the
application.
> >> Requiring high speed atomic replacement of individual file
contents is a
> >> recipe for frustration.
> >
> > Did you see message of Massimo? That''d be the ideal way from
an app
> > point of view.
> > Not solving this properly in the FS moves the problem to userspace
> > where it''s even harder to solve and is not as performant.
> >
> > Replacing file data is a common operation that IMO the FS should
> > support in a safe way.
> 
> Chris?
> 
My answer hasn''t really changed ;)  Replacing file data is a common
operation, but it is still surprisingly complex.  Again, the truncate is
O(size of the file) and it is actually impossible to do this atomically
in most filesystems.

You don''t notice this because xfs/ext34/btrfs (and many others) have
code that makes sure a truncate is restarted if you crash.  So, it
appears to be atomic even though we''re really just restarting the
operation.  In order to have a truncate + replacement of data operation,
we''d have to do a disk format change that includes both the truncate
and
the new data.

It would look a lot like echo data > file.new ; truncate file ; mv
file.new file, but recorded in the FS metadata.

I don''t have this in the btrfs roadmap.  It would be nice but most
people use databases for things that require atomic operations.  I
think what ext4 and btrfs do today fall into the category of best
effort and least surprise, and I think it is as good as we can get
without huge performance penalties for normal use.

Now, if you want to talk about atomic replacement of file data without
changing the file size, that''s much easier.  At least it''s
easier for
those of us with cows in our pockets.

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Olaf van der Spek

2011-Jan-26 21:56 UTC

head link

Re: Atomic file data replace API

On Wed, Jan 26, 2011 at 8:30 PM, Chris Mason <chris.mason@oracle.com>
wrote:> My answer hasn''t really changed ;)  Replacing file data is a
common
> operation, but it is still surprisingly complex.  Again, the truncate is
> O(size of the file) and it is actually impossible to do this atomically
> in most filesystems.
Unfortunately life isn''t trivial. ;)
Given that it''s common, it doesn''t make sense to have code
duplication
in lots of apps to implement the temp file rename pattern.
If it''s too complex to implement in the FS (ATM), would it be possible
to implement it in a higher layer?
> You don''t notice this because xfs/ext34/btrfs (and many others)
have
> code that makes sure a truncate is restarted if you crash.  So, it
> appears to be atomic even though we''re really just restarting the
> operation.  In order to have a truncate + replacement of data operation,
> we''d have to do a disk format change that includes both the
truncate and
> the new data.
I''m not sure why the disk format would have to change.
Conceptually, just like the temp file case, you''d write the new data
to newly allocated blocks.
After (and I guess that''s the complex part) they''re safely on
disk,
you update the meta data, in an atomic way.
> It would look a lot like echo data > file.new ; truncate file ; mv
> file.new file, but recorded in the FS metadata.
>
> I don''t have this in the btrfs roadmap.  It would be nice but most
> people use databases for things that require atomic operations.  I
Executables and files shouldn''t be in a DB.

Olaf
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Btrfs devel - Jan 2011 - Atomic file data replace API

Atomic file data replace API

Re: Atomic file data replace API

Re: Atomic file data replace API

Re: Atomic file data replace API

Re: Atomic file data replace API

Re: Atomic file data replace API

Re: Atomic file data replace API

Re: Atomic file data replace API

Re: Atomic file data replace API

Re: Atomic file data replace API

Re: Atomic file data replace API

Re: Atomic file data replace API

Re: Atomic file data replace API

Re: Atomic file data replace API

Re: Atomic file data replace API

Re: Atomic file data replace API

Re: Atomic file data replace API

Re: Atomic file data replace API

Re: Atomic file data replace API

Re: Atomic file data replace API

Re: Atomic file data replace API

Re: Atomic file data replace API

Re: Atomic file data replace API

Re: Atomic file data replace API

Re: Atomic file data replace API

Re: Atomic file data replace API

Re: Atomic file data replace API

Re: Atomic file data replace API