thr3ads.net - Btrfs devel - user transactions and ENOSPC... [Sep 2009]

If this information is useful, please help other people find it:
Share via:

Sage Weil

2009-Sep-25 21:10 UTC

user transactions and ENOSPC...

Hi everyone,

So, the btrfs user transaction ioctls work like so

 ioctl(fd, BTRFS_IOC_TRANS_START);
 /* do many operations: write(), setxattr(), rmdir(), whatever. */
 ioctl(fd, BTRFS_IOC_TRANS_END);    /* or close(fd); */

and allow an application to ensure some number of operations commit to 
disk together.  Ceph''s storage daemon uses this to avoid the overhead
of
maintaining a write-ahead journal for complex updates.  I can see this 
being useful for lots of other services too, since it can avoid all kinds 
of (often slow) atomicity games.

But there are two problems with the user transaction ioctls as 
implemented...

The first is that we may get ENOSPC somewhere between START and END
without any prior warning.  The patch below is intended to fix that by
adding a new reservation category used only by a new TRANS_RESV_START
ioctl.  It''ll allow an application to specify the total amount of data
it wants to write when the transaction starts, and get ENOSPC right
away before it starts making changes.

This isn''t a perfect solution: a mix of a transaction workload a
regular
workload will violate the reservations, and we can''t really fix that
without knowing whether any given write() or whatever belongs to a user
transaction or not.

The second problem is that the application may die between START and 
END. The current ioctls are "safe" in that the transaction handle is 
closed when the struct file is released, so the fs won''t get wedged if 
you say segfault.  On the other hand, they''re "unsafe" in
that a process
that is killed or segfaults will result in an imcomplete transaction 
making it to disk, which leaves the file system in an inconsistent state 
(from the point of view of the application).

One possibility is to (optionally) disable that safety mechanism with a 
mount option, so that the file system will wedge if the process dies.  
That''s probably better than nothing.  A cautious app may prefer a
wedged
system to a partial transaction reaching disk. (Remember these ioctls 
are already dangerous and require CAP_SYS_ADMIN.  A process can 
similarly wedge the fs by simply holding a transaction open.)


An alternative approach would be to describe the full contents of the 
user transaction, and submit the entire thing to btrfs at once using a 
single ioctl().  This makes for an awkward data structure to define the 
whole thing, but it would allow us to determine which operations belong 
to a user transaction and reserve/account for free space accordingly.  
It would also solve the problem of committing partial user transactions, 
since we could run the full transaction to completion even if the 
process gets a SIGKILL or seg faults or something.

I had some rough patches for this a while back that just called into vfs_* 
methods, but they ran up against those methods not being exported to 
modules.  If exporting those is not a deal breaker, then I can use 
filp->private_data to mark operations contained by the transaction so that 
the reservation accounting works, while still taking advantage of the 
generic vfs_* code.

It would also potentially let us make these non-privileged operations, 
since submiting the transaction as a unit would avoid the current 
situation where a misbehaving process can hold a transaction open 
and wedge the system.

I''m partial to the latter approach, but it''d be nice to have
some
confidence that it won''t be shot down out of hand on principle
("modules
should call vfs_*", etc.) before spending too much time on it...

In the meantime, patches to add reservations to the current ioctl 
approach follow.  Any feedback on how these might be improved are 
welcome, too!

Thanks-
sage

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Daniel J Blueman

2009-Sep-26 14:08 UTC

head link

Re: user transactions and ENOSPC...

On Fri, Sep 25, 2009 at 10:10 PM, Sage Weil <sage@newdream.net>
wrote:> So, the btrfs user transaction ioctls work like so
>
>  ioctl(fd, BTRFS_IOC_TRANS_START);
>  /* do many operations: write(), setxattr(), rmdir(), whatever. */
>  ioctl(fd, BTRFS_IOC_TRANS_END);    /* or close(fd); */
...> The second problem is that the application may die between START and
> END. The current ioctls are "safe" in that the transaction handle
is
> closed when the struct file is released, so the fs won''t get
wedged if
> you say segfault.  On the other hand, they''re "unsafe"
in that a process
> that is killed or segfaults will result in an imcomplete transaction
> making it to disk, which leaves the file system in an inconsistent state
> (from the point of view of the application).
With COW, where a transaction is incomplete due to application exit
without closing the transaction, is there a way to drop the
reference/''deallocate'' the new tree nodes, thus moving back to
the
prior state? Presumably the new tree nodes would get linked in when
the transaction is closed.
-- 
Daniel J Blueman
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Sage Weil

2009-Sep-28 16:05 UTC

head link

Re: user transactions and ENOSPC...

On Sat, 26 Sep 2009, Daniel J Blueman wrote:
> On Fri, Sep 25, 2009 at 10:10 PM, Sage Weil <sage@newdream.net>
wrote:
> > So, the btrfs user transaction ioctls work like so
> >
> >  ioctl(fd, BTRFS_IOC_TRANS_START);
> >  /* do many operations: write(), setxattr(), rmdir(), whatever. */
> >  ioctl(fd, BTRFS_IOC_TRANS_END);    /* or close(fd); */
> ...
> > The second problem is that the application may die between START and
> > END. The current ioctls are "safe" in that the transaction
handle is
> > closed when the struct file is released, so the fs won''t get
wedged if
> > you say segfault.  On the other hand, they''re
"unsafe" in that a process
> > that is killed or segfaults will result in an imcomplete transaction
> > making it to disk, which leaves the file system in an inconsistent
state
> > (from the point of view of the application).
> 
> With COW, where a transaction is incomplete due to application exit
> without closing the transaction, is there a way to drop the
> reference/''deallocate'' the new tree nodes, thus moving
back to the
> prior state? Presumably the new tree nodes would get linked in when
> the transaction is closed.
Not quite.  These are not full ACID transactions.. they''re only giving
you
atomicity and durability.  Adding the rollback would mean a huge addition 
of complexity, both in rolling back the btrfs and VFS inode/dentry/page 
cache states.  The current ioctls are quite simple because they hook into 
the existing btrfs transaction infrastructure that lets multiple btree 
updates (and, optionally, file data) commit to disk together.

For many applications, atomicity is enough, because they aren''t sharing
the files/directories they''re working on with other applications.  They
know whether an update should succeed.  We just need to ensure common 
failures like ENOSPC and crashes do not violate that atomicity.

sage

Valerie Aurora

2009-Oct-07 21:58 UTC

head link

Re: user transactions and ENOSPC...

On Fri, Sep 25, 2009 at 02:10:14PM -0700, Sage Weil
wrote:> Hi everyone,
> 
> So, the btrfs user transaction ioctls work like so
> 
>  ioctl(fd, BTRFS_IOC_TRANS_START);
>  /* do many operations: write(), setxattr(), rmdir(), whatever. */
>  ioctl(fd, BTRFS_IOC_TRANS_END);    /* or close(fd); */
> 
> and allow an application to ensure some number of operations commit to 
> disk together.  Ceph''s storage daemon uses this to avoid the
overhead of
> maintaining a write-ahead journal for complex updates.  I can see this 
> being useful for lots of other services too, since it can avoid all kinds 
> of (often slow) atomicity games.
> 
> But there are two problems with the user transaction ioctls as 
> implemented...
> The first is that we may get ENOSPC somewhere between START and END
> without any prior warning.  The patch below is intended to fix that by
> adding a new reservation category used only by a new TRANS_RESV_START
> ioctl.  It''ll allow an application to specify the total amount of
data
> it wants to write when the transaction starts, and get ENOSPC right
> away before it starts making changes.
> 
> This isn''t a perfect solution: a mix of a transaction workload a
regular
> workload will violate the reservations, and we can''t really fix
that
> without knowing whether any given write() or whatever belongs to a user
> transaction or not.
> 
> The second problem is that the application may die between START and 
> END. The current ioctls are "safe" in that the transaction handle
is
> closed when the struct file is released, so the fs won''t get
wedged if
> you say segfault.  On the other hand, they''re "unsafe"
in that a process
> that is killed or segfaults will result in an imcomplete transaction 
> making it to disk, which leaves the file system in an inconsistent state 
> (from the point of view of the application).
This is a pet peeve of mine - exporting file system transactions to
user space usually has these problems.

I would be quite interested in seeing the Featherstitch-style
patchgroups implemented on btrfs.  Do you think the ordering
guarantees they give would work for Ceph''s storage daemon?

http://featherstitch.cs.ucla.edu/
http://lwn.net/Articles/354861/

-VAL
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Sage Weil

2009-Oct-08 04:07 UTC

head link

Re: user transactions and ENOSPC...

Hi Val!

On Wed, 7 Oct 2009, Valerie Aurora wrote:
> On Fri, Sep 25, 2009 at 02:10:14PM -0700, Sage Weil wrote:
> > Hi everyone,
> > 
> > So, the btrfs user transaction ioctls work like so
> > 
> >  ioctl(fd, BTRFS_IOC_TRANS_START);
> >  /* do many operations: write(), setxattr(), rmdir(), whatever. */
> >  ioctl(fd, BTRFS_IOC_TRANS_END);    /* or close(fd); */
> > 
> > and allow an application to ensure some number of operations commit to
> > disk together.  Ceph''s storage daemon uses this to avoid the
overhead of
> > maintaining a write-ahead journal for complex updates.  I can see this
> > being useful for lots of other services too, since it can avoid all
kinds
> > of (often slow) atomicity games.
> > 
> > But there are two problems with the user transaction ioctls as 
> > implemented...
> > The first is that we may get ENOSPC somewhere between START and END
> > without any prior warning.  The patch below is intended to fix that by
> > adding a new reservation category used only by a new TRANS_RESV_START
> > ioctl.  It''ll allow an application to specify the total
amount of data
> > it wants to write when the transaction starts, and get ENOSPC right
> > away before it starts making changes.
> > 
> > This isn''t a perfect solution: a mix of a transaction
workload a regular
> > workload will violate the reservations, and we can''t really
fix that
> > without knowing whether any given write() or whatever belongs to a
user
> > transaction or not.
> > 
> > The second problem is that the application may die between START and 
> > END. The current ioctls are "safe" in that the transaction
handle is
> > closed when the struct file is released, so the fs won''t get
wedged if
> > you say segfault.  On the other hand, they''re
"unsafe" in that a process
> > that is killed or segfaults will result in an imcomplete transaction 
> > making it to disk, which leaves the file system in an inconsistent
state
> > (from the point of view of the application).
> 
> This is a pet peeve of mine - exporting file system transactions to
> user space usually has these problems.
> 
> I would be quite interested in seeing the Featherstitch-style
> patchgroups implemented on btrfs.  Do you think the ordering
> guarantees they give would work for Ceph''s storage daemon?
> 
> http://featherstitch.cs.ucla.edu/
> http://lwn.net/Articles/354861/
It sounds to me like like the patchgroups give you a slick way to describe 
how you want operations ordered, but don''t give you a general way to 
atomically commit multiple operations. At the end of the day, I think 
atomicity is much simpler to provide, and all that the Ceph storage daemon 
needs.  The typical update pattern is:

 - write some (fragment of a) file
 - update the file''s xattr with a new version #
 - write a log entry

The logs are there to let nodes quickly resynchronize any changes when 
they fail/restart. 

This _could_ be accomplished with ordering, if e.g. the log entry is 
forced to commit before the data update, and if the data is written twice 
(i.e. data=journal).

Or if there is an efficient way to swap bytes into a file (say from the 
journal into the file).  The clone range ioctl can actually do this, but 
requires that the data is first flushed to disk, and invalidates the page 
cache in the process, and that''s not good for read/write workloads.

Or if we limit ourselves to an atomic pwrite + xattr update (on the same 
file), we could order an intent log record, and then the actual write, and 
it could detect which writes committed during recovery.

I''m not sure it''s an improvement over the current (proposed)
approach to
user transactions, though.  Handing the kernel a description of the entire 
transaction should eliminate the usual problems with userspace 
transactions you''re referring to.

(And I''m a bit lazy; the ceph storage daemon was built on a transaction
primitive, and it''s used throughout in other convenient but not 
necessarily necessary ways.  Originally it was all done using a 
userspace file system and O_DIRECT, like any other database with 
transactions, but implementing yet another COW file system is exactly 
what I''m trying to avoid.  :)

But, I''m certainly open to other ideas!  I think both user transactions
and patchgroups would be generally useful tools for applications...

sage
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Btrfs devel - Sep 2009 - user transactions and ENOSPC...

user transactions and ENOSPC...

Re: user transactions and ENOSPC...

Re: user transactions and ENOSPC...

Re: user transactions and ENOSPC...

Re: user transactions and ENOSPC...