thr3ads.net - Lustre discuss - [Lustre-discuss] lock-free atomic append? [May 2006]

If this information is useful, please help other people find it:
Share via:

Phil Schwan

2006-May-19 07:36 UTC

[Lustre-discuss] lock-free atomic append?

Tobias Oberstein wrote:> I''m looking for efficient producer-consumer queues with large
(100s MB)
> persistent queues, 10-100 distributed producers, producing some hundred
> kB up to a couple of MBs for each insert.
> 
> For this reason I''m considering files in a clustered FS, where
multiple
> clients concurrently issue write-append operations, each of which must
> execute atomically and with minimal synchronization overhead. Say, such
> semantics is kind of special but very useful in certain scenarios.
> 
> Does Lustre provide such semantics? I''ve read some bits in
lustre.pdf
> and from my understanding, it doesn''t.
Certainly Lustre will do what you want in a consistent manner, but there
will probably be significant locking overhead if all clients are using
O_APPEND.

We have discussed internally a couple of implementation alternatives for
fast O_APPEND.  It is certainly possible to add optimizations for that
case to Lustre, although there is no customer yet to drive this work.
> Where is this lock provided? On the MDS or on the OSDes? Would it be
> possible to provide/optionally-switch-on/whatever the stuff described?
> By e.g. telling the MDS that certain file are "special"
PC-queues?
> I any case, I don''t care about up to the ms correct file sizes
> when the files are used as PC-queues.
File data locks are taken on the OSTs where the data resides.  You
don''t
care about stat() having the correct file size, I agree, but you do
depend on each client knowing how large the file is when it goes to
append the next bytes, if you''re using O_APPEND.
> (btw, the lustre.pdf on the website
> seems to be damaged .. acrobat reader bails out. I had to grap a probably
> outdated version elsewhere: 04/28/2003, HEAD)
The book should be fixed on the web now.

Thanks--

-Phil

Tobias Oberstein

2006-May-19 07:36 UTC

head link

AW: [Lustre-discuss] lock-free atomic append?

Hi,
> > For this reason I''m considering files in a clustered FS,
where multiple
> > clients concurrently issue write-append operations, each of which must
> > execute atomically and with minimal synchronization overhead. Say,
such
> > semantics is kind of special but very useful in certain scenarios.
> > 
> > Does Lustre provide such semantics? I''ve read some bits in
lustre.pdf
> > and from my understanding, it doesn''t.
> 
> Certainly Lustre will do what you want in a consistent manner, but there
> will probably be significant locking overhead if all clients are using
> O_APPEND.
> 
Would you mind correcting my understanding of the details in the outlined
situation?

Lets say, 10 processes on different machines open the same file
with "fd = open (file, O_APPEND | O_WRONLY)". At this point, there is
yet no lock issued.

Then 1 out of the 10 processes issues a "write (fd,foo,sizeof(foo))"
where "sizeof(foo)" is large, e.g. 100MB. At his point, a EOF lock
is taken on the OST where the data resides. A EOF lock is an exclusive
lock.

[Sideline: If data is replicated, locks on each OST holding a replica
are taken? If so, how does Lustre avoid deadlocks, since the locks
in this case are not centrally managed on the MDS but distributed
across all participating OSTs? Is there a primary replica which
orchestrates the acquition of all the locks?]

Now, while the first process still writes it''s 100MB payload to the
file, a second process decides to issue a "write ()" with
it''s own
payload of another 100MB on the file. That call/process is now blocked,
since it can''t get a EOF lock on the file. In other words, the lock is
held
during the complete write operation. This results in full serialization
of the 10 appending processes.
> We have discussed internally a couple of implementation alternatives for
> fast O_APPEND.  It is certainly possible to add optimizations for that
> case to Lustre, although there is no customer yet to drive this work.
I''ve got pretty zero know-how on fs implementation. Would it be naive
to
do this: when a client has opened a file O_APPEND and issues a write
operation, the OST first writes incoming data to "unnamed" data blocks
on disk. Only when all data has been written, the OST acquires a lock
on the file to chain in the new data block into the file''s inode. The
time span during which a lock is needed is much shorter. Only the final
chaining in of new data blocks must be serialized. That way, multiple
writes can do long voluminous write-appends to the same file.
> 
> > Where is this lock provided? On the MDS or on the OSDes? Would it be
> > possible to provide/optionally-switch-on/whatever the stuff described?
> > By e.g. telling the MDS that certain file are "special"
PC-queues?
> > I any case, I don''t care about up to the ms correct file
sizes
> > when the files are used as PC-queues.
> 
> File data locks are taken on the OSTs where the data resides.  You
don''t
> care about stat() having the correct file size, I agree, but you do
> depend on each client knowing how large the file is when it goes to
> append the next bytes, if you''re using O_APPEND.
Why does the client need to know the file size to append to the file?
My naive view: Appending to an array requires knowledge on current
array length (and a lock on the current "end of array"). Appending to
a linked list of arrays does not require knowledge on the current sum
of lengths of all linked arrays. Are FS not implemented like in the latter?

Thanks for clarifying,

Tobias

Phil Schwan

2006-May-19 07:36 UTC

head link

AW: [Lustre-discuss] lock-free atomic append?

Tobias Oberstein wrote:> 
> Would you mind correcting my understanding of the details in the outlined
> situation?
> 
> Lets say, 10 processes on different machines open the same file
> with "fd = open (file, O_APPEND | O_WRONLY)". At this point,
there is
> yet no lock issued.
Correct.
> Then 1 out of the 10 processes issues a "write
(fd,foo,sizeof(foo))"
> where "sizeof(foo)" is large, e.g. 100MB. At his point, a EOF
lock
> is taken on the OST where the data resides. A EOF lock is an exclusive
> lock.
In fact, we need to take an EOF lock on all OSTs -- because we don''t
know which OST holds the end of the file until we check them all.  This
is why O_APPEND is particularly costly.
> [Sideline: If data is replicated, locks on each OST holding a replica
> are taken? If so, how does Lustre avoid deadlocks, since the locks
> in this case are not centrally managed on the MDS but distributed
> across all participating OSTs? Is there a primary replica which
> orchestrates the acquition of all the locks?]
When multiple locks are required, they are taken in a known, consistent
order to avoid deadlocks.
> Now, while the first process still writes it''s 100MB payload to
the
> file, a second process decides to issue a "write ()" with
it''s own
> payload of another 100MB on the file. That call/process is now blocked,
> since it can''t get a EOF lock on the file. In other words, the
lock is held
> during the complete write operation. This results in full serialization
> of the 10 appending processes.
Indeed.
> I''ve got pretty zero know-how on fs implementation. Would it be
naive to
> do this: when a client has opened a file O_APPEND and issues a write
> operation, the OST first writes incoming data to "unnamed" data
blocks
> on disk. Only when all data has been written, the OST acquires a lock
> on the file to chain in the new data block into the file''s inode.
The
> time span during which a lock is needed is much shorter. Only the final
> chaining in of new data blocks must be serialized. That way, multiple
> writes can do long voluminous write-appends to the same file.
It''s not inherently naive, but two considerations make this
impractical.

First, your O_APPEND writes would need to be block-aligned and a
multiple of the blocksize -- otherwise, the chaining would leave little
zero-filled holes in your file.

Second, for a striped file this is not really possible.  Until the
client has locks on all OSTs (ie, until the previous writer has
finished), it can''t know which OSTs are going to hold the next piece.
> Why does the client need to know the file size to append to the file?
> My naive view: Appending to an array requires knowledge on current
> array length (and a lock on the current "end of array").
Appending to
> a linked list of arrays does not require knowledge on the current sum
> of lengths of all linked arrays. Are FS not implemented like in the latter?
Most file systems are not implemented that way, no.  The usual way is
that each inode contains a list of blocks or extents, along with a file
size.

ie: If the block size is 1000 bytes, and the file size is 500, there is
one half-full block allocated.  The next O_APPEND write will first fill
in the last 500 bytes of that first block, and then start allocating new
blocks.

Hope this helps--

-Phil

Tobias Oberstein

2006-May-19 07:36 UTC

head link

[Lustre-discuss] lock-free atomic append?

I''m looking for efficient producer-consumer queues with large (100s MB)
persistent queues, 10-100 distributed producers, producing some hundred
kB up to a couple of MBs for each insert.

For this reason I''m considering files in a clustered FS, where multiple
clients concurrently issue write-append operations, each of which must
execute atomically and with minimal synchronization overhead. Say, such
semantics is kind of special but very useful in certain scenarios.

Does Lustre provide such semantics? I''ve read some bits in lustre.pdf
and from my understanding, it doesn''t. (btw, the lustre.pdf on the
website
seems to be damaged .. acrobat reader bails out. I had to grap a probably
outdated version elsewhere: 04/28/2003, HEAD)

19.4.2. System Calls Requiring File Locks.

"..If the file is opened with O_APPEND flag, this will need the lock on
the file size [0, -1]. In other cases, an extent lock suffices. .."

Where is this lock provided? On the MDS or on the OSDes? Would it be
possible to provide/optionally-switch-on/whatever the stuff described?
By e.g. telling the MDS that certain file are "special" PC-queues?
I any case, I don''t care about up to the ms correct file sizes
when the files are used as PC-queues.

That said, let me say that Lustre is incredible. Also, going through
all the (also great) VAXclusters stuff, taking the good and
leaving the bad was kind of heroic. Enough ..

tob

Lustre discuss - May 2006 - lock-free atomic append?

[Lustre-discuss] lock-free atomic append?

AW: [Lustre-discuss] lock-free atomic append?

AW: [Lustre-discuss] lock-free atomic append?

[Lustre-discuss] lock-free atomic append?