Tobias Oberstein wrote:> I''m looking for efficient producer-consumer queues with large (100s MB) > persistent queues, 10-100 distributed producers, producing some hundred > kB up to a couple of MBs for each insert. > > For this reason I''m considering files in a clustered FS, where multiple > clients concurrently issue write-append operations, each of which must > execute atomically and with minimal synchronization overhead. Say, such > semantics is kind of special but very useful in certain scenarios. > > Does Lustre provide such semantics? I''ve read some bits in lustre.pdf > and from my understanding, it doesn''t.Certainly Lustre will do what you want in a consistent manner, but there will probably be significant locking overhead if all clients are using O_APPEND. We have discussed internally a couple of implementation alternatives for fast O_APPEND. It is certainly possible to add optimizations for that case to Lustre, although there is no customer yet to drive this work.> Where is this lock provided? On the MDS or on the OSDes? Would it be > possible to provide/optionally-switch-on/whatever the stuff described? > By e.g. telling the MDS that certain file are "special" PC-queues? > I any case, I don''t care about up to the ms correct file sizes > when the files are used as PC-queues.File data locks are taken on the OSTs where the data resides. You don''t care about stat() having the correct file size, I agree, but you do depend on each client knowing how large the file is when it goes to append the next bytes, if you''re using O_APPEND.> (btw, the lustre.pdf on the website > seems to be damaged .. acrobat reader bails out. I had to grap a probably > outdated version elsewhere: 04/28/2003, HEAD)The book should be fixed on the web now. Thanks-- -Phil
Hi,> > For this reason I''m considering files in a clustered FS, where multiple > > clients concurrently issue write-append operations, each of which must > > execute atomically and with minimal synchronization overhead. Say, such > > semantics is kind of special but very useful in certain scenarios. > > > > Does Lustre provide such semantics? I''ve read some bits in lustre.pdf > > and from my understanding, it doesn''t. > > Certainly Lustre will do what you want in a consistent manner, but there > will probably be significant locking overhead if all clients are using > O_APPEND. >Would you mind correcting my understanding of the details in the outlined situation? Lets say, 10 processes on different machines open the same file with "fd = open (file, O_APPEND | O_WRONLY)". At this point, there is yet no lock issued. Then 1 out of the 10 processes issues a "write (fd,foo,sizeof(foo))" where "sizeof(foo)" is large, e.g. 100MB. At his point, a EOF lock is taken on the OST where the data resides. A EOF lock is an exclusive lock. [Sideline: If data is replicated, locks on each OST holding a replica are taken? If so, how does Lustre avoid deadlocks, since the locks in this case are not centrally managed on the MDS but distributed across all participating OSTs? Is there a primary replica which orchestrates the acquition of all the locks?] Now, while the first process still writes it''s 100MB payload to the file, a second process decides to issue a "write ()" with it''s own payload of another 100MB on the file. That call/process is now blocked, since it can''t get a EOF lock on the file. In other words, the lock is held during the complete write operation. This results in full serialization of the 10 appending processes.> We have discussed internally a couple of implementation alternatives for > fast O_APPEND. It is certainly possible to add optimizations for that > case to Lustre, although there is no customer yet to drive this work.I''ve got pretty zero know-how on fs implementation. Would it be naive to do this: when a client has opened a file O_APPEND and issues a write operation, the OST first writes incoming data to "unnamed" data blocks on disk. Only when all data has been written, the OST acquires a lock on the file to chain in the new data block into the file''s inode. The time span during which a lock is needed is much shorter. Only the final chaining in of new data blocks must be serialized. That way, multiple writes can do long voluminous write-appends to the same file.> > > Where is this lock provided? On the MDS or on the OSDes? Would it be > > possible to provide/optionally-switch-on/whatever the stuff described? > > By e.g. telling the MDS that certain file are "special" PC-queues? > > I any case, I don''t care about up to the ms correct file sizes > > when the files are used as PC-queues. > > File data locks are taken on the OSTs where the data resides. You don''t > care about stat() having the correct file size, I agree, but you do > depend on each client knowing how large the file is when it goes to > append the next bytes, if you''re using O_APPEND.Why does the client need to know the file size to append to the file? My naive view: Appending to an array requires knowledge on current array length (and a lock on the current "end of array"). Appending to a linked list of arrays does not require knowledge on the current sum of lengths of all linked arrays. Are FS not implemented like in the latter? Thanks for clarifying, Tobias
Tobias Oberstein wrote:> > Would you mind correcting my understanding of the details in the outlined > situation? > > Lets say, 10 processes on different machines open the same file > with "fd = open (file, O_APPEND | O_WRONLY)". At this point, there is > yet no lock issued.Correct.> Then 1 out of the 10 processes issues a "write (fd,foo,sizeof(foo))" > where "sizeof(foo)" is large, e.g. 100MB. At his point, a EOF lock > is taken on the OST where the data resides. A EOF lock is an exclusive > lock.In fact, we need to take an EOF lock on all OSTs -- because we don''t know which OST holds the end of the file until we check them all. This is why O_APPEND is particularly costly.> [Sideline: If data is replicated, locks on each OST holding a replica > are taken? If so, how does Lustre avoid deadlocks, since the locks > in this case are not centrally managed on the MDS but distributed > across all participating OSTs? Is there a primary replica which > orchestrates the acquition of all the locks?]When multiple locks are required, they are taken in a known, consistent order to avoid deadlocks.> Now, while the first process still writes it''s 100MB payload to the > file, a second process decides to issue a "write ()" with it''s own > payload of another 100MB on the file. That call/process is now blocked, > since it can''t get a EOF lock on the file. In other words, the lock is held > during the complete write operation. This results in full serialization > of the 10 appending processes.Indeed.> I''ve got pretty zero know-how on fs implementation. Would it be naive to > do this: when a client has opened a file O_APPEND and issues a write > operation, the OST first writes incoming data to "unnamed" data blocks > on disk. Only when all data has been written, the OST acquires a lock > on the file to chain in the new data block into the file''s inode. The > time span during which a lock is needed is much shorter. Only the final > chaining in of new data blocks must be serialized. That way, multiple > writes can do long voluminous write-appends to the same file.It''s not inherently naive, but two considerations make this impractical. First, your O_APPEND writes would need to be block-aligned and a multiple of the blocksize -- otherwise, the chaining would leave little zero-filled holes in your file. Second, for a striped file this is not really possible. Until the client has locks on all OSTs (ie, until the previous writer has finished), it can''t know which OSTs are going to hold the next piece.> Why does the client need to know the file size to append to the file? > My naive view: Appending to an array requires knowledge on current > array length (and a lock on the current "end of array"). Appending to > a linked list of arrays does not require knowledge on the current sum > of lengths of all linked arrays. Are FS not implemented like in the latter?Most file systems are not implemented that way, no. The usual way is that each inode contains a list of blocks or extents, along with a file size. ie: If the block size is 1000 bytes, and the file size is 500, there is one half-full block allocated. The next O_APPEND write will first fill in the last 500 bytes of that first block, and then start allocating new blocks. Hope this helps-- -Phil
I''m looking for efficient producer-consumer queues with large (100s MB) persistent queues, 10-100 distributed producers, producing some hundred kB up to a couple of MBs for each insert. For this reason I''m considering files in a clustered FS, where multiple clients concurrently issue write-append operations, each of which must execute atomically and with minimal synchronization overhead. Say, such semantics is kind of special but very useful in certain scenarios. Does Lustre provide such semantics? I''ve read some bits in lustre.pdf and from my understanding, it doesn''t. (btw, the lustre.pdf on the website seems to be damaged .. acrobat reader bails out. I had to grap a probably outdated version elsewhere: 04/28/2003, HEAD) 19.4.2. System Calls Requiring File Locks. "..If the file is opened with O_APPEND flag, this will need the lock on the file size [0, -1]. In other cases, an extent lock suffices. .." Where is this lock provided? On the MDS or on the OSDes? Would it be possible to provide/optionally-switch-on/whatever the stuff described? By e.g. telling the MDS that certain file are "special" PC-queues? I any case, I don''t care about up to the ms correct file sizes when the files are used as PC-queues. That said, let me say that Lustre is incredible. Also, going through all the (also great) VAXclusters stuff, taking the good and leaving the bad was kind of heroic. Enough .. tob