Alex Zhuravlev
2009-Mar-10 06:41 UTC
[Lustre-devel] some observations about metadata writeback cache
Hello, I spent quite amount time thinking of wbc problem and I''d like to share the thoughts. for wbc we store metadata in local memory for two purposes: 1) later reintegration 2) read access (lookup, getattr, readdir) w/o server involvement for (2) it makes sense to store everything as "state". e.g. directory contains all alive entries, inode contains last valid attributes, etc. let''s call it state cache. in theory reintegration can be done from the state cache and this is probably the most efficient way (in terms of network traffic and memory footprint). but for simpler implementation we can introduce log of changes for (1). in turn, the log can be per-object or just global log for given filesystem. it''s hard to implement state cache in terms of operations because usual operation involves more than one object (e.g. parent directory + file). it''s much simpler when state cache is per-object. literally the best example is linux''s dcache and inode cache. it''s also fairly simple to maintain such cache at level where single object is being modified. for our purposes this matches layer implementing OSD API - because all operations in OSD API are per single object. the same applies to reintegration because: * we need to break complex operations to be sent to different servers anyway * if we''d need to optimize log (i.e., create/unlink), then it''s simpler to collapse log entries when they are basic operations * when we''d want to reintegrate from state cache we also need a layer to take metadata operations and translate them into per-object basic operations (updates). responsbility of this layer is: * to grab all required ldlm locks as the layer understands operation''s nature, locking rules, etc * to check current state whether name exists alread (for create), permissions * to apply updates to state cache (and reintegration backend, if required) * to release ldlm locks essentially this is what current metadata server does. the difference is * locks to be acquired on remote node * current state can be on remote node (not in local state cache) * updates can be stored in local memory for later reintegration (perhaps this applies to usual mds) it looks quite obvious that it''d make sense to use metadata server code to implement wbc: * ldlm hides where lock is being mastered * dedicated osd layer below metadata server can maintain state cache needed to check existing names, attributes, permissions, etc * dedicated osd layer below metadata server can take care of reintegration implementation would look like set of the following modules: * mdf - metadata filter this is location-free metadata server operating on top of osd api, grabs ldlm locks, check current state, apply changes. * cosd - caching osd this is dedicated layer with osd api, it maintains state cache and all data needed for reintegration. it also tries to use network efficient: regular lookup can be implemented via underlying readdir, etc. * gosd - global osd very specific module allowing node to talk to remote storage over osd api, it''s stateless, something similar to current mdc, but using different apis. some obvious cons of this approach: * implementation doesn''t rely on any system specific thing like dcache/icache * we can unify the code and re-use it to implement regular metadata server, wbc and metadata proxy server * overall simplicity inter-layer interaction is well defined and simple, same about layer''s functionality * clustered metadata fits this model very well because metadata server doesn''t need to know whether some update local or remote any comments and suggestions are very welcome! thanks, Alex
Robert Read
2009-Mar-24 23:53 UTC
[Lustre-devel] some observations about metadata writeback cache
Hi Alex, I''m trying to figure out how untrusted (what I''m calling simple) clients and trusted WBC-type clients will work together at the same time. Simple clients will need to participate in the oldest volatile epoch calculation, but will need to retain operations for replay. I''ve draw a simplified picture of how I think things are beginning to fit together, but more thought is needed here. Simple clients - don''t participate in global epochs - don''t have a node epoch or add epochs to messages - sends operations to MD server - replies include extended opaque "replay" data field - replayed operations the replay data is included - replay list is flushed based on "transno" (which may actually be the epoch and the replay data contains the actual transnos) - multiple operations can have the same "transno" Trusted clients - participate in global epochs - have a capability that allows them to participate - sends updates to OSD servers with epochs - replay-data contains only a single reply, could be same as today - when all update replies are received operation is placed on redo list - redo list flushed based on OVE MD server - MDT/MDD receives operations without epochs - sets the operation epoch to the node''s current epoch - all updates executed for that operation will use same epoch. - replies are gathered and sent in "replay data" field - participates in OVE - how much state does it need to retain to do this on behalf of the clients? - OSD receives updates epochs - handled locally - normal reply returned robert -------------- next part -------------- A non-text attachment was scrubbed... Name: cmd-recovery.pdf Type: application/pdf Size: 50831 bytes Desc: not available Url : http://lists.lustre.org/pipermail/lustre-devel/attachments/20090324/f9b54ac6/attachment-0001.pdf -------------- next part -------------- On Mar 9, 2009, at 23:41 , Alex Zhuravlev wrote:> > Hello, > > I spent quite amount time thinking of wbc problem and I''d like to > share > the thoughts. > > for wbc we store metadata in local memory for two purposes: > 1) later reintegration > 2) read access (lookup, getattr, readdir) w/o server involvement > > for (2) it makes sense to store everything as "state". e.g. directory > contains all alive entries, inode contains last valid attributes, etc. > let''s call it state cache. > > in theory reintegration can be done from the state cache and this is > probably the most efficient way (in terms of network traffic and > memory > footprint). but for simpler implementation we can introduce log of > changes for (1). in turn, the log can be per-object or just global log > for given filesystem. > > it''s hard to implement state cache in terms of operations because > usual > operation involves more than one object (e.g. parent directory + > file). > it''s much simpler when state cache is per-object. literally the best > example is linux''s dcache and inode cache. > > it''s also fairly simple to maintain such cache at level where single > object is being modified. for our purposes this matches layer > implementing > OSD API - because all operations in OSD API are per single object. > > the same applies to reintegration because: > * we need to break complex operations to be sent to different > servers anyway > * if we''d need to optimize log (i.e., create/unlink), then it''s > simpler > to collapse log entries when they are basic operations > * when we''d want to reintegrate from state cache > > we also need a layer to take metadata operations and translate them > into > per-object basic operations (updates). responsbility of this layer is: > * to grab all required ldlm locks > as the layer understands operation''s nature, locking rules, etc > * to check current state > whether name exists alread (for create), permissions > * to apply updates to state cache (and reintegration backend, if > required) > * to release ldlm locks > > essentially this is what current metadata server does. the > difference is > * locks to be acquired on remote node > * current state can be on remote node (not in local state cache) > * updates can be stored in local memory for later reintegration > (perhaps this applies to usual mds) > > it looks quite obvious that it''d make sense to use metadata server > code to > implement wbc: > * ldlm hides where lock is being mastered > * dedicated osd layer below metadata server can maintain state cache > needed > to check existing names, attributes, permissions, etc > * dedicated osd layer below metadata server can take care of > reintegration > > > implementation would look like set of the following modules: > * mdf - metadata filter > this is location-free metadata server operating on top of osd api, > grabs > ldlm locks, check current state, apply changes. > * cosd - caching osd > this is dedicated layer with osd api, it maintains state cache and > all data > needed for reintegration. it also tries to use network efficient: > regular > lookup can be implemented via underlying readdir, etc. > * gosd - global osd > very specific module allowing node to talk to remote storage over > osd api, > it''s stateless, something similar to current mdc, but using > different apis. > > > some obvious cons of this approach: > * implementation doesn''t rely on any system specific thing like > dcache/icache > * we can unify the code and re-use it to implement regular metadata > server, > wbc and metadata proxy server > * overall simplicity > inter-layer interaction is well defined and simple, same about > layer''s > functionality > * clustered metadata fits this model very well because metadata server > doesn''t need to know whether some update local or remote > > any comments and suggestions are very welcome! > > > thanks, Alex > _______________________________________________ > Lustre-devel mailing list > Lustre-devel at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-devel
Alex Zhuravlev
2009-Mar-25 16:59 UTC
[Lustre-devel] some observations about metadata writeback cache
>>>>> Robert Read (RR) writes:RR> Hi Alex, RR> I''m trying to figure out how untrusted (what I''m calling simple) RR> clients and trusted WBC-type clients will work together at the same RR> time. Simple clients will need to participate in the oldest volatile RR> epoch calculation, but will need to retain operations for replay. RR> I''ve draw a simplified picture of how I think things are beginning to RR> fit together, but more thought is needed here. RR> Simple clients RR> - don''t participate in global epochs hmm. if committed (in terms of transno) request can be reverted during global recovery, then even simple client has to retain request on replay list till it''s stable in terms of epochs? RR> - don''t have a node epoch or add epochs to messages RR> - sends operations to MD server RR> - replies include extended opaque "replay" data field probably we could simplify code a lot if we don''t need to put reply into request in order to do replay? IOW, make all request''s fields client-generated? thanks, Alex
Oleg Drokin
2009-Mar-25 17:48 UTC
[Lustre-devel] some observations about metadata writeback cache
Hello! On Mar 25, 2009, at 12:59 PM, Alex Zhuravlev wrote:> RR> Simple clients > RR> - don''t participate in global epochs > hmm. if committed (in terms of transno) request can be reverted > during global recovery, then even simple client has to retain > request on replay list till it''s stable in terms of epochs?Supposedly, server that performed the operation on behalf of the client can do this? So the simple client semantic does not change - the moment server has some stable record about hte operation, client can throw the data away (otherwise simple clients would need to know how to participate in rollback/replay even when the server the operation was sent to did not go down). Bye, Oleg
Alex Zhuravlev
2009-Mar-25 17:52 UTC
[Lustre-devel] some observations about metadata writeback cache
>>>>> Oleg Drokin (OD) writes:OD> Hello! OD> On Mar 25, 2009, at 12:59 PM, Alex Zhuravlev wrote: RR> Simple clients RR> - don''t participate in global epochs >> hmm. if committed (in terms of transno) request can be reverted >> during global recovery, then even simple client has to retain >> request on replay list till it''s stable in terms of epochs? OD> Supposedly, server that performed the operation on behalf of the client OD> can do this? So the simple client semantic does not change - the moment OD> server has some stable record about hte operation, client can throw the OD> data away (otherwise simple clients would need to know how to OD> participate OD> in rollback/replay even when the server the operation was sent to did OD> not OD> go down). hmm. then wouldn''t be simpler to do replay before global recovery and then do global replay from server''s undo logs? -- thanks, Alex
Oleg Drokin
2009-Mar-25 17:59 UTC
[Lustre-devel] some observations about metadata writeback cache
Hello! On Mar 25, 2009, at 1:52 PM, Alex Zhuravlev wrote:> RR> Simple clients > RR> - don''t participate in global epochs >>> hmm. if committed (in terms of transno) request can be reverted >>> during global recovery, then even simple client has to retain >>> request on replay list till it''s stable in terms of epochs? > OD> Supposedly, server that performed the operation on behalf of the > client > OD> can do this? So the simple client semantic does not change - the > moment > OD> server has some stable record about hte operation, client can > throw the > OD> data away (otherwise simple clients would need to know how to > OD> participate > OD> in rollback/replay even when the server the operation was sent > to did > OD> not > OD> go down). > hmm. then wouldn''t be simpler to do replay before global recovery > and then > do global replay from server''s undo logs?Yes, but aside from that, losing a caching client leads to global recovery, but losing simple client is not, since server tracks its status. Bye, Oleg
Robert Read
2009-Mar-25 18:03 UTC
[Lustre-devel] some observations about metadata writeback cache
On Mar 25, 2009, at 09:59 , Alex Zhuravlev wrote:>>>>>> Robert Read (RR) writes: > > RR> Hi Alex, > RR> I''m trying to figure out how untrusted (what I''m calling simple) > RR> clients and trusted WBC-type clients will work together at the > same > RR> time. Simple clients will need to participate in the oldest > volatile > RR> epoch calculation, but will need to retain operations for replay. > RR> I''ve draw a simplified picture of how I think things are > beginning to > RR> fit together, but more thought is needed here. > > RR> Simple clients > RR> - don''t participate in global epochs > > hmm. if committed (in terms of transno) request can be reverted > during global recovery, then even simple client has to retain > request on replay list till it''s stable in terms of epochs?Yes, this is why these clients won''t see the actual transno(s). Those would only be in the replay data blob the server returns with the reply. Instead, the MD server would send the epoch as the transno in the reply to these clients.> > RR> - don''t have a node epoch or add epochs to messages > RR> - sends operations to MD server > RR> - replies include extended opaque "replay" data field > > probably we could simplify code a lot if we don''t need to put > reply into request in order to do replay? IOW, make all request''s > fields client-generated?True, but a request could contain multiple replies (one for each update), and the client doesn''t need to be aware of that. I was thinking it would be better if the server managed this field. This mans the client can replay the request as it was originally sent and include the additional data at the end so the server can replay the updates in the correct order. cheers, robert