Alexander Zarochentsev
2009-Apr-05 20:50 UTC
[Lustre-devel] [RFC] two ideas for Meta Data Write Back Cache
Hello, There are ideas about WBC client MD stack, WBC protocol and changes needed at server side. They are Global OSD and another idea (let''s name it CMD3+) explained in the WBC HLD outline draft. Brief descriptions of the ideas: GOSD: a portable component (called MDS in Alex''s presentation) transates MD operations into OSD operations (updates). MDS may be at client side (WBC-client), proxy server or MD server. The MDS component is very similar to current MDD (Local MD server) layer in CMD3 server stack. I.e. it works like a local MD server, but the OSD layer below is not local, it is GOSD. It is simple as the local MD server and simplifies MD server stack a lot. Current MD stack processes MD operations at any level of MDT, CMM and MDD. First two levels should understand what is CMD and MDD layer should understand that some MD operations can be partial. It sounds like a unneeded complication. With GOSD those layers will be replaced by only one as simple as MDD layer! (however LDLM locking should be added). CMD3+: The component running on WBC client is based on MDT excluding transport things. Code reuse is possible. The WBC protocol logically is the current MD protocol with the partial MD operations (object create w/o name, for example). Partial operations are already used between MD servers for distributed MD operations. MD operations will be packed into batches. Both ideas (GOSD and CMD3+) assume a cache manager at WBC client to do caching & redo-logging of operations. I think CMD3+ has minimum impact to current Lustre-2.x design. It is closer to the original goal of just implementation of WBC feature. But the GOSD is an attractive idea and may be potentially better. With GOSD I am worrying about making Lustre 2.x unstable for some period of time. It would be good to think about a plan of incremental integration of new stack into existing code. It is a request for comments and new ideas because design mistakes would be too costly. Thanks, -- Alexander "Zam" Zarochentsev Staff Engineer Lustre Group, Sun Microsystems
Alexander Zarochentsev
2009-Apr-06 09:39 UTC
[Lustre-devel] [RFC] two ideas for Meta Data Write Back Cache
... lustre-devel@ doesn''t want to deliver the message, so I am adding CC list this time. Hello, There are ideas about WBC client MD stack, WBC protocol and changes needed at server side. They are Global OSD and another idea (let''s name it CMD3+) explained in the WBC HLD outline draft. Brief descriptions of the ideas: GOSD: a portable component (called MDS in Alex''s presentation) transates MD operations into OSD operations (updates). MDS may be at client side (WBC-client), proxy server or MD server. The MDS component is very similar to current MDD (Local MD server) layer in CMD3 server stack. I.e. it works like a local MD server, but the OSD layer below is not local, it is GOSD. It is simple as the local MD server and simplifies MD server stack a lot. Current MD stack processes MD operations at any level of MDT, CMM and MDD. First two levels should understand what is CMD and MDD layer should understand that some MD operations can be partial. It sounds like a unneeded complication. With GOSD those layers will be replaced by only one as simple as MDD layer! (however LDLM locking should be added). CMD3+: The component running on WBC client is based on MDT excluding transport things. Code reuse is possible. The WBC protocol logically is the current MD protocol with the partial MD operations (object create w/o name, for example). Partial operations are already used between MD servers for distributed MD operations. MD operations will be packed into batches. Both ideas (GOSD and CMD3+) assume a cache manager at WBC client to do caching & redo-logging of operations. I think CMD3+ has minimum impact to current Lustre-2.x design. It is closer to the original goal of just implementation of WBC feature. But the GOSD is an attractive idea and may be potentially better. With GOSD I am worrying about making Lustre 2.x unstable for some period of time. It would be good to think about a plan of incremental integration of new stack into existing code. It is a request for comments and new ideas because design mistakes would be too costly. Thanks, -- Alexander "Zam" Zarochentsev Staff Engineer Lustre Group, Sun Microsystems
Andreas Dilger
2009-Apr-06 10:03 UTC
[Lustre-devel] [RFC] two ideas for Meta Data Write Back Cache
On Apr 06, 2009 13:39 +0400, Alexander Zarochentsev wrote:> There are ideas about WBC client MD stack, WBC protocol and changes > needed at server side. They are Global OSD and another idea (let''s name > it CMD3+) explained in the WBC HLD outline draft. > > Brief descriptions of the ideas: > > GOSD: > > a portable component (called MDS in Alex''s presentation) transates MD > operations into OSD operations (updates). > > MDS may be at client side (WBC-client), proxy server or MD server. > > The MDS component is very similar to current MDD (Local MD server) layer > in CMD3 server stack. I.e. it works like a local MD server, but the OSD > layer below is not local, it is GOSD. > > It is simple as the local MD server and simplifies MD server stack a > lot. Current MD stack processes MD operations at any level of MDT, CMM > and MDD. First two levels should understand what is CMD and MDD layer > should understand that some MD operations can be partial. It sounds > like a unneeded complication. With GOSD those layers will be replaced > by only one as simple as MDD layer! (however LDLM locking should be > added).My internal thoughts (in the absence of ever haven taken a close look at the HEAD MD stack) have always been that we would essentially be moving the CMM to the client, and have it always connect to remote MDTs (i.e. no local MDD) if we want to split "operations" into "updates". I''d always visualized that the MDT accepts "operations" (as it does today) and CMM is the component that decides what parts of the operation are local (passed to MDD) and which are remote (passed to MDC). Maybe the MD stack layering isn''t quite as clean as this?> CMD3+: > > The component running on WBC client is based on MDT excluding transport > things. Code reuse is possible. > > The WBC protocol logically is the current MD protocol with the partial > MD operations (object create w/o name, for example). Partial operationspartial operations == updates?> are already used between MD servers for distributed MD operations. MD > operations will be packed into batches. > > Both ideas (GOSD and CMD3+) assume a cache manager at WBC client to do > caching & redo-logging of operations. > > I think CMD3+ has minimum impact to current Lustre-2.x design. It is > closer to the original goal of just implementation of WBC feature. But > the GOSD is an attractive idea and may be potentially better. > > With GOSD I am worrying about making Lustre 2.x unstable for some period > of time. It would be good to think about a plan of incremental > integration of new stack into existing code.Wouldn''t GOSD just end up being a new ptlrpc interface that exports the OSD protocol to the network? This would mean that we need to be able to have multiple services working on the same OSD (both MDD for classic clients, and GOSD for WBC clients). That isn''t a terrible idea, because we have also discussed having both MDT and OST exports of the same OSD so that we can efficiently store small files directly on the MDT and/or scale the number of MDTs == OSTs for massive metadata performance. I''d like to keep this kind of layering in mind also. Whether it makes sense to export yet another network protocol to clients, or instead to add new operations to the existing service handlers so that they can handle all of the operation types (with efficient passthrough to lower layers as needed) and be able to multiplex the underlying device to clients. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.
Alex Zhuravlev
2009-Apr-06 10:26 UTC
[Lustre-devel] [RFC] two ideas for Meta Data Write Back Cache
>>>>> Andreas Dilger (AD) writes:AD> My internal thoughts (in the absence of ever haven taken a close look AD> at the HEAD MD stack) have always been that we would essentially be AD> moving the CMM to the client, and have it always connect to remote AD> MDTs (i.e. no local MDD) if we want to split "operations" into "updates". AD> I''d always visualized that the MDT accepts "operations" (as it does AD> today) and CMM is the component that decides what parts of the operation AD> are local (passed to MDD) and which are remote (passed to MDC). few thoughts here: 1) in order to organize local cache with all this you''d need to do translate once more before md stack (you can''t cache create, you can cache directory entries and objects). at same time you need local cache to access just made changes. translation is already done by MDD. if you don''t run MDD locally you have to duplicate that code (to some extent) for WBC 2) "create w/o name" (this is what MDT accepts these days) isn''t operation, it''s partial operation. but for partial operations we already have OSD - clear, simple and generic. having one more "partial operations" adds nothing besides confusion, IMHO 3) local MDD is meaningless with CMD. CMD is distributed thing and I think any implementation of CMD using "metadata operations" (even partial, in contrast with updates in terms of OSD API) is a hack. exactly like we did in CMD1/CMD2 implementing local operations with calls to vfs_create() and distributed operations with special entries in fsfilt. instead of all this we should just use OSD always and properly. 4) the only rational reason behind current design in CMD3 was that rollback reqiured to make remote operations before any local one (to align epoch) - but it''s very likely we don''t this any more. thanks god (some ones will understand what i meant ;) 5) running MDD on MDS for WBC clients also adds nothing in terms of functionality or clearness, but adds code duplicating OSD >> are already used between MD servers for distributed MD operations. MD >> operations will be packed into batches. >> >> Both ideas (GOSD and CMD3+) assume a cache manager at WBC client to do >> caching & redo-logging of operations. >> >> I think CMD3+ has minimum impact to current Lustre-2.x design. It is >> closer to the original goal of just implementation of WBC feature. But >> the GOSD is an attractive idea and may be potentially better. >> >> With GOSD I am worrying about making Lustre 2.x unstable for some period >> of time. It would be good to think about a plan of incremental >> integration of new stack into existing code. AD> Wouldn''t GOSD just end up being a new ptlrpc interface that exports the AD> OSD protocol to the network? This would mean that we need to be able AD> to have multiple services working on the same OSD (both MDD for classic AD> clients, and GOSD for WBC clients). That isn''t a terrible idea, because AD> we have also discussed having both MDT and OST exports of the same OSD AD> so that we can efficiently store small files directly on the MDT and/or AD> scale the number of MDTs == OSTs for massive metadata performance. yes, with gosd you essentially have your object storage exported in terms of same API as local storage. you can use that to implement remote services (proxy, wbc). AD> I''d like to keep this kind of layering in mind also. Whether it makes AD> sense to export yet another network protocol to clients, or instead to AD> add new operations to the existing service handlers so that they can AD> handle all of the operation types (with efficient passthrough to lower AD> layers as needed) and be able to multiplex the underlying device AD> to clients. I think it''s not "another" network protocol. I think it''s right low level protocol. meaning that instead of having very limited set of partial metadata operations like "create w/o name", "link w/o inode", etc we may have very simple, generic protocol allowing us to do anything with remote storage. for example, the core of replication with this protocol could look like at one node you log osd operations (optional module inbetween regular disk osd and upper layers like mdd), then you just send those operations to virtially any node in the cluster and execute them there - you got things replicated. -- thanks, Alex
di wang
2009-Apr-06 22:02 UTC
[Lustre-devel] [RFC] two ideas for Meta Data Write Back Cache
> 2) "create w/o name" (this is what MDT accepts these days) isn''t operation, > it''s partial operation. but for partial operations we already have OSD > - clear, simple and generic. having one more "partial operations" adds > nothing besides confusion, IMHO >I am not sure you can( or should) translate all the MD partial operation into object RPC for these partial MD operation. For example rename, (a/b ---> c/d, a/b in MDS1, c/d in MDS2). RPC goes to MDS1. 1) delete d (entry and object) from c in MDS2. 2) create b entry under c in MDS2. 3) delete b entry under a in MDS1. So if you do 1) and 2) by object rpc (skip mdd), then you might need create create all 4 objects (a and b are local object, c and d are remote object), and permission check locally (whether you can delete d under c). Not sure it is a good way. And also some quota stuff are handled in these partial operation in remote MDD, so I am not sure we should skip mdd totally here. Am I miss sth? Thanks WangDi
Alex Zhuravlev
2009-Apr-07 04:27 UTC
[Lustre-devel] [RFC] two ideas for Meta Data Write Back Cache
>>>>> di wang (dw) writes:dw> I am not sure you can( or should) translate all the MD partial dw> operation into object RPC for these partial dw> MD operation. For example rename, (a/b ---> c/d, a/b in MDS1, c/d in dw> MDS2). dw> RPC goes to MDS1. dw> 1) delete d (entry and object) from c in MDS2. dw> 2) create b entry under c in MDS2. dw> 3) delete b entry under a in MDS1. dw> So if you do 1) and 2) by object rpc (skip mdd), then you might need dw> create create all 4 objects dw> (a and b are local object, c and d are remote object), and permission dw> check locally (whether dw> you can delete d under c). Not sure it is a good way. sorry, I don''t quite understand what you mean. for this rename you''d need to: if you''re worried about additional RPCs, then we can (should) optimize this with intents-like mechanism: mdd_rename() (or caller initialize intent describing rename in terms of fids and names), then network-aware layer (something like current MDC accepting OSD API) can put additional osd calls into enqueue RPC. for example, when it finds enqueue on c, it can form compound RPC consisting of: enqueue itself, osd''s attr_get(c), osd''s lookup(c, d) also notice, for such rename you need to check that you won''t create disconnected subtree, which is much worse than just few additional RPCs. dw> And also some quota stuff are handled dw> in these partial operation in remote MDD, so I am not sure we should dw> skip mdd totally here. dw> Am I miss sth? can you explain why do we need MDD on a server for quota? given we''re about to use OSD for data and metadata I''d think that: 1) for chown/chgrp MDD (wherever it runs) finds LOV EA and uses epoch (for example) to change uid/gid on MDS and all OSTs in atomical manner 2) quota code isn''t part of data or metadata stack, rather it''s side service like ldlm 3) quota code register hooks (or probably adds very small module right above OSD) to see all quota-related activity: attr_set, write, index insert, etc also I think we still don''t have proper design for quota, this is yet to be done to implement quota for DMU. thanks, Alex
Eric Barton
2009-May-18 21:01 UTC
[Lustre-devel] [RFC] two ideas for Meta Data Write Back Cache
Zam, A couple of things to consider when splitting up operations into updates.... 1. Each update must contain some information about its peer updates so that in the absence of the client (e.g. on client eviction) we can check that all the operations''s updates have been applied and apply a correction if not. I think there is an advantage if every update includes sufficient information to reconstruct all its peer updates. 2. The current security design grants capabilities to clients to perform operations on Lustre objects. If you allow remote "raw" OSD ops, you''re effectively distributing the Lustre clustered server further - i.e. nodes allowed to do such operations are being trusted just as much as servers to keep the filesystem consistent. Cheers, Eric