thr3ads.net - Lustre devel - [Lustre-devel] [RFC] two ideas for Meta Data Write Back Cache [Apr 2009]

If this information is useful, please help other people find it:
Share via:

Alexander Zarochentsev

2009-Apr-05 20:50 UTC

[Lustre-devel] [RFC] two ideas for Meta Data Write Back Cache

Hello,

There are ideas about WBC client MD stack, WBC protocol and changes 
needed at server side. They are Global OSD and another idea (let''s name
it CMD3+) explained in the WBC HLD outline draft.
 
Brief descriptions of the ideas:

GOSD: 

a portable component (called MDS in Alex''s presentation) transates MD 
operations into OSD operations (updates).

MDS may be at client side (WBC-client), proxy server or MD server.

The MDS component is very similar to current MDD (Local MD server) layer 
in CMD3 server stack. I.e. it works like a local MD server, but the OSD 
layer below is not local, it is GOSD. 

It is simple as the local MD server and simplifies MD server stack a 
lot. Current MD stack processes MD operations at any level of MDT, CMM 
and MDD. First two levels should understand what is CMD and MDD layer 
should understand that some MD operations can be partial. It sounds 
like a unneeded complication. With GOSD those layers will be replaced 
by only one as simple as MDD layer! (however LDLM locking should be 
added).

CMD3+:

The component running on WBC client is based on MDT excluding transport 
things. Code reuse is possible.

The WBC protocol logically is the current MD protocol with the partial 
MD operations (object create w/o name, for example). Partial operations 
are already used between MD servers for distributed MD operations. MD 
operations will be packed into batches.

Both ideas (GOSD and CMD3+) assume a cache manager at WBC client to do 
caching & redo-logging of operations.

I think CMD3+ has minimum impact to current Lustre-2.x design. It is 
closer to the original goal of just implementation of WBC feature. But 
the GOSD is an attractive idea and may be potentially better.

With GOSD I am worrying about making Lustre 2.x unstable for some period 
of time. It would be good to think about a plan of incremental 
integration of new stack into existing code.

It is a request for comments and new ideas because design mistakes would 
be too costly.

Thanks,
-- 
Alexander "Zam" Zarochentsev
Staff Engineer
Lustre Group, Sun Microsystems

Alexander Zarochentsev

2009-Apr-06 09:39 UTC

head link

[Lustre-devel] [RFC] two ideas for Meta Data Write Back Cache

... lustre-devel@ doesn''t want to deliver the message, so I am adding
CC
list this time.

Hello,

There are ideas about WBC client MD stack, WBC protocol and changes 
needed at server side. They are Global OSD and another idea (let''s name
it CMD3+) explained in the WBC HLD outline draft.
 
Brief descriptions of the ideas:

GOSD: 

a portable component (called MDS in Alex''s presentation) transates MD 
operations into OSD operations (updates).

MDS may be at client side (WBC-client), proxy server or MD server.

The MDS component is very similar to current MDD (Local MD server) layer 
in CMD3 server stack. I.e. it works like a local MD server, but the OSD 
layer below is not local, it is GOSD. 

It is simple as the local MD server and simplifies MD server stack a 
lot. Current MD stack processes MD operations at any level of MDT, CMM 
and MDD. First two levels should understand what is CMD and MDD layer 
should understand that some MD operations can be partial. It sounds 
like a unneeded complication. With GOSD those layers will be replaced 
by only one as simple as MDD layer! (however LDLM locking should be 
added).

CMD3+:

The component running on WBC client is based on MDT excluding transport 
things. Code reuse is possible.

The WBC protocol logically is the current MD protocol with the partial 
MD operations (object create w/o name, for example). Partial operations 
are already used between MD servers for distributed MD operations. MD 
operations will be packed into batches.

Both ideas (GOSD and CMD3+) assume a cache manager at WBC client to do 
caching & redo-logging of operations.

I think CMD3+ has minimum impact to current Lustre-2.x design. It is 
closer to the original goal of just implementation of WBC feature. But 
the GOSD is an attractive idea and may be potentially better.

With GOSD I am worrying about making Lustre 2.x unstable for some period 
of time. It would be good to think about a plan of incremental 
integration of new stack into existing code.

It is a request for comments and new ideas because design mistakes would 
be too costly.

Thanks,
-- 
Alexander "Zam" Zarochentsev
Staff Engineer
Lustre Group, Sun Microsystems

Andreas Dilger

2009-Apr-06 10:03 UTC

head link

[Lustre-devel] [RFC] two ideas for Meta Data Write Back Cache

On Apr 06, 2009  13:39 +0400, Alexander Zarochentsev
wrote:> There are ideas about WBC client MD stack, WBC protocol and changes 
> needed at server side. They are Global OSD and another idea (let''s
name
> it CMD3+) explained in the WBC HLD outline draft.
>  
> Brief descriptions of the ideas:
> 
> GOSD: 
> 
> a portable component (called MDS in Alex''s presentation) transates
MD
> operations into OSD operations (updates).
> 
> MDS may be at client side (WBC-client), proxy server or MD server.
> 
> The MDS component is very similar to current MDD (Local MD server) layer 
> in CMD3 server stack. I.e. it works like a local MD server, but the OSD 
> layer below is not local, it is GOSD. 
> 
> It is simple as the local MD server and simplifies MD server stack a 
> lot. Current MD stack processes MD operations at any level of MDT, CMM 
> and MDD. First two levels should understand what is CMD and MDD layer 
> should understand that some MD operations can be partial. It sounds 
> like a unneeded complication. With GOSD those layers will be replaced 
> by only one as simple as MDD layer! (however LDLM locking should be 
> added).
My internal thoughts (in the absence of ever haven taken a close look
at the HEAD MD stack) have always been that we would essentially be
moving the CMM to the client, and have it always connect to remote
MDTs (i.e. no local MDD) if we want to split "operations" into
"updates".

I''d always visualized that the MDT accepts "operations" (as
it does
today) and CMM is the component that decides what parts of the operation
are local (passed to MDD) and which are remote (passed to MDC).

Maybe the MD stack layering isn''t quite as clean as this?
> CMD3+:
> 
> The component running on WBC client is based on MDT excluding transport 
> things. Code reuse is possible.
> 
> The WBC protocol logically is the current MD protocol with the partial 
> MD operations (object create w/o name, for example). Partial operations 
partial operations == updates?
> are already used between MD servers for distributed MD operations. MD 
> operations will be packed into batches.
> 
> Both ideas (GOSD and CMD3+) assume a cache manager at WBC client to do 
> caching & redo-logging of operations.
> 
> I think CMD3+ has minimum impact to current Lustre-2.x design. It is 
> closer to the original goal of just implementation of WBC feature. But 
> the GOSD is an attractive idea and may be potentially better.
> 
> With GOSD I am worrying about making Lustre 2.x unstable for some period 
> of time. It would be good to think about a plan of incremental 
> integration of new stack into existing code.
Wouldn''t GOSD just end up being a new ptlrpc interface that exports the
OSD protocol to the network?  This would mean that we need to be able
to have multiple services working on the same OSD (both MDD for classic
clients, and GOSD for WBC clients).  That isn''t a terrible idea,
because
we have also discussed having both MDT and OST exports of the same OSD
so that we can efficiently store small files directly on the MDT and/or
scale the number of MDTs == OSTs for massive metadata performance.

I''d like to keep this kind of layering in mind also.  Whether it makes
sense to export yet another network protocol to clients, or instead to
add new operations to the existing service handlers so that they can
handle all of the operation types (with efficient passthrough to lower
layers as needed) and be able to multiplex the underlying device
to clients.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

Alex Zhuravlev

2009-Apr-06 10:26 UTC

head link

[Lustre-devel] [RFC] two ideas for Meta Data Write Back Cache

>>>>> Andreas Dilger (AD) writes:
 AD> My internal thoughts (in the absence of ever haven taken a close look
 AD> at the HEAD MD stack) have always been that we would essentially be
 AD> moving the CMM to the client, and have it always connect to remote
 AD> MDTs (i.e. no local MDD) if we want to split "operations" into
"updates".

 AD> I''d always visualized that the MDT accepts
"operations" (as it does
 AD> today) and CMM is the component that decides what parts of the operation
 AD> are local (passed to MDD) and which are remote (passed to MDC).

few thoughts here:
1) in order to organize local cache with all this you''d need to do
translate
   once more before md stack (you can''t cache create, you can cache
directory
   entries and objects). at same time you need local cache to access just made
   changes. translation is already done by MDD. if you don''t run MDD
locally
   you have to duplicate that code (to some extent) for WBC

2) "create w/o name" (this is what MDT accepts these days)
isn''t operation,
   it''s partial operation. but for partial operations we already have
OSD
   - clear, simple and generic. having one more "partial operations"
adds
   nothing besides confusion, IMHO

3) local MDD is meaningless with CMD. CMD is distributed thing and I think
   any implementation of CMD using "metadata operations" (even
partial,
   in contrast with updates in terms of OSD API) is a hack. exactly like we
   did in CMD1/CMD2 implementing local operations with calls to vfs_create()
   and distributed operations with special entries in fsfilt. instead of all
   this we should just use OSD always and properly.

4) the only rational reason behind current design in CMD3 was that rollback
   reqiured to make remote operations before any local one (to align epoch)
   - but it''s very likely we don''t this any more. thanks god
(some ones will
   understand what i meant ;)

5) running MDD on MDS for WBC clients also adds nothing in terms of
functionality
   or clearness, but adds code duplicating OSD

 >> are already used between MD servers for distributed MD operations. MD 
 >> operations will be packed into batches.
 >> 
 >> Both ideas (GOSD and CMD3+) assume a cache manager at WBC client to do
 >> caching & redo-logging of operations.
 >> 
 >> I think CMD3+ has minimum impact to current Lustre-2.x design. It is 
 >> closer to the original goal of just implementation of WBC feature. But
 >> the GOSD is an attractive idea and may be potentially better.
 >> 
 >> With GOSD I am worrying about making Lustre 2.x unstable for some
period
 >> of time. It would be good to think about a plan of incremental 
 >> integration of new stack into existing code.

 AD> Wouldn''t GOSD just end up being a new ptlrpc interface that
exports the
 AD> OSD protocol to the network?  This would mean that we need to be able
 AD> to have multiple services working on the same OSD (both MDD for classic
 AD> clients, and GOSD for WBC clients).  That isn''t a terrible
idea, because
 AD> we have also discussed having both MDT and OST exports of the same OSD
 AD> so that we can efficiently store small files directly on the MDT and/or
 AD> scale the number of MDTs == OSTs for massive metadata performance.

yes, with gosd you essentially have your object storage exported in terms
of same API as local storage. you can use that to implement remote services
(proxy, wbc).

 AD> I''d like to keep this kind of layering in mind also.  Whether
it makes
 AD> sense to export yet another network protocol to clients, or instead to
 AD> add new operations to the existing service handlers so that they can
 AD> handle all of the operation types (with efficient passthrough to lower
 AD> layers as needed) and be able to multiplex the underlying device
 AD> to clients.

I think it''s not "another" network protocol. I think
it''s right low level
protocol.  meaning that instead of having very limited set of partial metadata
operations like "create w/o name", "link w/o inode", etc we
may have very
simple, generic protocol allowing us to do anything with remote storage.

for example, the core of replication with this protocol could look like
at one node you log osd operations (optional module inbetween regular disk osd
and upper layers like mdd), then you just send those operations to virtially
any node in the cluster and execute them there - you got things replicated.

-- 
thanks, Alex

di wang

2009-Apr-06 22:02 UTC

head link

[Lustre-devel] [RFC] two ideas for Meta Data Write Back Cache

> 2) "create w/o name" (this is what MDT accepts these days)
isn''t operation,
>    it''s partial operation. but for partial operations we already
have OSD
>    - clear, simple and generic. having one more "partial
operations" adds
>    nothing besides confusion, IMHO
>   I am not sure you can( or should) translate all the MD partial 
operation  into object  RPC for these partial
MD operation.  For example rename,  (a/b  ---> c/d,  a/b in MDS1, c/d in 
MDS2).

     RPC goes to MDS1.

     1) delete d (entry and object) from c in MDS2.
     2) create b entry under c in MDS2.
     3) delete b entry under a in MDS1.

So if you do 1) and 2) by object rpc (skip mdd), then you might need 
create create all 4 objects
(a and b are local object, c and d are remote object),  and permission 
check locally (whether
you can delete d under c). Not sure it is a good way.  And also some 
quota stuff are handled
in these partial operation in remote MDD, so I am not sure we should 
skip mdd totally here.
Am I miss sth?

Thanks
WangDi

Alex Zhuravlev

2009-Apr-07 04:27 UTC

head link

[Lustre-devel] [RFC] two ideas for Meta Data Write Back Cache

>>>>> di wang (dw) writes:
 dw> I am not sure you can( or should) translate all the MD partial 
 dw> operation  into object  RPC for these partial
 dw> MD operation.  For example rename,  (a/b  ---> c/d,  a/b in MDS1, c/d
in
 dw> MDS2).

 dw>      RPC goes to MDS1.

 dw>      1) delete d (entry and object) from c in MDS2.
 dw>      2) create b entry under c in MDS2.
 dw>      3) delete b entry under a in MDS1.

 dw> So if you do 1) and 2) by object rpc (skip mdd), then you might need 
 dw> create create all 4 objects
 dw> (a and b are local object, c and d are remote object),  and permission 
 dw> check locally (whether
 dw> you can delete d under c). Not sure it is a good way.  

sorry, I don''t quite understand what you mean. for this rename
you''d need to:

if you''re worried about additional RPCs, then we can (should) optimize
this
with intents-like mechanism: mdd_rename() (or caller initialize intent
describing rename in terms of fids and names), then network-aware layer
(something like current MDC accepting OSD API) can put additional osd calls
into enqueue RPC.

for example, when it finds enqueue on c, it can form compound RPC consisting
of: enqueue itself, osd''s attr_get(c), osd''s lookup(c, d)

also notice, for such rename you need to check that you won''t create
disconnected subtree, which is much worse than just few additional RPCs.

 dw> And also some quota stuff are handled
 dw> in these partial operation in remote MDD, so I am not sure we should 
 dw> skip mdd totally here.
 dw> Am I miss sth?

can you explain why do we need MDD on a server for quota?

given we''re about to use OSD for data and metadata I''d think
that:
1) for chown/chgrp MDD (wherever it runs) finds LOV EA and uses epoch
   (for example) to change uid/gid on MDS and all OSTs in atomical manner
2) quota code isn''t part of data or metadata stack, rather
it''s side
   service like ldlm
3) quota code register hooks (or probably adds very small module right
   above OSD) to see all quota-related activity: attr_set, write,
   index insert, etc

also I think we still don''t have proper design for quota, this is yet
to be
done to implement quota for DMU.


thanks, Alex

Eric Barton

2009-May-18 21:01 UTC

head link

[Lustre-devel] [RFC] two ideas for Meta Data Write Back Cache

Zam,

A couple of things to consider when splitting up operations
into updates....

1. Each update must contain some information about its peer
   updates so that in the absence of the client (e.g. on
   client eviction) we can check that all the operations''s
   updates have been applied and apply a correction if not.

   I think there is an advantage if every update includes
   sufficient information to reconstruct all its peer
   updates.

2. The current security design grants capabilities to clients
   to perform operations on Lustre objects.  If you allow
   remote "raw" OSD ops, you''re effectively distributing the
   Lustre clustered server further - i.e. nodes allowed to
   do such operations are being trusted just as much as
   servers to keep the filesystem consistent.

    Cheers,
              Eric

Lustre devel - Apr 2009 - [RFC] two ideas for Meta Data Write Back Cache

[Lustre-devel] [RFC] two ideas for Meta Data Write Back Cache

[Lustre-devel] [RFC] two ideas for Meta Data Write Back Cache

[Lustre-devel] [RFC] two ideas for Meta Data Write Back Cache

[Lustre-devel] [RFC] two ideas for Meta Data Write Back Cache

[Lustre-devel] [RFC] two ideas for Meta Data Write Back Cache

[Lustre-devel] [RFC] two ideas for Meta Data Write Back Cache

[Lustre-devel] [RFC] two ideas for Meta Data Write Back Cache