thr3ads.net - Lustre devel - [Lustre-devel] some observations about metadata writeback cache [Mar 2009]

If this information is useful, please help other people find it:
Share via:

Alex Zhuravlev

2009-Mar-10 06:41 UTC

[Lustre-devel] some observations about metadata writeback cache

Hello,

I spent quite amount time thinking of wbc problem and I''d like to share
the thoughts.

for wbc we store metadata in local memory for two purposes:
1) later reintegration 
2) read access (lookup, getattr, readdir) w/o server involvement

for (2) it makes sense to store everything as "state". e.g. directory
contains all alive entries, inode contains last valid attributes, etc.
let''s call it state cache.

in theory reintegration can be done from the state cache and this is
probably the most efficient way (in terms of network traffic and memory
footprint). but for simpler implementation we can introduce log of
changes for (1). in turn, the log can be per-object or just global log
for given filesystem.

it''s hard to implement state cache in terms of operations because usual
operation involves more than one object (e.g. parent directory + file).
it''s much simpler when state cache is per-object. literally the best
example is linux''s dcache and inode cache.

it''s also fairly simple to maintain such cache at level where single
object is being modified. for our purposes this matches layer implementing
OSD API - because all operations in OSD API are per single object.

the same applies to reintegration because:
* we need to break complex operations to be sent to different servers anyway
* if we''d need to optimize log (i.e., create/unlink), then
it''s simpler
  to collapse log entries when they are basic operations
* when we''d want to reintegrate from state cache

we also need a layer to take metadata operations and translate them into
per-object basic operations (updates). responsbility of this layer is:
* to grab all required ldlm locks
  as the layer understands operation''s nature, locking rules, etc
* to check current state
  whether name exists alread (for create), permissions
* to apply updates to state cache (and reintegration backend, if required)
* to release ldlm locks

essentially this is what current metadata server does. the difference is
* locks to be acquired on remote node
* current state can be on remote node (not in local state cache)
* updates can be stored in local memory for later reintegration
  (perhaps this applies to usual mds)

it looks quite obvious that it''d make sense to use metadata server code
to
implement wbc:
* ldlm hides where lock is being mastered
* dedicated osd layer below metadata server can maintain state cache needed
  to check existing names, attributes, permissions, etc
* dedicated osd layer below metadata server can take care of reintegration


implementation would look like set of the following modules:
* mdf - metadata filter
  this is location-free metadata server operating on top of osd api, grabs
  ldlm locks, check current state, apply changes.
* cosd - caching osd
  this is dedicated layer with osd api, it maintains state cache and all data
  needed for reintegration. it also tries to use network efficient: regular
  lookup can be implemented via underlying readdir, etc.
* gosd - global osd
  very specific module allowing node to talk to remote storage over osd api,
  it''s stateless, something similar to current mdc, but using different
apis.


some obvious cons of this approach:
* implementation doesn''t rely on any system specific thing like
dcache/icache
* we can unify the code and re-use it to implement regular metadata server,
  wbc and metadata proxy server
* overall simplicity
  inter-layer interaction is well defined and simple, same about
layer''s
  functionality
* clustered metadata fits this model very well because metadata server
  doesn''t need to know whether some update local or remote

any comments and suggestions are very welcome!


thanks, Alex

Robert Read

2009-Mar-24 23:53 UTC

head link

[Lustre-devel] some observations about metadata writeback cache

Hi Alex,

I''m trying to figure out how untrusted (what I''m calling
simple)
clients and trusted WBC-type clients will work together at the same  
time. Simple clients will need to participate in the oldest volatile  
epoch calculation, but will need to retain operations for replay.   
I''ve draw a simplified picture of how I think things are beginning to  
fit together, but more thought is needed here.

Simple clients
     - don''t participate in global epochs
     - don''t have a node epoch or add epochs to messages
     - sends operations to MD server
     - replies include extended opaque "replay" data field
     - replayed operations the replay data is included
     - replay list is flushed based on "transno" (which may actually  
be the
       epoch and the replay data contains the actual transnos)
     - multiple operations can have the same "transno"
Trusted clients
     - participate in global epochs
     - have a capability that allows them to participate
     - sends updates to OSD servers with epochs
     - replay-data contains only a single reply, could be same as today
     - when all update replies are received operation is placed on redo
       list
     - redo list flushed based on OVE
MD server
     - MDT/MDD receives operations without epochs
	- sets the operation epoch to the node''s current epoch
         - all updates executed for that operation will use same epoch.
         - replies are gathered and sent in "replay data" field
	- participates in OVE - how much state does it need to retain to do  
this on
	  behalf of the clients?
     - OSD receives updates epochs
         - handled locally
         - normal reply returned

robert

-------------- next part --------------
A non-text attachment was scrubbed...
Name: cmd-recovery.pdf
Type: application/pdf
Size: 50831 bytes
Desc: not available
Url :
http://lists.lustre.org/pipermail/lustre-devel/attachments/20090324/f9b54ac6/attachment-0001.pdf
-------------- next part --------------



On Mar 9, 2009, at 23:41 , Alex Zhuravlev wrote:
>
> Hello,
>
> I spent quite amount time thinking of wbc problem and I''d like to
> share
> the thoughts.
>
> for wbc we store metadata in local memory for two purposes:
> 1) later reintegration
> 2) read access (lookup, getattr, readdir) w/o server involvement
>
> for (2) it makes sense to store everything as "state". e.g.
directory
> contains all alive entries, inode contains last valid attributes, etc.
> let''s call it state cache.
>
> in theory reintegration can be done from the state cache and this is
> probably the most efficient way (in terms of network traffic and  
> memory
> footprint). but for simpler implementation we can introduce log of
> changes for (1). in turn, the log can be per-object or just global log
> for given filesystem.
>
> it''s hard to implement state cache in terms of operations because
> usual
> operation involves more than one object (e.g. parent directory +  
> file).
> it''s much simpler when state cache is per-object. literally the
best
> example is linux''s dcache and inode cache.
>
> it''s also fairly simple to maintain such cache at level where
single
> object is being modified. for our purposes this matches layer  
> implementing
> OSD API - because all operations in OSD API are per single object.
>
> the same applies to reintegration because:
> * we need to break complex operations to be sent to different  
> servers anyway
> * if we''d need to optimize log (i.e., create/unlink), then
it''s
> simpler
>  to collapse log entries when they are basic operations
> * when we''d want to reintegrate from state cache
>
> we also need a layer to take metadata operations and translate them  
> into
> per-object basic operations (updates). responsbility of this layer is:
> * to grab all required ldlm locks
>  as the layer understands operation''s nature, locking rules, etc
> * to check current state
>  whether name exists alread (for create), permissions
> * to apply updates to state cache (and reintegration backend, if  
> required)
> * to release ldlm locks
>
> essentially this is what current metadata server does. the  
> difference is
> * locks to be acquired on remote node
> * current state can be on remote node (not in local state cache)
> * updates can be stored in local memory for later reintegration
>  (perhaps this applies to usual mds)
>
> it looks quite obvious that it''d make sense to use metadata server
> code to
> implement wbc:
> * ldlm hides where lock is being mastered
> * dedicated osd layer below metadata server can maintain state cache  
> needed
>  to check existing names, attributes, permissions, etc
> * dedicated osd layer below metadata server can take care of  
> reintegration
>
>
> implementation would look like set of the following modules:
> * mdf - metadata filter
>  this is location-free metadata server operating on top of osd api,  
> grabs
>  ldlm locks, check current state, apply changes.
> * cosd - caching osd
>  this is dedicated layer with osd api, it maintains state cache and  
> all data
>  needed for reintegration. it also tries to use network efficient:  
> regular
>  lookup can be implemented via underlying readdir, etc.
> * gosd - global osd
>  very specific module allowing node to talk to remote storage over  
> osd api,
>  it''s stateless, something similar to current mdc, but using  
> different apis.
>
>
> some obvious cons of this approach:
> * implementation doesn''t rely on any system specific thing like  
> dcache/icache
> * we can unify the code and re-use it to implement regular metadata  
> server,
>  wbc and metadata proxy server
> * overall simplicity
>  inter-layer interaction is well defined and simple, same about  
> layer''s
>  functionality
> * clustered metadata fits this model very well because metadata server
>  doesn''t need to know whether some update local or remote
>
> any comments and suggestions are very welcome!
>
>
> thanks, Alex
> _______________________________________________
> Lustre-devel mailing list
> Lustre-devel at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-devel

Alex Zhuravlev

2009-Mar-25 16:59 UTC

head link

[Lustre-devel] some observations about metadata writeback cache

>>>>> Robert Read (RR) writes:
 RR> Hi Alex,
 RR> I''m trying to figure out how untrusted (what I''m
calling simple)
 RR> clients and trusted WBC-type clients will work together at the same
 RR> time. Simple clients will need to participate in the oldest volatile
 RR> epoch calculation, but will need to retain operations for replay.
 RR> I''ve draw a simplified picture of how I think things are
beginning to
 RR> fit together, but more thought is needed here.

 RR> Simple clients
 RR>     - don''t participate in global epochs

hmm. if committed (in terms of transno) request can be reverted
during global recovery, then even simple client has to retain
request on replay list till it''s stable in terms of epochs?

 RR>     - don''t have a node epoch or add epochs to messages
 RR>     - sends operations to MD server
 RR>     - replies include extended opaque "replay" data field

probably we could simplify code a lot if we don''t need to put
reply into request in order to do replay? IOW, make all request''s
fields client-generated?


thanks, Alex

Oleg Drokin

2009-Mar-25 17:48 UTC

head link

[Lustre-devel] some observations about metadata writeback cache

Hello!

On Mar 25, 2009, at 12:59 PM, Alex Zhuravlev wrote:> RR> Simple clients
> RR>     - don''t participate in global epochs
> hmm. if committed (in terms of transno) request can be reverted
> during global recovery, then even simple client has to retain
> request on replay list till it''s stable in terms of epochs?
Supposedly, server that performed the operation on behalf of the client
can do this? So the simple client semantic does not change - the moment
server has some stable record about hte operation, client can throw the
data away (otherwise simple clients would need to know how to  
participate
in rollback/replay even when the server the operation was sent to did  
not
go down).

Bye,
     Oleg

Alex Zhuravlev

2009-Mar-25 17:52 UTC

head link

[Lustre-devel] some observations about metadata writeback cache

>>>>> Oleg Drokin (OD) writes:
 OD> Hello!
 OD> On Mar 25, 2009, at 12:59 PM, Alex Zhuravlev wrote:
 RR> Simple clients
 RR> - don''t participate in global epochs
 >> hmm. if committed (in terms of transno) request can be reverted
 >> during global recovery, then even simple client has to retain
 >> request on replay list till it''s stable in terms of epochs?

 OD> Supposedly, server that performed the operation on behalf of the client
 OD> can do this? So the simple client semantic does not change - the moment
 OD> server has some stable record about hte operation, client can throw the
 OD> data away (otherwise simple clients would need to know how to
 OD> participate
 OD> in rollback/replay even when the server the operation was sent to did
 OD> not
 OD> go down).

hmm. then wouldn''t be simpler to do replay before global recovery and
then
do global replay from server''s undo logs?

-- 
thanks, Alex

Oleg Drokin

2009-Mar-25 17:59 UTC

head link

[Lustre-devel] some observations about metadata writeback cache

Hello!

On Mar 25, 2009, at 1:52 PM, Alex Zhuravlev wrote:> RR> Simple clients
> RR> - don''t participate in global epochs
>>> hmm. if committed (in terms of transno) request can be reverted
>>> during global recovery, then even simple client has to retain
>>> request on replay list till it''s stable in terms of
epochs?
> OD> Supposedly, server that performed the operation on behalf of the  
> client
> OD> can do this? So the simple client semantic does not change - the  
> moment
> OD> server has some stable record about hte operation, client can  
> throw the
> OD> data away (otherwise simple clients would need to know how to
> OD> participate
> OD> in rollback/replay even when the server the operation was sent  
> to did
> OD> not
> OD> go down).
> hmm. then wouldn''t be simpler to do replay before global recovery
> and then
> do global replay from server''s undo logs?
Yes, but aside from that, losing a caching client leads to global  
recovery,
but losing simple client is not, since server tracks its status.

Bye,
     Oleg

Robert Read

2009-Mar-25 18:03 UTC

head link

[Lustre-devel] some observations about metadata writeback cache

On Mar 25, 2009, at 09:59 , Alex Zhuravlev wrote:
>>>>>> Robert Read (RR) writes:
>
> RR> Hi Alex,
> RR> I''m trying to figure out how untrusted (what I''m
calling simple)
> RR> clients and trusted WBC-type clients will work together at the  
> same
> RR> time. Simple clients will need to participate in the oldest  
> volatile
> RR> epoch calculation, but will need to retain operations for replay.
> RR> I''ve draw a simplified picture of how I think things are  
> beginning to
> RR> fit together, but more thought is needed here.
>
> RR> Simple clients
> RR>     - don''t participate in global epochs
>
> hmm. if committed (in terms of transno) request can be reverted
> during global recovery, then even simple client has to retain
> request on replay list till it''s stable in terms of epochs?
Yes, this is why these clients won''t see the actual transno(s). Those  
would only be in the replay data blob the server returns with the  
reply. Instead, the MD server would send the epoch as the  transno in  
the reply to these clients.
>
> RR>     - don''t have a node epoch or add epochs to messages
> RR>     - sends operations to MD server
> RR>     - replies include extended opaque "replay" data field
>
> probably we could simplify code a lot if we don''t need to put
> reply into request in order to do replay? IOW, make all request''s
> fields client-generated?
True, but a request could contain multiple replies (one for each  
update), and the client doesn''t need to be aware of that. I was  
thinking it would be better if the server managed this field.  This  
mans the client can replay the request as it was originally sent and  
include the additional data at the end so the server can replay the  
updates in the correct order.

cheers,
robert

Lustre devel - Mar 2009 - some observations about metadata writeback cache

[Lustre-devel] some observations about metadata writeback cache

[Lustre-devel] some observations about metadata writeback cache

[Lustre-devel] some observations about metadata writeback cache

[Lustre-devel] some observations about metadata writeback cache

[Lustre-devel] some observations about metadata writeback cache

[Lustre-devel] some observations about metadata writeback cache

[Lustre-devel] some observations about metadata writeback cache