thr3ads.net - Lustre devel - [Lustre-devel] SOM safety [Jan 2010]

If this information is useful, please help other people find it:
Share via:

Eric Barton

2010-Jan-05 18:39 UTC

[Lustre-devel] SOM safety

Some thoughts on SOM safety...

The MDS must guarantee that any SOM attributes it provides to its
clients are valid at the moment they are requested - i.e. that no file
stripes were updated while the SOM attributes were computed and
cached.  This guarantee must hold in the presence of all possible
failures.

Clients notify the MDS before they could possibly update any stripe of
a file (e.g. on OPEN) so that the MDS can invalidate any cached SOM
attributes.  Clients also notify the MDS with "done writing" when all
their stripe updates have committed so that the MDS can determine when
it may resume caching SOM attributes.

This protocol breaks down when the MDS evicts a client which is
updating files.  The client may not be aware of the eviction and can
continue to update the file''s stripes.  Since it is not safe to cache
SOM attributes for this file again until we can guarantee that all
stripe updates by the evicted client have ceased, we must...

R1: Invalidate SOM attributes cached on the MDS

and/or 

R2: Prevent further stripe updates by the evicted client

...until the client has reconnected to the MDS and the protocol is
back in synch.

R3: R1 and R2 must hold irrespective of any server (MDS or OSS) crash
    or restart.

The following requirements are also needed for performance...

R4. The MDS must avoid doing a synchronous disk I/O when receiving
    notification of possible stripe updates.

R5. O(# files * # clients) persistent state must be avoided (e.g. it''s
    not OK to keep a persistent list of open files for each client).

This means the MDS can''t track which files are vulnerable to stripe
updates if it crashes and then restarts or fails over.  A client that
had files open for update before the crash could fail to reconnect,
and since the OST logs only tell the MDS which files have been updated
already, files previously opened for update but not yet actually
updated by this client are not accounted.

Therefore without (R2), SOM attribute caching cannot be re-enabled for
_any_ files on a restarted MDS while any clients remain evicted.

Here are some alternative proposals to implement (R2)...

1. Timeouts

   A timeout can be use to guarantee (R2) by ensuring clients discover
   they have been evicted by the MDS and cease updates within a
   bounded interval.  This relies on...

   a. Clients and the MDS agree on the timeout.

   b. Clients detect they have been evicted by the MDS and stop
      sending stripe updates to any OST until the they have
      reconnected to the MDS.

   Note...

   1. Configuration errors could invalidate the timeout agreement
      unless it is confirmed by explicit message passing.

   2. Guaranteeing all in-flight stripe updates have completed within
      the timeout is tricky.  It requires a maximum latency bound
      either from LNET or ptlrpc.

   3. Clients will have to ping the MDS regularly in the absence of
      other traffic to bound the time it takes to detect eviction.
      Shorter timeouts will lead to shorter ping intervals and a
      corresponding increase in MDS load.

   4. On startup, the MDS cannot enable SOM attributes until the
      timeout has expired to ensure all clients have detected the
      restart.

   5. A buggy or malicious client can disregard the timeout.

2. OST eviction

   An alternative to timeouts is to evict clients from the OSTs when
   they are evicted from the MDS.  This prevents clients from
   performing further stripe updates after eviction from the MDS and
   notifies them to reconnect.

   Note however that this requires client connection/eviction to
   proceed in lockstep across all servers to ensure that stripe
   updates arriving at any OST were sent in the context of the current
   client/MDS connection and not an earlier one.

3. Ordered Keys

   Using ordered keys to verify stripe updates eliminates the lockstep
   requirement on OST eviction.  The MDS and OSTs maintain a key for
   every client which uniquely identifies a particular client/MDS
   connection instance and can be compared with other keys for the
   same client/MDS connection to determine which one is older.
   Clients receive this key when they connect to the MDS and pass it
   on every stripe update.  OSTs check the key and reject updates with
   an "old" key, which forces the client to reconnect to the MDS to
   obtain a new key.

   Note...

   1. The only requirement on keys is that they increase monotonically
      for a given client.  The same key can be in use by many
      different clients so a single clock could be used to generate
      keys for all clients provided it never goes backwards
      (persistently) and an individual client is not permitted to
      reconnect before the clock ticks.

   2. When a client is evicted, the MDS must continue to disable SOM
      attribute caching for the client''s writeable files until the new
      key has been sent to all OSTs backing those files.  This can be
      done individually for each file.

      Clients may reconnect and continue with stripe updates before
      all OSTs have received their new key since OSTs only reject old
      keys.  This allows OST notification to be relatively lazy -
      i.e. the MDS can buffer pending client/key updates for all OSTs
      and send them periodically.  Increasing this period only
      increases the time that SOM attribute caching must remain
      disabled for affected files.

   3. When the MDS restarts or fails over, it must resynchronise with
      all OSTs - i.e.  install keys to limit stripe updates to
      actively connected clients and read the OST logs to discover
      files that were updated without persistently invalidating SOM
      attributes cached on the MDS.  Since it only needs a single key
      for all clients at this time, resynchronisation should be cheap.

   4. When an OST restarts or fails over, it must recover its
      client/key state from the MDS before it can continue with normal
      operation to ensure that it continues to reject stripe updates
      that the MDS had already disabled with the previous OST
      instance.  For a long-running MDS, this client/key state could
      be 1 key for every client which might best be sent as bulk data.

      Alternatively, key state could be stored persistently on the
      OST so that recovery could use existing code to replay
      uncommitted key updates from the MDS.

      It seems safe to allow client replay to proceed concurrently
      with key state recovery since clients should only replay updates
      that were not rejected the first time round.  Also the MDS knows
      which files are volatile through an OST restart if clients only
      send "done writing" when all updates have committed.

-- 

        Cheers,
                   Eric

Andreas Dilger

2010-Jan-05 21:50 UTC

head link

[Lustre-devel] SOM safety

On 2010-01-05, at 11:39, Eric Barton wrote:> The MDS must guarantee that any SOM attributes it provides to its
> clients are valid at the moment they are requested - i.e. that no file
> stripes were updated while the SOM attributes were computed and
> cached.  This guarantee must hold in the presence of all possible
> failures.
>
> Clients notify the MDS before they could possibly update any stripe of
> a file (e.g. on OPEN) so that the MDS can invalidate any cached SOM
> attributes.  Clients also notify the MDS with "done writing" when
all
> their stripe updates have committed so that the MDS can determine when
> it may resume caching SOM attributes.
This brings up an interesting question.  When the client does a lookup  
on a file, or first opens it, the client gets the cached size from the  
MDS (assuming SOM cache is valid).  However, after this initial  
update, what guarantee does the client have that the size is still  
valid?  Must it do further MDS getattr or OST glimpse operations in  
order to revalidate the size?  I don''t recall any lock bit that the  
MDS gives the client that tells the client that the file size it has  
is still valid.

In this regard, it seems that SOM would only provide an improvement on  
the initial "ls -l" operation, and subsequent "ls -l"
operations would
be slower than the current "readdir + statahead + DLM lock  
cache" (which would not need to do any RPCs for the second "ls
-l").
> This protocol breaks down when the MDS evicts a client which is
> updating files.  The client may not be aware of the eviction and can
> continue to update the file''s stripes.  Since it is not safe to
cache
> SOM attributes for this file again until we can guarantee that all
> stripe updates by the evicted client have ceased, we must...
>
> R1: Invalidate SOM attributes cached on the MDS
>
> and/or
>
> R2: Prevent further stripe updates by the evicted client
>
> ...until the client has reconnected to the MDS and the protocol is
> back in synch.
>
> R3: R1 and R2 must hold irrespective of any server (MDS or OSS) crash
>    or restart.
>
> The following requirements are also needed for performance...
>
> R4. The MDS must avoid doing a synchronous disk I/O when receiving
>    notification of possible stripe updates.
>
> R5. O(# files * # clients) persistent state must be avoided (e.g.
it''s
>    not OK to keep a persistent list of open files for each client).
>
> This means the MDS can''t track which files are vulnerable to
stripe
> updates if it crashes and then restarts or fails over.  A client that
> had files open for update before the crash could fail to reconnect,
> and since the OST logs only tell the MDS which files have been updated
> already, files previously opened for update but not yet actually
> updated by this client are not accounted.
>
> Therefore without (R2), SOM attribute caching cannot be re-enabled for
> _any_ files on a restarted MDS while any clients remain evicted.
>
> Here are some alternative proposals to implement (R2)...
>
> 1. Timeouts
>
>   A timeout can be use to guarantee (R2) by ensuring clients discover
>   they have been evicted by the MDS and cease updates within a
>   bounded interval.  This relies on...
>
>   a. Clients and the MDS agree on the timeout.
>
>   b. Clients detect they have been evicted by the MDS and stop
>      sending stripe updates to any OST until the they have
>      reconnected to the MDS.
>
>   Note...
>
>   1. Configuration errors could invalidate the timeout agreement
>      unless it is confirmed by explicit message passing.
>
>   2. Guaranteeing all in-flight stripe updates have completed within
>      the timeout is tricky.  It requires a maximum latency bound
>      either from LNET or ptlrpc.
>
>   3. Clients will have to ping the MDS regularly in the absence of
>      other traffic to bound the time it takes to detect eviction.
>      Shorter timeouts will lead to shorter ping intervals and a
>      corresponding increase in MDS load.
>
>   4. On startup, the MDS cannot enable SOM attributes until the
>      timeout has expired to ensure all clients have detected the
>      restart.
>
>   5. A buggy or malicious client can disregard the timeout.
>
> 2. OST eviction
>
>   An alternative to timeouts is to evict clients from the OSTs when
>   they are evicted from the MDS.  This prevents clients from
>   performing further stripe updates after eviction from the MDS and
>   notifies them to reconnect.
>
>   Note however that this requires client connection/eviction to
>   proceed in lockstep across all servers to ensure that stripe
>   updates arriving at any OST were sent in the context of the current
>   client/MDS connection and not an earlier one.
>
> 3. Ordered Keys
>
>   Using ordered keys to verify stripe updates eliminates the lockstep
>   requirement on OST eviction.  The MDS and OSTs maintain a key for
>   every client which uniquely identifies a particular client/MDS
>   connection instance and can be compared with other keys for the
>   same client/MDS connection to determine which one is older.
>   Clients receive this key when they connect to the MDS and pass it
>   on every stripe update.  OSTs check the key and reject updates with
>   an "old" key, which forces the client to reconnect to the MDS
to
>   obtain a new key.
>
>   Note...
>
>   1. The only requirement on keys is that they increase monotonically
>      for a given client.  The same key can be in use by many
>      different clients so a single clock could be used to generate
>      keys for all clients provided it never goes backwards
>      (persistently) and an individual client is not permitted to
>      reconnect before the clock ticks.
>
>   2. When a client is evicted, the MDS must continue to disable SOM
>      attribute caching for the client''s writeable files until the
new
>      key has been sent to all OSTs backing those files.  This can be
>      done individually for each file.
>
>      Clients may reconnect and continue with stripe updates before
>      all OSTs have received their new key since OSTs only reject old
>      keys.  This allows OST notification to be relatively lazy -
>      i.e. the MDS can buffer pending client/key updates for all OSTs
>      and send them periodically.  Increasing this period only
>      increases the time that SOM attribute caching must remain
>      disabled for affected files.
>
>   3. When the MDS restarts or fails over, it must resynchronise with
>      all OSTs - i.e.  install keys to limit stripe updates to
>      actively connected clients and read the OST logs to discover
>      files that were updated without persistently invalidating SOM
>      attributes cached on the MDS.  Since it only needs a single key
>      for all clients at this time, resynchronisation should be cheap.
>
>   4. When an OST restarts or fails over, it must recover its
>      client/key state from the MDS before it can continue with normal
>      operation to ensure that it continues to reject stripe updates
>      that the MDS had already disabled with the previous OST
>      instance.  For a long-running MDS, this client/key state could
>      be 1 key for every client which might best be sent as bulk data.
>
>      Alternatively, key state could be stored persistently on the
>      OST so that recovery could use existing code to replay
>      uncommitted key updates from the MDS.
>
>      It seems safe to allow client replay to proceed concurrently
>      with key state recovery since clients should only replay updates
>      that were not rejected the first time round.  Also the MDS knows
>      which files are volatile through an OST restart if clients only
>      send "done writing" when all updates have committed.
>
> -- 
>
>        Cheers,
>                   Eric
>
> _______________________________________________
> Lustre-devel mailing list
> Lustre-devel at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-devel

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

Nicolas Williams

2010-Jan-05 21:55 UTC

head link

[Lustre-devel] SOM safety

On Tue, Jan 05, 2010 at 02:50:51PM -0700, Andreas Dilger
wrote:> On 2010-01-05, at 11:39, Eric Barton wrote:
> > The MDS must guarantee that any SOM attributes it provides to its
> > clients are valid at the moment they are requested - i.e. that no file
> > stripes were updated while the SOM attributes were computed and
> > cached.  This guarantee must hold in the presence of all possible
> > failures.
> >
> > Clients notify the MDS before they could possibly update any stripe of
> > a file (e.g. on OPEN) so that the MDS can invalidate any cached SOM
> > attributes.  Clients also notify the MDS with "done writing"
when all
> > their stripe updates have committed so that the MDS can determine when
> > it may resume caching SOM attributes.
> 
> This brings up an interesting question.  When the client does a lookup  
> on a file, or first opens it, the client gets the cached size from the  
> MDS (assuming SOM cache is valid).  However, after this initial  
> update, what guarantee does the client have that the size is still  
> valid?  Must it do further MDS getattr or OST glimpse operations in  
> order to revalidate the size?  I don''t recall any lock bit that
the
> MDS gives the client that tells the client that the file size it has  
> is still valid.
Well, the guarantee should be from the time the MDS responds to the
client until the stat() call returns to the application.  After all,
POSIX talks about system calls, not client/server messaging.  That means
that the client effectively holds a lock on the cached attributes for a
bit.
> In this regard, it seems that SOM would only provide an improvement on  
> the initial "ls -l" operation, and subsequent "ls -l"
operations would
> be slower than the current "readdir + statahead + DLM lock  
> cache" (which would not need to do any RPCs for the second "ls
-l").
If the client can hold that lock for as long as no one is writing, then
the client can cache that information for that long.

Nico
--

Oleg Drokin

2010-Jan-05 22:25 UTC

head link

[Lustre-devel] SOM safety

Hello!

On Jan 5, 2010, at 4:50 PM, Andreas Dilger wrote:
> On 2010-01-05, at 11:39, Eric Barton wrote:
>> The MDS must guarantee that any SOM attributes it provides to its
>> clients are valid at the moment they are requested - i.e. that no file
>> stripes were updated while the SOM attributes were computed and
>> cached.  This guarantee must hold in the presence of all possible
>> failures.
>> 
>> Clients notify the MDS before they could possibly update any stripe of
>> a file (e.g. on OPEN) so that the MDS can invalidate any cached SOM
>> attributes.  Clients also notify the MDS with "done writing"
when all
>> their stripe updates have committed so that the MDS can determine when
>> it may resume caching SOM attributes.
> This brings up an interesting question.  When the client does a lookup  
> on a file, or first opens it, the client gets the cached size from the  
> MDS (assuming SOM cache is valid).  However, after this initial  
> update, what guarantee does the client have that the size is still  
> valid?  Must it do further MDS getattr or OST glimpse operations in  
> order to revalidate the size?  I don''t recall any lock bit that
the
> MDS gives the client that tells the client that the file size it has  
> is still valid.
Actually when I was discussed this with Vitaly it was obvious that
once size on MDS becomes invalid the server will revoke UPDATE lock
on the inode. Subsequent stat will refetch attributes (with missing
size this time).
I am not sure if this is actually implemented and I do not see anything
about it in the latest TOI.
But in the simplest case any open for write (or presence of such open at
the time of getattr rpc) would drop get the lock (or cause us not to return it).
How that (additional lock taking on a server) would impact the cpu utilization
on the server is unknown, though.

Bye,
    Oleg

Aleksandr Guzovskiy

2010-Jan-06 17:09 UTC

head link

[Lustre-devel] SOM safety

Eric Barton wrote:> Some thoughts on SOM safety...
> 
> The MDS must guarantee that any SOM attributes it provides to its
> clients are valid at the moment they are requested - i.e. that no file
> stripes were updated while the SOM attributes were computed and
> cached.  This guarantee must hold in the presence of all possible
> failures.
> 
> Clients notify the MDS before they could possibly update any stripe of
> a file (e.g. on OPEN) so that the MDS can invalidate any cached SOM
> attributes.  Clients also notify the MDS with "done writing" when
all
> their stripe updates have committed so that the MDS can determine when
> it may resume caching SOM attributes.
> 
> This protocol breaks down when the MDS evicts a client which is
> updating files.  The client may not be aware of the eviction and can
> continue to update the file''s stripes.  Since it is not safe to
cache
> SOM attributes for this file again until we can guarantee that all
> stripe updates by the evicted client have ceased, we must...
> 
> R1: Invalidate SOM attributes cached on the MDS
> 
> and/or 
> 
> R2: Prevent further stripe updates by the evicted client
> 
> ...until the client has reconnected to the MDS and the protocol is
> back in synch.
> 
> R3: R1 and R2 must hold irrespective of any server (MDS or OSS) crash
>     or restart.
> 
> The following requirements are also needed for performance...
> 
> R4. The MDS must avoid doing a synchronous disk I/O when receiving
>     notification of possible stripe updates.
> 
> R5. O(# files * # clients) persistent state must be avoided (e.g.
it''s
>     not OK to keep a persistent list of open files for each client).
> 
> This means the MDS can''t track which files are vulnerable to
stripe
> updates if it crashes and then restarts or fails over.  A client that
> had files open for update before the crash could fail to reconnect,
> and since the OST logs only tell the MDS which files have been updated
> already, files previously opened for update but not yet actually
> updated by this client are not accounted.
> 
> Therefore without (R2), SOM attribute caching cannot be re-enabled for
> _any_ files on a restarted MDS while any clients remain evicted.
> 
> Here are some alternative proposals to implement (R2)...
> 
> 1. Timeouts
> 
>    A timeout can be use to guarantee (R2) by ensuring clients discover
>    they have been evicted by the MDS and cease updates within a
>    bounded interval.  This relies on...
> 
>    a. Clients and the MDS agree on the timeout.
> 
>    b. Clients detect they have been evicted by the MDS and stop
>       sending stripe updates to any OST until the they have
>       reconnected to the MDS.
> 
>    Note...
> 
>    1. Configuration errors could invalidate the timeout agreement
>       unless it is confirmed by explicit message passing.
> 
>    2. Guaranteeing all in-flight stripe updates have completed within
>       the timeout is tricky.  It requires a maximum latency bound
>       either from LNET or ptlrpc.
> 
>    3. Clients will have to ping the MDS regularly in the absence of
>       other traffic to bound the time it takes to detect eviction.
>       Shorter timeouts will lead to shorter ping intervals and a
>       corresponding increase in MDS load.
> 
>    4. On startup, the MDS cannot enable SOM attributes until the
>       timeout has expired to ensure all clients have detected the
>       restart.
> 
>    5. A buggy or malicious client can disregard the timeout.
> 
> 2. OST eviction
> 
>    An alternative to timeouts is to evict clients from the OSTs when
>    they are evicted from the MDS.
This would be a step towards adding a notion of cluster membership to 
Lustre. Wouldn''t there be other benefits from that in solving other 
races when client is evicted from one of the servers but is not evicted 
from others?

-Alex

>  This prevents clients from
>    performing further stripe updates after eviction from the MDS and
>    notifies them to reconnect.
> 
>    Note however that this requires client connection/eviction to
>    proceed in lockstep across all servers to ensure that stripe
>    updates arriving at any OST were sent in the context of the current
>    client/MDS connection and not an earlier one.
> 
> 3. Ordered Keys
> 
>    Using ordered keys to verify stripe updates eliminates the lockstep
>    requirement on OST eviction.  The MDS and OSTs maintain a key for
>    every client which uniquely identifies a particular client/MDS
>    connection instance and can be compared with other keys for the
>    same client/MDS connection to determine which one is older.
>    Clients receive this key when they connect to the MDS and pass it
>    on every stripe update.  OSTs check the key and reject updates with
>    an "old" key, which forces the client to reconnect to the MDS
to
>    obtain a new key.
> 
>    Note...
> 
>    1. The only requirement on keys is that they increase monotonically
>       for a given client.  The same key can be in use by many
>       different clients so a single clock could be used to generate
>       keys for all clients provided it never goes backwards
>       (persistently) and an individual client is not permitted to
>       reconnect before the clock ticks.
> 
>    2. When a client is evicted, the MDS must continue to disable SOM
>       attribute caching for the client''s writeable files until the
new
>       key has been sent to all OSTs backing those files.  This can be
>       done individually for each file.
> 
>       Clients may reconnect and continue with stripe updates before
>       all OSTs have received their new key since OSTs only reject old
>       keys.  This allows OST notification to be relatively lazy -
>       i.e. the MDS can buffer pending client/key updates for all OSTs
>       and send them periodically.  Increasing this period only
>       increases the time that SOM attribute caching must remain
>       disabled for affected files.
> 
>    3. When the MDS restarts or fails over, it must resynchronise with
>       all OSTs - i.e.  install keys to limit stripe updates to
>       actively connected clients and read the OST logs to discover
>       files that were updated without persistently invalidating SOM
>       attributes cached on the MDS.  Since it only needs a single key
>       for all clients at this time, resynchronisation should be cheap.
> 
>    4. When an OST restarts or fails over, it must recover its
>       client/key state from the MDS before it can continue with normal
>       operation to ensure that it continues to reject stripe updates
>       that the MDS had already disabled with the previous OST
>       instance.  For a long-running MDS, this client/key state could
>       be 1 key for every client which might best be sent as bulk data.
> 
>       Alternatively, key state could be stored persistently on the
>       OST so that recovery could use existing code to replay
>       uncommitted key updates from the MDS.
> 
>       It seems safe to allow client replay to proceed concurrently
>       with key state recovery since clients should only replay updates
>       that were not rejected the first time round.  Also the MDS knows
>       which files are volatile through an OST restart if clients only
>       send "done writing" when all updates have committed.
>

Nicolas Williams

2010-Jan-06 17:28 UTC

head link

[Lustre-devel] SOM safety

On Wed, Jan 06, 2010 at 12:09:41PM -0500, Aleksandr Guzovskiy
wrote:> Eric Barton wrote:
> > 2. OST eviction
> > 
> >    An alternative to timeouts is to evict clients from the OSTs when
> >    they are evicted from the MDS.
> 
> This would be a step towards adding a notion of cluster membership to 
> Lustre. Wouldn''t there be other benefits from that in solving
other
> races when client is evicted from one of the servers but is not evicted 
> from others?
The health network will allow for eviction notices to be spread around
the cluster quickly.

I think we''ll need a separate cluster membership capability for reasons
having to do with optimizing the health network: if you see a peer C
that''s got a membership capability issued at time T_a and
you''re a
server S_n that''s been in the cluster since before T_a and
you''ve not
heard any eviction notices for C, then C is still a member of the
cluster.  Without a cluster membership capability we''d need to ask the
health network if C is a member, and while that can happen quickly, in a
mostly-stateless health network (the current design) having every server
ask about the membership/liveness status of every peer client could
result in a load spike.

Nico
--

dzogin

2010-Jan-06 18:10 UTC

head link

[Lustre-devel] SOM safety

Nicolas Williams wrote:> The health network will allow for eviction notices to be spread around
> the cluster quickly.
>
> I think we''ll need a separate cluster membership capability for
reasons
> having to do with optimizing the health network: if you see a peer C
> that''s got a membership capability issued at time T_a and
you''re a
> server S_n that''s been in the cluster since before T_a and
you''ve not
> heard any eviction notices for C, then C is still a member of the
> cluster.  Without a cluster membership capability we''d need to ask
the
> health network if C is a member, and while that can happen quickly, in a
> mostly-stateless health network (the current design) having every server
> ask about the membership/liveness status of every peer client could
> result in a load spike.
>   I think this can be easily implemented as a bitmap on every server (both 
OSS and MDS) that keeps track of the alive and evicted clients. Once you 
have processed all the amount of work that needs to be done for the 
client that has been evicted, that bit in the evicted clients map is 
cleared. Instead of sending the information about every client, servers 
can exchange the bitmaps, and doing XOR on bitmaps would allow you to 
easily see the discrepancies in client evictions.

I think the same idea of bitmaps can be implemented for objects, if we 
want to track which client is updating the object/stripe.

Dmitry

Vitaly Fertman

2010-Jan-11 17:05 UTC

head link

[Lustre-devel] SOM safety

On Jan 6, 2010, at 12:50 AM, Andreas Dilger wrote:
> This brings up an interesting question.  When the client does a lookup
> on a file, or first opens it, the client gets the cached size from the
> MDS (assuming SOM cache is valid).  However, after this initial
> update, what guarantee does the client have that the size is still
> valid?  Must it do further MDS getattr or OST glimpse operations in
> order to revalidate the size?  I don''t recall any lock bit that
the
> MDS gives the client that tells the client that the file size it has
> is still valid.
MDS returns size to client if it gives the client the UPDATE lock
and SOM cache is valid (file is not opened, SOM cache is rebuilt).
If file is opened for write, MDS revokes UPDATE lock given to clients,
if SOM cache is valid by the time of open (thus 2nd open will not
revoke it).
If client gets size from MDS, it considers it valid as long as the
current UPDATE lock is cached on the client.
If client gets no size from MDS (i.e. for next getattr when file is
already opened for write) it has to go to OST for its part of  
attributes.
> In this regard, it seems that SOM would only provide an improvement on
> the initial "ls -l" operation, and subsequent "ls -l"
operations would
> be slower than the current "readdir + statahead + DLM lock
> cache" (which would not need to do any RPCs for the second "ls
-l").

subsequent "ls -l" operations will be faster because we do not need
to re-obtain attributes from MDS if we have UPDATE lock. As SOM cancels
this lock on open for write and on attribute update, an extra RPC will
be sent only on the very 1st getattr after any of these 2 cases occur,
subsequent "ls -l" within enabled or disabled SOM cache have no extra
RPC.

--
Vitaly

Vitaly Fertman

2010-Jan-11 21:41 UTC

head link

[Lustre-devel] SOM safety

On Jan 5, 2010, at 9:39 PM, Eric Barton wrote:
> Some thoughts on SOM safety...
>
> The MDS must guarantee that any SOM attributes it provides to its
> clients are valid at the moment they are requested - i.e. that no file
> stripes were updated while the SOM attributes were computed and
> cached.  This guarantee must hold in the presence of all possible
> failures.
>
> Clients notify the MDS before they could possibly update any stripe of
> a file (e.g. on OPEN) so that the MDS can invalidate any cached SOM
> attributes.  Clients also notify the MDS with "done writing" when
all
> their stripe updates have committed so that the MDS can determine when
> it may resume caching SOM attributes.
>
> This protocol breaks down when the MDS evicts a client which is
> updating files.  The client may not be aware of the eviction and can
> continue to update the file''s stripes.  Since it is not safe to
cache
> SOM attributes for this file again until we can guarantee that all
> stripe updates by the evicted client have ceased, we must...
>
> R1: Invalidate SOM attributes cached on the MDS
>
> and/or
>
> R2: Prevent further stripe updates by the evicted client
>
> ...until the client has reconnected to the MDS and the protocol is
> back in synch.
>
> R3: R1 and R2 must hold irrespective of any server (MDS or OSS) crash
>    or restart.
>
> The following requirements are also needed for performance...
>
> R4. The MDS must avoid doing a synchronous disk I/O when receiving
>    notification of possible stripe updates.
>
> R5. O(# files * # clients) persistent state must be avoided (e.g.
it''s
>    not OK to keep a persistent list of open files for each client).
>
> This means the MDS can''t track which files are vulnerable to
stripe
> updates if it crashes and then restarts or fails over.  A client that
> had files open for update before the crash could fail to reconnect,
> and since the OST logs only tell the MDS which files have been updated
> already, files previously opened for update but not yet actually
> updated by this client are not accounted.
>
> Therefore without (R2), SOM attribute caching cannot be re-enabled for
> _any_ files on a restarted MDS while any clients remain evicted.
>
> Here are some alternative proposals to implement (R2)...
>
> 1. Timeouts
>
>   A timeout can be use to guarantee (R2) by ensuring clients discover
>   they have been evicted by the MDS and cease updates within a
>   bounded interval.  This relies on...
>
>   a. Clients and the MDS agree on the timeout.
>
>   b. Clients detect they have been evicted by the MDS and stop
>      sending stripe updates to any OST until the they have
>      reconnected to the MDS.
>
>   Note...
>
>   1. Configuration errors could invalidate the timeout agreement
>      unless it is confirmed by explicit message passing.
>
>   2. Guaranteeing all in-flight stripe updates have completed within
>      the timeout is tricky.  It requires a maximum latency bound
>      either from LNET or ptlrpc.
it requires a time synchronisation between nodes to check RPC has been
too long in-flight, this is not mandatory yet AFAIK and it looks pure
controllable...
>   3. Clients will have to ping the MDS regularly in the absence of
>      other traffic to bound the time it takes to detect eviction.
>      Shorter timeouts will lead to shorter ping intervals and a
>      corresponding increase in MDS load.
>
>   4. On startup, the MDS cannot enable SOM attributes until the
>      timeout has expired to ensure all clients have detected the
>      restart.
>
>   5. A buggy or malicious client can disregard the timeout.
>
> 2. OST eviction
>
>   An alternative to timeouts is to evict clients from the OSTs when
>   they are evicted from the MDS.  This prevents clients from
>   performing further stripe updates after eviction from the MDS and
>   notifies them to reconnect.
>
>   Note however that this requires client connection/eviction to
>   proceed in lockstep across all servers to ensure that stripe
>   updates arriving at any OST were sent in the context of the current
>   client/MDS connection and not an earlier one.
>
> 3. Ordered Keys
>
>   Using ordered keys to verify stripe updates eliminates the lockstep
>   requirement on OST eviction.  The MDS and OSTs maintain a key for
>   every client which uniquely identifies a particular client/MDS
>   connection instance and can be compared with other keys for the
>   same client/MDS connection to determine which one is older.
>   Clients receive this key when they connect to the MDS and pass it
>   on every stripe update.  OSTs check the key and reject updates with
>   an "old" key, which forces the client to reconnect to the MDS
to
>   obtain a new key.
>
>   Note...
>
>   1. The only requirement on keys is that they increase monotonically
>      for a given client.  The same key can be in use by many
>      different clients so a single clock could be used to generate
>      keys for all clients provided it never goes backwards
>      (persistently) and an individual client is not permitted to
>      reconnect before the clock ticks.
>
>   2. When a client is evicted, the MDS must continue to disable SOM
>      attribute caching for the client''s writeable files until the
new
>      key has been sent to all OSTs backing those files.  This can be
>      done individually for each file.
>
>      Clients may reconnect and continue with stripe updates before
>      all OSTs have received their new key since OSTs only reject old
>      keys.  This allows OST notification to be relatively lazy -
>      i.e. the MDS can buffer pending client/key updates for all OSTs
>      and send them periodically.  Increasing this period only
>      increases the time that SOM attribute caching must remain
>      disabled for affected files.
>
>   3. When the MDS restarts or fails over, it must resynchronise with
>      all OSTs - i.e.  install keys to limit stripe updates to
>      actively connected clients and read the OST logs to discover
>      files that were updated without persistently invalidating SOM
>      attributes cached on the MDS.  Since it only needs a single key
>      for all clients at this time, resynchronisation should be cheap.
>
>   4. When an OST restarts or fails over, it must recover its
>      client/key state from the MDS before it can continue with normal
>      operation to ensure that it continues to reject stripe updates
>      that the MDS had already disabled with the previous OST
>      instance.  For a long-running MDS, this client/key state could
>      be 1 key for every client which might best be sent as bulk data.
>
>      Alternatively, key state could be stored persistently on the
>      OST so that recovery could use existing code to replay
>      uncommitted key updates from the MDS.
we will need to give OST fresh identifiers because OST could be down
long enough and they have succeeded to change.
>      It seems safe to allow client replay to proceed concurrently
>      with key state recovery since clients should only replay updates
>      that were not rejected the first time round.  Also the MDS knows
>      which files are volatile through an OST restart if clients only
>      send "done writing" when all updates have committed.
another possibility is not to try to get a common solution for MDS  
failover
and client eviction if it becomes complex but instead to get 2 separate
simple solutions for them. thus for MDS failover it could be as simple  
as:

1.1. you described in ordered keys;
1.2. I described in SC09 persentation with IOEpoch delimiter;
1.3. it could be just an MDS mountid;

this unique identifier is propagated to OST on MDS-OST synchronisation  
and
OST compares it with id packed into IO from clients;

for client eviction it could be resolved through incrementing some per- 
file
identified in OST objects, stored in EA; it could be:

2.1 IOEpoch, MDS can send notifications to OST in stripeset by itself
(batched to save on RPC amount); we will have:
- extra load for MDS, but in client eviction case only;
- extra write on OST, but in client eviction case only;
- extra RPC between MDS & OST, could be batched so not a problem;
- separate mechanisms for MDS failover & client eviction;

2.2 IOEpoch, client may do it for us on obd_getattr when MDS asks it  
to rebuild
the cache, client will have to wait until IOEpoch will commit to EA  
and send
setattr to MDS (batched to save on RPC amount); we will have:
- extra write on OST, but in client eviction case only;
- extra RPC between client & MDS, could be batched so not a problem;
- separate mechanisms for MDS failover & client eviction;
+ client anyway need to gather attributes and merge them, it is natural
   solution to send them to MDS as well;

Note: this could be a common solution for MDS failover as well, but  
there
will be no way to distinguish between SOM-disabled and SOM-recovery  
inodes,
i.e. for each inode with disabled SOM we will propagate IOEpoch to OST  
on
getattr and write it to EA, in this case:
- extra write on OST, 1 per file stripe; does it look dangerous?
- extra RPC between client & MDS, could be batched so not a problem;
+ common mechanisms for MDS failover & client eviction;
+ client anyway need to gather attributes and merge them, it is natural
   solution to send them to MDS as well;

2.3. Layout generation
Can we re-use Layout Lock mechanism here, blocking IO from previous  
generations?
If so, this way seems cheap.
- extra load for MDS, but in client eviction case only;
- extra write on OST, but in client eviction case only;
- extra RPC between MDS & OST, could be batched so not a problem;
+ common solution for layout lock & client eviction;
+ layout lock is done for HSM so it is cheapest solution, isn''t it?

2.4. Client-MDS connection ID (suggested above by Eric)
- extra load for MDS, but in client eviction and OST failover cases  
only;
- extra RPC between MDS & OST, could be batched so not a problem;
+ common mechanisms for MDS failover & client eviction;

--
Vitaly

Lustre devel - Jan 2010 - SOM safety

[Lustre-devel] SOM safety

[Lustre-devel] SOM safety

[Lustre-devel] SOM safety

[Lustre-devel] SOM safety

[Lustre-devel] SOM safety

[Lustre-devel] SOM safety

[Lustre-devel] SOM safety

[Lustre-devel] SOM safety

[Lustre-devel] SOM safety