thr3ads.net - Lustre devel - [Lustre-devel] WBC HLD outline [Mar 2009]

If this information is useful, please help other people find it:
Share via:

Alexander Zarochentsev

2009-Mar-23 21:58 UTC

[Lustre-devel] WBC HLD outline

Hello,

here is a wbc hld outline. 
Please take a look. 

==============================================WBC HLD OUTLINE

* Definitions

WBC (MD WBC): (Meta Data) Write Back Cache.

MD operation: whole MD operation over an object:
 rename/create/open/close/setattr/getattr/link/unlink/mkdir/rmdir +
 readdir.

Reintegration: The process of applying accumulated MD operation to the
MD servers.

MDS/RAW: MDS API extension to do "raw" fs operations: inserting of a
dir entry w/o creating inode and so.

MD update: a part of MD operation to be executed on one server,
contains one or more MDS/RAW operations.

MD batch: a collection of per-server MD updates.

MDTR: MD translator: translates MD operations into MD/Raw ones.
 
* Requirements

 Client application is able to create 64k files/second.

 Reintergration moves fs from one consistent state to another
 consistent state.

 Non-WBC client support w/o visible overhead.

 Avoid MDS code rewrite if possible.

* Design outline

** Overall picture

[Application]
    |
=syscalls    |
    V
  [VFS]
    |
=vfs hooks    |
    V
[LLITE/MDC]
    |
 =MD (non-WBC) proto    |
    V
[MD CACHE MANAGER] ---> [LDLM]
    |
    V
 [MDTR]
   +-----------+----------+
   |           |          |
  =======WBC proto=========   |           |          |
   V           V          V
[MDS1/RAW] [MDS2/RAW] [MDS3/RAW]

** WBC 

WBC client has a MDTR running on client side,
it also can be a proxy server, acting as a server for
non-WBC clients and as a client for MD servers.

*** WBC vs non-WBC

Processing MD operation request (lock enqueue + op intent, by Alex
suggestion), MD server may decide to execute it by itself, or grant a
only a lock (subtree one) and allow client to continue in WBC mode.
 
*** Locks

needed LDLM locks are taken before operation starts and held until the
corresponded batch is re-integrated.

*** Local cache management

WBC client executes operations locally, modifying local in-memory
objects. WBC client has a (redo-)log of all operations.

The cache manager controls process of MD cache re-integration. 

*** MDS/RAW operations

Managing directory entries and inodes, without maintaining
fs consistency automatically.

create/update/delete methods for directory entries and inodes.

*** MDTR

MDTR is responsible for converting MD operations into set of
per-server MD/RAW operations.

*** Client re-integration

Periodically, or because of (sub-)lock releasing, dirty memory
flushing or so, WBC client submits batches to all MD servers involved
into the operations.

Process of re-integration is protected by LDLM locks. MD servers are 
updated
using WBC protocol.

*** WBC protocol

WBC request contains a set of MD/RAW operations, tagged with one epoch
number.  Bulk transfers are used.

*** File data
Flushing file data to the OST servers is delayed until file creation
is re-integrated.

*** Recovery

The redo-log preserved until it is not needed in recovery (i.e. epoch
gets stable)

Client replay the log and re-execute all operations from it, repeating 
MDTR processing (dispatching the operation between MD servers).

**** WBC client eviction, uncompleted updates

If client dies until re-integration is completed, there are three 
choices:

a) Cluster-wide rollback, all servers roll back to the last globally
stable epoch, then clients to replay heir redo-logs.

This scenario should be avoided because a single client failure may
may stop whole cluster for recovery.

b) All servers participating in re-integration coordinate to undo
uncompleted updates.

c) The servers have all information needed to complete re-integration
w/o client.

The recovery strategy is a subject of CMD Recovery Design document,
but a possibility of (c) need a support in the WBC protocol.

** non-WBC

*** MD protocol

MD (non-WBC) protocol remains the same as now.

** Use cases

*** WBC / non-WBC decision

1. Check whether server and client can operate in WBC-mode through
connect flags.

2. I they can, a lock enqueue request may contain a request for
WBC-mode, the server may respond with granting WBC-mode and STL or PW
lock on the directory. MD server accepts or rejects WBC-mode request
depending on server rules and per-object access statistics.

*** File creation

client gets a PW lock on directory.

client fetches directory content.

client does file creation locally, in cache, the operation record is
added to the client redo-log.

Another client want to read the directory, lock conflict triggers
re-integration.

MD Cache manager processes the redo-log, prepares batches with MDS/RAW
operations and submits them to the MD servers.

The MD servers integrate the batches.

MD Cache manager frees local cache content and cancels the directory 
lock.

** Questions

Q: Can several wbc clients work in one directory simultaneously?
A: If extent locks for directories are implemented, each WBC client
   can take a lock on a hash interval.

Q: can  wbc clients do massive file creation in one directory
   efficiently?
A: the idea that may help: if we can guess that the file names created
   by a client are lexicographically ordered, a special hash function
   may reduce lock conflicts between clients holding locks on
   directory extents.

Thanks,
-- 
Alexander "Zam" Zarochentsev
Staff Engineer
Lustre Group, Sun Microsystems

Robert Read

2009-Mar-23 23:17 UTC

head link

[Lustre-devel] WBC HLD outline

Hi Zam,


On Mar 23, 2009, at 14:58 , Alexander Zarochentsev wrote:
> Hello,
>
> here is a wbc hld outline.
> Please take a look.
>
> ==============================================> WBC HLD OUTLINE
>
> * Definitions
> WBC (MD WBC): (Meta Data) Write Back Cache.
>
> MD operation: whole MD operation over an object:
> rename/create/open/close/setattr/getattr/link/unlink/mkdir/rmdir +
> readdir.
>
> Reintegration: The process of applying accumulated MD operation to the
> MD servers.
>
> MDS/RAW: MDS API extension to do "raw" fs operations: inserting
of a
> dir entry w/o creating inode and so.
>
> MD update: a part of MD operation to be executed on one server,
> contains one or more MDS/RAW operations.
Why does the client need to to be more granular than an update?  It  
seems MDS/Raw and update should be the same.
>
> MD batch: a collection of per-server MD updates.
>
> MDTR: MD translator: translates MD operations into MD/Raw ones.
Isn''t this essentially what the cmm is doing today? (Breaking down  
distributed operations into per-node updates?)  Are you expanding on  
Alex''s idea of creating a new generic MD server stack?
>
> * Requirements
>
> Client application is able to create 64k files/second.
>
> Reintergration moves fs from one consistent state to another
> consistent state.
>
> Non-WBC client support w/o visible overhead.
>
> Avoid MDS code rewrite if possible.
>
> * Design outline
>
> ** Overall picture
>
> [Application]
>    |
> =syscalls>    |
>    V
>  [VFS]
>    |
> =vfs hooks>    |
>    V
> [LLITE/MDC]
>    |
> =MD (non-WBC) proto>    |
>    V
> [MD CACHE MANAGER] ---> [LDLM]
>    |
>    V
> [MDTR]
>   +-----------+----------+
>   |           |          |
>  =======WBC proto=========>   |           |          |
>   V           V          V
> [MDS1/RAW] [MDS2/RAW] [MDS3/RAW]
>
> ** WBC
>
> WBC client has a MDTR running on client side,
> it also can be a proxy server, acting as a server for
> non-WBC clients and as a client for MD servers.
>
> *** WBC vs non-WBC
>
> Processing MD operation request (lock enqueue + op intent, by Alex
> suggestion), MD server may decide to execute it by itself, or grant a
> only a lock (subtree one) and allow client to continue in WBC mode.
>
> *** Locks
>
> needed LDLM locks are taken before operation starts and held until the
> corresponded batch is re-integrated.
>
> *** Local cache management
>
> WBC client executes operations locally, modifying local in-memory
> objects. WBC client has a (redo-)log of all operations.
>
> The cache manager controls process of MD cache re-integration.
>
> *** MDS/RAW operations
>
> Managing directory entries and inodes, without maintaining
> fs consistency automatically.
>
> create/update/delete methods for directory entries and inodes.
>
> *** MDTR
>
> MDTR is responsible for converting MD operations into set of
> per-server MD/RAW operations.
>
> *** Client re-integration
>
> Periodically, or because of (sub-)lock releasing, dirty memory
> flushing or so, WBC client submits batches to all MD servers involved
> into the operations.
>
> Process of re-integration is protected by LDLM locks. MD servers are
> updated
> using WBC protocol.
>
> *** WBC protocol
>
> WBC request contains a set of MD/RAW operations, tagged with one epoch
> number.  Bulk transfers are used.
All the updates in a single operation must have the same epoch, but I  
don''t think we can guarantee that all the operations in a batch will  
be in the same epoch, unless we stop exchanging messages with all the  
MD servers. I don''t see a need for them to be in the same epoch,
either.
>
> *** File data
> Flushing file data to the OST servers is delayed until file creation
> is re-integrated.
>
> *** Recovery
>
> The redo-log preserved until it is not needed in recovery (i.e. epoch
> gets stable)
>
> Client replay the log and re-execute all operations from it, repeating
> MDTR processing (dispatching the operation between MD servers).
Since the MD servers all roll back before recovery, recovery will be  
very similar to the original reintegration, with the exception of  
using versions.  So we should try to keep the recovery (replay) code  
as similar to the normal code as possible, and move recovery higher  
into the stack.
>
> **** WBC client eviction, uncompleted updates
>
> If client dies until re-integration is completed, there are three
> choices:
>
> a) Cluster-wide rollback, all servers roll back to the last globally
> stable epoch, then clients to replay heir redo-logs.
>
> This scenario should be avoided because a single client failure may
> may stop whole cluster for recovery.
>
> b) All servers participating in re-integration coordinate to undo
> uncompleted updates.
>
> c) The servers have all information needed to complete re-integration
> w/o client.
You mean by keeping the original operation info in the undo logs?
>
> The recovery strategy is a subject of CMD Recovery Design document,
> but a possibility of (c) need a support in the WBC protocol.
>
> ** non-WBC
>
> *** MD protocol
>
> MD (non-WBC) protocol remains the same as now.
>
> ** Use cases
>
> *** WBC / non-WBC decision
>
> 1. Check whether server and client can operate in WBC-mode through
> connect flags.
>
> 2. I they can, a lock enqueue request may contain a request for
> WBC-mode, the server may respond with granting WBC-mode and STL or PW
> lock on the directory. MD server accepts or rejects WBC-mode request
> depending on server rules and per-object access statistics.
>
> *** File creation
>
> client gets a PW lock on directory.
>
> client fetches directory content.
>
> client does file creation locally, in cache, the operation record is
> added to the client redo-log.
>
> Another client want to read the directory, lock conflict triggers
> re-integration.
>
> MD Cache manager processes the redo-log, prepares batches with MDS/RAW
> operations and submits them to the MD servers.
>
> The MD servers integrate the batches.
>
> MD Cache manager frees local cache content and cancels the directory
> lock.
>
> ** Questions
>
> Q: Can several wbc clients work in one directory simultaneously?
> A: If extent locks for directories are implemented, each WBC client
>   can take a lock on a hash interval.
>
> Q: can  wbc clients do massive file creation in one directory
>   efficiently?
> A: the idea that may help: if we can guess that the file names created
>   by a client are lexicographically ordered, a special hash function
>   may reduce lock conflicts between clients holding locks on
>   directory extents.

cheers,
robert

Alex Zhuravlev

2009-Mar-24 05:06 UTC

head link

[Lustre-devel] WBC HLD outline

>>>>> Alexander Zarochentsev (AZ) writes:
 AZ> MDS/RAW: MDS API extension to do "raw" fs operations:
inserting of a
 AZ> dir entry w/o creating inode and so.

this seems to be duplication of OSD API''s insert/delete/etc.


 AZ> MDTR: MD translator: translates MD operations into MD/Raw ones.

and this one seems to duplicate MDD code.

why would we want to duplicate these things?

thanks, Alex

Alexander Zarochentsev

2009-Mar-25 08:17 UTC

head link

[Lustre-devel] WBC HLD outline

On 24 March 2009 02:17:33 Robert Read wrote:> Hi Zam,
> > MD update: a part of MD operation to be executed on one server,
> > contains one or more MDS/RAW operations.
>
> Why does the client need to to be more granular than an update?  It
> seems MDS/Raw and update should be the same.
well, better to say an update is MDS op if the operation touch only one 
MD server and MDS/Raw op in case of distributed operation.
> > MD batch: a collection of per-server MD updates.
> >
> > MDTR: MD translator: translates MD operations into MD/Raw ones.
>
> Isn''t this essentially what the cmm is doing today? (Breaking down
> distributed operations into per-node updates?)  Are you expanding on
> Alex''s idea of creating a new generic MD server stack?
I just doubt that cmm code reuse is worth MD stack relayering. Can it be 
done as a subtask later?
> > *** WBC protocol
> >
> > WBC request contains a set of MD/RAW operations, tagged with one
> > epoch number.  Bulk transfers are used.
>
> All the updates in a single operation must have the same epoch, but I
> don''t think we can guarantee that all the operations in a batch
will
> be in the same epoch, unless we stop exchanging messages with all the
> MD servers. I don''t see a need for them to be in the same epoch,
> either.
you are right.
> > *** File data
> > Flushing file data to the OST servers is delayed until file
> > creation is re-integrated.
> >
> > *** Recovery
> >
> > The redo-log preserved until it is not needed in recovery (i.e.
> > epoch gets stable)
> >
> > Client replay the log and re-execute all operations from it,
> > repeating MDTR processing (dispatching the operation between MD
> > servers).
>
> Since the MD servers all roll back before recovery, recovery will be
> very similar to the original reintegration, with the exception of
> using versions.  So we should try to keep the recovery (replay) code
> as similar to the normal code as possible, and move recovery higher
> into the stack.
ok.
> > **** WBC client eviction, uncompleted updates
> >
> > If client dies until re-integration is completed, there are three
> > choices:
> >
> > a) Cluster-wide rollback, all servers roll back to the last
> > globally stable epoch, then clients to replay heir redo-logs.
> >
> > This scenario should be avoided because a single client failure may
> > may stop whole cluster for recovery.
> >
> > b) All servers participating in re-integration coordinate to undo
> > uncompleted updates.
> >
> > c) The servers have all information needed to complete
> > re-integration w/o client.
>
> You mean by keeping the original operation info in the undo logs?
I meant the servers receive not updates but whole operations. If the 
client failed and didn''t send an update to some of the servers, the 
operation can be completed w/o the client. It is an alternative to 
undoing of partial updates.

Thanks,
-- 
Alexander "Zam" Zarochentsev
Staff Engineer
Lustre Group, Sun Microsystems

Alex Zhuravlev

2009-Mar-25 08:33 UTC

head link

[Lustre-devel] WBC HLD outline

>>>>> Alexander Zarochentsev (AZ) writes:
 AZ> On 24 March 2009 02:17:33 Robert Read wrote:
 >> Hi Zam,

 >> > MD update: a part of MD operation to be executed on one server,
 >> > contains one or more MDS/RAW operations.
 >> 
 >> Why does the client need to to be more granular than an update?  It
 >> seems MDS/Raw and update should be the same.

 AZ> well, better to say an update is MDS op if the operation touch only one 
 AZ> MD server and MDS/Raw op in case of distributed operation.

I think this just adds unneeded entity to the system. stating that
we either have updates or operations is simpler.

 >> Isn''t this essentially what the cmm is doing today? (Breaking
down
 >> distributed operations into per-node updates?)  Are you expanding on
 >> Alex''s idea of creating a new generic MD server stack?

 AZ> I just doubt that cmm code reuse is worth MD stack relayering. Can it be
 AZ> done as a subtask later?

I don''t think CMM is right thing because it essentially breaks
layering:
instead of sending object creation request in terms of OSD API or index
insert in terms of OSD API it introduces some intermediate thing which
is neither operation nor update.

 >> You mean by keeping the original operation info in the undo logs?

 AZ> I meant the servers receive not updates but whole operations. If the 
 AZ> client failed and didn''t send an update to some of the servers,
the
 AZ> operation can be completed w/o the client. It is an alternative to 
 AZ> undoing of partial updates.

same can be done with updates if you send them through single server.
and you don''t need to put additional cpu processing to parse operation
into updates.

thanks, Alex

Alexander Zarochentsev

2009-Mar-25 16:17 UTC

head link

[Lustre-devel] WBC HLD outline

On 25 March 2009 11:33:12 Alex Zhuravlev wrote:> >>>>> Alexander Zarochentsev (AZ) writes:
>
>  AZ> On 24 March 2009 02:17:33 Robert Read wrote:
>  >> Hi Zam,
>  >>
>  >> > MD update: a part of MD operation to be executed on one
server,
>  >> > contains one or more MDS/RAW operations.
>  >>
>  >> Why does the client need to to be more granular than an update? 
>  >> It seems MDS/Raw and update should be the same.
>
>  AZ> well, better to say an update is MDS op if the operation touch
> only one AZ> MD server and MDS/Raw op in case of distributed
> operation.
>
>
> I think this just adds unneeded entity to the system. stating that
> we either have updates or operations is simpler.
>
>  >> Isn''t this essentially what the cmm is doing today?
(Breaking
>  >> down distributed operations into per-node updates?)  Are you
>  >> expanding on Alex''s idea of creating a new generic MD
server
>  >> stack?
>
>  AZ> I just doubt that cmm code reuse is worth MD stack relayering.
> Can it be AZ> done as a subtask later?
>
> I don''t think CMM is right thing because it essentially breaks
> layering: instead of sending object creation request in terms of OSD
> API or index insert in terms of OSD API it introduces some
> intermediate thing which is neither operation nor update.
Server MD stack has to support both WBC and non-WBC clients for the same 
objects. It is why I think MDT layer should handle MD ops as well as 
MDS/RAW ops. Then CMM only passes RAW operations to MDD layer, where 
raw ops are already supported.

Thanks,
-- 
Alexander "Zam" Zarochentsev
Staff Engineer
Lustre Group, Sun Microsystems

Alex Zhuravlav

2009-Mar-25 16:26 UTC

head link

[Lustre-devel] WBC HLD outline

>>>>> Alexander Zarochentsev (AZ) writes:
 AZ> Server MD stack has to support both WBC and non-WBC clients for the same
 AZ> objects. It is why I think MDT layer should handle MD ops as well as 
 AZ> MDS/RAW ops. Then CMM only passes RAW operations to MDD layer, where 
 AZ> raw ops are already supported.

then I don''t understand what you mean by CMM. same about RAW
operations.


-- 
thanks, Alex

Alex Zhuravlav

2009-Mar-25 16:32 UTC

head link

[Lustre-devel] WBC HLD outline

>>>>> Alexander Zarochentsev (AZ) writes:
 AZ> Server MD stack has to support both WBC and non-WBC clients for the same
 AZ> objects. It is why I think MDT layer should handle MD ops as well as 
 AZ> MDS/RAW ops. Then CMM only passes RAW operations to MDD layer, where 
 AZ> raw ops are already supported.

btw, what''s problem with supporting WBC and non-WBC clients for same
objects?

any time you access some object via short (MDT-OSD for WBC client) or long
(MDT-MDD-OSD) for non-WBC client) it''s initialized at all layers
(MDT-MDD-OSD).

-- 
thanks, Alex

Eric Barton

2009-Apr-01 08:17 UTC

head link

[Lustre-devel] WBC HLD outline

Zam,

Some notes on the WBC HLD outline

1. The requirement is for 32K creates/second on one node of small
files with a random size of up to 64K. It''s basically HPCS IO
Scenario 4.

2. Reintegration must change the filesystem from one consistent state
to another consistent state _atomically_.

3. Not all the updates in a batch for 1 server need to have the same
epoch number - i.e. being forced to advance your epoch
(e.g. because you acquired a lock) doesn''t force you to create
a new batch.

I think this got mentioned in other emails.

4. Most readers won''t know what "bulk transfers are used" for
batches.

5. Is ensuring file data is delayed until file creation is
reintegrated sufficient for correct operation? Are we not
effectively doing create-on-write with a WBC? I''m sure there
are more issues (e.g. orphans).

Does including the OSTs in epoch recovery solve all the issues? If
so, what are the expected bounds on client redo and server undo
storage? Can we avoid needing server undo for data with some
compromises? Can we exploit the DMU at all?

6. The section on recovering from WBC client death seems imprecise.
Is (a) just describing V1-4 in Nikita''s original post - similarly
(b) for V1-2, V3''-5''? Also, for (c) I think we may have
discussed
the possibility of always sending updates as the full operation +
context to select which updates apply locally so that an operation
can always be recovered from any of its updates.

Cheers,
Eric

-------------- next part --------------
A non-text attachment was scrubbed...
Name: DARPA.HPCS.IO.Scenarios.v3.11.17.2008.docx
Type: application/vnd.openxmlformats-officedocument.wordprocessingml.document
Size: 39722 bytes
Desc: not available
Url :
http://lists.lustre.org/pipermail/lustre-devel/attachments/20090401/14ec027b/attachment-0001.bin

Alexander Zarochentsev

2009-Apr-06 10:23 UTC

head link

[Lustre-devel] WBC HLD outline

Hello Eric,

Thanks for the review,

On 1 April 2009 12:17:17 Eric Barton wrote:> Zam,
>
> Some notes on the WBC HLD outline
[...]
>
> 5. Is ensuring file data is delayed until file creation is
>    reintegrated sufficient for correct operation?  Are we not
>    effectively doing create-on-write with a WBC?  I''m sure there
>    are more issues (e.g. orphans).
>
>    Does including the OSTs in epoch recovery solve all the issues? 
> If so, what are the expected bounds on client redo and server undo
> storage?  Can we avoid needing server undo for data with some
> compromises?  Can we exploit the DMU at all?
I think we can''t avoid tagging OST object creation w/ epoch counter.
Would Lustre users complain if file writes are out-of-epochs?

So a write to existing OST object may survive loosing the context of MD 
operations where the write operation was issued, object 
creation/deletion may not.

The alternative is to implement undo logging for file data. It would 
require support from underlaying server fs. It could be done for 
ldiskfs, not sure about DMU.

There is a security problem with out-of-epochs writes and setting 
file attributes (especially permissions):
chmod 400 foo; cat /etc/secret-file >> foo. Chmod/chown can be a special 
case which triggers wbc flush.
> 6. The section on recovering from WBC client death seems imprecise.
>    Is (a) just describing V1-4 in Nikita''s original post -
similarly
>    (b) for V1-2, V3''-5''?  Also, for (c) I think we may
have discussed
>    the possibility of always sending updates as the full operation +
>    context to select which updates apply locally so that an operation
>    can always be recovered from any of its updates.
It is only a rough schema of client eviction to list what support might 
be needed in wbc protocol, like sending full MD op instead of update-- 
what you just mentioned. BTW, I thought Epochs HLD would cover the 
detailed algorithm descriptions, no?
>     Cheers,
>               Eric
Thanks,
-- 
Alexander "Zam" Zarochentsev
Staff Engineer
Lustre Group, Sun Microsystems

Andreas Dilger

2009-Apr-07 06:18 UTC

head link

[Lustre-devel] WBC HLD outline

On Apr 06, 2009  13:23 +0300, Alexander Zarochentsev
wrote:> On 1 April 2009 12:17:17 Eric Barton wrote:
> I think we can''t avoid tagging OST object creation w/ epoch
counter.
> Would Lustre users complain if file writes are out-of-epochs?
> 
> There is a security problem with out-of-epochs writes and setting 
> file attributes (especially permissions):
> chmod 400 foo; cat /etc/secret-file >> foo. Chmod/chown can be a
special
> case which triggers wbc flush.
While this example has been given many times as a security issue that
forces many strange actions on the part of Lustre, the example is
fundamentally broken because POSIX allows "foo" to be opened before
the
chmod, and kept open until after the write and then read the
"secret-file"
content.  The "foo" file needs to be created securely in the first
place
to be safe.
> > 6. The section on recovering from WBC client death seems imprecise.
> >    Is (a) just describing V1-4 in Nikita''s original post -
similarly
> >    (b) for V1-2, V3''-5''?  Also, for (c) I think we
may have discussed
> >    the possibility of always sending updates as the full operation +
> >    context to select which updates apply locally so that an operation
> >    can always be recovered from any of its updates.
> 
> It is only a rough schema of client eviction to list what support might 
> be needed in wbc protocol, like sending full MD op instead of update-- 
> what you just mentioned. BTW, I thought Epochs HLD would cover the 
> detailed algorithm descriptions, no?
Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

Alex Zhuravlev

2009-Apr-07 06:30 UTC

head link

[Lustre-devel] WBC HLD outline

>>>>> Andreas Dilger (AD) writes:
 AD> On Apr 06, 2009  13:23 +0300, Alexander Zarochentsev wrote:
 >> On 1 April 2009 12:17:17 Eric Barton wrote:
 >> I think we can''t avoid tagging OST object creation w/ epoch
counter.
 >> Would Lustre users complain if file writes are out-of-epochs?
 >> 
 >> There is a security problem with out-of-epochs writes and setting 
 >> file attributes (especially permissions):
 >> chmod 400 foo; cat /etc/secret-file >> foo. Chmod/chown can be a
special
 >> case which triggers wbc flush.

 AD> While this example has been given many times as a security issue that
 AD> forces many strange actions on the part of Lustre, the example is
 AD> fundamentally broken because POSIX allows "foo" to be opened
before the
 AD> chmod, and kept open until after the write and then read the
"secret-file"
 AD> content.  The "foo" file needs to be created securely in the
first place
 AD> to be safe.

yup, and there is no way in posix to even check whether file is opened.

my take on this and similar security related issues is that we probably
should provide two modes:
1) strict, when no optimizations in order of flush is done
2) relaxed, when order is not garanteed and user should use some form of sync
   but lustre can improve performance

-- 
thanks, Alex

Nikita Danilov

2009-Apr-07 07:50 UTC

head link

[Lustre-devel] WBC HLD outline

2009/4/7 Alex Zhuravlev <bzzz at sun.com>
> >>>>> Andreas Dilger (AD) writes:

Hello,

>
>
>  AD> On Apr 06, 2009  13:23 +0300, Alexander Zarochentsev wrote:
>  >> On 1 April 2009 12:17:17 Eric Barton wrote:
>  >> I think we can''t avoid tagging OST object creation w/
epoch counter.
>  >> Would Lustre users complain if file writes are out-of-epochs?
>  >>
>  >> There is a security problem with out-of-epochs writes and setting
>  >> file attributes (especially permissions):
>  >> chmod 400 foo; cat /etc/secret-file >> foo. Chmod/chown can
be a
> special
>  >> case which triggers wbc flush.
>
>  AD> While this example has been given many times as a security issue
that
>  AD> forces many strange actions on the part of Lustre, the example is
>  AD> fundamentally broken because POSIX allows "foo" to be
opened before
> the
>  AD> chmod, and kept open until after the write and then read the
> "secret-file"
>  AD> content.  The "foo" file needs to be created securely in
the first
> place
>  AD> to be safe.

the original "partial write-back" problem was demonstrated with the
use case

$ mkdir -m 0700 a # nobody but me can access things under "a"
$ umask 000
$ mkdir -m 0777 -p a/b/c/d
$ echo "secret data" > a/b/c/d/file
$ sync # time passes...
$ echo > a/b/c/d/file # truncate secret data
$ chmod 777 a # relax permissions

Note that here an ordering between data and meta-data updates on _different_
objects is important.

>
> yup, and there is no way in posix to even check whether file is opened.
>
> my take on this and similar security related issues is that we probably
> should provide two modes:
> 1) strict, when no optimizations in order of flush is done
> 2) relaxed, when order is not garanteed and user should use some form of
> sync
>   but lustre can improve performance

The old (and outdated) WBC HLD has a section "Partial write-out"
describing
these issues.

--> thanks, Alex

Nikita.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-devel/attachments/20090407/0c4ebdd4/attachment.html

Alexander Zarochentsev

2009-Apr-08 16:41 UTC

head link

[Lustre-devel] WBC HLD outline

Hello Nikita!

On 7 April 2009 11:50:29 Nikita Danilov wrote:> 2009/4/7 Alex Zhuravlev <bzzz at sun.com>
>
> > >>>>> Andreas Dilger (AD) writes:
>
> Hello,
>
> >  AD> On Apr 06, 2009  13:23 +0300, Alexander Zarochentsev wrote:
> >  >> On 1 April 2009 12:17:17 Eric Barton wrote:
> >  >> I think we can''t avoid tagging OST object creation
w/ epoch
> >  >> counter. Would Lustre users complain if file writes are
> >  >> out-of-epochs?
> >  >>
> >  >> There is a security problem with out-of-epochs writes and
> >  >> setting file attributes (especially permissions):
> >  >> chmod 400 foo; cat /etc/secret-file >> foo.
Chmod/chown can be
> >  >> a
> >
> > special
> >
> >  >> case which triggers wbc flush.
> >
> >  AD> While this example has been given many times as a security
> > issue that AD> forces many strange actions on the part of Lustre,
> > the example is AD> fundamentally broken because POSIX allows
"foo"
> > to be opened before the
> >  AD> chmod, and kept open until after the write and then read the
> > "secret-file"
> >  AD> content.  The "foo" file needs to be created
securely in the
> > first place
> >  AD> to be safe.
>
> the original "partial write-back" problem was demonstrated with
the
> use case
>
> $ mkdir -m 0700 a # nobody but me can access things under "a"
> $ umask 000
> $ mkdir -m 0777 -p a/b/c/d
> $ echo "secret data" > a/b/c/d/file
> $ sync # time passes...
> $ echo > a/b/c/d/file # truncate secret data
> $ chmod 777 a # relax permissions
>
> Note that here an ordering between data and meta-data updates on
> _different_ objects is important.
If we only guarantee no reordering in MD updates, Lustre behavior would 
be like ext3 without data journalling? I think it is not terrible.
> > yup, and there is no way in posix to even check whether file is
> > opened.
> >
> > my take on this and similar security related issues is that we
> > probably should provide two modes:
> > 1) strict, when no optimizations in order of flush is done
> > 2) relaxed, when order is not garanteed and user should use some
> > form of sync
> >   but lustre can improve performance
>
> The old (and outdated) WBC HLD has a section "Partial write-out"
> describing these issues.
>
> --
>
> > thanks, Alex
>
> Nikita.
-- 
Alexander "Zam" Zarochentsev
Staff Engineer
Lustre Group, Sun Microsystems

Oleg Drokin

2009-Apr-09 03:04 UTC

head link

[Lustre-devel] WBC HLD outline

Hello!

On Apr 7, 2009, at 2:30 AM, Alex Zhuravlev wrote:> AD> While this example has been given many times as a security issue  
> that
> AD> forces many strange actions on the part of Lustre, the example is
> AD> fundamentally broken because POSIX allows "foo" to be
opened
> before the
> AD> chmod, and kept open until after the write and then read the  
> "secret-file"
> AD> content.  The "foo" file needs to be created securely in
the
> first place
> AD> to be safe.
> yup, and there is no way in posix to even check whether file is  
> opened.
I do not know if file leases are POSIX or not (and cannot check right  
now),
but they do in fact allow you not only to ensure the file is not  
opened in certain
mode, but would also allow you to get notified when somebody attempts  
to open
a file on which you have obtained such a lease.

Bye,
    Oleg

Nikita Danilov

2009-Apr-09 08:58 UTC

head link

[Lustre-devel] WBC HLD outline

2009/4/8 Alexander Zarochentsev <Alexander.Zarochentsev at sun.com>
> Hello Nikita!
>
> On 7 April 2009 11:50:29 Nikita Danilov wrote:
> > 2009/4/7 Alex Zhuravlev <bzzz at sun.com>
> >
> > > >>>>> Andreas Dilger (AD) writes:
> >
> > Hello,
> >
>
[...]

> > $ echo > a/b/c/d/file # truncate secret data
> > $ chmod 777 a # relax permissions
> >
> > Note that here an ordering between data and meta-data updates on
> > _different_ objects is important.
>
> If we only guarantee no reordering in MD updates, Lustre behavior would
> be like ext3 without data journalling? I think it is not terrible.

It''s not terrible, but it is non-intuitive, in my opinion. More
enlightened
file systems, like ZFS, reiser4, and NTFS provide stronger consistency
guarantees, ignoring the petty distinctions between data and meta-data. :-)

But even limiting consistency to meta-data leaves some issues opened. For
example, think about an md proxy server acting as a WBC client for a higher
tier server. To be efficient such proxy might need to cache very large
amount of meta-data, and it most likely cannot afford to keep a log of all
operations. In this situation, when a lock on a top-level directory gets a
blocking AST, proxy would have --to guarantee ordering of visible meta-data
updates-- to write back all cached dirty meta-data under this directory
before the lock can be cancelled, which might result in unacceptable
latency.
>
> > > thanks, Alex
> >
> > Nikita.
>
> --
> Alexander "Zam" Zarochentsev
>
Nikita.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-devel/attachments/20090409/ffcd5ed0/attachment-0001.html

Lustre devel - Mar 2009 - WBC HLD outline

[Lustre-devel] WBC HLD outline

[Lustre-devel] WBC HLD outline

[Lustre-devel] WBC HLD outline

[Lustre-devel] WBC HLD outline

[Lustre-devel] WBC HLD outline

[Lustre-devel] WBC HLD outline

[Lustre-devel] WBC HLD outline

[Lustre-devel] WBC HLD outline

[Lustre-devel] WBC HLD outline

[Lustre-devel] WBC HLD outline

[Lustre-devel] WBC HLD outline

[Lustre-devel] WBC HLD outline

[Lustre-devel] WBC HLD outline

[Lustre-devel] WBC HLD outline

[Lustre-devel] WBC HLD outline

[Lustre-devel] WBC HLD outline