Hello, here is a wbc hld outline. Please take a look. ==============================================WBC HLD OUTLINE * Definitions WBC (MD WBC): (Meta Data) Write Back Cache. MD operation: whole MD operation over an object: rename/create/open/close/setattr/getattr/link/unlink/mkdir/rmdir + readdir. Reintegration: The process of applying accumulated MD operation to the MD servers. MDS/RAW: MDS API extension to do "raw" fs operations: inserting of a dir entry w/o creating inode and so. MD update: a part of MD operation to be executed on one server, contains one or more MDS/RAW operations. MD batch: a collection of per-server MD updates. MDTR: MD translator: translates MD operations into MD/Raw ones. * Requirements Client application is able to create 64k files/second. Reintergration moves fs from one consistent state to another consistent state. Non-WBC client support w/o visible overhead. Avoid MDS code rewrite if possible. * Design outline ** Overall picture [Application] | =syscalls | V [VFS] | =vfs hooks | V [LLITE/MDC] | =MD (non-WBC) proto | V [MD CACHE MANAGER] ---> [LDLM] | V [MDTR] +-----------+----------+ | | | =======WBC proto========= | | | V V V [MDS1/RAW] [MDS2/RAW] [MDS3/RAW] ** WBC WBC client has a MDTR running on client side, it also can be a proxy server, acting as a server for non-WBC clients and as a client for MD servers. *** WBC vs non-WBC Processing MD operation request (lock enqueue + op intent, by Alex suggestion), MD server may decide to execute it by itself, or grant a only a lock (subtree one) and allow client to continue in WBC mode. *** Locks needed LDLM locks are taken before operation starts and held until the corresponded batch is re-integrated. *** Local cache management WBC client executes operations locally, modifying local in-memory objects. WBC client has a (redo-)log of all operations. The cache manager controls process of MD cache re-integration. *** MDS/RAW operations Managing directory entries and inodes, without maintaining fs consistency automatically. create/update/delete methods for directory entries and inodes. *** MDTR MDTR is responsible for converting MD operations into set of per-server MD/RAW operations. *** Client re-integration Periodically, or because of (sub-)lock releasing, dirty memory flushing or so, WBC client submits batches to all MD servers involved into the operations. Process of re-integration is protected by LDLM locks. MD servers are updated using WBC protocol. *** WBC protocol WBC request contains a set of MD/RAW operations, tagged with one epoch number. Bulk transfers are used. *** File data Flushing file data to the OST servers is delayed until file creation is re-integrated. *** Recovery The redo-log preserved until it is not needed in recovery (i.e. epoch gets stable) Client replay the log and re-execute all operations from it, repeating MDTR processing (dispatching the operation between MD servers). **** WBC client eviction, uncompleted updates If client dies until re-integration is completed, there are three choices: a) Cluster-wide rollback, all servers roll back to the last globally stable epoch, then clients to replay heir redo-logs. This scenario should be avoided because a single client failure may may stop whole cluster for recovery. b) All servers participating in re-integration coordinate to undo uncompleted updates. c) The servers have all information needed to complete re-integration w/o client. The recovery strategy is a subject of CMD Recovery Design document, but a possibility of (c) need a support in the WBC protocol. ** non-WBC *** MD protocol MD (non-WBC) protocol remains the same as now. ** Use cases *** WBC / non-WBC decision 1. Check whether server and client can operate in WBC-mode through connect flags. 2. I they can, a lock enqueue request may contain a request for WBC-mode, the server may respond with granting WBC-mode and STL or PW lock on the directory. MD server accepts or rejects WBC-mode request depending on server rules and per-object access statistics. *** File creation client gets a PW lock on directory. client fetches directory content. client does file creation locally, in cache, the operation record is added to the client redo-log. Another client want to read the directory, lock conflict triggers re-integration. MD Cache manager processes the redo-log, prepares batches with MDS/RAW operations and submits them to the MD servers. The MD servers integrate the batches. MD Cache manager frees local cache content and cancels the directory lock. ** Questions Q: Can several wbc clients work in one directory simultaneously? A: If extent locks for directories are implemented, each WBC client can take a lock on a hash interval. Q: can wbc clients do massive file creation in one directory efficiently? A: the idea that may help: if we can guess that the file names created by a client are lexicographically ordered, a special hash function may reduce lock conflicts between clients holding locks on directory extents. Thanks, -- Alexander "Zam" Zarochentsev Staff Engineer Lustre Group, Sun Microsystems
Hi Zam, On Mar 23, 2009, at 14:58 , Alexander Zarochentsev wrote:> Hello, > > here is a wbc hld outline. > Please take a look. > > ==============================================> WBC HLD OUTLINE > > * Definitions> WBC (MD WBC): (Meta Data) Write Back Cache. > > MD operation: whole MD operation over an object: > rename/create/open/close/setattr/getattr/link/unlink/mkdir/rmdir + > readdir. > > Reintegration: The process of applying accumulated MD operation to the > MD servers. > > MDS/RAW: MDS API extension to do "raw" fs operations: inserting of a > dir entry w/o creating inode and so. > > MD update: a part of MD operation to be executed on one server, > contains one or more MDS/RAW operations.Why does the client need to to be more granular than an update? It seems MDS/Raw and update should be the same.> > MD batch: a collection of per-server MD updates. > > MDTR: MD translator: translates MD operations into MD/Raw ones.Isn''t this essentially what the cmm is doing today? (Breaking down distributed operations into per-node updates?) Are you expanding on Alex''s idea of creating a new generic MD server stack?> > * Requirements > > Client application is able to create 64k files/second. > > Reintergration moves fs from one consistent state to another> consistent state. > > Non-WBC client support w/o visible overhead. > > Avoid MDS code rewrite if possible. > > * Design outline > > ** Overall picture > > [Application] > | > =syscalls> | > V > [VFS] > | > =vfs hooks> | > V > [LLITE/MDC] > | > =MD (non-WBC) proto> | > V > [MD CACHE MANAGER] ---> [LDLM] > | > V > [MDTR] > +-----------+----------+ > | | | > =======WBC proto=========> | | | > V V V > [MDS1/RAW] [MDS2/RAW] [MDS3/RAW] > > ** WBC > > WBC client has a MDTR running on client side, > it also can be a proxy server, acting as a server for > non-WBC clients and as a client for MD servers. > > *** WBC vs non-WBC > > Processing MD operation request (lock enqueue + op intent, by Alex > suggestion), MD server may decide to execute it by itself, or grant a > only a lock (subtree one) and allow client to continue in WBC mode. > > *** Locks > > needed LDLM locks are taken before operation starts and held until the > corresponded batch is re-integrated. > > *** Local cache management > > WBC client executes operations locally, modifying local in-memory > objects. WBC client has a (redo-)log of all operations. > > The cache manager controls process of MD cache re-integration. > > *** MDS/RAW operations > > Managing directory entries and inodes, without maintaining > fs consistency automatically. > > create/update/delete methods for directory entries and inodes. > > *** MDTR > > MDTR is responsible for converting MD operations into set of > per-server MD/RAW operations. > > *** Client re-integration > > Periodically, or because of (sub-)lock releasing, dirty memory > flushing or so, WBC client submits batches to all MD servers involved > into the operations. > > Process of re-integration is protected by LDLM locks. MD servers are > updated > using WBC protocol. > > *** WBC protocol > > WBC request contains a set of MD/RAW operations, tagged with one epoch > number. Bulk transfers are used.All the updates in a single operation must have the same epoch, but I don''t think we can guarantee that all the operations in a batch will be in the same epoch, unless we stop exchanging messages with all the MD servers. I don''t see a need for them to be in the same epoch, either.> > *** File data > Flushing file data to the OST servers is delayed until file creation > is re-integrated. > > *** Recovery > > The redo-log preserved until it is not needed in recovery (i.e. epoch > gets stable) > > Client replay the log and re-execute all operations from it, repeating > MDTR processing (dispatching the operation between MD servers).Since the MD servers all roll back before recovery, recovery will be very similar to the original reintegration, with the exception of using versions. So we should try to keep the recovery (replay) code as similar to the normal code as possible, and move recovery higher into the stack.> > **** WBC client eviction, uncompleted updates > > If client dies until re-integration is completed, there are three > choices: > > a) Cluster-wide rollback, all servers roll back to the last globally > stable epoch, then clients to replay heir redo-logs. > > This scenario should be avoided because a single client failure may > may stop whole cluster for recovery. > > b) All servers participating in re-integration coordinate to undo > uncompleted updates. > > c) The servers have all information needed to complete re-integration > w/o client.You mean by keeping the original operation info in the undo logs?> > The recovery strategy is a subject of CMD Recovery Design document, > but a possibility of (c) need a support in the WBC protocol. > > ** non-WBC > > *** MD protocol > > MD (non-WBC) protocol remains the same as now. > > ** Use cases > > *** WBC / non-WBC decision > > 1. Check whether server and client can operate in WBC-mode through > connect flags. > > 2. I they can, a lock enqueue request may contain a request for > WBC-mode, the server may respond with granting WBC-mode and STL or PW > lock on the directory. MD server accepts or rejects WBC-mode request > depending on server rules and per-object access statistics. > > *** File creation > > client gets a PW lock on directory. > > client fetches directory content. > > client does file creation locally, in cache, the operation record is > added to the client redo-log. > > Another client want to read the directory, lock conflict triggers > re-integration. > > MD Cache manager processes the redo-log, prepares batches with MDS/RAW > operations and submits them to the MD servers. > > The MD servers integrate the batches. > > MD Cache manager frees local cache content and cancels the directory > lock. > > ** Questions > > Q: Can several wbc clients work in one directory simultaneously? > A: If extent locks for directories are implemented, each WBC client > can take a lock on a hash interval. > > Q: can wbc clients do massive file creation in one directory > efficiently? > A: the idea that may help: if we can guess that the file names created > by a client are lexicographically ordered, a special hash function > may reduce lock conflicts between clients holding locks on > directory extents.cheers, robert
>>>>> Alexander Zarochentsev (AZ) writes:AZ> MDS/RAW: MDS API extension to do "raw" fs operations: inserting of a AZ> dir entry w/o creating inode and so. this seems to be duplication of OSD API''s insert/delete/etc. AZ> MDTR: MD translator: translates MD operations into MD/Raw ones. and this one seems to duplicate MDD code. why would we want to duplicate these things? thanks, Alex
On 24 March 2009 02:17:33 Robert Read wrote:> Hi Zam,> > MD update: a part of MD operation to be executed on one server, > > contains one or more MDS/RAW operations. > > Why does the client need to to be more granular than an update? It > seems MDS/Raw and update should be the same.well, better to say an update is MDS op if the operation touch only one MD server and MDS/Raw op in case of distributed operation.> > MD batch: a collection of per-server MD updates. > > > > MDTR: MD translator: translates MD operations into MD/Raw ones. > > Isn''t this essentially what the cmm is doing today? (Breaking down > distributed operations into per-node updates?) Are you expanding on > Alex''s idea of creating a new generic MD server stack?I just doubt that cmm code reuse is worth MD stack relayering. Can it be done as a subtask later?> > *** WBC protocol > > > > WBC request contains a set of MD/RAW operations, tagged with one > > epoch number. Bulk transfers are used. > > All the updates in a single operation must have the same epoch, but I > don''t think we can guarantee that all the operations in a batch will > be in the same epoch, unless we stop exchanging messages with all the > MD servers. I don''t see a need for them to be in the same epoch, > either.you are right.> > *** File data > > Flushing file data to the OST servers is delayed until file > > creation is re-integrated. > > > > *** Recovery > > > > The redo-log preserved until it is not needed in recovery (i.e. > > epoch gets stable) > > > > Client replay the log and re-execute all operations from it, > > repeating MDTR processing (dispatching the operation between MD > > servers). > > Since the MD servers all roll back before recovery, recovery will be > very similar to the original reintegration, with the exception of > using versions. So we should try to keep the recovery (replay) code > as similar to the normal code as possible, and move recovery higher > into the stack.ok.> > **** WBC client eviction, uncompleted updates > > > > If client dies until re-integration is completed, there are three > > choices: > > > > a) Cluster-wide rollback, all servers roll back to the last > > globally stable epoch, then clients to replay heir redo-logs. > > > > This scenario should be avoided because a single client failure may > > may stop whole cluster for recovery. > > > > b) All servers participating in re-integration coordinate to undo > > uncompleted updates. > > > > c) The servers have all information needed to complete > > re-integration w/o client. > > You mean by keeping the original operation info in the undo logs?I meant the servers receive not updates but whole operations. If the client failed and didn''t send an update to some of the servers, the operation can be completed w/o the client. It is an alternative to undoing of partial updates. Thanks, -- Alexander "Zam" Zarochentsev Staff Engineer Lustre Group, Sun Microsystems
>>>>> Alexander Zarochentsev (AZ) writes:AZ> On 24 March 2009 02:17:33 Robert Read wrote: >> Hi Zam, >> > MD update: a part of MD operation to be executed on one server, >> > contains one or more MDS/RAW operations. >> >> Why does the client need to to be more granular than an update? It >> seems MDS/Raw and update should be the same. AZ> well, better to say an update is MDS op if the operation touch only one AZ> MD server and MDS/Raw op in case of distributed operation. I think this just adds unneeded entity to the system. stating that we either have updates or operations is simpler. >> Isn''t this essentially what the cmm is doing today? (Breaking down >> distributed operations into per-node updates?) Are you expanding on >> Alex''s idea of creating a new generic MD server stack? AZ> I just doubt that cmm code reuse is worth MD stack relayering. Can it be AZ> done as a subtask later? I don''t think CMM is right thing because it essentially breaks layering: instead of sending object creation request in terms of OSD API or index insert in terms of OSD API it introduces some intermediate thing which is neither operation nor update. >> You mean by keeping the original operation info in the undo logs? AZ> I meant the servers receive not updates but whole operations. If the AZ> client failed and didn''t send an update to some of the servers, the AZ> operation can be completed w/o the client. It is an alternative to AZ> undoing of partial updates. same can be done with updates if you send them through single server. and you don''t need to put additional cpu processing to parse operation into updates. thanks, Alex
On 25 March 2009 11:33:12 Alex Zhuravlev wrote:> >>>>> Alexander Zarochentsev (AZ) writes: > > AZ> On 24 March 2009 02:17:33 Robert Read wrote: > >> Hi Zam, > >> > >> > MD update: a part of MD operation to be executed on one server, > >> > contains one or more MDS/RAW operations. > >> > >> Why does the client need to to be more granular than an update? > >> It seems MDS/Raw and update should be the same. > > AZ> well, better to say an update is MDS op if the operation touch > only one AZ> MD server and MDS/Raw op in case of distributed > operation. > > > I think this just adds unneeded entity to the system. stating that > we either have updates or operations is simpler. > > >> Isn''t this essentially what the cmm is doing today? (Breaking > >> down distributed operations into per-node updates?) Are you > >> expanding on Alex''s idea of creating a new generic MD server > >> stack? > > AZ> I just doubt that cmm code reuse is worth MD stack relayering. > Can it be AZ> done as a subtask later? > > I don''t think CMM is right thing because it essentially breaks > layering: instead of sending object creation request in terms of OSD > API or index insert in terms of OSD API it introduces some > intermediate thing which is neither operation nor update.Server MD stack has to support both WBC and non-WBC clients for the same objects. It is why I think MDT layer should handle MD ops as well as MDS/RAW ops. Then CMM only passes RAW operations to MDD layer, where raw ops are already supported. Thanks, -- Alexander "Zam" Zarochentsev Staff Engineer Lustre Group, Sun Microsystems
>>>>> Alexander Zarochentsev (AZ) writes:AZ> Server MD stack has to support both WBC and non-WBC clients for the same AZ> objects. It is why I think MDT layer should handle MD ops as well as AZ> MDS/RAW ops. Then CMM only passes RAW operations to MDD layer, where AZ> raw ops are already supported. then I don''t understand what you mean by CMM. same about RAW operations. -- thanks, Alex
>>>>> Alexander Zarochentsev (AZ) writes:AZ> Server MD stack has to support both WBC and non-WBC clients for the same AZ> objects. It is why I think MDT layer should handle MD ops as well as AZ> MDS/RAW ops. Then CMM only passes RAW operations to MDD layer, where AZ> raw ops are already supported. btw, what''s problem with supporting WBC and non-WBC clients for same objects? any time you access some object via short (MDT-OSD for WBC client) or long (MDT-MDD-OSD) for non-WBC client) it''s initialized at all layers (MDT-MDD-OSD). -- thanks, Alex
Zam, Some notes on the WBC HLD outline 1. The requirement is for 32K creates/second on one node of small files with a random size of up to 64K. It''s basically HPCS IO Scenario 4. 2. Reintegration must change the filesystem from one consistent state to another consistent state _atomically_. 3. Not all the updates in a batch for 1 server need to have the same epoch number - i.e. being forced to advance your epoch (e.g. because you acquired a lock) doesn''t force you to create a new batch. I think this got mentioned in other emails. 4. Most readers won''t know what "bulk transfers are used" for batches. 5. Is ensuring file data is delayed until file creation is reintegrated sufficient for correct operation? Are we not effectively doing create-on-write with a WBC? I''m sure there are more issues (e.g. orphans). Does including the OSTs in epoch recovery solve all the issues? If so, what are the expected bounds on client redo and server undo storage? Can we avoid needing server undo for data with some compromises? Can we exploit the DMU at all? 6. The section on recovering from WBC client death seems imprecise. Is (a) just describing V1-4 in Nikita''s original post - similarly (b) for V1-2, V3''-5''? Also, for (c) I think we may have discussed the possibility of always sending updates as the full operation + context to select which updates apply locally so that an operation can always be recovered from any of its updates. Cheers, Eric -------------- next part -------------- A non-text attachment was scrubbed... Name: DARPA.HPCS.IO.Scenarios.v3.11.17.2008.docx Type: application/vnd.openxmlformats-officedocument.wordprocessingml.document Size: 39722 bytes Desc: not available Url : http://lists.lustre.org/pipermail/lustre-devel/attachments/20090401/14ec027b/attachment-0001.bin
Hello Eric, Thanks for the review, On 1 April 2009 12:17:17 Eric Barton wrote:> Zam, > > Some notes on the WBC HLD outline[...]> > 5. Is ensuring file data is delayed until file creation is > reintegrated sufficient for correct operation? Are we not > effectively doing create-on-write with a WBC? I''m sure there > are more issues (e.g. orphans). > > Does including the OSTs in epoch recovery solve all the issues? > If so, what are the expected bounds on client redo and server undo > storage? Can we avoid needing server undo for data with some > compromises? Can we exploit the DMU at all?I think we can''t avoid tagging OST object creation w/ epoch counter. Would Lustre users complain if file writes are out-of-epochs? So a write to existing OST object may survive loosing the context of MD operations where the write operation was issued, object creation/deletion may not. The alternative is to implement undo logging for file data. It would require support from underlaying server fs. It could be done for ldiskfs, not sure about DMU. There is a security problem with out-of-epochs writes and setting file attributes (especially permissions): chmod 400 foo; cat /etc/secret-file >> foo. Chmod/chown can be a special case which triggers wbc flush.> 6. The section on recovering from WBC client death seems imprecise. > Is (a) just describing V1-4 in Nikita''s original post - similarly > (b) for V1-2, V3''-5''? Also, for (c) I think we may have discussed > the possibility of always sending updates as the full operation + > context to select which updates apply locally so that an operation > can always be recovered from any of its updates.It is only a rough schema of client eviction to list what support might be needed in wbc protocol, like sending full MD op instead of update-- what you just mentioned. BTW, I thought Epochs HLD would cover the detailed algorithm descriptions, no?> Cheers, > EricThanks, -- Alexander "Zam" Zarochentsev Staff Engineer Lustre Group, Sun Microsystems
On Apr 06, 2009 13:23 +0300, Alexander Zarochentsev wrote:> On 1 April 2009 12:17:17 Eric Barton wrote: > I think we can''t avoid tagging OST object creation w/ epoch counter. > Would Lustre users complain if file writes are out-of-epochs? > > There is a security problem with out-of-epochs writes and setting > file attributes (especially permissions): > chmod 400 foo; cat /etc/secret-file >> foo. Chmod/chown can be a special > case which triggers wbc flush.While this example has been given many times as a security issue that forces many strange actions on the part of Lustre, the example is fundamentally broken because POSIX allows "foo" to be opened before the chmod, and kept open until after the write and then read the "secret-file" content. The "foo" file needs to be created securely in the first place to be safe.> > 6. The section on recovering from WBC client death seems imprecise. > > Is (a) just describing V1-4 in Nikita''s original post - similarly > > (b) for V1-2, V3''-5''? Also, for (c) I think we may have discussed > > the possibility of always sending updates as the full operation + > > context to select which updates apply locally so that an operation > > can always be recovered from any of its updates. > > It is only a rough schema of client eviction to list what support might > be needed in wbc protocol, like sending full MD op instead of update-- > what you just mentioned. BTW, I thought Epochs HLD would cover the > detailed algorithm descriptions, no?Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.
>>>>> Andreas Dilger (AD) writes:AD> On Apr 06, 2009 13:23 +0300, Alexander Zarochentsev wrote: >> On 1 April 2009 12:17:17 Eric Barton wrote: >> I think we can''t avoid tagging OST object creation w/ epoch counter. >> Would Lustre users complain if file writes are out-of-epochs? >> >> There is a security problem with out-of-epochs writes and setting >> file attributes (especially permissions): >> chmod 400 foo; cat /etc/secret-file >> foo. Chmod/chown can be a special >> case which triggers wbc flush. AD> While this example has been given many times as a security issue that AD> forces many strange actions on the part of Lustre, the example is AD> fundamentally broken because POSIX allows "foo" to be opened before the AD> chmod, and kept open until after the write and then read the "secret-file" AD> content. The "foo" file needs to be created securely in the first place AD> to be safe. yup, and there is no way in posix to even check whether file is opened. my take on this and similar security related issues is that we probably should provide two modes: 1) strict, when no optimizations in order of flush is done 2) relaxed, when order is not garanteed and user should use some form of sync but lustre can improve performance -- thanks, Alex
2009/4/7 Alex Zhuravlev <bzzz at sun.com>> >>>>> Andreas Dilger (AD) writes:Hello,> > > AD> On Apr 06, 2009 13:23 +0300, Alexander Zarochentsev wrote: > >> On 1 April 2009 12:17:17 Eric Barton wrote: > >> I think we can''t avoid tagging OST object creation w/ epoch counter. > >> Would Lustre users complain if file writes are out-of-epochs? > >> > >> There is a security problem with out-of-epochs writes and setting > >> file attributes (especially permissions): > >> chmod 400 foo; cat /etc/secret-file >> foo. Chmod/chown can be a > special > >> case which triggers wbc flush. > > AD> While this example has been given many times as a security issue that > AD> forces many strange actions on the part of Lustre, the example is > AD> fundamentally broken because POSIX allows "foo" to be opened before > the > AD> chmod, and kept open until after the write and then read the > "secret-file" > AD> content. The "foo" file needs to be created securely in the first > place > AD> to be safe.the original "partial write-back" problem was demonstrated with the use case $ mkdir -m 0700 a # nobody but me can access things under "a" $ umask 000 $ mkdir -m 0777 -p a/b/c/d $ echo "secret data" > a/b/c/d/file $ sync # time passes... $ echo > a/b/c/d/file # truncate secret data $ chmod 777 a # relax permissions Note that here an ordering between data and meta-data updates on _different_ objects is important.> > yup, and there is no way in posix to even check whether file is opened. > > my take on this and similar security related issues is that we probably > should provide two modes: > 1) strict, when no optimizations in order of flush is done > 2) relaxed, when order is not garanteed and user should use some form of > sync > but lustre can improve performanceThe old (and outdated) WBC HLD has a section "Partial write-out" describing these issues. --> thanks, AlexNikita. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-devel/attachments/20090407/0c4ebdd4/attachment.html
Hello Nikita! On 7 April 2009 11:50:29 Nikita Danilov wrote:> 2009/4/7 Alex Zhuravlev <bzzz at sun.com> > > > >>>>> Andreas Dilger (AD) writes: > > Hello, > > > AD> On Apr 06, 2009 13:23 +0300, Alexander Zarochentsev wrote: > > >> On 1 April 2009 12:17:17 Eric Barton wrote: > > >> I think we can''t avoid tagging OST object creation w/ epoch > > >> counter. Would Lustre users complain if file writes are > > >> out-of-epochs? > > >> > > >> There is a security problem with out-of-epochs writes and > > >> setting file attributes (especially permissions): > > >> chmod 400 foo; cat /etc/secret-file >> foo. Chmod/chown can be > > >> a > > > > special > > > > >> case which triggers wbc flush. > > > > AD> While this example has been given many times as a security > > issue that AD> forces many strange actions on the part of Lustre, > > the example is AD> fundamentally broken because POSIX allows "foo" > > to be opened before the > > AD> chmod, and kept open until after the write and then read the > > "secret-file" > > AD> content. The "foo" file needs to be created securely in the > > first place > > AD> to be safe. > > the original "partial write-back" problem was demonstrated with the > use case > > $ mkdir -m 0700 a # nobody but me can access things under "a" > $ umask 000 > $ mkdir -m 0777 -p a/b/c/d > $ echo "secret data" > a/b/c/d/file > $ sync # time passes... > $ echo > a/b/c/d/file # truncate secret data > $ chmod 777 a # relax permissions > > Note that here an ordering between data and meta-data updates on > _different_ objects is important.If we only guarantee no reordering in MD updates, Lustre behavior would be like ext3 without data journalling? I think it is not terrible.> > yup, and there is no way in posix to even check whether file is > > opened. > > > > my take on this and similar security related issues is that we > > probably should provide two modes: > > 1) strict, when no optimizations in order of flush is done > > 2) relaxed, when order is not garanteed and user should use some > > form of sync > > but lustre can improve performance > > The old (and outdated) WBC HLD has a section "Partial write-out" > describing these issues. > > -- > > > thanks, Alex > > Nikita.-- Alexander "Zam" Zarochentsev Staff Engineer Lustre Group, Sun Microsystems
Hello! On Apr 7, 2009, at 2:30 AM, Alex Zhuravlev wrote:> AD> While this example has been given many times as a security issue > that > AD> forces many strange actions on the part of Lustre, the example is > AD> fundamentally broken because POSIX allows "foo" to be opened > before the > AD> chmod, and kept open until after the write and then read the > "secret-file" > AD> content. The "foo" file needs to be created securely in the > first place > AD> to be safe. > yup, and there is no way in posix to even check whether file is > opened.I do not know if file leases are POSIX or not (and cannot check right now), but they do in fact allow you not only to ensure the file is not opened in certain mode, but would also allow you to get notified when somebody attempts to open a file on which you have obtained such a lease. Bye, Oleg
2009/4/8 Alexander Zarochentsev <Alexander.Zarochentsev at sun.com>> Hello Nikita! > > On 7 April 2009 11:50:29 Nikita Danilov wrote: > > 2009/4/7 Alex Zhuravlev <bzzz at sun.com> > > > > > >>>>> Andreas Dilger (AD) writes: > > > > Hello, > > >[...]> > $ echo > a/b/c/d/file # truncate secret data > > $ chmod 777 a # relax permissions > > > > Note that here an ordering between data and meta-data updates on > > _different_ objects is important. > > If we only guarantee no reordering in MD updates, Lustre behavior would > be like ext3 without data journalling? I think it is not terrible.It''s not terrible, but it is non-intuitive, in my opinion. More enlightened file systems, like ZFS, reiser4, and NTFS provide stronger consistency guarantees, ignoring the petty distinctions between data and meta-data. :-) But even limiting consistency to meta-data leaves some issues opened. For example, think about an md proxy server acting as a WBC client for a higher tier server. To be efficient such proxy might need to cache very large amount of meta-data, and it most likely cannot afford to keep a log of all operations. In this situation, when a lock on a top-level directory gets a blocking AST, proxy would have --to guarantee ordering of visible meta-data updates-- to write back all cached dirty meta-data under this directory before the lock can be cancelled, which might result in unacceptable latency.> > > > thanks, Alex > > > > Nikita. > > -- > Alexander "Zam" Zarochentsev >Nikita. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-devel/attachments/20090409/ffcd5ed0/attachment-0001.html