Nikita, Do you agree that a buggy or malicious MDWBC could disrupt the namespace (e.g. links to missing files, orphaned files) if it splits up operations across multiple MDTs into sub-operations for the individual targets? I think it will be an issue for security if we just trust the MDWBC to do such operations correctly, and so I''m wondering how we can fix this. Using a master MDT to coordinate the operation across itself and the remaining MDTs seems part of, but not all of the solution. We have to process batches in bulk to retain a significant performance advantage, so I wonder if that requires us to trust that these batches have been created correctly. If so, we''re stuck with the MDWBC being something we can only do in a single trust domain - i.e. not across a WAN. That seems unfortunate since WAN performance should be a major beneficiary of the MDWBC. Maybe in this case, we can still send batches over the WAN, but to a single target which proxies for the remote client and can be trusted to split multi-target ops over batches correctly. Thoughts? Cheers, Eric
We discussed this in Moscow recently. It seems possible to avoid much mis-behavior by building relationships that have to be confirmed before a commit can happen. For example a directory entry creation must be accompanied by an object creation or link-count change. I think it is possible for an MDS or MDS cluster to know in which cases such relationships need to be present for operations to transition the name space to a new namespace (and clients can indicate what operations are correlated). Peter On 10/5/08 8:53 PM, "Eric Barton" <eeb at sun.com> wrote:> Nikita, > > Do you agree that a buggy or malicious MDWBC could disrupt the > namespace (e.g. links to missing files, orphaned files) if > it splits up operations across multiple MDTs into sub-operations > for the individual targets? I think it will be an issue for > security if we just trust the MDWBC to do such operations > correctly, and so I''m wondering how we can fix this. > > Using a master MDT to coordinate the operation across itself and > the remaining MDTs seems part of, but not all of the solution. > We have to process batches in bulk to retain a significant > performance advantage, so I wonder if that requires us to trust > that these batches have been created correctly. > > If so, we''re stuck with the MDWBC being something we can only > do in a single trust domain - i.e. not across a WAN. That seems > unfortunate since WAN performance should be a major beneficiary > of the MDWBC. Maybe in this case, we can still send batches over > the WAN, but to a single target which proxies for the remote client > and can be trusted to split multi-target ops over batches correctly. > > Thoughts? > > Cheers, > Eric > > _______________________________________________ > Lustre-devel mailing list > Lustre-devel at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-devel
Eric Barton writes: > Nikita, Hello, > > Do you agree that a buggy or malicious MDWBC could disrupt the > namespace (e.g. links to missing files, orphaned files) if > it splits up operations across multiple MDTs into sub-operations > for the individual targets? I think it will be an issue for > security if we just trust the MDWBC to do such operations > correctly, and so I''m wondering how we can fix this. as Peter mentioned, we discussed this topic during the Moscow meeting. If I am not mistaken, we converged to the idea that before committing an epoch, every mdt composes some kind of a `summary'', containing enough information for verification of a global consistency, and this summary is passed though every server as a ticket, with every server `approving'' some bits in the summary accumulated so far, and adding new ones. For example, one server adds (SETATTR: FID: fid1, UPDATE: nlink += 2) to the summary, then another server having (LINK: PARENT_FID: fid2, NAME: "foo", CHILD_FID: fid1), in its local epoch replaces UPDATE part of the SETATTR record above with nlink += 1, and yet another server with (LINK: PARENT_FID: fid3, NAME: "bar", CHILD_FID: fid1), can cancel SETATTR completely. Note, that LINK might cancel UNLINK or RENAME as well as SETATTR. Global consistency is verified when all summary records are similarly canceled. All this is still very vague to me: - it is not clear how to start summary exchange (round robin perhaps, based on an epoch number)? - what state should be kept in a summary? - is it always possible to prove consistency in one cycle? > > Using a master MDT to coordinate the operation across itself and > the remaining MDTs seems part of, but not all of the solution. > We have to process batches in bulk to retain a significant > performance advantage, so I wonder if that requires us to trust > that these batches have been created correctly. > > If so, we''re stuck with the MDWBC being something we can only > do in a single trust domain - i.e. not across a WAN. That seems > unfortunate since WAN performance should be a major beneficiary > of the MDWBC. Maybe in this case, we can still send batches over > the WAN, but to a single target which proxies for the remote client > and can be trusted to split multi-target ops over batches correctly. > > Thoughts? > > Cheers, > Eric Nikita.
Nikita Danilov writes: > Eric Barton writes: > > Nikita, > > Hello, [...] > > as Peter mentioned, we discussed this topic during the Moscow > meeting. If I am not mistaken, we converged to the idea that before > committing an epoch, every mdt composes some kind of a `summary'', > containing enough information for verification of a global consistency, > and this summary is passed though every server as a ticket, with every This can be simplified. Suppose total amount of `data'', describing all updates within given epoch is D, and there are N md servers in a cmd cluster. Then total network traffic incurred by this algorithm is D /* updates from client to all servers */ + D*N /* cycle summary through all servers */ that is, (N + 1)*D bytes, transferred in 2*N messages. So we won''t increase network traffic by broadcasting _all_ epoch updates to _every_ server (so that each server gets complete set of all updates within the epoch). In this latter case, servers can prove that epoch is consistent by - checking global consistency locally, - calculating md5 signature of all epoch updates, and - exchanging these signatures, to check that client sent the same set of updates to everybody. This results in D*N /* broadcast epoch updates to all servers */ + e*N /* exchange signatures */ that is N*(D + e), for some small e, bytes transferred in 2*N messages. Having complete set of updates on every server would probably help in other places too. > server `approving'' some bits in the summary accumulated so far, and [...] > > > > Thoughts? > > > > Cheers, > > Eric > Nikita.
You''ll need to limit this to the requests that have dependencies. With the algorithm below every server starts looking at every request - that probably kills the scaling you want to achieve. Peter On 10/7/08 3:13 AM, "Nikita Danilov" <Nikita.Danilov at Sun.COM> wrote:> Nikita Danilov writes: >> Eric Barton writes: >>> Nikita, >> >> Hello, > > [...] > >> >> as Peter mentioned, we discussed this topic during the Moscow >> meeting. If I am not mistaken, we converged to the idea that before >> committing an epoch, every mdt composes some kind of a `summary'', >> containing enough information for verification of a global consistency, >> and this summary is passed though every server as a ticket, with every > > This can be simplified. Suppose total amount of `data'', describing all > updates within given epoch is D, and there are N md servers in a cmd > cluster. Then total network traffic incurred by this algorithm is > > D /* updates from client to all servers */ + > D*N /* cycle summary through all servers */ > > that is, (N + 1)*D bytes, transferred in 2*N messages. So we won''t > increase network traffic by broadcasting _all_ epoch updates to _every_ > server (so that each server gets complete set of all updates within the > epoch). In this latter case, servers can prove that epoch is consistent > by > > - checking global consistency locally, > > - calculating md5 signature of all epoch updates, and > > - exchanging these signatures, to check that client sent the same > set of updates to everybody. > > This results in > > D*N /* broadcast epoch updates to all servers */ + > e*N /* exchange signatures */ > > that is N*(D + e), for some small e, bytes transferred in 2*N > messages. Having complete set of updates on every server would probably > help in other places too. > > >> server `approving'' some bits in the summary accumulated so far, and > > [...] > >>> >>> Thoughts? >>> >>> Cheers, >>> Eric >> > > Nikita.
Peter Braam writes: > You''ll need to limit this to the requests that have dependencies. With the > algorithm below every server starts looking at every request - that probably > kills the scaling you want to achieve. I agree that total amount of data can be reduced significantly, but won''t it be sometimes useful to have complete epoch state on all servers? E.g., we can do server->server replay forward instead of a roll back. After all, additional requests are only `looked at'' rather than actually processed. Moreover, global consistency check can be done by one server only (selected round-robin for each epoch), after which this server sends md5 signature of total epoch state to other servers to verify. > > Peter Nikita. > > > On 10/7/08 3:13 AM, "Nikita Danilov" <Nikita.Danilov at Sun.COM> wrote: > > > Nikita Danilov writes: > >> Eric Barton writes: > >>> Nikita, > >> > >> Hello, > > > > [...] > > > >> > >> as Peter mentioned, we discussed this topic during the Moscow > >> meeting. If I am not mistaken, we converged to the idea that before > >> committing an epoch, every mdt composes some kind of a `summary'', > >> containing enough information for verification of a global consistency, > >> and this summary is passed though every server as a ticket, with every > > > > This can be simplified. Suppose total amount of `data'', describing all > > updates within given epoch is D, and there are N md servers in a cmd > > cluster. Then total network traffic incurred by this algorithm is > > > > D /* updates from client to all servers */ + > > D*N /* cycle summary through all servers */ > > > > that is, (N + 1)*D bytes, transferred in 2*N messages. So we won''t > > increase network traffic by broadcasting _all_ epoch updates to _every_ > > server (so that each server gets complete set of all updates within the > > epoch). In this latter case, servers can prove that epoch is consistent > > by > > > > - checking global consistency locally, > > > > - calculating md5 signature of all epoch updates, and > > > > - exchanging these signatures, to check that client sent the same > > set of updates to everybody. > > > > This results in > > > > D*N /* broadcast epoch updates to all servers */ + > > e*N /* exchange signatures */ > > > > that is N*(D + e), for some small e, bytes transferred in 2*N > > messages. Having complete set of updates on every server would probably > > help in other places too. > > > > > >> server `approving'' some bits in the summary accumulated so far, and > > > > [...] > > > >>> > >>> Thoughts? > >>> > >>> Cheers, > >>> Eric > >> > > > > Nikita. > > > _______________________________________________ > Lustre-devel mailing list > Lustre-devel at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-devel