thr3ads.net - Lustre devel - [Lustre-devel] MDWBC and how much to trust clients [Oct 2008]

If this information is useful, please help other people find it:
Share via:

Eric Barton

2008-Oct-06 02:53 UTC

[Lustre-devel] MDWBC and how much to trust clients

Nikita,

Do you agree that a buggy or malicious MDWBC could disrupt the
namespace (e.g. links to missing files, orphaned files) if
it splits up operations across multiple MDTs into sub-operations
for the individual targets?  I think it will be an issue for
security if we just trust the MDWBC to do such operations
correctly, and so I''m wondering how we can fix this.  

Using a master MDT to coordinate the operation across itself and
the remaining MDTs seems part of, but not all of the solution.
We have to process batches in bulk to retain a significant
performance advantage, so I wonder if that requires us to trust
that these batches have been created correctly.  

If so, we''re stuck with the MDWBC being something we can only
do in a single trust domain - i.e. not across a WAN. That seems
unfortunate since WAN performance should be a major beneficiary
of the MDWBC.  Maybe in this case, we can still send batches over
the WAN, but to a single target which proxies for the remote client
and can be trusted to split multi-target ops over batches correctly.

Thoughts?

    Cheers,
              Eric

Peter Braam

2008-Oct-06 03:19 UTC

head link

[Lustre-devel] MDWBC and how much to trust clients

We discussed this in Moscow recently.  It seems possible to avoid much
mis-behavior by building relationships that have to be confirmed before a
commit can happen.

For example a directory entry creation must be accompanied by an object
creation or link-count change.

I think it is possible for an MDS or MDS cluster to know in which cases such
relationships need to be present  for operations to transition the name
space to a new namespace (and clients can indicate what operations are
correlated).

Peter



On 10/5/08 8:53 PM, "Eric Barton" <eeb at sun.com> wrote:
> Nikita,
> 
> Do you agree that a buggy or malicious MDWBC could disrupt the
> namespace (e.g. links to missing files, orphaned files) if
> it splits up operations across multiple MDTs into sub-operations
> for the individual targets?  I think it will be an issue for
> security if we just trust the MDWBC to do such operations
> correctly, and so I''m wondering how we can fix this.
> 
> Using a master MDT to coordinate the operation across itself and
> the remaining MDTs seems part of, but not all of the solution.
> We have to process batches in bulk to retain a significant
> performance advantage, so I wonder if that requires us to trust
> that these batches have been created correctly.
> 
> If so, we''re stuck with the MDWBC being something we can only
> do in a single trust domain - i.e. not across a WAN. That seems
> unfortunate since WAN performance should be a major beneficiary
> of the MDWBC.  Maybe in this case, we can still send batches over
> the WAN, but to a single target which proxies for the remote client
> and can be trusted to split multi-target ops over batches correctly.
> 
> Thoughts?
> 
>     Cheers,
>               Eric
> 
> _______________________________________________
> Lustre-devel mailing list
> Lustre-devel at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-devel

Nikita Danilov

2008-Oct-06 15:55 UTC

head link

[Lustre-devel] MDWBC and how much to trust clients

Eric Barton writes:
 > Nikita,

Hello,

 > 
 > Do you agree that a buggy or malicious MDWBC could disrupt the
 > namespace (e.g. links to missing files, orphaned files) if
 > it splits up operations across multiple MDTs into sub-operations
 > for the individual targets?  I think it will be an issue for
 > security if we just trust the MDWBC to do such operations
 > correctly, and so I''m wondering how we can fix this.  

as Peter mentioned, we discussed this topic during the Moscow
meeting. If I am not mistaken, we converged to the idea that before
committing an epoch, every mdt composes some kind of a `summary'',
containing enough information for verification of a global consistency,
and this summary is passed though every server as a ticket, with every
server `approving'' some bits in the summary accumulated so far, and
adding new ones. For example, one server adds 

        (SETATTR: FID: fid1, UPDATE: nlink += 2) 

to the summary, then another server having 

        (LINK: PARENT_FID: fid2, NAME: "foo", CHILD_FID: fid1),

in its local epoch replaces UPDATE part of the SETATTR record above with
nlink += 1, and yet another server with

        (LINK: PARENT_FID: fid3, NAME: "bar", CHILD_FID: fid1),

can cancel SETATTR completely. Note, that LINK might cancel UNLINK or
RENAME as well as SETATTR. Global consistency is verified when all
summary records are similarly canceled. All this is still very vague to
me:

    - it is not clear how to start summary exchange (round robin
      perhaps, based on an epoch number)?

    - what state should be kept in a summary?

    - is it always possible to prove consistency in one cycle?

 > 
 > Using a master MDT to coordinate the operation across itself and
 > the remaining MDTs seems part of, but not all of the solution.
 > We have to process batches in bulk to retain a significant
 > performance advantage, so I wonder if that requires us to trust
 > that these batches have been created correctly.  
 > 
 > If so, we''re stuck with the MDWBC being something we can only
 > do in a single trust domain - i.e. not across a WAN. That seems
 > unfortunate since WAN performance should be a major beneficiary
 > of the MDWBC.  Maybe in this case, we can still send batches over
 > the WAN, but to a single target which proxies for the remote client
 > and can be trusted to split multi-target ops over batches correctly.
 > 
 > Thoughts?
 > 
 >     Cheers,
 >               Eric

Nikita.

Nikita Danilov

2008-Oct-07 09:13 UTC

head link

[Lustre-devel] MDWBC and how much to trust clients

Nikita Danilov writes:
 > Eric Barton writes:
 >  > Nikita,
 > 
 > Hello,

[...]

 > 
 > as Peter mentioned, we discussed this topic during the Moscow
 > meeting. If I am not mistaken, we converged to the idea that before
 > committing an epoch, every mdt composes some kind of a `summary'',
 > containing enough information for verification of a global consistency,
 > and this summary is passed though every server as a ticket, with every

This can be simplified. Suppose total amount of `data'', describing all
updates within given epoch is D, and there are N md servers in a cmd
cluster. Then total network traffic incurred by this algorithm is

             D   /* updates from client to all servers */ +
             D*N /* cycle summary through all servers */

that is, (N + 1)*D bytes, transferred in 2*N messages. So we won''t
increase network traffic by broadcasting _all_ epoch updates to _every_
server (so that each server gets complete set of all updates within the
epoch). In this latter case, servers can prove that epoch is consistent
by

    - checking global consistency locally,

    - calculating md5 signature of all epoch updates, and

    - exchanging these signatures, to check that client sent the same
      set of updates to everybody.

This results in

             D*N /* broadcast epoch updates to all servers */ +
             e*N /* exchange signatures */

that is N*(D + e), for some small e, bytes transferred in 2*N
messages. Having complete set of updates on every server would probably
help in other places too.

 > server `approving'' some bits in the summary accumulated so far,
and

[...]

 >  > 
 >  > Thoughts?
 >  > 
 >  >     Cheers,
 >  >               Eric
 > 

Nikita.

Peter Braam

2008-Oct-09 14:04 UTC

head link

[Lustre-devel] MDWBC and how much to trust clients

You''ll need to limit this to the requests that have dependencies.  With
the
algorithm below every server starts looking at every request - that probably
kills the scaling you want to achieve.

Peter


On 10/7/08 3:13 AM, "Nikita Danilov" <Nikita.Danilov at Sun.COM>
wrote:
> Nikita Danilov writes:
>> Eric Barton writes:
>>> Nikita,
>> 
>> Hello,
> 
> [...]
> 
>> 
>> as Peter mentioned, we discussed this topic during the Moscow
>> meeting. If I am not mistaken, we converged to the idea that before
>> committing an epoch, every mdt composes some kind of a
`summary'',
>> containing enough information for verification of a global consistency,
>> and this summary is passed though every server as a ticket, with every
> 
> This can be simplified. Suppose total amount of `data'', describing
all
> updates within given epoch is D, and there are N md servers in a cmd
> cluster. Then total network traffic incurred by this algorithm is
> 
>              D   /* updates from client to all servers */ +
>              D*N /* cycle summary through all servers */
> 
> that is, (N + 1)*D bytes, transferred in 2*N messages. So we won''t
> increase network traffic by broadcasting _all_ epoch updates to _every_
> server (so that each server gets complete set of all updates within the
> epoch). In this latter case, servers can prove that epoch is consistent
> by
> 
>     - checking global consistency locally,
> 
>     - calculating md5 signature of all epoch updates, and
> 
>     - exchanging these signatures, to check that client sent the same
>       set of updates to everybody.
> 
> This results in
> 
>              D*N /* broadcast epoch updates to all servers */ +
>              e*N /* exchange signatures */
> 
> that is N*(D + e), for some small e, bytes transferred in 2*N
> messages. Having complete set of updates on every server would probably
> help in other places too.
> 
> 
>> server `approving'' some bits in the summary accumulated so
far, and
> 
> [...]
> 
>>> 
>>> Thoughts?
>>> 
>>>     Cheers,
>>>               Eric
>> 
> 
> Nikita.

Nikita Danilov

2008-Oct-09 16:13 UTC

head link

[Lustre-devel] MDWBC and how much to trust clients

Peter Braam writes:
 > You''ll need to limit this to the requests that have dependencies.
With the
 > algorithm below every server starts looking at every request - that
probably
 > kills the scaling you want to achieve.

I agree that total amount of data can be reduced significantly, but
won''t it be sometimes useful to have complete epoch state on all
servers? E.g., we can do server->server replay forward instead of a roll
back.

After all, additional requests are only `looked at'' rather than
actually
processed. Moreover, global consistency check can be done by one server
only (selected round-robin for each epoch), after which this server
sends md5 signature of total epoch state to other servers to verify.

 > 
 > Peter

Nikita.

 > 
 > 
 > On 10/7/08 3:13 AM, "Nikita Danilov" <Nikita.Danilov at
Sun.COM> wrote:
 > 
 > > Nikita Danilov writes:
 > >> Eric Barton writes:
 > >>> Nikita,
 > >> 
 > >> Hello,
 > > 
 > > [...]
 > > 
 > >> 
 > >> as Peter mentioned, we discussed this topic during the Moscow
 > >> meeting. If I am not mistaken, we converged to the idea that
before
 > >> committing an epoch, every mdt composes some kind of a
`summary'',
 > >> containing enough information for verification of a global
consistency,
 > >> and this summary is passed though every server as a ticket, with
every
 > > 
 > > This can be simplified. Suppose total amount of `data'',
describing all
 > > updates within given epoch is D, and there are N md servers in a cmd
 > > cluster. Then total network traffic incurred by this algorithm is
 > > 
 > >              D   /* updates from client to all servers */ +
 > >              D*N /* cycle summary through all servers */
 > > 
 > > that is, (N + 1)*D bytes, transferred in 2*N messages. So we
won''t
 > > increase network traffic by broadcasting _all_ epoch updates to
_every_
 > > server (so that each server gets complete set of all updates within
the
 > > epoch). In this latter case, servers can prove that epoch is
consistent
 > > by
 > > 
 > >     - checking global consistency locally,
 > > 
 > >     - calculating md5 signature of all epoch updates, and
 > > 
 > >     - exchanging these signatures, to check that client sent the same
 > >       set of updates to everybody.
 > > 
 > > This results in
 > > 
 > >              D*N /* broadcast epoch updates to all servers */ +
 > >              e*N /* exchange signatures */
 > > 
 > > that is N*(D + e), for some small e, bytes transferred in 2*N
 > > messages. Having complete set of updates on every server would
probably
 > > help in other places too.
 > > 
 > > 
 > >> server `approving'' some bits in the summary accumulated
so far, and
 > > 
 > > [...]
 > > 
 > >>> 
 > >>> Thoughts?
 > >>> 
 > >>>     Cheers,
 > >>>               Eric
 > >> 
 > > 
 > > Nikita.
 > 
 > 
 > _______________________________________________
 > Lustre-devel mailing list
 > Lustre-devel at lists.lustre.org
 > http://lists.lustre.org/mailman/listinfo/lustre-devel

Lustre devel - Oct 2008 - MDWBC and how much to trust clients

[Lustre-devel] MDWBC and how much to trust clients

[Lustre-devel] MDWBC and how much to trust clients

[Lustre-devel] MDWBC and how much to trust clients

[Lustre-devel] MDWBC and how much to trust clients

[Lustre-devel] MDWBC and how much to trust clients

[Lustre-devel] MDWBC and how much to trust clients