thr3ads.net - Lustre devel - [Lustre-devel] global epochs v. dependencies [Dec 2008]

If this information is useful, please help other people find it:
Share via:

Eric Barton

2008-Dec-31 15:20 UTC

[Lustre-devel] global epochs v. dependencies

FYI...

<eeb>    any further thoughts on the subject of epochs?
<snip>
<bzzz_z> i''d appreciate very much if you could describe your
concerns about
         overhead of dependency-based recovery
<eeb>    yes indeed
<bzzz_z> preferable in email so that i can look and think again and again 
<eeb>    sure - it''s all about how communications combine 
<eeb>    I don''t like nikita''s idea either (reduction of
a vector O(#clients)
	  in size)
<bzzz_z> my understanding is that nikita''s idea is very simple to
understand
         and implement
<eeb>    yes - I agree with almost all of it apart from having to
propagate
         per-client info
<eeb>    but as we discussed - that is only required if clients
don''t
         participate in a fixed-topology reduction
<eeb>    when they do, the info to reduce is O(constant) 
<eeb>    which means you can afford to reduce it much more frequently 
<eeb>    if the reduction tree
<eeb>    hasn''t got too high a branching ratio
<bzzz_z> i still think that having a single coordinator is bad thing and
fsync
         is a problem as well
<eeb>    no single coordinator required 
<bzzz_z> ? 
<eeb>    unless you want to call the root of the reduction tree "the 
         coordinator" - but really, it''s just the server that
knows the global
         reduction result first and broadcasts it back to the rest of the
         cluster via the other servers
<eeb>    in the tree 
<bzzz_z> well, this is a coordinator - single node deciding what epochs
are
         stable by given moment
<eeb>    no - it doesn''t decide it 
<eeb>    the reduction is done in the tree 
<eeb>    the root of the tree is only combining information from its
children
         like any other non-leaf in the tree
<bzzz_z> one part of tree knows nothing abou different part of tree 
<bzzz_z> yes, but you can''t make epoch N stable until you know
another subtree
         has no N-10 finished
<eeb>    indeed 
<eeb>    every client tells its favourite server (and only its favourite
         server) its oldest volatile epoch
<eeb>    each server sends its parent the oldest volatile epoch computed
over
         itself and its clients
<eeb>    and tells its parent 
<eeb>    the parent computs the oldest volatile epoch over its children
and
         tells its parent
<eeb>    etc 
<eeb>    until the root knows the global oldest volatile epoch 
<eeb>    the root tells its children the oldest global volatile epoch 
<eeb>    the children tell their children 
<bzzz_z> current epoch can come from different server, not from favorite 
         server.
<eeb>    and so on 
<eeb>    I don''t understand your last statement 
<bzzz_z> ok, describing .. 
<bzzz_z> say, there is mds1 and mds2 
<bzzz_z> client1 has mds1 as a favorite, so client2 does with mds2 
<bzzz_z> client1 has it volatile epoch 20 and reports this to mds1 
<eeb>    c1 has oldest volatile epoch 20 and reports to mds1 
<bzzz_z> at same time clent2 has its volatile epoch 10 and reports this to
mds2
<eeb>    c1 has _oldest_ volatile epoch 10 and reports to mds2 
<eeb>    mds2 reports 10 to mds1 
<eeb>    mds1 now knows the oldest volatile epoch is 10 
<bzzz_z> mds1 can''t make 20 stable till 10 is stable as well 
<eeb>    it tells mds2 10 is the oldest volatile opoch 
<eeb>    md1 tells c1 and mds2 tells c2 
<eeb>    now everyone knows 10 is the oldest volatile epoch 
<eeb>    what''s the issue 
<eeb>    ? 
<bzzz_z> the issue is that you need all servers be involved 
<eeb>    yes - they inevitable all are when you have a large enough
cluster and
         volume of distributed operations
<bzzz_z> that''s exactly the point 
<eeb>    so you need the # of messages and # of bytes in these messages
that
         any individual server sees to be limited
<eeb>    otherwise you can''t scale 
<eeb>    if you cannot combine messages, then you are doing an exchange,
not a
         reduction
<eeb>    reduction is far more scalable than exchange 
<bzzz_z> no, you''re telling about some cluster doing *single*
job. i''m telling
         about cluster doing many jobs. in the last case you want to localize
         operations to some servers
<eeb>    I''m neutral about whether the cluster is doing a single
job or
         multiple unrelated jobs
<eeb>    both use cases must scale 
<bzzz_z> requiring all servers in non-stop exchange makes big cluster very
         vulnerable to failures
<eeb>    how is that different for dependencies? 
<bzzz_z> and the bigger cluster, the frequent failures 
<bzzz_z> because with dependency all exchange can be limited to servers 
         involved in operations. if /home/eeb lives on (mds1; mds2) and
         /home/bzzz lives on (mds3; mds4) then failure of mds5 doesnt impact me
         or you
<bzzz_z> even w/o failures, requiring all servers to interact all the time
is
         not very good - servers can be distributed over the globe
<bzzz_z> especially given most of operations aren''t really
distributed at all
<eeb>    disagree - we discussed yesterday that WAN clients would have to
use
         proxy servers
<bzzz_z> because if they re, then performance will be bad 
<bzzz_z> proxy changes nothing, imho 
<eeb>    think again - a proxy does the global epoch calculation on behalf
of
         the WAN clients
<eeb>    you can expect every mds to be involved in a distributed
operation
         with every other mds after enough operations have been performed
<bzzz_z> expect doesn''t mean "a lot of distributed
operations all the time"
<snip>
<bzzz_z> proxy doesn global epoch calculation, but it link between proxy
and
         remote part of cluster is broken, you can''t make any progress
with
         undo cancel - because they share epoch namespace
<snip>
<eeb>    the proxy server is just a low-latency client 
<eeb>    which bounds who needs to be involved in the global last volatile
         epoch calculation
<bzzz_z> my feeling is that we have very different sense of
"scale" here: your
         one is something about zillions of distributed operations over whole
         cluster all the time, my one is rather a zillions of local domains
         where working set belongs to
<eeb>    if you make each MDS the proxy for the lustre clients (like we do
now
         with having the master MDS do the RPCs to the slave MDSes) then
you''ve
         limited the global oldest volatile epoch calculation to just the
         servers 
<eeb>    yes 
<eeb>    I agree with your last comment 
<eeb>    If you can convince me that we can achieve good load balance with
a
         scheme that can exploit locality - i.e. so you have mathematical
         bounds on the volume of non-local operations as a proportion of the
         whole - then I will start to believe more in dependencies :)
<bzzz_z> dependency-based recovery would work with any non-heterogenous
setups
         like usual, not requiring any special proxy. and i think it''d
scale
         very well with "working sets"
<eeb>    ok - I think we both have stuff to think about now 
<eeb>    ttyl...

Andreas Dilger

2009-Jan-06 19:59 UTC

head link

[Lustre-devel] global epochs v. dependencies

On Dec 31, 2008  15:20 +0000, Eric Barton wrote:> <eeb>    what''s the issue 
> <eeb>    ? 
> <bzzz_z> the issue is that you need all servers be involved 
> <eeb>    yes - they inevitable all are when you have a large enough
cluster
>          and volume of distributed operations
> <bzzz_z> that''s exactly the point 
> <eeb>    so you need the # of messages and # of bytes in these
messages that
>          any individual server sees to be limited
> <eeb>    otherwise you can''t scale 
> <eeb>    if you cannot combine messages, then you are doing an
exchange, not
>          a reduction
> <eeb>    reduction is far more scalable than exchange 
> <bzzz_z> no, you''re telling about some cluster doing
*single* job. i''m
>          telling  about cluster doing many jobs. in the last case you want
>          to localize operations to some servers
> <eeb>    I''m neutral about whether the cluster is doing a
single job or
>          multiple unrelated jobs
> <eeb>    both use cases must scale 
> <bzzz_z> requiring all servers in non-stop exchange makes big cluster
very
>          vulnerable to failures
> <eeb>    how is that different for dependencies? 
> <bzzz_z> and the bigger cluster, the frequent failures 
> <bzzz_z> because with dependency all exchange can be limited to
servers
>          involved in operations. if /home/eeb lives on (mds1; mds2) and
>          /home/bzzz lives on (mds3; mds4) then failure of mds5 doesnt
impact
>          me or you
I tend to agree with Alex here - even in a "local" cluster there may
be
administrative or technical reasons to bound subsets of the namespace to
a subset of the MDTs (e.g. MDT pools) and having those be autonomous would
be highly desirable both from a fault tolerance point of view and a load
balancing POV.
> <bzzz_z> even w/o failures, requiring all servers to interact all the
time is
>          not very good - servers can be distributed over the globe
> <bzzz_z> especially given most of operations aren''t really
distributed at all
> <eeb>    disagree - we discussed yesterday that WAN clients would
have to use
>          proxy servers
> <bzzz_z> because if they re, then performance will be bad 
> <bzzz_z> proxy changes nothing, imho 
> <eeb>    think again - a proxy does the global epoch calculation on
behalf of
>          the WAN clients
> <eeb>    you can expect every mds to be involved in a distributed
operation
>          with every other mds after enough operations have been performed
> <bzzz_z> expect doesn''t mean "a lot of distributed
operations all the time"
> <snip>
> <bzzz_z> proxy doesn global epoch calculation, but it link between
proxy and
>          remote part of cluster is broken, you can''t make any
progress with
>          undo cancel - because they share epoch namespace
> <snip>
> <eeb>    the proxy server is just a low-latency client 
> <eeb>    which bounds who needs to be involved in the global last
volatile
>          epoch calculation
> <bzzz_z> my feeling is that we have very different sense of
"scale" here:
>          your one is something about zillions of distributed operations
over
>          whole cluster all the time, my one is rather a zillions of local
>          domains where working set belongs to
> <eeb>    if you make each MDS the proxy for the lustre clients (like
we do
>          now with having the master MDS do the RPCs to the slave MDSes)
then
>          you''ve limited the global oldest volatile epoch
calculation to just
>          the servers 
This is exactly the kind of implementation that I hope we will end
up with - we DON''T have to have every client involved in the epochs,
only the servers and clients that are doing WBC (e.g. login nodes, proxy
clients for a WAN, etc).  This set would be flexible hopefully, so that
nodes like login nodes could temporarily start doing WBC operations
under load, but flush their state and return to "dumb" clients when
idle.

Ideally, with a single MDT and all dumb clients (i.e. today''s Lustre)
it
would collapse into a much more simple setup like we have today with just
a single transno controlled by the MDT.
> <eeb>    yes 
> <eeb>    I agree with your last comment 
> <eeb>    If you can convince me that we can achieve good load balance
with a
>          scheme that can exploit locality - i.e. so you have mathematical
>          bounds on the volume of non-local operations as a proportion of
the
>          whole - then I will start to believe more in dependencies :)
> <bzzz_z> dependency-based recovery would work with any
non-heterogenous
>          setups like usual, not requiring any special proxy. and i think
>          it''d scale very well with "working sets"
> <eeb>    ok - I think we both have stuff to think about now 
> <eeb>    ttyl...
> 
> 
Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

Alex Zhuravlev

2009-Jan-07 10:01 UTC

head link

[Lustre-devel] global epochs v. dependencies

On Tue, 06 Jan 2009 12:59:58 -0700
Andreas Dilger <adilger at sun.com> wrote:> > <eeb>    if you make each MDS the proxy for the lustre clients
> > (like we do now with having the master MDS do the RPCs to the slave
> > MDSes) then you''ve limited the global oldest volatile epoch
> > calculation to just the servers 
> 
> This is exactly the kind of implementation that I hope we will end
> up with - we DON''T have to have every client involved in the
epochs,
> only the servers and clients that are doing WBC (e.g. login nodes,
> proxy clients for a WAN, etc).  This set would be flexible hopefully,
> so that nodes like login nodes could temporarily start doing WBC
> operations under load, but flush their state and return to "dumb"
> clients when idle.
If i understood you right, you prefer to send all updates through one
server? while this has obvious benefits (like sanity checks, ability
to employ persistent redo, etc) I believe generic mechanism should be
able to send updates to target server directly. probably the best
example here would be SNS. depending on nature and source of updates,
we can choose either way.

let me remind that with dependency-based schema clients are "dumb" all
the time, then only thing they have to do is to tag updates with unique
id.

thanks, Alex

Lustre devel - Dec 2008 - global epochs v. dependencies

[Lustre-devel] global epochs v. dependencies

[Lustre-devel] global epochs v. dependencies

[Lustre-devel] global epochs v. dependencies