thr3ads.net - Lustre devel - [Lustre-devel] Is it possible to introduce multiple mds servers in lustre [Apr 2007]

If this information is useful, please help other people find it:
Share via:

Shobhit Dayal

2007-Apr-20 17:01 UTC

[Lustre-devel] Is it possible to introduce multiple mds servers in lustre

Hi,
 We''re a group of students at CMU and we''re building a project
around
lustre. A main part of the work involves introducing multiple mds servers in
lustre.
Now we have a design for managing metadata from multiple mds''s, but we
were
wondering how much work it is, besides changing mds metadata management,
to introduce a new active mds server. Our impression so far is that neither
the client nor the ost''s will work easily with a new active mds entity
in
the cluster in terms of managing connections from multiple mds''s and
that
they will have to be changed. Is this correct ?

For instance, for experiment purpose: we created a client-->mds-->ost and
created some file through them ''foo'', ''bar''.
Then replicated the file system
on the mds that stores all the metadata onto another mds mds2.
Now we introduced a second client and tried to setup the connections
client2-->mds2-->ost

This setup does not work when foo, bar are written from both clients.
changes cannot be seen from both clients. As soon as the second mds
connects, the client1, mds1 seem to loose their connection with the ost.

Can someone point us to the right way to bring up two mds''s in the
lustre
environment, even though it may lead to data/metadata corruption ?

Some guidance will be helpful.

Thanks in advance
-Shobhit Dayal
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://mail.clusterfs.com/pipermail/lustre-devel/attachments/20070420/d5232c36/attachment.html

Andreas Dilger

2007-Apr-23 17:24 UTC

head link

[Lustre-devel] Is it possible to introduce multiple mds servers in lustre

On Apr 20, 2007  19:00 -0400, Shobhit Dayal wrote:> We''re a group of students at CMU and we''re building a
project around
> lustre. A main part of the work involves introducing multiple mds servers
in
> lustre.
I''m sad to inform you that the work for introducing multiple MDTs for
a single filesystem has been going on for several years already, and
is mostly done (target for release some time at the end of this year).
This is what we call "clustered metadata" (CMD).  I''m not
sure what our
policy is for releasing an alpha version of this code would be.
> Now we have a design for managing metadata from multiple mds''s,
but we were
> wondering how much work it is, besides changing mds metadata management,
> to introduce a new active mds server. Our impression so far is that neither
> the client nor the ost''s will work easily with a new active mds
entity in
> the cluster in terms of managing connections from multiple mds''s
and that
> they will have to be changed. Is this correct ?
For CMD, there is a new "logical metadata volume" (LMV) that handles
the
connections from the filesystem to the multiple MDTs.  This is somewhat
analogous to the LOV, in that it spreads MDT access and operations over
the multiple MDTs.  Each MDT is still mostly independent in that they
export a single ext3 filesystem (like multiple OSTs on a single OSS),
rather than any shared-access to the same block device.
> For instance, for experiment purpose: we created a client-->mds-->ost
and
> created some file through them ''foo'',
''bar''. Then replicated the file system
> on the mds that stores all the metadata onto another mds mds2.
> Now we introduced a second client and tried to setup the connections
> client2-->mds2-->ost
Ah, this is somewhat different than CMD where each MDT is a (mostly)
independent subset of the filesystem.  The CMD code has no replication
between MDTs.  That would definitely be an interesting and worthwhile
project.  It would be implemented in a very similar manner, with a
replicating layer between llite and the MDC, each MDC connecting to a
separate MDT.
> This setup does not work when foo, bar are written from both clients.
> changes cannot be seen from both clients. As soon as the second mds
> connects, the client1, mds1 seem to loose their connection with the ost.
> 
> Can someone point us to the right way to bring up two mds''s in the
lustre
> environment, even though it may lead to data/metadata corruption ?
You need a layer like LOV is for OSCs to handle multiple independent
connections.  Then, that layer should handle replicating the requests to
each of the MDTs for modifying events (in MDT order), and could e.g.
round-robin for read-only events (e.g. getattr) to help spread the load.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

Shobhit Dayal

2007-Apr-23 21:38 UTC

head link

[Lustre-devel] Is it possible to introduce multiple mds servers in lustre

Hi Andreas,
 thanks for your reply, it is really helpful.

To give you some context of what we are doing,  we''re trying to build a
clustered mds servise in lustre based on a paper from cmu on dynamic
redistribution.
http://www.pdl.cs.cmu.edu/PDL-FTP/SelfStar/CMU-PDL-06-105_abs.html

We aren''t really looking to replicate the mds servers in this design,
that
was just a hack to get us started on getting two mds''s up that shared a
namespace, by copying the ext3 of one mds to another.

Dynamic redistribution proposes an easier way to decentralising mds service
than implementing distributed transactions for cross server operations such
as rename. It proposes that only a single server perform the rename like
operations by temporarily becoming the owner of both objects until the
operation completes.

So for instance, if there are multiple mds''s, each serving a part of a
global name space, and the client issues a rename that renames a file from
mds2 to mds1, the following approach can be used in the context of Lustre:

Mount from mds1 the ext3 filesystem of mds2.
delete the original file in ext3 of mds2
create a new file in the appropriate path on ext3 of mds1.
umount ext3 of mds2 from mds1.
All the above operations can be transactioned locally on mds1 for atomicity.

other operations on mds1 and mds2 on the relevant dir paths will disabled
until rename succeeds.

We''ll have to deal with the problem that deleting the file from mds2
and
recreating it on mds1 will change its inode number and generation count,
since these values are directly used at the OST as an object reference. And
so we are implementing something that will allow us to remember the old
inode numbers and generation count on mds1.

But we''re stuck on the problem of even bringing up two mds''s
in the lustre
environment and getting an OST with one LOV to share that LOV between both
the mds''s. Lustre doesnt allow us to configure
mds''s/ost''s in this way.
OST''s dont listen to two mds''s at the same time.

Is there an easy way to bring up two mds''s such that an OST with a
single
lov will allow two mds''s to connect to it, and pass around object
references
to objects that lie in this single volume?

Thanks
Shobhit

On 4/23/07, Andreas Dilger <adilger@clusterfs.com>
wrote:>
> On Apr 20, 2007  19:00 -0400, Shobhit Dayal wrote:
> > We''re a group of students at CMU and we''re building
a project around
> > lustre. A main part of the work involves introducing multiple mds
> servers in
> > lustre.
>
> I''m sad to inform you that the work for introducing multiple MDTs
for
> a single filesystem has been going on for several years already, and
> is mostly done (target for release some time at the end of this year).
> This is what we call "clustered metadata" (CMD).  I''m
not sure what our
> policy is for releasing an alpha version of this code would be.
>
> > Now we have a design for managing metadata from multiple
mds''s, but we
> were
> > wondering how much work it is, besides changing mds metadata
management,
> > to introduce a new active mds server. Our impression so far is that
> neither
> > the client nor the ost''s will work easily with a new active
mds entity
> in
> > the cluster in terms of managing connections from multiple
mds''s and
> that
> > they will have to be changed. Is this correct ?
>
> For CMD, there is a new "logical metadata volume" (LMV) that
handles the
> connections from the filesystem to the multiple MDTs.  This is somewhat
> analogous to the LOV, in that it spreads MDT access and operations over
> the multiple MDTs.  Each MDT is still mostly independent in that they
> export a single ext3 filesystem (like multiple OSTs on a single OSS),
> rather than any shared-access to the same block device.
>
> > For instance, for experiment purpose: we created a
client-->mds-->ost
> and
> > created some file through them ''foo'',
''bar''. Then replicated the file
> system
> > on the mds that stores all the metadata onto another mds mds2.
> > Now we introduced a second client and tried to setup the connections
> > client2-->mds2-->ost
>
> Ah, this is somewhat different than CMD where each MDT is a (mostly)
> independent subset of the filesystem.  The CMD code has no replication
> between MDTs.  That would definitely be an interesting and worthwhile
> project.  It would be implemented in a very similar manner, with a
> replicating layer between llite and the MDC, each MDC connecting to a
> separate MDT.
>
> > This setup does not work when foo, bar are written from both clients.
> > changes cannot be seen from both clients. As soon as the second mds
> > connects, the client1, mds1 seem to loose their connection with the
ost.
> >
> > Can someone point us to the right way to bring up two mds''s
in the
> lustre
> > environment, even though it may lead to data/metadata corruption ?
>
> You need a layer like LOV is for OSCs to handle multiple independent
> connections.  Then, that layer should handle replicating the requests to
> each of the MDTs for modifying events (in MDT order), and could e.g.
> round-robin for read-only events (e.g. getattr) to help spread the load.
>
> Cheers, Andreas
> --
> Andreas Dilger
> Principal Software Engineer
> Cluster File Systems, Inc.
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://mail.clusterfs.com/pipermail/lustre-devel/attachments/20070423/5cd938fc/attachment.html

Andreas Dilger

2007-Apr-25 02:48 UTC

head link

[Lustre-devel] Is it possible to introduce multiple mds servers in lustre

On Apr 23, 2007  23:38 -0400, Shobhit Dayal wrote:> To give you some context of what we are doing,  we''re trying to
build a
> clustered mds servise in lustre based on a paper from cmu on dynamic
> redistribution.
> http://www.pdl.cs.cmu.edu/PDL-FTP/SelfStar/CMU-PDL-06-105_abs.html
> 
> We aren''t really looking to replicate the mds servers in this
design, that
> was just a hack to get us started on getting two mds''s up that
shared a
> namespace, by copying the ext3 of one mds to another.
If you aren''t looking at replication, then you are in fact implementing
exactly what the CMD project at CFS has been working to complete.
> So for instance, if there are multiple mds''s, each serving a part
of a
> global name space, and the client issues a rename that renames a file from
> mds2 to mds1, the following approach can be used in the context of Lustre:
> 
> Mount from mds1 the ext3 filesystem of mds2.
> delete the original file in ext3 of mds2
> create a new file in the appropriate path on ext3 of mds1.
> umount ext3 of mds2 from mds1.
> All the above operations can be transactioned locally on mds1 for
atomicity.
Since ext3 is itself not a shared filesystem and can only be mounted on a
single MDS at one time, it would be FAR easier and faster to just have MDS1
do a synchronous operation to MDS2 instead of trying to coordinate unmounting
and remounting the filesystem across nodes.
> We''ll have to deal with the problem that deleting the file from
mds2 and
> recreating it on mds1 will change its inode number and generation count,
> since these values are directly used at the OST as an object reference. And
> so we are implementing something that will allow us to remember the old
> inode numbers and generation count on mds1.
CMD implemented a new abstraction layer for file identifiers "FIDs"
that
keep the ext3 inode numbers internal to the filesystem and expose only
abstracted numbers for the inodes to the clients.
> But we''re stuck on the problem of even bringing up two
mds''s in the lustre
> environment and getting an OST with one LOV to share that LOV between both
> the mds''s. Lustre doesnt allow us to configure
mds''s/ost''s in this way.
> OST''s dont listen to two mds''s at the same time.
The LOV is really for client->many OST communication, and you would need the
equivalent LMV layer for client->many MDT communication.  Each inode would
get a lmv striping EA that tells the client which MDT the inode resides on,
just like the lov EA tells which OST the object lives on.
> Is there an easy way to bring up two mds''s such that an OST with a
single
> lov will allow two mds''s to connect to it, and pass around object
references
> to objects that lie in this single volume?
You need to add an Logical Metadata Volume (LMV) layer to have the single
llite->MDC connection be multiplexed to multiple MDTs.
> On 4/23/07, Andreas Dilger <adilger@clusterfs.com> wrote:
> >
> >On Apr 20, 2007  19:00 -0400, Shobhit Dayal wrote:
> >> We''re a group of students at CMU and we''re
building a project around
> >> lustre. A main part of the work involves introducing multiple mds
> >servers in
> >> lustre.
> >
> >I''m sad to inform you that the work for introducing multiple
MDTs for
> >a single filesystem has been going on for several years already, and
> >is mostly done (target for release some time at the end of this year).
> >This is what we call "clustered metadata" (CMD). 
I''m not sure what our
> >policy is for releasing an alpha version of this code would be.
> >
> >> Now we have a design for managing metadata from multiple
mds''s, but we
> >were
> >> wondering how much work it is, besides changing mds metadata
management,
> >> to introduce a new active mds server. Our impression so far is
that
> >neither
> >> the client nor the ost''s will work easily with a new
active mds entity
> >in
> >> the cluster in terms of managing connections from multiple
mds''s and
> >that
> >> they will have to be changed. Is this correct ?
> >
> >For CMD, there is a new "logical metadata volume" (LMV) that
handles the
> >connections from the filesystem to the multiple MDTs.  This is somewhat
> >analogous to the LOV, in that it spreads MDT access and operations over
> >the multiple MDTs.  Each MDT is still mostly independent in that they
> >export a single ext3 filesystem (like multiple OSTs on a single OSS),
> >rather than any shared-access to the same block device.
> >
> >> For instance, for experiment purpose: we created a
client-->mds-->ost
> >and
> >> created some file through them ''foo'',
''bar''. Then replicated the file
> >system
> >> on the mds that stores all the metadata onto another mds mds2.
> >> Now we introduced a second client and tried to setup the
connections
> >> client2-->mds2-->ost
> >
> >Ah, this is somewhat different than CMD where each MDT is a (mostly)
> >independent subset of the filesystem.  The CMD code has no replication
> >between MDTs.  That would definitely be an interesting and worthwhile
> >project.  It would be implemented in a very similar manner, with a
> >replicating layer between llite and the MDC, each MDC connecting to a
> >separate MDT.
> >
> >> This setup does not work when foo, bar are written from both
clients.
> >> changes cannot be seen from both clients. As soon as the second
mds
> >> connects, the client1, mds1 seem to loose their connection with
the ost.
> >>
> >> Can someone point us to the right way to bring up two
mds''s in the
> >lustre
> >> environment, even though it may lead to data/metadata corruption ?
> >
> >You need a layer like LOV is for OSCs to handle multiple independent
> >connections.  Then, that layer should handle replicating the requests
to
> >each of the MDTs for modifying events (in MDT order), and could e.g.
> >round-robin for read-only events (e.g. getattr) to help spread the
load.
> >
> >Cheers, Andreas
> >--
> >Andreas Dilger
> >Principal Software Engineer
> >Cluster File Systems, Inc.
> >
> >
Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

Shobhit Dayal

2007-Apr-25 03:11 UTC

head link

[Lustre-devel] Is it possible to introduce multiple mds servers in lustre

Thanks,
 I think I see the direction we need to head in. I guess there are no short
term hacks to get multiple mds''s up
in our environment. What we were really interested in was demonstrating the
dynamic redistribution concept
with as little change to lustre as possible. But I guess what you are saying
is that it may not be possible to do that.

Its also helpfull to know the fid''s approach you took for managing
inode
changes.

Thanks for taking so much time out, and writing such detailed mails.
when manage to bring up something, I''ll let you know :)

-Shobhit




On 4/25/07, Andreas Dilger <adilger@clusterfs.com>
wrote:>
> On Apr 23, 2007  23:38 -0400, Shobhit Dayal wrote:
> > To give you some context of what we are doing,  we''re trying
to build a
> > clustered mds servise in lustre based on a paper from cmu on dynamic
> > redistribution.
> > http://www.pdl.cs.cmu.edu/PDL-FTP/SelfStar/CMU-PDL-06-105_abs.html
> >
> > We aren''t really looking to replicate the mds servers in this
design,
> that
> > was just a hack to get us started on getting two mds''s up
that shared a
> > namespace, by copying the ext3 of one mds to another.
>
> If you aren''t looking at replication, then you are in fact
implementing
> exactly what the CMD project at CFS has been working to complete.
>
> > So for instance, if there are multiple mds''s, each serving a
part of a
> > global name space, and the client issues a rename that renames a file
> from
> > mds2 to mds1, the following approach can be used in the context of
> Lustre:
> >
> > Mount from mds1 the ext3 filesystem of mds2.
> > delete the original file in ext3 of mds2
> > create a new file in the appropriate path on ext3 of mds1.
> > umount ext3 of mds2 from mds1.
> > All the above operations can be transactioned locally on mds1 for
> atomicity.
>
> Since ext3 is itself not a shared filesystem and can only be mounted on a
> single MDS at one time, it would be FAR easier and faster to just have
> MDS1
> do a synchronous operation to MDS2 instead of trying to coordinate
> unmounting
> and remounting the filesystem across nodes.
>
> > We''ll have to deal with the problem that deleting the file
from mds2 and
> > recreating it on mds1 will change its inode number and generation
count,
> > since these values are directly used at the OST as an object
reference.
> And
> > so we are implementing something that will allow us to remember the
old
> > inode numbers and generation count on mds1.
>
> CMD implemented a new abstraction layer for file identifiers
"FIDs" that
> keep the ext3 inode numbers internal to the filesystem and expose only
> abstracted numbers for the inodes to the clients.
>
> > But we''re stuck on the problem of even bringing up two
mds''s in the
> lustre
> > environment and getting an OST with one LOV to share that LOV between
> both
> > the mds''s. Lustre doesnt allow us to configure
mds''s/ost''s in this way.
> > OST''s dont listen to two mds''s at the same time.
>
> The LOV is really for client->many OST communication, and you would need
> the
> equivalent LMV layer for client->many MDT communication.  Each inode
would
> get a lmv striping EA that tells the client which MDT the inode resides
> on,
> just like the lov EA tells which OST the object lives on.
>
> > Is there an easy way to bring up two mds''s such that an OST
with a
> single
> > lov will allow two mds''s to connect to it, and pass around
object
> references
> > to objects that lie in this single volume?
>
> You need to add an Logical Metadata Volume (LMV) layer to have the single
> llite->MDC connection be multiplexed to multiple MDTs.
>
> > On 4/23/07, Andreas Dilger <adilger@clusterfs.com> wrote:
> > >
> > >On Apr 20, 2007  19:00 -0400, Shobhit Dayal wrote:
> > >> We''re a group of students at CMU and we''re
building a project around
> > >> lustre. A main part of the work involves introducing multiple
mds
> > >servers in
> > >> lustre.
> > >
> > >I''m sad to inform you that the work for introducing
multiple MDTs for
> > >a single filesystem has been going on for several years already,
and
> > >is mostly done (target for release some time at the end of this
year).
> > >This is what we call "clustered metadata" (CMD). 
I''m not sure what our
> > >policy is for releasing an alpha version of this code would be.
> > >
> > >> Now we have a design for managing metadata from multiple
mds''s, but
> we
> > >were
> > >> wondering how much work it is, besides changing mds metadata
> management,
> > >> to introduce a new active mds server. Our impression so far
is that
> > >neither
> > >> the client nor the ost''s will work easily with a new
active mds
> entity
> > >in
> > >> the cluster in terms of managing connections from multiple
mds''s and
> > >that
> > >> they will have to be changed. Is this correct ?
> > >
> > >For CMD, there is a new "logical metadata volume" (LMV)
that handles
> the
> > >connections from the filesystem to the multiple MDTs.  This is
somewhat
> > >analogous to the LOV, in that it spreads MDT access and operations
over
> > >the multiple MDTs.  Each MDT is still mostly independent in that
they
> > >export a single ext3 filesystem (like multiple OSTs on a single
OSS),
> > >rather than any shared-access to the same block device.
> > >
> > >> For instance, for experiment purpose: we created a
client-->mds-->ost
> > >and
> > >> created some file through them ''foo'',
''bar''. Then replicated the file
> > >system
> > >> on the mds that stores all the metadata onto another mds
mds2.
> > >> Now we introduced a second client and tried to setup the
connections
> > >> client2-->mds2-->ost
> > >
> > >Ah, this is somewhat different than CMD where each MDT is a
(mostly)
> > >independent subset of the filesystem.  The CMD code has no
replication
> > >between MDTs.  That would definitely be an interesting and
worthwhile
> > >project.  It would be implemented in a very similar manner, with a
> > >replicating layer between llite and the MDC, each MDC connecting
to a
> > >separate MDT.
> > >
> > >> This setup does not work when foo, bar are written from both
clients.
> > >> changes cannot be seen from both clients. As soon as the
second mds
> > >> connects, the client1, mds1 seem to loose their connection
with the
> ost.
> > >>
> > >> Can someone point us to the right way to bring up two
mds''s in the
> > >lustre
> > >> environment, even though it may lead to data/metadata
corruption ?
> > >
> > >You need a layer like LOV is for OSCs to handle multiple
independent
> > >connections.  Then, that layer should handle replicating the
requests
> to
> > >each of the MDTs for modifying events (in MDT order), and could
e.g.
> > >round-robin for read-only events (e.g. getattr) to help spread the
> load.
> > >
> > >Cheers, Andreas
> > >--
> > >Andreas Dilger
> > >Principal Software Engineer
> > >Cluster File Systems, Inc.
> > >
> > >
>
> Cheers, Andreas
> --
> Andreas Dilger
> Principal Software Engineer
> Cluster File Systems, Inc.
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://mail.clusterfs.com/pipermail/lustre-devel/attachments/20070425/bfad0c37/attachment.html

Brian J. Murrell

2007-Apr-25 09:13 UTC

head link

[Lustre-devel] Is it possible to introduce multiple mds servers in lustre

On Wed, 2007-25-04 at 05:09 -0400, Shobhit Dayal wrote:> Thanks,
>  I think I see the direction we need to head in. I guess there are no
> short term hacks to get multiple mds''s up
> in our environment. What we were really interested in was
> demonstrating the dynamic redistribution concept 
> with as little change to lustre as possible. But I guess what you are
> saying is that it may not be possible to do that.
> 
> Its also helpfull to know the fid''s approach you took for managing
> inode changes.
> 
> Thanks for taking so much time out, and writing such detailed mails.
> when manage to bring up something, I''ll let you know :)
Shobhit,

Did you see the announcement on this list this morning by Peter Braam
about the Lustre 2.0 code branch snapshot?

That work contains the CMD (clustered metadata) code to which Andreas
has referred.

Perhaps that is a good place for you start your work.

Cheers,
b.

Andreas Dilger

2007-Apr-25 09:48 UTC

head link

[Lustre-devel] Is it possible to introduce multiple mds servers in lustre

On Apr 25, 2007  11:13 -0400, Brian J. Murrell wrote:> On Wed, 2007-25-04 at 05:09 -0400, Shobhit Dayal wrote:
> >  I think I see the direction we need to head in. I guess there are no
> > short term hacks to get multiple mds''s up
> > in our environment. What we were really interested in was
> > demonstrating the dynamic redistribution concept 
> > with as little change to lustre as possible. But I guess what you are
> > saying is that it may not be possible to do that.
> > 
> > Its also helpfull to know the fid''s approach you took for
managing
> > inode changes.
> > 
> > Thanks for taking so much time out, and writing such detailed mails.
> > when manage to bring up something, I''ll let you know :)
> 
> Did you see the announcement on this list this morning by Peter Braam
> about the Lustre 2.0 code branch snapshot?
> 
> That work contains the CMD (clustered metadata) code to which Andreas
> has referred.
> 
> Perhaps that is a good place for you start your work.
I was just going to say the same thing.  Starting with the CMD code you
can likely tune the MDT distribution algorithms (which AFAIK is your
main goal) and we''d be happy to see any results you produce on
different
algorithms for this.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

Peter J. Braam

2007-Apr-28 12:38 UTC

head link

[Lustre-devel] Is it possible to introduce multiple mds servers in lustre

The fid subsystem was designed to do that - fids in the version of lustre we
posted are unique cluster wide and have a location database.

- Peter -

On 4/25/07, Shobhit Dayal <shobhit.dayal@gmail.com>
wrote:>
> Thanks,
>  I think I see the direction we need to head in. I guess there are no
> short term hacks to get multiple mds''s up
> in our environment. What we were really interested in was demonstrating
> the dynamic redistribution concept
> with as little change to lustre as possible. But I guess what you are
> saying is that it may not be possible to do that.
>
> Its also helpfull to know the fid''s approach you took for managing
inode
> changes.
>
> Thanks for taking so much time out, and writing such detailed mails.
> when manage to bring up something, I''ll let you know :)
>
> -Shobhit
>
>
>
>
> On 4/25/07, Andreas Dilger <adilger@clusterfs.com> wrote:
> >
> > On Apr 23, 2007  23:38 -0400, Shobhit Dayal wrote:
> > > To give you some context of what we are doing,  we''re
trying to build
> > a
> > > clustered mds servise in lustre based on a paper from cmu on
dynamic
> > > redistribution.
> > >
http://www.pdl.cs.cmu.edu/PDL-FTP/SelfStar/CMU-PDL-06-105_abs.html
> > >
> > > We aren''t really looking to replicate the mds servers in
this design,
> > that
> > > was just a hack to get us started on getting two mds''s
up that shared
> > a
> > > namespace, by copying the ext3 of one mds to another.
> >
> > If you aren''t looking at replication, then you are in fact
implementing
> > exactly what the CMD project at CFS has been working to complete.
> >
> > > So for instance, if there are multiple mds''s, each
serving a part of a
> > > global name space, and the client issues a rename that renames a
file
> > from
> > > mds2 to mds1, the following approach can be used in the context
of
> > Lustre:
> > >
> > > Mount from mds1 the ext3 filesystem of mds2.
> > > delete the original file in ext3 of mds2
> > > create a new file in the appropriate path on ext3 of mds1.
> > > umount ext3 of mds2 from mds1.
> > > All the above operations can be transactioned locally on mds1 for
> > atomicity.
> >
> > Since ext3 is itself not a shared filesystem and can only be mounted
on
> > a
> > single MDS at one time, it would be FAR easier and faster to just have
> > MDS1
> > do a synchronous operation to MDS2 instead of trying to coordinate
> > unmounting
> > and remounting the filesystem across nodes.
> >
> > > We''ll have to deal with the problem that deleting the
file from mds2
> > and
> > > recreating it on mds1 will change its inode number and generation
> > count,
> > > since these values are directly used at the OST as an object
> > reference. And
> > > so we are implementing something that will allow us to remember
the
> > old
> > > inode numbers and generation count on mds1.
> >
> > CMD implemented a new abstraction layer for file identifiers
"FIDs" that
> > keep the ext3 inode numbers internal to the filesystem and expose only
> > abstracted numbers for the inodes to the clients.
> >
> > > But we''re stuck on the problem of even bringing up two
mds''s in the
> > lustre
> > > environment and getting an OST with one LOV to share that LOV
between
> > both
> > > the mds''s. Lustre doesnt allow us to configure
mds''s/ost''s in this
> > way.
> > > OST''s dont listen to two mds''s at the same
time.
> >
> > The LOV is really for client->many OST communication, and you would
need
> > the
> > equivalent LMV layer for client->many MDT communication.  Each
inode
> > would
> > get a lmv striping EA that tells the client which MDT the inode
resides
> > on,
> > just like the lov EA tells which OST the object lives on.
> >
> > > Is there an easy way to bring up two mds''s such that an
OST with a
> > single
> > > lov will allow two mds''s to connect to it, and pass
around object
> > references
> > > to objects that lie in this single volume?
> >
> > You need to add an Logical Metadata Volume (LMV) layer to have the
> > single
> > llite->MDC connection be multiplexed to multiple MDTs.
> >
> > > On 4/23/07, Andreas Dilger <adilger@clusterfs.com> wrote:
> > > >
> > > >On Apr 20, 2007  19:00 -0400, Shobhit Dayal wrote:
> > > >> We''re a group of students at CMU and
we''re building a project
> > around
> > > >> lustre. A main part of the work involves introducing
multiple mds
> > > >servers in
> > > >> lustre.
> > > >
> > > >I''m sad to inform you that the work for introducing
multiple MDTs for
> > > >a single filesystem has been going on for several years
already, and
> > > >is mostly done (target for release some time at the end of
this
> > year).
> > > >This is what we call "clustered metadata" (CMD). 
I''m not sure what
> > our
> > > >policy is for releasing an alpha version of this code would
be.
> > > >
> > > >> Now we have a design for managing metadata from multiple
mds''s, but
> > we
> > > >were
> > > >> wondering how much work it is, besides changing mds
metadata
> > management,
> > > >> to introduce a new active mds server. Our impression so
far is that
> > > >neither
> > > >> the client nor the ost''s will work easily with
a new active mds
> > entity
> > > >in
> > > >> the cluster in terms of managing connections from
multiple mds''s
> > and
> > > >that
> > > >> they will have to be changed. Is this correct ?
> > > >
> > > >For CMD, there is a new "logical metadata volume"
(LMV) that handles
> > the
> > > >connections from the filesystem to the multiple MDTs.  This
is
> > somewhat
> > > >analogous to the LOV, in that it spreads MDT access and
operations
> > over
> > > >the multiple MDTs.  Each MDT is still mostly independent in
that they
> >
> > > >export a single ext3 filesystem (like multiple OSTs on a
single OSS),
> > > >rather than any shared-access to the same block device.
> > > >
> > > >> For instance, for experiment purpose: we created a
> > client-->mds-->ost
> > > >and
> > > >> created some file through them ''foo'',
''bar''. Then replicated the
> > file
> > > >system
> > > >> on the mds that stores all the metadata onto another mds
mds2.
> > > >> Now we introduced a second client and tried to setup the
> > connections
> > > >> client2-->mds2-->ost
> > > >
> > > >Ah, this is somewhat different than CMD where each MDT is a
(mostly)
> > > >independent subset of the filesystem.  The CMD code has no
> > replication
> > > >between MDTs.  That would definitely be an interesting and
worthwhile
> > > >project.  It would be implemented in a very similar manner,
with a
> > > >replicating layer between llite and the MDC, each MDC
connecting to a
> > > >separate MDT.
> > > >
> > > >> This setup does not work when foo, bar are written from
both
> > clients.
> > > >> changes cannot be seen from both clients. As soon as the
second mds
> >
> > > >> connects, the client1, mds1 seem to loose their
connection with the
> > ost.
> > > >>
> > > >> Can someone point us to the right way to bring up two
mds''s in the
> > > >lustre
> > > >> environment, even though it may lead to data/metadata
corruption ?
> > > >
> > > >You need a layer like LOV is for OSCs to handle multiple
independent
> > > >connections.  Then, that layer should handle replicating the
requests
> > to
> > > >each of the MDTs for modifying events (in MDT order), and
could e.g.
> > > >round-robin for read-only events (e.g. getattr) to help
spread the
> > load.
> > > >
> > > >Cheers, Andreas
> > > >--
> > > >Andreas Dilger
> > > >Principal Software Engineer
> > > >Cluster File Systems, Inc.
> > > >
> > > >
> >
> > Cheers, Andreas
> > --
> > Andreas Dilger
> > Principal Software Engineer
> > Cluster File Systems, Inc.
> >
> >
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://mail.clusterfs.com/pipermail/lustre-devel/attachments/20070428/998f4aab/attachment.html

Lustre devel - Apr 2007 - Is it possible to introduce multiple mds servers in lustre

[Lustre-devel] Is it possible to introduce multiple mds servers in lustre

[Lustre-devel] Is it possible to introduce multiple mds servers in lustre

[Lustre-devel] Is it possible to introduce multiple mds servers in lustre

[Lustre-devel] Is it possible to introduce multiple mds servers in lustre

[Lustre-devel] Is it possible to introduce multiple mds servers in lustre

[Lustre-devel] Is it possible to introduce multiple mds servers in lustre

[Lustre-devel] Is it possible to introduce multiple mds servers in lustre

[Lustre-devel] Is it possible to introduce multiple mds servers in lustre