I just got myself into a bit of a mess this morning by doing things that sure seemed reasonable at the time, but turned out not to be. I''ve been experimenting with a small ram-based (!) lustre fs, essentially feeding some diskless nodes from a lustre fs whose OSTs are ramdisk partitions on a separate batch of servers. To test all this, I formatted up the RAM OSTs and MDT, pointed them all at my existing (disk-based) MGS, just because it was already there, and tried it out. Everything worked perfectly. But it turned out that I hadn''t made my RAM MDT big enough, so I tried to shut down the RAM pieces, reformat them, and start them up again with bigger sizes. Nope. I couldn''t mount the MDT, as mount.lustre complained that my service index was already in use. I realized that the MGS remembered the previous MDT, and was booting me out because it thought the new one was a name collision. So I figured I should rebuild the MGS to make it forget, and let everybody register. Nope. Zapping the MGS after other components such as OSTs have registered with it causes them to fail to reregister, because they claim they already have. The error message in the log of the MGS machine suggested that I could straighten it out by running tunefs.lustre to reset the OSTs'' idea of who the MGS was. Nope. tunefs.lustre told me it couldnt help me I had extended fs properties, and that it should be fixed in a newer release, which I should download. The net effect of all this was that I started over and rebuilt my disk-based lustrefs from scratch. When I get all the pieces of the test system online, I''ll go back and start over on the RAM-based one. I''ll definitely use a completely separate MGS this time, to avoid polluting the main one. I think the lesson here is that if you propose to use separate lustre filesystems sharing an MGS, you need to be careful about it. It might have worked if I''d done cleaner shutdowns of my RAM-based pieces, but what if you lose a machine? I don''t think it''s safe to assume that one fs will always be shut down in the way you expect. It also might have worked if I''d had the right version of tunefs.lustre; I didn''t have time to go through the process of upgrading it all to see if a newer version would have fixed that particular issue. I believe that in the general case, the issue is around sharing a single MGS between a filesystem that''s expected to live for a while, and one that''s expected to get reconstituted frequently. Given that the MGS remembers stuff about all the filesystems that are or have been attached to it, that seems like a pretty dodgey proposition. Perhaps there''s also a corollary issue, around what happens if you lose, say, the disk on your MGS. What''s the procedure meant to be to build up a fresh one and get everybody talking to each other again? I''d be interested in hearing about anybody else who''s using an MGS for more than one fs concurrently, and about what the experts at CFS say is the right way to do that.
jrd@jrd.org wrote:> I just got myself into a bit of a mess this morning by doing things that sure > seemed reasonable at the time, but turned out not to be. > > I''ve been experimenting with a small ram-based (!) lustre fs, essentially > feeding some diskless nodes from a lustre fs whose OSTs are ramdisk partitions > on a separate batch of servers. To test all this, I formatted up the RAM OSTs > and MDT, pointed them all at my existing (disk-based) MGS, just because it was > already there, and tried it out. Everything worked perfectly. > > But it turned out that I hadn''t made my RAM MDT big enough, so I tried to shut > down the RAM pieces, reformat them, and start them up again with bigger > sizes. Nope. I couldn''t mount the MDT, as mount.lustre complained that my > service index was already in use. > > I realized that the MGS remembered the previous MDT, and was booting me out > because it thought the new one was a name collision.Right.> So I figured I should > rebuild the MGS to make it forget, and let everybody register. > >Rebuilding the MGS will forget all logs for all registered FS''s, which is not what you want. This is a correct place to use the "--writeconf" flag to tunefs. Reformatting an already-registered MDT gives an error when you try to remount: cfs21:~# mount -t lustre -o loop /tmp/testmdt /mnt/test mount.lustre: mount /dev/loop3 at /mnt/test failed: Address already in use The target service''s index is already in use. (/dev/loop3) To override this check, use the "--writeconf" flag on the MDT to force destruction of all config files for this FS. tunefs.lustre --writeconf /tmp/testmdt Then the mount will succeed, and you will see in dmesg: Lustre: MGS: Logs for fs test were removed by user request. All servers must be restarted in order to regenerate the logs.> Nope. Zapping the MGS after other components such as OSTs have registered > with it causes them to fail to reregister, because they claim they already > have. >Again, you can force re-registration by specifying tunefs.lustre --writeconf for each server.> The error message in the log of the MGS machine suggested that I could > straighten it out by running tunefs.lustre to reset the OSTs'' idea of who the > MGS was. Nope. tunefs.lustre told me it couldnt help me I had extended fs > properties, and that it should be fixed in a newer release, which I should > download. >The tunefs.lustre error message should have included: Use e2fsprogs-1.38-cfs1 or later, available from ftp://ftp.lustre.org/pub/lustre/other/e2fsprogs/ which you should do, because older e2fsprogs can''t understand extended attributes which are used on the OSTs, and therefore can''t modify it.> The net effect of all this was that I started over and rebuilt my disk-based > lustrefs from scratch. When I get all the pieces of the test system online, > I''ll go back and start over on the RAM-based one. I''ll definitely use a > completely separate MGS this time, to avoid polluting the main one. > > I think the lesson here is that if you propose to use separate lustre > filesystems sharing an MGS, you need to be careful about it. It might have > worked if I''d done cleaner shutdowns of my RAM-based pieces, but what if you > lose a machine? I don''t think it''s safe to assume that one fs will always be > shut down in the way you expect. It also might have worked if I''d had the > right version of tunefs.lustre; I didn''t have time to go through the process > of upgrading it all to see if a newer version would have fixed that particular > issue. >A tunefs.lustre --writeconf should have. There is more information about writeconf here https://mail.clusterfs.com/wikis/lustre/MountConf> I believe that in the general case, the issue is around sharing a single MGS > between a filesystem that''s expected to live for a while, and one that''s > expected to get reconstituted frequently. Given that the MGS remembers stuff > about all the filesystems that are or have been attached to it, that seems > like a pretty dodgey proposition. > > Perhaps there''s also a corollary issue, around what happens if you lose, say, > the disk on your MGS. What''s the procedure meant to be to build up a fresh > one and get everybody talking to each other again? >Reformat the MGS, and again, you guessed it, writeconf on each of the servers.> I''d be interested in hearing about anybody else who''s using an MGS for more > than one fs concurrently, and about what the experts at CFS say is the right > way to do that. > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss@clusterfs.com > https://mail.clusterfs.com/mailman/listinfo/lustre-discuss > >
From: Nathaniel Rutman <nathan@clusterfs.com> Date: Thu, 12 Oct 2006 08:59:15 -0700 Rebuilding the MGS will forget all logs for all registered FS''s, which is not what you want. Yeah, I kind of figured that out :-} The tunefs.lustre error message should have included: Use e2fsprogs-1.38-cfs1 or later, available from ftp://ftp.lustre.org/pub/lustre/other/e2fsprogs/ which you should do, I believe it did. I was in a low-level panic because what was planned to be a short downtime of my test system was turning into a much bigger problem, so at that point I just decided to fall back and use a receipe I knew would work. I''m still running 1.6b4; if I update to b5 does that stuff come with it, or do I still need to treat it as a separate thing? Thanks!
John R. Dunning wrote:> From: Nathaniel Rutman <nathan@clusterfs.com> > Date: Thu, 12 Oct 2006 08:59:15 -0700 > > Rebuilding the MGS will forget all logs for all registered FS''s, which > is not what you want. > > Yeah, I kind of figured that out :-} > > The tunefs.lustre error message should have included: > Use e2fsprogs-1.38-cfs1 or later, available from > ftp://ftp.lustre.org/pub/lustre/other/e2fsprogs/ > which you should do, > > I believe it did. I was in a low-level panic because what was planned > to be a short downtime of my test system was turning into a much > bigger problem, so at that point I just decided to fall back and use a > receipe I knew would work. > > I''m still running 1.6b4; if I update to b5 does that stuff come with > it, or do I still need to treat it as a separate thing? > >The e2fsprogs are separate. Eventually these features will migrate into the standard distros, but they aren''t there yet. b5 has more of the --writeconf stuff worked out iirc, so it''s probably good to migrate. But a word of warning, b4 disks won''t run under b5 due to a config file format change. That''s the last change we forsee.
From: Nathaniel Rutman <nathan@clusterfs.com> Date: Thu, 12 Oct 2006 10:22:27 -0700 The e2fsprogs are separate. Eventually these features will migrate into the standard distros, but they aren''t there yet. Ok. b5 has more of the --writeconf stuff worked out iirc, so it''s probably good to migrate. But a word of warning, b4 disks won''t run under b5 due to a config file format change. That''s the last change we forsee. That''s fine. As long as I have some time to plan the migration/upgrade on my test systems, it should be no problem. Good to know about the incompatibility, though.