Hi, first of all I have 140TB lustre installation, running over infiniband without any problems. It''s working great! But I have one small question about the MGS service. It''s not clear to me which informations/configurations are stored by the MGS. I have two MDT''s created on the MDS with the command 1. with MGS mkfs.lustre --mdt --mgs --fsname=aeifs1... /dev/drbd0 MGS and MDT are sharing one partition 2. without MGS mkfs.lustre --mdt --fsname=aeifs2... /dev/drbd1 MDT on a separate partition What could happen if I reformat mkfs.lustre --mdt --mgs --fsname=aeifs1... /dev/drbd0? Is there any influence to the second MDT by reformat the first MDT with --mgs option? I am using lustre 1.6.4.3. "la lykken gro, som gresset bak do" Nico Budewitz HPC-Cluster Administrator Max-Planck-Institute for Gravitational Physics / Albert-Einstein- Institute Am Muehlenberg 1, 14476 Golm Tel.: +49 (0)331 567 7364 Fax: +49 (0)331 567 7297 http://supercomputers.aei.mpg.de
On Mon, 2008-09-29 at 01:42 +0200, Nico Budewitz wrote:> It''s not clear to me which informations/configurations are stored by > the MGS.All of the configuration information related to all filesystems that use that MGS.> I have two MDT''s created on the MDS with the command > > 1. with MGS > mkfs.lustre --mdt --mgs --fsname=aeifs1... /dev/drbd0 > MGS and MDT are sharing one partitionSo this is a co-located MDT and MGS.> 2. without MGS > mkfs.lustre --mdt --fsname=aeifs2... /dev/drbd1 > MDT on a separate partitionYou must have used an --mgsnode= to point this MDT to the MGS you created above, yes?> What could happen if I reformat > mkfs.lustre --mdt --mgs --fsname=aeifs1... /dev/drbd0?You will lose the configuration for both filesystems! This is _exactly_ the reason that we recommend a separate MGS for installations that have even a notion of wanting more than one filesystem.> Is there any influence to the second MDT by reformat the first MDT > with --mgs option?Indeed, as above. b. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 197 bytes Desc: This is a digitally signed message part Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20080929/fce11a29/attachment.bin
Brian J. Murrell wrote:> This is _exactly_ the reason that we recommend a separate MGS for > installations that have even a notion of wanting more than one > filesystem. >I just checked, and the current Lustre operations manual (v1.14, updated September 19) does not seem to reflect that belief; even the "more complex configurations" offer mkfs.lustre commands that give a combined MGS/MDT. In any case, my experience has been that customers are often reluctant to carve a whole LUN out of their disk arrays for so little data, and partitioning can have undesirable performance impacts. Has any consideration been given to ways of storing the MGS information within a file on an existing filesystem instead of requiring on a separate block device? I''m tempted to suggest doing this via the loopback driver, since the MGS data-change rate is so low anyway.
On Mon, 2008-09-29 at 10:34 -0400, Jeff Darcy wrote:> > I just checked, and the current Lustre operations manual (v1.14, updated > September 19) does not seem to reflect that belief; even the "more > complex configurations" offer mkfs.lustre commands that give a combined > MGS/MDT.But you will notice that in the specific section 4.2.2.6 "Running Multiple Lustre Filesystems" they do demonstrate setting up a separate MGS: An installation with two filesystems could look like this: mgsnode# mkfs.lustre --mgs /dev/sda mdtfoonode# mkfs.lustre --fsname=foo --mdt --mgsnode=mgsnode at tcp0 /dev/sda ossfoonode# mkfs.lustre --fsname=foo --ost --mgsnode=mgsnode at tcp0 /dev/sda ossfoonode# mkfs.lustre --fsname=foo --ost --mgsnode=mgsnode at tcp0 /dev/sdb mdtbarnode# mkfs.lustre --fsname=bar --mdt --mgsnode=mgsnode at tcp0 /dev/sda ossbarnode# mkfs.lustre --fsname=bar --ost --mgsnode=mgsnode at tcp0 /dev/sda ossbarnode# mkfs.lustre --fsname=bar --ost --mgsnode=mgsnode at tcp0 /dev/sdb Certainly it''s not an absolute requirement, but for the corner case being discussed here, the need to reformat a filesystem -- the one co-located with the MGS, there is a problem.> In any case, my experience has been that customers are often > reluctant to carve a whole LUN out of their disk arrays for so little > data,Agreed.> and partitioning can have undesirable performance impacts.Also agreed.> Has > any consideration been given to ways of storing the MGS information > within a file on an existing filesystem instead of requiring on a > separate block device?How would that be any better than the co-located MGS/MDT situation where you want to reformat the filesystem that has the configuration information stored on it in a file?> I''m tempted to suggest doing this via the > loopback driver, since the MGS data-change rate is so low anyway.But you still have the problem of formatting that particular filesystem. I suppose you could umount the loopback device, copy the file to a different filesystem and remount the loopback device. That just seems cumbersome. I think a better proposal would be to enhance mkfs.lustre to save and restore the configuration data of the filesystems not being reformatted when a combination MGS/MDT is being formatted. Feel free to file a bugzilla ticket requesting that. b. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 197 bytes Desc: This is a digitally signed message part Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20080929/e95112ad/attachment.bin
Brian J. Murrell wrote:> But you will notice that in the specific section 4.2.2.6 "Running > Multiple Lustre Filesystems" they do demonstrate setting up a separate > MGS: >Actually they demonstrate setting up a separate MDT to use an existing MGS. There''s even a note that says "specify --mdt --mgs on one, and --mdt --mgsnode=/<mgsnodenid>/ on the others" which would still run into the same problem should one ever need to reformat that first MGS/MDT. That won''t direct somebody who has "even a notion of wanting more than one filesystem" toward creating a separate MGS. If it''s something you guys feel should be recommended for all or nearly all cases, it needs to be presented that way in even the early examples.>> Has >> any consideration been given to ways of storing the MGS information >> within a file on an existing filesystem instead of requiring on a >> separate block device? >> > > How would that be any better than the co-located MGS/MDT situation where > you want to reformat the filesystem that has the configuration > information stored on it in a file? >That''s why I said an *existing* filesystem - i.e. one existing before and therefore separate from any MDTs you create. Sure, if you reformat the filesystem containing the MGS data you''ll still have to do a (trivial) save and restore, but that will always be the case no matter where the MGS data goes. At least customers wouldn''t find themselves facing the problem that started the thread, where reformatting the MDT will rather unexpectedly mean reformatting their MGS as well.> But you still have the problem of formatting that particular filesystem. > I suppose you could umount the loopback device, copy the file to a > different filesystem and remount the loopback device. That just seems > cumbersome. >Yes, it is, but no more so than the MGS-on-private-storage case currently. More importantly, it avoids the cumbersome requirement to devote a whole separate block device to this role. Customers'' notions of which burden to avoid might not match ours, and I''ve found that many customers really loathe allocating tiny LUNs for special purposes (gateway LUNs on EMC/Clariion devices are another example). This is something a customer could do today without code changes to work around that particular burden.
On Mon, 2008-09-29 at 11:28 -0400, Jeff Darcy wrote:> > Actually they demonstrate setting up a separate MDT to use an existing > MGS.Are we talking about the same thing here? In the commands: mgsnode# mkfs.lustre --mgs /dev/sda mdtfoonode# mkfs.lustre --fsname=foo --mdt --mgsnode=mgsnode at tcp0 /dev/sda ossfoonode# mkfs.lustre --fsname=foo --ost --mgsnode=mgsnode at tcp0 /dev/sda ossfoonode# mkfs.lustre --fsname=foo --ost --mgsnode=mgsnode at tcp0 /dev/sdb mdtbarnode# mkfs.lustre --fsname=bar --mdt --mgsnode=mgsnode at tcp0 /dev/sda ossbarnode# mkfs.lustre --fsname=bar --ost --mgsnode=mgsnode at tcp0 /dev/sda ossbarnode# mkfs.lustre --fsname=bar --ost --mgsnode=mgsnode at tcp0 /dev/sdb The MGS is completely separate from any MDT. It is in fact on a separate node even.> There''s even a note that says "specify --mdt --mgs on one, and > --mdt --mgsnode=/<mgsnodenid>/ on the others"Indeed.> which would still run into > the same problem should one ever need to reformat that first MGS/MDT.Right.> That won''t direct somebody who has "even a notion of wanting more than > one filesystem" toward creating a separate MGS. If it''s something you > guys feel should be recommended for all or nearly all cases, it needs to > be presented that way in even the early examples.Fair enough. Can you file a documentation ticket on this?> That''s why I said an *existing* filesystem - i.e. one existing before > and therefore separate from any MDTs you create.So this would be a non-lustre filesystem? If you want failover MGS capability, this filesystem will need to be accessible from multiple (i.e. MGS) nodes.> Sure, if you reformat > the filesystem containing the MGS data you''ll still have to do a > (trivial) save and restore, but that will always be the case no matter > where the MGS data goes. At least customers wouldn''t find themselves > facing the problem that started the thread, where reformatting the MDT > will rather unexpectedly mean reformatting their MGS as well.I agree that this is a problem. I''m just trying to propose a(n unnecessarily complicated) solution.> Yes, it is, but no more so than the MGS-on-private-storage case > currently.If the MGS is on separate storage, there is no action required (as opposed to the idea of moving it around depending on what you want to format) so, yes, it does seem to me to be less cumbersome.> More importantly, it avoids the cumbersome requirement to > devote a whole separate block device to this role.Hrm. That doesn''t seem "cumbersome". Maybe wasteful, yes. Anyway, I''ve proposed how one can resolve this problem with current implementations of Lustre. Anyone is of course free to request any further features or enchantments one wishes via our bugzilla system. b. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 197 bytes Desc: This is a digitally signed message part Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20080929/5e78370f/attachment.bin