Hello, the docs state for a failover setup that the MDS and MDT should be on seperate nodes. Therefore we would like to know what a "good" scenario could be. Does it make sense or is it even possible to setup a MDS with failover capabilities, i.e. something like (mirrored with DRBD here for example): On MDS1LUSTRE: mkfs.lustre --fsname=foo --mgs --failnode=MDS2LUSTRE /dev/drbd0 On MDT1LUSTRE: mkfs.lustre --fsname=foo --mdt --failnode=MGS2LUSTRE /dev/drbd0 That counts for 4 machines. Would that be a feasible setup or is it just overkill besides the costs ? What would be the proper setup scenario if the above is not possible ? (Avoiding a single point of failure for the MDS) Thanks and Regards Heiko
On Thu, 2008-06-19 at 16:00 +0200, Heiko Schroeter wrote:> > What would be the proper setup scenario if the above is not possible ? > (Avoiding a single point of failure for the MDS)Two MDSes each with access to the same storage (the MDT) device so that either (but never both at the same time!) can mount the MDT. DRBD has the potential to simulate this shared storage by mirroring it, but I have no first-hand experience doing so. b. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20080620/c2ebc555/attachment.bin
On Friday 20 June 2008 15:54:20 Brian J. Murrell wrote:> On Thu, 2008-06-19 at 16:00 +0200, Heiko Schroeter wrote: > > What would be the proper setup scenario if the above is not possible ? > > (Avoiding a single point of failure for the MDS) > > Two MDSes each with access to the same storage (the MDT) device so that > either (but never both at the same time!) can mount the MDT. > > DRBD has the potential to simulate this shared storage by mirroring it, > but I have no first-hand experience doing so.We do it for several lustre installations and it works fine. Cheers, Bernd -- Bernd Schubert Q-Leap Networks GmbH
On Fri, 2008-06-20 at 16:01 +0200, Bernd Schubert wrote:> > We do it for several lustre installations and it works fine.Have you done any "intensive" failover testing of it? I''m thinking something along the lines of our Hendrix/CMD test 11 or 17. In those tests we had to survive a constant stream of failovers at something like 3 or 5 minute intervals for 24 hours. So yes, a hundred or two failovers in a row and no application (i.e. userspace) errors. Seeing as we know Lustre can do it (we completed that contract) I''d be more interested of course in seeing DRBD survive that kind of torture. I''m not sure how (long) DRBD takes to come back into fully mirrored status when one node is powered off though. If it''s a long time, that in itself is an exposure to failure that shared storage doesn''t suffer. b. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20080620/3bad608d/attachment.bin
On Friday 20 June 2008 16:08:23 Brian J. Murrell wrote:> On Fri, 2008-06-20 at 16:01 +0200, Bernd Schubert wrote: > > We do it for several lustre installations and it works fine. > > Have you done any "intensive" failover testing of it? I''m thinking > something along the lines of our Hendrix/CMD test 11 or 17. In those > tests we had to survive a constant stream of failovers at something like > 3 or 5 minute intervals for 24 hours. So yes, a hundred or two > failovers in a row and no application (i.e. userspace) errors.I don''t think we did these tests yet, but I could put it onto my TODO list, if you think it is important. So far drbd always perfectly did its job and never was an issue here (in contrary of the many many hardware problems we often have).> > Seeing as we know Lustre can do it (we completed that contract) I''d be > more interested of course in seeing DRBD survive that kind of torture. > I''m not sure how (long) DRBD takes to come back into fully mirrored > status when one node is powered off though. If it''s a long time, that > in itself is an exposure to failure that shared storage doesn''t suffer.Since drbd-0.7 a journaled/bitmapped raid1 is used, so drbd always only needs to sync the parts of its extents, which have been modified. So usually a resync takes a few seconds. If for some reason a full resync is required, this can take much longer of course, but mostly this is only done for initial sync or if for some reason a split brain happened (shouldn''t be an issue if heartbeat + stonitdh is used). Cheers, Bernd -- Bernd Schubert Q-Leap Networks GmbH
>>> On Fri, 20 Jun 2008 09:54:20 -0400, "Brian J. Murrell" >>> <Brian.Murrell at Sun.COM> said:>> What would be the proper setup scenario if the above is not >> possible ? (Avoiding a single point of failure for the MDS)That''s a somewhat naive idea of "Avoiding a single point of failure", for example if the MDT drives are all of the same type, they will have the same firmware, design weaknesses, and that will be a single point of failure.> Two MDSes each with access to the same storage (the MDT) > device so that either (but never both at the same time!) can > mount the MDT.Well, I''d like to see an argument why a single storage backend shared by two storage-to-network frontends (often identical ones running the same software) helps much that much with availability. I''d rather have fully mirrored storage backends (ideally of very different types):> DRBD has the potential to simulate this shared storage by > mirroring it, but I have no first-hand experience doing so.Well, DRBD does not merely "simulate this shared storage", it implements fully mirrored dual storage. Perhaps it does then logically present them as a single image, but the important detail is that the storage is duplicated across two units that can be very different and even in different locations. Also, between duplicating the backend storage and duplicating the frontend MDS I''d rather do the former, as in the MDT case data safety is rather critical. It is potentially very easy to put in a new MDS, but recovering and restoring the MDTs is a rather bigger risk...
Am Freitag, 20. Juni 2008 16:20:23 schrieb Bernd Schubert:> On Friday 20 June 2008 16:08:23 Brian J. Murrell wrote: > > On Fri, 2008-06-20 at 16:01 +0200, Bernd Schubert wrote: > > > We do it for several lustre installations and it works fine. > > > > Have you done any "intensive" failover testing of it? I''m thinking > > something along the lines of our Hendrix/CMD test 11 or 17. In those > > tests we had to survive a constant stream of failovers at something like > > 3 or 5 minute intervals for 24 hours. So yes, a hundred or two > > failovers in a row and no application (i.e. userspace) errors. > > I don''t think we did these tests yet, but I could put it onto my TODO list, > if you think it is important. So far drbd always perfectly did its job and > never was an issue here (in contrary of the many many hardware problems we > often have).The failover takes about 3-4 minutes in our setup with an shared MDS and MDT running a mirrored DRBD device. As far as we can see it this is taken by the fsck on the DRBD device when HEARBEAT takes over. The MDS/MDT partition used in this test szenario is 20GB in size running in a 1.8GHz AMD machine. Just one more question about the partition sizes. As the docs points out one determines the size for the MDS partition by the number of inodes. How can one determine the size for the MDT partition or is that the same as the MDS device ? (As far as i can see the MDT takes the DIR info etc. So it should be larger than the MDS.) Thanks Heiko
On Wed, 2008-06-25 at 07:36 +0200, Heiko Schroeter wrote:> > How can one determine the size for the MDT partition or is that the same as > the MDS device ? > (As far as i can see the MDT takes the DIR info etc. So it should be larger > than the MDS.)An MDT is the device (i.e. the disk) that Lustre in an MDS (the server) uses to manage the metadata. Maybe that clears it up? b. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20080625/0f39668b/attachment.bin
Am Mittwoch, 25. Juni 2008 14:19:11 schrieb Brian J. Murrell:> On Wed, 2008-06-25 at 07:36 +0200, Heiko Schroeter wrote: > > How can one determine the size for the MDT partition or is that the same > > as the MDS device ? > > (As far as i can see the MDT takes the DIR info etc. So it should be > > larger than the MDS.) > > An MDT is the device (i.e. the disk) that Lustre in an MDS (the server) > uses to manage the metadata. Maybe that clears it up?Well yes ok, but what about the sizes of the partitions ? The docs present an example calculating the inode space needed on an ''MDS''. (3.2.2 Calculating MDS Size) That what actually confuses me a bit. So when the MDS partitions holds the inodes of the lustre system what will be the partition size of the MDT device ? Or should it read ''MDT'' partition size in the docs and the MDS partition size doesn''t matter at all ? Thanks and Regards Heiko
On Wed, 2008-06-25 at 14:29 +0200, Heiko Schroeter wrote:> > Well yes ok, but what about the sizes of the partitions ? > > The docs present an example calculating the inode space needed on an ''MDS''. > (3.2.2 Calculating MDS Size)That should technically be "MDT Size", not MDS size.> So when the MDS partitionsAKA MDTs.> holds the inodes of the lustre system what will be > the partition size of the MDT device ?You are talking about two things that are in fact one and the same. The "MDS partition" is the MDT.> Or should it read ''MDT'' partition size in the docs and the MDS partition size > doesn''t matter at all ?Indeed. Can you file a bugzilla ticket regarding that that bad use of MDS? I will see that it gets to the documentation team. Thanx, b. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20080625/5511259c/attachment.bin
In the current Lustre manual (v. 1_12), section 3.2.2 is Lustre Tools. Section 21.3.2 is Calculating MDT Size, which includes an inode calculation example and does not refer to the MDS. http://manual.lustre.org/manual/LustreManual16_HTML/LustreTuning.html#50446384_pgfId-1291157 Please clarify which Lustre manual version you are referring to. Sheila Brian J. Murrell wrote:> On Wed, 2008-06-25 at 14:29 +0200, Heiko Schroeter wrote: > >> Well yes ok, but what about the sizes of the partitions ? >> >> The docs present an example calculating the inode space needed on an ''MDS''. >> (3.2.2 Calculating MDS Size) >> > > That should technically be "MDT Size", not MDS size. > > >> So when the MDS partitions >> > > AKA MDTs. > > >> holds the inodes of the lustre system what will be >> the partition size of the MDT device ? >> > > You are talking about two things that are in fact one and the same. The > "MDS partition" is the MDT. > > >> Or should it read ''MDT'' partition size in the docs and the MDS partition size >> doesn''t matter at all ? >> > > Indeed. Can you file a bugzilla ticket regarding that that bad use of > MDS? I will see that it gets to the documentation team. > > Thanx, > b. > > > ------------------------------------------------------------------------ > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss >
Am Donnerstag, 26. Juni 2008 01:56:23 schrieb Sheila Barthel:> In the current Lustre manual (v. 1_12), section 3.2.2 is Lustre Tools. > Section 21.3.2 is Calculating MDT Size, which includes an inode > calculation example and does not refer to the MDS. > > http://manual.lustre.org/manual/LustreManual16_HTML/LustreTuning.html#50446 >384_pgfId-1291157 > > Please clarify which Lustre manual version you are referring to.LustreManual_1.6_man_v19.pdf . Sorry for the noise if this has been corrected allready. But still missing some infos about the MDS partition size. What is the minium size needed for the MDS partition ? Is it the same size as the MDS RAM size calculation in the lustre manual 820-3681 (may 2008) 3.4.1 ? So partition_size = RAM size ? Heiko -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20080626/46bdd531/attachment-0001.html
Hello Heiko, If I''m not mistaken, ''MDS'' refers to the metadata _server_, while ''MDT'' refers to the metadata _target_, i.e. the distinction is akin to that between ''OSS'' and ''OST''. The MDS is a server node; the MDT is the volume where all the metadata for your volume is stored. The handbook recommends an MDT size about 1-2% of the size of your total volume size, i.e. if your total CFS volume is 10 TB, the MDT would be about 200 GB. This is fairly conservative, so you may want to err on the side of growth by using a larger volume than that. If you can spare the disk, you''re certainly not sacrificing anything by over-provisioning your MDS. hope this helps, Klaus On 6/25/08 5:29 AM, "Heiko Schroeter" <schroete at iup.physik.uni-bremen.de>did etch on stone tablets:> Am Mittwoch, 25. Juni 2008 14:19:11 schrieb Brian J. Murrell: >> On Wed, 2008-06-25 at 07:36 +0200, Heiko Schroeter wrote: >>> How can one determine the size for the MDT partition or is that the same >>> as the MDS device ? >>> (As far as i can see the MDT takes the DIR info etc. So it should be >>> larger than the MDS.) >> >> An MDT is the device (i.e. the disk) that Lustre in an MDS (the server) >> uses to manage the metadata. Maybe that clears it up? > > Well yes ok, but what about the sizes of the partitions ? > > The docs present an example calculating the inode space needed on an ''MDS''. > (3.2.2 Calculating MDS Size) > > That what actually confuses me a bit. > > So when the MDS partitions holds the inodes of the lustre system what will be > the partition size of the MDT device ? > Or should it read ''MDT'' partition size in the docs and the MDS partition size > doesn''t matter at all ? > > Thanks and Regards > Heiko > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss