Hi all, we have gotten new MDS hardware, and I''ve got two questions: What are the recommendations for the RAID configuration and formatting options? I was following the recent discussion about these aspects on an OST: chunk size, strip size, stride-size, stripe-width etc. in the light of the 1MB chunks of Lustre ... So what about the MDT? I will have a RAID 10 that consists of 11 RAID-1 pairs striped over. giving me roughly 3TB of space. What would be the correct value for <insert your favorite term>, the amount of data written to one disk before proceeding to the next disk? Secondly, it is not yet decided whether we wouldn''t use this hardware to set up a second Lustre cluster. The manual recommends to have only one MGS per site, but doesn''t elaborate: what would be the drawback of having two MGSes, two different network addresses the clients have to connect to to mount the Lustres? I know that it didn''t work in Lustre 1.6.3 ;-) and there are no apparent issues when connecting a Lustre client to a test cluster now (version 1.8.4), but what about production? Cheers, Thomas
In our lab, we''ve never had a problem with simply having 1 MGS per filesystem. Mountpoints will be unique for all of them, but functionally it works just fine. -Ben Evans -----Original Message----- From: lustre-discuss-bounces at lists.lustre.org on behalf of Thomas Roth Sent: Fri 1/21/2011 6:43 AM To: lustre-discuss at lists.lustre.org Subject: [Lustre-discuss] MDT raid parameters, multiple MGSes Hi all, we have gotten new MDS hardware, and I''ve got two questions: What are the recommendations for the RAID configuration and formatting options? I was following the recent discussion about these aspects on an OST: chunk size, strip size, stride-size, stripe-width etc. in the light of the 1MB chunks of Lustre ... So what about the MDT? I will have a RAID 10 that consists of 11 RAID-1 pairs striped over. giving me roughly 3TB of space. What would be the correct value for <insert your favorite term>, the amount of data written to one disk before proceeding to the next disk? Secondly, it is not yet decided whether we wouldn''t use this hardware to set up a second Lustre cluster. The manual recommends to have only one MGS per site, but doesn''t elaborate: what would be the drawback of having two MGSes, two different network addresses the clients have to connect to to mount the Lustres? I know that it didn''t work in Lustre 1.6.3 ;-) and there are no apparent issues when connecting a Lustre client to a test cluster now (version 1.8.4), but what about production? Cheers, Thomas _______________________________________________ Lustre-discuss mailing list Lustre-discuss at lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20110121/37ba3c4f/attachment.html
On 2011-01-21, at 06:55, Ben Evans wrote:> In our lab, we''ve never had a problem with simply having 1 MGS per filesystem. Mountpoints will be unique for all of them, but functionally it works just fine.While this "runs", it is definitely not correct. The problem is that the client will only connect to a single MGS for configuration updates (in particular, the MGS for the last filesystem that was mounted). If there is a configuration change (e.g. lctl conf_param, or adding a new OST) on one of the other filesystems, then the client will not be notified of this change because it is no longer connected to the MGS for that filesystem. I agree that it would be desirable to allow the client to connect to multiple MGSes, but it doesn''t work today. I''d be thrilled if some interested party were to fix that.> -----Original Message----- > From: lustre-discuss-bounces at lists.lustre.org on behalf of Thomas Roth > Sent: Fri 1/21/2011 6:43 AM > To: lustre-discuss at lists.lustre.org > Subject: [Lustre-discuss] MDT raid parameters, multiple MGSes > > Hi all, > > we have gotten new MDS hardware, and I''ve got two questions: > > What are the recommendations for the RAID configuration and formatting > options? > I was following the recent discussion about these aspects on an OST: > chunk size, strip size, stride-size, stripe-width etc. in the light of > the 1MB chunks of Lustre ... So what about the MDT? I will have a RAID > 10 that consists of 11 RAID-1 pairs striped over. giving me roughly 3TB > of space. What would be the correct value for <insert your favorite > term>, the amount of data written to one disk before proceeding to the > next disk? > > Secondly, it is not yet decided whether we wouldn''t use this hardware to > set up a second Lustre cluster. The manual recommends to have only one > MGS per site, but doesn''t elaborate: what would be the drawback of > having two MGSes, two different network addresses the clients have to > connect to to mount the Lustres? > I know that it didn''t work in Lustre 1.6.3 ;-) and there are no apparent > issues when connecting a Lustre client to a test cluster now (version > 1.8.4), but what about production? > > > Cheers, > Thomas > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss > > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discussCheers, Andreas -- Andreas Dilger Principal Engineer Whamcloud, Inc.
On Fri, Jan 21, 2011 at 3:43 AM, Thomas Roth <t.roth at gsi.de> wrote:> Hi all, > > we have gotten new MDS hardware, and I''ve got two questions: > > What are the recommendations for the RAID configuration and formatting > options? > I was following the recent discussion about these aspects on an OST: > chunk size, strip size, stride-size, stripe-width etc. in the light of > the 1MB chunks of Lustre ... So what about the MDT? I will have a RAID > 10 that consists of 11 RAID-1 pairs striped over. giving me roughly 3TB > of space. What would be the correct value for <insert your favorite > term>, the amount of data written to one disk before proceeding to the > next disk? >The MDS does very small random IO - inodes and directories. Afaik, the largest chunk of data read/written would be 4.5K -and you would see that only with large OST stripe counts. RAID 10 is fine. You will not be doing IO that spans more than one spindle, so I''m not sure if there''s a real need to tune here. Also, the size of the data on the MDS is determined by the number of files in the filesystem (~4k per file is good) unless you are buried in petabytes 3TB is likely way oversize for an MDT. cliffw> > Secondly, it is not yet decided whether we wouldn''t use this hardware to > set up a second Lustre cluster. The manual recommends to have only one > MGS per site, but doesn''t elaborate: what would be the drawback of > having two MGSes, two different network addresses the clients have to > connect to to mount the Lustres? > I know that it didn''t work in Lustre 1.6.3 ;-) and there are no apparent > issues when connecting a Lustre client to a test cluster now (version > 1.8.4), but what about production? > > > Cheers, > Thomas > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss >-------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20110121/6fa5a77c/attachment.html
On 01/21/2011 07:02 PM, Andreas Dilger wrote:> On 2011-01-21, at 06:55, Ben Evans wrote: >> In our lab, we''ve never had a problem with simply having 1 MGS per filesystem. Mountpoints will be unique for all of them, but functionally it > works just fine. > > While this "runs", it is definitely not correct. The problem is that the client will only connect to a single MGS for configuration updates (in > particular, the MGS for the last filesystem that was mounted). If there is a configuration change (e.g. lctl conf_param, or adding a new OST) on one > of the other filesystems, then the client will not be notified of this change because it is no longer connected to the MGS for that filesystem. > > I agree that it would be desirable to allow the client to connect to multiple MGSes, but it doesn''t work today. I''d be thrilled if some interested > party were to fix that. >Ah, thanks Andreas, that would be a point we''d miss with test clusters. I had only seen this effect when deactivating OST using the device number instead of the name, and a client with multiple MGSes could have a different numbering. Very well, we''ll stick to one MGS, then. Thomas>> -----Original Message----- >> From: lustre-discuss-bounces at lists.lustre.org on behalf of Thomas Roth >> Sent: Fri 1/21/2011 6:43 AM >> To: lustre-discuss at lists.lustre.org >> Subject: [Lustre-discuss] MDT raid parameters, multiple MGSes >> >> Hi all, >> >> we have gotten new MDS hardware, and I''ve got two questions: >> >> What are the recommendations for the RAID configuration and formatting >> options? >> I was following the recent discussion about these aspects on an OST: >> chunk size, strip size, stride-size, stripe-width etc. in the light of >> the 1MB chunks of Lustre ... So what about the MDT? I will have a RAID >> 10 that consists of 11 RAID-1 pairs striped over. giving me roughly 3TB >> of space. What would be the correct value for <insert your favorite >> term>, the amount of data written to one disk before proceeding to the >> next disk? >> >> Secondly, it is not yet decided whether we wouldn''t use this hardware to >> set up a second Lustre cluster. The manual recommends to have only one >> MGS per site, but doesn''t elaborate: what would be the drawback of >> having two MGSes, two different network addresses the clients have to >> connect to to mount the Lustres? >> I know that it didn''t work in Lustre 1.6.3 ;-) and there are no apparent >> issues when connecting a Lustre client to a test cluster now (version >> 1.8.4), but what about production? >> >> >> Cheers, >> Thomas >> _______________________________________________ >> Lustre-discuss mailing list >> Lustre-discuss at lists.lustre.org >> http://lists.lustre.org/mailman/listinfo/lustre-discuss >> >> >> _______________________________________________ >> Lustre-discuss mailing list >> Lustre-discuss at lists.lustre.org >> http://lists.lustre.org/mailman/listinfo/lustre-discuss > > > Cheers, Andreas > -- > Andreas Dilger > Principal Engineer > Whamcloud, Inc. > > >-- -------------------------------------------------------------------- Thomas Roth Department: Informationstechnologie Location: SB3 1.262 Phone: +49-6159-71 1453 Fax: +49-6159-71 2986 GSI Helmholtzzentrum f?r Schwerionenforschung GmbH Planckstra?e 1 64291 Darmstadt www.gsi.de Gesellschaft mit beschr?nkter Haftung Sitz der Gesellschaft: Darmstadt Handelsregister: Amtsgericht Darmstadt, HRB 1528 Gesch?ftsf?hrung: Professor Dr. Dr. h.c. Horst St?cker, Dr. Hartmut Eickhoff Vorsitzende des Aufsichtsrates: Dr. Beatrix Vierkorn-Rudolph Stellvertreter: Ministerialdirigent Dr. Rolf Bernhardt
O.k., so the point is that the MDS writes are so small, one could never stripe such a write over multiple disks anyhow. Very good, one point less to worry about. Btw, files on the MDT - why does the apparent file size there sometimes reflect the size of the real file, and sometimes not? For example, on a ldiskfs-mounted copy of our MDT, I have a directory under ROOT/ with -rw-rw-r-- 1 935M 15. Jul 2009 09000075278027.140 -rw-rw-r-- 1 0 15. Jul 2009 09000075278027.150 As they should, both entries are 0-sized, as seen by e.g. "du". On Lustre, both files exist and both have size 935M. So for some reason, one has a metatdata entry that appears as a huge sparse file, the other does not. Is there a reason, or is this just an illness of our installation? Cheers, Thomas On 01/21/2011 09:31 PM, Cliff White wrote:> > > On Fri, Jan 21, 2011 at 3:43 AM, Thomas Roth <t.roth at gsi.de <mailto:t.roth at gsi.de>> wrote: > > Hi all, > > we have gotten new MDS hardware, and I''ve got two questions: > > What are the recommendations for the RAID configuration and formatting > options? > I was following the recent discussion about these aspects on an OST: > chunk size, strip size, stride-size, stripe-width etc. in the light of > the 1MB chunks of Lustre ... So what about the MDT? I will have a RAID > 10 that consists of 11 RAID-1 pairs striped over. giving me roughly 3TB > of space. What would be the correct value for <insert your favorite > term>, the amount of data written to one disk before proceeding to the > next disk? > > > The MDS does very small random IO - inodes and directories. Afaik, the largest chunk > of data read/written would be 4.5K -and you would see that only with large OST stripe > counts. RAID 10 is fine. You will not > be doing IO that spans more than one spindle, so I''m not sure if there''s a real need to tune here. > Also, the size of the data on the MDS is determined by the number of files in the > filesystem (~4k per file is good) > unless you are buried in petabytes 3TB is likely way oversize for an MDT. > cliffw > > > > Secondly, it is not yet decided whether we wouldn''t use this hardware to > set up a second Lustre cluster. The manual recommends to have only one > MGS per site, but doesn''t elaborate: what would be the drawback of > having two MGSes, two different network addresses the clients have to > connect to to mount the Lustres? > I know that it didn''t work in Lustre 1.6.3 ;-) and there are no apparent > issues when connecting a Lustre client to a test cluster now (version > 1.8.4), but what about production? > > > Cheers, > Thomas > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org <mailto:Lustre-discuss at lists.lustre.org> > http://lists.lustre.org/mailman/listinfo/lustre-discuss > >-- -------------------------------------------------------------------- Thomas Roth Department: Informationstechnologie Location: SB3 1.262 Phone: +49-6159-71 1453 Fax: +49-6159-71 2986 GSI Helmholtzzentrum f?r Schwerionenforschung GmbH Planckstra?e 1 64291 Darmstadt www.gsi.de Gesellschaft mit beschr?nkter Haftung Sitz der Gesellschaft: Darmstadt Handelsregister: Amtsgericht Darmstadt, HRB 1528 Gesch?ftsf?hrung: Professor Dr. Dr. h.c. Horst St?cker, Dr. Hartmut Eickhoff Vorsitzende des Aufsichtsrates: Dr. Beatrix Vierkorn-Rudolph Stellvertreter: Ministerialdirigent Dr. Rolf Bernhardt
Brian J. Murrell
2011-Jan-22 23:06 UTC
[Lustre-discuss] MDT raid parameters, multiple MGSes
On Sat, 2011-01-22 at 11:23 +0100, Thomas Roth wrote:> > Btw, files on the MDT - why does the apparent file size there sometimes reflect the size of the real file, and sometimes not?I believe in the cases where you are seeing a size, it''s SOM (size on mds) in action.> For example, on a ldiskfs-mounted copy of our MDT, I have a directory under ROOT/ with > > -rw-rw-r-- 1 935M 15. Jul 2009 09000075278027.140 > -rw-rw-r-- 1 0 15. Jul 2009 09000075278027.150Indeed. It''s pretty hazy at the moment (so please do check the archives) but I think there was a thread here not that long ago that explained that SOM was only activated for newly created files in the release it showed up in. Now it''s interesting that the two examples above have the same mtime. Perhaps there are conditions for recording an SOM for a preexisting file, like reading it perhaps. Maybe one of those has been read since you installed a SOM release and the other has not.> As they should, both entries are 0-sized, as seen by e.g. "du".Or ls -ls.> On Lustre, both files exist and both have size 935M. So for some reason, > one has a metatdata entry that appears as a huge sparse file, the other does not.Right.> Is there a reason, or is this just an illness of our installation?As above, and per the archives. Cheers, b. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 198 bytes Desc: This is a digitally signed message part Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20110122/161d7908/attachment.bin
Jeremy Filizetti
2011-Jan-26 00:05 UTC
[Lustre-discuss] MDT raid parameters, multiple MGSes
On Fri, Jan 21, 2011 at 1:02 PM, Andreas Dilger <adilger at whamcloud.com>wrote:> On 2011-01-21, at 06:55, Ben Evans wrote: > > In our lab, we''ve never had a problem with simply having 1 MGS per > filesystem. Mountpoints will be unique for all of them, but functionally it > works just fine. > > While this "runs", it is definitely not correct. The problem is that the > client will only connect to a single MGS for configuration updates (in > particular, the MGS for the last filesystem that was mounted). If there is > a configuration change (e.g. lctl conf_param, or adding a new OST) on one of > the other filesystems, then the client will not be notified of this change > because it is no longer connected to the MGS for that filesystem. >We use Lustre in a WAN environment and each geographic location has their own Lustre file system with it''s own MGS. While I don''t add storage frequently I''ve never seen an issue with this. Just to be sure I just mounted a test file system, follewed by another file system and added an OST to the test file system and the client was notified by the MGS. Looking at "lctl dl" the client shows a device for MGC and I see connections in the peers list. I didn''t test any conf_param, but at least the connections look fine including the output from the "lctl dk". Is there something I''m missing here? I know each OSS shares a single MGC between all the OBDs so that you can really only mount one file system at a time in Lustre. Is that what you are referring to? Thanks, Jeremy -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20110125/96e13508/attachment.html
On 2011-01-25, at 17:05, Jeremy Filizetti wrote:> On Fri, Jan 21, 2011 at 1:02 PM, Andreas Dilger <adilger at whamcloud.com> wrote: >> While this "runs", it is definitely not correct. The problem is that the client will only connect to a single MGS for configuration updates (in particular, the MGS for the last filesystem that was mounted). If there is a configuration change (e.g. lctl conf_param, or adding a new OST) on one of the other filesystems, then the client will not be notified of this change because it is no longer connected to the MGS for that filesystem. >> > > We use Lustre in a WAN environment and each geographic location has their own Lustre file system with it''s own MGS. While I don''t add storage frequently I''ve never seen an issue with this. > > Just to be sure I just mounted a test file system, follewed by another file system and added an OST to the test file system and the client was notified by the MGS. Looking at "lctl dl" the client shows a device for MGC and I see connections in the peers list. I didn''t test any conf_param, but at least the connections look fine including the output from the "lctl dk". > > Is there something I''m missing here? I know each OSS shares a single MGC between all the OBDs so that you can really only mount one file system at a time in Lustre. Is that what you are referring to?Depending on how you ran the test, it is entirely possible that the client hadn''t been evicted from the first MGS yet, and it accepted the message from this MGS even though this was evicted. However, if you check the connection state on the client (e.g. "lctl get_param mgc.*.import") it is only possible for the client to have a single MGC today, and that MGC can only have a connection to a single MGS at a time. Granted, it is possible that someone fixed this when I wasn''t paying attention. Cheers, Andreas -- Andreas Dilger Principal Engineer Whamcloud, Inc.
On Jan 27, 2011, at 3:15 AM, Andreas Dilger wrote:> On 2011-01-25, at 17:05, Jeremy Filizetti wrote: >> On Fri, Jan 21, 2011 at 1:02 PM, Andreas Dilger <adilger at whamcloud.com> wrote: >>> While this "runs", it is definitely not correct. The problem is that the client will only connect to a single MGS for configuration updates (in particular, the MGS for the last filesystem that was mounted). If there is a configuration change (e.g. lctl conf_param, or adding a new OST) on one of the other filesystems, then the client will not be notified of this change because it is no longer connected to the MGS for that filesystem. >>> >> >> We use Lustre in a WAN environment and each geographic location has their own Lustre file system with it''s own MGS. While I don''t add storage frequently I''ve never seen an issue with this. >> >> Just to be sure I just mounted a test file system, follewed by another file system and added an OST to the test file system and the client was notified by the MGS. Looking at "lctl dl" the client shows a device for MGC and I see connections in the peers list. I didn''t test any conf_param, but at least the connections look fine including the output from the "lctl dk". >> >> Is there something I''m missing here? I know each OSS shares a single MGC between all the OBDs so that you can really only mount one file system at a time in Lustre. Is that what you are referring to? > > Depending on how you ran the test, it is entirely possible that the client > hadn''t been evicted from the first MGS yet, and it accepted the message from this MGS even though this was evicted. However, if you check the connection state > on the client (e.g. "lctl get_param mgc.*.import") it is only possible for > the client to have a single MGC today, and that MGC can only have a connection > to a single MGS at a time. > > Granted, it is possible that someone fixed this when I wasn''t paying attention.I thought this sounded familiar - have a look at bz 20299. Multiple MGCs on a client are ok; multiple MGSes on a single server are not. Jason -- Jason Rappleye System Administrator NASA Advanced Supercomputing Division NASA Ames Research Center Moffett Field, CA 94035
On 2011-01-27, at 08:26, Jason Rappleye wrote:> On Jan 27, 2011, at 3:15 AM, Andreas Dilger wrote: >> >> The problem is that the client will only connect to a single MGS for configuration updates (in particular, the MGS for the last filesystem that was mounted). If there is a configuration change (e.g. lctl conf_param, or adding a new OST) on one of the other filesystems, then the client will not be notified of this change because it is no longer connected to the MGS for that filesystem. >> >> Granted, it is possible that someone fixed this when I wasn''t paying attention. > > I thought this sounded familiar - have a look at bz 20299. Multiple MGCs on a client are ok; multiple MGSes on a single server are not.Sigh, it was even me who filed the bug... Seems that bit of information was evicted from my memory. Thanks for setting me straight. Cheers, Andreas -- Andreas Dilger Principal Engineer Whamcloud, Inc.
Jeremy Filizetti
2011-Jan-27 23:23 UTC
[Lustre-discuss] MDT raid parameters, multiple MGSes
Thanks Jason, I haven''t had any luck in reproducing it, although I have been trying. Next time I''ll have to check bugzilla for closed bugs too. Jeremy On Thu, Jan 27, 2011 at 2:10 PM, Andreas Dilger <adilger at whamcloud.com>wrote:> On 2011-01-27, at 08:26, Jason Rappleye wrote: > > On Jan 27, 2011, at 3:15 AM, Andreas Dilger wrote: > >> > >> The problem is that the client will only connect to a single MGS for > configuration updates (in particular, the MGS for the last filesystem that > was mounted). If there is a configuration change (e.g. lctl conf_param, or > adding a new OST) on one of the other filesystems, then the client will not > be notified of this change because it is no longer connected to the MGS for > that filesystem. > >> > >> Granted, it is possible that someone fixed this when I wasn''t paying > attention. > > > > I thought this sounded familiar - have a look at bz 20299. Multiple MGCs > on a client are ok; multiple MGSes on a single server are not. > > Sigh, it was even me who filed the bug... Seems that bit of information > was evicted from my memory. Thanks for setting me straight. > > Cheers, Andreas > -- > Andreas Dilger > Principal Engineer > Whamcloud, Inc. > > > >-------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20110127/f1ee24ef/attachment.html