We''re trying to architect a Lustre setup for our group, and want to leverage our available resources. In doing so, we''ve come to consider multi-purposing several hosts, so that they''ll function simultaneously as MDS & OSS. Background: Our OSTs are a NexSan SATAboy (10T), a NexSan SATAbeast (30T), a FalconStor NSS650 (32T), and a FalconStor NSS620 (32T) -- all have multiple iSCSI interfaces. Primary MDS would be a 72-cpu IBM x3950m2, which would also be an OSS. Secondary MDS would be a 2-cpu Penguin Computing Altus-1300, which would also be an OSS. A 2-cpu Dell PowerEdge 1425 would be our third OSS. The IBM x3950m2 also functions as a heavily-used compute cluster (70 dedicated cpus, which could/would be reduced by the number of cpus to be dedicated to MDS and OSS needs). We have most of the infrastructure already in-place for Infiniband networking. Are there basic conflicts-of-interest, and/or known/potential "gotchas" in utilizing hosts in such multi-purpose roles? -- JONATHAN B. HOREN Systems Administrator UAF Life Science Informatics Center for Research Services (907) 474-2742 jbhoren at alaska.edu http://biotech.inbre.alaska.edu -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100917/83364133/attachment.html
On Fri, 2010-09-17 at 10:42 -0800, Jonathan B. Horen wrote:> > Background: Our OSTsOSSes. OSTs are the disks that an OSS provides object service with.> Primary MDS would be a 72-cpu IBM x3950m2, which would > also be an OSS.MDS and OSS on the same node is an unsupported configuration due to the fact that if it fails you will have a "double failure" and recovery cannot be performed.> Secondary MDS would be a 2-cpu Penguin Computing Altus-1300, > which would also be an OSS.Ditto.> Are there basic conflicts-of-interest, and/or known/potential "gotchas" in > utilizing hosts in such multi-purpose roles?OSSes and MDSes require a kernel patched for Lustre. So you''d need to be able to either replace the kernel on those existing machines or patch the source you built it from. Generally speaking, you are of course going to have only as much performance as the Lustre services on those shared nodes will be able to get the resources it wants. Our usual recommendation is to dedicate OSS and MDS nodes for this reason, but there is no hard rule that you must provide dedicated nodes, so long as everything else on the nodes can live with the patched Lustre kernel. If you are already patching those kernels for something else, you could run into conflicts trying to patch them for Lustre. b. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 198 bytes Desc: This is a digitally signed message part Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100917/c00175a7/attachment.bin
Thanks very much! On Fri, Sep 17, 2010 at 10:52 AM, Brian J. Murrell <brian.murrell at oracle.com> wrote:> On Fri, 2010-09-17 at 10:42 -0800, Jonathan B. Horen wrote: > > > > Background: Our OSTs > > OSSes. OSTs are the disks that an OSS provides object service with. >Yes, but... how, then, am I to view SAN storage devices? Sure, the disks are the OSTs, but these aren''t JBODs hooked-up to a host''s SCSI/SATA/SAS backplane... they''re already in RAID-6 arrays, with PVs, VGs, and LVs, holding real user data, which are managed by the NexSan/FalconStor software (on top of a Linux OS). Am I correct in thinking that these SAN storage devices would be networked to one-or-more OSSes? Admittedly, I find it somewhat confusing.> > Primary MDS would be a 72-cpu IBM x3950m2, which would > > also be an OSS. > > MDS and OSS on the same node is an unsupported configuration due to the > fact that if it fails you will have a "double failure" and recovery > cannot be performed. > > > Secondary MDS would be a 2-cpu Penguin Computing Altus-1300, > > which would also be an OSS. > > Ditto. > > > Are there basic conflicts-of-interest, and/or known/potential "gotchas" > in > > utilizing hosts in such multi-purpose roles? > > OSSes and MDSes require a kernel patched for Lustre. So you''d need to > be able to either replace the kernel on those existing machines or patch > the source you built it from. >Did I misunderstand that RHEL5 sports Lustre-support already in the kernel? -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100917/0f030b67/attachment.html
On Fri, 2010-09-17 at 11:10 -0800, Jonathan B. Horen wrote:> Thanks very much!NP.> Yes, but... how, then, am I to view SAN storage devices?I''m not sure. Whatever you can do to present the devices to Linux as block devices. Lustre OSTs and MDTs are Linux block devices.> they''re already in RAID-6 arrays, with PVs, VGs, and LVs,An (LVM if that''s what you are referring to) is a block device and can be used as an OST or MDT.> Am I correct in thinking that these SAN storage devices would be networked > to one-or-more OSSes?They need to make themselves available as block device(s).> Did I misunderstand that RHEL5 sports Lustre-support already in the kernel?Yes, I''m afraid you did. You will need either one of our pre-built (we currently build for RHEL5, OEL5, SLES10 and SLES11) kernels or patch your own kernel with the patches. b. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 198 bytes Desc: This is a digitally signed message part Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100917/41e910ed/attachment.bin
On 2010-09-17, at 12:42, Jonathan B. Horen wrote:> We''re trying to architect a Lustre setup for our group, and want to leverage our available resources. In doing so, we''ve come to consider multi-purposing several hosts, so that they''ll function simultaneously as MDS & OSS.You can''t do this and expect recovery to work in a robust manner. The reason is that the MDS is a client of the OSS, and if they are both on the same node that crashes, the OSS will wait for the MDS "client" to reconnect and will time out recovery of the real clients. Cheers, Andreas -- Andreas Dilger Lustre Technical Lead Oracle Corporation Canada Inc.
On 2010-09-17, at 13:19, Brian J. Murrell wrote:> On Fri, 2010-09-17 at 11:10 -0800, Jonathan B. Horen wrote: >> Yes, but... how, then, am I to view SAN storage devices? > > I''m not sure. Whatever you can do to present the devices to Linux as > block devices. Lustre OSTs and MDTs are Linux block devices.I think the more informative answer is that the OST is the Lustre name for the ext4-like filesystems on the Linux block devices (regardless of what the underlying storage is). The OSS is the Lustre name for the node on which these block devices are attached.>> Am I correct in thinking that these SAN storage devices would be networked to one-or-more OSSes? > > They need to make themselves available as block device(s).Yes, the SAN devices need to be attached to at least one OSS, but preferably two OSSes to provide high availability. We recommend against connecting the SAN devices to all of the OSS/MDS nodes, because this increases the configuration complexity and risk of an administrative error, and provides no benefit.>> Did I misunderstand that RHEL5 sports Lustre-support already in the kernel? > > Yes, I''m afraid you did. You will need either one of our pre-built (we > currently build for RHEL5, OEL5, SLES10 and SLES11) kernels or patch > your own kernel with the patches.The patches are needed on the Lustre server kernel, but are not needed on the client. Cheers, Andreas -- Andreas Dilger Lustre Technical Lead Oracle Corporation Canada Inc.
On Friday, September 17, 2010, Andreas Dilger wrote:> On 2010-09-17, at 12:42, Jonathan B. Horen wrote: > > We''re trying to architect a Lustre setup for our group, and want to > > leverage our available resources. In doing so, we''ve come to consider > > multi-purposing several hosts, so that they''ll function simultaneously > > as MDS & OSS. > > You can''t do this and expect recovery to work in a robust manner. The > reason is that the MDS is a client of the OSS, and if they are both on the > same node that crashes, the OSS will wait for the MDS "client" to > reconnect and will time out recovery of the real clients.Well, that is some kind of design problem. Even on separate nodes it can easily happen, that both MDS and OSS fail, for example power outage of the storage rack. In my experience situations like that happen frequently... I think some kind a pre-connection would be required, where a client can tell a server, that it was rebooted and that the server shall not to wait any longer for it. Actually, shouldn''t be that difficult, as already different connection flags exist. So if the client contacts a server and ask for an initial connection, the server could check for that NID and then immediately abort recovery for that client. Cheers, Bernd -- Bernd Schubert DataDirect Networks
Hi, Bernd. On 09/17/2010 02:48 PM, Bernd Schubert wrote:> On Friday, September 17, 2010, Andreas Dilger wrote: >> On 2010-09-17, at 12:42, Jonathan B. Horen wrote: >>> We''re trying to architect a Lustre setup for our group, and want to >>> leverage our available resources. In doing so, we''ve come to consider >>> multi-purposing several hosts, so that they''ll function simultaneously >>> as MDS & OSS. >> >> You can''t do this and expect recovery to work in a robust manner. The >> reason is that the MDS is a client of the OSS, and if they are both on the >> same node that crashes, the OSS will wait for the MDS "client" to >> reconnect and will time out recovery of the real clients. > > Well, that is some kind of design problem. Even on separate nodes it can > easily happen, that both MDS and OSS fail, for example power outage of the > storage rack. In my experience situations like that happen frequently... >I think that just argues that the MDS should be on a separate UPS.> I think some kind a pre-connection would be required, where a client can tell > a server, that it was rebooted and that the server shall not to wait any > longer for it. Actually, shouldn''t be that difficult, as already different > connection flags exist. So if the client contacts a server and ask for an > initial connection, the server could check for that NID and then immediately > abort recovery for that client. > > > Cheers, > Bernd > >
Hello Cory, On 09/17/2010 11:31 PM, Cory Spitz wrote:> Hi, Bernd. > > On 09/17/2010 02:48 PM, Bernd Schubert wrote: >> On Friday, September 17, 2010, Andreas Dilger wrote: >>> On 2010-09-17, at 12:42, Jonathan B. Horen wrote: >>>> We''re trying to architect a Lustre setup for our group, and want to >>>> leverage our available resources. In doing so, we''ve come to consider >>>> multi-purposing several hosts, so that they''ll function simultaneously >>>> as MDS & OSS. >>> >>> You can''t do this and expect recovery to work in a robust manner. The >>> reason is that the MDS is a client of the OSS, and if they are both on the >>> same node that crashes, the OSS will wait for the MDS "client" to >>> reconnect and will time out recovery of the real clients. >> >> Well, that is some kind of design problem. Even on separate nodes it can >> easily happen, that both MDS and OSS fail, for example power outage of the >> storage rack. In my experience situations like that happen frequently... >> > > I think that just argues that the MDS should be on a separate UPS.well, there is not only a single reason. Next hardware issue is that maybe an IB switch fails. And then have also seen cascading Lustre failures. It starts with an LBUG on the OSS, which triggers another problem on the MDS... Also, for us this actually will become a real problem, which cannot be easily solved. So this issue will become a DDN priority. Cheers, Bernd -- Bernd Schubert DataDirect Networks -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 262 bytes Desc: OpenPGP digital signature Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100917/0e068e2e/attachment.bin
Hi, On Sep 17, 2010, at 14:49 , Bernd Schubert wrote:> Hello Cory, > > On 09/17/2010 11:31 PM, Cory Spitz wrote: >> Hi, Bernd. >> >> On 09/17/2010 02:48 PM, Bernd Schubert wrote: >>> On Friday, September 17, 2010, Andreas Dilger wrote: >>>> On 2010-09-17, at 12:42, Jonathan B. Horen wrote: >>>>> We''re trying to architect a Lustre setup for our group, and want to >>>>> leverage our available resources. In doing so, we''ve come to consider >>>>> multi-purposing several hosts, so that they''ll function simultaneously >>>>> as MDS & OSS. >>>> >>>> You can''t do this and expect recovery to work in a robust manner. The >>>> reason is that the MDS is a client of the OSS, and if they are both on the >>>> same node that crashes, the OSS will wait for the MDS "client" to >>>> reconnect and will time out recovery of the real clients. >>> >>> Well, that is some kind of design problem. Even on separate nodes it can >>> easily happen, that both MDS and OSS fail, for example power outage of the >>> storage rack. In my experience situations like that happen frequently... >>> >> >> I think that just argues that the MDS should be on a separate UPS. > > well, there is not only a single reason. Next hardware issue is that > maybe an IB switch fails. And then have also seen cascading Lustre > failures. It starts with an LBUG on the OSS, which triggers another > problem on the MDS... > Also, for us this actually will become a real problem, which cannot be > easily solved. So this issue will become a DDN priority.There is always a possibility that multiple failures will occur, and this possibility can be reduced depending on one''s resources. The point here is simply that a configuration with an mds and oss on the same node will guarantee multiple failures and aborted OSS recovery when that node fails. cheers, robert
Hi, On Sep 17, 2010, at 12:48 , Bernd Schubert wrote:> On Friday, September 17, 2010, Andreas Dilger wrote: >> On 2010-09-17, at 12:42, Jonathan B. Horen wrote: >>> We''re trying to architect a Lustre setup for our group, and want to >>> leverage our available resources. In doing so, we''ve come to consider >>> multi-purposing several hosts, so that they''ll function simultaneously >>> as MDS & OSS. >> >> You can''t do this and expect recovery to work in a robust manner. The >> reason is that the MDS is a client of the OSS, and if they are both on the >> same node that crashes, the OSS will wait for the MDS "client" to >> reconnect and will time out recovery of the real clients. > > Well, that is some kind of design problem. Even on separate nodes it can > easily happen, that both MDS and OSS fail, for example power outage of the > storage rack. In my experience situations like that happen frequently... > > I think some kind a pre-connection would be required, where a client can tell > a server, that it was rebooted and that the server shall not to wait any > longer for it. Actually, shouldn''t be that difficult, as already different > connection flags exist. So if the client contacts a server and ask for an > initial connection, the server could check for that NID and then immediately > abort recovery for that client.This is an interesting idea, but NID is not ideal as this wouldn''t be compatible with multiple mounts on the same node. Not very useful in production, perhaps, but very useful for testing. Another option would be to hash the mount point pathname (and some other data, such as the NID) and use this as the client uuid. Then the client uuid would be persistent across reboots and the server would rely on flags to detect if this was a reconnect or a new connection after a reboot or remount. robert
Bernd Schubert wrote:> Hello Cory, > > On 09/17/2010 11:31 PM, Cory Spitz wrote: > >> Hi, Bernd. >> >> On 09/17/2010 02:48 PM, Bernd Schubert wrote: >> >>> On Friday, September 17, 2010, Andreas Dilger wrote: >>> >>>> On 2010-09-17, at 12:42, Jonathan B. Horen wrote: >>>> >>>>> We''re trying to architect a Lustre setup for our group, and want to >>>>> leverage our available resources. In doing so, we''ve come to consider >>>>> multi-purposing several hosts, so that they''ll function simultaneously >>>>> as MDS & OSS. >>>>> >>>> You can''t do this and expect recovery to work in a robust manner. The >>>> reason is that the MDS is a client of the OSS, and if they are both on the >>>> same node that crashes, the OSS will wait for the MDS "client" to >>>> reconnect and will time out recovery of the real clients. >>>> >>> Well, that is some kind of design problem. Even on separate nodes it can >>> easily happen, that both MDS and OSS fail, for example power outage of the >>> storage rack. In my experience situations like that happen frequently... >>> >>> >> I think that just argues that the MDS should be on a separate UPS. >>Or dual-redundant UPS devices driving all "critical infrastructure". Redundant power supplies are the norm for server-class hardware, and they should be cabled to different circuits (which each need to be sized to sustain the maximum power).> well, there is not only a single reason. Next hardware issue is that > maybe an IB switch fails.Sure, but that''s also easy to address (in theory): put OSS nodes on different leaf switches than MDS nodes, and put the failover pairs on different switches as well. In practice, IB switches probably do not fail often enough to worry about recovery glitches, especially if they have redundant power, but I certainly recommend failover partners are on different switch chips so that in case of a failure it is still possible to get the system up. I would also recommend using bonded network interfaces to avoid cable-failure issues (ie, connect both OSS nodes to both of the leaf switches, rather than one to each), but there are some outstanding issues with Lustre on IB bonding (patches in bugzilla), and of course multipath to disk (loss of connectivity to disk was mentioned at LUG as one of the biggest causes of Lustre issues). In general it is easier to have redundant cables than to ensure your HA package properly monitors cable status and does a failover when required.> And then have also seen cascading Lustre > failures. It starts with an LBUG on the OSS, which triggers another > problem on the MDS... >Yes, that''s why bugs are fixed. panic_on_lbug may help stop the problem before it spreads, depending on the issue.> Also, for us this actually will become a real problem, which cannot be > easily solved. So this issue will become a DDN priority. > > > Cheers, > Bernd > > -- > Bernd Schubert > DataDirect Networks > >