I have 60 nodes to use as OSS ?I have made an experiment : I use a disk?iscsi? to be an OST ? if I do not define the failnodes when I use the mkfs.luster command, I mount this ost and at the client node ,lfs df -h can see this OST, but when I umount it and mount the ost to another OSS ? lfs df -h can not see it again. But if I define the failnodes in the mkfs.luster command, and do the operations above, we can see the OST at the client node using lfs df -h command. So my question is :if I want an OST to failover to any OSS (one of sixty nodes),should I need to defined 60 failnodes when I format the disk?or can I use pacemaker to select an oss and modify something to notify client that the disk is on some OSS? -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20091109/a5cbf7ab/attachment.html
On Mon, 2009-11-09 at 16:25 +0800, lelustre wrote:> I have 60 nodes to use as OSS ?I have made an experiment : I use > a disk?iscsi? to be an OST ? if I do not define the failnodes when > I use the mkfs.luster command, I mount this ost and at the client > node ,lfs df -h can see this OST, but when I umount it and mount the > ost to another OSS ? > lfs df -h can not see it again.Right. Because you have to tell the client which other nodes might make that OST available so that it can find the one actually making it available. If you don''t give the client any alternate nodes, it doesn''t know other nodes and doesn''t try any but the one node the OST was configured on.> But if I define the failnodes in the mkfs.luster command, and do the > operations above, we can see the OST at the client node using lfs df > -h command.Right.> So my question is :if I want an OST to failover to any OSS (one of > sixty nodes),should I need to defined 60 failnodes when I format the > disk?Theoretically. I had discussed this briefly with another engineer a while ago and IIRC, the result of the discussion was that there was nothing inherent in the configuration logic that would prevent one from having more than two ("primary" and "failover") OSSes providing service to an OST. Two nodes per OST is how just about everyone that wants failover configures Lustre. I''m not really sure that 60 nodes for every OST is really practical though. When an OSS does fail, the process of finding the OST on a failover node is serial and linear. That is, when the OSS providing an OST dies, the client cycles through the OST''s failover list trying each OSS, serially, until it finds the OST. The time given to each discovery attempt is not trivial (i.e. a few seconds or less) so hunting through 60 of them will take considerable time.> or can I use pacemaker to select an oss and modify something to notify > client that the disk is on some OSS?No. There is currently no way to push a client towards an OSS for a given OST. b. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 197 bytes Desc: This is a digitally signed message part Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20091109/317f8ca1/attachment.bin
On Monday 09 November 2009, Brian J. Murrell wrote:> > Theoretically. I had discussed this briefly with another engineer a > while ago and IIRC, the result of the discussion was that there was > nothing inherent in the configuration logic that would prevent one from > having more than two ("primary" and "failover") OSSes providing service > to an OST. Two nodes per OST is how just about everyone that wants > failover configures Lustre.Not everyone ;) And especially it doesn''t make sense to have a 2 node failover scheme with pacemaker: https://bugzilla.lustre.org/show_bug.cgi?id=20964 -- Bernd Schubert DataDirect Networks
Am Montag, 9. November 2009 16:36:15 schrieb Bernd Schubert:> On Monday 09 November 2009, Brian J. Murrell wrote: > > Theoretically. I had discussed this briefly with another engineer a > > while ago and IIRC, the result of the discussion was that there was > > nothing inherent in the configuration logic that would prevent one from > > having more than two ("primary" and "failover") OSSes providing service > > to an OST. Two nodes per OST is how just about everyone that wants > > failover configures Lustre. > > Not everyone ;) And especially it doesn''t make sense to have a 2 node > failover scheme with pacemaker: > > https://bugzilla.lustre.org/show_bug.cgi?id=20964the problem is that pacemaker does not understand about the applications it does cluster. pacemaker is made to provide high availability for ANY service, not only for a cluster FS. So if you want to pin some resources (i.e. FS1) to a special node, you have to add a location constraint. But this contradicts the logic of pacemaker a little bit. Why should a resource run on this node, if all nodes are equal? Basically I had the same problem with my lustre cluster I had the following solution: - make colocation constratins so that filesystems do not like to run in the same node. And theoretically with openais as a cluster stack the number of nodes is not limited to 16 any more like in heartbeat. You can build larger clusters. Greetings, -- Dr. Michael Schwartzkopff MultiNET Services GmbH Addresse: Bretonischer Ring 7; 85630 Grasbrunn; Germany Tel: +49 - 89 - 45 69 11 0 Fax: +49 - 89 - 45 69 11 21 mob: +49 - 174 - 343 28 75 mail: misch at multinet.de web: www.multinet.de Sitz der Gesellschaft: 85630 Grasbrunn Registergericht: Amtsgericht M?nchen HRB 114375 Gesch?ftsf?hrer: G?nter Jurgeneit, Hubert Martens --- PGP Fingerprint: F919 3919 FF12 ED5A 2801 DEA6 AA77 57A4 EDD8 979B Skype: misch42
On 2009-11-09, at 08:31, Brian J. Murrell wrote:> On Mon, 2009-11-09 at 16:25 +0800, lelustre wrote: >> So my question is :if I want an OST to failover to any OSS (one of >> sixty nodes),should I need to defined 60 failnodes when I format the >> disk? > > I''m not really sure that 60 nodes for every OST is really practical > though. When an OSS does fail, the process of finding the OST on a > failover node is serial and linear. That is, when the OSS providing > an > OST dies, the client cycles through the OST''s failover list trying > each > OSS, serially, until it finds the OST. The time given to each > discovery > attempt is not trivial (i.e. a few seconds or less) so hunting through > 60 of them will take considerable time. > >> or can I use pacemaker to select an oss and modify something to >> notify >> client that the disk is on some OSS? > > No. There is currently no way to push a client towards an OSS for a > given OST.That is what the "Imperative Recovery" feature is - having the failover server notify the client that it has taken over an OST/MDT filesystem, rather than waiting for the client to time out its RPC and poke around trying to find which of the failover servers is controlling the OST/MDT. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.
Hi, First,thanks everyone. I have thought an idea: I use pacemaker for HA, and iscsi method to find SAN disk.The service of HA is when an OSS fail, pacemaker select another OSS, and resource agent script on selected OSS can discovery the OST disk and mount it to an directory, then I can use #MDT> tunefs.lustre --writeconf <mount point> using pdsh in the script (Lustre manual : changing a server NID), so the client can know where the OST is. But I really do not know if writeconf is damage to the data or the fs? I have not test the idea now, it is only an idea -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20091110/7014414a/attachment.html
On Tue, 2009-11-10 at 14:13 +0800, lelustre wrote:> Hi,Hi,> The service of HA is when an OSS fail, pacemaker select another OSS, > and resource agent script on selected OSS can discovery the OST disk > and mount it to an directory,Only one of the OSSes which have been configured as the failover servers for the OST should mount the OST, or the clients won''t be able to find it.> then I can use #MDT> tunefs.lustre --writeconf <mount point> using > pdsh in the script (Lustre manual : changing a server NID), so the > client can know where the OST is.No. DO NOT do this. Please don''t try to re-invent how Lustre failover works.> But I really do not know if writeconf is damage to the data or the fs?You should not use writeconf in this manner. I believe the instructions to which you are referring (changing a server NID) explicitly says that you must shut down the entire filesystem before you do any writeconfs and then you must bring the servers all back up before you bring any clients up. This is a lot more traumatic to the users than simply configuration failover the way it''s supposed to work. b. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 197 bytes Desc: This is a digitally signed message part Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20091110/94db0220/attachment.bin