Hi, I have a question about the fainode directive of mkfs.lustre. I hope someone help me. When a shared volume is formatted and mounted as below, [root at ossnode1 ~]# mkfs.lustre --ost --failnode=ossnode2 \ > --mgsnode=mgsnode at tcp /dev/sda1 [root at ossnode1 ~]# mount -t lustre /dev/sda1 /mnt/ost1 ossnode1 knows that sda1 can be accessed by ossnode2. Then, a failover occurs and ossnode2 mounts sda1, I think there is no way for ossnode2 to know sda1 can be accessed by ossnode1. Does this become a problem? Best regards, Kazuki Ohara
On Thu, 2007-10-25 at 20:58 +0900, Kazuki Ohara wrote:> Hi, > I have a question about the fainode directive of mkfs.lustre. > I hope someone help me. > > When a shared volume is formatted and mounted as below, > [root at ossnode1 ~]# mkfs.lustre --ost --failnode=ossnode2 \ > > --mgsnode=mgsnode at tcp /dev/sda1 > [root at ossnode1 ~]# mount -t lustre /dev/sda1 /mnt/ost1 > ossnode1 knows that sda1 can be accessed by ossnode2. > > Then, a failover occurs and ossnode2 mounts sda1,ossnode2 should in fact, before it does the mount, make as entirely sure as it can that ossnode1 does not have it mounted. The surefire way to do that is to kill the power to ossnode1 (assuming there is a power controller between the mains and ossnode1 that ossnode2 can operate). In the failover game this is called STONITH and is an acronym for "Shoot The Other Node In The Head". All of this is usually coordinated with something like Heartbeat. The reason for this STONITH action is that in an HA scenario, ossnode2 only knows that it cannot reach ossnode1. It does not know why. It could be because it''s power failed, it panic''d or any number of reasons. Not all of those reasons imply that ossnode1 cannot (and does not) still have the disk mounted though. Only by killing ossnode1 itself, can ossnode2 be absolutely sure that ossnode1 does not have the disk mounted. More than one node mounting an ext{2,3,4} or ldiskfs (which is ext4, basically) filesystem is disastrous for that filesystem, so all possible measures necessary to prevent that need to be taken.> I think there is no way for ossnode2 to know sda1 can be accessed by ossnode1. > Does this become a problem?It does, hence the steps above. b.
Hi Brain, Thank you for your answer. I understood the need of STONISH. By the way, I doubt the need of the --failnode directive. I walked into the source code of lustre, but I could not find the use of the --failnode information except for building log messages. Are there any more important reasons of the use of the --failnode directive? Best regards, Kazuki Ohara Brian J. Murrell wrote:> On Thu, 2007-10-25 at 20:58 +0900, Kazuki Ohara wrote: >> Hi, >> I have a question about the fainode directive of mkfs.lustre. >> I hope someone help me. >> >> When a shared volume is formatted and mounted as below, >> [root at ossnode1 ~]# mkfs.lustre --ost --failnode=ossnode2 \ >> > --mgsnode=mgsnode at tcp /dev/sda1 >> [root at ossnode1 ~]# mount -t lustre /dev/sda1 /mnt/ost1 >> ossnode1 knows that sda1 can be accessed by ossnode2. >> >> Then, a failover occurs and ossnode2 mounts sda1, > > ossnode2 should in fact, before it does the mount, make as entirely sure > as it can that ossnode1 does not have it mounted. The surefire way to > do that is to kill the power to ossnode1 (assuming there is a power > controller between the mains and ossnode1 that ossnode2 can operate). > > In the failover game this is called STONITH and is an acronym for "Shoot > The Other Node In The Head". All of this is usually coordinated with > something like Heartbeat. > > The reason for this STONITH action is that in an HA scenario, ossnode2 > only knows that it cannot reach ossnode1. It does not know why. It > could be because it''s power failed, it panic''d or any number of reasons. > Not all of those reasons imply that ossnode1 cannot (and does not) still > have the disk mounted though. Only by killing ossnode1 itself, can > ossnode2 be absolutely sure that ossnode1 does not have the disk > mounted. > > More than one node mounting an ext{2,3,4} or ldiskfs (which is ext4, > basically) filesystem is disastrous for that filesystem, so all possible > measures necessary to prevent that need to be taken. > >> I think there is no way for ossnode2 to know sda1 can be accessed by ossnode1. >> Does this become a problem? > > It does, hence the steps above. > > b. > > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at clusterfs.com > https://mail.clusterfs.com/mailman/listinfo/lustre-discuss > >-- Kazuki Ohara Sony Computer Entertainment Inc. Computer Development Div. Distributed OS Development Dept. Japan
On Fri, 2007-10-26 at 16:03 +0900, Kazuki Ohara wrote:> Hi Brain, > Thank you for your answer.NP.> By the way, I doubt the need of the --failnode directive.It''s needed. That is how the MGS and thusly all other nodes learn of an OSS''s failover partner. This information is communicated via the mkfs.lustre command to the MGS. b.
Brian J. Murrell wrote:> On Fri, 2007-10-26 at 16:03 +0900, Kazuki Ohara wrote: >> Hi Brain, >> Thank you for your answer. > > NP. > >> By the way, I doubt the need of the --failnode directive. > > It''s needed. That is how the MGS and thusly all other nodes learn of an > OSS''s failover partner. This information is communicated via the > mkfs.lustre command to the MGS.uh... Thank you for you answer, but, I can''t find out the reason why MGS and OSS need to learn of the failover partner. By that information, does MGS or OSS request the partner not to access the shared volume or something special requests? Excuse my persistent question. Best regards, Kazuki Ohara
On 10/29/07, Kazuki Ohara <ohara at rd.scei.sony.co.jp> wrote:> Brian J. Murrell wrote: > > On Fri, 2007-10-26 at 16:03 +0900, Kazuki Ohara wrote: > >> Hi Brain, > >> Thank you for your answer. > > > > NP. > > > >> By the way, I doubt the need of the --failnode directive. > > > > It''s needed. That is how the MGS and thusly all other nodes learn of an > > OSS''s failover partner. This information is communicated via the > > mkfs.lustre command to the MGS. > > uh... > Thank you for you answer, but, > I can''t find out the reason why MGS and OSS need to learn of the failover partner. > By that information, does MGS or OSS request the partner not to access the shared volume > or something special requests?I am sure someone who understands Lustre internals can tackle this question better, however, from my understanding: The MGS keeps track of all data as it''s written to the OST in question, as well as the OSS responsible for the OST. By creating a pair of OSS systems, one is effectively delegating responsibility of the back-end storage to a pair of OSS machines and ensuring a seamless fail-over by redirecting client requests transparently. The second part of your question is "how do you tell the stand-by OSS not to access the volume?". The Lustre MGS is not going to direct client requests to the stand-by node for the OST in question when a client request comes in, hence, the shared or replicated OST on the fail-over pair need not be mounted or actively available. In case a fail-over is required, the shared/replicated storage device is mounted on the stand-by OSS and Lustre fail-over requested via the MGS.> Excuse my persistent question.Hope the explanation helped -- I am sure CFS/Sun can clarify further. -mustafa.
On Mon, 2007-10-29 at 16:58 +0900, Kazuki Ohara wrote:> > I can''t find out the reason why MGS and OSS need to learn of the failover partner.Because the MGS is the centre (i.e. configuration broker if you will) of how a cluster is configured. All nodes go to the MGS to get the cluster configuration. Strictly speaking a given OST does not need to know who it''s failover partner is, and the --failnode in the mkfs.lustre on the OST is not in to inform the OST (it''s running on) but that information is sent (i.e. by mkfs.lustre) to the MGS so that the rest of the cluster (who do need to know) can learn this.> By that information, does MGS or OSS request the partner not to access the shared volume > or something special requests?No. You must keep in mind that exclusive access to the shared media is absolutely required. I think you understand this, but it does bear repeating. You must never have both OSSes mount the same volume at the same time -- you will corrupt it. Lustre itself does not take care of this mounting and unmounting however. This is left to the operating environment that Lustre is running in -- Linux and can be achieved in Linux using Heartbeat. Please see the manual for more information on how Heartbeat achieves this. b.
Hi, Brain Thank you for your answering many times. Brian J. Murrell wrote:> On Mon, 2007-10-29 at 16:58 +0900, Kazuki Ohara wrote: >> I can''t find out the reason why MGS and OSS need to learn of the failover partner. > > Because the MGS is the centre (i.e. configuration broker if you will) of > how a cluster is configured. All nodes go to the MGS to get the cluster > configuration. > > Strictly speaking a given OST does not need to know who it''s failover > partner is, and the --failnode in the mkfs.lustre on the OST is not in > to inform the OST (it''s running on) but that information is sent (i.e. > by mkfs.lustre) to the MGS so that the rest of the cluster (who do need > to know) can learn this.Nevertheless the OSSes itself don''t need to know the failover partner, who does need to get and use the information of the partner that is hold by the MGS?> >> By that information, does MGS or OSS request the partner not to access the shared volume >> or something special requests? > > No. You must keep in mind that exclusive access to the shared media is > absolutely required. I think you understand this, but it does bear > repeating. > > You must never have both OSSes mount the same volume at the same time -- > you will corrupt it. Lustre itself does not take care of this mounting > and unmounting however. This is left to the operating environment that > Lustre is running in -- Linux and can be achieved in Linux using > Heartbeat. Please see the manual for more information on how Heartbeat > achieves this.I''m sorry for making you worry. I understood Lustre do nothing for exclusive accesses and it must be ensured by Heartbeat and STONISH mechanism by your previous answer. But I wanted to write something as an example of special requests and wrote such one. I am not good at English, so I hope the things I want to express is expressed well. Best regards, Kazuki Ohara
On Tue, 2007-10-30 at 15:14 +0900, Kazuki Ohara wrote:> Hi, BrainHi,> Nevertheless the OSSes itself don''t need to know the failover partner,Right.> who does need to get and use the information of the partner that is hold by the MGS?Clients of course. A client needs to know that for a given OST "service" that alternate paths (i.e. OSTs) can be used to reach it.> I''m sorry for making you worry. > I understood Lustre do nothing for exclusive accesses and it must be ensured by > Heartbeat and STONISH mechanism by your previous answer.OK. Good. b.
>> who does need to get and use the information of the partner that is hold by >> the MGS? > > Clients of course. A client needs to know that for a given OST > "service" that alternate paths (i.e. OSTs) can be used to reach it. >How are clients informed (i.e. what''s the mechanism) when a given OSS or MDS node stops responding ... or is that something a STONITH framework would have to take care of? Klaus
Klaus Steden wrote:>>> who does need to get and use the information of the partner that is hold by >>> the MGS? >> Clients of course. A client needs to know that for a given OST >> "service" that alternate paths (i.e. OSTs) can be used to reach it. >> > How are clients informed (i.e. what''s the mechanism) when a given OSS or MDS > node stops responding ... or is that something a STONITH framework would > have to take care of? >The client request will time out, and it will retry with the alternate path. There is no notification process. cliffw> Klaus > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at clusterfs.com > https://mail.clusterfs.com/mailman/listinfo/lustre-discuss
Brian J. Murrell wrote:>> who does need to get and use the information of the partner that is hold by the MGS? > > Clients of course. A client needs to know that for a given OST > "service" that alternate paths (i.e. OSTs) can be used to reach it. >oh.. Exactly! I should have noticed that. My question disappeared clearly. Thank you so much for long discussion! Best regards, Kazuki Ohara