thr3ads.net - Lustre discuss - [Lustre-discuss] Question about failnode [Oct 2007]

If this information is useful, please help other people find it:
Share via:

Kazuki Ohara

2007-Oct-25 11:58 UTC

[Lustre-discuss] Question about failnode

Hi,
I have a question about the fainode directive of mkfs.lustre.
I hope someone help me.

When a shared volume is formatted and mounted as below,
  [root at ossnode1 ~]# mkfs.lustre --ost --failnode=ossnode2 \
  > --mgsnode=mgsnode at tcp /dev/sda1
  [root at ossnode1 ~]# mount -t lustre /dev/sda1 /mnt/ost1
ossnode1 knows that sda1 can be accessed by ossnode2.

Then, a failover occurs and ossnode2 mounts sda1,
I think there is no way for ossnode2 to know sda1 can be accessed by ossnode1.
Does this become a problem?

Best regards,
Kazuki Ohara

Brian J. Murrell

2007-Oct-25 14:29 UTC

head link

[Lustre-discuss] Question about failnode

On Thu, 2007-10-25 at 20:58 +0900, Kazuki Ohara wrote:> Hi,
> I have a question about the fainode directive of mkfs.lustre.
> I hope someone help me.
> 
> When a shared volume is formatted and mounted as below,
>   [root at ossnode1 ~]# mkfs.lustre --ost --failnode=ossnode2 \
>   > --mgsnode=mgsnode at tcp /dev/sda1
>   [root at ossnode1 ~]# mount -t lustre /dev/sda1 /mnt/ost1
> ossnode1 knows that sda1 can be accessed by ossnode2.
> 
> Then, a failover occurs and ossnode2 mounts sda1,
ossnode2 should in fact, before it does the mount, make as entirely sure
as it can that ossnode1 does not have it mounted.  The surefire way to
do that is to kill the power to ossnode1 (assuming there is a power
controller between the mains and ossnode1 that ossnode2 can operate).

In the failover game this is called STONITH and is an acronym for "Shoot
The Other Node In The Head".  All of this is usually coordinated with
something like Heartbeat.

The reason for this STONITH action is that in an HA scenario, ossnode2
only knows that it cannot reach ossnode1.  It does not know why.  It
could be because it''s power failed, it panic''d or any number
of reasons.
Not all of those reasons imply that ossnode1 cannot (and does not) still
have the disk mounted though.  Only by killing ossnode1 itself, can
ossnode2 be absolutely sure that ossnode1 does not have the disk
mounted.

More than one node mounting an ext{2,3,4} or ldiskfs (which is ext4,
basically) filesystem is disastrous for that filesystem, so all possible
measures necessary to prevent that need to be taken.
> I think there is no way for ossnode2 to know sda1 can be accessed by
ossnode1.
> Does this become a problem?
It does, hence the steps above.

b.

Kazuki Ohara

2007-Oct-26 07:03 UTC

head link

[Lustre-discuss] Question about failnode

Hi Brain,
Thank you for your answer.

I understood the need of STONISH.

By the way, I doubt the need of the --failnode directive.
I walked into the source code of lustre, but I could not find
the use of the --failnode information except for building log messages.

Are there any more important reasons of the use of the --failnode directive?

Best regards,
Kazuki Ohara

Brian J. Murrell wrote:> On Thu, 2007-10-25 at 20:58 +0900, Kazuki Ohara wrote:
>> Hi,
>> I have a question about the fainode directive of mkfs.lustre.
>> I hope someone help me.
>>
>> When a shared volume is formatted and mounted as below,
>>   [root at ossnode1 ~]# mkfs.lustre --ost --failnode=ossnode2 \
>>   > --mgsnode=mgsnode at tcp /dev/sda1
>>   [root at ossnode1 ~]# mount -t lustre /dev/sda1 /mnt/ost1
>> ossnode1 knows that sda1 can be accessed by ossnode2.
>>
>> Then, a failover occurs and ossnode2 mounts sda1,
> 
> ossnode2 should in fact, before it does the mount, make as entirely sure
> as it can that ossnode1 does not have it mounted.  The surefire way to
> do that is to kill the power to ossnode1 (assuming there is a power
> controller between the mains and ossnode1 that ossnode2 can operate).
> 
> In the failover game this is called STONITH and is an acronym for
"Shoot
> The Other Node In The Head".  All of this is usually coordinated with
> something like Heartbeat.
> 
> The reason for this STONITH action is that in an HA scenario, ossnode2
> only knows that it cannot reach ossnode1.  It does not know why.  It
> could be because it''s power failed, it panic''d or any
number of reasons.
> Not all of those reasons imply that ossnode1 cannot (and does not) still
> have the disk mounted though.  Only by killing ossnode1 itself, can
> ossnode2 be absolutely sure that ossnode1 does not have the disk
> mounted.
> 
> More than one node mounting an ext{2,3,4} or ldiskfs (which is ext4,
> basically) filesystem is disastrous for that filesystem, so all possible
> measures necessary to prevent that need to be taken.
> 
>> I think there is no way for ossnode2 to know sda1 can be accessed by
ossnode1.
>> Does this become a problem?
> 
> It does, hence the steps above.
> 
> b.
> 
> 
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at clusterfs.com
> https://mail.clusterfs.com/mailman/listinfo/lustre-discuss
> 
> 

-- 
Kazuki Ohara
Sony Computer Entertainment Inc.
Computer Development Div. Distributed OS Development Dept.
Japan

Brian J. Murrell

2007-Oct-26 18:43 UTC

head link

[Lustre-discuss] Question about failnode

On Fri, 2007-10-26 at 16:03 +0900, Kazuki Ohara wrote:> Hi Brain,
> Thank you for your answer.
NP.
> By the way, I doubt the need of the --failnode directive.
It''s needed.  That is how the MGS and thusly all other nodes learn of
an
OSS''s failover partner.  This information is communicated via the
mkfs.lustre command to the MGS.

b.

Kazuki Ohara

2007-Oct-29 07:58 UTC

head link

[Lustre-discuss] Question about failnode

Brian J. Murrell wrote:> On Fri, 2007-10-26 at 16:03 +0900, Kazuki Ohara wrote:
>> Hi Brain,
>> Thank you for your answer.
> 
> NP.
> 
>> By the way, I doubt the need of the --failnode directive.
> 
> It''s needed.  That is how the MGS and thusly all other nodes learn
of an
> OSS''s failover partner.  This information is communicated via the
> mkfs.lustre command to the MGS.
uh...
Thank you for you answer, but,
I can''t find out the reason why MGS and OSS need to learn of the
failover partner.
By that information, does MGS or OSS request the partner not to access the
shared volume
or something special requests?

Excuse my persistent question.

Best regards,
Kazuki Ohara

Mustafa A. Hashmi

2007-Oct-29 09:25 UTC

head link

[Lustre-discuss] Question about failnode

On 10/29/07, Kazuki Ohara <ohara at rd.scei.sony.co.jp>
wrote:> Brian J. Murrell wrote:
> > On Fri, 2007-10-26 at 16:03 +0900, Kazuki Ohara wrote:
> >> Hi Brain,
> >> Thank you for your answer.
> >
> > NP.
> >
> >> By the way, I doubt the need of the --failnode directive.
> >
> > It''s needed.  That is how the MGS and thusly all other nodes
learn of an
> > OSS''s failover partner.  This information is communicated via
the
> > mkfs.lustre command to the MGS.
>
> uh...
> Thank you for you answer, but,
> I can''t find out the reason why MGS and OSS need to learn of the
failover partner.
> By that information, does MGS or OSS request the partner not to access the
shared volume
> or something special requests?
I am sure someone who understands Lustre internals can tackle this
question better, however, from my understanding:

The MGS keeps track of all data as it''s written to the OST in
question, as well as the OSS responsible for the OST. By creating a
pair of OSS systems, one is effectively delegating responsibility of
the back-end storage to a pair of OSS machines and ensuring a seamless
fail-over by redirecting client requests transparently.

The second part of your question is "how do you tell the stand-by OSS
not to access the volume?".  The Lustre MGS is not going to direct
client requests to the stand-by node for the OST in question when a
client request comes in, hence, the shared or replicated OST on the
fail-over pair need not be mounted or actively available.

In case a fail-over is required, the shared/replicated storage device
is mounted on the stand-by OSS and Lustre fail-over requested via the
MGS.
> Excuse my persistent question.
Hope the explanation helped -- I am sure CFS/Sun can clarify further.

-mustafa.

Brian J. Murrell

2007-Oct-29 14:36 UTC

head link

[Lustre-discuss] Question about failnode

On Mon, 2007-10-29 at 16:58 +0900, Kazuki Ohara wrote:> 
> I can''t find out the reason why MGS and OSS need to learn of the
failover partner.
Because the MGS is the centre (i.e. configuration broker if you will) of
how a cluster is configured.  All nodes go to the MGS to get the cluster
configuration.

Strictly speaking a given OST does not need to know who it''s failover
partner is, and the --failnode in the mkfs.lustre on the OST is not in
to inform the OST (it''s running on) but that information is sent (i.e.
by mkfs.lustre) to the MGS so that the rest of the cluster (who do need
to know) can learn this.
> By that information, does MGS or OSS request the partner not to access the
shared volume
> or something special requests?
No.  You must keep in mind that exclusive access to the shared media is
absolutely required.  I think you understand this, but it does bear
repeating.

You must never have both OSSes mount the same volume at the same time --
you will corrupt it.  Lustre itself does not take care of this mounting
and unmounting however.  This is left to the operating environment that
Lustre is running in -- Linux and can be achieved in Linux using
Heartbeat.  Please see the manual for more information on how Heartbeat
achieves this.

b.

Kazuki Ohara

2007-Oct-30 06:14 UTC

head link

[Lustre-discuss] Question about failnode

Hi, Brain

Thank you for your answering many times.

Brian J. Murrell wrote:> On Mon, 2007-10-29 at 16:58 +0900, Kazuki Ohara wrote:
>> I can''t find out the reason why MGS and OSS need to learn of
the failover partner.
> 
> Because the MGS is the centre (i.e. configuration broker if you will) of
> how a cluster is configured.  All nodes go to the MGS to get the cluster
> configuration.
> 
> Strictly speaking a given OST does not need to know who it''s
failover
> partner is, and the --failnode in the mkfs.lustre on the OST is not in
> to inform the OST (it''s running on) but that information is sent
(i.e.
> by mkfs.lustre) to the MGS so that the rest of the cluster (who do need
> to know) can learn this.
Nevertheless the OSSes itself don''t need to know the failover partner,
who does need to get and use the information of the partner that is hold by the
MGS?
> 
>> By that information, does MGS or OSS request the partner not to access
the shared volume
>> or something special requests?
> 
> No.  You must keep in mind that exclusive access to the shared media is
> absolutely required.  I think you understand this, but it does bear
> repeating.
> 
> You must never have both OSSes mount the same volume at the same time --
> you will corrupt it.  Lustre itself does not take care of this mounting
> and unmounting however.  This is left to the operating environment that
> Lustre is running in -- Linux and can be achieved in Linux using
> Heartbeat.  Please see the manual for more information on how Heartbeat
> achieves this.
I''m sorry for making you worry.
I understood Lustre do nothing for exclusive accesses and it must be ensured by
Heartbeat and STONISH mechanism by your previous answer. But I wanted to write
something as an example of special requests and wrote such one.
I am not good at English, so I hope the things I want to express is expressed
well.

Best regards,
Kazuki Ohara

Brian J. Murrell

2007-Oct-30 16:51 UTC

head link

[Lustre-discuss] Question about failnode

On Tue, 2007-10-30 at 15:14 +0900, Kazuki Ohara wrote:> Hi, Brain
Hi,
> Nevertheless the OSSes itself don''t need to know the failover
partner,
Right.
> who does need to get and use the information of the partner that is hold by
the MGS?
Clients of course.  A client needs to know that for a given OST
"service" that alternate paths (i.e. OSTs) can be used to reach it.
> I''m sorry for making you worry.
> I understood Lustre do nothing for exclusive accesses and it must be
ensured by
> Heartbeat and STONISH mechanism by your previous answer.
OK.  Good.

b.

Klaus Steden

2007-Oct-30 20:12 UTC

head link

[Lustre-discuss] Question about failnode

>> who does need to get and use the information of the partner that is
hold by
>> the MGS?
> 
> Clients of course.  A client needs to know that for a given OST
> "service" that alternate paths (i.e. OSTs) can be used to reach
it.
> How are clients informed (i.e. what''s the mechanism) when a given OSS
or MDS
node stops responding ... or is that something a STONITH framework would
have to take care of?

Klaus

Cliff White

2007-Oct-30 21:37 UTC

head link

[Lustre-discuss] Question about failnode

Klaus Steden wrote:>>> who does need to get and use the information of the partner that is
hold by
>>> the MGS?
>> Clients of course.  A client needs to know that for a given OST
>> "service" that alternate paths (i.e. OSTs) can be used to
reach it.
>>
> How are clients informed (i.e. what''s the mechanism) when a given
OSS or MDS
> node stops responding ... or is that something a STONITH framework would
> have to take care of?
> 
The client request will time out, and it will retry with the alternate 
path. There is no notification process.
cliffw
> Klaus
> 
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at clusterfs.com
> https://mail.clusterfs.com/mailman/listinfo/lustre-discuss

Kazuki Ohara

2007-Oct-31 02:25 UTC

head link

[Lustre-discuss] Question about failnode

Brian J. Murrell wrote:
>> who does need to get and use the information of the partner that is
hold by the MGS?
> 
> Clients of course.  A client needs to know that for a given OST
> "service" that alternate paths (i.e. OSTs) can be used to reach
it.
>
oh..
Exactly!
I should have noticed that.

My question disappeared clearly.
Thank you so much for long discussion!


Best regards,
Kazuki Ohara

Lustre discuss - Oct 2007 - Question about failnode

[Lustre-discuss] Question about failnode

[Lustre-discuss] Question about failnode

[Lustre-discuss] Question about failnode

[Lustre-discuss] Question about failnode

[Lustre-discuss] Question about failnode

[Lustre-discuss] Question about failnode

[Lustre-discuss] Question about failnode

[Lustre-discuss] Question about failnode

[Lustre-discuss] Question about failnode

[Lustre-discuss] Question about failnode

[Lustre-discuss] Question about failnode

[Lustre-discuss] Question about failnode