thr3ads.net - Lustre discuss - [Lustre-discuss] lustre 1.6.5.1 panic on failover [Jul 2008]

If this information is useful, please help other people find it:
Share via:

Brock Palen

2008-Jul-31 20:57 UTC

[Lustre-discuss] lustre 1.6.5.1 panic on failover

I have two machines I am setting up as my first mds failover pair.

The two sun x4100''s  are connected to a FC disk array.  I have set up  
heartbeat with IPMI for STONITH.

Problem is when I run a test on the host that currently has the mds/ 
mgs mounted  ''killall -9 heartbeat''  I see the IPMI shutdown
and when
the second 4100 tries to mount the filesystem it does a kernel panic.

Has anyone else seen this behavior?  Is there something I am running  
into?  If I do a ''hb_takelover'' or shutdown heartbeat cleanly
all is
well.  Only if I simulate heartbeat failing does this happen.  Note I  
have not tired yanking power yet, but I want to simulate a MDS in a  
semi dead state and ran into this.


Brock Palen
www.umich.edu/~brockp
Center for Advanced Computing
brockp at umich.edu
(734)936-1985

Brian J. Murrell

2008-Jul-31 21:14 UTC

head link

[Lustre-discuss] lustre 1.6.5.1 panic on failover

On Thu, 2008-07-31 at 16:57 -0400, Brock Palen wrote:> 
> Problem is when I run a test on the host that currently has the mds/ 
> mgs mounted  ''killall -9 heartbeat''  I see the IPMI
shutdown and when
> the second 4100 tries to mount the filesystem it does a kernel panic.
We''d need to see the *full* panic info to do any amount of diagnostics.

b.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
Url :
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20080731/9cba034c/attachment.bin

Klaus Steden

2008-Jul-31 21:28 UTC

head link

[Lustre-discuss] lustre 1.6.5.1 panic on failover

Hi Brock,

I''ve been using Sun X2200s with Lustre in a similar configuration
(IPMI,
STONITH, Linux-HA, FC storage) and haven''t had any issues like this
(although I would typically panic the primary node during testing using
Sysrq) ... is the behaviour consistent?

Klaus

On 7/31/08 1:57 PM, "Brock Palen" <brockp at umich.edu>did etch
on stone
tablets:
> I have two machines I am setting up as my first mds failover pair.
> 
> The two sun x4100''s  are connected to a FC disk array.  I have set
up
> heartbeat with IPMI for STONITH.
> 
> Problem is when I run a test on the host that currently has the mds/
> mgs mounted  ''killall -9 heartbeat''  I see the IPMI
shutdown and when
> the second 4100 tries to mount the filesystem it does a kernel panic.
> 
> Has anyone else seen this behavior?  Is there something I am running
> into?  If I do a ''hb_takelover'' or shutdown heartbeat
cleanly all is
> well.  Only if I simulate heartbeat failing does this happen.  Note I
> have not tired yanking power yet, but I want to simulate a MDS in a
> semi dead state and ran into this.
> 
> 
> Brock Palen
> www.umich.edu/~brockp
> Center for Advanced Computing
> brockp at umich.edu
> (734)936-1985
> 
> 
> 
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss

Brock Palen

2008-Aug-01 00:22 UTC

head link

[Lustre-discuss] lustre 1.6.5.1 panic on failover

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Whats a good tool to grab this? Its more than one page long, and the  
machine does not have serial ports.
Links are ok.

Brock Palen
www.umich.edu/~brockp
Center for Advanced Computing
brockp at umich.edu
(734)936-1985



On Jul 31, 2008, at 5:14 PM, Brian J. Murrell wrote:> On Thu, 2008-07-31 at 16:57 -0400, Brock Palen wrote:
>>
>> Problem is when I run a test on the host that currently has the mds/
>> mgs mounted  ''killall -9 heartbeat''  I see the IPMI
shutdown and when
>> the second 4100 tries to mount the filesystem it does a kernel panic.
>
> We''d need to see the *full* panic info to do any amount of  
> diagnostics.
>
> b.
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (Darwin)

iD8DBQFIkldGMFCQB4Bvz5QRAjEqAJ99IN1m0/JJcqyh/Dm7WF0w5nd2eQCfT9IT
w39dxPiWCdXKzpLEo4WxBSU=Gnsm
-----END PGP SIGNATURE-----

Kilian CAVALOTTI

2008-Aug-01 00:50 UTC

head link

[Lustre-discuss] lustre 1.6.5.1 panic on failover

On Thursday 31 July 2008 17:22:28 Brock Palen wrote:> Whats a good tool to grab this? Its more than one page long, and the
> machine does not have serial ports.
If your servers do IPMI, you probably can configure Serial-over-LAN to get 
a console and capture the logs.

But a way more convenient solution is netdump. As long as the network 
connection is working on the panicking machine, you should be able to 
transmit the kernel panic info, as well as a stack trace, to a 
netump-server, which will store it in a file.

See http://www.redhat.com/support/wpapers/redhat/netdump/


Cheers,
-- 
Kilian

Klaus Steden

2008-Aug-01 01:45 UTC

head link

[Lustre-discuss] lustre 1.6.5.1 panic on failover

netdump is indeed good for this, but you may have to take two or three
cracks at it ... it doesn''t always dump the complete core image, and
you
can''t really do a whole lot with the incomplete version.

Klaus

On 7/31/08 5:50 PM, "Kilian CAVALOTTI" <kilian at
stanford.edu>did etch on
stone tablets:
> On Thursday 31 July 2008 17:22:28 Brock Palen wrote:
>> Whats a good tool to grab this? Its more than one page long, and the
>> machine does not have serial ports.
> 
> If your servers do IPMI, you probably can configure Serial-over-LAN to get
> a console and capture the logs.
> 
> But a way more convenient solution is netdump. As long as the network
> connection is working on the panicking machine, you should be able to
> transmit the kernel panic info, as well as a stack trace, to a
> netump-server, which will store it in a file.
> 
> See http://www.redhat.com/support/wpapers/redhat/netdump/
> 
> 
> Cheers,

Brock Palen

2008-Aug-01 15:39 UTC

head link

[Lustre-discuss] lustre 1.6.5.1 panic on failover

yes it is consistant.  I looked up how to induce a panic using sysrq

echo c > /proc/sysreq-trigger

That will work right, the machine cycles the second takes over and  
all is well.

If instead of crashing the node I run ''killall -9 heartbeat''
I can get the panic every time.  I even edited the external/ipmi  
script from ''power reset'' to ''power cycle''
didn''t help.

Its kinda unstable, if heartbeat dies the who MDS/mgs server setup  
would lock up, if the server panics I will be ok.  I don''t like this  
spot.

I am looking at grabbing a crash dump. I think its a race, heartbeat  
is mounting the filesystems before the first node is toatally dead.

Does it hurt to run mmp on the mgs file system also?

Brock Palen
www.umich.edu/~brockp
Center for Advanced Computing
brockp at umich.edu
(734)936-1985



On Jul 31, 2008, at 5:28 PM, Klaus Steden wrote:>
> Hi Brock,
>
> I''ve been using Sun X2200s with Lustre in a similar configuration
> (IPMI,
> STONITH, Linux-HA, FC storage) and haven''t had any issues like
this
> (although I would typically panic the primary node during testing  
> using
> Sysrq) ... is the behaviour consistent?
>
> Klaus
>
> On 7/31/08 1:57 PM, "Brock Palen" <brockp at umich.edu>did
etch on stone
> tablets:
>
>> I have two machines I am setting up as my first mds failover pair.
>>
>> The two sun x4100''s  are connected to a FC disk array.  I have
set up
>> heartbeat with IPMI for STONITH.
>>
>> Problem is when I run a test on the host that currently has the mds/
>> mgs mounted  ''killall -9 heartbeat''  I see the IPMI
shutdown and when
>> the second 4100 tries to mount the filesystem it does a kernel panic.
>>
>> Has anyone else seen this behavior?  Is there something I am running
>> into?  If I do a ''hb_takelover'' or shutdown heartbeat
cleanly all is
>> well.  Only if I simulate heartbeat failing does this happen.  Note I
>> have not tired yanking power yet, but I want to simulate a MDS in a
>> semi dead state and ran into this.
>>
>>
>> Brock Palen
>> www.umich.edu/~brockp
>> Center for Advanced Computing
>> brockp at umich.edu
>> (734)936-1985
>>
>>
>>
>> _______________________________________________
>> Lustre-discuss mailing list
>> Lustre-discuss at lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>
>
>

Brian J. Murrell

2008-Aug-01 17:25 UTC

head link

[Lustre-discuss] lustre 1.6.5.1 panic on failover

On Fri, 2008-08-01 at 11:39 -0400, Brock Palen wrote:> 
> I am looking at grabbing a crash dump. I think its a race, heartbeat  
> is mounting the filesystems before the first node is toatally dead.
Just to be clear, heartbeat should _always_ STONITH the peer node before
doing any mount of a Lustre device.

b.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
Url :
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20080801/63ae479e/attachment-0001.bin

Andreas Dilger

2008-Aug-01 19:56 UTC

head link

[Lustre-discuss] lustre 1.6.5.1 panic on failover

On Aug 01, 2008  11:39 -0400, Brock Palen wrote:> That will work right, the machine cycles the second takes over and  
> all is well.
> 
> If instead of crashing the node I run ''killall -9
heartbeat''
> I can get the panic every time.  I even edited the external/ipmi  
> script from ''power reset'' to ''power
cycle'' didn''t help.
> 
> Its kinda unstable, if heartbeat dies the who MDS/mgs server setup  
> would lock up, if the server panics I will be ok.  I don''t like
this
> spot.
> 
> I am looking at grabbing a crash dump. I think its a race, heartbeat  
> is mounting the filesystems before the first node is toatally dead.
> 
> Does it hurt to run mmp on the mgs file system also?
It should be fine to run MMP on any ldiskfs filesystem even if not
in failover mode.  The only drawback is a minor delay (20s or something)
in mounting and e2fsck startup.
> On Jul 31, 2008, at 5:28 PM, Klaus Steden wrote:
> >
> > Hi Brock,
> >
> > I''ve been using Sun X2200s with Lustre in a similar
configuration
> > (IPMI,
> > STONITH, Linux-HA, FC storage) and haven''t had any issues
like this
> > (although I would typically panic the primary node during testing  
> > using
> > Sysrq) ... is the behaviour consistent?
> >
> > Klaus
> >
> > On 7/31/08 1:57 PM, "Brock Palen" <brockp at
umich.edu>did etch on stone
> > tablets:
> >
> >> I have two machines I am setting up as my first mds failover pair.
> >>
> >> The two sun x4100''s  are connected to a FC disk array.  I
have set up
> >> heartbeat with IPMI for STONITH.
> >>
> >> Problem is when I run a test on the host that currently has the
mds/
> >> mgs mounted  ''killall -9 heartbeat''  I see the
IPMI shutdown and when
> >> the second 4100 tries to mount the filesystem it does a kernel
panic.
> >>
> >> Has anyone else seen this behavior?  Is there something I am
running
> >> into?  If I do a ''hb_takelover'' or shutdown
heartbeat cleanly all is
> >> well.  Only if I simulate heartbeat failing does this happen. 
Note I
> >> have not tired yanking power yet, but I want to simulate a MDS in
a
> >> semi dead state and ran into this.
> >>
> >>
> >> Brock Palen
> >> www.umich.edu/~brockp
> >> Center for Advanced Computing
> >> brockp at umich.edu
> >> (734)936-1985
> >>
> >>
> >>
> >> _______________________________________________
> >> Lustre-discuss mailing list
> >> Lustre-discuss at lists.lustre.org
> >> http://lists.lustre.org/mailman/listinfo/lustre-discuss
> >
> >
> >
> 
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

Lustre discuss - Jul 2008 - lustre 1.6.5.1 panic on failover

[Lustre-discuss] lustre 1.6.5.1 panic on failover

[Lustre-discuss] lustre 1.6.5.1 panic on failover

[Lustre-discuss] lustre 1.6.5.1 panic on failover

[Lustre-discuss] lustre 1.6.5.1 panic on failover

[Lustre-discuss] lustre 1.6.5.1 panic on failover

[Lustre-discuss] lustre 1.6.5.1 panic on failover

[Lustre-discuss] lustre 1.6.5.1 panic on failover

[Lustre-discuss] lustre 1.6.5.1 panic on failover

[Lustre-discuss] lustre 1.6.5.1 panic on failover