thr3ads.net - Lustre discuss - [Lustre-discuss] Node randomly panic [Nov 2007]

If this information is useful, please help other people find it:
Share via:

Somsak Sriprayoonsakul

2007-Nov-26 05:27 UTC

[Lustre-discuss] Node randomly panic

Hello,

    We have a 4 nodes Lustre Cluster that provides parallel file system 
for our 192 nodes cluster. The Lustre Cluster are CentOS 4.5, x86_64 
(Intel series 4000), on HP DL360-G5. The cluster that use it is ROCKS 
4.2.1, on the same set of hardware. Our network is Gigabit Ethernet, 
using bnx2 driver. Lustre setup is

storage-0-0: mgs+mdt, ost0, ost1 (backup)
storage-0-1: mgs+mdt (backup), ost0 (backup), ost1
storage-0-2: ost2, ost3 (backup)
storage-0-3: ost2 (backup), ost3

    We''re using heartbeat 2.0.8 base on pre-built RPM from CentOS. All 
backup is configure in the way that it''ll not run simultaneously with 
primary. Note that, we enable flock and quota on Lustre.

    The problem we have right now is, some of the nodes are randomly 
panic. This happened about once a week or two week. We tolerate this 
stupidly by setting kernel.panic=60 and hope that the backup node will 
not failed within the time frame, though this is working quite well 
(base on user feedback, they do not know that the file system is 
failed). The backup node take-over OST and do recovery for about 250 
secs then everything back to normal.

    Anyways we''re trying to nail down the reason why the file system is
panic. I believe that information above will not suffice to track down 
the reason. Could someone give me a way to debug or dump some useful 
information that I can send to the list for later analysis? Also, does 
the "RECOVERING" suffice to make the file system stable? Do we need to
shutdown the whole system and do e2fsck+lfsck?

    Also, every panic time, quota that was enabled will be disabled (lfs 
quota <user> /fs yield "No such process). I have to do quotaoff and 
quotaon again. It seems that the quota is not being turn on when OST is 
boot up. Is there a way to always turn this on?


    Thank you very much in advance


-- 

-----------------------------------------------------------------------------------
Somsak Sriprayoonsakul

Thai National Grid Center
Software Industry Promotion Agency
Ministry of ICT, Thailand
somsak_sr at thaigrid.or.th
-----------------------------------------------------------------------------------

Matt

2007-Nov-26 10:41 UTC

head link

[Lustre-discuss] Node randomly panic

Somsak,

Did you build your own bnx2 driver? I was getting kernel panics when hitting
a certain load with Dell 1950s that also use the bnx2 driver.  My solution
was to grab the bnx2 source code and build it under the Lustre kernel.  If
you search the mailing list you''ll find the mails dealing with this.

If you see bnx2 mentioned in your kernel panic output, then it''s
probably
the cause.

Thanks,

Matt

On 26/11/2007, Somsak Sriprayoonsakul <somsak_sr at thaigrid.or.th>
wrote:>
> Hello,
>
>     We have a 4 nodes Lustre Cluster that provides parallel file system
> for our 192 nodes cluster. The Lustre Cluster are CentOS 4.5, x86_64
> (Intel series 4000), on HP DL360-G5. The cluster that use it is ROCKS
> 4.2.1, on the same set of hardware. Our network is Gigabit Ethernet,
> using bnx2 driver. Lustre setup is
>
> storage-0-0: mgs+mdt, ost0, ost1 (backup)
> storage-0-1: mgs+mdt (backup), ost0 (backup), ost1
> storage-0-2: ost2, ost3 (backup)
> storage-0-3: ost2 (backup), ost3
>
>     We''re using heartbeat 2.0.8 base on pre-built RPM from CentOS.
All
> backup is configure in the way that it''ll not run simultaneously
with
> primary. Note that, we enable flock and quota on Lustre.
>
>     The problem we have right now is, some of the nodes are randomly
> panic. This happened about once a week or two week. We tolerate this
> stupidly by setting kernel.panic=60 and hope that the backup node will
> not failed within the time frame, though this is working quite well
> (base on user feedback, they do not know that the file system is
> failed). The backup node take-over OST and do recovery for about 250
> secs then everything back to normal.
>
>     Anyways we''re trying to nail down the reason why the file
system is
> panic. I believe that information above will not suffice to track down
> the reason. Could someone give me a way to debug or dump some useful
> information that I can send to the list for later analysis? Also, does
> the "RECOVERING" suffice to make the file system stable? Do we
need to
> shutdown the whole system and do e2fsck+lfsck?
>
>     Also, every panic time, quota that was enabled will be disabled (lfs
> quota <user> /fs yield "No such process). I have to do quotaoff
and
> quotaon again. It seems that the quota is not being turn on when OST is
> boot up. Is there a way to always turn this on?
>
>
>     Thank you very much in advance
>
>
> --
>
>
>
-----------------------------------------------------------------------------------
> Somsak Sriprayoonsakul
>
> Thai National Grid Center
> Software Industry Promotion Agency
> Ministry of ICT, Thailand
> somsak_sr at thaigrid.or.th
>
>
-----------------------------------------------------------------------------------
>
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at clusterfs.com
> https://mail.clusterfs.com/mailman/listinfo/lustre-discuss
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20071126/18d9ef78/attachment-0002.html

Somsak Sriprayoonsakul

2007-Nov-26 13:25 UTC

head link

[Lustre-discuss] Node randomly panic

No. I use stock bnx2 driver from pre-built latest kernel-lustre-1.6.3.

I forgot to mention about oops. It''s something about lustre 
(lustre_blah_blah_blah something).

All other nodes also use bnx2. There''s no problem at all.

Matt wrote:> Somsak,
>
> Did you build your own bnx2 driver? I was getting kernel panics when 
> hitting a certain load with Dell 1950s that also use the bnx2 driver.  
> My solution was to grab the bnx2 source code and build it under the 
> Lustre kernel.  If you search the mailing list you''ll find the
mails
> dealing with this.
>
> If you see bnx2 mentioned in your kernel panic output, then it''s 
> probably the cause.
>
> Thanks,
>
> Matt
>
> On 26/11/2007, *Somsak Sriprayoonsakul * <somsak_sr at thaigrid.or.th 
> <mailto:somsak_sr at thaigrid.or.th>> wrote:
>
>     Hello,
>
>         We have a 4 nodes Lustre Cluster that provides parallel file
>     system
>     for our 192 nodes cluster. The Lustre Cluster are CentOS 4.5, x86_64
>     (Intel series 4000), on HP DL360-G5. The cluster that use it is ROCKS
>     4.2.1, on the same set of hardware. Our network is Gigabit Ethernet,
>     using bnx2 driver. Lustre setup is
>
>     storage-0-0: mgs+mdt, ost0, ost1 (backup)
>     storage-0-1: mgs+mdt (backup), ost0 (backup), ost1
>     storage-0-2: ost2, ost3 (backup)
>     storage-0-3: ost2 (backup), ost3
>
>         We''re using heartbeat 2.0.8 base on pre-built RPM from
CentOS. All
>     backup is configure in the way that it''ll not run
simultaneously with
>     primary. Note that, we enable flock and quota on Lustre.
>
>         The problem we have right now is, some of the nodes are randomly
>     panic. This happened about once a week or two week. We tolerate this
>     stupidly by setting kernel.panic=60 and hope that the backup node
>     will
>     not failed within the time frame, though this is working quite well
>     (base on user feedback, they do not know that the file system is
>     failed). The backup node take-over OST and do recovery for about 250
>     secs then everything back to normal.
>
>         Anyways we''re trying to nail down the reason why the file
>     system is
>     panic. I believe that information above will not suffice to track down
>     the reason. Could someone give me a way to debug or dump some useful
>     information that I can send to the list for later analysis? Also, does
>     the "RECOVERING" suffice to make the file system stable? Do
we need to
>     shutdown the whole system and do e2fsck+lfsck?
>
>         Also, every panic time, quota that was enabled will be
>     disabled (lfs
>     quota <user> /fs yield "No such process). I have to do
quotaoff and
>     quotaon again. It seems that the quota is not being turn on when
>     OST is
>     boot up. Is there a way to always turn this on?
>
>
>         Thank you very much in advance
>
>
>     --
>
>    
-----------------------------------------------------------------------------------
>     Somsak Sriprayoonsakul
>
>     Thai National Grid Center
>     Software Industry Promotion Agency
>     Ministry of ICT, Thailand
>     somsak_sr at thaigrid.or.th <mailto:somsak_sr at thaigrid.or.th>
>    
-----------------------------------------------------------------------------------
>
>
>     _______________________________________________
>     Lustre-discuss mailing list
>     Lustre-discuss at clusterfs.com <mailto:Lustre-discuss at
clusterfs.com>
>     https://mail.clusterfs.com/mailman/listinfo/lustre-discuss
>     <https://mail.clusterfs.com/mailman/listinfo/lustre-discuss>
>
>
-- 

-----------------------------------------------------------------------------------
Somsak Sriprayoonsakul

Thai National Grid Center
Software Industry Promotion Agency
Ministry of ICT, Thailand
somsak_sr at thaigrid.or.th
-----------------------------------------------------------------------------------

Wojciech Turek

2007-Nov-26 13:34 UTC

head link

[Lustre-discuss] Node randomly panic

Hi,

how many clients (compute nodes) you have in your cluster? What is  
crashing randomly: clients or OSS or MDS or maybe all of them?
Do you have screenshot of the kernel panic or crashdump log?

cheers,

Wojciech Turek
On 26 Nov 2007, at 13:25, Somsak Sriprayoonsakul wrote:
> No. I use stock bnx2 driver from pre-built latest kernel-lustre-1.6.3.
>
> I forgot to mention about oops. It''s something about lustre
> (lustre_blah_blah_blah something).
>
> All other nodes also use bnx2. There''s no problem at all.
>
> Matt wrote:
>> Somsak,
>>
>> Did you build your own bnx2 driver? I was getting kernel panics when
>> hitting a certain load with Dell 1950s that also use the bnx2 driver.
>> My solution was to grab the bnx2 source code and build it under the
>> Lustre kernel.  If you search the mailing list you''ll find the
mails
>> dealing with this.
>>
>> If you see bnx2 mentioned in your kernel panic output, then
it''s
>> probably the cause.
>>
>> Thanks,
>>
>> Matt
>>
>> On 26/11/2007, *Somsak Sriprayoonsakul * <somsak_sr at
thaigrid.or.th
>> <mailto:somsak_sr at thaigrid.or.th>> wrote:
>>
>>     Hello,
>>
>>         We have a 4 nodes Lustre Cluster that provides parallel file
>>     system
>>     for our 192 nodes cluster. The Lustre Cluster are CentOS 4.5,  
>> x86_64
>>     (Intel series 4000), on HP DL360-G5. The cluster that use it  
>> is ROCKS
>>     4.2.1, on the same set of hardware. Our network is Gigabit  
>> Ethernet,
>>     using bnx2 driver. Lustre setup is
>>
>>     storage-0-0: mgs+mdt, ost0, ost1 (backup)
>>     storage-0-1: mgs+mdt (backup), ost0 (backup), ost1
>>     storage-0-2: ost2, ost3 (backup)
>>     storage-0-3: ost2 (backup), ost3
>>
>>         We''re using heartbeat 2.0.8 base on pre-built RPM from
>> CentOS. All
>>     backup is configure in the way that it''ll not run  
>> simultaneously with
>>     primary. Note that, we enable flock and quota on Lustre.
>>
>>         The problem we have right now is, some of the nodes are  
>> randomly
>>     panic. This happened about once a week or two week. We  
>> tolerate this
>>     stupidly by setting kernel.panic=60 and hope that the backup node
>>     will
>>     not failed within the time frame, though this is working quite  
>> well
>>     (base on user feedback, they do not know that the file system is
>>     failed). The backup node take-over OST and do recovery for  
>> about 250
>>     secs then everything back to normal.
>>
>>         Anyways we''re trying to nail down the reason why the
file
>>     system is
>>     panic. I believe that information above will not suffice to  
>> track down
>>     the reason. Could someone give me a way to debug or dump some  
>> useful
>>     information that I can send to the list for later analysis?  
>> Also, does
>>     the "RECOVERING" suffice to make the file system stable?
Do we
>> need to
>>     shutdown the whole system and do e2fsck+lfsck?
>>
>>         Also, every panic time, quota that was enabled will be
>>     disabled (lfs
>>     quota <user> /fs yield "No such process). I have to do  
>> quotaoff and
>>     quotaon again. It seems that the quota is not being turn on when
>>     OST is
>>     boot up. Is there a way to always turn this on?
>>
>>
>>         Thank you very much in advance
>>
>>
>>     --
>>
>>      
>> --------------------------------------------------------------------- 
>> --------------
>>     Somsak Sriprayoonsakul
>>
>>     Thai National Grid Center
>>     Software Industry Promotion Agency
>>     Ministry of ICT, Thailand
>>     somsak_sr at thaigrid.or.th <mailto:somsak_sr at
thaigrid.or.th>
>>      
>> --------------------------------------------------------------------- 
>> --------------
>>
>>
>>     _______________________________________________
>>     Lustre-discuss mailing list
>>     Lustre-discuss at clusterfs.com <mailto:Lustre- 
>> discuss at clusterfs.com>
>>     https://mail.clusterfs.com/mailman/listinfo/lustre-discuss
>>     <https://mail.clusterfs.com/mailman/listinfo/lustre-discuss>
>>
>>
>
> -- 
>
> ---------------------------------------------------------------------- 
> -------------
> Somsak Sriprayoonsakul
>
> Thai National Grid Center
> Software Industry Promotion Agency
> Ministry of ICT, Thailand
> somsak_sr at thaigrid.or.th
> ---------------------------------------------------------------------- 
> -------------
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at clusterfs.com
> https://mail.clusterfs.com/mailman/listinfo/lustre-discuss
Mr Wojciech Turek
Assistant System Manager
University of Cambridge
High Performance Computing service
email: wjt27 at cam.ac.uk
tel. +441223763517



-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20071126/4c384626/attachment-0002.html

Somsak Sriprayoonsakul

2007-Nov-26 13:36 UTC

head link

[Lustre-discuss] Node randomly panic

We have about 177 client nodes.

I think the crashed happened only with OSS.

I do not have screenshot yet. How can I get the crashdump log?

Wojciech Turek wrote:> Hi,
>
> how many clients (compute nodes) you have in your cluster? What is 
> crashing randomly: clients or OSS or MDS or maybe all of them?
> Do you have screenshot of the kernel panic or crashdump log?
>
> cheers,
>
> Wojciech Turek 
> On 26 Nov 2007, at 13:25, Somsak Sriprayoonsakul wrote:
>
>> No. I use stock bnx2 driver from pre-built latest kernel-lustre-1.6.3.
>>
>> I forgot to mention about oops. It''s something about lustre 
>> (lustre_blah_blah_blah something).
>>
>> All other nodes also use bnx2. There''s no problem at all.
>>
>> Matt wrote:
>>> Somsak,
>>>
>>> Did you build your own bnx2 driver? I was getting kernel panics
when
>>> hitting a certain load with Dell 1950s that also use the bnx2
driver.
>>> My solution was to grab the bnx2 source code and build it under the
>>> Lustre kernel.  If you search the mailing list you''ll find
the mails
>>> dealing with this.
>>>
>>> If you see bnx2 mentioned in your kernel panic output, then
it''s
>>> probably the cause.
>>>
>>> Thanks,
>>>
>>> Matt
>>>
>>> On 26/11/2007, *Somsak Sriprayoonsakul * <somsak_sr at
thaigrid.or.th
>>> <mailto:somsak_sr at thaigrid.or.th> 
>>> <mailto:somsak_sr at thaigrid.or.th>> wrote:
>>>
>>>     Hello,
>>>
>>>         We have a 4 nodes Lustre Cluster that provides parallel
file
>>>     system
>>>     for our 192 nodes cluster. The Lustre Cluster are CentOS 4.5,
x86_64
>>>     (Intel series 4000), on HP DL360-G5. The cluster that use it is
>>> ROCKS
>>>     4.2.1, on the same set of hardware. Our network is Gigabit
Ethernet,
>>>     using bnx2 driver. Lustre setup is
>>>
>>>     storage-0-0: mgs+mdt, ost0, ost1 (backup)
>>>     storage-0-1: mgs+mdt (backup), ost0 (backup), ost1
>>>     storage-0-2: ost2, ost3 (backup)
>>>     storage-0-3: ost2 (backup), ost3
>>>
>>>         We''re using heartbeat 2.0.8 base on pre-built RPM
from
>>> CentOS. All
>>>     backup is configure in the way that it''ll not run
simultaneously
>>> with
>>>     primary. Note that, we enable flock and quota on Lustre.
>>>
>>>         The problem we have right now is, some of the nodes are
randomly
>>>     panic. This happened about once a week or two week. We tolerate
this
>>>     stupidly by setting kernel.panic=60 and hope that the backup
node
>>>     will
>>>     not failed within the time frame, though this is working quite
well
>>>     (base on user feedback, they do not know that the file system
is
>>>     failed). The backup node take-over OST and do recovery for
about 250
>>>     secs then everything back to normal.
>>>
>>>         Anyways we''re trying to nail down the reason why
the file
>>>     system is
>>>     panic. I believe that information above will not suffice to 
>>> track down
>>>     the reason. Could someone give me a way to debug or dump some
useful
>>>     information that I can send to the list for later analysis? 
>>> Also, does
>>>     the "RECOVERING" suffice to make the file system
stable? Do we
>>> need to
>>>     shutdown the whole system and do e2fsck+lfsck?
>>>
>>>         Also, every panic time, quota that was enabled will be
>>>     disabled (lfs
>>>     quota <user> /fs yield "No such process). I have to
do quotaoff and
>>>     quotaon again. It seems that the quota is not being turn on
when
>>>     OST is
>>>     boot up. Is there a way to always turn this on?
>>>
>>>
>>>         Thank you very much in advance
>>>
>>>
>>>     --
>>>
>>>     
>>>
-----------------------------------------------------------------------------------
>>>     Somsak Sriprayoonsakul
>>>
>>>     Thai National Grid Center
>>>     Software Industry Promotion Agency
>>>     Ministry of ICT, Thailand
>>>     somsak_sr at thaigrid.or.th <mailto:somsak_sr at
thaigrid.or.th>
>>>     
>>>
-----------------------------------------------------------------------------------
>>>
>>>
>>>     _______________________________________________
>>>     Lustre-discuss mailing list
>>>     Lustre-discuss at clusterfs.com <mailto:Lustre-discuss at
clusterfs.com>
>>>     https://mail.clusterfs.com/mailman/listinfo/lustre-discuss
>>>    
<https://mail.clusterfs.com/mailman/listinfo/lustre-discuss>
>>>
>>>
>>
>> -- 
>>
>>
-----------------------------------------------------------------------------------
>> Somsak Sriprayoonsakul
>>
>> Thai National Grid Center
>> Software Industry Promotion Agency
>> Ministry of ICT, Thailand
>> somsak_sr at thaigrid.or.th <mailto:somsak_sr at thaigrid.or.th>
>>
-----------------------------------------------------------------------------------
>>
>> _______________________________________________
>> Lustre-discuss mailing list
>> Lustre-discuss at clusterfs.com <mailto:Lustre-discuss at
clusterfs.com>
>> https://mail.clusterfs.com/mailman/listinfo/lustre-discuss
>
> Mr Wojciech Turek
> Assistant System Manager
> University of Cambridge
> High Performance Computing service 
> email: wjt27 at cam.ac.uk <mailto:wjt27 at cam.ac.uk>
> tel. +441223763517
>
>
>
-- 

-----------------------------------------------------------------------------------
Somsak Sriprayoonsakul

Thai National Grid Center
Software Industry Promotion Agency
Ministry of ICT, Thailand
somsak_sr at thaigrid.or.th
-----------------------------------------------------------------------------------

Johann Lombardi

2007-Nov-26 13:41 UTC

head link

[Lustre-discuss] Node randomly panic

On Mon, Nov 26, 2007 at 08:25:53PM +0700, Somsak Sriprayoonsakul
wrote:> I forgot to mention about oops. It''s something about lustre 
> (lustre_blah_blah_blah something).
Could you provide the stack trace?
(we need to know more about the blah_blah_blah)

Johann

Wojciech Turek

2007-Nov-26 13:45 UTC

head link

[Lustre-discuss] Node randomly panic

Hi,
On 26 Nov 2007, at 13:36, Somsak Sriprayoonsakul wrote:
> We have about 177 client nodes.
>
> I think the crashed happened only with OSS.We have had similar problem. We have 600 clients and crashes happened  
avery 2 days. There is bug https://bugzilla.lustre.org/show_bug.cgi? 
id=14293
If your kernel panic looks similar you might be almost certain that  
it is the same issue.>
> I do not have screenshot yet. How can I get the crashdump log?You can try netdump
http://www.redhat.com/support/wpapers/redhat/netdump/>
> Wojciech Turek wrote:
>> Hi,
>>
>> how many clients (compute nodes) you have in your cluster? What is  
>> crashing randomly: clients or OSS or MDS or maybe all of them?
>> Do you have screenshot of the kernel panic or crashdump log?
>>
>> cheers,
>>
>> Wojciech Turek On 26 Nov 2007, at 13:25, Somsak Sriprayoonsakul  
>> wrote:
>>
>>> No. I use stock bnx2 driver from pre-built latest kernel- 
>>> lustre-1.6.3.
>>>
>>> I forgot to mention about oops. It''s something about
lustre
>>> (lustre_blah_blah_blah something).
>>>
>>> All other nodes also use bnx2. There''s no problem at all.
>>>
>>> Matt wrote:
>>>> Somsak,
>>>>
>>>> Did you build your own bnx2 driver? I was getting kernel panics
>>>> when hitting a certain load with Dell 1950s that also use the  
>>>> bnx2 driver.  My solution was to grab the bnx2 source code and
>>>> build it under the Lustre kernel.  If you search the mailing  
>>>> list you''ll find the mails dealing with this.
>>>>
>>>> If you see bnx2 mentioned in your kernel panic output, then
it''s
>>>> probably the cause.
>>>>
>>>> Thanks,
>>>>
>>>> Matt
>>>>
>>>> On 26/11/2007, *Somsak Sriprayoonsakul *  
>>>> <somsak_sr at thaigrid.or.th <mailto:somsak_sr at
thaigrid.or.th>
>>>> <mailto:somsak_sr at thaigrid.or.th>> wrote:
>>>>
>>>>     Hello,
>>>>
>>>>         We have a 4 nodes Lustre Cluster that provides parallel
>>>> file
>>>>     system
>>>>     for our 192 nodes cluster. The Lustre Cluster are CentOS  
>>>> 4.5, x86_64
>>>>     (Intel series 4000), on HP DL360-G5. The cluster that use
it
>>>> is ROCKS
>>>>     4.2.1, on the same set of hardware. Our network is Gigabit
>>>> Ethernet,
>>>>     using bnx2 driver. Lustre setup is
>>>>
>>>>     storage-0-0: mgs+mdt, ost0, ost1 (backup)
>>>>     storage-0-1: mgs+mdt (backup), ost0 (backup), ost1
>>>>     storage-0-2: ost2, ost3 (backup)
>>>>     storage-0-3: ost2 (backup), ost3
>>>>
>>>>         We''re using heartbeat 2.0.8 base on pre-built
RPM from
>>>> CentOS. All
>>>>     backup is configure in the way that it''ll not run
>>>> simultaneously with
>>>>     primary. Note that, we enable flock and quota on Lustre.
>>>>
>>>>         The problem we have right now is, some of the nodes are
>>>> randomly
>>>>     panic. This happened about once a week or two week. We  
>>>> tolerate this
>>>>     stupidly by setting kernel.panic=60 and hope that the
backup
>>>> node
>>>>     will
>>>>     not failed within the time frame, though this is working  
>>>> quite well
>>>>     (base on user feedback, they do not know that the file  
>>>> system is
>>>>     failed). The backup node take-over OST and do recovery for
>>>> about 250
>>>>     secs then everything back to normal.
>>>>
>>>>         Anyways we''re trying to nail down the reason
why the file
>>>>     system is
>>>>     panic. I believe that information above will not suffice to
>>>> track down
>>>>     the reason. Could someone give me a way to debug or dump  
>>>> some useful
>>>>     information that I can send to the list for later analysis?
>>>> Also, does
>>>>     the "RECOVERING" suffice to make the file system
stable? Do
>>>> we need to
>>>>     shutdown the whole system and do e2fsck+lfsck?
>>>>
>>>>         Also, every panic time, quota that was enabled will be
>>>>     disabled (lfs
>>>>     quota <user> /fs yield "No such process). I have
to do
>>>> quotaoff and
>>>>     quotaon again. It seems that the quota is not being turn on
>>>> when
>>>>     OST is
>>>>     boot up. Is there a way to always turn this on?
>>>>
>>>>
>>>>         Thank you very much in advance
>>>>
>>>>
>>>>     --
>>>>
>>>>      
>>>>
-------------------------------------------------------------------
>>>> ----------------
>>>>     Somsak Sriprayoonsakul
>>>>
>>>>     Thai National Grid Center
>>>>     Software Industry Promotion Agency
>>>>     Ministry of ICT, Thailand
>>>>     somsak_sr at thaigrid.or.th <mailto:somsak_sr at
thaigrid.or.th>
>>>>      
>>>>
-------------------------------------------------------------------
>>>> ----------------
>>>>
>>>>
>>>>     _______________________________________________
>>>>     Lustre-discuss mailing list
>>>>     Lustre-discuss at clusterfs.com <mailto:Lustre- 
>>>> discuss at clusterfs.com>
>>>>     https://mail.clusterfs.com/mailman/listinfo/lustre-discuss
>>>>    
<https://mail.clusterfs.com/mailman/listinfo/lustre-discuss>
>>>>
>>>>
>>>
>>> -- 
>>>
>>>
--------------------------------------------------------------------
>>> ---------------
>>> Somsak Sriprayoonsakul
>>>
>>> Thai National Grid Center
>>> Software Industry Promotion Agency
>>> Ministry of ICT, Thailand
>>> somsak_sr at thaigrid.or.th <mailto:somsak_sr at
thaigrid.or.th>
>>>
--------------------------------------------------------------------
>>> ---------------
>>>
>>> _______________________________________________
>>> Lustre-discuss mailing list
>>> Lustre-discuss at clusterfs.com <mailto:Lustre-discuss at
clusterfs.com>
>>> https://mail.clusterfs.com/mailman/listinfo/lustre-discuss
>>
>> Mr Wojciech Turek
>> Assistant System Manager
>> University of Cambridge
>> High Performance Computing service email: wjt27 at cam.ac.uk  
>> <mailto:wjt27 at cam.ac.uk>
>> tel. +441223763517
>>
>>
>>
>
> -- 
>
> ---------------------------------------------------------------------- 
> -------------
> Somsak Sriprayoonsakul
>
> Thai National Grid Center
> Software Industry Promotion Agency
> Ministry of ICT, Thailand
> somsak_sr at thaigrid.or.th
> ---------------------------------------------------------------------- 
> -------------
>
Mr Wojciech Turek
Assistant System Manager
University of Cambridge
High Performance Computing service
email: wjt27 at cam.ac.uk
tel. +441223763517



-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20071126/66335bcd/attachment-0002.html

Somsak Sriprayoonsakul

2007-Nov-26 13:49 UTC

head link

[Lustre-discuss] Node randomly panic

Could you tell me how to dump the whole crash log to file? It''s not 
appear in /var/log/messages. I only seen it once actually. That''s why I
don''t know the function name :) But the whole screen are something 
related to lustre for sure.

Note that, the dump log is longer than a screen size, so taking photo 
wouldn''t help ( I think ).

I still have sometimes. The crash seldom occurred. The last time it 
happened is today. I think it''ll not crash again within few days.

Johann Lombardi wrote:> On Mon, Nov 26, 2007 at 08:25:53PM +0700, Somsak Sriprayoonsakul wrote:
>   
>> I forgot to mention about oops. It''s something about lustre 
>> (lustre_blah_blah_blah something).
>>     
>
> Could you provide the stack trace?
> (we need to know more about the blah_blah_blah)
>
> Johann
>
>   -- 

-----------------------------------------------------------------------------------
Somsak Sriprayoonsakul

Thai National Grid Center
Software Industry Promotion Agency
Ministry of ICT, Thailand
somsak_sr at thaigrid.or.th
-----------------------------------------------------------------------------------

Johann Lombardi

2007-Nov-26 15:37 UTC

head link

[Lustre-discuss] Node randomly panic

On Mon, Nov 26, 2007 at 08:49:56PM +0700, Somsak Sriprayoonsakul
wrote:> Could you tell me how to dump the whole crash log to file? It''s
not
> appear in /var/log/messages. I only seen it once actually. That''s
why I
> don''t know the function name :) But the whole screen are something
> related to lustre for sure.
You should set up serial consoles (or netconsole). A crash dump utility
(netdump, LKCD, ...) is also very useful.
> Note that, the dump log is longer than a screen size, so taking photo 
> wouldn''t help ( I think ).
If /proc/sys/kernel/panic_on_oops is set to 1 on the OSS, you could try to set
it to 0 and to log onto the node to get the stack trace via dmesg before
rebooting it.

Johann

Steden Klaus

2007-Nov-27 01:05 UTC

head link

[Lustre-discuss] Node randomly panic

I am using netdump for this purpose, but I find that I don''t always get
complete core images on crash; tweaking the wait time before reboot
doesn''t seem to have the desired effect of allowing the complete core
image to transfer, so YMMV.
 
I don''t think there are any debugging symbols in the pre-built kernels,
either, so you''d have to compile a debugging kernel version in order to
dissect crashes.
 
hth,
Klaus

________________________________

From: lustre-discuss-bounces at clusterfs.com on behalf of Johann Lombardi
Sent: Mon 11/26/2007 7:37 AM
To: Somsak Sriprayoonsakul
Cc: lustre-discuss at clusterfs.com
Subject: Re: [Lustre-discuss] Node randomly panic



On Mon, Nov 26, 2007 at 08:49:56PM +0700, Somsak Sriprayoonsakul
wrote:> Could you tell me how to dump the whole crash log to file? It''s
not
> appear in /var/log/messages. I only seen it once actually. That''s
why I
> don''t know the function name :) But the whole screen are something
> related to lustre for sure.
You should set up serial consoles (or netconsole). A crash dump utility
(netdump, LKCD, ...) is also very useful.
> Note that, the dump log is longer than a screen size, so taking photo
> wouldn''t help ( I think ).
If /proc/sys/kernel/panic_on_oops is set to 1 on the OSS, you could try to set
it to 0 and to log onto the node to get the stack trace via dmesg before
rebooting it.

Johann

_______________________________________________
Lustre-discuss mailing list
Lustre-discuss at clusterfs.com
https://mail.clusterfs.com/mailman/listinfo/lustre-discuss

Somsak Sriprayoonsakul

2007-Nov-27 03:33 UTC

head link

[Lustre-discuss] Node randomly panic

Thanks for all comments. I''ll try netdump and see how it goes.
It''ll
take a while but I''ll be back.

Anyways, could someone answer my second and third questions?

- Is RECOVERING enough? Should we run e2fsck + lfsck every time Lustre 
failed?
 
- Quota is turned off  when *any* OSS node failed. Are there anyways to 
have it "always on"?

BTW, when I turn quota back on, sometimes quota setting goes wrong, some 
OSS has only 1 byte while the others have proper value. We work aroudn 
this by reset the quota back to all-zero and set the quota again. Is 
this normal?

Johann Lombardi wrote:> On Mon, Nov 26, 2007 at 08:49:56PM +0700, Somsak Sriprayoonsakul wrote:
>   
>> Could you tell me how to dump the whole crash log to file?
It''s not
>> appear in /var/log/messages. I only seen it once actually.
That''s why I
>> don''t know the function name :) But the whole screen are
something
>> related to lustre for sure.
>>     
>
> You should set up serial consoles (or netconsole). A crash dump utility
> (netdump, LKCD, ...) is also very useful.
>
>   
>> Note that, the dump log is longer than a screen size, so taking photo 
>> wouldn''t help ( I think ).
>>     
>
> If /proc/sys/kernel/panic_on_oops is set to 1 on the OSS, you could try to
set
> it to 0 and to log onto the node to get the stack trace via dmesg before
> rebooting it.
>
> Johann
>
>   -- 

-----------------------------------------------------------------------------------
Somsak Sriprayoonsakul

Thai National Grid Center
Software Industry Promotion Agency
Ministry of ICT, Thailand
somsak_sr at thaigrid.or.th
-----------------------------------------------------------------------------------

Johann Lombardi

2007-Nov-27 05:58 UTC

head link

[Lustre-discuss] Node randomly panic

On Tue, Nov 27, 2007 at 10:33:08AM +0700, Somsak Sriprayoonsakul
wrote:> - Is RECOVERING enough? Should we run e2fsck + lfsck every time Lustre 
> failed?
Yes, restarting the OSS is enough after an oops.
> - Quota is turned off  when *any* OSS node failed. Are there anyways to 
> have it "always on"?
We are working on this problem (see bugzilla ticket #13359).
https://bugzilla.lustre.org/show_bug.cgi?id=13359
> BTW, when I turn quota back on, sometimes quota setting goes wrong, some 
> OSS has only 1 byte while the others have proper value. We work aroudn 
> this by reset the quota back to all-zero and set the quota again. Is 
> this normal?
What is wrong? The quota limit or the disk usage?

Johann

Lustre discuss - Nov 2007 - Node randomly panic

[Lustre-discuss] Node randomly panic

[Lustre-discuss] Node randomly panic

[Lustre-discuss] Node randomly panic

[Lustre-discuss] Node randomly panic

[Lustre-discuss] Node randomly panic

[Lustre-discuss] Node randomly panic

[Lustre-discuss] Node randomly panic

[Lustre-discuss] Node randomly panic

[Lustre-discuss] Node randomly panic

[Lustre-discuss] Node randomly panic

[Lustre-discuss] Node randomly panic

[Lustre-discuss] Node randomly panic