Hello, We have a 4 nodes Lustre Cluster that provides parallel file system for our 192 nodes cluster. The Lustre Cluster are CentOS 4.5, x86_64 (Intel series 4000), on HP DL360-G5. The cluster that use it is ROCKS 4.2.1, on the same set of hardware. Our network is Gigabit Ethernet, using bnx2 driver. Lustre setup is storage-0-0: mgs+mdt, ost0, ost1 (backup) storage-0-1: mgs+mdt (backup), ost0 (backup), ost1 storage-0-2: ost2, ost3 (backup) storage-0-3: ost2 (backup), ost3 We''re using heartbeat 2.0.8 base on pre-built RPM from CentOS. All backup is configure in the way that it''ll not run simultaneously with primary. Note that, we enable flock and quota on Lustre. The problem we have right now is, some of the nodes are randomly panic. This happened about once a week or two week. We tolerate this stupidly by setting kernel.panic=60 and hope that the backup node will not failed within the time frame, though this is working quite well (base on user feedback, they do not know that the file system is failed). The backup node take-over OST and do recovery for about 250 secs then everything back to normal. Anyways we''re trying to nail down the reason why the file system is panic. I believe that information above will not suffice to track down the reason. Could someone give me a way to debug or dump some useful information that I can send to the list for later analysis? Also, does the "RECOVERING" suffice to make the file system stable? Do we need to shutdown the whole system and do e2fsck+lfsck? Also, every panic time, quota that was enabled will be disabled (lfs quota <user> /fs yield "No such process). I have to do quotaoff and quotaon again. It seems that the quota is not being turn on when OST is boot up. Is there a way to always turn this on? Thank you very much in advance -- ----------------------------------------------------------------------------------- Somsak Sriprayoonsakul Thai National Grid Center Software Industry Promotion Agency Ministry of ICT, Thailand somsak_sr at thaigrid.or.th -----------------------------------------------------------------------------------
Somsak, Did you build your own bnx2 driver? I was getting kernel panics when hitting a certain load with Dell 1950s that also use the bnx2 driver. My solution was to grab the bnx2 source code and build it under the Lustre kernel. If you search the mailing list you''ll find the mails dealing with this. If you see bnx2 mentioned in your kernel panic output, then it''s probably the cause. Thanks, Matt On 26/11/2007, Somsak Sriprayoonsakul <somsak_sr at thaigrid.or.th> wrote:> > Hello, > > We have a 4 nodes Lustre Cluster that provides parallel file system > for our 192 nodes cluster. The Lustre Cluster are CentOS 4.5, x86_64 > (Intel series 4000), on HP DL360-G5. The cluster that use it is ROCKS > 4.2.1, on the same set of hardware. Our network is Gigabit Ethernet, > using bnx2 driver. Lustre setup is > > storage-0-0: mgs+mdt, ost0, ost1 (backup) > storage-0-1: mgs+mdt (backup), ost0 (backup), ost1 > storage-0-2: ost2, ost3 (backup) > storage-0-3: ost2 (backup), ost3 > > We''re using heartbeat 2.0.8 base on pre-built RPM from CentOS. All > backup is configure in the way that it''ll not run simultaneously with > primary. Note that, we enable flock and quota on Lustre. > > The problem we have right now is, some of the nodes are randomly > panic. This happened about once a week or two week. We tolerate this > stupidly by setting kernel.panic=60 and hope that the backup node will > not failed within the time frame, though this is working quite well > (base on user feedback, they do not know that the file system is > failed). The backup node take-over OST and do recovery for about 250 > secs then everything back to normal. > > Anyways we''re trying to nail down the reason why the file system is > panic. I believe that information above will not suffice to track down > the reason. Could someone give me a way to debug or dump some useful > information that I can send to the list for later analysis? Also, does > the "RECOVERING" suffice to make the file system stable? Do we need to > shutdown the whole system and do e2fsck+lfsck? > > Also, every panic time, quota that was enabled will be disabled (lfs > quota <user> /fs yield "No such process). I have to do quotaoff and > quotaon again. It seems that the quota is not being turn on when OST is > boot up. Is there a way to always turn this on? > > > Thank you very much in advance > > > -- > > > ----------------------------------------------------------------------------------- > Somsak Sriprayoonsakul > > Thai National Grid Center > Software Industry Promotion Agency > Ministry of ICT, Thailand > somsak_sr at thaigrid.or.th > > ----------------------------------------------------------------------------------- > > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at clusterfs.com > https://mail.clusterfs.com/mailman/listinfo/lustre-discuss >-------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20071126/18d9ef78/attachment-0002.html
No. I use stock bnx2 driver from pre-built latest kernel-lustre-1.6.3. I forgot to mention about oops. It''s something about lustre (lustre_blah_blah_blah something). All other nodes also use bnx2. There''s no problem at all. Matt wrote:> Somsak, > > Did you build your own bnx2 driver? I was getting kernel panics when > hitting a certain load with Dell 1950s that also use the bnx2 driver. > My solution was to grab the bnx2 source code and build it under the > Lustre kernel. If you search the mailing list you''ll find the mails > dealing with this. > > If you see bnx2 mentioned in your kernel panic output, then it''s > probably the cause. > > Thanks, > > Matt > > On 26/11/2007, *Somsak Sriprayoonsakul * <somsak_sr at thaigrid.or.th > <mailto:somsak_sr at thaigrid.or.th>> wrote: > > Hello, > > We have a 4 nodes Lustre Cluster that provides parallel file > system > for our 192 nodes cluster. The Lustre Cluster are CentOS 4.5, x86_64 > (Intel series 4000), on HP DL360-G5. The cluster that use it is ROCKS > 4.2.1, on the same set of hardware. Our network is Gigabit Ethernet, > using bnx2 driver. Lustre setup is > > storage-0-0: mgs+mdt, ost0, ost1 (backup) > storage-0-1: mgs+mdt (backup), ost0 (backup), ost1 > storage-0-2: ost2, ost3 (backup) > storage-0-3: ost2 (backup), ost3 > > We''re using heartbeat 2.0.8 base on pre-built RPM from CentOS. All > backup is configure in the way that it''ll not run simultaneously with > primary. Note that, we enable flock and quota on Lustre. > > The problem we have right now is, some of the nodes are randomly > panic. This happened about once a week or two week. We tolerate this > stupidly by setting kernel.panic=60 and hope that the backup node > will > not failed within the time frame, though this is working quite well > (base on user feedback, they do not know that the file system is > failed). The backup node take-over OST and do recovery for about 250 > secs then everything back to normal. > > Anyways we''re trying to nail down the reason why the file > system is > panic. I believe that information above will not suffice to track down > the reason. Could someone give me a way to debug or dump some useful > information that I can send to the list for later analysis? Also, does > the "RECOVERING" suffice to make the file system stable? Do we need to > shutdown the whole system and do e2fsck+lfsck? > > Also, every panic time, quota that was enabled will be > disabled (lfs > quota <user> /fs yield "No such process). I have to do quotaoff and > quotaon again. It seems that the quota is not being turn on when > OST is > boot up. Is there a way to always turn this on? > > > Thank you very much in advance > > > -- > > ----------------------------------------------------------------------------------- > Somsak Sriprayoonsakul > > Thai National Grid Center > Software Industry Promotion Agency > Ministry of ICT, Thailand > somsak_sr at thaigrid.or.th <mailto:somsak_sr at thaigrid.or.th> > ----------------------------------------------------------------------------------- > > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at clusterfs.com <mailto:Lustre-discuss at clusterfs.com> > https://mail.clusterfs.com/mailman/listinfo/lustre-discuss > <https://mail.clusterfs.com/mailman/listinfo/lustre-discuss> > >-- ----------------------------------------------------------------------------------- Somsak Sriprayoonsakul Thai National Grid Center Software Industry Promotion Agency Ministry of ICT, Thailand somsak_sr at thaigrid.or.th -----------------------------------------------------------------------------------
Hi, how many clients (compute nodes) you have in your cluster? What is crashing randomly: clients or OSS or MDS or maybe all of them? Do you have screenshot of the kernel panic or crashdump log? cheers, Wojciech Turek On 26 Nov 2007, at 13:25, Somsak Sriprayoonsakul wrote:> No. I use stock bnx2 driver from pre-built latest kernel-lustre-1.6.3. > > I forgot to mention about oops. It''s something about lustre > (lustre_blah_blah_blah something). > > All other nodes also use bnx2. There''s no problem at all. > > Matt wrote: >> Somsak, >> >> Did you build your own bnx2 driver? I was getting kernel panics when >> hitting a certain load with Dell 1950s that also use the bnx2 driver. >> My solution was to grab the bnx2 source code and build it under the >> Lustre kernel. If you search the mailing list you''ll find the mails >> dealing with this. >> >> If you see bnx2 mentioned in your kernel panic output, then it''s >> probably the cause. >> >> Thanks, >> >> Matt >> >> On 26/11/2007, *Somsak Sriprayoonsakul * <somsak_sr at thaigrid.or.th >> <mailto:somsak_sr at thaigrid.or.th>> wrote: >> >> Hello, >> >> We have a 4 nodes Lustre Cluster that provides parallel file >> system >> for our 192 nodes cluster. The Lustre Cluster are CentOS 4.5, >> x86_64 >> (Intel series 4000), on HP DL360-G5. The cluster that use it >> is ROCKS >> 4.2.1, on the same set of hardware. Our network is Gigabit >> Ethernet, >> using bnx2 driver. Lustre setup is >> >> storage-0-0: mgs+mdt, ost0, ost1 (backup) >> storage-0-1: mgs+mdt (backup), ost0 (backup), ost1 >> storage-0-2: ost2, ost3 (backup) >> storage-0-3: ost2 (backup), ost3 >> >> We''re using heartbeat 2.0.8 base on pre-built RPM from >> CentOS. All >> backup is configure in the way that it''ll not run >> simultaneously with >> primary. Note that, we enable flock and quota on Lustre. >> >> The problem we have right now is, some of the nodes are >> randomly >> panic. This happened about once a week or two week. We >> tolerate this >> stupidly by setting kernel.panic=60 and hope that the backup node >> will >> not failed within the time frame, though this is working quite >> well >> (base on user feedback, they do not know that the file system is >> failed). The backup node take-over OST and do recovery for >> about 250 >> secs then everything back to normal. >> >> Anyways we''re trying to nail down the reason why the file >> system is >> panic. I believe that information above will not suffice to >> track down >> the reason. Could someone give me a way to debug or dump some >> useful >> information that I can send to the list for later analysis? >> Also, does >> the "RECOVERING" suffice to make the file system stable? Do we >> need to >> shutdown the whole system and do e2fsck+lfsck? >> >> Also, every panic time, quota that was enabled will be >> disabled (lfs >> quota <user> /fs yield "No such process). I have to do >> quotaoff and >> quotaon again. It seems that the quota is not being turn on when >> OST is >> boot up. Is there a way to always turn this on? >> >> >> Thank you very much in advance >> >> >> -- >> >> >> --------------------------------------------------------------------- >> -------------- >> Somsak Sriprayoonsakul >> >> Thai National Grid Center >> Software Industry Promotion Agency >> Ministry of ICT, Thailand >> somsak_sr at thaigrid.or.th <mailto:somsak_sr at thaigrid.or.th> >> >> --------------------------------------------------------------------- >> -------------- >> >> >> _______________________________________________ >> Lustre-discuss mailing list >> Lustre-discuss at clusterfs.com <mailto:Lustre- >> discuss at clusterfs.com> >> https://mail.clusterfs.com/mailman/listinfo/lustre-discuss >> <https://mail.clusterfs.com/mailman/listinfo/lustre-discuss> >> >> > > -- > > ---------------------------------------------------------------------- > ------------- > Somsak Sriprayoonsakul > > Thai National Grid Center > Software Industry Promotion Agency > Ministry of ICT, Thailand > somsak_sr at thaigrid.or.th > ---------------------------------------------------------------------- > ------------- > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at clusterfs.com > https://mail.clusterfs.com/mailman/listinfo/lustre-discussMr Wojciech Turek Assistant System Manager University of Cambridge High Performance Computing service email: wjt27 at cam.ac.uk tel. +441223763517 -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20071126/4c384626/attachment-0002.html
We have about 177 client nodes. I think the crashed happened only with OSS. I do not have screenshot yet. How can I get the crashdump log? Wojciech Turek wrote:> Hi, > > how many clients (compute nodes) you have in your cluster? What is > crashing randomly: clients or OSS or MDS or maybe all of them? > Do you have screenshot of the kernel panic or crashdump log? > > cheers, > > Wojciech Turek > On 26 Nov 2007, at 13:25, Somsak Sriprayoonsakul wrote: > >> No. I use stock bnx2 driver from pre-built latest kernel-lustre-1.6.3. >> >> I forgot to mention about oops. It''s something about lustre >> (lustre_blah_blah_blah something). >> >> All other nodes also use bnx2. There''s no problem at all. >> >> Matt wrote: >>> Somsak, >>> >>> Did you build your own bnx2 driver? I was getting kernel panics when >>> hitting a certain load with Dell 1950s that also use the bnx2 driver. >>> My solution was to grab the bnx2 source code and build it under the >>> Lustre kernel. If you search the mailing list you''ll find the mails >>> dealing with this. >>> >>> If you see bnx2 mentioned in your kernel panic output, then it''s >>> probably the cause. >>> >>> Thanks, >>> >>> Matt >>> >>> On 26/11/2007, *Somsak Sriprayoonsakul * <somsak_sr at thaigrid.or.th >>> <mailto:somsak_sr at thaigrid.or.th> >>> <mailto:somsak_sr at thaigrid.or.th>> wrote: >>> >>> Hello, >>> >>> We have a 4 nodes Lustre Cluster that provides parallel file >>> system >>> for our 192 nodes cluster. The Lustre Cluster are CentOS 4.5, x86_64 >>> (Intel series 4000), on HP DL360-G5. The cluster that use it is >>> ROCKS >>> 4.2.1, on the same set of hardware. Our network is Gigabit Ethernet, >>> using bnx2 driver. Lustre setup is >>> >>> storage-0-0: mgs+mdt, ost0, ost1 (backup) >>> storage-0-1: mgs+mdt (backup), ost0 (backup), ost1 >>> storage-0-2: ost2, ost3 (backup) >>> storage-0-3: ost2 (backup), ost3 >>> >>> We''re using heartbeat 2.0.8 base on pre-built RPM from >>> CentOS. All >>> backup is configure in the way that it''ll not run simultaneously >>> with >>> primary. Note that, we enable flock and quota on Lustre. >>> >>> The problem we have right now is, some of the nodes are randomly >>> panic. This happened about once a week or two week. We tolerate this >>> stupidly by setting kernel.panic=60 and hope that the backup node >>> will >>> not failed within the time frame, though this is working quite well >>> (base on user feedback, they do not know that the file system is >>> failed). The backup node take-over OST and do recovery for about 250 >>> secs then everything back to normal. >>> >>> Anyways we''re trying to nail down the reason why the file >>> system is >>> panic. I believe that information above will not suffice to >>> track down >>> the reason. Could someone give me a way to debug or dump some useful >>> information that I can send to the list for later analysis? >>> Also, does >>> the "RECOVERING" suffice to make the file system stable? Do we >>> need to >>> shutdown the whole system and do e2fsck+lfsck? >>> >>> Also, every panic time, quota that was enabled will be >>> disabled (lfs >>> quota <user> /fs yield "No such process). I have to do quotaoff and >>> quotaon again. It seems that the quota is not being turn on when >>> OST is >>> boot up. Is there a way to always turn this on? >>> >>> >>> Thank you very much in advance >>> >>> >>> -- >>> >>> >>> ----------------------------------------------------------------------------------- >>> Somsak Sriprayoonsakul >>> >>> Thai National Grid Center >>> Software Industry Promotion Agency >>> Ministry of ICT, Thailand >>> somsak_sr at thaigrid.or.th <mailto:somsak_sr at thaigrid.or.th> >>> >>> ----------------------------------------------------------------------------------- >>> >>> >>> _______________________________________________ >>> Lustre-discuss mailing list >>> Lustre-discuss at clusterfs.com <mailto:Lustre-discuss at clusterfs.com> >>> https://mail.clusterfs.com/mailman/listinfo/lustre-discuss >>> <https://mail.clusterfs.com/mailman/listinfo/lustre-discuss> >>> >>> >> >> -- >> >> ----------------------------------------------------------------------------------- >> Somsak Sriprayoonsakul >> >> Thai National Grid Center >> Software Industry Promotion Agency >> Ministry of ICT, Thailand >> somsak_sr at thaigrid.or.th <mailto:somsak_sr at thaigrid.or.th> >> ----------------------------------------------------------------------------------- >> >> _______________________________________________ >> Lustre-discuss mailing list >> Lustre-discuss at clusterfs.com <mailto:Lustre-discuss at clusterfs.com> >> https://mail.clusterfs.com/mailman/listinfo/lustre-discuss > > Mr Wojciech Turek > Assistant System Manager > University of Cambridge > High Performance Computing service > email: wjt27 at cam.ac.uk <mailto:wjt27 at cam.ac.uk> > tel. +441223763517 > > >-- ----------------------------------------------------------------------------------- Somsak Sriprayoonsakul Thai National Grid Center Software Industry Promotion Agency Ministry of ICT, Thailand somsak_sr at thaigrid.or.th -----------------------------------------------------------------------------------
On Mon, Nov 26, 2007 at 08:25:53PM +0700, Somsak Sriprayoonsakul wrote:> I forgot to mention about oops. It''s something about lustre > (lustre_blah_blah_blah something).Could you provide the stack trace? (we need to know more about the blah_blah_blah) Johann
Hi, On 26 Nov 2007, at 13:36, Somsak Sriprayoonsakul wrote:> We have about 177 client nodes. > > I think the crashed happened only with OSS.We have had similar problem. We have 600 clients and crashes happened avery 2 days. There is bug https://bugzilla.lustre.org/show_bug.cgi? id=14293 If your kernel panic looks similar you might be almost certain that it is the same issue.> > I do not have screenshot yet. How can I get the crashdump log?You can try netdump http://www.redhat.com/support/wpapers/redhat/netdump/> > Wojciech Turek wrote: >> Hi, >> >> how many clients (compute nodes) you have in your cluster? What is >> crashing randomly: clients or OSS or MDS or maybe all of them? >> Do you have screenshot of the kernel panic or crashdump log? >> >> cheers, >> >> Wojciech Turek On 26 Nov 2007, at 13:25, Somsak Sriprayoonsakul >> wrote: >> >>> No. I use stock bnx2 driver from pre-built latest kernel- >>> lustre-1.6.3. >>> >>> I forgot to mention about oops. It''s something about lustre >>> (lustre_blah_blah_blah something). >>> >>> All other nodes also use bnx2. There''s no problem at all. >>> >>> Matt wrote: >>>> Somsak, >>>> >>>> Did you build your own bnx2 driver? I was getting kernel panics >>>> when hitting a certain load with Dell 1950s that also use the >>>> bnx2 driver. My solution was to grab the bnx2 source code and >>>> build it under the Lustre kernel. If you search the mailing >>>> list you''ll find the mails dealing with this. >>>> >>>> If you see bnx2 mentioned in your kernel panic output, then it''s >>>> probably the cause. >>>> >>>> Thanks, >>>> >>>> Matt >>>> >>>> On 26/11/2007, *Somsak Sriprayoonsakul * >>>> <somsak_sr at thaigrid.or.th <mailto:somsak_sr at thaigrid.or.th> >>>> <mailto:somsak_sr at thaigrid.or.th>> wrote: >>>> >>>> Hello, >>>> >>>> We have a 4 nodes Lustre Cluster that provides parallel >>>> file >>>> system >>>> for our 192 nodes cluster. The Lustre Cluster are CentOS >>>> 4.5, x86_64 >>>> (Intel series 4000), on HP DL360-G5. The cluster that use it >>>> is ROCKS >>>> 4.2.1, on the same set of hardware. Our network is Gigabit >>>> Ethernet, >>>> using bnx2 driver. Lustre setup is >>>> >>>> storage-0-0: mgs+mdt, ost0, ost1 (backup) >>>> storage-0-1: mgs+mdt (backup), ost0 (backup), ost1 >>>> storage-0-2: ost2, ost3 (backup) >>>> storage-0-3: ost2 (backup), ost3 >>>> >>>> We''re using heartbeat 2.0.8 base on pre-built RPM from >>>> CentOS. All >>>> backup is configure in the way that it''ll not run >>>> simultaneously with >>>> primary. Note that, we enable flock and quota on Lustre. >>>> >>>> The problem we have right now is, some of the nodes are >>>> randomly >>>> panic. This happened about once a week or two week. We >>>> tolerate this >>>> stupidly by setting kernel.panic=60 and hope that the backup >>>> node >>>> will >>>> not failed within the time frame, though this is working >>>> quite well >>>> (base on user feedback, they do not know that the file >>>> system is >>>> failed). The backup node take-over OST and do recovery for >>>> about 250 >>>> secs then everything back to normal. >>>> >>>> Anyways we''re trying to nail down the reason why the file >>>> system is >>>> panic. I believe that information above will not suffice to >>>> track down >>>> the reason. Could someone give me a way to debug or dump >>>> some useful >>>> information that I can send to the list for later analysis? >>>> Also, does >>>> the "RECOVERING" suffice to make the file system stable? Do >>>> we need to >>>> shutdown the whole system and do e2fsck+lfsck? >>>> >>>> Also, every panic time, quota that was enabled will be >>>> disabled (lfs >>>> quota <user> /fs yield "No such process). I have to do >>>> quotaoff and >>>> quotaon again. It seems that the quota is not being turn on >>>> when >>>> OST is >>>> boot up. Is there a way to always turn this on? >>>> >>>> >>>> Thank you very much in advance >>>> >>>> >>>> -- >>>> >>>> >>>> ------------------------------------------------------------------- >>>> ---------------- >>>> Somsak Sriprayoonsakul >>>> >>>> Thai National Grid Center >>>> Software Industry Promotion Agency >>>> Ministry of ICT, Thailand >>>> somsak_sr at thaigrid.or.th <mailto:somsak_sr at thaigrid.or.th> >>>> >>>> ------------------------------------------------------------------- >>>> ---------------- >>>> >>>> >>>> _______________________________________________ >>>> Lustre-discuss mailing list >>>> Lustre-discuss at clusterfs.com <mailto:Lustre- >>>> discuss at clusterfs.com> >>>> https://mail.clusterfs.com/mailman/listinfo/lustre-discuss >>>> <https://mail.clusterfs.com/mailman/listinfo/lustre-discuss> >>>> >>>> >>> >>> -- >>> >>> -------------------------------------------------------------------- >>> --------------- >>> Somsak Sriprayoonsakul >>> >>> Thai National Grid Center >>> Software Industry Promotion Agency >>> Ministry of ICT, Thailand >>> somsak_sr at thaigrid.or.th <mailto:somsak_sr at thaigrid.or.th> >>> -------------------------------------------------------------------- >>> --------------- >>> >>> _______________________________________________ >>> Lustre-discuss mailing list >>> Lustre-discuss at clusterfs.com <mailto:Lustre-discuss at clusterfs.com> >>> https://mail.clusterfs.com/mailman/listinfo/lustre-discuss >> >> Mr Wojciech Turek >> Assistant System Manager >> University of Cambridge >> High Performance Computing service email: wjt27 at cam.ac.uk >> <mailto:wjt27 at cam.ac.uk> >> tel. +441223763517 >> >> >> > > -- > > ---------------------------------------------------------------------- > ------------- > Somsak Sriprayoonsakul > > Thai National Grid Center > Software Industry Promotion Agency > Ministry of ICT, Thailand > somsak_sr at thaigrid.or.th > ---------------------------------------------------------------------- > ------------- >Mr Wojciech Turek Assistant System Manager University of Cambridge High Performance Computing service email: wjt27 at cam.ac.uk tel. +441223763517 -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20071126/66335bcd/attachment-0002.html
Could you tell me how to dump the whole crash log to file? It''s not appear in /var/log/messages. I only seen it once actually. That''s why I don''t know the function name :) But the whole screen are something related to lustre for sure. Note that, the dump log is longer than a screen size, so taking photo wouldn''t help ( I think ). I still have sometimes. The crash seldom occurred. The last time it happened is today. I think it''ll not crash again within few days. Johann Lombardi wrote:> On Mon, Nov 26, 2007 at 08:25:53PM +0700, Somsak Sriprayoonsakul wrote: > >> I forgot to mention about oops. It''s something about lustre >> (lustre_blah_blah_blah something). >> > > Could you provide the stack trace? > (we need to know more about the blah_blah_blah) > > Johann > >-- ----------------------------------------------------------------------------------- Somsak Sriprayoonsakul Thai National Grid Center Software Industry Promotion Agency Ministry of ICT, Thailand somsak_sr at thaigrid.or.th -----------------------------------------------------------------------------------
On Mon, Nov 26, 2007 at 08:49:56PM +0700, Somsak Sriprayoonsakul wrote:> Could you tell me how to dump the whole crash log to file? It''s not > appear in /var/log/messages. I only seen it once actually. That''s why I > don''t know the function name :) But the whole screen are something > related to lustre for sure.You should set up serial consoles (or netconsole). A crash dump utility (netdump, LKCD, ...) is also very useful.> Note that, the dump log is longer than a screen size, so taking photo > wouldn''t help ( I think ).If /proc/sys/kernel/panic_on_oops is set to 1 on the OSS, you could try to set it to 0 and to log onto the node to get the stack trace via dmesg before rebooting it. Johann
I am using netdump for this purpose, but I find that I don''t always get complete core images on crash; tweaking the wait time before reboot doesn''t seem to have the desired effect of allowing the complete core image to transfer, so YMMV. I don''t think there are any debugging symbols in the pre-built kernels, either, so you''d have to compile a debugging kernel version in order to dissect crashes. hth, Klaus ________________________________ From: lustre-discuss-bounces at clusterfs.com on behalf of Johann Lombardi Sent: Mon 11/26/2007 7:37 AM To: Somsak Sriprayoonsakul Cc: lustre-discuss at clusterfs.com Subject: Re: [Lustre-discuss] Node randomly panic On Mon, Nov 26, 2007 at 08:49:56PM +0700, Somsak Sriprayoonsakul wrote:> Could you tell me how to dump the whole crash log to file? It''s not > appear in /var/log/messages. I only seen it once actually. That''s why I > don''t know the function name :) But the whole screen are something > related to lustre for sure.You should set up serial consoles (or netconsole). A crash dump utility (netdump, LKCD, ...) is also very useful.> Note that, the dump log is longer than a screen size, so taking photo > wouldn''t help ( I think ).If /proc/sys/kernel/panic_on_oops is set to 1 on the OSS, you could try to set it to 0 and to log onto the node to get the stack trace via dmesg before rebooting it. Johann _______________________________________________ Lustre-discuss mailing list Lustre-discuss at clusterfs.com https://mail.clusterfs.com/mailman/listinfo/lustre-discuss
Thanks for all comments. I''ll try netdump and see how it goes. It''ll take a while but I''ll be back. Anyways, could someone answer my second and third questions? - Is RECOVERING enough? Should we run e2fsck + lfsck every time Lustre failed? - Quota is turned off when *any* OSS node failed. Are there anyways to have it "always on"? BTW, when I turn quota back on, sometimes quota setting goes wrong, some OSS has only 1 byte while the others have proper value. We work aroudn this by reset the quota back to all-zero and set the quota again. Is this normal? Johann Lombardi wrote:> On Mon, Nov 26, 2007 at 08:49:56PM +0700, Somsak Sriprayoonsakul wrote: > >> Could you tell me how to dump the whole crash log to file? It''s not >> appear in /var/log/messages. I only seen it once actually. That''s why I >> don''t know the function name :) But the whole screen are something >> related to lustre for sure. >> > > You should set up serial consoles (or netconsole). A crash dump utility > (netdump, LKCD, ...) is also very useful. > > >> Note that, the dump log is longer than a screen size, so taking photo >> wouldn''t help ( I think ). >> > > If /proc/sys/kernel/panic_on_oops is set to 1 on the OSS, you could try to set > it to 0 and to log onto the node to get the stack trace via dmesg before > rebooting it. > > Johann > >-- ----------------------------------------------------------------------------------- Somsak Sriprayoonsakul Thai National Grid Center Software Industry Promotion Agency Ministry of ICT, Thailand somsak_sr at thaigrid.or.th -----------------------------------------------------------------------------------
On Tue, Nov 27, 2007 at 10:33:08AM +0700, Somsak Sriprayoonsakul wrote:> - Is RECOVERING enough? Should we run e2fsck + lfsck every time Lustre > failed?Yes, restarting the OSS is enough after an oops.> - Quota is turned off when *any* OSS node failed. Are there anyways to > have it "always on"?We are working on this problem (see bugzilla ticket #13359). https://bugzilla.lustre.org/show_bug.cgi?id=13359> BTW, when I turn quota back on, sometimes quota setting goes wrong, some > OSS has only 1 byte while the others have proper value. We work aroudn > this by reset the quota back to all-zero and set the quota again. Is > this normal?What is wrong? The quota limit or the disk usage? Johann