Sunil Mushran
2006-Aug-09 18:57 UTC
[Ocfs-devel] Re: URGENT: OCFS2 hang - 32 node cluster POC
Run: # top # vmstat 1 # iostat -x /dev/emcpowerb 1 The latter two you can save to a file. For top, just monitor cpu usage and see if any process is hogging all of it. Colin Laird wrote:> and the fstab settings: > > # This file is edited by fstab-sync - see 'man fstab-sync' for details > /dev/VolGroup00/LogVol01 / ext3 > defaults 1 1 > LABEL=/boot /boot ext3 > defaults 1 2 > none /dev/pts devpts > gid=5,mode=620 0 0 > none /dev/shm tmpfs > defaults 0 0 > /dev/VolGroup00/LogVol02 /home ext3 > defaults 1 2 > none /proc proc > defaults 0 0 > none /sys sysfs > defaults 0 0 > /dev/VolGroup00/LogVol00 swap swap > defaults 0 0 > /dev/emcpowerb /ocfs2 ocfs2 > _netdev 0 0 > /dev/hda /media/cdrom auto > pamconsole,exec,noauto,managed 0 0 > /dev/fd0 /media/floppy auto > pamconsole,exec,noauto,managed 0 0 > > We are not storing the voting disk and cluster reg for RAC in here. > > Thanks > > > Colin Laird wrote: >> Hi, >> >> We are in the middle of a very large bid (Centrelink, Australia) with >> time at a premium. So PLEASE HELP. we have been experiencing >> machine hangs whenever we do large copies (5-18G) into OCFS2. Either >> from ftp or local disk. The whole machine just freezes and we need >> to run off and on. we now cannot get the data available for the POC >> across the nodes! >> >> The setup is: >> >> 32 clustered Dell 6850 nodes running RHEL4 U3 - Linux >> c2.au.oracle.com 2.6.9-34.ELsmp #1 SMP Fri Feb 24 16:56:28 EST 2006 >> x86_64 x86_64 x86_64 GNU/Linux >> >> We have the following ocfs2 packages installed: >> ocfs2-2.6.9-34.ELsmp-1.2.3-1 >> ocfs2-2.6.9-34.EL-1.2.3-1 >> ocfs2-tools-debuginfo-1.2.1-1 >> ocfs2-2.6.9-34.ELlargesmp-1.2.3-1 >> ocfs2console-1.2.1-1 >> ocfs2-tools-1.2.1-1 >> >> We have* elevator=deadline* set as per instructions too. >> >> We are currently looking for a log to see if we can find anything. >> The system and ftp logs show nothing. >> >> Can anyone provide any pointers? Have we missed applying anything? >> >> Thanks, >> >> -- >> Colin Laird >> Principal Solutions Consultant >> >> Oracle New Zealand Ltd >> Level 10 >> Todd Building >> 93-97 Customhouse Quay >> Wellington >> New Zealand >> >> main: +64 4 978 5400 >> ddi: +64 4 978 5423 >> mob: +64 21 617 025 >> fax: +64 4 978 5401 > > -- > Colin Laird > Principal Solutions Consultant > > Oracle New Zealand Ltd > Level 10 > Todd Building > 93-97 Customhouse Quay > Wellington > New Zealand > > main: +64 4 978 5400 > ddi: +64 4 978 5423 > mob: +64 21 617 025 > fax: +64 4 978 5401
Wim Coekaerts
2006-Aug-09 19:24 UTC
[Ocfs-devel] Re: URGENT: OCFS2 hang - 32 node cluster POC
alt-sysrq-t should still work w/ netdump configured On Thu, Aug 10, 2006 at 12:22:39PM +1000, Colin Laird wrote:> The problem is during the hang you can't get on to the box, its > completely dead. > > Something we have found is that the heartbeat is set to 7, on the test > cluster which has worked fine it is at 61. We are setting this value to > 61 across the cluster. > > Sunil Mushran wrote: > >Run: > ># top > ># vmstat 1 > ># iostat -x /dev/emcpowerb 1 > > > >The latter two you can save to a file. For top, just monitor cpu usage > >and see if any process is hogging all of it. > > > >Colin Laird wrote: > >>and the fstab settings: > >> > >># This file is edited by fstab-sync - see 'man fstab-sync' for details > >>/dev/VolGroup00/LogVol01 / ext3 > >>defaults 1 1 > >>LABEL=/boot /boot ext3 > >>defaults 1 2 > >>none /dev/pts devpts > >>gid=5,mode=620 0 0 > >>none /dev/shm tmpfs > >>defaults 0 0 > >>/dev/VolGroup00/LogVol02 /home ext3 > >>defaults 1 2 > >>none /proc proc > >>defaults 0 0 > >>none /sys sysfs > >>defaults 0 0 > >>/dev/VolGroup00/LogVol00 swap swap > >>defaults 0 0 > >>/dev/emcpowerb /ocfs2 ocfs2 > >>_netdev 0 0 > >>/dev/hda /media/cdrom auto > >>pamconsole,exec,noauto,managed 0 0 > >>/dev/fd0 /media/floppy auto > >>pamconsole,exec,noauto,managed 0 0 > >> > >>We are not storing the voting disk and cluster reg for RAC in here. > >> > >>Thanks > >> > >> > >>Colin Laird wrote: > >>>Hi, > >>> > >>>We are in the middle of a very large bid (Centrelink, Australia) > >>>with time at a premium. So PLEASE HELP. we have been experiencing > >>>machine hangs whenever we do large copies (5-18G) into OCFS2. > >>>Either from ftp or local disk. The whole machine just freezes and > >>>we need to run off and on. we now cannot get the data available for > >>>the POC across the nodes! > >>> > >>>The setup is: > >>> > >>>32 clustered Dell 6850 nodes running RHEL4 U3 - Linux > >>>c2.au.oracle.com 2.6.9-34.ELsmp #1 SMP Fri Feb 24 16:56:28 EST 2006 > >>>x86_64 x86_64 x86_64 GNU/Linux > >>> > >>>We have the following ocfs2 packages installed: > >>>ocfs2-2.6.9-34.ELsmp-1.2.3-1 > >>>ocfs2-2.6.9-34.EL-1.2.3-1 > >>>ocfs2-tools-debuginfo-1.2.1-1 > >>>ocfs2-2.6.9-34.ELlargesmp-1.2.3-1 > >>>ocfs2console-1.2.1-1 > >>>ocfs2-tools-1.2.1-1 > >>> > >>>We have* elevator=deadline* set as per instructions too. > >>> > >>>We are currently looking for a log to see if we can find anything. > >>>The system and ftp logs show nothing. > >>> > >>>Can anyone provide any pointers? Have we missed applying anything? > >>> > >>>Thanks, > >>> > >>>-- > >>>Colin Laird > >>>Principal Solutions Consultant > >>> > >>>Oracle New Zealand Ltd > >>>Level 10 > >>>Todd Building > >>>93-97 Customhouse Quay > >>>Wellington > >>>New Zealand > >>> > >>>main: +64 4 978 5400 > >>>ddi: +64 4 978 5423 > >>>mob: +64 21 617 025 > >>>fax: +64 4 978 5401 > >> > >>-- > >>Colin Laird > >>Principal Solutions Consultant > >> > >>Oracle New Zealand Ltd > >>Level 10 > >>Todd Building > >>93-97 Customhouse Quay > >>Wellington > >>New Zealand > >> > >>main: +64 4 978 5400 > >>ddi: +64 4 978 5423 > >>mob: +64 21 617 025 > >>fax: +64 4 978 5401 > > -- > Colin Laird > Principal Solutions Consultant > > Oracle New Zealand Ltd > Level 10 > Todd Building > 93-97 Customhouse Quay > Wellington > New Zealand > > main: +64 4 978 5400 > ddi: +64 4 978 5423 > mob: +64 21 617 025 > fax: +64 4 978 5401 >