Deaderick, David (EDS)
2006-Oct-25 14:40 UTC
[Ocfs2-users] OCFS2 Fencing and Locking MSA500 Array: Help
I have a RedHat Enterprise Linux 4.0 two node cluster on HP ProLiant ML350 Servers connected to an HP MSA500 with HP 532 SCSI adapters (cciss driver). The following list includes critical component versions: ocfs2console-1.2.1-1 Mon 28 Aug 2006 05:39:20 PM EDT ocfs2-2.6.9-42.0.2.ELsmp-1.2.3-1 Mon 28 Aug 2006 05:39:19 PM EDT ocfs2-2.6.9-42.0.2.ELhugemem-1.2.3-1 Mon 28 Aug 2006 05:39:18 PM EDT ocfs2-2.6.9-42.0.2.EL-1.2.3-1 Mon 28 Aug 2006 05:39:17 PM EDT ocfs2-tools-1.2.1-1 Mon 28 Aug 2006 05:39:15 PM EDT oracleasmlib-2.0.2-1 Mon 28 Aug 2006 05:37:51 PM EDT oracleasm-2.6.9-42.0.2.ELhugemem-2.0.3-1 Mon 28 Aug 2006 05:37:49 PM EDT oracleasm-2.6.9-42.0.2.EL-2.0.3-1 Mon 28 Aug 2006 05:37:47 PM EDT oracleasm-2.6.9-42.0.2.ELsmp-2.0.3-1 Mon 28 Aug 2006 05:37:45 PM EDT oracleasm-support-2.0.3-1 Mon 28 Aug 2006 05:37:44 PM EDT kernel-hugemem-2.6.9-42.0.2.EL Mon 28 Aug 2006 05:25:32 PM EDT kernel-doc-2.6.9-42.0.2.EL Mon 28 Aug 2006 05:25:29 PM EDT kernel-hugemem-devel-2.6.9-42.0.2.EL Mon 28 Aug 2006 05:25:07 PM EDT kernel-smp-devel-2.6.9-42.0.2.EL Mon 28 Aug 2006 05:21:45 PM EDT kernel-smp-2.6.9-42.0.2.EL Mon 28 Aug 2006 05:20:51 PM EDT kernel-utils-2.4-13.1.83 Mon 28 Aug 2006 05:20:48 PM EDT kernel-devel-2.6.9-42.0.2.EL Mon 28 Aug 2006 04:42:48 PM EDT kernel-2.6.9-42.0.2.EL Mon 28 Aug 2006 04:42:37 PM EDT When ever a heavy load is on the I/O system (i.e. database full backups using RMAN), the servers fence, reboot and cannot reconnect with the MSA500. We must power the servers and the MSA500 off and restart. Where can I start troubleshooting this? /var/log/messages: (Node 2) Oct 11 05:16:56 vhaispora02 kernel: o2net: connection to node vhaispora01 (num 0) at 192.168.1.1:7777 has been idle for 10 seconds, shutting it down. Oct 11 05:16:56 vhaispora02 kernel: (0,0):o2net_idle_timer:1309 here are some times that might help debug the situation: (tmr 1160558206.560358 now 1160558216.558300 dr 1160558206.560323 adv 1160558206.560375:1160558206.560379 func (0d6da305:504) 1160552001.561116:1160552001.561125) Oct 11 05:16:56 vhaispora02 kernel: o2net: no longer connected to node vhaispora01 (num 0) at 192.168.1.1:7777 Oct 11 05:16:59 vhaispora02 kernel: cciss0: unsolicited abort f7010e90 Oct 11 05:16:59 vhaispora02 kernel: cciss0: retrying f7010e90 . . . Oct 11 05:17:18 vhaispora02 kernel: cciss0: f7010550 retried too many times Oct 11 05:17:18 vhaispora02 kernel: cciss0: unsolicited abort f70107a0 Oct 11 05:17:18 vhaispora02 kernel: cciss0: f70107a0 retried too many times Oct 11 05:17:18 vhaispora02 kernel: cciss0: unsolicited abort f70109f0 Oct 11 10:35:57 vhaispora02 syslogd 1.4.1: restart. Oct 11 10:35:57 vhaispora02 syslog: syslogd startup succeeded Oct 11 10:35:57 vhaispora02 kernel: klogd 1.4.1, log source = /proc/kmsg started. Oct 11 10:35:57 vhaispora02 kernel: Linux version 2.6.9-42.0.2.ELsmp (bhcompile@ls20-bc1-13.build.redhat.com) (gcc version 3.4.6 20060404 (Red Hat 3.4.6-3)) #1 SMP Thu Aug 17 18:00:32 EDT 2006 /var/log/messages (Node 1) Oct 11 05:10:01 vhaispora01 crond(pam_unix)[14577]: session closed for user root Oct 11 05:14:25 vhaispora01 ntpd[3243]: synchronized to 10.4.31.254, stratum 2 Oct 11 05:15:28 vhaispora01 kernel: cciss0: unsolicited abort f7000250 Oct 11 05:15:28 vhaispora01 kernel: cciss0: retrying f7000250 Oct 11 05:15:28 vhaispora01 kernel: cciss0: unsolicited abort f70004a0 Oct 11 05:15:28 vhaispora01 kernel: cciss0: retrying f70004a0 Oct 11 05:15:28 vhaispora01 kernel: cciss0: unsolicited abort f70006f0 Oct 11 05:15:28 vhaispora01 kernel: cciss0: retrying f70006f0 Oct 11 05:15:28 vhaispora01 kernel: cciss0: unsolicited abort f7000940 Oct 11 05:15:28 vhaispora01 kernel: cciss0: retrying f7000940 Oct 11 05:15:28 vhaispora01 kernel: cciss0: unsolicited abort f7000b90 Oct 11 05:15:28 vhaispora01 kernel: cciss0: retrying f7000b90 Oct 11 05:15:28 vhaispora01 kernel: cciss0: unsolicited abort f7000de0 Oct 11 05:15:28 vhaispora01 kernel: cciss0: retrying f7000de0 Oct 11 05:15:28 vhaispora01 kernel: cciss0: unsolicited abort f7001030 Oct 11 05:15:28 vhaispora01 kernel: cciss0: retrying f7001030 Oct 11 05:15:28 vhaispora01 kernel: cciss0: unsolicited abort f7001280 Oct 11 05:15:28 vhaispora01 kernel: cciss0: retrying f7001280 Oct 11 05:15:28 vhaispora01 kernel: cciss0: unsolicited abort f70014d0 Oct 11 05:15:28 vhaispora01 kernel: cciss0: retrying f70014d0 . . . Oct 11 05:16:46 vhaispora01 kernel: cciss0: unsolicited abort f7012ca0 Oct 11 05:16:46 vhaispora01 kernel: cciss0: f7012ca0 retried too many times Oct 11 05:16:47 vhaispora01 kernel: cciss0: unsolicited abort f7012ef0 Oct 11 05:16:47 vhaispora01 kernel: cciss0: f7012ef0 retried too many times Oct 11 10:35:50 vhaispora01 syslogd 1.4.1: restart. Oct 11 10:35:50 vhaispora01 syslog: syslogd startup succeeded Oct 11 10:35:50 vhaispora01 kernel: klogd 1.4.1, log source = /proc/kmsg started. Oct 11 10:35:50 vhaispora01 kernel: Linux version 2.6.9-42.0.2.ELsmp (bhcompile@ls20-bc1-13.build.redhat.com) (gcc version 3.4.6 20060404 (Red Hat 3.4.6-3)) #1 SMP Thu Aug 17 18:00:32 EDT 2006 O
Sunil Mushran
2006-Oct-25 14:59 UTC
[Ocfs2-users] OCFS2 Fencing and Locking MSA500 Array: Help
Oct 11 05:15:28 vhaispora01 kernel: cciss0: unsolicited abort f7000250 Oct 11 05:15:28 vhaispora01 kernel: cciss0: retrying f7000250 That's where the problem begins. The cciss driver is unable to to complete the ios due to a bus reset maybe. Ping HP or whoever your contact is for the MSA500. You may get more information if you setup a netconsole server to catch the stack dumps. Deaderick, David (EDS) wrote:> I have a RedHat Enterprise Linux 4.0 two node cluster on HP ProLiant > ML350 Servers connected to an HP MSA500 with HP 532 SCSI adapters (cciss > driver). > The following list includes critical component versions: > ocfs2console-1.2.1-1 Mon 28 Aug 2006 05:39:20 > PM EDT > ocfs2-2.6.9-42.0.2.ELsmp-1.2.3-1 Mon 28 Aug 2006 05:39:19 > PM EDT > ocfs2-2.6.9-42.0.2.ELhugemem-1.2.3-1 Mon 28 Aug 2006 05:39:18 > PM EDT > ocfs2-2.6.9-42.0.2.EL-1.2.3-1 Mon 28 Aug 2006 05:39:17 > PM EDT > ocfs2-tools-1.2.1-1 Mon 28 Aug 2006 05:39:15 > PM EDT > oracleasmlib-2.0.2-1 Mon 28 Aug 2006 05:37:51 > PM EDT > oracleasm-2.6.9-42.0.2.ELhugemem-2.0.3-1 Mon 28 Aug 2006 05:37:49 > PM EDT > oracleasm-2.6.9-42.0.2.EL-2.0.3-1 Mon 28 Aug 2006 05:37:47 > PM EDT > oracleasm-2.6.9-42.0.2.ELsmp-2.0.3-1 Mon 28 Aug 2006 05:37:45 > PM EDT > oracleasm-support-2.0.3-1 Mon 28 Aug 2006 05:37:44 > PM EDT > kernel-hugemem-2.6.9-42.0.2.EL Mon 28 Aug 2006 05:25:32 > PM EDT > kernel-doc-2.6.9-42.0.2.EL Mon 28 Aug 2006 05:25:29 > PM EDT > kernel-hugemem-devel-2.6.9-42.0.2.EL Mon 28 Aug 2006 05:25:07 > PM EDT > kernel-smp-devel-2.6.9-42.0.2.EL Mon 28 Aug 2006 05:21:45 > PM EDT > kernel-smp-2.6.9-42.0.2.EL Mon 28 Aug 2006 05:20:51 > PM EDT > kernel-utils-2.4-13.1.83 Mon 28 Aug 2006 05:20:48 > PM EDT > kernel-devel-2.6.9-42.0.2.EL Mon 28 Aug 2006 04:42:48 > PM EDT > kernel-2.6.9-42.0.2.EL Mon 28 Aug 2006 04:42:37 > PM EDT > > When ever a heavy load is on the I/O system (i.e. database full backups > using RMAN), the servers fence, reboot and cannot reconnect with the > MSA500. > We must power the servers and the MSA500 off and restart. > > Where can I start troubleshooting this? > > /var/log/messages: (Node 2) > > Oct 11 05:16:56 vhaispora02 kernel: o2net: connection to node > vhaispora01 (num 0) at 192.168.1.1:7777 has been idle for 10 seconds, > shutting it down. > Oct 11 05:16:56 vhaispora02 kernel: (0,0):o2net_idle_timer:1309 here are > some times that might help debug the situation: (tmr 1160558206.560358 > now 1160558216.558300 dr 1160558206.560323 adv > 1160558206.560375:1160558206.560379 func (0d6da305:504) > 1160552001.561116:1160552001.561125) > Oct 11 05:16:56 vhaispora02 kernel: o2net: no longer connected to node > vhaispora01 (num 0) at 192.168.1.1:7777 > Oct 11 05:16:59 vhaispora02 kernel: cciss0: unsolicited abort f7010e90 > Oct 11 05:16:59 vhaispora02 kernel: cciss0: retrying f7010e90 > . > . > . > Oct 11 05:17:18 vhaispora02 kernel: cciss0: f7010550 retried too many > times > Oct 11 05:17:18 vhaispora02 kernel: cciss0: unsolicited abort f70107a0 > Oct 11 05:17:18 vhaispora02 kernel: cciss0: f70107a0 retried too many > times > Oct 11 05:17:18 vhaispora02 kernel: cciss0: unsolicited abort f70109f0 > Oct 11 10:35:57 vhaispora02 syslogd 1.4.1: restart. > Oct 11 10:35:57 vhaispora02 syslog: syslogd startup succeeded > Oct 11 10:35:57 vhaispora02 kernel: klogd 1.4.1, log source = /proc/kmsg > started. > Oct 11 10:35:57 vhaispora02 kernel: Linux version 2.6.9-42.0.2.ELsmp > (bhcompile@ls20-bc1-13.build.redhat.com) (gcc version 3.4.6 20060404 > (Red Hat 3.4.6-3)) #1 SMP Thu Aug 17 18:00:32 EDT 2006 > > /var/log/messages (Node 1) > Oct 11 05:10:01 vhaispora01 crond(pam_unix)[14577]: session closed for > user root > Oct 11 05:14:25 vhaispora01 ntpd[3243]: synchronized to 10.4.31.254, > stratum 2 > Oct 11 05:15:28 vhaispora01 kernel: cciss0: unsolicited abort f7000250 > Oct 11 05:15:28 vhaispora01 kernel: cciss0: retrying f7000250 > Oct 11 05:15:28 vhaispora01 kernel: cciss0: unsolicited abort f70004a0 > Oct 11 05:15:28 vhaispora01 kernel: cciss0: retrying f70004a0 > Oct 11 05:15:28 vhaispora01 kernel: cciss0: unsolicited abort f70006f0 > Oct 11 05:15:28 vhaispora01 kernel: cciss0: retrying f70006f0 > Oct 11 05:15:28 vhaispora01 kernel: cciss0: unsolicited abort f7000940 > Oct 11 05:15:28 vhaispora01 kernel: cciss0: retrying f7000940 > Oct 11 05:15:28 vhaispora01 kernel: cciss0: unsolicited abort f7000b90 > Oct 11 05:15:28 vhaispora01 kernel: cciss0: retrying f7000b90 > Oct 11 05:15:28 vhaispora01 kernel: cciss0: unsolicited abort f7000de0 > Oct 11 05:15:28 vhaispora01 kernel: cciss0: retrying f7000de0 > Oct 11 05:15:28 vhaispora01 kernel: cciss0: unsolicited abort f7001030 > Oct 11 05:15:28 vhaispora01 kernel: cciss0: retrying f7001030 > Oct 11 05:15:28 vhaispora01 kernel: cciss0: unsolicited abort f7001280 > Oct 11 05:15:28 vhaispora01 kernel: cciss0: retrying f7001280 > Oct 11 05:15:28 vhaispora01 kernel: cciss0: unsolicited abort f70014d0 > Oct 11 05:15:28 vhaispora01 kernel: cciss0: retrying f70014d0 > . > . > . > Oct 11 05:16:46 vhaispora01 kernel: cciss0: unsolicited abort f7012ca0 > Oct 11 05:16:46 vhaispora01 kernel: cciss0: f7012ca0 retried too many > times > Oct 11 05:16:47 vhaispora01 kernel: cciss0: unsolicited abort f7012ef0 > Oct 11 05:16:47 vhaispora01 kernel: cciss0: f7012ef0 retried too many > times > Oct 11 10:35:50 vhaispora01 syslogd 1.4.1: restart. > Oct 11 10:35:50 vhaispora01 syslog: syslogd startup succeeded > Oct 11 10:35:50 vhaispora01 kernel: klogd 1.4.1, log source = /proc/kmsg > started. > Oct 11 10:35:50 vhaispora01 kernel: Linux version 2.6.9-42.0.2.ELsmp > (bhcompile@ls20-bc1-13.build.redhat.com) (gcc version 3.4.6 20060404 > (Red Hat 3.4.6-3)) #1 SMP Thu Aug 17 18:00:32 EDT 2006 > O > > _______________________________________________ > Ocfs2-users mailing list > Ocfs2-users@oss.oracle.com > http://oss.oracle.com/mailman/listinfo/ocfs2-users >