Peter Santos
2007-Mar-16 12:58 UTC
[Ocfs2-users] re: o2hb_do_disk_heartbeat:963 ERROR: Device "sdb1" another node is heartbeating in our slot!
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Folks, I'm trying to wrap my head around something that happened in our environment. Basically, we noticed the error in /var/log/messages with no other errors. "Mar 16 13:38:02 dbo3 kernel: (3712,3):o2hb_do_disk_heartbeat:963 ERROR: Device "sdb1": another node is heartbeating in our slot!" Usually there are a number of other errors, but this one was it. Our RAC cluster is made up of 3 nodes (dbo1,dbo2,dbo3) and they use ocfs2 for the ocr /voting file, but ASM is where the datafiles are located. This is suse9 kernel 282. A while back one of our SA's was trying to install ocfs2 on a couple of red-hat machines, and didn't properly configure ocfs2 to add the nodes. I believe he just copied directories and the /etc/ocfs2/cluster.conf file. Anyway, when he turned the machines on today, they were still mis configured and I believe that is the cause of the error message "another node is heartbeating in our slot" message? would you agree ? As I mentioned there are only 3 nodes in our cluster, but the /etc/cluster.conf file shows 6 and so does the following: oracle@dbo1:/etc/ocfs2> ls /config/cluster/ocfs2/node/ dbo1 dbo2 dbo3 dbo4 dbt3 dbt4 So my question, is how do I permanently remove dbt3, dbt4 and dbo4 ? I checked out the ocfs2 guide, but it only has information on adding a node to both an online/offline cluster. More importantly is how the oracle clusterware behaved. After this happened, my ASM and RDBMS instances stayed up. None of the machines rebooted. But the CRS deamon appears to be having issues. When I run "crsctl check crs" on all 3 nodes, I get the error "Cannot communicate with CRS" on all 3 nodes. The cssd log directory has a core file .. yet I can log into all 3 database instances as if nothing happened. I suspect this is a bug? The CRSD log files reveal some sort of issue relating to problems writing to the ocr file ..which is on ocfs2. But if there really was a problem, wouldn't ocfs2 have rebooted the machine? And when RAC has a problem accessing the ocfs2 volume, there are usually a large number of io errors in the system log Any insight is greatly appreciated. - -peter alertdbo3.log ============2007-03-16 13:38:25.471 [crsd(4994)]CRS-1006:The OCR location /ocfs2/oracrs/ocr.crs is inaccessible. Details in /data/app/crs/oracle/product/10.2.0/crs/log/dbo3/crsd/crsd.log. 2007-03-16 13:38:43.377 [client(13125)]CRS-1006:The OCR location /ocfs2/oracrs/ocr.crs is inaccessible. Details in /data/app/crs/oracle/product/10.2.0/crs/log/dbo3/client/css.log. crsd.log ============2007-03-16 13:38:11.708: [ OCRCLI][1407371616]proac_set_value: Response message returned with failure keyname [CRS.CUR.ora!ORACTAH!ORACTAH3!inst.REASON], retcode = 26 2007-03-16 13:38:11.710: [ OCRCLI][1417865568]proac_set_value: Response message returned with failure keyname [CRS.CUR.ora!dbo3!LISTENER_DBO3!lsnr.REASON], retcode = 26 2007-03-16 13:38:24.159: [ OCRMSG][1407371616]prom_rpc: CLSC recv failure..ret code 7 2007-03-16 13:38:24.159: [ OCRMSG][1407371616]prom_rpc: possible OCR retry scenario 2007-03-16 13:38:24.159: [ COMMCRS][1417865568]clscsendx: (0xc80100) Physical connection (0xc7fa30) not active 2007-03-16 13:38:24.159: [ OCRMSG][1417865568]prom_rpc: CLSC send failure..ret code 11 2007-03-16 13:38:24.159: [ OCRMSG][1417865568]prom_rpc: possible OCR retry scenario 2007-03-16 13:38:25.036: [ OCRMAS][1182845280]th_master:13: I AM THE NEW OCR MASTER at incar 3. Node Number = 3 2007-03-16 13:38:25.046: [ OCRRAW][1182845280]proprioo: for disk 0 (/ocfs2/oracrs/ocr.crs), id match (1), my id set (1201294405,1028247821) total id sets (1), 1st set (1201294405,1028247821), 2nd set (0,0) my votes (2), total votes (2) 2007-03-16 13:38:25.102: [ OCRRAW][1182845280]rrecover:3: recovery required 2007-03-16 13:38:25.471: [ OCRRAW][1182845280]rtnode:3: invalid tnode 1085 2007-03-16 13:38:25.471: [ OCRRAW][1182845280]propropen:0: could not read tnode addrd=0 2007-03-16 13:38:25.471: [ OCRRAW][1182845280]proprseterror: Error in accessing physical storage [26] Marking context invalid. 2007-03-16 13:38:25.471: [ OCRUTL][1182845280]u_freem: INVALID PROU_BEGIN_MEMTAG for memory [99351708] Begin tag [99351170] Expected begin tag [5072426d] [ OCRMAS][1182845280]th_calc_av:8.1': Error reading key [SYSTEM.version.node_numbers.node3] 2007-03-16 13:38:25.471: [ OCRMAS][1182845280]th_master:9: Shutdown CacheMaster. prev AV [169869824] new calc av [169869824] my sv [169869824]2007-03-16 13:38:39.932: [ CRSOCR][1438853472]0OCR api procr_open_key failed for key CRS.CUR. OCR error code = 3 OCR error msg: 2007-03-16 13:38:39.932: [ CRSOCR][1438853472][PANIC]0Failed to open key: CRS.CUR(File: caaocr.cpp, line: 472) * The cssd directory has a core file, but nothing in the ocssd.log file. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.1 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFF+vg0oyy5QBCjoT0RAkemAJ9NSS2e9gndC62WErJlgr82aAwuZwCgjfk8 xFtWactcUf2LcoUKLexmaPQ=Av6M -----END PGP SIGNATURE-----
Alexei_Roudnev
2007-Mar-16 13:35 UTC
[Ocfs2-users] re: o2hb_do_disk_heartbeat:963 ERROR: Device "sdb1" another node is heartbeating in our slot!
Btw, upgrade kernel to #283; 282 had a serious bug in OCFSv2 (relaying to the simultaneous append t the file). Another story - try to keep CSR and CSS files out of OCFSv2. reason is that keeping CRS files on OCFS, you de facto keep one cluster (CRS) depending of another (OCFS), which can influence CRS decisions in a faulrty situations. (It's usually simple to create 2 more partitions or LUN's for OCRFile and CSSFile - 102MB and 22MB each). What's about your case - these experiments could really broke heartbeat (did you allowed access to the same disks from these new experimental servers?) ----- Original Message ----- From: "Peter Santos" <psantos@cheetahmail.com> To: <ocfs2-users@oss.oracle.com> Sent: Friday, March 16, 2007 1:04 PM Subject: [Ocfs2-users] re: o2hb_do_disk_heartbeat:963 ERROR: Device "sdb1" another node is heartbeating in our slot!> -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > Folks, > > I'm trying to wrap my head around something that happened in ourenvironment.> Basically, we noticed the error in /var/log/messages with no other errors. > > "Mar 16 13:38:02 dbo3 kernel: (3712,3):o2hb_do_disk_heartbeat:963 ERROR:Device "sdb1": another node is> heartbeating in our slot!" > Usually there are a number of other errors, but this one was it. > > Our RAC cluster is made up of 3 nodes (dbo1,dbo2,dbo3) and they use ocfs2for the ocr /voting file, but> ASM is where the datafiles are located. This is suse9 kernel 282. > > > A while back one of our SA's was trying to install ocfs2 on a couple ofred-hat machines, and didn't properly> configure ocfs2 to add the nodes. I believe he just copied directories andthe /etc/ocfs2/cluster.conf file.> Anyway, when he turned the machines on today, they were still misconfigured and I believe that is the> cause of the error message "another node is heartbeating in our slot"message? would you agree ?> > As I mentioned there are only 3 nodes in our cluster, but the/etc/cluster.conf file shows 6 and so does the> following: > oracle@dbo1:/etc/ocfs2> ls /config/cluster/ocfs2/node/ > dbo1 dbo2 dbo3 dbo4 dbt3 dbt4 > > So my question, is how do I permanently remove dbt3, dbt4 and dbo4 ? Ichecked out the ocfs2 guide, but it only> has information on adding a node to both an online/offline cluster. > > > More importantly is how the oracle clusterware behaved. After thishappened, my ASM and RDBMS instances stayed> up. None of the machines rebooted. But the CRS deamon appears to be havingissues.> > When I run "crsctl check crs" on all 3 nodes, I get the error "Cannotcommunicate with CRS" on all 3 nodes.> The cssd log directory has a core file .. yet I can log into all 3database instances as if nothing happened.> > I suspect this is a bug? > > The CRSD log files reveal some sort of issue relating to problems writingto the ocr file ..which is on ocfs2. But> if there really was a problem, wouldn't ocfs2 have rebooted the machine?And when RAC has a problem accessing the ocfs2> volume, there are usually a large number of io errors in the system log > > > Any insight is greatly appreciated. > > - -peter > > > alertdbo3.log > ============> 2007-03-16 13:38:25.471 > [crsd(4994)]CRS-1006:The OCR location /ocfs2/oracrs/ocr.crs isinaccessible. Details in> /data/app/crs/oracle/product/10.2.0/crs/log/dbo3/crsd/crsd.log. > > 2007-03-16 13:38:43.377 > [client(13125)]CRS-1006:The OCR location /ocfs2/oracrs/ocr.crs isinaccessible. Details in> /data/app/crs/oracle/product/10.2.0/crs/log/dbo3/client/css.log. > > > crsd.log > ============> 2007-03-16 13:38:11.708: [ OCRCLI][1407371616]proac_set_value: Responsemessage returned with failure keyname > [CRS.CUR.ora!ORACTAH!ORACTAH3!inst.REASON], retcode = 26> 2007-03-16 13:38:11.710: [ OCRCLI][1417865568]proac_set_value: Responsemessage returned with failure keyname > [CRS.CUR.ora!dbo3!LISTENER_DBO3!lsnr.REASON], retcode = 26> 2007-03-16 13:38:24.159: [ OCRMSG][1407371616]prom_rpc: CLSC recvfailure..ret code 7> 2007-03-16 13:38:24.159: [ OCRMSG][1407371616]prom_rpc: possible OCRretry scenario> 2007-03-16 13:38:24.159: [ COMMCRS][1417865568]clscsendx: (0xc80100)Physical connection (0xc7fa30) not active> > 2007-03-16 13:38:24.159: [ OCRMSG][1417865568]prom_rpc: CLSC sendfailure..ret code 11> 2007-03-16 13:38:24.159: [ OCRMSG][1417865568]prom_rpc: possible OCRretry scenario> 2007-03-16 13:38:25.036: [ OCRMAS][1182845280]th_master:13: I AM THE NEWOCR MASTER at incar 3. Node Number = 3> 2007-03-16 13:38:25.046: [ OCRRAW][1182845280]proprioo: for disk 0(/ocfs2/oracrs/ocr.crs), id match (1), my id set> (1201294405,1028247821) total id sets (1), 1st set(1201294405,1028247821), 2nd set (0,0) my votes (2), total votes (2)> 2007-03-16 13:38:25.102: [ OCRRAW][1182845280]rrecover:3: recoveryrequired> 2007-03-16 13:38:25.471: [ OCRRAW][1182845280]rtnode:3: invalid tnode1085> 2007-03-16 13:38:25.471: [ OCRRAW][1182845280]propropen:0: could not readtnode addrd=0> 2007-03-16 13:38:25.471: [ OCRRAW][1182845280]proprseterror: Error inaccessing physical storage [26] Marking context> invalid. > 2007-03-16 13:38:25.471: [ OCRUTL][1182845280]u_freem: INVALIDPROU_BEGIN_MEMTAG for memory [99351708] Begin tag> [99351170] Expected begin tag [5072426d] > [ OCRMAS][1182845280]th_calc_av:8.1': Error reading key[SYSTEM.version.node_numbers.node3]> 2007-03-16 13:38:25.471: [ OCRMAS][1182845280]th_master:9: ShutdownCacheMaster. prev AV [169869824] new calc av> [169869824] my sv [169869824]2007-03-16 13:38:39.932: [CRSOCR][1438853472]0OCR api procr_open_key failed for key> CRS.CUR. OCR error code = 3 OCR error msg: > 2007-03-16 13:38:39.932: [ CRSOCR][1438853472][PANIC]0Failed to open key:CRS.CUR(File: caaocr.cpp, line: 472)> > > * The cssd directory has a core file, but nothing in the ocssd.log file. > > > > > > > > > > > > -----BEGIN PGP SIGNATURE----- > Version: GnuPG v1.4.1 (GNU/Linux) > Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org > > iD8DBQFF+vg0oyy5QBCjoT0RAkemAJ9NSS2e9gndC62WErJlgr82aAwuZwCgjfk8 > xFtWactcUf2LcoUKLexmaPQ> =Av6M > -----END PGP SIGNATURE----- > > _______________________________________________ > Ocfs2-users mailing list > Ocfs2-users@oss.oracle.com > http://oss.oracle.com/mailman/listinfo/ocfs2-users >
Sunil Mushran
2007-Mar-16 13:50 UTC
[Ocfs2-users] re: o2hb_do_disk_heartbeat:963 ERROR: Device "sdb1" another node is heartbeating in our slot!
Peter Santos wrote:> "Mar 16 13:38:02 dbo3 kernel: (3712,3):o2hb_do_disk_heartbeat:963 ERROR: Device "sdb1": another node is > heartbeating in our slot!" > Usually there are a number of other errors, but this one was it. >If this was one isolated error message, it just could be that the previous hb write failed for some reason. As in, the real error may not be as severe as the message printed.> Our RAC cluster is made up of 3 nodes (dbo1,dbo2,dbo3) and they use ocfs2 for the ocr /voting file, but > ASM is where the datafiles are located. This is suse9 kernel 282. > > > A while back one of our SA's was trying to install ocfs2 on a couple of red-hat machines, and didn't properly > configure ocfs2 to add the nodes. I believe he just copied directories and the /etc/ocfs2/cluster.conf file. > Anyway, when he turned the machines on today, they were still mis configured and I believe that is the > cause of the error message "another node is heartbeating in our slot" message? would you agree ? >If it was just one message then unlikely. But see the the config file to see whether it is correct or not.> As I mentioned there are only 3 nodes in our cluster, but the /etc/cluster.conf file shows 6 and so does the > following: > oracle@dbo1:/etc/ocfs2> ls /config/cluster/ocfs2/node/ > dbo1 dbo2 dbo3 dbo4 dbt3 dbt4 > > So my question, is how do I permanently remove dbt3, dbt4 and dbo4 ? I checked out the ocfs2 guide, but it only > has information on adding a node to both an online/offline cluster. >Deletion would require a cluster shutdown. But why do you have to have to remove it right now? Why can't you schedule a cluster.conf cleanup during your next cluster shutdown window.> More importantly is how the oracle clusterware behaved. After this happened, my ASM and RDBMS instances stayed > up. None of the machines rebooted. But the CRS deamon appears to be having issues. > > When I run "crsctl check crs" on all 3 nodes, I get the error "Cannot communicate with CRS" on all 3 nodes. > The cssd log directory has a core file .. yet I can log into all 3 database instances as if nothing happened. > > I suspect this is a bug? > > The CRSD log files reveal some sort of issue relating to problems writing to the ocr file ..which is on ocfs2. But > if there really was a problem, wouldn't ocfs2 have rebooted the machine? And when RAC has a problem accessing the ocfs2 > volume, there are usually a large number of io errors in the system log >File a SR with Oracle and let the RAC folks look at the issue. Existence of a core file may mean that some process may need to be restarted.