Robert Milkowski
2010-Feb-04 17:10 UTC
[zfs-discuss] [ha-clusters-discuss] data corruption
putting storage-discuss@ and zfs-discuss@ as well. On 04/02/2010 16:33, Robert Milkowski wrote:> Hi, > > S10, SC3.2 + patches, Generic_142900-03, 2x T5220 with QLE2462 connected to 6540s. > > We started to observe below messages yesterday at both nodes at the same time after several weeks of running: > > <pre> > XXX cl_runtime: [ID 856360 kern.warning] WARNING: QUORUM_GENERIC: quorum_read_keys error: Reading the registration keys failed on quorum device /dev/did/rdsk/d7s2 with error 22. > XXX cl_runtime: [ID 868277 kern.warning] WARNING: CMM: Erstwhile online quorum device /dev/did/rdsk/d7s2 (qid 1) is inaccessible now. > > d7 is a quorum device and it was marked by cluster as offline: > > # clq status > > === Cluster Quorum ==> > --- Quorum Votes Summary from latest node reconfiguration --- > > Needed Present Possible > ------ ------- -------- > 2 3 3 > > > --- Quorum Votes by Node (current status) --- > > Node Name Present Possible Status > --------- ------- -------- ------ > XXXXXXXXXXXXXXX 1 1 Online > YYYYYYYYYYYYYYY 1 1 Online > > > --- Quorum Votes by Device (current status) --- > > Device Name Present Possible Status > ----------- ------- -------- ------ > d7 0 1 Offline > > > > By looking at the source code I found that the above message is printed from within quorum_device_generic_impl::quorum_read_keys() and it will only happen if quorum_pgre_key_read() returns with return code 22 (actually any other than 0 or EACCESS but we already know that the rc is 22 from the syslog message). > > Now quorum_pgre_key_read() calls quorum_scsi_sector_read() and passes its return code as its own. > The quorum_scsi_sector_read() can possibly return with error if quorum_ioctl_with_retries() return with error or if there is a checksum mismatch. > > This is the relevant source code: > 406 int > 407 quorum_scsi_sector_read( > [...] > 449 error = quorum_ioctl_with_retries(vnode_ptr, USCSICMD, (intptr_t)&ucmd, > 450 &retval); > 451 if (error != 0) { > 452 CMM_TRACE(("quorum_scsi_sector_read: ioctl USCSICMD " > 453 "returned error (%d).\n", error)); > 454 kmem_free(ucmd.uscsi_rqbuf, (size_t)SENSE_LENGTH); > 455 return (error); > 456 } > 457 > 458 // > 459 // Calculate and compare the checksum if check_data is true. > 460 // Also, validate the pgres_id string at the beg of the sector. > 461 // > 462 if (check_data) { > 463 PGRE_CALCCHKSUM(chksum, sector, iptr); > 464 > 465 // Compare the checksum. > 466 if (PGRE_GETCHKSUM(sector) != chksum) { > 467 CMM_TRACE(("quorum_scsi_sector_read: " > 468 "checksum mismatch.\n")); > 469 kmem_free(ucmd.uscsi_rqbuf, (size_t)SENSE_LENGTH); > 470 return (EINVAL); > 471 } > 472 > 473 // > 474 // Validate the PGRE string at the beg of the sector. > 475 // It should contain PGRE_ID_LEAD_STRING[1|2]. > 476 // > 477 if ((os::strncmp((char *)sector->pgres_id, PGRE_ID_LEAD_STRING1, > 478 strlen(PGRE_ID_LEAD_STRING1)) != 0)&& > 479 (os::strncmp((char *)sector->pgres_id, PGRE_ID_LEAD_STRING2, > 480 strlen(PGRE_ID_LEAD_STRING2)) != 0)) { > 481 CMM_TRACE(("quorum_scsi_sector_read: pgre id " > 482 "mismatch. The sector id is %s.\n", > 483 sector->pgres_id)); > 484 kmem_free(ucmd.uscsi_rqbuf, (size_t)SENSE_LENGTH); > 485 return (EINVAL); > 486 } > 487 > 488 } > 489 kmem_free(ucmd.uscsi_rqbuf, (size_t)SENSE_LENGTH); > 490 > 491 return (error); > 492 } > > > > 56 -> __1cXquorum_scsi_sector_read6FpnFvnode_LpnLpgre_sector_b_i_ 6308555744942019 enter > 56 -> __1cZquorum_ioctl_with_retries6FpnFvnode_ilpi_i_ 6308555744957176 enter > 56<- __1cZquorum_ioctl_with_retries6FpnFvnode_ilpi_i_ 6308555745089857 rc: 0 > 56 -> __1cNdbg_print_bufIdbprintf6MpcE_v_ 6308555745108310 enter > 56 -> __1cNdbg_print_bufLdbprintf_va6Mbpcrpv_v_ 6308555745120941 enter > 56 -> __1cCosHsprintf6FpcpkcE_v_ 6308555745134231 enter > 56<- __1cCosHsprintf6FpcpkcE_v_ 6308555745148729 rc: 2890607504684 > 56<- __1cNdbg_print_bufLdbprintf_va6Mbpcrpv_v_ 6308555745162898 rc: 1886718112 > 56<- __1cNdbg_print_bufIdbprintf6MpcE_v_ 6308555745175529 rc: 1886718112 > 56<- __1cXquorum_scsi_sector_read6FpnFvnode_LpnLpgre_sector_b_i_ 6308555745188599 rc: 22 > > From the above output we know that quorum_ioctl_with_retries() returns with 0 so it must be a checksum mismatch! > As CMM_TRACE() is being called above and there are two of them in the code lets check which one it is: > > 21 -> __1cNdbg_print_bufIdbprintf6MpcE_v_ 6309628794339298 CMM_DEBUG: quorum_scsi_sector_read: checksum mismatch. > > > So this is where it fails: > > 462 if (check_data) { > 463 PGRE_CALCCHKSUM(chksum, sector, iptr); > 464 > 465 // Compare the checksum. > 466 if (PGRE_GETCHKSUM(sector) != chksum) { > 467 CMM_TRACE(("quorum_scsi_sector_read: " > 468 "checksum mismatch.\n")); > 469 kmem_free(ucmd.uscsi_rqbuf, (size_t)SENSE_LENGTH); > 470 return (EINVAL); > 471 } > > > > By adding another quorum device, them removing d7 and adding it again (and removing the extra one) everything came back to normal. However I wonder how did we end-up there? HBA? firmware? 6540''s firmware? SC bug? > > # fcinfo hba-port -l > HBA Port WWN: 2100001b3291014c > OS Device Name: /dev/cfg/c2 > Manufacturer: QLogic Corp. > Model: 375-3356-02 > Firmware Version: 05.01.00 > FCode/BIOS Version: BIOS: 2.10; fcode: 2.4; EFI: 2.4; > Serial Number: 0402R00-0927731201 > Driver Name: qlc > Driver Version: 20090519-2.31 > Type: N-port > State: online > Supported Speeds: 1Gb 2Gb 4Gb > Current Speed: 4Gb > Node WWN: 2000001b3291014c > Link Error Statistics: > Link Failure Count: 0 > Loss of Sync Count: 0 > Loss of Signal Count: 0 > Primitive Seq Protocol Error Count: 0 > Invalid Tx Word Count: 0 > Invalid CRC Count: 0 > HBA Port WWN: 2101001b32b1014c > OS Device Name: /dev/cfg/c3 > Manufacturer: QLogic Corp. > Model: 375-3356-02 > Firmware Version: 05.01.00 > FCode/BIOS Version: BIOS: 2.10; fcode: 2.4; EFI: 2.4; > Serial Number: 0402R00-0927731201 > Driver Name: qlc > Driver Version: 20090519-2.31 > Type: N-port > State: online > Supported Speeds: 1Gb 2Gb 4Gb > Current Speed: 4Gb > Node WWN: 2001001b32b1014c > Link Error Statistics: > Link Failure Count: 0 > Loss of Sync Count: 0 > Loss of Signal Count: 0 > Primitive Seq Protocol Error Count: 0 > Invalid Tx Word Count: 0 > Invalid CRC Count: 0 > > > 142084-02 is applied and by a quick glance I can''t see anything related to the above which might be addressed by 142084-03. > > Each 6540 presents one 2TB LUN and we are using ZFS to mirror between them. One of LUNs is used as the quorum device as well. > Since it looks like data was corrupted for quorum the pool itself might be affected as well so I run scrub and after couple of hours I got so far: > > # zpool status -v XXXX > pool: XXXX > state: DEGRADED > status: One or more devices has experienced an error resulting in data > corruption. Applications may be affected. > action: Restore the file in question if possible. Otherwise restore the > entire pool from backup. > see: http://www.sun.com/msg/ZFS-8000-8A > scrub: scrub in progress for 2h29m, 56.94% done, 1h52m to go > config: > > NAME STATE READ WRITE CKSUM > XXXX DEGRADED 0 0 14 > mirror DEGRADED 0 0 28 > c4t600A0B800029AF0000006CD4486B3B05d0 DEGRADED 0 0 28 too many errors > c4t600A0B800029B74600004255486B6A4Fd0 DEGRADED 0 0 28 too many errors > > errors: Permanent errors have been detected in the following files: > > /XXXX/XXXX/XXXXXXXX/YYYYYY.dbf > > > I can''t see any other errors in the system nor in logs or from FMA. The HBA firmware seems to be the latest version as well. > > Because of the corruption within the zfs pool I think that while the issue manifested itself first as a problem with the quorum device it has rather nothing to do with the SC itself and data corruption is happening somewhere. The other interesting thing is that so far all the corrupted blocks detected by ZFS were corrupted on both sides of the mirror. Since each side is a separate disk array I think the corruption must probably have originated on the server itself rather than on SAN or disk arrays. Now the HBA is a dual-ported card and both paths are used (MPxIO). The issue is also rather not caused by ZFS itself as it shouldn''t have affect the SC keys on the quorum device. > > > Any ideas? > </pre> > >