Hairul Ikmal Mohamad Fuzi
2007-Feb-24 20:07 UTC
[CentOS] Storage/SCSI Error on our CentOS server
Hi, Currently we are running CentOS 4.x on a 2-way Opteron machine. This machine, through a SCSI host adapter (Adaptec), is connected to a 2TB storage unit (an external RAID-5 disk array) Until our recent unintentional power trip, everything was fine and smooth. We have been experiencing complication accessing the storage ( it could be either intermittent filesystem error, partition could not be mounted in read-write mode, unacceptable writing speed, etc ), especially when we start to 'write' on the storage. After a few check, we are suspecting either : 1) the storage unit (but the storage control panel did not report any disk/raidset failure) is failing or, 2) the SCSI host adapter is failing, or 3) the filesystem itself is corrupted (we did 'fsck.ext3 -v -f' but it turned out it did not find any errors) ..but we are not sure which one.We did received 'messages' which we never experience previously in our 'dmesg' . (please refer below) Based on the above info and the below 'dmesg' output, we'd appreciate if somebody could share and help us to identify what actually went wrong and how could fix it (if possible)? TIA. -Ikmal Dmesg output : ==========================================scsi4:0:0:0: Attempting to abort cmd 0000010082297a80: 0x28 0x0 0x0 0x0 0x1 0x3f 0x0 0x0 0x8 0x0 scsi4: At time of recovery, card was not paused>>>>>>>>>>>>>>>>>> Dump Card State Begins<<<<<<<<<<<<<<<<< scsi4: Dumping Card State at program address 0xc Mode 0x33 Card was paused HS_MAILBOX[0x0] INTCTL[0x80] SEQINTSTAT[0x0] SAVED_MODE[0x11] DFFSTAT[0x33] SCSISIGI[0x0] SCSIPHASE[0x0] SCSIBUS[0x0] LASTPHASE[0x1] SCSISEQ0[0x0] SCSISEQ1[0x12] SEQCTL0[0x0] SEQINTCTL[0x0] SEQ_FLAGS[0xc0] SEQ_FLAGS2[0x0] SSTAT0[0x0] SSTAT1[0x0] SSTAT2[0x0] SSTAT3[0x0] PERRDIAG[0xc0] SIMODE1[0xa4] LQISTAT0[0x0] LQISTAT1[0x0] LQISTAT2[0x0] LQOSTAT0[0x0] LQOSTAT1[0x0] LQOSTAT2[0x0] SCB Count = 4 CMDS_PENDING = 1 LASTSCB 0xffff CURRSCB 0x2 NEXTSCB 0x0 qinstart = 59 qinfifonext = 59 QINFIFO: WAITING_TID_QUEUES: Pending list: 2 FIFO_USE[0x0] SCB_CONTROL[0x64] SCB_SCSIID[0x7] Total 1 Kernel Free SCB list: 3 1 0 Sequencer Complete DMA-inprog list: Sequencer Complete list: Sequencer DMA-Up and Complete list: scsi4: FIFO0 Free, LONGJMP == 0x80ff, SCB 0x0 SEQIMODE[0x3f] SEQINTSRC[0x0] DFCNTRL[0x0] DFSTATUS[0x89] SG_CACHE_SHADOW[0x2] SG_STATE[0x0] DFFSXFRCTL[0x0] SOFFCNT[0x0] MDFFSTAT[0x5] SHADDR = 0x00, SHCNT = 0x0 HADDR = 0x00, HCNT = 0x0 CCSGCTL[0x10] scsi4: FIFO1 Free, LONGJMP == 0x81d8, SCB 0x3 SEQIMODE[0x3f] SEQINTSRC[0x0] DFCNTRL[0x0] DFSTATUS[0x89] SG_CACHE_SHADOW[0x2] SG_STATE[0x0] DFFSXFRCTL[0x0] SOFFCNT[0x0] MDFFSTAT[0x5] SHADDR = 0x00, SHCNT = 0x0 HADDR = 0x00, HCNT = 0x0 CCSGCTL[0x10] LQIN: 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 scsi4: LQISTATE = 0x0, LQOSTATE = 0x0, OPTIONMODE 0x52 scsi4: OS_SPACE_CNT = 0x20 MAXCMDCNT = 0x0 SIMODE0[0xc] CCSCBCTL[0x0] scsi4: REG0 == 0xffff, SINDEX = 0x1e0, DINDEX = 0xe1 scsi4: SCBPTR == 0x3, SCB_NEXT == 0x2, SCB_NEXT2 =0x2 CDB 28 0 0 80 19 ac STACK: 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 <<<<<<<<<<<<<<<<< Dump Card State Ends>>>>>>>>>>>>>>>>>>DevQ(0:0:0): 0 waiting (scsi4:A:0:0): Device is disconnected, re-queuing SCB Recovery code sleeping Recovery SCB completes Recovery code awake scsi4: Transmission error detected LQISTAT1[0x0] LASTPHASE[0x1] SCSISIGI[0x0] PERRDIAG[0x1]>>>>>>>>>>>>>>>>>> Dump Card State Begins<<<<<<<<<<<<<<<<< scsi4: Dumping Card State at program address 0x26 Mode 0x11 Card was paused HS_MAILBOX[0x0] INTCTL[0x80] SEQINTSTAT[0x0] SAVED_MODE[0x11] DFFSTAT[0x33] SCSISIGI[0x1a] SCSIPHASE[0x1] SCSIBUS[0xff] LASTPHASE[0x1] SCSISEQ0[0x40] SCSISEQ1[0x12] SEQCTL0[0x0] SEQINTCTL[0x0] SEQ_FLAGS[0xc0] SEQ_FLAGS2[0x0] SSTAT0[0x10] SSTAT1[0x11] SSTAT2[0x0] SSTAT3[0x0] PERRDIAG[0x0] SIMODE1[0xac] LQISTAT0[0x0] LQISTAT1[0x0] LQISTAT2[0x0] LQOSTAT0[0x0] LQOSTAT1[0x0] LQOSTAT2[0x0] SCB Count = 4 CMDS_PENDING = 1 LASTSCB 0xffff CURRSCB 0x2 NEXTSCB 0x0 qinstart = 61 qinfifonext = 61 QINFIFO: WAITING_TID_QUEUES: 0 ( 0x2 ) Pending list: 2 FIFO_USE[0x0] SCB_CONTROL[0x50] SCB_SCSIID[0x7] Total 1 Kernel Free SCB list: 3 1 0 Sequencer Complete DMA-inprog list: Sequencer Complete list: Sequencer DMA-Up and Complete list: scsi4: FIFO0 Free, LONGJMP == 0x80ff, SCB 0x0 SEQIMODE[0x3f] SEQINTSRC[0x0] DFCNTRL[0x0] DFSTATUS[0x89] SG_CACHE_SHADOW[0x2] SG_STATE[0x0] DFFSXFRCTL[0x0] SOFFCNT[0x1] MDFFSTAT[0x5] SHADDR = 0x00, SHCNT = 0x0 HADDR = 0x00, HCNT = 0x0 CCSGCTL[0x10] scsi4: FIFO1 Free, LONGJMP == 0x81d8, SCB 0x3 SEQIMODE[0x3f] SEQINTSRC[0x0] DFCNTRL[0x0] DFSTATUS[0x89] SG_CACHE_SHADOW[0x2] SG_STATE[0x0] DFFSXFRCTL[0x0] SOFFCNT[0x1] MDFFSTAT[0x5] SHADDR = 0x00, SHCNT = 0x0 HADDR = 0x00, HCNT = 0x0 CCSGCTL[0x10] LQIN: 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 scsi4: LQISTATE = 0x0, LQOSTATE = 0x0, OPTIONMODE 0x52 scsi4: OS_SPACE_CNT = 0x20 MAXCMDCNT = 0x0 SIMODE0[0xc] CCSCBCTL[0x4] scsi4: REG0 == 0x3, SINDEX = 0x11d, DINDEX = 0xe1 scsi4: SCBPTR == 0x3, SCB_NEXT == 0x2, SCB_NEXT2 =0x2 CDB 28 0 0 80 19 ac STACK: 0x13 0x0 0x0 0x0 0x0 0x0 0x0 0x0 <<<<<<<<<<<<<<<<< Dump Card State Ends>>>>>>>>>>>>>>>>>>DevQ(0:0:0): 0 waiting (scsi4:A:0): 80.000MB/s transfers (40.000MHz DT, 16bit) ===========================================
Hairul Ikmal Mohamad Fuzi wrote:> Hi, > > Currently we are running CentOS 4.x on a 2-way Opteron machine. > This machine, through a SCSI host adapter (Adaptec), is connected to a > 2TB storage unit (an external RAID-5 disk array) > > Until our recent unintentional power trip, everything was fine and > smooth. > We have been experiencing complication accessing the storage ( it > could be either intermittent filesystem error, partition could not be > mounted in read-write mode, unacceptable writing speed, etc ), > especially when we start to 'write' on the storage. > > After a few check, we are suspecting either : > > 1) the storage unit (but the storage control panel did not report any > disk/raidset failure) is failing or, > 2) the SCSI host adapter is failing, or > 3) the filesystem itself is corrupted (we did 'fsck.ext3 -v -f' but it > turned out it did not find any errors)or 4) scsi cabling. I see some scsi transmission errors in there. About the only way I know to diagnose something like this would be to swap parts... I'd swap the controller card and see if the problems go away, then try the cable, then try the storage controller. if one of these things fixes the problem back the other changes out (ie put the original card back, etc).