Curtis H. Wilbar Jr.
2008-Apr-21 21:34 UTC
[CentOS] rejecting I/O to offline device (PERC woes)
Haven't gotten any tips on a solution to the problem below. It happened again this weekend. My next test steps (order not determined): 1. Downgrade to CentOS 4 2. Swap out PERC controller with a spare I have never had a problem with the PERC4/DC controllers on our other machines (RHEL3/4, CentOS 4). Although, I've no other machine that has 5 300G Fujitsu SCSI drives either. Any suggestions on the below, or which order on the above to try ? Thanks, -- Curt ------------------------------- I have a 6650 with a PERC4/DC running CentOS5. After 1 to 3 weeks of operation (running VMWare Server) it 'dies' (raid array gets taken offline) and you get rejecting I/O to offline device. When this system was setup late last year, the 6650 was given all the latest firmware along with the PERC4/DC. using linttylog, the last entries from when the system must have 'checked out' last night, I see the data attached below. Some time back I thought I had cured this problem by adding noapic to the kernel boot parameters in boot.conf. It had gone away for a long time... but is now back. according to lintty, it reports controller firmware is: T0: Firmware version 352D build on Mar 19 2007 at 17:43:23 T0: MegaRAID Series 518 firmware version 352D using strings tty.log | grep 'MedErr on pd' | cut -c17- | sort | uniq -c | sort -n I see: 163 REC:log MedErr on pd[1] #retries=0 165 REC:log MedErr on pd[4] #retries=0 168 REC:log MedErr on pd[2] #retries=0 If I am to believe this, Patrol read is finding media errors on physical drives 1, 2, and 4 ! ? These drives are not even a year old, and to have an almost even distribution of errors across 3 drives seems far fetched (unless patrol read is reading past the end of drive ?, but then it would be doing that with all 5 drives). Is the PERC busted ? driver issue ? I'm running CentOS 5 with kernel 2.6.18-53.1.13.el5PAE from dmesg, megaraid related driver versions: megaraid cmm: 2.20.2.7 (Release Date: Sun Jul 16 00:01:03 EST 2006) megaraid: 2.20.5.1 (Release Date: Thu Nov 16 15:32:35 EST 2006) Anyone seen this behavior before ? Anyone have a solution ? We have several Dells in a hosting environment with PERC4/DC running RHEL3, RHEL4.X, and CentOS4.X. We have not had this issue on any of them (though they do not have 5 300G Fujitsu SCSI drives in a RAID 5 config either (as this one does)). Hoping someone can shed some light on this... so far I keep coming up short on finding a solution. Here is the full content of the last lines recorded in the PERC as pulled by linttylog: 03/24 21:43:41: Next PR scheduled to start at 03/25 0:00:00 03/25 3:47:41: REC:log MedErr on pd[4] #retries=0 03/25 3:47:41: LogSense: pd=04, cdb=2f 00 14 30 03 12 00 ff ff 00 03/25 3:47:41: sense=f0 00 03 14 30 14 e6 28 00 00 00 00 11 01 00 00 00 3f 03/25 3:47:41: REC: MedErr on LD[4] BadLba=143014e6 03/25 3:47:41: prCallback: Medium Error on pd=04, StartLba=14300312, ErrLba=143014e6 03/25 3:47:42: prRecQueue: starting pd=04 recovery - blocking host commands 03/25 3:47:42: prRecGo: Ready to attempt recovery errLBA=143014e6 on pd=04 03/25 3:47:42: prGetLDInfo: MediaErr in ld=0, span=0, arm=4 03/25 3:47:42: prRecGo: dataErr found on ld 0 span 0 arm 4 03/25 3:47:42: prRecGo: data NOT in cache; cacheLn=ffffffff, row=143014, stripe=50c052, refBlk=e6, type=1 03/25 3:47:42: prRecGo: R5-get cacheLn=133, c_ptr=d0dc1045 mem=d0daf9a4 & setup cInx=1db c=d0d26ee6 03/25 3:47:42: RtnFrmPrcRcyRd for arm=1 mem=d0db23f1 stripe=143014 type=2 i=0 status=ffff 03/25 3:47:42: RtnFrmPrcRcyRd for arm=2 mem=d0db2402 stripe=50c050 type=1 i=1 status=ffff 03/25 3:47:42: RtnFrmPrcRcyRd for arm=3 mem=d0db2413 stripe=50c051 type=1 i=2 status=ffff 03/25 3:47:42: RtnFrmPrcRcyRd for arm=0 mem=d0db23e0 stripe=50c053 type=1 i=4 status=ffff 03/25 3:47:43: REC:log MedErr on pd[1] #retries=0 03/25 3:47:43: LogSense: pd=01, cdb=28 00 14 30 14 e6 00 00 01 00 03/25 3:47:43: sense=f0 00 03 14 30 14 e6 28 00 00 00 00 11 01 00 00 00 3f 03/25 3:47:43: REC: MedErr on LD[1] BadLba=143014e6 03/25 3:47:43: REC:log MedErr on pd[2] #retries=0 03/25 3:47:43: LogSense: pd=02, cdb=28 00 14 30 14 e6 00 00 01 00 03/25 3:47:43: sense=f0 00 03 14 30 14 e6 28 00 00 00 00 11 01 00 00 00 3f 03/25 3:47:43: REC: MedErr on LD[2] BadLba=143014e6 03/25 3:47:43: org resgnBlk=e6 03/25 3:47:43: org resgnBlk=ffff 03/25 3:47:43: PatrolReadStart: Log MedErr pd=04 errLBA=143014e6 in NVRAM_EVENT_LOG 03/25 3:48:06: REC:log MedErr on pd[2] #retries=0 03/25 3:48:06: LogSense: pd=02, cdb=2f 00 14 2f a7 92 00 ff ff 00 03/25 3:48:06: sense=f0 00 03 14 30 14 e6 28 00 00 00 00 11 01 00 00 00 3f 03/25 3:48:06: REC: MedErr on LD[2] BadLba=143014e6 03/25 3:48:06: prCallback: Medium Error on pd=02, StartLba=142fa792, ErrLba=143014e6 03/25 3:48:06: prRecQueue: starting pd=02 recovery - blocking host commands 03/25 3:48:06: prRecGo: Ready to attempt recovery errLBA=143014e6 on pd=02 03/25 3:48:06: prGetLDInfo: MediaErr in ld=0, span=0, arm=2 03/25 3:48:06: prRecGo: dataErr found on ld 0 span 0 arm 2 03/25 3:48:06: prRecGo: data NOT in cache; cacheLn=24a, row=143014, stripe=50c050, refBlk=e6, type=1 03/25 3:48:06: prRecGo: R5-get cacheLn=24a, c_ptr=d0dcc646 mem=d0d945e2 & setup cInx=10e c=d0d1f9fc 03/25 3:48:06: RtnFrmPrcRcyRd for arm=1 mem=d0d98e11 stripe=143014 type=2 i=0 status=ffff 03/25 3:48:06: RtnFrmPrcRcyRd for arm=3 mem=d0d98e33 stripe=50c051 type=1 i=2 status=ffff 03/25 3:48:06: RtnFrmPrcRcyRd for arm=4 mem=d0d98e44 stripe=50c052 type=1 i=3 status=ffff 03/25 3:48:06: RtnFrmPrcRcyRd for arm=0 mem=d0d98e00 stripe=50c053 type=1 i=4 status=ffff 03/25 3:48:07: REC:log MedErr on pd[1] #retries=0 03/25 3:48:07: LogSense: pd=01, cdb=28 00 14 30 14 e6 00 00 01 00 03/25 3:48:07: sense=f0 00 03 14 30 14 e6 28 00 00 00 00 11 01 00 00 00 3f 03/25 3:48:07: REC: MedErr on LD[1] BadLba=143014e6 03/25 3:48:07: REC:log MedErr on pd[4] #retries=0 03/25 3:48:07: LogSense: pd=04, cdb=28 00 14 30 14 e6 00 00 01 00 03/25 3:48:07: sense=f0 00 03 14 30 14 e6 28 00 00 00 00 11 01 00 00 00 3f 03/25 3:48:07: REC: MedErr on LD[4] BadLba=143014e6 03/25 3:48:07: org resgnBlk=e6 03/25 3:48:07: org resgnBlk=ffff 03/25 3:48:07: PatrolReadStart: Log MedErr pd=02 errLBA=143014e6 in NVRAM_EVENT_LOG 03/25 3:48:09: REC:log MedErr on pd[1] #retries=0 03/25 3:48:09: LogSense: pd=01, cdb=2f 00 14 30 00 d2 00 ff ff 00 03/25 3:48:09: sense=f0 00 03 14 30 14 e6 28 00 00 00 00 11 01 00 00 00 3f 03/25 3:48:09: REC: MedErr on LD[1] BadLba=143014e6 03/25 3:48:09: prCallback: Medium Error on pd=01, StartLba=143000d2, ErrLba=143014e6 03/25 3:48:10: prRecQueue: starting pd=01 recovery - blocking host commands 03/25 3:48:10: prRecGo: Ready to attempt recovery errLBA=143014e6 on pd=01 03/25 3:48:10: prGetLDInfo: MediaErr in ld=0, span=0, arm=1 03/25 3:48:10: prRecGo: dataErr found on ld 0 span 0 arm 1 03/25 3:48:10: prRecGo: data NOT in cache; cacheLn=99, row=143014, stripe=143014, refBlk=e6, type=2 03/25 3:48:10: prRecGo: R5-get cacheLn=99, c_ptr=d0dbabcf mem=d0d92c51 & setup cInx=102 c=d0d1f324 03/25 3:48:10: RtnFrmPrcRcyRd for arm=2 mem=d0d9d002 stripe=50c050 type=1 i=1 status=ffff 03/25 3:48:10: RtnFrmPrcRcyRd for arm=3 mem=d0d9d013 stripe=50c051 type=1 i=2 status=ffff 03/25 3:48:10: RtnFrmPrcRcyRd for arm=4 mem=d0d9d024 stripe=50c052 type=1 i=3 status=ffff 03/25 3:48:10: RtnFrmPrcRcyRd for arm=0 mem=d0d9cfe0 stripe=50c053 type=1 i=4 status=ffff 03/25 3:48:11: REC:log MedErr on pd[2] #retries=0 03/25 3:48:11: LogSense: pd=02, cdb=28 00 14 30 14 e6 00 00 01 00 03/25 3:48:11: sense=f0 00 03 14 30 14 e6 28 00 00 00 00 11 01 00 00 00 3f 03/25 3:48:11: REC: MedErr on LD[2] BadLba=143014e6 03/25 3:48:11: REC:log MedErr on pd[4] #retries=0 03/25 3:48:11: LogSense: pd=04, cdb=28 00 14 30 14 e6 00 00 01 00 03/25 3:48:11: sense=f0 00 03 14 30 14 e6 28 00 00 00 00 11 01 00 00 00 3f 03/25 3:48:11: REC: MedErr on LD[4] BadLba=143014e6 03/25 3:48:11: org resgnBlk=e6 03/25 3:48:11: org resgnBlk=ffff 03/25 3:48:11: PatrolReadStart: Log MedErr pd=01 errLBA=143014e6 in NVRAM_EVENT_LOG Any help is greatly appreciated, -- Curt
On Mon, 2008-04-21 at 17:34 -0400, Curtis H. Wilbar Jr. wrote:> Haven't gotten any tips on a solution to the problem below. > It happened again this weekend. > > My next test steps (order not determined): > > 1. Downgrade to CentOS 4 > 2. Swap out PERC controller with a spare > > I have never had a problem with the PERC4/DC controllers on our > other machines (RHEL3/4, CentOS 4). Although, I've no other > machine that has 5 300G Fujitsu SCSI drives either. > > Any suggestions on the below, or which order on the above to > try ? > > Thanks, > > -- CurtI see you have found the Right List. "linux-poweredge at dell.com" -- ~/john OpenPGP Sig:BA91F079