I've been having no end of issues with a 3ware 9650SE-24M8 in a server that's coming on a year old. I've got 24 WDC WD5001ABYS drives (500GB) hooked to it, running as a single RAID6 w/ a hot spare. These issues boil down to the card periodically throwing errors like the following: sd 1:0:0:0: WARNING: (0x06:0x002C): Command (0x8a) timed out, resetting card. Usually when this happens, it's followed by: 3w-9xxx: scsi1: AEN: INFO (0x04:0x005E): Cache synchronization completed:unit=0. On the less pleasant occasions, it's followed by: scsi1: ERROR: (0x06:0x0036): Response queue (large) empty failed during reset sequence. 3w-9xxx: scsi1: ERROR: (0x06:0x002B): Controller reset failed during scsi host reset. sd 1:0:0:0: scsi: Device offlined - not ready after error recovery This of course leads to a several hour downtime as the system has to be powered down (not just rebooted) and then the volume needs to be fscked. I've been back and forth with both the vendor and (via the vendor) 3ware with this. The card has been replaced, as well as the whole system. I'm running the latest firmware and drivers from 3ware. Have other folks had good luck with this card? What sorts of configs are you running? I'm in the position of needing more storage, and I'm a bit gun shy on 3ware at the moment... -- Joshua Baker-LePain QB3 Shared Cluster Sysadmin UCSF
Joshua Baker-LePain wrote:> I've been having no end of issues with a 3ware 9650SE-24M8 in a server > that's coming on a year old. I've got 24 WDC WD5001ABYS drives > (500GB) hooked to it, running as a single RAID6 w/ a hot spare. These > issues boil down to the card periodically throwing errors like the > following: > .... > Have other folks had good luck with this card? What sorts of configs > are you running? I'm in the position of needing more storage, and I'm > a bit gun shy on 3ware at the moment... >I have no experience with that raid card, most of our larger systems use external SAN storage, but I will say that, IMHO, is a very large raid-6. we usually don't make single raid sets much large than 7-8 drives, and for a very large storage system, will stripe multiple raid5/6 sets rather than have one huge one.
Joshua Baker-LePain wrote:> periodically throwing errors like the following: > > sd 1:0:0:0: WARNING: (0x06:0x002C): Command (0x8a) timed out, resetting > card.Wondering if you have scheduled automatic media scans of all of the disks in the array? Perhaps you have a disk that is going bad causing the issue. Something else that could be related, I was told by someone who had a Isilon storage system(fancy NAS box), who was having his WD disk drives hang on him on occasion, when this occured he had to physically remove the disk from the system and re plug it in. It was a firmware issue, I don't recall which WD drives he had, he eventually got a fixed firmware though. This was about a year ago. I have media scans run once a week for about 7 hours on my 2 disk 3Ware systems (8006-2 controllers). For a 24 disk system you'll probably need to run it longer. (unless the newer controllers scan in parallel, the 8000 series seems to be serial). I ran a couple 9650 series cards not too long ago, I think they were just two disk systems running RAID 1 (up to 8 disks, but only used 2). I've been using 3ware cards for about 8 years now and have not run into those types of errors you describe. Probably ran about 350 cards over the years, most of them in the 8000 series. nate
On Sat, Jun 21, 2008 at 11:04 PM, Joshua Baker-LePain <jlb17 at duke.edu> wrote:> I've been having no end of issues with a 3ware 9650SE-24M8 in a server > that's coming on a year old. I've got 24 WDC WD5001ABYS drives (500GB) > hooked to it, running as a single RAID6 w/ a hot spare. These issues boil > down to the card periodically throwing errors like the following: > > sd 1:0:0:0: WARNING: (0x06:0x002C): Command (0x8a) timed out, resetting > card. > > Usually when this happens, it's followed by: > > 3w-9xxx: scsi1: AEN: INFO (0x04:0x005E): Cache synchronization > completed:unit=0. > > On the less pleasant occasions, it's followed by: > > scsi1: ERROR: (0x06:0x0036): Response queue (large) empty failed during > reset sequence. > 3w-9xxx: scsi1: ERROR: (0x06:0x002B): Controller reset failed during scsi > host reset. > sd 1:0:0:0: scsi: Device offlined - not ready after error recovery > > This of course leads to a several hour downtime as the system has to be > powered down (not just rebooted) and then the volume needs to be fscked. > I've been back and forth with both the vendor and (via the vendor) 3ware > with this. The card has been replaced, as well as the whole system. I'm > running the latest firmware and drivers from 3ware. > > Have other folks had good luck with this card? What sorts of configs are > you running? I'm in the position of needing more storage, and I'm a bit gun > shy on 3ware at the moment...This may be completely irrelevant, but we have a 9550 card running RAID 5 with a 'prominent non-Linux' operating system that suffers from the same symptoms (and 4 others that have never done it). We've heard from our vendor (and 3ware) that there are some upcoming firmware releases (looks like August) that might help. A 3ware tech told me that the controller reset happens when communication between the driver and the firmware times out, which appears to be exactly what is in your error message. Meanwhile, we just cross our fingers and thank our lucky stars the the server in question is in our local office and not one of our non-tech-staffed remote offices. There are unsupported pre-release firmware downloads available if you like to gamble. I have not had the courage to install the beta firmware on our servers. I have not used 3ware with CentOS, but I don't think this is a CentOS issue. -- Jeff
>Have other folks had good luck with this card? What sorts of configs are you >running? I'm in the position of needing more storage, and I'm a bit gun shy on >3ware at the moment...Does that drive have a jumper to slow it down to 1.5Gb transfer rate? Cheap controllers and drives just cant do it, I have had no end of issues even with *all* my LSI controllers until I jumped all my sata drives down. As far as performance, it made no impact on my systems. jlc
on 6-21-2008 9:04 PM Joshua Baker-LePain spake the following:> I've been having no end of issues with a 3ware 9650SE-24M8 in a server > that's coming on a year old. I've got 24 WDC WD5001ABYS drives (500GB) > hooked to it, running as a single RAID6 w/ a hot spare. These issues > boil down to the card periodically throwing errors like the following: > > sd 1:0:0:0: WARNING: (0x06:0x002C): Command (0x8a) timed out, resetting > card. > > Usually when this happens, it's followed by: > > 3w-9xxx: scsi1: AEN: INFO (0x04:0x005E): Cache synchronization > completed:unit=0. > > On the less pleasant occasions, it's followed by: > > scsi1: ERROR: (0x06:0x0036): Response queue (large) empty failed during > reset sequence. > 3w-9xxx: scsi1: ERROR: (0x06:0x002B): Controller reset failed during > scsi host reset. > sd 1:0:0:0: scsi: Device offlined - not ready after error recovery > > This of course leads to a several hour downtime as the system has to be > powered down (not just rebooted) and then the volume needs to be fscked. > I've been back and forth with both the vendor and (via the vendor) 3ware > with this. The card has been replaced, as well as the whole system. > I'm running the latest firmware and drivers from 3ware. > > Have other folks had good luck with this card? What sorts of configs > are you running? I'm in the position of needing more storage, and I'm a > bit gun shy on 3ware at the moment... >That looks like either drive, cabling, or power problems. -- MailScanner is like deodorant... You hope everybody uses it, and you notice quickly if they don't!!!! -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 258 bytes Desc: OpenPGP digital signature URL: <http://lists.centos.org/pipermail/centos/attachments/20080622/a2396fd3/attachment-0002.sig>
On Sunday 22 June 2008 12:04:47 am Joshua Baker-LePain wrote:> I've been having no end of issues with a 3ware 9650SE-24M8 in a server > that's coming on a year old. I've got 24 WDC WD5001ABYS drives (500GB) > hooked to it, running as a single RAID6 w/ a hot spare.What size power supply do you have in your server? Peter.
On Sun, 22 Jun 2008 at 10:23am, Scott Silva wrote> on 6-21-2008 9:04 PM Joshua Baker-LePain spake the following: >> >> This of course leads to a several hour downtime as the system has to be >> powered down (not just rebooted) and then the volume needs to be fscked. >> I've been back and forth with both the vendor and (via the vendor) 3ware >> with this. The card has been replaced, as well as the whole system. I'm >> running the latest firmware and drivers from 3ware. >> > That looks like either drive, cabling, or power problems.I'd agree, except for a) all the hardware has been swapped out and b) 1500W should be plenty. It's starting to sound like this may be a somewhat known issue with a *long* overdue fix coming from 3ware. *sigh* Thanks all. -- Joshua Baker-LePain QB3 Shared Cluster Sysadmin UCSF
Joshua Baker-LePain wrote:> I've been having no end of issues with a 3ware 9650SE-24M8 in a server > that's coming on a year old. I've got 24 WDC WD5001ABYS drives (500GB) > hooked to it, running as a single RAID6 w/ a hot spare. These issues > boil down to the card periodically throwing errors like the following: > > sd 1:0:0:0: WARNING: (0x06:0x002C): Command (0x8a) timed out, resetting > card.9650SE with 8 ports on a couple servers running CentOS 5 64 bit. Pretty heavily used database servers, lots of bursts of disk activity. No problems so far. I'm using the binary driver provided by 3ware. I'm testing now the 16 port version for newer servers, no problems there either (but not much usage yet, of course). -- Florin Andrei http://florin.myip.org/