Hello, Having a problem with software RAID that is driving me crazy. Here's the details: 1. CentOS 6.2 x86_64 install from the minimal iso (via pxeboot). 2. Reasonably good PC hardware (i.e. not budget, but not server grade either) with a pair of 1TB Western Digital SATA3 Drives. 3. Drives are plugged into the SATA3 ports on the mainboard (both drives and cables say they can do 6Gb/s). 4. During the install I set up software RAID1 for the two drives with two raid partitions: md0 - 500M for /boot md1 - "the rest" for a physical volume 5. Setup LVM on md1 in the standard slash, swap, home layout Install goes fine (actually really fast) and I reboot into CentoS 6.2. Next I ran yum update, added a few minor packages and performed some basic configuration. Now I start to get I/O errors on printed on the console. Run 'mdadm -D /dev/md1' and see the array is degraded and /dev/sdb2 has been marked as faulty. Okay, fair enough, I've got at least one bad drive. I boot the system from a live usb and run the short and long SMART tests on both drive. No problems reported but I know that can be misleading, so I'm going to have to gather some evidence before I try to return these drives. I run badblocks in destructive mode on both drives as follows badblocks -w -b 4096 -c 98304 -s /dev/sda badblocks -w -b 4096 -c 98304 -s /dev/sdb Come back the next day and see that no errors are reported. Er thats odd. I check the SMART data in case badblocks activity has triggered something. Nope. Maybe I screwed up the install somehow? So I start again and repeat the install process very carefully. This time I check the raid array straight after boot. mdadm -D /dev/md0 - all is fine. mdadm -D /dev/md1 - the two drives are resyncing. Okay, that is odd. The RAID1 array was created at the start of the install process, before any software was installed. Surely it should be in sync already? Googled a bit and found a post were someone else had seen same thing happen. The advice was to just wait until the drives sync so the 'blocks match exactly' but I'm not really happy with the explanation. At this rate its going to take a whole day to do a single minimal install and I'm sure I would have heard others complaining about the process. Anyway, I leave the system to sync for the rest of the day. When I get back to it I see the same (similar) I/O errors on the console and mdadm shows the RAID array is degraded, /dev/sdb2 has been marked as faulty. This time I notice that the I/O errors all refer to /dev/sda. Have to reboot because the fs is now readonly. When the system comes back up, its trying to resync the drive again. Eh? Any ideas what is going on here? If its bad drives, I really need some confirmation independent of the software raid failing. I thought SMART or badblocks give me that. Perhaps it has nothing to do with the drives. Could a problem with the mainboard or the memory cause this issue? Is it a SATA3 issue? Should I try it on the 3Gb/s channels since there's probably little speed difference with non-SSDs? Cheers, Kal
On 2012-02-29, Kahlil Hodgson <kahlil.hodgson at dealmax.com.au> wrote:> > 2. Reasonably good PC hardware (i.e. not budget, but not server grade either) > with a pair of 1TB Western Digital SATA3 Drives.One thing you can try is to download WD's drive tester and throw it at your drives. It seems unlikely to find anything, but you never know. The tester is available on the UBCD bootable CD image (which has lots of other handy tools). Which model drives do you have? I've found a lot of variability between WDxxEARS vs their RE drives.> Okay, that is odd. The RAID1 array was created at the start of the install > process, before any software was installed. Surely it should be in sync > already? Googled a bit and found a post were someone else had seen same thing > happen. The advice was to just wait until the drives sync so the 'blocks > match exactly' but I'm not really happy with the explanation.Supposedly, at least with RAID[456], the array is completely usable when it's resyncing after an initial creation. In practice, I found that writing significant amounts of data to that array killed resync performance, so I just let the resync finish before doing any heavy lifting on the array.> Anyway, I leave the system to sync for the rest of the day. When I get back to > it I see the same (similar) I/O errors on the console and mdadm shows the RAID > array is degraded, /dev/sdb2 has been marked as faulty. This time I notice > that the I/O errors all refer to /dev/sda. Have to reboot because the fs is > now readonly. When the system comes back up, its trying to resync the drive > again. Eh?This sounds a little odd. You're having IO errors on sda, but sdb2 has been kicked out of the RAID? Do you have any other errors in /var/log/messages that relate to sdb, and/or the errors right around when the md devices failed? --keith -- kkeller-usenet at wombat.san-francisco.ca.us
on 2/28/2012 4:27 PM Kahlil Hodgson spake the following:> Hello, > > Having a problem with software RAID that is driving me crazy. > > Here's the details: > > 1. CentOS 6.2 x86_64 install from the minimal iso (via pxeboot). > 2. Reasonably good PC hardware (i.e. not budget, but not server grade either) > with a pair of 1TB Western Digital SATA3 Drives. > 3. Drives are plugged into the SATA3 ports on the mainboard (both drives and > cables say they can do 6Gb/s). > 4. During the install I set up software RAID1 for the two drives with two raid > partitions: > md0 - 500M for /boot > md1 - "the rest" for a physical volume > 5. Setup LVM on md1 in the standard slash, swap, home layout > > Install goes fine (actually really fast) and I reboot into CentoS 6.2. Next I > ran yum update, added a few minor packages and performed some basic > configuration. > > Now I start to get I/O errors on printed on the console. Run 'mdadm -D > /dev/md1' and see the array is degraded and /dev/sdb2 has been marked as > faulty. > > Okay, fair enough, I've got at least one bad drive. I boot the system from a > live usb and run the short and long SMART tests on both drive. No problems > reported but I know that can be misleading, so I'm going to have to gather some > evidence before I try to return these drives. I run badblocks in destructive > mode on both drives as follows > > badblocks -w -b 4096 -c 98304 -s /dev/sda > badblocks -w -b 4096 -c 98304 -s /dev/sdb > > Come back the next day and see that no errors are reported. Er thats odd. I > check the SMART data in case badblocks activity has triggered something. > Nope. Maybe I screwed up the install somehow? > > So I start again and repeat the install process very carefully. This time I > check the raid array straight after boot. > > mdadm -D /dev/md0 - all is fine. > mdadm -D /dev/md1 - the two drives are resyncing. > > Okay, that is odd. The RAID1 array was created at the start of the install > process, before any software was installed. Surely it should be in sync > already? Googled a bit and found a post were someone else had seen same thing > happen. The advice was to just wait until the drives sync so the 'blocks > match exactly' but I'm not really happy with the explanation. At this rate > its going to take a whole day to do a single minimal install and I'm sure I > would have heard others complaining about the process. > > Anyway, I leave the system to sync for the rest of the day. When I get back to > it I see the same (similar) I/O errors on the console and mdadm shows the RAID > array is degraded, /dev/sdb2 has been marked as faulty. This time I notice > that the I/O errors all refer to /dev/sda. Have to reboot because the fs is > now readonly. When the system comes back up, its trying to resync the drive > again. Eh? > > Any ideas what is going on here? If its bad drives, I really need some > confirmation independent of the software raid failing. I thought SMART or > badblocks give me that. Perhaps it has nothing to do with the drives. Could a > problem with the mainboard or the memory cause this issue? Is it a SATA3 > issue? Should I try it on the 3Gb/s channels since there's probably little > speed difference with non-SSDs? > > Cheers, > > KalFirst thing... Are they green drives? Green drives power down randomly and can cause these types of errors... Also, maybe the 6GB sata isn't fully supported by linux and that board... Try the 3 GB channels
On Wed, Feb 29, 2012 at 11:27:53AM +1100, Kahlil Hodgson wrote:> Now I start to get I/O errors on printed on the console. Run 'mdadm -D > /dev/md1' and see the array is degraded and /dev/sdb2 has been marked as > faulty.what I/O errors?> So I start again and repeat the install process very carefully. This time I > check the raid array straight after boot. > > mdadm -D /dev/md0 - all is fine. > mdadm -D /dev/md1 - the two drives are resyncing. > > Okay, that is odd. The RAID1 array was created at the start of the install > process, before any software was installed. Surely it should be in sync > already? Googled a bit and found a post were someone else had seen same thing > happen. The advice was to just wait until the drives sync so the 'blocks > match exactly' but I'm not really happy with the explanation. At this rate > its going to take a whole day to do a single minimal install and I'm sure I > would have heard others complaining about the process.Yeah, it's normal for a raid1 to 'sync' when you first create it. the odd part is the I/O errors.> Any ideas what is going on here? If its bad drives, I really need some > confirmation independent of the software raid failing. I thought SMART or > badblocks give me that. Perhaps it has nothing to do with the drives. Could a > problem with the mainboard or the memory cause this issue? Is it a SATA3 > issue? Should I try it on the 3Gb/s channels since there's probably little > speed difference with non-SSDs?look up the drive errors. Oh, and my experience? both wd and seagate won't complain if you error on the side of 'when in doubt, return the drive' - that's what I do. But yeah, usually smart will report something... at least a high reallocated sectors or something.
On Tue, Feb 28, 2012 at 5:27 PM, Kahlil Hodgson <kahlil.hodgson at dealmax.com.au> wrote:> Now I start to get I/O errors on printed on the console. ?Run 'mdadm -D > /dev/md1' and see the array is degraded and /dev/sdb2 has been marked as > faulty.I had a problem like this once. In a heterogeneous array of 80 GB PATA drives (it was a while ago), the one WD drive kept dropping out like this. WD's diagnostic tool showed a problem, so I RMA'ed the drive... only to discover the replacement did the same thing on the system, but checked out just fine on a different system. Turned out to be a combination of a power supply with less-than-stellar regulation (go Enermax...) and the WD was particularly sensitive to it; nothing else in the system seemed to be affected Replacing the power supply finally eliminated the issue. --ln
On 02/28/2012 04:27 PM, Kahlil Hodgson wrote:> Hello, > > Having a problem with software RAID that is driving me crazy. > > Here's the details: > > 1. CentOS 6.2 x86_64 install from the minimal iso (via pxeboot). > 2. Reasonably good PC hardware (i.e. not budget, but not server grade either) > with a pair of 1TB Western Digital SATA3 Drives. > 3. Drives are plugged into the SATA3 ports on the mainboard (both drives and > cables say they can do 6Gb/s). > 4. During the install I set up software RAID1 for the two drives with two raid > partitions: > md0 - 500M for /boot > md1 - "the rest" for a physical volume > 5. Setup LVM on md1 in the standard slash, swap, home layout > > Install goes fine (actually really fast) and I reboot into CentoS 6.2. Next I > ran yum update, added a few minor packages and performed some basic > configuration. > > Now I start to get I/O errors on printed on the console. Run 'mdadm -D > /dev/md1' and see the array is degraded and /dev/sdb2 has been marked as > faulty. > > Okay, fair enough, I've got at least one bad drive. I boot the system from a > live usb and run the short and long SMART tests on both drive. No problems > reported but I know that can be misleading, so I'm going to have to gather some > evidence before I try to return these drives. I run badblocks in destructive > mode on both drives as follows > > badblocks -w -b 4096 -c 98304 -s /dev/sda > badblocks -w -b 4096 -c 98304 -s /dev/sdb > > Come back the next day and see that no errors are reported. Er thats odd. I > check the SMART data in case badblocks activity has triggered something. > Nope. Maybe I screwed up the install somehow? > > So I start again and repeat the install process very carefully. This time I > check the raid array straight after boot. > > mdadm -D /dev/md0 - all is fine. > mdadm -D /dev/md1 - the two drives are resyncing. > > Okay, that is odd. The RAID1 array was created at the start of the install > process, before any software was installed. Surely it should be in sync > already? Googled a bit and found a post were someone else had seen same thing > happen. The advice was to just wait until the drives sync so the 'blocks > match exactly' but I'm not really happy with the explanation. At this rate > its going to take a whole day to do a single minimal install and I'm sure I > would have heard others complaining about the process. > > Anyway, I leave the system to sync for the rest of the day. When I get back to > it I see the same (similar) I/O errors on the console and mdadm shows the RAID > array is degraded, /dev/sdb2 has been marked as faulty. This time I notice > that the I/O errors all refer to /dev/sda. Have to reboot because the fs is > now readonly. When the system comes back up, its trying to resync the drive > again. Eh? > > Any ideas what is going on here? If its bad drives, I really need some > confirmation independent of the software raid failing. I thought SMART or > badblocks give me that. Perhaps it has nothing to do with the drives. Could a > problem with the mainboard or the memory cause this issue? Is it a SATA3 > issue? Should I try it on the 3Gb/s channels since there's probably little > speed difference with non-SSDs? > > Cheers, > > Kal > > > _______________________________________________ > CentOS mailing list > CentOS at centos.org > http://lists.centos.org/mailman/listinfo/centos > >I just had a very similar problem with a raid 10 array with four new 1TB drives. It turned out to be the SATA cable. I first tried a new drive and even replaced the five disk hot plug carrier. It was always the same logical drive (/dev/sdb). I then tried using an additional SATA adapter card. That cinched it, as the only thing common to all the above was the SATA cable. All has been well for a week now. I should have tired replacing the cable first :-) Emmett
A few months ago I had an enormous amount of grief trying to understand why a RAID array in a new server kept getting corrupted and suddenly changing configuration. After a lot of despair and head scratching it turned out to be the SATA cables. This was a rack server from Asus with a SATA backplane. The cables, made by Foxconn, came pre-installed. After I replaced the SATA cables with new ones, all problems were gone and the array is now rock solid. Many SATA cables on the market are pieces of junk either incapable of coping with the high frequencies involved in SATA 3Gb/s or 6Gb/s or their connector are made of bad quality plastics unable to keep the necessary pressure on the contacts. I had already found this problem with desktop machines, I simply wouldn't believe that such a class of hardware would exhibit it also. So, I would advise you to replace the SATA cables with good quality ones. As an additional information, I quote from the Caviar Black range datasheet: "Desktop / Consumer RAID Environments - WD Caviar Black Hard Drives are tested and recommended for use in consumer-type RAID applications (RAID-0 /RAID-1). - Business Critical RAID Environments ? WD Caviar Black Hard Drives are not recommended for and are not warranted for use in RAID environments utilizing Enterprise HBAs and/or expanders and in multi-bay chassis, as they are not designed for, nor tested in, these specific types of RAID applications. For all Business Critical RAID applications, please consider WD?s Enterprise Hard Drives that are specifically designed with RAID-specific, time-limited error recovery (TLER), are tested extensively in 24x7 RAID applications, and include features like enhanced RAFF technology and thermal extended burn-in testing."
On 03/01/2012 09:00 AM, Mark Roth wrote:> > Miguel Medalha wrote: >> > >> > A few months ago I had an enormous amount of grief trying to understand >> > why a RAID array in a new server kept getting corrupted and suddenly >> > changing configuration. After a lot of despair and head scratching it >> > turned out to be the SATA cables. This was a rack server from Asus with >> > a SATA backplane. The cables, made by Foxconn, came pre-installed. >> > >> > After I replaced the SATA cables with new ones, all problems were gone >> > and the array is now rock solid. > Thanks for this info, Miguel. > <snip> >> > As an additional information, I quote from the Caviar Black range >> > datasheet: >> > >> > "Desktop / Consumer RAID Environments - WD Caviar Black Hard Drives are >> > tested and recommended for use in consumer-type RAID applications >> > (RAID-0 /RAID-1). >> > - Business Critical RAID Environments ? WD Caviar Black Hard Drives are >> > not recommended for and are not warranted for use in RAID environments >> > utilizing Enterprise HBAs and/or expanders and in multi-bay chassis, as >> > they are not designed for, nor tested in, these specific types of RAID >> > applications. For all Business Critical RAID applications, please >> > consider WD?s Enterprise Hard Drives that are specifically designed with >> > RAID-specific, time-limited error recovery (TLER), are tested >> > extensively in 24x7 RAID applications, and include features like >> > enhanced RAFF technology and thermal extended burn-in testing." > Wonderful... NOT. We've got a number of Caviar Green, so I looked up its > datasheet... and it says the same. > > That rebuild of my system at home? I think I'll look at commercial grade > drives.... > > mark >Interesting thread ... I have had problems with SATA cables in the past, and prefer those with the little metal latches. The problem is that you can't easily tell by looking at the connectors whether or not they're flakey. I've had positive experience with Caviar Black and Scorpio Black drives. The WD Green and Blue drives are built more cheaply than the Blacks (which have close to enterprise-grade construction). The dealer I buy drives from has told me that the Blacks have far lower return/defect rates. Of the approximately 30 2TB Blacks I have in RAID-6 service, I've only experienced two failures, which were handled quickly by the WD warranty program. It's interesting to note that while all the drive manufacturers are going back to 1 or 2-year warranties, the WD Black series remains at 5 years. A friend of mine has had a couple of strange problems with the RE (RAID) series of Caviars, which utilize the same mechanics as the non-RE Blacks. For software RAID, I would recommend that you stick with the non-RE versions because of differences in the firmware. It has come down to me buying *only* WD Black-series drives and nothing else. If I could afford them, I'd consider enterprise-grade drives. Having said that, I have a pair of 1TB Green drives in RAID-1 for the TimeMachine backups on my Mac, and they've been spinning 24x7 non-stop for 3 years without failure. I'm almost afraid to switch them off. Now, if WD can just get their post-flood production back in gear so prices can drop. My 2c, FWIW :-) Chuck