I''m curious to know what other people are running for HD''s in white box systems? I''m currently looking at Seagate Barracuda''s and Hitachi Deskstars. I''m looking at the 1tb models. These will be attached to an LSI expander in a sc847e2 chassis driven by an LSI 9211-8i HBA. This system will be used as a large storage array for backups and archiving. Thanks, Jordan -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100716/307a2e24/attachment.html>
Jordan McQuown wrote:> I?m curious to know what other people are running for HD?s in white box > systems? I?m currently looking at Seagate Barracuda?s and Hitachi > Deskstars. I?m looking at the 1tb models. These will be attached to an > LSI expander in a sc847e2 chassis driven by an LSI 9211-8i HBA. This > system will be used as a large storage array for backups and archiving.I wouldn''t recommend using desktop drives in a server RAID. They can''t handle the vibrations well that are present in a server. I''d recommend at least the Seagate Constellation or the Hitachi Ultrastar, though I haven''t tested the Deskstar myself. --Arne> > Thanks, > Jordan > > > > ------------------------------------------------------------------------ > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Arne Jansen wrote:> Jordan McQuown wrote: >> I?m curious to know what other people are running for HD?s in white >> box systems? I?m currently looking at Seagate Barracuda?s and Hitachi >> Deskstars. I?m looking at the 1tb models. These will be attached to >> an LSI expander in a sc847e2 chassis driven by an LSI 9211-8i HBA. >> This system will be used as a large storage array for backups and >> archiving. > > I wouldn''t recommend using desktop drives in a server RAID. They can''t > handle the vibrations well that are present in a server. I''d recommend > at least the Seagate Constellation or the Hitachi Ultrastar, though I > haven''t tested the Deskstar myself.I''ve been using a couple of 1TB Hitachi Ultrastars for about a year with no problem. I don''t think mine are still available, but I expect they have something equivalent. The pool is scrubbed 3 times a week which takes nearly 19 hours now, and hammers the heads quite hard. I keep meaning to reduce the scrub frequency now it''s getting to take so long, but haven''t got around to it. What I really want is pause/resume scrub, and the ability to trigger the pause/resume from the screensaver (or something similar). -- Andrew Gabriel
On Fri, Jul 16 at 18:32, Jordan McQuown wrote:> I''m curious to know what other people are running for HD''s in white box > systems? I''m currently looking at Seagate Barracuda''s and Hitachi > Deskstars. I''m looking at the 1tb models. These will be attached to an LSI > expander in a sc847e2 chassis driven by an LSI 9211-8i HBA. This system > will be used as a large storage array for backups and archiving.Dell shipped us WD RE3 drives in the server we bought from them, they''ve been working great and come in a 1TB size. Not sure about the expander, but they talk just fine to the 9211 HBAs. -- Eric D. Mudama edmudama at mail.bounceswoosh.org
On Fri, Jul 16, 2010 at 11:32 AM, Jordan McQuown <jcm at larsondesigngroup.com> wrote:> I?m curious to know what other people are running for HD?s in white box > systems? I?m currently looking at Seagate Barracuda?s and Hitachi Deskstars. > I?m looking at the 1tb models. These will be attached to an LSI expander in > a sc847e2 chassis driven by an LSI 9211-8i HBA. This system will be used as > a large storage array for backups and archiving.Some of the Deskstars are qualified to run in a raid configuration, but not all. The E7K1000 is, but the 7K1000, 7K1000.B, 7K1000.C and 7K2000 are not. Curiously, many of the drives are recommended for "video editing arrays", and the 7K2000 and A7K2000 share the same specifications, including vibration tolerance. The only difference appears to be the warranty and error rate. I would not suggest using consumer drives from WD or Seagate for a large array. Recent versions no longer support enabling TLER or ERC. To the best of my knowledge, Samsung and Hitachi drives all support CCTL, which is yet another name for the same thing. -B -- Brandon High : bhigh at freaks.com
>>>>> "bh" == Brandon High <bhigh at freaks.com> writes:bh> Recent versions no longer support enabling TLER or ERC. To bh> the best of my knowledge, Samsung and Hitachi drives all bh> support CCTL, which is yet another name for the same thing. once again, I have to ask, has anyone actually found these features to make a verified positive difference with ZFS? Some of those things you cannot even set on Solaris because the channel to the drive with a LSI controller isn''t sufficiently transparent to support smartctl, and the settings don''t survive reboots. Brandon have you actually set it yourself, or are you just aggregating forum discussion? The experience so far that I''ve read here has been: * if a drive goes bad completely + zfs will mark the drive unavailable after a delay that depends on the controller you''re using, but with lengths like 60 seconds, 180 seconds, 2 hours, or forever. The delay is not sane or reasonable with all controllers, and even if redundancy is available ZFS will patiently wait for the controller. The delay depends on the controller driver. It''s part of the Solaris code. best case zpool will freeze until the delay is up, but there are application timeouts and iSCSI initiator-target timeouts, too---getting the equivalent of an NFS hard mount is hard these days (even with NFS, in some people''s experiences). + the delay is different if the system''s running when the drive fails, or if it''s trying to boot up. For example iSCSI will ``patiently wait'''' forever for a drive to appear while booting up, but will notice after 180 seconds while running. + because the disk is compeltely bad, TLER, ERC, CCTL, whatever you call it, doesn''t apply. The drive might not answer commands ever, at all. The timer is not in the drive: the drive is bad starting now, continuing forever. * if a drive goes partially bad (large and increasing numbers of latent sector errors, which for me happens more often than bad-completely): + the zpool becomes unusably slow + it stays unusably slow until you use ''iostat'' or ''fmdump'' to find the marginal drive and offline it + TLER, ERC, CCTL makes the slowness factor 7ms : 7000ms vs 7ms : 30000ms. In other words, it''s unusably slow with or without the feature. AFAICT the feature is useful as a workaround for buggy RAID card firmware and nothing else. It''s a cost differentiator, and you''re swallowing it hook, line and sinker. If you know otherwise please reinform me, but the discussion here so far doesn''t match what I''ve learned about ZFS and Solaris exception handling. That said, to reword Don Marti, ``uninformed Western Digital bashing is better than no Western Digital bashing at all.'''' -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 304 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100722/32c5d1d0/attachment.bin>
On Thu, Jul 22, 2010 at 11:14 AM, Miles Nordin <carton at ivy.net> wrote:> reboots. ?Brandon have you actually set it yourself, or are you just > aggregating forum discussion?I''m using an older revision of WD10EADS drives that allow TLER to be enabled via WDTLER.EXE. I have not had a drive fail in this environment so I can''t speak from personal experience. I''m basing my statement on what I''ve read in the product specs from the manufacturer and what I''ve heard about newer revisions of the drives.> AFAICT the feature is useful as a workaround for buggy RAID card > firmware and nothing else. ?It''s a cost differentiator, and you''re > swallowing it hook, line and sinker.ERC is part of the ATA-8 spec. WD and Seagate fail to recognize the command on their desktop drives. Hitachi and Samsung implement it.> If you know otherwise please reinform me, but the discussion here so > far doesn''t match what I''ve learned about ZFS and Solaris exception > handling.The idea of ERC is to return an error prior to the timeout. With 60 second timeouts and 5 retries, it could conceivably take 5 minutes for a bad read to fail past the scsi driver. For those 5 minutes, you''ll see horrible performance. If the drive returns an error within 7-10 seconds, it would only take 35-50 seconds to fail. ERC allows you to fast-fail with the assumption that you''ll correct the error at a higher level. This is true of HW raid cards that offline a disk that is slow to respond as well as ZFS and other software raid mechanisms. The difference is that a fast fail with ZFS relies on ZFS to fix the problem rather than degrading the array. -B -- Brandon High : bhigh at freaks.com
>>>>> "bh" == Brandon High <bhigh at freaks.com> writes:bh> For those 5 minutes, you''ll see horrible performance. If the bh> drive returns an error within 7-10 seconds, it would only take bh> 35-50 seconds to fail. For those 1 - 5 minutes, AIUI you see NO performance, not bad performance. And pools other than the one containing the failing drive may be frozen as well, ex. for NFS client mounts. But if it were just the difference between 5min freeze when a drive fails, and 1min freeze when a drive fails, I don''t see that anyone would care---both are bad enough to invoke upper-layer application timeouts of iSCSI connections and load balancers, but not disastrous. but it''s not. ZFS doesn''t immediately offline the drive after 1 read error. Some people find it doesn''t offline the drive at all, until they notice which drive is taking multiple seconds to complete commands and offline it manually. so you have 1 - 5 minute freezes several times a day, every time the slowly-failing drive hits a latent sector error. I''m saying the works:notworks comparison is not between TLER-broken and non-TLER-broken. I think the TLER fans are taking advantage of people''s binary debating bias to imply that TLER is the ``works OK'''' case and non-TLER is ``broken: dont u see it''s 5x slower.'''' There are three cases to compare for any given failure mode: TLER-failed, non-TLER-failed, and working. The proper comparison is therefore between a successful read (7ms) and an unsuccessful read (7000ms * <n> cargo-cult retries put into various parts of the stack to work around some scar someone has on their knee from some weird thing an FC switch once did in 1999). The unsuccessful read is thousands of times slower than normal performance. It doesn''t make your array seem 5x slower during the fail like the false TLER vs non-TLER comparison makes it seem. It makes your array seem entirely frozen. The actual speed doesn''t matter: it''s FROZEN. Having TLER does not make FROZEN any faster than FROZEN. The story here sounds great, so I can see why it spreads so well: ``during drive failures, the array drags performance a little, maybe 5x, until you locate teh drive and replace it. However, if you have used +1 MAGICAL DRIVES OF RECKONING, the dragging is much reduced! Unfortunately +1 magical drives are only appropriate for ENTERPRISE use while at home we use non-magic drives, but you get what you pay for.'''' That all sounds fair, reasonable, and like good fun gameplay. Unfortunately ZFS isn''t a video game: it just fucking freezes. bh> The difference is that a fast fail with ZFS relies on ZFS to bh> fix the problem rather than degrading the array. ok but the decision of ``degrading the array'''' means ``not sending commands to the slowly-failing drive any more''''. which is actually the correct decision, the wrong course being to continue sending commands there and ``patiently waiting'''' for them to fail instead of re-issuing them to redundant drives, even when waiting thousands of standard deviations outside the mean request time. TLER or not, a failing drive will poison the array by making reads thousands of times slower. And ZFS or HW, fail or degrade, the problem is still fixed for the upper layers. You make it soudn like ``degrading the array'''' means the upper layers got an error for the HW controller and got good data for ZFS. not so. If anything, the thread above ZFS gave up waiting on read() for ``fixed'''' data to come back and got killed by request timeout, or the user pressed ^Z^Z^C^C^C^C^C^\^\^acpkill -9 vi If the disk manufacturers could find a way to make all errors return in 7 seconds (to reduce the number of HW RAID ''degraded'' marks leading to warranty returns), but still charge people double for drives that have some silly feature they think they need, I bet they''d do it. The only real problem we''ve got is the one we always had, that the Solaris storage stack and vdev layer don''t handle slowly failing drives with any reasonable grace, and this is how most drives fail. now suppose they built a drive with a ``streaming'''' mode: * with the ``streaming'''' jumper in place, drive starts spun down. * drive must be sent a magical ENABLE command, otherwise it returns failure to everything. once the magical command is sent, the rules below apply. * the first read must be LBA 0. If so, it spins up the drive. * the drive''s head now ignores you, and reads from one end of the disk to the other, dumping data into the on-disk cache. * if you issue a read that''s a higher LBA than the head''s current position, then your read WAITS for the head to pass that position. This is the only time any command waits on mechanics. * if you issue a read that''s in the cache, it returns data from the cache and ``success'''' * if you issue a read of lower LBA than the head, and the data is not in the cache, then the read immediately returns ``failure''''. * no writes. I might pay extra for that feature, but it''s more a desktop-grade feature for recovering data. If the drive''s in an array, may as well send it back the first time it reports an error and let them deal with it while I resilver. The only question is FINDING the bad drive, which doesn''t seem easier with or without TLER: either way you wait until you find your pool freezing now and then, and then you look at ''iostat'' for service times that are a thousand times too high. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 304 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100723/7df94cd8/attachment.bin>
> But if it were just the difference between 5min freeze when a drive > fails, and 1min freeze when a drive fails, I don''t see that anyone > would care---both are bad enough to invoke upper-layer application > timeouts of iSCSI connections and load balancers, but not disastrous. > > but it''s not. ZFS doesn''t immediately offline the drive after 1 read > error. Some people find it doesn''t offline the drive at all, until > they notice which drive is taking multiple seconds to complete > commands and offline it manually. so you have 1 - 5 minute freezes > several times a day, every time the slowly-failing drive hits a latent > sector error. > > I''m saying the works:notworks comparison is not between TLER-broken > and non-TLER-broken. I think the TLER fans are taking advantage of > people''s binary debating bias to imply that TLER is the ``works OK'''' > case and non-TLER is ``broken: dont u see it''s 5x slower.'''' There are > three cases to compare for any given failure mode: TLER-failed, > non-TLER-failed, and working. The proper comparison is therefore > between a successful read (7ms) and an unsuccessful read (7000ms * <n> > cargo-cult retries put into various parts of the stack to work around > some scar someone has on their knee from some weird thing an FC switch > once did in 1999). >If you give a drive enough retries on a sector giving a read error, sometimes it can get the data back. I once had project with an 80gb Maxtor IDE drive that I needed to get all the files off of. One file (a ZIP archive) was sitting over a sector with a read error. I found that I could get what appeared to be partial data from the sector using Ontrack EasyRecovery, but the data read back from the 512 byte sector was slightly different each time. I manually repeated this a few times and got it down to about a few bytes out of the 512 that were different on each re-read attempt. Looking at those further I figured it was actually only a few bits of each of those bytes that were different each time, and I could narrow that down as well by looking at the frequency of the results of each read. I knew the ZIP file had a CRC32 code that would match the correct byte sequence, and figured I could write up a brute force recovery for the remaining bytes. I didn''t end up writing the code to do that because I found something else: GNU ddrescue. That can image a drive including as many automatic retries as you like, including infinite. I didn''t need the drive right away, so I started up ddrescue and let it go after the drive over a whole weekend. There was only one sector on the whole drive that ddrescue was working to recover...the one with the file on it. About two days later it finished reading, and when I mounted the drive image, I was then able to open up the ZIP file. The CRC passed and I had confirmation that the drive had finally after days of reread attempts gotten that last sector. It was really slow, but I had nothing to lose, and just wanted to see what would happen. I''ve tried it since on other bad sectors with varying results. Sometimes a couple hundred or thousand retries will get a lucky break and recover the sector. Sometimes not.> The unsuccessful read is thousands of times slower than normal > performance. It doesn''t make your array seem 5x slower during the > fail like the false TLER vs non-TLER comparison makes it seem. It > makes your array seem entirely frozen. The actual speed doesn''t > matter: it''s FROZEN. Having TLER does not make FROZEN any faster than > FROZEN. >I agree.> The story here sounds great, so I can see why it spreads so well: > ``during drive failures, the array drags performance a little, maybe > 5x, until you locate teh drive and replace it. However, if you have > used +1 MAGICAL DRIVES OF RECKONING, the dragging is much reduced! > Unfortunately +1 magical drives are only appropriate for ENTERPRISE > use while at home we use non-magic drives, but you get what you pay > for.'''' That all sounds fair, reasonable, and like good fun gameplay. > Unfortunately ZFS isn''t a video game: it just fucking freezes. > > bh> The difference is that a fast fail with ZFS relies on ZFS to > bh> fix the problem rather than degrading the array. > > ok but the decision of ``degrading the array'''' means ``not sending > commands to the slowly-failing drive any more''''. > > which is actually the correct decision, the wrong course being to > continue sending commands there and ``patiently waiting'''' for them to > fail instead of re-issuing them to redundant drives, even when waiting > thousands of standard deviations outside the mean request time. TLER > or not, a failing drive will poison the array by making reads > thousands of times slower. >I agree. This is the behavior all RAID type devices should have whether hardware or Linux RAID or ZFS. If a drive is slow to respond, stop sending it read commands if there is enough redundancy remaining to compute the data. ZFS should have no problem with this even though I understand that it needs to read across multiple devices to see verify checksums. If you have some number of devices and N levels of redundancy, and your number of still working devices is equal to or greater than the minimum needed for data integrity, there is no reason to slow down reads (other than the time to compute the data). The main reasons I can think of to slow down reads, even though there is enough redundancy remaining to be fast are: 1.) ease of implementation (one less design case) 2.) squeaky wheel policy (array is slow...figure out why and then fix it rather than limping along and failing completely later) As for writes, that''s more complex, as you have a device that is still halfway alive. Maybe for writes they just get cached longer until the slow drive gets them onto media. (But don''t block the rest of the system in the meantime.)> And ZFS or HW, fail or degrade, the problem is still fixed for the > upper layers. You make it soudn like ``degrading the array'''' means > the upper layers got an error for the HW controller and got good data > for ZFS. not so. If anything, the thread above ZFS gave up waiting > on read() for ``fixed'''' data to come back and got killed by request > timeout, or the user pressed ^Z^Z^C^C^C^C^C^\^\^acpkill -9 vi > >The filesystem should elegantly tolerate slow drives when they''re part of a redundant array.