Roy Sigurd Karlsbakk
2011-Aug-29 21:07 UTC
[zfs-discuss] BAD WD drives - defective by design?
Hi all It seems recent WD drives that aren''t "Raid edition" can cause rather a lot of problems on RAID systems. We have a few machines with LSI controllers (6801/6081/9201) and we''re seeing massive errors occuring. The usual pattern is a drive failing or even a resilver/scrub starting and then, suddenly, most drives on the whole backplane report errors. These are usually No Device (as reported by iostat -En), but the result is that we may see data corruption at the end. We also have a system setup with Hitachi Deskstars, which has been running for almost a year without issues. One system with a mixture of WD Blacks and greens showed the same errors as described, but has been working well after the WD drives were replaced by deskstars. Now, it seems WD has changed their firmware to inhibit people from using them for other things than toys (read: PCs etc). Since we''ve seen this issue on different controllers and different drives, and can''t reproduce it with Hitachi Deskstars, I would guess the firmware "upgrade" from WD is the issue. Would it be possible to fix this in ZFS somehow? The drives seem to work well except for those "No Device" errors.... Vennlige hilsener / Best regards roy -- Roy Sigurd Karlsbakk (+47) 97542685 roy at karlsbakk.net http://blogg.karlsbakk.net/ -- I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er et element?rt imperativ for alle pedagoger ? unng? eksessiv anvendelse av idiomer med fremmed opprinnelse. I de fleste tilfeller eksisterer adekvate og relevante synonymer p? norsk.
Rich
2011-Aug-29 21:10 UTC
[zfs-discuss] [OpenIndiana-discuss] BAD WD drives - defective by design?
The drives are attached to a backplane? Try using 4k sector sizes and see if that improves it - I''ve seen and been part of a number of discussions which involved this - and you, I think, actually. - Rich On Mon, Aug 29, 2011 at 5:07 PM, Roy Sigurd Karlsbakk <roy at karlsbakk.net> wrote:> Hi all > > It seems recent WD drives that aren''t "Raid edition" can cause rather a lot of problems on RAID systems. We have a few machines with LSI controllers (6801/6081/9201) and we''re seeing massive errors occuring. The usual pattern is a drive failing or even a resilver/scrub starting and then, suddenly, most drives on the whole backplane report errors. These are usually No Device (as reported by iostat -En), but the result is that we may see data corruption at the end. We also have a system setup with Hitachi Deskstars, which has been running for almost a year without issues. One system with a mixture of WD Blacks and greens showed the same errors as described, but has been working well after the WD drives were replaced by deskstars. > > Now, it seems WD has changed their firmware to inhibit people from using them for other things than toys (read: PCs etc). Since we''ve seen this issue on different controllers and different drives, and can''t reproduce it with Hitachi Deskstars, I would guess the firmware "upgrade" from WD is the issue. > > Would it be possible to fix this in ZFS somehow? The drives seem to work well except for those "No Device" errors.... > > Vennlige hilsener / Best regards > > roy > -- > Roy Sigurd Karlsbakk > (+47) 97542685 > roy at karlsbakk.net > http://blogg.karlsbakk.net/ > -- > I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er et element?rt imperativ for alle pedagoger ? unng? eksessiv anvendelse av idiomer med fremmed opprinnelse. I de fleste tilfeller eksisterer adekvate og relevante synonymer p? norsk. > > _______________________________________________ > OpenIndiana-discuss mailing list > OpenIndiana-discuss at openindiana.org > http://openindiana.org/mailman/listinfo/openindiana-discuss >
Roy Sigurd Karlsbakk
2011-Aug-29 21:15 UTC
[zfs-discuss] [OpenIndiana-discuss] BAD WD drives - defective by design?
All drives have 512b sector sizes, WD FASS (blacks) and WD EADS (greens) both use plain old 512 sectors..... ----- Original Message -----> The drives are attached to a backplane? > > Try using 4k sector sizes and see if that improves it - I''ve seen and > been part of a number of discussions which involved this - and you, I > think, actually. > > - Rich > > On Mon, Aug 29, 2011 at 5:07 PM, Roy Sigurd Karlsbakk > <roy at karlsbakk.net> wrote: > > Hi all > > > > It seems recent WD drives that aren''t "Raid edition" can cause > > rather a lot of problems on RAID systems. We have a few machines > > with LSI controllers (6801/6081/9201) and we''re seeing massive > > errors occuring. The usual pattern is a drive failing or even a > > resilver/scrub starting and then, suddenly, most drives on the whole > > backplane report errors. These are usually No Device (as reported by > > iostat -En), but the result is that we may see data corruption at > > the end. We also have a system setup with Hitachi Deskstars, which > > has been running for almost a year without issues. One system with a > > mixture of WD Blacks and greens showed the same errors as described, > > but has been working well after the WD drives were replaced by > > deskstars. > > > > Now, it seems WD has changed their firmware to inhibit people from > > using them for other things than toys (read: PCs etc). Since we''ve > > seen this issue on different controllers and different drives, and > > can''t reproduce it with Hitachi Deskstars, I would guess the > > firmware "upgrade" from WD is the issue. > > > > Would it be possible to fix this in ZFS somehow? The drives seem to > > work well except for those "No Device" errors.... > > > > Vennlige hilsener / Best regards > > > > roy > > -- > > Roy Sigurd Karlsbakk > > (+47) 97542685 > > roy at karlsbakk.net > > http://blogg.karlsbakk.net/ > > -- > > I all pedagogikk er det essensielt at pensum presenteres > > intelligibelt. Det er et element?rt imperativ for alle pedagoger ? > > unng? eksessiv anvendelse av idiomer med fremmed opprinnelse. I de > > fleste tilfeller eksisterer adekvate og relevante synonymer p? > > norsk. > > > > _______________________________________________ > > OpenIndiana-discuss mailing list > > OpenIndiana-discuss at openindiana.org > > http://openindiana.org/mailman/listinfo/openindiana-discuss > > > > _______________________________________________ > OpenIndiana-discuss mailing list > OpenIndiana-discuss at openindiana.org > http://openindiana.org/mailman/listinfo/openindiana-discuss-- Vennlige hilsener / Best regards roy -- Roy Sigurd Karlsbakk (+47) 97542685 roy at karlsbakk.net http://blogg.karlsbakk.net/ -- I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er et element?rt imperativ for alle pedagoger ? unng? eksessiv anvendelse av idiomer med fremmed opprinnelse. I de fleste tilfeller eksisterer adekvate og relevante synonymer p? norsk.
Roy Sigurd Karlsbakk
2011-Aug-29 21:17 UTC
[zfs-discuss] [OpenIndiana-discuss] BAD WD drives - defective by design?
And, yes, they''re connected to an LSI SAS expander from Super Micro. Works well with Seagate and Hitachi, but not with WD.... ----- Original Message -----> The drives are attached to a backplane? > > Try using 4k sector sizes and see if that improves it - I''ve seen and > been part of a number of discussions which involved this - and you, I > think, actually. > > - Rich > > On Mon, Aug 29, 2011 at 5:07 PM, Roy Sigurd Karlsbakk > <roy at karlsbakk.net> wrote: > > Hi all > > > > It seems recent WD drives that aren''t "Raid edition" can cause > > rather a lot of problems on RAID systems. We have a few machines > > with LSI controllers (6801/6081/9201) and we''re seeing massive > > errors occuring. The usual pattern is a drive failing or even a > > resilver/scrub starting and then, suddenly, most drives on the whole > > backplane report errors. These are usually No Device (as reported by > > iostat -En), but the result is that we may see data corruption at > > the end. We also have a system setup with Hitachi Deskstars, which > > has been running for almost a year without issues. One system with a > > mixture of WD Blacks and greens showed the same errors as described, > > but has been working well after the WD drives were replaced by > > deskstars. > > > > Now, it seems WD has changed their firmware to inhibit people from > > using them for other things than toys (read: PCs etc). Since we''ve > > seen this issue on different controllers and different drives, and > > can''t reproduce it with Hitachi Deskstars, I would guess the > > firmware "upgrade" from WD is the issue. > > > > Would it be possible to fix this in ZFS somehow? The drives seem to > > work well except for those "No Device" errors.... > > > > Vennlige hilsener / Best regards > > > > roy > > -- > > Roy Sigurd Karlsbakk > > (+47) 97542685 > > roy at karlsbakk.net > > http://blogg.karlsbakk.net/ > > -- > > I all pedagogikk er det essensielt at pensum presenteres > > intelligibelt. Det er et element?rt imperativ for alle pedagoger ? > > unng? eksessiv anvendelse av idiomer med fremmed opprinnelse. I de > > fleste tilfeller eksisterer adekvate og relevante synonymer p? > > norsk. > > > > _______________________________________________ > > OpenIndiana-discuss mailing list > > OpenIndiana-discuss at openindiana.org > > http://openindiana.org/mailman/listinfo/openindiana-discuss > > > > _______________________________________________ > OpenIndiana-discuss mailing list > OpenIndiana-discuss at openindiana.org > http://openindiana.org/mailman/listinfo/openindiana-discuss-- Vennlige hilsener / Best regards roy -- Roy Sigurd Karlsbakk (+47) 97542685 roy at karlsbakk.net http://blogg.karlsbakk.net/ -- I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er et element?rt imperativ for alle pedagoger ? unng? eksessiv anvendelse av idiomer med fremmed opprinnelse. I de fleste tilfeller eksisterer adekvate og relevante synonymer p? norsk.
On Aug 29, 2011, at 2:07 PM, Roy Sigurd Karlsbakk wrote:> Hi all > > It seems recent WD drives that aren''t "Raid edition" can cause rather a lot of problems on RAID systems. We have a few machines with LSI controllers (6801/6081/9201) and we''re seeing massive errors occuring. The usual pattern is a drive failing or even a resilver/scrub starting and then, suddenly, most drives on the whole backplane report errors. These are usually No Device (as reported by iostat -En), but the result is that we may see data corruption at the end. We also have a system setup with Hitachi Deskstars, which has been running for almost a year without issues. One system with a mixture of WD Blacks and greens showed the same errors as described, but has been working well after the WD drives were replaced by deskstars.Sounds familiar.> > Now, it seems WD has changed their firmware to inhibit people from using them for other things than toys (read: PCs etc). Since we''ve seen this issue on different controllers and different drives, and can''t reproduce it with Hitachi Deskstars, I would guess the firmware "upgrade" from WD is the issue.Likely.> > Would it be possible to fix this in ZFS somehow?No, the error is 1-2 layers below ZFS.> The drives seem to work well except for those "No Device" errors?.''nuff said. -- richard
http://wdc.custhelp.com/app/answers/detail/a_id/1397/~/difference-between-desktop-edition-and-raid-%28enterprise%29-edition-drives
Roy Sigurd Karlsbakk
2011-Sep-07 08:36 UTC
[zfs-discuss] BAD WD drives - defective by design?
> http://wdc.custhelp.com/app/answers/detail/a_id/1397/~/difference-between-desktop-edition-and-raid-%28enterprise%29-edition-drives"When an error is found on a desktop edition hard drive, the drive will enter into a deep recovery cycle to attempt to repair the error, recover the data from the problematic area, and then reallocate a dedicated area to replace the problematic area. This process can take up to 2 minutes depending on the severity of the issue...." Or in other words: "When an error occurs on a desktop drive, the drive will refuse to realize the sector is bad, and retry forever and ever without even increasing SMART counters, so that even if Western Digital Data LifeGuard Diagnostics need to spend 36 hours to test a drive (as opposed to the normal 5-ish hours for a 2TB 7k2 drive), WD will refuse return of the drive because IT WORKS". Or in yet other words: "Desktop drives aren''t meant to be used for anything productive or important." Vennlige hilsener / Best regards roy -- Roy Sigurd Karlsbakk (+47) 97542685 roy at karlsbakk.net http://blogg.karlsbakk.net/ -- I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er et element?rt imperativ for alle pedagoger ? unng? eksessiv anvendelse av idiomer med fremmed opprinnelse. I de fleste tilfeller eksisterer adekvate og relevante synonymer p? norsk.
On 07 September, 2011 - Roy Sigurd Karlsbakk sent me these 2,0K bytes:> > http://wdc.custhelp.com/app/answers/detail/a_id/1397/~/difference-between-desktop-edition-and-raid-%28enterprise%29-edition-drives > > "When an error is found on a desktop edition hard drive, the drive will enter into a deep recovery cycle to attempt to repair the error, recover the data from the problematic area, and then reallocate a dedicated area to replace the problematic area. This process can take up to 2 minutes depending on the severity of the issue...." > > Or in other words: "When an error occurs on a desktop drive, the drive will refuse to realize the sector is bad, and retry forever ...."The common use for desktop drives is having a single disk without redundancy.. If a sector is feeling bad, it''s better if it tries a bit harder to recover it than just say "blah, there was a bit of dirt in the corner.. I don''t feel like looking at it, so I''ll just say your data is screwed instead".. In a raid setup, that data is sitting safe(?) on some other disk as well, so it might as well give up early. So don''t use desktop drives in raid and don''t use raid disks in a desktop setup. Ofcourse, this is just a config setting - but it''s still reality. /Tomas -- Tomas Forsman, stric at acc.umu.se, http://www.acc.umu.se/~stric/ |- Student at Computing Science, University of Ume? `- Sysadmin at {cs,acc}.umu.se
Roy Sigurd Karlsbakk
2011-Sep-07 09:05 UTC
[zfs-discuss] BAD WD drives - defective by design?
> The common use for desktop drives is having a single disk without > redundancy.. If a sector is feeling bad, it''s better if it tries a bit > harder to recover it than just say "blah, there was a bit of dirt in > the corner.. I don''t feel like looking at it, so I''ll just say your data > is screwed instead".. In a raid setup, that data is sitting safe(?) on > some other disk as well, so it might as well give up early.Still, there''s a wee difference between shaving and cutting your head off. A drive retrying a single sector for two whole minutes is nonsense, even on a desktop or laptop, at least when it does so without logging the error to SMART or summing up the issues so to flag the disk unusable. And, beleive it or not, a drive spending 2 minutes trying to fetch 512 bytes from a dead sector is quite unusable when the the number of bad sectors start climbing. Vennlige hilsener / Best regards roy -- Roy Sigurd Karlsbakk (+47) 97542685 roy at karlsbakk.net http://blogg.karlsbakk.net/ -- I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er et element?rt imperativ for alle pedagoger ? unng? eksessiv anvendelse av idiomer med fremmed opprinnelse. I de fleste tilfeller eksisterer adekvate og relevante synonymer p? norsk.
On Sep 7, 2011, at 2:05 AM, Roy Sigurd Karlsbakk wrote:>> The common use for desktop drives is having a single disk without >> redundancy.. If a sector is feeling bad, it''s better if it tries a bit >> harder to recover it than just say "blah, there was a bit of dirt in >> the corner.. I don''t feel like looking at it, so I''ll just say your data >> is screwed instead".. In a raid setup, that data is sitting safe(?) on >> some other disk as well, so it might as well give up early. > > Still, there''s a wee difference between shaving and cutting your head off.Today, it is in the best interest of the suppliers to do this. They can show concrete product differentiation to support increased margins. Business 101.> A drive retrying a single sector for two whole minutes is nonsense, even on a desktop > or laptop, at least when it does so without logging the error to SMART or summing up > the issues so to flag the disk unusable. And, beleive it or not, a drive spending 2 minutes > trying to fetch 512 bytes from a dead sector is quite unusable when the the number of > bad sectors start climbing.Yes, but that is the current state of the market, and this change has become more pronounced in the past few generations. Experienced systems architects know this and design accordingly. The disk vendors provide product roadmaps so you can plan for the future (Seagate''s is quite good reading :-) -- richard
On 09/ 9/11 06:40 AM, Richard Elling wrote:> On Sep 7, 2011, at 2:05 AM, Roy Sigurd Karlsbakk wrote: >> A drive retrying a single sector for two whole minutes is nonsense, even on a desktop >> or laptop, at least when it does so without logging the error to SMART or summing up >> the issues so to flag the disk unusable. And, beleive it or not, a drive spending 2 minutes >> trying to fetch 512 bytes from a dead sector is quite unusable when the the number of >> bad sectors start climbing. > Yes, but that is the current state of the market, and this change has become more pronounced > in the past few generations. Experienced systems architects know this and design accordingly.I think I''ve fallen victim to this one! I had a pair of older WD black drives in a test system in a hostile environment (my garage!) for 18 months without any issues so I recently replaced the remaining 8 drives in the pool with the current model. I now see these warnings in my logs: scsi: [ID 243001 kern.warning] WARNING: /pci at 0,0/pci8086,3410 at 9/pci15d9,400 at 0 (mpt_sas0): mptsas_handle_event_sync: IOCStatus=0x8000, IOCLogInfo=0x31120303 Now whether that''s due to a change of controller, or one of the new drives wandering off into la-la land I''m not sure.> The disk vendors provide product roadmaps so you can plan for the future (Seagate''s is quite > good reading :-)Indeed. I''ll have to update my reading habits! -- Ian.