I understand that ZFS provides fault data, via the zpool status command, that would (probably ??) allow a knowledgeable[0] zfs admin to determine (manually) when a zfs hardware element (today that''s usually called a "disk drive") might need "maintainence" or might need to be replaced. And there are sufficient features/facilities available in the current (b36+) release of ZFS to allow that "faulty" element to be replaced. But what I''m curious about, is what the ZFS Team plan is to support automatic sparing and automatic substitution/replacement of faulty ZFS elements with ZFS pre-allocated "spares". Many hardware RAID controllers support the concept of ''spares''. Some allow the spare to be powered down - so that the spare does not fail just as the drive it is intended to replace, also fails! Whoops - the wonders of MTBF and the statistical likelyhood that identical drives, with identical runtimes and MTBF specs, operated in an identical runtime environment, are likely to fail within a short timeframe of one another. [0] Unfortunately, the "supply" of knowledgeable and talented admins in various technical disciplines seems (to me) to be deteriorating, rather than increasing, over time. Why is that? Some possibilities: - Perhaps due to the trend to outsource different computer admin related functions to less experienced, and lower cost, labor pools. - perhaps due to the fact that technical management does not understand or appreciate, the required skillsets and is simply not prepared to pay for them. - perhaps because of the theory that modern hardware is so reliable that human involvement is simply un-necessary. And that the supplier of said hardware has already "engineered out", the expected failure modes. - perhaps because the current trend is that most companies are simply not willing to pay the ongoing costs to keep their computer personnel trained and equipped with the necessary technical skillsets to afford the company a technical advantage. - perhaps because the management has been conned into spending their entire IT/computer budget "unwisely" into technology which promised to solve all their reliability/uptime issues. Yep - they blew the budget on the "silver bullet", as skillfully presented by adept "namebrand" company sales droids. Can you spell "SAP"?! Summary: What is the ZFS sparing policy vision and how can it be best implemented? Should it be part of ZFS, or should it be an independent software suite that is ZFS aware, but customizable by the individual ZFS user. If it''s customizable - how much customization can one allow before it becomes totally ineffective through mis-configuration by less-than-stellar zfs admins? Technical curiosity is a disease - and I''m afflicted with it! :) Al Hopper Logical Approach Inc, Plano, TX. al at logical-approach.com Voice: 972.379.2133 Fax: 972.379.2134 Timezone: US CDT OpenSolaris.Org Community Advisory Board (CAB) Member - Apr 2005
Eric Schrock
2006-Mar-30 04:43 UTC
[zfs-discuss] disk drive sparing - questions not answers
I''d wait about 24 hours. I''ve finished up the prototype and the team is reviewing the proposed interfaces. I''ll forward the proposal to a larger audience once it''s passed some internal scrutiny. - Eric On Wed, Mar 29, 2006 at 08:20:27PM -0600, Al Hopper wrote:> > I understand that ZFS provides fault data, via the zpool status command, > that would (probably ??) allow a knowledgeable[0] zfs admin to determine > (manually) when a zfs hardware element (today that''s usually called a "disk > drive") might need "maintainence" or might need to be replaced. And there > are sufficient features/facilities available in the current (b36+) release > of ZFS to allow that "faulty" element to be replaced. > > But what I''m curious about, is what the ZFS Team plan is to support > automatic sparing and automatic substitution/replacement of faulty ZFS > elements with ZFS pre-allocated "spares". > > Many hardware RAID controllers support the concept of ''spares''. Some allow > the spare to be powered down - so that the spare does not fail just as the > drive it is intended to replace, also fails! Whoops - the wonders of MTBF > and the statistical likelyhood that identical drives, with identical > runtimes and MTBF specs, operated in an identical runtime environment, are > likely to fail within a short timeframe of one another. > > [0] Unfortunately, the "supply" of knowledgeable and talented admins in > various technical disciplines seems (to me) to be deteriorating, rather > than increasing, over time. Why is that? Some possibilities: > > - Perhaps due to the trend to outsource different computer admin related > functions to less experienced, and lower cost, labor pools. > > - perhaps due to the fact that technical management does not understand or > appreciate, the required skillsets and is simply not prepared to pay for > them. > > - perhaps because of the theory that modern hardware is so reliable that > human involvement is simply un-necessary. And that the supplier of said > hardware has already "engineered out", the expected failure modes. > > - perhaps because the current trend is that most companies are simply not > willing to pay the ongoing costs to keep their computer personnel trained > and equipped with the necessary technical skillsets to afford the company > a technical advantage. > > - perhaps because the management has been conned into spending their entire > IT/computer budget "unwisely" into technology which promised to solve all > their reliability/uptime issues. Yep - they blew the budget on the "silver > bullet", as skillfully presented by adept "namebrand" company sales droids. > Can you spell "SAP"?! > > Summary: What is the ZFS sparing policy vision and how can it be best > implemented? Should it be part of ZFS, or should it be an independent > software suite that is ZFS aware, but customizable by the individual ZFS > user. If it''s customizable - how much customization can one allow before > it becomes totally ineffective through mis-configuration by > less-than-stellar zfs admins? > > Technical curiosity is a disease - and I''m afflicted with it! :) > > Al Hopper Logical Approach Inc, Plano, TX. al at logical-approach.com > Voice: 972.379.2133 Fax: 972.379.2134 Timezone: US CDT > OpenSolaris.Org Community Advisory Board (CAB) Member - Apr 2005 > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss-- Eric Schrock, Solaris Kernel Development http://blogs.sun.com/eschrock