MC
2007-Aug-28 15:01 UTC
[zfs-discuss] Best way to incorporate disk size tolerance into raidz arrays?
The situation: a three 500gb disk raidz array. One disk breaks and you replace it with a new one. But the new 500gb disk is slightly smaller than the smallest disk in the array. I presume the disk would not be accepted into the array because the zpool replace entry on the zpool man page says "The size of new_device must be greater than or equal to the minimum size of all the devices in a mirror or raidz configuration."[1] I had expected (hoped) that a raidz array with sufficient free space would downsize itself to accommodate the smaller replaced disk. But I''ve never seen that function mentioned anywhere :o) So I figure the only way to build smaller-than-max-disk-size functionality into a raidz array is to make a slice on each disk that is slightly smaller than the max disk size, and then build the array out of those slices. Am I correct here? If so, is there a downside to using slice(s) instead of whole disks? The zpool manual says "ZFS can use individual slices or partitions, though the recommended mode of operation is to use whole disks." ["Virtual Devices (vdevs)", 1] [1] http://docs.sun.com/app/docs/doc/819-2240/zpool-1m?a=view This message posted from opensolaris.org
Marion Hakanson
2007-Aug-28 23:55 UTC
[zfs-discuss] Best way to incorporate disk size tolerance into raidz arrays?
rac at eastlink.ca said:> The situation: a three 500gb disk raidz array. One disk breaks and you > replace it with a new one. But the new 500gb disk is slightly smaller than > the smallest disk in the array. > . . . > So I figure the only way to build smaller-than-max-disk-size functionality > into a raidz array is to make a slice on each disk that is slightly smaller > than the max disk size, and then build the array out of those slices. Am I > correct here?Actually, you can manually adjust the "whole disk" label so it takes up less than the whole disk. ZFS doesn''t seem to notice. One way of doing this is to create a temporary whole-disk pool on an unlabelled disk, allowing ZFS to setup its standard EFI label. Then destroy that temporary pool, and use "format" to adjust the size of slice 0 to whatever smaller block count you want. Later "zpool create", "add", or "attach" operations seem to just follow the existing label, rather than adjust it upwards to the maximum block count that will fit on the disk. I''m just reporting what I''ve observed (Solaris-10U3); Naturally this could change as releases go forward, although the current behavior seems like a pretty safe one.> If so, is there a downside to using slice(s) instead of whole disks? The > zpool manual says "ZFS can use individual slices or partitions, though the > recommended mode of operation is to use whole disks." ["Virtual Devices > (vdevs)", 1]The only down-side I know of is only a potential one: You could get competing uses of the same spindle, if you have more than one slice in use on the same physical drive at the same time. That can definitely slow things down a lot, depending on what''s going on. ZFS seems to try to use up all available performance of the drives it has been configured to use. Note that slicing up a boot drive with boot filesystems on part of the disk, and a ZFS data pool on the rest, works just fine, likely because you don''t typically see a lot of I/O on the OS/boot filesystems unless you''re short on RAM (in which case things go slow for other reasons). Regards, Marion
Richard Elling
2007-Aug-29 01:02 UTC
[zfs-discuss] Best way to incorporate disk size tolerance into raidz arrays?
MC wrote:> The situation: a three 500gb disk raidz array. One disk breaks and you > replace it with a new one. But the new 500gb disk is slightly smaller > than the smallest disk in the array.This is quite a problem for RAID arrays, too. It is why vendors use custom labels for disks. When you have multiple disk vendors, or the disk vendors change designs, you can end up with slightly different sized disks. So you tend to use a least common denominator for your custom label.> I presume the disk would not be accepted into the array because the zpool > replace entry on the zpool man page says "The size of new_device must be > greater than or equal to the minimum size of all the devices in a mirror or > raidz configuration."[1]yes> I had expected (hoped) that a raidz array with sufficient free space would > downsize itself to accommodate the smaller replaced disk. But I''ve never > seen that function mentioned anywhere :o)this is the infamous "shrink vdev" RFE.> So I figure the only way to build smaller-than-max-disk-size functionality > into a raidz array is to make a slice on each disk that is slightly smaller > than the max disk size, and then build the array out of those slices. Am I > correct here?This is the technique vendors use for RAID arrays.> If so, is there a downside to using slice(s) instead of whole disks? The zpool > manual says "ZFS can use individual slices or partitions, though the recommended > mode of operation is to use whole disks." ["Virtual Devices (vdevs)", 1]The recommended use of whole disks is for drives with volatile write caches where ZFS will enable the cache if it owns the whole disk. There may be an RFE lurking here, but it might be tricky to correctly implement to protect against future data corruptions by non-ZFS use. -- richard
Thanks for the comprehensive replies! I''ll need some baby speak on this one though:> The recommended use of whole disks is for drives with volatile write caches where ZFS will enable the cache if it owns the whole disk. There may be an RFE lurking here, but it might be tricky to correctly implement to protect against future data corruptions by non-ZFS use.I don''t know what you mean by "drives with volatile write caches", but I''m dealing with commodity SATA2 drives from WD/Seagate/Hitachi/Samsung. This disk replacement thing is a pretty common use case, so I think it would be smart to sort it out while someone cares, and then stick the authoritative answer into the zfs wiki. This is what I can contribute without knowing the answer: The best way to incorporate abnormal disk size variance tolerance into a raidz array is BLANK, and it has these BLANK side effects. Now you guys fill in the BLANKs :P This message posted from opensolaris.org
Richard Elling
2007-Aug-29 15:44 UTC
[zfs-discuss] Best way to incorporate disk size tolerance into
MC wrote:> Thanks for the comprehensive replies! > > I''ll need some baby speak on this one though: > >> The recommended use of whole disks is for drives with volatile write >> caches where ZFS will enable the cache if it owns the whole disk. There >> may be an RFE lurking here, but it might be tricky to correctly implement >> to protect against future data corruptions by non-ZFS use. > > I don''t know what you mean by "drives with volatile write caches", but I''m > dealing with commodity SATA2 drives from WD/Seagate/Hitachi/Samsung.You may see it in the data sheet as "buffer" or "cache buffer" for such drives. Usually 8-16 MBytes with 32 MBytes for newer drives.> This disk replacement thing is a pretty common use case, so I think it would > be smart to sort it out while someone cares, and then stick the authoritative > answer into the zfs wiki. This is what I can contribute without knowing the answer:The authoritative answer is in the man page for zpool. System Administration Commands zpool(1M) The size of new_device must be greater than or equal to the minimum size of all the devices in a mirror or raidz configuration.> The best way to incorporate abnormal disk size variance tolerance into a raidz array > is BLANK, and it has these BLANK side effects.This is a problem for replacement, not creation. For creation, the problem becomes more generic, but can make use of automation. I''ve got some algorithms to do that, but am not quite ready with a generic solution which is administrator friendly. In other words, the science isn''t difficult, the automation is. -- richard
> This is a problem for replacement, not creation.You''re talking about solving the problem in the future? I''m talking about working around the problem today. :) This isn''t a fluffy dream problem. I ran into this last month when an RMA''d drive wouldn''t fit back into a RAID5 array. RAIDZ is subject to the exact same problem, so I want to find the solution before making a RAIDZ array.> The authoritative answer is in the man page for zpool.You quoted the exact same line that I quoted in my original post. That isn''t a solution. That is a constraint which causes the problem and which the solution must work around. The two solutions listed here are slicing and hacking the "whole disk" label to be smaller than the whole disk. There is no consensus here on what solution, if any, should be used. I would like there to be, so I''ll leave the original question open. This message posted from opensolaris.org
Richard Elling
2007-Aug-29 23:34 UTC
[zfs-discuss] Best way to incorporate disk size tolerance into
MC wrote:>> This is a problem for replacement, not creation. > > You''re talking about solving the problem in the future? I''m talking about working around the problem today. :) This isn''t a fluffy dream problem. I ran into this last month when an RMA''d drive wouldn''t fit back into a RAID5 array. RAIDZ is subject to the exact same problem, so I want to find the solution before making a RAIDZ array. > >> The authoritative answer is in the man page for zpool. > > You quoted the exact same line that I quoted in my original post. That isn''t a solution. That is a constraint which causes the problem and which the solution must work around. > > The two solutions listed here are slicing and hacking the "whole disk" label to be smaller than the whole disk. There is no consensus here on what solution, if any, should be used. I would like there to be, so I''ll leave the original question open.slicing seems simple enough. -- richard
To expand on this:> The recommended use of whole disks is for drives with volatile write caches where ZFS will enable the cache if it owns the whole disk.Does ZFS really never use disk cache when working with a disk slice? Is there any way to force it to use the disk cache? This message posted from opensolaris.org
Richard Elling
2007-Sep-10 19:57 UTC
[zfs-discuss] Best way to incorporate disk size tolerance into
MC wrote:> To expand on this: > >> The recommended use of whole disks is for drives with volatile >> write caches where ZFS will enable the cache if it owns the whole disk. > > Does ZFS really never use disk cache when working with a disk slice?This question doesn''t make sense. ZFS doesn''t know anything about the disk''s cache. But if ZFS has full control over the disk, then it will attempt to enable the disk''s nonvolatile cache.> Is there any way to force it to use the disk cache?ZFS doesn''t know anything about the disk''s cache. But it will try to issue the flush cache commands as needed. To try to preempt the next question, some disks allow you to turn off the nonvolatile write cache. Some don''t. Some disks allow you to enable or disable the nonvolatile write cache via the format(1m) command in expert mode. Some don''t. AFAIK, nobody has a list of these, so you might try it. Caveat: do not enable nonvolatile write cache for UFS. -- richard