As we all know, it is important to ensure that modern disks are set up properly with the correct block size. Everything is good if all the disks and the pool are "ashift=9" (512 byte blocks). But as soon as one new drive requires 4k blocks, performance drops through the floor of the enture pool. In order to upgrade there appear to be two separate things that must be done for a ZFS pool. 1. Create partitions on 4K boundaries. This is simple with the "-a 4k" option in gpart, and it isn't hard to remove disks one at a time from a pool, reformat them on the right boundaries and put them back. Hopefully you've left a few spare bytes on the disk to ensure that your partition doesn't get smaller when you reinsert it to the pool. 2. Create a brand new pool which has ashift=12 and zfs send|receive all the data over. I guess I don't understand enough about zpool to know why the pool itself has a block size, since I understood ZFS to have variable stripe widths. The problem with step 2 is that you need to have enough hard disks spare to create a whole new pool and throw away the old disks. Plus a disk controller with lots of spare ports. Plus the ability to take the system offline for hours or days while the migration happens. One way to reduce this slightly is to create a new pool with reduced redundancy. For example, create a RAIDZ2 with two fake disks, then offline those disks. So, given how much this problem sucks (it is extremely easy to add a 4K disk by mistake as a replacement for a failed disk), and how painful the workaround is... will ZFS ever gain the ability to change block size for the pool? Or is this so deep in the internals of ZFS it is as likely as being able to dynamically add disks to an existing zvol in the "never going to happen" basket? And secondly, is it also bad to have ashift 9 disks inside a ashift 12 pool? That is, do we need to replace all our disks in one go and forever keep big sticky labels on each disk so we never mix them? Thanks for any advice Ari Maniatis -- --------------------------> Aristedes Maniatis ish http://www.ish.com.au Level 1, 30 Wilson Street Newtown 2042 Australia phone +61 2 9550 5001 fax +61 2 9550 4001 GPG fingerprint CBFB 84B4 738D 4E87 5E5C 5EFA EF6A 7D2E 3E49 102A
On Wednesday, September 10, 2014 04:46:28 PM Aristedes Maniatis wrote:> As we all know, it is important to ensure that modern disks are set up > properly with the correct block size. Everything is good if all the disks > and the pool are "ashift=9" (512 byte blocks). But as soon as one new drive > requires 4k blocks, performance drops through the floor of the enture pool.[..]> And secondly, is it also bad to have ashift 9 disks inside a ashift 12 pool? > That is, do we need to replace all our disks in one go and forever keep big > sticky labels on each disk so we never mix them?For what its worth, in the freebsd.org cluster we automatically align everything to a minimum of 4k, no matter what the actual drive is. We set: sysctl vfs.zfs.min_auto_ashift=12 (this saves a lot of messing around with gnop etc) and ensure all the gpt slices are 4k or better aligned. -- Peter Wemm - peter at wemm.org; peter at FreeBSD.org; peter at yahoo-inc.com; KI6FJV UTF-8: for when a ' or ... just won\342\200\231t do\342\200\246 -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 473 bytes Desc: This is a digitally signed message part. URL: <http://lists.freebsd.org/pipermail/freebsd-stable/attachments/20140909/abb3c0a2/attachment.sig>
Am 10.09.2014 um 08:46 schrieb Aristedes Maniatis:> As we all know, it is important to ensure that modern disks are set > up properly with the correct block size. Everything is good if all > the disks and the pool are "ashift=9" (512 byte blocks). But as soon > as one new drive requires 4k blocks, performance drops through the > floor of the enture pool. > > > In order to upgrade there appear to be two separate things that must > be done for a ZFS pool. > > 1. Create partitions on 4K boundaries. This is simple with the > "-a 4k" option in gpart, and it isn't hard to remove disks one at a > time from a pool, reformat them on the right boundaries and put them > back. Hopefully you've left a few spare bytes on the disk to ensure > that your partition doesn't get smaller when you reinsert it to the > pool. > > 2. Create a brand new pool which has ashift=12 and zfs send|receive > all the data over. > > > I guess I don't understand enough about zpool to know why the pool > itself has a block size, since I understood ZFS to have variable > stripe widths.I'm not a ZFS internals expert, just a long time user, but I'll try to answer your questions. ZFS is based on a copy-on-write paradigm, which ensures, that no data is ever overwritten in place. All writes go to new blank blocks, and only after the last reference to an "old" block is lost (when no TXG or snapshot has references to it), is the old block freed and put back on the free block map. ZFS uses variable block sizes by breaking down large blocks to smaller fragments as suitable for the data to be stored. The largest block to be used is configurable (128 KByte by default) and the smallest fragment is the sector size (i.e. 512 or 4096 bytes), as configured by "ashift". The problem with 4K sector disks that report 512 byte sectors is, that ZFS still assumes, that no data is overwritten in place, while the disk drive does it behind the curtains. ZFS thinks it can atomically write 512 bytes, but the drive reads 4K, places the 512 bytes of data within that 4K physical sector in the drive's cache, and then writes back the 4K of data in one go. The cost is not only the latency of this read-modify-write sequence, but also that an elementary ZFS assumption is violated: Data that is in other (logical) 512 byte sectors of the physical 4 KByte sector can be lost, if that write operation fails, resulting in loss of data in those files that happen to share the physical sector with the one that received the write operation. This may never hit you, but ZFS is built on the assumption, that it cannot happen at all, which is no longer true with 4KB drives that are used with ashift=9.> The problem with step 2 is that you need to have enough hard disks > spare to create a whole new pool and throw away the old disks. Plus > a disk controller with lots of spare ports. Plus the ability to take > the system offline for hours or days while the migration happens. > > One way to reduce this slightly is to create a new pool with reduced > redundancy. For example, create a RAIDZ2 with two fake disks, then > off-line those disks.Both methods are dangerous! Studies have found, that the risk of another disk failure during resilvering is substantial. That was the reason for higher RAIDZ redundancy groups (raidz2, raidz3). With 1) you have to copy the data multiple times and the load could lead to loss of one of the source drives (and since you are in the process of overwriting the drive that provided redundancy, you loose your pool that way). The copying to a degraded pool that you describe in 2) is a possibility (and I've done it, once). You should make sure, that all source data is still available until a successful resilvering of the "new" pool with the fake disks replaced. You could do this by moving the redundant disks from the old pool the new pool (i.e. degrading the old pool, after all data has been copied, to use the redundant drives to complete the new pool). But this assumes, that the technologies of the drives match - I'll soon go from 4*2TB to 3*4TB (raidz1 in both cases), since I had 2 of the 2TB drives fail over the course of last year (replaced under warranty).> So, given how much this problem sucks (it is extremely easy to add > a 4K disk by mistake as a replacement for a failed disk), and how > painful the workaround is... will ZFS ever gain the ability to change > block size for the pool? Or is this so deep in the internals of ZFS > it is as likely as being able to dynamically add disks to an existing > zvol in the "never going to happen" basket?You can add a 4 KB physical drive that emulates 512 byte sectors (nearly all drives do) to an ashift=9 ZFS pool, but performance will suffer and you'll be violating a ZFS assumption as explained above.> And secondly, is it also bad to have ashift 9 disks inside a ashift > 12 pool? That is, do we need to replace all our disks in one go and > forever keep big sticky labels on each disk so we never mix them?The ashift parameter is per pool, not per disk. You can have a drive with emulated 512 byte sectors in an ashift=9 pool, but you cannot change the ashift value of a pool after creation. Regards, STefan
> On Sep 10, 2014, at 12:46 AM, Aristedes Maniatis <ari at ish.com.au> wrote: > > As we all know, it is important to ensure that modern disks are set up properly with the correct block size. Everything is good if all the disks and the pool are "ashift=9" (512 byte blocks). But as soon as one new drive requires 4k blocks, performance drops through the floor of the enture pool. > > > In order to upgrade there appear to be two separate things that must be done for a ZFS pool. > > 1. Create partitions on 4K boundaries. This is simple with the "-a 4k" option in gpart, and it isn't hard to remove disks one at a time from a pool, reformat them on the right boundaries and put them back. Hopefully you've left a few spare bytes on the disk to ensure that your partition doesn't get smaller when you reinsert it to the pool. > > 2. Create a brand new pool which has ashift=12 and zfs send|receive all the data over. > > > I guess I don't understand enough about zpool to know why the pool itself has a block size, since I understood ZFS to have variable stripe widths. > > The problem with step 2 is that you need to have enough hard disks spare to create a whole new pool and throw away the old disks. Plus a disk controller with lots of spare ports. Plus the ability to take the system offline for hours or days while the migration happens. > > One way to reduce this slightly is to create a new pool with reduced redundancy. For example, create a RAIDZ2 with two fake disks, then offline those disks.Lots of good info in other responses, I just wanted to address this part of your message. It should be a given that good backups are a requirement before you start any of this. _Especially_ if you have to destroy the old pool in order to provide redundancy for the new pool. I have done this ashift conversion and it was a bit of a nail-biting experience as you've anticipated. The one suggestion I have for improving on the above is to use snapshots to minimize the downtime. Get an initial clone of the pool during off-peak hours (if any), then you only need to take the system down to send a "final" differential snapshot. JN