Yeah:) I''d like to work on this. Here are my first observations: - We need to call vdev_op_asize method with additonal ''offset'' argument, - We need to move data to new disk starting from the very begining, so we can''t reuse scrub/resilver code which does tree-walk through the data. Below you can see how I imagine to extend RAIDZ. Here is the legend: << >> - block boundaries D<x> - data block P<x> - parity block N<x> - new parity block U - unused * - if offset in I/O request is less that this marker we use four disks only, if greater - we use five disks After adding ''NewDisk'' to RAIDZ vdev, we have something like this: Disk0 Disk1 Disk2 Disk3 NewDisk <<P00 D00 D01 D02 U P01 D03 D04 D05 U P02 D06>> <<P03 D07>> U <<P04 D08>> <<P05 D09 U P06 D10 D11 D12>> U <<P07 D13 D14 D15>> U Then we start moving data, but we need to beging from the start: Disk0 Disk1 Disk2 Disk3 NewDisk <<N00 D00 D01 D02 D03 N01 D04 D05 D06>> <<P03 D07>> * U U U U <<P04 D08>> <<P05 D09 U P06 D10 D11 D12>> U <<P07 D13 D14 D15>> U At the end we have something like this (free space at the end): Disk0 Disk1 Disk2 Disk3 NewDisk <<N00 D00 D01 D02 D03 N01 D04 D05 D06>> <<P03 D07>> <<P04 D08>> <<N03 D09 D10 D11 D12>> <<N04 D13 D14 D15>> U U U U U U U U The biggest problem for me is a method to traverse allocated blocks sorted by offset. Any hints how to do it? -- Pawel Jakub Dawidek http://www.wheel.pl pjd at FreeBSD.org http://www.FreeBSD.org FreeBSD committer Am I Evil? Yes, I Am! -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 187 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-code/attachments/20070807/8d7c4e22/attachment.bin>
Well I read this email having just written a mammoth one in the other thread, my thoughts: The main difficulty in this, as far as I see it, is you''re intentionally moving data on a checksummed copy-on-write filesystem ;). At the very least this is creating lots of work before we even start to address the problem (and given that the ZFS guys are undoubtedly working on device removal, that effort would be wasted). I think this is probably more difficult than it''s worth -- re-writing data should be a separate non RAID-Z specific feature (once you''re changing the block pointers, you need to update the checksums, and you need to ensure that you''re maintaining consistency, preserve snapshots, etc. etc.). Surely it would be much easier to leave the data as is and version the array''s disk layout? James On 8/7/07, Pawel Jakub Dawidek <pjd at freebsd.org> wrote:> Yeah:) > > I''d like to work on this. Here are my first observations: > - We need to call vdev_op_asize method with additonal ''offset'' argument, > - We need to move data to new disk starting from the very begining, so > we can''t reuse scrub/resilver code which does tree-walk through the > data. > > Below you can see how I imagine to extend RAIDZ. Here is the legend: > << >> - block boundaries > D<x> - data block > P<x> - parity block > N<x> - new parity block > U - unused > * - if offset in I/O request is less that this marker we use four > disks only, if greater - we use five disks > > After adding ''NewDisk'' to RAIDZ vdev, we have something like this: > > Disk0 Disk1 Disk2 Disk3 NewDisk > > <<P00 D00 D01 D02 U > P01 D03 D04 D05 U > P02 D06>> <<P03 D07>> U > <<P04 D08>> <<P05 D09 U > P06 D10 D11 D12>> U > <<P07 D13 D14 D15>> U > > Then we start moving data, but we need to beging from the start: > > Disk0 Disk1 Disk2 Disk3 NewDisk > > <<N00 D00 D01 D02 D03 > N01 D04 D05 D06>> <<P03 > D07>> * U U U U > <<P04 D08>> <<P05 D09 U > P06 D10 D11 D12>> U > <<P07 D13 D14 D15>> U > > At the end we have something like this (free space at the end): > > Disk0 Disk1 Disk2 Disk3 NewDisk > > <<N00 D00 D01 D02 D03 > N01 D04 D05 D06>> <<P03 > D07>> <<P04 D08>> <<N03 D09 > D10 D11 D12>> <<N04 D13 > D14 D15>> U U U > U U U U U > > The biggest problem for me is a method to traverse allocated blocks > sorted by offset. Any hints how to do it? > > -- > Pawel Jakub Dawidek http://www.wheel.pl > pjd at FreeBSD.org http://www.FreeBSD.org > FreeBSD committer Am I Evil? Yes, I Am! > > _______________________________________________ > zfs-code mailing list > zfs-code at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-code > > >
On Tue, Aug 07, 2007 at 11:28:31PM +0100, James Blackburn wrote:> Well I read this email having just written a mammoth one in the other > thread, my thoughts: > > The main difficulty in this, as far as I see it, is you''re > intentionally moving data on a checksummed copy-on-write filesystem > ;). At the very least this is creating lots of work before we even > start to address the problem (and given that the ZFS guys are > undoubtedly working on device removal, that effort would be wasted). > I think this is probably more difficult than it''s worth -- re-writing > data should be a separate non RAID-Z specific feature (once you''re > changing the block pointers, you need to update the checksums, and you > need to ensure that you''re maintaining consistency, preserve > snapshots, etc. etc.). Surely it would be much easier to leave the > data as is and version the array''s disk layout?I''ve some time to experiment with my idea. What I did was: 1. Hardcode vdev_raidz_map_alloc() to always use 3 as vdev_children this helps me to using hacked up ''zpool attach'' with RAIDZ. 2. Turn on logging of all write into RAIDZ vdev (offset+size). 3. zpool create tank raidz disk0 disk1 disk2 4. zpool attach tank disk0 disk3 5. zpool export tank 6. Backout 1. 7. Use a special tool, that will read all blocks written earlier. I use only three disks for reading and logged offset+size pairs. 8. Use the same tool to write the data back, but now use four disks. 9. Try to: zpool import tank Yeah, 9 fails. It shows that pool metadata is corrupted. I was really surprised. This means that layers above vdev knows details about vdev internals, like number of disks, I think. What I basically did was adding one disk. ZFS can ask raidz vdev for a block using exactly the same offset+size as before. This should be enough, but isn''t. Checksum is stored with a block pointer in a leaf vdev? If so, why? -- Pawel Jakub Dawidek http://www.wheel.pl pjd at FreeBSD.org http://www.FreeBSD.org FreeBSD committer Am I Evil? Yes, I Am! -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 187 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-code/attachments/20070912/47c9d7a5/attachment.bin>
On 9/12/07, Pawel Jakub Dawidek <pjd at freebsd.org> wrote:> On Tue, Aug 07, 2007 at 11:28:31PM +0100, James Blackburn wrote: > > Well I read this email having just written a mammoth one in the other > > thread, my thoughts: > > > > The main difficulty in this, as far as I see it, is you''re > > intentionally moving data on a checksummed copy-on-write filesystem > > ;). At the very least this is creating lots of work before we even > > start to address the problem (and given that the ZFS guys are > > undoubtedly working on device removal, that effort would be wasted). > > I think this is probably more difficult than it''s worth -- re-writing > > data should be a separate non RAID-Z specific feature (once you''re > > changing the block pointers, you need to update the checksums, and you > > need to ensure that you''re maintaining consistency, preserve > > snapshots, etc. etc.). Surely it would be much easier to leave the > > data as is and version the array''s disk layout? > > I''ve some time to experiment with my idea. What I did was: > > 1. Hardcode vdev_raidz_map_alloc() to always use 3 as vdev_children this > helps me to using hacked up ''zpool attach'' with RAIDZ. > 2. Turn on logging of all write into RAIDZ vdev (offset+size). > 3. zpool create tank raidz disk0 disk1 disk2 > 4. zpool attach tank disk0 disk3 > 5. zpool export tank > 6. Backout 1. > 7. Use a special tool, that will read all blocks written earlier. I use > only three disks for reading and logged offset+size pairs. > 8. Use the same tool to write the data back, but now use four disks. > 9. Try to: zpool import tank > > Yeah, 9 fails. It shows that pool metadata is corrupted. > > I was really surprised. This means that layers above vdev knows details > about vdev internals, like number of disks, I think. What I basically > did was adding one disk. ZFS can ask raidz vdev for a block using > exactly the same offset+size as before. This should be enough, but > isn''t. Checksum is stored with a block pointer in a leaf vdev? If so, > why?All that''s needed to resolve a block pointer currently is the vdev + offset. Of course the checksum needs to be correct. So assuming that that you have added the extra disk, moved the blocks around, and updated the offsets correctly the likely problem is checksums. As every block pointer checksums its child, if you change a block''s location and update the block pointer offset, the block pointer''s parent''s checksum will be wrong. If you''re re-writing/moving data, you''ll need to re-write checksums as well, or switch them off. James
Pawel Jakub Dawidek wrote:> On Tue, Aug 07, 2007 at 11:28:31PM +0100, James Blackburn wrote: >> Well I read this email having just written a mammoth one in the other >> thread, my thoughts: >> >> The main difficulty in this, as far as I see it, is you''re >> intentionally moving data on a checksummed copy-on-write filesystem >> ;). At the very least this is creating lots of work before we even >> start to address the problem (and given that the ZFS guys are >> undoubtedly working on device removal, that effort would be wasted). >> I think this is probably more difficult than it''s worth -- re-writing >> data should be a separate non RAID-Z specific feature (once you''re >> changing the block pointers, you need to update the checksums, and you >> need to ensure that you''re maintaining consistency, preserve >> snapshots, etc. etc.). Surely it would be much easier to leave the >> data as is and version the array''s disk layout? > > I''ve some time to experiment with my idea. What I did was: > > 1. Hardcode vdev_raidz_map_alloc() to always use 3 as vdev_children this > helps me to using hacked up ''zpool attach'' with RAIDZ. > 2. Turn on logging of all write into RAIDZ vdev (offset+size). > 3. zpool create tank raidz disk0 disk1 disk2 > 4. zpool attach tank disk0 disk3 > 5. zpool export tank > 6. Backout 1. > 7. Use a special tool, that will read all blocks written earlier. I use > only three disks for reading and logged offset+size pairs. > 8. Use the same tool to write the data back, but now use four disks. > 9. Try to: zpool import tank > > Yeah, 9 fails. It shows that pool metadata is corrupted. > > I was really surprised. This means that layers above vdev knows details > about vdev internals, like number of disks, I think. What I basically > did was adding one disk. ZFS can ask raidz vdev for a block using > exactly the same offset+size as before.Really? I don''t see how that could be possible using the current raidz on-disk layout. How did you rearrange the data? Ie, what do steps 7+8 do to the data on disk? If you change the number of disks in a raidz group, then at a minimum the number of allocated sectors for each block will change, since this count includes the raidz parity sectors. This will change the block pointers, so all block-pointer-containing metadata will have to be rewritten (ie, all indirect blocks and dnodes). --matt
On Wed, Sep 19, 2007 at 03:06:20PM -0700, Matthew Ahrens wrote:> Pawel Jakub Dawidek wrote: > > On Tue, Aug 07, 2007 at 11:28:31PM +0100, James Blackburn wrote: > >> Well I read this email having just written a mammoth one in the other > >> thread, my thoughts: > >> > >> The main difficulty in this, as far as I see it, is you''re > >> intentionally moving data on a checksummed copy-on-write filesystem > >> ;). At the very least this is creating lots of work before we even > >> start to address the problem (and given that the ZFS guys are > >> undoubtedly working on device removal, that effort would be wasted). > >> I think this is probably more difficult than it''s worth -- re-writing > >> data should be a separate non RAID-Z specific feature (once you''re > >> changing the block pointers, you need to update the checksums, and you > >> need to ensure that you''re maintaining consistency, preserve > >> snapshots, etc. etc.). Surely it would be much easier to leave the > >> data as is and version the array''s disk layout? > > > > I''ve some time to experiment with my idea. What I did was: > > > > 1. Hardcode vdev_raidz_map_alloc() to always use 3 as vdev_children this > > helps me to using hacked up ''zpool attach'' with RAIDZ. > > 2. Turn on logging of all write into RAIDZ vdev (offset+size). > > 3. zpool create tank raidz disk0 disk1 disk2 > > 4. zpool attach tank disk0 disk3 > > 5. zpool export tank > > 6. Backout 1. > > 7. Use a special tool, that will read all blocks written earlier. I use > > only three disks for reading and logged offset+size pairs. > > 8. Use the same tool to write the data back, but now use four disks. > > 9. Try to: zpool import tank > > > > Yeah, 9 fails. It shows that pool metadata is corrupted. > > > > I was really surprised. This means that layers above vdev knows details > > about vdev internals, like number of disks, I think. What I basically > > did was adding one disk. ZFS can ask raidz vdev for a block using > > exactly the same offset+size as before. > > Really? I don''t see how that could be possible using the current raidz > on-disk layout. > > How did you rearrange the data? Ie, what do steps 7+8 do to the data on > disk? If you change the number of disks in a raidz group, then at a minimum > the number of allocated sectors for each block will change, since this count > includes the raidz parity sectors. This will change the block pointers, so > all block-pointer-containing metadata will have to be rewritten (ie, all > indirect blocks and dnodes).I create a userland tool based on vdev_raidz.c. I logged all write requests to vdev_raidz (offset+size). This userland tool first read all the data (based on offset+size pairs) to memory - it pass old number of components to vdev_raidz_map_alloc() as dcols argument. When all the data is in memory, it writes the data back, but now giving dcols+1. This way I don''t change the offset which is passed by upper layers to vdev_raidz. -- Pawel Jakub Dawidek http://www.wheel.pl pjd at FreeBSD.org http://www.FreeBSD.org FreeBSD committer Am I Evil? Yes, I Am! -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 187 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-code/attachments/20070920/70035301/attachment.bin>