thr3ads.net - zfs code - [zfs-code] Extending RAIDZ. [Aug 2007]

If this information is useful, please help other people find it:
Share via:

Pawel Jakub Dawidek

2007-Aug-07 13:14 UTC

[zfs-code] Extending RAIDZ.

Yeah:)

I''d like to work on this. Here are my first observations:
- We need to call vdev_op_asize method with additonal ''offset''
argument,
- We need to move data to new disk starting from the very begining, so
  we can''t reuse scrub/resilver code which does tree-walk through the
  data.

Below you can see how I imagine to extend RAIDZ. Here is the legend:
<< >> - block boundaries
D<x> - data block
P<x> - parity block
N<x> - new parity block
U - unused
* - if offset in I/O request is less that this marker we use four
    disks only, if greater - we use five disks

After adding ''NewDisk'' to RAIDZ vdev, we have something like
this:
 
	Disk0	Disk1	Disk2	Disk3	NewDisk

	<<P00	  D00	  D01	  D02	  U
	  P01	  D03	  D04	  D05	  U
	  P02	  D06>>	<<P03	  D07>>	  U
	<<P04	  D08>>	<<P05	  D09	  U
	  P06	  D10	  D11	  D12>>	  U
	<<P07	  D13	  D14	  D15>>	  U

Then we start moving data, but we need to beging from the start:
 
	Disk0	Disk1	Disk2	Disk3	NewDisk

	<<N00	  D00	  D01	  D02	  D03
	  N01	  D04	  D05	  D06>>	<<P03
	  D07>>	* U	  U	  U	  U
	<<P04	  D08>>	<<P05	  D09	  U
	  P06	  D10	  D11	  D12>>	  U
	<<P07	  D13	  D14	  D15>>	  U
 
At the end we have something like this (free space at the end):

	Disk0	Disk1	Disk2	Disk3	NewDisk

	<<N00	  D00	  D01	  D02	  D03
	  N01	  D04	  D05	  D06>>	<<P03
	  D07>>	<<P04	  D08>>	<<N03	  D09
	  D10	  D11	  D12>>	<<N04	  D13
	  D14	  D15>>	  U	  U	  U
	  U	  U	  U	  U	  U

The biggest problem for me is a method to traverse allocated blocks
sorted by offset. Any hints how to do it?

-- 
Pawel Jakub Dawidek                       http://www.wheel.pl
pjd at FreeBSD.org                           http://www.FreeBSD.org
FreeBSD committer                         Am I Evil? Yes, I Am!
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 187 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-code/attachments/20070807/8d7c4e22/attachment.bin>

James Blackburn

2007-Aug-07 22:28 UTC

head link

[zfs-code] Extending RAIDZ.

Well I read this email having just written a mammoth one in the other
thread, my thoughts:

The main difficulty in this, as far as I see it, is you''re
intentionally moving data on a checksummed copy-on-write filesystem
;).  At the very least this is creating lots of work before we even
start to address the problem (and given that the ZFS guys are
undoubtedly working on device removal, that effort would be wasted).
I think this is probably more difficult than it''s worth -- re-writing
data should be a separate non RAID-Z specific feature (once you''re
changing the block pointers, you need to update the checksums, and you
need to ensure that you''re maintaining consistency, preserve
snapshots, etc. etc.). Surely it would be much easier to leave the
data as is and version the array''s disk layout?

James

On 8/7/07, Pawel Jakub Dawidek <pjd at freebsd.org>
wrote:> Yeah:)
>
> I''d like to work on this. Here are my first observations:
> - We need to call vdev_op_asize method with additonal
''offset'' argument,
> - We need to move data to new disk starting from the very begining, so
>   we can''t reuse scrub/resilver code which does tree-walk through
the
>   data.
>
> Below you can see how I imagine to extend RAIDZ. Here is the legend:
> << >> - block boundaries
> D<x> - data block
> P<x> - parity block
> N<x> - new parity block
> U - unused
> * - if offset in I/O request is less that this marker we use four
>     disks only, if greater - we use five disks
>
> After adding ''NewDisk'' to RAIDZ vdev, we have something
like this:
>
>         Disk0   Disk1   Disk2   Disk3   NewDisk
>
>         <<P00     D00     D01     D02     U
>           P01     D03     D04     D05     U
>           P02     D06>> <<P03     D07>>   U
>         <<P04     D08>> <<P05     D09     U
>           P06     D10     D11     D12>>   U
>         <<P07     D13     D14     D15>>   U
>
> Then we start moving data, but we need to beging from the start:
>
>         Disk0   Disk1   Disk2   Disk3   NewDisk
>
>         <<N00     D00     D01     D02     D03
>           N01     D04     D05     D06>> <<P03
>           D07>> * U       U       U       U
>         <<P04     D08>> <<P05     D09     U
>           P06     D10     D11     D12>>   U
>         <<P07     D13     D14     D15>>   U
>
> At the end we have something like this (free space at the end):
>
>         Disk0   Disk1   Disk2   Disk3   NewDisk
>
>         <<N00     D00     D01     D02     D03
>           N01     D04     D05     D06>> <<P03
>           D07>> <<P04     D08>> <<N03     D09
>           D10     D11     D12>> <<N04     D13
>           D14     D15>>   U       U       U
>           U       U       U       U       U
>
> The biggest problem for me is a method to traverse allocated blocks
> sorted by offset. Any hints how to do it?
>
> --
> Pawel Jakub Dawidek                       http://www.wheel.pl
> pjd at FreeBSD.org                           http://www.FreeBSD.org
> FreeBSD committer                         Am I Evil? Yes, I Am!
>
> _______________________________________________
> zfs-code mailing list
> zfs-code at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-code
>
>
>

Pawel Jakub Dawidek

2007-Sep-12 20:35 UTC

head link

[zfs-code] Extending RAIDZ.

On Tue, Aug 07, 2007 at 11:28:31PM +0100, James Blackburn
wrote:> Well I read this email having just written a mammoth one in the other
> thread, my thoughts:
> 
> The main difficulty in this, as far as I see it, is you''re
> intentionally moving data on a checksummed copy-on-write filesystem
> ;).  At the very least this is creating lots of work before we even
> start to address the problem (and given that the ZFS guys are
> undoubtedly working on device removal, that effort would be wasted).
> I think this is probably more difficult than it''s worth --
re-writing
> data should be a separate non RAID-Z specific feature (once you''re
> changing the block pointers, you need to update the checksums, and you
> need to ensure that you''re maintaining consistency, preserve
> snapshots, etc. etc.). Surely it would be much easier to leave the
> data as is and version the array''s disk layout?
I''ve some time to experiment with my idea. What I did was:

1. Hardcode vdev_raidz_map_alloc() to always use 3 as vdev_children this
   helps me to using hacked up ''zpool attach'' with RAIDZ.
2. Turn on logging of all write into RAIDZ vdev (offset+size).
3. zpool create tank raidz disk0 disk1 disk2
4. zpool attach tank disk0 disk3
5. zpool export tank
6. Backout 1.
7. Use a special tool, that will read all blocks written earlier. I use
   only three disks for reading and logged offset+size pairs.
8. Use the same tool to write the data back, but now use four disks.
9. Try to: zpool import tank

Yeah, 9 fails. It shows that pool metadata is corrupted.

I was really surprised. This means that layers above vdev knows details
about vdev internals, like number of disks, I think. What I basically
did was adding one disk. ZFS can ask raidz vdev for a block using
exactly the same offset+size as before. This should be enough, but
isn''t. Checksum is stored with a block pointer in a leaf vdev? If so,
why?

-- 
Pawel Jakub Dawidek                       http://www.wheel.pl
pjd at FreeBSD.org                           http://www.FreeBSD.org
FreeBSD committer                         Am I Evil? Yes, I Am!
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 187 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-code/attachments/20070912/47c9d7a5/attachment.bin>

James Blackburn

2007-Sep-13 21:15 UTC

head link

[zfs-code] Extending RAIDZ.

On 9/12/07, Pawel Jakub Dawidek <pjd at freebsd.org>
wrote:> On Tue, Aug 07, 2007 at 11:28:31PM +0100, James Blackburn wrote:
> > Well I read this email having just written a mammoth one in the other
> > thread, my thoughts:
> >
> > The main difficulty in this, as far as I see it, is you''re
> > intentionally moving data on a checksummed copy-on-write filesystem
> > ;).  At the very least this is creating lots of work before we even
> > start to address the problem (and given that the ZFS guys are
> > undoubtedly working on device removal, that effort would be wasted).
> > I think this is probably more difficult than it''s worth --
re-writing
> > data should be a separate non RAID-Z specific feature (once
you''re
> > changing the block pointers, you need to update the checksums, and you
> > need to ensure that you''re maintaining consistency, preserve
> > snapshots, etc. etc.). Surely it would be much easier to leave the
> > data as is and version the array''s disk layout?
>
> I''ve some time to experiment with my idea. What I did was:
>
> 1. Hardcode vdev_raidz_map_alloc() to always use 3 as vdev_children this
>    helps me to using hacked up ''zpool attach'' with RAIDZ.
> 2. Turn on logging of all write into RAIDZ vdev (offset+size).
> 3. zpool create tank raidz disk0 disk1 disk2
> 4. zpool attach tank disk0 disk3
> 5. zpool export tank
> 6. Backout 1.
> 7. Use a special tool, that will read all blocks written earlier. I use
>    only three disks for reading and logged offset+size pairs.
> 8. Use the same tool to write the data back, but now use four disks.
> 9. Try to: zpool import tank
>
> Yeah, 9 fails. It shows that pool metadata is corrupted.
>
> I was really surprised. This means that layers above vdev knows details
> about vdev internals, like number of disks, I think. What I basically
> did was adding one disk. ZFS can ask raidz vdev for a block using
> exactly the same offset+size as before. This should be enough, but
> isn''t. Checksum is stored with a block pointer in a leaf vdev? If
so,
> why?
All that''s needed to resolve a block pointer currently is the vdev +
offset.  Of course the checksum needs to be correct.  So assuming that
that you have added the extra disk, moved the blocks around, and
updated the offsets correctly the likely problem is checksums.  As
every block pointer checksums its child, if you change a block''s
location and update the block pointer offset, the block pointer''s
parent''s checksum will be wrong.

If you''re re-writing/moving data, you''ll need to re-write
checksums as
well, or switch them off.

James

Matthew Ahrens

2007-Sep-19 22:06 UTC

head link

[zfs-code] Extending RAIDZ.

Pawel Jakub Dawidek wrote:> On Tue, Aug 07, 2007 at 11:28:31PM +0100, James Blackburn wrote:
>> Well I read this email having just written a mammoth one in the other
>> thread, my thoughts:
>>
>> The main difficulty in this, as far as I see it, is you''re
>> intentionally moving data on a checksummed copy-on-write filesystem
>> ;).  At the very least this is creating lots of work before we even
>> start to address the problem (and given that the ZFS guys are
>> undoubtedly working on device removal, that effort would be wasted).
>> I think this is probably more difficult than it''s worth --
re-writing
>> data should be a separate non RAID-Z specific feature (once
you''re
>> changing the block pointers, you need to update the checksums, and you
>> need to ensure that you''re maintaining consistency, preserve
>> snapshots, etc. etc.). Surely it would be much easier to leave the
>> data as is and version the array''s disk layout?
> 
> I''ve some time to experiment with my idea. What I did was:
> 
> 1. Hardcode vdev_raidz_map_alloc() to always use 3 as vdev_children this
>    helps me to using hacked up ''zpool attach'' with RAIDZ.
> 2. Turn on logging of all write into RAIDZ vdev (offset+size).
> 3. zpool create tank raidz disk0 disk1 disk2
> 4. zpool attach tank disk0 disk3
> 5. zpool export tank
> 6. Backout 1.
> 7. Use a special tool, that will read all blocks written earlier. I use
>    only three disks for reading and logged offset+size pairs.
> 8. Use the same tool to write the data back, but now use four disks.
> 9. Try to: zpool import tank
> 
> Yeah, 9 fails. It shows that pool metadata is corrupted.
> 
> I was really surprised. This means that layers above vdev knows details
> about vdev internals, like number of disks, I think. What I basically
> did was adding one disk. ZFS can ask raidz vdev for a block using
> exactly the same offset+size as before.
Really?  I don''t see how that could be possible using the current raidz
on-disk layout.

How did you rearrange the data?  Ie, what do steps 7+8 do to the data on 
disk?  If you change the number of disks in a raidz group, then at a minimum 
the number of allocated sectors for each block will change, since this count 
includes the raidz parity sectors.  This will change the block pointers, so 
all block-pointer-containing metadata will have to be rewritten (ie, all 
indirect blocks and dnodes).

--matt

Pawel Jakub Dawidek

2007-Sep-20 11:21 UTC

head link

[zfs-code] Extending RAIDZ.

On Wed, Sep 19, 2007 at 03:06:20PM -0700, Matthew Ahrens
wrote:> Pawel Jakub Dawidek wrote:
> > On Tue, Aug 07, 2007 at 11:28:31PM +0100, James Blackburn wrote:
> >> Well I read this email having just written a mammoth one in the
other
> >> thread, my thoughts:
> >>
> >> The main difficulty in this, as far as I see it, is
you''re
> >> intentionally moving data on a checksummed copy-on-write
filesystem
> >> ;).  At the very least this is creating lots of work before we
even
> >> start to address the problem (and given that the ZFS guys are
> >> undoubtedly working on device removal, that effort would be
wasted).
> >> I think this is probably more difficult than it''s worth
-- re-writing
> >> data should be a separate non RAID-Z specific feature (once
you''re
> >> changing the block pointers, you need to update the checksums, and
you
> >> need to ensure that you''re maintaining consistency,
preserve
> >> snapshots, etc. etc.). Surely it would be much easier to leave the
> >> data as is and version the array''s disk layout?
> > 
> > I''ve some time to experiment with my idea. What I did was:
> > 
> > 1. Hardcode vdev_raidz_map_alloc() to always use 3 as vdev_children
this
> >    helps me to using hacked up ''zpool attach'' with
RAIDZ.
> > 2. Turn on logging of all write into RAIDZ vdev (offset+size).
> > 3. zpool create tank raidz disk0 disk1 disk2
> > 4. zpool attach tank disk0 disk3
> > 5. zpool export tank
> > 6. Backout 1.
> > 7. Use a special tool, that will read all blocks written earlier. I
use
> >    only three disks for reading and logged offset+size pairs.
> > 8. Use the same tool to write the data back, but now use four disks.
> > 9. Try to: zpool import tank
> > 
> > Yeah, 9 fails. It shows that pool metadata is corrupted.
> > 
> > I was really surprised. This means that layers above vdev knows
details
> > about vdev internals, like number of disks, I think. What I basically
> > did was adding one disk. ZFS can ask raidz vdev for a block using
> > exactly the same offset+size as before.
> 
> Really?  I don''t see how that could be possible using the current
raidz
> on-disk layout.
> 
> How did you rearrange the data?  Ie, what do steps 7+8 do to the data on 
> disk?  If you change the number of disks in a raidz group, then at a
minimum
> the number of allocated sectors for each block will change, since this
count
> includes the raidz parity sectors.  This will change the block pointers, so
> all block-pointer-containing metadata will have to be rewritten (ie, all 
> indirect blocks and dnodes).
I create a userland tool based on vdev_raidz.c. I logged all write
requests to vdev_raidz (offset+size). This userland tool first read all
the data (based on offset+size pairs) to memory - it pass old number of
components to vdev_raidz_map_alloc() as dcols argument. When all the
data is in memory, it writes the data back, but now giving dcols+1.

This way I don''t change the offset which is passed by upper layers to
vdev_raidz.

-- 
Pawel Jakub Dawidek                       http://www.wheel.pl
pjd at FreeBSD.org                           http://www.FreeBSD.org
FreeBSD committer                         Am I Evil? Yes, I Am!
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 187 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-code/attachments/20070920/70035301/attachment.bin>

Possibly Parallel Threads

Search for more reasonably related threads

zfs code - Aug 2007 - Extending RAIDZ.

[zfs-code] Extending RAIDZ.

[zfs-code] Extending RAIDZ.

[zfs-code] Extending RAIDZ.

[zfs-code] Extending RAIDZ.

[zfs-code] Extending RAIDZ.

[zfs-code] Extending RAIDZ.

Possibly Parallel Threads