We would like to delete and recreate our existing zfs pool without losing any data. The way we though we could do this was attach a few HDDs and create a new temporary pool, migrate our existing zfs volume to the new pool, delete and recreate the old pool and migrate the zfs volumes back. The big problem we have is we need to do all this live, without any downtime. We have 2 volumes taking up around 11TB and they are shared out to a couple windows servers with comstar. Anyone have any good ideas? -- This message posted from opensolaris.org
Hi Wolf, Which Solaris release is this? If it is an OpenSolaris system running a recent build, you might consider the zpool split feature, which splits a mirrored pool into two separate pools, while the original pool is online. If possible, attach the spare disks to create the mirrored pool as a first step. See the example below. Thanks, Cindy You can attach the spare disks to the existing pool to create the mirrored pool: # zpool attach tank disk-1 spare-disk-1 # zpool attach tank disk-2 spare-disk-2 Which gives you a pool like this: # zpool status tank pool: tank state: ONLINE scrub: resilver completed after 0h0m with 0 errors on Tue Apr 27 14:36:28 2010 config: NAME STATE READ WRITE CKSUM tank ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 c2t9d0 ONLINE 0 0 0 c2t5d0 ONLINE 0 0 0 mirror-1 ONLINE 0 0 0 c2t10d0 ONLINE 0 0 0 c2t6d0 ONLINE 0 0 0 56.5K resilvered errors: No known data errors Then, split the mirrored pool, like this: # zpool split tank tank2 # zpool import tank2 # zpool status tank tank2 pool: tank state: ONLINE scrub: resilver completed after 0h0m with 0 errors on Tue Apr 27 14:36:28 2010 config: NAME STATE READ WRITE CKSUM tank ONLINE 0 0 0 c2t9d0 ONLINE 0 0 0 c2t10d0 ONLINE 0 0 0 errors: No known data errors pool: tank2 state: ONLINE scrub: none requested config: NAME STATE READ WRITE CKSUM tank2 ONLINE 0 0 0 c2t5d0 ONLINE 0 0 0 c2t6d0 ONLINE 0 0 0 On 04/27/10 15:06, Wolfraider wrote:> We would like to delete and recreate our existing zfs pool without losing any data. The way we though we could do this was attach a few HDDs and create a new temporary pool, migrate our existing zfs volume to the new pool, delete and recreate the old pool and migrate the zfs volumes back. The big problem we have is we need to do all this live, without any downtime. We have 2 volumes taking up around 11TB and they are shared out to a couple windows servers with comstar. Anyone have any good ideas?
Unclear what you want to do? What''s the goal for this excise? If you want to replace the pool with larger disks and the pool is in mirror or raidz. You just replace one disk at a time and allow the pool to rebuild it self. Once all the disk has been replace, it will atomically realize the disk increase and expand the pool. -- This message posted from opensolaris.org
The original drive pool was configured with 144 1TB drives and a hardware raid 0 strip across every 4 drives to create 4TB luns. These luns where then combined into 6 raidz2 luns and added to the zfs pool. I would like to delete the original hardware raid 0 stripes and add the 144 drives directly to the zfs pool. This should improve performance considerably since we are not doing a raid on top of a raid and fix the whole stripe size issue. Since this pool will be deleted and recreated, I will need to move the data off to something else. -- This message posted from opensolaris.org
We are running the latest dev release. I was hoping to just mirror the zfs voumes and not the whole pool. The original pool size is around 100TB in size. The spare disks I have come up with will total around 40TB. We only have 11TB of space in use on the original zfs pool. -- This message posted from opensolaris.org
On Apr 28, 2010, at 6:40 AM, Wolfraider wrote:> We are running the latest dev release. > > I was hoping to just mirror the zfs voumes and not the whole pool. The original pool size is around 100TB in size. The spare disks I have come up with will total around 40TB. We only have 11TB of space in use on the original zfs pool.Mirrors are made with vdevs (LUs or disks), not pools. However, the vdev attached to a mirror must be the same size (or nearly so) as the original. If the original vdevs are 4TB, then a migration to a pool made with 1TB vdevs cannot be done by replacing vdevs (mirror method). -- richard ZFS storage and performance consulting at http://www.RichardElling.com ZFS training on deduplication, NexentaStor, and NAS performance Las Vegas, April 29-30, 2010 http://nexenta-vegas.eventbrite.com
On Apr 28, 2010, at 6:37 AM, Wolfraider wrote:> The original drive pool was configured with 144 1TB drives and a hardware raid 0 strip across every 4 drives to create 4TB luns.For the archives, this is not a good idea...> These luns where then combined into 6 raidz2 luns and added to the zfs pool.... even when data protection is achieved at a higher level. See also RAID-0+1 vs RAID-1+0. -- richard ZFS storage and performance consulting at http://www.RichardElling.com ZFS training on deduplication, NexentaStor, and NAS performance Las Vegas, April 29-30, 2010 http://nexenta-vegas.eventbrite.com
> On Apr 28, 2010, at 6:37 AM, Wolfraider wrote: > > The original drive pool was configured with 144 1TB > drives and a hardware raid 0 strip across every 4 > drives to create 4TB luns. > > For the archives, this is not a good idea...Exactly, This is the reason I want to blow all the old configuration away. :) -- This message posted from opensolaris.org
> Mirrors are made with vdevs (LUs or disks), not > pools. However, the > vdev attached to a mirror must be the same size (or > nearly so) as the > original. If the original vdevs are 4TB, then a > migration to a pool made > with 1TB vdevs cannot be done by replacing vdevs > (mirror method). > -- richardBoth luns that we are sharing out with comstar are vdevs. It sounds like we can create the new temporary pool, create a couple new luns the same size as the old ones and then create mirrors between the two. Wait until it is synced and break the mirror. This is what we were thinking we could do, just wanted to make sure. -- This message posted from opensolaris.org
On Apr 28, 2010, at 8:39 AM, Wolfraider wrote:>> Mirrors are made with vdevs (LUs or disks), not >> pools. However, the >> vdev attached to a mirror must be the same size (or >> nearly so) as the >> original. If the original vdevs are 4TB, then a >> migration to a pool made >> with 1TB vdevs cannot be done by replacing vdevs >> (mirror method). >> -- richard > > Both luns that we are sharing out with comstar are vdevs. It sounds like we can create the new temporary pool, create a couple new luns the same size as the old ones and then create mirrors between the two. Wait until it is synced and break the mirror. This is what we were thinking we could do, just wanted to make sure.This can work, and you create make the temporary iSCSI targets as compressed, sparse volumes. If your 100TB of data will squeeze into 40TB, then it is just a matter of time to copy. -- richard ZFS storage and performance consulting at http://www.RichardElling.com ZFS training on deduplication, NexentaStor, and NAS performance Las Vegas, April 29-30, 2010 http://nexenta-vegas.eventbrite.com
For this type of migration a downtime is required. However, it can be reduce to only a few hours to a few minutes depending how much change need to be synced. I have done this many times on a NetApp Filer but can be apply to zfs as well. First thing is consider is only do the migration once so you don''t need two downtime. Let talk about migration first 1. You will need a recent enough zfs to support zfs send and recive. 2. create your destination pool (there is things you can do here to avoid migrating back). 3. create you destination volume 4. create snapshot snap1 of the source volume (zfs snapshot) 5. use zfs send <volume at snapshot1> | zfs receive <dstvol> (this will sync most of the 11 TB and may take days) 6 create snapshot2 snap2 of the source volume 7. incremental sync the snapshot with zfs send -i <volume at snapshot1> <volume at snapshot2> | <volume at snapshot1> (this should be faster). repeat 6 and 7 as needed to get the sync time to be about the allowed downtime. 8 ** DOWNTIME ** Turn off the windows Servers 9 zfs unmount the source volume to ensure no more change to the volume 10 create snapshot final of the source volume 11 incremental sync the final snapshot 12 rename the source volume to backup volume (you can rename pools via import export) 13 rename the destvol to production 14 mount product destvol. (reconfigure what you need for comstar) 15 Turn on windows server 16 You need to have some way of verifying the migration and blackout if needed. Once verify enable the window services ** END of DOWNTIME ** 17 you should have a backup of the old volume before destroy the old pool 18 Destroy the pool. 19 Add the now spare disks into the new pool No Downtime is not possible because you need to switch pool and zfs don''t currently support features like pvmove, vgsplit, vgmerge, and vgreduce in lvm. -- This message posted from opensolaris.org
So on the point of not need an migration back. Even at 144 disk. they won''t be on the same raid group. So figure out what is the best raid group size for you since zfs don''t support changing number of disk in raidz yet. I usually use the number of the slots per shelf. or a good number is 7~10 for raidz1 and 10~20 for raidz2 or raidz3 Create the desvol with that optimized number of disks in one group (or two). do the migration. the once the old volume is destroyed, just added them to the new pool. remember you can expend pool by a raid group at a time just not change the raid group size after that. -- This message posted from opensolaris.org
On Wed, 28 Apr 2010, Jim Horng wrote:> So on the point of not need an migration back.> Even at 144 disk. they won''t be on the same raid group. So figure > out what is the best raid group size for you since zfs don''t support > changing number of disk in raidz yet. I usually use the number of > the slots per shelf. or a good number is 7~10 for raidz1 and 10~20 > for raidz2 or raidz3Good luck with the 20 disks in raidz2 or raidz3. If you are going to base the number of disks per vdev on the shelf/rack layout then it makes more sense to base it on the number of disk shelves/controllers than it does the number of slots per shelf. You would want to distribute the disks in your raidz2 vdevs across the shelves rather than devote a shelf to one raidz2 vdev. At least that is what you would do if you care about performance and reliability. If a shelf dies, your pool should remain alive. Likewise, there is likely more I/O bandwidth available if the vdevs are spread across controllers. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Sorry, I need to correct myself. Mirror luns on the windows side to switch storage pool under it is a great idea and I think you can do this without downtime. -- This message posted from opensolaris.org
3 shelves with 2 controllers each. 48 drive per shelf. These are Fibrechannel attached. We would like all 144 drives added to the same large pool. -- This message posted from opensolaris.org
I understand your point. however in most production system the selves are added incrementally so make sense to be related to number of slots per shelf. and in most case withstand a shelf failure is to much of overhead on storage any are. for example in his case he will have to configure 1+0 raid with only two shelves. i.e. 50% overhead. -- This message posted from opensolaris.org
On Wed, 28 Apr 2010, Jim Horng wrote:> I understand your point. however in most production system the > selves are added incrementally so make sense to be related to number > of slots per shelf. and in most case withstand a shelf failure is > to much of overhead on storage any are. for example in his case he > will have to configure 1+0 raid with only two shelves. i.e. 50% > overhead.Yes, I can see that with 48 drives per shelf, the opportunities for creative natural fault isolation are less available. It is also true that often hardware is added incrementally. A strong argument can be made that smaller less capable simplex-routed shelves may be a more cost effective and reliable solution when used carefully with zfs. For example, mini-shelves which support 8 drives each. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
> 3 shelves with 2 controllers each. 48 drive per > shelf. These are Fibrechannel attached. We would like > all 144 drives added to the same large pool.I would do either a 12 or 16 disk raidz3 vdev and do spread out the disk across controllers within vdevs. also may want to leave a least 1 spare disk per shelf. (i.e. some vdevs with one less disk). Just my 2 cents -- This message posted from opensolaris.org
On Apr 28, 2010, at 9:48 PM, Jim Horng wrote:>> 3 shelves with 2 controllers each. 48 drive per >> shelf. These are Fibrechannel attached. We would like >> all 144 drives added to the same large pool. > > I would do either a 12 or 16 disk raidz3 vdev and do spread out the disk across controllers within vdevs. also may want to leave a least 1 spare disk per shelf. (i.e. some vdevs with one less disk).Why would you recommend a spare for raidz2 or raidz3? -- richard ZFS storage and performance consulting at http://www.RichardElling.com ZFS training on deduplication, NexentaStor, and NAS performance Las Vegas, April 29-30, 2010 http://nexenta-vegas.eventbrite.com
> Why would you recommend a spare for raidz2 or raidz3? > -- richardSpare is to minimize the reconstruction time. Because remember a vdev can not start resilvering until there is a spare disk available. And with disks as big as they are today, resilvering also take many hours. I rather have the disk finished resilvering before I have the chance to replace the bad disk than to risk more disks fail before It had a chance to resilverize. This is especially important if the file system is not at a location with 24 hours staff. -- This message posted from opensolaris.org
On Wed, 28 Apr 2010, Jim Horng wrote:>> Why would you recommend a spare for raidz2 or raidz3? > > Spare is to minimize the reconstruction time. Because remember a > vdev can not start resilvering until there is a spare disk > available. And with disks as big as they are today, resilvering also > take many hours. I rather have the disk finished resilvering before > I have the chance to replace the bad disk than to risk more disks > fail before It had a chance to resilverize.Would your opinion change if the disks you used took 7 days to resilver? Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
> Would your opinion change if the disks you used took > 7 days to resilver? > > BobThat will only make a stronger case that hot spare is absolutely needed. This will also make a strong case for choosing raidz3 over raidz2 as well as vdev smaller number of disks. -- This message posted from opensolaris.org