Jesus Cea
2012-Jan-06 05:32 UTC
[zfs-discuss] Thinking about spliting a zpool in "system" and "data"
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Sorry if this list is inappropriate. Pointers welcomed. Using Solaris 10 Update 10, x86-64. I have been a ZFS heavy user since available, and I love the system. My servers are usually "small" (two disks) and usually hosted in a datacenter, so I usually create a ZPOOL used both for system and data. That is, the entire system contains an unique two-disk zpool. This have worked nice so far. But my new servers have SSD too. Using them for L2ARC is easy enough, but I can not use them as ZIL because no separate ZIL device can be used in root zpools. Ugh, that hurts!. So I am thinking about splitting my full two-disk zpool in two zpools, one for system and other for data. Both using both disks for mirroring. So I would have two slices per disk. I have the system in production in a datacenter I can not access, but I have remote KVM access. Servers are in production, I can''t reinstall but I could be allowed to have small (minutes) downtimes for a while. My plan is this: 1. Do a scrub to be sure the data is OK in both disks. 2. Break the mirror. The A disk will keep working, B disk is idle. 3. Partition B disk with two slices instead of current full disk slice. 4. Create a "system" zpool in B. 5. Snapshot "zpool/ROOT" in A and "zfs send it" to "system" in B. Repeat several times until we have a recent enough copy. This stream will contain the OS and the zones root datasets. I have zones. 6. Change GRUB to boot from "system" instead of "zpool". Cross fingers and reboot. Do I have to touch the "bootfs" property? Now ideally I would be able to have "system" as the zpool root. The zones would be mounted from the old datasets. 7. If everything is OK, I would "zfs send" the data from the old zpool to the new one. After doing a few times to get a recent copy, I would stop the zones and do a final copy, to be sure I have all data, no changes in progress. 8. I would change the zone manifest to mount the data in the new zpool. 9. I would restart the zones and be sure everything seems ok. 10. I would restart the computer to be sure everything works. So fair, it this doesn''t work, I could go back to the old situation simply changing the GRUB boot to the old zpool. 11. If everything works, I would destroy the original "zpool" in A, partition the disk and recreate the mirroring, with B as the source. 12. Reboot to be sure everything is OK. So, my questions: a) Is this workflow reasonable and would work?. Is the procedure documented anywhere?. Suggestions?. Pitfalls? b) *MUST* SWAP and DUMP ZVOLs reside in the root zpool or can they live in a "nonsystem" zpool? (always plugged and available). I would like to have a quite small(let say 30GB, I use Live Upgrade and quite a fez zones) "system" zpool, but my swap is huge (32 GB and yes, I use it) and I would rather prefer to have SWAP and DUMP in the data zpool, if possible & supported. c) Currently Solaris decides to activate write caching in the SATA disks, nice. What would happen if I still use the complete disks BUT with two slices instead of one?. Would it still have write cache enabled?. And yes, I have checked that the cache flush works as expected, because I can "only" do around one hundred "write+sync" per second. Advices?. - -- Jesus Cea Avion _/_/ _/_/_/ _/_/_/ jcea at jcea.es - http://www.jcea.es/ _/_/ _/_/ _/_/ _/_/ _/_/ jabber / xmpp:jcea at jabber.org _/_/ _/_/ _/_/_/_/_/ . _/_/ _/_/ _/_/ _/_/ _/_/ "Things are not so easy" _/_/ _/_/ _/_/ _/_/ _/_/ _/_/ "My name is Dump, Core Dump" _/_/_/ _/_/_/ _/_/ _/_/ "El amor es poner tu felicidad en la felicidad de otro" - Leibniz -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.10 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iQCVAwUBTwaHW5lgi5GaxT1NAQLe/AP9EIK0tckVBhqzrTHWbNzT2TPUGYc7ZYjS pZYX1EXkJNxVOmmXrWApmoVFGtYbwWeaSQODqE9XY5rUZurEbYrXOmejF2olvBPL zyGFMnZTcmWLTrlwH5vaXeEJOSBZBqzwMWPR/uv2Z/a9JWO2nbidcV1OAzVdT2zU kfboJpbxONQ=6i+A -----END PGP SIGNATURE-----
Fajar A. Nugraha
2012-Jan-06 05:54 UTC
[zfs-discuss] Thinking about spliting a zpool in "system" and "data"
On Fri, Jan 6, 2012 at 12:32 PM, Jesus Cea <jcea at jcea.es> wrote:> So, my questions: > > a) Is this workflow reasonable and would work?. Is the procedure > documented anywhere?. Suggestions?. Pitfalls?try http://www.solarisinternals.com/wiki/index.php/ZFS_Troubleshooting_Guide#Complete_Solaris_ZFS_Root_Pool_Recovery> > b) *MUST* SWAP and DUMP ZVOLs reside in the root zpool or can they > live in a "nonsystem" zpool? (always plugged and available). I would > like to have a quite small(let say 30GB, I use Live Upgrade and quite > a fez zones) "system" zpool, but my swap is huge (32 GB and yes, I use > it) and I would rather prefer to have SWAP and DUMP in the data zpool, > if possible & supported.try it? :D Last time i played around with S11, you could even go without swap and dump (with some manual setup).> > c) Currently Solaris decides to activate write caching in the SATA > disks, nice. What would happen if I still use the complete disks BUT > with two slices instead of one?. Would it still have write cache > enabled?. And yes, I have checked that the cache flush works as > expected, because I can "only" do around one hundred "write+sync" per > second.You can enable disk cache manually using "format". -- Fajar
Edward Ned Harvey
2012-Jan-06 13:59 UTC
[zfs-discuss] Thinking about spliting a zpool in "system" and "data"
> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > bounces at opensolaris.org] On Behalf Of Jesus Cea > > Sorry if this list is inappropriate. Pointers welcomed.Not at all. This is the perfect forum for your question.> So I am thinking about splitting my full two-disk zpool in two zpools, > one for system and other for data. Both using both disks for > mirroring. So I would have two slices per disk.Please see the procedure below, which I wrote as notes for myself, to perform disaster recovery backup/restore of rpool. This is not DIRECTLY applicable for you, but it includes all the necessary ingredients to make your transition successful. So please read, and modify as necessary for your purposes. Many good notes available: ZFS Troubleshooting Guide http://www.solarisinternals.com/wiki/index.php/ZFS_Troubleshooting_Guide#ZFS _Root_Pool_Recovery Before you begin Because you will restore from a boot CD, there are only a few compression options available to you: 7z, bzip2, and gzip. The clear winner in general is 7z with compression level 1. It''s about as fast as gzip, with compression approx 2x stronger than bzip2 Because you will restore from a boot CD, the media needs to be located somewhere that can be accessed from a CD boot environment, which does not include ssh/scp. The obvious choice is NFS. Be aware that solaris NFS client is often not very compatible with linux NFS servers. I am assuming there is a solaris NFS server available because it makes my job easy while I''m writing this. ;-) Note: You could just as easily store the backup on removable disk in something like a zfs pool. Just make sure whatever way you store it, it''s accessible from the CD boot environment, which might not support a later version of zpool etc. Create NFS exports on some other solaris machine. share -F nfs -o rw=machine1:machine2,root=machine1:machine2 /backupdir Also edit hosts file to match, because forward/reverse dns must match for the client. Create a backup suitable for system recovery. mount someserver:/backupdir /mnt Create snapshots: zfs snapshot rpool at uniquebackupstring zfs snapshot rpool/ROOT at uniquebackupstring zfs snapshot rpool/ROOT/machinename_slash at uniquebackupstring Send snapshots: Notice: due to bugs, don''t do this recursively. Do it separately, as outlined here. Notice: In some version of zpool/zfs, these bugs were fixed so you can safely do it recursively. I don''t know what rev is needed. zfs send rpool at uniquebackupstring | 7z a -mx=1 -si /mnt/rpool.zfssend.7z zfs send rpool/ROOT at uniquebackupstring | 7z a -mx=1 -si /mnt/rpool_ROOT.zfssend.7z zfs send rpool/ROOT/machinename_slash at uniquebackupstring | 7z a -mx=1 -si /mnt/rpool_ROOT_machinename_slash.zfssend.7z It is also wise to capture a list of the "pristine" zpool and zfs properties: echo "" > /mnt/zpool-properties.txt for pool in `zpool list | grep -v ''^NAME '' | sed ''s/ .*//''` ; do echo "-------------------------------------" | tee -a /mnt/zpool-properties.txt echo "zpool get all $pool" | tee -a /mnt/zpool-properties.txt zpool get all $pool | tee -a /mnt/zpool-properties.txt done echo "" > /mnt/zfs-properties.txt for fs in `zfs list | grep -v @ | grep -v ''^NAME '' | sed ''s/ .*//''` ; do echo "-------------------------------------" | tee -a /mnt/zfs-properties.txt echo "zfs get all $fs" | tee -a /mnt/zfs-properties.txt zfs get all $fs | tee -a /mnt/zfs-properties.txt done Notice: The above will also capture info about dump & swap, which might be important, so you know what sizes & blocksizes they are. Now suppose a disaster has happened. You need to restore. Boot from the CD. Choose "Solaris" Choose "6. Single User Shell" To bring up the network: ifconfig -a plumb ifconfig -a (Notice the name of the network adapter. In my case, it''s e1000g0) ifconfig e1000g0 192.168.1.100/24 up mount 192.168.1.105:/backupdir /mnt Verify that you have access to the backup images. Now prepare your boot disk as follows: format -e (Select the appropriate disk) fdisk (no fdisk table exists. Yes, create default) partition (choose to "modify" a table based on "hog") (in my example, I''m using c1t0d0s0 for rpool) zpool create -f -o failmode=continue -R /a -m legacy rpool c1t0d0s0 7z x /mnt/rpool.zfssend.7z -so | zfs receive -F rpool (notice: the first one requires -F because it already exists. The others don''t need this.) 7z x /mnt/rpool_ROOT.zfssend.7z -so | zfs receive rpool/ROOT 7z x /mnt/rpool_ROOT_machinename_slash.zfssend.7z -so | zfs receive rpool/ROOT/machinename_slash zfs set mountpoint=/rpool rpool zfs set mountpoint=legacy rpool/ROOT zfs set mountpoint=/ rpool/ROOT/machinename_slash zpool set bootfs=rpool/ROOT/machinename_slash rpool You did save the zpool-properties.txt and zfs-properties.txt didn''t you? ;-) If not, you''ll just have to guess about sizes and blocksizes. The following 2G dump, 1G swap, and 4k blocksize swap are pretty standard for a x86 ystem with 2G ram. zfs create -V 2G rpool/dump zfs create -V 1G -b 4k rpool/swap installgrub /boot/grub/stage1 /boot/grub/stage2 /dev/rdsk/c1t0d0s0 And finally, init 6 After the system comes up "natural" once, it''s probably a good idea to capture a new "zpool-properties.txt" and "zfs-properties.txt" and compare against the "pristine" ones, to see what (if anything) is different. Likely suspects are auto-snapshot properties and stuff like that. Which you probably forgot you ever created, years ago when you (or someone else) built your server and didn''t save any documentation about the process.
Edward Ned Harvey
2012-Jan-06 14:02 UTC
[zfs-discuss] Thinking about spliting a zpool in "system" and "data"
> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > bounces at opensolaris.org] On Behalf Of Fajar A. Nugraha > > > c) Currently Solaris decides to activate write caching in the SATA > > disks, nice. What would happen if I still use the complete disks BUT > > with two slices instead of one?. Would it still have write cache > > enabled?. And yes, I have checked that the cache flush works as > > expected, because I can "only" do around one hundred "write+sync" per > > second. > > You can enable disk cache manually using "format".I''m not aware of any automatic way to make this work correctly. I wrote some scripts to run in cron, if you''re interested.
"Hung-Sheng Tsao (Lao Tsao 老曹) Ph.D."
2012-Jan-06 20:34 UTC
[zfs-discuss] Thinking about spliting a zpool in "system" and "data"
may be one can do the following (assume c0t0d0 and c0t1d0) 1)split rpool mirror: zpool split rpool newpool c0t1d0s0 1b)zpool destroy newpool 2)partition 2nd hdd c0t1d0s0 into two slice (s0 and s1) 3)zpool create rpool2 c0t1d0s1 4)use lucreate -c c0t0d0s0 -n new-zfsbe -p c0t1d0s0 5)lustatus c0t0d0s0 new-zfsbe 6)luactivate new-zfsbe 7)init 6 now you have two BE old and new you can create dpool on slice1 add L2ARC and zil and repartition c0t0d0 if you want you can create rpool on c0t0d0s0 and new BE so everything will be name rpool for root pool SWAP and DUMP can be on different rpool good luck On 1/6/2012 12:32 AM, Jesus Cea wrote:> -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > Sorry if this list is inappropriate. Pointers welcomed. > > Using Solaris 10 Update 10, x86-64. > > I have been a ZFS heavy user since available, and I love the system. > My servers are usually "small" (two disks) and usually hosted in a > datacenter, so I usually create a ZPOOL used both for system and data. > That is, the entire system contains an unique two-disk zpool. > > This have worked nice so far. > > But my new servers have SSD too. Using them for L2ARC is easy enough, > but I can not use them as ZIL because no separate ZIL device can be > used in root zpools. Ugh, that hurts!. > > So I am thinking about splitting my full two-disk zpool in two zpools, > one for system and other for data. Both using both disks for > mirroring. So I would have two slices per disk. > > I have the system in production in a datacenter I can not access, but > I have remote KVM access. Servers are in production, I can''t reinstall > but I could be allowed to have small (minutes) downtimes for a while. > > My plan is this: > > 1. Do a scrub to be sure the data is OK in both disks. > > 2. Break the mirror. The A disk will keep working, B disk is idle. > > 3. Partition B disk with two slices instead of current full disk slice. > > 4. Create a "system" zpool in B. > > 5. Snapshot "zpool/ROOT" in A and "zfs send it" to "system" in B. > Repeat several times until we have a recent enough copy. This stream > will contain the OS and the zones root datasets. I have zones. > > 6. Change GRUB to boot from "system" instead of "zpool". Cross fingers > and reboot. Do I have to touch the "bootfs" property? > > Now ideally I would be able to have "system" as the zpool root. The > zones would be mounted from the old datasets. > > 7. If everything is OK, I would "zfs send" the data from the old zpool > to the new one. After doing a few times to get a recent copy, I would > stop the zones and do a final copy, to be sure I have all data, no > changes in progress. > > 8. I would change the zone manifest to mount the data in the new zpool. > > 9. I would restart the zones and be sure everything seems ok. > > 10. I would restart the computer to be sure everything works. > > So fair, it this doesn''t work, I could go back to the old situation > simply changing the GRUB boot to the old zpool. > > 11. If everything works, I would destroy the original "zpool" in A, > partition the disk and recreate the mirroring, with B as the source. > > 12. Reboot to be sure everything is OK. > > So, my questions: > > a) Is this workflow reasonable and would work?. Is the procedure > documented anywhere?. Suggestions?. Pitfalls? > > b) *MUST* SWAP and DUMP ZVOLs reside in the root zpool or can they > live in a "nonsystem" zpool? (always plugged and available). I would > like to have a quite small(let say 30GB, I use Live Upgrade and quite > a fez zones) "system" zpool, but my swap is huge (32 GB and yes, I use > it) and I would rather prefer to have SWAP and DUMP in the data zpool, > if possible& supported. > > c) Currently Solaris decides to activate write caching in the SATA > disks, nice. What would happen if I still use the complete disks BUT > with two slices instead of one?. Would it still have write cache > enabled?. And yes, I have checked that the cache flush works as > expected, because I can "only" do around one hundred "write+sync" per > second. > > Advices?. > > - -- > Jesus Cea Avion _/_/ _/_/_/ _/_/_/ > jcea at jcea.es - http://www.jcea.es/ _/_/ _/_/ _/_/ _/_/ _/_/ > jabber / xmpp:jcea at jabber.org _/_/ _/_/ _/_/_/_/_/ > . _/_/ _/_/ _/_/ _/_/ _/_/ > "Things are not so easy" _/_/ _/_/ _/_/ _/_/ _/_/ _/_/ > "My name is Dump, Core Dump" _/_/_/ _/_/_/ _/_/ _/_/ > "El amor es poner tu felicidad en la felicidad de otro" - Leibniz > -----BEGIN PGP SIGNATURE----- > Version: GnuPG v1.4.10 (GNU/Linux) > Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ > > iQCVAwUBTwaHW5lgi5GaxT1NAQLe/AP9EIK0tckVBhqzrTHWbNzT2TPUGYc7ZYjS > pZYX1EXkJNxVOmmXrWApmoVFGtYbwWeaSQODqE9XY5rUZurEbYrXOmejF2olvBPL > zyGFMnZTcmWLTrlwH5vaXeEJOSBZBqzwMWPR/uv2Z/a9JWO2nbidcV1OAzVdT2zU > kfboJpbxONQ> =6i+A > -----END PGP SIGNATURE----- > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss-- Hung-Sheng Tsao Ph D. Founder& Principal HopBit GridComputing LLC cell: 9734950840 http://laotsao.wordpress.com/ http://laotsao.blogspot.com/ -------------- next part -------------- A non-text attachment was scrubbed... Name: laotsao.vcf Type: text/x-vcard Size: 153 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20120106/3acf7b7b/attachment.vcf>
"Hung-Sheng Tsao (Lao Tsao 老曹) Ph.D."
2012-Jan-06 20:36 UTC
[zfs-discuss] Thinking about spliting a zpool in "system" and "data"
correction On 1/6/2012 3:34 PM, "Hung-Sheng Tsao (Lao Tsao ??) Ph.D." wrote:> > may be one can do the following (assume c0t0d0 and c0t1d0) > 1)split rpool mirror: zpool split rpool newpool c0t1d0s0 > 1b)zpool destroy newpool > 2)partition 2nd hdd c0t1d0s0 into two slice (s0 and s1) > 3)zpool create rpool2 c0t1d0s1 <===should be c0t1d0s0 > 4)use lucreate -c c0t0d0s0 -n new-zfsbe -p c0t1d0s0 <==rpool2 > 5)lustatus > c0t0d0s0 > new-zfsbe > 6)luactivate new-zfsbe > 7)init 6 > now you have two BE old and new > you can create dpool on slice1 add L2ARC and zil and repartition c0t0d0 > if you want you can create rpool on c0t0d0s0 and new BE so everything > will be name rpool for root pool > > SWAP and DUMP can be on different rpool > > good luck > > > On 1/6/2012 12:32 AM, Jesus Cea wrote: >> -----BEGIN PGP SIGNED MESSAGE----- >> Hash: SHA1 >> >> Sorry if this list is inappropriate. Pointers welcomed. >> >> Using Solaris 10 Update 10, x86-64. >> >> I have been a ZFS heavy user since available, and I love the system. >> My servers are usually "small" (two disks) and usually hosted in a >> datacenter, so I usually create a ZPOOL used both for system and data. >> That is, the entire system contains an unique two-disk zpool. >> >> This have worked nice so far. >> >> But my new servers have SSD too. Using them for L2ARC is easy enough, >> but I can not use them as ZIL because no separate ZIL device can be >> used in root zpools. Ugh, that hurts!. >> >> So I am thinking about splitting my full two-disk zpool in two zpools, >> one for system and other for data. Both using both disks for >> mirroring. So I would have two slices per disk. >> >> I have the system in production in a datacenter I can not access, but >> I have remote KVM access. Servers are in production, I can''t reinstall >> but I could be allowed to have small (minutes) downtimes for a while. >> >> My plan is this: >> >> 1. Do a scrub to be sure the data is OK in both disks. >> >> 2. Break the mirror. The A disk will keep working, B disk is idle. >> >> 3. Partition B disk with two slices instead of current full disk slice. >> >> 4. Create a "system" zpool in B. >> >> 5. Snapshot "zpool/ROOT" in A and "zfs send it" to "system" in B. >> Repeat several times until we have a recent enough copy. This stream >> will contain the OS and the zones root datasets. I have zones. >> >> 6. Change GRUB to boot from "system" instead of "zpool". Cross fingers >> and reboot. Do I have to touch the "bootfs" property? >> >> Now ideally I would be able to have "system" as the zpool root. The >> zones would be mounted from the old datasets. >> >> 7. If everything is OK, I would "zfs send" the data from the old zpool >> to the new one. After doing a few times to get a recent copy, I would >> stop the zones and do a final copy, to be sure I have all data, no >> changes in progress. >> >> 8. I would change the zone manifest to mount the data in the new zpool. >> >> 9. I would restart the zones and be sure everything seems ok. >> >> 10. I would restart the computer to be sure everything works. >> >> So fair, it this doesn''t work, I could go back to the old situation >> simply changing the GRUB boot to the old zpool. >> >> 11. If everything works, I would destroy the original "zpool" in A, >> partition the disk and recreate the mirroring, with B as the source. >> >> 12. Reboot to be sure everything is OK. >> >> So, my questions: >> >> a) Is this workflow reasonable and would work?. Is the procedure >> documented anywhere?. Suggestions?. Pitfalls? >> >> b) *MUST* SWAP and DUMP ZVOLs reside in the root zpool or can they >> live in a "nonsystem" zpool? (always plugged and available). I would >> like to have a quite small(let say 30GB, I use Live Upgrade and quite >> a fez zones) "system" zpool, but my swap is huge (32 GB and yes, I use >> it) and I would rather prefer to have SWAP and DUMP in the data zpool, >> if possible& supported. >> >> c) Currently Solaris decides to activate write caching in the SATA >> disks, nice. What would happen if I still use the complete disks BUT >> with two slices instead of one?. Would it still have write cache >> enabled?. And yes, I have checked that the cache flush works as >> expected, because I can "only" do around one hundred "write+sync" per >> second. >> >> Advices?. >> >> - -- Jesus Cea Avion _/_/ _/_/_/ >> _/_/_/ >> jcea at jcea.es - http://www.jcea.es/ _/_/ _/_/ _/_/ _/_/ _/_/ >> jabber / xmpp:jcea at jabber.org _/_/ _/_/ _/_/_/_/_/ >> . _/_/ _/_/ _/_/ _/_/ _/_/ >> "Things are not so easy" _/_/ _/_/ _/_/ _/_/ _/_/ _/_/ >> "My name is Dump, Core Dump" _/_/_/ _/_/_/ _/_/ _/_/ >> "El amor es poner tu felicidad en la felicidad de otro" - Leibniz >> -----BEGIN PGP SIGNATURE----- >> Version: GnuPG v1.4.10 (GNU/Linux) >> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ >> >> iQCVAwUBTwaHW5lgi5GaxT1NAQLe/AP9EIK0tckVBhqzrTHWbNzT2TPUGYc7ZYjS >> pZYX1EXkJNxVOmmXrWApmoVFGtYbwWeaSQODqE9XY5rUZurEbYrXOmejF2olvBPL >> zyGFMnZTcmWLTrlwH5vaXeEJOSBZBqzwMWPR/uv2Z/a9JWO2nbidcV1OAzVdT2zU >> kfboJpbxONQ>> =6i+A >> -----END PGP SIGNATURE----- >> _______________________________________________ >> zfs-discuss mailing list >> zfs-discuss at opensolaris.org >> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >-- Hung-Sheng Tsao Ph D. Founder& Principal HopBit GridComputing LLC cell: 9734950840 http://laotsao.wordpress.com/ http://laotsao.blogspot.com/ -------------- next part -------------- A non-text attachment was scrubbed... Name: laotsao.vcf Type: text/x-vcard Size: 153 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20120106/a6063680/attachment.vcf>
Jim Klimov
2012-Jan-07 12:39 UTC
[zfs-discuss] Thinking about spliting a zpool in "system" and "data"
Hello, Jesus, I have transitioned a number of systems roughly by the same procedure as you''ve outlined. Sadly, my notes are not in English so they wouldn''t be of much help directly; but I can report that I had success with similar "in-place" manual transitions from mirrored SVM (pre-solaris 10u4) to new ZFS root pools, as well as various transitions of ZFS root pools from one layout to another, on systems with limited numbers of disk drives (2-4 overall). As I''ve recently reported on the list, I''ve also done such "migration" for my faulty single-disk rpool at home via the data pool and backwards, changing the "copies" setting enroute. Overall, your plan seems okay and has more failsafes than we''ve had - because longer downtimes were affordable ;) However, when doing such low-level stuff, you should make sure that you have remote access to your systems (ILOM, KVM, etc.; remotely-controlled PDUs for externally enforced poweroff-poweron are welcome), and that you can boot the systems over ILOM/rKVM with an image of a LiveUSB/LiveCD/etc in case of bigger trouble. In the steps 6-7, where you reboot the system to test that new rpool works, you might want to keep the zones down, i.e. by disabling the zones service in the old BE just before reboot, and zfs-sending this update to the new small rpool. Also it is likely that in the new BE (small rpool) your old "data" from the big rpool won''t get imported by itself and zones (or their services) wouldn''t start correctly anyway before steps 7-8. --- Below I''ll outline our experience from my notes, as it successfully applied to an even more complicated situation than yours: On many Sol10/SXCE systems with ZFS roots we''ve also created a hierarchical layout (separate /var, /usr, /opt with compression enabled), but this procedure HAS FAILED for newer OpenIndiana systems. So for OI we have to use the default single-root layout and only seperate some of /var/* subdirs (adm, log, mail, crash, cores, ...) in order to set quotas and higher compression on them. Such datasets are also kept separate from OS upgrades and are used in all boot environments without cloning. To simplify things, most of the transitions were done in off-hours time so it was okay to shut down all the zones and other services. In some cases for Sol10/SXCE the procedure involved booting in the "Failsafe Boot" mode; for all systems this can be done with the BootCD. For usual Solaris 10 and OpenSolaris SXCE maintenance we did use LiveUpgrade, but at that time its ZFS support was immature, so we circumvented LU and transitioned manually. In those cases we used LU to update systems to the base level supporting ZFS roots (Sol10u4+) while running from SVM mirrors (one mirror for main root, another mirror for LU root for new/old OS image). After the transition to ZFS rpool, we cleared out the LU settings (/etc/lu/, /etc/lutab) by using defaults from the most recent SUNWlu* packages, and when booted from ZFS - we created the "current" LU BE based on the current ZFS rpool. When the OS was capable of booting from ZFS (sol10u4+, snv_100 approx), we broke the SVM mirrors, repartitioned the second disk to our liking (about 4-10Gb for rpool, rest for data), created the new rpool and dataset hierarchy we needed and had in mounted under "/zfsroot". Note that in our case we used a "minimized" install of Solaris which fit under 1-2Gb per BE, we did not use a separate /dump device and the swap volume was located in the ZFS data pool (mirror or raidz for 4-disk systems). Zoneroots were also separate from the system rpool and were stored in the data pool. This DID yield problems for LiveUpgrade, so zones were detached before LU and reattached-with-upgrade after the OS upgrade and disk migrations. Then we copied the root FS data like this: # cd /zfsroot && ( ufsdump 0f - / | ufsrestore -rf - ) If the source (SVM) paths like /var, /usr or /boot are separate UFS filesystems - repeat likewise, changing the current paths in the command above. For non-UFS systems, such as migration from VxFS or even ZFS (if you need a different layout, compression, etc. - so ZFS send/recv is not applicable), you can use Sun cpio (it should carry over extended attributes and ACLs). For example, if you''re booted from the LiveCD and the old UFS root is mounted in "/usfroot" and new ZFS rpool hierarchy is in "/zfsroot", you''d do this: # cd /ufsroot && ( find . -xdev -depth -print | cpio -pvdm /zfsroot ) The example above also copies only the data from current FS, so you need to repeat it for each UFS sub-fs like /var, etc. Another problem we''ve encountered while cpio''ing live systems (when not running from failsafe/livecd) is that "find" skips mountpoints of sub-fses. While your new ZFS hierarchy would provide usr, var, opt under /zfspool, you might need to manually create some others - see the list in your current "df" output. Example: # cd /zfsroot # mkdir -p tmp proc devices var/run system/contract system/object etc/svc/volatile # touch etc/mnttab etc/dfs/sharetab Also some system libraries might be overlay-mounted on top of "filenames", for example a hardware-optimised version of libc, or the base "/dev" (used during boot) or its components might be overlayed by dynamic devfs: # df -k | grep ''.so'' /usr/lib/libc/libc_hwcap2.so.1 2728172 522776 2205396 20% /lib/libc.so.1 # mount | egrep ^/dev /devices on /devices read/write/setuid/devices/dev=5340000 on Thu Dec 1 06:14:26 2011 /dev on /dev read/write/setuid/devices/dev=5380000 on Thu Dec 1 06:14:26 2011 /dev/fd on fd read/write/setuid/devices/dev=5640001 on Thu Dec 1 06:14:38 2011 In order to copy the original libs and dev-paths from a live system, you would need FS-aware tools like ufsdump or zfs-send, or you might try loopback-mounts: # mount -F lofs -o nosub / /mnt # (cd /mnt; tar cvf - devices dev lib/libc.so.1 ) | (cd /zfsroot; tar xvf -) # umount /mnt There might be some more pitfalls regarding use of the /etc/vfstab along with canmount=noauto, mountpoint=legacy and such - but these are mostly relevant for our split ZFS-root hierarchy on some OS releases. Also, if you use LiveUpgrade, you might want to apply Jens Elkner''s patches and/or consider his troubleshooting tips: http://iws.cs.uni-magdeburg.de/~elkner/luc/lutrouble.html One of these pathces in particular allowed LU to ignore some paths which you know are irrelevant to OS upgrade, i.e. "/export/home/*" - this can speed up LU by many minutes per run. Note that these patches were applicable back when I updated our systems (Sol10u4-Soll10u8, SXCE~100), so these tricks may or may not be included in current LU software. Hope this helps, don''t hesitate to ask for more details (my notes became a "textbook" for our admins which approached 100 pages, so not everything is in this email) //Jim Klimov
Jesus Cea
2012-Jan-10 03:23 UTC
[zfs-discuss] Thinking about spliting a zpool in "system" and "data"
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 07/01/12 13:39, Jim Klimov wrote:> I have transitioned a number of systems roughly by the same > procedure as you''ve outlined. Sadly, my notes are not in English so > they wouldn''t be of much help directly;Yes, my russian is rusty :-). I have bitten the bullet and spend 3-4 days doing the migration. I wrote the details here: http://www.jcea.es/artic/solaris_zfs_split.htm The page is written in Spanish, but the terminal transcriptions should be useful for everybody. In the process, maybe somebody finds this interesting too: http://www.jcea.es/artic/zfs_flash01.htm Sorry, Spanish only too.> Overall, your plan seems okay and has more failsafes than we''ve had > - because longer downtimes were affordable ;) However, when doing > such low-level stuff, you should make sure that you have remote > access to your systems (ILOM, KVM, etc.; remotely-controlled PDUs > for externally enforcedYes, the migration I did had plenty of safety points (you can go back if something doesn''t work) and, most of the time, the system was in a state able to survive accidental reboot. Downtime was minimal, less than an hour in total (several reboots to validate configurations before proceeding). I am quite pleased of the eventless migration, but I planned it quite carefully. Worried about hitting bugs in Solaris/ZFS, though. But it was very smooth. The machine is hosted remotely but yes, I have remote-KVM. I can''t boot from remote media, but I have an OpenIndiana release in the SSD, with VirtualBox installed and the Solaris 10 Update 10 release ISO, just in case :-). The only "suspicious" thing is that I keep "swap" (32GB) and "dump" (4GB) in the "data" zpool, instead in "system". Seems to work OK. Crossing my fingers for the next Live Upgrade :-). I have read your message after I migrated, but it was very interesting. Thanks for taking the time to write it!. Have a nice 2012. - -- Jesus Cea Avion _/_/ _/_/_/ _/_/_/ jcea at jcea.es - http://www.jcea.es/ _/_/ _/_/ _/_/ _/_/ _/_/ jabber / xmpp:jcea at jabber.org _/_/ _/_/ _/_/_/_/_/ . _/_/ _/_/ _/_/ _/_/ _/_/ "Things are not so easy" _/_/ _/_/ _/_/ _/_/ _/_/ _/_/ "My name is Dump, Core Dump" _/_/_/ _/_/_/ _/_/ _/_/ "El amor es poner tu felicidad en la felicidad de otro" - Leibniz -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.10 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iQCVAwUBTwuvNJlgi5GaxT1NAQLJ0wP9EgpQnUdYCiLOnlGK8UC2QodT9s8KuqMK 5F9YwlPLdZ3S1DfWGKgC3k9MLbCfYLihM+KqysblsHs5Jf9/HGYSGK5Ky5HlYB5c 4vO+KrDU2eT/BYIVrDmFCucj8Fh8CN0Ule+Z5JtvhdlN/5rQ+osRmLQXr3SqQm6F w/ilYwB09+0=fGc3 -----END PGP SIGNATURE-----
Richard Elling
2012-Jan-10 20:32 UTC
[zfs-discuss] Thinking about spliting a zpool in "system" and "data"
On Jan 9, 2012, at 7:23 PM, Jesus Cea wrote:> -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > On 07/01/12 13:39, Jim Klimov wrote: >> I have transitioned a number of systems roughly by the same >> procedure as you''ve outlined. Sadly, my notes are not in English so >> they wouldn''t be of much help directly; > > Yes, my russian is rusty :-). > > I have bitten the bullet and spend 3-4 days doing the migration. I > wrote the details here: > > http://www.jcea.es/artic/solaris_zfs_split.htm > > The page is written in Spanish, but the terminal transcriptions should > be useful for everybody. > > In the process, maybe somebody finds this interesting too: > > http://www.jcea.es/artic/zfs_flash01.htmGoogle translate works well for this :-) Thanks for posting! -- richard -- ZFS and performance consulting http://www.RichardElling.com illumos meetup, Jan 10, 2012, Menlo Park, CA http://www.meetup.com/illumos-User-Group/events/41665962/
Jesus Cea
2012-Jan-11 03:38 UTC
[zfs-discuss] Thinking about spliting a zpool in "system" and "data"
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 10/01/12 21:32, Richard Elling wrote:> On Jan 9, 2012, at 7:23 PM, Jesus Cea wrote:[...]>> The page is written in Spanish, but the terminal transcriptions >> should be useful for everybody. >> >> In the process, maybe somebody finds this interesting too: >> >> http://www.jcea.es/artic/zfs_flash01.htm > > Google translate works well for this :-) Thanks for posting! -- > richardTalking about this, there is something that bugs me. For some reason, sync writes are written to the ZIL only if they are "small". Big writes are far slower, apparently bypassing the ZIL. Maybe some concern about disk bandwidth (because we would be writing the data twice, but it is only speculation). But this is happening TOO when the ZIL is in a SSD. I guess ZFS should write the sync writes to the SSD even if they are quite big (megabytes). In the "zil.c" code I see things like: """ /* * Define a limited set of intent log block sizes. * These must be a multiple of 4KB. Note only the amount used (again * aligned to 4KB) actually gets written. However, we can''t always just * allocate SPA_MAXBLOCKSIZE as the slog space could be exhausted. */ uint64_t zil_block_buckets[] = { 4096, /* non TX_WRITE */ 8192+4096, /* data base */ 32*1024 + 4096, /* NFS writes */ UINT64_MAX }; /* * Use the slog as long as the logbias is ''latency'' and the current commit size * is less than the limit or the total list size is less than 2X the limit. * Limit checking is disabled by setting zil_slog_limit to UINT64_MAX. */ uint64_t zil_slog_limit = 1024 * 1024; #define USE_SLOG(zilog) (((zilog)->zl_logbias == ZFS_LOGBIAS_LATENCY) && \ (((zilog)->zl_cur_used < zil_slog_limit) || \ ((zilog)->zl_itx_list_sz < (zil_slog_limit << 1)))) """ I have 2GB of ZIL in a mirrored SSD. I can randomly write to it at 240MB/s, so I guess the sync write restriction could be reexamined when ZFS is using a separate ZIL device, with plenty of space to burn :-). Am I missing anything? Could I change the value of "zil_slog_limit" in the kernel (via mdb) when using a ZIL device, safely?. Would do what I expect? My usual database block size is 64KB... :-(. The writeahead log write can be bigger that 128KB easily (before and after data, plus some changes in the parent nodes). Seems faster to do several writes with several SYNCs that a big write with a final SYNC. That is quite counterintuitive. Am I hitting something else, like the "write throttle"? PS: I am talking about Solaris 10 U10. My ZFS "logbias" attribute is "latency". - -- Jesus Cea Avion _/_/ _/_/_/ _/_/_/ jcea at jcea.es - http://www.jcea.es/ _/_/ _/_/ _/_/ _/_/ _/_/ jabber / xmpp:jcea at jabber.org _/_/ _/_/ _/_/_/_/_/ . _/_/ _/_/ _/_/ _/_/ _/_/ "Things are not so easy" _/_/ _/_/ _/_/ _/_/ _/_/ _/_/ "My name is Dump, Core Dump" _/_/_/ _/_/_/ _/_/ _/_/ "El amor es poner tu felicidad en la felicidad de otro" - Leibniz -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.10 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iQCVAwUBTw0EMZlgi5GaxT1NAQLfVAQAhQxJwLVBOJ4ybA8HUJc+p94cJJ4CtsSS /9Un7KKR09+FYrkOycoViYsUqrb+vBGSZHCyElQRXZf7nz14qX0qullXn6jqkSHv Pxjp3nQAu7ERCcPi2jfuOgXyzw7F74F/UduL2Qla+XFrYSpkBYsDIikIO+lgSLZh JVdvnshISMc=at00 -----END PGP SIGNATURE-----