thr3ads.net - zfs discuss - [zfs-discuss] Thinking about spliting a zpool in "system" and "data" [Jan 2012]

If this information is useful, please help other people find it:
Share via:

Jesus Cea

2012-Jan-06 05:32 UTC

[zfs-discuss] Thinking about spliting a zpool in "system" and "data"

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Sorry if this list is inappropriate. Pointers welcomed.

Using Solaris 10 Update 10, x86-64.

I have been a ZFS heavy user since available, and I love the system.
My servers are usually "small" (two disks) and usually hosted in a
datacenter, so I usually create a ZPOOL used both for system and data.
That is, the entire system contains an unique two-disk zpool.

This have worked nice so far.

But my new servers have SSD too. Using them for L2ARC is easy enough,
but I can not use them as ZIL because no separate ZIL device can be
used in root zpools. Ugh, that hurts!.

So I am thinking about splitting my full two-disk zpool in two zpools,
one for system and other for data. Both using both disks for
mirroring. So I would have two slices per disk.

I have the system in production in a datacenter I can not access, but
I have remote KVM access. Servers are in production, I can''t reinstall
but I could be allowed to have small (minutes) downtimes for a while.

My plan is this:

1. Do a scrub to be sure the data is OK in both disks.

2. Break the mirror. The A disk will keep working, B disk is idle.

3. Partition B disk with two slices instead of current full disk slice.

4. Create a "system" zpool in B.

5. Snapshot "zpool/ROOT" in A and "zfs send it" to
"system" in B.
Repeat several times until we have a recent enough copy. This stream
will contain the OS and the zones root datasets. I have zones.

6. Change GRUB to boot from "system" instead of "zpool".
Cross fingers
and reboot. Do I have to touch the "bootfs" property?

Now ideally I would be able to have "system" as the zpool root. The
zones would be mounted from the old datasets.

7. If everything is OK, I would "zfs send" the data from the old zpool
to the new one. After doing a few times to get a recent copy, I would
stop the zones and do a final copy, to be sure I have all data, no
changes in progress.

8. I would change the zone manifest to mount the data in the new zpool.

9. I would restart the zones and be sure everything seems ok.

10. I would restart the computer to be sure everything works.

So fair, it this doesn''t work, I could go back to the old situation
simply changing the GRUB boot to the old zpool.

11. If everything works, I would destroy the original "zpool" in A,
partition the disk and recreate the mirroring, with B as the source.

12. Reboot to be sure everything is OK.

So, my questions:

a) Is this workflow reasonable and would work?. Is the procedure
documented anywhere?. Suggestions?. Pitfalls?

b) *MUST* SWAP and DUMP ZVOLs reside in the root zpool or can they
live in a "nonsystem" zpool? (always plugged and available). I would
like to have a quite small(let say 30GB, I use Live Upgrade and quite
a fez zones) "system" zpool, but my swap is huge (32 GB and yes, I use
it) and I would rather prefer to have SWAP and DUMP in the data zpool,
if possible & supported.

c) Currently Solaris decides to activate write caching in the SATA
disks, nice. What would happen if I still use the complete disks BUT
with two slices instead of one?. Would it still have write cache
enabled?. And yes, I have checked that the cache flush works as
expected, because I can "only" do around one hundred
"write+sync" per
second.

Advices?.

- -- 
Jesus Cea Avion                         _/_/      _/_/_/        _/_/_/
jcea at jcea.es - http://www.jcea.es/     _/_/    _/_/  _/_/    _/_/  _/_/
jabber / xmpp:jcea at jabber.org         _/_/    _/_/          _/_/_/_/_/
.                              _/_/  _/_/    _/_/          _/_/  _/_/
"Things are not so easy"      _/_/  _/_/    _/_/  _/_/    _/_/  _/_/
"My name is Dump, Core Dump"   _/_/_/        _/_/_/      _/_/  _/_/
"El amor es poner tu felicidad en la felicidad de otro" - Leibniz
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.10 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iQCVAwUBTwaHW5lgi5GaxT1NAQLe/AP9EIK0tckVBhqzrTHWbNzT2TPUGYc7ZYjS
pZYX1EXkJNxVOmmXrWApmoVFGtYbwWeaSQODqE9XY5rUZurEbYrXOmejF2olvBPL
zyGFMnZTcmWLTrlwH5vaXeEJOSBZBqzwMWPR/uv2Z/a9JWO2nbidcV1OAzVdT2zU
kfboJpbxONQ=6i+A
-----END PGP SIGNATURE-----

Fajar A. Nugraha

2012-Jan-06 05:54 UTC

head link

[zfs-discuss] Thinking about spliting a zpool in "system" and "data"

On Fri, Jan 6, 2012 at 12:32 PM, Jesus Cea <jcea at jcea.es>
wrote:> So, my questions:
>
> a) Is this workflow reasonable and would work?. Is the procedure
> documented anywhere?. Suggestions?. Pitfalls?
try
http://www.solarisinternals.com/wiki/index.php/ZFS_Troubleshooting_Guide#Complete_Solaris_ZFS_Root_Pool_Recovery
>
> b) *MUST* SWAP and DUMP ZVOLs reside in the root zpool or can they
> live in a "nonsystem" zpool? (always plugged and available). I
would
> like to have a quite small(let say 30GB, I use Live Upgrade and quite
> a fez zones) "system" zpool, but my swap is huge (32 GB and yes,
I use
> it) and I would rather prefer to have SWAP and DUMP in the data zpool,
> if possible & supported.
try it? :D

Last time i played around with S11, you could even go without swap and
dump (with some manual setup).
>
> c) Currently Solaris decides to activate write caching in the SATA
> disks, nice. What would happen if I still use the complete disks BUT
> with two slices instead of one?. Would it still have write cache
> enabled?. And yes, I have checked that the cache flush works as
> expected, because I can "only" do around one hundred
"write+sync" per
> second.
You can enable disk cache manually using "format".

-- 
Fajar

Edward Ned Harvey

2012-Jan-06 13:59 UTC

head link

[zfs-discuss] Thinking about spliting a zpool in "system" and "data"

> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-
> bounces at opensolaris.org] On Behalf Of Jesus Cea
> 
> Sorry if this list is inappropriate. Pointers welcomed.
Not at all.  This is the perfect forum for your question.

> So I am thinking about splitting my full two-disk zpool in two zpools,
> one for system and other for data. Both using both disks for
> mirroring. So I would have two slices per disk.
Please see the procedure below, which I wrote as notes for myself, to
perform disaster recovery backup/restore of rpool.  This is not DIRECTLY
applicable for you, but it includes all the necessary ingredients to make
your transition successful.  So please read, and modify as necessary for
your purposes.

Many good notes available:
    ZFS Troubleshooting Guide
 
http://www.solarisinternals.com/wiki/index.php/ZFS_Troubleshooting_Guide#ZFS
_Root_Pool_Recovery

Before you begin
    Because you will restore from a boot CD, there are only a few
    compression options available to you:  7z, bzip2, and gzip.
    The clear winner in general is 7z with compression level 1.  It''s
about
    as fast as gzip, with compression approx 2x stronger than bzip2

    Because you will restore from a boot CD, the media needs to be located
somewhere 
    that can be accessed from a CD boot environment, which does not include
ssh/scp.
    The obvious choice is NFS.  Be aware that solaris NFS client is often
not very
    compatible with linux NFS servers.

    I am assuming there is a solaris NFS server available because it makes
    my job easy while I''m writing this.  ;-)  Note:  You could just as 
    easily store the backup on removable disk in something like a zfs pool.
    Just make sure whatever way you store it, it''s accessible from the 
    CD boot environment, which might not support a later version of zpool
etc.

Create NFS exports on some other solaris machine.
    share -F nfs -o rw=machine1:machine2,root=machine1:machine2 /backupdir
    Also edit hosts file to match, because forward/reverse dns must match
for the client.

Create a backup suitable for system recovery.
    mount someserver:/backupdir /mnt

    Create snapshots:
    zfs snapshot rpool at uniquebackupstring
    zfs snapshot rpool/ROOT at uniquebackupstring
    zfs snapshot rpool/ROOT/machinename_slash at uniquebackupstring

    Send snapshots:
    Notice: due to bugs, don''t do this recursively.  Do it separately,
as
outlined here.
    Notice:  In some version of zpool/zfs, these bugs were fixed so you can
safely do it recursively.  I don''t know what rev is needed.
    zfs send rpool at uniquebackupstring | 7z a -mx=1 -si /mnt/rpool.zfssend.7z
    zfs send rpool/ROOT at uniquebackupstring | 7z a -mx=1 -si
/mnt/rpool_ROOT.zfssend.7z
    zfs send rpool/ROOT/machinename_slash at uniquebackupstring | 7z a -mx=1
-si /mnt/rpool_ROOT_machinename_slash.zfssend.7z

    It is also wise to capture a list of the "pristine" zpool and zfs
properties:
    echo "" > /mnt/zpool-properties.txt
    for pool in `zpool list | grep -v ''^NAME '' | sed
''s/ .*//''` ; do
        echo "-------------------------------------" | tee -a
/mnt/zpool-properties.txt
        echo "zpool get all $pool" | tee -a /mnt/zpool-properties.txt
        zpool get all $pool | tee -a /mnt/zpool-properties.txt
    done
    echo "" > /mnt/zfs-properties.txt
    for fs in `zfs list | grep -v @ | grep -v ''^NAME '' | sed
''s/ .*//''` ; do
        echo "-------------------------------------" | tee -a
/mnt/zfs-properties.txt
        echo "zfs get all $fs" | tee -a /mnt/zfs-properties.txt
        zfs get all $fs | tee -a /mnt/zfs-properties.txt
    done

    Notice:  The above will also capture info about dump & swap, which might
be important, 
    so you know what sizes & blocksizes they are.

    
Now suppose a disaster has happened.  You need to restore.
    Boot from the CD.
    Choose "Solaris"
    Choose "6.  Single User Shell"

    To bring up the network:
        ifconfig -a plumb
        ifconfig -a
        (Notice the name of the network adapter.  In my case, it''s
e1000g0)
        ifconfig e1000g0 192.168.1.100/24 up

    mount 192.168.1.105:/backupdir /mnt
    
    Verify that you have access to the backup images.  Now prepare your boot
disk as follows:

    format -e
    (Select the appropriate disk)
        fdisk
        (no fdisk table exists.  Yes, create default)
        partition
        (choose to "modify" a table based on "hog")
        (in my example, I''m using c1t0d0s0 for rpool)
    
    zpool create -f -o failmode=continue -R /a -m legacy rpool c1t0d0s0

    7z x /mnt/rpool.zfssend.7z -so | zfs receive -F rpool
    (notice: the first one requires -F because it already exists.  The
others don''t need this.)

    7z x /mnt/rpool_ROOT.zfssend.7z -so | zfs receive rpool/ROOT
    7z x /mnt/rpool_ROOT_machinename_slash.zfssend.7z -so | zfs receive
rpool/ROOT/machinename_slash

    zfs set mountpoint=/rpool rpool
    zfs set mountpoint=legacy rpool/ROOT
    zfs set mountpoint=/      rpool/ROOT/machinename_slash

    zpool set bootfs=rpool/ROOT/machinename_slash rpool

    You did save the zpool-properties.txt and zfs-properties.txt didn''t
you?
;-)  If not, you''ll just have to guess 
    about sizes and blocksizes.  The following 2G dump, 1G swap, and 4k
blocksize swap are pretty standard for a x86 
    ystem with 2G ram.
    zfs create -V 2G rpool/dump
    zfs create -V 1G -b 4k rpool/swap

    installgrub /boot/grub/stage1 /boot/grub/stage2 /dev/rdsk/c1t0d0s0

    And finally, init 6

    After the system comes up "natural" once, it''s probably a
good idea to
capture a new "zpool-properties.txt" and
"zfs-properties.txt" and
    compare against the "pristine" ones, to see what (if anything) is
different.  Likely suspects are auto-snapshot properties and
    stuff like that.  Which you probably forgot you ever created, years ago
when you (or someone else) built your server and didn''t save any 
    documentation about the process.

Edward Ned Harvey

2012-Jan-06 14:02 UTC

head link

[zfs-discuss] Thinking about spliting a zpool in "system" and "data"

> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-
> bounces at opensolaris.org] On Behalf Of Fajar A. Nugraha
> 
> > c) Currently Solaris decides to activate write caching in the SATA
> > disks, nice. What would happen if I still use the complete disks BUT
> > with two slices instead of one?. Would it still have write cache
> > enabled?. And yes, I have checked that the cache flush works as
> > expected, because I can "only" do around one hundred
"write+sync" per
> > second.
> 
> You can enable disk cache manually using "format".
I''m not aware of any automatic way to make this work correctly.  I
wrote
some scripts to run in cron, if you''re interested.

"Hung-Sheng Tsao (Lao Tsao 老曹) Ph.D."

2012-Jan-06 20:34 UTC

head link

[zfs-discuss] Thinking about spliting a zpool in "system" and "data"

may be one can do the following (assume c0t0d0 and c0t1d0)
1)split rpool mirror: zpool split rpool newpool c0t1d0s0
1b)zpool destroy newpool
2)partition 2nd hdd c0t1d0s0 into two slice (s0 and s1)
3)zpool create rpool2 c0t1d0s1
4)use lucreate  -c c0t0d0s0 -n new-zfsbe -p c0t1d0s0
5)lustatus
c0t0d0s0
new-zfsbe
6)luactivate new-zfsbe
7)init 6
now you have two BE old and new
you can create dpool on  slice1 add L2ARC and zil and repartition c0t0d0
if you want you can create  rpool on c0t0d0s0 and new BE so everything 
will be name rpool for root pool

SWAP and DUMP can be on different rpool

good luck


On 1/6/2012 12:32 AM, Jesus Cea wrote:> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Sorry if this list is inappropriate. Pointers welcomed.
>
> Using Solaris 10 Update 10, x86-64.
>
> I have been a ZFS heavy user since available, and I love the system.
> My servers are usually "small" (two disks) and usually hosted in
a
> datacenter, so I usually create a ZPOOL used both for system and data.
> That is, the entire system contains an unique two-disk zpool.
>
> This have worked nice so far.
>
> But my new servers have SSD too. Using them for L2ARC is easy enough,
> but I can not use them as ZIL because no separate ZIL device can be
> used in root zpools. Ugh, that hurts!.
>
> So I am thinking about splitting my full two-disk zpool in two zpools,
> one for system and other for data. Both using both disks for
> mirroring. So I would have two slices per disk.
>
> I have the system in production in a datacenter I can not access, but
> I have remote KVM access. Servers are in production, I can''t
reinstall
> but I could be allowed to have small (minutes) downtimes for a while.
>
> My plan is this:
>
> 1. Do a scrub to be sure the data is OK in both disks.
>
> 2. Break the mirror. The A disk will keep working, B disk is idle.
>
> 3. Partition B disk with two slices instead of current full disk slice.
>
> 4. Create a "system" zpool in B.
>
> 5. Snapshot "zpool/ROOT" in A and "zfs send it" to
"system" in B.
> Repeat several times until we have a recent enough copy. This stream
> will contain the OS and the zones root datasets. I have zones.
>
> 6. Change GRUB to boot from "system" instead of
"zpool". Cross fingers
> and reboot. Do I have to touch the "bootfs" property?
>
> Now ideally I would be able to have "system" as the zpool root.
The
> zones would be mounted from the old datasets.
>
> 7. If everything is OK, I would "zfs send" the data from the old
zpool
> to the new one. After doing a few times to get a recent copy, I would
> stop the zones and do a final copy, to be sure I have all data, no
> changes in progress.
>
> 8. I would change the zone manifest to mount the data in the new zpool.
>
> 9. I would restart the zones and be sure everything seems ok.
>
> 10. I would restart the computer to be sure everything works.
>
> So fair, it this doesn''t work, I could go back to the old
situation
> simply changing the GRUB boot to the old zpool.
>
> 11. If everything works, I would destroy the original "zpool" in
A,
> partition the disk and recreate the mirroring, with B as the source.
>
> 12. Reboot to be sure everything is OK.
>
> So, my questions:
>
> a) Is this workflow reasonable and would work?. Is the procedure
> documented anywhere?. Suggestions?. Pitfalls?
>
> b) *MUST* SWAP and DUMP ZVOLs reside in the root zpool or can they
> live in a "nonsystem" zpool? (always plugged and available). I
would
> like to have a quite small(let say 30GB, I use Live Upgrade and quite
> a fez zones) "system" zpool, but my swap is huge (32 GB and yes,
I use
> it) and I would rather prefer to have SWAP and DUMP in the data zpool,
> if possible&  supported.
>
> c) Currently Solaris decides to activate write caching in the SATA
> disks, nice. What would happen if I still use the complete disks BUT
> with two slices instead of one?. Would it still have write cache
> enabled?. And yes, I have checked that the cache flush works as
> expected, because I can "only" do around one hundred
"write+sync" per
> second.
>
> Advices?.
>
> - -- 
> Jesus Cea Avion                         _/_/      _/_/_/        _/_/_/
> jcea at jcea.es - http://www.jcea.es/     _/_/    _/_/  _/_/    _/_/  _/_/
> jabber / xmpp:jcea at jabber.org         _/_/    _/_/          _/_/_/_/_/
> .                              _/_/  _/_/    _/_/          _/_/  _/_/
> "Things are not so easy"      _/_/  _/_/    _/_/  _/_/    _/_/ 
_/_/
> "My name is Dump, Core Dump"   _/_/_/        _/_/_/      _/_/ 
_/_/
> "El amor es poner tu felicidad en la felicidad de otro" - Leibniz
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.10 (GNU/Linux)
> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/
>
> iQCVAwUBTwaHW5lgi5GaxT1NAQLe/AP9EIK0tckVBhqzrTHWbNzT2TPUGYc7ZYjS
> pZYX1EXkJNxVOmmXrWApmoVFGtYbwWeaSQODqE9XY5rUZurEbYrXOmejF2olvBPL
> zyGFMnZTcmWLTrlwH5vaXeEJOSBZBqzwMWPR/uv2Z/a9JWO2nbidcV1OAzVdT2zU
> kfboJpbxONQ> =6i+A
> -----END PGP SIGNATURE-----
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
-- 
Hung-Sheng Tsao Ph D.
Founder&  Principal
HopBit GridComputing LLC
cell: 9734950840
http://laotsao.wordpress.com/
http://laotsao.blogspot.com/

-------------- next part --------------
A non-text attachment was scrubbed...
Name: laotsao.vcf
Type: text/x-vcard
Size: 153 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20120106/3acf7b7b/attachment.vcf>

"Hung-Sheng Tsao (Lao Tsao 老曹) Ph.D."

2012-Jan-06 20:36 UTC

head link

[zfs-discuss] Thinking about spliting a zpool in "system" and "data"

correction
On 1/6/2012 3:34 PM, "Hung-Sheng Tsao (Lao Tsao ??) Ph.D."
wrote:>
> may be one can do the following (assume c0t0d0 and c0t1d0)
> 1)split rpool mirror: zpool split rpool newpool c0t1d0s0
> 1b)zpool destroy newpool
> 2)partition 2nd hdd c0t1d0s0 into two slice (s0 and s1)
> 3)zpool create rpool2 c0t1d0s1 <===should be c0t1d0s0
> 4)use lucreate  -c c0t0d0s0 -n new-zfsbe -p c0t1d0s0 <==rpool2
> 5)lustatus
> c0t0d0s0
> new-zfsbe
> 6)luactivate new-zfsbe
> 7)init 6
> now you have two BE old and new
> you can create dpool on  slice1 add L2ARC and zil and repartition c0t0d0
> if you want you can create  rpool on c0t0d0s0 and new BE so everything 
> will be name rpool for root pool
>
> SWAP and DUMP can be on different rpool
>
> good luck
>
>
> On 1/6/2012 12:32 AM, Jesus Cea wrote:
>> -----BEGIN PGP SIGNED MESSAGE-----
>> Hash: SHA1
>>
>> Sorry if this list is inappropriate. Pointers welcomed.
>>
>> Using Solaris 10 Update 10, x86-64.
>>
>> I have been a ZFS heavy user since available, and I love the system.
>> My servers are usually "small" (two disks) and usually hosted
in a
>> datacenter, so I usually create a ZPOOL used both for system and data.
>> That is, the entire system contains an unique two-disk zpool.
>>
>> This have worked nice so far.
>>
>> But my new servers have SSD too. Using them for L2ARC is easy enough,
>> but I can not use them as ZIL because no separate ZIL device can be
>> used in root zpools. Ugh, that hurts!.
>>
>> So I am thinking about splitting my full two-disk zpool in two zpools,
>> one for system and other for data. Both using both disks for
>> mirroring. So I would have two slices per disk.
>>
>> I have the system in production in a datacenter I can not access, but
>> I have remote KVM access. Servers are in production, I can''t
reinstall
>> but I could be allowed to have small (minutes) downtimes for a while.
>>
>> My plan is this:
>>
>> 1. Do a scrub to be sure the data is OK in both disks.
>>
>> 2. Break the mirror. The A disk will keep working, B disk is idle.
>>
>> 3. Partition B disk with two slices instead of current full disk slice.
>>
>> 4. Create a "system" zpool in B.
>>
>> 5. Snapshot "zpool/ROOT" in A and "zfs send it" to
"system" in B.
>> Repeat several times until we have a recent enough copy. This stream
>> will contain the OS and the zones root datasets. I have zones.
>>
>> 6. Change GRUB to boot from "system" instead of
"zpool". Cross fingers
>> and reboot. Do I have to touch the "bootfs" property?
>>
>> Now ideally I would be able to have "system" as the zpool
root. The
>> zones would be mounted from the old datasets.
>>
>> 7. If everything is OK, I would "zfs send" the data from the
old zpool
>> to the new one. After doing a few times to get a recent copy, I would
>> stop the zones and do a final copy, to be sure I have all data, no
>> changes in progress.
>>
>> 8. I would change the zone manifest to mount the data in the new zpool.
>>
>> 9. I would restart the zones and be sure everything seems ok.
>>
>> 10. I would restart the computer to be sure everything works.
>>
>> So fair, it this doesn''t work, I could go back to the old
situation
>> simply changing the GRUB boot to the old zpool.
>>
>> 11. If everything works, I would destroy the original "zpool"
in A,
>> partition the disk and recreate the mirroring, with B as the source.
>>
>> 12. Reboot to be sure everything is OK.
>>
>> So, my questions:
>>
>> a) Is this workflow reasonable and would work?. Is the procedure
>> documented anywhere?. Suggestions?. Pitfalls?
>>
>> b) *MUST* SWAP and DUMP ZVOLs reside in the root zpool or can they
>> live in a "nonsystem" zpool? (always plugged and available).
I would
>> like to have a quite small(let say 30GB, I use Live Upgrade and quite
>> a fez zones) "system" zpool, but my swap is huge (32 GB and
yes, I use
>> it) and I would rather prefer to have SWAP and DUMP in the data zpool,
>> if possible&  supported.
>>
>> c) Currently Solaris decides to activate write caching in the SATA
>> disks, nice. What would happen if I still use the complete disks BUT
>> with two slices instead of one?. Would it still have write cache
>> enabled?. And yes, I have checked that the cache flush works as
>> expected, because I can "only" do around one hundred
"write+sync" per
>> second.
>>
>> Advices?.
>>
>> - -- Jesus Cea Avion                         _/_/      _/_/_/        
>> _/_/_/
>> jcea at jcea.es - http://www.jcea.es/     _/_/    _/_/  _/_/    _/_/ 
_/_/
>> jabber / xmpp:jcea at jabber.org         _/_/    _/_/         
_/_/_/_/_/
>> .                              _/_/  _/_/    _/_/          _/_/  _/_/
>> "Things are not so easy"      _/_/  _/_/    _/_/  _/_/   
_/_/  _/_/
>> "My name is Dump, Core Dump"   _/_/_/        _/_/_/      _/_/
_/_/
>> "El amor es poner tu felicidad en la felicidad de otro" -
Leibniz
>> -----BEGIN PGP SIGNATURE-----
>> Version: GnuPG v1.4.10 (GNU/Linux)
>> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/
>>
>> iQCVAwUBTwaHW5lgi5GaxT1NAQLe/AP9EIK0tckVBhqzrTHWbNzT2TPUGYc7ZYjS
>> pZYX1EXkJNxVOmmXrWApmoVFGtYbwWeaSQODqE9XY5rUZurEbYrXOmejF2olvBPL
>> zyGFMnZTcmWLTrlwH5vaXeEJOSBZBqzwMWPR/uv2Z/a9JWO2nbidcV1OAzVdT2zU
>> kfboJpbxONQ>> =6i+A
>> -----END PGP SIGNATURE-----
>> _______________________________________________
>> zfs-discuss mailing list
>> zfs-discuss at opensolaris.org
>> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>
-- 
Hung-Sheng Tsao Ph D.
Founder&  Principal
HopBit GridComputing LLC
cell: 9734950840
http://laotsao.wordpress.com/
http://laotsao.blogspot.com/

-------------- next part --------------
A non-text attachment was scrubbed...
Name: laotsao.vcf
Type: text/x-vcard
Size: 153 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20120106/a6063680/attachment.vcf>

Jim Klimov

2012-Jan-07 12:39 UTC

head link

[zfs-discuss] Thinking about spliting a zpool in "system" and "data"

Hello, Jesus,

   I have transitioned a number of systems roughly by the
same procedure as you''ve outlined. Sadly, my notes are
not in English so they wouldn''t be of much help directly;
but I can report that I had success with similar "in-place"
manual transitions from mirrored SVM (pre-solaris 10u4)
to new ZFS root pools, as well as various transitions
of ZFS root pools from one layout to another, on systems
with limited numbers of disk drives (2-4 overall).

   As I''ve recently reported on the list, I''ve also done
such "migration" for my faulty single-disk rpool at home
via the data pool and backwards, changing the "copies"
setting enroute.

   Overall, your plan seems okay and has more failsafes
than we''ve had - because longer downtimes were affordable ;)
However, when doing such low-level stuff, you should make
sure that you have remote access to your systems (ILOM,
KVM, etc.; remotely-controlled PDUs for externally enforced
poweroff-poweron are welcome), and that you can boot the
systems over ILOM/rKVM with an image of a LiveUSB/LiveCD/etc
in case of bigger trouble.

   In the steps 6-7, where you reboot the system to test
that new rpool works, you might want to keep the zones
down, i.e. by disabling the zones service in the old BE
just before reboot, and zfs-sending this update to the
new small rpool. Also it is likely that in the new BE
(small rpool) your old "data" from the big rpool won''t
get imported by itself and zones (or their services)
wouldn''t start correctly anyway before steps 7-8.

---

Below I''ll outline our experience from my notes, as it
successfully applied to an even more complicated situation
than yours:

   On many Sol10/SXCE systems with ZFS roots we''ve also
created a hierarchical layout (separate /var, /usr, /opt
with compression enabled), but this procedure HAS FAILED
for newer OpenIndiana systems. So for OI we have to use
the default single-root layout and only seperate some of
/var/* subdirs (adm, log, mail, crash, cores, ...) in
order to set quotas and higher compression on them.
Such datasets are also kept separate from OS upgrades
and are used in all boot environments without cloning.

   To simplify things, most of the transitions were done
in off-hours time so it was okay to shut down all the
zones and other services. In some cases for Sol10/SXCE
the procedure involved booting in the "Failsafe Boot"
mode; for all systems this can be done with the BootCD.

   For usual Solaris 10 and OpenSolaris SXCE maintenance
we did use LiveUpgrade, but at that time its ZFS support
was immature, so we circumvented LU and transitioned
manually. In those cases we used LU to update systems
to the base level supporting ZFS roots (Sol10u4+) while
running from SVM mirrors (one mirror for main root,
another mirror for LU root for new/old OS image).
After the transition to ZFS rpool, we cleared out the
LU settings (/etc/lu/, /etc/lutab) by using defaults
from the most recent SUNWlu* packages, and when booted
from ZFS - we created the "current" LU BE based on the
current ZFS rpool.

   When the OS was capable of booting from ZFS (sol10u4+,
snv_100 approx), we broke the SVM mirrors, repartitioned
the second disk to our liking (about 4-10Gb for rpool,
rest for data), created the new rpool and dataset
hierarchy we needed and had in mounted under "/zfsroot".

   Note that in our case we used a "minimized" install
of Solaris which fit under 1-2Gb per BE, we did not use
a separate /dump device and the swap volume was located
in the ZFS data pool (mirror or raidz for 4-disk systems).
Zoneroots were also separate from the system rpool and
were stored in the data pool. This DID yield problems
for LiveUpgrade, so zones were detached before LU and
reattached-with-upgrade after the OS upgrade and disk
migrations.

   Then we copied the root FS data like this:

# cd /zfsroot && ( ufsdump 0f - / | ufsrestore -rf - )

   If the source (SVM) paths like /var, /usr or /boot are
separate UFS filesystems - repeat likewise, changing the
current paths in the command above.

   For non-UFS systems, such as migration from VxFS or
even ZFS (if you need a different layout, compression,
etc. - so ZFS send/recv is not applicable), you can use
Sun cpio (it should carry over extended attributes and
ACLs). For example, if you''re booted from the LiveCD
and the old UFS root is mounted in "/usfroot" and new
ZFS rpool hierarchy is in "/zfsroot", you''d do this:

# cd /ufsroot && ( find . -xdev -depth -print | cpio -pvdm /zfsroot )

   The example above also copies only the data from
current FS, so you need to repeat it for each UFS
sub-fs like /var, etc.

   Another problem we''ve encountered while cpio''ing live
systems (when not running from failsafe/livecd) is that
"find" skips mountpoints of sub-fses. While your new ZFS
hierarchy would provide usr, var, opt under /zfspool,
you might need to manually create some others - see the
list in your current "df" output. Example:

# cd /zfsroot
# mkdir -p tmp proc devices var/run system/contract system/object 
etc/svc/volatile
# touch etc/mnttab etc/dfs/sharetab

Also some system libraries might be overlay-mounted on
top of "filenames", for example a hardware-optimised
version of libc, or the base "/dev" (used during boot)
or its components might be overlayed by dynamic devfs:

# df -k | grep ''.so''
/usr/lib/libc/libc_hwcap2.so.1 2728172  522776 2205396    20% 
/lib/libc.so.1

# mount | egrep ^/dev
/devices on /devices read/write/setuid/devices/dev=5340000 on Thu Dec  1 
06:14:26 2011
/dev on /dev read/write/setuid/devices/dev=5380000 on Thu Dec  1 
06:14:26 2011
/dev/fd on fd read/write/setuid/devices/dev=5640001 on Thu Dec  1 
06:14:38 2011

In order to copy the original libs and dev-paths from
a live system, you would need FS-aware tools like
ufsdump or zfs-send, or you might try loopback-mounts:

# mount -F lofs -o nosub / /mnt
# (cd /mnt; tar cvf - devices dev lib/libc.so.1 ) | (cd /zfsroot; tar xvf -)
# umount /mnt

There might be some more pitfalls regarding use of the
/etc/vfstab along with canmount=noauto, mountpoint=legacy
and such - but these are mostly relevant for our split
ZFS-root hierarchy on some OS releases.

Also, if you use LiveUpgrade, you might want to apply
Jens Elkner''s patches and/or consider his troubleshooting
tips: http://iws.cs.uni-magdeburg.de/~elkner/luc/lutrouble.html
One of these pathces in particular allowed LU to ignore
some paths which you know are irrelevant to OS upgrade,
i.e. "/export/home/*" - this can speed up LU by many
minutes per run.
Note that these patches were applicable back when I updated
our systems (Sol10u4-Soll10u8, SXCE~100), so these tricks
may or may not be included in current LU software.

Hope this helps, don''t hesitate to ask for more details
(my notes became a "textbook" for our admins which
approached 100 pages, so not everything is in this email)
//Jim Klimov

Jesus Cea

2012-Jan-10 03:23 UTC

head link

[zfs-discuss] Thinking about spliting a zpool in "system" and "data"

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 07/01/12 13:39, Jim Klimov wrote:> I have transitioned a number of systems roughly by the same
> procedure as you''ve outlined. Sadly, my notes are not in English
so
> they wouldn''t be of much help directly;
Yes, my russian is rusty :-).

I have bitten the bullet and spend 3-4 days doing the migration. I
wrote the details here:

http://www.jcea.es/artic/solaris_zfs_split.htm

The page is written in Spanish, but the terminal transcriptions should
be useful for everybody.

In the process, maybe somebody finds this interesting too:

http://www.jcea.es/artic/zfs_flash01.htm

Sorry, Spanish only too.
> Overall, your plan seems okay and has more failsafes than we''ve
had
> - because longer downtimes were affordable ;) However, when doing
> such low-level stuff, you should make sure that you have remote
> access to your systems (ILOM, KVM, etc.; remotely-controlled PDUs
> for externally enforced
Yes, the migration I did had plenty of safety points (you can go back if
something doesn''t work) and, most of the time, the system was in a
state able to survive accidental reboot. Downtime was minimal, less than
an hour in total (several reboots to validate configurations before
proceeding). I am quite pleased of the eventless migration, but I
planned it quite carefully. Worried about hitting bugs in Solaris/ZFS,
though. But it was very smooth.

The machine is hosted remotely but yes, I have remote-KVM. I can''t
boot from remote media, but I have an OpenIndiana release in the SSD,
with VirtualBox installed and the Solaris 10 Update 10 release ISO,
just in case :-).

The only "suspicious" thing is that I keep "swap" (32GB) and
"dump"
(4GB) in the "data" zpool, instead in "system". Seems to
work OK.
Crossing my fingers for the next Live Upgrade :-).

I have read your message after I migrated, but it was very
interesting. Thanks for taking the time to write it!.

Have a nice 2012.

- -- 
Jesus Cea Avion                         _/_/      _/_/_/        _/_/_/
jcea at jcea.es - http://www.jcea.es/     _/_/    _/_/  _/_/    _/_/  _/_/
jabber / xmpp:jcea at jabber.org         _/_/    _/_/          _/_/_/_/_/
.                              _/_/  _/_/    _/_/          _/_/  _/_/
"Things are not so easy"      _/_/  _/_/    _/_/  _/_/    _/_/  _/_/
"My name is Dump, Core Dump"   _/_/_/        _/_/_/      _/_/  _/_/
"El amor es poner tu felicidad en la felicidad de otro" - Leibniz
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.10 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iQCVAwUBTwuvNJlgi5GaxT1NAQLJ0wP9EgpQnUdYCiLOnlGK8UC2QodT9s8KuqMK
5F9YwlPLdZ3S1DfWGKgC3k9MLbCfYLihM+KqysblsHs5Jf9/HGYSGK5Ky5HlYB5c
4vO+KrDU2eT/BYIVrDmFCucj8Fh8CN0Ule+Z5JtvhdlN/5rQ+osRmLQXr3SqQm6F
w/ilYwB09+0=fGc3
-----END PGP SIGNATURE-----

Richard Elling

2012-Jan-10 20:32 UTC

head link

[zfs-discuss] Thinking about spliting a zpool in "system" and "data"

On Jan 9, 2012, at 7:23 PM, Jesus Cea wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> On 07/01/12 13:39, Jim Klimov wrote:
>> I have transitioned a number of systems roughly by the same
>> procedure as you''ve outlined. Sadly, my notes are not in
English so
>> they wouldn''t be of much help directly;
> 
> Yes, my russian is rusty :-).
> 
> I have bitten the bullet and spend 3-4 days doing the migration. I
> wrote the details here:
> 
> http://www.jcea.es/artic/solaris_zfs_split.htm
> 
> The page is written in Spanish, but the terminal transcriptions should
> be useful for everybody.
> 
> In the process, maybe somebody finds this interesting too:
> 
> http://www.jcea.es/artic/zfs_flash01.htm
Google translate works well for this :-)  Thanks for posting!
 -- richard

-- 

ZFS and performance consulting
http://www.RichardElling.com
illumos meetup, Jan 10, 2012, Menlo Park, CA
http://www.meetup.com/illumos-User-Group/events/41665962/

Jesus Cea

2012-Jan-11 03:38 UTC

head link

[zfs-discuss] Thinking about spliting a zpool in "system" and "data"

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 10/01/12 21:32, Richard Elling wrote:> On Jan 9, 2012, at 7:23 PM, Jesus Cea wrote:
[...]>> The page is written in Spanish, but the terminal transcriptions
>> should be useful for everybody.
>> 
>> In the process, maybe somebody finds this interesting too:
>> 
>> http://www.jcea.es/artic/zfs_flash01.htm
> 
> Google translate works well for this :-)  Thanks for posting! --
> richard
Talking about this, there is something that bugs me.

For some reason, sync writes are written to the ZIL only if they are
"small". Big writes are far slower, apparently bypassing the ZIL.
Maybe some concern about disk bandwidth (because we would be writing
the data twice, but it is only speculation).

But this is happening TOO when the ZIL is in a SSD. I guess ZFS should
write the sync writes to the SSD even if they are quite big (megabytes).

In the "zil.c" code I see things like:

"""
/*
 * Define a limited set of intent log block sizes.
 * These must be a multiple of 4KB. Note only the amount used (again
 * aligned to 4KB) actually gets written. However, we can''t always just
 * allocate SPA_MAXBLOCKSIZE as the slog space could be exhausted.
 */
uint64_t zil_block_buckets[] = {
    4096,               /* non TX_WRITE */
    8192+4096,          /* data base */
    32*1024 + 4096,     /* NFS writes */
    UINT64_MAX
};

/*
 * Use the slog as long as the logbias is ''latency'' and the
current
commit size
 * is less than the limit or the total list size is less than 2X the
limit.
 * Limit checking is disabled by setting zil_slog_limit to UINT64_MAX.
 */
uint64_t zil_slog_limit = 1024 * 1024;
#define USE_SLOG(zilog) (((zilog)->zl_logbias == ZFS_LOGBIAS_LATENCY)
&& \
        (((zilog)->zl_cur_used < zil_slog_limit) || \
        ((zilog)->zl_itx_list_sz < (zil_slog_limit << 1))))
"""

I have 2GB of ZIL in a mirrored SSD. I can randomly write to it at
240MB/s, so I guess the sync write restriction could be reexamined
when ZFS is using a separate ZIL device, with plenty of space to burn
:-). Am I missing anything?

Could I change the value of "zil_slog_limit" in the kernel (via mdb)
when using a ZIL device, safely?. Would do what I expect?

My usual database block size is 64KB... :-(. The writeahead log write
can be bigger that 128KB easily (before and after data, plus some
changes in the parent nodes).

Seems faster to do several writes with several SYNCs that a big write
with a final SYNC. That is quite counterintuitive.

Am I hitting something else, like the "write throttle"?

PS: I am talking about Solaris 10 U10. My ZFS "logbias" attribute is
"latency".

- -- 
Jesus Cea Avion                         _/_/      _/_/_/        _/_/_/
jcea at jcea.es - http://www.jcea.es/     _/_/    _/_/  _/_/    _/_/  _/_/
jabber / xmpp:jcea at jabber.org         _/_/    _/_/          _/_/_/_/_/
.                              _/_/  _/_/    _/_/          _/_/  _/_/
"Things are not so easy"      _/_/  _/_/    _/_/  _/_/    _/_/  _/_/
"My name is Dump, Core Dump"   _/_/_/        _/_/_/      _/_/  _/_/
"El amor es poner tu felicidad en la felicidad de otro" - Leibniz
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.10 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iQCVAwUBTw0EMZlgi5GaxT1NAQLfVAQAhQxJwLVBOJ4ybA8HUJc+p94cJJ4CtsSS
/9Un7KKR09+FYrkOycoViYsUqrb+vBGSZHCyElQRXZf7nz14qX0qullXn6jqkSHv
Pxjp3nQAu7ERCcPi2jfuOgXyzw7F74F/UduL2Qla+XFrYSpkBYsDIikIO+lgSLZh
JVdvnshISMc=at00
-----END PGP SIGNATURE-----

zfs discuss - Jan 2012 - Thinking about spliting a zpool in "system" and "data"

[zfs-discuss] Thinking about spliting a zpool in "system" and "data"

[zfs-discuss] Thinking about spliting a zpool in "system" and "data"

[zfs-discuss] Thinking about spliting a zpool in "system" and "data"

[zfs-discuss] Thinking about spliting a zpool in "system" and "data"

[zfs-discuss] Thinking about spliting a zpool in "system" and "data"

[zfs-discuss] Thinking about spliting a zpool in "system" and "data"

[zfs-discuss] Thinking about spliting a zpool in "system" and "data"

[zfs-discuss] Thinking about spliting a zpool in "system" and "data"

[zfs-discuss] Thinking about spliting a zpool in "system" and "data"

[zfs-discuss] Thinking about spliting a zpool in "system" and "data"