thr3ads.net - zfs discuss - [zfs-discuss] Mirror of SAN Boxes with ZFS ? (split site mirror) [Dec 2009]

If this information is useful, please help other people find it:
Share via:

Lutz Schumann

2009-Dec-22 20:12 UTC

[zfs-discuss] Mirror of SAN Boxes with ZFS ? (split site mirror)

Hello, 

I''m thinking about a  setup that looks like this:

- 2 headnodes with FC connectivity (OpenSolaris)
- 2 backend FC srtorages (Disk Shelves with RAID Controllers presenting a huge
15 TB RAID5)
- 2 datacenters (distance 1 km with dark fibre)
- one headnode and one storage in each data center

(Sorry for this ascii art :)


          ( Data Center 1)       <--1km-->    (Data Center 2)
              (primary)                                        (backup)

[Disk Array 1 with Raid CRTL]     [Disk Array 2 with Raid CRTL]
  [   -- LUN1 16 TB -- ]                 [   -- LUN2 16 TB -- ]
                  |                         \     /                    |
                  |                         /     \                    |
                  |                      /           \                 |
          [      FABRIC 1      ]            [     FABRIC 2    ]
                  |                         \     /                    |
                  |                         /     \                    |
                  |                     /             \                 |
[    Osol HeadNode 1     ]             [    Osol HeadNode 2   ]
      [   -- active -- ]

Zpool "mytest" on HeadNode2 :

mytest
 | - mirror
       |- LUN1
       |- LUN2

Both headnodes can see both storages. The storages are connected to the hosts
via SAN switches and two fabrics  (redundant multipath configuration).

This setup should be a active / passive setup with manual failover (pool import
in case of a site failure)

When thinking about this setup some questions popped into my mind. Most of them
are concerened with resilvering.

SAS-analogy:
 If using OpenSolaris in a simple SAS backplane server with SAS disks, if I pull
a disk, the disk failure is detected and the volume continues in degrated mode.
Now if I plug the SAS disk back, automatic resilvering happens to the disk. Only
deltas are resilvered.

How there are different corner cases of outage in the FC example that are
intersting and I''m, not sure how ZFS would react (unfortunately I do
not have the boxes here to test).

Failure scenarios:
a) temporary storage failure
(e.g. Disk Array 1 rebooting)

In this case I expect that the pool continues in degraded mode. When the storage
comes back up I''m not sure if the disks are automatically hot added to
the OS and thus I dont know if an automatic resilvering takes place.

b) permanent storage failure
(e.g. Disk Array 1 burning down or having 2 disk failure in RAID5 )

In this case I expect that the pool continues in degraded mode. When a new
storage is put back, no automatic resilvering takes place (no vdev label found)
. The LUN has to be replaced manually.

c) split brain - no volume import
(e.g. connection between the sites failing, administrator not issueing
"volume import" on HeadNode2)

This case is similar to a).

d) Short Failure of Data Center 1
(e.g. short power failure in data center 1. No manual failover to data center 2
by administrator.)

.. actually I have no idea what happens :)

e) Power Outage in Data Center 1
(e.g. long power failure in data center 1. Administrator performs volume import
on HeadNode2)

.. actually I have no idea what happens ... again :)

f) split brain - volume is imported
(e.g. connection between the sites failing, administrator issueing "volume
import" on HeadNode2)

This is a critical case. The pool is active on two nodes, while HeadNode1 uses
LUN1 and HeadNode2 uses LUN2 of the pool. If automatic resilvering takes place,
in which direction will resilving take place ? Will the nodes overwrite each
others data in the backend ? - no idea.

My question is:

Has anyone setup something like this and can give some insights on how ZFS
behaves in the cases above ? Is this a safe setup (guaranteed data integrity of
ZFS) ? How does resilvering identify the direction in which resilvered should
happen ?

I would appreshiate any input on this.

Regards,
-- 
This message posted from opensolaris.org

Lutz Schumann

2010-Jan-20 11:34 UTC

head link

[zfs-discuss] Mirror of SAN Boxes with ZFS ? (split site mirror)

Actually I found some time (and reason) to test this. 

Environment: 
- 1 osol server 
- one SLES10 iSCSI Target
- two LUN''s exported via iSCSi to the OSol server 

I did some rescilver tests to see how ZFS resilvers devices.

Prep:

osol: create a pool (myiscsi) with one mirror pair made from the two iSCSI
backend disks of SLES10

Test: 

osol: both disks ok, 
osol: txn in ueberblock of pool = 86 
sles10: remove one disk (lun=1) 
osol: disk is detected failed, pool degraded
osol: write with oflag=direct; sync multiple times to the pool
osol: create fs myiscsi/test
osol: txn in ueberblock = 107 

osol: power off (hard) 
sles10: add lun 1 again (the one with txn 86)
sles10: remove lun 0 (the onw with txn 107)
osol: power on 

osol: txn in ueberblock = 92
osol: zfs myiscsi/test does not exist
osol: create fs myiscsi/mytest_old
osol: txn in ueberblock = 96

osol: power off (hard)
sles10: add lun 0 again (with txn 107)
sles10: both luns are there

osol: Resilvering happens automatically 

osol: txn in ueberblock = 112 
osol: filesystem myiscsi/test exists 

... same thing other way around to see if rescilver direction is persistent ...

osol: both disks ok, 
osol: txn in ueberblock = 120 
sles10: remove one disk (lun=0)
osol: write with oflag=sync; sync multiple times 
osol: create fs myiscsi/test
osol: txn in ueberblock = 142 

osol: power off (hard)
sles10: add lun 0 again (the one with txn 120)
sles10: remove lun 1 (the onw with txn 142)
osol: boot 

osol: txn in ueberblock = 127
osol: filesystem myiscsi/test does not exist
osol: create fs myiscsi/mytest_old
osol: txn in ueberblock = 133
osol: power off

sles10: add lun 1 again (with txn 142)
sles10: both luns are there

osol: boot 

osol: Resilvering happens automatically 

osol: txn in ueberblock = 148
osol: filesystem myiscsi/test exists 

---
>From this tests it seems that the latest txn always wins. 
This practially means that the jbod with most changes (in terms of transacitons)
will always sync over the one with the least modifications.

Could someone confirm this assumtion ? 

Could someone explain resilvering direction selection ? 

Regards, 
Robert 


p.s. I did not test split brain, but this is next. (The planned setup is
clustered not iSCSI but SAS, so the split brain is more academic in this case).
-- 
This message posted from opensolaris.org

Richard Elling

2010-Jan-20 22:04 UTC

head link

[zfs-discuss] Mirror of SAN Boxes with ZFS ? (split site mirror)

Comment below. Perhaps someone from Sun''s ZFS team can fill in the
blanks, too.

On Jan 20, 2010, at 3:34 AM, Lutz Schumann wrote:
> Actually I found some time (and reason) to test this. 
> 
> Environment: 
> - 1 osol server 
> - one SLES10 iSCSI Target
> - two LUN''s exported via iSCSi to the OSol server 
> 
> I did some rescilver tests to see how ZFS resilvers devices.
> 
> Prep:
> 
> osol: create a pool (myiscsi) with one mirror pair made from the two iSCSI
backend disks of SLES10
> 
> Test: 
> 
> osol: both disks ok, 
> osol: txn in ueberblock of pool = 86 
> sles10: remove one disk (lun=1) 
> osol: disk is detected failed, pool degraded
> osol: write with oflag=direct; sync multiple times to the pool
> osol: create fs myiscsi/test
> osol: txn in ueberblock = 107 
> 
> osol: power off (hard) 
> sles10: add lun 1 again (the one with txn 86)
> sles10: remove lun 0 (the onw with txn 107)
> osol: power on 
> 
> osol: txn in ueberblock = 92
> osol: zfs myiscsi/test does not exist
> osol: create fs myiscsi/mytest_old
> osol: txn in ueberblock = 96
> 
> osol: power off (hard)
> sles10: add lun 0 again (with txn 107)
> sles10: both luns are there
> 
> osol: Resilvering happens automatically 
> 
> osol: txn in ueberblock = 112 
> osol: filesystem myiscsi/test exists 
> 
> ... same thing other way around to see if rescilver direction is persistent
...
> 
> osol: both disks ok, 
> osol: txn in ueberblock = 120 
> sles10: remove one disk (lun=0)
> osol: write with oflag=sync; sync multiple times 
> osol: create fs myiscsi/test
> osol: txn in ueberblock = 142 
> 
> osol: power off (hard)
> sles10: add lun 0 again (the one with txn 120)
> sles10: remove lun 1 (the onw with txn 142)
> osol: boot 
> 
> osol: txn in ueberblock = 127
> osol: filesystem myiscsi/test does not exist
> osol: create fs myiscsi/mytest_old
> osol: txn in ueberblock = 133
> osol: power off
> 
> sles10: add lun 1 again (with txn 142)
> sles10: both luns are there
> 
> osol: boot 
> 
> osol: Resilvering happens automatically 
> 
> osol: txn in ueberblock = 148
> osol: filesystem myiscsi/test exists 
> 
> ---
> 
> From this tests it seems that the latest txn always wins. 
Yes, this is by design. txg IDs are expected to be monotonically increasing.
The largest txg ID corresponds to the latest txg.
> This practially means that the jbod with most changes (in terms of
transacitons) will always sync over the one with the least modifications.
> 
> Could someone confirm this assumtion ? 
> 
> Could someone explain resilvering direction selection ? 
> 
> Regards, 
> Robert 
> 
> 
> p.s. I did not test split brain, but this is next. (The planned setup is
clustered not iSCSI but SAS, so the split brain is more academic in this case).
Split brain is a cluster feature and the most common methods of
handling split brain do not occur at the file system level.  Could 
you share your test plan?
 -- richard

Seemingly Similar Threads

Search for more reasonably related threads

zfs discuss - Dec 2009 - Mirror of SAN Boxes with ZFS ? (split site mirror)

[zfs-discuss] Mirror of SAN Boxes with ZFS ? (split site mirror)

[zfs-discuss] Mirror of SAN Boxes with ZFS ? (split site mirror)

[zfs-discuss] Mirror of SAN Boxes with ZFS ? (split site mirror)

Seemingly Similar Threads