thr3ads.net - zfs discuss - [zfs-discuss] Mirrored zpool across network [Aug 2007]

If this information is useful, please help other people find it:
Share via:

Mark

2007-Aug-20 00:45 UTC

[zfs-discuss] Mirrored zpool across network

Hi All,

Im just wondering (i figure you can do this but dont know what hardware and
stuff i would need) if I can set up a mirror of a raidz zpool across a network.

Basically, the setup is a large volume of Hi-Def video is being streamed from a
camera, onto an editing timeline. This will be written to a network share. Due
to the large amounts of data, ZFS is a really good option for us. But we need a
backup. We need to do it on generic hardware (i was thinking AMD64 with an array
of large 7200rpm hard drives), and therefore i think im going to have one box
mirroring the other box. They will be connected by gigabit ethernet. So my
question is how do I mirror one raidz Array across the network to the other?

Thanks for all your help

Mark.
 
 
This message posted from opensolaris.org

Torrey McMahon

2007-Aug-20 00:51 UTC

head link

[zfs-discuss] Mirrored zpool across network

Mark wrote:> Hi All,
>
> Im just wondering (i figure you can do this but dont know what hardware and
stuff i would need) if I can set up a mirror of a raidz zpool across a network.
>
> Basically, the setup is a large volume of Hi-Def video is being streamed
from a camera, onto an editing timeline. This will be written to a network
share. Due to the large amounts of data, ZFS is a really good option for us. But
we need a backup. We need to do it on generic hardware (i was thinking AMD64
with an array of large 7200rpm hard drives), and therefore i think im going to
have one box mirroring the other box. They will be connected by gigabit
ethernet. So my question is how do I mirror one raidz Array across the network
to the other?
rsync?
zfs send/rcv?
AVS?
iSCSI targets on the two boxes?

Lots of ways to do it. Depends what your definition of backup is. Time 
based? Extra redundancy?

Adam Leventhal

2007-Aug-20 17:18 UTC

head link

[zfs-discuss] Mirrored zpool across network

On Sun, Aug 19, 2007 at 05:45:18PM -0700, Mark wrote:> Basically, the setup is a large volume of Hi-Def video is being streamed
> from a camera, onto an editing timeline. This will be written to a
> network share. Due to the large amounts of data, ZFS is a really good
> option for us. But we need a backup. We need to do it on generic
> hardware (i was thinking AMD64 with an array of large 7200rpm hard
> drives), and therefore i think im going to have one box mirroring the
> other box. They will be connected by gigabit ethernet. So my question
> is how do I mirror one raidz Array across the network to the other?
One big decision you need to make in this scenario is whether you want
true synchronous replication or if asynchronous replication possibly with
some time-bound is acceptable. For the former, each byte must traverse the
network before it is acknowledged to the client; for the latter, data is
written locally and then transmitted shortly after that.

Synchronous replication obviously imposes a much larger performance hit,
but asychronous replication means you may lose data over some recent
period (but the data will always be consistent).

Adam

-- 
Adam Leventhal, Solaris Kernel Development       http://blogs.sun.com/ahl

Ralf Ramge

2007-Aug-21 12:14 UTC

head link

[zfs-discuss] Mirrored zpool across network

Torrey McMahon wrote:> AVS?
>   Jim Dunham will probably shoot me, or worse, but I recommend thinking 
twice about using AVS for ZFS replication. Basically, you only have a 
few options:

 1) Using a battery buffered hardware RAID controller, which leads to 
bad ZFS performance in many cases,
 2) Buildung up Three-Way-Mirrors to avoid complete data loss in several 
desaster scenarios due to missing ZFS recovery mechanisms like `fsck`, 
which makes AVS/ZFS based solutions quite expensive,
 3) Additionally using another form of backup, e.g. tapes.

For instance, one scenario which made me think: Imagine you have a 
X4500. 48 internal disks, 500 GB each. This would lead to ZFS pool on 40 
disks (you need 1 for the system, plus 3x RAID 10 for the bitmap 
volumes, otherwise your performance will be very bad, plus 2x HSP). 
Using 40 disks leads to a total of 40 separate replications. Now imagine 
the following two scenarios:

a) A disk in the primary fails. What happens? A HSP jumps in and 500 GB 
will be rebuilt. These 500 GB are synced over a single 1 GBit/s 
crossover cable. This takes a bit of time and is 100% unnecessary - and 
it will become much worse in the future, because the disk capacities 
rocket up into the sky, while the performance isn''t improved as much. 
During this time, your service misses redundancy. And we''re not talking
about some minutes during this time. Well, and now try to imagine what 
will happen if another disks fails during this rebuild, this time in the 
secondary ...

b) A disk in the secondary fails. What happens now? No HSP will jump in 
on the secondary, because the zpool isn''t imported and ZFS
doesn''t know
about the failure.  Instead, you''ll end up with 39 active replications 
instead of 40. The one which replicates to the failed drive will become 
inactive. But ... oh damn, the zpool isn''t mounted on the secondary 
host, so ZFS doesn''t report the drive failure to our server monitoring.
That can be funny. The only way to get aware of the problem I found 
after a minute of thinking was asking sndradm about the health status - 
which would lead to a  false alarm on Host A, because the failed disc is 
in Host B, and operators are usually not bright enough to change the 
disc in Host B after they get notified about a problem on Host B. But 
even if everything works,  what will if the primary fails before an 
administrator fixed the problem, the missing replication is running 
again and the replacement disc has been completely synced? "Hello, 
kernel panic", and "Goodbye, 12 TB of data").

c) You *must* force every single `zfs import <zpool>` on the secondary 
host. Always. Because you usually need your secondary host after your 
primary crashed. You won''t have the chance to export your zpool on the 
primary first - and if you do, you don''t need AVS at all. Bring some 
Kleenex to get rid of the sweat on your forehead when you have to switch 
to your secondary host, because a single mistake (like forgetting to put 
the secondary host into logging mode manually before you try to import 
the zpool) will lead to a complete data loss. I bet you won''t even
trust
your own failover scripts.

Use AVS and ZFS together. I use it myself. But I made sure that I know 
what I''m doing. Most people probably don''t.

Btw: I have to admit that I haven''t tried the newst nevada builds
during
the tests. It''s possible that AVS and ZFS work better together than
they
did under Solaris 10 11/06 and AVS 4.0. But there''s a reason I
haven''t
tried. It''s because Sun Cluster 3.2 instantly crashes on Thumpers, 
SATA-related kernel panics, and the OpenHA Cluster isn''t available yet.

-- 

Ralf Ramge
Senior Solaris Administrator, SCNA, SCSA

Tel. +49-721-91374-3963 
ralf.ramge at webde.de - http://web.de/

1&1 Internet AG
Brauerstra?e 48
76135 Karlsruhe

Amtsgericht Montabaur HRB 6484

Vorstand: Henning Ahlert, Ralph Dommermuth, Matthias Ehrlich, Andreas Gauger,
Matthias Greve, Robert Hoffmann, Norbert Lang, Achim Weiss
Aufsichtsratsvorsitzender: Michael Scheeren

Jim Dunham

2007-Aug-21 14:27 UTC

head link

[zfs-discuss] Mirrored zpool across network

Ralf,
> Torrey McMahon wrote:
>> AVS?
>>
> Jim Dunham will probably shoot me, or worse, but I recommend thinking
> twice about using AVS for ZFS replication.
That''s is why the call this a discussion group, as it encourages  
differing opinions,
> Basically, you only have a
> few options:
>
>  1) Using a battery buffered hardware RAID controller, which leads to
> bad ZFS performance in many cases,
>  2) Buildung up Three-Way-Mirrors to avoid complete data loss in  
> several
> desaster scenarios due to missing ZFS recovery mechanisms like `fsck`,
> which makes AVS/ZFS based solutions quite expensive,
>  3) Additionally using another form of backup, e.g. tapes.
>
> For instance, one scenario which made me think: Imagine you have a
> X4500. 48 internal disks, 500 GB each. This would lead to ZFS pool  
> on 40
> disks (you need 1 for the system, plus 3x RAID 10 for the bitmap
> volumes, otherwise your performance will be very bad, plus 2x HSP).
> Using 40 disks leads to a total of 40 separate replications. Now  
> imagine
> the following two scenarios:
This is just one scenario for deploying the 48 disks of x4500. The  
blog listed below offers another option, by mirroring the bitmaps  
across all available disks, bring the total disk count back up to 46,  
(or 44, if 2x HSP) leaving the other two for a mirrored root disk.   
http://blogs.sun.com/AVS/entry/avs_and_zfs_seamless

Yes, provisioning one slice for bitmaps and the another slice for  
ZFS''s vdevs on the same internal disk, may introduce out of band head  
seeks between bitmap I/O and ZFS I/O, plus taking a piece of a ZFS  
vdev turns off ZFS''s ability to enable write-caching. All things  
considered, this is the cost of host-based replication.
>
> a) A disk in the primary fails. What happens? A HSP jumps in and  
> 500 GB
> will be rebuilt. These 500 GB are synced over a single 1 GBit/s
> crossover cable. This takes a bit of time and is 100% unnecessary
But it is necessary! As soon as the HSP disk kicks in, not only is  
the disk being rebuilt by ZFS, but newly allocated ZFS data will also  
being written to this HSP disk. So although it may appear that there  
is wasted replication cost (of which there is), the instant that ZFS  
writes new data to this HSP disk, the old replicated disk is  
instantly inconsistent, and there is no means to fix.

For all that is good (or bad) about AVS, the fact that it works by  
simply interposing itself on the Solaris I/O data path is great, as  
it works with any Solaris block storage. Of course this also means  
that it has not filesystem, database or host-spare knowledge, which  
means that at times AVS will be inefficient at what it does.

> - and
> it will become much worse in the future, because the disk capacities
> rocket up into the sky, while the performance isn''t improved as
much.
Larger disk capacities are now worse in this scenario, then they are  
with controller-based replication, ZFS send / receive, etc. Actually  
it is quite efficient. If the disk that failed was one 5% full, when  
the HSP disk is switch and being rebuilt, old 5% of the entire disk  
will have to be replicated. If at the time ZFS and AVS were deployed  
on this server, if they HSP disks (containing uninitialized data)  
were also configured as equal with "sndradm -E ...", then there would
be not initial replication cost, and when swapped into use, only the  
cost of replicating the actual ZFS in use data.
> During this time, your service misses redundancy.
Absolute not. If all of the ZFS in use and ZFS HSP disks are  
configured under AVS, there is never a time of lost redundancy.
> And we''re not talking
> about some minutes during this time. Well, and now try to imagine what
> will happen if another disks fails during this rebuild, this time  
> in the
> secondary ...
If I was truly counting on AVS, I would be glad this happened!  
Getting replication configured right, be it AVS or some other option,  
means that when disks, systems, networks, etc., fail, there is always  
a period of degraded system performance, but it is better then no  
system performance.
>
> b) A disk in the secondary fails. What happens now? No HSP will  
> jump in
> on the secondary, because the zpool isn''t imported and ZFS
doesn''t
> know
> about the failure.  Instead, you''ll end up with 39 active
replications
> instead of 40. The one which replicates to the failed drive will  
> become
> inactive. But ... oh damn, the zpool isn''t mounted on the
secondary
> host, so ZFS doesn''t report the drive failure to our server  
> monitoring.
But if a disaster happened on the primary node, and a decision was  
made to ZFS import the storage pool on the secondary, ZFS will detect  
the inconsistency, mark the drive as failed, swap in the secondary  
HSP disk. Later, when the primary site comes back, and a reverse  
synchronization is done to restore writes that happened on the  
secondary, the primary ZFS file system will become aware that a HSP  
swap occurred, and continue on right where the secondary node left off.
>
> That can be funny. The only way to get aware of the problem I found
> after a minute of thinking was asking sndradm about the health  
> status -
> which would lead to a  false alarm on Host A, because the failed  
> disc is
> in Host B, and operators are usually not bright enough to change the
> disc in Host B after they get notified about a problem on Host B. But
> even if everything works,  what will if the primary fails before an
> administrator fixed the problem, the missing replication is running
> again and the replacement disc has been completely synced? "Hello,
> kernel panic", and "Goodbye, 12 TB of data").
See above, but yes there is a need for a system administrator to  
monitor SNDR replication.

>
> c) You *must* force every single `zfs import <zpool>` on the
secondary
> host. Always.
Correct, but this is the case even without AVS! If one configured ZFS  
on SAN based storage and your primary node crashed, one would need to  
force every single `zfs import <zpool>`. This is not an AVS issue,  
but a ZFS protection.
> Because you usually need your secondary host after your
> primary crashed. You won''t have the chance to export your zpool on
the
> primary first - and if you do, you don''t need AVS at all. Bring
some
> Kleenex to get rid of the sweat on your forehead when you have to  
> switch
> to your secondary host, because a single mistake (like forgetting  
> to put
> the secondary host into logging mode manually before you try to import
> the zpool) will lead to a complete data loss.
Correct, but this is the case even without AVS! Take the same SAN  
based storage scenario above, go to a secondary system on your SAN,  
and force every single `zfs import <zpool>`.

In the case of a SAN, where the same physical disk would be written  
to by both hosts, you would likely get complete data loss, but with  
AVS, where ZFS is actually on two physical disk, and AVS is tracking  
writes, even if they are inconsistent writes, AVS can and will  
recover if an update sync is done.
> I bet you won''t even trust
> your own failover scripts.
>
> Use AVS and ZFS together. I use it myself. But I made sure that I know
> what I''m doing. Most people probably don''t.
Your are quite correct in that although ZFS is intuitively easy to  
use, AVS is painfully complex. Of course the mindset of AVS and ZFS  
are as distant apart as they are in the alphabet. :-O

> Btw: I have to admit that I haven''t tried the newst nevada builds
> during
> the tests. It''s possible that AVS and ZFS work better together
than
> they
> did under Solaris 10 11/06 and AVS 4.0. But there''s a reason I
haven''t
> tried. It''s because Sun Cluster 3.2 instantly crashes on Thumpers,
> SATA-related kernel panics, and the OpenHA Cluster isn''t available
> yet.
With AVS in Nevada, there is now an opportunity for leveraging the  
ease of use of ZFS, with AVS. Being also the iSCSI Target project  
lead, I see a lot of value in the ZFS option "set shareiscsi=on", to  
get end users in using iSCSI.

I would like to see "set replication=AVS:<secondary host>",  
configuring a locally named ZFS storage pool to the same named pair  
on some remote host. Starting down this path would afford things like  
ZFS replication monitoring, similar to what ZFS does with each of its  
own vdevs.

Jim
>
> -- 
>
> Ralf Ramge
> Senior Solaris Administrator, SCNA, SCSA
>
> Tel. +49-721-91374-3963
> ralf.ramge at webde.de - http://web.de/
>
> 1&1 Internet AG
> Brauerstra?e 48
> 76135 Karlsruhe
>
> Amtsgericht Montabaur HRB 6484
>
> Vorstand: Henning Ahlert, Ralph Dommermuth, Matthias Ehrlich,  
> Andreas Gauger, Matthias Greve, Robert Hoffmann, Norbert Lang,  
> Achim Weiss
> Aufsichtsratsvorsitzender: Michael Scheeren
>
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Ralf Ramge

2007-Aug-22 07:55 UTC

head link

[zfs-discuss] Mirrored zpool across network

Jim Dunham wrote:> This is just one scenario for deploying the 48 disks of x4500. The 
> blog listed below offers another option, by mirroring the bitmaps 
> across all available disks, bring the total disk count back up to 46, 
> (or 44, if 2x HSP) leaving the other two for a mirrored root disk.  
> http://blogs.sun.com/AVS/entry/avs_and_zfs_seamless
>I know your blog entry, Jim. And I still admire your skills in 
calculations within shell scripts (I just gave each soft partition 100 
Megabytes of space, finished ;-) ). But after some thinking, I didn''t 
consider using a slice on the same disk for bitmaps. Not just because of 
performance issues, that''s not a valid reason. Again, the desaster 
scenarios make me think. In this case, the complexity of administration.

You know, the x64 Solaris boxes are basically competing against Linux 
boxes all day. The X4500 is very attractive replacement for the typical 
Linux file server, consisting of a server, a hardware RAID controller 
and several cheap and stupid fibre-channeled SATA JBODs for less than 
$5,000 each. Double this to have a cluster. In our case, the X4500 is 
competing against more than 60 of those clusters with a total of 360 
JBODs. The X4500''s main advantage isn''t the price per gigabyte
(the
price is exactly the same!), like most members of the sales department 
may expect, the real advantage is the gigabyte per rack unit. But there 
are several disadvantages, for instance: not being able to access the 
hard drives from the front and needing a ladder and a screwdriver 
instead, or, most important for the typical data center, the *operator* 
is not able to replace a disk like he''s used to: pulling the old disc 
out, putting the new disc in, resync starting, finished. You''ll always 
have to wait until the next morning, until a Solaris administrator is 
available again (which may impact your high availability concepts) or 
get an Solaris administrator into the company 24/7 a day (which raises 
the TCO of the Solaris boxes).
Well, and what I want to say: if you place the bitmap volume on the same 
disk, this situation even gets worse. The problem is the involvement of 
SVM. Building the soft partition again makes the handling even more 
complex and the case harder to handle for operators. It''s the best way 
to make sure that the disk will be replaced, but not added to the zpool 
during the night - and replacing it during regular working hours isn''t 
an option too, because syncing 500 GB over a 1 GBit/s interface during 
daytime just isn''t possible without putting the guaranteed service
times
to a risk. Having to take care about soft partitions just isn''t 
idiot-proof enough. And *poof* there''s a good chance the TCO of a X4500
is considered being too high.
>> a) A disk in the primary fails. What happens? A HSP jumps in and 500 GB
>> will be rebuilt. These 500 GB are synced over a single 1 GBit/s
>> crossover cable. This takes a bit of time and is 100% unnecessary
>
>
> But it is necessary! As soon as the HSP disk kicks in, not only is the 
> disk being rebuilt by ZFS, but newly allocated ZFS data will also 
> being written to this HSP disk. So although it may appear that there 
> is wasted replication cost (of which there is), the instant that ZFS 
> writes new data to this HSP disk, the old replicated disk is instantly 
> inconsistent, and there is no means to fix.It''s necessary from your point of view, Jim. But not in the minds of
the
customers. Even worse, it could be considered a design flaw - not in 
AVS, but in ZFS.

Just have a look how the usual Linux dude works. He doesn''t use AVS, he
uses a kernel module called DRBD. It does basically the same, it 
replicates one raw device to another over a network interface, like AVS 
does. But the linux dude has one advantage: he doesn''t have ZFS. Yes,
as
impossible as it may sound, it is an advantage. Why? Because he never 
has to mirror 40 or 46 devices, because his lame file systems depend on 
a hardware RAID controller! Same goes with UFS, of course. There''s only
ONE replicated device, no matter how many discs are involved.
And so, it''s definitely NOT necessary to sync a disc when a HSP kicks 
in, because this disc failure will never be reported to the host, it''s 
handled by the RAID controller. As a result, no replication will take 
place, because AVS simply isn''t involved. We even tried to deploy ZFS 
upon SVM RAID5 stripes to get rid of this problem, just to learn how 
much the RAID 5 performance of SVM sucks ... a cluster of six USB sticks 
was faster than the Thumpers.


I consider this a big design flaw of ZFS. I''m not very familiar with
the
code, but I still have hope that there''ll be a parameter which allows
to
get rid of the cache flushes. ZFS, and the X4500, are typical examples 
of different departments not really working together, e.g. they have a 
wonderful file system, but there are no storages who supports it. Or a 
great X4500, a 11-24 TB file server for $40,000, but no options to make 
it highly available like the $1,000 boxes. AVS is, in my opinion, 
clearly one of the components which suffers from it. The Sun marketing 
and Jonathan still have a long way to go. But, on the other hand, 
difficult customers like me and my company are always happy to point out 
some difficulties and to help resolving them :-)
> For all that is good (or bad) about AVS, the fact that it works by 
> simply interposing itself on the Solaris I/O data path is great, as it 
> works with any Solaris block storage. Of course this also means that 
> it has not filesystem, database or host-spare knowledge, which means 
> that at times AVS will be inefficient at what it does.
>I don''t think that there''s a problem with AVS and its
concepts. In my
opinion, ZFS has to do the homework. At least it should be aware of the 
fact that AVS is involved. Or has been, when it comes to recovering data 
from a zpool - simply saying "the discs belong exclusively to the local 
ZFS, and no other mechanisms can write onto the discs, so let''s panic 
and lose all the terabytes of important data" just isn''t valid. It
may
be easy and comfortable for the ZFS development department, but it 
doesn''t refelct the real world - and not even Suns software portfolio. 
The AVS integration into Nevada makes this even worse and I hope 
there''ll be something like fsck in the future, something which allows
me
to recover the files with correct checksums from a zpool, instead of 
simply hearing the sales droids repeat "There can''t be any errors,
NEVER!" over and over again :-)
>
>> - and
>> it will become much worse in the future, because the disk capacities
>> rocket up into the sky, while the performance isn''t improved
as much.
>
> Larger disk capacities are now worse in this scenario, then they are 
> with controller-based replication, ZFS send / receive, etc. Actually 
> it is quite efficient. If the disk that failed was one 5% full, when 
> the HSP disk is switch and being rebuilt, old 5% of the entire disk 
> will have to be replicated. If at the time ZFS and AVS were deployed 
> on this server, if they HSP disks (containing uninitialized data) were 
> also configured as equal with "sndradm -E ...", then there would
be
> not initial replication cost, and when swapped into use, only the cost 
> of replicating the actual ZFS in use data.That''s interesting. Because, together with your "data and bitmap
volume
on the same disk" scenario, the bitmap volume would be lost. A full sync 
of the disc would be necessary then, even if only 5% are in use. Am I 
correct?
>
>> During this time, your service misses redundancy.
>
> Absolute not. If all of the ZFS in use and ZFS HSP disks are 
> configured under AVS, there is never a time of lost redundancy.
>I''m sure there is, as soon as a disc crashed in the secondary and the 
primary disc is in logging mode for several hours. I bet you''ll lose 
your HA as soon as the primary crashes before the secondary is in sync 
again, because the global ZFS metadata weren''t logged, but updated. I 
think to avoid this, the primary would have to sent the entire 
replication group into logging mode - but then it would get even worse, 
because you''ll lose your redundancy for days until the secondary is
100%
in sync again and the regular replicating state becomes active (a full 
sync of a X4500 takes at least 5 days, and only when you don''t have Sun
Cluster with exlclusive interconnect interfaces up and running).

Linux/DRBD: Some data will be missing and you''ll have fun
fsck''ing for
two hours.
ZFS: The secondary is not consistent, zpool is FAULTED, all data is 
lost, you have a downtime while recovering from backup tapes, plus a 
week with reduced redundancy because of the time needed for resyncing 
the restored data. You want three cluster nodes in most deployment 
scenarios, not just two, believe me ;-) It doesn''t matter much if you 
only use several easy to restore videos. But I talk about file servers 
which host several billion inodes, like the file servers which host the 
mail headers, bodies and attachments for a million Yahoo users, a 
terabyte of moving data each day which cannot be backuped to tape.
>> And we''re not talking
>> about some minutes during this time. Well, and now try to imagine what
>> will happen if another disks fails during this rebuild, this time in
the
>> secondary ...
>
> If I was truly counting on AVS, I would be glad this happened! Getting 
> replication configured right, be it AVS or some other option, means 
> that when disks, systems, networks, etc., fail, there is always a 
> period of degraded system performance, but it is better then no system 
> performance.
>That''s correct. But don''t forget that it''s always a
very small step from
"degraded" to "faulted". In particular when it comes to high
availability scenarios in data centers, because in such scenarios
you''ll
always have to rely on other people with less know-how and motivation. 
It''s easy to accept a degraded state as long as you''re in your
office.
But with an X4500, your degraded state may potentially last longer than 
a weekend and when you''re directly responsible for the mail of millions
of user and you know that any non-availability will place your name on 
Slashdot (or the name of your CEO, wich equals placing your head on a 
scaffold), I''m sure you''ll think twice about using ZFS with
AVS or
letting the linux dudes continue to play with their inefficient boxes :-)
> But if a disaster happened on the primary node, and a decision was 
> made to ZFS import the storage pool on the secondary, ZFS will detect 
> the inconsistency, mark the drive as failed, swap in the secondary HSP 
> disk. Later, when the primary site comes back, and a reverse 
> synchronization is done to restore writes that happened on the 
> secondary, the primary ZFS file system will become aware that a HSP 
> swap occurred, and continue on right where the secondary node left off.I''ll try that as soon as I have a chance again (which means: as soon as
Sun gets the Sun Cluster working on a X4500).
>> c) You *must* force every single `zfs import <zpool>` on the
secondary
>> host. Always.
>
> Correct, but this is the case even without AVS! If one configured ZFS 
> on SAN based storage and your primary node crashed, one would need to 
> force every single `zfs import <zpool>`. This is not an AVS issue,
but
> a ZFS protection.Right. Too bad ZFS reacts this way.

I have to admit that you made me nervous once, when you wrote that 
forcing zpool imports would be a bad idea ...

[X] Zfsck now! Let''s organize a petition. :-)
> Correct, but this is the case even without AVS! Take the same SAN 
> based storage scenario above, go to a secondary system on your SAN, 
> and force every single `zfs import <zpool>`.
>Yes, but on a SAN, I don''t have to worry about zpool inconsistency, 
because the zpool always resides on the same devices.
> In the case of a SAN, where the same physical disk would be written to 
> by both hosts, you would likely get complete data loss, but with AVS, 
> where ZFS is actually on two physical disk, and AVS is tracking 
> writes, even if they are inconsistent writes, AVS can and will recover 
> if an update sync is done.My problem is that there''s no ZFS mechanism which allows me to verify 
the zpool consistency before I actually try to import it. Like I said 
before: AVS does it right, just ZFS doesn''t (and otherwise it
wouldn''t
make sense to discuss it on this mailinglist anyway :-) ).

It could really help me with AVS if there was something like "zpool 
check <zpool>", something for checking a zpool before an import. I
could
do a cronjob which puts the secondary host into logging mode, run a 
"zpool check" and continue with the replication  a few hours
afterwards.
Would let me sleep better and I wouldn''t have to pray to the IT gods 
before an import. ou know, I saw literally *hundreds* of kernel panics 
during my tests, that made me nervous. I have scripts which do the job 
now, but I saw the risks and the things which can go wrong if someone 
else without my experience does it (like the infamous "forgetting to 
manually place the secondary in the logging mode before trying to import 
a zpool").
> Your are quite correct in that although ZFS is intuitively easy to 
> use, AVS is painfully complex. Of course the mindset of AVS and ZFS 
> are as distant apart as they are in the alphabet. :-O
>AVS was easy to learn and isn''t very difficult to work with. All you 
need is 1 or 2 months of testing experience. Very easy with UFS.
> With AVS in Nevada, there is now an opportunity for leveraging the 
> ease of use of ZFS, with AVS. Being also the iSCSI Target project 
> lead, I see a lot of value in the ZFS option "set shareiscsi=on",
to
> get end users in using iSCSI.
>Too bad the X4500 has too few PCI slots to consider buying iSCSI cards. 
The two existing slots are already needed for the Sun Cluster 
interconnect. I think iSCSI won''t be real option unless the servers are
shipped with it onboard, like it has been done in the past with the SCSI 
or ethernet interfaces.
> I would like to see "set replication=AVS:<secondary host>",
> configuring a locally named ZFS storage pool to the same named pair on 
> some remote host. Starting down this path would afford things like ZFS 
> replication monitoring, similar to what ZFS does with each of its own 
> vdevs.Yes! Jim, I think we''ll become friends :-) Who do I have to send the 
bribe money to?

-- 

Ralf Ramge
Senior Solaris Administrator, SCNA, SCSA

Tel. +49-721-91374-3963 
ralf.ramge at webde.de - http://web.de/

1&1 Internet AG
Brauerstra?e 48
76135 Karlsruhe

Amtsgericht Montabaur HRB 6484

Vorstand: Henning Ahlert, Ralph Dommermuth, Matthias Ehrlich, Andreas Gauger,
Matthias Greve, Robert Hoffmann, Norbert Lang, Achim Weiss
Aufsichtsratsvorsitzender: Michael Scheeren

Mark

2007-Aug-22 11:43 UTC

head link

[zfs-discuss] Mirrored zpool across network

Wow,

I just opened a whole can of worms there that went flying over my head. Thanks
for all the information! i''ll see if i can plough through it all :)

I''m guessing that i might be able to do asynchronous, but the problem
is that the video is going to be streaming from a camera in real time. And its
only going to the file server. Also this isnt like security footage or
something. Its for a feature documentary, so we can''t afford to loose
any of this footage, and we really only get one try of it. Hence mirroring so at
least we have two copies of it.

I''m guessing that AVS is some kind of low level data replication
software. correct me if im wrong.

I suppose the other thing is that we need sustained transfer speeds of between
50MB/s and about 300MB/s depending on what video format we choose. So im
guessing that with those speeds only fibre is going to cut it really. is this
correct?

thanks again for all your help

Cheers

Mark
 
 
This message posted from opensolaris.org

Jim Dunham

2007-Aug-22 13:29 UTC

head link

[zfs-discuss] Mirrored zpool across network

Ralf,
> Well, and what I want to say: if you place the bitmap volume on the  
> same
> disk, this situation even gets worse. The problem is the  
> involvement of
> SVM. Building the soft partition again makes the handling even more
> complex and the case harder to handle for operators. It''s the best
way
> to make sure that the disk will be replaced, but not added to the  
> zpool
> during the night - and replacing it during regular working hours
isn''t
> an option too, because syncing 500 GB over a 1 GBit/s interface during
> daytime just isn''t possible without putting the guaranteed service
> times
> to a risk. Having to take care about soft partitions just isn''t
> idiot-proof enough. And *poof* there''s a good chance the TCO of a
> X4500
> is considered being too high.
You are quite correct in that increasing the number of data path  
technologies ZFS + AVS + SVM, increases the TCO, as the skills  
required by everyone involved must increase proportionately. For the  
record, using ZFS zvols for bitmap volumes does not scale, as the  
overhead of bit flipping is way too many I/Os for raidz or raidz2  
storage pools, and even a mirrored storage pool is high, as the COW  
semantics of ZFS make the I/O cost too high.
>
>>> a) A disk in the primary fails. What happens? A HSP jumps in and  
>>> 500 GB
>>> will be rebuilt. These 500 GB are synced over a single 1 GBit/s
>>> crossover cable. This takes a bit of time and is 100% unnecessary
>>
>>
>> But it is necessary! As soon as the HSP disk kicks in, not only is  
>> the
>> disk being rebuilt by ZFS, but newly allocated ZFS data will also
>> being written to this HSP disk. So although it may appear that there
>> is wasted replication cost (of which there is), the instant that ZFS
>> writes new data to this HSP disk, the old replicated disk is  
>> instantly
>> inconsistent, and there is no means to fix.
> It''s necessary from your point of view, Jim. But not in the minds
> of the
> customers. Even worse, it could be considered a design flaw - not in
> AVS, but in ZFS.
I wouldn''t go that far as to say it is a design flaw. The fact that  
AVS works with ZFS, and vice-versa, without either having knowledge  
of each other''s presence, says a lot for the I/O architecture of  
Solaris. If there is a compelling advantage to interoperate, the  
OpenSolaris community as a whole is free to propose a project, gather  
community interest, and go from there. The potential of OpenSolaris  
is huge, especially when it is ridding a technology wave, like the  
one created by the x4500 and ZFS.

> Just have a look how the usual Linux dude works. He doesn''t use  
> AVS, he
> uses a kernel module called DRBD. It does basically the same, it
> replicates one raw device to another over a network interface, like  
> AVS
> does. But the linux dude has one advantage: he doesn''t have ZFS.  
> Yes, as
> impossible as it may sound, it is an advantage. Why? Because he never
> has to mirror 40 or 46 devices, because his lame file systems  
> depend on
> a hardware RAID controller! Same goes with UFS, of course. There''s
> only
> ONE replicated device, no matter how many discs are involved.
> And so, it''s definitely NOT necessary to sync a disc when a HSP
kicks
> in, because this disc failure will never be reported to the host,
it''s
> handled by the RAID controller. As a result, no replication will take
> place, because AVS simply isn''t involved. We even tried to deploy
ZFS
> upon SVM RAID5 stripes to get rid of this problem, just to learn how
> much the RAID 5 performance of SVM sucks ... a cluster of six USB  
> sticks
> was faster than the Thumpers.
Instead of using SVM for RAID 5, too keep the volume count low,  
consider concatenating 8 devices (RAID 0) into 5 separate SVM  
volumes, then configuring both a ZFS raidz storage pool, plus AVS on  
these 5 volumes. This prevents SVM from performing software RAID 5,  
RAID 0 is a low-overhead pass thru for SVM, plus prior to giving the  
entire SVM volume to ZFS, one can also get the AVS bitmaps form this  
pool too.
> I consider this a big design flaw of ZFS. I''m not very familiar  
> with the
> code, but I still have hope that there''ll be a parameter which  
> allows to
> get rid of the cache flushes. ZFS, and the X4500, are typical examples
> of different departments not really working together, e.g. they have a
> wonderful file system, but there are no storages who supports it. Or a
> great X4500, a 11-24 TB file server for $40,000, but no options to  
> make
> it highly available like the $1,000 boxes. AVS is, in my opinion,
> clearly one of the components which suffers from it. The Sun marketing
> and Jonathan still have a long way to go. But, on the other hand,
> difficult customers like me and my company are always happy to  
> point out
> some difficulties and to help resolving them :-)
Sun does recognize the potential of both the X4500 and ZFS, and also  
of the difficulties (and problems) when combining them together. It  
would be great if there was pre-existing technology (hardware,  
software, or both), that just made this high availability issue go  
away, without adding any complexity.
>
>> For all that is good (or bad) about AVS, the fact that it works by
>> simply interposing itself on the Solaris I/O data path is great,  
>> as it
>> works with any Solaris block storage. Of course this also means that
>> it has not filesystem, database or host-spare knowledge, which means
>> that at times AVS will be inefficient at what it does.
>>
> I don''t think that there''s a problem with AVS and its
concepts. In my
> opinion, ZFS has to do the homework. At least it should be aware of  
> the
> fact that AVS is involved. Or has been, when it comes to recovering  
> data
> from a zpool - simply saying "the discs belong exclusively to the  
> local
> ZFS, and no other mechanisms can write onto the discs, so let''s
panic
> and lose all the terabytes of important data" just isn''t
valid. It may
> be easy and comfortable for the ZFS development department, but it
> doesn''t refelct the real world - and not even Suns software
portfolio.
> The AVS integration into Nevada makes this even worse and I hope
> there''ll be something like fsck in the future, something which  
> allows me
> to recover the files with correct checksums from a zpool, instead of
> simply hearing the sales droids repeat "There can''t be any
errors,
> NEVER!" over and over again :-)
I don''t think there is any single technology that is too blame here,  
unless of course that technology is as you put it "Suns software  
portfolio". The "ZFS development department" has done an
excellent
job in meeting, and exceeding what they set out to accomplish, and  
then more. The even offer remote file replication via send / recv.

What was not taken into consideration, and its is unclear where this  
falls, is that any Solaris filesystem can be replicated by either  
host-based or controller-based data services, and the need to  
assuring data consistency of that replicated filesystem. Concerned as  
you are about system panics, ZFS is doing the correct thing in  
validating checksums, and panicing Solaris under circumstances ZFS  
considers to be data corruption. Do the same types of operations with  
other filesystems, and these undetected writes, are essentially  
silent data corruption.

The fact that ZFS validate data on reads is powerful.

>
>>
>>> - and
>>> it will become much worse in the future, because the disk
capacities
>>> rocket up into the sky, while the performance isn''t
improved as
>>> much.
>>
>> Larger disk capacities are now worse in this scenario, then they are
>> with controller-based replication, ZFS send / receive, etc. Actually
>> it is quite efficient. If the disk that failed was one 5% full, when
>> the HSP disk is switch and being rebuilt, old 5% of the entire disk
>> will have to be replicated. If at the time ZFS and AVS were deployed
>> on this server, if they HSP disks (containing uninitialized data)  
>> were
>> also configured as equal with "sndradm -E ...", then there
would be
>> not initial replication cost, and when swapped into use, only the  
>> cost
>> of replicating the actual ZFS in use data.
> That''s interesting. Because, together with your "data and
bitmap
> volume
> on the same disk" scenario, the bitmap volume would be lost. A full  
> sync
> of the disc would be necessary then, even if only 5% are in use. Am I
> correct?
My scenario used SVM mirrored bitmaps for AVS, and ZFS is protecting  
by its raidz or mirrored storage pool. When one looses a disk, SVM  
continues to use the other side of the mirror for AVS bitmaps, ZFS  
uses the redundancy of its storage pool. When the failed disk is  
replaced, SVM needs to resilver, ZFS needs to rebuild, either on  
demand, or via zpool scrub. All is good.
>
>>
>>> During this time, your service misses redundancy.
>>
>> Absolute not. If all of the ZFS in use and ZFS HSP disks are
>> configured under AVS, there is never a time of lost redundancy.
>>
> I''m sure there is, as soon as a disc crashed in the secondary and
the
> primary disc is in logging mode for several hours. I bet you''ll
lose
> your HA as soon as the primary crashes before the secondary is in sync
> again, because the global ZFS metadata weren''t logged, but
updated.
Redundancy, based on my understanding, is recovery from a single  
failure. What you allude to above is two (or more) failures,  
something not covered with simple redundancy. The need to be able to  
recover from multiple failures, is clearly a known concept, hence the  
creation of raidz2, knowing that loosing two disks in raidz is bad news.

Using AVS to replicate a ZFS storage pool, offers something AVS has  
never had, the ability for ZFS to validate that AVS''s replication was  
indeed perfect. Drop the replica into logging mode, zpool import,  
zpool scrub, zpool export, resume replication.
> I
> think to avoid this, the primary would have to sent the entire
> replication group into logging mode - but then it would get even  
> worse,
> because you''ll lose your redundancy for days until the secondary
is
> 100%
> in sync again and the regular replicating state becomes active (a full
> sync of a X4500 takes at least 5 days, and only when you don''t
have
> Sun
> Cluster with exlclusive interconnect interfaces up and running).
>
> Linux/DRBD: Some data will be missing and you''ll have fun
fsck''ing for
> two hours.
> ZFS: The secondary is not consistent, zpool is FAULTED, all data is
> lost, you have a downtime while recovering from backup tapes, plus a
> week with reduced redundancy because of the time needed for resyncing
> the restored data. You want three cluster nodes in most deployment
> scenarios, not just two, believe me ;-) It doesn''t matter much if
you
> only use several easy to restore videos. But I talk about file servers
> which host several billion inodes, like the file servers which host  
> the
> mail headers, bodies and attachments for a million Yahoo users, a
> terabyte of moving data each day which cannot be backuped to tape.
>
>>> And we''re not talking
>>> about some minutes during this time. Well, and now try to imagine  
>>> what
>>> will happen if another disks fails during this rebuild, this time  
>>> in the
>>> secondary ...
>>
>> If I was truly counting on AVS, I would be glad this happened!  
>> Getting
>> replication configured right, be it AVS or some other option, means
>> that when disks, systems, networks, etc., fail, there is always a
>> period of degraded system performance, but it is better then no  
>> system
>> performance.
>>
> That''s correct. But don''t forget that it''s
always a very small step
> from
>  availability scenarios in data centers, because in such scenarios  
> you''ll
> always have to rely on other people with less know-how and motivation.
> It''s easy to accept a degraded state as long as you''re in
your office.
> But with an X4500, your degraded state may potentially last longer  
> than
> a weekend and when you''re directly responsible for the mail of  
> millions
> of user and you know that any non-availability will place your name on
> Slashdot (or the name of your CEO, wich equals placing your head on a
> scaffold), I''m sure you''ll think twice about using ZFS
with AVS or
> letting the linux dudes continue to play with their inefficient  
> boxes :-)
All very valid points, and having reassurance that choices made today  
will prove themselves valuable if, and when degraded or faulted  
states arise, is key. I am a strong proponent of disaster recovery  
testing, long before your company, or CxOs signoff on a solution put  
into production. You are right to question, and arrive at your own  
informed conclusions about the technologies you choose before  
deployment.
>
>> But if a disaster happened on the primary node, and a decision was
>> made to ZFS import the storage pool on the secondary, ZFS will detect
>> the inconsistency, mark the drive as failed, swap in the secondary  
>> HSP
>> disk. Later, when the primary site comes back, and a reverse
>> synchronization is done to restore writes that happened on the
>> secondary, the primary ZFS file system will become aware that a HSP
>> swap occurred, and continue on right where the secondary node left  
>> off.
> I''ll try that as soon as I have a chance again (which means: as  
> soon as
> Sun gets the Sun Cluster working on a X4500).
>
>>> c) You *must* force every single `zfs import <zpool>` on the
>>> secondary
>>> host. Always.
>>
>> Correct, but this is the case even without AVS! If one configured ZFS
>> on SAN based storage and your primary node crashed, one would need to
>> force every single `zfs import <zpool>`. This is not an AVS
issue,
>> but
>> a ZFS protection.
> Right. Too bad ZFS reacts this way.
>
> I have to admit that you made me nervous once, when you wrote that
> forcing zpool imports would be a bad idea ...
I think there was some context to my prior statement, as in checking  
the current state of replication before doing so. ;-)
>
> [X] Zfsck now! Let''s organize a petition. :-)
>
>> Correct, but this is the case even without AVS! Take the same SAN
>> based storage scenario above, go to a secondary system on your SAN,
>> and force every single `zfs import <zpool>`.
>>
> Yes, but on a SAN, I don''t have to worry about zpool
inconsistency,
> because the zpool always resides on the same devices.
Point well taken.
>
>> In the case of a SAN,  where the same physical disk would be  
>> written to
>> by both hosts, you would likely get complete data loss, but with AVS,
>> where ZFS is actually on two physical disk, and AVS is tracking
>> writes, even if they are inconsistent writes, AVS can and will  
>> recover
>> if an update sync is done.
> My problem is that there''s no ZFS mechanism which allows me to
verify
> the zpool consistency before I actually try to import it. Like I said
> before: AVS does it right, just ZFS doesn''t (and otherwise it
wouldn''t
> make sense to discuss it on this mailinglist anyway :-) ).
>
> It could really help me with AVS if there was something like "zpool
> check <zpool>", something for checking a zpool before an import.
I
> could
> do a cronjob which puts the secondary host into logging mode, run a
> "zpool check" and continue with the replication  a few hours  
> afterwards.
> Would let me sleep better and I wouldn''t have to pray to the IT
gods
> before an import. ou know, I saw literally *hundreds* of kernel panics
> during my tests, that made me nervous. I have scripts which do the job
> now, but I saw the risks and the things which can go wrong if someone
> else without my experience does it (like the infamous "forgetting to
> manually place the secondary in the logging mode before trying to  
> import
> a zpool").
The issue is not the need for a "zpool check", or to improve on  
"zpool import", or ZFS itself, as each could validate the storage  
pool as being 100% perfect, if at the moment they are running, ZFS on  
the primary node was not writing data, data which may be actively  
replicating to the secondary node.

The problem (or lack of a feature), is that ZFS does not support  
shared access to a single storage poll. ZFS on one node, in seeing  
ZFS writes issued another node (being a dual-port disked, SAN disk,  
AVS replication, or controller based replication), views these ZFS  
writes and their CRC data, as a form of data corruption, and  
rightfully ZFS panics Solaris

I know that the shared QFS filesystem supports careful, ordered  
writes, which allows their shared QFS reader client to read (only)  
from an active replica, being AVS or controller based replication. As  
with QFS, given time, ZFS will evolve.
>
>> Your are quite correct in that although ZFS is intuitively easy to
>> use, AVS is painfully complex. Of course the mindset of AVS and ZFS
>> are as distant apart as they are in the alphabet. :-O
>>
> AVS was easy to learn and isn''t very difficult to work with. All
you
> need is 1 or 2 months of testing experience. Very easy with UFS.
>
>> With AVS in Nevada, there is now an opportunity for leveraging the
>> ease of use of ZFS, with AVS. Being also the iSCSI Target project
>> lead, I see a lot of value in the ZFS option "set
shareiscsi=on", to
>> get end users in using iSCSI.
>>
> Too bad the X4500 has too few PCI slots to consider buying iSCSI  
> cards.
HBA manufactures have in the past created multi-port, and multi- 
function HBAs. I would expect there to be something out there, or out  
there soon which will address the need of limited PCI slots.

> The two existing slots are already needed for the Sun Cluster
> interconnect. I think iSCSI won''t be real option unless the
servers
> are
> shipped with it onboard, like it has been done in the past with the  
> SCSI
> or ethernet interfaces.
>
>> I would like to see "set replication=AVS:<secondary
host>",
>> configuring a locally named ZFS storage pool to the same named  
>> pair on
>> some remote host. Starting down this path would afford things like  
>> ZFS
>> replication monitoring, similar to what ZFS does with each of its own
>> vdevs.
> Yes! Jim, I think we''ll become friends :-) Who do I have to send
the
> bribe money to?
Sun Microsystems, Inc., as in buying Sun Servers, Software, Storage  
and Services.
Non-monetary offerings in the form of being an active OpenSolaris  
community member, are also highly valued.

>
> -- 
>
> Ralf Ramge
> Senior Solaris Administrator, SCNA, SCSA
>
> Tel. +49-721-91374-3963
> ralf.ramge at webde.de - http://web.de/
>
> 1&1 Internet AG
> Brauerstra?e 48
> 76135 Karlsruhe
>
> Amtsgericht Montabaur HRB 6484
>
> Vorstand: Henning Ahlert, Ralph Dommermuth, Matthias Ehrlich,  
> Andreas Gauger, Matthias Greve, Robert Hoffmann, Norbert Lang,  
> Achim Weiss
> Aufsichtsratsvorsitzender: Michael Scheeren
>
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Jim Dunham
Solaris, Storage Software Group

Sun Microsystems, Inc.
1617 Southwood Drive
Nashua, NH 03063
Email: James.Dunham at Sun.COM
http://blogs.sun.com/avs

Matthew Ahrens

2007-Aug-22 18:23 UTC

head link

[zfs-discuss] Mirrored zpool across network

Ralf Ramge wrote:> I consider this a big design flaw of ZFS.
Are you saying that it''s a design flaw of ZFS that we haven''t
yet implemented
remote replication?  I would consider that a missing feature, not a design 
flaw.  There''s nothing in the design of ZFS to prevent such a feature
(and in
fact, several aspects of the design that would work very well with such a 
feature, eg as used with "zfs send").

 > I''m not very familiar with the> code, but I still have hope that there''ll be a parameter which
allows to
> get rid of the cache flushes.
You mean zfs_nocacheflush?  Admittedly, this is a hack.  We''re working
on
making this simply do the right thing, based on the capabilities of the 
underlying storage device.

 > ZFS, and the X4500, are typical examples> of different departments not really working together, e.g. they have a 
> wonderful file system, but there are no storages who supports it.
I''m not sure what you mean.  ZFS supports any storage, and works great
on the
X4500.

--matt

Robert Milkowski

2007-Aug-23 07:38 UTC

head link

[zfs-discuss] Mirrored zpool across network

Hello Ralf,

Wednesday, August 22, 2007, 8:55:35 AM, you wrote:
RR> instead, or, most important for the typical data center, the *operator*
RR> is not able to replace a disk like he''s used to: pulling the old
disc
RR> out, putting the new disc in, resync starting, finished. You''ll
always
RR> have to wait until the next morning, until a Solaris administrator is 
RR> available again (which may impact your high availability concepts) or 
RR> get an Solaris administrator into the company 24/7 a day (which raises
RR> the TCO of the Solaris boxes).

See a putback few weeks ago (it went in, right?) which should deliver
exactly what you want - just pull out bad disk, pull in new disk, and
zfs automatically resyncs - no admin intervention at all.


-- 
Best regards,
 Robert Milkowski                      mailto:rmilkowski at task.gda.pl
                                       http://milek.blogspot.com

Mark

2007-Aug-24 08:22 UTC

head link

[zfs-discuss] Mirrored zpool across network

Ok, had a bit of a look around,

What about this setup.

Two boxes with all the hard drives in them. And all drives iSCSI Targets. A
third Box puts all of the Drives into a mirrored RAIDz setup (one box mirroring
the other, each has a RAIDz zfs zpool). This setup wil be shared via samba out.

Does anybody see a problem with this?

Also i know this isnt ZFS, but is there any upper limit on file size with samba?

Thanks For all your help.

Mark
 
 
This message posted from opensolaris.org

Darren J Moffat

2007-Aug-24 10:05 UTC

head link

[zfs-discuss] Mirrored zpool across network

Mark wrote:> Ok, had a bit of a look around,
> 
> What about this setup.
> 
> Two boxes with all the hard drives in them. And all drives iSCSI Targets. A
third Box puts all of the Drives into a mirrored RAIDz setup (one box mirroring
the other, each has a RAIDz zfs zpool). This setup wil be shared via samba out.
> 
> Does anybody see a problem with this?
Seems reasonable to me.  However you haven''t said anything about
how "third box" is networked to "first box" and "second
box".

With iSCSI I HIGHLY recommend at least using IPsec AH to that you get 
integrity protection of the packets - the TCP checksum is not enough. 
If you care enough to use sha256 checksum with ZFS you should care 
enough to ensure the data on the wire is checksum strongly too.

Also consider that if this was direct attach you would probably be using 
two separate HBAs so you may want to consider using different physical 
NICs and or IPMP or other network failover technologies (depending on 
what hardware you have network wise).

I did a similar setup recently where I had a zpool on one machine and 
created two iscsi targets (using zfs) and then created a mirror using 
those two luns on another machine.  In the end I removed the ZFS pool on 
the target side and shared out the raw disks with iscsi and build the 
pool on the initator machine that way.  Why ?  because I couldn''t 
rationalise to myself what value ZFS was giving me in this particular 
case since I was sharing the whole disk array.  In cases where you 
aren''t sharing the whole array to a single initiator then I can see 
value in having the iscsi targets be zvols.

-- 
Darren J Moffat

Mark

2007-Aug-25 00:11 UTC

head link

[zfs-discuss] Mirrored zpool across network

hi,

Few questions, I seem to remember in WAN environments IPsec can have a
reasonably large performance impact, how large is this performance impact? and
is there soe way to mitigate it? The problem is we could be needing to use all a
gigbit links bandwidth (possibly more). is IPsec AH slightly different to the
cryptography algorithms to keep VPN''s secure?

also i had a look at IPMP, sounds really good. I was wondering yesterday about
the possibility of linking a few gigabit links together, at FB is very expensive
and 10GbE is almost the same. I read on the Wikipedia article that by using IPMP
the bandwidth is increased due to sharing across all the network cards, is this
true?

Thanks again for all your help

Cheers
Mark
 
 
This message posted from opensolaris.org

Darren J Moffat

2007-Aug-28 09:20 UTC

head link

[zfs-discuss] Mirrored zpool across network

Depending on what hardware you have and what size the data chunks are 
will determine what impact IPsec will have.  WAN vs LAN isn''t the
issue.

As for mitigating the impact of the crypto in IPsec it depends on the 
data size.  If the size of the packets is > 512 bytes then the crypto 
framework will off load that to hardware.  However that really only 
matters for symetric ciphers such as AES, 3DES which if you are doing 
IPsec AH only, rather than ESP+auth, you aren''t using.   If you do want
to encrypt and have that off loaded to hardware there are two choices: 
Sun CA-6000 card or an UltraSPARC T2 processors (Niagara 2) [ cpu in the 
the recently announced new machines ].

Some VPNs are IPsec and some are SSL or SSH.  Those that are IPsec based 
do so with ESP+Auth.  IPsec AH doesn''t protect the data from viewing on
the wire just integrity protects it - just like ZFS today (integrity 
protected but not encrypted); a VPN needs to be more than that!

--
Darren J Moffat

Mark

2007-Sep-11 06:27 UTC

head link

[zfs-discuss] Mirrored zpool across network

Hey all again,

Looking into a few other options. How about infiniband? it would give us more
bandwidth, but will it increase complexity/price? any thoughts?

Cheers

Mark
 
 
This message posted from opensolaris.org

zfs discuss - Aug 2007 - Mirrored zpool across network

[zfs-discuss] Mirrored zpool across network

[zfs-discuss] Mirrored zpool across network

[zfs-discuss] Mirrored zpool across network

[zfs-discuss] Mirrored zpool across network

[zfs-discuss] Mirrored zpool across network

[zfs-discuss] Mirrored zpool across network

[zfs-discuss] Mirrored zpool across network

[zfs-discuss] Mirrored zpool across network

[zfs-discuss] Mirrored zpool across network

[zfs-discuss] Mirrored zpool across network

[zfs-discuss] Mirrored zpool across network

[zfs-discuss] Mirrored zpool across network

[zfs-discuss] Mirrored zpool across network

[zfs-discuss] Mirrored zpool across network

[zfs-discuss] Mirrored zpool across network