thr3ads.net - zfs discuss - [zfs-discuss] x4500 vs AVS ? [Sep 2008]

If this information is useful, please help other people find it:
Share via:

Jorgen Lundman

2008-Sep-04 01:39 UTC

[zfs-discuss] x4500 vs AVS ?

If we get two x4500s, and look at AVS, would it be possible to:

1) Setup AVS to replicate zfs, and zvol (ufs) from 01 -> 02 ? Supported 
by Sol 10 5/08 ?


Assuming 1, if we setup a home-made IP fail-over so that; should 01 go 
down, all clients are redirected to 02.


2) Fail-back, are there methods in AVS to handle fail-back? Since 02 has 
been used, it will have newer/modified files, and will need to replicate 
backwards until synchronised, before fail-back can occur.


We did ask our vendor, but we were just told that AVS does not support 
x4500.


Lund

-- 
Jorgen Lundman       | <lundman at lundman.net>
Unix Administrator   | +81 (0)3 -5456-2687 ext 1017 (work)
Shibuya-ku, Tokyo    | +81 (0)90-5578-8500          (cell)
Japan                | +81 (0)3 -3375-1767          (home)

Marion Hakanson

2008-Sep-04 02:00 UTC

head link

[zfs-discuss] x4500 vs AVS ?

lundman at gmo.jp said:> We did ask our vendor, but we were just told that AVS does not support
> x4500. 
You might have to use the open-source version of AVS, but it''s not
clear if that requires OpenSolaris or if it will run on Solaris-10.
Here''s a description of how to set it up between two X4500''s:

  http://blogs.sun.com/AVS/entry/avs_and_zfs_seamless

Regards,

Marion

Ralf Ramge

2008-Sep-04 07:19 UTC

head link

[zfs-discuss] x4500 vs AVS ?

Jorgen Lundman wrote:
> We did ask our vendor, but we were just told that AVS does not support 
> x4500.

The officially supported AVS works on the X4500 since the X4500 came 
out. But, although Jim Dunham and others will tell you otherwise, I 
absolutely can *not* recommend using it on this hardware with ZFS, 
especially with the larger disk sizes. At least not for important, or 
even business critical data - in such a case, using X41x0 servers with
J4500 JBODs and a HAStoragePlus Cluster instead of AVS may be a much 
better and more reliable option, for basically the same price.

-- 

Ralf Ramge
Senior Solaris Administrator, SCNA, SCSA

Tel. +49-721-91374-3963
ralf.ramge at webde.de - http://web.de/

1&1 Internet AG
Brauerstra?e 48
76135 Karlsruhe

Amtsgericht Montabaur HRB 6484

Vorstand: Henning Ahlert, Ralph Dommermuth, Matthias Ehrlich, Thomas 
Gottschlich, Matthias Greve, Robert Hoffmann, Markus Huhn, Oliver Mauss, 
Achim Weiss
Aufsichtsratsvorsitzender: Michael Scheeren

Brent Jones

2008-Sep-04 07:41 UTC

head link

[zfs-discuss] x4500 vs AVS ?

On Thu, Sep 4, 2008 at 12:19 AM, Ralf Ramge <ralf.ramge at webde.de>
wrote:>
> Jorgen Lundman wrote:
>
> > We did ask our vendor, but we were just told that AVS does not support
> > x4500.
>
>
> The officially supported AVS works on the X4500 since the X4500 came
> out. But, although Jim Dunham and others will tell you otherwise, I
> absolutely can *not* recommend using it on this hardware with ZFS,
> especially with the larger disk sizes. At least not for important, or
> even business critical data - in such a case, using X41x0 servers with
> J4500 JBODs and a HAStoragePlus Cluster instead of AVS may be a much
> better and more reliable option, for basically the same price.
>
>
>
>
> --
>
> Ralf Ramge
> Senior Solaris Administrator, SCNA, SCSA
I did some Googling, but I saw some limitations sharing your ZFS pool
via NFS while using HAStorage Cluster product as well.
Do similar limitations exist for sharing via the built in CIFS in
OpenSolaris as well?

Here:
http://docs.sun.com/app/docs/doc/820-2565/z4000275997776?a=view

"
Zettabyte File System (ZFS) Restrictions

If you are using the zettabyte file system (ZFS) as the exported file
system, you must set the sharenfs property to off.

To set the sharenfs property to off, run the following command.

$ zfs set sharenfs=off file_system/volume

To verify if the sharenfs property is set to off, run the following command.

$ zfs get sharenfs file_system/volume
"



--
Brent Jones
brent at servuhome.net

Ralf Ramge

2008-Sep-04 08:26 UTC

head link

[zfs-discuss] x4500 vs AVS ?

Brent Jones wrote:
> I did some Googling, but I saw some limitations sharing your ZFS pool
> via NFS while using HAStorage Cluster product as well.[...]
  > If you are using the zettabyte file system (ZFS) as the exported
file> system, you must set the sharenfs property to off.
That''s not a limitation, just looks like one. The cluster''s
resource
type called "SUNW.nfs" decides if a file system is shared or not. And
it
does this with the usual "share" and "unshare" commands in a
separate
dfstab file. The ZFS sharenfs flag is set to "off" to avoid conflicts.

-- 

Ralf Ramge
Senior Solaris Administrator, SCNA, SCSA

Tel. +49-721-91374-3963
ralf.ramge at webde.de - http://web.de/

1&1 Internet AG
Brauerstra?e 48
76135 Karlsruhe

Amtsgericht Montabaur HRB 6484

Vorstand: Henning Ahlert, Ralph Dommermuth, Matthias Ehrlich, Thomas 
Gottschlich, Matthias Greve, Robert Hoffmann, Markus Huhn, Oliver Mauss, 
Achim Weiss
Aufsichtsratsvorsitzender: Michael Scheeren

Wade.Stuart at fallon.com

2008-Sep-04 15:09 UTC

head link

[zfs-discuss] x4500 vs AVS ?

zfs-discuss-bounces at opensolaris.org wrote on 09/04/2008 02:19:23 AM:
> Jorgen Lundman wrote:
>
> > We did ask our vendor, but we were just told that AVS does not support
> > x4500.
>
>
> The officially supported AVS works on the X4500 since the X4500 came
> out. But, although Jim Dunham and others will tell you otherwise, I
> absolutely can *not* recommend using it on this hardware with ZFS,
> especially with the larger disk sizes. At least not for important, or
> even business critical data - in such a case, using X41x0 servers with
> J4500 JBODs and a HAStoragePlus Cluster instead of AVS may be a much
> better and more reliable option, for basically the same price.
>
Ralf,

      War wounds?  Could you please expand on the why a bit more?

-Wade

Al Hopper

2008-Sep-05 02:38 UTC

head link

[zfs-discuss] x4500 vs AVS ?

On Thu, Sep 4, 2008 at 10:09 AM,  <Wade.Stuart at fallon.com>
wrote:> zfs-discuss-bounces at opensolaris.org wrote on 09/04/2008 02:19:23 AM:
>
>> Jorgen Lundman wrote:
>>
>> > We did ask our vendor, but we were just told that AVS does not
support
>> > x4500.
>>
>>
>> The officially supported AVS works on the X4500 since the X4500 came
>> out. But, although Jim Dunham and others will tell you otherwise, I
>> absolutely can *not* recommend using it on this hardware with ZFS,
>> especially with the larger disk sizes. At least not for important, or
>> even business critical data - in such a case, using X41x0 servers with
>> J4500 JBODs and a HAStoragePlus Cluster instead of AVS may be a much
>> better and more reliable option, for basically the same price.
>>
>
> Ralf,
>
>      War wounds?  Could you please expand on the why a bit more?
+1   I''d also be interested in more details.

Thanks,

-- 
Al Hopper Logical Approach Inc,Plano,TX al at logical-approach.com
 Voice: 972.379.2133 Timezone: US CDT
OpenSolaris Governing Board (OGB) Member - Apr 2005 to Mar 2007
http://www.opensolaris.org/os/community/ogb/ogb_2005-2007/

Brent Jones

2008-Sep-05 02:41 UTC

head link

[zfs-discuss] x4500 vs AVS ?

On Thu, Sep 4, 2008 at 7:38 PM, Al Hopper <al at logical-approach.com>
wrote:> On Thu, Sep 4, 2008 at 10:09 AM,  <Wade.Stuart at fallon.com> wrote:
>> zfs-discuss-bounces at opensolaris.org wrote on 09/04/2008 02:19:23 AM:
>>
>>> Jorgen Lundman wrote:
>>>
>>> > We did ask our vendor, but we were just told that AVS does not
support
>>> > x4500.
>>>
>>>
>>> The officially supported AVS works on the X4500 since the X4500
came
>>> out. But, although Jim Dunham and others will tell you otherwise, I
>>> absolutely can *not* recommend using it on this hardware with ZFS,
>>> especially with the larger disk sizes. At least not for important,
or
>>> even business critical data - in such a case, using X41x0 servers
with
>>> J4500 JBODs and a HAStoragePlus Cluster instead of AVS may be a
much
>>> better and more reliable option, for basically the same price.
>>>
>>
>> Ralf,
>>
>>      War wounds?  Could you please expand on the why a bit more?
>
> +1   I''d also be interested in more details.
>
> Thanks,
>
> --
> Al Hopper Logical Approach Inc,Plano,TX al at logical-approach.com
>  Voice: 972.379.2133 Timezone: US CDT
> OpenSolaris Governing Board (OGB) Member - Apr 2005 to Mar 2007
> http://www.opensolaris.org/os/community/ogb/ogb_2005-2007/
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>
Story time!

-- 
Brent Jones
brent at servuhome.net

Ralf Ramge

2008-Sep-05 07:49 UTC

head link

[zfs-discuss] x4500 vs AVS ?

Wade.Stuart at fallon.com wrote:
>       War wounds?  Could you please expand on the why a bit more?


- ZFS is not aware of AVS. On the secondary node, you''ll always have to
force the `zfs import` due to the unnoticed changes of metadata (zpool 
in use). No mechanism to prevent data loss exists, e.g. zpools can be 
imported when the replicator is *not* in logging mode.

- AVS is not ZFS aware. For instance, if ZFS resilves a mirrored disk, 
e.g. after replacing a drive, the complete disk is sent over the network 
to the secondary node, even though the replicated data on the secondary 
is intact.
That''s a lot of fun with today''s disk sizes of 750 GB and 1 TB
drives,
resulting in usually 10+ hours without real redundancy (customers who 
use Thumpers to store important data usually don''t have the budget to
connect their data centers with 10 Gbit/s, so expect 10+ hours *per disk*).

- ZFS & AVS & X4500 leads to a bad error handling. The Zpool may not be 
imported on the secondary node during the replication. The X4500 does 
not have a RAID controller which signals (and handles) drive faults. 
Drive failures on the secondary node may happen unnoticed until the 
primary nodes goes down and you want to import the zpool on the 
secondary node with the broken drive. Since ZFS doesn''t offer a
recovery
mechanism like fsck, data loss of up to 20 TB may occur.
If you use AVS with ZFS, make sure that you have a storage which handles 
drive failures without OS interaction.

- 5 hours for scrubbing a 1 TB drive. If you''re lucky. Up to 48 drives 
in total.

- An X4500 has no battery buffered write cache. ZFS uses the server''s 
RAM as a cache, 15 GB+. I don''t want to find out how much time a 
resilver over the network after a power outage may take (a full reverse 
replication would take up to 2 weeks and is no valid option in a serious 
production environment). But the underlying question I asked myself is 
why I should I want to replicate data in such an expensive way, when I 
think the 48 TB data itself are not important enough to be protected by 
a battery?


- I gave AVS a set of 6 drives just for the bitmaps (using SVM soft 
partitions). Weren''t enough, the replication was still very slow, 
probably because of an insane amount of head movements, and scales
badly. Putting the bitmap of a drive on the drive itself (if I remember 
correctly, this is recommended in one of the most referenced howto blog 
articles) is a bad idea. Always use ZFS on whole disks, if performance 
and caching matters to you.

- AVS seems to require an additional shared storage when building 
failover clusters with 48 TB of internal storage. That may be hard to 
explain to the customer. But I''m not 100% sure about this, because I 
just didn''t find a way, I didn''t ask on a mailing list for
help.


If you want a fail-over solution for important data, use the external 
JBODs. Use AVS only to mirror complete clusters, don''t use it to 
replicate single boxes with local drives. And, in case OpenSolaris is 
not an option for you due to your company policies or support contracts, 
building a real cluster also A LOT cheaper.


-- 

Ralf Ramge
Senior Solaris Administrator, SCNA, SCSA

Tel. +49-721-91374-3963
ralf.ramge at webde.de - http://web.de/

1&1 Internet AG
Brauerstra?e 48
76135 Karlsruhe

Amtsgericht Montabaur HRB 6484

Vorstand: Henning Ahlert, Ralph Dommermuth, Matthias Ehrlich, Thomas 
Gottschlich, Matthias Greve, Robert Hoffmann, Markus Huhn, Oliver Mauss, 
Achim Weiss
Aufsichtsratsvorsitzender: Michael Scheeren

Richard Elling

2008-Sep-05 14:04 UTC

head link

[zfs-discuss] x4500 vs AVS ?

[jumping ahead and quoting myself]
AVS is not a mirroring technology, it is a remote replication technology.
So, yes, I agree 100% that people should not expect AVS to be a mirror.


Ralf Ramge wrote:> Wade.Stuart at fallon.com wrote:
>
>   
>>       War wounds?  Could you please expand on the why a bit more?
>>     
>
>
>
> - ZFS is not aware of AVS. On the secondary node, you''ll always
have to
> force the `zfs import` due to the unnoticed changes of metadata (zpool 
> in use). No mechanism to prevent data loss exists, e.g. zpools can be 
> imported when the replicator is *not* in logging mode.
>   
ZFS isn''t special in this regard, AFAIK all file systems, databases and
other data stores suffer from the same issue with remote replication.
> - AVS is not ZFS aware. For instance, if ZFS resilves a mirrored disk, 
> e.g. after replacing a drive, the complete disk is sent over the network 
> to the secondary node, even though the replicated data on the secondary 
> is intact.
> That''s a lot of fun with today''s disk sizes of 750 GB and
1 TB drives,
> resulting in usually 10+ hours without real redundancy (customers who 
> use Thumpers to store important data usually don''t have the budget
to
> connect their data centers with 10 Gbit/s, so expect 10+ hours *per disk*).
>   
ZFS only resilvers data.  Other LVMs, like SVM, will resilver the entire 
disk,
though.
> - ZFS & AVS & X4500 leads to a bad error handling. The Zpool may
not be
> imported on the secondary node during the replication. The X4500 does 
> not have a RAID controller which signals (and handles) drive faults. 
> Drive failures on the secondary node may happen unnoticed until the 
> primary nodes goes down and you want to import the zpool on the 
> secondary node with the broken drive. Since ZFS doesn''t offer a
recovery
> mechanism like fsck, data loss of up to 20 TB may occur.
> If you use AVS with ZFS, make sure that you have a storage which handles 
> drive failures without OS interaction.
>   
If this is the case, then array-based replication would also be similarly
affected by this architectural problem.  In other words, if you say that
a software RAID system cannot be replicated by a software replicator,
then TrueCopy, SRDF, and other RAID array-based (also software)
replicators also do not work.  I think there is enough empirical evidence
that they do work.  I can see where there might be a best practice here,
but I see no fundamental issue.

fsck does not recover data, it only recovers metadata.
> - 5 hours for scrubbing a 1 TB drive. If you''re lucky. Up to 48
drives
> in total.
>   
ZFS only scrubs data.  But it is not unusual for a lot of data scrubbing to
take a long time.  ZFS only performs read scrubs, so there is no replication
required during a ZFS scrub, unless data is repaired.
> - An X4500 has no battery buffered write cache. ZFS uses the
server''s
> RAM as a cache, 15 GB+. I don''t want to find out how much time a 
> resilver over the network after a power outage may take (a full reverse 
> replication would take up to 2 weeks and is no valid option in a serious 
> production environment). But the underlying question I asked myself is 
> why I should I want to replicate data in such an expensive way, when I 
> think the 48 TB data itself are not important enough to be protected by 
> a battery?
>   
ZFS will not be storing 15 GBytes of unflushed data on any system I can
imagine today.  While we can all agree that 48 TBytes will be painful to
replicate, that is not caused by ZFS -- though it is enabled by ZFS, because
some other file systems (UFS) cannot be as large as 48 TBytes.
> - I gave AVS a set of 6 drives just for the bitmaps (using SVM soft 
> partitions). Weren''t enough, the replication was still very slow, 
> probably because of an insane amount of head movements, and scales
> badly. Putting the bitmap of a drive on the drive itself (if I remember 
> correctly, this is recommended in one of the most referenced howto blog 
> articles) is a bad idea. Always use ZFS on whole disks, if performance 
> and caching matters to you.
>   
I think there are opportunities for perormance improvement, but don''t
know who is currently actively working on this.

Actually, the cases where ZFS for whole disks is a big win are small.
And, of course, you can enable disk write caches by hand.
> - AVS seems to require an additional shared storage when building 
> failover clusters with 48 TB of internal storage. That may be hard to 
> explain to the customer. But I''m not 100% sure about this, because
I
> just didn''t find a way, I didn''t ask on a mailing list
for help.
>
>
> If you want a fail-over solution for important data, use the external 
> JBODs. Use AVS only to mirror complete clusters, don''t use it to 
> replicate single boxes with local drives. And, in case OpenSolaris is 
> not an option for you due to your company policies or support contracts, 
> building a real cluster also A LOT cheaper.
>   
AVS is not a mirroring technology, it is a remote replication technology.
So, yes, I agree 100% that people should not expect AVS to be a mirror.

An earlier discussion on this forum dealt with the details of when the
write ordering must be preserved for ongoing operation.  But when a
full resync is required, the write ordering is not preserved.  The theory
is that this might affect ZFS more so than other file systems, or perhaps
ZFS might notice it more than other file systems.  But again, this affects
other remote replication technologies, also.
 -- richard

Jim Dunham

2008-Sep-08 00:15 UTC

head link

[zfs-discuss] x4500 vs AVS ?

Jorgen,
>
> If we get two x4500s, and look at AVS, would it be possible to:
>
> 1) Setup AVS to replicate zfs, and zvol (ufs) from 01 -> 02 ?  
> Supported
> by Sol 10 5/08 ?
For Solaris 10, one will need to purchase AVS. It was not until  
OpenSolaris, that AVS became bundled. Also the OpenSolaris version  
will not run on Solaris 10.
> Assuming 1, if we setup a home-made IP fail-over so that; should 01 go
> down, all clients are redirected to 02.
>
>
> 2) Fail-back, are there methods in AVS to handle fail-back?
Yes, its called SNDR reverse synchronization, and is key feature of  
SNDR and its ability to create DR site.

> Since 02 has
> been used, it will have newer/modified files, and will need to  
> replicate
> backwards until synchronised, before fail-back can occur.
SNDR supports on demand pull, which means that once reverse  
synchronization has been started, the SNDR primary volumes can be  
accessed. In addition to the background resilvering of difference,  
those blocks requested "on demand", will be included in the reverse  
synchronization.
> We did ask our vendor, but we were just told that AVS does not  
> support  x4500.
AVS works with any Solaris blocks storage device, independent of  
platform. Period.
>
>
>
> Lund
>
> -- 
> Jorgen Lundman       | <lundman at lundman.net>
> Unix Administrator   | +81 (0)3 -5456-2687 ext 1017 (work)
> Shibuya-ku, Tokyo    | +81 (0)90-5578-8500          (cell)
> Japan                | +81 (0)3 -3375-1767          (home)
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Jim Dunham
Engineering Manager
Storage Platform Software Group
Sun Microsystems, Inc.

Jim Dunham

2008-Sep-08 00:57 UTC

head link

[zfs-discuss] x4500 vs AVS ?

Ralf,
> Wade.Stuart at fallon.com wrote:
>
>>      War wounds?  Could you please expand on the why a bit more?
>
> - ZFS is not aware of AVS. On the secondary node, you''ll always
have
> to
> force the `zfs import` due to the unnoticed changes of metadata (zpool
> in use).
This is not true. If on the primary node invokes "zpool export" while
replication is still active, then a forced "zpool import" is not  
required. This behavior is the same as with a zpool on dual-ported or  
SAN storage, and is NOT specific to AVS.
> No mechanism to prevent data loss exists, e.g. zpools can be
> imported when the replicator is *not* in logging mode.
This behavior is the same as with a zpool on dual-ported or SAN  
storage, and is NOT specific to AVS.
> - AVS is not ZFS aware.
AVS is not UFS, QFS, Oracle, Sybase aware either. This makes AVS, and  
other host based and controller based replication services multi- 
functional. If you desire ZFS aware functionality, use ZFS send and  
recv.
> For instance, if ZFS resilves a mirrored disk,
> e.g. after replacing a drive, the complete disk is sent over the  
> network
> to the secondary node, even though the replicated data on the  
> secondary
> is intact.
The complete disk IS NOT sent of the over the network to the secondary  
node, only those disk blocks that re-written by ZFS. This has to be  
this way, since ZFS does not differentiate between writes caused by re- 
silvering, and writes caused my new ZFS filesystem operations.  
Furthermore, only those portions of the ZFS storage pool are  
replicated in this scenario, not every block in the entire storage pool.
> That''s a lot of fun with today''s disk sizes of 750 GB and
1 TB drives,
> resulting in usually 10+ hours without real redundancy (customers who
> use Thumpers to store important data usually don''t have the budget
to
> connect their data centers with 10 Gbit/s, so expect 10+ hours *per  
> disk*).
If once creates a ZFS Storage pool whose size is 1 TB, then enables  
AVS after the fact, AVS can not differentiate between blocks that are  
in use by ZFS from those that are not, therefore AVS needs to  
replicate then entire TB of storage.

If one enables AVS first, before the volumes are places in a ZFS  
storage pool, then the "sndradm -E ...", option can be used. Then when
the ZFS storage pool is created, only those I/Os need to initial the  
pool need be replicated.

If one has a ZFS storage pool that is quite large, but in actuality  
there is little of the storage pool in use, by enabling SNDR first on  
a placement volume, then invoking "zpool replace ..." on multiple  
''vdevs'' in the storage pool, and optimal replication of the
ZFS
storage pool can be done.>
>
> - ZFS & AVS & X4500 leads to a bad error handling. The Zpool may
not
> be
> imported on the secondary node during the replication.
This behavior is the same as with a zpool on dual-ported or SAN  
storage, and is NOT specific to AVS.
> The X4500 does
> not have a RAID controller which signals (and handles) drive faults.
> Drive failures on the secondary node may happen unnoticed until the
> primary nodes goes down and you want to import the zpool on the
> secondary node with the broken drive. Since ZFS doesn''t offer a  
> recovery
> mechanism like fsck, data loss of up to 20 TB may occur.
> If you use AVS with ZFS, make sure that you have a storage which  
> handles
> drive failures without OS interaction.
>
> - 5 hours for scrubbing a 1 TB drive. If you''re lucky. Up to 48
drives
> in total.
>
> - An X4500 has no battery buffered write cache. ZFS uses the
server''s
> RAM as a cache, 15 GB+. I don''t want to find out how much time a
> resilver over the network after a power outage may take (a full  
> reverse
> replication would take up to 2 weeks and is no valid option in a  
> serious
> production environment). But the underlying question I asked myself is
> why I should I want to replicate data in such an expensive way, when I
> think the 48 TB data itself are not important enough to be protected  
> by
> a battery?
I don''t understand the relevance to AVS in the prior three paragraphs?
> - I gave AVS a set of 6 drives just for the bitmaps (using SVM soft
> partitions). Weren''t enough, the replication was still very slow,
> probably because of an insane amount of head movements, and scales
> badly. Putting the bitmap of a drive on the drive itself (if I  
> remember
> correctly, this is recommended in one of the most referenced howto  
> blog
> articles) is a bad idea. Always use ZFS on whole disks, if performance
> and caching matters to you.
When you have the time, can you replace the "probably because of ... "
with some real performance numbers?
> - AVS seems to require an additional shared storage when building
> failover clusters with 48 TB of internal storage. That may be hard to
> explain to the customer. But I''m not 100% sure about this, because
I
> just didn''t find a way, I didn''t ask on a mailing list
for help.
When you have them time, can you replace the "AVS seems to ... " with
some specific references to what you are referring to?

> If you want a fail-over solution for important data, use the external
> JBODs. Use AVS only to mirror complete clusters, don''t use it to
> replicate single boxes with local drives. And, in case OpenSolaris is
> not an option for you due to your company policies or support  
> contracts,
> building a real cluster also A LOT cheaper.
You are offering up these position statements based on what?

> -- 
>
> Ralf Ramge
> Senior Solaris Administrator, SCNA, SCSA
>
> Tel. +49-721-91374-3963
> ralf.ramge at webde.de - http://web.de/
>
> 1&1 Internet AG
> Brauerstra?e 48
> 76135 Karlsruhe
>
> Amtsgericht Montabaur HRB 6484
>
> Vorstand: Henning Ahlert, Ralph Dommermuth, Matthias Ehrlich, Thomas
> Gottschlich, Matthias Greve, Robert Hoffmann, Markus Huhn, Oliver  
> Mauss,
> Achim Weiss
> Aufsichtsratsvorsitzender: Michael Scheeren
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Jim Dunham
Engineering Manager
Storage Platform Software Group
Sun Microsystems, Inc.

Ralf Ramge

2008-Sep-08 06:32 UTC

head link

[zfs-discuss] x4500 vs AVS ?

Jim Dunham wrote:
[...]


Jim, at first: I never said that AVS is a bad product. And I never will. 
  I wonder why you act as if you were attacked personally.

To be honest, if I were a customer with the original question, such a 
reaction wouldn''t make me feel safer.

>> - ZFS is not aware of AVS. On the secondary node, you''ll
always have to
>> force the `zfs import` due to the unnoticed changes of metadata (zpool
>> in use).
> 
> This is not true. If on the primary node invokes "zpool export"
while
> replication is still active, then a forced "zpool import" is not 
> required. This behavior is the same as with a zpool on dual-ported or 
> SAN storage, and is NOT specific to AVS.
Jim. A graceful shutdown of the primary node may be a valid desaster 
scenario in the laboratory, but it never will be in the real life.
> 
>> No mechanism to prevent data loss exists, e.g. zpools can be
>> imported when the replicator is *not* in logging mode.
> 
> This behavior is the same as with a zpool on dual-ported or SAN storage, 
> and is NOT specific to AVS.
And what makes you think that I said that AVS is the problem here?

And by the way, the customer doesn''t care *why* there''s a
problem. He
only wants to know *if* there''s a problem.
>> - AVS is not ZFS aware.
> 
> AVS is not UFS, QFS, Oracle, Sybase aware either. This makes AVS, and 
> other host based and controller based replication services 
> multi-functional. If you desire ZFS aware functionality, use ZFS send 
> and recv.
Yes, exactly. And that''s the problem, sind `zfs send` and `zfs receive`
are no working solution in a fail-safe two node environment. Again: the 
customer doesn''t care *why* there''s a problem. He only wants
to know
*if* there''s a problem.
>> For instance, if ZFS resilves a mirrored disk,
>> e.g. after replacing a drive, the complete disk is sent over the
network
>> to the secondary node, even though the replicated data on the secondary
>> is intact.
> 
> The complete disk IS NOT sent of the over the network to the secondary 
> node, only those disk blocks that re-written by ZFS. 
Yes, you''re right. But sadly, in the mentioned scenario of having 
replaced an entire drive, the entire disk is rewritten by ZFS.

Again: And what makes you think that I said that AVS is the problem here?
>> - ZFS & AVS & X4500 leads to a bad error handling. The Zpool
may not be
>> imported on the secondary node during the replication.
> 
> This behavior is the same as with a zpool on dual-ported or SAN storage, 
> and is NOT specific to AVS.
Again: And what makes you think that I said that AVS is the problem 
here? We are not on avs-discuss, Jim.
> I don''t understand the relevance to AVS in the prior three
paragraphs?
We are not on avs-discuss, Jim. The customer wanted to know what 
drawbacks exist in his *scenario*. Not AVS.
>> - I gave AVS a set of 6 drives just for the bitmaps (using SVM soft
>> partitions). Weren''t enough, the replication was still very
slow,
>> probably because of an insane amount of head movements, and scales
>> badly. Putting the bitmap of a drive on the drive itself (if I remember
>> correctly, this is recommended in one of the most referenced howto blog
>> articles) is a bad idea. Always use ZFS on whole disks, if performance
>> and caching matters to you.
> 
> When you have the time, can you replace the "probably because of ...
"
> with some real performance numbers?
No problem. If you please organize a Try&Buy of two X4500 server being 
sent to my address, thank you.

>> - AVS seems to require an additional shared storage when building
>> failover clusters with 48 TB of internal storage. That may be hard to
>> explain to the customer. But I''m not 100% sure about this,
because I
>> just didn''t find a way, I didn''t ask on a mailing
list for help.
> 
> When you have them time, can you replace the "AVS seems to ... "
with
> some specific references to what you are referring to?
The installation and configuration process and the location where AVS 
wants to store the shared database. I can tell you details about it the 
next time I give it try. Until then, please read the last sentence you 
quoted once more, thank you.
>> If you want a fail-over solution for important data, use the external
>> JBODs. Use AVS only to mirror complete clusters, don''t use it
to
>> replicate single boxes with local drives. And, in case OpenSolaris is
>> not an option for you due to your company policies or support
contracts,
>> building a real cluster also A LOT cheaper.
> 
> You are offering up these position statements based on what?
My outline agreements, my support contracts, partner web desk and 
finally my experience with projects in high availability scenarios with 
tens of thousands of servers.


Jim, it''s okay. I know that you''re a project leader at Sun
Microsystems
and that AVS is your main concern. But if there''s one thing I cannot 
withstand, it''s getting stroppy replies from someone who should know 
better and should have realized that he''s acting publicly and in front 
of the people who finance his income instead of trying to start a flame 
war. From now on, I leave the rest to you, because I earn my living with 
products of Sun Microsystems, too, and I don''t want to damage neither 
Sun nor this mailing list.


-- 

Ralf Ramge
Senior Solaris Administrator, SCNA, SCSA

Tel. +49-721-91374-3963
ralf.ramge at webde.de - http://web.de/

1&1 Internet AG
Brauerstra?e 48
76135 Karlsruhe

Amtsgericht Montabaur HRB 6484

Vorstand: Henning Ahlert, Ralph Dommermuth, Matthias Ehrlich, Thomas 
Gottschlich, Matthias Greve, Robert Hoffmann, Markus Huhn, Oliver Mauss, 
Achim Weiss
Aufsichtsratsvorsitzender: Michael Scheeren

Ralf Ramge

2008-Sep-09 06:04 UTC

head link

[zfs-discuss] x4500 vs AVS ?

Richard Elling wrote:
>> Yes, you''re right. But sadly, in the mentioned scenario of
having
>> replaced an entire drive, the entire disk is rewritten by ZFS.
> 
> No, this is not true.  ZFS only resilvers data.
Okay, I see we have a communication problem here. Probably my fault, I
should have written "the entire data and metadata".
I made the assumption that a 1 TB drive in a X4500 may have up to 1 TB
of data on it. Simply because nobody buys the 1 TB X4500 just to use 10%
of the disk space, he would have bought the 250 GB, 500 GB or 750 GB
model then.
In any case and any disk size scenario, that''s something you
don''t want
to have on your network if there''s a chance to avoid this.

-- 

Ralf Ramge
Senior Solaris Administrator, SCNA, SCSA

Tel. +49-721-91374-3963
ralf.ramge at webde.de - http://web.de/

1&1 Internet AG
Brauerstra?e 48
76135 Karlsruhe

Amtsgericht Montabaur HRB 6484

Vorstand: Henning Ahlert, Ralph Dommermuth, Matthias Ehrlich, Thomas
Gottschlich, Matthias Greve, Robert Hoffmann, Markus Huhn, Oliver Mauss,
Achim Weiss
Aufsichtsratsvorsitzender: Michael Scheeren

Richard Elling

2008-Sep-09 15:32 UTC

head link

[zfs-discuss] x4500 vs AVS ?

Ralf Ramge wrote:> Richard Elling wrote:
>
>>> Yes, you''re right. But sadly, in the mentioned scenario of
having
>>> replaced an entire drive, the entire disk is rewritten by ZFS.
>>
>> No, this is not true.  ZFS only resilvers data.
>
> Okay, I see we have a communication problem here. Probably my fault, I 
> should have written "the entire data and metadata".
> I made the assumption that a 1 TB drive in a X4500 may have up to 1 TB 
> of data on it. Simply because nobody buys the 1 TB X4500 just to use 
> 10% of the disk space, he would have bought the 250 GB, 500 GB or 750 
> GB model then.
Actually, they do :-)  Some storage vendors insist on it, to keep
performance up -- short-stroking.

I''ve done several large-scale surveys of this and the average usage
is 50%.  This is still a large difference in resilver times between
ZFS and SVM.
> In any case and any disk size scenario, that''s something you
don''t
> want to have on your network if there''s a chance to avoid this.
Agree 100%.
-- richard

Victor Latushkin

2008-Sep-10 07:07 UTC

head link

[zfs-discuss] x4500 vs AVS ?

On 09.09.08 19:32, Richard Elling wrote:> Ralf Ramge wrote:
>> Richard Elling wrote:
>>
>>>> Yes, you''re right. But sadly, in the mentioned
scenario of having
>>>> replaced an entire drive, the entire disk is rewritten by ZFS.
>>> No, this is not true.  ZFS only resilvers data.
>> Okay, I see we have a communication problem here. Probably my fault, I 
>> should have written "the entire data and metadata".
>> I made the assumption that a 1 TB drive in a X4500 may have up to 1 TB 
>> of data on it. Simply because nobody buys the 1 TB X4500 just to use 
>> 10% of the disk space, he would have bought the 250 GB, 500 GB or 750 
>> GB model then.
> 
> Actually, they do :-)  Some storage vendors insist on it, to keep
> performance up -- short-stroking.
> 
> I''ve done several large-scale surveys of this and the average
usage
> is 50%.  This is still a large difference in resilver times between
> ZFS and SVM.
There is RFE 6722786 "resilver on mirror could reduce window of 
vulnerability" which is aimed to reduce this difference for mirrors.

See here: http://bugs.opensolaris.org/view_bug.do?bug_id=6722786

Wbr,
Victor

Matt Beebe

2008-Sep-10 20:26 UTC

head link

[zfs-discuss] x4500 vs AVS ?

Just to clarify a few items... consider a setup where we desire to use AVS to
replicate the ZFS pool on a 4 drive server to like hardware.  The 4 drives are
setup as RaidZ.

If we lose a drive (say #2) in the primary server, RaidZ will take over, and our
data will still be "available" but the array is at a degraded state.

But what happens to the secondary server?  Specifically to its bit-for-bit copy
of Drive #2... presumably it is still good, but ZFS will offline that disk on
the primary server, replicate the metadata, and when/if I "promote"
the seconday server, it will also be running in a degraded state (ie: 3 out of 4
drives).  correct?

In this scenario, my replication hasn''t really bought me any increased
availablity... or am I missing something?

Also, if I do chose to fail over to the secondary, can I just to a scrub the
"broken" drive (which isn''t really broken, but the zpool
would be inconsistent at some level with the other "online" drives)
and get back to "full speed" quickly? or will I always have to wait
until one of the servers resilvers itself (from scratch?), and re-replicates
itself??

thanks in advance.

-Matt
--
This message posted from opensolaris.org

Ralf Ramge

2008-Sep-11 05:47 UTC

head link

[zfs-discuss] x4500 vs AVS ?

Matt Beebe wrote:
> But what happens to the secondary server?  Specifically to its bit-for-bit
copy of Drive #2... presumably it is still good, but ZFS will offline that disk
on the primary server, replicate the metadata, and when/if I "promote"
the seconday server, it will also be running in a degraded state (ie: 3 out of 4
drives).  correct?


Correct.
> In this scenario, my replication hasn''t really bought me any
increased availablity... or am I missing something?


No. You have an increase of availability when the entire primary node 
goes down, but you''re not particularly safer when it comes to decreased
zpools.

> Also, if I do chose to fail over to the secondary, can I just to a scrub
the "broken" drive (which isn''t really broken, but the zpool
would be inconsistent at some level with the other "online" drives)
and get back to "full speed" quickly? or will I always have to wait
until one of the servers resilvers itself (from scratch?), and re-replicates
itself??

I have not tested this scenario, so I can''t say anything about this.

-- 

Ralf Ramge
Senior Solaris Administrator, SCNA, SCSA

Tel. +49-721-91374-3963
ralf.ramge at webde.de - http://web.de/

1&1 Internet AG
Brauerstra?e 48
76135 Karlsruhe

Amtsgericht Montabaur HRB 6484

Vorstand: Henning Ahlert, Ralph Dommermuth, Matthias Ehrlich, Thomas 
Gottschlich, Matthias Greve, Robert Hoffmann, Markus Huhn, Oliver Mauss, 
Achim Weiss
Aufsichtsratsvorsitzender: Michael Scheeren

Jim Dunham

2008-Sep-11 14:19 UTC

head link

[zfs-discuss] x4500 vs AVS ?

Ralf,
> Jim, at first: I never said that AVS is a bad product. And I never  
> will.  I wonder why you act as if you were attacked personally.
> To be honest, if I were a customer with the original question, such  
> a reaction wouldn''t make me feel safer.
I am sorry that my response came across that way, it was not  
intentional.
>
>>> - ZFS is not aware of AVS. On the secondary node, you''ll
always
>>> have to
>>> force the `zfs import` due to the unnoticed changes of metadata  
>>> (zpool
>>> in use).
>> This is not true. If on the primary node invokes "zpool
export"
>> while replication is still active, then a forced "zpool
import" is
>> not required. This behavior is the same as with a zpool on dual- 
>> ported or SAN storage, and is NOT specific to AVS.
>
> Jim. A graceful shutdown of the primary node may be a valid desaster  
> scenario in the laboratory, but it never will be in the real life.
I agree with your assessment that in real life a ''zpool
export'' will
never be done in a real disaster, but unconditionally doing a forced  
''zpool import'' is problematic. Prior to performing the forced
import,
one needs to assure that the primary node is actually down and is not  
in the process of booting up, or that replication is stopped and will  
not automatically resume.

Failure to make these checks prior to a forced ''zpool import''
could
lead to scenarios where two or more instances of ZFS are accessing the  
same ZFS storage pool, each attempting to writing their own metadata,  
and thus there own CRCs. In time this action will result in CRC  
checksum failures on reads, followed by a ZFS induced panic.

>>> No mechanism to prevent data loss exists, e.g. zpools can be
>>> imported when the replicator is *not* in logging mode.
>> This behavior is the same as with a zpool on dual-ported or SAN  
>> storage, and is NOT specific to AVS.
>
> And what makes you think that I said that AVS is the problem here?
>
> And by the way, the customer doesn''t care *why* there''s a
problem.
> He only wants to know *if* there''s a problem.
There is a mechanism to prevent data lost here, its AVS! This is the  
reasoning behind questioning the association made above of replication  
being part of the problem, where in fact how replication is  
implemented with AVS, it is actually part of the solution.

If one does not following the guidance suggested above before invoking  
a forced ''zpool import'', the action will likely result in
on-disk CRC
checksum inconsistencies within the ZFS storage pool, resulting in  
secondary node data loss, the initial point above. Since AVS  
replication is unidirectional there is no data loss on the primary  
node, and when replication is resumed, AVS will undo the faulty  
secondary node writes, correcting the actual data loss, and in time  
restoring 100% synchronization of the ZFS storage pool between the  
primary and secondary nodes.
>
>>> - AVS is not ZFS aware.
>> AVS is not UFS, QFS, Oracle, Sybase aware either. This makes AVS,  
>> and other host based and controller based replication services  
>> multi-functional. If you desire ZFS aware functionality, use ZFS  
>> send and recv.
>
> Yes, exactly. And that''s the problem, sind `zfs send` and `zfs  
> receive` are no working solution in a fail-safe two node  
> environment. Again: the customer doesn''t care *why*
there''s a
> problem. He only wants to know *if* there''s a problem.
My takeaway from this is that both AVS and ZFS are data path services,  
but collectively they are not on their own a complete disaster  
recovery solution. Since AVS is not aware of ZFS, and vice-versa,  
additional software in the form of Solaris Cluster, GeoCluster or  
other developed software needs to provide the awareness, so that  
viable disaster recovery solutions can be possible, and supportable.

>>> For instance, if ZFS resilves a mirrored disk,
>>> e.g. after replacing a drive, the complete disk is sent over the  
>>> network
>>> to the secondary node, even though the replicated data on the  
>>> secondary
>>> is intact.
The problem with this statement is that one can not guarantee that the  
replicated data on the secondary is intact, specifically that the data  
is 100% identical to the non-failing side of the mirror on the primary  
node. Of course if this guarantee could be assured, then an "sndradm - 
E ...", (equal enable) could be done, and the full disk copy could be  
avoided. But all is not lost...

A failure in writing to a mirrored volume almost assures that the data  
will be different, by at least one I/O, the one that triggered the  
initial failure of the mirror. The momentary upside is that AVS is  
interposed above the failing volume, so that the I/O will get  
replicated, even if it failed to make it the disk. The downside is  
that with ZFS (or any other mirroring software), once a failure is  
detected by the mirroring software, it will stop writing to the side  
of the mirror containing the failed disk (and thus the configured AVS  
replica), but will still continue to write to the non-failing side of  
the mirror. This assures that the good side of the mirror, and the  
replica will be out of sync.
>>>
>> The complete disk IS NOT sent of the over the network to the  
>> secondary node, only those disk blocks that re-written by ZFS.
>
> Yes, you''re right. But sadly, in the mentioned scenario of having
> replaced an entire drive, the entire disk is rewritten ZFS.
I have to believe that the issue being referred to is an order of  
enabling issue. One needs to enable AVS before ZFS. Let me explain.

If I have a replacement volume that has yet to be given to ZFS, it  
contains unknown data. Likewise its replacement volume on the  
secondary node also contains unknown data (even if this volume is the  
one above, as it known to not be 100% intact).  If one was to enable  
these two volumes with the "sndradm -E ...", where
''E'' means equal
enable, this means to the replication software that unknown data =  
unknown data, therefore no replication is needed to bring the two  
volume into synchronization. Now when one gives the primary node  
volume to ZFS as a replacement, ZFS and thus AVS, only need to rewrite  
those metadata and data blocks that are in use by ZFS on the remaining  
good side of the mirror. This means a full-copy is avoided, unless of  
course the volume is full.

Conversely, if one gives the replacement volume to ZFS prior to  
enabling the volume in AVS for replication, then "sndradm -E ..." can
not be used, as the volumes are not starting out equal, and AVS was  
not running to scoreboard the differences. Therefore "sndradm -e ...",
must be used, and in this case the entire disk will be replicated.
> Again: And what makes you think that I said that AVS is the problem  
> here?
>
>>> - ZFS & AVS & X4500 leads to a bad error handling. The
Zpool may
>>> not be
>>> imported on the secondary node during the replication.
>> This behavior is the same as with a zpool on dual-ported or SAN  
>> storage, and is NOT specific to AVS.
>
> Again: And what makes you think that I said that AVS is the problem  
> here? We are not on avs-discuss, Jim.
Your association of "ZFS & AVS & X4500", this is purely a ZFS
issue.

The problem at hand is that a ZFS storage pool can not be concurrently  
accessed by two or more instances of ZFS. This is true for both shared  
storage and replicated storage. This remains true even if one instance  
of ZFS will be operating in a read-only mode.

>> I don''t understand the relevance to AVS in the prior three  
>> paragraphs?
>
> We are not on avs-discuss, Jim. The customer wanted to know what  
> drawbacks exist in his *scenario*. Not AVS.
>
>>> - I gave AVS a set of 6 drives just for the bitmaps (using SVM soft
>>> partitions). Weren''t enough, the replication was still
very slow,
>>> probably because of an insane amount of head movements, and scales
>>> badly. Putting the bitmap of a drive on the drive itself (if I  
>>> remember
>>> correctly, this is recommended in one of the most referenced howto
>>> blog
>>> articles) is a bad idea. Always use ZFS on whole disks, if  
>>> performance
>>> and caching matters to you.
>> When you have the time, can you replace the "probably because  
>> of ... " with some real performance numbers?
>
> No problem. If you please organize a Try&Buy of two X4500 server  
> being sent to my address, thank you.
Done:

	http://blogs.sun.com/AVS/entry/sun_storagetek_availability_suite_4
	http://www.sun.com/tryandbuy/specialoffers.jsp

>>> - AVS seems to require an additional shared storage when building
>>> failover clusters with 48 TB of internal storage. That may be hard
>>> to
>>> explain to the customer. But I''m not 100% sure about this,
because I
>>> just didn''t find a way, I didn''t ask on a mailing
list for help.
>> When you have them time, can you replace the "AVS seems to ...
"
>> with some specific references to what you are referring to?
>
> The installation and configuration process and the location where  
> AVS wants to store the shared database. I can tell you details about  
> it the next time I give it try. Until then, please read the last  
> sentence you quoted once more, thank you.
The design of AVS in a failover Sun Cluster requires shared access to  
AVS''s cluster-wide configuration data. This data is fixed at ~16.5 MB,
and must be contained on a single volume the can be concurrently  
accessed by all nodes in a Sun Cluster. At the time AVS was enhanced  
to support Sun Cluster, various options where taken under  
consideration, this was the design selected, such as it may be.

FWIW: Across all of Solaris, there a various methods of maintaining  
persistent configuration data. Sun Cluster uses it CCR database, SVM  
uses its metadb database, Solaris is starting to use its SCF database  
(part of SMF), the list goes on and on. The AVS developers approached  
Sun Cluster developers asking to use their CCR database mechanism, but  
at the time the answer was no. At this time it would be hard to  
reconsider this position.

>>> If you want a fail-over solution for important data, use the  
>>> external
>>> JBODs. Use AVS only to mirror complete clusters, don''t use
it to
>>> replicate single boxes with local drives. And, in case OpenSolaris
>>> is
>>> not an option for you due to your company policies or support  
>>> contracts,
>>> building a real cluster also A LOT cheaper.
>> You are offering up these position statements based on what?
>
> My outline agreements, my support contracts, partner web desk and  
> finally my experience with projects in high availability scenarios  
> with tens of thousands of servers.
>
> Jim, it''s okay. I know that you''re a project leader at
Sun
> Microsystems and that AVS is your main concern. But if there''s one
> thing I cannot withstand, it''s getting stroppy replies from
someone
> who should know better and should have realized that he''s acting  
> publicly and in front of the people who finance his income instead  
> of trying to start a flame war. From now on, I leave the rest to  
> you, because I earn my living with products of Sun Microsystems,  
> too, and I don''t want to damage neither Sun nor this mailing list.
My reasoning for posting not only the original but subsequent reply,  
is that AVS is constantly bombarded by "War wounds", where in fact the
reasons that many of these stories exist is do in part to the fact  
that developing and deploying disaster recovery or high availability  
solutions is not easy. ZFS is the new "battlefront", allowing for  
opportunities to learn about ZFS, AVS and other replication  
technologies. In their day, similar "War wounds" and successful  
"battles" have been had regarding AVS in use with UFS, QFS, VxFS, SVM,
VxVM, Oracle, Sybase and others.

Jim Dunham
Engineering Manager
Storage Platform Software Group
Sun Microsystems, Inc.

>
> -- 
>
> Ralf Ramge
> Senior Solaris Administrator, SCNA, SCSA
>
> Tel. +49-721-91374-3963
> ralf.ramge at webde.de - http://web.de/
>
> 1&1 Internet AG
> Brauerstra?e 48
> 76135 Karlsruhe
>
> Amtsgericht Montabaur HRB 6484
>
> Vorstand: Henning Ahlert, Ralph Dommermuth, Matthias Ehrlich, Thomas  
> Gottschlich, Matthias Greve, Robert Hoffmann, Markus Huhn, Oliver  
> Mauss, Achim Weiss
> Aufsichtsratsvorsitzender: Michael Scheeren

Jim Dunham

2008-Sep-11 14:33 UTC

head link

[zfs-discuss] x4500 vs AVS ?

Matt,
> Just to clarify a few items... consider a setup where we desire to  
> use AVS to replicate the ZFS pool on a 4 drive server to like  
> hardware.  The 4 drives are setup as RaidZ.
>
> If we lose a drive (say #2) in the primary server, RaidZ will take  
> over, and our data will still be "available" but the array is at
a
> degraded state.
>
> But what happens to the secondary server?  Specifically to its bit- 
> for-bit copy of Drive #2... presumably it is still good, but ZFS  
> will offline that disk on the primary server, replicate the  
> metadata, and when/if I "promote" the seconday server, it will
also
> be running in a degraded state (ie: 3 out of 4 drives).  correct?
The issue with any form of RAID >1, is that the instant a disk fails  
out of the RAID set, with the next write I/O to the remaining members  
of the RAID set, the failed disk (and its replica) are instantly out  
of sync.
> In this scenario, my replication hasn''t really bought me any  
> increased availablity... or am I missing something?
In testing with ZFS in the scenario, first of all the secondary node''s
ZPOOL is not in the import state. So if one stops replication, or  
there is a primary node failure, a zpool import operation will need to  
be done on the secondary node. In all my testing to date, ZFS does the  
correct thing, realizing that one disk had failed out of the RAID set  
on the primary, and thus to not use it on the secondary. In short, ZFS  
knows that the RAID set is degraded, was being maintained in a  
degraded state, and this fact was replicated to the secondary node,  
correctly.
> Also, if I do chose to fail over to the secondary, can I just to a  
> scrub the "broken" drive (which isn''t really broken, but
the zpool
> would be inconsistent at some level with the other "online"
drives)
> and get back to "full speed" quickly? or will I always have to
wait
> until one of the servers resilvers itself (from scratch?), and re- 
> replicates itself??
>
> thanks in advance.
>
> -Matt
> --
> This message posted from opensolaris.org
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Jim Dunham
Engineering Manager
Storage Platform Software Group
Sun Microsystems, Inc.

A Darren Dunham

2008-Sep-11 15:19 UTC

head link

[zfs-discuss] x4500 vs AVS ?

On Thu, Sep 11, 2008 at 10:33:00AM -0400, Jim Dunham
wrote:> The issue with any form of RAID >1, is that the instant a disk fails  
> out of the RAID set, with the next write I/O to the remaining members  
> of the RAID set, the failed disk (and its replica) are instantly out  
> of sync.
Does raidz fall into that category?  Since the parity is maintained only
on written blocks rather than all disk blocks on all columns, it seems
to be resistant to this issue.

-- 
Darren

Jim Dunham

2008-Sep-11 20:28 UTC

head link

[zfs-discuss] x4500 vs AVS ?

On Sep 11, 2008, at 11:19 AM, A Darren Dunham wrote:
> On Thu, Sep 11, 2008 at 10:33:00AM -0400, Jim Dunham wrote:
>> The issue with any form of RAID >1, is that the instant a disk fails
>> out of the RAID set, with the next write I/O to the remaining members
>> of the RAID set, the failed disk (and its replica) are instantly out
>> of sync.
>
> Does raidz fall into that category?
Yes. The key reason is that as soon as ZFS (or other mirroring  
software) detects a disk failure in a RAID >1 set, it will stop  
writing to the failed disk, which also means it will also stop writing  
to the replica of the failed disk. From the point of view of the  
remote node, the replica of the failed disk is no longer being updated.

Now if replication was stopped, or the primary node powered off or  
panicked, during the import of the ZFS storage pool on the secondary  
node, the replica of the failed disk must not be part of the ZFS  
storage pool as its data is stale. This happens automatically, since  
the ZFS metadata on the remaining disks have already given up on this  
member of the RAID set.

> Since the parity is maintained only
> on written blocks rather than all disk blocks on all columns, it seems
> to be resistant to this issue.
>
> -- 
> Darren
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Jim Dunham
Engineering Manager
Storage Platform Software Group
Sun Microsystems, Inc.

A Darren Dunham

2008-Sep-11 21:16 UTC

head link

[zfs-discuss] x4500 vs AVS ?

On Thu, Sep 11, 2008 at 04:28:03PM -0400, Jim Dunham
wrote:>
> On Sep 11, 2008, at 11:19 AM, A Darren Dunham wrote:
>
>> On Thu, Sep 11, 2008 at 10:33:00AM -0400, Jim Dunham wrote:
>>> The issue with any form of RAID >1, is that the instant a disk
fails
>>> out of the RAID set, with the next write I/O to the remaining
members
>>> of the RAID set, the failed disk (and its replica) are instantly
out
>>> of sync.
>>
>> Does raidz fall into that category?
>
> Yes. The key reason is that as soon as ZFS (or other mirroring software) 
> detects a disk failure in a RAID >1 set, it will stop writing to the 
> failed disk, which also means it will also stop writing to the replica of 
> the failed disk. From the point of view of the remote node, the replica 
> of the failed disk is no longer being updated.
>
> Now if replication was stopped, or the primary node powered off or  
> panicked, during the import of the ZFS storage pool on the secondary  
> node, the replica of the failed disk must not be part of the ZFS storage 
> pool as its data is stale. This happens automatically, since the ZFS 
> metadata on the remaining disks have already given up on this member of 
> the RAID set.
Then I misunderstood what you were talking about.  Why the restriction
on RAID >1 for your statement?  Even for a mirror, the data is stale and
it''s removed from the active set.  I thought you were talking about
block parity run across columns...

-- 
Darren

Jim Dunham

2008-Sep-12 21:17 UTC

head link

[zfs-discuss] x4500 vs AVS ?

On Sep 11, 2008, at 5:16 PM, A Darren Dunham wrote:> On Thu, Sep 11, 2008 at 04:28:03PM -0400, Jim Dunham wrote:
>>
>> On Sep 11, 2008, at 11:19 AM, A Darren Dunham wrote:
>>
>>> On Thu, Sep 11, 2008 at 10:33:00AM -0400, Jim Dunham wrote:
>>>> The issue with any form of RAID >1, is that the instant a
disk
>>>> fails
>>>> out of the RAID set, with the next write I/O to the remaining  
>>>> members
>>>> of the RAID set, the failed disk (and its replica) are
instantly
>>>> out
>>>> of sync.
>>>
>>> Does raidz fall into that category?
>>
>> Yes. The key reason is that as soon as ZFS (or other mirroring  
>> software)
>> detects a disk failure in a RAID >1 set, it will stop writing to the
>> failed disk, which also means it will also stop writing to the  
>> replica of
>> the failed disk. From the point of view of the remote node, the  
>> replica
>> of the failed disk is no longer being updated.
>>
>> Now if replication was stopped, or the primary node powered off or
>> panicked, during the import of the ZFS storage pool on the secondary
>> node, the replica of the failed disk must not be part of the ZFS  
>> storage
>> pool as its data is stale. This happens automatically, since the ZFS
>> metadata on the remaining disks have already given up on this  
>> member of
>> the RAID set.
>
> Then I misunderstood what you were talking about.  Why the restriction
> on RAID >1 for your statement?
No restriction. I meant to say, RAID 1 or greater.
> Even for a mirror, the data is stale and
> it''s removed from the active set.  I thought you were talking
about
> block parity run across columns...
>
> -- 
> Darren
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Jim Dunham
Engineering Manager
Storage Platform Software Group
Sun Microsystems, Inc.
work: 781-442-4042
cell: 603.724.2972

Jorgen Lundman

2008-Sep-17 00:23 UTC

head link

[zfs-discuss] x4500 vs AVS ?

Sorry, I popped up to Hokkdaido for a holiday. I want to thank you all 
for the replies.

I mentioned AVS as I thought it to do be the only product close to 
enabling us to do a (makeshift) fail-over setup.

We have 5-6 ZFS filesystem, and 5-6 zvol with UFS (for quotas). To do 
"zfs send" snapshots every minute might perhaps be possible (just not 
very attractive), but if the script dies at any time, you need to resend 
the full volumes, this currently takes 5 days. (Even using "nc").

Since we are forced by vendor to run Sol10, it sounds like AVS is not an 
option for us.

If we were interested in finding a method to replicate data to a 2nd 
x4500, what other options are there for us? We do not need instant 
updates, just someplace to fail-over to when the x4500 panics, or a HDD 
dies. (Which equals panic) It currently takes 2 hours to fsck the UFS 
volumes after a panic (and yes, they are logging; it is actually just 
the one UFS volume that always needs fsck).

Vendor has mentioned "VeritasVolumReplicator" but I was under the 
impression that Veritas is a whole different set to zfs/zpool.

Lund

Jim Dunham wrote:> On Sep 11, 2008, at 5:16 PM, A Darren Dunham wrote:
>> On Thu, Sep 11, 2008 at 04:28:03PM -0400, Jim Dunham wrote:
>>> On Sep 11, 2008, at 11:19 AM, A Darren Dunham wrote:
>>>
>>>> On Thu, Sep 11, 2008 at 10:33:00AM -0400, Jim Dunham wrote:
>>>>> The issue with any form of RAID >1, is that the instant
a disk
>>>>> fails
>>>>> out of the RAID set, with the next write I/O to the
remaining
>>>>> members
>>>>> of the RAID set, the failed disk (and its replica) are
instantly
>>>>> out
>>>>> of sync.
>>>> Does raidz fall into that category?
>>> Yes. The key reason is that as soon as ZFS (or other mirroring  
>>> software)
>>> detects a disk failure in a RAID >1 set, it will stop writing to
the
>>> failed disk, which also means it will also stop writing to the  
>>> replica of
>>> the failed disk. From the point of view of the remote node, the  
>>> replica
>>> of the failed disk is no longer being updated.
>>>
>>> Now if replication was stopped, or the primary node powered off or
>>> panicked, during the import of the ZFS storage pool on the
secondary
>>> node, the replica of the failed disk must not be part of the ZFS  
>>> storage
>>> pool as its data is stale. This happens automatically, since the
ZFS
>>> metadata on the remaining disks have already given up on this  
>>> member of
>>> the RAID set.
>> Then I misunderstood what you were talking about.  Why the restriction
>> on RAID >1 for your statement?
> 
> No restriction. I meant to say, RAID 1 or greater.
> 
>> Even for a mirror, the data is stale and
>> it''s removed from the active set.  I thought you were talking
about
>> block parity run across columns...
>>
>> -- 
>> Darren
>> _______________________________________________
>> zfs-discuss mailing list
>> zfs-discuss at opensolaris.org
>> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
> 
> Jim Dunham
> Engineering Manager
> Storage Platform Software Group
> Sun Microsystems, Inc.
> work: 781-442-4042
> cell: 603.724.2972
> 
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
> 
-- 
Jorgen Lundman       | <lundman at lundman.net>
Unix Administrator   | +81 (0)3 -5456-2687 ext 1017 (work)
Shibuya-ku, Tokyo    | +81 (0)90-5578-8500          (cell)
Japan                | +81 (0)3 -3375-1767          (home)

Ralf Ramge

2008-Sep-17 06:51 UTC

head link

[zfs-discuss] x4500 vs AVS ?

Jorgen Lundman wrote:
> If we were interested in finding a method to replicate data to a 2nd 
> x4500, what other options are there for us? 
If you already have an X4500, I think the best option for you is a cron 
job with incremental ''zfs send''. Or rsync.

-- 

Ralf Ramge
Senior Solaris Administrator, SCNA, SCSA

Tel. +49-721-91374-3963
ralf.ramge at webde.de - http://web.de/

1&1 Internet AG
Brauerstra?e 48
76135 Karlsruhe

Amtsgericht Montabaur HRB 6484

Vorstand: Henning Ahlert, Ralph Dommermuth, Matthias Ehrlich, Thomas 
Gottschlich, Matthias Greve, Robert Hoffmann, Markus Huhn, Oliver Mauss, 
Achim Weiss
Aufsichtsratsvorsitzender: Michael Scheeren

Brent Jones

2008-Sep-18 18:07 UTC

head link

[zfs-discuss] x4500 vs AVS ?

On Tue, Sep 16, 2008 at 11:51 PM, Ralf Ramge <ralf.ramge at webde.de>
wrote:> Jorgen Lundman wrote:
>
>> If we were interested in finding a method to replicate data to a 2nd
>> x4500, what other options are there for us?
>
> If you already have an X4500, I think the best option for you is a cron
> job with incremental ''zfs send''. Or rsync.
>
> --
>
> Ralf Ramge
> Senior Solaris Administrator, SCNA, SCSA
>
We had some Sun reps come out the other day to talk to us about
storage options, and part of the discussion was AVS replication with
ZFS.
I brought up the question of replicating the resilvering process, and
the reps said it does not replicate. They may be mistaken, but I''m
hopeful they are correct.
Could this behavior have been changed recently on AVS to make
replication ''smarter'' with ZFS as the underlying filesystem?

-- 
Brent Jones
brent at servuhome.net

Jim Dunham

2008-Sep-20 02:50 UTC

head link

[zfs-discuss] x4500 vs AVS ?

Brent,
> On Tue, Sep 16, 2008 at 11:51 PM, Ralf Ramge <ralf.ramge at webde.de>
> wrote:
>> Jorgen Lundman wrote:
>>
>>> If we were interested in finding a method to replicate data to a
2nd
>>> x4500, what other options are there for us?
>>
>> If you already have an X4500, I think the best option for you is a  
>> cron
>> job with incremental ''zfs send''. Or rsync.
>>
>> --
>>
>> Ralf Ramge
>> Senior Solaris Administrator, SCNA, SCSA
>>
>
> We had some Sun reps come out the other day to talk to us about
> storage options, and part of the discussion was AVS replication with
> ZFS.
> I brought up the question of replicating the resilvering process, and
> the reps said it does not replicate. They may be mistaken, but I''m
> hopeful they are correct.
The resilvering process is replicated, as AVS can not differentiate  
between ZFS resilvering writes, and ZFS filesystem writes.
> Could this behavior have been changed recently on AVS to make
> replication ''smarter'' with ZFS as the underlying
filesystem?
No ''smarter'' changes have been made to AVS.

This issue at hand, is that as soon as ZFS makes the decision to not  
write to one of its configured vdevs, that vdev and its replica will  
now contain stale data. When ZFS is told to use the vdev again, (a  
zpool replace), ZFS starts resilvering all in-use data, plus any new  
ZFS filesystem writes on the local vdev, both of which will be  
replicated by AVS.

It is the mixture of both resilvering writes, and new ZFS filesystem  
writes, that make it impossible for AVS to make replication
''smarter''.
> -- 
> Brent Jones
> brent at servuhome.net
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Jim Dunham
Engineering Manager
Storage Platform Software Group
Sun Microsystems, Inc.

Ralf Ramge

2008-Sep-22 10:24 UTC

head link

[zfs-discuss] x4500 vs AVS ?

Jim Dunham wrote:
> It is the mixture of both resilvering writes, and new ZFS filesystem 
> writes, that make it impossible for AVS to make replication
''smarter''.
Jim is right here. I just want to add that I don''t see an obvious way
to
  make AVS as "smart" as Brent may wish it to be.
Sometimes I describe AVS as a low level service with some proxy 
functionalities. That''s not really correct, but good enough for a
single
powerpoint sheet. AVS receives the writes from the file system, and 
replicates them. It does not care about the contents of the 
transactions, like IP can''t take care of the responsibilities of higher
layer protocols like TCP or even layer 7 data (bad comparison, I know, 
but it may help to understand what I mean).

What AVS does is copying the contents of devices. A file system writes 
some data to a sector on a hard disk -> AVS is aware of this transaction 
-> AVS replicates the sector to the second host -> on the secondary 
host, AVS makes sure that *exactly* the same data is written to 
*exactly* the same position on the secondary host''s storage device.
Your
secondary storage is a 100% copy. And if you write a bazillion of 0 byte 
sectors to the disk with `dd`, AVS will make sure that the secondary 
does does it, too. And it does this in near real time (if you ignore the 
network bottlenecks). The downside of it: it''s easy to do something 
wrong and you may run in network bottlenecks due to a higher amount of 
traffic.

What AVS can''t offer: file-based replication. In many cases, you
don''t
have to care about having an exact copy of a device. For example, if you 
want a standby solution for your NFS file server, you want to keep the 
contents of the files and directories in sync. You don''t care if a
newly
written file uses the same inode number. You only care if the file is 
copied to your backup host while the file system of the backup host is 
*mounted*. The best-known service for this functionality is `rsync`. And 
if you know rsync, you know the downside of these services, too: don''t 
even think about replicating your data in real time and/or to multiple 
servers.

The challenge is to find out which kind of replication suits your 
concept better.
For instance, if you want to replicate html pages, graphics or other 
documents, perhaps even with a "copy button" on an intranet page, 
file-based replication is your friend.
If you need real time copying or device replication, for instance on a 
database server with its own file system, or for keeping configuration 
files in sync across a cluster, then AVS is your best bet.

But let''s face it: everybody wants the best of both worlds, and so 
people ask if AVS could not just get smarter. The answer: no, not 
really. It can''t check if the file system''s write operation
"make sense"
or if the data "really needs to be replicated". AVS is a truck which 
guarantees fast and accurate delivery of whatever you throw into it. 
Taking care of the content itself is the job of the person who prepares 
the freight. And, in our case, this person is called UFS. Or ZFS. And 
ZFS could do a much better job here.

Sun''s marketing sells ZFS as offering data integrity at *all times* 
(http://www.sun.com/2004-0914/feature/). Well, that''s true, at least as
long as there is no problem on lower layers. And I  often wondered if 
ZFS doesn''t offer something fsck-like for faulted pools because
it''s
technically impossible, or because the marketing guys forbade it. I also 
wondered why people are enthusiastic about gimmicks like ditto blocks, 
but don''t want data protection in case an X4540 suffers a power outage 
and lots of gigabytes of zfs cache may go down the drain.

Proposal: ZFS should offer some kind of "IsReplicated" flag in the
zpool
metadata. During a `zpool import`, this flag should be checked and if it 
is set, a corresponding error message should be printed on stdout. Or 
the ability to set dummy zpool parameters, something like a "zpool set 
storage:cluster:avs=true tank". This would be only some kind of first 
aid only, but that''s better than nothing.

This has nothing to do with AVS only. It also applies to other 
replication services. It would allow us to write simple wrapper scripts 
to switch the replication mechanism into logging mode, thus allowing us 
to safely force the import of the zpool in case of a desaster.

Of course, it would be even better to integrate AVS into ZFS itself. 
"zfs set
replication=<hostname1>[,<hostname2>...<hostnameN>]"
would be
the coolest thing on earth, because it would combine the benefits of AVS 
and rsync-like replication into a perfect product. And it would allow 
the marketing people to use the "high availability" and "full
data
redundancy" buzzwords in their flyers.

But until then, I''ll have to continue using cron jobs on the secondary 
node which try to log in to the primary with ssh and to do a "zfs get 
storage:cluster:avs <filesystem>" on all mounted file systems and
save
it locally for my "zpool import wrapper" script.  This is a cheap 
workaround, but honestly: You can use something like this for your own 
datacenter, but I bet nobody wants to sell it to a customer as a 
supported solution ;-)

-- 

Ralf Ramge
Senior Solaris Administrator, SCNA, SCSA

Tel. +49-721-91374-3963
ralf.ramge at webde.de - http://web.de/

1&1 Internet AG
Brauerstra?e 48
76135 Karlsruhe

Amtsgericht Montabaur HRB 6484

Vorstand: Henning Ahlert, Ralph Dommermuth, Matthias Ehrlich, Thomas 
Gottschlich, Matthias Greve, Robert Hoffmann, Markus Huhn, Oliver Mauss, 
Achim Weiss
Aufsichtsratsvorsitzender: Michael Scheeren

zfs discuss - Sep 2008 - x4500 vs AVS ?

[zfs-discuss] x4500 vs AVS ?

[zfs-discuss] x4500 vs AVS ?

[zfs-discuss] x4500 vs AVS ?

[zfs-discuss] x4500 vs AVS ?

[zfs-discuss] x4500 vs AVS ?

[zfs-discuss] x4500 vs AVS ?

[zfs-discuss] x4500 vs AVS ?

[zfs-discuss] x4500 vs AVS ?

[zfs-discuss] x4500 vs AVS ?

[zfs-discuss] x4500 vs AVS ?

[zfs-discuss] x4500 vs AVS ?

[zfs-discuss] x4500 vs AVS ?

[zfs-discuss] x4500 vs AVS ?

[zfs-discuss] x4500 vs AVS ?

[zfs-discuss] x4500 vs AVS ?

[zfs-discuss] x4500 vs AVS ?

[zfs-discuss] x4500 vs AVS ?

[zfs-discuss] x4500 vs AVS ?

[zfs-discuss] x4500 vs AVS ?

[zfs-discuss] x4500 vs AVS ?

[zfs-discuss] x4500 vs AVS ?

[zfs-discuss] x4500 vs AVS ?

[zfs-discuss] x4500 vs AVS ?

[zfs-discuss] x4500 vs AVS ?

[zfs-discuss] x4500 vs AVS ?

[zfs-discuss] x4500 vs AVS ?

[zfs-discuss] x4500 vs AVS ?

[zfs-discuss] x4500 vs AVS ?

[zfs-discuss] x4500 vs AVS ?