-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 hi, are there any RFEs or plans to create a ''continuous'' replication mode for ZFS? i envisage it working something like this: a ''zfs send'' on the sending host monitors the pool/filesystem for changes, and immediately sends them to the receiving host, which applies the change to the remote pool. currently i crontab zfs send | zfs recv for this, but even a small delay (10 minutes) makes it less useful than continuous replication, as most changed or newly created files in our workload are accessed shortly after creation. (i searched sunsolve with a few keywords, but didn''t find anything like this.) - river. -----BEGIN PGP SIGNATURE----- iD8DBQFJG2EQIXd7fCuc5vIRAqHwAKCa7PC7Y6Qbfmqep/d7bfVLNx90XgCeNoaG wvOxL2OD8IIqV6Gfin63HOA=Ra/3 -----END PGP SIGNATURE-----
On Thu 13/11/08 12:04 , River Tarnell river at loreley.flyingparchment.org.uk sent:> > are there any RFEs or plans to create a ''continuous'' replication mode for > ZFS?i envisage it working something like this: a ''zfs send'' on the sending > hostmonitors the pool/filesystem for changes, and immediately sends them to > thereceiving host, which applies the change to the remote pool. > > currently i crontab zfs send | zfs recv for this, but even a small delay > (10minutes) makes it less useful than continuous replication, as most changed > ornewly created files in our workload are accessed shortly after > creation.Would it be a practical solution? I doubt zfs receive would be able to keep pace with any non-trivial update rate. Mirroring iSCSI or a dedicated HA tool would be a better solution. -- Ian.
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Ian Collins:> I doubt zfs receive would be able to keep pace with any non-trivial update rate.one could consider this a bug in zfs receive :)> Mirroring iSCSI or a dedicated HA tool would be a better solution.i''m not sure how to apply iSCSI here; the pool needs to be mounted at least read-only on both hosts at the same time. (also suggested was AVS, which doesn''t allow keeping the pool mounted on the slave.) at least Solaris Cluster, from what i''ve seen, doesn''t allow this either; the failover is handled by importing the pool during failover. - river. -----BEGIN PGP SIGNATURE----- iD8DBQFJG2ltIXd7fCuc5vIRAv5PAJ4lrVLcWuQlJkY05fxCYkLn8kgtxQCgo/CX Ae17uVMuX1FABt73hmeULmM=OZZa -----END PGP SIGNATURE-----
On Wed, Nov 12, 2008 at 3:40 PM, River Tarnell <river at loreley.flyingparchment.org.uk> wrote:> -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > Ian Collins: >> I doubt zfs receive would be able to keep pace with any non-trivial update rate. > > one could consider this a bug in zfs receive :) > >> Mirroring iSCSI or a dedicated HA tool would be a better solution. > > i''m not sure how to apply iSCSI here; the pool needs to be mounted at least > read-only on both hosts at the same time. (also suggested was AVS, which > doesn''t allow keeping the pool mounted on the slave.) at least Solaris > Cluster, from what i''ve seen, doesn''t allow this either; the failover is > handled by importing the pool during failover. > > - river. > -----BEGIN PGP SIGNATURE----- > > iD8DBQFJG2ltIXd7fCuc5vIRAv5PAJ4lrVLcWuQlJkY05fxCYkLn8kgtxQCgo/CX > Ae17uVMuX1FABt73hmeULmM> =OZZa > -----END PGP SIGNATURE----- > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >It sounds like you need either a true clustering file system or to draw back your plans to see changes read-only instantly on the secondary node. What kind of link do you plan between these nodes? Would the link keep up with non-trivial updates? -- Brent Jones brent at servuhome.net
As an aside, replication has been implemented as part of the new Storage 7000 family. Here''s a link to a blog discussing using the 7000 Simulator running in two separate VMs and replicating w/ each other: http://blogs.sun.com/pgdh/entry/fun_with_replicating_the_sun I''m not sure of the specifics of how, but it might provide ideas of how it can be accomplished. Regards. -------- Original Message -------- Subject: Re: [zfs-discuss] continuous replication From: Brent Jones <brent at servuhome.net> To: Ian Collins <ian at ianshome.com>, zfs-discuss at opensolaris.org Date: Wed Nov 12 16:46:37 2008> On Wed, Nov 12, 2008 at 3:40 PM, River Tarnell > <river at loreley.flyingparchment.org.uk> wrote: > >> -----BEGIN PGP SIGNED MESSAGE----- >> Hash: SHA1 >> >> Ian Collins: >> >>> I doubt zfs receive would be able to keep pace with any non-trivial update rate. >>> >> one could consider this a bug in zfs receive :) >> >> >>> Mirroring iSCSI or a dedicated HA tool would be a better solution. >>> >> i''m not sure how to apply iSCSI here; the pool needs to be mounted at least >> read-only on both hosts at the same time. (also suggested was AVS, which >> doesn''t allow keeping the pool mounted on the slave.) at least Solaris >> Cluster, from what i''ve seen, doesn''t allow this either; the failover is >> handled by importing the pool during failover. >> >> - river. >> -----BEGIN PGP SIGNATURE----- >> >> iD8DBQFJG2ltIXd7fCuc5vIRAv5PAJ4lrVLcWuQlJkY05fxCYkLn8kgtxQCgo/CX >> Ae17uVMuX1FABt73hmeULmM>> =OZZa >> -----END PGP SIGNATURE----- >> _______________________________________________ >> zfs-discuss mailing list >> zfs-discuss at opensolaris.org >> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >> >> > > It sounds like you need either a true clustering file system or to > draw back your plans to see changes read-only instantly on the > secondary node. > What kind of link do you plan between these nodes? Would the link keep > up with non-trivial updates? > > > >
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Brent Jones:> It sounds like you need either a true clustering file system or to draw back > your plans to see changes read-only instantly on the secondary node.well, the idea is to have two separate copies of the data, for backup / DR. being able to serve content from the backup host is only a benefit, not a requirement.> What kind of link do you plan between these nodes? Would the link keep > up with non-trivial updates?the change rate is fairly low, perhaps 50GB/day, and all the changes are done over NFS, which is not especially fast to begin with. currently the connection is GE, but it would be easy enough to do link aggregation if needed. - river. -----BEGIN PGP SIGNATURE----- iD8DBQFJG4XWIXd7fCuc5vIRAvtNAJ9GzTwZHfDidySUn5/+o51KTXk6qQCfS8Kw oIVcILkrmq5OlXE6YAEyuZI=mxbE -----END PGP SIGNATURE-----
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Daryl Doami:> As an aside, replication has been implemented as part of the new Storage > 7000 family. Here''s a link to a blog discussing using the 7000 > Simulator running in two separate VMs and replicating w/ each other:that''s interesting, although ''less than a minute later'' makes me suspect they might just be using snapshots and send/recv? presumably, if fishworks is based on (Open)Solaris, any new ZFS features they created will make it back into Solaris proper eventually... - river. -----BEGIN PGP SIGNATURE----- iD8DBQFJG4m5IXd7fCuc5vIRAvY3AJ0dRRblJhwfA7X/s8CUU775hd3HNgCffARy x8Vryc/+Fl+a4pjJWN/KsDM=ImHD -----END PGP SIGNATURE-----
On Wed, Nov 12, 2008 at 5:58 PM, River Tarnell <river at loreley.flyingparchment.org.uk> wrote:> -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > Daryl Doami: >> As an aside, replication has been implemented as part of the new Storage >> 7000 family. Here''s a link to a blog discussing using the 7000 >> Simulator running in two separate VMs and replicating w/ each other: > > that''s interesting, although ''less than a minute later'' makes me suspect they > might just be using snapshots and send/recv? > > presumably, if fishworks is based on (Open)Solaris, any new ZFS features they > created will make it back into Solaris proper eventually... > > - river. > -----BEGIN PGP SIGNATURE----- > > iD8DBQFJG4m5IXd7fCuc5vIRAvY3AJ0dRRblJhwfA7X/s8CUU775hd3HNgCffARy > x8Vryc/+Fl+a4pjJWN/KsDM> =ImHD > -----END PGP SIGNATURE----- >Yah from what I can tell, it''s just using already-there-but-easier-to-look-at approach. Not belittling the accomplishment, rolling all the system tools into a coherent package is great, and the analytic is just awesome. I am doing a similar project, and weighed several options for replication. AVS was coveted for it''s near real time replication and ability to "switch directions" to replicate to the primary if you had a fail-over. But some AVS limitations[1] are probably going to make us use zfs send/receive and it should keep up (delta per day is ~100GB) We will be testing both methods here in the next few weeks, will keep the list posted to our findings. [1] sending drive rebuilds over the link sucks -- Brent Jones brent at servuhome.net
Ian Collins wrote:> On Thu 13/11/08 12:04 , River Tarnell river at loreley.flyingparchment.org.uk sent: >> are there any RFEs or plans to create a ''continuous'' replication mode for >> ZFS?i envisage it working something like this: a ''zfs send'' on the sending >> hostmonitors the pool/filesystem for changes, and immediately sends them to >> thereceiving host, which applies the change to the remote pool. >> >> currently i crontab zfs send | zfs recv for this, but even a small delay >> (10minutes) makes it less useful than continuous replication, as most changed >> ornewly created files in our workload are accessed shortly after >> creation. > > Would it be a practical solution? I doubt zfs receive would be able to keep pace with any non-trivial update rate. > > Mirroring iSCSI or a dedicated HA tool would be a better solution.Any block-level replication will happily replicate block-level corruption (especially that caused by file system bugs). This can be a Very Bad Thing. Also, without some cluster file system you can''t run backups off the replica (which is frequently desirable). NetApp has a "best effort" real-time snapmirror that does what the OP wants. ZFS could do the same thing, and it would be very nice to have. A real synchronous option would also be nice (yes, it would probably slow down writes on the master side, but sometimes that''s an OK tradeoff for zero loss of "committed" data when your main data center goes buh-bye). -- Carson
>>>>> "rt" == River Tarnell <river at loreley.flyingparchment.org.uk> writes:rt> currently i crontab zfs send | zfs recv for this doesn''t it also fall over if the stream falls behind? I mean, what if it takes longer than ten minutes? What if the backup node goes away and then comes back? What if the master node panics half way through a restore? this is all in addition to the performance and stability problems in manual zfs sending. My point is that I don''t think the 10min delay is the most significant difference between AVS/snapmirror and a ''zfs send'' cronjob. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 304 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20081113/351777e8/attachment.bin>
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Miles Nordin:> rt> currently i crontab zfs send | zfs recv for this > My point is that I don''t think the 10min delay is the most significant > difference between AVS/snapmirror and a ''zfs send'' cronjob.i didn''t intend to suggest there was any similarity between send|recv and AVS. i don''t think i actually mentioned AVS, except in passing to note i''d rejected it as a solution (since it doesn''t allow to mount the filesystem on both systems simultaneously). having said that, the homemade script i use now has been fairly resilient to errors (including starting another copy when the previous one hasn''t finished, and when the receiving system goes away during the send). when it encounters something unexpected, it fails gracefully and a human can fix the problem. it''s not AVS, but for our environment it works fine. the only thing it''s missing is the ability to ''push'' changes continuously, rather than polling every 10 minutes. - river. -----BEGIN PGP SIGNATURE----- iD8DBQFJHJXAIXd7fCuc5vIRAhL4AJ9QFv+cjUoa2WdRS95CJQk60ROnwgCgoAnr lGxORMtnACLPwN8Gg2Cj4kM=PDHN -----END PGP SIGNATURE-----
River Tarnell wrote:> Daryl Doami: >> As an aside, replication has been implemented as part of the new Storage >> 7000 family. Here''s a link to a blog discussing using the 7000 >> Simulator running in two separate VMs and replicating w/ each other: > > that''s interesting, although ''less than a minute later'' makes me suspect they > might just be using snapshots and send/recv?That''s correct. The question is: why isn''t that sufficient? What are you really after? If you want synchronous replication (i.e., writes are blocked until they''re replicated to another host), that''s a very different problem. But your original post suggested:> a ''zfs send'' on the sending host > monitors the pool/filesystem for changes, and immediately sends them to the > receiving host, which applies the change to the remote pool.This is asynchronous, and isn''t really different from running zfs send/recv in a loop. Whether the loop is in userland or in the kernel, either way you''re continuously pushing changes across the wire.> presumably, if fishworks is based on (Open)Solaris, any new ZFS features they > created will make it back into Solaris proper eventually...Replication in the 7000 series is mostly built _on top of_ the existing ZFS infrastructure. -- Dave -- David Pacheco, Sun Microsystems Fishworks. http://blogs.sun.com/dap/
*snip*>> a ''zfs send'' on the sending host >> monitors the pool/filesystem for changes, and immediately sends them to >> the >> receiving host, which applies the change to the remote pool. > > This is asynchronous, and isn''t really different from running zfs send/recv > in a loop. Whether the loop is in userland or in the kernel, either way > you''re continuously pushing changes across the wire. > >> presumably, if fishworks is based on (Open)Solaris, any new ZFS features >> they >> created will make it back into Solaris proper eventually... > > Replication in the 7000 series is mostly built _on top of_ the existing ZFS > infrastructure. > > -- Dave > > -- > David Pacheco, Sun Microsystems Fishworks. http://blogs.sun.com/dap/ >Sun advertises Active/Active replication on the 7000, how is this possible? Can send/receive operate bi-directional so changes on either reflect on both sides? I always visualized send/receive only being beneficial in Active/Passive situations, where you must only perform operations on the primary, and should fail over occur, you switch to the secondary. -- Brent Jones brent at servuhome.net
Brent Jones wrote:> *snip* >>> a ''zfs send'' on the sending host >>> monitors the pool/filesystem for changes, and immediately sends them to >>> the >>> receiving host, which applies the change to the remote pool. >> This is asynchronous, and isn''t really different from running zfs send/recv >> in a loop. Whether the loop is in userland or in the kernel, either way >> you''re continuously pushing changes across the wire. >> >>> presumably, if fishworks is based on (Open)Solaris, any new ZFS features >>> they >>> created will make it back into Solaris proper eventually... >> Replication in the 7000 series is mostly built _on top of_ the existing ZFS >> infrastructure. > > Sun advertises Active/Active replication on the 7000, how is this > possible? Can send/receive operate bi-directional so changes on either > reflect on both sides? > I always visualized send/receive only being beneficial in > Active/Passive situations, where you must only perform operations on > the primary, and should fail over occur, you switch to the secondary. >I think you''re confusing our clustering feature with the remote replication feature. With active-active clustering, you have two closely linked head nodes serving files from different zpools using JBODs connected to both head nodes. When one fails, the other imports the failed node''s pool and can then serve those files. With remote replication, one appliance sends filesystems and volumes across the network to an otherwise separate appliance. Neither of these is performing synchronous data replication, though. For more clustering, I''ll refer you to Keith''s blog: http://blogs.sun.com/wesolows/entry/low_availability_clusters -- Dave -- David Pacheco, Sun Microsystems Fishworks. http://blogs.sun.com/dap/
> I think you''re confusing our clustering feature with the remote > replication feature. With active-active clustering, you have two closely > linked head nodes serving files from different zpools using JBODs > connected to both head nodes. When one fails, the other imports the > failed node''s pool and can then serve those files. With remote > replication, one appliance sends filesystems and volumes across the > network to an otherwise separate appliance. Neither of these is > performing synchronous data replication, though.That is _not_ active-active, that is active-passive. If you have a active-active system I can access the same data via both controllers at the same time. I can''t if it works like you just described. You can''t call it active-active just because different volumes are controlled by different controllers. Most active-passive RAID controllers can do that. The data sheet talks about active-active clusters, how does that work?
On Fri, Nov 14, 2008 at 10:48:25PM +0100, Mattias Pantzare wrote:> That is _not_ active-active, that is active-passive. > > If you have a active-active system I can access the same data via both > controllers at the same time. I can''t if it works like you just > described. You can''t call it active-active just because different > volumes are controlled by different controllers. Most active-passive > RAID controllers can do that. > > The data sheet talks about active-active clusters, how does that work?What the Sun Storage 7000 Series does would more accurately be described as dual active-passive. Adam -- Adam Leventhal, Fishworks http://blogs.sun.com/ahl
Adam Leventhal wrote:> On Fri, Nov 14, 2008 at 10:48:25PM +0100, Mattias Pantzare wrote: > >> That is _not_ active-active, that is active-passive. >> >> If you have a active-active system I can access the same data via both >> controllers at the same time. I can''t if it works like you just >> described. You can''t call it active-active just because different >> volumes are controlled by different controllers. Most active-passive >> RAID controllers can do that. >> >> The data sheet talks about active-active clusters, how does that work? >> > > What the Sun Storage 7000 Series does would more accurately be described as > dual active-passive. >This is ambiguous in the cluster market. It is common to describe HA clusters where each node can be offering services concurrently, as active/active, even though the services themselves are active/passive. This is to appease folks who feel that idle secondary servers are a bad thing. For services which you may think as "active on all nodes providing the exact same view of the data and service" we usually use the terms "scalable service," "parallel database," or "cluster file system." -- richard
On Sat, Nov 15, 2008 at 00:46, Richard Elling <Richard.Elling at sun.com> wrote:> Adam Leventhal wrote: >> >> On Fri, Nov 14, 2008 at 10:48:25PM +0100, Mattias Pantzare wrote: >> >>> >>> That is _not_ active-active, that is active-passive. >>> >>> If you have a active-active system I can access the same data via both >>> controllers at the same time. I can''t if it works like you just >>> described. You can''t call it active-active just because different >>> volumes are controlled by different controllers. Most active-passive >>> RAID controllers can do that. >>> >>> The data sheet talks about active-active clusters, how does that work? >>> >> >> What the Sun Storage 7000 Series does would more accurately be described >> as >> dual active-passive. >> > > This is ambiguous in the cluster market. It is common to describe > HA clusters where each node can be offering services concurrently, > as active/active, even though the services themselves are active/passive. > This is to appease folks who feel that idle secondary servers are a bad > thing.But this product is not in the cluster market. It is in the storage market. By your definition virtually all dual controller RAID boxes are active/active. You should talk to Veritas so that they can change all their documentation... Active/active and active/passive has a real technical meaning, don''t let marketing destroy that!
Hi All; Accessing the same data(Raid Group) from different controllers does slow down the system considerebly. All modern controllers will demand the administrator to choose the primary controler for Raid Groups. Two controller accesing the same data will require drive interface switching between ports, controllers will not be able to optimize the head movement, caching will suffer due to dublicate records on both controllers, a lot of data transfers between each controller... Only very few disk systems support multi controler access to same data and when you read their best practice document you will notice that this is not recommended. Best regards Mertol Mertol Ozyoney Storage Practice - Sales Manager Sun Microsystems, TR Istanbul TR Phone +902123352200 Mobile +905339310752 Fax +902123352222 Email mertol.ozyoney at sun.com -----Original Message----- From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-bounces at opensolaris.org] On Behalf Of Mattias Pantzare Sent: Friday, November 14, 2008 11:48 PM To: David Pacheco Cc: zfs-discuss at opensolaris.org Subject: Re: [zfs-discuss] continuous replication> I think you''re confusing our clustering feature with the remote > replication feature. With active-active clustering, you have two closely > linked head nodes serving files from different zpools using JBODs > connected to both head nodes. When one fails, the other imports the > failed node''s pool and can then serve those files. With remote > replication, one appliance sends filesystems and volumes across the > network to an otherwise separate appliance. Neither of these is > performing synchronous data replication, though.That is _not_ active-active, that is active-passive. If you have a active-active system I can access the same data via both controllers at the same time. I can''t if it works like you just described. You can''t call it active-active just because different volumes are controlled by different controllers. Most active-passive RAID controllers can do that. The data sheet talks about active-active clusters, how does that work? _______________________________________________ zfs-discuss mailing list zfs-discuss at opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Hello Mattias, Saturday, November 15, 2008, 12:24:05 AM, you wrote: MP> On Sat, Nov 15, 2008 at 00:46, Richard Elling <Richard.Elling at sun.com> wrote:>> Adam Leventhal wrote: >>> >>> On Fri, Nov 14, 2008 at 10:48:25PM +0100, Mattias Pantzare wrote: >>> >>>> >>>> That is _not_ active-active, that is active-passive. >>>> >>>> If you have a active-active system I can access the same data via both >>>> controllers at the same time. I can''t if it works like you just >>>> described. You can''t call it active-active just because different >>>> volumes are controlled by different controllers. Most active-passive >>>> RAID controllers can do that. >>>> >>>> The data sheet talks about active-active clusters, how does that work? >>>> >>> >>> What the Sun Storage 7000 Series does would more accurately be described >>> as >>> dual active-passive. >>> >> >> This is ambiguous in the cluster market. It is common to describe >> HA clusters where each node can be offering services concurrently, >> as active/active, even though the services themselves are active/passive. >> This is to appease folks who feel that idle secondary servers are a bad >> thing.MP> But this product is not in the cluster market. It is in the storage market. MP> By your definition virtually all dual controller RAID boxes are active/active. MP> You should talk to Veritas so that they can change all their documentation... MP> Active/active and active/passive has a real technical meaning, don''t MP> let marketing destroy that! I thought that when you can access the same LUN via different controller then you have a symmetric disk array and when you can''t you have an asymmetric one. It has nothing to do with active-active or active-standby. Most of a disk arrays in the marked are active-active and asymmetric. -- Best regards, Robert Milkowski mailto:milek at task.gda.pl http://milek.blogspot.com