Hi All, Im just wondering (i figure you can do this but dont know what hardware and stuff i would need) if I can set up a mirror of a raidz zpool across a network. Basically, the setup is a large volume of Hi-Def video is being streamed from a camera, onto an editing timeline. This will be written to a network share. Due to the large amounts of data, ZFS is a really good option for us. But we need a backup. We need to do it on generic hardware (i was thinking AMD64 with an array of large 7200rpm hard drives), and therefore i think im going to have one box mirroring the other box. They will be connected by gigabit ethernet. So my question is how do I mirror one raidz Array across the network to the other? Thanks for all your help Mark. This message posted from opensolaris.org
Mark wrote:> Hi All, > > Im just wondering (i figure you can do this but dont know what hardware and stuff i would need) if I can set up a mirror of a raidz zpool across a network. > > Basically, the setup is a large volume of Hi-Def video is being streamed from a camera, onto an editing timeline. This will be written to a network share. Due to the large amounts of data, ZFS is a really good option for us. But we need a backup. We need to do it on generic hardware (i was thinking AMD64 with an array of large 7200rpm hard drives), and therefore i think im going to have one box mirroring the other box. They will be connected by gigabit ethernet. So my question is how do I mirror one raidz Array across the network to the other?rsync? zfs send/rcv? AVS? iSCSI targets on the two boxes? Lots of ways to do it. Depends what your definition of backup is. Time based? Extra redundancy?
On Sun, Aug 19, 2007 at 05:45:18PM -0700, Mark wrote:> Basically, the setup is a large volume of Hi-Def video is being streamed > from a camera, onto an editing timeline. This will be written to a > network share. Due to the large amounts of data, ZFS is a really good > option for us. But we need a backup. We need to do it on generic > hardware (i was thinking AMD64 with an array of large 7200rpm hard > drives), and therefore i think im going to have one box mirroring the > other box. They will be connected by gigabit ethernet. So my question > is how do I mirror one raidz Array across the network to the other?One big decision you need to make in this scenario is whether you want true synchronous replication or if asynchronous replication possibly with some time-bound is acceptable. For the former, each byte must traverse the network before it is acknowledged to the client; for the latter, data is written locally and then transmitted shortly after that. Synchronous replication obviously imposes a much larger performance hit, but asychronous replication means you may lose data over some recent period (but the data will always be consistent). Adam -- Adam Leventhal, Solaris Kernel Development http://blogs.sun.com/ahl
Torrey McMahon wrote:> AVS? >Jim Dunham will probably shoot me, or worse, but I recommend thinking twice about using AVS for ZFS replication. Basically, you only have a few options: 1) Using a battery buffered hardware RAID controller, which leads to bad ZFS performance in many cases, 2) Buildung up Three-Way-Mirrors to avoid complete data loss in several desaster scenarios due to missing ZFS recovery mechanisms like `fsck`, which makes AVS/ZFS based solutions quite expensive, 3) Additionally using another form of backup, e.g. tapes. For instance, one scenario which made me think: Imagine you have a X4500. 48 internal disks, 500 GB each. This would lead to ZFS pool on 40 disks (you need 1 for the system, plus 3x RAID 10 for the bitmap volumes, otherwise your performance will be very bad, plus 2x HSP). Using 40 disks leads to a total of 40 separate replications. Now imagine the following two scenarios: a) A disk in the primary fails. What happens? A HSP jumps in and 500 GB will be rebuilt. These 500 GB are synced over a single 1 GBit/s crossover cable. This takes a bit of time and is 100% unnecessary - and it will become much worse in the future, because the disk capacities rocket up into the sky, while the performance isn''t improved as much. During this time, your service misses redundancy. And we''re not talking about some minutes during this time. Well, and now try to imagine what will happen if another disks fails during this rebuild, this time in the secondary ... b) A disk in the secondary fails. What happens now? No HSP will jump in on the secondary, because the zpool isn''t imported and ZFS doesn''t know about the failure. Instead, you''ll end up with 39 active replications instead of 40. The one which replicates to the failed drive will become inactive. But ... oh damn, the zpool isn''t mounted on the secondary host, so ZFS doesn''t report the drive failure to our server monitoring. That can be funny. The only way to get aware of the problem I found after a minute of thinking was asking sndradm about the health status - which would lead to a false alarm on Host A, because the failed disc is in Host B, and operators are usually not bright enough to change the disc in Host B after they get notified about a problem on Host B. But even if everything works, what will if the primary fails before an administrator fixed the problem, the missing replication is running again and the replacement disc has been completely synced? "Hello, kernel panic", and "Goodbye, 12 TB of data"). c) You *must* force every single `zfs import <zpool>` on the secondary host. Always. Because you usually need your secondary host after your primary crashed. You won''t have the chance to export your zpool on the primary first - and if you do, you don''t need AVS at all. Bring some Kleenex to get rid of the sweat on your forehead when you have to switch to your secondary host, because a single mistake (like forgetting to put the secondary host into logging mode manually before you try to import the zpool) will lead to a complete data loss. I bet you won''t even trust your own failover scripts. Use AVS and ZFS together. I use it myself. But I made sure that I know what I''m doing. Most people probably don''t. Btw: I have to admit that I haven''t tried the newst nevada builds during the tests. It''s possible that AVS and ZFS work better together than they did under Solaris 10 11/06 and AVS 4.0. But there''s a reason I haven''t tried. It''s because Sun Cluster 3.2 instantly crashes on Thumpers, SATA-related kernel panics, and the OpenHA Cluster isn''t available yet. -- Ralf Ramge Senior Solaris Administrator, SCNA, SCSA Tel. +49-721-91374-3963 ralf.ramge at webde.de - http://web.de/ 1&1 Internet AG Brauerstra?e 48 76135 Karlsruhe Amtsgericht Montabaur HRB 6484 Vorstand: Henning Ahlert, Ralph Dommermuth, Matthias Ehrlich, Andreas Gauger, Matthias Greve, Robert Hoffmann, Norbert Lang, Achim Weiss Aufsichtsratsvorsitzender: Michael Scheeren
Ralf,> Torrey McMahon wrote: >> AVS? >> > Jim Dunham will probably shoot me, or worse, but I recommend thinking > twice about using AVS for ZFS replication.That''s is why the call this a discussion group, as it encourages differing opinions,> Basically, you only have a > few options: > > 1) Using a battery buffered hardware RAID controller, which leads to > bad ZFS performance in many cases, > 2) Buildung up Three-Way-Mirrors to avoid complete data loss in > several > desaster scenarios due to missing ZFS recovery mechanisms like `fsck`, > which makes AVS/ZFS based solutions quite expensive, > 3) Additionally using another form of backup, e.g. tapes. > > For instance, one scenario which made me think: Imagine you have a > X4500. 48 internal disks, 500 GB each. This would lead to ZFS pool > on 40 > disks (you need 1 for the system, plus 3x RAID 10 for the bitmap > volumes, otherwise your performance will be very bad, plus 2x HSP). > Using 40 disks leads to a total of 40 separate replications. Now > imagine > the following two scenarios:This is just one scenario for deploying the 48 disks of x4500. The blog listed below offers another option, by mirroring the bitmaps across all available disks, bring the total disk count back up to 46, (or 44, if 2x HSP) leaving the other two for a mirrored root disk. http://blogs.sun.com/AVS/entry/avs_and_zfs_seamless Yes, provisioning one slice for bitmaps and the another slice for ZFS''s vdevs on the same internal disk, may introduce out of band head seeks between bitmap I/O and ZFS I/O, plus taking a piece of a ZFS vdev turns off ZFS''s ability to enable write-caching. All things considered, this is the cost of host-based replication.> > a) A disk in the primary fails. What happens? A HSP jumps in and > 500 GB > will be rebuilt. These 500 GB are synced over a single 1 GBit/s > crossover cable. This takes a bit of time and is 100% unnecessaryBut it is necessary! As soon as the HSP disk kicks in, not only is the disk being rebuilt by ZFS, but newly allocated ZFS data will also being written to this HSP disk. So although it may appear that there is wasted replication cost (of which there is), the instant that ZFS writes new data to this HSP disk, the old replicated disk is instantly inconsistent, and there is no means to fix. For all that is good (or bad) about AVS, the fact that it works by simply interposing itself on the Solaris I/O data path is great, as it works with any Solaris block storage. Of course this also means that it has not filesystem, database or host-spare knowledge, which means that at times AVS will be inefficient at what it does.> - and > it will become much worse in the future, because the disk capacities > rocket up into the sky, while the performance isn''t improved as much.Larger disk capacities are now worse in this scenario, then they are with controller-based replication, ZFS send / receive, etc. Actually it is quite efficient. If the disk that failed was one 5% full, when the HSP disk is switch and being rebuilt, old 5% of the entire disk will have to be replicated. If at the time ZFS and AVS were deployed on this server, if they HSP disks (containing uninitialized data) were also configured as equal with "sndradm -E ...", then there would be not initial replication cost, and when swapped into use, only the cost of replicating the actual ZFS in use data.> During this time, your service misses redundancy.Absolute not. If all of the ZFS in use and ZFS HSP disks are configured under AVS, there is never a time of lost redundancy.> And we''re not talking > about some minutes during this time. Well, and now try to imagine what > will happen if another disks fails during this rebuild, this time > in the > secondary ...If I was truly counting on AVS, I would be glad this happened! Getting replication configured right, be it AVS or some other option, means that when disks, systems, networks, etc., fail, there is always a period of degraded system performance, but it is better then no system performance.> > b) A disk in the secondary fails. What happens now? No HSP will > jump in > on the secondary, because the zpool isn''t imported and ZFS doesn''t > know > about the failure. Instead, you''ll end up with 39 active replications > instead of 40. The one which replicates to the failed drive will > become > inactive. But ... oh damn, the zpool isn''t mounted on the secondary > host, so ZFS doesn''t report the drive failure to our server > monitoring.But if a disaster happened on the primary node, and a decision was made to ZFS import the storage pool on the secondary, ZFS will detect the inconsistency, mark the drive as failed, swap in the secondary HSP disk. Later, when the primary site comes back, and a reverse synchronization is done to restore writes that happened on the secondary, the primary ZFS file system will become aware that a HSP swap occurred, and continue on right where the secondary node left off.> > That can be funny. The only way to get aware of the problem I found > after a minute of thinking was asking sndradm about the health > status - > which would lead to a false alarm on Host A, because the failed > disc is > in Host B, and operators are usually not bright enough to change the > disc in Host B after they get notified about a problem on Host B. But > even if everything works, what will if the primary fails before an > administrator fixed the problem, the missing replication is running > again and the replacement disc has been completely synced? "Hello, > kernel panic", and "Goodbye, 12 TB of data").See above, but yes there is a need for a system administrator to monitor SNDR replication.> > c) You *must* force every single `zfs import <zpool>` on the secondary > host. Always.Correct, but this is the case even without AVS! If one configured ZFS on SAN based storage and your primary node crashed, one would need to force every single `zfs import <zpool>`. This is not an AVS issue, but a ZFS protection.> Because you usually need your secondary host after your > primary crashed. You won''t have the chance to export your zpool on the > primary first - and if you do, you don''t need AVS at all. Bring some > Kleenex to get rid of the sweat on your forehead when you have to > switch > to your secondary host, because a single mistake (like forgetting > to put > the secondary host into logging mode manually before you try to import > the zpool) will lead to a complete data loss.Correct, but this is the case even without AVS! Take the same SAN based storage scenario above, go to a secondary system on your SAN, and force every single `zfs import <zpool>`. In the case of a SAN, where the same physical disk would be written to by both hosts, you would likely get complete data loss, but with AVS, where ZFS is actually on two physical disk, and AVS is tracking writes, even if they are inconsistent writes, AVS can and will recover if an update sync is done.> I bet you won''t even trust > your own failover scripts. > > Use AVS and ZFS together. I use it myself. But I made sure that I know > what I''m doing. Most people probably don''t.Your are quite correct in that although ZFS is intuitively easy to use, AVS is painfully complex. Of course the mindset of AVS and ZFS are as distant apart as they are in the alphabet. :-O> Btw: I have to admit that I haven''t tried the newst nevada builds > during > the tests. It''s possible that AVS and ZFS work better together than > they > did under Solaris 10 11/06 and AVS 4.0. But there''s a reason I haven''t > tried. It''s because Sun Cluster 3.2 instantly crashes on Thumpers, > SATA-related kernel panics, and the OpenHA Cluster isn''t available > yet.With AVS in Nevada, there is now an opportunity for leveraging the ease of use of ZFS, with AVS. Being also the iSCSI Target project lead, I see a lot of value in the ZFS option "set shareiscsi=on", to get end users in using iSCSI. I would like to see "set replication=AVS:<secondary host>", configuring a locally named ZFS storage pool to the same named pair on some remote host. Starting down this path would afford things like ZFS replication monitoring, similar to what ZFS does with each of its own vdevs. Jim> > -- > > Ralf Ramge > Senior Solaris Administrator, SCNA, SCSA > > Tel. +49-721-91374-3963 > ralf.ramge at webde.de - http://web.de/ > > 1&1 Internet AG > Brauerstra?e 48 > 76135 Karlsruhe > > Amtsgericht Montabaur HRB 6484 > > Vorstand: Henning Ahlert, Ralph Dommermuth, Matthias Ehrlich, > Andreas Gauger, Matthias Greve, Robert Hoffmann, Norbert Lang, > Achim Weiss > Aufsichtsratsvorsitzender: Michael Scheeren > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Jim Dunham wrote:> This is just one scenario for deploying the 48 disks of x4500. The > blog listed below offers another option, by mirroring the bitmaps > across all available disks, bring the total disk count back up to 46, > (or 44, if 2x HSP) leaving the other two for a mirrored root disk. > http://blogs.sun.com/AVS/entry/avs_and_zfs_seamless >I know your blog entry, Jim. And I still admire your skills in calculations within shell scripts (I just gave each soft partition 100 Megabytes of space, finished ;-) ). But after some thinking, I didn''t consider using a slice on the same disk for bitmaps. Not just because of performance issues, that''s not a valid reason. Again, the desaster scenarios make me think. In this case, the complexity of administration. You know, the x64 Solaris boxes are basically competing against Linux boxes all day. The X4500 is very attractive replacement for the typical Linux file server, consisting of a server, a hardware RAID controller and several cheap and stupid fibre-channeled SATA JBODs for less than $5,000 each. Double this to have a cluster. In our case, the X4500 is competing against more than 60 of those clusters with a total of 360 JBODs. The X4500''s main advantage isn''t the price per gigabyte (the price is exactly the same!), like most members of the sales department may expect, the real advantage is the gigabyte per rack unit. But there are several disadvantages, for instance: not being able to access the hard drives from the front and needing a ladder and a screwdriver instead, or, most important for the typical data center, the *operator* is not able to replace a disk like he''s used to: pulling the old disc out, putting the new disc in, resync starting, finished. You''ll always have to wait until the next morning, until a Solaris administrator is available again (which may impact your high availability concepts) or get an Solaris administrator into the company 24/7 a day (which raises the TCO of the Solaris boxes). Well, and what I want to say: if you place the bitmap volume on the same disk, this situation even gets worse. The problem is the involvement of SVM. Building the soft partition again makes the handling even more complex and the case harder to handle for operators. It''s the best way to make sure that the disk will be replaced, but not added to the zpool during the night - and replacing it during regular working hours isn''t an option too, because syncing 500 GB over a 1 GBit/s interface during daytime just isn''t possible without putting the guaranteed service times to a risk. Having to take care about soft partitions just isn''t idiot-proof enough. And *poof* there''s a good chance the TCO of a X4500 is considered being too high.>> a) A disk in the primary fails. What happens? A HSP jumps in and 500 GB >> will be rebuilt. These 500 GB are synced over a single 1 GBit/s >> crossover cable. This takes a bit of time and is 100% unnecessary > > > But it is necessary! As soon as the HSP disk kicks in, not only is the > disk being rebuilt by ZFS, but newly allocated ZFS data will also > being written to this HSP disk. So although it may appear that there > is wasted replication cost (of which there is), the instant that ZFS > writes new data to this HSP disk, the old replicated disk is instantly > inconsistent, and there is no means to fix.It''s necessary from your point of view, Jim. But not in the minds of the customers. Even worse, it could be considered a design flaw - not in AVS, but in ZFS. Just have a look how the usual Linux dude works. He doesn''t use AVS, he uses a kernel module called DRBD. It does basically the same, it replicates one raw device to another over a network interface, like AVS does. But the linux dude has one advantage: he doesn''t have ZFS. Yes, as impossible as it may sound, it is an advantage. Why? Because he never has to mirror 40 or 46 devices, because his lame file systems depend on a hardware RAID controller! Same goes with UFS, of course. There''s only ONE replicated device, no matter how many discs are involved. And so, it''s definitely NOT necessary to sync a disc when a HSP kicks in, because this disc failure will never be reported to the host, it''s handled by the RAID controller. As a result, no replication will take place, because AVS simply isn''t involved. We even tried to deploy ZFS upon SVM RAID5 stripes to get rid of this problem, just to learn how much the RAID 5 performance of SVM sucks ... a cluster of six USB sticks was faster than the Thumpers. I consider this a big design flaw of ZFS. I''m not very familiar with the code, but I still have hope that there''ll be a parameter which allows to get rid of the cache flushes. ZFS, and the X4500, are typical examples of different departments not really working together, e.g. they have a wonderful file system, but there are no storages who supports it. Or a great X4500, a 11-24 TB file server for $40,000, but no options to make it highly available like the $1,000 boxes. AVS is, in my opinion, clearly one of the components which suffers from it. The Sun marketing and Jonathan still have a long way to go. But, on the other hand, difficult customers like me and my company are always happy to point out some difficulties and to help resolving them :-)> For all that is good (or bad) about AVS, the fact that it works by > simply interposing itself on the Solaris I/O data path is great, as it > works with any Solaris block storage. Of course this also means that > it has not filesystem, database or host-spare knowledge, which means > that at times AVS will be inefficient at what it does. >I don''t think that there''s a problem with AVS and its concepts. In my opinion, ZFS has to do the homework. At least it should be aware of the fact that AVS is involved. Or has been, when it comes to recovering data from a zpool - simply saying "the discs belong exclusively to the local ZFS, and no other mechanisms can write onto the discs, so let''s panic and lose all the terabytes of important data" just isn''t valid. It may be easy and comfortable for the ZFS development department, but it doesn''t refelct the real world - and not even Suns software portfolio. The AVS integration into Nevada makes this even worse and I hope there''ll be something like fsck in the future, something which allows me to recover the files with correct checksums from a zpool, instead of simply hearing the sales droids repeat "There can''t be any errors, NEVER!" over and over again :-)> >> - and >> it will become much worse in the future, because the disk capacities >> rocket up into the sky, while the performance isn''t improved as much. > > Larger disk capacities are now worse in this scenario, then they are > with controller-based replication, ZFS send / receive, etc. Actually > it is quite efficient. If the disk that failed was one 5% full, when > the HSP disk is switch and being rebuilt, old 5% of the entire disk > will have to be replicated. If at the time ZFS and AVS were deployed > on this server, if they HSP disks (containing uninitialized data) were > also configured as equal with "sndradm -E ...", then there would be > not initial replication cost, and when swapped into use, only the cost > of replicating the actual ZFS in use data.That''s interesting. Because, together with your "data and bitmap volume on the same disk" scenario, the bitmap volume would be lost. A full sync of the disc would be necessary then, even if only 5% are in use. Am I correct?> >> During this time, your service misses redundancy. > > Absolute not. If all of the ZFS in use and ZFS HSP disks are > configured under AVS, there is never a time of lost redundancy. >I''m sure there is, as soon as a disc crashed in the secondary and the primary disc is in logging mode for several hours. I bet you''ll lose your HA as soon as the primary crashes before the secondary is in sync again, because the global ZFS metadata weren''t logged, but updated. I think to avoid this, the primary would have to sent the entire replication group into logging mode - but then it would get even worse, because you''ll lose your redundancy for days until the secondary is 100% in sync again and the regular replicating state becomes active (a full sync of a X4500 takes at least 5 days, and only when you don''t have Sun Cluster with exlclusive interconnect interfaces up and running). Linux/DRBD: Some data will be missing and you''ll have fun fsck''ing for two hours. ZFS: The secondary is not consistent, zpool is FAULTED, all data is lost, you have a downtime while recovering from backup tapes, plus a week with reduced redundancy because of the time needed for resyncing the restored data. You want three cluster nodes in most deployment scenarios, not just two, believe me ;-) It doesn''t matter much if you only use several easy to restore videos. But I talk about file servers which host several billion inodes, like the file servers which host the mail headers, bodies and attachments for a million Yahoo users, a terabyte of moving data each day which cannot be backuped to tape.>> And we''re not talking >> about some minutes during this time. Well, and now try to imagine what >> will happen if another disks fails during this rebuild, this time in the >> secondary ... > > If I was truly counting on AVS, I would be glad this happened! Getting > replication configured right, be it AVS or some other option, means > that when disks, systems, networks, etc., fail, there is always a > period of degraded system performance, but it is better then no system > performance. >That''s correct. But don''t forget that it''s always a very small step from "degraded" to "faulted". In particular when it comes to high availability scenarios in data centers, because in such scenarios you''ll always have to rely on other people with less know-how and motivation. It''s easy to accept a degraded state as long as you''re in your office. But with an X4500, your degraded state may potentially last longer than a weekend and when you''re directly responsible for the mail of millions of user and you know that any non-availability will place your name on Slashdot (or the name of your CEO, wich equals placing your head on a scaffold), I''m sure you''ll think twice about using ZFS with AVS or letting the linux dudes continue to play with their inefficient boxes :-)> But if a disaster happened on the primary node, and a decision was > made to ZFS import the storage pool on the secondary, ZFS will detect > the inconsistency, mark the drive as failed, swap in the secondary HSP > disk. Later, when the primary site comes back, and a reverse > synchronization is done to restore writes that happened on the > secondary, the primary ZFS file system will become aware that a HSP > swap occurred, and continue on right where the secondary node left off.I''ll try that as soon as I have a chance again (which means: as soon as Sun gets the Sun Cluster working on a X4500).>> c) You *must* force every single `zfs import <zpool>` on the secondary >> host. Always. > > Correct, but this is the case even without AVS! If one configured ZFS > on SAN based storage and your primary node crashed, one would need to > force every single `zfs import <zpool>`. This is not an AVS issue, but > a ZFS protection.Right. Too bad ZFS reacts this way. I have to admit that you made me nervous once, when you wrote that forcing zpool imports would be a bad idea ... [X] Zfsck now! Let''s organize a petition. :-)> Correct, but this is the case even without AVS! Take the same SAN > based storage scenario above, go to a secondary system on your SAN, > and force every single `zfs import <zpool>`. >Yes, but on a SAN, I don''t have to worry about zpool inconsistency, because the zpool always resides on the same devices.> In the case of a SAN, where the same physical disk would be written to > by both hosts, you would likely get complete data loss, but with AVS, > where ZFS is actually on two physical disk, and AVS is tracking > writes, even if they are inconsistent writes, AVS can and will recover > if an update sync is done.My problem is that there''s no ZFS mechanism which allows me to verify the zpool consistency before I actually try to import it. Like I said before: AVS does it right, just ZFS doesn''t (and otherwise it wouldn''t make sense to discuss it on this mailinglist anyway :-) ). It could really help me with AVS if there was something like "zpool check <zpool>", something for checking a zpool before an import. I could do a cronjob which puts the secondary host into logging mode, run a "zpool check" and continue with the replication a few hours afterwards. Would let me sleep better and I wouldn''t have to pray to the IT gods before an import. ou know, I saw literally *hundreds* of kernel panics during my tests, that made me nervous. I have scripts which do the job now, but I saw the risks and the things which can go wrong if someone else without my experience does it (like the infamous "forgetting to manually place the secondary in the logging mode before trying to import a zpool").> Your are quite correct in that although ZFS is intuitively easy to > use, AVS is painfully complex. Of course the mindset of AVS and ZFS > are as distant apart as they are in the alphabet. :-O >AVS was easy to learn and isn''t very difficult to work with. All you need is 1 or 2 months of testing experience. Very easy with UFS.> With AVS in Nevada, there is now an opportunity for leveraging the > ease of use of ZFS, with AVS. Being also the iSCSI Target project > lead, I see a lot of value in the ZFS option "set shareiscsi=on", to > get end users in using iSCSI. >Too bad the X4500 has too few PCI slots to consider buying iSCSI cards. The two existing slots are already needed for the Sun Cluster interconnect. I think iSCSI won''t be real option unless the servers are shipped with it onboard, like it has been done in the past with the SCSI or ethernet interfaces.> I would like to see "set replication=AVS:<secondary host>", > configuring a locally named ZFS storage pool to the same named pair on > some remote host. Starting down this path would afford things like ZFS > replication monitoring, similar to what ZFS does with each of its own > vdevs.Yes! Jim, I think we''ll become friends :-) Who do I have to send the bribe money to? -- Ralf Ramge Senior Solaris Administrator, SCNA, SCSA Tel. +49-721-91374-3963 ralf.ramge at webde.de - http://web.de/ 1&1 Internet AG Brauerstra?e 48 76135 Karlsruhe Amtsgericht Montabaur HRB 6484 Vorstand: Henning Ahlert, Ralph Dommermuth, Matthias Ehrlich, Andreas Gauger, Matthias Greve, Robert Hoffmann, Norbert Lang, Achim Weiss Aufsichtsratsvorsitzender: Michael Scheeren
Wow, I just opened a whole can of worms there that went flying over my head. Thanks for all the information! i''ll see if i can plough through it all :) I''m guessing that i might be able to do asynchronous, but the problem is that the video is going to be streaming from a camera in real time. And its only going to the file server. Also this isnt like security footage or something. Its for a feature documentary, so we can''t afford to loose any of this footage, and we really only get one try of it. Hence mirroring so at least we have two copies of it. I''m guessing that AVS is some kind of low level data replication software. correct me if im wrong. I suppose the other thing is that we need sustained transfer speeds of between 50MB/s and about 300MB/s depending on what video format we choose. So im guessing that with those speeds only fibre is going to cut it really. is this correct? thanks again for all your help Cheers Mark This message posted from opensolaris.org
Ralf,> Well, and what I want to say: if you place the bitmap volume on the > same > disk, this situation even gets worse. The problem is the > involvement of > SVM. Building the soft partition again makes the handling even more > complex and the case harder to handle for operators. It''s the best way > to make sure that the disk will be replaced, but not added to the > zpool > during the night - and replacing it during regular working hours isn''t > an option too, because syncing 500 GB over a 1 GBit/s interface during > daytime just isn''t possible without putting the guaranteed service > times > to a risk. Having to take care about soft partitions just isn''t > idiot-proof enough. And *poof* there''s a good chance the TCO of a > X4500 > is considered being too high.You are quite correct in that increasing the number of data path technologies ZFS + AVS + SVM, increases the TCO, as the skills required by everyone involved must increase proportionately. For the record, using ZFS zvols for bitmap volumes does not scale, as the overhead of bit flipping is way too many I/Os for raidz or raidz2 storage pools, and even a mirrored storage pool is high, as the COW semantics of ZFS make the I/O cost too high.> >>> a) A disk in the primary fails. What happens? A HSP jumps in and >>> 500 GB >>> will be rebuilt. These 500 GB are synced over a single 1 GBit/s >>> crossover cable. This takes a bit of time and is 100% unnecessary >> >> >> But it is necessary! As soon as the HSP disk kicks in, not only is >> the >> disk being rebuilt by ZFS, but newly allocated ZFS data will also >> being written to this HSP disk. So although it may appear that there >> is wasted replication cost (of which there is), the instant that ZFS >> writes new data to this HSP disk, the old replicated disk is >> instantly >> inconsistent, and there is no means to fix. > It''s necessary from your point of view, Jim. But not in the minds > of the > customers. Even worse, it could be considered a design flaw - not in > AVS, but in ZFS.I wouldn''t go that far as to say it is a design flaw. The fact that AVS works with ZFS, and vice-versa, without either having knowledge of each other''s presence, says a lot for the I/O architecture of Solaris. If there is a compelling advantage to interoperate, the OpenSolaris community as a whole is free to propose a project, gather community interest, and go from there. The potential of OpenSolaris is huge, especially when it is ridding a technology wave, like the one created by the x4500 and ZFS.> Just have a look how the usual Linux dude works. He doesn''t use > AVS, he > uses a kernel module called DRBD. It does basically the same, it > replicates one raw device to another over a network interface, like > AVS > does. But the linux dude has one advantage: he doesn''t have ZFS. > Yes, as > impossible as it may sound, it is an advantage. Why? Because he never > has to mirror 40 or 46 devices, because his lame file systems > depend on > a hardware RAID controller! Same goes with UFS, of course. There''s > only > ONE replicated device, no matter how many discs are involved. > And so, it''s definitely NOT necessary to sync a disc when a HSP kicks > in, because this disc failure will never be reported to the host, it''s > handled by the RAID controller. As a result, no replication will take > place, because AVS simply isn''t involved. We even tried to deploy ZFS > upon SVM RAID5 stripes to get rid of this problem, just to learn how > much the RAID 5 performance of SVM sucks ... a cluster of six USB > sticks > was faster than the Thumpers.Instead of using SVM for RAID 5, too keep the volume count low, consider concatenating 8 devices (RAID 0) into 5 separate SVM volumes, then configuring both a ZFS raidz storage pool, plus AVS on these 5 volumes. This prevents SVM from performing software RAID 5, RAID 0 is a low-overhead pass thru for SVM, plus prior to giving the entire SVM volume to ZFS, one can also get the AVS bitmaps form this pool too.> I consider this a big design flaw of ZFS. I''m not very familiar > with the > code, but I still have hope that there''ll be a parameter which > allows to > get rid of the cache flushes. ZFS, and the X4500, are typical examples > of different departments not really working together, e.g. they have a > wonderful file system, but there are no storages who supports it. Or a > great X4500, a 11-24 TB file server for $40,000, but no options to > make > it highly available like the $1,000 boxes. AVS is, in my opinion, > clearly one of the components which suffers from it. The Sun marketing > and Jonathan still have a long way to go. But, on the other hand, > difficult customers like me and my company are always happy to > point out > some difficulties and to help resolving them :-)Sun does recognize the potential of both the X4500 and ZFS, and also of the difficulties (and problems) when combining them together. It would be great if there was pre-existing technology (hardware, software, or both), that just made this high availability issue go away, without adding any complexity.> >> For all that is good (or bad) about AVS, the fact that it works by >> simply interposing itself on the Solaris I/O data path is great, >> as it >> works with any Solaris block storage. Of course this also means that >> it has not filesystem, database or host-spare knowledge, which means >> that at times AVS will be inefficient at what it does. >> > I don''t think that there''s a problem with AVS and its concepts. In my > opinion, ZFS has to do the homework. At least it should be aware of > the > fact that AVS is involved. Or has been, when it comes to recovering > data > from a zpool - simply saying "the discs belong exclusively to the > local > ZFS, and no other mechanisms can write onto the discs, so let''s panic > and lose all the terabytes of important data" just isn''t valid. It may > be easy and comfortable for the ZFS development department, but it > doesn''t refelct the real world - and not even Suns software portfolio. > The AVS integration into Nevada makes this even worse and I hope > there''ll be something like fsck in the future, something which > allows me > to recover the files with correct checksums from a zpool, instead of > simply hearing the sales droids repeat "There can''t be any errors, > NEVER!" over and over again :-)I don''t think there is any single technology that is too blame here, unless of course that technology is as you put it "Suns software portfolio". The "ZFS development department" has done an excellent job in meeting, and exceeding what they set out to accomplish, and then more. The even offer remote file replication via send / recv. What was not taken into consideration, and its is unclear where this falls, is that any Solaris filesystem can be replicated by either host-based or controller-based data services, and the need to assuring data consistency of that replicated filesystem. Concerned as you are about system panics, ZFS is doing the correct thing in validating checksums, and panicing Solaris under circumstances ZFS considers to be data corruption. Do the same types of operations with other filesystems, and these undetected writes, are essentially silent data corruption. The fact that ZFS validate data on reads is powerful.> >> >>> - and >>> it will become much worse in the future, because the disk capacities >>> rocket up into the sky, while the performance isn''t improved as >>> much. >> >> Larger disk capacities are now worse in this scenario, then they are >> with controller-based replication, ZFS send / receive, etc. Actually >> it is quite efficient. If the disk that failed was one 5% full, when >> the HSP disk is switch and being rebuilt, old 5% of the entire disk >> will have to be replicated. If at the time ZFS and AVS were deployed >> on this server, if they HSP disks (containing uninitialized data) >> were >> also configured as equal with "sndradm -E ...", then there would be >> not initial replication cost, and when swapped into use, only the >> cost >> of replicating the actual ZFS in use data. > That''s interesting. Because, together with your "data and bitmap > volume > on the same disk" scenario, the bitmap volume would be lost. A full > sync > of the disc would be necessary then, even if only 5% are in use. Am I > correct?My scenario used SVM mirrored bitmaps for AVS, and ZFS is protecting by its raidz or mirrored storage pool. When one looses a disk, SVM continues to use the other side of the mirror for AVS bitmaps, ZFS uses the redundancy of its storage pool. When the failed disk is replaced, SVM needs to resilver, ZFS needs to rebuild, either on demand, or via zpool scrub. All is good.> >> >>> During this time, your service misses redundancy. >> >> Absolute not. If all of the ZFS in use and ZFS HSP disks are >> configured under AVS, there is never a time of lost redundancy. >> > I''m sure there is, as soon as a disc crashed in the secondary and the > primary disc is in logging mode for several hours. I bet you''ll lose > your HA as soon as the primary crashes before the secondary is in sync > again, because the global ZFS metadata weren''t logged, but updated.Redundancy, based on my understanding, is recovery from a single failure. What you allude to above is two (or more) failures, something not covered with simple redundancy. The need to be able to recover from multiple failures, is clearly a known concept, hence the creation of raidz2, knowing that loosing two disks in raidz is bad news. Using AVS to replicate a ZFS storage pool, offers something AVS has never had, the ability for ZFS to validate that AVS''s replication was indeed perfect. Drop the replica into logging mode, zpool import, zpool scrub, zpool export, resume replication.> I > think to avoid this, the primary would have to sent the entire > replication group into logging mode - but then it would get even > worse, > because you''ll lose your redundancy for days until the secondary is > 100% > in sync again and the regular replicating state becomes active (a full > sync of a X4500 takes at least 5 days, and only when you don''t have > Sun > Cluster with exlclusive interconnect interfaces up and running). > > Linux/DRBD: Some data will be missing and you''ll have fun fsck''ing for > two hours. > ZFS: The secondary is not consistent, zpool is FAULTED, all data is > lost, you have a downtime while recovering from backup tapes, plus a > week with reduced redundancy because of the time needed for resyncing > the restored data. You want three cluster nodes in most deployment > scenarios, not just two, believe me ;-) It doesn''t matter much if you > only use several easy to restore videos. But I talk about file servers > which host several billion inodes, like the file servers which host > the > mail headers, bodies and attachments for a million Yahoo users, a > terabyte of moving data each day which cannot be backuped to tape. > >>> And we''re not talking >>> about some minutes during this time. Well, and now try to imagine >>> what >>> will happen if another disks fails during this rebuild, this time >>> in the >>> secondary ... >> >> If I was truly counting on AVS, I would be glad this happened! >> Getting >> replication configured right, be it AVS or some other option, means >> that when disks, systems, networks, etc., fail, there is always a >> period of degraded system performance, but it is better then no >> system >> performance. >> > That''s correct. But don''t forget that it''s always a very small step > from > availability scenarios in data centers, because in such scenarios > you''ll > always have to rely on other people with less know-how and motivation. > It''s easy to accept a degraded state as long as you''re in your office. > But with an X4500, your degraded state may potentially last longer > than > a weekend and when you''re directly responsible for the mail of > millions > of user and you know that any non-availability will place your name on > Slashdot (or the name of your CEO, wich equals placing your head on a > scaffold), I''m sure you''ll think twice about using ZFS with AVS or > letting the linux dudes continue to play with their inefficient > boxes :-)All very valid points, and having reassurance that choices made today will prove themselves valuable if, and when degraded or faulted states arise, is key. I am a strong proponent of disaster recovery testing, long before your company, or CxOs signoff on a solution put into production. You are right to question, and arrive at your own informed conclusions about the technologies you choose before deployment.> >> But if a disaster happened on the primary node, and a decision was >> made to ZFS import the storage pool on the secondary, ZFS will detect >> the inconsistency, mark the drive as failed, swap in the secondary >> HSP >> disk. Later, when the primary site comes back, and a reverse >> synchronization is done to restore writes that happened on the >> secondary, the primary ZFS file system will become aware that a HSP >> swap occurred, and continue on right where the secondary node left >> off. > I''ll try that as soon as I have a chance again (which means: as > soon as > Sun gets the Sun Cluster working on a X4500). > >>> c) You *must* force every single `zfs import <zpool>` on the >>> secondary >>> host. Always. >> >> Correct, but this is the case even without AVS! If one configured ZFS >> on SAN based storage and your primary node crashed, one would need to >> force every single `zfs import <zpool>`. This is not an AVS issue, >> but >> a ZFS protection. > Right. Too bad ZFS reacts this way. > > I have to admit that you made me nervous once, when you wrote that > forcing zpool imports would be a bad idea ...I think there was some context to my prior statement, as in checking the current state of replication before doing so. ;-)> > [X] Zfsck now! Let''s organize a petition. :-) > >> Correct, but this is the case even without AVS! Take the same SAN >> based storage scenario above, go to a secondary system on your SAN, >> and force every single `zfs import <zpool>`. >> > Yes, but on a SAN, I don''t have to worry about zpool inconsistency, > because the zpool always resides on the same devices.Point well taken.> >> In the case of a SAN, where the same physical disk would be >> written to >> by both hosts, you would likely get complete data loss, but with AVS, >> where ZFS is actually on two physical disk, and AVS is tracking >> writes, even if they are inconsistent writes, AVS can and will >> recover >> if an update sync is done. > My problem is that there''s no ZFS mechanism which allows me to verify > the zpool consistency before I actually try to import it. Like I said > before: AVS does it right, just ZFS doesn''t (and otherwise it wouldn''t > make sense to discuss it on this mailinglist anyway :-) ). > > It could really help me with AVS if there was something like "zpool > check <zpool>", something for checking a zpool before an import. I > could > do a cronjob which puts the secondary host into logging mode, run a > "zpool check" and continue with the replication a few hours > afterwards. > Would let me sleep better and I wouldn''t have to pray to the IT gods > before an import. ou know, I saw literally *hundreds* of kernel panics > during my tests, that made me nervous. I have scripts which do the job > now, but I saw the risks and the things which can go wrong if someone > else without my experience does it (like the infamous "forgetting to > manually place the secondary in the logging mode before trying to > import > a zpool").The issue is not the need for a "zpool check", or to improve on "zpool import", or ZFS itself, as each could validate the storage pool as being 100% perfect, if at the moment they are running, ZFS on the primary node was not writing data, data which may be actively replicating to the secondary node. The problem (or lack of a feature), is that ZFS does not support shared access to a single storage poll. ZFS on one node, in seeing ZFS writes issued another node (being a dual-port disked, SAN disk, AVS replication, or controller based replication), views these ZFS writes and their CRC data, as a form of data corruption, and rightfully ZFS panics Solaris I know that the shared QFS filesystem supports careful, ordered writes, which allows their shared QFS reader client to read (only) from an active replica, being AVS or controller based replication. As with QFS, given time, ZFS will evolve.> >> Your are quite correct in that although ZFS is intuitively easy to >> use, AVS is painfully complex. Of course the mindset of AVS and ZFS >> are as distant apart as they are in the alphabet. :-O >> > AVS was easy to learn and isn''t very difficult to work with. All you > need is 1 or 2 months of testing experience. Very easy with UFS. > >> With AVS in Nevada, there is now an opportunity for leveraging the >> ease of use of ZFS, with AVS. Being also the iSCSI Target project >> lead, I see a lot of value in the ZFS option "set shareiscsi=on", to >> get end users in using iSCSI. >> > Too bad the X4500 has too few PCI slots to consider buying iSCSI > cards.HBA manufactures have in the past created multi-port, and multi- function HBAs. I would expect there to be something out there, or out there soon which will address the need of limited PCI slots.> The two existing slots are already needed for the Sun Cluster > interconnect. I think iSCSI won''t be real option unless the servers > are > shipped with it onboard, like it has been done in the past with the > SCSI > or ethernet interfaces. > >> I would like to see "set replication=AVS:<secondary host>", >> configuring a locally named ZFS storage pool to the same named >> pair on >> some remote host. Starting down this path would afford things like >> ZFS >> replication monitoring, similar to what ZFS does with each of its own >> vdevs. > Yes! Jim, I think we''ll become friends :-) Who do I have to send the > bribe money to?Sun Microsystems, Inc., as in buying Sun Servers, Software, Storage and Services. Non-monetary offerings in the form of being an active OpenSolaris community member, are also highly valued.> > -- > > Ralf Ramge > Senior Solaris Administrator, SCNA, SCSA > > Tel. +49-721-91374-3963 > ralf.ramge at webde.de - http://web.de/ > > 1&1 Internet AG > Brauerstra?e 48 > 76135 Karlsruhe > > Amtsgericht Montabaur HRB 6484 > > Vorstand: Henning Ahlert, Ralph Dommermuth, Matthias Ehrlich, > Andreas Gauger, Matthias Greve, Robert Hoffmann, Norbert Lang, > Achim Weiss > Aufsichtsratsvorsitzender: Michael Scheeren > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discussJim Dunham Solaris, Storage Software Group Sun Microsystems, Inc. 1617 Southwood Drive Nashua, NH 03063 Email: James.Dunham at Sun.COM http://blogs.sun.com/avs
Ralf Ramge wrote:> I consider this a big design flaw of ZFS.Are you saying that it''s a design flaw of ZFS that we haven''t yet implemented remote replication? I would consider that a missing feature, not a design flaw. There''s nothing in the design of ZFS to prevent such a feature (and in fact, several aspects of the design that would work very well with such a feature, eg as used with "zfs send"). > I''m not very familiar with the> code, but I still have hope that there''ll be a parameter which allows to > get rid of the cache flushes.You mean zfs_nocacheflush? Admittedly, this is a hack. We''re working on making this simply do the right thing, based on the capabilities of the underlying storage device. > ZFS, and the X4500, are typical examples> of different departments not really working together, e.g. they have a > wonderful file system, but there are no storages who supports it.I''m not sure what you mean. ZFS supports any storage, and works great on the X4500. --matt
Hello Ralf, Wednesday, August 22, 2007, 8:55:35 AM, you wrote: RR> instead, or, most important for the typical data center, the *operator* RR> is not able to replace a disk like he''s used to: pulling the old disc RR> out, putting the new disc in, resync starting, finished. You''ll always RR> have to wait until the next morning, until a Solaris administrator is RR> available again (which may impact your high availability concepts) or RR> get an Solaris administrator into the company 24/7 a day (which raises RR> the TCO of the Solaris boxes). See a putback few weeks ago (it went in, right?) which should deliver exactly what you want - just pull out bad disk, pull in new disk, and zfs automatically resyncs - no admin intervention at all. -- Best regards, Robert Milkowski mailto:rmilkowski at task.gda.pl http://milek.blogspot.com
Ok, had a bit of a look around, What about this setup. Two boxes with all the hard drives in them. And all drives iSCSI Targets. A third Box puts all of the Drives into a mirrored RAIDz setup (one box mirroring the other, each has a RAIDz zfs zpool). This setup wil be shared via samba out. Does anybody see a problem with this? Also i know this isnt ZFS, but is there any upper limit on file size with samba? Thanks For all your help. Mark This message posted from opensolaris.org
Mark wrote:> Ok, had a bit of a look around, > > What about this setup. > > Two boxes with all the hard drives in them. And all drives iSCSI Targets. A third Box puts all of the Drives into a mirrored RAIDz setup (one box mirroring the other, each has a RAIDz zfs zpool). This setup wil be shared via samba out. > > Does anybody see a problem with this?Seems reasonable to me. However you haven''t said anything about how "third box" is networked to "first box" and "second box". With iSCSI I HIGHLY recommend at least using IPsec AH to that you get integrity protection of the packets - the TCP checksum is not enough. If you care enough to use sha256 checksum with ZFS you should care enough to ensure the data on the wire is checksum strongly too. Also consider that if this was direct attach you would probably be using two separate HBAs so you may want to consider using different physical NICs and or IPMP or other network failover technologies (depending on what hardware you have network wise). I did a similar setup recently where I had a zpool on one machine and created two iscsi targets (using zfs) and then created a mirror using those two luns on another machine. In the end I removed the ZFS pool on the target side and shared out the raw disks with iscsi and build the pool on the initator machine that way. Why ? because I couldn''t rationalise to myself what value ZFS was giving me in this particular case since I was sharing the whole disk array. In cases where you aren''t sharing the whole array to a single initiator then I can see value in having the iscsi targets be zvols. -- Darren J Moffat
hi, Few questions, I seem to remember in WAN environments IPsec can have a reasonably large performance impact, how large is this performance impact? and is there soe way to mitigate it? The problem is we could be needing to use all a gigbit links bandwidth (possibly more). is IPsec AH slightly different to the cryptography algorithms to keep VPN''s secure? also i had a look at IPMP, sounds really good. I was wondering yesterday about the possibility of linking a few gigabit links together, at FB is very expensive and 10GbE is almost the same. I read on the Wikipedia article that by using IPMP the bandwidth is increased due to sharing across all the network cards, is this true? Thanks again for all your help Cheers Mark This message posted from opensolaris.org
Depending on what hardware you have and what size the data chunks are will determine what impact IPsec will have. WAN vs LAN isn''t the issue. As for mitigating the impact of the crypto in IPsec it depends on the data size. If the size of the packets is > 512 bytes then the crypto framework will off load that to hardware. However that really only matters for symetric ciphers such as AES, 3DES which if you are doing IPsec AH only, rather than ESP+auth, you aren''t using. If you do want to encrypt and have that off loaded to hardware there are two choices: Sun CA-6000 card or an UltraSPARC T2 processors (Niagara 2) [ cpu in the the recently announced new machines ]. Some VPNs are IPsec and some are SSL or SSH. Those that are IPsec based do so with ESP+Auth. IPsec AH doesn''t protect the data from viewing on the wire just integrity protects it - just like ZFS today (integrity protected but not encrypted); a VPN needs to be more than that! -- Darren J Moffat