Someone suggested an idea, which the more I think about the less insane it sounds. I thought I would ask the assembled masses to see if anyone had tried anything like this, and how successful they had been. I''ll start with the simplest variant of the solution, but there are potentially subtleties which could be applied. Take 3 machines, for the same of argument 2 x4500s and an x4100 as a head unit. Export the storage from each of the x4500s by making it an iSCSI target. Import the storage onto the x4100 making it an iSCSI initiator. Using ZFS (and I assume this is preferable to Solaris Volume manager) set up a mirror between the two sets of storage. Assuming that works, one of the two servers can be moved to a different site, and you now have real time, cross site mirroring of data. For added tweaks I believe that I can arrange to have two head units so that I can do something resembling failover of data, if not necessarily instantaneously. The only issue I haven''t yet been able to come up with a solution for in this thought experiment is how to recover quickly from one half of the mirror going down. As far as I can tell I need to re-silver the entire half of the mirror, which could take some time. Am I missing some clever trick here? I''m interested in any input as to why this does or doesn''t work, and I''m especially interested to hear from anyone that has actually done something like this already. Cheers, Julian -- Julian King Computer Officer, University of Cambridge, Unix Support
I think I have heard something called dirty time logging being implemented in ZFS. Mertol Ozyoney Storage Practice - Sales Manager Sun Microsystems, TR Istanbul TR Phone +902123352200 Mobile +905339310752 Fax +902123352222 Email mertol.ozyoney at Sun.COM -----Original Message----- From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-bounces at opensolaris.org] On Behalf Of J.P. King Sent: 08 ?ubat 2008 Cuma 10:26 To: zfs-discuss at opensolaris.org Subject: [zfs-discuss] Real time mirroring Someone suggested an idea, which the more I think about the less insane it sounds. I thought I would ask the assembled masses to see if anyone had tried anything like this, and how successful they had been. I''ll start with the simplest variant of the solution, but there are potentially subtleties which could be applied. Take 3 machines, for the same of argument 2 x4500s and an x4100 as a head unit. Export the storage from each of the x4500s by making it an iSCSI target. Import the storage onto the x4100 making it an iSCSI initiator. Using ZFS (and I assume this is preferable to Solaris Volume manager) set up a mirror between the two sets of storage. Assuming that works, one of the two servers can be moved to a different site, and you now have real time, cross site mirroring of data. For added tweaks I believe that I can arrange to have two head units so that I can do something resembling failover of data, if not necessarily instantaneously. The only issue I haven''t yet been able to come up with a solution for in this thought experiment is how to recover quickly from one half of the mirror going down. As far as I can tell I need to re-silver the entire half of the mirror, which could take some time. Am I missing some clever trick here? I''m interested in any input as to why this does or doesn''t work, and I''m especially interested to hear from anyone that has actually done something like this already. Cheers, Julian -- Julian King Computer Officer, University of Cambridge, Unix Support _______________________________________________ zfs-discuss mailing list zfs-discuss at opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
J.P. King wrote:> Someone suggested an idea, which the more I think about the less insane it > sounds. I thought I would ask the assembled masses to see if anyone had > tried anything like this, and how successful they had been. > > I''ll start with the simplest variant of the solution, but there are > potentially subtleties which could be applied. > > Take 3 machines, for the same of argument 2 x4500s and an x4100 as a head > unit. > > Export the storage from each of the x4500s by making it an iSCSI target. > Import the storage onto the x4100 making it an iSCSI initiator. > > Using ZFS (and I assume this is preferable to Solaris Volume manager) > set up a mirror between the two sets of storage.Remember to also deploy IPsec to protect the iSCSI traffic. You want at least IPsec with AH to get integrity protection on the wire and for cross site you likely what ESP+Auth as well. -- Darren J Moffat
> Remember to also deploy IPsec to protect the iSCSI traffic. You want at > least IPsec with AH to get integrity protection on the wire and for cross > site you likely what ESP+Auth as well.How will this help given dark fibre between the sites? I''m not doing this over a public internet!> Darren J MoffatJulian -- Julian King Computer Officer, University of Cambridge, Unix Support
> I think I have heard something called dirty time logging being implemented > in ZFS.Thanks for the pointer. Certainly interesting, but according to the talks/emails I''ve found a month or so ago ZFS "will offer" this, so I am guessing it isn''t there yet, and certainly not in a released version of Solaris. Knowing that it is (probably) on the way is still useful.> Mertol OzyoneyJulian -- Julian King Computer Officer, University of Cambridge, Unix Support
J.P. King wrote:>> I think I have heard something called dirty time logging being implemented >> in ZFS. > > Thanks for the pointer. Certainly interesting, but according to the > talks/emails I''ve found a month or so ago ZFS "will offer" this, so I am > guessing it isn''t there yet, and certainly not in a released version of > Solaris. > > Knowing that it is (probably) on the way is still useful.It is already there, see here http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/zfs/sys/vdev_impl.h#130 and try full-text search for dtl in usr/src/uts/common/fs/zfs/ as well hth, victor
What is the procedure for enabling DTL ? PS: I am no unix guru Mertol Ozyoney Storage Practice - Sales Manager Sun Microsystems, TR Istanbul TR Phone +902123352200 Mobile +905339310752 Fax +902123352222 Email mertol.ozyoney at Sun.COM -----Original Message----- From: Victor.Latushkin at Sun.COM [mailto:Victor.Latushkin at Sun.COM] Sent: 08 ?ubat 2008 Cuma 13:42 To: J.P. King Cc: Mertol Ozyoney; zfs-discuss at opensolaris.org Subject: Re: [zfs-discuss] Real time mirroring J.P. King wrote:>> I think I have heard something called dirty time logging beingimplemented>> in ZFS. > > Thanks for the pointer. Certainly interesting, but according to the > talks/emails I''ve found a month or so ago ZFS "will offer" this, so I am > guessing it isn''t there yet, and certainly not in a released version of > Solaris. > > Knowing that it is (probably) on the way is still useful.It is already there, see here http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/ zfs/sys/vdev_impl.h#130 and try full-text search for dtl in usr/src/uts/common/fs/zfs/ as well hth, victor
J.P. King wrote:>> Remember to also deploy IPsec to protect the iSCSI traffic. You want at >> least IPsec with AH to get integrity protection on the wire and for cross >> site you likely what ESP+Auth as well. > > How will this help given dark fibre between the sites? I''m not doing this > over a public internet!The IPsec AH is to ensure that you don''t get corruption on the wire - this is especially important if the iSCSI targets are not ZVOLs but even then I''d highly recommend it. If you are happy with the physical security of your cable then you don''t need the ESP. -- Darren J Moffat
Heh, it might have been me who suggested that. I''m testing the idea out at the moment, but being new to Solaris it''s taking some time. So far I''ve confirmed that you can import iSCSI volumes to ZFS fine, but you need to use static discovery. If you use sendtargets, it breaks when devices go offline (hangs iSCSI and ZFS and then Solaris won''t boot). I''ve also got a basic cluster running with HA-ZFS mirroring a pair of iSCSI disks, with HA-NFS running on top of that. That appears to work fine too, and is pretty reliable. In terms of recovery time after one half of the mirror going down, I thought ZFS already had that feature - it was one of the things I read that gave me this idea in the first place. Have a look at page 15 of this presentation, it specifically says "a 5 second outage takes 5 seconds to repair": http://opensolaris.org/os/community/zfs/docs/zfs_last.pdf I read that to understand that if the iSCSI server breaks but is repairable, you will only need to re-sync the data that has changed. Of course, if the whole thing dies you have rather a lot of data to shift around, but if you''re running ZFS with dual parity raid on the x4500''s, the chances are you''ll only need to do that when hell freezes over :) I''m doing my level best to kill our setup at the moment. I''ve been pulling the (virtual) power on the iSCSI servers, resilvering ZFS, and swopping ZFS between the two cluster nodes. So far I''ve had a few teething problems but it''s always come back online and I''ve never lost any data. Even swopping active nodes in the cluster while iSCSI devices are offline isn''t a problem, but I do have a lot more stress testing to do. The latest trick is that I''ve now got 5 Solaris boxes running under VMware (2x iSCSI servers, 2x Cluster, 1x client), and I''m about to test: VMware -> Solaris -> ZFS pool -> iSCSI -> Solaris Cluster -> HA-ZFS -> HA-NFS -> VMware Yes, Vmware is quite happy accessing an NFS store hosted within itself, although I''m yet to test how it handles a cluster node failure. I''m going to test that, and then host an XP desktop on the NFS share and see how performance compares to a desktop on native storage. I figure that will give me a reasonable idea as to how much overhead this is adding :) One of the main reasons I''m testing with VMware is that I plan to access the iSCSI storage on the Thumpers via a Solaris machine hosted under VMware. That way I can connect directly to it from other virtual servers and take advantage of the 64Gbps speed and low latency of the virtual network. It means mirroring the Thumpers shouldn''t add any noticable latency to the traffic. That''s about the extent of my progress so far. Would love to hear your feedback if you''re testing this too. This message posted from opensolaris.org
Found my first problems with this today. The ZFS mirror appears to work fine, but if you disconnect one of the iSCSI targets it hangs for 5 mins or more. I''m also seeing very concerning behaviour when attempting to re-attach the missing disk. My test scenario is: - Two 35GB iSCSI targets are being shared using ZFS shareiscsi=on - They are imported to a 3rd Solaris box and used to create a mirrored ZFS pool - I use that to mount a NFS share, and connected to that with VMware ESX server My first test was to clone a virtual machine onto the new volume. That appeared to work fine, so I decided to test the mirroring. I started another clone operation then powered down one of the iSCSI targets. Well, the clone operation seemed to hang as soon as I did that, so I ran "zpool status" to see what was going on. The news wasn''t good: That hung too. Nothing happened in either window for a good 5 minutes, then ESX popped up with an error saying "the virtual disk is either corrupted or not a supported format", and at the exact same time the zpool status command completed, but showing that all the drives were still ONLINE. I immediately re-ran zpool status, now it reported that one iSCSI was now offline and the pool was running in a degraded state. So, for some reason it''s taken 5 minutes for the iSCSI device to go offline, it''s locked up ZFS for that entire time, and ZFS reported the wrong status the first time around too. The only good news is that now that ZFS is in a degraded state I can start the clone operation again and it completes fine with just half of the mirror available. Next, I powered on the missing server, checked "format < /dev/null" to ensure the drives had re-connected, and used "zpool online" to re-attach the missing disk. So far it''s taken over an hour to attempt to resilver files from a 10 minute copy, and the progress report is up and down like a yo-yo. The progress reporting from ZFS so far has been: - 2.25% done, 0h13m to go - 7.20% done, 0h12m to go - 6.14% done, 0h8m to go (odd, how does it go down?) ... - 78.50% done, 0h2m to go - 41.67% done, 0h8m to go (huh?) ... - 72.45% done, 0h3m to go - 42.42% done, 0h9m to go Getting concerned now, I''m actually wondering if this is ever going to complete, and I have no idea if these problems are ZFS or iSCSI related. This message posted from opensolaris.org
Well 5 minutes after posting that the resilver completed. However despite it saying that the "resilver completed with 0 errors" ten minutes ago, the device still shows as unavailable, and my pool is still degraded. This message posted from opensolaris.org
Well, I got it working, but not in a tidy way. I''m running HA-ZFS here, so I moved the ZFS pool over to the other node in the cluster. That had exactly the same problems however, the iSCSI disks were unavailable. Then I found an article from November 2006 (http://web.ivy.net/~carton/oneNightOfWork/20061119-carton.html), saying that the iSCSI initiator won''t reconnect until you reboot. I rebooted one node of the cluster, then swopped ZFS back over to there and Voila! Fully working mirrored storage again. So I guess it''s an iSCSI initiator problem in that it doesn''t reconnect properly to a rebooted target, but it''s not a particularly stable solution at this stage. This message posted from opensolaris.org