Edward Ned Harvey (opensolarisisdeadlongliveopensolaris)
2012-Sep-26 17:54 UTC
[zfs-discuss] vm server storage mirror
Here''s another one. Two identical servers are sitting side by side. They could be connected to each other via anything (presently using crossover ethernet cable.) And obviously they both connect to the regular LAN. You want to serve VM''s from at least one of them, and even if the VM''s aren''t fault tolerant, you want at least the storage to be live synced. The first obvious thing to do is simply cron a zfs send | zfs receive at a very frequent interval. But there are a lot of downsides to that - besides the fact that you have to settle for some granularity, you also have a script on one system that will clobber the other system. So in the event of a failure, you might promote the backup into production, and you have to be careful not to let it get clobbered when the main server comes up again. I like much better, the idea of using a zfs mirror between the two systems. Even if it comes with a performance penalty, as a result of bottlenecking the storage onto Ethernet. But there are several ways to possibly do that, and I''m wondering which will be best. Option 1: Each system creates a big zpool of the local storage. Then, create a zvol within the zpool, and export it iscsi to the other system. Now both systems can see a local zvol, and a remote zvol, which it can use to create a zpool mirror. The reasons I don''t like this idea are because it''s a zpool within a zpool, including the double-checksumming and everything. But the double-checksumming isn''t such a concern to me - I''m mostly afraid some horrible performance or reliability problem might be resultant. Naturally, you would only zpool import the nested zpool on one system. The other system would basically just ignore it. But in the event of a primary failure, you could force import the nested zpool on the secondary system. Option 2: At present, both systems are using local mirroring ,3 mirror pairs of 6 disks. I could break these mirrors, and export one side over to the other system... And vice versa. So neither server will be doing local mirroring; they will both be mirroring across iscsi to targets on the other host. Once again, each zpool will only be imported on one host, but in the event of a failure, you could force import it on the other host. Can anybody think of a reason why Option 2 would be stupid, or can you think of a better solution? -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20120926/16bed913/attachment-0001.html>
If you''re willing to try FreeBSD, there''s HAST (aka high availability storage) for this very purpose. You use hast to create mirror pairs using 1 disk from each box, thus creating /dev/hast/* nodes. Then you use those to create the zpool one the ''primary'' box. All writes to the pool on the primary box are mirrored over the network to the secondary box. When the primary box goes down, the secondary imports the pool and carries on. When the primary box comes online, it syncs the data back from the secondary, and then either takes over as primary or becomes the new secondary. On Sep 26, 2012 10:54 AM, "Edward Ned Harvey (opensolarisisdeadlongliveopensolaris)" < opensolarisisdeadlongliveopensolaris at nedharvey.com> wrote:> Here''s another one.**** > > ** ** > > Two identical servers are sitting side by side. They could be connected > to each other via anything (presently using crossover ethernet cable.) And > obviously they both connect to the regular LAN. You want to serve VM''s > from at least one of them, and even if the VM''s aren''t fault tolerant, you > want at least the storage to be live synced. The first obvious thing to > do is simply cron a zfs send | zfs receive at a very frequent interval. But > there are a lot of downsides to that - besides the fact that you have to > settle for some granularity, you also have a script on one system that will > clobber the other system. So in the event of a failure, you might > promote the backup into production, and you have to be careful not to let > it get clobbered when the main server comes up again.**** > > ** ** > > I like much better, the idea of using a zfs mirror between the two > systems. Even if it comes with a performance penalty, as a result of > bottlenecking the storage onto Ethernet. But there are several ways to > possibly do that, and I''m wondering which will be best.**** > > ** ** > > Option 1: Each system creates a big zpool of the local storage. Then, > create a zvol within the zpool, and export it iscsi to the other system. Now > both systems can see a local zvol, and a remote zvol, which it can use to > create a zpool mirror. The reasons I don''t like this idea are because > it''s a zpool within a zpool, including the double-checksumming and > everything. But the double-checksumming isn''t such a concern to me - I''m > mostly afraid some horrible performance or reliability problem might be > resultant. Naturally, you would only zpool import the nested zpool on > one system. The other system would basically just ignore it. But in the > event of a primary failure, you could force import the nested zpool on > the secondary system.**** > > ** ** > > Option 2: At present, both systems are using local mirroring ,3 mirror > pairs of 6 disks. I could break these mirrors, and export one side over > to the other system... And vice versa. So neither server will be doing > local mirroring; they will both be mirroring across iscsi to targets on > the other host. Once again, each zpool will only be imported on one > host, but in the event of a failure, you could force import it on the other > host.**** > > ** ** > > Can anybody think of a reason why Option 2 would be stupid, or can you > think of a better solution?**** > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20120926/35516af5/attachment.html>
"head units" crash or do weird things, but disks persist. There are a couple of HA head-unit solutions out there but most of them have their own separate storage and they effectively just send transaction groups to each other. The other way is to connect 2 nodes to an external SAS/FC chassis. create desired ZPools. Assign some subset of pools to node A, the rest to node B. When failure occurs the other node imports the other''s pools and exports as NFS/iSCSI/whatever. You''ll have to have a clustering/quorum and resource migration subsystem obviously. Or if you want simple act/passive, a means to make sure both heads don''t try to import the same pools.
On Wed, Sep 26, 2012 at 12:54 PM, Edward Ned Harvey (opensolarisisdeadlongliveopensolaris) < opensolarisisdeadlongliveopensolaris at nedharvey.com> wrote:> Here''s another one.**** > > ** ** > > Two identical servers are sitting side by side. They could be connected > to each other via anything (presently using crossover ethernet cable.) And > obviously they both connect to the regular LAN. You want to serve VM''s > from at least one of them, and even if the VM''s aren''t fault tolerant, you > want at least the storage to be live synced. The first obvious thing to > do is simply cron a zfs send | zfs receive at a very frequent interval. But > there are a lot of downsides to that - besides the fact that you have to > settle for some granularity, you also have a script on one system that will > clobber the other system. So in the event of a failure, you might > promote the backup into production, and you have to be careful not to let > it get clobbered when the main server comes up again.**** > > ** ** > > I like much better, the idea of using a zfs mirror between the two > systems. Even if it comes with a performance penalty, as a result of > bottlenecking the storage onto Ethernet. But there are several ways to > possibly do that, and I''m wondering which will be best.**** > > ** ** > > Option 1: Each system creates a big zpool of the local storage. Then, > create a zvol within the zpool, and export it iscsi to the other system. Now > both systems can see a local zvol, and a remote zvol, which it can use to > create a zpool mirror. The reasons I don''t like this idea are because > it''s a zpool within a zpool, including the double-checksumming and > everything. But the double-checksumming isn''t such a concern to me - I''m > mostly afraid some horrible performance or reliability problem might be > resultant. Naturally, you would only zpool import the nested zpool on > one system. The other system would basically just ignore it. But in the > event of a primary failure, you could force import the nested zpool on > the secondary system.**** > > ** ** > > Option 2: At present, both systems are using local mirroring ,3 mirror > pairs of 6 disks. I could break these mirrors, and export one side over > to the other system... And vice versa. So neither server will be doing > local mirroring; they will both be mirroring across iscsi to targets on > the other host. Once again, each zpool will only be imported on one > host, but in the event of a failure, you could force import it on the other > host.**** > > ** ** > > Can anybody think of a reason why Option 2 would be stupid, or can you > think of a better solution?**** > > >I would suggest if you''re doing a crossover between systems, you use infiniband rather than ethernet. You can eBay a 40Gb IB card for under $300. Quite frankly the performance issues should become almost a non-factor at that point. --Tim -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20120926/986af65c/attachment-0001.html>
On Sep 26, 2012, at 10:54 AM, Edward Ned Harvey (opensolarisisdeadlongliveopensolaris) <opensolarisisdeadlongliveopensolaris at nedharvey.com> wrote:> Here''s another one. > > Two identical servers are sitting side by side. They could be connected to each other via anything (presently using crossover ethernet cable.) And obviously they both connect to the regular LAN. You want to serve VM''s from at least one of them, and even if the VM''s aren''t fault tolerant, you want at least the storage to be live synced. The first obvious thing to do is simply cron a zfs send | zfs receive at a very frequent interval. But there are a lot of downsides to that - besides the fact that you have to settle for some granularity, you also have a script on one system that will clobber the other system. So in the event of a failure, you might promote the backup into production, and you have to be careful not to let it get clobbered when the main server comes up again. > > I like much better, the idea of using a zfs mirror between the two systems. Even if it comes with a performance penalty, as a result of bottlenecking the storage onto Ethernet. But there are several ways to possibly do that, and I''m wondering which will be best. > > Option 1: Each system creates a big zpool of the local storage. Then, create a zvol within the zpool, and export it iscsi to the other system. Now both systems can see a local zvol, and a remote zvol, which it can use to create a zpool mirror. The reasons I don''t like this idea are because it''s a zpoolwithin a zpool, including the double-checksumming and everything. But the double-checksummingisn''t such a concern to me - I''m mostly afraid some horrible performance or reliability problem might be resultant. Naturally, you would only zpool import the nested zpool on one system. The other system would basically just ignore it. But in the event of a primary failure, you could force import the nested zpool on the secondary system.This was described by Thorsten a few years ago. http://www.osdevcon.org/2009/slides/high_availability_with_minimal_cluster_torsten_frueauf.pdf IMHO, the issues are operational: troubleshooting could be very challenging.> > Option 2: At present, both systems are using local mirroring ,3 mirror pairs of 6 disks. I could break these mirrors, and export one side over to the other system... And vice versa. So neither server will be doing local mirroring; they will both be mirroring across iscsi to targets on the other host. Once again, each zpool will only be imported on one host, but in the event of a failure, you could force import it on the other host. > > Can anybody think of a reason why Option 2 would be stupid, or can you think of a better solution?If they are close enough for "crossover cable" where the cable is UTP, then they are close enough for SAS. -- richard -- illumos Day & ZFS Day, Oct 1-2, 2012 San Fransisco www.zfsday.com Richard.Elling at RichardElling.com +1-760-896-4422 -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20120926/3657537b/attachment.html>
Edward Ned Harvey (opensolarisisdeadlongliveopensolaris)
2012-Sep-27 17:48 UTC
[zfs-discuss] vm server storage mirror
> From: Tim Cook [mailto:tim at cook.ms] > Sent: Wednesday, September 26, 2012 3:45 PM > > I would suggest if you''re doing a crossover between systems, you use > infiniband rather than ethernet. ?You can eBay a 40Gb IB card for under > $300. ?Quite frankly the performance issues should become almost a non- > factor at that point.I like that idea too - but I thought IB couldn''t do crossover. I thought a switch is required?
On Thu, Sep 27, 2012 at 12:48 PM, Edward Ned Harvey (opensolarisisdeadlongliveopensolaris) < opensolarisisdeadlongliveopensolaris at nedharvey.com> wrote:> > From: Tim Cook [mailto:tim at cook.ms] > > Sent: Wednesday, September 26, 2012 3:45 PM > > > > I would suggest if you''re doing a crossover between systems, you use > > infiniband rather than ethernet. You can eBay a 40Gb IB card for under > > $300. Quite frankly the performance issues should become almost a non- > > factor at that point. > > I like that idea too - but I thought IB couldn''t do crossover. I thought > a switch is required? > >Crossover should be fine as long as you have a subnet manager on one of the hosts. Now you''re going to ask me where you can get a subnet manager for illumos/solaris/whatever, and I''m going to have to plead the fifth because I haven''t looked into it. --Tim -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20120927/46230676/attachment.html>
2012-09-27 3:11, Richard Elling wrote:>> Option 2:At present, both systems are using local mirroring ,3 mirror >> pairs of 6 disks.I could break these mirrors, and export one side over >> to the other system...And vice versa.So neither server will be doing >> local mirroring; they will both be mirroring across iscsi to targets on >> the other host.Once again, each zpool will only be imported on one host, >> but in the event of a failure, you could force import it on the other >> host. >> Can anybody think of a reason why Option 2 would be stupid, or can you >> think of a better solution? > > If they are close enough for "crossover cable" where the cable is UTP, > then they are > close enough for SAS.Pardon my ignorance, can a system easily serve its local storage devices over SAS to a neighbor system (i.e. using a SAS HBA in place of an Ethernet NIC of an IB card in Ed''s crossover scenario?) Would this be doable over today''s COMSTAR, using a different storage path from the iSCSI stack most often used now? //Jim
Edward Ned Harvey (opensolarisisdeadlongliveopensolaris)
2012-Oct-01 13:07 UTC
[zfs-discuss] vm server storage mirror
> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > bounces at opensolaris.org] On Behalf Of Jim Klimov > > > If they are close enough for "crossover cable" where the cable is UTP, > > then they are > > close enough for SAS. > > Pardon my ignorance, can a system easily serve its local storage > devices over SAS to a neighbor system (i.e. using a SAS HBA in > place of an Ethernet NIC of an IB card in Ed''s crossover scenario?) > Would this be doable over today''s COMSTAR, using a different > storage path from the iSCSI stack most often used now?I was wondering the same thing - but it turns out to be irrelevant. Remember when I said this?> Can anybody think of a reason why Option 2 would be stupid, or can you > think of a better solution?Well, now I know why it''s stupid. Cuz it doesn''t work right - It turns out, iscsi devices (And I presume SAS devices) are not removable storage. That means, if the device goes offline and comes back online again, it doesn''t just gracefully resilver and move on without any problems, it''s in a perpetual state of IO error, device unreadable. If there were simply cksum errors, or something like that, I could handle it. But it''s bus error, device error, system can''t operate, I have to remove the device permanently. The really odd thing is - It doesn''t always show as faulted in zpool status. Even when it does show as faulted - I can zpool online, or zpool clear, to make the pool look healthy again. But when an app tries to use something in that zpool, the system grinds, and I can see scsi errors spewing into the /var/adm/messages, and sometimes the system will halt. This is call caused because I disconnected / rebooted either the iscsi initiator or target. Lesson learned: If you create an iscsi target, make *damn* sure it''s an always-on system. And don''t use just one. And don''t do maintenance on them both, in anywhere near the same week.
2012-10-01 17:07, Edward Ned Harvey (opensolarisisdeadlongliveopensolaris) ?????:> Well, now I know why it''s stupid. Cuz it doesn''t work right - It turns out, iscsi devices (And I presume SAS devices) are not removable storage. That means, if the device goes offline and comes back online again, it doesn''t just gracefully resilver and move on without any problems, it''s in a perpetual state of IO error, device unreadable. If there were simply cksum errors, or something like that, I could handle it. But it''s bus error, device error, system can''t operate, I have to remove the device permanently. > > The really odd thing is - It doesn''t always show as faulted in zpool status. Even when it does show as faulted - I can zpool online, or zpool clear, to make the pool look healthy again. But when an app tries to use something in that zpool, the system grinds, and I can see scsi errors spewing into the /var/adm/messages, and sometimes the system will halt. > > This is call caused because I disconnected / rebooted either the iscsi initiator or target. > > Lesson learned: If you create an iscsi target, make *damn* sure it''s an always-on system. And don''t use just one. And don''t do maintenance on them both, in anywhere near the same week.And would some sort of clusterware help in this case? I.e. when the target goes down, it informs the initiator to "offline" the disk component gracefully (if that is possible). When the target returns up, the automation would online the pool components, or replace them in-place, and *properly* resilver and clear the pool. Wonder if that''s possible and if that would help your case? //Jim
Edward Ned Harvey (opensolarisisdeadlongliveopensolaris)
2012-Oct-03 18:03 UTC
[zfs-discuss] vm server storage mirror
> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > bounces at opensolaris.org] On Behalf Of Edward Ned Harvey > > it doesn''t work right - It turns out, iscsi > devices (And I presume SAS devices) are not removable storage. That > means, if the device goes offline and comes back online again, it doesn''t just > gracefully resilver and move on without any problems, it''s in a perpetual > state of IO error, device unreadable.I am revisiting this issue today. I''ve tried everything I can think of to recreate this issue, and haven''t been able to do it. I have certainly encountered some bad behaviors - which I''ll expound upon momentarily - but they all seem to be addressable, fixable, logical problems, and none of them result in a supposedly good pool (as reported in zpool status) returning scsi IO errors or halting the system. The most likely explanation right now, for the bad behavior I saw before, perpetual IO error even after restoring connection, is that I screwed something up in my iscsi config the first time. Herein lie the new problems: If I don''t export the pool before rebooting, then either the iscsi target or initiator is shutdown before the filesystems are unmounted. So the system spews all sorts of error messages while trying to go down, but it eventually succeeds. It''s somewhat important to know if it was the target or initiator that went down first - If it was the target, then only the local disks became inaccessible, but if it was the intiiator, then both the local and remote disks became inaccessible. I don''t know yet. Upon reboot, the pool fails to import, so the svc:/system/filesystem/local service fails, and comes up in maintenance mode. The whole world is a mess, you have to login at physical text console to export the pool, and reboot. But it comes up cleanly the second time. These sorts of problems seem like they should be solvable by introducing some service manifest dependencies... But there''s no way to make it a generalization for the distribution as a whole (illumos/openindiana/oracle). It''s just something that should be solvable on a case-by-case basis. If you are going to be an initiator only, then it makes sense for svc:/network/iscsi/initiator to be required by svc:/system/filesystem/local If you are going to be a target only, then it makes sense for svc:/system/filesystem/local to be required by svc:/network/iscsi/target If you are going to be a target & initiator, then you could get yourself into a deadlock situation. Make the filesystem depend on the initiator, and make the initiator depend on the target, and make the target depend on the filesystem. Uh-oh. But we can break that cycle easy enough in a lot of situations - If you''re doing as I''m doing, where the only targets are raw devices (not zvols) then it should be ok to make the filesystem depend on the initiator, which depends on the target, and the target doesn''t depend on anything. If you''re both a target and an initiator, but all of your targets are zvols that you export to other systems (you''re not nesting a filesystem in a zvol of your own, are you?) then it''s ok to let the target needs filesystem and filesystem needs initiator, but initiator doesn''t need anything. So in my case, I''m sharing raw disks, I''m going to try and make filessytem needs initiator, initiator needs target, and target doesn''t need anything. Haven''t tried yet ... Hopefully google will help accelerate me figuring out how to do that.
2012-10-03 22:03, Edward Ned Harvey (opensolarisisdeadlongliveopensolaris) wrote:> If you are going to be an initiator only, then it makes sense for svc:/network/iscsi/initiator to be required by svc:/system/filesystem/local > If you are going to be a target only, then it makes sense for svc:/system/filesystem/local to be required by svc:/network/iscsi/targetWell, on my system that I complained a lot about last year, I''ve had a physical pool, a zvol in it, shared and imported over iscsi on loopback (or sometimes initiated from another box), and another pool inside that zvol ultimately. Since the overall construct including hardware lent itself to many problems and panics as you may remember, I ultimately did not import the data pool nor the pool in the zvol via common services and /etc/zfs/zpool.cache, but made new services for that. If you want, I''ll try to find the pieces and send them (off-list?), but the general idea was that I made two services - one for import (without cachefile) of the physical pool and one for virtual dcpool, *maybe* I also made instances of the iscsi initiator and/or target services, and overall meshed it with proper dependencies to start in order: OS milestone svcs - pool - target - initiator - dcpool and stop in proper reverse order. Ultimately, since the pool imports could occasionally crash that box, there were files to touch or remove, in order to delay or cancel imports of pool or dcpool easily. Overall, this let the system to boot into interactive mode and enable all of its standard services and mount the rpool filesystems way before attempting risky pool imports and iscsi. Of course, on a particular system you might reconfigure SMF services for zones or VMs to depend on accessibility of their particular storage pools to start up - and reversely for shutdowns.> These sorts of problems seem like they should be solvable by introducing some service manifest dependencies... But there''s no way to make it a generalization for the distribution as a whole (illumos/openindiana/oracle). It''s just something that should be solvable on a case-by-case basis.I think I got pretty close to generalization, so after some code cleanup (things were hard-coded for this box) and even real-world testing on your setup, we might try to push this into OI or whoever picks it up. Now, I''ll try to find these manifests and methods ;) HTH, //Jim Klimov
Edward Ned Harvey (opensolarisisdeadlongliveopensolaris)
2012-Oct-04 12:06 UTC
[zfs-discuss] vm server storage mirror
> From: Jim Klimov [mailto:jimklimov at cos.ru] > > Well, on my system that I complained a lot about last year, > I''ve had a physical pool, a zvol in it, shared and imported > over iscsi on loopback (or sometimes initiated from another > box), and another pool inside that zvol ultimately.Ick. And it worked?> > These sorts of problems seem like they should be solvable by introducing > some service manifest dependencies... But there''s no way to make it a > generalization for the distribution as a whole (illumos/openindiana/oracle). > It''s just something that should be solvable on a case-by-case basis.I started looking at that yesterday, and was surprised by how complex the dependency graph is. Also, can''t get graphviz to install or build, so I don''t actually have a graph. In any event, rather than changing the existing service dependencies, I decided to just make a new service, which would zpool import, and zpool export the pools that are on iscsi, before and after the iscsi initiator. At present, the new service correctly mounts & dismounts the iscsi pool while I''m sitting there, but for some reason, it fails during reboot. I ran out of time digging into it ... I''ll look some more tomorrow.
2012-10-04 16:06, Edward Ned Harvey (opensolarisisdeadlongliveopensolaris) ?????:>> From: Jim Klimov [mailto:jimklimov at cos.ru] >> >> Well, on my system that I complained a lot about last year, >> I''ve had a physical pool, a zvol in it, shared and imported >> over iscsi on loopback (or sometimes initiated from another >> box), and another pool inside that zvol ultimately. > > Ick. And it worked?Um, well. Kind of yes, but it ran into many rough corners - many of which I posted and asked about. The fatal one was my choice of smaller blocks in the zvol, so I learned that metadata (on 4k sectored disks) could consume about as much as userdata in that zvol/pool, so I ultimately migrated data off that pool into the system''s physical pool - not very easy given that there was little free space left (unexpectedly for me, until I understood the inner workings). Still, technically, there is little problem building such a setup - it just needs some more thorough understanding and planning. I did learn a lot, so it wasn''t in vain, too.>>> These sorts of problems seem like they should be solvable by introducing >> some service manifest dependencies... But there''s no way to make it a >> generalization for the distribution as a whole (illumos/openindiana/oracle). >> It''s just something that should be solvable on a case-by-case basis. > > I started looking at that yesterday, and was surprised by how complex the dependency graph is. Also, can''t get graphviz to install or build, so I don''t actually have a graph.There are also loops ;) # svcs -d filesystem/usr STATE STIME FMRI online Aug_27 svc:/system/scheduler:default ... # svcs -d scheduler STATE STIME FMRI online Aug_27 svc:/system/filesystem/minimal:default ... # svcs -d filesystem/minimal STATE STIME FMRI online Aug_27 svc:/system/filesystem/usr:default ...> > In any event, rather than changing the existing service dependencies, I decided to just make a new service, which would zpool import, and zpool export the pools that are on iscsi, before and after the iscsi initiator. > > At present, the new service correctly mounts & dismounts the iscsi pool while I''m sitting there, but for some reason, it fails during reboot. I ran out of time digging into it ... I''ll look some more tomorrow.That''s about what I did and described. I too do avoid hacking into distro-provided services, so that upgrades don''t break my customizations and vice-versa. My code is not yet accessible to me, but I think my instance of the target/initiator services did a temp-enable/disable of stock services as its start/stop methods, and the system iscsi services were kept disabled by default. This way I could start them at a needed moment in time without changing their service definitions. Also note that if you do prefer to rely on stock services, you can define reverse dependencies in your own new services (i.e. "I declare that iscsi/target depends on me. Yours, pool-import"). HTH, //Jim
This whole thread has been fascinating. I really wish we (OI) had the two following things that freebsd supports: 1. HAST - provides a block-level driver that mirrors a local disk to a network "disk" presenting the result as a block device using the GEOM API. 2. CARP. I have a prototype with two freebsd VMs where I can failover back and forth and it works beautifully. Block level replication using all open source software. There were some glitches involving boot and shutdown (dependencies that are not set up properly), but I think if there was enough love in the freebsd community that could be fixed. I could be wrong, but it doesn''t *seem* as if either HAST (or an equivalent) or CARP exist in the OI (or other *solaris derivatives) space. Shame if so...
Forgot to mention: my interest in doing this was so I could have my ESXi host point at a CARP-backed IP address for the datastore, and I would have no single point of failure at the storage level.
On Oct 4, 2012, at 8:35 AM, Dan Swartzendruber <dswartz at druber.com> wrote:> > This whole thread has been fascinating. I really wish we (OI) had the two following things that freebsd supports: > > 1. HAST - provides a block-level driver that mirrors a local disk to a network "disk" presenting the result as a block device using the GEOM API.This is called AVS in the Solaris world. In general, these systems suffer from a fatal design flaw: the authoritative view of the data is not also responsible for the replication. In other words, you can provide coherency but not consistency. Both are required to provide a single view of the data.> 2. CARP.This exists as part of the OHAC project. -- richard -- Richard.Elling at RichardElling.com +1-760-896-4422 -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20121004/368642c3/attachment.html>
On 10/4/2012 11:48 AM, Richard Elling wrote:> On Oct 4, 2012, at 8:35 AM, Dan Swartzendruber <dswartz at druber.com > <mailto:dswartz at druber.com>> wrote: > >> >> This whole thread has been fascinating. I really wish we (OI) had >> the two following things that freebsd supports: >> >> 1. HAST - provides a block-level driver that mirrors a local disk to >> a network "disk" presenting the result as a block device using the >> GEOM API. > > This is called AVS in the Solaris world. > > In general, these systems suffer from a fatal design flaw: the > authoritative view of the > data is not also responsible for the replication. In other words, you > can provide coherency > but not consistency. Both are required to provide a single view of the > data.Can you expand on this?>> 2. CARP. > > This exists as part of the OHAC project. > -- richard >These are both freely available? -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20121004/423443a3/attachment.html>
On Oct 4, 2012, at 9:07 AM, Dan Swartzendruber <dswartz at druber.com> wrote:> On 10/4/2012 11:48 AM, Richard Elling wrote: >> >> On Oct 4, 2012, at 8:35 AM, Dan Swartzendruber <dswartz at druber.com> wrote: >> >>> >>> This whole thread has been fascinating. I really wish we (OI) had the two following things that freebsd supports: >>> >>> 1. HAST - provides a block-level driver that mirrors a local disk to a network "disk" presenting the result as a block device using the GEOM API. >> >> This is called AVS in the Solaris world. >> >> In general, these systems suffer from a fatal design flaw: the authoritative view of the >> data is not also responsible for the replication. In other words, you can provide coherency >> but not consistency. Both are required to provide a single view of the data. > > Can you expand on this?I could, but I''ve already written a book on clustering. For a more general approach to understanding clustering, I can highly recommend Pfister''s In Search of Clusters. http://www.amazon.com/In-Search-Clusters-2nd-Edition/dp/0138997098 NB, clustered storage is the same problem as clustered compute wrt state.>>> 2. CARP. >> >> This exists as part of the OHAC project. >> -- richard >> > > These are both freely available?Yes. -- richard -- Richard.Elling at RichardElling.com +1-760-896-4422 -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20121004/b5473370/attachment-0001.html>
2012-10-04 19:48, Richard Elling wrote:>> 2. CARP. > > This exists as part of the OHAC project. > -- richardWikipedia says CARP is the open-source equivalent of VRRP. And we have that in OI, don''t we? Would it suffice? # pkg info -r vrrp Name: system/network/routing/vrrp Summary: Solaris VRRP protocol Description: Solaris VRRP protocol service Category: System/Administration and Configuration State: Not installed Publisher: openindiana.org Version: 0.5.11 Build Release: 5.11 Branch: 0.151.1.5 Packaging Date: Sat Jun 30 20:01:06 2012 Size: 275.57 kB FMRI: pkg://openindiana.org/system/network/routing/vrrp at 0.5.11,5.11-0.151.1.5:20120630T200106Z HTH, //Jim Klimov
On 10/4/2012 12:19 PM, Richard Elling wrote:> On Oct 4, 2012, at 9:07 AM, Dan Swartzendruber <dswartz at druber.com > <mailto:dswartz at druber.com>> wrote: > >> On 10/4/2012 11:48 AM, Richard Elling wrote: >>> On Oct 4, 2012, at 8:35 AM, Dan Swartzendruber <dswartz at druber.com >>> <mailto:dswartz at druber.com>> wrote: >>> >>>> >>>> This whole thread has been fascinating. I really wish we (OI) had >>>> the two following things that freebsd supports: >>>> >>>> 1. HAST - provides a block-level driver that mirrors a local disk >>>> to a network "disk" presenting the result as a block device using >>>> the GEOM API. >>> >>> This is called AVS in the Solaris world. >>> >>> In general, these systems suffer from a fatal design flaw: the >>> authoritative view of the >>> data is not also responsible for the replication. In other words, >>> you can provide coherency >>> but not consistency. Both are required to provide a single view of >>> the data. >>Sorry to be dense here, but I''m not getting how this is a cluster setup, or what your point wrt authoritative vs replication meant. In the scenario I was looking at, one host is providing access to clients - on the backup host, no services are provided at all. The master node does mirrored writes to the local disk and the network disk. The mirrored write does not return until the backup host confirms the data is safely written to disk. If a failover event occurs, there should not be any writes the client has been told completed that was not completed to both sides. The master node stops responding to the virtual IP, and the backup starts responding to it. Any pending NFS writes will presumably be retried by the client, and the new master node has completely up to date data on disk to respond with. Maybe I am focusing too narrowly here, but in the case I am looking at, there is only a single node which is active at any time, and it is responsible for replication and access by clients, so I don''t see the failure modes you allude to. Maybe I need to shell out for that book :) -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20121004/d86e9ab7/attachment.html>
2012-10-04 21:19, Dan Swartzendruber writes:> Sorry to be dense here, but I''m not getting how this is a cluster setup, > or what your point wrt authoritative vs replication meant. In the > scenario I was looking at, one host is providing access to clients - on > the backup host, no services are provided at all. The master node does > mirrored writes to the local disk and the network disk. The mirrored > write does not return until the backup host confirms the data is safely > written to disk. If a failover event occurs, there should not be any > writes the client has been told completed that was not completed to both > sides. The master node stops responding to the virtual IP, and the > backup starts responding to it. Any pending NFS writes will presumably > be retried by the client, and the new master node has completely up to > date data on disk to respond with. Maybe I am focusing too narrowly > here, but in the case I am looking at, there is only a single node which > is active at any time, and it is responsible for replication and access > by clients, so I don''t see the failure modes you allude to. Maybe I > need to shell out for that book :)What if the backup host is down (i.e. the ex-master after the failover)? Will your failed-over pool accept no writes until both storage machines are working? What if internetworking between these two heads has a glitch, and as a result both of them become masters of their private copies (mirror halves), and perhaps both even manage to accept writes from clients? This is the clustering part, which involves "fencing" around the node which is considered dead, perhaps including a hardware reset request just to make sure it''s dead, before taking over resources it used to master (STONITH - Shoot The Other Node In The Head). In particular, clusters suggest that for hearbeats so as to make sure both machines work indeed, you use at least two separate wires (i.e. serial and LAN) without active hardware (switches) in-between, separate from data networking. HTH, //Jim
On 10/4/2012 1:56 PM, Jim Klimov wrote:> > What if the backup host is down (i.e. the ex-master after the failover)? > Will your failed-over pool accept no writes until both storage machines > are working? > > What if internetworking between these two heads has a glitch, and as > a result both of them become masters of their private copies (mirror > halves), and perhaps both even manage to accept writes from clients? > > This is the clustering part, which involves "fencing" around the node > which is considered dead, perhaps including a hardware reset request > just to make sure it''s dead, before taking over resources it used to > master (STONITH - Shoot The Other Node In The Head). In particular, > clusters suggest that for hearbeats so as to make sure both machines > work indeed, you use at least two separate wires (i.e. serial and LAN) > without active hardware (switches) in-between, separate from data > networking.this all makes a lot of sense. didn''t mean to imply there are no failure modes that can take you down entirely. i was aware of the split-brain issue. i was not sure what richard was getting at...
Edward Ned Harvey (opensolarisisdeadlongliveopensolaris)
2012-Oct-04 21:44 UTC
[zfs-discuss] vm server storage mirror
> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > bounces at opensolaris.org] On Behalf Of Jim Klimov > > There are also loops ;) > > # svcs -d filesystem/usr > STATE STIME FMRI > online Aug_27 svc:/system/scheduler:default > ... > > # svcs -d scheduler > STATE STIME FMRI > online Aug_27 svc:/system/filesystem/minimal:default > ... > > # svcs -d filesystem/minimal > STATE STIME FMRI > online Aug_27 svc:/system/filesystem/usr:default > ...How is that possible? Why would the system be willing to startup in a situation like that? It *must* be launching one of those, even without its dependencies met ... The answer to this question, will in all likelihood, shed some light on my situation, trying to understand why my iscsi mounted zpool import/export service is failing to go down or come up in the order I expected, when it''s dependent on the iscsi initiator.
2012-10-05 1:44, Edward Ned Harvey (opensolarisisdeadlongliveopensolaris) ?????:>> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- >> bounces at opensolaris.org] On Behalf Of Jim Klimov >> >> There are also loops ;) >> >> # svcs -d filesystem/usr >> STATE STIME FMRI >> online Aug_27 svc:/system/scheduler:default >> ... >> >> # svcs -d scheduler >> STATE STIME FMRI >> online Aug_27 svc:/system/filesystem/minimal:default >> ... >> >> # svcs -d filesystem/minimal >> STATE STIME FMRI >> online Aug_27 svc:/system/filesystem/usr:default >> ... > > How is that possible? Why would the system be willing to startup in a situation like that? It *must* be launching one of those, even without its dependencies met ...Well, it seems just like a peculiar effect of required vs. optional dependencies. The loop is in the default installation. Details: # svcprop filesystem/usr | grep scheduler svc:/system/filesystem/usr:default/:properties/scheduler_usr/entities fmri svc:/system/scheduler svc:/system/filesystem/usr:default/:properties/scheduler_usr/external boolean true svc:/system/filesystem/usr:default/:properties/scheduler_usr/grouping astring optional_all svc:/system/filesystem/usr:default/:properties/scheduler_usr/restart_on astring none svc:/system/filesystem/usr:default/:properties/scheduler_usr/type astring service # svcprop scheduler | grep minimal svc:/application/cups/scheduler:default/:properties/filesystem_minimal/entities fmri svc:/system/filesystem/minimal svc:/application/cups/scheduler:default/:properties/filesystem_minimal/grouping astring require_all svc:/application/cups/scheduler:default/:properties/filesystem_minimal/restart_on astring none svc:/application/cups/scheduler:default/:properties/filesystem_minimal/type astring service # svcprop filesystem/minimal | grep usr usr/entities fmri svc:/system/filesystem/usr usr/grouping astring require_all usr/restart_on astring none usr/type astring service> The answer to this question, will in all likelihood, shed some light on my situation, trying to understand why my iscsi mounted zpool import/export service is failing to go down or come up in the order I expected, when it''s dependent on the iscsi initiator.Likewise - see what dependency type you introduced, and verify that you''ve "svcadm refreshed" the service after config changes. HTH, //Jim
Edward Ned Harvey (opensolarisisdeadlongliveopensolaris)
2012-Oct-05 13:47 UTC
[zfs-discuss] vm server storage mirror
> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > bounces at opensolaris.org] On Behalf Of Jim Klimov > > Well, it seems just like a peculiar effect of required vs. optional > dependencies. The loop is in the default installation. Details: > > # svcprop filesystem/usr | grep scheduler > svc:/system/filesystem/usr:default/:properties/scheduler_usr/entities > fmri svc:/system/scheduler > svc:/system/filesystem/usr:default/:properties/scheduler_usr/external > boolean true > svc:/system/filesystem/usr:default/:properties/scheduler_usr/grouping > astring optional_all > svc:/system/filesystem/usr:default/:properties/scheduler_usr/restart_on > astring none > svc:/system/filesystem/usr:default/:properties/scheduler_usr/type > astring service > > # svcprop scheduler | grep minimal > svc:/application/cups/scheduler:default/:properties/filesystem_minimal/ent > ities > fmri svc:/system/filesystem/minimal > svc:/application/cups/scheduler:default/:properties/filesystem_minimal/gro > uping > astring require_all > svc:/application/cups/scheduler:default/:properties/filesystem_minimal/res > tart_on > astring none > svc:/application/cups/scheduler:default/:properties/filesystem_minimal/typ > e > astring service > > # svcprop filesystem/minimal | grep usr > usr/entities fmri svc:/system/filesystem/usr > usr/grouping astring require_all > usr/restart_on astring none > usr/type astring service >I must be missing something - I don''t see anything above that indicates any required vs optional dependencies. I''m not quite sure what I''m supposed to be seeing in the svcprop outputs you pasted...> > The answer to this question, will in all likelihood, shed some light on my > situation, trying to understand why my iscsi mounted zpool import/export > service is failing to go down or come up in the order I expected, when it''s > dependent on the iscsi initiator. > > Likewise - see what dependency type you introduced, and verify > that you''ve "svcadm refreshed" the service after config changes.Thank you for the suggestion - I like the direction this is heading, but I don''t know how to do that yet. (This email is the first I ever heard of it.) Rest assured, I''ll be googling and reading more man pages in the meantime.
Edward Ned Harvey (opensolarisisdeadlongliveopensolaris)
2012-Oct-05 18:53 UTC
[zfs-discuss] vm server storage mirror
> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > bounces at opensolaris.org] On Behalf Of Edward Ned Harvey > > I must be missing something - I don''t see anything above that indicates any > required vs optional dependencies.Ok, I see that now. (Thanks to the SMF FAQ). A dependency may have grouping optional_all, require_any, or require_all. Mine is require_all, and I figured out the problem. I had my automatic zpool import/export script dependent on the initiator ... But it wasn''t the initiator going down first. It was the target going down first. So the solution is like this: sudo svccfg -s svc:/network/iscsi/initiator:default svc:/network/iscsi/initiator:default> addpg iscsi-target dependency svc:/network/iscsi/initiator:default> setprop iscsi-target/grouping = astring: "require_all" svc:/network/iscsi/initiator:default> setprop iscsi-target/restart_on = astring: "none" svc:/network/iscsi/initiator:default> setprop iscsi-target/type = astring: "service" svc:/network/iscsi/initiator:default> setprop iscsi-target/entities = fmri: "svc:/network/iscsi/target:default" svc:/network/iscsi/initiator:default> exit sudo svcadm refresh svc:/network/iscsi/initiator:default And additionally, create the SMF service dependent on initiator, which will import/export the iscsi pools automatically. http://nedharvey.com/blog/?p=105
2012-10-05 22:53, Edward Ned Harvey (opensolarisisdeadlongliveopensolaris) wrote:> http://nedharvey.com/blog/?p=105 >Nice writeup, thanks. Perhaps you could also post/link it on OI wiki so the community can find it easier? A few comments: 1) For readability I''d use "...| awk ''{print $1}''" instead of sed: - for GUID in `sudo sbdadm list-lu | grep rdsk | sed ''s/ .*//''` + for GUID in `sudo sbdadm list-lu | grep rdsk | awk ''{print $1}''` On one hand, different implementations of sed might parse regexps differently, on the other - column order might change and changing a number in awk would be more straightforward. 2) Here you can just redirect stdio from /dev/null: - sudo format -e # Make a note of the new device names. And hit Ctrl-C. + sudo format -e < /dev/null 3) In iscsi-pool-ctrl.sh it is more readable to replace the ''if "$1"...elif..else'' clause with ''case "$1" in ... esac'' That is also easier to expand if needed; for example, to alias ''import|start)'' and ''export|stop)'' for more standard method naming. 3.1) Also you should probably do "zpool import -o cachefile ..." or plain "zpool import -R / ..." to set a particular cachefile or use none, to avoid auto-import upon boot via standard file /etc/zfs/zpool.cache (which can break your filesystem/local service). Also note that use of the altroot (-R) option disables the cachefile by default, so you can use it as a shortcut. 3.2) The exit errors should be aligned with SMF status codes, so you should include /lib/svc/share/smf_include.sh and return one of these: SMF_EXIT_OK=0 SMF_EXIT_ERR_FATAL=95 SMF_EXIT_ERR_CONFIG=96 SMF_EXIT_MON_DEGRADE=97 SMF_EXIT_MON_OFFLINE=98 SMF_EXIT_ERR_NOSMF=99 SMF_EXIT_ERR_PERM=100 (You can validate inclusion of that file, so if it fails, you can define these values yourself for the script, i.e. to use it as an initscript on a system without SMF). 3.3) To catch "device busy" errors you can retry failed zpool export runs with "zpool export -f" which tries a bit harder. Otherwise, quite LGTM :) HTH, //Jim Klimov
Hello Ed and all, Just for the sake of completeness, I dug out my implementation of SMF services for iscsi-imported pools. As I said, it is kinda ugly due to hardcoded things which should rather be in SMF properties or at least in config files, but this was a single-solution POC. Here is the client side: 1) Method script to import the "dcpool" over iscsi. It has an optional config file that I can touch, fill or remove in order to delay the import of the pool, and another file to disable the import (perhaps by touching it while the sleep is in effect) but not fail or reconfigure the SMF instance. If the latter file is present in advance, the service is temp-disabled (reverse dependency, see XML below). Finally, export of the pool is retried with force until success (or SMF timeout): $ cat /lib/svc/method/iscsi-mount-dcpool ------ #!/bin/sh DELAY=600 case "$1" in start) if [ -f /etc/zfs/delay.dcpool ]; then D="`head -1 /etc/zfs/delay.dcpool`" [ "$D" -gt 0 ] 2>/dev/null && DELAY="$D" || D=10 echo "`date`: Delay requested... ${DELAY}sec" sleep ${DELAY} echo "`date`: Done sleeping" fi if [ -f /etc/zfs/noimport-dcpool ]; then echo "`date`: /etc/zfs/noimport-dcpool block-file reappeared. Aborting." exit 0 fi [ -d /dcpool/export -o -f /etc/zfs/noimport-dcpool ] || \ ( echo "`date`: beginning dcpool import..." time zpool import -o cachefile=none dcpool RET=$? echo "`date`: dcpool import complete ($RET)" exit $RET ) ;; stop) [ ! -d /dcpool/export ] || \ time zpool export dcpool || \ while ! time zpool export -f dcpool; do sleep 1; done ;; esac ------ 2) This script just wraps the call to original method (and adds a small sleep) and allows me to create a separate service and define dependencies on it - and not touch original services: $ cat /lib/svc/method/iscsi-initiator-dcpool ------- #!/bin/sh case "$1" in start) /lib/svc/method/iscsi-initiator "$@" && sleep 10 ;; stop) sleep 10 && /lib/svc/method/iscsi-initiator "$@" ;; *) /lib/svc/method/iscsi-initiator "$@" ;; esac ------- 3) The XML manifests for the services: NOTE: Startup time is unlimited, because pool processing (deferred frees, etc.) could take days on my setup, and the server (target) could be unaccessible for some time too. $ cat /root/smf/iscsi_mount-dcpool.xml ------- <?xml version=''1.0''?> <!DOCTYPE service_bundle SYSTEM ''/usr/share/lib/xml/dtd/service_bundle.dtd.1''> <service_bundle type=''manifest'' name=''export''> <service name=''network/iscsi/mount-dcpool'' type=''service'' version=''0''> <dependency name=''loopback'' grouping=''require_all'' restart_on=''none'' type=''service''> <service_fmri value=''svc:/network/loopback''/> </dependency> <dependency name=''initiator-dcpool'' grouping=''require_all'' restart_on=''restart'' type=''service''> <service_fmri value=''svc:/network/iscsi/initiator-dcpool:default''/> </dependency> <dependency name=''noimport-file'' grouping=''exclude_all'' restart_on=''refresh'' type=''path''> <service_fmri value=''file://localhost/etc/zfs/noimport-dcpool''/> </dependency> <instance name=''default'' enabled=''false''> <exec_method name=''start'' type=''method'' exec=''/lib/svc/method/iscsi-mount-dcpool %m'' timeout_seconds=''0''/> <exec_method name=''stop'' type=''method'' exec=''/lib/svc/method/iscsi-mount-dcpool %m'' timeout_seconds=''600''/> <property_group name=''startd'' type=''framework''> <propval name=''duration'' type=''astring'' value=''transient''/> </property_group> <template> <common_name> <loctext xml:lang=''C''>import 'dcpool' over iscsi</loctext> </common_name> </template> </instance> <stability value=''Unstable''/> </service> </service_bundle> ------- $ cat /root/smf/iscsi_initiator-dcpool.xml ------- <?xml version=''1.0''?> <!DOCTYPE service_bundle SYSTEM ''/usr/share/lib/xml/dtd/service_bundle.dtd.1''> <service_bundle type=''manifest'' name=''export''> <service name=''network/iscsi/initiator-dcpool'' type=''service'' version=''0''> <create_default_instance enabled=''false''/> <single_instance/> <dependency name=''loopback'' grouping=''require_any'' restart_on=''error'' type=''service''> <service_fmri value=''svc:/network/loopback''/> </dependency> <dependency name=''mus'' grouping=''require_any'' restart_on=''error'' type=''service''> <service_fmri value=''svc:/milestone/multi-user-server:default''/> </dependency> <exec_method name=''start'' type=''method'' exec=''/lib/svc/method/iscsi-initiator-dcpool %m'' timeout_seconds=''600''> <method_context> <method_credential user=''root'' group=''root'' privileges=''basic,sys_devices,sys_mount''/> </method_context> </exec_method> <exec_method name=''stop'' type=''method'' exec=''/lib/svc/method/iscsi-initiator-dcpool %m'' timeout_seconds=''600''> <method_context> <method_credential user=''root'' group=''root'' privileges=''basic,sys_devices,sys_mount''/> </method_context> </exec_method> <property_group name=''dependents'' type=''framework''> <property name=''iscsi-initiator_multi-user'' type=''fmri''/> <property name=''iscsi-mount-dcpool'' type=''fmri''/> </property_group> <stability value=''Evolving''/> <template> <common_name> <loctext xml:lang=''C''>iSCSI initiator daemon for dcpool</loctext> </common_name> <documentation> <manpage title=''iscsi'' section=''7D'' manpath=''/usr/share/man''/> </documentation> </template> </service> </service_bundle> ------- Maybe some of the former script''s ideas can wind up into your solution; I''m not sure if the initiator-wrapper is that useful =) Good luck, //Jim Klimov
2012-10-06 14:49, Jim Klimov wrote:> $ cat /lib/svc/method/iscsi-mount-dcpool > ------ > #!/bin/sh > > DELAY=600 > > case "$1" in > start) > if [ -f /etc/zfs/delay.dcpool ]; then > D="`head -1 /etc/zfs/delay.dcpool`" > [ "$D" -gt 0 ] 2>/dev/null && DELAY="$D" || D=10 > echo "`date`: Delay requested... ${DELAY}sec"Oops, a typo (thoughtlessly entered by hand into email, although happens to be harmless in effect due to another typo in the process): - [ "$D" -gt 0 ] 2>/dev/null && DELAY="$D" || D=10 + [ "$D" -gt 0 ] 2>/dev/null && DELAY="$D" The default delay is defined above, and is large enough (600) so that I can enter a panicking system after reboot and disable the pool-importing service or otherwise influence or monitor the situation. //Jim
Edward Ned Harvey (opensolarisisdeadlongliveopensolaris)
2012-Oct-19 23:55 UTC
[zfs-discuss] vm server storage mirror
Yikes, I''m back at it again, and sooooo frustrated. For about 2-3 weeks now, I had the iscsi mirror configuration in production, as previously described. Two disks on system 1 mirror against two disks on system 2, everything done via iscsi, so you could zpool export on machine 1, and then zpool import on machine 2 for a manual failover. Created the dependency - initiator depends on target, and created a new smf service to mount the iscsi zpool after the initiator is up (and consequently export the zpool before the initiator shuts down.) Able to reboot, everything working perfectly. Until today. Today I rebooted one system for some maintenance, and it stayed down longer than expected, so those disks started throwing errors on the second machine. First system eventually came up again, second system resilvered, everything looked good. I zpool clear''d the pool on the second machine just to make the counters look pretty again. But it wasn''t pretty at all. This is so bizarre - Throughout the day, the VM''s on system 2 kept choking. I had to powercycle system 2 about half a dozen times due to unresponsiveness. Exactly the type of behavior you expect for IO error - but nothing whatsoever appears in the system log, and the zpool status still looks clean. Several times, I destroyed the pool and recreated it completely from backup. zfs send and zfs receive both work fine. But strangely - when I launch a VM, the IO grinds to a halt, and I''m forced to powercycle (usually) the host. You might try to conclude it''s something wrong with virtualbox - but it''s not. I literally copied & pasted the zfs send | zfs receive commands that restored the pool from backup, but this time restored it onto local storage. The only difference is local disk versus iscsi pool. And then it finally worked without any glitches. During the day, trying to get the iscsi pool up again - this is so bizarre - I did everything I could think of, to get back to a pristine state. I removed iscsi targets, I removed lun''s (lu''s), I removed the static discovery and re-added it, got new device names, I wiped the disks (zpool destroy & zpool create) re-created lu''s, re-created static discovery, re-created targets, re-created zpools... The behavior was the same no matter what I did. I can create the pool, import it, zfs receive onto it no problem, but then when I launch the VM, the whole system grinds to a halt. VirtualBox will be in a "sleep" state, Virtualbox shows the green light on the hard drive indicating it''s trying to read, meanwhile if I try to X it out, it won''t die, and gnome gives me the "Force Quit" dialog, meanwhile I can sudo kill -KILL VirtualBox, and VirtualBox *still* won''t die. Any "zpool" or "zfs" command I type in hangs indefinitely (even time-slider daemon or zfs auto snapshot are hung). I can poke around the system in other areas - on other pools and stuff - but the only way out of it is power cycle. It''s so weird, that once the problem happens once, I have not yet found any way to recover from it except to reformat and reinstall the OS for the whole system. I cannot, for the life of me, think of *any*thing that could be storing state like this, preventing me from getting back into a usable iscsi mirror pool. One thing I haven''t tried yet - It appears, I think, that when you make a disk, let''s say c2t4d0 an iscsi target, let''s say c6t7blahblahblahd0... It appears, I think, that c6t7blahblahblahd0 is actually c2t4d0s2. I could create a pool using c2t4d0, and/or zero the whole disk, completely obliterating any semblance of partition tables inside there, or old redundant copies of old uberblocks or anything like that. But seriously, I''m grasping at straws here, just trying to find *any* place where some bad state is stored that I haven''t thought of yet. I shouldn''t need to reformat the host.
> Several times, I destroyed the pool and recreated it completely from > backup. zfs send and zfs receive both work fine. But strangely - when I > launch a VM, the IO grinds to a halt, and I''m forced to powercycle > (usually) the host. >A shot in the dark here, but perhaps one of the disks involved is taking a long time to return from reads, but is returning eventually, so ZFS doesn''t notice the problem? Watching ''iostat -x'' for busy time while a VM is hung might tell you something. Tim -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20121019/7a686bb3/attachment-0001.html>
Edward Ned Harvey (opensolarisisdeadlongliveopensolaris)
2012-Oct-20 12:39 UTC
[zfs-discuss] vm server storage mirror
> From: Timothy Coalson [mailto:tsc5yc at mst.edu] > Sent: Friday, October 19, 2012 9:43 PM > > A shot in the dark here, but perhaps one of the disks involved is taking a long > time to return from reads, but is returning eventually, so ZFS doesn''t notice > the problem? ?Watching ''iostat -x'' for busy time while a VM is hung might tell > you something.Oh yeah - this is also bizarre. I watched "zpool iostat" for a while. It was showing me : Operations (read and write) consistently 0 Bandwidth (read and write) consistently non-zero, but something small, like 1k-20k or so. Maybe that is normal to someone who uses zpool iostat more often than I do. But to me, zero operations resulting in non-zero bandwidth defies logic.
On Sat, Oct 20, 2012 at 7:39 AM, Edward Ned Harvey (opensolarisisdeadlongliveopensolaris) < opensolarisisdeadlongliveopensolaris at nedharvey.com> wrote:> > From: Timothy Coalson [mailto:tsc5yc at mst.edu] > > Sent: Friday, October 19, 2012 9:43 PM > > > > A shot in the dark here, but perhaps one of the disks involved is taking > a long > > time to return from reads, but is returning eventually, so ZFS doesn''t > notice > > the problem? Watching ''iostat -x'' for busy time while a VM is hung > might tell > > you something. > > Oh yeah - this is also bizarre. I watched "zpool iostat" for a while. It > was showing me : > Operations (read and write) consistently 0 > Bandwidth (read and write) consistently non-zero, but something small, > like 1k-20k or so. > > Maybe that is normal to someone who uses zpool iostat more often than I > do. But to me, zero operations resulting in non-zero bandwidth defies > logic. > > It might be operations per second, and is rounding down (I know thishappens in DTrace normalization, not sure about zpool/zfs), try an interval of 1 (perhaps with -v) and see if you still get 0 operations. I haven''t seen zero operations with nonzero bandwidth on my pools, I always see lots of operations in bursts, so it sounds like you might be on to something. Also, iostat -x shows device busy time, which is usually higher on the slowest disk when there is an imbalance, while zpool iostat does not. So, if it happens to be a single device''s fault, iostat -nx has a better chance of finding it (the n flag translates the disk names to the device names used by the system, so you can figure out which one is the problem). Tim -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20121020/89851fd6/attachment.html>