thr3ads.net - zfs discuss - [zfs-discuss] vm server storage mirror [Sep 2012]

If this information is useful, please help other people find it:
Share via:

Edward Ned Harvey (opensolarisisdeadlongliveopensolaris)

2012-Sep-26 17:54 UTC

[zfs-discuss] vm server storage mirror

Here''s another one.

Two identical servers are sitting side by side.  They could be connected to each
other via anything (presently using crossover ethernet cable.)  And obviously
they both connect to the regular LAN.  You want to serve VM''s from at
least one of them, and even if the VM''s aren''t fault tolerant,
you want at least the storage to be live synced.  The first obvious thing to do
is simply cron a zfs send | zfs receive at a very frequent interval.  But there
are a lot of downsides to that - besides the fact that you have to settle for
some granularity, you also have a script on one system that will clobber the
other system.  So in the event of a failure, you might promote the backup into
production, and you have to be careful not to let it get clobbered when the main
server comes up again.

I like much better, the idea of using a zfs mirror between the two systems. 
Even if it comes with a performance penalty, as a result of bottlenecking the
storage onto Ethernet.  But there are several ways to possibly do that, and
I''m wondering which will be best.

Option 1:  Each system creates a big zpool of the local storage.  Then, create a
zvol within the zpool, and export it iscsi to the other system.  Now both
systems can see a local zvol, and a remote zvol, which it can use to create a
zpool mirror.  The reasons I don''t like this idea are because
it''s a zpool within a zpool, including the double-checksumming and
everything.  But the double-checksumming isn''t such a concern to me -
I''m mostly afraid some horrible performance or reliability problem
might be resultant.  Naturally, you would only zpool import the nested zpool on
one system.  The other system would basically just ignore it.  But in the event
of a primary failure, you could force import the nested zpool on the secondary
system.

Option 2:  At present, both systems are using local mirroring ,3 mirror pairs of
6 disks.  I could break these mirrors, and export one side over to the other
system...  And vice versa.  So neither server will be doing local mirroring;
they will both be mirroring across iscsi to targets on the other host.  Once
again, each zpool will only be imported on one host, but in the event of a
failure, you could force import it on the other host.

Can anybody think of a reason why Option 2 would be stupid, or can you think of
a better solution?
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20120926/16bed913/attachment-0001.html>

Freddie Cash

2012-Sep-26 18:20 UTC

head link

[zfs-discuss] vm server storage mirror

If you''re willing to try FreeBSD, there''s HAST (aka high
availability
storage) for this very purpose.

You use hast to create mirror pairs using 1 disk from each box, thus
creating /dev/hast/* nodes. Then you use those to create the zpool one the
''primary'' box.

All writes to the pool on the primary box are mirrored over the network to
the secondary box.

When the primary box goes down, the secondary imports the pool and carries
on. When the primary box comes online, it syncs the data back from the
secondary, and then either takes over as primary or becomes the new
secondary.
 On Sep 26, 2012 10:54 AM, "Edward Ned Harvey
(opensolarisisdeadlongliveopensolaris)" <
opensolarisisdeadlongliveopensolaris at nedharvey.com> wrote:
>  Here''s another one.****
>
> ** **
>
> Two identical servers are sitting side by side.  They could be connected
> to each other via anything (presently using crossover ethernet cable.)  And
> obviously they both connect to the regular LAN.  You want to serve
VM''s
> from at least one of them, and even if the VM''s aren''t
fault tolerant, you
> want at least the storage to be live synced.  The first obvious thing to
> do is simply cron a zfs send | zfs receive at a very frequent interval. 
But
> there are a lot of downsides to that - besides the fact that you have to
> settle for some granularity, you also have a script on one system that will
> clobber the other system.  So in the event of a failure, you might
> promote the backup into production, and you have to be careful not to let
> it get clobbered when the main server comes up again.****
>
> ** **
>
> I like much better, the idea of using a zfs mirror between the two
> systems.  Even if it comes with a performance penalty, as a result of
> bottlenecking the storage onto Ethernet.  But there are several ways to
> possibly do that, and I''m wondering which will be best.****
>
> ** **
>
> Option 1:  Each system creates a big zpool of the local storage.  Then,
> create a zvol within the zpool, and export it iscsi to the other system. 
Now
> both systems can see a local zvol, and a remote zvol, which it can use to
> create a zpool mirror.  The reasons I don''t like this idea are
because
> it''s a zpool within a zpool, including the double-checksumming and
> everything.  But the double-checksumming isn''t such a concern to
me - I''m
> mostly afraid some horrible performance or reliability problem might be
> resultant.  Naturally, you would only zpool import the nested zpool on
> one system.  The other system would basically just ignore it.  But in the
> event of a primary failure, you could force import the nested zpool on
> the secondary system.****
>
> ** **
>
> Option 2:  At present, both systems are using local mirroring ,3 mirror
> pairs of 6 disks.  I could break these mirrors, and export one side over
> to the other system...  And vice versa.  So neither server will be doing
> local mirroring; they will both be mirroring across iscsi to targets on
> the other host.  Once again, each zpool will only be imported on one
> host, but in the event of a failure, you could force import it on the other
> host.****
>
> ** **
>
> Can anybody think of a reason why Option 2 would be stupid, or can you
> think of a better solution?****
>
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20120926/35516af5/attachment.html>

matthew patton

2012-Sep-26 18:38 UTC

head link

[zfs-discuss] vm server storage mirror

"head units" crash or do weird things, but disks persist. There are a
couple of HA head-unit solutions out there but most of them have their own
separate storage and they effectively just send transaction groups to each
other.

The other way is to connect 2 nodes to an external SAS/FC chassis. create
desired ZPools. Assign some subset of pools to node A, the rest to node B. When
failure occurs the other node imports the other''s pools and exports as
NFS/iSCSI/whatever.

You''ll have to have a clustering/quorum and resource migration
subsystem obviously. Or if you want simple act/passive, a means to make sure
both heads don''t try to import the same pools.

Tim Cook

2012-Sep-26 19:45 UTC

head link

[zfs-discuss] vm server storage mirror

On Wed, Sep 26, 2012 at 12:54 PM, Edward Ned Harvey
(opensolarisisdeadlongliveopensolaris) <
opensolarisisdeadlongliveopensolaris at nedharvey.com> wrote:
>  Here''s another one.****
>
> ** **
>
> Two identical servers are sitting side by side.  They could be connected
> to each other via anything (presently using crossover ethernet cable.)  And
> obviously they both connect to the regular LAN.  You want to serve
VM''s
> from at least one of them, and even if the VM''s aren''t
fault tolerant, you
> want at least the storage to be live synced.  The first obvious thing to
> do is simply cron a zfs send | zfs receive at a very frequent interval. 
But
> there are a lot of downsides to that - besides the fact that you have to
> settle for some granularity, you also have a script on one system that will
> clobber the other system.  So in the event of a failure, you might
> promote the backup into production, and you have to be careful not to let
> it get clobbered when the main server comes up again.****
>
> ** **
>
> I like much better, the idea of using a zfs mirror between the two
> systems.  Even if it comes with a performance penalty, as a result of
> bottlenecking the storage onto Ethernet.  But there are several ways to
> possibly do that, and I''m wondering which will be best.****
>
> ** **
>
> Option 1:  Each system creates a big zpool of the local storage.  Then,
> create a zvol within the zpool, and export it iscsi to the other system. 
Now
> both systems can see a local zvol, and a remote zvol, which it can use to
> create a zpool mirror.  The reasons I don''t like this idea are
because
> it''s a zpool within a zpool, including the double-checksumming and
> everything.  But the double-checksumming isn''t such a concern to
me - I''m
> mostly afraid some horrible performance or reliability problem might be
> resultant.  Naturally, you would only zpool import the nested zpool on
> one system.  The other system would basically just ignore it.  But in the
> event of a primary failure, you could force import the nested zpool on
> the secondary system.****
>
> ** **
>
> Option 2:  At present, both systems are using local mirroring ,3 mirror
> pairs of 6 disks.  I could break these mirrors, and export one side over
> to the other system...  And vice versa.  So neither server will be doing
> local mirroring; they will both be mirroring across iscsi to targets on
> the other host.  Once again, each zpool will only be imported on one
> host, but in the event of a failure, you could force import it on the other
> host.****
>
> ** **
>
> Can anybody think of a reason why Option 2 would be stupid, or can you
> think of a better solution?****
>
>
>
I would suggest if you''re doing a crossover between systems, you use
infiniband rather than ethernet.  You can eBay a 40Gb IB card for under
$300.  Quite frankly the performance issues should become almost a
non-factor at that point.

--Tim
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20120926/986af65c/attachment-0001.html>

Richard Elling

2012-Sep-26 23:11 UTC

head link

[zfs-discuss] vm server storage mirror

On Sep 26, 2012, at 10:54 AM, Edward Ned Harvey
(opensolarisisdeadlongliveopensolaris) <opensolarisisdeadlongliveopensolaris
at nedharvey.com> wrote:
> Here''s another one.
>  
> Two identical servers are sitting side by side.  They could be connected to
each other via anything (presently using crossover ethernet cable.)  And
obviously they both connect to the regular LAN.  You want to serve VM''s
from at least one of them, and even if the VM''s aren''t fault
tolerant, you want at least the storage to be live synced.  The first obvious
thing to do is simply cron a zfs send | zfs receive at a very frequent interval.
But there are a lot of downsides to that - besides the fact that you have to
settle for some granularity, you also have a script on one system that will
clobber the other system. So in the event of a failure, you might promote the
backup into production, and you have to be careful not to let it get clobbered
when the main server comes up again.
>  
> I like much better, the idea of using a zfs mirror between the two systems.
Even if it comes with a performance penalty, as a result of bottlenecking the
storage onto Ethernet.  But there are several ways to possibly do that, and
I''m wondering which will be best.
>  
> Option 1:  Each system creates a big zpool of the local storage.  Then,
create a zvol within the zpool, and export it iscsi to the other system.  Now
both systems can see a local zvol, and a remote zvol, which it can use to create
a zpool mirror.  The reasons I don''t like this idea are because
it''s a zpoolwithin a zpool, including the double-checksumming and
everything.  But the double-checksummingisn''t such a concern to me -
I''m mostly afraid some horrible performance or reliability problem
might be resultant.  Naturally, you would only zpool import the nested zpool on
one system.  The other system would basically just ignore it.  But in the event
of a primary failure, you could force import the nested zpool on the secondary
system.
This was described by Thorsten a few years ago.
http://www.osdevcon.org/2009/slides/high_availability_with_minimal_cluster_torsten_frueauf.pdf

IMHO, the issues are operational: troubleshooting could be very challenging.
>  
> Option 2:  At present, both systems are using local mirroring ,3 mirror
pairs of 6 disks.  I could break these mirrors, and export one side over to the
other system...  And vice versa.  So neither server will be doing local
mirroring; they will both be mirroring across iscsi to targets on the other
host.  Once again, each zpool will only be imported on one host, but in the
event of a failure, you could force import it on the other host.
>  
> Can anybody think of a reason why Option 2 would be stupid, or can you
think of a better solution?
If they are close enough for "crossover cable" where the cable is UTP,
then they are
close enough for SAS.
 -- richard

--
illumos Day & ZFS Day, Oct 1-2, 2012 San Fransisco 
www.zfsday.com
Richard.Elling at RichardElling.com
+1-760-896-4422








-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20120926/3657537b/attachment.html>

Edward Ned Harvey (opensolarisisdeadlongliveopensolaris)

2012-Sep-27 17:48 UTC

head link

[zfs-discuss] vm server storage mirror

> From: Tim Cook [mailto:tim at cook.ms]
> Sent: Wednesday, September 26, 2012 3:45 PM
>  
> I would suggest if you''re doing a crossover between systems, you
use
> infiniband rather than ethernet. ?You can eBay a 40Gb IB card for under
> $300. ?Quite frankly the performance issues should become almost a non-
> factor at that point.
I like that idea too - but I thought IB couldn''t do crossover.  I
thought a switch is required?

Tim Cook

2012-Sep-27 22:06 UTC

head link

[zfs-discuss] vm server storage mirror

On Thu, Sep 27, 2012 at 12:48 PM, Edward Ned Harvey
(opensolarisisdeadlongliveopensolaris) <
opensolarisisdeadlongliveopensolaris at nedharvey.com> wrote:
> > From: Tim Cook [mailto:tim at cook.ms]
> > Sent: Wednesday, September 26, 2012 3:45 PM
> >
> > I would suggest if you''re doing a crossover between systems,
you use
> > infiniband rather than ethernet.  You can eBay a 40Gb IB card for
under
> > $300.  Quite frankly the performance issues should become almost a
non-
> > factor at that point.
>
> I like that idea too - but I thought IB couldn''t do crossover.  I
thought
> a switch is required?
>
>
Crossover should be fine as long as you have a subnet manager on one of the
hosts.  Now you''re going to ask me where you can get a subnet manager
for
illumos/solaris/whatever, and I''m going to have to plead the fifth
because
I haven''t looked into it.

--Tim
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20120927/46230676/attachment.html>

Jim Klimov

2012-Oct-01 10:21 UTC

head link

[zfs-discuss] vm server storage mirror

2012-09-27 3:11, Richard Elling wrote:>> Option 2:At present, both systems are using local mirroring ,3 mirror
>> pairs of 6 disks.I could break these mirrors, and export one side over
>> to the other system...And vice versa.So neither server will be doing
>> local mirroring; they will both be mirroring across iscsi to targets on
>> the other host.Once again, each zpool will only be imported on one
host,
>> but in the event of a failure, you could force import it on the other
>> host.
>> Can anybody think of a reason why Option 2 would be stupid, or can you
>> think of a better solution?
>
> If they are close enough for "crossover cable" where the cable is
UTP,
> then they are
> close enough for SAS.
Pardon my ignorance, can a system easily serve its local storage
devices over SAS to a neighbor system (i.e. using a SAS HBA in
place of an Ethernet NIC of an IB card in Ed''s crossover scenario?)
Would this be doable over today''s COMSTAR, using a different
storage path from the iSCSI stack most often used now?

//Jim

Edward Ned Harvey (opensolarisisdeadlongliveopensolaris)

2012-Oct-01 13:07 UTC

head link

[zfs-discuss] vm server storage mirror

> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-
> bounces at opensolaris.org] On Behalf Of Jim Klimov
> 
> > If they are close enough for "crossover cable" where the
cable is UTP,
> > then they are
> > close enough for SAS.
> 
> Pardon my ignorance, can a system easily serve its local storage
> devices over SAS to a neighbor system (i.e. using a SAS HBA in
> place of an Ethernet NIC of an IB card in Ed''s crossover
scenario?)
> Would this be doable over today''s COMSTAR, using a different
> storage path from the iSCSI stack most often used now?
I was wondering the same thing - but it turns out to be irrelevant.  Remember
when I said this?
> Can anybody think of a reason why Option 2 would be stupid, or can you
> think of a better solution?
Well, now I know why it''s stupid.  Cuz it doesn''t work right -
It turns out, iscsi devices (And I presume SAS devices) are not removable
storage.  That means, if the device goes offline and comes back online again, it
doesn''t just gracefully resilver and move on without any problems,
it''s in a perpetual state of IO error, device unreadable.  If there
were simply cksum errors, or something like that, I could handle it.  But
it''s bus error, device error, system can''t operate, I have to
remove the device permanently.

The really odd thing is - It doesn''t always show as faulted in zpool
status.  Even when it does show as faulted - I can zpool online, or zpool clear,
to make the pool look healthy again.  But when an app tries to use something in
that zpool, the system grinds, and I can see scsi errors spewing into the
/var/adm/messages, and sometimes the system will halt.

This is call caused because I disconnected / rebooted either the iscsi initiator
or target.

Lesson learned:  If you create an iscsi target, make *damn* sure it''s
an always-on system.  And don''t use just one.  And don''t do
maintenance on them both, in anywhere near the same week.

Jim Klimov

2012-Oct-01 16:51 UTC

head link

[zfs-discuss] vm server storage mirror

2012-10-01 17:07, Edward Ned Harvey 
(opensolarisisdeadlongliveopensolaris) ?????:> Well, now I know why it''s stupid.  Cuz it doesn''t work
right - It turns out, iscsi devices (And I presume SAS devices) are not
removable storage.  That means, if the device goes offline and comes back online
again, it doesn''t just gracefully resilver and move on without any
problems, it''s in a perpetual state of IO error, device unreadable.  If
there were simply cksum errors, or something like that, I could handle it.  But
it''s bus error, device error, system can''t operate, I have to
remove the device permanently.
>
> The really odd thing is - It doesn''t always show as faulted in
zpool status.  Even when it does show as faulted - I can zpool online, or zpool
clear, to make the pool look healthy again.  But when an app tries to use
something in that zpool, the system grinds, and I can see scsi errors spewing
into the /var/adm/messages, and sometimes the system will halt.
>
> This is call caused because I disconnected / rebooted either the iscsi
initiator or target.
>
> Lesson learned:  If you create an iscsi target, make *damn* sure
it''s an always-on system.  And don''t use just one.  And
don''t do maintenance on them both, in anywhere near the same week.

And would some sort of clusterware help in this case?
I.e. when the target goes down, it informs the initiator to
"offline" the disk component gracefully (if that is possible).
When the target returns up, the automation would online the
pool components, or replace them in-place, and *properly*
resilver and clear the pool.

Wonder if that''s possible and if that would help your case?

//Jim

Edward Ned Harvey (opensolarisisdeadlongliveopensolaris)

2012-Oct-03 18:03 UTC

head link

[zfs-discuss] vm server storage mirror

> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-
> bounces at opensolaris.org] On Behalf Of Edward Ned Harvey
> 
> it doesn''t work right - It turns out, iscsi
> devices (And I presume SAS devices) are not removable storage.  That
> means, if the device goes offline and comes back online again, it
doesn''t just
> gracefully resilver and move on without any problems, it''s in a
perpetual
> state of IO error, device unreadable.  
I am revisiting this issue today.  I''ve tried everything I can think of
to recreate this issue, and haven''t been able to do it.  I have
certainly encountered some bad behaviors - which I''ll expound upon
momentarily - but they all seem to be addressable, fixable, logical problems,
and none of them result in a supposedly good pool (as reported in zpool status)
returning scsi IO errors or halting the system.  The most likely explanation
right now, for the bad behavior I saw before, perpetual IO error even after
restoring connection, is that I screwed something up in my iscsi config the
first time.

Herein lie the new problems:

If I don''t export the pool before rebooting, then either the iscsi
target or initiator is shutdown before the filesystems are unmounted.  So the
system spews all sorts of error messages while trying to go down, but it
eventually succeeds.  It''s somewhat important to know if it was the
target or initiator that went down first - If it was the target, then only the
local disks became inaccessible, but if it was the intiiator, then both the
local and remote disks became inaccessible.  I don''t know yet.

Upon reboot, the pool fails to import, so the svc:/system/filesystem/local
service fails, and comes up in maintenance mode.  The whole world is a mess, you
have to login at physical text console to export the pool, and reboot.  But it
comes up cleanly the second time.

These sorts of problems seem like they should be solvable by introducing some
service manifest dependencies...  But there''s no way to make it a
generalization for the distribution as a whole (illumos/openindiana/oracle). 
It''s just something that should be solvable on a case-by-case basis.

If you are going to be an initiator only, then it makes sense for
svc:/network/iscsi/initiator to be required by svc:/system/filesystem/local

If you are going to be a target only, then it makes sense for
svc:/system/filesystem/local to be required by svc:/network/iscsi/target

If you are going to be a target & initiator, then you could get yourself
into a deadlock situation.  Make the filesystem depend on the initiator, and
make the initiator depend on the target, and make the target depend on the
filesystem.  Uh-oh.

But we can break that cycle easy enough in a lot of situations - If
you''re doing as I''m doing, where the only targets are raw
devices (not zvols) then it should be ok to make the filesystem depend on the
initiator, which depends on the target, and the target doesn''t depend
on anything.

If you''re both a target and an initiator, but all of your targets are
zvols that you export to other systems (you''re not nesting a filesystem
in a zvol of your own, are you?) then it''s ok to let the target needs
filesystem and filesystem needs initiator, but initiator doesn''t need
anything.

So in my case, I''m sharing raw disks, I''m going to try and
make filessytem needs initiator, initiator needs target, and target
doesn''t need anything.

Haven''t tried yet ... Hopefully google will help accelerate me figuring
out how to do that.

Jim Klimov

2012-Oct-04 10:30 UTC

head link

[zfs-discuss] vm server storage mirror

2012-10-03 22:03, Edward Ned Harvey 
(opensolarisisdeadlongliveopensolaris) wrote:> If you are going to be an initiator only, then it makes sense for
svc:/network/iscsi/initiator to be required by svc:/system/filesystem/local
> If you are going to be a target only, then it makes sense for
svc:/system/filesystem/local to be required by svc:/network/iscsi/target
Well, on my system that I complained a lot about last year,
I''ve had a physical pool, a zvol in it, shared and imported
over iscsi on loopback (or sometimes initiated from another
box), and another pool inside that zvol ultimately.

Since the overall construct including hardware lent itself to
many problems and panics as you may remember, I ultimately did
not import the data pool nor the pool in the zvol via common
services and /etc/zfs/zpool.cache, but made new services for
that. If you want, I''ll try to find the pieces and send them
(off-list?), but the general idea was that I made two services -
one for import (without cachefile) of the physical pool and
one for virtual dcpool, *maybe* I also made instances of the
iscsi initiator and/or target services, and overall meshed
it with proper dependencies to start in order:
   OS milestone svcs - pool - target - initiator - dcpool
and stop in proper reverse order.

Ultimately, since the pool imports could occasionally crash
that box, there were files to touch or remove, in order to
delay or cancel imports of pool or dcpool easily.

Overall, this let the system to boot into interactive mode
and enable all of its standard services and mount the rpool
filesystems way before attempting risky pool imports and iscsi.

Of course, on a particular system you might reconfigure SMF
services for zones or VMs to depend on accessibility of their
particular storage pools to start up - and reversely for
shutdowns.
> These sorts of problems seem like they should be solvable by introducing
some service manifest dependencies...  But there''s no way to make it a
generalization for the distribution as a whole (illumos/openindiana/oracle). 
It''s just something that should be solvable on a case-by-case basis.
I think I got pretty close to generalization, so after some
code cleanup (things were hard-coded for this box) and even
real-world testing on your setup, we might try to push this
into OI or whoever picks it up. Now, I''ll try to find these
manifests and methods ;)

HTH,
//Jim Klimov

Edward Ned Harvey (opensolarisisdeadlongliveopensolaris)

2012-Oct-04 12:06 UTC

head link

[zfs-discuss] vm server storage mirror

> From: Jim Klimov [mailto:jimklimov at cos.ru]
> 
> Well, on my system that I complained a lot about last year,
> I''ve had a physical pool, a zvol in it, shared and imported
> over iscsi on loopback (or sometimes initiated from another
> box), and another pool inside that zvol ultimately.
Ick.  And it worked?

> > These sorts of problems seem like they should be solvable by
introducing
> some service manifest dependencies...  But there''s no way to make
it a
> generalization for the distribution as a whole
(illumos/openindiana/oracle).
> It''s just something that should be solvable on a case-by-case
basis.
I started looking at that yesterday, and was surprised by how complex the
dependency graph is.  Also, can''t get graphviz to install or build, so
I don''t actually have a graph.

In any event, rather than changing the existing service dependencies, I decided
to just make a new service, which would zpool import, and zpool export the pools
that are on iscsi, before and after the iscsi initiator.

At present, the new service correctly mounts & dismounts the iscsi pool
while I''m sitting there, but for some reason, it fails during reboot. 
I ran out of time digging into it ... I''ll look some more tomorrow.

Jim Klimov

2012-Oct-04 15:25 UTC

head link

[zfs-discuss] vm server storage mirror

2012-10-04 16:06, Edward Ned Harvey 
(opensolarisisdeadlongliveopensolaris) ?????:>> From: Jim Klimov [mailto:jimklimov at cos.ru]
>>
>> Well, on my system that I complained a lot about last year,
>> I''ve had a physical pool, a zvol in it, shared and imported
>> over iscsi on loopback (or sometimes initiated from another
>> box), and another pool inside that zvol ultimately.
>
> Ick.  And it worked?
Um, well. Kind of yes, but it ran into many rough corners -
many of which I posted and asked about.

The fatal one was my choice of smaller blocks in the zvol,
so I learned that metadata (on 4k sectored disks) could
consume about as much as userdata in that zvol/pool, so
I ultimately migrated data off that pool into the system''s
physical pool - not very easy given that there was little
free space left (unexpectedly for me, until I understood
the inner workings).

Still, technically, there is little problem building such
a setup - it just needs some more thorough understanding
and planning. I did learn a lot, so it wasn''t in vain, too.
>>> These sorts of problems seem like they should be solvable by
introducing
>> some service manifest dependencies...  But there''s no way to
make it a
>> generalization for the distribution as a whole
(illumos/openindiana/oracle).
>> It''s just something that should be solvable on a case-by-case
basis.
>
> I started looking at that yesterday, and was surprised by how complex the
dependency graph is.  Also, can''t get graphviz to install or build, so
I don''t actually have a graph.
There are also loops ;)

# svcs -d filesystem/usr
STATE          STIME    FMRI
online         Aug_27   svc:/system/scheduler:default
...

# svcs -d scheduler
STATE          STIME    FMRI
online         Aug_27   svc:/system/filesystem/minimal:default
...

# svcs -d filesystem/minimal
STATE          STIME    FMRI
online         Aug_27   svc:/system/filesystem/usr:default
...
>
> In any event, rather than changing the existing service dependencies, I
decided to just make a new service, which would zpool import, and zpool export
the pools that are on iscsi, before and after the iscsi initiator.
>
> At present, the new service correctly mounts & dismounts the iscsi pool
while I''m sitting there, but for some reason, it fails during reboot. 
I ran out of time digging into it ... I''ll look some more tomorrow.
That''s about what I did and described.
I too do avoid hacking into distro-provided services, so
that upgrades don''t break my customizations and vice-versa.

My code is not yet accessible to me, but I think my instance
of the target/initiator services did a temp-enable/disable of
stock services as its start/stop methods, and the system iscsi
services were kept disabled by default. This way I could start
them at a needed moment in time without changing their service
definitions.

Also note that if you do prefer to rely on stock services, you
can define reverse dependencies in your own new services (i.e.
"I declare that iscsi/target depends on me. Yours, pool-import").

HTH,
//Jim

Dan Swartzendruber

2012-Oct-04 15:35 UTC

head link

[zfs-discuss] vm server storage mirror

This whole thread has been fascinating.  I really wish we (OI) had the 
two following things that freebsd supports:

1. HAST - provides a block-level driver that mirrors a local disk to a 
network "disk" presenting the result as a block device using the GEOM
API.

2. CARP.

I have a prototype with two freebsd VMs where I can failover back and 
forth and it works beautifully.  Block level replication using all open 
source software.  There were some glitches involving boot and shutdown 
(dependencies that are not set up properly), but I think if there was 
enough love in the freebsd community that could be fixed.   I could be 
wrong, but it doesn''t *seem* as if either HAST (or an equivalent) or 
CARP exist in the OI (or other *solaris derivatives) space.  Shame if so...

Dan Swartzendruber

2012-Oct-04 15:37 UTC

head link

[zfs-discuss] vm server storage mirror

Forgot to mention: my interest in doing this was so I could have my ESXi 
host point at a CARP-backed IP address for the datastore, and I would 
have no single point of failure at the storage level.

Richard Elling

2012-Oct-04 15:48 UTC

head link

[zfs-discuss] vm server storage mirror

On Oct 4, 2012, at 8:35 AM, Dan Swartzendruber <dswartz at druber.com>
wrote:
> 
> This whole thread has been fascinating.  I really wish we (OI) had the two
following things that freebsd supports:
> 
> 1. HAST - provides a block-level driver that mirrors a local disk to a
network "disk" presenting the result as a block device using the GEOM
API.
This is called AVS in the Solaris world.

In general, these systems suffer from a fatal design flaw: the authoritative
view of the
data is not also responsible for the replication. In other words, you can
provide coherency
but not consistency. Both are required to provide a single view of the data.
> 2. CARP.
This exists as part of the OHAC project.
 -- richard

--

Richard.Elling at RichardElling.com
+1-760-896-4422









-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20121004/368642c3/attachment.html>

Dan Swartzendruber

2012-Oct-04 16:07 UTC

head link

[zfs-discuss] vm server storage mirror

On 10/4/2012 11:48 AM, Richard Elling wrote:> On Oct 4, 2012, at 8:35 AM, Dan Swartzendruber <dswartz at druber.com 
> <mailto:dswartz at druber.com>> wrote:
>
>>
>> This whole thread has been fascinating.  I really wish we (OI) had 
>> the two following things that freebsd supports:
>>
>> 1. HAST - provides a block-level driver that mirrors a local disk to 
>> a network "disk" presenting the result as a block device
using the
>> GEOM API.
>
> This is called AVS in the Solaris world.
>
> In general, these systems suffer from a fatal design flaw: the 
> authoritative view of the
> data is not also responsible for the replication. In other words, you 
> can provide coherency
> but not consistency. Both are required to provide a single view of the 
> data.
Can you expand on this?
>> 2. CARP.
>
> This exists as part of the OHAC project.
>  -- richard
>
These are both freely available?
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20121004/423443a3/attachment.html>

Richard Elling

2012-Oct-04 16:19 UTC

head link

[zfs-discuss] vm server storage mirror

On Oct 4, 2012, at 9:07 AM, Dan Swartzendruber <dswartz at druber.com>
wrote:
> On 10/4/2012 11:48 AM, Richard Elling wrote:
>> 
>> On Oct 4, 2012, at 8:35 AM, Dan Swartzendruber <dswartz at
druber.com> wrote:
>> 
>>> 
>>> This whole thread has been fascinating.  I really wish we (OI) had
the two following things that freebsd supports:
>>> 
>>> 1. HAST - provides a block-level driver that mirrors a local disk
to a network "disk" presenting the result as a block device using the
GEOM API.
>> 
>> This is called AVS in the Solaris world.
>> 
>> In general, these systems suffer from a fatal design flaw: the
authoritative view of the
>> data is not also responsible for the replication. In other words, you
can provide coherency
>> but not consistency. Both are required to provide a single view of the
data.
> 
> Can you expand on this?
I could, but I''ve already written a book on clustering. For a more
general approach
to understanding clustering, I can highly recommend Pfister''s In Search
of Clusters.
http://www.amazon.com/In-Search-Clusters-2nd-Edition/dp/0138997098

NB, clustered storage is the same problem as clustered compute wrt state.
>>> 2. CARP.
>> 
>> This exists as part of the OHAC project.
>>  -- richard
>> 
> 
> These are both freely available?
Yes.
 -- richard

--

Richard.Elling at RichardElling.com
+1-760-896-4422









-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20121004/b5473370/attachment-0001.html>

Jim Klimov

2012-Oct-04 16:26 UTC

head link

[zfs-discuss] vm server storage mirror

2012-10-04 19:48, Richard Elling wrote:>> 2. CARP.
>
> This exists as part of the OHAC project.
>   -- richard
Wikipedia says CARP is the open-source equivalent of VRRP.
And we have that in OI, don''t we? Would it suffice?

# pkg info -r vrrp
           Name: system/network/routing/vrrp
        Summary: Solaris VRRP protocol
    Description: Solaris VRRP protocol service
       Category: System/Administration and Configuration
          State: Not installed
      Publisher: openindiana.org
        Version: 0.5.11
  Build Release: 5.11
         Branch: 0.151.1.5
Packaging Date: Sat Jun 30 20:01:06 2012
           Size: 275.57 kB
           FMRI: 
pkg://openindiana.org/system/network/routing/vrrp at
0.5.11,5.11-0.151.1.5:20120630T200106Z

HTH,
//Jim Klimov

Dan Swartzendruber

2012-Oct-04 17:19 UTC

head link

[zfs-discuss] vm server storage mirror

On 10/4/2012 12:19 PM, Richard Elling wrote:> On Oct 4, 2012, at 9:07 AM, Dan Swartzendruber <dswartz at druber.com 
> <mailto:dswartz at druber.com>> wrote:
>
>> On 10/4/2012 11:48 AM, Richard Elling wrote:
>>> On Oct 4, 2012, at 8:35 AM, Dan Swartzendruber <dswartz at
druber.com
>>> <mailto:dswartz at druber.com>> wrote:
>>>
>>>>
>>>> This whole thread has been fascinating.  I really wish we (OI)
had
>>>> the two following things that freebsd supports:
>>>>
>>>> 1. HAST - provides a block-level driver that mirrors a local
disk
>>>> to a network "disk" presenting the result as a block
device using
>>>> the GEOM API.
>>>
>>> This is called AVS in the Solaris world.
>>>
>>> In general, these systems suffer from a fatal design flaw: the 
>>> authoritative view of the
>>> data is not also responsible for the replication. In other words, 
>>> you can provide coherency
>>> but not consistency. Both are required to provide a single view of 
>>> the data.
>>
Sorry to be dense here, but I''m not getting how this is a cluster
setup,
or what your point wrt authoritative vs replication meant.  In the 
scenario I was looking at, one host is providing access to clients - on 
the backup host, no services are provided at all.  The master node does 
mirrored writes to the local disk and the network disk.  The mirrored 
write does not return until the backup host confirms the data is safely 
written to disk.  If a failover event occurs, there should not be any 
writes the client has been told completed that was not completed to both 
sides.  The master node stops responding to the virtual IP, and the 
backup starts responding to it.  Any pending NFS writes will presumably 
be retried by the client, and the new master node has completely up to 
date data on disk to respond with.  Maybe I am focusing too narrowly 
here, but in the case I am looking at, there is only a single node which 
is active at any time, and it is responsible for replication and access 
by clients, so I don''t see the failure modes you allude to.  Maybe I 
need to shell out for that book :)
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20121004/d86e9ab7/attachment.html>

Jim Klimov

2012-Oct-04 17:56 UTC

head link

[zfs-discuss] vm server storage mirror

2012-10-04 21:19, Dan Swartzendruber writes:> Sorry to be dense here, but I''m not getting how this is a cluster
setup,
> or what your point wrt authoritative vs replication meant.  In the
> scenario I was looking at, one host is providing access to clients - on
> the backup host, no services are provided at all.  The master node does
> mirrored writes to the local disk and the network disk.  The mirrored
> write does not return until the backup host confirms the data is safely
> written to disk.  If a failover event occurs, there should not be any
> writes the client has been told completed that was not completed to both
> sides.  The master node stops responding to the virtual IP, and the
> backup starts responding to it.  Any pending NFS writes will presumably
> be retried by the client, and the new master node has completely up to
> date data on disk to respond with.  Maybe I am focusing too narrowly
> here, but in the case I am looking at, there is only a single node which
> is active at any time, and it is responsible for replication and access
> by clients, so I don''t see the failure modes you allude to.  Maybe
I
> need to shell out for that book :)
What if the backup host is down (i.e. the ex-master after the failover)?
Will your failed-over pool accept no writes until both storage machines
are working?

What if internetworking between these two heads has a glitch, and as
a result both of them become masters of their private copies (mirror
halves), and perhaps both even manage to accept writes from clients?

This is the clustering part, which involves "fencing" around the node
which is considered dead, perhaps including a hardware reset request
just to make sure it''s dead, before taking over resources it used to
master (STONITH - Shoot The Other Node In The Head). In particular,
clusters suggest that for hearbeats so as to make sure both machines
work indeed, you use at least two separate wires (i.e. serial and LAN)
without active hardware (switches) in-between, separate from data
networking.

HTH,
//Jim

Dan Swartzendruber

2012-Oct-04 18:00 UTC

head link

[zfs-discuss] vm server storage mirror

On 10/4/2012 1:56 PM, Jim Klimov wrote:>
> What if the backup host is down (i.e. the ex-master after the failover)?
> Will your failed-over pool accept no writes until both storage machines
> are working?
>
> What if internetworking between these two heads has a glitch, and as
> a result both of them become masters of their private copies (mirror
> halves), and perhaps both even manage to accept writes from clients?
>
> This is the clustering part, which involves "fencing" around the
node
> which is considered dead, perhaps including a hardware reset request
> just to make sure it''s dead, before taking over resources it used
to
> master (STONITH - Shoot The Other Node In The Head). In particular,
> clusters suggest that for hearbeats so as to make sure both machines
> work indeed, you use at least two separate wires (i.e. serial and LAN)
> without active hardware (switches) in-between, separate from data
> networking.
this all makes a lot of sense.  didn''t mean to imply there are no 
failure modes that can take you down entirely.  i was aware of the 
split-brain issue.  i was not sure what richard was getting at...

Edward Ned Harvey (opensolarisisdeadlongliveopensolaris)

2012-Oct-04 21:44 UTC

head link

[zfs-discuss] vm server storage mirror

> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-
> bounces at opensolaris.org] On Behalf Of Jim Klimov
> 
> There are also loops ;)
> 
> # svcs -d filesystem/usr
> STATE          STIME    FMRI
> online         Aug_27   svc:/system/scheduler:default
> ...
> 
> # svcs -d scheduler
> STATE          STIME    FMRI
> online         Aug_27   svc:/system/filesystem/minimal:default
> ...
> 
> # svcs -d filesystem/minimal
> STATE          STIME    FMRI
> online         Aug_27   svc:/system/filesystem/usr:default
> ...
How is that possible?  Why would the system be willing to startup in a situation
like that?  It *must* be launching one of those, even without its dependencies
met ...

The answer to this question, will in all likelihood, shed some light on my
situation, trying to understand why my iscsi mounted zpool import/export service
is failing to go down or come up in the order I expected, when it''s
dependent on the iscsi initiator.

Jim Klimov

2012-Oct-04 22:00 UTC

head link

[zfs-discuss] vm server storage mirror

2012-10-05 1:44, Edward Ned Harvey 
(opensolarisisdeadlongliveopensolaris) ?????:>> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-
>> bounces at opensolaris.org] On Behalf Of Jim Klimov
>>
>> There are also loops ;)
>>
>> # svcs -d filesystem/usr
>> STATE          STIME    FMRI
>> online         Aug_27   svc:/system/scheduler:default
>> ...
>>
>> # svcs -d scheduler
>> STATE          STIME    FMRI
>> online         Aug_27   svc:/system/filesystem/minimal:default
>> ...
>>
>> # svcs -d filesystem/minimal
>> STATE          STIME    FMRI
>> online         Aug_27   svc:/system/filesystem/usr:default
>> ...
>
> How is that possible?  Why would the system be willing to startup in a
situation like that?  It *must* be launching one of those, even without its
dependencies met ...
Well, it seems just like a peculiar effect of required vs. optional
dependencies. The loop is in the default installation. Details:

# svcprop filesystem/usr | grep scheduler
svc:/system/filesystem/usr:default/:properties/scheduler_usr/entities 
fmri svc:/system/scheduler
svc:/system/filesystem/usr:default/:properties/scheduler_usr/external 
boolean true
svc:/system/filesystem/usr:default/:properties/scheduler_usr/grouping 
astring optional_all
svc:/system/filesystem/usr:default/:properties/scheduler_usr/restart_on 
astring none
svc:/system/filesystem/usr:default/:properties/scheduler_usr/type 
astring service

# svcprop scheduler | grep minimal
svc:/application/cups/scheduler:default/:properties/filesystem_minimal/entities 
fmri svc:/system/filesystem/minimal
svc:/application/cups/scheduler:default/:properties/filesystem_minimal/grouping 
astring require_all
svc:/application/cups/scheduler:default/:properties/filesystem_minimal/restart_on
astring none
svc:/application/cups/scheduler:default/:properties/filesystem_minimal/type 
astring service

# svcprop filesystem/minimal | grep usr
usr/entities fmri svc:/system/filesystem/usr
usr/grouping astring require_all
usr/restart_on astring none
usr/type astring service
> The answer to this question, will in all likelihood, shed some light on my
situation, trying to understand why my iscsi mounted zpool import/export service
is failing to go down or come up in the order I expected, when it''s
dependent on the iscsi initiator.
Likewise - see what dependency type you introduced, and verify
that you''ve "svcadm refreshed" the service after config
changes.

HTH,
//Jim

Edward Ned Harvey (opensolarisisdeadlongliveopensolaris)

2012-Oct-05 13:47 UTC

head link

[zfs-discuss] vm server storage mirror

> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-
> bounces at opensolaris.org] On Behalf Of Jim Klimov
> 
> Well, it seems just like a peculiar effect of required vs. optional
> dependencies. The loop is in the default installation. Details:
> 
> # svcprop filesystem/usr | grep scheduler
> svc:/system/filesystem/usr:default/:properties/scheduler_usr/entities
> fmri svc:/system/scheduler
> svc:/system/filesystem/usr:default/:properties/scheduler_usr/external
> boolean true
> svc:/system/filesystem/usr:default/:properties/scheduler_usr/grouping
> astring optional_all
> svc:/system/filesystem/usr:default/:properties/scheduler_usr/restart_on
> astring none
> svc:/system/filesystem/usr:default/:properties/scheduler_usr/type
> astring service
> 
> # svcprop scheduler | grep minimal
> svc:/application/cups/scheduler:default/:properties/filesystem_minimal/ent
> ities
> fmri svc:/system/filesystem/minimal
> svc:/application/cups/scheduler:default/:properties/filesystem_minimal/gro
> uping
> astring require_all
> svc:/application/cups/scheduler:default/:properties/filesystem_minimal/res
> tart_on
> astring none
> svc:/application/cups/scheduler:default/:properties/filesystem_minimal/typ
> e
> astring service
> 
> # svcprop filesystem/minimal | grep usr
> usr/entities fmri svc:/system/filesystem/usr
> usr/grouping astring require_all
> usr/restart_on astring none
> usr/type astring service
> 
I must be missing something - I don''t see anything above that indicates
any required vs optional dependencies.

I''m not quite sure what I''m supposed to be seeing in the
svcprop outputs you pasted...

> > The answer to this question, will in all likelihood, shed some light
on my
> situation, trying to understand why my iscsi mounted zpool import/export
> service is failing to go down or come up in the order I expected, when
it''s
> dependent on the iscsi initiator.
> 
> Likewise - see what dependency type you introduced, and verify
> that you''ve "svcadm refreshed" the service after config
changes.
Thank you for the suggestion - I like the direction this is heading, but I
don''t know how to do that yet.  (This email is the first I ever heard
of it.)

Rest assured, I''ll be googling and reading more man pages in the
meantime.

Edward Ned Harvey (opensolarisisdeadlongliveopensolaris)

2012-Oct-05 18:53 UTC

head link

[zfs-discuss] vm server storage mirror

> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-
> bounces at opensolaris.org] On Behalf Of Edward Ned Harvey
> 
> I must be missing something - I don''t see anything above that
indicates any
> required vs optional dependencies.
Ok, I see that now.  (Thanks to the SMF FAQ).
A dependency may have grouping optional_all, require_any, or require_all.

Mine is require_all, and I figured out the problem.  I had my automatic zpool
import/export script dependent on the initiator ... But it wasn''t the
initiator going down first.  It was the target going down first.

So the solution is like this:

sudo svccfg -s svc:/network/iscsi/initiator:default
svc:/network/iscsi/initiator:default> addpg iscsi-target dependency
svc:/network/iscsi/initiator:default> setprop iscsi-target/grouping =
astring: "require_all"
svc:/network/iscsi/initiator:default> setprop iscsi-target/restart_on =
astring: "none"
svc:/network/iscsi/initiator:default> setprop iscsi-target/type = astring:
"service"
svc:/network/iscsi/initiator:default> setprop iscsi-target/entities = fmri:
"svc:/network/iscsi/target:default"
svc:/network/iscsi/initiator:default> exit

sudo svcadm refresh svc:/network/iscsi/initiator:default

And additionally, create the SMF service dependent on initiator, which will
import/export the iscsi pools automatically.

http://nedharvey.com/blog/?p=105

Jim Klimov

2012-Oct-06 10:25 UTC

head link

[zfs-discuss] vm server storage mirror

2012-10-05 22:53, Edward Ned Harvey 
(opensolarisisdeadlongliveopensolaris) wrote:> http://nedharvey.com/blog/?p=105
>
Nice writeup, thanks. Perhaps you could also post/link it on OI wiki
so the community can find it easier?

A few comments:

1) For readability I''d use "...| awk ''{print
$1}''" instead of sed:

- for GUID in `sudo sbdadm list-lu | grep rdsk | sed ''s/
.*//''`
+ for GUID in `sudo sbdadm list-lu | grep rdsk | awk ''{print
$1}''`

On one hand, different implementations of sed might parse regexps
differently, on the other - column order might change and changing
a number in awk would be more straightforward.



2) Here you can just redirect stdio from /dev/null:

- sudo format -e  # Make a note of the new device names. And hit Ctrl-C.
+ sudo format -e < /dev/null



3) In iscsi-pool-ctrl.sh it is more readable to replace the
''if "$1"...elif..else'' clause with ''case
"$1" in ... esac''
That is also easier to expand if needed; for example, to alias
''import|start)'' and ''export|stop)'' for more
standard method
naming.

3.1) Also you should probably do "zpool import -o cachefile ..."
or plain "zpool import -R / ..." to set a particular cachefile
or use none, to avoid auto-import upon boot via standard file
/etc/zfs/zpool.cache (which can break your filesystem/local
service).

Also note that use of the altroot (-R) option disables the
cachefile by default, so you can use it as a shortcut.

3.2) The exit errors should be aligned with SMF status codes, so
you should include /lib/svc/share/smf_include.sh and return
one of these:

SMF_EXIT_OK=0
SMF_EXIT_ERR_FATAL=95
SMF_EXIT_ERR_CONFIG=96
SMF_EXIT_MON_DEGRADE=97
SMF_EXIT_MON_OFFLINE=98
SMF_EXIT_ERR_NOSMF=99
SMF_EXIT_ERR_PERM=100

(You can validate inclusion of that file, so if it fails, you
can define these values yourself for the script, i.e. to use
it as an initscript on a system without SMF).

3.3) To catch "device busy" errors you can retry failed zpool
export runs with "zpool export -f" which tries a bit harder.

Otherwise, quite LGTM :)
HTH,
//Jim Klimov

Jim Klimov

2012-Oct-06 10:49 UTC

head link

[zfs-discuss] vm server storage mirror

Hello Ed and all,

Just for the sake of completeness, I dug out my implementation of
SMF services for iscsi-imported pools. As I said, it is kinda ugly
due to hardcoded things which should rather be in SMF properties
or at least in config files, but this was a single-solution POC.
Here is the client side:

1) Method script to import the "dcpool" over iscsi. It has an
optional config file that I can touch, fill or remove in order
to delay the import of the pool, and another file to disable
the import (perhaps by touching it while the sleep is in effect)
but not fail or reconfigure the SMF instance. If the latter file
is present in advance, the service is temp-disabled (reverse
dependency, see XML below). Finally, export of the pool is
retried with force until success (or SMF timeout):

$ cat /lib/svc/method/iscsi-mount-dcpool
------
#!/bin/sh

DELAY=600

case "$1" in
         start)
                 if [ -f /etc/zfs/delay.dcpool ]; then
                         D="`head -1 /etc/zfs/delay.dcpool`"
                         [ "$D" -gt 0 ] 2>/dev/null &&
DELAY="$D" || D=10
                         echo "`date`: Delay requested... ${DELAY}sec"
                         sleep ${DELAY}
                         echo "`date`: Done sleeping"
                 fi

                 if [ -f /etc/zfs/noimport-dcpool ]; then
                         echo "`date`: /etc/zfs/noimport-dcpool 
block-file reappeared. Aborting."
                         exit 0
                 fi

                 [ -d /dcpool/export -o -f /etc/zfs/noimport-dcpool ] || \
                         ( echo "`date`: beginning dcpool import..."
                           time zpool import -o cachefile=none dcpool
                           RET=$?
                           echo "`date`: dcpool import complete
($RET)"
                           exit $RET )
                 ;;
         stop)
                 [ ! -d /dcpool/export ] || \
                         time zpool export dcpool || \
                         while ! time zpool export -f dcpool; do sleep 
1; done
                 ;;
esac
------



2) This script just wraps the call to original method (and
adds a small sleep) and allows me to create a separate service
and define dependencies on it - and not touch original services:

$ cat /lib/svc/method/iscsi-initiator-dcpool
-------
#!/bin/sh

case "$1" in
         start) /lib/svc/method/iscsi-initiator "$@" && sleep
10 ;;
         stop)  sleep 10 && /lib/svc/method/iscsi-initiator
"$@" ;;
         *) /lib/svc/method/iscsi-initiator "$@" ;;
esac
-------



3) The XML manifests for the services:


NOTE: Startup time is unlimited, because pool processing
(deferred frees, etc.) could take days on my setup, and
the server (target) could be unaccessible for some time too.

$ cat /root/smf/iscsi_mount-dcpool.xml
-------
<?xml version=''1.0''?>
<!DOCTYPE service_bundle SYSTEM 
''/usr/share/lib/xml/dtd/service_bundle.dtd.1''>
<service_bundle type=''manifest''
name=''export''>
   <service name=''network/iscsi/mount-dcpool''
type=''service'' version=''0''>
     <dependency name=''loopback''
grouping=''require_all''
restart_on=''none'' type=''service''>
       <service_fmri value=''svc:/network/loopback''/>
     </dependency>
     <dependency name=''initiator-dcpool''
grouping=''require_all''
restart_on=''restart'' type=''service''>
       <service_fmri
value=''svc:/network/iscsi/initiator-dcpool:default''/>
     </dependency>
     <dependency name=''noimport-file''
grouping=''exclude_all''
restart_on=''refresh'' type=''path''>
       <service_fmri
value=''file://localhost/etc/zfs/noimport-dcpool''/>
     </dependency>
     <instance name=''default''
enabled=''false''>
       <exec_method name=''start''
type=''method''
exec=''/lib/svc/method/iscsi-mount-dcpool %m''
timeout_seconds=''0''/>
       <exec_method name=''stop''
type=''method''
exec=''/lib/svc/method/iscsi-mount-dcpool %m''
timeout_seconds=''600''/>
       <property_group name=''startd''
type=''framework''>
         <propval name=''duration''
type=''astring'' value=''transient''/>
       </property_group>
       <template>
         <common_name>
           <loctext xml:lang=''C''>import
&apos;dcpool&apos; over
iscsi</loctext>
         </common_name>
       </template>
     </instance>
     <stability value=''Unstable''/>
   </service>
</service_bundle>
-------



$ cat /root/smf/iscsi_initiator-dcpool.xml
-------
<?xml version=''1.0''?>
<!DOCTYPE service_bundle SYSTEM 
''/usr/share/lib/xml/dtd/service_bundle.dtd.1''>
<service_bundle type=''manifest''
name=''export''>
   <service name=''network/iscsi/initiator-dcpool''
type=''service''
version=''0''>
     <create_default_instance enabled=''false''/>
     <single_instance/>
     <dependency name=''loopback''
grouping=''require_any''
restart_on=''error'' type=''service''>
       <service_fmri value=''svc:/network/loopback''/>
     </dependency>
     <dependency name=''mus''
grouping=''require_any'' restart_on=''error''
type=''service''>
       <service_fmri
value=''svc:/milestone/multi-user-server:default''/>
     </dependency>
     <exec_method name=''start''
type=''method''
exec=''/lib/svc/method/iscsi-initiator-dcpool %m''
timeout_seconds=''600''>
       <method_context>
         <method_credential user=''root''
group=''root''
privileges=''basic,sys_devices,sys_mount''/>
       </method_context>
     </exec_method>
     <exec_method name=''stop''
type=''method''
exec=''/lib/svc/method/iscsi-initiator-dcpool %m''
timeout_seconds=''600''>
       <method_context>
         <method_credential user=''root''
group=''root''
privileges=''basic,sys_devices,sys_mount''/>
       </method_context>
     </exec_method>
     <property_group name=''dependents''
type=''framework''>
       <property name=''iscsi-initiator_multi-user''
type=''fmri''/>
       <property name=''iscsi-mount-dcpool''
type=''fmri''/>
     </property_group>
     <stability value=''Evolving''/>
     <template>
       <common_name>
         <loctext xml:lang=''C''>iSCSI initiator daemon
for dcpool</loctext>
       </common_name>
       <documentation>
         <manpage title=''iscsi''
section=''7D'' manpath=''/usr/share/man''/>
       </documentation>
     </template>
   </service>
</service_bundle>
-------



Maybe some of the former script''s ideas can wind up into your
solution; I''m not sure if the initiator-wrapper is that useful =)

Good luck,
//Jim Klimov

Jim Klimov

2012-Oct-06 10:56 UTC

head link

[zfs-discuss] vm server storage mirror

2012-10-06 14:49, Jim Klimov wrote:> $ cat /lib/svc/method/iscsi-mount-dcpool
> ------
> #!/bin/sh
>
> DELAY=600
>
> case "$1" in
>          start)
>                  if [ -f /etc/zfs/delay.dcpool ]; then
>                          D="`head -1 /etc/zfs/delay.dcpool`"
>                          [ "$D" -gt 0 ] 2>/dev/null &&
DELAY="$D" || D=10
>                          echo "`date`: Delay requested...
${DELAY}sec"
Oops, a typo (thoughtlessly entered by hand into email,
although happens to be harmless in effect due to another
typo in the process):

- [ "$D" -gt 0 ] 2>/dev/null && DELAY="$D" ||
D=10
+ [ "$D" -gt 0 ] 2>/dev/null && DELAY="$D"

The default delay is defined above, and is large enough (600)
so that I can enter a panicking system after reboot and disable
the pool-importing service or otherwise influence or monitor
the situation.

//Jim

Edward Ned Harvey (opensolarisisdeadlongliveopensolaris)

2012-Oct-19 23:55 UTC

head link

[zfs-discuss] vm server storage mirror

Yikes, I''m back at it again, and sooooo frustrated.

For about 2-3 weeks now, I had the iscsi mirror configuration in production, as
previously described.  Two disks on system 1 mirror against two disks on system
2, everything done via iscsi, so you could zpool export on machine 1, and then
zpool import on machine 2 for a manual failover.

Created the dependency - initiator depends on target, and created a new smf
service to mount the iscsi zpool after the initiator is up (and consequently
export the zpool before the initiator shuts down.)  Able to reboot, everything
working perfectly.

Until today.

Today I rebooted one system for some maintenance, and it stayed down longer than
expected, so those disks started throwing errors on the second machine.  First
system eventually came up again, second system resilvered, everything looked
good.  I zpool clear''d the pool on the second machine just to make the
counters look pretty again.

But it wasn''t pretty at all.

This is so bizarre - 

Throughout the day, the VM''s on system 2 kept choking.  I had to
powercycle system 2 about half a dozen times due to unresponsiveness.  Exactly
the type of behavior you expect for IO error - but nothing whatsoever appears in
the system log, and the zpool status still looks clean.

Several times, I destroyed the pool and recreated it completely from backup. 
zfs send and zfs receive both work fine.  But strangely - when I launch a VM,
the IO grinds to a halt, and I''m forced to powercycle (usually) the
host.

You might try to conclude it''s something wrong with virtualbox - but
it''s not.  I literally copied & pasted the zfs send | zfs receive
commands that restored the pool from backup, but this time restored it onto
local storage.  The only difference is local disk versus iscsi pool.  And then
it finally worked without any glitches.

During the day, trying to get the iscsi pool up again - this is so bizarre - I
did everything I could think of, to get back to a pristine state.  I removed
iscsi targets, I removed lun''s (lu''s), I removed the static
discovery and re-added it, got new device names, I wiped the disks (zpool
destroy & zpool create)  re-created lu''s, re-created static
discovery, re-created targets, re-created zpools...  The behavior was the same
no matter what I did.

I can create the pool, import it, zfs receive onto it no problem, but then when
I launch the VM, the whole system grinds to a halt.  VirtualBox will be in a
"sleep" state, Virtualbox shows the green light on the hard drive
indicating it''s trying to read, meanwhile if I try to X it out, it
won''t die, and gnome gives me the "Force Quit" dialog,
meanwhile I can sudo kill -KILL VirtualBox, and VirtualBox *still*
won''t die.  Any "zpool" or "zfs" command I type in
hangs indefinitely (even time-slider daemon or zfs auto snapshot are hung).  I
can poke around the system in other areas - on other pools and stuff - but the
only way out of it is power cycle.

It''s so weird, that once the problem happens once, I have not yet found
any way to recover from it except to reformat and reinstall the OS for the whole
system.  I cannot, for the life of me, think of *any*thing that could be storing
state like this, preventing me from getting back into a usable iscsi mirror
pool.

One thing I haven''t tried yet - 

It appears, I think, that when you make a disk, let''s say c2t4d0 an
iscsi target, let''s say c6t7blahblahblahd0...  It appears, I think,
that c6t7blahblahblahd0 is actually c2t4d0s2.  I could create a pool using
c2t4d0, and/or zero the whole disk, completely obliterating any semblance of
partition tables inside there, or old redundant copies of old uberblocks or
anything like that.  But seriously, I''m grasping at straws here, just
trying to find *any* place where some bad state is stored that I
haven''t thought of yet.

I shouldn''t need to reformat the host.

Timothy Coalson

2012-Oct-20 01:42 UTC

head link

[zfs-discuss] vm server storage mirror

> Several times, I destroyed the pool and recreated it completely from
> backup.  zfs send and zfs receive both work fine.  But strangely - when I
> launch a VM, the IO grinds to a halt, and I''m forced to powercycle
> (usually) the host.
>
A shot in the dark here, but perhaps one of the disks involved is taking a
long time to return from reads, but is returning eventually, so ZFS
doesn''t
notice the problem?  Watching ''iostat -x'' for busy time while
a VM is hung
might tell you something.

Tim
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20121019/7a686bb3/attachment-0001.html>

Edward Ned Harvey (opensolarisisdeadlongliveopensolaris)

2012-Oct-20 12:39 UTC

head link

[zfs-discuss] vm server storage mirror

> From: Timothy Coalson [mailto:tsc5yc at mst.edu]
> Sent: Friday, October 19, 2012 9:43 PM
> 
> A shot in the dark here, but perhaps one of the disks involved is taking a
long
> time to return from reads, but is returning eventually, so ZFS
doesn''t notice
> the problem? ?Watching ''iostat -x'' for busy time while a
VM is hung might tell
> you something.
Oh yeah - this is also bizarre.  I watched "zpool iostat" for a while.
It was showing me :
Operations (read and write) consistently 0
Bandwidth (read and write) consistently non-zero, but something small, like
1k-20k or so.

Maybe that is normal to someone who uses zpool iostat more often than I do.  But
to me, zero operations resulting in non-zero bandwidth defies logic.

Timothy Coalson

2012-Oct-20 17:21 UTC

head link

[zfs-discuss] vm server storage mirror

On Sat, Oct 20, 2012 at 7:39 AM, Edward Ned Harvey
(opensolarisisdeadlongliveopensolaris) <
opensolarisisdeadlongliveopensolaris at nedharvey.com> wrote:
> > From: Timothy Coalson [mailto:tsc5yc at mst.edu]
> > Sent: Friday, October 19, 2012 9:43 PM
> >
> > A shot in the dark here, but perhaps one of the disks involved is
taking
> a long
> > time to return from reads, but is returning eventually, so ZFS
doesn''t
> notice
> > the problem?  Watching ''iostat -x'' for busy time
while a VM is hung
> might tell
> > you something.
>
> Oh yeah - this is also bizarre.  I watched "zpool iostat" for a
while.  It
> was showing me :
> Operations (read and write) consistently 0
> Bandwidth (read and write) consistently non-zero, but something small,
> like 1k-20k or so.
>
> Maybe that is normal to someone who uses zpool iostat more often than I
> do.  But to me, zero operations resulting in non-zero bandwidth defies
> logic.
>
> It might be operations per second, and is rounding down (I know thishappens in DTrace normalization, not sure about zpool/zfs), try an interval
of 1 (perhaps with -v) and see if you still get 0 operations.  I
haven''t
seen zero operations with nonzero bandwidth on my pools, I always see lots
of operations in bursts, so it sounds like you might be on to something.

Also, iostat -x shows device busy time, which is usually higher on the
slowest disk when there is an imbalance, while zpool iostat does not.  So,
if it happens to be a single device''s fault, iostat -nx has a better
chance
of finding it (the n flag translates the disk names to the device names
used by the system, so you can figure out which one is the problem).

Tim
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20121020/89851fd6/attachment.html>

zfs discuss - Sep 2012 - vm server storage mirror

[zfs-discuss] vm server storage mirror

[zfs-discuss] vm server storage mirror

[zfs-discuss] vm server storage mirror

[zfs-discuss] vm server storage mirror

[zfs-discuss] vm server storage mirror

[zfs-discuss] vm server storage mirror

[zfs-discuss] vm server storage mirror

[zfs-discuss] vm server storage mirror

[zfs-discuss] vm server storage mirror

[zfs-discuss] vm server storage mirror

[zfs-discuss] vm server storage mirror

[zfs-discuss] vm server storage mirror

[zfs-discuss] vm server storage mirror

[zfs-discuss] vm server storage mirror

[zfs-discuss] vm server storage mirror

[zfs-discuss] vm server storage mirror

[zfs-discuss] vm server storage mirror

[zfs-discuss] vm server storage mirror

[zfs-discuss] vm server storage mirror

[zfs-discuss] vm server storage mirror

[zfs-discuss] vm server storage mirror

[zfs-discuss] vm server storage mirror

[zfs-discuss] vm server storage mirror

[zfs-discuss] vm server storage mirror

[zfs-discuss] vm server storage mirror

[zfs-discuss] vm server storage mirror

[zfs-discuss] vm server storage mirror

[zfs-discuss] vm server storage mirror

[zfs-discuss] vm server storage mirror

[zfs-discuss] vm server storage mirror

[zfs-discuss] vm server storage mirror

[zfs-discuss] vm server storage mirror

[zfs-discuss] vm server storage mirror

[zfs-discuss] vm server storage mirror