thr3ads.net - zfs discuss - [zfs-discuss] Wanted: sanity check for a clustered ZFS idea [Oct 2011]

If this information is useful, please help other people find it:
Share via:

Jim Klimov

2011-Oct-09 17:28 UTC

[zfs-discuss] Wanted: sanity check for a clustered ZFS idea

Hello all,

ZFS developers have for a long time stated that ZFS is not intended,
at least not in near term, for clustered environments (that is, having
a pool safely imported by several nodes simultaneously). However,
many people on forums have wished having ZFS features in clusters.

I have some ideas at least for a limited implementation of clustering
which may be useful aat least for some areas. If it is not my fantasy
and if it is realistic to make - this might be a good start for further
optimisation of ZFS clustering for other uses.

For one use-case example, I would talk about VM farms with VM
migration. In case of shared storage, the physical hosts need only
migrate the VM RAM without copying gigabytes of data between their
individual storages. Such copying makes less sense when the
hosts'' storage is mounted off the same NAS/NAS box(es), because:
* it only wastes bandwidth moving bits around the same storage, and
* IP networking speed (NFS/SMB copying) may be less than that of
dedicated storage net between the hosts  and storage (SAS, FC, etc.)
* with pre-configured disk layout from one storage box into LUNs for
several hosts, more slack space is wasted than with having a single
pool for several hosts, all using the same "free" pool space;
* it is also less scalable (i.e. if we lay out the whole SAN for 5 hosts,
it would be problematic to add a 6th server) - but it won''t be a
problem
when the single pool consumes the whole SAM and is available to
all server nodes.

One feature of this use-case is that specific datasets within the
potentially common pool on the NAS/SAN are still dedicated to
certain physical hosts. This would be similar to serving iSCSI
volumes or NFS datasets with individual VMs from a NAS box -
just with a faster connection over SAS/FC. Hopefully this allows
for some shortcuts in clustering ZFS implementation, while
such solutions would still be useful in practice.



So, one version of the solution would be to have a single host
which imports the pool in read-write mode (i.e. the first one
which boots), and other hosts would write thru it (like iSCSI
or whatever; maybe using SAS or FC to connect between
"reader" and "writer" hosts). However they would read
directly
from the ZFS pool using the full SAN bandwidth.

WRITES would be consistent because only one node writes
data to the active ZFS block tree using more or less the same
code and algorithms as already exist.


In order for READS to be consistent, the "readers" need only
rely on whatever latest TXG they know of, and on the cached
results of their more recent writes (between the last TXG
these nodes know of and current state).

Here''s where this use-case''s bonus comes in: the node which
currently uses a certain dataset and issues writes for it, is the
only one expected to write there - so even if its knowledge of
the pool is some TXGs behind, it does not matter.

In order to stay up to date, and "know" the current TXG completely,
the "reader nodes" should regularly read-in the ZIL data (anyway
available and accessible as part of the pool) and expire changed
entries from their local caches.

If for some reason a "reader node" has lost track of the pool for
too long, so that ZIL data is not sufficient to update from "known
in-RAM TXG" to "current on-disk TXG", the full read-only import
can be done again (keeping track of newer TXGs appearing
while the RO import is being done).

Thanks to ZFS COW, nodes can expect that on-disk data (as
pointed to by block addresses/numbers) does not change.
So in the worst case, nodes would read outdated data a few
TXGs old - but not completely invalid data.


Second version of the solution is more or less the same, except
that all nodes can write to the pool hardware directly using some
dedicated block ranges "owned" by one node at a time. This
would work like much a ZIL containing both data and metadata.
Perhaps these ranges would be whole metaslabs or some other
ranges as "agreed" between the master node and other nodes.

When a node''s write is completed (or a TXG sync happens), the
master node would update the ZFS block tree and uberblocks,
and those per-node-ZIL blocks which are already on disk would
become part of the ZFS tree. At this time new block ranges would
be fanned out for writes by each non-master node.

A probable optimization would be to give out several TXG''s worth
of dedicated block ranges to each node, to reduce hickups during
any lags or even master-node reelections.

Main difference from the first solution would be in performance -
here all nodes would be writing to the pool hardware at full SAN/NAS
networking speed, and less load would come on the "writer node".
Actually, instead of a "writer node" (responsible for translation of
LAN writes to SAN writes in the first solution), there would be a
"master node" responsible just for consistent application of TXG
updates, and for distribution of new dedicated block ranges to
other nodes for new writes. Information about such block ranges
would be kept on-disk like some sort of a "cluster ZIL" - so that
writes won''t be lost in case of hardware resets, software panics,
node reelections, etc. Applying these cluster ZILs and per-node
ZIL ranges would become part of normal ZFS read-write imports
(by a master node).

Probably there should be a part of the pool with information about
most-current cluster members (who imported the pool and what
role that node performs); it could be a "ring" of blocks like the ZFS
uberblocks are now.


So... above I presented a couple of possible solutions to the
problem of ZFS clustering. These are "off the top of my head"
ideas, and as I am not a great specialist in storage clustering,
they are probably silly ideas with many flaws "as is". At the very
least, I see a lot of possible optimisation locations already,
and the solutions (esp. #1) may be unreliable for uses other
than VM hosting.

Other than common clustering problems (quorum, stonith,
loss of connectivity from active nodes and assumption of wrong
roles - i.e. attempted pool-mastering by several nodes), which
may be solved by different popular methods, not excluding that
part of the pool with information about cluster members, there
is also a problem of ensuring that all nodes have and use the
most current state of the pool - receiving new TXG info/ZIL
updates, and ultimately updating uberblocks ASAP.

So beside an invitation to bash these ideas and explain why they
are wrong an impossible - if they are - there is also a hope to
stir up a constructive discussion finally leading up to a working
"clustered ZFS" solution, and one more reliable than my ideas
above ;) I think there is some demand for that in the market, as
well as amoung enthusiasts...

Hope to see some interesting reading,
//Jim Klimov

Richard Elling

2011-Oct-12 04:15 UTC

head link

[zfs-discuss] Wanted: sanity check for a clustered ZFS idea

On Oct 9, 2011, at 10:28 AM, Jim Klimov wrote:
> Hello all,
> 
> ZFS developers have for a long time stated that ZFS is not intended,
> at least not in near term, for clustered environments (that is, having
> a pool safely imported by several nodes simultaneously). However,
> many people on forums have wished having ZFS features in clusters.
...and UFS before ZFS? I''d wager that every file system has this RFE in
its
wish list :-)
> I have some ideas at least for a limited implementation of clustering
> which may be useful aat least for some areas. If it is not my fantasy
> and if it is realistic to make - this might be a good start for further
> optimisation of ZFS clustering for other uses.
> 
> For one use-case example, I would talk about VM farms with VM
> migration. In case of shared storage, the physical hosts need only
> migrate the VM RAM without copying gigabytes of data between their
> individual storages. Such copying makes less sense when the
> hosts'' storage is mounted off the same NAS/NAS box(es), because:
> * it only wastes bandwidth moving bits around the same storage, and
This is why the best solutions use snapshots? no moving of data and
you get the added benefit of shared ARC -- increasing the logical working
set size does not increase the physical working set size.
> * IP networking speed (NFS/SMB copying) may be less than that of
> dedicated storage net between the hosts  and storage (SAS, FC, etc.)
Disk access is not bandwidth bound by the channel.
> * with pre-configured disk layout from one storage box into LUNs for
> several hosts, more slack space is wasted than with having a single
> pool for several hosts, all using the same "free" pool space;
...and you die by latency of metadata traffic.
> * it is also less scalable (i.e. if we lay out the whole SAN for 5 hosts,
> it would be problematic to add a 6th server) - but it won''t be a
problem
> when the single pool consumes the whole SAM and is available to
> all server nodes.
Are you assuming disk access is faster than RAM access?
> One feature of this use-case is that specific datasets within the
> potentially common pool on the NAS/SAN are still dedicated to
> certain physical hosts. This would be similar to serving iSCSI
> volumes or NFS datasets with individual VMs from a NAS box -
> just with a faster connection over SAS/FC. Hopefully this allows
> for some shortcuts in clustering ZFS implementation, while
> such solutions would still be useful in practice.
I''m still missing the connection of the problem to the solution.
The problem, as I see it today: disks are slow and not getting 
faster. SSDs are fast and getting faster and lower $/IOP. Almost
all VM environments and most general purpose environments are
overprovisioned for bandwidth and underprovisioned for latency.
The Achille''s heel of solutions that cluster for bandwidth (eg lustre,
QFS, pNFS, Gluster, GFS, etc) is that you have to trade-off latency.
But latency is what we need, so perhaps not the best architectural
solution?
> So, one version of the solution would be to have a single host
> which imports the pool in read-write mode (i.e. the first one
> which boots), and other hosts would write thru it (like iSCSI
> or whatever; maybe using SAS or FC to connect between
> "reader" and "writer" hosts). However they would read
directly
> from the ZFS pool using the full SAN bandwidth.
> 
> WRITES would be consistent because only one node writes
> data to the active ZFS block tree using more or less the same
> code and algorithms as already exist.
> 
> 
> In order for READS to be consistent, the "readers" need only
> rely on whatever latest TXG they know of, and on the cached
> results of their more recent writes (between the last TXG
> these nodes know of and current state).
> 
> Here''s where this use-case''s bonus comes in: the node
which
> currently uses a certain dataset and issues writes for it, is the
> only one expected to write there - so even if its knowledge of
> the pool is some TXGs behind, it does not matter.
> 
> In order to stay up to date, and "know" the current TXG
completely,
> the "reader nodes" should regularly read-in the ZIL data (anyway
> available and accessible as part of the pool) and expire changed
> entries from their local caches.
:-)
> If for some reason a "reader node" has lost track of the pool for
> too long, so that ZIL data is not sufficient to update from "known
> in-RAM TXG" to "current on-disk TXG", the full read-only
import
> can be done again (keeping track of newer TXGs appearing
> while the RO import is being done).
> 
> Thanks to ZFS COW, nodes can expect that on-disk data (as
> pointed to by block addresses/numbers) does not change.
> So in the worst case, nodes would read outdated data a few
> TXGs old - but not completely invalid data.
> 
> 
> Second version of the solution is more or less the same, except
> that all nodes can write to the pool hardware directly using some
> dedicated block ranges "owned" by one node at a time. This
> would work like much a ZIL containing both data and metadata.
> Perhaps these ranges would be whole metaslabs or some other
> ranges as "agreed" between the master node and other nodes.
> 
> When a node''s write is completed (or a TXG sync happens), the
> master node would update the ZFS block tree and uberblocks,
> and those per-node-ZIL blocks which are already on disk would
> become part of the ZFS tree. At this time new block ranges would
> be fanned out for writes by each non-master node.
> 
> A probable optimization would be to give out several TXG''s worth
> of dedicated block ranges to each node, to reduce hickups during
> any lags or even master-node reelections.
:-)
> Main difference from the first solution would be in performance -
> here all nodes would be writing to the pool hardware at full SAN/NAS
> networking speed, and less load would come on the "writer node".
> Actually, instead of a "writer node" (responsible for translation
of
> LAN writes to SAN writes in the first solution), there would be a
> "master node" responsible just for consistent application of TXG
> updates, and for distribution of new dedicated block ranges to
> other nodes for new writes. Information about such block ranges
> would be kept on-disk like some sort of a "cluster ZIL" - so that
> writes won''t be lost in case of hardware resets, software panics,
> node reelections, etc. Applying these cluster ZILs and per-node
> ZIL ranges would become part of normal ZFS read-write imports
> (by a master node).
> 
> Probably there should be a part of the pool with information about
> most-current cluster members (who imported the pool and what
> role that node performs); it could be a "ring" of blocks like the
ZFS
> uberblocks are now.
> 
> 
> So... above I presented a couple of possible solutions to the
> problem of ZFS clustering. These are "off the top of my head"
> ideas, and as I am not a great specialist in storage clustering,
> they are probably silly ideas with many flaws "as is". At the
very
> least, I see a lot of possible optimisation locations already,
> and the solutions (esp. #1) may be unreliable for uses other
> than VM hosting.
Everything in the ZIL is also in RAM. I can read from RAM with lower
latency than reading from a shared slog. So how are you improving
latency?
> Other than common clustering problems (quorum, stonith,
> loss of connectivity from active nodes and assumption of wrong
> roles - i.e. attempted pool-mastering by several nodes), which
> may be solved by different popular methods, not excluding that
> part of the pool with information about cluster members, there
> is also a problem of ensuring that all nodes have and use the
> most current state of the pool - receiving new TXG info/ZIL
> updates, and ultimately updating uberblocks ASAP.
> 
> So beside an invitation to bash these ideas and explain why they
> are wrong an impossible - if they are - there is also a hope to
> stir up a constructive discussion finally leading up to a working
> "clustered ZFS" solution, and one more reliable than my ideas
> above ;) I think there is some demand for that in the market, as
> well as amoung enthusiasts?
Definitely not impossible, but please work on the business case.
Remember, it is easier to build hardware than software, so your
software solution must be sufficiently advanced to not be obsoleted
by the next few hardware generations.
 -- richard

-- 

ZFS and performance consulting
http://www.RichardElling.com
VMworld Copenhagen, October 17-20
OpenStorage Summit, San Jose, CA, October 24-27
LISA ''11, Boston, MA, December 4-9

Nico Williams

2011-Oct-12 05:03 UTC

head link

[zfs-discuss] Wanted: sanity check for a clustered ZFS idea

On Tue, Oct 11, 2011 at 11:15 PM, Richard Elling
<richard.elling at gmail.com> wrote:> On Oct 9, 2011, at 10:28 AM, Jim Klimov wrote:
>> ZFS developers have for a long time stated that ZFS is not intended,
>> at least not in near term, for clustered environments (that is, having
>> a pool safely imported by several nodes simultaneously). However,
>> many people on forums have wished having ZFS features in clusters.
>
> ...and UFS before ZFS? I''d wager that every file system has this
RFE in its
> wish list :-)
Except the ones that already have it!  :)

Nico
--

Nico Williams

2011-Oct-12 05:18 UTC

head link

[zfs-discuss] Wanted: sanity check for a clustered ZFS idea

On Sun, Oct 9, 2011 at 12:28 PM, Jim Klimov <jimklimov at cos.ru>
wrote:> So, one version of the solution would be to have a single host
> which imports the pool in read-write mode (i.e. the first one
> which boots), and other hosts would write thru it (like iSCSI
> or whatever; maybe using SAS or FC to connect between
> "reader" and "writer" hosts). However they would read
directly
> from the ZFS pool using the full SAN bandwidth.
You need to do more than simply assign a node for writes.  You need to
send write and lock requests to one node.  And then you need to figure
out what to do about POSIX write visibility rules (i.e., when a write
should be visible to other readers).  I think you''d basically end up
not meeting POSIX in this regard, just like NFS, though perhaps not
with close-to-open semantics.

I don''t think ZFS is the beast you''re looking for.  You want
something
more like Lustre, GPFS, and so on.  I suppose someone might surprise
us one day with properly clustered ZFS, but I think it''d be more
likely that the filesystem would be ZFS-like, not ZFS proper.
> Second version of the solution is more or less the same, except
> that all nodes can write to the pool hardware directly using some
> dedicated block ranges "owned" by one node at a time. This
> would work like much a ZIL containing both data and metadata.
> Perhaps these ranges would be whole metaslabs or some other
> ranges as "agreed" between the master node and other nodes.
This is much hairier.  You need consistency.  If two processes on
different nodes are writing to the same file, then you need to
*internally* lock around all those writes so that the on-disk
structure ends up being sane.  There''s a number of things you could do
here, such as, for example, having a per-node log and one node
coalescing them (possibly one node per-file, but even then one node
has to be the master of every txg).

And still you need to be careful about POSIX semantics.  That does not
come for free in any design -- you will need something like the Lustre
DLM (distributed lock manager).  Or else you''ll have to give up on
POSIX.

There''s a hefty price to be paid for POSIX semantics in a clustered
environment.  You''d do well to read up on Lustre''s experience
in
detail.  And not just Lustre -- that would be just to start.  I
caution you that this is not a simple project.

Nico
--

Jim Klimov

2011-Oct-14 02:13 UTC

head link

[zfs-discuss] Wanted: sanity check for a clustered ZFS idea

Hello all,
> Definitely not impossible, but please work on the business case.
> Remember, it is easier to build hardware than software, so your
> software solution must be sufficiently advanced to not be obsoleted
> by the next few hardware generations.
>   -- richard
I guess Richard was correct about the usecase description -
I should detail what I''m thinking about, to give some illustration.
Coming from a software company though, I tend to think of
software being the more flexible part of equation. This is
something we have a chance to change. We use whatever
hardware is given to us from above, for years...

When thinking about the problem and its applications to life,
I have in mind blade servers farms like Intel MFSYS25 which
include relatively large internal storage and you can possibly
add external SAS storage. We use such server farms as
self-contained units (a single chassis plugged into customer''s
network) for a number of projects, and recently more and more
of these deployments become VMWare ESX farms with shared
VMFS. Due to my stronger love for things Solaris, I would love
to see ZFS and any of Solaris-based hypervisors (VBox, Xen
or KVM ports) running there instead. But for things to be as
efficient, ZFS would have to become shared - clustered...

I think I would have to elaborate more on this hardware, as
it tends to be our major usecase, and thus a limitation which
influences my approach to clustered ZFS and belief whatever
shortcuts are appropriate.

These boxes have a shared chassis to accomodate 6 server
blades, each with 2 CPUs and 2 or 4 gigabit ethernet ports.
The chassis also has single or dual ethernet switches to interlink
the servers and to connect to external world (10 ext ports each),
as well as single or dual storage controllers and 14 internal HDD
bays. External SAS boxes can also be attached to the storage
controller modules, but I haven''t yet seen real setups like that.

In normal "Intel usecase", the controller(s) implement several
RAID LUNs which are accessible to the servers via SAS
(with MPIO in case of dual controllers). Usually these LUNs
are dedicated to servers - for example, boot/OS volumes.

With an additional license from Intel, Shared LUNs can be
implemented on the chassis - these are primarily aimed at
VMWare farms with clustered VMFS to use available disk
space (and multiple-spindle aggregated bandwidths) more
efficiently, as well as aid in VM migration.

To be clearer, I should say that modern VM hypervisors can
migrate running virtual machines between two VM hosts.

Usually (with dedicated storage for each server host) they
do this by copying over the IP network their HDD image
files from an "old host" to "new host", transferring virtual
RAM contents, replumbing virtual networks and beginning
execution "from the same point" - after just a second-long
hiccup for finalization of the running VM''s migration.

With clustered VMFS on shared storage, VMWare can
migrate VMs faster - it knows not to copy the HDD image
file in vain - it will be equally available to the "new host"
at the correct point in migration, just as it was accessible
to the "old host".

This is what I kind of hoped to reimplement with VirtualBox
or Xen or KVM running on OpenSolaris derivatives (such as
OpenIndiana and others), and the proposed "ZFS clustering"
using each HDD wholly as an individual LUN, aggregated into
a ZFS pool by the servers themselves. For many cases this
would also be cheaper, with OpenIndiana and free hypervisors ;)

As was rightfully noted, with a common ZFS pool as underlying
storage (as happens in current Sun VDI solutions using a ZFS
NAS), VM image clones can be instantiated quickly and efficiently
on resources - cheaper and faster than copying a golden image.

Now, at the risk of being accused pushing some "marketing"
through the discussion list, I have to state that these servers
are relatively cheap (if compared to 6 single-unit servers of
comparable configuration, dual managed ethernet switches,
a SAN with 14 disks + dual storage controllers). Price is an
important factor in many of our deployments, where these
boxes work stand-alone.

This usually starts with a POC, when a pre-configured
basic MFSYS with some VMs of our software arrives to
a customer, gets tailored and works like a "black box".
In a year or so an upgrade may come in form of added
disks, server blades and RAM. I have never heard even
discussions of adding external storage - too pricey, and
often useless with relatively fixed VM sizes - hence my
desire to get a single ZFS pool available to all the blades
equally. While dedicated storage boxes might be good
and great, they would bump the solution price by orders
of magnitude (StorEdge 7000 series) and are generally
out of question for our limited deployments.

Thanks to Nico for concerns about POSIX locking. However,
hopefully, in the usecase I described - serving images of
VMs in a manner where storage, access and migration are
efficient - whole datasets (be it volumes or FS datasets)
can be dedicated to one VM host server at a time, just like
whole pools are dedicated to one host nowadays. In this
case POSIX compliance can be disregarded - access
is locked by one host, not avaialble to others, period.
Of course, there is a problem of capturing storage from
hosts which died, and avoiding corruptions - but this is
hopefully solved in the past decades of clustering tech''s.

Nico also confirmed that "one node has to be a master of
all TXGs" - which is conveyed in both ideas of my original
post.

More directed replies below...

2011-10-12 8:15, Richard Elling ?????:> On Oct 9, 2011, at 10:28 AM, Jim Klimov wrote:
>
> >  ... individual storages. Such copying makes less sense when the
> >  hosts'' storage is mounted off the same NAS/NAS box(es),
because:
> >  * it only wastes bandwidth moving bits around the same storage, and
> This is why the best solutions use snapshots? no moving of data and
> you get the added benefit of shared ARC -- increasing the logical working
> set size does not increase the physical working set size.
Snapshots would be good for cloning of VMs.
They can also help with VM migration between separate hosts,
IFF both machines have some common baseline snapshot so
as to send increments, but the VM supervisor would have to be
really intimate with the FS - like VMWare is with VMFS...

I believe with "clustered ZFS" rearchitecturing nobody forbids
implementation of either ARCs or L2ARCs local to each host.
I believe it is also not a very big challenge to allow non-local
L2ARCs, i.e. so that shared ZFS pool caches can be local
to individual hosts, and that the common shared ZFS pool
would have no knowledge of remote L2ARCs - so their
absence would not cause the pool to be considered corrupt.

True, though, in case of cloned VM images, common blocks
used by different datasets on different hosts, would use up
their ARCs separately and lose some benefit of shared ARC
described above by Richard.

However each blade has as much RAM as any other which
might be dedicated as a storage host, so that common
cache memory in the VM farm overall would be increased.
Moreover, such ARCs and L2ARCs local to cluster nodes
would be "cluttered" only by data relevant to this certain host.

By default there can be no L2ARCs in MFSYS, though -
unless a box of SSDs would be attached as an external
SAS storage and portions would be dedicated to each
blade as LUNs via storage controller modules.
> >  * IP networking speed (NFS/SMB copying) may be less than that of
> >  dedicated storage net between the hosts  and storage (SAS, FC, etc.)
> Disk access is not bandwidth bound by the channel.
In case of larger NASes or SANs, where multiple-spindle
performance would exceed say 125MB/s, a gigabit LAN
performance (iSCSI/NFS/SMB) would be a bottleneck
indeed, compared to a faster SAS or FC link (i.e. 8Gbit/s).

Even in case of the MFSYS chassis above, LUNs are
accessible using a faster link than that which could be
provided by networking of a blade dedicated to NAS
tasks and serving access to ZFS to other blades over
LAN. To say the least, disk access is common and
equal to each blade - so having a ZFS server and
serving storage over LAN only adds another layer
to latency, and possibly limits bandwidth.

> >  * with pre-configured disk layout from one storage box into LUNs for
> >  several hosts, more slack space is wasted than with having a single
> >  pool for several hosts, all using the same "free" pool
space;
> ...and you die by latency of metadata traffic.
That''s possible. Hopefully it can be reduced by preallocating
adequately large (sets of) fragments from the shared pool for
each server''s writes, so that actual blocking exchange of
metadata would be rare. Since each server knows in advance
what on-disk blocks it can safely write into, there should be
little danger of conflict, and little added real-time latency.

> >  * it is also less scalable (i.e. if we lay out the whole SAN for 5
hosts,
> >  it would be problematic to add a 6th server) - but it won''t
be a problem
> >  when the single pool consumes the whole SAM and is available to
> >  all server nodes.
> Are you assuming disk access is faster than RAM access?
I am not sure how this question is relevant to the paragraph above.
Of course I don''t assume THAT ;)

I would take it is due to my typo "SAM" misinterpreted as
"RAM"
while it stood for "SAN".

To reiterate that idea, a SAN, such as the 14 shared disks in the
MFSYS chassis, aggregated by RAID and cut into individual
per-server LUNs, is less scalable that a Shared LUN on the same
chassis. Because, for example, if we have 3 blades during a POC
and distribute the whole disk array into 3 individual LUNs, there
would be no more disk space to allocate when new blades
arrive. If we don''t preallocate disk space, it is wasted.
Of course, in my example we know there can be no more than
6 servers, so we can preallocate 6 LUNs, and give some servers
2 or more storage areas for a while. In non-blade setups there is
no such luxury of certain-prediction ;)
>
>> One feature of this use-case is that specific datasets within the
>> potentially common pool on the NAS/SAN are still dedicated to
>> certain physical hosts. This would be similar to serving iSCSI
>> volumes or NFS datasets with individual VMs from a NAS box -
>> just with a faster connection over SAS/FC. Hopefully this allows
>> for some shortcuts in clustering ZFS implementation, while
>> such solutions would still be useful in practice.
> I''m still missing the connection of the problem to the solution.
> The problem, as I see it today: disks are slow and not getting
> faster. SSDs are fast and getting faster and lower $/IOP. Almost
> all VM environments and most general purpose environments are
> overprovisioned for bandwidth and underprovisioned for latency.
> The Achille''s heel of solutions that cluster for bandwidth (eg
lustre,
> QFS, pNFS, Gluster, GFS, etc) is that you have to trade-off latency.
> But latency is what we need, so perhaps not the best architectural
> solution?
Again back to my MFSYS example:
* Individual server blades have no local HDDs, nor SSDs for L2ARC.
They only have CPUs, RAM, SAS-initiator and Pro/1000 chips.
* All blades access LUNs from chassis storage, no matter what.
Thechnically one of the servers can be provisioned as a storage
node, but it should better be redundant - taking 2 blades out of
other jobs. And repackaging disk access from SAS to LAN is
bound to be slower and/or have more latency than accessing
these disks (LUNs) directly.
> Everything in the ZIL is also in RAM. True for the local host which wrote the ZIL.
False for other hosts which use the same shared ZFS pool concurrently.

However these other hosts can read in older (flushed) ZILs to update
their local caches and general knowledge of pool metadata.
> I can read from RAM with lower latency than reading from a shared 
> slog. So how are you improving latency? To be honest - I don''t know. But I can make some excuses ;)

1) If datasets are dedicated to hosts (i.e. with locking) there is
not much stuff going on in other parts of the pool that would be
"interesting" to hosts. They are interested in two things:
* what they can READ - safely thanks to COW, and not changed
by others thanks to "dedication" of datasets
* where they can WRITE so as not to disturb/overwrite other hosts''
new writes - distributed in advance by master-host.
In this case, latency is only added when hosts run out of assigned
block ranges for writes, and are waiting for new assigned block
ranges from master-host.

2) Improvement of latency was, truly, not considered. I am not
ready to speculate how or why it might improve or worsen.
I was after best utilization of disk space and spindles (by using
a single pool), as well as bandwidth (by using direct disk access
instead of adding a storage server in the path).

>> Other than common clustering problems (quorum, stonith,
>> loss of connectivity from active nodes and assumption of wrong
>> roles - i.e. attempted pool-mastering by several nodes), which
>> may be solved by different popular methods, not excluding that
>> part of the pool with information about cluster members, there
>> is also a problem of ensuring that all nodes have and use the
>> most current state of the pool - receiving new TXG info/ZIL
>> updates, and ultimately updating uberblocks ASAP.
>>
>> So beside an invitation to bash these ideas and explain why they
>> are wrong an impossible - if they are - there is also a hope to
>> stir up a constructive discussion finally leading up to a working
>> "clustered ZFS" solution, and one more reliable than my ideas
>> above ;) I think there is some demand for that in the market, as
>> well as amoung enthusiasts?
>

-- 

+============================================================+
|                                                            |
| ?????? ???????,                                 Jim Klimov |
| ??????????? ????????                                   CTO |
| ??? "??? ? ??"                                  JSC COS&HT |
|                                                            |
| +7-903-7705859 (cellular)          mailto:jimklimov at cos.ru |
|                          CC:admin at cos.ru,jimklimov at mail.ru |
+============================================================+
| ()  ascii ribbon campaign - against html mail              |
| /\                        - against microsoft attachments  |
+============================================================+

Edward Ned Harvey

2011-Oct-14 11:53 UTC

head link

[zfs-discuss] Wanted: sanity check for a clustered ZFS idea

> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-
> bounces at opensolaris.org] On Behalf Of Jim Klimov
> 
> I guess Richard was correct about the usecase description -
> I should detail what I''m thinking about, to give some
illustration.
After reading all this, I''m still unclear on what you want to
accomplish, that isn''t already done today.  Yes I understand what it
means when we say ZFS is not a clustering filesystem, and yes I understand what
benefits there would be to gain if it were a clustering FS.  But in all of what
you''re saying below, I don''t see that you need a clustering
FS.

> of these deployments become VMWare ESX farms with shared
> VMFS. Due to my stronger love for things Solaris, I would love
> to see ZFS and any of Solaris-based hypervisors (VBox, Xen
> or KVM ports) running there instead. But for things to be as
> efficient, ZFS would have to become shared - clustered...
I think the solution people currently use in this area is either NFS or iscsi. 
(Or infiniband, and other flavors.)  You have a storage server presenting the
storage to the various vmware (or whatever) hypervisors.  Everything works. 
What''s missing?  And why does this need to be a clustering FS?

> To be clearer, I should say that modern VM hypervisors can
> migrate running virtual machines between two VM hosts.
This works on NFS/iscsi/IB as well.  Doesn''t need a clustering FS.

> With clustered VMFS on shared storage, VMWare can
> migrate VMs faster - it knows not to copy the HDD image
> file in vain - it will be equally available to the "new host"
> at the correct point in migration, just as it was accessible
> to the "old host".
Again.  NFS/iscsi/IB = ok.

Jim Klimov

2011-Oct-14 12:36 UTC

head link

[zfs-discuss] Wanted: sanity check for a clustered ZFS idea

2011-10-14 15:53, Edward Ned Harvey ?????:>> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-
>> bounces at opensolaris.org] On Behalf Of Jim Klimov
>>
>> I guess Richard was correct about the usecase description -
>> I should detail what I''m thinking about, to give some
illustration.
> After reading all this, I''m still unclear on what you want to
accomplish, that isn''t already done today.  Yes I understand what it
means when we say ZFS is not a clustering filesystem, and yes I understand what
benefits there would be to gain if it were a clustering FS.  But in all of what
you''re saying below, I don''t see that you need a clustering
FS.
In my example - probably not a completely clustered FS.
A clustered ZFS pool with datasets individually owned by
specific nodes at any given time would suffice for such
VM farms. This would give users the benefits of ZFS
(resilience, snapshots and clones, shared free space)
merged with the speed of direct disk access instead of
lagging through a storage server accessing these disks.

This is why I think such a solution may be more simple
than a fully-fledged POSIX-compliant shared FS, but it
would still have some benefits for specific - and popular -
usage cases. And it might pave way for a more complete
solution - or perhaps illustrate what should not be done
for those solutions ;)

After all, I think that if the problem of safe multiple-node
RW access to ZFS gets fundamentally solved, these
usages I described before might just become a couple
of new dataset types with specific predefined usage
and limitations - like POSIX-compliant FS datasets
and block-based volumes are now defined over ZFS.
There is no reason not to call them "clustered FS and
clustered volume datasets", for example ;)

AFAIK, VMFS is not a generic filesystem, and cannot
quite be used "directly" by software applications, but it
has its target market for shared VM farming...

I do not know how they solve the problems of consistency
control - with master nodes or something else, and for
the sake of patent un-encroaching, I''m afraid I''d rather
not know - as to not copycat someone''s solution and
get burnt for that ;)
>
>> of these deployments become VMWare ESX farms with shared
>> VMFS. Due to my stronger love for things Solaris, I would love
>> to see ZFS and any of Solaris-based hypervisors (VBox, Xen
>> or KVM ports) running there instead. But for things to be as
>> efficient, ZFS would have to become shared - clustered...
> I think the solution people currently use in this area is either NFS or
iscsi.  (Or infiniband, and other flavors.)  You have a storage server
presenting the storage to the various vmware (or whatever) hypervisors.
In fact, no. Based on the MFSYS model, there is no storage server.
There is a built-in storage controller which can do RAID over HDDs
and represent SCSI LUNs to the blades over direct SAS access.
These LUNs can be accessed individually by certain servers, or
concurrently. In the latter case it is possible that servers take turns
mounting the LUN as a HDD with some single-server FS, or use
a clustered FS to use the LUN''s disk space simultaneously.

If we were to use in this system an OpenSolaris-based OS and
VirtualBox/Xen/KVM as they are now, and hope for live migration
of VMs without copying of data, we would have to make a separate
LUN for each VM on the controller, and mount/import this LUN to
its current running host. I don''t need to explain why that would be
a clumsy and unflexible solution for a near-infinite number of
reasons, do i? ;)
>   Everything works.  What''s missing?  And why does this need to be
a clustering FS?
>
>
>> To be clearer, I should say that modern VM hypervisors can
>> migrate running virtual machines between two VM hosts.
> This works on NFS/iscsi/IB as well.  Doesn''t need a clustering FS.Except that the storage controller doesn''t do NFS/iscsi/IB,
and doesn''t do snapshots and clones. And if I were to
dedicate one or two out of six blades to storage tasks,
this might be considered an improper waste of resources.
And would repackage SAS access (anyway available to
all blades at full bandwidth) into NFS/iscsi access over a
Gbit link...
>
>
>> With clustered VMFS on shared storage, VMWare can
>> migrate VMs faster - it knows not to copy the HDD image
>> file in vain - it will be equally available to the "new host"
>> at the correct point in migration, just as it was accessible
>> to the "old host".
> Again.  NFS/iscsi/IB = ok.
True, except that this is not an optimal solution in this described
usecase - a farm of server blades with a relatively dumb fast raw
storage (but NOT an intellectual storage server).

//Jim

Tim Cook

2011-Oct-14 15:33 UTC

head link

[zfs-discuss] Wanted: sanity check for a clustered ZFS idea

On Fri, Oct 14, 2011 at 7:36 AM, Jim Klimov <jimklimov at cos.ru> wrote:
> 2011-10-14 15:53, Edward Ned Harvey ?????:
>
>  From: zfs-discuss-bounces@**opensolaris.org<zfs-discuss-bounces at
opensolaris.org>[mailto:
>>> zfs-discuss-
>>> bounces at opensolaris.org] On Behalf Of Jim Klimov
>>>
>>> I guess Richard was correct about the usecase description -
>>> I should detail what I''m thinking about, to give some
illustration.
>>>
>> After reading all this, I''m still unclear on what you want to
accomplish,
>> that isn''t already done today.  Yes I understand what it means
when we say
>> ZFS is not a clustering filesystem, and yes I understand what benefits
there
>> would be to gain if it were a clustering FS.  But in all of what
you''re
>> saying below, I don''t see that you need a clustering FS.
>>
>
> In my example - probably not a completely clustered FS.
> A clustered ZFS pool with datasets individually owned by
> specific nodes at any given time would suffice for such
> VM farms. This would give users the benefits of ZFS
> (resilience, snapshots and clones, shared free space)
> merged with the speed of direct disk access instead of
> lagging through a storage server accessing these disks.
>
> This is why I think such a solution may be more simple
> than a fully-fledged POSIX-compliant shared FS, but it
> would still have some benefits for specific - and popular -
> usage cases. And it might pave way for a more complete
> solution - or perhaps illustrate what should not be done
> for those solutions ;)
>
> After all, I think that if the problem of safe multiple-node
> RW access to ZFS gets fundamentally solved, these
> usages I described before might just become a couple
> of new dataset types with specific predefined usage
> and limitations - like POSIX-compliant FS datasets
> and block-based volumes are now defined over ZFS.
> There is no reason not to call them "clustered FS and
> clustered volume datasets", for example ;)
>
> AFAIK, VMFS is not a generic filesystem, and cannot
> quite be used "directly" by software applications, but it
> has its target market for shared VM farming...
>
> I do not know how they solve the problems of consistency
> control - with master nodes or something else, and for
> the sake of patent un-encroaching, I''m afraid I''d rather
> not know - as to not copycat someone''s solution and
> get burnt for that ;)
>
>
>
>>  of these deployments become VMWare ESX farms with shared
>>> VMFS. Due to my stronger love for things Solaris, I would love
>>> to see ZFS and any of Solaris-based hypervisors (VBox, Xen
>>> or KVM ports) running there instead. But for things to be as
>>> efficient, ZFS would have to become shared - clustered...
>>>
>> I think the solution people currently use in this area is either NFS or
>> iscsi.  (Or infiniband, and other flavors.)  You have a storage server
>> presenting the storage to the various vmware (or whatever) hypervisors.
>>
>
> In fact, no. Based on the MFSYS model, there is no storage server.
> There is a built-in storage controller which can do RAID over HDDs
> and represent SCSI LUNs to the blades over direct SAS access.
> These LUNs can be accessed individually by certain servers, or
> concurrently. In the latter case it is possible that servers take turns
> mounting the LUN as a HDD with some single-server FS, or use
> a clustered FS to use the LUN''s disk space simultaneously.
>
> If we were to use in this system an OpenSolaris-based OS and
> VirtualBox/Xen/KVM as they are now, and hope for live migration
> of VMs without copying of data, we would have to make a separate
> LUN for each VM on the controller, and mount/import this LUN to
> its current running host. I don''t need to explain why that would
be
> a clumsy and unflexible solution for a near-infinite number of
> reasons, do i? ;)
>
>
>   Everything works.  What''s missing?  And why does this need to be
a
>> clustering FS?
>>
>>
>>  To be clearer, I should say that modern VM hypervisors can
>>> migrate running virtual machines between two VM hosts.
>>>
>> This works on NFS/iscsi/IB as well.  Doesn''t need a clustering
FS.
>>
> Except that the storage controller doesn''t do NFS/iscsi/IB,
> and doesn''t do snapshots and clones. And if I were to
> dedicate one or two out of six blades to storage tasks,
> this might be considered an improper waste of resources.
> And would repackage SAS access (anyway available to
> all blades at full bandwidth) into NFS/iscsi access over a
> Gbit link...
>
>
>
>>
>>  With clustered VMFS on shared storage, VMWare can
>>> migrate VMs faster - it knows not to copy the HDD image
>>> file in vain - it will be equally available to the "new
host"
>>> at the correct point in migration, just as it was accessible
>>> to the "old host".
>>>
>> Again.  NFS/iscsi/IB = ok.
>>
>
> True, except that this is not an optimal solution in this described
> usecase - a farm of server blades with a relatively dumb fast raw
> storage (but NOT an intellectual storage server).
>
> //Jim
>


The idea is you would dedicate one of the servers in the chassis to be a
Solaris system, which then presents NFS out to the rest of the hosts.  From
the chassis itself you would present every drive that isn''t being used
to
boot an existing server to this solaris host as individual disks, and let
that server take care of RAID and presenting out the storage to the rests of
the vmware hosts.

--Tim
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20111014/6737efb9/attachment.html>

Jim Klimov

2011-Oct-14 16:49 UTC

head link

[zfs-discuss] Wanted: sanity check for a clustered ZFS idea

2011-10-14 19:33, Tim Cook ?????:>
>
>             With clustered VMFS on shared storage, VMWare can
>             migrate VMs faster - it knows not to copy the HDD image
>             file in vain - it will be equally available to the "new
host"
>             at the correct point in migration, just as it was accessible
>             to the "old host".
>
>         Again.  NFS/iscsi/IB = ok.
>
>
>     True, except that this is not an optimal solution in this described
>     usecase - a farm of server blades with a relatively dumb fast raw
>     storage (but NOT an intellectual storage server).
>
>     //Jim
>
>
>
>
> The idea is you would dedicate one of the servers in the chassis to be 
> a Solaris system, which then presents NFS out to the rest of the 
> hosts.  From the chassis itself you would present every drive that 
> isn''t being used to boot an existing server to this solaris host
as
> individual disks, and let that server take care of RAID and presenting 
> out the storage to the rests of the vmware hosts.
>
> --TimYes, I wrote of that as an option - but a relatively poor one
(though now we''re limited to do this). As I numerously
wrote, major downsides are:
* probably increased latency due to another added hop
of processing delays, just as with extra switches and
routers in networks;
* probably reduced bandwidth of LAN as compared to
direct disk access; certainly it won''t get increased ;)
Besides, the LAN may be (highly) utilized by servers
running in VMs or physical blades, so storage traffic
over LAN would compete with real networking and/or
add to latencies.
* in order for the whole chassis to provide HA services
and run highly-available VMs, the storage servers have
to be redundant - at least one other blade would have
to be provisioned for failover ZFS import and serving
for other nodes.
This is not exactly a showstopper - but the "spare" blade
would either have to not run VMs at all, or run not as many
VMs as others, and in case of a pool failover event it would
probably have to migrate its running VMs away in order to
increase ARC and reduce storage latency for other servers.
That''s doable, and automatable, but a hassle nonetheless.

Also I''m not certain how well other hosts can benefit from
caching in their local RAMs when using NFS or iSCSI
resources. I think they might benefit better from local
ARCs in the pool were directly imported to each of them...

Upsides are:
* this already works, and reliably, as any other ZFS NAS
solution. That''s a certain "plus" :)

In this current case one or two out of six blades should be
dedicated  to storage, leaving only 4 or 5 to VMs.

In case of shared pools, there is a new problem of
TXG-master failover to some other node (which would
probably be not slower than a pool reimport is now), but
otherwise all six servers'' loads are balanced. And they
only cache what they really need. And they have faster
disk access times. And they don''t use LAN superfluously
for storage access.

//Jim

PS: Anyway, I wanted to say this earlier - thanks to everyone
who responded, even (or especially) with criticism and
requests for detalisation. If nothing else, you helped me
describe my idea better and less ambigously, so that
some other thinkers can decide whether and how to
implement it ;)

PPS: When I earlier asked about getting ZFS under the
hood of RAID controllers, I guess I kinda wished to
replace the black box of intel''s firmware with a ZFS-aware
OS (FreeBSD probably) - the storage controller modules
must be some sort of computers running in a failover link...
These SCMs would then export datasets as SAS LUNs
to specific servers, like is done now, and possibly would
not require clustered ZFS - but might benefit from it too.
So my MFSYS illustration is partially relevant for that
question as well...


-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20111014/d00cb25c/attachment-0001.html>

Nico Williams

2011-Oct-14 17:17 UTC

head link

[zfs-discuss] Wanted: sanity check for a clustered ZFS idea

On Thu, Oct 13, 2011 at 9:13 PM, Jim Klimov <jimklimov at cos.ru>
wrote:> Thanks to Nico for concerns about POSIX locking. However,
> hopefully, in the usecase I described - serving images of
> VMs in a manner where storage, access and migration are
> efficient - whole datasets (be it volumes or FS datasets)
> can be dedicated to one VM host server at a time, just like
> whole pools are dedicated to one host nowadays. In this
> case POSIX compliance can be disregarded - access
> is locked by one host, not avaialble to others, period.
> Of course, there is a problem of capturing storage from
> hosts which died, and avoiding corruptions - but this is
> hopefully solved in the past decades of clustering tech''s.
It sounds to me like you need horizontal scaling more than anything
else.  In that case, why not use pNFS or Lustre?  Even if you want
snapshots, a VM should be able to handle that on its own, and though
probably not as nicely as ZFS in some respects, having the application
be in control of the exact snapshot boundaries does mean that you
don''t have to quiesce your VMs just to snapshot safely.
> Nico also confirmed that "one node has to be a master of
> all TXGs" - which is conveyed in both ideas of my original
> post.
Well, at any one time one node would have to be the master of the next
TXG, but it doesn''t mean that you couldn''t have some
cooperation.
There are lots of other much more interesting questions.  I think the
biggest problem lies in requiring full connectivity from every server
to every LUN.  I''d much rather take the Lustre / pNFS model (which,
incidentally, don''t preclude having snapshots).

Nico
--

Nico Williams

2011-Oct-14 17:19 UTC

head link

[zfs-discuss] Wanted: sanity check for a clustered ZFS idea

Also, it''s not worth doing a clustered ZFS thing that is too
application-specific.  You really want to nail down your choices of
semantics, explore what design options those yield (or approach from
the other direction, or both), and so on.

Nico
--

Edward Ned Harvey

2011-Oct-15 12:28 UTC

head link

[zfs-discuss] Wanted: sanity check for a clustered ZFS idea

> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-
> bounces at opensolaris.org] On Behalf Of Jim Klimov
> 
> The idea is you would dedicate one of the servers in the chassis to be a
> Solaris system, which then presents NFS out to the rest of the hosts. ?
Actually, I looked into a configuration like this, and found it''s
useful in
some cases - 

VMware boots from a dumb disk, and does PCI pass-thru, presenting the raw
HBA to solaris.  Create your pools, and export on the Virtual Switch, so
VMware can then use the storage to hold other VM''s.  Since
it''s going across
only a CPU limited virtual ethernet switch, it should be nearly as fast as
local access to the disk.  In theory.  But not in practice.

I found that the max throughput of the virtual switch is around 2-3Gbit/sec.
Nevermind ZFS or storage or anything.  Simply the CPU limited virtual switch
is a bottleneck.  

I see they''re developing virtual switches with cisco and intel.  Maybe
it''ll
improve.  But I suspect they''re probably adding more functionality
(QoS,
etc) rather than focusing on performance.

Edward Ned Harvey

2011-Oct-15 13:14 UTC

head link

[zfs-discuss] Wanted: sanity check for a clustered ZFS idea

> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-
> bounces at opensolaris.org] On Behalf Of Tim Cook
> 
> In my example - probably not a completely clustered FS.
> A clustered ZFS pool with datasets individually owned by
> specific nodes at any given time would suffice for such
> VM farms. This would give users the benefits of ZFS
> (resilience, snapshots and clones, shared free space)
> merged with the speed of direct disk access instead of
> lagging through a storage server accessing these disks.
I think I see a couple of points of disconnect.

#1 - You seem to be assuming storage is slower when it''s on a remote
storage
server as opposed to a local disk.  While this is typically true over
ethernet, it''s not necessarily true over infiniband or fibre channel. 
That
being said, I don''t want to assume everyone should be shoe-horned into
infiniband or fibre channel.  There are some significant downsides of IB and
FC.  Such as cost, and centralization of the storage.  Single point of
failure, and so on.  So there is some ground to be gained...  Saving cost
and/or increasing workload distribution and/or scalability.  One size
doesn''t fit all.  I like the fact that you''re thinking of
something
different.

#2 - You''re talking about a clustered FS, but the characteristics
required
are more similar to a distributed filesystem.  In a clustered FS, you have
something like a LUN on a SAN, which is a raw device simultaneously mounted
by multiple OSes.  In a distributed FS, such as lustre, you have a
configurable level of redundancy (maybe zero) distributed across multiple
systems (maybe all) and meanwhile all hosts share the same namespace.  So
each system doing heavy IO is working at local disk speeds, but any system
trying to access data that was created by another system must access that
data remotely.

If the goal is ... to do something like VMotion, including the storage...
Doing something like VMotion would be largely pointless if the VM storage
still remains on the node that was previously the compute head.

So let''s imagine for a moment that you have two systems, which are
connected
directly to each other over infiniband or any bus whose remote performance
is the same as local performance.  You have a zpool mirror using the local
disk and the remote disk.  Then you should be able to (theoretically) do
something like VMotion from one system to the other, and kill the original
system.  Even if the original system dies ungracefully and the VM dies with
it, you can still boot up the VM on the second system, and the only loss
you''ve suffered was an ungraceful reboot.

If you do the same thing over ethernet, then the performance will be
degraded to ethernet speeds.  So take it for granted, no matter what you do,
you either need a bus that performs just as well remotely versus locally...
Or else performance will be degraded...  Or else it''s kind of pointless
because the VM storage lives only on the system that you want to VMotion
away from.

Richard Elling

2011-Oct-15 18:43 UTC

head link

[zfs-discuss] Wanted: sanity check for a clustered ZFS idea

On Oct 15, 2011, at 6:14 AM, Edward Ned Harvey wrote:
>> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-
>> bounces at opensolaris.org] On Behalf Of Tim Cook
>> 
>> In my example - probably not a completely clustered FS.
>> A clustered ZFS pool with datasets individually owned by
>> specific nodes at any given time would suffice for such
>> VM farms. This would give users the benefits of ZFS
>> (resilience, snapshots and clones, shared free space)
>> merged with the speed of direct disk access instead of
>> lagging through a storage server accessing these disks.
> 
> I think I see a couple of points of disconnect.
> 
> #1 - You seem to be assuming storage is slower when it''s on a
remote storage
> server as opposed to a local disk.  While this is typically true over
> ethernet, it''s not necessarily true over infiniband or fibre
channel.
Ethernet has *always* been faster than a HDD. Even back when we had 3/180s
10Mbps Ethernet it was faster than the 30ms average access time for the disks of
the day. I tested a simple server the other day and round-trip for 4KB of data
on a
busy 1GbE switch was 0.2ms. Can you show a HDD as fast? Indeed many SSDs 
have trouble reaching that rate under load.

Many people today are deploying 10GbE and it is relatively easy to get wire
speed
for bandwidth and < 0.1 ms average access for storage.

Today, HDDs aren''t fast, and are not getting faster.
 -- richard

-- 

ZFS and performance consulting
http://www.RichardElling.com
VMworld Copenhagen, October 17-20
OpenStorage Summit, San Jose, CA, October 24-27
LISA ''11, Boston, MA, December 4-9

Toby Thain

2011-Oct-15 19:31 UTC

head link

[zfs-discuss] Wanted: sanity check for a clustered ZFS idea

On 15/10/11 2:43 PM, Richard Elling wrote:> On Oct 15, 2011, at 6:14 AM, Edward Ned Harvey wrote:
>
>>> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-
>>> bounces at opensolaris.org] On Behalf Of Tim Cook
>>>
>>> In my example - probably not a completely clustered FS.
>>> A clustered ZFS pool with datasets individually owned by
>>> specific nodes at any given time would suffice for such
>>> VM farms. This would give users the benefits of ZFS
>>> (resilience, snapshots and clones, shared free space)
>>> merged with the speed of direct disk access instead of
>>> lagging through a storage server accessing these disks.
>>
>> I think I see a couple of points of disconnect.
>>
>> #1 - You seem to be assuming storage is slower when it''s on a
remote storage
>> server as opposed to a local disk.  While this is typically true over
>> ethernet, it''s not necessarily true over infiniband or fibre
channel.
>
> Ethernet has *always* been faster than a HDD. Even back when we had 3/180s
> 10Mbps Ethernet it was faster than the 30ms average access time for the
disks of
> the day. I tested a simple server the other day and round-trip for 4KB of
data on a
> busy 1GbE switch was 0.2ms. Can you show a HDD as fast? Indeed many SSDs
> have trouble reaching that rate under load.
Hmm, of course the *latency* of Ethernet has always been much less, but 
I did not see it reaching the *throughput* of a single direct attached 
disk until gigabit.

I''m pretty sure direct attached disk throughput in the Sun 3 era was 
much better than 10Mbit Ethernet could manage. Iirc, NFS on a Sun 3 
running NetBSD over 10B2 was only *just* capable of streaming MP3, with 
tweaking, from my own experiments (I ran 10B2 at home until 2004; hey, 
it was good enough!)

--Toby
>
> Many people today are deploying 10GbE and it is relatively easy to get wire
speed
> for bandwidth and<  0.1 ms average access for storage.
>
> Today, HDDs aren''t fast, and are not getting faster.
>   -- richard
>

Jim Klimov

2011-Oct-15 23:57 UTC

head link

[zfs-discuss] Wanted: sanity check for a clustered ZFS idea

Thanks to all that replied. I hope we may continue the discussion,
but I''m afraid the overall verdict so far is disapproval of the idea.
It is my understanding that those active in discussion considered
it either too limited (in application - for VMs, or for hardware cfg),
or too difficult to implement, so that we should rather use some
alternative solutions. Or at least research them better (thanks Nico).

I guess I am happy to not have seen replies like "won''t work
at all, period" or "useless, period". I get "Difficult"
and "Limited"
and hope these can be worked around sometime, and hopefully
this discussion would spark some interest in other software
authors or customers to suggest more solutions and applications -
to make some shared ZFS a possibility ;)

Still, I would like to clear up some misunderstandings in replies -
because at times we seemed to have been speaking about
different architectures. Thanks to Richard, I stated what exact
hardware I had in mind (and wanted to use most efficiently)
while thinking about this problem, and how it is different from
"general" extensible computers or server+NAS networks.

Namely, with the shared storage architecture built into Intel
MFSYS25 blade chassis and lack of expansibility of servers
beyond that, some suggested solutions are not applicable
(10GbE, FC, Infiniband) but some networking problems
are already solved in hardware (full and equal connectivity
between all servers and all shared storage LUNs).

So some combined replies follow below:

2011-10-15, Richard Elling and Edward Ned Harver and Nico Williams
wrote:> >  #1 - You seem to be assuming storage is slower when it''s on
a remote storage
> >  server as opposed to a local disk.  While this is typically true over
> >  ethernet, it''s not necessarily true over infiniband or fibre
channel.
> Many people today are deploying 10GbE and it is relatively easy to get wire
speed
> for bandwidth and<  0.1 ms average access for storage.
Well, I am afraid I have to reiterate: for a number of reasons including
price, our customers are choosing some specific and relatively fixed
hardware solutions. So, time and again, I am afraid I''ll have to remind
of the sandbox I''m tucked into - I have to do with these boxes, and I
want to do the best with them.

I understand that Richard comes from a background where HW is the
flexible part in equations and software is designed to be used for
years. But  for many people (especially those oriented at fast-evolving
free software) the hardware is something they have to BUY and it
works unchanged as long as possible. This does not only cover
enthusiasts like the proverbial "red-eyed linuxoids", but also many
small businesses. I do still maintain several decade-old computers
running infrastructure tasks (luckily, floorspace and electricity are
near-free there) which were not yet virtualized because "if it
ain''t
broken - don''t touch it!" ;)

In particular, the blade chassis in my example, which I hoped to
utilize to their best, using shared ZFS pools, have no extension
slots. There is no 10GbE for neither external RJ45 nor internal
ports (technically there is 10GbE interlink of two switch modules),
so each server blade is limited to have either 2 or 4 1Gbps ports.
There is no FC. No infiniband. There may be one extSAS link
on each storage controller module, that''s it.
> I think the biggest problem lies in requiring full
> connectivity from every server to every LUN.
This is exactly (and the only) sort of connectivity available to
server blades in this chassis.

I think this is as applicable to networked storage where there
is a mesh of reliable connections between disk controllers
and disks (or at least LUNs), be it switched FC or dual-link
SAS or whatnot.
> Doing something like VMotion would be largely pointless if the VM storage
> still remains on the node that was previously the compute head.
True. However, in these Intel MFSYS25 boxes no server blade
has any local disks (unlike most other blades I know). Any disk
space is fed to them - and is equally accessible over a HA link -
from the storage controller modules (which are in turn connected
to the built-in array of hard-disks) that are a part of the chassis
shared by all servers, like the networking switches are.
> If you do the same thing over ethernet, then the performance will be
> degraded to ethernet speeds.  So take it for granted, no matter what you
do,
> you either need a bus that performs just as well remotely versus locally...
> Or else performance will be degraded...  Or else it''s kind of
pointless
> because the VM storage lives only on the system that you want to VMotion
> away from.
Well, while this is no Infiniband, in terms of disk access this
paragraph is applicable to MFSYS chassis: disk access
via storage controller modules can be considered a fast
common bus - if this comforts readers into understanding
my idea better. And yes, I do also think that channeling
disk over ethernet via one of the servers is a bad thing
bound to degrade performance as opposed to what can
be had anyway with direct disk access.
> Ethernet has *always* been faster than a HDD. Even back when we had 3/180s
> 10Mbps Ethernet it was faster than the 30ms average access time for the
disks of
> the day. I tested a simple server the other day and round-trip for 4KB of
data on a
> busy 1GbE switch was 0.2ms. Can you show a HDD as fast? Indeed many SSDs
> have trouble reaching that rate under load.
As noted by other posters, access times are not bandwidth.
So these are two different "faster"''s ;) Besides, (1Gbps)
Ethernet is faster than a single HDD stream. But it is not
quite faster than an array of 14HDDs...

And if Ethernet is utilized by its direct tasks - whatever they
be, say video streaming off this server to 5000 viewers or
whatever is needed to saturate the network, disk access
over the same ethernet link would have to compete. And
whatever the QoS settings, viewers would lose - either the
real-time multimedia signal would lag, or the disk data to
feed it.

Moreover, usage of an external NAS (a dedicated server
with Ethernet connection to the blade chassis) would make
an external box dedicated and perhaps optimized to storage
tasks (i.e. with ZIL/L2ARC), and would free up a blade for
VM farming needs, but it would consume much of the LAN
bandwidth of the blades using its storage services.
> Today, HDDs aren''t fast, and are not getting faster.
>   -- richardWell, typical consumer disks did get about 2-3 times faster for
linear RW speeds over the past decade; but for random access
they do still lag a lot. So, "agreed" ;)

//Jim

Tim Cook

2011-Oct-16 00:14 UTC

head link

[zfs-discuss] Wanted: sanity check for a clustered ZFS idea

On Sat, Oct 15, 2011 at 6:57 PM, Jim Klimov <jimklimov at cos.ru> wrote:
> Thanks to all that replied. I hope we may continue the discussion,
> but I''m afraid the overall verdict so far is disapproval of the
idea.
> It is my understanding that those active in discussion considered
> it either too limited (in application - for VMs, or for hardware cfg),
> or too difficult to implement, so that we should rather use some
> alternative solutions. Or at least research them better (thanks Nico).
>
> I guess I am happy to not have seen replies like "won''t work
> at all, period" or "useless, period". I get
"Difficult" and "Limited"
> and hope these can be worked around sometime, and hopefully
> this discussion would spark some interest in other software
> authors or customers to suggest more solutions and applications -
> to make some shared ZFS a possibility ;)
>
> Still, I would like to clear up some misunderstandings in replies -
> because at times we seemed to have been speaking about
> different architectures. Thanks to Richard, I stated what exact
> hardware I had in mind (and wanted to use most efficiently)
> while thinking about this problem, and how it is different from
> "general" extensible computers or server+NAS networks.
>
> Namely, with the shared storage architecture built into Intel
> MFSYS25 blade chassis and lack of expansibility of servers
> beyond that, some suggested solutions are not applicable
> (10GbE, FC, Infiniband) but some networking problems
> are already solved in hardware (full and equal connectivity
> between all servers and all shared storage LUNs).
>
> So some combined replies follow below:
>
> 2011-10-15, Richard Elling and Edward Ned Harver and Nico Williams wrote:
>
>> >  #1 - You seem to be assuming storage is slower when it''s
on a remote
>> storage
>> >  server as opposed to a local disk.  While this is typically true
over
>> >  ethernet, it''s not necessarily true over infiniband or
fibre channel.
>> Many people today are deploying 10GbE and it is relatively easy to get
>> wire speed
>> for bandwidth and<  0.1 ms average access for storage.
>>
>
> Well, I am afraid I have to reiterate: for a number of reasons including
> price, our customers are choosing some specific and relatively fixed
> hardware solutions. So, time and again, I am afraid I''ll have to
remind
> of the sandbox I''m tucked into - I have to do with these boxes,
and I
> want to do the best with them.
>
> I understand that Richard comes from a background where HW is the
> flexible part in equations and software is designed to be used for
> years. But  for many people (especially those oriented at fast-evolving
> free software) the hardware is something they have to BUY and it
> works unchanged as long as possible. This does not only cover
> enthusiasts like the proverbial "red-eyed linuxoids", but also
many
> small businesses. I do still maintain several decade-old computers
> running infrastructure tasks (luckily, floorspace and electricity are
> near-free there) which were not yet virtualized because "if it
ain''t
> broken - don''t touch it!" ;)
>
> In particular, the blade chassis in my example, which I hoped to
> utilize to their best, using shared ZFS pools, have no extension
> slots. There is no 10GbE for neither external RJ45 nor internal
> ports (technically there is 10GbE interlink of two switch modules),
> so each server blade is limited to have either 2 or 4 1Gbps ports.
> There is no FC. No infiniband. There may be one extSAS link
> on each storage controller module, that''s it.
>
>  I think the biggest problem lies in requiring full
>> connectivity from every server to every LUN.
>>
>
> This is exactly (and the only) sort of connectivity available to
> server blades in this chassis.
>
> I think this is as applicable to networked storage where there
> is a mesh of reliable connections between disk controllers
> and disks (or at least LUNs), be it switched FC or dual-link
> SAS or whatnot.
>
>  Doing something like VMotion would be largely pointless if the VM storage
>> still remains on the node that was previously the compute head.
>>
>
> True. However, in these Intel MFSYS25 boxes no server blade
> has any local disks (unlike most other blades I know). Any disk
> space is fed to them - and is equally accessible over a HA link -
> from the storage controller modules (which are in turn connected
> to the built-in array of hard-disks) that are a part of the chassis
> shared by all servers, like the networking switches are.
>
>  If you do the same thing over ethernet, then the performance will be
>> degraded to ethernet speeds.  So take it for granted, no matter what
you
>> do,
>> you either need a bus that performs just as well remotely versus
>> locally...
>> Or else performance will be degraded...  Or else it''s kind of
pointless
>> because the VM storage lives only on the system that you want to
VMotion
>> away from.
>>
>
> Well, while this is no Infiniband, in terms of disk access this
> paragraph is applicable to MFSYS chassis: disk access
> via storage controller modules can be considered a fast
> common bus - if this comforts readers into understanding
> my idea better. And yes, I do also think that channeling
> disk over ethernet via one of the servers is a bad thing
> bound to degrade performance as opposed to what can
> be had anyway with direct disk access.
>
>  Ethernet has *always* been faster than a HDD. Even back when we had 3/180s
>> 10Mbps Ethernet it was faster than the 30ms average access time for the
>> disks of
>> the day. I tested a simple server the other day and round-trip for 4KB
of
>> data on a
>> busy 1GbE switch was 0.2ms. Can you show a HDD as fast? Indeed many
SSDs
>> have trouble reaching that rate under load.
>>
>
> As noted by other posters, access times are not bandwidth.
> So these are two different "faster"''s ;) Besides,
(1Gbps)
> Ethernet is faster than a single HDD stream. But it is not
> quite faster than an array of 14HDDs...
>
> And if Ethernet is utilized by its direct tasks - whatever they
> be, say video streaming off this server to 5000 viewers or
> whatever is needed to saturate the network, disk access
> over the same ethernet link would have to compete. And
> whatever the QoS settings, viewers would lose - either the
> real-time multimedia signal would lag, or the disk data to
> feed it.
>
> Moreover, usage of an external NAS (a dedicated server
> with Ethernet connection to the blade chassis) would make
> an external box dedicated and perhaps optimized to storage
> tasks (i.e. with ZIL/L2ARC), and would free up a blade for
> VM farming needs, but it would consume much of the LAN
> bandwidth of the blades using its storage services.
>
>  Today, HDDs aren''t fast, and are not getting faster.
>>  -- richard
>>
> Well, typical consumer disks did get about 2-3 times faster for
> linear RW speeds over the past decade; but for random access
> they do still lag a lot. So, "agreed" ;)
>
> //Jim
>
>
Quite frankly your choice in blade chassis was a horrible design decision.
 From your description of its limitations it should never be the building
block for a vmware cluster in the first place.  I would start by rethinking
that decision instead of trying to pound a round ZFS peg into a square hole.

--Tim
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20111015/6ba09256/attachment.html>

Jim Klimov

2011-Oct-16 01:10 UTC

head link

[zfs-discuss] Wanted: sanity check for a clustered ZFS idea

2011-10-16 4:14, Tim Cook wrote:> Quite frankly your choice in blade chassis was a horrible design 
> decision.  From your description of its limitations it should never be 
> the building block for a vmware cluster in the first place.  I would 
> start by rethinking that decision instead of trying to pound a round 
> ZFS peg into a square hole.
>
> --Tim
Point taken ;)

Alas, quite often, it is not us engineers that make designs but a mix of 
bookkeeping folks and vendor marketing.

The MFSYS boxes are pushed by Intel or its partners as a good VMWare 
farm in a box - and for that it works well. As long as storage capacity 
on board (4.2TB with basic 300Gb drives, or more with larger ones, or 
even expanded with extSAS) is sufficient, the chassis is not "a building 
block" of the VMWare cluster. It is the cluster, all of it. The box has 
many HA features, including dual-link SAS, redundant storage and 
networking controllers, and stuff. It is just not very expansible. But 
relatively cheap, which as I said is an important factor for many.

For our company as software service vendors it is also suitable - the 
customer buys almost a preconfigured appliance, plugs in power and an 
ethernet uplink, and things magically work. This requires little to no 
skill from customers'' IT people (won''t always say they are
"admins") to
maintain, and there are no intricate external connections to break off...

For relatively small offices, 20 external gigabit ports of two managed 
switch modules can also become the networking core for the deployment site.

Thanks,
//Jim

Edward Ned Harvey

2011-Oct-16 02:15 UTC

head link

[zfs-discuss] Wanted: sanity check for a clustered ZFS idea

> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-
> bounces at opensolaris.org] On Behalf Of Toby Thain
> 
> Hmm, of course the *latency* of Ethernet has always been much less, but
> I did not see it reaching the *throughput* of a single direct attached
> disk until gigabit.
Nobody runs a single disk except in laptops, which is of course not a
relevant datum for this conversation.  If you want to remotely attach
storage, you''ll need at least 1Gb per disk, if not more.  This is
assuming
the bus is dedicated to storage traffic and nothing else.

Yes, 10G ether is relevant, but for the same price, IB will get 4x the
bandwidth and 10x smaller latency.  So ...

Supposing you have a single local disk and you have a dedicated 1Gb ethernet
to use for mirroring that device to something like an iscsi remote device...
That''s probably reasonable.

Richard Elling

2011-Oct-17 12:16 UTC

head link

[zfs-discuss] Wanted: sanity check for a clustered ZFS idea

On Oct 15, 2011, at 12:31 PM, Toby Thain wrote:> On 15/10/11 2:43 PM, Richard Elling wrote:
>> On Oct 15, 2011, at 6:14 AM, Edward Ned Harvey wrote:
>> 
>>>> From: zfs-discuss-bounces at opensolaris.org
[mailto:zfs-discuss-
>>>> bounces at opensolaris.org] On Behalf Of Tim Cook
>>>> 
>>>> In my example - probably not a completely clustered FS.
>>>> A clustered ZFS pool with datasets individually owned by
>>>> specific nodes at any given time would suffice for such
>>>> VM farms. This would give users the benefits of ZFS
>>>> (resilience, snapshots and clones, shared free space)
>>>> merged with the speed of direct disk access instead of
>>>> lagging through a storage server accessing these disks.
>>> 
>>> I think I see a couple of points of disconnect.
>>> 
>>> #1 - You seem to be assuming storage is slower when it''s
on a remote storage
>>> server as opposed to a local disk.  While this is typically true
over
>>> ethernet, it''s not necessarily true over infiniband or
fibre channel.
>> 
>> Ethernet has *always* been faster than a HDD. Even back when we had
3/180s
>> 10Mbps Ethernet it was faster than the 30ms average access time for the
disks of
>> the day. I tested a simple server the other day and round-trip for 4KB
of data on a
>> busy 1GbE switch was 0.2ms. Can you show a HDD as fast? Indeed many
SSDs
>> have trouble reaching that rate under load.
> 
> Hmm, of course the *latency* of Ethernet has always been much less, but I
did not see it reaching the *throughput* of a single direct attached disk until
gigabit.
In practice, there are very, very, very few disk workloads that do not involve a
seek.
Just one seek kills your bandwidth. But we do not define "fast" as
"bandwidth" do we?
> I''m pretty sure direct attached disk throughput in the Sun 3 era
was much better than 10Mbit Ethernet could manage. Iirc, NFS on a Sun 3 running
NetBSD over 10B2 was only *just* capable of streaming MP3, with tweaking, from
my own experiments (I ran 10B2 at home until 2004; hey, it was good enough!)
The max memory you could put into a Sun-3/280 was 32MB. There is no possible way
for such a system to handle 100 Mbps Ethernet, you could exhaust all of main
memory
in about 3 seconds :-)

 -- richard

-- 

ZFS and performance consulting
http://www.RichardElling.com
VMworld Copenhagen, October 17-20
OpenStorage Summit, San Jose, CA, October 24-27
LISA ''11, Boston, MA, December 4-9

Jim Klimov

2011-Nov-08 23:52 UTC

head link

[zfs-discuss] Wanted: sanity check for a clustered ZFS idea

Hello all,

   A couple of months ago I wrote up some ideas about clustered
ZFS with shared storage, but the idea was generally disregarded
as not something to be done in near-term due to technological
difficultes.

   Recently I stumbled upon a Nexenta+Supermicro report [1] about
cluster-in-a-box with shared storage boasting an "active-active
cluster" with "transparent failover". Now, I am not certain how
these two phrases fit in the same sentence, and maybe it is some
marketing-people mixup, but I have a couple of options:

1) The shared storage (all 16 disks are accessible to both
    motherboards) is split into two ZFS pools, each mounted
    by one node normally. If a node fails, another imports
    the pool and continues serving it.

2) All disks are aggregated into one pool, and one node
    serves it while another is in hot standby.

    Ideas (1) and (2) may possibly contradict the claim that
    the failover is seamless and transparent to clients.
    A pool import usually takes some time, maybe long if
    fixups are needed; and TCP sessions are likely to get
    broken. Still, maybe the clusterware solves this...


3) Nexenta did implement a shared ZFS pool with both nodes
    accessing all of the data instantly and cleanly.
    Can this be true? ;)


If this is not a deeply-kept trade secret, can the Nexenta
people elaborate in technical terms how this cluster works?

[1] http://www.nexenta.com/corp/sbb?gclid=CIzBg-aEqKwCFUK9zAodCSscsA

Thanks,
//Jim Klimov

Daniel Carosone

2011-Nov-09 00:09 UTC

head link

[zfs-discuss] Wanted: sanity check for a clustered ZFS idea

On Wed, Nov 09, 2011 at 03:52:49AM +0400, Jim Klimov
wrote:>   Recently I stumbled upon a Nexenta+Supermicro report [1] about
> cluster-in-a-box with shared storage boasting an "active-active
> cluster" with "transparent failover". Now, I am not certain
how
> these two phrases fit in the same sentence, and maybe it is some
> marketing-people mixup,
One way they can not be in conflict, is if each host normally owns 8
disks and is active with it, and standby for the other 8 disks. 

Not sure if this is what the solution in question is doing, just
saying. 

--
Dan.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 194 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20111109/7bbbfddd/attachment.bin>

Matt Breitbach

2011-Nov-09 02:28 UTC

head link

[zfs-discuss] Wanted: sanity check for a clustered ZFS idea

This is accomplished with the Nexenta HA cluster plugin.  The plugin is
written by RSF, and you can read more about it here :
http://www.high-availability.com/

You can do either option 1 or two that you put forth.  There is some
failover time, but in the latest version of Nexenta (3.1.1) there are some
additional tweaks that bring the failover time down significantly.
Depending on pool configuration and load, failover can be done in under 10
seconds based on some of my internal testing.

-Matt Breitbach

-----Original Message-----
From: zfs-discuss-bounces at opensolaris.org
[mailto:zfs-discuss-bounces at opensolaris.org] On Behalf Of Jim Klimov
Sent: Tuesday, November 08, 2011 5:53 PM
To: zfs-discuss at opensolaris.org
Subject: Re: [zfs-discuss] Wanted: sanity check for a clustered ZFS idea

Hello all,

   A couple of months ago I wrote up some ideas about clustered
ZFS with shared storage, but the idea was generally disregarded
as not something to be done in near-term due to technological
difficultes.

   Recently I stumbled upon a Nexenta+Supermicro report [1] about
cluster-in-a-box with shared storage boasting an "active-active
cluster" with "transparent failover". Now, I am not certain how
these two phrases fit in the same sentence, and maybe it is some
marketing-people mixup, but I have a couple of options:

1) The shared storage (all 16 disks are accessible to both
    motherboards) is split into two ZFS pools, each mounted
    by one node normally. If a node fails, another imports
    the pool and continues serving it.

2) All disks are aggregated into one pool, and one node
    serves it while another is in hot standby.

    Ideas (1) and (2) may possibly contradict the claim that
    the failover is seamless and transparent to clients.
    A pool import usually takes some time, maybe long if
    fixups are needed; and TCP sessions are likely to get
    broken. Still, maybe the clusterware solves this...


3) Nexenta did implement a shared ZFS pool with both nodes
    accessing all of the data instantly and cleanly.
    Can this be true? ;)


If this is not a deeply-kept trade secret, can the Nexenta
people elaborate in technical terms how this cluster works?

[1] http://www.nexenta.com/corp/sbb?gclid=CIzBg-aEqKwCFUK9zAodCSscsA

Thanks,
//Jim Klimov
_______________________________________________
zfs-discuss mailing list
zfs-discuss at opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Daniel Carosone

2011-Nov-09 04:18 UTC

head link

[zfs-discuss] Wanted: sanity check for a clustered ZFS idea

On Wed, Nov 09, 2011 at 11:09:45AM +1100, Daniel Carosone
wrote:> On Wed, Nov 09, 2011 at 03:52:49AM +0400, Jim Klimov wrote:
> >   Recently I stumbled upon a Nexenta+Supermicro report [1] about
> > cluster-in-a-box with shared storage boasting an "active-active
> > cluster" with "transparent failover". Now, I am not
certain how
> > these two phrases fit in the same sentence, and maybe it is some
> > marketing-people mixup,
> 
> One way they can not be in conflict, is if each host normally owns 8
> disks and is active with it, and standby for the other 8 disks. 
Which, now that I reread it more carefully, is your case 1. 

Sorry for the noise.

--
Dan.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 194 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20111109/4445d6b5/attachment.bin>

Nico Williams

2011-Nov-09 06:15 UTC

head link

[zfs-discuss] Wanted: sanity check for a clustered ZFS idea

To some people "active-active" means all cluster members serve the
same filesystems.

To others "active-active" means all cluster members serve some
filesystems and can serve all filesystems ultimately by taking over
failed cluster members.

Nico
--

zfs discuss - Oct 2011 - Wanted: sanity check for a clustered ZFS idea

[zfs-discuss] Wanted: sanity check for a clustered ZFS idea

[zfs-discuss] Wanted: sanity check for a clustered ZFS idea

[zfs-discuss] Wanted: sanity check for a clustered ZFS idea

[zfs-discuss] Wanted: sanity check for a clustered ZFS idea

[zfs-discuss] Wanted: sanity check for a clustered ZFS idea

[zfs-discuss] Wanted: sanity check for a clustered ZFS idea

[zfs-discuss] Wanted: sanity check for a clustered ZFS idea

[zfs-discuss] Wanted: sanity check for a clustered ZFS idea

[zfs-discuss] Wanted: sanity check for a clustered ZFS idea

[zfs-discuss] Wanted: sanity check for a clustered ZFS idea

[zfs-discuss] Wanted: sanity check for a clustered ZFS idea

[zfs-discuss] Wanted: sanity check for a clustered ZFS idea

[zfs-discuss] Wanted: sanity check for a clustered ZFS idea

[zfs-discuss] Wanted: sanity check for a clustered ZFS idea

[zfs-discuss] Wanted: sanity check for a clustered ZFS idea

[zfs-discuss] Wanted: sanity check for a clustered ZFS idea

[zfs-discuss] Wanted: sanity check for a clustered ZFS idea

[zfs-discuss] Wanted: sanity check for a clustered ZFS idea

[zfs-discuss] Wanted: sanity check for a clustered ZFS idea

[zfs-discuss] Wanted: sanity check for a clustered ZFS idea

[zfs-discuss] Wanted: sanity check for a clustered ZFS idea

[zfs-discuss] Wanted: sanity check for a clustered ZFS idea

[zfs-discuss] Wanted: sanity check for a clustered ZFS idea

[zfs-discuss] Wanted: sanity check for a clustered ZFS idea

[zfs-discuss] Wanted: sanity check for a clustered ZFS idea