Jim Klimov
2011-Oct-09 17:28 UTC
[zfs-discuss] Wanted: sanity check for a clustered ZFS idea
Hello all, ZFS developers have for a long time stated that ZFS is not intended, at least not in near term, for clustered environments (that is, having a pool safely imported by several nodes simultaneously). However, many people on forums have wished having ZFS features in clusters. I have some ideas at least for a limited implementation of clustering which may be useful aat least for some areas. If it is not my fantasy and if it is realistic to make - this might be a good start for further optimisation of ZFS clustering for other uses. For one use-case example, I would talk about VM farms with VM migration. In case of shared storage, the physical hosts need only migrate the VM RAM without copying gigabytes of data between their individual storages. Such copying makes less sense when the hosts'' storage is mounted off the same NAS/NAS box(es), because: * it only wastes bandwidth moving bits around the same storage, and * IP networking speed (NFS/SMB copying) may be less than that of dedicated storage net between the hosts and storage (SAS, FC, etc.) * with pre-configured disk layout from one storage box into LUNs for several hosts, more slack space is wasted than with having a single pool for several hosts, all using the same "free" pool space; * it is also less scalable (i.e. if we lay out the whole SAN for 5 hosts, it would be problematic to add a 6th server) - but it won''t be a problem when the single pool consumes the whole SAM and is available to all server nodes. One feature of this use-case is that specific datasets within the potentially common pool on the NAS/SAN are still dedicated to certain physical hosts. This would be similar to serving iSCSI volumes or NFS datasets with individual VMs from a NAS box - just with a faster connection over SAS/FC. Hopefully this allows for some shortcuts in clustering ZFS implementation, while such solutions would still be useful in practice. So, one version of the solution would be to have a single host which imports the pool in read-write mode (i.e. the first one which boots), and other hosts would write thru it (like iSCSI or whatever; maybe using SAS or FC to connect between "reader" and "writer" hosts). However they would read directly from the ZFS pool using the full SAN bandwidth. WRITES would be consistent because only one node writes data to the active ZFS block tree using more or less the same code and algorithms as already exist. In order for READS to be consistent, the "readers" need only rely on whatever latest TXG they know of, and on the cached results of their more recent writes (between the last TXG these nodes know of and current state). Here''s where this use-case''s bonus comes in: the node which currently uses a certain dataset and issues writes for it, is the only one expected to write there - so even if its knowledge of the pool is some TXGs behind, it does not matter. In order to stay up to date, and "know" the current TXG completely, the "reader nodes" should regularly read-in the ZIL data (anyway available and accessible as part of the pool) and expire changed entries from their local caches. If for some reason a "reader node" has lost track of the pool for too long, so that ZIL data is not sufficient to update from "known in-RAM TXG" to "current on-disk TXG", the full read-only import can be done again (keeping track of newer TXGs appearing while the RO import is being done). Thanks to ZFS COW, nodes can expect that on-disk data (as pointed to by block addresses/numbers) does not change. So in the worst case, nodes would read outdated data a few TXGs old - but not completely invalid data. Second version of the solution is more or less the same, except that all nodes can write to the pool hardware directly using some dedicated block ranges "owned" by one node at a time. This would work like much a ZIL containing both data and metadata. Perhaps these ranges would be whole metaslabs or some other ranges as "agreed" between the master node and other nodes. When a node''s write is completed (or a TXG sync happens), the master node would update the ZFS block tree and uberblocks, and those per-node-ZIL blocks which are already on disk would become part of the ZFS tree. At this time new block ranges would be fanned out for writes by each non-master node. A probable optimization would be to give out several TXG''s worth of dedicated block ranges to each node, to reduce hickups during any lags or even master-node reelections. Main difference from the first solution would be in performance - here all nodes would be writing to the pool hardware at full SAN/NAS networking speed, and less load would come on the "writer node". Actually, instead of a "writer node" (responsible for translation of LAN writes to SAN writes in the first solution), there would be a "master node" responsible just for consistent application of TXG updates, and for distribution of new dedicated block ranges to other nodes for new writes. Information about such block ranges would be kept on-disk like some sort of a "cluster ZIL" - so that writes won''t be lost in case of hardware resets, software panics, node reelections, etc. Applying these cluster ZILs and per-node ZIL ranges would become part of normal ZFS read-write imports (by a master node). Probably there should be a part of the pool with information about most-current cluster members (who imported the pool and what role that node performs); it could be a "ring" of blocks like the ZFS uberblocks are now. So... above I presented a couple of possible solutions to the problem of ZFS clustering. These are "off the top of my head" ideas, and as I am not a great specialist in storage clustering, they are probably silly ideas with many flaws "as is". At the very least, I see a lot of possible optimisation locations already, and the solutions (esp. #1) may be unreliable for uses other than VM hosting. Other than common clustering problems (quorum, stonith, loss of connectivity from active nodes and assumption of wrong roles - i.e. attempted pool-mastering by several nodes), which may be solved by different popular methods, not excluding that part of the pool with information about cluster members, there is also a problem of ensuring that all nodes have and use the most current state of the pool - receiving new TXG info/ZIL updates, and ultimately updating uberblocks ASAP. So beside an invitation to bash these ideas and explain why they are wrong an impossible - if they are - there is also a hope to stir up a constructive discussion finally leading up to a working "clustered ZFS" solution, and one more reliable than my ideas above ;) I think there is some demand for that in the market, as well as amoung enthusiasts... Hope to see some interesting reading, //Jim Klimov
Richard Elling
2011-Oct-12 04:15 UTC
[zfs-discuss] Wanted: sanity check for a clustered ZFS idea
On Oct 9, 2011, at 10:28 AM, Jim Klimov wrote:> Hello all, > > ZFS developers have for a long time stated that ZFS is not intended, > at least not in near term, for clustered environments (that is, having > a pool safely imported by several nodes simultaneously). However, > many people on forums have wished having ZFS features in clusters....and UFS before ZFS? I''d wager that every file system has this RFE in its wish list :-)> I have some ideas at least for a limited implementation of clustering > which may be useful aat least for some areas. If it is not my fantasy > and if it is realistic to make - this might be a good start for further > optimisation of ZFS clustering for other uses. > > For one use-case example, I would talk about VM farms with VM > migration. In case of shared storage, the physical hosts need only > migrate the VM RAM without copying gigabytes of data between their > individual storages. Such copying makes less sense when the > hosts'' storage is mounted off the same NAS/NAS box(es), because: > * it only wastes bandwidth moving bits around the same storage, andThis is why the best solutions use snapshots? no moving of data and you get the added benefit of shared ARC -- increasing the logical working set size does not increase the physical working set size.> * IP networking speed (NFS/SMB copying) may be less than that of > dedicated storage net between the hosts and storage (SAS, FC, etc.)Disk access is not bandwidth bound by the channel.> * with pre-configured disk layout from one storage box into LUNs for > several hosts, more slack space is wasted than with having a single > pool for several hosts, all using the same "free" pool space;...and you die by latency of metadata traffic.> * it is also less scalable (i.e. if we lay out the whole SAN for 5 hosts, > it would be problematic to add a 6th server) - but it won''t be a problem > when the single pool consumes the whole SAM and is available to > all server nodes.Are you assuming disk access is faster than RAM access?> One feature of this use-case is that specific datasets within the > potentially common pool on the NAS/SAN are still dedicated to > certain physical hosts. This would be similar to serving iSCSI > volumes or NFS datasets with individual VMs from a NAS box - > just with a faster connection over SAS/FC. Hopefully this allows > for some shortcuts in clustering ZFS implementation, while > such solutions would still be useful in practice.I''m still missing the connection of the problem to the solution. The problem, as I see it today: disks are slow and not getting faster. SSDs are fast and getting faster and lower $/IOP. Almost all VM environments and most general purpose environments are overprovisioned for bandwidth and underprovisioned for latency. The Achille''s heel of solutions that cluster for bandwidth (eg lustre, QFS, pNFS, Gluster, GFS, etc) is that you have to trade-off latency. But latency is what we need, so perhaps not the best architectural solution?> So, one version of the solution would be to have a single host > which imports the pool in read-write mode (i.e. the first one > which boots), and other hosts would write thru it (like iSCSI > or whatever; maybe using SAS or FC to connect between > "reader" and "writer" hosts). However they would read directly > from the ZFS pool using the full SAN bandwidth. > > WRITES would be consistent because only one node writes > data to the active ZFS block tree using more or less the same > code and algorithms as already exist. > > > In order for READS to be consistent, the "readers" need only > rely on whatever latest TXG they know of, and on the cached > results of their more recent writes (between the last TXG > these nodes know of and current state). > > Here''s where this use-case''s bonus comes in: the node which > currently uses a certain dataset and issues writes for it, is the > only one expected to write there - so even if its knowledge of > the pool is some TXGs behind, it does not matter. > > In order to stay up to date, and "know" the current TXG completely, > the "reader nodes" should regularly read-in the ZIL data (anyway > available and accessible as part of the pool) and expire changed > entries from their local caches.:-)> If for some reason a "reader node" has lost track of the pool for > too long, so that ZIL data is not sufficient to update from "known > in-RAM TXG" to "current on-disk TXG", the full read-only import > can be done again (keeping track of newer TXGs appearing > while the RO import is being done). > > Thanks to ZFS COW, nodes can expect that on-disk data (as > pointed to by block addresses/numbers) does not change. > So in the worst case, nodes would read outdated data a few > TXGs old - but not completely invalid data. > > > Second version of the solution is more or less the same, except > that all nodes can write to the pool hardware directly using some > dedicated block ranges "owned" by one node at a time. This > would work like much a ZIL containing both data and metadata. > Perhaps these ranges would be whole metaslabs or some other > ranges as "agreed" between the master node and other nodes. > > When a node''s write is completed (or a TXG sync happens), the > master node would update the ZFS block tree and uberblocks, > and those per-node-ZIL blocks which are already on disk would > become part of the ZFS tree. At this time new block ranges would > be fanned out for writes by each non-master node. > > A probable optimization would be to give out several TXG''s worth > of dedicated block ranges to each node, to reduce hickups during > any lags or even master-node reelections.:-)> Main difference from the first solution would be in performance - > here all nodes would be writing to the pool hardware at full SAN/NAS > networking speed, and less load would come on the "writer node". > Actually, instead of a "writer node" (responsible for translation of > LAN writes to SAN writes in the first solution), there would be a > "master node" responsible just for consistent application of TXG > updates, and for distribution of new dedicated block ranges to > other nodes for new writes. Information about such block ranges > would be kept on-disk like some sort of a "cluster ZIL" - so that > writes won''t be lost in case of hardware resets, software panics, > node reelections, etc. Applying these cluster ZILs and per-node > ZIL ranges would become part of normal ZFS read-write imports > (by a master node). > > Probably there should be a part of the pool with information about > most-current cluster members (who imported the pool and what > role that node performs); it could be a "ring" of blocks like the ZFS > uberblocks are now. > > > So... above I presented a couple of possible solutions to the > problem of ZFS clustering. These are "off the top of my head" > ideas, and as I am not a great specialist in storage clustering, > they are probably silly ideas with many flaws "as is". At the very > least, I see a lot of possible optimisation locations already, > and the solutions (esp. #1) may be unreliable for uses other > than VM hosting.Everything in the ZIL is also in RAM. I can read from RAM with lower latency than reading from a shared slog. So how are you improving latency?> Other than common clustering problems (quorum, stonith, > loss of connectivity from active nodes and assumption of wrong > roles - i.e. attempted pool-mastering by several nodes), which > may be solved by different popular methods, not excluding that > part of the pool with information about cluster members, there > is also a problem of ensuring that all nodes have and use the > most current state of the pool - receiving new TXG info/ZIL > updates, and ultimately updating uberblocks ASAP. > > So beside an invitation to bash these ideas and explain why they > are wrong an impossible - if they are - there is also a hope to > stir up a constructive discussion finally leading up to a working > "clustered ZFS" solution, and one more reliable than my ideas > above ;) I think there is some demand for that in the market, as > well as amoung enthusiasts?Definitely not impossible, but please work on the business case. Remember, it is easier to build hardware than software, so your software solution must be sufficiently advanced to not be obsoleted by the next few hardware generations. -- richard -- ZFS and performance consulting http://www.RichardElling.com VMworld Copenhagen, October 17-20 OpenStorage Summit, San Jose, CA, October 24-27 LISA ''11, Boston, MA, December 4-9
Nico Williams
2011-Oct-12 05:03 UTC
[zfs-discuss] Wanted: sanity check for a clustered ZFS idea
On Tue, Oct 11, 2011 at 11:15 PM, Richard Elling <richard.elling at gmail.com> wrote:> On Oct 9, 2011, at 10:28 AM, Jim Klimov wrote: >> ZFS developers have for a long time stated that ZFS is not intended, >> at least not in near term, for clustered environments (that is, having >> a pool safely imported by several nodes simultaneously). However, >> many people on forums have wished having ZFS features in clusters. > > ...and UFS before ZFS? I''d wager that every file system has this RFE in its > wish list :-)Except the ones that already have it! :) Nico --
Nico Williams
2011-Oct-12 05:18 UTC
[zfs-discuss] Wanted: sanity check for a clustered ZFS idea
On Sun, Oct 9, 2011 at 12:28 PM, Jim Klimov <jimklimov at cos.ru> wrote:> So, one version of the solution would be to have a single host > which imports the pool in read-write mode (i.e. the first one > which boots), and other hosts would write thru it (like iSCSI > or whatever; maybe using SAS or FC to connect between > "reader" and "writer" hosts). However they would read directly > from the ZFS pool using the full SAN bandwidth.You need to do more than simply assign a node for writes. You need to send write and lock requests to one node. And then you need to figure out what to do about POSIX write visibility rules (i.e., when a write should be visible to other readers). I think you''d basically end up not meeting POSIX in this regard, just like NFS, though perhaps not with close-to-open semantics. I don''t think ZFS is the beast you''re looking for. You want something more like Lustre, GPFS, and so on. I suppose someone might surprise us one day with properly clustered ZFS, but I think it''d be more likely that the filesystem would be ZFS-like, not ZFS proper.> Second version of the solution is more or less the same, except > that all nodes can write to the pool hardware directly using some > dedicated block ranges "owned" by one node at a time. This > would work like much a ZIL containing both data and metadata. > Perhaps these ranges would be whole metaslabs or some other > ranges as "agreed" between the master node and other nodes.This is much hairier. You need consistency. If two processes on different nodes are writing to the same file, then you need to *internally* lock around all those writes so that the on-disk structure ends up being sane. There''s a number of things you could do here, such as, for example, having a per-node log and one node coalescing them (possibly one node per-file, but even then one node has to be the master of every txg). And still you need to be careful about POSIX semantics. That does not come for free in any design -- you will need something like the Lustre DLM (distributed lock manager). Or else you''ll have to give up on POSIX. There''s a hefty price to be paid for POSIX semantics in a clustered environment. You''d do well to read up on Lustre''s experience in detail. And not just Lustre -- that would be just to start. I caution you that this is not a simple project. Nico --
Jim Klimov
2011-Oct-14 02:13 UTC
[zfs-discuss] Wanted: sanity check for a clustered ZFS idea
Hello all,> Definitely not impossible, but please work on the business case. > Remember, it is easier to build hardware than software, so your > software solution must be sufficiently advanced to not be obsoleted > by the next few hardware generations. > -- richardI guess Richard was correct about the usecase description - I should detail what I''m thinking about, to give some illustration. Coming from a software company though, I tend to think of software being the more flexible part of equation. This is something we have a chance to change. We use whatever hardware is given to us from above, for years... When thinking about the problem and its applications to life, I have in mind blade servers farms like Intel MFSYS25 which include relatively large internal storage and you can possibly add external SAS storage. We use such server farms as self-contained units (a single chassis plugged into customer''s network) for a number of projects, and recently more and more of these deployments become VMWare ESX farms with shared VMFS. Due to my stronger love for things Solaris, I would love to see ZFS and any of Solaris-based hypervisors (VBox, Xen or KVM ports) running there instead. But for things to be as efficient, ZFS would have to become shared - clustered... I think I would have to elaborate more on this hardware, as it tends to be our major usecase, and thus a limitation which influences my approach to clustered ZFS and belief whatever shortcuts are appropriate. These boxes have a shared chassis to accomodate 6 server blades, each with 2 CPUs and 2 or 4 gigabit ethernet ports. The chassis also has single or dual ethernet switches to interlink the servers and to connect to external world (10 ext ports each), as well as single or dual storage controllers and 14 internal HDD bays. External SAS boxes can also be attached to the storage controller modules, but I haven''t yet seen real setups like that. In normal "Intel usecase", the controller(s) implement several RAID LUNs which are accessible to the servers via SAS (with MPIO in case of dual controllers). Usually these LUNs are dedicated to servers - for example, boot/OS volumes. With an additional license from Intel, Shared LUNs can be implemented on the chassis - these are primarily aimed at VMWare farms with clustered VMFS to use available disk space (and multiple-spindle aggregated bandwidths) more efficiently, as well as aid in VM migration. To be clearer, I should say that modern VM hypervisors can migrate running virtual machines between two VM hosts. Usually (with dedicated storage for each server host) they do this by copying over the IP network their HDD image files from an "old host" to "new host", transferring virtual RAM contents, replumbing virtual networks and beginning execution "from the same point" - after just a second-long hiccup for finalization of the running VM''s migration. With clustered VMFS on shared storage, VMWare can migrate VMs faster - it knows not to copy the HDD image file in vain - it will be equally available to the "new host" at the correct point in migration, just as it was accessible to the "old host". This is what I kind of hoped to reimplement with VirtualBox or Xen or KVM running on OpenSolaris derivatives (such as OpenIndiana and others), and the proposed "ZFS clustering" using each HDD wholly as an individual LUN, aggregated into a ZFS pool by the servers themselves. For many cases this would also be cheaper, with OpenIndiana and free hypervisors ;) As was rightfully noted, with a common ZFS pool as underlying storage (as happens in current Sun VDI solutions using a ZFS NAS), VM image clones can be instantiated quickly and efficiently on resources - cheaper and faster than copying a golden image. Now, at the risk of being accused pushing some "marketing" through the discussion list, I have to state that these servers are relatively cheap (if compared to 6 single-unit servers of comparable configuration, dual managed ethernet switches, a SAN with 14 disks + dual storage controllers). Price is an important factor in many of our deployments, where these boxes work stand-alone. This usually starts with a POC, when a pre-configured basic MFSYS with some VMs of our software arrives to a customer, gets tailored and works like a "black box". In a year or so an upgrade may come in form of added disks, server blades and RAM. I have never heard even discussions of adding external storage - too pricey, and often useless with relatively fixed VM sizes - hence my desire to get a single ZFS pool available to all the blades equally. While dedicated storage boxes might be good and great, they would bump the solution price by orders of magnitude (StorEdge 7000 series) and are generally out of question for our limited deployments. Thanks to Nico for concerns about POSIX locking. However, hopefully, in the usecase I described - serving images of VMs in a manner where storage, access and migration are efficient - whole datasets (be it volumes or FS datasets) can be dedicated to one VM host server at a time, just like whole pools are dedicated to one host nowadays. In this case POSIX compliance can be disregarded - access is locked by one host, not avaialble to others, period. Of course, there is a problem of capturing storage from hosts which died, and avoiding corruptions - but this is hopefully solved in the past decades of clustering tech''s. Nico also confirmed that "one node has to be a master of all TXGs" - which is conveyed in both ideas of my original post. More directed replies below... 2011-10-12 8:15, Richard Elling ?????:> On Oct 9, 2011, at 10:28 AM, Jim Klimov wrote: > > > ... individual storages. Such copying makes less sense when the > > hosts'' storage is mounted off the same NAS/NAS box(es), because: > > * it only wastes bandwidth moving bits around the same storage, and > This is why the best solutions use snapshots? no moving of data and > you get the added benefit of shared ARC -- increasing the logical working > set size does not increase the physical working set size.Snapshots would be good for cloning of VMs. They can also help with VM migration between separate hosts, IFF both machines have some common baseline snapshot so as to send increments, but the VM supervisor would have to be really intimate with the FS - like VMWare is with VMFS... I believe with "clustered ZFS" rearchitecturing nobody forbids implementation of either ARCs or L2ARCs local to each host. I believe it is also not a very big challenge to allow non-local L2ARCs, i.e. so that shared ZFS pool caches can be local to individual hosts, and that the common shared ZFS pool would have no knowledge of remote L2ARCs - so their absence would not cause the pool to be considered corrupt. True, though, in case of cloned VM images, common blocks used by different datasets on different hosts, would use up their ARCs separately and lose some benefit of shared ARC described above by Richard. However each blade has as much RAM as any other which might be dedicated as a storage host, so that common cache memory in the VM farm overall would be increased. Moreover, such ARCs and L2ARCs local to cluster nodes would be "cluttered" only by data relevant to this certain host. By default there can be no L2ARCs in MFSYS, though - unless a box of SSDs would be attached as an external SAS storage and portions would be dedicated to each blade as LUNs via storage controller modules.> > * IP networking speed (NFS/SMB copying) may be less than that of > > dedicated storage net between the hosts and storage (SAS, FC, etc.) > Disk access is not bandwidth bound by the channel.In case of larger NASes or SANs, where multiple-spindle performance would exceed say 125MB/s, a gigabit LAN performance (iSCSI/NFS/SMB) would be a bottleneck indeed, compared to a faster SAS or FC link (i.e. 8Gbit/s). Even in case of the MFSYS chassis above, LUNs are accessible using a faster link than that which could be provided by networking of a blade dedicated to NAS tasks and serving access to ZFS to other blades over LAN. To say the least, disk access is common and equal to each blade - so having a ZFS server and serving storage over LAN only adds another layer to latency, and possibly limits bandwidth.> > * with pre-configured disk layout from one storage box into LUNs for > > several hosts, more slack space is wasted than with having a single > > pool for several hosts, all using the same "free" pool space; > ...and you die by latency of metadata traffic.That''s possible. Hopefully it can be reduced by preallocating adequately large (sets of) fragments from the shared pool for each server''s writes, so that actual blocking exchange of metadata would be rare. Since each server knows in advance what on-disk blocks it can safely write into, there should be little danger of conflict, and little added real-time latency.> > * it is also less scalable (i.e. if we lay out the whole SAN for 5 hosts, > > it would be problematic to add a 6th server) - but it won''t be a problem > > when the single pool consumes the whole SAM and is available to > > all server nodes. > Are you assuming disk access is faster than RAM access?I am not sure how this question is relevant to the paragraph above. Of course I don''t assume THAT ;) I would take it is due to my typo "SAM" misinterpreted as "RAM" while it stood for "SAN". To reiterate that idea, a SAN, such as the 14 shared disks in the MFSYS chassis, aggregated by RAID and cut into individual per-server LUNs, is less scalable that a Shared LUN on the same chassis. Because, for example, if we have 3 blades during a POC and distribute the whole disk array into 3 individual LUNs, there would be no more disk space to allocate when new blades arrive. If we don''t preallocate disk space, it is wasted. Of course, in my example we know there can be no more than 6 servers, so we can preallocate 6 LUNs, and give some servers 2 or more storage areas for a while. In non-blade setups there is no such luxury of certain-prediction ;)> >> One feature of this use-case is that specific datasets within the >> potentially common pool on the NAS/SAN are still dedicated to >> certain physical hosts. This would be similar to serving iSCSI >> volumes or NFS datasets with individual VMs from a NAS box - >> just with a faster connection over SAS/FC. Hopefully this allows >> for some shortcuts in clustering ZFS implementation, while >> such solutions would still be useful in practice. > I''m still missing the connection of the problem to the solution. > The problem, as I see it today: disks are slow and not getting > faster. SSDs are fast and getting faster and lower $/IOP. Almost > all VM environments and most general purpose environments are > overprovisioned for bandwidth and underprovisioned for latency. > The Achille''s heel of solutions that cluster for bandwidth (eg lustre, > QFS, pNFS, Gluster, GFS, etc) is that you have to trade-off latency. > But latency is what we need, so perhaps not the best architectural > solution?Again back to my MFSYS example: * Individual server blades have no local HDDs, nor SSDs for L2ARC. They only have CPUs, RAM, SAS-initiator and Pro/1000 chips. * All blades access LUNs from chassis storage, no matter what. Thechnically one of the servers can be provisioned as a storage node, but it should better be redundant - taking 2 blades out of other jobs. And repackaging disk access from SAS to LAN is bound to be slower and/or have more latency than accessing these disks (LUNs) directly.> Everything in the ZIL is also in RAM.True for the local host which wrote the ZIL. False for other hosts which use the same shared ZFS pool concurrently. However these other hosts can read in older (flushed) ZILs to update their local caches and general knowledge of pool metadata.> I can read from RAM with lower latency than reading from a shared > slog. So how are you improving latency?To be honest - I don''t know. But I can make some excuses ;) 1) If datasets are dedicated to hosts (i.e. with locking) there is not much stuff going on in other parts of the pool that would be "interesting" to hosts. They are interested in two things: * what they can READ - safely thanks to COW, and not changed by others thanks to "dedication" of datasets * where they can WRITE so as not to disturb/overwrite other hosts'' new writes - distributed in advance by master-host. In this case, latency is only added when hosts run out of assigned block ranges for writes, and are waiting for new assigned block ranges from master-host. 2) Improvement of latency was, truly, not considered. I am not ready to speculate how or why it might improve or worsen. I was after best utilization of disk space and spindles (by using a single pool), as well as bandwidth (by using direct disk access instead of adding a storage server in the path).>> Other than common clustering problems (quorum, stonith, >> loss of connectivity from active nodes and assumption of wrong >> roles - i.e. attempted pool-mastering by several nodes), which >> may be solved by different popular methods, not excluding that >> part of the pool with information about cluster members, there >> is also a problem of ensuring that all nodes have and use the >> most current state of the pool - receiving new TXG info/ZIL >> updates, and ultimately updating uberblocks ASAP. >> >> So beside an invitation to bash these ideas and explain why they >> are wrong an impossible - if they are - there is also a hope to >> stir up a constructive discussion finally leading up to a working >> "clustered ZFS" solution, and one more reliable than my ideas >> above ;) I think there is some demand for that in the market, as >> well as amoung enthusiasts? >-- +============================================================+ | | | ?????? ???????, Jim Klimov | | ??????????? ???????? CTO | | ??? "??? ? ??" JSC COS&HT | | | | +7-903-7705859 (cellular) mailto:jimklimov at cos.ru | | CC:admin at cos.ru,jimklimov at mail.ru | +============================================================+ | () ascii ribbon campaign - against html mail | | /\ - against microsoft attachments | +============================================================+
Edward Ned Harvey
2011-Oct-14 11:53 UTC
[zfs-discuss] Wanted: sanity check for a clustered ZFS idea
> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > bounces at opensolaris.org] On Behalf Of Jim Klimov > > I guess Richard was correct about the usecase description - > I should detail what I''m thinking about, to give some illustration.After reading all this, I''m still unclear on what you want to accomplish, that isn''t already done today. Yes I understand what it means when we say ZFS is not a clustering filesystem, and yes I understand what benefits there would be to gain if it were a clustering FS. But in all of what you''re saying below, I don''t see that you need a clustering FS.> of these deployments become VMWare ESX farms with shared > VMFS. Due to my stronger love for things Solaris, I would love > to see ZFS and any of Solaris-based hypervisors (VBox, Xen > or KVM ports) running there instead. But for things to be as > efficient, ZFS would have to become shared - clustered...I think the solution people currently use in this area is either NFS or iscsi. (Or infiniband, and other flavors.) You have a storage server presenting the storage to the various vmware (or whatever) hypervisors. Everything works. What''s missing? And why does this need to be a clustering FS?> To be clearer, I should say that modern VM hypervisors can > migrate running virtual machines between two VM hosts.This works on NFS/iscsi/IB as well. Doesn''t need a clustering FS.> With clustered VMFS on shared storage, VMWare can > migrate VMs faster - it knows not to copy the HDD image > file in vain - it will be equally available to the "new host" > at the correct point in migration, just as it was accessible > to the "old host".Again. NFS/iscsi/IB = ok.
Jim Klimov
2011-Oct-14 12:36 UTC
[zfs-discuss] Wanted: sanity check for a clustered ZFS idea
2011-10-14 15:53, Edward Ned Harvey ?????:>> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- >> bounces at opensolaris.org] On Behalf Of Jim Klimov >> >> I guess Richard was correct about the usecase description - >> I should detail what I''m thinking about, to give some illustration. > After reading all this, I''m still unclear on what you want to accomplish, that isn''t already done today. Yes I understand what it means when we say ZFS is not a clustering filesystem, and yes I understand what benefits there would be to gain if it were a clustering FS. But in all of what you''re saying below, I don''t see that you need a clustering FS.In my example - probably not a completely clustered FS. A clustered ZFS pool with datasets individually owned by specific nodes at any given time would suffice for such VM farms. This would give users the benefits of ZFS (resilience, snapshots and clones, shared free space) merged with the speed of direct disk access instead of lagging through a storage server accessing these disks. This is why I think such a solution may be more simple than a fully-fledged POSIX-compliant shared FS, but it would still have some benefits for specific - and popular - usage cases. And it might pave way for a more complete solution - or perhaps illustrate what should not be done for those solutions ;) After all, I think that if the problem of safe multiple-node RW access to ZFS gets fundamentally solved, these usages I described before might just become a couple of new dataset types with specific predefined usage and limitations - like POSIX-compliant FS datasets and block-based volumes are now defined over ZFS. There is no reason not to call them "clustered FS and clustered volume datasets", for example ;) AFAIK, VMFS is not a generic filesystem, and cannot quite be used "directly" by software applications, but it has its target market for shared VM farming... I do not know how they solve the problems of consistency control - with master nodes or something else, and for the sake of patent un-encroaching, I''m afraid I''d rather not know - as to not copycat someone''s solution and get burnt for that ;)> >> of these deployments become VMWare ESX farms with shared >> VMFS. Due to my stronger love for things Solaris, I would love >> to see ZFS and any of Solaris-based hypervisors (VBox, Xen >> or KVM ports) running there instead. But for things to be as >> efficient, ZFS would have to become shared - clustered... > I think the solution people currently use in this area is either NFS or iscsi. (Or infiniband, and other flavors.) You have a storage server presenting the storage to the various vmware (or whatever) hypervisors.In fact, no. Based on the MFSYS model, there is no storage server. There is a built-in storage controller which can do RAID over HDDs and represent SCSI LUNs to the blades over direct SAS access. These LUNs can be accessed individually by certain servers, or concurrently. In the latter case it is possible that servers take turns mounting the LUN as a HDD with some single-server FS, or use a clustered FS to use the LUN''s disk space simultaneously. If we were to use in this system an OpenSolaris-based OS and VirtualBox/Xen/KVM as they are now, and hope for live migration of VMs without copying of data, we would have to make a separate LUN for each VM on the controller, and mount/import this LUN to its current running host. I don''t need to explain why that would be a clumsy and unflexible solution for a near-infinite number of reasons, do i? ;)> Everything works. What''s missing? And why does this need to be a clustering FS? > > >> To be clearer, I should say that modern VM hypervisors can >> migrate running virtual machines between two VM hosts. > This works on NFS/iscsi/IB as well. Doesn''t need a clustering FS.Except that the storage controller doesn''t do NFS/iscsi/IB, and doesn''t do snapshots and clones. And if I were to dedicate one or two out of six blades to storage tasks, this might be considered an improper waste of resources. And would repackage SAS access (anyway available to all blades at full bandwidth) into NFS/iscsi access over a Gbit link...> > >> With clustered VMFS on shared storage, VMWare can >> migrate VMs faster - it knows not to copy the HDD image >> file in vain - it will be equally available to the "new host" >> at the correct point in migration, just as it was accessible >> to the "old host". > Again. NFS/iscsi/IB = ok.True, except that this is not an optimal solution in this described usecase - a farm of server blades with a relatively dumb fast raw storage (but NOT an intellectual storage server). //Jim
Tim Cook
2011-Oct-14 15:33 UTC
[zfs-discuss] Wanted: sanity check for a clustered ZFS idea
On Fri, Oct 14, 2011 at 7:36 AM, Jim Klimov <jimklimov at cos.ru> wrote:> 2011-10-14 15:53, Edward Ned Harvey ?????: > > From: zfs-discuss-bounces@**opensolaris.org<zfs-discuss-bounces at opensolaris.org>[mailto: >>> zfs-discuss- >>> bounces at opensolaris.org] On Behalf Of Jim Klimov >>> >>> I guess Richard was correct about the usecase description - >>> I should detail what I''m thinking about, to give some illustration. >>> >> After reading all this, I''m still unclear on what you want to accomplish, >> that isn''t already done today. Yes I understand what it means when we say >> ZFS is not a clustering filesystem, and yes I understand what benefits there >> would be to gain if it were a clustering FS. But in all of what you''re >> saying below, I don''t see that you need a clustering FS. >> > > In my example - probably not a completely clustered FS. > A clustered ZFS pool with datasets individually owned by > specific nodes at any given time would suffice for such > VM farms. This would give users the benefits of ZFS > (resilience, snapshots and clones, shared free space) > merged with the speed of direct disk access instead of > lagging through a storage server accessing these disks. > > This is why I think such a solution may be more simple > than a fully-fledged POSIX-compliant shared FS, but it > would still have some benefits for specific - and popular - > usage cases. And it might pave way for a more complete > solution - or perhaps illustrate what should not be done > for those solutions ;) > > After all, I think that if the problem of safe multiple-node > RW access to ZFS gets fundamentally solved, these > usages I described before might just become a couple > of new dataset types with specific predefined usage > and limitations - like POSIX-compliant FS datasets > and block-based volumes are now defined over ZFS. > There is no reason not to call them "clustered FS and > clustered volume datasets", for example ;) > > AFAIK, VMFS is not a generic filesystem, and cannot > quite be used "directly" by software applications, but it > has its target market for shared VM farming... > > I do not know how they solve the problems of consistency > control - with master nodes or something else, and for > the sake of patent un-encroaching, I''m afraid I''d rather > not know - as to not copycat someone''s solution and > get burnt for that ;) > > > >> of these deployments become VMWare ESX farms with shared >>> VMFS. Due to my stronger love for things Solaris, I would love >>> to see ZFS and any of Solaris-based hypervisors (VBox, Xen >>> or KVM ports) running there instead. But for things to be as >>> efficient, ZFS would have to become shared - clustered... >>> >> I think the solution people currently use in this area is either NFS or >> iscsi. (Or infiniband, and other flavors.) You have a storage server >> presenting the storage to the various vmware (or whatever) hypervisors. >> > > In fact, no. Based on the MFSYS model, there is no storage server. > There is a built-in storage controller which can do RAID over HDDs > and represent SCSI LUNs to the blades over direct SAS access. > These LUNs can be accessed individually by certain servers, or > concurrently. In the latter case it is possible that servers take turns > mounting the LUN as a HDD with some single-server FS, or use > a clustered FS to use the LUN''s disk space simultaneously. > > If we were to use in this system an OpenSolaris-based OS and > VirtualBox/Xen/KVM as they are now, and hope for live migration > of VMs without copying of data, we would have to make a separate > LUN for each VM on the controller, and mount/import this LUN to > its current running host. I don''t need to explain why that would be > a clumsy and unflexible solution for a near-infinite number of > reasons, do i? ;) > > > Everything works. What''s missing? And why does this need to be a >> clustering FS? >> >> >> To be clearer, I should say that modern VM hypervisors can >>> migrate running virtual machines between two VM hosts. >>> >> This works on NFS/iscsi/IB as well. Doesn''t need a clustering FS. >> > Except that the storage controller doesn''t do NFS/iscsi/IB, > and doesn''t do snapshots and clones. And if I were to > dedicate one or two out of six blades to storage tasks, > this might be considered an improper waste of resources. > And would repackage SAS access (anyway available to > all blades at full bandwidth) into NFS/iscsi access over a > Gbit link... > > > >> >> With clustered VMFS on shared storage, VMWare can >>> migrate VMs faster - it knows not to copy the HDD image >>> file in vain - it will be equally available to the "new host" >>> at the correct point in migration, just as it was accessible >>> to the "old host". >>> >> Again. NFS/iscsi/IB = ok. >> > > True, except that this is not an optimal solution in this described > usecase - a farm of server blades with a relatively dumb fast raw > storage (but NOT an intellectual storage server). > > //Jim >The idea is you would dedicate one of the servers in the chassis to be a Solaris system, which then presents NFS out to the rest of the hosts. From the chassis itself you would present every drive that isn''t being used to boot an existing server to this solaris host as individual disks, and let that server take care of RAID and presenting out the storage to the rests of the vmware hosts. --Tim -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20111014/6737efb9/attachment.html>
Jim Klimov
2011-Oct-14 16:49 UTC
[zfs-discuss] Wanted: sanity check for a clustered ZFS idea
2011-10-14 19:33, Tim Cook ?????:> > > With clustered VMFS on shared storage, VMWare can > migrate VMs faster - it knows not to copy the HDD image > file in vain - it will be equally available to the "new host" > at the correct point in migration, just as it was accessible > to the "old host". > > Again. NFS/iscsi/IB = ok. > > > True, except that this is not an optimal solution in this described > usecase - a farm of server blades with a relatively dumb fast raw > storage (but NOT an intellectual storage server). > > //Jim > > > > > The idea is you would dedicate one of the servers in the chassis to be > a Solaris system, which then presents NFS out to the rest of the > hosts. From the chassis itself you would present every drive that > isn''t being used to boot an existing server to this solaris host as > individual disks, and let that server take care of RAID and presenting > out the storage to the rests of the vmware hosts. > > --TimYes, I wrote of that as an option - but a relatively poor one (though now we''re limited to do this). As I numerously wrote, major downsides are: * probably increased latency due to another added hop of processing delays, just as with extra switches and routers in networks; * probably reduced bandwidth of LAN as compared to direct disk access; certainly it won''t get increased ;) Besides, the LAN may be (highly) utilized by servers running in VMs or physical blades, so storage traffic over LAN would compete with real networking and/or add to latencies. * in order for the whole chassis to provide HA services and run highly-available VMs, the storage servers have to be redundant - at least one other blade would have to be provisioned for failover ZFS import and serving for other nodes. This is not exactly a showstopper - but the "spare" blade would either have to not run VMs at all, or run not as many VMs as others, and in case of a pool failover event it would probably have to migrate its running VMs away in order to increase ARC and reduce storage latency for other servers. That''s doable, and automatable, but a hassle nonetheless. Also I''m not certain how well other hosts can benefit from caching in their local RAMs when using NFS or iSCSI resources. I think they might benefit better from local ARCs in the pool were directly imported to each of them... Upsides are: * this already works, and reliably, as any other ZFS NAS solution. That''s a certain "plus" :) In this current case one or two out of six blades should be dedicated to storage, leaving only 4 or 5 to VMs. In case of shared pools, there is a new problem of TXG-master failover to some other node (which would probably be not slower than a pool reimport is now), but otherwise all six servers'' loads are balanced. And they only cache what they really need. And they have faster disk access times. And they don''t use LAN superfluously for storage access. //Jim PS: Anyway, I wanted to say this earlier - thanks to everyone who responded, even (or especially) with criticism and requests for detalisation. If nothing else, you helped me describe my idea better and less ambigously, so that some other thinkers can decide whether and how to implement it ;) PPS: When I earlier asked about getting ZFS under the hood of RAID controllers, I guess I kinda wished to replace the black box of intel''s firmware with a ZFS-aware OS (FreeBSD probably) - the storage controller modules must be some sort of computers running in a failover link... These SCMs would then export datasets as SAS LUNs to specific servers, like is done now, and possibly would not require clustered ZFS - but might benefit from it too. So my MFSYS illustration is partially relevant for that question as well... -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20111014/d00cb25c/attachment-0001.html>
Nico Williams
2011-Oct-14 17:17 UTC
[zfs-discuss] Wanted: sanity check for a clustered ZFS idea
On Thu, Oct 13, 2011 at 9:13 PM, Jim Klimov <jimklimov at cos.ru> wrote:> Thanks to Nico for concerns about POSIX locking. However, > hopefully, in the usecase I described - serving images of > VMs in a manner where storage, access and migration are > efficient - whole datasets (be it volumes or FS datasets) > can be dedicated to one VM host server at a time, just like > whole pools are dedicated to one host nowadays. In this > case POSIX compliance can be disregarded - access > is locked by one host, not avaialble to others, period. > Of course, there is a problem of capturing storage from > hosts which died, and avoiding corruptions - but this is > hopefully solved in the past decades of clustering tech''s.It sounds to me like you need horizontal scaling more than anything else. In that case, why not use pNFS or Lustre? Even if you want snapshots, a VM should be able to handle that on its own, and though probably not as nicely as ZFS in some respects, having the application be in control of the exact snapshot boundaries does mean that you don''t have to quiesce your VMs just to snapshot safely.> Nico also confirmed that "one node has to be a master of > all TXGs" - which is conveyed in both ideas of my original > post.Well, at any one time one node would have to be the master of the next TXG, but it doesn''t mean that you couldn''t have some cooperation. There are lots of other much more interesting questions. I think the biggest problem lies in requiring full connectivity from every server to every LUN. I''d much rather take the Lustre / pNFS model (which, incidentally, don''t preclude having snapshots). Nico --
Nico Williams
2011-Oct-14 17:19 UTC
[zfs-discuss] Wanted: sanity check for a clustered ZFS idea
Also, it''s not worth doing a clustered ZFS thing that is too application-specific. You really want to nail down your choices of semantics, explore what design options those yield (or approach from the other direction, or both), and so on. Nico --
Edward Ned Harvey
2011-Oct-15 12:28 UTC
[zfs-discuss] Wanted: sanity check for a clustered ZFS idea
> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > bounces at opensolaris.org] On Behalf Of Jim Klimov > > The idea is you would dedicate one of the servers in the chassis to be a > Solaris system, which then presents NFS out to the rest of the hosts. ?Actually, I looked into a configuration like this, and found it''s useful in some cases - VMware boots from a dumb disk, and does PCI pass-thru, presenting the raw HBA to solaris. Create your pools, and export on the Virtual Switch, so VMware can then use the storage to hold other VM''s. Since it''s going across only a CPU limited virtual ethernet switch, it should be nearly as fast as local access to the disk. In theory. But not in practice. I found that the max throughput of the virtual switch is around 2-3Gbit/sec. Nevermind ZFS or storage or anything. Simply the CPU limited virtual switch is a bottleneck. I see they''re developing virtual switches with cisco and intel. Maybe it''ll improve. But I suspect they''re probably adding more functionality (QoS, etc) rather than focusing on performance.
Edward Ned Harvey
2011-Oct-15 13:14 UTC
[zfs-discuss] Wanted: sanity check for a clustered ZFS idea
> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > bounces at opensolaris.org] On Behalf Of Tim Cook > > In my example - probably not a completely clustered FS. > A clustered ZFS pool with datasets individually owned by > specific nodes at any given time would suffice for such > VM farms. This would give users the benefits of ZFS > (resilience, snapshots and clones, shared free space) > merged with the speed of direct disk access instead of > lagging through a storage server accessing these disks.I think I see a couple of points of disconnect. #1 - You seem to be assuming storage is slower when it''s on a remote storage server as opposed to a local disk. While this is typically true over ethernet, it''s not necessarily true over infiniband or fibre channel. That being said, I don''t want to assume everyone should be shoe-horned into infiniband or fibre channel. There are some significant downsides of IB and FC. Such as cost, and centralization of the storage. Single point of failure, and so on. So there is some ground to be gained... Saving cost and/or increasing workload distribution and/or scalability. One size doesn''t fit all. I like the fact that you''re thinking of something different. #2 - You''re talking about a clustered FS, but the characteristics required are more similar to a distributed filesystem. In a clustered FS, you have something like a LUN on a SAN, which is a raw device simultaneously mounted by multiple OSes. In a distributed FS, such as lustre, you have a configurable level of redundancy (maybe zero) distributed across multiple systems (maybe all) and meanwhile all hosts share the same namespace. So each system doing heavy IO is working at local disk speeds, but any system trying to access data that was created by another system must access that data remotely. If the goal is ... to do something like VMotion, including the storage... Doing something like VMotion would be largely pointless if the VM storage still remains on the node that was previously the compute head. So let''s imagine for a moment that you have two systems, which are connected directly to each other over infiniband or any bus whose remote performance is the same as local performance. You have a zpool mirror using the local disk and the remote disk. Then you should be able to (theoretically) do something like VMotion from one system to the other, and kill the original system. Even if the original system dies ungracefully and the VM dies with it, you can still boot up the VM on the second system, and the only loss you''ve suffered was an ungraceful reboot. If you do the same thing over ethernet, then the performance will be degraded to ethernet speeds. So take it for granted, no matter what you do, you either need a bus that performs just as well remotely versus locally... Or else performance will be degraded... Or else it''s kind of pointless because the VM storage lives only on the system that you want to VMotion away from.
Richard Elling
2011-Oct-15 18:43 UTC
[zfs-discuss] Wanted: sanity check for a clustered ZFS idea
On Oct 15, 2011, at 6:14 AM, Edward Ned Harvey wrote:>> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- >> bounces at opensolaris.org] On Behalf Of Tim Cook >> >> In my example - probably not a completely clustered FS. >> A clustered ZFS pool with datasets individually owned by >> specific nodes at any given time would suffice for such >> VM farms. This would give users the benefits of ZFS >> (resilience, snapshots and clones, shared free space) >> merged with the speed of direct disk access instead of >> lagging through a storage server accessing these disks. > > I think I see a couple of points of disconnect. > > #1 - You seem to be assuming storage is slower when it''s on a remote storage > server as opposed to a local disk. While this is typically true over > ethernet, it''s not necessarily true over infiniband or fibre channel.Ethernet has *always* been faster than a HDD. Even back when we had 3/180s 10Mbps Ethernet it was faster than the 30ms average access time for the disks of the day. I tested a simple server the other day and round-trip for 4KB of data on a busy 1GbE switch was 0.2ms. Can you show a HDD as fast? Indeed many SSDs have trouble reaching that rate under load. Many people today are deploying 10GbE and it is relatively easy to get wire speed for bandwidth and < 0.1 ms average access for storage. Today, HDDs aren''t fast, and are not getting faster. -- richard -- ZFS and performance consulting http://www.RichardElling.com VMworld Copenhagen, October 17-20 OpenStorage Summit, San Jose, CA, October 24-27 LISA ''11, Boston, MA, December 4-9
Toby Thain
2011-Oct-15 19:31 UTC
[zfs-discuss] Wanted: sanity check for a clustered ZFS idea
On 15/10/11 2:43 PM, Richard Elling wrote:> On Oct 15, 2011, at 6:14 AM, Edward Ned Harvey wrote: > >>> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- >>> bounces at opensolaris.org] On Behalf Of Tim Cook >>> >>> In my example - probably not a completely clustered FS. >>> A clustered ZFS pool with datasets individually owned by >>> specific nodes at any given time would suffice for such >>> VM farms. This would give users the benefits of ZFS >>> (resilience, snapshots and clones, shared free space) >>> merged with the speed of direct disk access instead of >>> lagging through a storage server accessing these disks. >> >> I think I see a couple of points of disconnect. >> >> #1 - You seem to be assuming storage is slower when it''s on a remote storage >> server as opposed to a local disk. While this is typically true over >> ethernet, it''s not necessarily true over infiniband or fibre channel. > > Ethernet has *always* been faster than a HDD. Even back when we had 3/180s > 10Mbps Ethernet it was faster than the 30ms average access time for the disks of > the day. I tested a simple server the other day and round-trip for 4KB of data on a > busy 1GbE switch was 0.2ms. Can you show a HDD as fast? Indeed many SSDs > have trouble reaching that rate under load.Hmm, of course the *latency* of Ethernet has always been much less, but I did not see it reaching the *throughput* of a single direct attached disk until gigabit. I''m pretty sure direct attached disk throughput in the Sun 3 era was much better than 10Mbit Ethernet could manage. Iirc, NFS on a Sun 3 running NetBSD over 10B2 was only *just* capable of streaming MP3, with tweaking, from my own experiments (I ran 10B2 at home until 2004; hey, it was good enough!) --Toby> > Many people today are deploying 10GbE and it is relatively easy to get wire speed > for bandwidth and< 0.1 ms average access for storage. > > Today, HDDs aren''t fast, and are not getting faster. > -- richard >
Jim Klimov
2011-Oct-15 23:57 UTC
[zfs-discuss] Wanted: sanity check for a clustered ZFS idea
Thanks to all that replied. I hope we may continue the discussion, but I''m afraid the overall verdict so far is disapproval of the idea. It is my understanding that those active in discussion considered it either too limited (in application - for VMs, or for hardware cfg), or too difficult to implement, so that we should rather use some alternative solutions. Or at least research them better (thanks Nico). I guess I am happy to not have seen replies like "won''t work at all, period" or "useless, period". I get "Difficult" and "Limited" and hope these can be worked around sometime, and hopefully this discussion would spark some interest in other software authors or customers to suggest more solutions and applications - to make some shared ZFS a possibility ;) Still, I would like to clear up some misunderstandings in replies - because at times we seemed to have been speaking about different architectures. Thanks to Richard, I stated what exact hardware I had in mind (and wanted to use most efficiently) while thinking about this problem, and how it is different from "general" extensible computers or server+NAS networks. Namely, with the shared storage architecture built into Intel MFSYS25 blade chassis and lack of expansibility of servers beyond that, some suggested solutions are not applicable (10GbE, FC, Infiniband) but some networking problems are already solved in hardware (full and equal connectivity between all servers and all shared storage LUNs). So some combined replies follow below: 2011-10-15, Richard Elling and Edward Ned Harver and Nico Williams wrote:> > #1 - You seem to be assuming storage is slower when it''s on a remote storage > > server as opposed to a local disk. While this is typically true over > > ethernet, it''s not necessarily true over infiniband or fibre channel. > Many people today are deploying 10GbE and it is relatively easy to get wire speed > for bandwidth and< 0.1 ms average access for storage.Well, I am afraid I have to reiterate: for a number of reasons including price, our customers are choosing some specific and relatively fixed hardware solutions. So, time and again, I am afraid I''ll have to remind of the sandbox I''m tucked into - I have to do with these boxes, and I want to do the best with them. I understand that Richard comes from a background where HW is the flexible part in equations and software is designed to be used for years. But for many people (especially those oriented at fast-evolving free software) the hardware is something they have to BUY and it works unchanged as long as possible. This does not only cover enthusiasts like the proverbial "red-eyed linuxoids", but also many small businesses. I do still maintain several decade-old computers running infrastructure tasks (luckily, floorspace and electricity are near-free there) which were not yet virtualized because "if it ain''t broken - don''t touch it!" ;) In particular, the blade chassis in my example, which I hoped to utilize to their best, using shared ZFS pools, have no extension slots. There is no 10GbE for neither external RJ45 nor internal ports (technically there is 10GbE interlink of two switch modules), so each server blade is limited to have either 2 or 4 1Gbps ports. There is no FC. No infiniband. There may be one extSAS link on each storage controller module, that''s it.> I think the biggest problem lies in requiring full > connectivity from every server to every LUN.This is exactly (and the only) sort of connectivity available to server blades in this chassis. I think this is as applicable to networked storage where there is a mesh of reliable connections between disk controllers and disks (or at least LUNs), be it switched FC or dual-link SAS or whatnot.> Doing something like VMotion would be largely pointless if the VM storage > still remains on the node that was previously the compute head.True. However, in these Intel MFSYS25 boxes no server blade has any local disks (unlike most other blades I know). Any disk space is fed to them - and is equally accessible over a HA link - from the storage controller modules (which are in turn connected to the built-in array of hard-disks) that are a part of the chassis shared by all servers, like the networking switches are.> If you do the same thing over ethernet, then the performance will be > degraded to ethernet speeds. So take it for granted, no matter what you do, > you either need a bus that performs just as well remotely versus locally... > Or else performance will be degraded... Or else it''s kind of pointless > because the VM storage lives only on the system that you want to VMotion > away from.Well, while this is no Infiniband, in terms of disk access this paragraph is applicable to MFSYS chassis: disk access via storage controller modules can be considered a fast common bus - if this comforts readers into understanding my idea better. And yes, I do also think that channeling disk over ethernet via one of the servers is a bad thing bound to degrade performance as opposed to what can be had anyway with direct disk access.> Ethernet has *always* been faster than a HDD. Even back when we had 3/180s > 10Mbps Ethernet it was faster than the 30ms average access time for the disks of > the day. I tested a simple server the other day and round-trip for 4KB of data on a > busy 1GbE switch was 0.2ms. Can you show a HDD as fast? Indeed many SSDs > have trouble reaching that rate under load.As noted by other posters, access times are not bandwidth. So these are two different "faster"''s ;) Besides, (1Gbps) Ethernet is faster than a single HDD stream. But it is not quite faster than an array of 14HDDs... And if Ethernet is utilized by its direct tasks - whatever they be, say video streaming off this server to 5000 viewers or whatever is needed to saturate the network, disk access over the same ethernet link would have to compete. And whatever the QoS settings, viewers would lose - either the real-time multimedia signal would lag, or the disk data to feed it. Moreover, usage of an external NAS (a dedicated server with Ethernet connection to the blade chassis) would make an external box dedicated and perhaps optimized to storage tasks (i.e. with ZIL/L2ARC), and would free up a blade for VM farming needs, but it would consume much of the LAN bandwidth of the blades using its storage services.> Today, HDDs aren''t fast, and are not getting faster. > -- richardWell, typical consumer disks did get about 2-3 times faster for linear RW speeds over the past decade; but for random access they do still lag a lot. So, "agreed" ;) //Jim
Tim Cook
2011-Oct-16 00:14 UTC
[zfs-discuss] Wanted: sanity check for a clustered ZFS idea
On Sat, Oct 15, 2011 at 6:57 PM, Jim Klimov <jimklimov at cos.ru> wrote:> Thanks to all that replied. I hope we may continue the discussion, > but I''m afraid the overall verdict so far is disapproval of the idea. > It is my understanding that those active in discussion considered > it either too limited (in application - for VMs, or for hardware cfg), > or too difficult to implement, so that we should rather use some > alternative solutions. Or at least research them better (thanks Nico). > > I guess I am happy to not have seen replies like "won''t work > at all, period" or "useless, period". I get "Difficult" and "Limited" > and hope these can be worked around sometime, and hopefully > this discussion would spark some interest in other software > authors or customers to suggest more solutions and applications - > to make some shared ZFS a possibility ;) > > Still, I would like to clear up some misunderstandings in replies - > because at times we seemed to have been speaking about > different architectures. Thanks to Richard, I stated what exact > hardware I had in mind (and wanted to use most efficiently) > while thinking about this problem, and how it is different from > "general" extensible computers or server+NAS networks. > > Namely, with the shared storage architecture built into Intel > MFSYS25 blade chassis and lack of expansibility of servers > beyond that, some suggested solutions are not applicable > (10GbE, FC, Infiniband) but some networking problems > are already solved in hardware (full and equal connectivity > between all servers and all shared storage LUNs). > > So some combined replies follow below: > > 2011-10-15, Richard Elling and Edward Ned Harver and Nico Williams wrote: > >> > #1 - You seem to be assuming storage is slower when it''s on a remote >> storage >> > server as opposed to a local disk. While this is typically true over >> > ethernet, it''s not necessarily true over infiniband or fibre channel. >> Many people today are deploying 10GbE and it is relatively easy to get >> wire speed >> for bandwidth and< 0.1 ms average access for storage. >> > > Well, I am afraid I have to reiterate: for a number of reasons including > price, our customers are choosing some specific and relatively fixed > hardware solutions. So, time and again, I am afraid I''ll have to remind > of the sandbox I''m tucked into - I have to do with these boxes, and I > want to do the best with them. > > I understand that Richard comes from a background where HW is the > flexible part in equations and software is designed to be used for > years. But for many people (especially those oriented at fast-evolving > free software) the hardware is something they have to BUY and it > works unchanged as long as possible. This does not only cover > enthusiasts like the proverbial "red-eyed linuxoids", but also many > small businesses. I do still maintain several decade-old computers > running infrastructure tasks (luckily, floorspace and electricity are > near-free there) which were not yet virtualized because "if it ain''t > broken - don''t touch it!" ;) > > In particular, the blade chassis in my example, which I hoped to > utilize to their best, using shared ZFS pools, have no extension > slots. There is no 10GbE for neither external RJ45 nor internal > ports (technically there is 10GbE interlink of two switch modules), > so each server blade is limited to have either 2 or 4 1Gbps ports. > There is no FC. No infiniband. There may be one extSAS link > on each storage controller module, that''s it. > > I think the biggest problem lies in requiring full >> connectivity from every server to every LUN. >> > > This is exactly (and the only) sort of connectivity available to > server blades in this chassis. > > I think this is as applicable to networked storage where there > is a mesh of reliable connections between disk controllers > and disks (or at least LUNs), be it switched FC or dual-link > SAS or whatnot. > > Doing something like VMotion would be largely pointless if the VM storage >> still remains on the node that was previously the compute head. >> > > True. However, in these Intel MFSYS25 boxes no server blade > has any local disks (unlike most other blades I know). Any disk > space is fed to them - and is equally accessible over a HA link - > from the storage controller modules (which are in turn connected > to the built-in array of hard-disks) that are a part of the chassis > shared by all servers, like the networking switches are. > > If you do the same thing over ethernet, then the performance will be >> degraded to ethernet speeds. So take it for granted, no matter what you >> do, >> you either need a bus that performs just as well remotely versus >> locally... >> Or else performance will be degraded... Or else it''s kind of pointless >> because the VM storage lives only on the system that you want to VMotion >> away from. >> > > Well, while this is no Infiniband, in terms of disk access this > paragraph is applicable to MFSYS chassis: disk access > via storage controller modules can be considered a fast > common bus - if this comforts readers into understanding > my idea better. And yes, I do also think that channeling > disk over ethernet via one of the servers is a bad thing > bound to degrade performance as opposed to what can > be had anyway with direct disk access. > > Ethernet has *always* been faster than a HDD. Even back when we had 3/180s >> 10Mbps Ethernet it was faster than the 30ms average access time for the >> disks of >> the day. I tested a simple server the other day and round-trip for 4KB of >> data on a >> busy 1GbE switch was 0.2ms. Can you show a HDD as fast? Indeed many SSDs >> have trouble reaching that rate under load. >> > > As noted by other posters, access times are not bandwidth. > So these are two different "faster"''s ;) Besides, (1Gbps) > Ethernet is faster than a single HDD stream. But it is not > quite faster than an array of 14HDDs... > > And if Ethernet is utilized by its direct tasks - whatever they > be, say video streaming off this server to 5000 viewers or > whatever is needed to saturate the network, disk access > over the same ethernet link would have to compete. And > whatever the QoS settings, viewers would lose - either the > real-time multimedia signal would lag, or the disk data to > feed it. > > Moreover, usage of an external NAS (a dedicated server > with Ethernet connection to the blade chassis) would make > an external box dedicated and perhaps optimized to storage > tasks (i.e. with ZIL/L2ARC), and would free up a blade for > VM farming needs, but it would consume much of the LAN > bandwidth of the blades using its storage services. > > Today, HDDs aren''t fast, and are not getting faster. >> -- richard >> > Well, typical consumer disks did get about 2-3 times faster for > linear RW speeds over the past decade; but for random access > they do still lag a lot. So, "agreed" ;) > > //Jim > >Quite frankly your choice in blade chassis was a horrible design decision. From your description of its limitations it should never be the building block for a vmware cluster in the first place. I would start by rethinking that decision instead of trying to pound a round ZFS peg into a square hole. --Tim -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20111015/6ba09256/attachment.html>
Jim Klimov
2011-Oct-16 01:10 UTC
[zfs-discuss] Wanted: sanity check for a clustered ZFS idea
2011-10-16 4:14, Tim Cook wrote:> Quite frankly your choice in blade chassis was a horrible design > decision. From your description of its limitations it should never be > the building block for a vmware cluster in the first place. I would > start by rethinking that decision instead of trying to pound a round > ZFS peg into a square hole. > > --TimPoint taken ;) Alas, quite often, it is not us engineers that make designs but a mix of bookkeeping folks and vendor marketing. The MFSYS boxes are pushed by Intel or its partners as a good VMWare farm in a box - and for that it works well. As long as storage capacity on board (4.2TB with basic 300Gb drives, or more with larger ones, or even expanded with extSAS) is sufficient, the chassis is not "a building block" of the VMWare cluster. It is the cluster, all of it. The box has many HA features, including dual-link SAS, redundant storage and networking controllers, and stuff. It is just not very expansible. But relatively cheap, which as I said is an important factor for many. For our company as software service vendors it is also suitable - the customer buys almost a preconfigured appliance, plugs in power and an ethernet uplink, and things magically work. This requires little to no skill from customers'' IT people (won''t always say they are "admins") to maintain, and there are no intricate external connections to break off... For relatively small offices, 20 external gigabit ports of two managed switch modules can also become the networking core for the deployment site. Thanks, //Jim
Edward Ned Harvey
2011-Oct-16 02:15 UTC
[zfs-discuss] Wanted: sanity check for a clustered ZFS idea
> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > bounces at opensolaris.org] On Behalf Of Toby Thain > > Hmm, of course the *latency* of Ethernet has always been much less, but > I did not see it reaching the *throughput* of a single direct attached > disk until gigabit.Nobody runs a single disk except in laptops, which is of course not a relevant datum for this conversation. If you want to remotely attach storage, you''ll need at least 1Gb per disk, if not more. This is assuming the bus is dedicated to storage traffic and nothing else. Yes, 10G ether is relevant, but for the same price, IB will get 4x the bandwidth and 10x smaller latency. So ... Supposing you have a single local disk and you have a dedicated 1Gb ethernet to use for mirroring that device to something like an iscsi remote device... That''s probably reasonable.
Richard Elling
2011-Oct-17 12:16 UTC
[zfs-discuss] Wanted: sanity check for a clustered ZFS idea
On Oct 15, 2011, at 12:31 PM, Toby Thain wrote:> On 15/10/11 2:43 PM, Richard Elling wrote: >> On Oct 15, 2011, at 6:14 AM, Edward Ned Harvey wrote: >> >>>> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- >>>> bounces at opensolaris.org] On Behalf Of Tim Cook >>>> >>>> In my example - probably not a completely clustered FS. >>>> A clustered ZFS pool with datasets individually owned by >>>> specific nodes at any given time would suffice for such >>>> VM farms. This would give users the benefits of ZFS >>>> (resilience, snapshots and clones, shared free space) >>>> merged with the speed of direct disk access instead of >>>> lagging through a storage server accessing these disks. >>> >>> I think I see a couple of points of disconnect. >>> >>> #1 - You seem to be assuming storage is slower when it''s on a remote storage >>> server as opposed to a local disk. While this is typically true over >>> ethernet, it''s not necessarily true over infiniband or fibre channel. >> >> Ethernet has *always* been faster than a HDD. Even back when we had 3/180s >> 10Mbps Ethernet it was faster than the 30ms average access time for the disks of >> the day. I tested a simple server the other day and round-trip for 4KB of data on a >> busy 1GbE switch was 0.2ms. Can you show a HDD as fast? Indeed many SSDs >> have trouble reaching that rate under load. > > Hmm, of course the *latency* of Ethernet has always been much less, but I did not see it reaching the *throughput* of a single direct attached disk until gigabit.In practice, there are very, very, very few disk workloads that do not involve a seek. Just one seek kills your bandwidth. But we do not define "fast" as "bandwidth" do we?> I''m pretty sure direct attached disk throughput in the Sun 3 era was much better than 10Mbit Ethernet could manage. Iirc, NFS on a Sun 3 running NetBSD over 10B2 was only *just* capable of streaming MP3, with tweaking, from my own experiments (I ran 10B2 at home until 2004; hey, it was good enough!)The max memory you could put into a Sun-3/280 was 32MB. There is no possible way for such a system to handle 100 Mbps Ethernet, you could exhaust all of main memory in about 3 seconds :-) -- richard -- ZFS and performance consulting http://www.RichardElling.com VMworld Copenhagen, October 17-20 OpenStorage Summit, San Jose, CA, October 24-27 LISA ''11, Boston, MA, December 4-9
Jim Klimov
2011-Nov-08 23:52 UTC
[zfs-discuss] Wanted: sanity check for a clustered ZFS idea
Hello all, A couple of months ago I wrote up some ideas about clustered ZFS with shared storage, but the idea was generally disregarded as not something to be done in near-term due to technological difficultes. Recently I stumbled upon a Nexenta+Supermicro report [1] about cluster-in-a-box with shared storage boasting an "active-active cluster" with "transparent failover". Now, I am not certain how these two phrases fit in the same sentence, and maybe it is some marketing-people mixup, but I have a couple of options: 1) The shared storage (all 16 disks are accessible to both motherboards) is split into two ZFS pools, each mounted by one node normally. If a node fails, another imports the pool and continues serving it. 2) All disks are aggregated into one pool, and one node serves it while another is in hot standby. Ideas (1) and (2) may possibly contradict the claim that the failover is seamless and transparent to clients. A pool import usually takes some time, maybe long if fixups are needed; and TCP sessions are likely to get broken. Still, maybe the clusterware solves this... 3) Nexenta did implement a shared ZFS pool with both nodes accessing all of the data instantly and cleanly. Can this be true? ;) If this is not a deeply-kept trade secret, can the Nexenta people elaborate in technical terms how this cluster works? [1] http://www.nexenta.com/corp/sbb?gclid=CIzBg-aEqKwCFUK9zAodCSscsA Thanks, //Jim Klimov
Daniel Carosone
2011-Nov-09 00:09 UTC
[zfs-discuss] Wanted: sanity check for a clustered ZFS idea
On Wed, Nov 09, 2011 at 03:52:49AM +0400, Jim Klimov wrote:> Recently I stumbled upon a Nexenta+Supermicro report [1] about > cluster-in-a-box with shared storage boasting an "active-active > cluster" with "transparent failover". Now, I am not certain how > these two phrases fit in the same sentence, and maybe it is some > marketing-people mixup,One way they can not be in conflict, is if each host normally owns 8 disks and is active with it, and standby for the other 8 disks. Not sure if this is what the solution in question is doing, just saying. -- Dan. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 194 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20111109/7bbbfddd/attachment.bin>
Matt Breitbach
2011-Nov-09 02:28 UTC
[zfs-discuss] Wanted: sanity check for a clustered ZFS idea
This is accomplished with the Nexenta HA cluster plugin. The plugin is written by RSF, and you can read more about it here : http://www.high-availability.com/ You can do either option 1 or two that you put forth. There is some failover time, but in the latest version of Nexenta (3.1.1) there are some additional tweaks that bring the failover time down significantly. Depending on pool configuration and load, failover can be done in under 10 seconds based on some of my internal testing. -Matt Breitbach -----Original Message----- From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-bounces at opensolaris.org] On Behalf Of Jim Klimov Sent: Tuesday, November 08, 2011 5:53 PM To: zfs-discuss at opensolaris.org Subject: Re: [zfs-discuss] Wanted: sanity check for a clustered ZFS idea Hello all, A couple of months ago I wrote up some ideas about clustered ZFS with shared storage, but the idea was generally disregarded as not something to be done in near-term due to technological difficultes. Recently I stumbled upon a Nexenta+Supermicro report [1] about cluster-in-a-box with shared storage boasting an "active-active cluster" with "transparent failover". Now, I am not certain how these two phrases fit in the same sentence, and maybe it is some marketing-people mixup, but I have a couple of options: 1) The shared storage (all 16 disks are accessible to both motherboards) is split into two ZFS pools, each mounted by one node normally. If a node fails, another imports the pool and continues serving it. 2) All disks are aggregated into one pool, and one node serves it while another is in hot standby. Ideas (1) and (2) may possibly contradict the claim that the failover is seamless and transparent to clients. A pool import usually takes some time, maybe long if fixups are needed; and TCP sessions are likely to get broken. Still, maybe the clusterware solves this... 3) Nexenta did implement a shared ZFS pool with both nodes accessing all of the data instantly and cleanly. Can this be true? ;) If this is not a deeply-kept trade secret, can the Nexenta people elaborate in technical terms how this cluster works? [1] http://www.nexenta.com/corp/sbb?gclid=CIzBg-aEqKwCFUK9zAodCSscsA Thanks, //Jim Klimov _______________________________________________ zfs-discuss mailing list zfs-discuss at opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Daniel Carosone
2011-Nov-09 04:18 UTC
[zfs-discuss] Wanted: sanity check for a clustered ZFS idea
On Wed, Nov 09, 2011 at 11:09:45AM +1100, Daniel Carosone wrote:> On Wed, Nov 09, 2011 at 03:52:49AM +0400, Jim Klimov wrote: > > Recently I stumbled upon a Nexenta+Supermicro report [1] about > > cluster-in-a-box with shared storage boasting an "active-active > > cluster" with "transparent failover". Now, I am not certain how > > these two phrases fit in the same sentence, and maybe it is some > > marketing-people mixup, > > One way they can not be in conflict, is if each host normally owns 8 > disks and is active with it, and standby for the other 8 disks.Which, now that I reread it more carefully, is your case 1. Sorry for the noise. -- Dan. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 194 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20111109/4445d6b5/attachment.bin>
Nico Williams
2011-Nov-09 06:15 UTC
[zfs-discuss] Wanted: sanity check for a clustered ZFS idea
To some people "active-active" means all cluster members serve the same filesystems. To others "active-active" means all cluster members serve some filesystems and can serve all filesystems ultimately by taking over failed cluster members. Nico --