Over the months that this list has been active, there have been several queries about using ZFS with clusters. I have responded that we are looking into these issues. We are starting a project to define, in detail, what would be required to make ZFS into a true cluster file system. (It is much to early to speculate on when this project might produce results, so please don''t ask.) And we would like some help from the community. If you are interested in using a cluster file system, please let us know how you would like to do so. We are gathering use cases, and we would like a broad spectrum of possibilities to evaluate. Any description that you can provide of how you would like to share data among the nodes of a cluster would be valuable to us; in particular, we''re interested in such things as - how many nodes would likely be in the cluster? - how many of the nodes participate in active data sharing? - what applications would be sharing data? - what is the sharing model? Is it single-writer, multi-reader? multi-writer? multi-append? something else? sharing via a single shared file or multiple files? Initially, at least, we are probably going to confine our thinking to clusters where all nodes have direct access to the storage (e.g., a SAN environment). But we are still interested in use cases that would apply to other circumstances, as well. Please note that this is in the context of the Sun Cluster product, which provides high-availability clustering for a modest number of nodes (from two to 64). This product is not a high-performance computing solution encompassing hundreds or thousands of nodes. Thank you very much. --Ed -- Ed Gould Sun Microsystems File System Architect Sun Cluster ed.gould at sun.com 17 Network Circle +1.650.786.4937 MS UMPK17-201 x84937 Menlo Park, CA 94025
Outside of a few rare questions here and there, most of the time I get asked about what turns out to be a cluster filesystem, it''s in the context of a web farm that would probably otherwise rely on NFS (server failure issues) or over-the-network syncing (extra storage requirements, data update latency). They wanted a filesystem with basically a single writer (or at most a single writer at a time) and multiple simultaneous readers (usually less than 10). Now this certainly isn''t everyone (nor is it a wishlist), but I hear that one come up more than other cases. As this is just my observation over the past couple of years, I have no firm numbers. In fact, if that came about, I would wish to have a way to override the default endian used for writing on ZFS. I could easily see a case where I''ve got a beefy SPARC crunching stuff at the center and doing a few writes to a shared storage where the I/O performance isn''t an issue, but lots and lots of reads on x86/x64 boxes at the edge where we''d want to speed things up as much as possible. For now, it sounds like we''d have to incur a translation on almost all the I/O in a setup like that. -- Darren This message posted from opensolaris.org
On Tue, Jan 24, 2006 at 04:10:54PM -0800, Darren Dunham wrote:> In fact, if that came about, I would wish to have a way to override > the default endian used for writing on ZFS. I could easily see a case > where I''ve got a beefy SPARC crunching stuff at the center and doing a > few writes to a shared storage where the I/O performance isn''t an > issue, but lots and lots of reads on x86/x64 boxes at the edge where > we''d want to speed things up as much as possible. For now, it sounds > like we''d have to incur a translation on almost all the I/O in a setup > like that.That''s not true. You''d only have to pay the byte-swapping tax on ZFS metadata (since that''s the only data we know how to byte-swap). This is typically a very small percentage of the overall data in the storage pool (<1%), and we byte-swap it before we put it in the cache. If this is actually a measured performance problem, I''d really like to know so we can look into it. --Bill
Ed Gould wrote:> > If you are interested in using a cluster file system, please let us > know how you would like to do so. We are gathering use cases, and we > would like a broad spectrum of possibilities to evaluate. Any > description that you can provide of how you would like to share data > among the nodes of a cluster would be valuable to us; in particular, > we''re interested in such things asThis is a use case for my company, and probably most of the CG/post industry.> > - how many nodes would likely be in the cluster?two scenarios here: 1) (which you''ve already basically ruled out) a large (300-3000) node cluster. Each node is responsible for procesing a job and would directly read the data it needs ( 3d models, texture, etc ) from the file system. 2) ( our case ) A medium number of file servers that service that much larger pool of nodes. We currently use NetApps, and distribute load with Microsoft''s DFS. Servers and clients talk CIFS. A smaller number of NFS clients need access as well.> - how many of the nodes participate in active data sharing? > - what applications would be sharing data? > - what is the sharing model? Is it single-writer, multi-reader? > multi-writer? > multi-append? something else? sharing via a single shared > file or multiple > files?Almost 100% single writer, multi-reader. ie a single node writes an image, which is then read by render wranglers, other nodes for Quicktime generation, compositors, etc. Or a 3d scene used by potentially thousands of processes simultaneously to render images. Having a single bandwidth path to a file is a huge bottleneck. So you have a huge amount of generated data (hair, particles, geometry, animation data, images), of which a relatively tiny amount is in demand at any one time. Lots of disk spindles spread between different heads. Can elaborate further if you want more gory details on data life cycle... cheers, Barry
> Outside of a few rare questions here and there, most > of the time I get asked about what turns out to be a > cluster filesystem, it''s in the context of a web farm > that would probably otherwise rely on NFS (server > failure issues) or over-the-network syncing (extra > storage requirements, data update latency). > > They wanted a filesystem with basically a single > writer (or at most a single writer at a time) and > multiple simultaneous readers (usually less than 10). > Now this certainly isn''t everyone (nor is it a > a wishlist), but I hear that one come up more than > other cases. As this is just my observation over the > past couple of years, I have no firm numbers.For what it''s worth, I encountered this exact scenario about a month ago for a project. I believe we ended up going with clustered VxFS instead (in other words I definately concur). This message posted from opensolaris.org
Barry Robison writes:> > - how many of the nodes participate in active data sharing? > > - what applications would be sharing data? > > - what is the sharing model? Is it single-writer, multi-reader? > > multi-writer? > > multi-append? something else? sharing via a single shared > > file or multiple > > files? > > Almost 100% single writer, multi-reader. ie a single node writes an > image, which is then read by render wranglers, other nodes for Quicktime > generation, compositors, etc. Or a 3d scene used by potentially > thousands of processes simultaneously to render images. Having a single > bandwidth path to a file is a huge bottleneck.Pardon my sidetracking, but this doesn''t make sense to me except for the case where the system engineer assumes that bandwidth to storage >> bandwidth between nodes. Since that is not the case with today''s technology, nor will it ever be the case going forward with magnetic disks, are you making an assumption which is already technologically obsolete? Back on the main track, QFS today has the model of single writer, multiple reader which relieves a the major architectural bottleneck of arbitration. But it means that those workloads where multiple writers and readers suffer; no trade-off is free. I think this is the track Ed is following: where do we make the trade-off for writer arbitration. But I do not necessarily think it is a good idea to design a system based on today''s mag disk technology which may be obsoleted long before the file system is obsoleted. Rather, we should expect radical changes in the storage technology as well. Considering that today''s storage technology is at least an order of magnitude slower and smaller bandwidth than interconnect technology, does that change your architectural view of the system? Is 10:1 readers:writers a reasonable target? 100:1? 1000:1? N.B. arbitration is latency-sensitive, data movement is bandwidth-sensitive, so it is often difficult to determine where the right mix should be. I''m not convinced that the general case has a viable solution (I''ve not seen one yet) -- richard This message posted from opensolaris.org
Richard Elling wrote:>Barry Robison writes: > > >>Almost 100% single writer, multi-reader. ie a single node writes an >>image, which is then read by render wranglers, other nodes for Quicktime >>generation, compositors, etc. Or a 3d scene used by potentially >>thousands of processes simultaneously to render images. Having a single >>bandwidth path to a file is a huge bottleneck. >> >> > >Pardon my sidetracking, but this doesn''t make sense to me >except for the case where the system engineer assumes that >bandwidth to storage >> bandwidth between nodes. Since that >is not the case with today''s technology, nor will it ever be the >case going forward with magnetic disks, are you making an >assumption which is already technologically obsolete? > >Well yes, the first scenario where all the nodes participate in the cluster is superior. However that''s not the architecture we have currently, and Ed struck down that scenario with the 2-64 cluster member limit. We do have an in house p2p application that attempts to get requested files from peers that have already cached them from the filers. But it''s requires hooks into applications, and has it''s own issues of course. cheers, Barry
On Jan 24, 2006, at 22:31, Barry Robison wrote:> Well yes, the first scenario where all the nodes participate in the > cluster is superior. However that''s not the architecture we have > currently, and Ed struck down that scenario with the 2-64 cluster > member limit. We do have an in house p2p application that attempts to > get requested files from peers that have already cached them from the > filers. But it''s requires hooks into applications, and has it''s own > issues of course.I certainly didn''t man to suggest that the high-performance cluster case (hundreds or thousands of nodes) wasn''t also interesting. But the project I''m concerned with (because that''s what Sun''s cluster product is) is for high-availability clustering, with a modest number of nodes. Perhaps I was too focussed on my task at hand when I phrased my query. If there is significant interest in adapting ZFS to the high-performance cluster arena as well, we would like to know that, too. As I think more about this possibility, even though it is not, and will not be, the focus of the project I''m working on, it may be valuable to keep the HPC case in mind as we architect for the HA case. --Ed -- Ed Gould Sun Microsystems File System Architect Sun Cluster ed.gould at sun.com 17 Network Circle +1.650.786.4937 MS UMPK17-201 x84937 Menlo Park, CA 94025
On Wed, 2006-01-25 at 05:39, Richard Elling wrote:> Barry Robison writes: > > Almost 100% single writer, multi-reader. ie a single node writes an > > image, which is then read by render wranglers, other nodes for Quicktime > > generation, compositors, etc. Or a 3d scene used by potentially > > thousands of processes simultaneously to render images. Having a single > > bandwidth path to a file is a huge bottleneck. > > Pardon my sidetracking, but this doesn''t make sense to me > except for the case where the system engineer assumes that > bandwidth to storage >> bandwidth between nodes. Since that > is not the case with today''s technology, nor will it ever be the > case going forward with magnetic disks, are you making an > assumption which is already technologically obsolete?While it''s true that the bandwidth to a single storage device may be smaller than the bandwidth between nodes, what about the aggregate bandwidth to a large number of storage devices? If I have a large number of devices on a SAN, for example, then having to route all the requests to them through one node is a major bottleneck. At my previous employer, the single-writer multi-reader scenario would have been a great boon, as we were limited by the NFS server (which could saturate gigabit, and often did). Currently, I''m thinking more about multi-writer - think Oracle RAC. -- -Peter Tribble L.I.S., University of Hertfordshire - http://www.herts.ac.uk/ http://www.petertribble.co.uk/ - http://ptribble.blogspot.com/
Bandwidth to storage can be greater than node-to-node bandwidth in high-end installations; typically the node interconnect is 1Gb Ethernet today, while storage is 2Gb or 4Gb FC per port with perhaps 8 ports into an array (e.g. StorageTek FLX380, DataDirect S2A9500), and data spread across multiple arrays. QFS supports both a single-writer/multiple-reader model (typically used for web server farms or video distribution) and a multiple-writer/multiple-reader model (for a true "shared" file system). The former doesn''t require a network link between nodes, which is an advantage in environments where network security requires that the writer be "firewalled" from the readers; however, this limits how much synchronization is possible between writer & readers. For clustering, the multiple-writer/multiple-reader model makes more sense. For distributed-compute applications (e.g. seismic analysis), there are two common cases. All nodes read from a common data file; then either each node writes to an independent file, or all the nodes write to non-overlapping ranges of the same file. This is a function of the structure of the computation; changing the relative speeds of storage & interconnect won''t change it. One can, of course, choose to route all writes through a single node attached to the storage by sending data across network, saving the cost of a storage interconnect at the cost of increased latency and increased utilization of the network interconnect. That''s reasonable for some applications. (Obviously one would want to use at least two storage-connected nodes for redundancy.) However, the industry direction appears (IMHO) to be attaching some types of storage directly to the network interconnect (Infiniband or Ethernet), which eliminates the need to centralize I/O through a server which could become a bottleneck. This message posted from opensolaris.org
On Wed, Jan 25, 2006 at 04:33:11PM +0000, Peter Tribble wrote:> On Wed, 2006-01-25 at 05:39, Richard Elling wrote: > > Barry Robison writes: > > > Almost 100% single writer, multi-reader. ie a single node writes an > > > image, which is then read by render wranglers, other nodes for Quicktime > > > generation, compositors, etc. Or a 3d scene used by potentially > > > thousands of processes simultaneously to render images. Having a single > > > bandwidth path to a file is a huge bottleneck. > > > > Pardon my sidetracking, but this doesn''t make sense to me > > except for the case where the system engineer assumes that > > bandwidth to storage >> bandwidth between nodes. Since that > > is not the case with today''s technology, nor will it ever be the > > case going forward with magnetic disks, are you making an > > assumption which is already technologically obsolete? > > While it''s true that the bandwidth to a single storage device > may be smaller than the bandwidth between nodes, what about the > aggregate bandwidth to a large number of storage devices? > > If I have a large number of devices on a SAN, for example, > then having to route all the requests to them through one > node is a major bottleneck. > > At my previous employer, the single-writer multi-reader > scenario would have been a great boon, as we were limited > by the NFS server (which could saturate gigabit, and often > did). Currently, I''m thinking more about multi-writer - think > Oracle RAC.Would a combination of single-writer/multi-reader ZFS clustering + pNFS[*] help? [*] pNFS ->parallelized NFS, where one server handles most filesystem metadata and redirects clients to data servers for file I/O.
> Bandwidth to storage can be greater than node-to-node > bandwidth in high-end installations; typically the > node interconnect is 1Gb Ethernet today, while > storage is 2Gb or 4Gb FC per port with perhaps 8 > ports into an array (e.g. StorageTek FLX380, > DataDirect S2A9500), and data spread across multiple > arrays.By the time this gets specified and developed, GbE will be passe, if it isn''t already. The higher speed networks are approaching main memory bandwidth, and that essentially flips the architecture upside down.> QFS supports both a single-writer/multiple-reader > model (typically used for web server farms or video > distribution) and a multiple-writer/multiple-reader > model (for a true "shared" file system). The former > doesn''t require a network link between nodes, which > is an advantage in environments where network > security requires that the writer be "firewalled" > from the readers; however, this limits how much > synchronization is possible between writer & readers. > For clustering, the multiple-writer/multiple-reader > model makes more sense.Yes.> For distributed-compute applications (e.g. seismic > analysis), there are two common cases. All nodes read > from a common data file; then either each node writes > to an independent file, or all the nodes write to > non-overlapping ranges of the same file. This is a > function of the structure of the computation; > changing the relative speeds of storage & > interconnect won''t change it.Agree.> One can, of course, choose to route all writes > through a single node attached to the storage by > sending data across network, saving the cost of a > storage interconnect at the cost of increased latency > and increased utilization of the network > interconnect. That''s reasonable for some > applications. (Obviously one would want to use at > least two storage-connected nodes for redundancy.) > However, the industry direction appears (IMHO) to be > attaching some types of storage directly to the > network interconnect (Infiniband or Ethernet), which > eliminates the need to centralize I/O through a > server which could become a bottleneck.When I hear people make this argument, I always ask them "what is a RAID array?" Usually they don''t really know. A RAID array is really just a server. So in such architectures you really need multiple RAID arrays, for the reasons stated above. This also puts another protocol or two in between your processors and the media as well as a hop or three. For the high bandwidth, single writer scenarios, this can work quite well. For multi-writer it adds complexity because the RAID array doesn''t understand the context of the data. [the doors are open... just gotta choose which one... :-)] -- richard This message posted from opensolaris.org
Hello Ed, Wednesday, January 25, 2006, 8:42:27 AM, you wrote: EG> On Jan 24, 2006, at 22:31, Barry Robison wrote:>> Well yes, the first scenario where all the nodes participate in the >> cluster is superior. However that''s not the architecture we have >> currently, and Ed struck down that scenario with the 2-64 cluster >> member limit. We do have an in house p2p application that attempts to >> get requested files from peers that have already cached them from the >> filers. But it''s requires hooks into applications, and has it''s own >> issues of course.EG> I certainly didn''t man to suggest that the high-performance cluster EG> case (hundreds or thousands of nodes) wasn''t also interesting. But the EG> project I''m concerned with (because that''s what Sun''s cluster product EG> is) is for high-availability clustering, with a modest number of nodes. It isn''t clear to me if you plan to make ZFS clutering dependable oc Sun Cluster? I hope not. I would really like if you could get 2-16 nodes even without Sun Cluster and use clustered ZFS (shared I should probably say). Then using Sun Cluster would be only an option to provide HA or/and scalability to an aplication. I also belive that as ZFS is going to hit S10U2 it would be really useful if equivalent of HAStorage+ (HAZFS?) agant would be created ASAP so people could use Sun Cluster with S10U2 and ZFS (of course not shared filesystem, yet). I know I would use it immediatly. -- Best regards, Robert mailto:rmilkowski at task.gda.pl http://milek.blogspot.com
I would be most interested in understanding where QFS is going in relationship with ZFS. ZFS seems to be the long term focus, file systems wise for Sun, someone can correct me if I am wrong. I have heard/seen references to features such as encryption, and clustering, coming either soon or in the case of clustering, multi-reader/multi-writer, in the longer term. Sun already has an HSM product, SAMFS which works only with QFS, so my questions are simply... [a] How does it make sense to support both QFS and ZFS longer term? If both will be, or are, "high performance" and reliable, and in the case of ZFS extended to support security features such as encryption or labels. Stating what appears obvious, once ZFS is bootable it seems UFS will "go away", won''t the same hold true for QFS, once performance and other issues are worked out with ZFS? [b] Will SAMFS ever run with ZFS? Please note, I don''t expect Sun to "open source" SAMFS. [c] What technologies, if any, will be shared between ZFS and QFS? I understand ZFS is much more advanced and different in many areas, but it seems from my perspective that there could be a large degree of value in the 2 teams joining efforts in the file systems area. I already work with some folks I do consulting for that run SAMFS, so I would love to understand the road map wrt this subject area. Thanks.
Robert Milkowski wrote:> It isn''t clear to me if you plan to make ZFS clutering dependable oc > Sun Cluster? I hope not. I would really like if you could get 2-16 > nodes even without Sun Cluster and use clustered ZFS (shared I should > probably say). Then using Sun Cluster would be only an option to > provide HA or/and scalability to an aplication.At the moment, we are only considering clusterized ZFS in the context of Sun Cluster. People often ask for this sort of decoupling, without really understanding what it entails. In particular, there are parts of Sun Cluster (e.g., membership management) that are most likely required for a cluster file system to function properly (at least with reasonable performance when failures occur), but are integral to the Cluster product and cannot be factored out easily. The request really seems to amount to, "Let me pick and choose the parts of Sun Cluster that I want at the moment, even though they were not designed to be separable." We''ll keep this idea in mind, however, and if there is a reasonable way to decouple sharing ZFS from Clustering, we''ll look at it.> I also belive that as ZFS is going to hit S10U2 it would be really > useful if equivalent of HAStorage+ (HAZFS?) agant would be created > ASAP so people could use Sun Cluster with S10U2 and ZFS (of course not > shared filesystem, yet). I know I would use it immediatly.We agree that HA-ZFS would be very useful. It is planned for Sun Cluster 3.2; coding is done and testing has begun. I do not know the release schedule for this, however. Due to testing requirements, and that Sun Cluster is part of JES (it''s the JES Availability Suite), it won''t be concurrent with S10U2, but I do not imagine that it should be too long afterwards. --Ed -- Ed Gould Sun Microsystems File System Architect Sun Cluster ed.gould at sun.com 17 Network Circle +1.650.786.4937 M/S UMPK17-201 x84937 Menlo Park, CA 94025
Hello Ed, Wednesday, January 25, 2006, 11:30:28 PM, you wrote: EG> Robert Milkowski wrote:>> It isn''t clear to me if you plan to make ZFS clutering dependable oc >> Sun Cluster? I hope not. I would really like if you could get 2-16 >> nodes even without Sun Cluster and use clustered ZFS (shared I should >> probably say). Then using Sun Cluster would be only an option to >> provide HA or/and scalability to an aplication.EG> At the moment, we are only considering clusterized ZFS in the context of EG> Sun Cluster. People often ask for this sort of decoupling, without EG> really understanding what it entails. In particular, there are parts of EG> Sun Cluster (e.g., membership management) that are most likely required EG> for a cluster file system to function properly (at least with reasonable EG> performance when failures occur), but are integral to the Cluster EG> product and cannot be factored out easily. The request really seems to EG> amount to, "Let me pick and choose the parts of Sun Cluster that I want EG> at the moment, even though they were not designed to be separable." EG> We''ll keep this idea in mind, however, and if there is a reasonable way EG> to decouple sharing ZFS from Clustering, we''ll look at it. I haven''t used QFS - but doesn''t it allow to have sharing filesystem between noded nad yet doesn''t require Sun Cluster?>> I also belive that as ZFS is going to hit S10U2 it would be really >> useful if equivalent of HAStorage+ (HAZFS?) agant would be created >> ASAP so people could use Sun Cluster with S10U2 and ZFS (of course not >> shared filesystem, yet). I know I would use it immediatly.EG> We agree that HA-ZFS would be very useful. It is planned for Sun EG> Cluster 3.2; coding is done and testing has begun. I do not know the EG> release schedule for this, however. Due to testing requirements, and EG> that Sun Cluster is part of JES (it''s the JES Availability Suite), it EG> won''t be concurrent with S10U2, but I do not imagine that it should be EG> too long afterwards. This is great news! Is it possible to get some beta bits of it? -- Best regards, Robert mailto:rmilkowski at task.gda.pl http://milek.blogspot.com
QFS and ZFS presently address somewhat different markets. ZFS is a general-purpose file system which offers very high reliability at some performance cost. QFS can be used as a general-purpose file system as well, but is at its best in high-performance scenarios, where it can be tuned to get absolute peak performance for a particular application. (For instance, metadata can be stored on a separate disk to avoid head seeks when reading or writing data files, and files can be allocated contiguously on disk.) SAM at present is tied to QFS, which enables some interesting features (for instance, the ability to read from a file on tape without having to ever copy it to disk, increasing performance over a tradtional HSM). There is an internal project which has begun looking at decoupling SAM functionality from the file system, with the eventual intent of providing HSM capabilities to ZFS and possibly other file systems. I can''t say anything about dates, of course. There is some contact between the ZFS and QFS teams, but given the fundamentally different architectures, it''s more likely that features and interfaces may be shared than implementations. (QFS also has an existing customer base, so features in new releases are driven primarily by customer requests.) This message posted from opensolaris.org
Anton B. Rang
2006-Jan-26 15:50 UTC
[zfs-discuss] Re: Re[2]: Re: Cluster File System Use Cases
>I haven''t used QFS - but doesn''t it allow to have sharing filesystem >between nodes and yet doesn''t require Sun Cluster?Yes. You do get somewhat more functionality when running QFS in conjunction with SunCluster, though. (In particular, without SunCluster the system administrator is responsible for issuing the commands to reconfigure QFS if the metadata server fails.) This message posted from opensolaris.org
Robert Milkowski wrote:> I haven''t used QFS - but doesn''t it allow to have sharing filesystem > between noded nad yet doesn''t require Sun Cluster?Yes, there is a Shared QFS that does not depend on Sun Cluster. But, as Anton Rang has already commented, the QFS architecture is substantially different from that of ZFS, and there is functionality that is only available when Shared QFS is cuopled with Sun Cluster. It''s not at all clear to me that we could do the same with ZFS and maintain the performance and correctness characteristics that we want. But, as I said, we''ll keep it in mind, and if there''s a way to do it, we''ll consider it. --Ed -- Ed Gould Sun Microsystems File System Architect Sun Cluster ed.gould at sun.com 17 Network Circle +1.650.786.4937 M/S UMPK17-201 x84937 Menlo Park, CA 94025
> Robert Milkowski wrote: > > It isn''t clear to me if you plan to make ZFS clutering dependable oc > > Sun Cluster? I hope not. I would really like if you could get 2-16 > > nodes even without Sun Cluster and use clustered ZFS (shared I should > > probably say). Then using Sun Cluster would be only an option to > > provide HA or/and scalability to an aplication. > > At the moment, we are only considering clusterized ZFS in the context of > Sun Cluster. People often ask for this sort of decoupling, without > really understanding what it entails. In particular, there are parts of > Sun Cluster (e.g., membership management) that are most likely required > for a cluster file system to function properly (at least with reasonable > performance when failures occur), but are integral to the Cluster > product and cannot be factored out easily. The request really seems to > amount to, "Let me pick and choose the parts of Sun Cluster that I want > at the moment, even though they were not designed to be separable."In my experience, reactions like Robert''s are often due to policy decisions made in the Sun Cluster design rather than a desire to not cluster. In particular, the Sun Cluster policy is oriented towards making a cluster have the same data integrity as a single host. For modern single hosts, if hardware breaks, beyond builtin resiliency, then the OS will panic or othewise attempt to work around the issue. Sun Cluster will do this too, by use of fencing and failfast panics. The problem is that this is counterintuitive to most system administrators, who likely do not have the same expectations of a cluster as we do for single systems. When a failfast panic occurs, they tend to blame Sun Cluster software as broken rather than recognize that a failfast panic is a symptom that something else in the cluster is broken. For a single system, they would have just seen a panic and immediately understood that the hardware is broken. I don''t know how to directly solve this recognition problem. If we were to allow this policy to be tunable, then we could eliminate some of the real or percieved deficiencies. However, this must be done in the manner such that the right data is protected. I feel that ZFS offers some opportunities here which we simply don''t have in other file systems. There are also opportunities to improve fencing at a level more appropriate than LUN reservations and all of the grief associated with vendor implementations of LUN reservations. Back to Robert''s thread, what is it about Sun Cluster that is distasteful? + Membership which is needed to ensure protection of data access and enable cluster-wide system administration? + Data protection via LUN reservation? + General complexity, system admin interfaces? + Resource group management and agent interfaces (which is similar to SMF)? + Policies? + LVM management? -- richard This message posted from opensolaris.org
Robert Milkowski
2006-Jan-30 08:14 UTC
[zfs-discuss] Re: Re: Cluster File System Use Cases
Hello Richard, Friday, January 27, 2006, 8:33:16 PM, you wrote:>> Robert Milkowski wrote: >> > It isn''t clear to me if you plan to make ZFS clutering dependable oc >> > Sun Cluster? I hope not. I would really like if you could get 2-16 >> > nodes even without Sun Cluster and use clustered ZFS (shared I should >> > probably say). Then using Sun Cluster would be only an option to >> > provide HA or/and scalability to an aplication. >> >> At the moment, we are only considering clusterized ZFS in the context of >> Sun Cluster. People often ask for this sort of decoupling, without >> really understanding what it entails. In particular, there are parts of >> Sun Cluster (e.g., membership management) that are most likely required >> for a cluster file system to function properly (at least with reasonable >> performance when failures occur), but are integral to the Cluster >> product and cannot be factored out easily. The request really seems to >> amount to, "Let me pick and choose the parts of Sun Cluster that I want >> at the moment, even though they were not designed to be separable."RE> In my experience, reactions like Robert''s are often due to RE> policy decisions made in the Sun Cluster design rather RE> than a desire to not cluster. In particular, the Sun Cluster RE> policy is oriented towards making a cluster have the same RE> data integrity as a single host. For modern single hosts, if RE> hardware breaks, beyond builtin resiliency, then the OS RE> will panic or othewise attempt to work around the issue. RE> Sun Cluster will do this too, by use of fencing and failfast RE> panics. The problem is that this is counterintuitive to most RE> system administrators, who likely do not have the same RE> expectations of a cluster as we do for single systems. RE> When a failfast panic occurs, they tend to blame Sun RE> Cluster software as broken rather than recognize that a RE> failfast panic is a symptom that something else in the RE> cluster is broken. For a single system, they would have RE> just seen a panic and immediately understood that the RE> hardware is broken. I don''t know how to directly solve RE> this recognition problem. RE> If we were to allow this policy to be tunable, then we RE> could eliminate some of the real or percieved deficiencies. RE> However, this must be done in the manner such that RE> the right data is protected. I feel that ZFS offers some RE> opportunities here which we simply don''t have in other RE> file systems. There are also opportunities to improve RE> fencing at a level more appropriate than LUN RE> reservations and all of the grief associated with RE> vendor implementations of LUN reservations. RE> Back to Robert''s thread, what is it about Sun Cluster RE> that is distasteful? RE> + Membership which is needed to ensure protection RE> of data access and enable cluster-wide system RE> administration? RE> + Data protection via LUN reservation? RE> + General complexity, system admin interfaces? RE> + Resource group management and agent RE> interfaces (which is similar to SMF)? RE> + Policies? RE> + LVM management? I do use SC and I find it really great product. I''m almost sure I will use ZFS+SC in a near future (in a way I already do it). However I can think of some other enviroments where SC is just too much complexity and all is needed is a shared filesystem. If one can just mount the same filesystem on two or three nodes wih standard Solaris installation and not worring about interconnects, clusters, etc. Sure there''ll be less functionality but it''s not always needed. I''m also not sure if you can setup SC with different architectures in the same cluster - in theory it should be possible with ZFS and there shouldn''t be such a limitation. -- Best regards, Robert mailto:rmilkowski at task.gda.pl http://milek.blogspot.com
Robert Milkowski wrote:> I haven''t used QFS - but doesn''t it allow to have sharing filesystem > between noded nad yet doesn''t require Sun Cluster?You are right. There is a flavor of QFS that works with Sun Cluster. QFS has a metadata server. When working with Sun Cluster, the metadata server can be made highly available. When the node hosting the metadata server goes down, it is brought up on another node. All this is done without a tight integration with Sun Cluster. The HA metadata server is an RGM service. It just automates what the sysadmin would have done manually. By the way, the functionality/guarantees Shared-QFS provides are different from what a cluster filesystem like GFS aka PxFS provides. PxFS provides transparent access during failover/switchover scenarios. By transparent, I mean the client would not see an EIO as long as a node is able to master the filesystem. I do not think shared QFS provides this functionality - HA metadata server or otherwise. My take is that hooking into sun cluster configuration changes would be needed to do this. The design team would have to decide how tight the integration is going to be. The svm-sc aka ''sun cluster svm'' aka oban team went with a loosely coupled design. You can actually get it up without installing sun cluster (though it is probably not supported). IMO, this loose integration has performance penalties during reconfigurations. Disclaimer: I have not looked at the QFS source code nor have I worked with it extensively. The above is based on my understanding of how it works. :) Regards, Manoj
Hi Please see responses inline. Ellard> Hello Richard, > > Friday, January 27, 2006, 8:33:16 PM, you wrote: > > >> Robert Milkowski wrote: > >> > It isn''t clear to me if you plan to make ZFS clutering dependable oc > >> > Sun Cluster? I hope not. I would really like if you could get 2-16 > >> > nodes even without Sun Cluster and use clustered ZFS (shared I should > >> > probably say). Then using Sun Cluster would be only an option to > >> > provide HA or/and scalability to an aplication. > >> > >> At the moment, we are only considering clusterized ZFS in the context of > >> Sun Cluster. People often ask for this sort of decoupling, without > >> really understanding what it entails. In particular, there are parts of > >> Sun Cluster (e.g., membership management) that are most likely required > >> for a cluster file system to function properly (at least with reasonable > >> performance when failures occur), but are integral to the Cluster > >> product and cannot be factored out easily. The request really seems to > >> amount to, "Let me pick and choose the parts of Sun Cluster that I want > >> at the moment, even though they were not designed to be separable." > > RE> In my experience, reactions like Robert''s are often due to > RE> policy decisions made in the Sun Cluster design rather > RE> than a desire to not cluster. In particular, the Sun Cluster > RE> policy is oriented towards making a cluster have the same > RE> data integrity as a single host. For modern single hosts, if > RE> hardware breaks, beyond builtin resiliency, then the OS > RE> will panic or othewise attempt to work around the issue. > RE> Sun Cluster will do this too, by use of fencing and failfast > RE> panics. The problem is that this is counterintuitive to most > RE> system administrators, who likely do not have the same > RE> expectations of a cluster as we do for single systems. > RE> When a failfast panic occurs, they tend to blame Sun > RE> Cluster software as broken rather than recognize that a > RE> failfast panic is a symptom that something else in the > RE> cluster is broken. For a single system, they would have > RE> just seen a panic and immediately understood that the > RE> hardware is broken. I don''t know how to directly solve > RE> this recognition problem. > > RE> If we were to allow this policy to be tunable, then we > RE> could eliminate some of the real or percieved deficiencies. > RE> However, this must be done in the manner such that > RE> the right data is protected. I feel that ZFS offers some > RE> opportunities here which we simply don''t have in other > RE> file systems. There are also opportunities to improve > RE> fencing at a level more appropriate than LUN > RE> reservations and all of the grief associated with > RE> vendor implementations of LUN reservations. > > RE> Back to Robert''s thread, what is it about Sun Cluster > RE> that is distasteful? > RE> + Membership which is needed to ensure protection > RE> of data access and enable cluster-wide system > RE> administration? > RE> + Data protection via LUN reservation? > RE> + General complexity, system admin interfaces? > RE> + Resource group management and agent > RE> interfaces (which is similar to SMF)? > RE> + Policies? > RE> + LVM management? > > I do use SC and I find it really great product. > I''m almost sure I will use ZFS+SC in a near future (in a way I already > do it). > > However I can think of some other enviroments where SC is just too > much complexity and all is needed is a shared filesystem. If one can > just mount the same filesystem on two or three nodes wih standard > Solaris installation and not worring about interconnects, clusters, > etc. Sure there''ll be less functionality but it''s not always needed. >Complexity is one complaint that we have received about Sun Cluster. Specifically, the administrative work and hardware restrictions have been cited. We have some ideas about significantly improving each of these areas. I would love to hear from anyone outside the Sun Cluster organization who is familiar with our SC product as to their concerns and issues. It is always good to independent feedback. Since this email is on "zfs-discuss" and this is really a Sun Cluster topic, please send responses to just me so that we do not flood the ZFS people with SC stuff.> I''m also not sure if you can setup SC with different architectures in > the same cluster - in theory it should be possible with ZFS and there > shouldn''t be such a limitation. >At this time Sun Cluster only supports a cluster consisting of machines of the same architecture: either SPARC or x86. Sun Cluster operates mostly at a level where the differences between SPARC and x86 do not matter. The two limitations that I know about are: 1) little-endian vs big-endian translation 2) pxfs assumes all machines have same OS flavor. One reason that we have not pursued mixed SPARC/x86 cluster is that we have not found much customer interest. It would be interesting if you are encountering potential customers for such a product.> > -- > Best regards, > Robert mailto:rmilkowski at task.gda.pl > http://milek.blogspot.com > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Do you believe it is necessary for a host to be part of the Sun Cluster to mount the filesystem? I understand the need for some actions to require an HA component but it seems to me it would be possible to mount and R/W the filesytem without being part of the cluster as long as host could access services required to be HA. Perhaps by putting the location of the service in a label? Certain restrictions like requiring z* commands to be executed from inside the cluster would be acceptable. To give a real world example we run ~40 web servers for images. I don''t see it as practical to put all those servers into a cluster. This message posted from opensolaris.org
> Do you believe it is necessary for a host to be part > of the Sun Cluster to mount the filesystem?Yes, as long as the cluster owns the data.> I understand the need for some actions to require an HA > component but it seems to me it would be possible to > mount and R/W the filesytem without being part of the > cluster as long as host could access services > required to be HA. Perhaps by putting the location of > the service in a label? > Certain restrictions like requiring z* commands to be > executed from inside the cluster would be acceptable. > > To give a real world example we run ~40 web servers > for images. I don''t see it as practical to put all > those servers into a cluster.Lots of people do this today. Current sharing technologies seem to work quite well. How would a tight coupling between a (mostly?) read-only client and read-write clients be an improvement? Or, what problem are you trying to solve? -- richard This message posted from opensolaris.org
Not necessarily a Cluster File System Use Case, but I''d like to be able to make filesystems available to multiple boxes over SAN attached disk for the purpose of using clones. We have several test environments for each production environment which refresh data from production snapshots regularly. Currently, all systems must have enough disk to hold all of the production data even though, the testing usually only changes relatively small amounts of data. I understand I could create an architecture using nfs to share the clone(s) to the test systems, but problems with that are: - I would have to add at least 1 more server to the environment (probably a cluster, so 2+ servers). - Our systems are pre-wired w/ 2 GigE interfaces and I''d be worried about sharing the general network bandwidth w/ file access. - Our systems are already pre-wired w/ SAN connections, so it would be a waste to not utilize them. I also understand that it would be better to just load test data on the test instances, but I have very limited influence in that area :-( So in essence, what would be really cool, would be to be able to import a pool on multiple machines w/ one read/writer of the primary data, and be able to build read/write clones (writable by only one machine) to facilitate near instant refreshes of data. Any chance of that becoming a reality?
I''m trying to solve the problem of lots of copies of the same files and a file distribution model based on rsync. I''ll have to take a look at Sun Cluster. It''s been a two years since I touched Solaris. Truthfuly it would have to be shockingly cost effective to consider and that''s no dig on Sun. Just looking around at all the new stuff I''m seeing amazing value for the money This message posted from opensolaris.org
> I''m trying to solve the problem of lots of copies of > the same files and a file distribution model based on > rsync. I''ll have to take a look at Sun Cluster. It''s > been a two years since I touched Solaris. Truthfuly > it would have to be shockingly cost effective to > consider and that''s no dig on Sun. Just looking > around at all the new stuff I''m seeing amazing value > for the moneyYour model is backwards. Rather than pushing out to the masses, have the masses cache. q.v. cachefsd(1m), mount_cachefs(1m) [note: cachefs is a nop for NFSv4] This is a much simpler model to manage, almost a no-brainer. -- richard This message posted from opensolaris.org
> Not necessarily a Cluster File System Use Case, but I''d like to be able > to make filesystems available to multiple boxes over SAN attached disk > for the purpose of using clones. We have several test environments for > each production environment which refresh data from production snapshots > regularly. Currently, all systems must have enough disk to hold all of > the production data even though, the testing usually only changes > relatively small amounts of data.It is the changing that is problematic. Read-only is a much simpler case.> I understand I could create an architecture using nfs to share the > clone(s) to the test systems, but problems with that are:If I understand what you are saying, you want something like: disk --<SCSI>-- server --<SCSI>-- server[s] --<IP>-- clients while I tend to advocate: disk --<SCSI>-- server --<NFS>-- server[s] --<IP>-- clients habit, I suppose. But the difference is that the SCSI protocol has no context of the data, whereas NFS has some knowledge of the data. NFS does not have as much knowledge of the data as ZFS, though. Hint: use NFS to share the ZFS clones. Note, in this model a RAID array is a server which speaks the SCSI protocol. You still need to get to a file system level of abstraction at the edge servers.> - I would have to add at least 1 more server to to the environment > (probably a cluster, so 2+ servers).With a RAID array this would be: disk --<SCSI>-- server --<SCSI>-- server[s] --<NFS>-- server[s] --<IP>-- clients Not especially palatable, though common.> - Our systems are pre-wired w/ 2 GigE interfaces aces and I''d be worried > about sharing the general network bandwidth w/ file access. > - Our systems are already pre-wired w/ SAN SAN connections, so it would > be a waste to not utilize them. > > I also understand that it would be better to just load test data on the > test instances, but I have very limited influence in that area :-( > > So in essence, what would be really cool, would be to be able to import > a pool on multiple machines w/ one read/writer of the primary data, and > be able to build read/write clones (writable by only one machine) to > facilitate near instant refreshes of data. > > Any chance of that becoming a reality?This is already available with QFS today. It has an arbitration/ synchronization method which is especially suitable for such environments. It follows your desired model. It isn''t quite like ZFS, though, so there are some feature trade-offs. Bringing this back towards ZFS-land, I think that there are some clever things we can do with snapshots and clones. But the age-old problem of arbitration rears its ugly head. I think I could write an option to expose ZFS snapshots to read-only clients. But in doing so, I don''t see how to prevent an ill-behaved client from clobbering the data. To solve that problem, an arbiter must decide who can write where. The SCSI protocol has almost nothing to assist us in this cause, but NFS, QFS, and pxfs do. There is room for cleverness, but not at the SCSI or block level. -- richard This message posted from opensolaris.org
Probably a typical scenario. Three to five node Oracle RAC. Two to 4 nodes read/write, with the last node a datawarehouse needing read access. Specifically: - how many nodes would likely be in the cluster? 5 at the upper end - how many of the nodes participate in active data sharing? All - what applications would be sharing data? Oracle - what is the sharing model? multi-writer This message posted from opensolaris.org
Hello frank, Thursday, February 2, 2006, 12:10:18 AM, you wrote: fg> I''m trying to solve the problem of lots of copies of the same files and a file distribution model based on rsync. I''ll have to take a look at Sun Cluster. It''s been a two years since I touched fg> Solaris. Truthfuly it would have to be shockingly cost effective to consider and that''s no dig on Sun. Just looking around at all the new stuff I''m seeing amazing value for the money Sun Cluster is free now. And wouldn''t NFS be a better solution -- Best regards, Robert mailto:rmilkowski at task.gda.pl http://milek.blogspot.com
> Probably a typical scenario. Three to five node > Oracle RAC. Two to 4 nodes read/write, with the last > node a datawarehouse needing read access.Disagree, this is not typical. You really do want isolation between your OLTP and DSS workloads. RAC tends to run as fast as the slowest node can handle the arbitration, so you don''t want to mix your workloads in the same cluster.> Specifically: > - how many nodes would likely be in the cluster? > 5 at the upper end > - how many of the nodes participate in active data > sharing? > All > - what applications would be sharing data? > Oracle > - what is the sharing model? > multi-writerHowever, Oracle is heavily invested in, and promoting, ASM. I do not think it is wise for ZFS to try to out-ASM ASM. In fact, most of the recent Sun+Oracle world record performance benchmarks are using ASM. -- richard This message posted from opensolaris.org
>If I understand what you are saying, you want something like: > > disk --<SCSI>-- server --<SCSI>-- server[s] --<IP>-- clients > >while I tend to advocate: > > disk --<SCSI>-- server --<NFS>-- server[s] --<IP>-- clients > >habit, I suppose. But the difference is that the SCSI protocol has >no context of the data, whereas NFS has some knowledge of >the data. NFS does not have as much knowledge of the data >as ZFS, though. Hint: use NFS to share the ZFS clones. > > >I would want something like: disk --<SCSI>-- server1 (r/w) --<IP>-- clients (This would actually be a BC/DR copy of the production data so there would only be client access in a disaster) | -- server2 (ro w/ the exception of r/w clones) --<IP>-- clients | -- server3 (ro w/ the exception of different r/w clones) --<IP>-- clients . . . I understand the possibility of using NFS to share the ZFS clones. My concern more around our current, well embedded infrastructure being able to handle this case.>This is already available with QFS today. It has an arbitration/ >synchronization method which is especially suitable for such environments. >It follows your desired model. It isn''t quite like ZFS, though, so there are >some feature trade-offs. > >QFS doesn''t seem to have the "clone" feature which is the whole point to the approach. The idea is to virutalize the data. Let''s see if I can draw this out a little further. If I have a production server hosting a database with 1TB of mirrored disk attached, and 4 test/dev environments which at this point also require 1TB of disk which is in one way or another copied from the production database, on a monthly basis. I''m required to have 5TB + whatever redundancy is required for each environment. I would like to be able to create clones of the production data and only require disk to hold the changes for each environment which is really quite small relatively. Again, I do understand this could be done over NFS. The question is would it be possible to implement something like "sub-pools" in ZFS. Where the primary pool is available/imported r/w on a master server. While slave servers could import readonly that same pool and create a read/writeable sub-pool on which clones could be created.>Bringing this back towards ZFS-land, I think that there are some clever >things we can do with snapshots and clones. But the age-old problem >of arbitration rears its ugly head. I think I could write an option to expose >ZFS snapshots to read-only clients. But in doing so, I don''t see how to >prevent an ill-behaved client from clobbering the data. To solve that > >An ill-behaved client being one that might attempt to take over the primary pool read/writeable? I could understand a concern there, but I would prefer to handle that farther up the stack, probably at a process layer.>problem, an arbiter must decide who can write where. The SCSI >protocol has almost nothing to assist us in this cause, but NFS, QFS, >and pxfs do. There is room for cleverness, but not at the SCSI or block >level. > -- richard >This message posted from opensolaris.org >_______________________________________________ >zfs-discuss mailing list >zfs-discuss at opensolaris.org >http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > > >Thanks for the inpus so far... --Andy
Ahh. True I should not have said typical. Using the read access on the "DSS node" to avoid sending that massive amount of data across the network during DW loads. The node reads it "locally". Apparently I''ll have to look more closely at the effects. Yes, ASM would make my life easier, but it''s another matter to convince the rest of the organization. From my limited understanding of ASM, it would also require using RMAN. We''re currently stuck on file system backups. Something we''ll try to rectify this year. I didn''t know about the benchmarks using ASM. That is great ammunition for me. Thanks!! This message posted from opensolaris.org
> Ahh. True I should not have said typical. Using the > read access on the "DSS node" to avoid sending that > massive amount of data across the network during DW > loads. The node reads it "locally". Apparently I''ll > have to look more closely at the effects.This implies a new use case, which I think we should consider for ZFS: QoS. One of the problems mixing OLTP and DSS workloads in the same storage is that when OLTP needs a lot of latency-sensitive but small iops, DSS needs fewer, larger iops. Without any QoS, the OLTP app will get killed by the DSS hog. It would seem to reason that ZFS would need an interface into whatever IO QoS mechanisms are being developed. -- richard This message posted from opensolaris.org
I''m an Oracle DBA and we are doing ASM on SUN with RAC. I am happy with ASM''s performance but am interested in Clustering. I mentioned to Bob Netherton that if Sun could make it a clustering file system, that helps them enable the grid further. Oracle wrote and gave OCFS2 to the Linux Kernel. Since Solaris is GPL too and CDDL (Correct me if Im wrong) then couldn''t they take OCFS2 and port it into Solaris? Any chance at adding Clustering to ZFS? Just to see it and play with it would be fun. ZFS is open source so if someone cares to write their own clustering file system, they can : ) This message posted from opensolaris.org
Sun is supposed to have work ongoing on clustered zfs, but is also supposed to be out in the 2+ year timeframe. I for one would love if someone involved in this work would give a little bit of visibility into the effort and possibly how community members could help, if one was sufficiently talented and inclined. thanks, paul -----Original Message----- From: zfs-discuss-bounces at opensolaris.org on behalf of Thomas Roach Sent: Wed 2/28/2007 9:23 AM To: zfs-discuss at opensolaris.org Subject: [zfs-discuss] Re: Cluster File System Use Cases I''m an Oracle DBA and we are doing ASM on SUN with RAC. I am happy with ASM''s performance but am interested in Clustering. I mentioned to Bob Netherton that if Sun could make it a clustering file system, that helps them enable the grid further. Oracle wrote and gave OCFS2 to the Linux Kernel. Since Solaris is GPL too and CDDL (Correct me if Im wrong) then couldn''t they take OCFS2 and port it into Solaris? Any chance at adding Clustering to ZFS? Just to see it and play with it would be fun. ZFS is open source so if someone cares to write their own clustering file system, they can : ) This message posted from opensolaris.org _______________________________________________ zfs-discuss mailing list zfs-discuss at opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20070228/c7b84ab3/attachment.html>
"Also Oracle forums and SUN forums have the SAME exact look and feel... hmmm. Even the options are exactly the same... weird." Both are from a company called Jive Software that does enterprise forums. This message posted from opensolaris.org
On Wed, Feb 28, 2007 at 07:23:44AM -0800, Thomas Roach wrote:> I''m an Oracle DBA and we are doing ASM on SUN with RAC. I am happy with ASM''s > performance but am interested in Clustering. I mentioned to Bob Netherton that > if Sun could make it a clustering file system, that helps them enable the grid > further. Oracle wrote and gave OCFS2 to the Linux Kernel. Since Solaris is GPL > too and CDDL (Correct me if Im wrong) then couldn''t they take OCFS2 and port > it into Solaris? Any chance at adding Clustering to ZFS?ASM was Storage-Tek''s rebranding of SAM-QFS. SAM-QFS is already a shared (clustering) filesystem. You need to upgrade :) Look for "Shared QFS". And yes, we''re actively pushing the SAM-QFS code through the open-source process. Here''s the first blog entry: http://blogs.sun.com/samqfs/entry/welcome_to_sam_qfs_weblog Dean
Hi, my main interest is sharing a zpool between machines, so the zfs filesystems on different hosts can share a single lun. When you run several applications each in a different zone and allow the zones to be run on one of several hosts individually (!) this currently means at least one separate lun for each zone. Therefore you can''t use any of the cool features like cloning a zone with snapshots, dynamic space sharing between zones or easy resizing. Mounting a zfs on multiple hosts is nice to have, but for me not essential. Just my .02 ? -- Dagobert This message posted from opensolaris.org
I read this paper on Sunday. Seems interesting: The Architecture of PolyServe Matrix Server: Implementing a Symmetric Cluster File System http://www.polyserve.com/requestinfo_formq1.php?pdf=2 What interested me the most is that the metadata and lock are spread across all the nodes. I read the "Parallel NFS (pNFS)" presentation, and seems like pNFS still has the metadata on one server... (Lisa, correct me if I am wrong). http://opensolaris.org/os/community/os_user_groups/frosug/pNFS/FROSUG-pNFS.pdf Rayson
On 2/28/07, Dean Roehrich <dean.roehrich at sun.com> wrote:> ASM was Storage-Tek''s rebranding of SAM-QFS. SAM-QFS is already a shared > (clustering) filesystem. You need to upgrade :) Look for "Shared QFS".ASM as Oracle states it is Automatic Storage Management. To the best of my knowledge, it shares no heritage with SAM-QFS. http://www.oracle.com/technology/products/database/asm/index.html Mike -- Mike Gerdts http://mgerdts.blogspot.com/
On Mon, Mar 05, 2007 at 08:20:33PM -0600, Mike Gerdts wrote:> On 2/28/07, Dean Roehrich <dean.roehrich at sun.com> wrote: > >ASM was Storage-Tek''s rebranding of SAM-QFS. SAM-QFS is already a shared > >(clustering) filesystem. You need to upgrade :) Look for "Shared QFS". > > ASM as Oracle states it is Automatic Storage Management. To the best > of my knowledge, it shares no heritage with SAM-QFS. > > http://www.oracle.com/technology/products/database/asm/index.htmlThanks. This is the ASM I know: http://www.storagetek.com/products/product_page86.html Dean
The pNFS protocol doesn''t preclude varying meta-data server designs and their various locking strategies. As an example, there has been work going on at University of Michigan/ CITI to extend the Linux/NFSv4 implementation to allow for a pNFS server on top of the Polyserve solution. Spencer On Mar 5, 2007, at 2:37 PM, Rayson Ho wrote:> I read this paper on Sunday. Seems interesting: > > The Architecture of PolyServe Matrix Server: Implementing a Symmetric > Cluster File System > > http://www.polyserve.com/requestinfo_formq1.php?pdf=2 > > What interested me the most is that the metadata and lock are spread > across all the nodes. I read the "Parallel NFS (pNFS)" presentation, > and seems like pNFS still has the metadata on one server... (Lisa, > correct me if I am wrong). > > http://opensolaris.org/os/community/os_user_groups/frosug/pNFS/ > FROSUG-pNFS.pdf > > Rayson > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
On Wed, Feb 28, 2007 at 09:54:37AM -0600, Dean Roehrich wrote:> On Wed, Feb 28, 2007 at 07:23:44AM -0800, Thomas Roach wrote: > > And yes, we''re actively pushing the SAM-QFS code through the open-source > process. Here''s the first blog entry: > > http://blogs.sun.com/samqfs/entry/welcome_to_sam_qfs_weblogI see that libSAM has been release. How long until we see QFS out in the wild? -brian -- "Perl can be fast and elegant as much as J2EE can be fast and elegant. In the hands of a skilled artisan, it can and does happen; it''s just that most of the shit out there is built by people who''d be better suited to making sure that my burger is cooked thoroughly." -- Jonathan Patschke
> Bringing this back towards ZFS-land, I think that > there are some clever > things we can do with snapshots and clones. But the > age-old problem > of arbitration rears its ugly head. I think I could > write an option to expose > ZFS snapshots to read-only clients. But in doing so, > I don''t see how to > prevent an ill-behaved client from clobbering the > data. To solve that > problem, an arbiter must decide who can write where. > The SCSI > rotocol has almost nothing to assist us in this > cause, but NFS, QFS, > and pxfs do. There is room for cleverness, but not > at the SCSI or block > level. > -- richardYeah; ISTR that IBM mainframe complexes with what they called "shared DASD" (DASD==Direct Access Storage Device, i.e. disk, drum, or the like) depended on extent reserves. IIRC, SCSI dropped extent reserve support, and indeed it was never widely nor reliably available anyway. AFAIK, all SCSI offers is reserves of an entire LUN; that doesn''t even help with slices, let alone anything else. Nor (unlike either the VTOC structure on MVS nor VxFS) is ZFS extent-based anyway; so even if extent reserves were available, they''d only help a little. Which means, as he says, some sort of arbitration. I wonder whether the hooks for putting the ZIL on a separate device will be of any use for the cluster filesystem problem; it almost makes me wonder if there could be any parallels between pNFS and a refactored ZFS. This message posted from opensolaris.org
On Jul 13, 2007, at 2:20 AM, Richard L. Hamilton wrote:>> Bringing this back towards ZFS-land, I think that >> there are some clever >> things we can do with snapshots and clones. But the >> age-old problem >> of arbitration rears its ugly head. I think I could >> write an option to expose >> ZFS snapshots to read-only clients. But in doing so, >> I don''t see how to >> prevent an ill-behaved client from clobbering the >> data. To solve that >> problem, an arbiter must decide who can write where. >> The SCSI >> rotocol has almost nothing to assist us in this >> cause, but NFS, QFS, >> and pxfs do. There is room for cleverness, but not >> at the SCSI or block >> level. >> -- richard > > Yeah; ISTR that IBM mainframe complexes with what they called > "shared DASD" (DASD==Direct Access Storage Device, i.e. disk, drum, > or the > like) depended on extent reserves. IIRC, SCSI dropped extent reserve > support, and indeed it was never widely nor reliably available anyway. > AFAIK, all SCSI offers is reserves of an entire LUN; that doesn''t > even help > with slices, let alone anything else. Nor (unlike either the VTOC > structure > on MVS nor VxFS) is ZFS extent-based anyway; so even if extent > reserves > were available, they''d only help a little. Which means, as he > says, some > sort of arbitration. > > I wonder whether the hooks for putting the ZIL on a separate device > will be of any use for the cluster filesystem problem; it almost > makes me > wonder if there could be any parallels between pNFS and a refactored > ZFS.We are busy layering pNFS on ZFS in the NFSv4.1 project and hope to allow for coordination with client access and other interesting features. Spencer