This may fall into the realm of a religious war (I hope not!), but recently several people on this list have said/implied that ZFS was only acceptable for production use on FreeBSD (or Solaris, of course) rather than Linux with ZoL. I''m working on a project at work involving a large(-ish) amount of data, about 5TB, working its way up to 12-15TB eventually, spread among a dozen or so nodes. There may or may not be a clustered filesystem involved (probably gluster if we use anything). I''ve been looking at ZoL as the primary filesystem for this data. We''re a Linux shop, so I''d rather not switch to FreeBSD, or any of the Solaris-derived distros--although I have no problem with them, I just don''t want to introduce another OS into the mix if I can avoid it. So, the actual questions are: Is ZoL really not ready for production use? If not, what is holding it back? Features? Performance? Stability? If not, then what kind of timeframe are we looking at to get past whatever is holding it back?
On Wed, Apr 25, 2012 at 05:48:57AM -0700, Paul Archer wrote:> This may fall into the realm of a religious war (I hope not!), but > recently several people on this list have said/implied that ZFS was > only acceptable for production use on FreeBSD (or Solaris, of course) > rather than Linux with ZoL. > > I''m working on a project at work involving a large(-ish) amount of > data, about 5TB, working its way up to 12-15TB eventually, spread > among a dozen or so nodes. There may or may not be a clustered > filesystem involved (probably gluster if we use anything). I''ve been > looking at ZoL as the primary filesystem for this data. We''re a Linux > shop, so I''d rather not switch to FreeBSD, or any of the > Solaris-derived distros--although I have no problem with them, I just > don''t want to introduce another OS into the mix if I can avoid it. > > So, the actual questions are: > > Is ZoL really not ready for production use? > > If not, what is holding it back? Features? Performance? Stability? > > If not, then what kind of timeframe are we looking at to get past > whatever is holding it back?I can''t comment directly on experiences with ZoL as I haven''t used it, but it does seem to be under active development. That can be a good thing or a bad thing. :) I for one would be hesitant to use it for anything production based solely on the "youngness" of the effort. That said, might be worthwhile to check out the ZoL mailing lists and bug reports to see what types of issues the early adopters are running into and whether or not they are showstoppers for you or you are willing to accept the risks. For your size requierements and your intent to use Gluster, it sounds like ext4 or xfs would be entirely suitable and are obviously more "mature" on Linux at this point. Regardless, curious to hear which way you end up going and how things work out. Ray
On Apr 25, 2012, at 5:48 AM, Paul Archer wrote:> This may fall into the realm of a religious war (I hope not!), but recently several people on this list have said/implied that ZFS was only acceptable for production use on FreeBSD (or Solaris, of course) rather than Linux with ZoL. > > I''m working on a project at work involving a large(-ish) amount of data, about 5TB, working its way up to 12-15TBThis is pretty small by today''s standards. With 4TB disks, that is only 3-4 disks + redundancy.> eventually, spread among a dozen or so nodes. There may or may not be a clustered filesystem involved (probably gluster if we use anything).I wouldn''t dream of building a clustered file system that small. Maybe when you get into the multiple-PB range, then it might make sense.> I''ve been looking at ZoL as the primary filesystem for this data. We''re a Linux shop, so I''d rather not switch to FreeBSD, or any of the Solaris-derived distros--although I have no problem with them, I just don''t want to introduce another OS into the mix if I can avoid it. > > So, the actual questions are: > > Is ZoL really not ready for production use? > > If not, what is holding it back? Features? Performance? Stability?The computer science behind ZFS is sound. But it was also developed for Solaris which is quite different than Linux under the covers. So the Linux and other OS ports have issues around virtual memory system differences and fault management differences. This is the classic "getting it to work is 20% of the effort, getting it to work when all else is failing is the other 80%" case. -- richard -- ZFS Performance and Training Richard.Elling at RichardElling.com +1-760-896-4422 -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20120425/9000d463/attachment.html>
9:59am, Richard Elling wrote:> On Apr 25, 2012, at 5:48 AM, Paul Archer wrote: > > This may fall into the realm of a religious war (I hope not!), but recently several people on this list have > said/implied that ZFS was only acceptable for production use on FreeBSD (or Solaris, of course) rather than Linux > with ZoL. > > I''m working on a project at work involving a large(-ish) amount of data, about 5TB, working its way up to 12-15TB > > > This is pretty small by today''s standards. ?With 4TB disks, that is only 3-4 disks + redundancy. >True. At my last job, we were used to researchers asking for individual 4-5TB filesystems, and 1-2TB increases in size. When I left, there was over a 100TB online (in ''07).> eventually, spread among a dozen or so nodes. There may or may not be a clustered filesystem involved (probably > gluster if we use anything). > > > I wouldn''t dream of building a clustered file system that small. Maybe when you get into the > multiple-PB range, then it might make sense. >The point of a clustered filesystem was to be able to spread our data out among all nodes and still have access from any node without having to run NFS. Size of the data set (once you get past the point where you can replicate it on each node) is irrelevant.> I''ve been looking at ZoL as the primary filesystem for this data. We''re a Linux shop, so I''d rather not switch to > FreeBSD, or any of the Solaris-derived distros--although I have no problem with them, I just don''t want to > introduce another OS into the mix if I can avoid it. > > So, the actual questions are: > > Is ZoL really not ready for production use? > > If not, what is holding it back? Features? Performance? Stability? > > > The computer science behind ZFS is sound. But it was also developed for Solaris which > is quite different than Linux under the covers. So the Linux and other OS ports have issues > around?virtual memory system differences and fault management differences. This is the > classic "getting it to work is 20% of the effort, getting it to work when all else is failing is > the other 80%" case. > ?-- richardI understand the 80/20 rule. But this doesn''t really answer the question(s). If there weren''t any major differences among operating systems, the project probably would have been done long ago. To put it slightly differently, if I used ZoL in production, would I be likely to experience performance or stability problems? Or would it be lacking in features that I would likely need?
On Apr 25, 2012, at 10:59 AM, Paul Archer wrote:> 9:59am, Richard Elling wrote: > >> On Apr 25, 2012, at 5:48 AM, Paul Archer wrote: >> >> This may fall into the realm of a religious war (I hope not!), but recently several people on this list have >> said/implied that ZFS was only acceptable for production use on FreeBSD (or Solaris, of course) rather than Linux >> with ZoL. >> >> I''m working on a project at work involving a large(-ish) amount of data, about 5TB, working its way up to 12-15TB >> This is pretty small by today''s standards. With 4TB disks, that is only 3-4 disks + redundancy. > True. At my last job, we were used to researchers asking for individual 4-5TB filesystems, and 1-2TB increases in size. When I left, there was over a 100TB online (in ''07).100TB is medium sized for today''s systems, about 4RU or less :-)>> eventually, spread among a dozen or so nodes. There may or may not be a clustered filesystem involved (probably >> gluster if we use anything). >> I wouldn''t dream of building a clustered file system that small. Maybe when you get into the >> multiple-PB range, then it might make sense. >> > The point of a clustered filesystem was to be able to spread our data out among all nodes and still have access from any node without having to run NFS. Size of the data set (once you get past the point where you can replicate it on each node) is irrelevant.Interesting, something more complex than NFS to avoid the complexities of NFS? ;-)>> I''ve been looking at ZoL as the primary filesystem for this data. We''re a Linux shop, so I''d rather not switch to >> FreeBSD, or any of the Solaris-derived distros--although I have no problem with them, I just don''t want to >> introduce another OS into the mix if I can avoid it. >> >> So, the actual questions are: >> >> Is ZoL really not ready for production use? >> >> If not, what is holding it back? Features? Performance? Stability? >> The computer science behind ZFS is sound. But it was also developed for Solaris which >> is quite different than Linux under the covers. So the Linux and other OS ports have issues >> around virtual memory system differences and fault management differences. This is the >> classic "getting it to work is 20% of the effort, getting it to work when all else is failing is >> the other 80%" case. >> -- richard > > I understand the 80/20 rule. But this doesn''t really answer the question(s). If there weren''t any major differences among operating systems, the project probably would have been done long ago.The issues are not only technical :-(> To put it slightly differently, if I used ZoL in production, would I be likely to experience performance or stability problems? Or would it be lacking in features that I would likely need?It seems reasonably stable for the casual use cases. As for the features, that is a much more difficult question to answer. For example, if you use ACLs, you might find that some userland tools on some distros have full or no support for ACLs. Let us know how it works out for you. -- richard -- ZFS Performance and Training Richard.Elling at RichardElling.com +1-760-896-4422 -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20120425/62104303/attachment.html>
>To put it slightly differently, if I used ZoL in production, would I belikely to experience performance or stability problems? I saw one team revert from ZoL (CentOS 6) back to ext on some backup servers for an application project, the killer was stat times (find running slow etc.), perhaps more layer 2 cache could have solved the problem, but it was easier to deploy ext/lvm2. The source filesystems were ext so zfs send/rcv was not an option. You may want to check with the ZoL project about where there development is with respect to performance, I heard that the focus was on stability. Jordan On Wed, Apr 25, 2012 at 10:59 AM, Paul Archer <paul at paularcher.org> wrote:> 9:59am, Richard Elling wrote: > > On Apr 25, 2012, at 5:48 AM, Paul Archer wrote: >> >> This may fall into the realm of a religious war (I hope not!), but >> recently several people on this list have >> said/implied that ZFS was only acceptable for production use on >> FreeBSD (or Solaris, of course) rather than Linux >> with ZoL. >> >> I''m working on a project at work involving a large(-ish) amount of >> data, about 5TB, working its way up to 12-15TB >> >> >> This is pretty small by today''s standards. With 4TB disks, that is only >> 3-4 disks + redundancy. >> >> True. At my last job, we were used to researchers asking for individual > 4-5TB filesystems, and 1-2TB increases in size. When I left, there was over > a 100TB online (in ''07). > > > > eventually, spread among a dozen or so nodes. There may or may not >> be a clustered filesystem involved (probably >> gluster if we use anything). >> >> >> I wouldn''t dream of building a clustered file system that small. Maybe >> when you get into the >> multiple-PB range, then it might make sense. >> >> The point of a clustered filesystem was to be able to spread our data > out among all nodes and still have access from any node without having to > run NFS. Size of the data set (once you get past the point where you can > replicate it on each node) is irrelevant. > > > > > I''ve been looking at ZoL as the primary filesystem for this data. >> We''re a Linux shop, so I''d rather not switch to >> FreeBSD, or any of the Solaris-derived distros--although I have no >> problem with them, I just don''t want to >> introduce another OS into the mix if I can avoid it. >> >> So, the actual questions are: >> >> Is ZoL really not ready for production use? >> >> If not, what is holding it back? Features? Performance? Stability? >> >> >> The computer science behind ZFS is sound. But it was also developed for >> Solaris which >> is quite different than Linux under the covers. So the Linux and other OS >> ports have issues >> around virtual memory system differences and fault management >> differences. This is the >> classic "getting it to work is 20% of the effort, getting it to work when >> all else is failing is >> the other 80%" case. >> -- richard >> > > I understand the 80/20 rule. But this doesn''t really answer the > question(s). If there weren''t any major differences among operating > systems, the project probably would have been done long ago. > > To put it slightly differently, if I used ZoL in production, would I be > likely to experience performance or stability problems? Or would it be > lacking in features that I would likely need? > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20120425/0f8fa01d/attachment-0001.html>
Paul Archer
2012-Apr-25 19:04 UTC
[zfs-discuss] cluster vs nfs (was: Re: ZFS on Linux vs FreeBSD)
11:26am, Richard Elling wrote:> On Apr 25, 2012, at 10:59 AM, Paul Archer wrote: > > The point of a clustered filesystem was to be able to spread our data out among all nodes and still have access > from any node without having to run NFS. Size of the data set (once you get past the point where you can replicate > it on each node) is irrelevant. > > > Interesting, something more complex than NFS to avoid the complexities of NFS? ;-) >We have data coming in on multiple nodes (with local storage) that is needed on other multiple nodes. The only way to do that with NFS would be with a matrix of cross mounts that would be truly scary.
> I saw one team revert from ZoL (CentOS 6) back to ext on some backup servers > for an application project, the killer? was > stat times (find running slow etc.), perhaps more layer 2 cache could have > solved the problem, but it was easier to deploy ext/lvm2.But stat times (think directory traversal) are horrible on ZFS/Solaris as well, at least on a workstation-class machine that doesn''t run 24/7. Maybe on an always-on server with 256GB RAM or more, things would be different. For me, that''s really the only pain point of using ZFS. Sorry for not being able to contribute any ZoL experience. I''ve been pondering whether it''s worth trying for a few months myself already. Last time I checked, it didn''t support the .zfs directory (for snapshot access), which you really don''t want to miss after getting used to it.
> >To put it slightly differently, if I used ZoL in production, would I be likely to experience performance or stability > problems? > > I saw one team revert from ZoL (CentOS 6) back to ext on some backup servers for an application project, the killer? was > stat times (find running slow etc.), perhaps more layer 2 cache could have solved the problem, but it was easier to deploy > ext/lvm2.Hmm... I''ve got 1.4TB in about 70K files in 2K directories, and a simple find on a cold FS took me about 6 seconds: root at hoard22:/hpool/12/db# time find . -type d | wc df -h 2082 2082 32912 real 0m5.923s user 0m0.052s sys 0m1.012s So I''d say I''m doing OK there. But I''ve got 10K disks and a fast SSD for caching.
9:08pm, Stefan Ring wrote:> Sorry for not being able to contribute any ZoL experience. I''ve been > pondering whether it''s worth trying for a few months myself already. > Last time I checked, it didn''t support the .zfs directory (for > snapshot access), which you really don''t want to miss after getting > used to it. >Actually, rc8 (or was it rc7?) introduced/implemented the .zfs directory. If you''re upgrading, you need to reboot, but other than that, it works perfectly.
As I understand it LLNL has very large datasets on ZFS on Linux. You could inquire with them, as well as http://groups.google.com/a/zfsonlinux.org/group/zfs-discuss/topics?pli=1 . My guess is that it''s quite stable for at least some use cases (most likely: LLNL''s!), but that may not be yours. You could always... test it, but if you do then please tell us how it went :) Nico --
Nico Williams
2012-Apr-25 19:31 UTC
[zfs-discuss] cluster vs nfs (was: Re: ZFS on Linux vs FreeBSD)
I agree, you need something like AFS, Lustre, or pNFS. And/or an NFS proxy to those. Nico --
Robert Milkowski
2012-Apr-25 21:10 UTC
[zfs-discuss] cluster vs nfs (was: Re: ZFS on Linux vs FreeBSD)
And he will still need an underlying filesystem like ZFS for them :)> -----Original Message----- > From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > bounces at opensolaris.org] On Behalf Of Nico Williams > Sent: 25 April 2012 20:32 > To: Paul Archer > Cc: ZFS-Discuss mailing list > Subject: Re: [zfs-discuss] cluster vs nfs (was: Re: ZFS on Linux vsFreeBSD)> > I agree, you need something like AFS, Lustre, or pNFS. And/or an NFSproxy> to those. > > Nico > -- > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Richard Elling
2012-Apr-25 21:20 UTC
[zfs-discuss] cluster vs nfs (was: Re: ZFS on Linux vs FreeBSD)
On Apr 25, 2012, at 12:04 PM, Paul Archer wrote:> 11:26am, Richard Elling wrote: > >> On Apr 25, 2012, at 10:59 AM, Paul Archer wrote: >> >> The point of a clustered filesystem was to be able to spread our data out among all nodes and still have access >> from any node without having to run NFS. Size of the data set (once you get past the point where you can replicate >> it on each node) is irrelevant. >> Interesting, something more complex than NFS to avoid the complexities of NFS? ;-) > We have data coming in on multiple nodes (with local storage) that is needed on other multiple nodes. The only way to do that with NFS would be with a matrix of cross mounts that would be truly scary.Ignoring lame NFS clients, how is that architecture different than what you would have with any other distributed file system? If all nodes share data to all other nodes, then...? -- richard -- ZFS Performance and Training Richard.Elling at RichardElling.com +1-760-896-4422 -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20120425/b4e8c7e1/attachment-0001.html>
Paul Archer
2012-Apr-25 21:26 UTC
[zfs-discuss] cluster vs nfs (was: Re: ZFS on Linux vs FreeBSD)
2:20pm, Richard Elling wrote:> On Apr 25, 2012, at 12:04 PM, Paul Archer wrote: > > Interesting, something more complex than NFS to avoid the complexities of NFS? ;-) > > We have data coming in on multiple nodes (with local storage) that is needed on other multiple nodes. The only way > to do that with NFS would be with a matrix of cross mounts that would be truly scary. > > > Ignoring lame NFS clients, how is that architecture different than what you would have? > with any other distributed file system? If all nodes share data to all other nodes, then...? > ?-- richard >Simple. With a distributed FS, all nodes mount from a single DFS. With NFS, each node would have to mount from each other node. With 16 nodes, that''s what, 240 mounts? Not to mention your data is in 16 different mounts/directory structures, instead of being in a unified filespace.
Rich Teer
2012-Apr-25 21:34 UTC
[zfs-discuss] cluster vs nfs (was: Re: ZFS on Linux vs FreeBSD)
On Wed, 25 Apr 2012, Paul Archer wrote:> Simple. With a distributed FS, all nodes mount from a single DFS. With NFS, > each node would have to mount from each other node. With 16 nodes, that''s > what, 240 mounts? Not to mention your data is in 16 different mounts/directory > structures, instead of being in a unified filespace.Perhaps I''m being overly simplistic, but in this scenario, what would prevent one from having, on a single file server, /exports/nodes/node[0-15], and then having each node NFS-mount /exports/nodes from the server? Much simplier than your example, and all data is available on all machines/nodes. -- Rich Teer, Publisher Vinylphile Magazine www.vinylphilemag.com
Nico Williams
2012-Apr-25 21:53 UTC
[zfs-discuss] cluster vs nfs (was: Re: ZFS on Linux vs FreeBSD)
On Wed, Apr 25, 2012 at 4:26 PM, Paul Archer <paul at paularcher.org> wrote:> 2:20pm, Richard Elling wrote: >> Ignoring lame NFS clients, how is that architecture different than what >> you would have >> with any other distributed file system? If all nodes share data to all >> other nodes, then...? > > Simple. With a distributed FS, all nodes mount from a single DFS. With NFS, > each node would have to mount from each other node. With 16 nodes, that''s > what, 240 mounts? Not to mention your data is in 16 different > mounts/directory structures, instead of being in a unified filespace.To be fair NFSv4 now has a distributed namespace scheme so you could still have a single mount on the client. That said, some DFSes have better properties, such as striping of data across sets of servers, aggressive caching, and various choices of semantics (e.g., Lustre tries hard to give you POSIX cache coherency semantics). Nico --
Bob Friesenhahn
2012-Apr-25 21:54 UTC
[zfs-discuss] cluster vs nfs (was: Re: ZFS on Linux vs FreeBSD)
On Wed, 25 Apr 2012, Rich Teer wrote:> > Perhaps I''m being overly simplistic, but in this scenario, what would prevent > one from having, on a single file server, /exports/nodes/node[0-15], and then > having each node NFS-mount /exports/nodes from the server? Much simplier > than > your example, and all data is available on all machines/nodes.This solution would limit bandwidth to that available from that single server. With the cluster approach, the objective is for each machine in the cluster to primarily access files which are stored locally. Whole files could be moved as necessary. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
On 04/26/12 09:54 AM, Bob Friesenhahn wrote:> On Wed, 25 Apr 2012, Rich Teer wrote: >> Perhaps I''m being overly simplistic, but in this scenario, what would prevent >> one from having, on a single file server, /exports/nodes/node[0-15], and then >> having each node NFS-mount /exports/nodes from the server? Much simplier >> than >> your example, and all data is available on all machines/nodes. > This solution would limit bandwidth to that available from that single > server. With the cluster approach, the objective is for each machine > in the cluster to primarily access files which are stored locally. > Whole files could be moved as necessary.Distributed software building faces similar issues, but I''ve found once the common files have been read (and cached) by each node, network traffic becomes one way (to the file server). I guess that topology works well when most access to shared data is read. -- Ian.
Richard Elling
2012-Apr-25 22:22 UTC
[zfs-discuss] cluster vs nfs (was: Re: ZFS on Linux vs FreeBSD)
On Apr 25, 2012, at 2:26 PM, Paul Archer wrote:> 2:20pm, Richard Elling wrote: > >> On Apr 25, 2012, at 12:04 PM, Paul Archer wrote: >> >> Interesting, something more complex than NFS to avoid the complexities of NFS? ;-) >> >> We have data coming in on multiple nodes (with local storage) that is needed on other multiple nodes. The only way >> to do that with NFS would be with a matrix of cross mounts that would be truly scary. >> Ignoring lame NFS clients, how is that architecture different than what you would have >> with any other distributed file system? If all nodes share data to all other nodes, then...? >> -- richard >> > > Simple. With a distributed FS, all nodes mount from a single DFS. With NFS, each node would have to mount from each other node. With 16 nodes, that''s what, 240 mounts? Not to mention your data is in 16 different mounts/directory structures, instead of being in a unified filespace.Unified namespace doesn''t relieve you of 240 cross-mounts (or equivalents). FWIW, automounters were invented 20+ years ago to handle this in a nearly seamless manner. Today, we have DFS from Microsoft and NFS referrals that almost eliminate the need for automounter-like solutions. Also, it is not unusual for a NFS environment to have 10,000+ mounts with thousands of mounts on each server. No big deal, happens every day. On Apr 25, 2012, at 2:53 PM, Nico Williams wrote:> To be fair NFSv4 now has a distributed namespace scheme so you could > still have a single mount on the client. That said, some DFSes have > better properties, such as striping of data across sets of servers, > aggressive caching, and various choices of semantics (e.g., Lustre > tries hard to give you POSIX cache coherency semantics).I think this is where the real value is. NFS & CIFS are intentionally generic and have caching policies that are favorably described as generic. For special-purpose workloads there can be advantages to having policies more explicitly applicable to the workload. -- richard -- ZFS Performance and Training Richard.Elling at RichardElling.com +1-760-896-4422 -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20120425/55b1fb6d/attachment-0001.html>
Paul Archer
2012-Apr-25 22:34 UTC
[zfs-discuss] cluster vs nfs (was: Re: ZFS on Linux vs FreeBSD)
2:34pm, Rich Teer wrote:> On Wed, 25 Apr 2012, Paul Archer wrote: > >> Simple. With a distributed FS, all nodes mount from a single DFS. With NFS, >> each node would have to mount from each other node. With 16 nodes, that''s >> what, 240 mounts? Not to mention your data is in 16 different >> mounts/directory >> structures, instead of being in a unified filespace. > > Perhaps I''m being overly simplistic, but in this scenario, what would prevent > one from having, on a single file server, /exports/nodes/node[0-15], and then > having each node NFS-mount /exports/nodes from the server? Much simplier > than > your example, and all data is available on all machines/nodes. >That assumes the data set will fit on one machine, and that machine won''t be a performance bottleneck.
Nico Williams
2012-Apr-25 22:36 UTC
[zfs-discuss] cluster vs nfs (was: Re: ZFS on Linux vs FreeBSD)
On Wed, Apr 25, 2012 at 5:22 PM, Richard Elling <richard.elling at gmail.com> wrote:> Unified namespace doesn''t relieve you of 240 cross-mounts (or equivalents). > FWIW, > automounters were invented 20+ years ago to handle this in a nearly seamless > manner. > Today, we have DFS from Microsoft and NFS referrals that almost eliminate > the need > for automounter-like solutions.I disagree vehemently. automount is a disaster because you need to synchronize changes with all those clients. That''s not realistic. I''ve built a large automount-based namespace, replete with a distributed configuration system for setting the environment variables available to the automounter. I can tell you this: the automounter does not scale, and it certainly does not avoid the need for outages when storage migrates. With server-side, referral-based namespace construction that problem goes away, and the whole thing can be transparent w.r.t. migrations. For my money the key features a DFS must have are: - server-driven namespace construction - data migration without having to restart clients, reconfigure them, or do anything at all to them - aggressive caching - striping of file data for HPC and media environments - semantics that ultimately allow multiple processes on disparate clients to cooperate (i.e., byte range locking), but I don''t think full POSIX semantics are needed (that said, I think O_EXCL is necessary, and it''d be very nice to have O_APPEND, though the latter is particularly difficult to implement and painful when there''s contention if you stripe file data across multiple servers) Nico --
On 04/26/12 10:34 AM, Paul Archer wrote:> 2:34pm, Rich Teer wrote: > >> On Wed, 25 Apr 2012, Paul Archer wrote: >> >>> Simple. With a distributed FS, all nodes mount from a single DFS. With NFS, >>> each node would have to mount from each other node. With 16 nodes, that''s >>> what, 240 mounts? Not to mention your data is in 16 different >>> mounts/directory >>> structures, instead of being in a unified filespace. >> Perhaps I''m being overly simplistic, but in this scenario, what would prevent >> one from having, on a single file server, /exports/nodes/node[0-15], and then >> having each node NFS-mount /exports/nodes from the server? Much simplier >> than >> your example, and all data is available on all machines/nodes. >> > That assumes the data set will fit on one machine, and that machine won''t be a > performance bottleneck.Aren''t those general considerations when specifying a file server? -- Ian.
Richard Elling
2012-Apr-26 00:37 UTC
[zfs-discuss] cluster vs nfs (was: Re: ZFS on Linux vs FreeBSD)
On Apr 25, 2012, at 3:36 PM, Nico Williams wrote:> On Wed, Apr 25, 2012 at 5:22 PM, Richard Elling > <richard.elling at gmail.com> wrote: >> Unified namespace doesn''t relieve you of 240 cross-mounts (or equivalents). >> FWIW, >> automounters were invented 20+ years ago to handle this in a nearly seamless >> manner. >> Today, we have DFS from Microsoft and NFS referrals that almost eliminate >> the need >> for automounter-like solutions. > > I disagree vehemently. automount is a disaster because you need to > synchronize changes with all those clients. That''s not realistic.Really? I did it with NIS automount maps and 600+ clients back in 1991. Other than the obvious problems with open files, has it gotten worse since then?> I''ve built a large automount-based namespace, replete with a > distributed configuration system for setting the environment variables > available to the automounter. I can tell you this: the automounter > does not scale, and it certainly does not avoid the need for outages > when storage migrates.Storage migration is much more difficult with NFSv2, NFSv3, NetWare, etc.> With server-side, referral-based namespace construction that problem > goes away, and the whole thing can be transparent w.r.t. migrations.Agree, but we didn''t have NFSv4 back in 1991 :-) Today, of course, this is how one would design it if you had to design a new DFS today.> > For my money the key features a DFS must have are: > > - server-driven namespace construction > - data migration without having to restart clients, > reconfigure them, or do anything at all to them > - aggressive caching > > - striping of file data for HPC and media environments > > - semantics that ultimately allow multiple processes > on disparate clients to cooperate (i.e., byte range > locking), but I don''t think full POSIX semantics are > neededAlmost any of the popular nosql databases offer this and more. The movement away from POSIX-ish DFS and storing data in traditional "files" is inevitable. Even ZFS is a object store at its core.> (that said, I think O_EXCL is necessary, and it''d be > very nice to have O_APPEND, though the latter is > particularly difficult to implement and painful when > there''s contention if you stripe file data across > multiple servers)+1 -- richard -- ZFS Performance and Training Richard.Elling at RichardElling.com +1-760-896-4422 -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20120425/2593cd13/attachment.html>
On Wed, Apr 25, 2012 at 5:42 PM, Ian Collins <ian at ianshome.com> wrote:> Aren''t those general considerations when specifying a file server?There are Lustre clusters with thousands of nodes, hundreds of them being servers, and high utilization rates. Whatever specs you might have for one server head will not meet the demand that hundreds of the same can. Nico --
Nico Williams
2012-Apr-26 01:07 UTC
[zfs-discuss] cluster vs nfs (was: Re: ZFS on Linux vs FreeBSD)
On Wed, Apr 25, 2012 at 7:37 PM, Richard Elling <richard.elling at gmail.com> wrote:> On Apr 25, 2012, at 3:36 PM, Nico Williams wrote: > > I disagree vehemently. ?automount is a disaster because you need to > > synchronize changes with all those clients. ?That''s not realistic. > > Really? ?I did it with NIS automount maps and 600+ clients back in 1991. > Other than the obvious problems with open files, has it gotten worse since > then?Nothing''s changed. Automounter + data migration -> rebooting clients (or close enough to rebooting). I.e., outage.> Storage migration is much more difficult with NFSv2, NFSv3, NetWare, etc.But not with AFS. And spec-wise not with NFSv4 (though I don''t know if/when all NFSv4 clients will properly support migration, just that the protocol and some servers do).> With server-side, referral-based namespace construction that problem > goes away, and the whole thing can be transparent w.r.t. migrations.Yes.> Agree, but we didn''t have NFSv4 back in 1991 :-) ?Today, of course, this > is how one would design it if you had to design a new DFS today.Indeed, that''s why I built an automounter solution in 1996 (that''s still in use, I''m told). Although to be fair AFS existed back then and had global namespace and data migration back then, and was mature. It''s taken NFS that long to catch up...> >[...] > > Almost any of the popular nosql databases offer this and more. > The movement away from POSIX-ish DFS and storing data in > traditional?"files"?is inevitable. Even ZFS is a object store at its core.I agree. Except that there are applications where large octet streams are needed. HPC, media come to mind. Nico --
Tomorrow, Ian Collins wrote:> On 04/26/12 10:34 AM, Paul Archer wrote: >> That assumes the data set will fit on one machine, and that machine won''t >> be a >> performance bottleneck. > > Aren''t those general considerations when specifying a file server? >I suppose. But I meant specifically that our data will not fit on one single machine, and we are relying on spreading it across more nodes to get it on more spindles as well.
Paul Kraus
2012-Apr-26 01:57 UTC
[zfs-discuss] cluster vs nfs (was: Re: ZFS on Linux vs FreeBSD)
On Wed, Apr 25, 2012 at 9:07 PM, Nico Williams <nico at cryptonector.com> wrote:> On Wed, Apr 25, 2012 at 7:37 PM, Richard Elling > <richard.elling at gmail.com> wrote: >> On Apr 25, 2012, at 3:36 PM, Nico Williams wrote:>> > I disagree vehemently. ?automount is a disaster because you need to >> > synchronize changes with all those clients. ?That''s not realistic. >> >> Really? ?I did it with NIS automount maps and 600+ clients back in 1991. >> Other than the obvious problems with open files, has it gotten worse since >> then? > > Nothing''s changed. ?Automounter + data migration -> rebooting clients > (or close enough to rebooting). ?I.e., outage.Uhhh, not if you design your automounter architecture correctly and (as Richard said) have NFS clients that are not lame to which I''ll add, automunters that actually work as advertised. I was designing automount architectures that permitted dynamic changes with minimal to no outages in the late 1990''s. I only had a little over 100 clients (most of which were also servers) and NIS+ (NIS ver. 3) to distribute the indirect automount maps. I also had to _redesign_ a number of automount strategies that were built by people who thought that using direct maps for everything was a good idea. That _was_ a pain in the a** due to the changes needed at the applications to point at a different hierarchy. It all depends on _what_ the application is doing. Something that opens and locks a file and never releases the lock or closes the file until the application exits will require a restart of the application with an automounter / NFS approach. -- {--------1---------2---------3---------4---------5---------6---------7---------} Paul Kraus -> Senior Systems Architect, Garnet River ( http://www.garnetriver.com/ ) -> Sound Coordinator, Schenectady Light Opera Company ( http://www.sloctheater.org/ ) -> Technical Advisor, RPI Players
On 4/25/12 6:57 PM, Paul Kraus wrote:> On Wed, Apr 25, 2012 at 9:07 PM, Nico Williams<nico at cryptonector.com> wrote: >> On Wed, Apr 25, 2012 at 7:37 PM, Richard Elling >> <richard.elling at gmail.com> wrote:>> >> Nothing''s changed. Automounter + data migration -> rebooting clients >> (or close enough to rebooting). I.e., outage. > > Uhhh, not if you design your automounter architecture correctly > and (as Richard said) have NFS clients that are not lame to which I''ll > add, automunters that actually work as advertised. I was designingAnd applications that don''t pin the mount points, and can be idled during the migration. If your migration is due to a dead server, and you have pending writes, you have no choice but to reboot the client(s) (and accept the data loss, of course). Which is why we use AFS for RO replicated data, and NetApp clusters with SnapMirror and VIPs for RW data. To bring this back to ZFS, sadly ZFS doesn''t support NFS HA without shared / replicated storage, as ZFS send / recv can''t preserve the data necessary to have the same NFS filehandle, so failing over to a replica causes stale NFS filehandles on the clients. Which frustrates me, because the technology to do NFS shadow copy (which is possible in Solaris - not sure about the open source forks) is a superset of that needed to do HA, but can''t be used for HA. -- Carson
Nico Williams
2012-Apr-26 03:47 UTC
[zfs-discuss] cluster vs nfs (was: Re: ZFS on Linux vs FreeBSD)
On Wed, Apr 25, 2012 at 8:57 PM, Paul Kraus <pk1048 at gmail.com> wrote:> On Wed, Apr 25, 2012 at 9:07 PM, Nico Williams <nico at cryptonector.com> wrote: >> Nothing''s changed. ?Automounter + data migration -> rebooting clients >> (or close enough to rebooting). ?I.e., outage. > > ? ?Uhhh, not if you design your automounter architecture correctly > and (as Richard said) have NFS clients that are not lame to which I''ll > add, automunters that actually work as advertised. I was designing > automount architectures that permitted dynamic changes with minimal to > no outages in the late 1990''s. I only had a little over 100 clients > (most of which were also servers) and NIS+ (NIS ver. 3) to distribute > the indirect automount maps.Further below you admit that you''re talking about read-only data, effectively. But the world is not static. Sure, *code* is by and large static, and indeed, we segregated data by whether it was read-only (code, historical data) or not (application data, home directories). We were able to migrated *read-only* data with no outages. But for the rest? Yeah, there were always outages. Of course, we had a periodic maintenance window, with all systems rebooting within a short period, and this meant that some data migration outages were not noticeable, but they were real.> ? ?I also had to _redesign_ a number of automount strategies that > were built by people who thought that using direct maps for everything > was a good idea. That _was_ a pain in the a** due to the changes > needed at the applications to point at a different hierarchy.We used indirect maps almost exclusively. Moreover, we used hierarchical automount entries, and even -autofs mounts. We also used environment variables to control various things, such as which servers to mount what from (this was particularly useful for spreading the load on read-only static data). We used practically every feature of the automounter except for executable maps (and direct maps, when we eventually stopped using those).> ? ?It all depends on _what_ the application is doing. Something that > opens and locks a file and never releases the lock or closes the file > until the application exits will require a restart of the application > with an automounter / NFS approach.No kidding! In the real world such applications exist and get used. Nico --
On Apr 25, 2012, at 8:30 PM, Carson Gaspar wrote:> On 4/25/12 6:57 PM, Paul Kraus wrote: >> On Wed, Apr 25, 2012 at 9:07 PM, Nico Williams<nico at cryptonector.com> wrote: >>> On Wed, Apr 25, 2012 at 7:37 PM, Richard Elling >>> <richard.elling at gmail.com> wrote: > >>> >>> Nothing''s changed. Automounter + data migration -> rebooting clients >>> (or close enough to rebooting). I.e., outage. >> >> Uhhh, not if you design your automounter architecture correctly >> and (as Richard said) have NFS clients that are not lame to which I''ll >> add, automunters that actually work as advertised. I was designing > > And applications that don''t pin the mount points, and can be idled during the migration. If your migration is due to a dead server, and you have pending writes, you have no choice but to reboot the client(s) (and accept the data loss, of course).Reboot requirement is a lame client implementation.> Which is why we use AFS for RO replicated data, and NetApp clusters with SnapMirror and VIPs for RW data. > > To bring this back to ZFS, sadly ZFS doesn''t support NFS HA without shared / replicated storage, as ZFS send / recv can''t preserve the data necessary to have the same NFS filehandle, so failing over to a replica causes stale NFS filehandles on the clients. Which frustrates me, because the technology to do NFS shadow copy (which is possible in Solaris - not sure about the open source forks) is a superset of that needed to do HA, but can''t be used for HA.You are correct, a ZFS send/receive will result in different file handles on the receiver, just like rsync, tar, ufsdump+ufsrestore, etc. Do you mean the Sun ZFS Storage 7000 Shadow Migration feature? This is not a HA feature, it is an interposition architecture. It is possible to preserve NFSv[23] file handles in a ZFS environment using lower-level replication like TrueCopy, SRDF, AVS, etc. But those have other architectural issues (aka suckage). I am open to looking at what it would take to make a ZFS-friendly replicator that would do this, but need to know the business case [1] The beauty of AFS and others, is that the file handle equivalent is not a number. NFSv4 also has this feature. So I have a little bit of heartburn when people say, "NFS sux because it has a feature I won''t use because I won''t upgrade to NFSv4 even though it was released 10 years ago." As Nico points out, there are cases where you really need a Lustre, Ceph, Gluster, or other parallel file system. That is not the design point for ZFS''s ZPL or volume interfaces. [1] FWIW, you can build a metropolitan area ZFS-based, shared storage cluster today for about 1/4 the cost of the NetApp Stretch Metro software license. There is more than one way to skin a cat :-) So if the idea is to get even lower than 1/4 the NetApp cost, it feels like a race to the bottom. -- richard -- ZFS Performance and Training Richard.Elling at RichardElling.com +1-760-896-4422 -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20120425/229da89e/attachment-0001.html>
On Thu, Apr 26, 2012 at 12:10 AM, Richard Elling <richard.elling at gmail.com> wrote:> On Apr 25, 2012, at 8:30 PM, Carson Gaspar wrote: > Reboot requirement is a lame client implementation.And lame protocol design. You could possibly migrate read-write NFSv3 on the fly by preserving FHs and somehow updating the clients to go to the new server (with a hiccup in between, no doubt), but only entire shares at a time -- you could not migrate only part of a volume with NFSv3. Of course, having migration support in the protocol does not equate to getting it in the implementation, but it''s certainly a good step in that direction.> You are correct, a ZFS send/receive will result in different file handles on > the receiver, just like > rsync, tar, ufsdump+ufsrestore, etc.That''s understandable for NFSv2 and v3, but for v4 there''s no reason that an NFSv4 server stack and ZFS could not arrange to preserve FHs (if, perhaps, at the price of making the v4 FHs rather large). Although even for v3 it should be possible for servers in a cluster to arrange to preserve devids... Bottom line: live migration needs to be built right into the protocol. For me one of the exciting things about Lustre was/is the idea that you could just have a single volume where all new data (and metadata) is distributed evenly as you go. Need more storage? Plug it in, either to an existing head or via a new head, then flip a switch and there it is. No need to manage allocation. Migration may still be needed, both within a cluster and between clusters, but that''s much more manageable when you have a protocol where data locations can be all over the place in a completely transparent manner. Nico --
I jump into this loop with different alternative -- ip-based block device. And I saw few successful cases with "HAST + UCARP + ZFS + FreeBSD". If zfsonlinux is robust enough, trying "DRBD + PACEMAKER + ZFS + LINUX" is definitely encouraged. Thanks. Fred> -----Original Message----- > From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > bounces at opensolaris.org] On Behalf Of Nico Williams > Sent: ???, ?? 26, 2012 14:00 > To: Richard Elling > Cc: zfs-discuss at opensolaris.org > Subject: Re: [zfs-discuss] cluster vs nfs > > On Thu, Apr 26, 2012 at 12:10 AM, Richard Elling > <richard.elling at gmail.com> wrote: > > On Apr 25, 2012, at 8:30 PM, Carson Gaspar wrote: > > Reboot requirement is a lame client implementation. > > And lame protocol design. You could possibly migrate read-write NFSv3 > on the fly by preserving FHs and somehow updating the clients to go to > the new server (with a hiccup in between, no doubt), but only entire > shares at a time -- you could not migrate only part of a volume with > NFSv3. > > Of course, having migration support in the protocol does not equate to > getting it in the implementation, but it''s certainly a good step in > that direction. > > > You are correct, a ZFS send/receive will result in different file > handles on > > the receiver, just like > > rsync, tar, ufsdump+ufsrestore, etc. > > That''s understandable for NFSv2 and v3, but for v4 there''s no reason > that an NFSv4 server stack and ZFS could not arrange to preserve FHs > (if, perhaps, at the price of making the v4 FHs rather large). > Although even for v3 it should be possible for servers in a cluster to > arrange to preserve devids... > > Bottom line: live migration needs to be built right into the protocol. > > For me one of the exciting things about Lustre was/is the idea that > you could just have a single volume where all new data (and metadata) > is distributed evenly as you go. Need more storage? Plug it in, > either to an existing head or via a new head, then flip a switch and > there it is. No need to manage allocation. Migration may still be > needed, both within a cluster and between clusters, but that''s much > more manageable when you have a protocol where data locations can be > all over the place in a completely transparent manner. > > Nico > -- > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
On 2012-04-26 2:20, Ian Collins wrote:> On 04/26/12 09:54 AM, Bob Friesenhahn wrote: >> On Wed, 25 Apr 2012, Rich Teer wrote: >>> Perhaps I''m being overly simplistic, but in this scenario, what would >>> prevent >>> one from having, on a single file server, /exports/nodes/node[0-15], >>> and then >>> having each node NFS-mount /exports/nodes from the server? Much simplier >>> than >>> your example, and all data is available on all machines/nodes. >> This solution would limit bandwidth to that available from that single >> server. With the cluster approach, the objective is for each machine >> in the cluster to primarily access files which are stored locally. >> Whole files could be moved as necessary. > > Distributed software building faces similar issues, but I''ve found once > the common files have been read (and cached) by each node, network > traffic becomes one way (to the file server). I guess that topology > works well when most access to shared data is read.Which reminds me: older Solarises used to have a nifty-looking (via descriptions) cachefs, apparently to speed up NFS clients and reduce traffic, which we did not get to really use in real life. AFAIK Oracle EOLed it for Solaris 11, and I am not sure it is in illumos either. Does caching in current Solaris/illumos NFS client replace those benefits, or did the project have some merits of its own (like caching into local storage of client, so that the cache was not empty after reboot)?
On 26 April, 2012 - Jim Klimov sent me these 1,6K bytes:> Which reminds me: older Solarises used to have a nifty-looking > (via descriptions) cachefs, apparently to speed up NFS clients > and reduce traffic, which we did not get to really use in real > life. AFAIK Oracle EOLed it for Solaris 11, and I am not sure > it is in illumos either. > > Does caching in current Solaris/illumos NFS client replace those > benefits, or did the project have some merits of its own (like > caching into local storage of client, so that the cache was not > empty after reboot)?It had its share of merits and bugs. /Tomas -- Tomas Forsman, stric at acc.umu.se, http://www.acc.umu.se/~stric/ |- Student at Computing Science, University of Ume? `- Sysadmin at {cs,acc}.umu.se
On 04/26/12 10:12 PM, Jim Klimov wrote:> On 2012-04-26 2:20, Ian Collins wrote: >> On 04/26/12 09:54 AM, Bob Friesenhahn wrote: >>> On Wed, 25 Apr 2012, Rich Teer wrote: >>>> Perhaps I''m being overly simplistic, but in this scenario, what would >>>> prevent >>>> one from having, on a single file server, /exports/nodes/node[0-15], >>>> and then >>>> having each node NFS-mount /exports/nodes from the server? Much simplier >>>> than >>>> your example, and all data is available on all machines/nodes. >>> This solution would limit bandwidth to that available from that single >>> server. With the cluster approach, the objective is for each machine >>> in the cluster to primarily access files which are stored locally. >>> Whole files could be moved as necessary. >> Distributed software building faces similar issues, but I''ve found once >> the common files have been read (and cached) by each node, network >> traffic becomes one way (to the file server). I guess that topology >> works well when most access to shared data is read. > Which reminds me: older Solarises used to have a nifty-looking > (via descriptions) cachefs, apparently to speed up NFS clients > and reduce traffic, which we did not get to really use in real > life. AFAIK Oracle EOLed it for Solaris 11, and I am not sure > it is in illumos either.I don''t think it even made it into Solaris 10.. I used to use it with Solaris 8 back in the days when 100Mb switches were exotic!> Does caching in current Solaris/illumos NFS client replace those > benefits, or did the project have some merits of its own (like > caching into local storage of client, so that the cache was not > empty after reboot)? >It did have local backing store, but my current desktop has more RAM than that Solaris 8 box had disk and my network is 100 times faster, so it doesn''t really matter any more. -- Ian.
On 2012-04-26 14:47, Ian Collins wrote: > I don''t think it even made it into Solaris 10. Actually, I see the kernel modules available in both Solaris 10, several builds of OpenSolaris SXCE and an illumos-current. $ find /kernel/ /platform/ /usr/platform/ /usr/kernel/ | grep -i cachefs /kernel/fs/amd64/cachefs /kernel/fs/cachefs /platform/i86pc/amd64/archive_cache/kernel/fs/amd64/cachefs /platform/i86pc/archive_cache/kernel/fs/cachefs $ uname -a SunOS summit-blade5 5.11 oi_151a2 i86pc i386 i86pc> It did have local backing store, but my current desktop has more RAM > than that Solaris 8 box had disk and my network is 100 times faster, so > it doesn''t really matter any more.Well, it depends on your working set size. A matter of scale. If those researchers dig into their terabyte of data each ("each" seems important here for conflict/sync resolution), on a gigabit-connected workstation, it would still take them a couple of minutes to just download the dataset from the server, let alone random-seek around it afterwards. And you can easily have a local backing store for such cachefs (or equivalent) today, even on an SSD or a few. Just my 2c for possible build of that cluster they wanted, and perhaps some evolution/revival of cachefs with today''s realities and demands - if it''s deemed appropriate for their task. MY THEORY based on marketing info: I believe they could make a central fileserver with enough data space for everyone, and each worker would use cachefs+nfs to access it. Their actual worksets would be stored locally in the cachefs backing stores on each workstation, and not abuse networking traffic and the fileserver until there are some writes to be replicated into central storage. They would have approximately one common share to mount ;) //Jim
On 04/26/12 04:17 PM, Ian Collins wrote:> On 04/26/12 10:12 PM, Jim Klimov wrote: >> On 2012-04-26 2:20, Ian Collins wrote: >>> On 04/26/12 09:54 AM, Bob Friesenhahn wrote: >>>> On Wed, 25 Apr 2012, Rich Teer wrote: >>>>> Perhaps I''m being overly simplistic, but in this scenario, what would >>>>> prevent >>>>> one from having, on a single file server, /exports/nodes/node[0-15], >>>>> and then >>>>> having each node NFS-mount /exports/nodes from the server? Much >>>>> simplier >>>>> than >>>>> your example, and all data is available on all machines/nodes. >>>> This solution would limit bandwidth to that available from that single >>>> server. With the cluster approach, the objective is for each machine >>>> in the cluster to primarily access files which are stored locally. >>>> Whole files could be moved as necessary. >>> Distributed software building faces similar issues, but I''ve found once >>> the common files have been read (and cached) by each node, network >>> traffic becomes one way (to the file server). I guess that topology >>> works well when most access to shared data is read. >> Which reminds me: older Solarises used to have a nifty-looking >> (via descriptions) cachefs, apparently to speed up NFS clients >> and reduce traffic, which we did not get to really use in real >> life. AFAIK Oracle EOLed it for Solaris 11, and I am not sure >> it is in illumos either. > > I don''t think it even made it into Solaris 10.. I used to use it with > Solaris 8 back in the days when 100Mb switches were exotic!cachefs is present in Solaris 10. It is EOL''d in S11.>> Does caching in current Solaris/illumos NFS client replace those >> benefits, or did the project have some merits of its own (like >> caching into local storage of client, so that the cache was not >> empty after reboot)? >> > It did have local backing store, but my current desktop has more RAM > than that Solaris 8 box had disk and my network is 100 times faster, > so it doesn''t really matter any more. >
On Thu, Apr 26, 2012 at 4:34 AM, Deepak Honnalli <deepak.honnalli at oracle.com> wrote:> ? ?cachefs is present in Solaris 10. It is EOL''d in S11.And for those who need/want to use Linux, the equivalent is FSCache. -- Freddie Cash fjwcash at gmail.com
On Apr 25, 2012, at 11:00 PM, Nico Williams wrote:> On Thu, Apr 26, 2012 at 12:10 AM, Richard Elling > <richard.elling at gmail.com> wrote: >> On Apr 25, 2012, at 8:30 PM, Carson Gaspar wrote: >> Reboot requirement is a lame client implementation. > > And lame protocol design. You could possibly migrate read-write NFSv3 > on the fly by preserving FHs and somehow updating the clients to go to > the new server (with a hiccup in between, no doubt), but only entire > shares at a time -- you could not migrate only part of a volume with > NFSv3.Requirements, requirements, requirements... boil the ocean while we''re at it? :-)> Of course, having migration support in the protocol does not equate to > getting it in the implementation, but it''s certainly a good step in > that direction.NFSv4 has support for migrating volumes and managing the movement of file handles. The technique includes filehandle expiry, similar to methods used in other distributed FSs.>> You are correct, a ZFS send/receive will result in different file handles on >> the receiver, just like >> rsync, tar, ufsdump+ufsrestore, etc. > > That''s understandable for NFSv2 and v3, but for v4 there''s no reason > that an NFSv4 server stack and ZFS could not arrange to preserve FHs > (if, perhaps, at the price of making the v4 FHs rather large).This is already in the v4 spec.> Although even for v3 it should be possible for servers in a cluster to > arrange to preserve devids...We''ve been doing that for many years.> > Bottom line: live migration needs to be built right into the protocol.Agree, and volume migration support is already in the NFSv4 spec.> For me one of the exciting things about Lustre was/is the idea that > you could just have a single volume where all new data (and metadata) > is distributed evenly as you go. Need more storage? Plug it in, > either to an existing head or via a new head, then flip a switch and > there it is. No need to manage allocation. Migration may still be > needed, both within a cluster and between clusters, but that''s much > more manageable when you have a protocol where data locations can be > all over the place in a completely transparent manner.Many distributed file systems do this, at the cost of being not quite POSIX-ish. In the brave new world of storage vmotion, nosql, and distributed object stores, it is not clear to me that coding to a POSIX file system is a strong requirement. Perhaps people are so tainted by experiences with v2 and v3 that we can explain the non-migration to v4 as being due to poor marketing? As a leader of NFS, Sun had unimpressive marketing. -- richard -- ZFS Performance and Training Richard.Elling at RichardElling.com +1-760-896-4422 -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20120426/6de4d7c5/attachment.html>
On 4/25/12 10:10 PM, Richard Elling wrote:> On Apr 25, 2012, at 8:30 PM, Carson Gaspar wrote: > >> And applications that don''t pin the mount points, and can be idled >> during the migration. If your migration is due to a dead server, and >> you have pending writes, you have no choice but to reboot the >> client(s) (and accept the data loss, of course). > > Reboot requirement is a lame client implementation.Then it''s a lame client misfeature of every single NFS client I''ve ever seen, assuming the mount is "hard" (and if a RW mount isn''t, you''re crazy).>> To bring this back to ZFS, sadly ZFS doesn''t support NFS HA without >> shared / replicated storage, as ZFS send / recv can''t preserve the >> data necessary to have the same NFS filehandle, so failing over to a >> replica causes stale NFS filehandles on the clients. Which frustrates >> me, because the technology to do NFS shadow copy (which is possible in >> Solaris - not sure about the open source forks) is a superset of that >> needed to do HA, but can''t be used for HA. > > You are correct, a ZFS send/receive will result in different file > handles on the receiver, just like > rsync, tar, ufsdump+ufsrestore, etc.But unlike SnapMirror.> It is possible to preserve NFSv[23] file handles in a ZFS environment > using lower-level replication > like TrueCopy, SRDF, AVS, etc. But those have other architectural issues > (aka suckage). I am > open to looking at what it would take to make a ZFS-friendly replicator > that would do this, but > need to know the business case [1]See below.> The beauty of AFS and others, is that the file handle equivalent is not > a number. NFSv4 also has > this feature. So I have a little bit of heartburn when people say, "NFS > sux because it has a feature > I won''t use because I won''t upgrade to NFSv4 even though it was released > 10 years ago."NFSv4 implementations are still iffy. We''ve tried it - it hasn''t been stable (on Linux, at least). However we haven''t tested RHEL6 yet. Are you saying that if we have a Solaris NFSv4 server serving Solaris and Linux NFSv4 clients with ZFS send/recv replication, that we can flip a VIP to point to the replica target and the clients won''t get stale filehandles? Or that this is not the case today, but would be easier to make the case than for v[23] filehandles?> [1] FWIW, you can build a metropolitan area ZFS-based, shared storage > cluster today for about 1/4 > the cost of the NetApp Stretch Metro software license. There is more > than one way to skin a cat :-) > So if the idea is to get even lower than 1/4 the NetApp cost, it feels > like a race to the bottom.Shared storage is evil (in this context). Corrupt the storage, and you have no DR. That goes for all block-based replication products as well. This is not acceptable risk. I keep looking for a non-block-based replication system that allows seamless client failover, and can''t find anything but NetApp SnapMirror. Please tell me I haven''t been looking hard enough. Lustre et. al. don''t support Solaris clients (which I find hilarious as Oracle owns it). I could build something on top of / under AFS for RW replication if I tried hard, but it would be fairly fragile. -- Carson
> Shared storage is evil (in this context). Corrupt the storage, and you have > no DR.Now I am confused. We''re talking about storage which can be used for failover, aren''t we? In which case we are talking about HA not DR.> That goes for all block-based replication products as well. This is > not acceptable risk. I keep looking for a non-block-based replication system > that allows seamless client failover, and can''t find anything but NetApp > SnapMirror.I don''t know SnapMirror, so I may be mistaken, but I don''t see how you can have non-synchronous replication which can allow for seamless client failover (in the general case). Technically this doesn''t have to be block based, but I''ve not seen anything which wasn''t. Synchronous replication pretty much precludes DR (again, I can think of theoretical ways around this, but have never come across anything in practice).> CarsonJulian
On 4/26/12 2:17 PM, J.P. King wrote:> >> Shared storage is evil (in this context). Corrupt the storage, and you >> have no DR. > > Now I am confused. We''re talking about storage which can be used for > failover, aren''t we? In which case we are talking about HA not DR.Depends on how you define DR - we have shared storage HA in each datacenter (NetApp cluster), and replication between them in case we lose a datacenter (all clients on the MAN hit the same cluster unless we do a DR failover). The latter is what I''m calling DR.>> That goes for all block-based replication products as well. This is >> not acceptable risk. I keep looking for a non-block-based replication >> system that allows seamless client failover, and can''t find anything >> but NetApp SnapMirror. > > I don''t know SnapMirror, so I may be mistaken, but I don''t see how you > can have non-synchronous replication which can allow for seamless client > failover (in the general case). Technically this doesn''t have to be > block based, but I''ve not seen anything which wasn''t. Synchronous > replication pretty much precludes DR (again, I can think of theoretical > ways around this, but have never come across anything in practice)."seamless" is an over-statement, I agree. NetApp has synchronous SnapMirror (which is only mostly synchronous...). Worst case, clients may see a filesystem go backwards in time, but to a point-in-time consistent state. -- Carson
> Depends on how you define DR - we have shared storage HA in each datacenter > (NetApp cluster), and replication between them in case we lose a datacenter > (all clients on the MAN hit the same cluster unless we do a DR failover). The > latter is what I''m calling DR.It''s what I call HA. DR is what snapshots or backups can help you towards. HA can be used to reduce the likelyhood of needing to use DR measures of course.> "seamless" is an over-statement, I agree. NetApp has synchronous SnapMirror > (which is only mostly synchronous...). Worst case, clients may see a > filesystem go backwards in time, but to a point-in-time consistent state.Tell that to my swapfile! Here we use synchronous mirroring for our VM systems storage. Having that go back in time will cause unpredictable problems. Worst case is pretty bad! It may be that for your purposes you can treat your filesystems the way you do safely - although you''d better not have any in-memory caching of files, obviously - however lots and lots of people cannot. I believe that we can do seamless replication and failover of NFS/ZFS, except that it is very painful to manage, iSCSI (the only way I know to do mirroring in this context) caused us a lot of pain last time we used it, and the way Oracle treats Solaris and its support has made it largely untenable for us. Instead we''ve switched to Linux and DRBD. And if that doesn''t get me sympathy I don''t know what will.> CarsonJulian -- Julian King Computer Officer, University of Cambridge, Unix Support
On Thu, Apr 26, 2012 at 5:45 PM, Carson Gaspar <carson at taltos.org> wrote:> On 4/26/12 2:17 PM, J.P. King wrote: >> I don''t know SnapMirror, so I may be mistaken, but I don''t see how you >> can have non-synchronous replication which can allow for seamless client >> failover (in the general case). Technically this doesn''t have to be >> block based, but I''ve not seen anything which wasn''t. Synchronous >> replication pretty much precludes DR (again, I can think of theoretical >> ways around this, but have never come across anything in practice). > > "seamless" is an over-statement, I agree. NetApp has synchronous SnapMirror > (which is only mostly synchronous...). Worst case, clients may see a > filesystem go backwards in time, but to a point-in-time consistent state.Sure, if we assume apps make proper use of O_EXECL, O_APPEND, link(2)/unlink(2)/rename(2), sync(2), fsync(2), and fdatasync(3C) and can roll their own state back on their own. Databases typically know how to do that (e.g., SQLite3). Most apps? Doubtful. Nico --
On Thu, Apr 26, 2012 at 12:37 PM, Richard Elling <richard.elling at gmail.com> wrote:> [...]NFSv4 had migration in the protocol (excluding protocols between servers) from the get-go, but it was missing a lot (FedFS) and was not implemented until recently. I''ve no idea what clients and servers support it adequately besides Solaris 11, though that''s just my fault (not being informed). It''s taken over a decade to get to where we have any implementations of NFSv4 migration.>> For me one of the exciting things about Lustre was/is the idea that >> you could just have a single volume where all new data (and metadata) >> is distributed evenly as you go. ?Need more storage? ?Plug it in, >> either to an existing head or via a new head, then flip a switch and >> there it is. ?No need to manage allocation. ?Migration may still be >> needed, both within a cluster and between clusters, but that''s much >> more manageable when you have a protocol where data locations can be >> all over the place in a completely transparent manner. > > > Many distributed file systems do this, at the cost of being not quite > POSIX-ish.Well, Lustre does POSIX semantics just fine, including cache coherency (as opposed to NFS'' close-to-open coherency, which is decidedly non-POSIX).> In the brave new world of storage vmotion, nosql, and distributed object > stores, > it is not clear to me that coding to a POSIX file system is a strong > requirement.Well, I don''t quite agree. I''m very suspicious of eventually-consistent. I''m not saying that the enormous DBs that eBay and such run should sport SQL and ACID semantics -- I''m saying that I think we can do much better than eventually-consistent (and no-language) while not paying the steep price that ACID requires. I''m not alone in this either. The trick is to find the right compromise. Close-to-open semantics works out fine for NFS, but O_APPEND is too wonderful not to have (ditto O_EXCL, which NFSv2 did not have; v4 has O_EXCL, but not O_APPEND). Whoever first delivers the right compromise in distributed DB semantics stands to make a fortune.> Perhaps people are so tainted by experiences with v2 and v3 that we can > explain > the non-migration to v4 as being due to poor marketing? As a leader of NFS, > Sun > had unimpressive marketing.Sun did not do too much to improve NFS in the 90s, not compared to the v4 work that only really started paying off only too recently. And then since Sun had lost the client space by then it doesn''t mean all that much to have the best server if the clients aren''t able to take advantage of the server''s best features for lack of client implementation. Basically, Sun''s ZFS, DTrace, SMF, NFSv4, Zones, and other amazing innovations came a few years too late to make up for the awful management that Sun was saddled with. But for all the decidedly awful things Sun management did (or didn''t do), the worst was terminating Sun PS (yes, worse that all the non-marketing, poor marketing, poor acquisitions, poor strategy, and all the rest including truly epic mistakes like icing Solaris on x86 a decade ago). One of the worst outcomes of the Sun debacle is that now there''s a bevy of senior execs who think the worst thing Sun did was to open source Solaris and Java -- which isn''t to say that Sun should have open sourced as much as it did, or that open source is an end in itself, but that open sourcing these things was legitimate a business tool with very specific goals in mind in each case, and which had nothing to do with the sinking of the company. Or maybe that''s one of the best outcomes, because the good news about it is that those who learn the right lessons (in that case: that open source is a legitimate business tool that is sometimes, often even, a great mind-share building tool) will be in the minority, and thus will have a huge advantage over their competition. That''s another thing Sun did not learn until it was too late: mind-share matters enormously to a software company. Nico --
Hi Paul, I have been testing ZoL for a while now (somewhere around a year?) on two separate machines: 1) dual Socket 771 Xeon , 8GB ECC RAM, 12 Seagate 1TB ES.2 HD (2x6 disk raidz2), ubuntu oneiric, with the zfs-native/stable PPA 2) Intel Xeon CPU E31120, 8GB ECC RAM, 4 x 400GB WD RE2 ( 1 4 disk raidz1), ubuntu oneiric, zfs-native/daily PPA I would consider neither the daily nor the stable good enough for production use. I frequently get pools with all or nearly all of the disks marked as removed (a simple zfs export; zfs import -f fixes this). I also got kernel panics under heavy random IO (backuppc) on #1. I recently switched to openindiana on that machine, which is stable for the same workload. On machine #2 (my office workstation) I get routine crashes and slow performance (the raidz holds my home directory) That being said, I haven''t lost any data yet, and bugs that have been affecting me are quickly fixed, so I think that at some point it will be stable, just not right now. Richard On 04/25/2012 05:48 AM, Paul Archer wrote:> This may fall into the realm of a religious war (I hope not!), but > recently several people on this list have said/implied that ZFS was > only acceptable for production use on FreeBSD (or Solaris, of course) > rather than Linux with ZoL. > > I''m working on a project at work involving a large(-ish) amount of > data, about 5TB, working its way up to 12-15TB eventually, spread > among a dozen or so nodes. There may or may not be a clustered > filesystem involved (probably gluster if we use anything). I''ve been > looking at ZoL as the primary filesystem for this data. We''re a Linux > shop, so I''d rather not switch to FreeBSD, or any of the > Solaris-derived distros--although I have no problem with them, I just > don''t want to introduce another OS into the mix if I can avoid it. > > So, the actual questions are: > > Is ZoL really not ready for production use? > > If not, what is holding it back? Features? Performance? Stability? > > If not, then what kind of timeframe are we looking at to get past > whatever is holding it back? > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss