Hello list, I''ve read about your fascinating new fs implementation, ZFS. I''ve seen alot - nbd, lvm, evms, pvfs2, gfs, ocfs - and I have to say: I''m quite impressed! I''d set up a few of my boxes to OpenSolaris for storage (using Linux and lvm right now - offers pooling, but no built-in fault-tolerance) if ZFS had one feature: Use of more than one machine - currently, as I understand it, if disks fail, no problem, but if the server machine fails, ... I read in your FAQ that cluster features are on the way and wanted to ask what''s the status here :-) BTW I recently read about a filesystem, which has a pretty good cluster architecture, called Google File System. The article on the English Wikipedia has a good overview, a link to the detailed papers and a ZDNet interview about it. I just wanted to point that out to you, maybe some of its design / architecture is useful in ZFS''s cluster mode. --erj This message posted from opensolaris.org
On Tue, May 30, 2006 at 03:55:09AM -0700, Ernst Rohlicek jun. wrote:> Hello list, > > I''ve read about your fascinating new fs implementation, ZFS. I''ve seen > alot - nbd, lvm, evms, pvfs2, gfs, ocfs - and I have to say: I''m quite > impressed! > > I''d set up a few of my boxes to OpenSolaris for storage (using Linux > and lvm right now - offers pooling, but no built-in fault-tolerance) > if ZFS had one feature: Use of more than one machine - currently, as I > understand it, if disks fail, no problem, but if the server machine > fails, ... > > I read in your FAQ that cluster features are on the way and wanted to > ask what''s the status here :-) > > BTW I recently read about a filesystem, which has a pretty good > cluster architecture, called Google File System. The article on the > English Wikipedia has a good overview, a link to the detailed papers > and a ZDNet interview about it. > > I just wanted to point that out to you, maybe some of its design / > architecture is useful in ZFS''s cluster mode.For cross-machine tolerance, it should be possible (once the iSCSI target is integrated) to create ZFS-backed iSCSI targets and then use RAID-Z from a single host across machines. This is not a true clustered filesystem, as it has a single point of access, but it does get you beyond the ''single node = dataloss'' mode of failure. As for the true clustered filesystem, we''re still gathering requirements. We have some ideas in the pipeline, and it''s definitely a direction in which we are headed, but there''s not much to say at this point. - Eric -- Eric Schrock, Solaris Kernel Development http://blogs.sun.com/eschrock
Well, I would caution at this point against the iscsi backend if you are planning on using NFS. We took a long winded conversation online and have yet to return to this list, but the gist of it is that the latency of iscsi along with the tendency for NFS to fsync 3 times per write causes performance to drop dramatically, and it gets much worse for a RAIDZ config. If you want to go this route, FC is a current suggested requirement. On 5/30/06, Eric Schrock <eric.schrock at sun.com> wrote:> On Tue, May 30, 2006 at 03:55:09AM -0700, Ernst Rohlicek jun. wrote: > > Hello list, > > > > I''ve read about your fascinating new fs implementation, ZFS. I''ve seen > > alot - nbd, lvm, evms, pvfs2, gfs, ocfs - and I have to say: I''m quite > > impressed! > > > > I''d set up a few of my boxes to OpenSolaris for storage (using Linux > > and lvm right now - offers pooling, but no built-in fault-tolerance) > > if ZFS had one feature: Use of more than one machine - currently, as I > > understand it, if disks fail, no problem, but if the server machine > > fails, ... > > > > I read in your FAQ that cluster features are on the way and wanted to > > ask what''s the status here :-) > > > > BTW I recently read about a filesystem, which has a pretty good > > cluster architecture, called Google File System. The article on the > > English Wikipedia has a good overview, a link to the detailed papers > > and a ZDNet interview about it. > > > > I just wanted to point that out to you, maybe some of its design / > > architecture is useful in ZFS''s cluster mode. > > For cross-machine tolerance, it should be possible (once the iSCSI > target is integrated) to create ZFS-backed iSCSI targets and then use > RAID-Z from a single host across machines. This is not a true clustered > filesystem, as it has a single point of access, but it does get you > beyond the ''single node = dataloss'' mode of failure. > > As for the true clustered filesystem, we''re still gathering > requirements. We have some ideas in the pipeline, and it''s definitely a > direction in which we are headed, but there''s not much to say at this > point. > > - Eric > > -- > Eric Schrock, Solaris Kernel Development http://blogs.sun.com/eschrock > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >
Hello Joe, Wednesday, May 31, 2006, 12:44:22 AM, you wrote: JL> Well, I would caution at this point against the iscsi backend if you JL> are planning on using NFS. We took a long winded conversation online JL> and have yet to return to this list, but the gist of it is that the JL> latency of iscsi along with the tendency for NFS to fsync 3 times per JL> write causes performance to drop dramatically, and it gets much worse JL> for a RAIDZ config. If you want to go this route, FC is a current JL> suggested requirement. Can you provide more info on NFS+raidz? -- Best regards, Robert mailto:rmilkowski at task.gda.pl http://milek.blogspot.com
Well, here''s my previous summary off list to different solaris folk (regarding NFS serving via ZFS and iSCSI): I want to use ZFS as a NAS with no bounds on the backing hardware (not restricted to one boxes capacity). Thus, there are two options: FC SAN or iSCSI. In my case, I have multi-building considerations and 10GB ethernet layer-2 interconnects that make iscsi ideal. Our standard users use NAS for collections of many small files to many large files (source code repositories, simulations, cad tools, VM images, rendering meta-forms, and final results). Ideally to allow for ongoing growth and drive replacement across multiple iscsi targets, RAIDZ was selected over static hardware raid solutions. This setup is very similar to a gfiler (iscsi based) or otherwise a standard NetApp Filer product, and it would have appeared that Sun is targeting this solution. I need this setup for both Tier1 primary NAS storage, as well as disk-to-disk Tier2 backup. In my extensive testing (not so much benchmarking, and definitely without the time/focus to learn dtrace and the like), we have found out that ZFS can be used for a tier2 system and not for tier1 due to pathologically poor performance via NFS against a ZFS filesystem based on RAIDZ over non-local storage. We have extremely poor but more acceptable performance using a non-RAIDZ configuration. Only in the case of expensive FC-SAN network implementation would it appear that ZFS is workable. If this is the only workable solution, then ZFS has lost its benefits over NetApp as we approach the same costs but do not have the same current maturity. Is it a lost cause? Honestly, I need to be convinced that this is workable, and so far optional solutions have been shot down. Evidence? The final synethetic test used was to generate a directory of 6250 random 8k files. On an NFS client (solaris, linux, or even loop-back on the server itself), run "cp -r SRCDIR DESTDIR" where DESTDIR is on the NFS server. Averages from memory: FS iSCSI backend Rate XFS 1.5TB single Lun ~1-1.1MB/sec ZFS 1.5TB single Lun ~250-400KB/sec ZFS 1.5TB RAIDZ (8 disks) ~25KB/sec In the case of mixed sized files with predominantly small files above and below 8K, I see the XFS solution jump to an average of 2.5-3MB/sec. The ZFS store over a single lun stay within 200-420KB/sec, and the RAIDZ range from 16-40KB/sec. Likely caching and some dynamic behaviours cause ZFS to get worse with mixed sizing, whereas XFS or such increases performance. Finally, by switching to SMB and not using NFS, I can maintain over 3MB/sec rates. Large files over NFS get more reasonable performance (14MB-28MB/sec) on any given ZFS backend, and I get 30+MB/sec locally with spikes close to 100MB/sec when writing locally. I only can maximize performance on my ZFS backend if I use a blocksize (tests using dd) of 256K or greater. 128K seems to provide less overall datarates, and I believe this is the default when I use cp, rsync, or other commands locally. In summary, I can make my ZFS-based initiator an NFS client or otherwise use rsyncd to ameliorate the pathological NFS server performance of the ZFS combination. I can then service files fine. This solution allows us to move forward as a Tier2 only solution. If _any_ thing can be done to address NFS and its interactions with ZFS, and bring it close to 1MB/sec performance (these are gig-e interconnects afterall, think about it) then it will only be 1/10th the performance of a NetApp in this worse case scenario and perform similar to the NetApp if not better in other cases. The NetApp can do around 10MB/sec in the senario I''m depicting. Currently, we have around 1/20th to 1/30th the performance level when not using RAIDZ, and 1/200th using RAIDZ. I just can''t quite understand how we can go from a "cp -p TESTDIR DESTDIR" of 50MB of small files locally in an instant and the OS returning to the prompt. Zpool iostat showing the writes committed over the next 3-6 seconds, and this is OK for on-disk consistency. But then for some reason its required that the NFS client can''t commit in a similar fashion, with Solaris saying "yes, we got it, here''s confirmation.. next" just as it does locally. The data definitely gets there at the same speed as my tests with remote iscsi pools and as an NFS client shows. My naive sense is that this should be addressable at some level without inducing corruption. I have a feeling that its somehow being overly conservative in this stance. On 5/30/06, Robert Milkowski <rmilkowski at task.gda.pl> wrote:> Hello Joe, > > Wednesday, May 31, 2006, 12:44:22 AM, you wrote: > > JL> Well, I would caution at this point against the iscsi backend if you > JL> are planning on using NFS. We took a long winded conversation online > JL> and have yet to return to this list, but the gist of it is that the > JL> latency of iscsi along with the tendency for NFS to fsync 3 times per > JL> write causes performance to drop dramatically, and it gets much worse > JL> for a RAIDZ config. If you want to go this route, FC is a current > JL> suggested requirement. > > Can you provide more info on NFS+raidz? > > -- > Best regards, > Robert mailto:rmilkowski at task.gda.pl > http://milek.blogspot.com > >
I would urgently need a Cluster Filesystem that runs with Oracle Clusterware. Currently Cluster Filesystems with Solaris need very expensive solutions (like Veritas Storage Foundation or SUN Cluster) that are no longer accepted by the customers (with regard to Linux solutions). That''s why I want tp propose to think about a project to integrate ocfs2 in OpenSolaris. ClusterZFS would be surely an interesting next step. But unnecessarily complex for RDBMS purposes that needs a cluster FS only for some archival purposes and shared access for backup and restore (ZFS characteristics like copy on write are not all too good for RDBMS with much scattered I/O, so we would always prefer Cluster Raw Devices over Cluster Files for normal database files) kind rgds. Karsten This message posted from opensolaris.org
Karsten Hashimoto wrote:> I would urgently need a Cluster Filesystem that runs with Oracle Clusterware.Today, we would recommend the Sun Cluster Advanced Edition for Oracle RAC. This includes QFS which is a distributed file system. See http://www.sun.com/software/cluster/faq.xml#q21> Currently Cluster Filesystems with Solaris need very expensive solutions (like > Veritas Storage Foundation or SUN Cluster) that are no longer accepted by the > customers (with regard to Linux solutions).I can''t comment on the price, other than to say that it[1] is free (as in $) for research, development, or education purposes. There is also a variety of different pricing options for production systems which require support. Anecdotally, I hear that it often costs less than VSF for production systems. [1] it will depend on what "it" is, what your company or institution does, and where you live. I find the www.sun.com pages on this topic confusing, so I''d recommend contacting your local Sun sales office.> That''s why I want tp propose to think about a project to integrate ocfs2 in OpenSolaris.OCFS2 is an Oracle "project," so you should ask them. Without Oracle''s blessing, you''d be wasting your time. http://oss.oracle.com/projects/ocfs2/> ClusterZFS would be surely an interesting next step. But unnecessarily complex for > RDBMS purposes that needs a cluster FS only for some archival purposes and shared > access for backup and restore (ZFS characteristics like copy on write are not all > too good for RDBMS with much scattered I/O, so we would always prefer Cluster > Raw Devices over Cluster Files for normal database files)For raw devices, zfs offers zvols. The jury is still out pondering the merits of databases over ZFS. From a RAS perspective, this combination is excellent. But I don''t think we have much real data to address a performance perspective (that is being done by others... who will speak up when they are ready :-) Sun marketing has been collecting requirements for a distributed version of ZFS. Please check this forum''s archives and provide input. Also, http://www.opensolaris.org/os/community/zfs/faq -- richard
Well, here''s my previous summary off list to different solaris folk (regarding NFS serving via ZFS and iSCSI): I want to use ZFS as a NAS with no bounds on the backing hardware (not restricted to one boxes capacity). Thus, there are two options: FC SAN or iSCSI. In my case, I have multi-building considerations and 10GB ethernet layer-2 interconnects that make iscsi ideal. Our standard users use NAS for collections of many small files to many large files (source code repositories, simulations, cad tools, VM images, rendering meta-forms, and final results). Ideally to allow for ongoing growth and drive replacement across multiple iscsi targets, RAIDZ was selected over static hardware raid solutions. This setup is very similar to a gfiler (iscsi based) or otherwise a standard NetApp Filer product, and it would have appeared that Sun is targeting this solution. I need this setup for both Tier1 primary NAS storage, as well as disk-to-disk Tier2 backup. In my extensive testing (not so much benchmarking, and definitely without the time/focus to learn dtrace and the like), we have found out that ZFS can be used for a tier2 system and not for tier1 due to pathologically poor performance via NFS against a ZFS filesystem based on RAIDZ over non-local storage. We have extremely poor but more acceptable performance using a non-RAIDZ configuration. Only in the case of expensive FC-SAN network implementation would it appear that ZFS is workable. If this is the only workable solution, then ZFS has lost its benefits over NetApp as we approach the same costs but do not have the same current maturity. Is it a lost cause? Honestly, I need to be convinced that this is workable, and so far optional solutions have been shot down. Evidence? The final synethetic test used was to generate a directory of 6250 random 8k files. On an NFS client (solaris, linux, or even loop-back on the server itself), run "cp -r SRCDIR DESTDIR" where DESTDIR is on the NFS server. Averages from memory: FS iSCSI backend Rate XFS 1.5TB single Lun ~1-1.1MB/sec ZFS 1.5TB single Lun ~250-400KB/sec ZFS 1.5TB RAIDZ (8 disks) ~25KB/sec In the case of mixed sized files with predominantly small files above and below 8K, I see the XFS solution jump to an average of 2.5-3MB/sec. The ZFS store over a single lun stay within 200-420KB/sec, and the RAIDZ range from 16-40KB/sec. Likely caching and some dynamic behaviours cause ZFS to get worse with mixed sizing, whereas XFS or such increases performance. Finally, by switching to SMB and not using NFS, I can maintain over 3MB/sec rates. Large files over NFS get more reasonable performance (14MB-28MB/sec) on any given ZFS backend, and I get 30+MB/sec locally with spikes close to 100MB/sec when writing locally. I only can maximize performance on my ZFS backend if I use a blocksize (tests using dd) of 256K or greater. 128K seems to provide less overall datarates, and I believe this is the default when I use cp, rsync, or other commands locally. In summary, I can make my ZFS-based initiator an NFS client or otherwise use rsyncd to ameliorate the pathological NFS server performance of the ZFS combination. I can then service files fine. This solution allows us to move forward as a Tier2 only solution. If _any_ thing can be done to address NFS and its interactions with ZFS, and bring it close to 1MB/sec performance (these are gig-e interconnects afterall, think about it) then it will only be 1/10th the performance of a NetApp in this worse case scenario and perform similar to the NetApp if not better in other cases. The NetApp can do around 10MB/sec in the senario I''m depicting. Currently, we have around 1/20th to 1/30th the performance level when not using RAIDZ, and 1/200th using RAIDZ. I just can''t quite understand how we can go from a "cp -p TESTDIR DESTDIR" of 50MB of small files locally in an instant and the OS returning to the prompt. Zpool iostat showing the writes committed over the next 3-6 seconds, and this is OK for on-disk consistency. But then for some reason its required that the NFS client can''t commit in a similar fashion, with Solaris saying "yes, we got it, here''s confirmation.. next" just as it does locally. The data definitely gets there at the same speed as my tests with remote iscsi pools and as an NFS client shows. My naive sense is that this should be addressable at some level without inducing corruption. I have a feeling that its somehow being overly conservative in this stance. On 5/30/06, Robert Milkowski <rmilkowski at task.gda.pl> wrote:> Hello Joe, > > Wednesday, May 31, 2006, 12:44:22 AM, you wrote: > > JL> Well, I would caution at this point against the iscsi backend if you > JL> are planning on using NFS. We took a long winded conversation online > JL> and have yet to return to this list, but the gist of it is that the > JL> latency of iscsi along with the tendency for NFS to fsync 3 times per > JL> write causes performance to drop dramatically, and it gets much worse > JL> for a RAIDZ config. If you want to go this route, FC is a current > JL> suggested requirement. > > Can you provide more info on NFS+raidz? > > -- > Best regards, > Robert mailto:rmilkowski at task.gda.pl > http://milek.blogspot.com > >_______________________________________________ zfs-discuss mailing list zfs-discuss at opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- This message posted from opensolaris.org