thr3ads.net - zfs discuss - [zfs-discuss] cluster features [May 2006]

If this information is useful, please help other people find it:
Share via:

Ernst Rohlicek jun.

2006-May-30 10:55 UTC

[zfs-discuss] cluster features

Hello list,

I''ve read about your fascinating new fs implementation, ZFS.
I''ve seen alot - nbd, lvm, evms, pvfs2, gfs, ocfs - and I have to say:
I''m quite impressed!

I''d set up a few of my boxes to OpenSolaris for storage (using Linux
and lvm right now - offers pooling, but no built-in fault-tolerance) if ZFS had
one feature: Use of more than one machine - currently, as I understand it, if
disks fail, no problem, but if the server machine fails, ...

I read in your FAQ that cluster features are on the way and wanted to ask
what''s the status here :-)

BTW I recently read about a filesystem, which has a pretty good cluster
architecture, called Google File System. The article on the English Wikipedia
has a good overview, a link to the detailed papers and a ZDNet interview about
it.

I just wanted to point that out to you, maybe some of its design / architecture
is useful in ZFS''s cluster mode.

--erj
 
 
This message posted from opensolaris.org

Eric Schrock

2006-May-30 16:55 UTC

head link

[zfs-discuss] cluster features

On Tue, May 30, 2006 at 03:55:09AM -0700, Ernst Rohlicek jun.
wrote:> Hello list,
> 
> I''ve read about your fascinating new fs implementation, ZFS.
I''ve seen
> alot - nbd, lvm, evms, pvfs2, gfs, ocfs - and I have to say: I''m
quite
> impressed!
> 
> I''d set up a few of my boxes to OpenSolaris for storage (using
Linux
> and lvm right now - offers pooling, but no built-in fault-tolerance)
> if ZFS had one feature: Use of more than one machine - currently, as I
> understand it, if disks fail, no problem, but if the server machine
> fails, ...
> 
> I read in your FAQ that cluster features are on the way and wanted to
> ask what''s the status here :-)
> 
> BTW I recently read about a filesystem, which has a pretty good
> cluster architecture, called Google File System. The article on the
> English Wikipedia has a good overview, a link to the detailed papers
> and a ZDNet interview about it.
> 
> I just wanted to point that out to you, maybe some of its design /
> architecture is useful in ZFS''s cluster mode.
For cross-machine tolerance, it should be possible (once the iSCSI
target is integrated) to create ZFS-backed iSCSI targets and then use
RAID-Z from a single host across machines.  This is not a true clustered
filesystem, as it has a single point of access, but it does get you
beyond the ''single node = dataloss'' mode of failure.

As for the true clustered filesystem, we''re still gathering
requirements.  We have some ideas in the pipeline, and it''s definitely
a
direction in which we are headed, but there''s not much to say at this
point.

- Eric

--
Eric Schrock, Solaris Kernel Development       http://blogs.sun.com/eschrock

Joe Little

2006-May-30 22:44 UTC

head link

[zfs-discuss] cluster features

Well, I would caution at this point against the iscsi backend if you
are planning on using NFS. We took a long winded conversation online
and have yet to return to this list, but the gist of it is that the
latency of iscsi along with the tendency for NFS to fsync 3 times per
write causes performance to drop dramatically, and it gets much worse
for a RAIDZ config. If you want to go this route, FC is a current
suggested requirement.

On 5/30/06, Eric Schrock <eric.schrock at sun.com>
wrote:> On Tue, May 30, 2006 at 03:55:09AM -0700, Ernst Rohlicek jun. wrote:
> > Hello list,
> >
> > I''ve read about your fascinating new fs implementation, ZFS.
I''ve seen
> > alot - nbd, lvm, evms, pvfs2, gfs, ocfs - and I have to say:
I''m quite
> > impressed!
> >
> > I''d set up a few of my boxes to OpenSolaris for storage
(using Linux
> > and lvm right now - offers pooling, but no built-in fault-tolerance)
> > if ZFS had one feature: Use of more than one machine - currently, as I
> > understand it, if disks fail, no problem, but if the server machine
> > fails, ...
> >
> > I read in your FAQ that cluster features are on the way and wanted to
> > ask what''s the status here :-)
> >
> > BTW I recently read about a filesystem, which has a pretty good
> > cluster architecture, called Google File System. The article on the
> > English Wikipedia has a good overview, a link to the detailed papers
> > and a ZDNet interview about it.
> >
> > I just wanted to point that out to you, maybe some of its design /
> > architecture is useful in ZFS''s cluster mode.
>
> For cross-machine tolerance, it should be possible (once the iSCSI
> target is integrated) to create ZFS-backed iSCSI targets and then use
> RAID-Z from a single host across machines.  This is not a true clustered
> filesystem, as it has a single point of access, but it does get you
> beyond the ''single node = dataloss'' mode of failure.
>
> As for the true clustered filesystem, we''re still gathering
> requirements.  We have some ideas in the pipeline, and it''s
definitely a
> direction in which we are headed, but there''s not much to say at
this
> point.
>
> - Eric
>
> --
> Eric Schrock, Solaris Kernel Development      
http://blogs.sun.com/eschrock
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>

Robert Milkowski

2006-May-31 06:15 UTC

head link

[zfs-discuss] cluster features

Hello Joe,

Wednesday, May 31, 2006, 12:44:22 AM, you wrote:

JL> Well, I would caution at this point against the iscsi backend if you
JL> are planning on using NFS. We took a long winded conversation online
JL> and have yet to return to this list, but the gist of it is that the
JL> latency of iscsi along with the tendency for NFS to fsync 3 times per
JL> write causes performance to drop dramatically, and it gets much worse
JL> for a RAIDZ config. If you want to go this route, FC is a current
JL> suggested requirement.

Can you provide more info on NFS+raidz?

-- 
Best regards,
 Robert                            mailto:rmilkowski at task.gda.pl
                                       http://milek.blogspot.com

Joe Little

2006-May-31 15:26 UTC

head link

[zfs-discuss] cluster features

Well, here''s my previous summary off list to different solaris folk
(regarding NFS serving via ZFS and iSCSI):

I want to use ZFS as a NAS with no bounds on the backing hardware (not
restricted to one boxes capacity). Thus, there are two options: FC SAN
or iSCSI. In my case, I have multi-building considerations and 10GB
ethernet layer-2 interconnects that make iscsi ideal. Our standard
users use NAS for collections of many small files to many large files
(source code repositories, simulations, cad tools, VM images,
rendering meta-forms, and final results). Ideally to allow for ongoing
growth and drive replacement across multiple iscsi targets, RAIDZ was
selected over static hardware raid solutions. This setup is very
similar to a gfiler (iscsi based) or otherwise a standard NetApp Filer
product, and it would have appeared that Sun is targeting this
solution. I need this setup for both Tier1 primary NAS storage, as
well as disk-to-disk Tier2 backup.

In my extensive testing (not so much benchmarking, and definitely
without the time/focus to learn dtrace and the like), we have found
out that ZFS can be used for a tier2 system and not for tier1 due to
pathologically poor performance via NFS against a ZFS filesystem based
on RAIDZ over non-local storage. We have extremely poor but more
acceptable performance using a non-RAIDZ configuration. Only in the
case of expensive  FC-SAN network implementation would it appear that
ZFS is workable. If this is the only workable solution, then ZFS has
lost its benefits over NetApp as we approach the same costs but do not
have the same current maturity. Is it a lost cause? Honestly, I need
to be convinced that this is workable, and so far optional solutions
have been shot down.

Evidence? The final synethetic test used was to generate a directory
of 6250 random 8k files. On an NFS client (solaris, linux, or even
loop-back on the server itself), run "cp -r SRCDIR DESTDIR" where
DESTDIR is on the NFS server. Averages from memory:

FS                iSCSI backend            Rate
XFS              1.5TB single Lun         ~1-1.1MB/sec
ZFS              1.5TB single Lun         ~250-400KB/sec
ZFS              1.5TB RAIDZ (8 disks) ~25KB/sec

In the case of mixed sized files with predominantly small files above
and below 8K, I see the XFS solution jump to an average of
2.5-3MB/sec. The ZFS store over a single lun stay within
200-420KB/sec, and the RAIDZ range from 16-40KB/sec.

Likely caching and some dynamic behaviours cause ZFS to get worse with
mixed sizing, whereas XFS or such increases performance. Finally, by
switching to SMB and not using NFS, I can maintain over 3MB/sec rates.

Large files over NFS get more reasonable performance (14MB-28MB/sec)
on any given ZFS backend, and I get 30+MB/sec locally with spikes
close to 100MB/sec when writing locally. I only can maximize
performance on my ZFS backend if I use a blocksize (tests using dd) of
256K or greater. 128K seems to provide less overall datarates, and I
believe this is the default when I use cp, rsync, or other commands
locally.

In summary, I can make my ZFS-based initiator an NFS client or
otherwise use rsyncd to ameliorate the pathological NFS server
performance of the ZFS combination. I can then service files fine.
This solution allows us to move forward as a Tier2 only solution. If
_any_ thing can be done to address NFS and its interactions with ZFS,
and bring it close to 1MB/sec performance (these are gig-e
interconnects afterall, think about it) then it will only be 1/10th
the performance of a NetApp in this worse case scenario and perform
similar to the NetApp if not better in other cases. The NetApp can do
around 10MB/sec in the senario I''m depicting. Currently, we have
around 1/20th to 1/30th the performance level when not using RAIDZ,
and 1/200th using RAIDZ.

I just can''t quite understand how we can go from a "cp -p TESTDIR
DESTDIR" of 50MB of small files locally in an instant and the OS
returning to the prompt. Zpool iostat showing the writes committed
over the next 3-6 seconds, and this is OK for on-disk consistency. But
then for some reason its required that the NFS client can''t commit in
a similar fashion, with Solaris saying "yes, we got it, here''s
confirmation.. next" just as it does locally. The data definitely gets
there at the same speed as my tests with remote iscsi pools and as an
NFS client shows. My naive sense is that this should be addressable at
some level without inducing corruption. I have a feeling that its
somehow being overly conservative in this stance.

On 5/30/06, Robert Milkowski <rmilkowski at task.gda.pl>
wrote:> Hello Joe,
>
> Wednesday, May 31, 2006, 12:44:22 AM, you wrote:
>
> JL> Well, I would caution at this point against the iscsi backend if you
> JL> are planning on using NFS. We took a long winded conversation online
> JL> and have yet to return to this list, but the gist of it is that the
> JL> latency of iscsi along with the tendency for NFS to fsync 3 times
per
> JL> write causes performance to drop dramatically, and it gets much
worse
> JL> for a RAIDZ config. If you want to go this route, FC is a current
> JL> suggested requirement.
>
> Can you provide more info on NFS+raidz?
>
> --
> Best regards,
>  Robert                            mailto:rmilkowski at task.gda.pl
>                                        http://milek.blogspot.com
>
>

Karsten Hashimoto

2006-Jul-27 15:25 UTC

head link

[zfs-discuss] Re: cluster features

I would urgently need a Cluster Filesystem that runs with Oracle Clusterware. 
Currently Cluster Filesystems with Solaris need very expensive solutions (like
Veritas Storage Foundation or SUN Cluster) that are no longer accepted by the
customers (with regard to Linux solutions).
That''s why I want tp propose to think about a project to integrate
ocfs2 in OpenSolaris.
ClusterZFS would be surely an interesting next step. But unnecessarily complex
for RDBMS purposes that needs a cluster FS only for some archival purposes and
shared access for backup and restore (ZFS characteristics like copy on write are
not all too good for RDBMS with much scattered I/O, so we would always prefer
Cluster Raw Devices  over Cluster Files for normal database files)

kind rgds. Karsten
 
 
This message posted from opensolaris.org

Richard Elling

2006-Jul-27 16:25 UTC

head link

[zfs-discuss] Re: cluster features

Karsten Hashimoto wrote:> I would urgently need a Cluster Filesystem that runs with Oracle
Clusterware.
Today, we would recommend the Sun Cluster Advanced Edition for Oracle RAC.
This includes QFS which is a distributed file system.  See
http://www.sun.com/software/cluster/faq.xml#q21
> Currently Cluster Filesystems with Solaris need very expensive solutions
(like
> Veritas Storage Foundation or SUN Cluster) that are no longer accepted by
the
> customers (with regard to Linux solutions).
I can''t comment on the price, other than to say that it[1] is free (as
in $)
for research, development, or education purposes.  There is also a variety of
different pricing options for production systems which require support.
Anecdotally, I hear that it often costs less than VSF for production systems.

[1] it will depend on what "it" is, what your company or institution
does,
and where you live.  I find the www.sun.com pages on this topic confusing,
so I''d recommend contacting your local Sun sales office.
> That''s why I want tp propose to think about a project to integrate
ocfs2 in OpenSolaris.
OCFS2 is an Oracle "project," so you should ask them.  Without
Oracle''s blessing,
you''d be wasting your time.
http://oss.oracle.com/projects/ocfs2/
> ClusterZFS would be surely an interesting next step. But unnecessarily
complex for
> RDBMS purposes that needs a cluster FS only for some archival purposes and
shared
> access for backup and restore (ZFS characteristics like copy on write are
not all
> too good for RDBMS with much scattered I/O, so we would always prefer
Cluster
> Raw Devices  over Cluster Files for normal database files)
For raw devices, zfs offers zvols.  The jury is still out pondering the merits
of databases over ZFS.  From a RAS perspective, this combination is excellent.
But I don''t think we have much real data to address a performance
perspective
(that is being done by others... who will speak up when they are ready :-)

Sun marketing has been collecting requirements for a distributed version of ZFS.
Please check this forum''s archives and provide input.  Also,
http://www.opensolaris.org/os/community/zfs/faq
  -- richard

Joe Little

2011-May-10 17:26 UTC

head link

[zfs-discuss] cluster features

Well, here''s my previous summary off list to different solaris folk
(regarding NFS serving via ZFS and iSCSI):

I want to use ZFS as a NAS with no bounds on the backing hardware (not
restricted to one boxes capacity). Thus, there are two options: FC SAN
or iSCSI. In my case, I have multi-building considerations and 10GB
ethernet layer-2 interconnects that make iscsi ideal. Our standard
users use NAS for collections of many small files to many large files
(source code repositories, simulations, cad tools, VM images,
rendering meta-forms, and final results). Ideally to allow for ongoing
growth and drive replacement across multiple iscsi targets, RAIDZ was
selected over static hardware raid solutions. This setup is very
similar to a gfiler (iscsi based) or otherwise a standard NetApp Filer
product, and it would have appeared that Sun is targeting this
solution. I need this setup for both Tier1 primary NAS storage, as
well as disk-to-disk Tier2 backup.

In my extensive testing (not so much benchmarking, and definitely
without the time/focus to learn dtrace and the like), we have found
out that ZFS can be used for a tier2 system and not for tier1 due to
pathologically poor performance via NFS against a ZFS filesystem based
on RAIDZ over non-local storage. We have extremely poor but more
acceptable performance using a non-RAIDZ configuration. Only in the
case of expensive  FC-SAN network implementation would it appear that
ZFS is workable. If this is the only workable solution, then ZFS has
lost its benefits over NetApp as we approach the same costs but do not
have the same current maturity. Is it a lost cause? Honestly, I need
to be convinced that this is workable, and so far optional solutions
have been shot down.

Evidence? The final synethetic test used was to generate a directory
of 6250 random 8k files. On an NFS client (solaris, linux, or even
loop-back on the server itself), run "cp -r SRCDIR DESTDIR" where
DESTDIR is on the NFS server. Averages from memory:

FS                iSCSI backend            Rate
XFS              1.5TB single Lun         ~1-1.1MB/sec
ZFS              1.5TB single Lun         ~250-400KB/sec
ZFS              1.5TB RAIDZ (8 disks) ~25KB/sec

In the case of mixed sized files with predominantly small files above
and below 8K, I see the XFS solution jump to an average of
2.5-3MB/sec. The ZFS store over a single lun stay within
200-420KB/sec, and the RAIDZ range from 16-40KB/sec.

Likely caching and some dynamic behaviours cause ZFS to get worse with
mixed sizing, whereas XFS or such increases performance. Finally, by
switching to SMB and not using NFS, I can maintain over 3MB/sec rates.

Large files over NFS get more reasonable performance (14MB-28MB/sec)
on any given ZFS backend, and I get 30+MB/sec locally with spikes
close to 100MB/sec when writing locally. I only can maximize
performance on my ZFS backend if I use a blocksize (tests using dd) of
256K or greater. 128K seems to provide less overall datarates, and I
believe this is the default when I use cp, rsync, or other commands
locally.

In summary, I can make my ZFS-based initiator an NFS client or
otherwise use rsyncd to ameliorate the pathological NFS server
performance of the ZFS combination. I can then service files fine.
This solution allows us to move forward as a Tier2 only solution. If
_any_ thing can be done to address NFS and its interactions with ZFS,
and bring it close to 1MB/sec performance (these are gig-e
interconnects afterall, think about it) then it will only be 1/10th
the performance of a NetApp in this worse case scenario and perform
similar to the NetApp if not better in other cases. The NetApp can do
around 10MB/sec in the senario I''m depicting. Currently, we have
around 1/20th to 1/30th the performance level when not using RAIDZ,
and 1/200th using RAIDZ.

I just can''t quite understand how we can go from a "cp -p TESTDIR
DESTDIR" of 50MB of small files locally in an instant and the OS
returning to the prompt. Zpool iostat showing the writes committed
over the next 3-6 seconds, and this is OK for on-disk consistency. But
then for some reason its required that the NFS client can''t commit in
a similar fashion, with Solaris saying "yes, we got it, here''s
confirmation.. next" just as it does locally. The data definitely gets
there at the same speed as my tests with remote iscsi pools and as an
NFS client shows. My naive sense is that this should be addressable at
some level without inducing corruption. I have a feeling that its
somehow being overly conservative in this stance.

On 5/30/06, Robert Milkowski <rmilkowski at task.gda.pl>
wrote:> Hello Joe,
>
> Wednesday, May 31, 2006, 12:44:22 AM, you wrote:
>
> JL> Well, I would caution at this point against the iscsi backend if you
> JL> are planning on using NFS. We took a long winded conversation online
> JL> and have yet to return to this list, but the gist of it is that the
> JL> latency of iscsi along with the tendency for NFS to fsync 3 times
per
> JL> write causes performance to drop dramatically, and it gets much
worse
> JL> for a RAIDZ config. If you want to go this route, FC is a current
> JL> suggested requirement.
>
> Can you provide more info on NFS+raidz?
>
> --
> Best regards,
>  Robert                            mailto:rmilkowski at task.gda.pl
>                                        http://milek.blogspot.com
>
>_______________________________________________
zfs-discuss mailing list
zfs-discuss at opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

-- 
This message posted from opensolaris.org

zfs discuss - May 2006 - cluster features

[zfs-discuss] cluster features

[zfs-discuss] cluster features

[zfs-discuss] cluster features

[zfs-discuss] cluster features

[zfs-discuss] cluster features

[zfs-discuss] Re: cluster features

[zfs-discuss] Re: cluster features

[zfs-discuss] cluster features