thr3ads.net - zfs discuss - [zfs-discuss] ZFS or UFS

If this information is useful, please help other people find it:
Share via:

Jeffery Malloch

2007-Jan-26 14:16 UTC

[zfs-discuss] ZFS or UFS - what to do?

Hi Folks,

I am currently in the midst of setting up a completely new file server using a
pretty well loaded Sun T2000 (8x1GHz, 16GB RAM) connected to an Engenio 6994
product (I work for LSI Logic so Engenio is a no brainer).  I have configured a
couple of zpools from Volume groups on the Engenio box - 1x2.5TB and 1x3.75TB. 
I then created sub zfs systems below that and set quotas and sharenfs''d
them so that it appears that these "file systems" are dynamically
shrinkable and growable.  It looks very good...  I can see the correct file
system sizes on all types of machines (Linux 32/64bit and of course Solaris
boxes) and if I resize the quota it''s picked up in NFS right away.  But
I would be the first in our organization to use this in an enterprise system so
I definitely have some concerns that I''m hoping someone here can
address.

1.  How stable is ZFS?  The Engenio box is completely configured for RAID5 with
hot spares and write cache (8GB) has battery backup so I''m not too
concerned from a hardware side.  I''m looking for an idea of how stable
ZFS itself is in terms of corruptability, uptime and OS stability.

2.  Recommended config.  Above, I have a fairly simple setup.  In many of the
examples the granularity is home directory level and when you have many many
users that could get to be a bit of a nightmare administratively.  I am really
only looking for high level dynamic size adjustability and am not interested in
its built in RAID features.  But given that, any real world recommendations?

3.  Caveats?  Anything I''m missing that isn''t in the docs that
could turn into a BIG gotchya?

4.  Since all data access is via NFS we are concerned that 32 bit systems
(Mainly Linux and Windows via Samba) will not be able to access all the data
areas of a 2TB+ zpool even if the zfs quota on a particular share is less then
that.  Can anyone comment?

The bottom line is that with anything new there is cause for concern. 
Especially if it hasn''t been tested within our organization.  But the
convenience/functionality factors are way too hard to ignore.

Thanks,

Jeff
 
 
This message posted from opensolaris.org

Robert Milkowski

2007-Jan-26 15:01 UTC

head link

[zfs-discuss] ZFS or UFS - what to do?

Hello Jeffery,

Friday, January 26, 2007, 3:16:44 PM, you wrote:

JM> Hi Folks,

JM> I am currently in the midst of setting up a completely new file
JM> server using a pretty well loaded Sun T2000 (8x1GHz, 16GB RAM)
JM> connected to an Engenio 6994 product (I work for LSI Logic so
JM> Engenio is a no brainer).  I have configured a couple of zpools
JM> from Volume groups on the Engenio box - 1x2.5TB and 1x3.75TB.  I
JM> then created sub zfs systems below that and set quotas and
JM> sharenfs''d them so that it appears that these "file
systems" are
JM> dynamically shrinkable and growable.  It looks very good...  I can
JM> see the correct file system sizes on all types of machines (Linux
JM> 32/64bit and of course Solaris boxes) and if I resize the quota
JM> it''s picked up in NFS right away.  But I would be the first in
our
JM> organization to use this in an enterprise system so I definitely
JM> have some concerns that I''m hoping someone here can address.

JM> 1.  How stable is ZFS?  The Engenio box is completely configured
JM> for RAID5 with hot spares and write cache (8GB) has battery backup
JM> so I''m not too concerned from a hardware side.  I''m
looking for an
JM> idea of how stable ZFS itself is in terms of corruptability, uptime and
OS stability.

When it comes to uptime, os stability or corruptability - no problems
here.

However if you give ZFS entire LUN''s on Enginio devices IIRC with that
arrays when zfs issues flush wrtie cache to the array it actually does
and this can possibly hurt performance. There''s a way to setup array
to ignore flush commands or you can put zfs on SMI. You have to check
if this problem was actually with Enginio - I''m not sure.

However, depending on workload, consider doing RAID in ZFS instead of
in on the array. Especially ''coz you get self-healing from ZFS then.

At least doing stripe between several RAID5 LUNs would be good idea.


JM> 2.  Recommended config.  Above, I have a fairly simple setup.  In
JM> many of the examples the granularity is home directory level and
JM> when you have many many users that could get to be a bit of a
JM> nightmare administratively.  I am really only looking for high
JM> level dynamic size adjustability and am not interested in its
JM> built in RAID features.  But given that, any real world recommendations?

Depending on how much users you have consider creating a file system
for each user or at least for a group of users if you can group them.


JM> 3.  Caveats?  Anything I''m missing that isn''t in the
docs that could turn into a BIG gotchya?

WRITE CACHE problem I mentioned above - but check if it was really
Enginio - anyway there''re simple workarounds.

There''re some performance issues in corner cases hope you
won''t hit
one. Use at least S10U3 or Nevada (there''re some people using nevada
in production :)).


JM> 4.  Since all data access is via NFS we are concerned that 32 bit
JM> systems (Mainly Linux and Windows via Samba) will not be able to
JM> access all the data areas of a 2TB+ zpool even if the zfs quota on
JM> a particular share is less then that.  Can anyone comment?

If there''s quota on a file system then nfs client will see that quota
as a file system size IIRC so it shouldn''t be a problem. But that
means a file system for each users.


JM> The bottom line is that with anything new there is cause for
JM> concern.  Especially if it hasn''t been tested within our
JM> organization.  But the convenience/functionality factors are way too hard
to ignore.


ZFS is new, that''s right. There''re some problems, mostly
related to
performance and hot spare support (when doing raid in ZFS). Other that
that you should be ok. Quite a lot of people are using ZFS in a
production. I myself have ZFS in a production for years and right now
with well over 100TB of data on it using different storage arrays and
I''m still migrating more and more data. Never lost any data on ZFS, at
least I don''t know about it :)))))



-- 
Best regards,
 Robert                            mailto:rmilkowski at task.gda.pl
                                       http://milek.blogspot.com

Francois Dion

2007-Jan-26 15:09 UTC

head link

[zfs-discuss] ZFS or UFS - what to do?

On Fri, 2007-01-26 at 06:16 -0800, Jeffery Malloch
wrote:> Hi Folks,
> 
> I am currently in the midst of setting up a completely new file server
using a pretty well loaded Sun T2000 (8x1GHz, 16GB RAM) connected to an Engenio
6994 product (I work for LSI Logic so Engenio is a no brainer).  I have
configured a couple of zpools from Volume groups on the Engenio box - 1x2.5TB
and 1x3.75TB.  I then created sub zfs systems below that and set quotas and
sharenfs''d them so that it appears that these "file systems"
are dynamically shrinkable and growable.  It looks very good...  I can see the
correct file system sizes on all types of machines (Linux 32/64bit and of course
Solaris boxes) and if I resize the quota it''s picked up in NFS right
away.  But I would be the first in our organization to use this in an enterprise
system so I definitely have some concerns that I''m hoping someone here
can address.
> 
> 1.  How stable is ZFS?  The Engenio box is completely configured for RAID5
with hot spares
That partly defeats the purpose of ZFS. ZFS offers raid-z and raid-z2
(double parity) with all the advantages of raid-5 or raid-6 but without
several of the raid-5 issues. It also has features that a raid-5
controller could never do: ensure data integrity from the kernel to the
disk, and self correction.
>  and write cache (8GB) has battery backup so I''m not too concerned
from a hardware side.
Whereas the cache/battery backup is a requirement if you run raid-5, it
is not for zfs.
>   I''m looking for an idea of how stable ZFS itself is in terms of
corruptability, uptime and OS stability.
Since Solaris 10 U3, it is rock solid. No issue here. 1.3TB or so
currently assigned in FC drives, in production without any issues. We
switched after losing some data from hardware mirroring. Our sysadmin is
ecstatic with zfs. Some of the filesystems have compression enabled and
that increases even the throughput, if you have the cpu/ram available.
> 2.  Recommended config.
The most reliable setup is a JBOD + zfs. But if you have cache, on your
box, there might be some magic setup you have to do for that box, and
I''m sure somebody on the list will help you with that. I dont have an
Engenio.

Francois

Bill Sommerfeld

2007-Jan-26 15:46 UTC

head link

[zfs-discuss] ZFS or UFS - what to do?

On Fri, 2007-01-26 at 06:16 -0800, Jeffery Malloch
wrote:> 2.  Recommended config.  
1) Since this is a system that many users will depend on, use
zfs-managed redundancy, either mirroring or raid-z, between the LUNs
exported by the storage system.  You may think your storage system is
perfect, but are you sure?  with a non-redundant zfs, over time, you''ll
know for sure, but you might find this out at a very inconvenient time.

With zfs-managed redundancy, if bit rot happens, you have an excellent
chance of slogging through without any application-visible impact.

2) Enable compression.  For the software development workloads I''m
seeing, this generally recovers the space lost to redundancy.

					- Bill

Anantha N. Srirama

2007-Jan-26 16:06 UTC

head link

[zfs-discuss] Re: ZFS or UFS - what to do?

I''ve used ZFS since July/August 2006 when Sol 10 Update 2 came out
(first release to integrate ZFS.) I''ve used it on three servers (E25K
domain, and 2 E2900s) extensivesely; two them are production. I''ve over
3TB of storage from an EMC SAN under ZFS management for no less than 6 months.
Like your configuration we''ve defered data redundancy to SAN. My
observations are:

1. ZFS is stable to a very large extent. There are two known issues that
I''m aware of:
  a. You can end up in an endless ''reboot'' cycle when
you''ve a corrupt zpool. I came across this when I had data corruption
due to a HBA mismatch with EMC SAN. This mismatch injected data corruption in
transit and the EMC faithfully wrote bad data, upon reading this bad data ZFS
threw up all over the floor for that pool. There is a documented workaround to
snap out of the ''reboot'' cycle, I''ve not checked if
this is fixed in 11/06 update 3.
  b. Your server will hang when one of the underlying disks disappear. In our
case we had a T2000 running 11/06 and had a mirrored zpool against two internal
drives. When we pulled one of the drives abruptly the server simply hung. I
believe this is a known bug, workaround?

2. When you''ve I/O operations that either request fsync or open files
with O_DSYNC option coupled with high I/O ZFS will choke. It won''t
crash but the filesystem I/O runs like molases on a cold morning.

All my feedback is based on Solaris 10 Update 2 (aka 06/06) and I''ve no
comments on NFS. I strongly recommend that you use ZFS data redundancy (z1, z2,
or mirror) and simply delegate the Engenio to stripe the data for performance.
 
 
This message posted from opensolaris.org

Brian Hechinger

2007-Jan-26 16:41 UTC

head link

[zfs-discuss] Re: ZFS or UFS - what to do?

On Fri, Jan 26, 2007 at 08:06:46AM -0800, Anantha N. Srirama
wrote:> 
>   b. Your server will hang when one of the underlying disks disappear. In
our case we had a T2000 running 11/06 and had a mirrored zpool against two
internal drives. When we pulled one of the drives abruptly the server simply
hung. I believe this is a known bug, workaround?
This was just covered here and looks like the fix will make it into u4 (i think
it''s in svn_48?)

The workaround is to do a ''zpool offline'' whenever possible
before removing a disk.  Yes,
this is not always possible (in the case of disk death), but will help in some
situations.

I can''t wait for U4.  :)

-brian
-- 
"The reason I don''t use Gnome: every single other window manager I
know of is
very powerfully extensible, where you can switch actions to different mouse
buttons. Guess which one is not, because it might confuse the poor users?
Here''s a hint: it''s not the small and fast one."       
--Linus

Akhilesh Mritunjai

2007-Jan-26 17:33 UTC

head link

[zfs-discuss] Re: ZFS or UFS - what to do?

ZFS Rule #0: You gotta have redundancy
ZFS Rule #1: Redundancy shall be managed by zfs, and zfs alone.


Whatever you have, junk it. Let ZFS manage mirroring and redundancy. ZFS
doesn''t forgive even single bit errors!
 
 
This message posted from opensolaris.org

Akhilesh Mritunjai

2007-Jan-26 17:35 UTC

head link

[zfs-discuss] Re: ZFS or UFS - what to do?

Oh yep, I know that "churning" feeling in stomach that
there''s got to be a GOTCHA somewhere... it can''t be *that*
simple!
 
 
This message posted from opensolaris.org

Gary Mills

2007-Jan-26 17:42 UTC

head link

[zfs-discuss] Re: ZFS or UFS - what to do?

On Fri, Jan 26, 2007 at 09:33:40AM -0800, Akhilesh Mritunjai
wrote:> ZFS Rule #0: You gotta have redundancy
> ZFS Rule #1: Redundancy shall be managed by zfs, and zfs alone.
> 
> Whatever you have, junk it. Let ZFS manage mirroring and redundancy. ZFS
doesn''t forgive even single bit errors!
How does this work in an environment with storage that''s centrally-
managed and shared between many servers?  I''m putting together a new
IMAP server that will eventually use 3TB of space from our Netapp via
an iSCSI SAN.  The Netapp provides all of the disk management and
redundancy that I''ll ever need.  The server will only see a virtual
disk (a LUN).  I want to use ZFS on that LUN because it''s superior
to UFS in this application, even without the redundancy.  There''s
no way to get the Netapp to behave like a JBOD.  Are you saying that
this configuration isn''t going to work?

-- 
-Gary Mills-    -Unix Support-    -U of M Academic Computing and Networking-

Rich Teer

2007-Jan-26 17:54 UTC

head link

[zfs-discuss] Re: ZFS or UFS - what to do?

On Fri, 26 Jan 2007, Gary Mills wrote:
> no way to get the Netapp to behave like a JBOD.  Are you saying that
> this configuration isn''t going to work?
It''ll work, but it may not be optimal.

-- 
Rich Teer, SCSA, SCNA, SCSECA, OpenSolaris CAB member

President,
Rite Online Inc.

Voice: +1 (250) 979-1638
URL: http://www.rite-group.com/rich

Ed Gould

2007-Jan-26 19:05 UTC

head link

[zfs-discuss] Re: ZFS or UFS - what to do?

On Jan 26, 2007, at 9:42, Gary Mills wrote:> How does this work in an environment with storage that''s
centrally-
> managed and shared between many servers?  I''m putting together a
new
> IMAP server that will eventually use 3TB of space from our Netapp via
> an iSCSI SAN.  The Netapp provides all of the disk management and
> redundancy that I''ll ever need.  The server will only see a
virtual
> disk (a LUN).  I want to use ZFS on that LUN because it''s superior
> to UFS in this application, even without the redundancy.  There''s
> no way to get the Netapp to behave like a JBOD.  Are you saying that
> this configuration isn''t going to work?
It will work, but if the storage system corrupts the data, ZFS will be 
unable to correct it.  It will detect the error.

A number that I''ve been quoting, albeit without a good reference, comes
from Jim Gray, who has been around the data-management industry for 
longer than I have (and I''ve been in this business since 1970);
he''s
currently at Microsoft.  Jim says that the controller/drive subsystem 
writes data to the wrong sector of the drive without notice about once 
per drive per year.  In a 400-drive array, that''s once a day.  ZFS will
detect this error when the file is read (one of the blocks'' checksum 
will not match).  But it can only correct the error if it manages the 
redundancy.

I would suggest exporting two LUNs from your central storage and let 
ZFS mirror them.  You can get a wider range of space/performance 
tradeoffs if you give ZFS a JBOD, but that doesn''t sound like an 
option.

	--Ed

Gary Mills

2007-Jan-26 19:43 UTC

head link

[zfs-discuss] Re: ZFS or UFS - what to do?

On Fri, Jan 26, 2007 at 11:05:17AM -0800, Ed Gould
wrote:> On Jan 26, 2007, at 9:42, Gary Mills wrote:
> >How does this work in an environment with storage that''s
centrally-
> >managed and shared between many servers?
> 
> It will work, but if the storage system corrupts the data, ZFS will be 
> unable to correct it.  It will detect the error.
> 
> A number that I''ve been quoting, albeit without a good reference,
comes
> from Jim Gray, who has been around the data-management industry for 
> longer than I have (and I''ve been in this business since 1970);
he''s
> currently at Microsoft.  Jim says that the controller/drive subsystem 
> writes data to the wrong sector of the drive without notice about once 
> per drive per year.  In a 400-drive array, that''s once a day.  ZFS
will
> detect this error when the file is read (one of the blocks''
checksum
> will not match).  But it can only correct the error if it manages the 
> redundancy.
Our Netapp does double-parity RAID.  In fact, the filesystem design is
remarkably similar to that of ZFS.  Wouldn''t that also detect the
error?  I suppose it depends if the `wrong sector without notice''
error is repeated each time.  Or is it random?
> I would suggest exporting two LUNs from your central storage and let 
> ZFS mirror them.  You can get a wider range of space/performance 
> tradeoffs if you give ZFS a JBOD, but that doesn''t sound like an 
> option.
That would double the amount of disk that we''d require.  I am actually
planning on using two iSCSI LUNs and letting ZFS stripe across them.
When we need to expand the ZFS pool, I''d like to just expand the two
LUNs on the Netapp.  If ZFS won''t accomodate that, I can just add a
couple more LUNs.  This is all convenient and easily managable.

-- 
-Gary Mills-    -Unix Support-    -U of M Academic Computing and Networking-

Richard Elling

2007-Jan-26 20:13 UTC

head link

[zfs-discuss] Re: ZFS or UFS - what to do?

Gary Mills wrote:> On Fri, Jan 26, 2007 at 11:05:17AM -0800, Ed Gould wrote:
>> On Jan 26, 2007, at 9:42, Gary Mills wrote:
>>> How does this work in an environment with storage that''s
centrally-
>>> managed and shared between many servers?
>> It will work, but if the storage system corrupts the data, ZFS will be 
>> unable to correct it.  It will detect the error.
>>
>> A number that I''ve been quoting, albeit without a good
reference, comes
>> from Jim Gray, who has been around the data-management industry for 
>> longer than I have (and I''ve been in this business since
1970); he''s
>> currently at Microsoft.  Jim says that the controller/drive subsystem 
>> writes data to the wrong sector of the drive without notice about once 
>> per drive per year.  In a 400-drive array, that''s once a day. 
ZFS will
>> detect this error when the file is read (one of the blocks''
checksum
>> will not match).  But it can only correct the error if it manages the 
>> redundancy.
The quote from Jim seems to be related to the leaves of the tree (disks).
Anecdotally, now that we have ZFS at the trunk, we''re seeing that the
branches are also corrupting data.  We''ve speculated that it would
occur,
but now we can measure it, and it is non-zero.  See Anantha''s post for
one such anecdote.
> Our Netapp does double-parity RAID.  In fact, the filesystem design is
> remarkably similar to that of ZFS.  Wouldn''t that also detect the
> error?  I suppose it depends if the `wrong sector without notice''
> error is repeated each time.  Or is it random?
We''re having a debate related to this, data would be appreciated :-)
Do you get small, random read performance equivalent to N-2 spindles
for an N-way double-parity volume?
>> I would suggest exporting two LUNs from your central storage and let 
>> ZFS mirror them.  You can get a wider range of space/performance 
>> tradeoffs if you give ZFS a JBOD, but that doesn''t sound like
an
>> option.
> 
> That would double the amount of disk that we''d require.  I am
actually
> planning on using two iSCSI LUNs and letting ZFS stripe across them.
> When we need to expand the ZFS pool, I''d like to just expand the
two
> LUNs on the Netapp.  If ZFS won''t accomodate that, I can just add
a
> couple more LUNs.  This is all convenient and easily managable.
Sounds reasonable to me :-)
  -- richard

Torrey McMahon

2007-Jan-26 20:15 UTC

head link

[zfs-discuss] Re: ZFS or UFS - what to do?

Gary Mills wrote:> On Fri, Jan 26, 2007 at 11:05:17AM -0800, Ed Gould wrote:
>   
>> On Jan 26, 2007, at 9:42, Gary Mills wrote:
>>     
>>> How does this work in an environment with storage that''s
centrally-
>>> managed and shared between many servers?
>>>       
>> It will work, but if the storage system corrupts the data, ZFS will be 
>> unable to correct it.  It will detect the error.
>>
>> A number that I''ve been quoting, albeit without a good
reference, comes
>> from Jim Gray, who has been around the data-management industry for 
>> longer than I have (and I''ve been in this business since
1970); he''s
>> currently at Microsoft.  Jim says that the controller/drive subsystem 
>> writes data to the wrong sector of the drive without notice about once 
>> per drive per year.  In a 400-drive array, that''s once a day. 
ZFS will
>> detect this error when the file is read (one of the blocks''
checksum
>> will not match).  But it can only correct the error if it manages the 
>> redundancy.
>>     
>
> Our Netapp does double-parity RAID.  In fact, the filesystem design is
> remarkably similar to that of ZFS.  Wouldn''t that also detect the
> error?  I suppose it depends if the `wrong sector without notice''
> error is repeated each time. 
If the wrong block is written by the controller then you''re out of
luck.
The filesystem would read the incorrect block and ... who knows. Thats 
why the ZFS checksums are important.

Wade.Stuart at fallon.com

2007-Jan-26 20:20 UTC

head link

[zfs-discuss] Re: ZFS or UFS - what to do?

zfs-discuss-bounces at opensolaris.org wrote on 01/26/2007 01:43:35 PM:
> On Fri, Jan 26, 2007 at 11:05:17AM -0800, Ed Gould wrote:
> > On Jan 26, 2007, at 9:42, Gary Mills wrote:
> > >How does this work in an environment with storage that''s
centrally-
> > >managed and shared between many servers?
> >
> > It will work, but if the storage system corrupts the data, ZFS will be
> > unable to correct it.  It will detect the error.
> >
> > A number that I''ve been quoting, albeit without a good
reference, comes
> > from Jim Gray, who has been around the data-management industry for
> > longer than I have (and I''ve been in this business since
1970); he''s
> > currently at Microsoft.  Jim says that the controller/drive subsystem
> > writes data to the wrong sector of the drive without notice about once
> > per drive per year.  In a 400-drive array, that''s once a day.
ZFS will
> > detect this error when the file is read (one of the blocks''
checksum
> > will not match).  But it can only correct the error if it manages the
> > redundancy.
>
> Our Netapp does double-parity RAID.  In fact, the filesystem design is
> remarkably similar to that of ZFS.  Wouldn''t that also detect the
> error?  I suppose it depends if the `wrong sector without notice''
> error is repeated each time.  Or is it random?
I do not know,  WAFL and other portions of NetApp backends are never really
described in very technical details -- even getting real IOPS numbers from
them seems to be a hassle, much magic -- little meat.  To me, zfs is very
well defined behavior and methodology (you can even see the source to
verify specifics) and this allows you to _know_ what weak points are.
NetApp, EMC  and other disk vendors may have financial benefits for
allowing edge cases such as the write hole or bit rot (x errors per disk
are acceptable losses,  after x errors then consider replacing disk
cost/benefit analysis -- will customers actually know a bit is flipped?).
In EMC''s case it is very common for a disk to have multiple read/write
errors before EMC will swap out the disk,  they even use a substantial
portion of the disk as replacement and parity bits (outside of raid) so
they offset or postpone the replacement volume/costs on the customer.

The most detailed description of WAFL I was able to find last time I looked
was:
http://www.netapp.com/library/tr/3002.pdf

>
> > I would suggest exporting two LUNs from your central storage and let
> > ZFS mirror them.  You can get a wider range of space/performance
> > tradeoffs if you give ZFS a JBOD, but that doesn''t sound like
an
> > option.
>
> That would double the amount of disk that we''d require.  I am
actually
> planning on using two iSCSI LUNs and letting ZFS stripe across them.
> When we need to expand the ZFS pool, I''d like to just expand the
two
> LUNs on the Netapp.  If ZFS won''t accomodate that, I can just add
a
> couple more LUNs.  This is all convenient and easily managable.
If you do have bit errors coming from the netapp zfs will find them and
will not be able to correct in this case.

>
> --
> -Gary Mills-    -Unix Support-    -U of M Academic Computing and
Networking-> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Ed Gould

2007-Jan-26 20:27 UTC

head link

[zfs-discuss] Re: ZFS or UFS - what to do?

On Jan 26, 2007, at 12:13, Richard Elling wrote:> On Fri, Jan 26, 2007 at 11:05:17AM -0800, Ed Gould wrote:
>> A number that I''ve been quoting, albeit without a good
reference,
>> comes from Jim Gray, who has been around the data-management industry 
>> for longer than I have (and I''ve been in this business since
1970);
>> he''s currently at Microsoft.  Jim says that the
controller/drive
>> subsystem writes data to the wrong sector of the drive without notice 
>> about once per drive per year.  In a 400-drive array, that''s
once a
>> day.  ZFS will detect this error when the file is read (one of the 
>> blocks'' checksum will not match).  But it can only correct the
error
>> if it manages the redundancy.
>
> The quote from Jim seems to be related to the leaves of the tree 
> (disks).
> Anecdotally, now that we have ZFS at the trunk, we''re seeing that
the
> branches are also corrupting data.  We''ve speculated that it would
> occur,
> but now we can measure it, and it is non-zero.  See Anantha''s post
for
> one such anecdote.
Actually, Jim was referring to everything but the trunk.  He didn''t 
specify where from the HBA to the drive the error actually occurs.  I 
don''t think it really matters.  I saw him give a talk a few years ago 
at the Usenix FAST conference; that''s where I got this information.

	--Ed

Chad Leigh -- Shire.Net LLC

2007-Jan-26 20:48 UTC

head link

[zfs-discuss] Re: ZFS or UFS - what to do?

On Jan 26, 2007, at 12:05 PM, Ed Gould wrote:
> I would suggest exporting two LUNs from your central storage and  
> let ZFS mirror them.  You can get a wider range of space/ 
> performance tradeoffs if you give ZFS a JBOD, but that doesn''t  
> sound like an option.
I am doing something similar on a lower end scale.  I am using 2  
Areca RAID-6 controllers, each with an 8 disk raid plus 1 hot spare  
equal to 1.7TB.  ZFS is being used to mirror them.  Battery backed  
with ECC on controller cache of at least 1GB.

I am in the process of building this now.

Chad

---
Chad Leigh -- Shire.Net LLC
Your Web App and Email hosting provider
chad at shire.net



-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 2411 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20070126/9fbcb96b/attachment.bin>

Dana H. Myers

2007-Jan-26 20:52 UTC

head link

[zfs-discuss] Re: ZFS or UFS - what to do?

Ed Gould wrote:> On Jan 26, 2007, at 12:13, Richard Elling wrote:
>> On Fri, Jan 26, 2007 at 11:05:17AM -0800, Ed Gould wrote:
>>> A number that I''ve been quoting, albeit without a good
reference,
>>> comes from Jim Gray, who has been around the data-management
industry
>>> for longer than I have (and I''ve been in this business
since 1970);
>>> he''s currently at Microsoft.  Jim says that the
controller/drive
>>> subsystem writes data to the wrong sector of the drive without
notice
>>> about once per drive per year.  In a 400-drive array,
that''s once a
>>> day.  ZFS will detect this error when the file is read (one of the
>>> blocks'' checksum will not match).  But it can only correct
the error
>>> if it manages the redundancy.
> Actually, Jim was referring to everything but the trunk.  He
didn''t
> specify where from the HBA to the drive the error actually occurs.  I
> don''t think it really matters.  I saw him give a talk a few years
ago at
> the Usenix FAST conference; that''s where I got this information.
So this leaves me wondering how often the controller/drive subsystem
reads data from the wrong sector of the drive without notice; is it
symmetrical with respect to writing, and thus about once a drive/year,
or are there factors which change this?

Dana

Ed Gould

2007-Jan-26 21:10 UTC

head link

[zfs-discuss] Re: ZFS or UFS - what to do?

On Jan 26, 2007, at 12:52, Dana H. Myers wrote:> So this leaves me wondering how often the controller/drive subsystem
> reads data from the wrong sector of the drive without notice; is it
> symmetrical with respect to writing, and thus about once a drive/year,
> or are there factors which change this?
My guess is that it would be symmetric, but I don''t really know.

	--Ed

Torrey McMahon

2007-Jan-26 21:11 UTC

head link

[zfs-discuss] Re: ZFS or UFS - what to do?

Dana H. Myers wrote:> Ed Gould wrote:
>   
>> On Jan 26, 2007, at 12:13, Richard Elling wrote:
>>     
>>> On Fri, Jan 26, 2007 at 11:05:17AM -0800, Ed Gould wrote:
>>>       
>>>> A number that I''ve been quoting, albeit without a good
reference,
>>>> comes from Jim Gray, who has been around the data-management
industry
>>>> for longer than I have (and I''ve been in this business
since 1970);
>>>> he''s currently at Microsoft.  Jim says that the
controller/drive
>>>> subsystem writes data to the wrong sector of the drive without
notice
>>>> about once per drive per year.  In a 400-drive array,
that''s once a
>>>> day.  ZFS will detect this error when the file is read (one of
the
>>>> blocks'' checksum will not match).  But it can only
correct the error
>>>> if it manages the redundancy.
>>>>         
>
>   
>> Actually, Jim was referring to everything but the trunk.  He
didn''t
>> specify where from the HBA to the drive the error actually occurs.  I
>> don''t think it really matters.  I saw him give a talk a few
years ago at
>> the Usenix FAST conference; that''s where I got this
information.
>>     
>
> So this leaves me wondering how often the controller/drive subsystem
> reads data from the wrong sector of the drive without notice; is it
> symmetrical with respect to writing, and thus about once a drive/year,
> or are there factors which change this?
>   
It''s not symmetrical. Often times its a fw bug. Others a spurious event
causes one block to be read/written instead of an other one. (Alpha 
particles anyone?)

Dana H. Myers

2007-Jan-26 21:16 UTC

head link

[zfs-discuss] Re: ZFS or UFS - what to do?

Torrey McMahon wrote:> Dana H. Myers wrote:
>> Ed Gould wrote:
>>  
>>> On Jan 26, 2007, at 12:13, Richard Elling wrote:
>>>    
>>>> On Fri, Jan 26, 2007 at 11:05:17AM -0800, Ed Gould wrote:
>>>>      
>>>>> A number that I''ve been quoting, albeit without a
good reference,
>>>>> comes from Jim Gray, who has been around the
data-management industry
>>>>> for longer than I have (and I''ve been in this
business since 1970);
>>>>> he''s currently at Microsoft.  Jim says that the
controller/drive
>>>>> subsystem writes data to the wrong sector of the drive
without notice
>>>>> about once per drive per year.  In a 400-drive array,
that''s once a
>>>>> day.  ZFS will detect this error when the file is read (one
of the
>>>>> blocks'' checksum will not match).  But it can only
correct the error
>>>>> if it manages the redundancy.
>>>>>         
>>
>>  
>>> Actually, Jim was referring to everything but the trunk.  He
didn''t
>>> specify where from the HBA to the drive the error actually occurs. 
I
>>> don''t think it really matters.  I saw him give a talk a
few years ago at
>>> the Usenix FAST conference; that''s where I got this
information.
>>>     
>>
>> So this leaves me wondering how often the controller/drive subsystem
>> reads data from the wrong sector of the drive without notice; is it
>> symmetrical with respect to writing, and thus about once a drive/year,
>> or are there factors which change this?
>>   
> 
> It''s not symmetrical. Often times its a fw bug. Others a spurious
event
> causes one block to be read/written instead of an other one. (Alpha
> particles anyone?)
I would tend to expect these spurious events to impact read and write
equally; more specifically, the chance of any one read or write being
mis-addressed is about the same.  Since, AFAIK, there are many more reads
from a disk typically than writes, this would seem to suggest that there
would be more mis-addressed reads in a drive/year than mis-addressed
writes.  Is this the reason for the asymmetry?

(I''m sure waving my hands here)

Dana

Ed Gould

2007-Jan-26 21:28 UTC

head link

[zfs-discuss] Re: ZFS or UFS - what to do?

On Jan 26, 2007, at 13:16, Dana H. Myers wrote:> I would tend to expect these spurious events to impact read and write
> equally; more specifically, the chance of any one read or write being
> mis-addressed is about the same.  Since, AFAIK, there are many more 
> reads
> from a disk typically than writes, this would seem to suggest that 
> there
> would be more mis-addressed reads in a drive/year than mis-addressed
> writes.  Is this the reason for the asymmetry?
Jim''s "once per drive per year" number was not very precise. 
I took it
to be just one significant digit.  I don''t recall if he distinguished 
reads from writes.

	--Ed

Selim Daoud

2007-Jan-26 21:29 UTC

head link

[zfs-discuss] Re: ZFS or UFS - what to do?

it would be good to have real data and not only guess ot anecdots

this story about wrong blocks being written by  RAID controllers
sounds like the anti-terrorism propaganda we are leaving in: exagerate
the facts to catch everyone''s attention
.It''s going to take more than that to prove RAID ctrls have been doing
a bad jobs for the last 30 years
Let''s make up  real stories with hard fact first
s.

On 1/26/07, Ed Gould <Ed.Gould at sun.com> wrote:> On Jan 26, 2007, at 12:52, Dana H. Myers wrote:
> > So this leaves me wondering how often the controller/drive subsystem
> > reads data from the wrong sector of the drive without notice; is it
> > symmetrical with respect to writing, and thus about once a drive/year,
> > or are there factors which change this?
>
> My guess is that it would be symmetric, but I don''t really know.
>
>         --Ed
>
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>

Ed Gould

2007-Jan-26 21:37 UTC

head link

[zfs-discuss] Re: ZFS or UFS - what to do?

On Jan 26, 2007, at 13:29, Selim Daoud wrote:> it would be good to have real data and not only guess ot anecdots
Yes, I agree.  I''m sorry I don''t have the data that Jim
presented at
FAST, but he did present actual data.  Richard Elling (I believe it was 
Richard) has also posted some related data from ZFS experience to this 
list.

There is more than just anecdotal evidence for this.

	--Ed

Paul Fisher

2007-Jan-26 21:53 UTC

head link

[zfs-discuss] Re: ZFS or UFS - what to do?

> From: zfs-discuss-bounces at opensolaris.org
> [mailto:zfs-discuss-bounces at opensolaris.org] On Behalf Of Ed Gould
> Sent: Friday, January 26, 2007 3:38 PM
> 
> Yes, I agree.  I''m sorry I don''t have the data that Jim
presented at
> FAST, but he did present actual data.  Richard Elling (I believe it 
> was
> Richard) has also posted some related data from ZFS experience to this 
> list.
This seems to be from Jim and on point:

http://www.usenix.org/event/fast05/tech/gray.pdf


paul

Ed Gould

2007-Jan-26 21:59 UTC

head link

[zfs-discuss] Re: ZFS or UFS - what to do?

On Jan 26, 2007, at 13:53, Paul Fisher wrote:> This seems to be from Jim and on point:
>
> http://www.usenix.org/event/fast05/tech/gray.pdf
Yes, thanks.  That''s the talk I was referring to.  There''s a
reference
in it to a Microsoft tech report with measurement data.

	--Ed

Torrey McMahon

2007-Jan-26 22:24 UTC

head link

[zfs-discuss] Re: ZFS or UFS - what to do?

Dana H. Myers wrote:> Torrey McMahon wrote:
>   
>> Dana H. Myers wrote:
>>     
>>> Ed Gould wrote:
>>>  
>>>       
>>>> On Jan 26, 2007, at 12:13, Richard Elling wrote:
>>>>    
>>>>         
>>>>> On Fri, Jan 26, 2007 at 11:05:17AM -0800, Ed Gould wrote:
>>>>>      
>>>>>           
>>>>>> A number that I''ve been quoting, albeit
without a good reference,
>>>>>> comes from Jim Gray, who has been around the
data-management industry
>>>>>> for longer than I have (and I''ve been in this
business since 1970);
>>>>>> he''s currently at Microsoft.  Jim says that
the controller/drive
>>>>>> subsystem writes data to the wrong sector of the drive
without notice
>>>>>> about once per drive per year.  In a 400-drive array,
that''s once a
>>>>>> day.  ZFS will detect this error when the file is read
(one of the
>>>>>> blocks'' checksum will not match).  But it can
only correct the error
>>>>>> if it manages the redundancy.
>>>>>>         
>>>>>>             
>>>  
>>>       
>>>> Actually, Jim was referring to everything but the trunk.  He
didn''t
>>>> specify where from the HBA to the drive the error actually
occurs.  I
>>>> don''t think it really matters.  I saw him give a talk
a few years ago at
>>>> the Usenix FAST conference; that''s where I got this
information.
>>>>     
>>>>         
>>> So this leaves me wondering how often the controller/drive
subsystem
>>> reads data from the wrong sector of the drive without notice; is it
>>> symmetrical with respect to writing, and thus about once a
drive/year,
>>> or are there factors which change this?
>>>   
>>>       
>> It''s not symmetrical. Often times its a fw bug. Others a
spurious event
>> causes one block to be read/written instead of an other one. (Alpha
>> particles anyone?)
>>     
>
> I would tend to expect these spurious events to impact read and write
> equally; more specifically, the chance of any one read or write being
> mis-addressed is about the same.  Since, AFAIK, there are many more reads
> from a disk typically than writes, this would seem to suggest that there
> would be more mis-addressed reads in a drive/year than mis-addressed
> writes.  Is this the reason for the asymmetry?
>
> (I''m sure waving my hands here)
For the spurious events, yes, I would expect things to be impacted 
symmetrically depending when it comes to errors during reads and errors 
during writes. That is if you could figure out what spurious event 
occurred. In most cases the spurious errors are caught only at read time 
and you''re left wondering. Was it an incorrect read? Was the data 
written incorrectly? You end up throwing your hands up and saying, "Lets 
hope that doesn''t happen again." It''s much easier to
unearth a fw bug in
a particular disk drive operating in certain conditions and fix it.

Now that we''re checksumming things I''d expect to find more
errors, and
hopefully be in a condition to fix them, then we have in the past. We 
will also start getting customer complaints like, "We moved to ZFS and 
now we are seeing media errors more often. Why is ZFS broken?" This is 
similar to the StorADE issues we had in NWS - Ahhh, the good old days - 
when we started doing a much better job discovering issues and reporting 
them when in the past we were blissfully silent. We used to have some 
data on that with nice graphs but I can''t find them lying about.

Jason J. W. Williams

2007-Jan-26 23:49 UTC

head link

[zfs-discuss] ZFS or UFS - what to do?

Hi Jeff,

We''re running a FLX210 which I believe is an Engenio 2884. In our case
it also is attached to a T2000. ZFS has run VERY stably for us with
data integrity issues at all.

We did have a significant latency problem caused by ZFS flushing the
write cache on the array after every write, but that can be fixed by
configuring your array to ignore cache flushes. The instructions for
Engenio products are here: http://blogs.digitar.com/jjww/?itemid=44

We use the config for a production database, so I can''t speak to the
NFS issues. All I would mention is to watch the RAM consumption by
ZFS.

Does anyone on the list have a recommendation for ARC sizing with NFS?

Best Regards,
Jason


On 1/26/07, Jeffery Malloch <jeffery.malloch at lsi.com>
wrote:> Hi Folks,
>
> I am currently in the midst of setting up a completely new file server
using a pretty well loaded Sun T2000 (8x1GHz, 16GB RAM) connected to an Engenio
6994 product (I work for LSI Logic so Engenio is a no brainer).  I have
configured a couple of zpools from Volume groups on the Engenio box - 1x2.5TB
and 1x3.75TB.  I then created sub zfs systems below that and set quotas and
sharenfs''d them so that it appears that these "file systems"
are dynamically shrinkable and growable.  It looks very good...  I can see the
correct file system sizes on all types of machines (Linux 32/64bit and of course
Solaris boxes) and if I resize the quota it''s picked up in NFS right
away.  But I would be the first in our organization to use this in an enterprise
system so I definitely have some concerns that I''m hoping someone here
can address.
>
> 1.  How stable is ZFS?  The Engenio box is completely configured for RAID5
with hot spares and write cache (8GB) has battery backup so I''m not too
concerned from a hardware side.  I''m looking for an idea of how stable
ZFS itself is in terms of corruptability, uptime and OS stability.
>
> 2.  Recommended config.  Above, I have a fairly simple setup.  In many of
the examples the granularity is home directory level and when you have many many
users that could get to be a bit of a nightmare administratively.  I am really
only looking for high level dynamic size adjustability and am not interested in
its built in RAID features.  But given that, any real world recommendations?
>
> 3.  Caveats?  Anything I''m missing that isn''t in the docs
that could turn into a BIG gotchya?
>
> 4.  Since all data access is via NFS we are concerned that 32 bit systems
(Mainly Linux and Windows via Samba) will not be able to access all the data
areas of a 2TB+ zpool even if the zfs quota on a particular share is less then
that.  Can anyone comment?
>
> The bottom line is that with anything new there is cause for concern. 
Especially if it hasn''t been tested within our organization.  But the
convenience/functionality factors are way too hard to ignore.
>
> Thanks,
>
> Jeff
>
>
> This message posted from opensolaris.org
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>

Jason J. W. Williams

2007-Jan-26 23:49 UTC

head link

[zfs-discuss] ZFS or UFS - what to do?

Correction: "ZFS has run VERY stably for us with  data integrity
issues at all." should read "ZFS has run VERY stably for us with  NO
data integrity issues at all."


On 1/26/07, Jason J. W. Williams <jasonjwwilliams at gmail.com>
wrote:> Hi Jeff,
>
> We''re running a FLX210 which I believe is an Engenio 2884. In our
case
> it also is attached to a T2000. ZFS has run VERY stably for us with
> data integrity issues at all.
>
> We did have a significant latency problem caused by ZFS flushing the
> write cache on the array after every write, but that can be fixed by
> configuring your array to ignore cache flushes. The instructions for
> Engenio products are here: http://blogs.digitar.com/jjww/?itemid=44
>
> We use the config for a production database, so I can''t speak to
the
> NFS issues. All I would mention is to watch the RAM consumption by
> ZFS.
>
> Does anyone on the list have a recommendation for ARC sizing with NFS?
>
> Best Regards,
> Jason
>
>
> On 1/26/07, Jeffery Malloch <jeffery.malloch at lsi.com> wrote:
> > Hi Folks,
> >
> > I am currently in the midst of setting up a completely new file server
using a pretty well loaded Sun T2000 (8x1GHz, 16GB RAM) connected to an Engenio
6994 product (I work for LSI Logic so Engenio is a no brainer).  I have
configured a couple of zpools from Volume groups on the Engenio box - 1x2.5TB
and 1x3.75TB.  I then created sub zfs systems below that and set quotas and
sharenfs''d them so that it appears that these "file systems"
are dynamically shrinkable and growable.  It looks very good...  I can see the
correct file system sizes on all types of machines (Linux 32/64bit and of course
Solaris boxes) and if I resize the quota it''s picked up in NFS right
away.  But I would be the first in our organization to use this in an enterprise
system so I definitely have some concerns that I''m hoping someone here
can address.
> >
> > 1.  How stable is ZFS?  The Engenio box is completely configured for
RAID5 with hot spares and write cache (8GB) has battery backup so I''m
not too concerned from a hardware side.  I''m looking for an idea of how
stable ZFS itself is in terms of corruptability, uptime and OS stability.
> >
> > 2.  Recommended config.  Above, I have a fairly simple setup.  In many
of the examples the granularity is home directory level and when you have many
many users that could get to be a bit of a nightmare administratively.  I am
really only looking for high level dynamic size adjustability and am not
interested in its built in RAID features.  But given that, any real world
recommendations?
> >
> > 3.  Caveats?  Anything I''m missing that isn''t in the
docs that could turn into a BIG gotchya?
> >
> > 4.  Since all data access is via NFS we are concerned that 32 bit
systems (Mainly Linux and Windows via Samba) will not be able to access all the
data areas of a 2TB+ zpool even if the zfs quota on a particular share is less
then that.  Can anyone comment?
> >
> > The bottom line is that with anything new there is cause for concern. 
Especially if it hasn''t been tested within our organization.  But the
convenience/functionality factors are way too hard to ignore.
> >
> > Thanks,
> >
> > Jeff
> >
> >
> > This message posted from opensolaris.org
> > _______________________________________________
> > zfs-discuss mailing list
> > zfs-discuss at opensolaris.org
> > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
> >
>

Gary Mills

2007-Jan-27 00:25 UTC

head link

[zfs-discuss] Re: ZFS or UFS - what to do?

On Fri, Jan 26, 2007 at 11:05:17AM -0800, Ed Gould
wrote:> 
> A number that I''ve been quoting, albeit without a good reference,
comes
> from Jim Gray, who has been around the data-management industry for 
> longer than I have (and I''ve been in this business since 1970);
he''s
> currently at Microsoft.  Jim says that the controller/drive subsystem 
> writes data to the wrong sector of the drive without notice about once 
> per drive per year.  In a 400-drive array, that''s once a day.  ZFS
will
> detect this error when the file is read (one of the blocks''
checksum
> will not match).  But it can only correct the error if it manages the 
> redundancy.
My only qualification to enter this discussion is that I once wrote a
floppy disk format program for minix.  I recollect, however, that each
sector on the disk is accompanied by a block that contains the sector
address and a CRC.  In order to write to the wrong sector, both of
these items would have to be read incorrectly.  Otherwise, the
controller would never find the wrong sector.  Are we just talking
about a CRC failure here?  That would be random, but the frequency
of CRC errors would depend on the signal quality.

-- 
-Gary Mills-    -Unix Support-    -U of M Academic Computing and Networking-

Toby Thain

2007-Jan-27 00:45 UTC

head link

[zfs-discuss] Re: ZFS or UFS - what to do?

On 26-Jan-07, at 7:29 PM, Selim Daoud wrote:
> it would be good to have real data and not only guess ot anecdots
>
> this story about wrong blocks being written by  RAID controllers
> sounds like the anti-terrorism propaganda we are leaving in: exagerate
> the facts to catch everyone''s attention
> .It''s going to take more than that to prove RAID ctrls have been
doing
> a bad jobs for the last 30 years
It does happen. Hard numbers are available if you look. This sounds a  
bit like the "RAID expert" I bumped into who just couldn''t
see the
paradigm had shifted under him -- the implications of "end to end".



> Let''s make up  real stories with hard fact first
> s.
>
Related links:
https://www.gelato.unsw.edu.au/archives/comp-arch/2006-September/ 
003008.html
http://www.lockss.org/locksswiki/files/3/30/Eurosys2006.pdf [A Fresh  
Look at the Reliability of Long-term Digital Storage, 2006]
http://www.ecsl.cs.sunysb.edu/tr/rpe19.pdf [Challenges of Long-Term  
Digital Archiving: A Survey, 2006]
http://www.cs.wisc.edu/~vijayan/vijayan-thesis.pdf [IRON File  
Systems, 2006]
http://www.tcs.hut.fi/~hhk/phd/phd_Hannu_H_Kari.pdf [Latent Sector  
Faults and Reliability of Disk Arrays, 1997]

--T

> On 1/26/07, Ed Gould <Ed.Gould at sun.com> wrote:
>> On Jan 26, 2007, at 12:52, Dana H. Myers wrote:
>> > So this leaves me wondering how often the controller/drive  
>> subsystem
>> > reads data from the wrong sector of the drive without notice; is
it
>> > symmetrical with respect to writing, and thus about once a drive/ 
>> year,
>> > or are there factors which change this?
>>
>> My guess is that it would be symmetric, but I don''t really
know.
>>
>>         --Ed
>>
>> _______________________________________________
>> zfs-discuss mailing list
>> zfs-discuss at opensolaris.org
>> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>>
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Darren Dunham

2007-Jan-27 01:12 UTC

head link

[zfs-discuss] Re: ZFS or UFS - what to do?

> My only qualification to enter this discussion is that I once wrote a
> floppy disk format program for minix.  I recollect, however, that each
> sector on the disk is accompanied by a block that contains the sector
> address and a CRC.
You''d have to define the layer you''re talking about.  I
presume
something like this occurs between a dumb disk and an intelligent
controller, or even within the encoding parameters of a disk, but I
don''t think it does between say a SCSI/FC controller and a disk.

So if the drive itself put the head in the wrong sector, maybe it could
figure that out.  But perhaps the scsi controller had a bug and sent the
wrong address to the drive.  I don''t think there''s anything at
that
layer that would notice (unless the application/file system is encoding
intent into the data).

Corrections about my assumption with SCSI/FC/ATA appreciated.

-- 
Darren Dunham                                           ddunham at taos.com
Senior Technical Consultant         TAOS            http://www.taos.com/
Got some Dr Pepper?                           San Francisco, CA bay area
         < This line left intentionally blank to confuse you. >

Torrey McMahon

2007-Jan-27 01:38 UTC

head link

[zfs-discuss] Re: ZFS or UFS - what to do?

Toby Thain wrote:>
> On 26-Jan-07, at 7:29 PM, Selim Daoud wrote:
>
>> it would be good to have real data and not only guess ot anecdots
>>
>> this story about wrong blocks being written by  RAID controllers
>> sounds like the anti-terrorism propaganda we are leaving in: exagerate
>> the facts to catch everyone''s attention
>> .It''s going to take more than that to prove RAID ctrls have
been doing
>> a bad jobs for the last 30 years
>
> It does happen. Hard numbers are available if you look. This sounds a 
> bit like the "RAID expert" I bumped into who just
couldn''t see the
> paradigm had shifted under him -- the implications of "end to
end".
It happens. As long we look at the numbers in context and don''t run 
around going, "Hey...have you seen these numbers? What have been doing 
for the last 35 years!?!?" we''re ok.

Peter Schuller

2007-Jan-27 06:32 UTC

head link

[zfs-discuss] Re: ZFS or UFS - what to do?

> A number that I''ve been quoting, albeit without a good reference,
comes
> from Jim Gray, who has been around the data-management industry for 
> longer than I have (and I''ve been in this business since 1970);
he''s
> currently at Microsoft.  Jim says that the controller/drive subsystem 
> writes data to the wrong sector of the drive without notice about once 
> per drive per year.  In a 400-drive array, that''s once a day.  ZFS
will
> detect this error when the file is read (one of the blocks''
checksum
> will not match).  But it can only correct the error if it manages the 
> redundancy.
So now with ZFS, can anyone with a 400 drive array confirm that a
"scrub" has to fix roughly one problem a day? (Or modify appropriately
for whatever amount of drives.)

-- 
/ Peter Schuller

PGP userID: 0xE9758B7D or ''Peter Schuller <peter.schuller at
infidyne.com>''
Key retrieval: Send an E-Mail to getpgpkey at scode.org
E-Mail: peter.schuller at infidyne.com Web: http://www.scode.org

Anton B. Rang

2007-Jan-27 07:08 UTC

head link

[zfs-discuss] Re: ZFS or UFS - what to do?

> 1.  How stable is ZFS?
It''s a new file system; there will be bugs.  It appears to be
well-tested, though.  There are a few known issues; for instance, a write
failure can panic the system under some circumstances.  UFS has known issues
too....
> 2.  Recommended config.  Above, I have a fairly
> simple setup.  In many of the examples the
> granularity is home directory level and when you have
> many many users that could get to be a bit of a
> nightmare administratively.
Do you need user quotas?  If so, you need a file system per user with ZFS.  That
may be an argument against it in some environments, but in my experience tends
to be more important in academic settings than corporations.
> 4.  Since all data access is via NFS we are concerned
> that 32 bit systems (Mainly Linux and Windows via
> Samba) will not be able to access all the data areas
> of a 2TB+ zpool even if the zfs quota on a particular
> share is less then that.  Can anyone comment?
Not a problem.  NFS doesn''t really deal with volumes, just files, so
the offsets are always file-relative and the volume can be as large as desired.

Anton
 
 
This message posted from opensolaris.org

James C. McPherson

2007-Jan-27 10:02 UTC

head link

[zfs-discuss] Re: ZFS or UFS - what to do?

Selim Daoud wrote:> it would be good to have real data and not only guess ot anecdots
> 
> this story about wrong blocks being written by  RAID controllers
> sounds like the anti-terrorism propaganda we are leaving in: exagerate
> the facts to catch everyone''s attention
> .It''s going to take more than that to prove RAID ctrls have been
doing
> a bad jobs for the last 30 years
> Let''s make up  real stories with hard fact first
I have actual hard data and bitter experience (from support calls)
to backup the allegations that raid controllers can and do write
bad blocks.

No, I cannot and will not provide specifics - I signed an NDA
which expressly deals with confidentiality of customer information.

What I can say is that if we''d had ZFS to manage the filesystems
in question, not only would we have detected the problem much
earlier, but the flow-on effect to the end-users would have been
much more easily managed.

James C. McPherson
--
Solaris kernel software engineer, system admin and troubleshooter
               http://www.jmcp.homeunix.com/blog
Find me on LinkedIn @ http://www.linkedin.com/in/jamescmcpherson

David Magda

2007-Jan-27 17:56 UTC

head link

[zfs-discuss] Re: ZFS or UFS - what to do?

On Jan 26, 2007, at 14:05, Ed Gould wrote:
> It will work, but if the storage system corrupts the data, ZFS will  
> be unable to correct it.  It will detect the error.
Unless you turn checksuming off. From zfs(1M):

checksum=on | off | fletcher2, | fletcher4 | sha256
Controls the checksum used to verify data integrity. The default  
value is ?on?, which automatically selects an appropriate algorithm  
(currently, fletcher2, but this may change in future releases). The  
value ?off? disables integrity checking on user data. Disabling  
checksums is NOT a recommended practice.

David Magda

2007-Jan-27 17:59 UTC

head link

[zfs-discuss] Re: ZFS or UFS - what to do?

On Jan 26, 2007, at 14:43, Gary Mills wrote:
> Our Netapp does double-parity RAID.  In fact, the filesystem design is
> remarkably similar to that of ZFS.  Wouldn''t that also detect the
> error?  I suppose it depends if the `wrong sector without notice''
> error is repeated each time.  Or is it random?
On most (all?) other systems the parity only comes into effect when a  
drive fails. When all the drives are reporting "OK" most (all?) RAID  
systems don''t use the parity data at all. ZFS is the first (only?)  
system that actively checks the data returned from disk, regardless  
of whether the drives are reporting they''re okay or not.

I''m sure I''ll be corrected if I''m wrong. :)

Anantha N. Srirama

2007-Jan-28 00:15 UTC

head link

[zfs-discuss] Re: Re: ZFS or UFS - what to do?

I''m not sure what benefit you forsee by running a COW filesystem (ZFS)
on a COW array (NetApp).

Back to regularly scheduled programming: I still say you should let ZFS manage
JBoD type storage. I can personally recount the horror of relying upon an
intelligent storage array (EMC DMX3500 in our case.) We had in flight data
corruption that EMC faithfully wrote just like NetApp would in your case.
Everybody is assuming that corruption or data loss occurs only on disks, it can
happen everywhere. In a datacenter SAN you''ve so many more paths that
can introduce data corruption. Hence the need for ensuring data integrity
closest to the use of data, namely ZFS. ZFS will not stop alpha particle induced
memory corruption after data has been received by server and verified to be
correct. Sadly I''ve been hit with that as well.
 
 
This message posted from opensolaris.org

Toby Thain

2007-Jan-28 00:46 UTC

head link

[zfs-discuss] Re: Re: ZFS or UFS - what to do?

On 27-Jan-07, at 10:15 PM, Anantha N. Srirama wrote:
> We had in flight data corruption that EMC faithfully wrote just  
> like NetApp would in your case. Everybody is assuming that  
> corruption or data loss occurs only on disks, it can happen  
> everywhere. In a datacenter SAN you''ve so many more paths that can
> introduce data corruption. Hence the need for ensuring data  
> integrity closest to the use of data, namely ZFS.

Now how do we get this message out there and understood, fellow  
evangelicals? :)

--Toby
> ZFS will not stop alpha particle induced memory corruption after  
> data has been received by server and verified to be correct. Sadly  
> I''ve been hit with that as well.

Toby Thain

2007-Jan-28 00:49 UTC

head link

[zfs-discuss] Re: Re: ZFS or UFS - what to do?

On 27-Jan-07, at 10:15 PM, Anantha N. Srirama wrote:
> ... ZFS will not stop alpha particle induced memory corruption  
> after data has been received by server and verified to be correct.  
> Sadly I''ve been hit with that as well.

My brother points out that you can use a rad hardened CPU. ECC should  
take care of the RAM. :-)

I wonder when the former will become data centre best practice?

--Toby

Gary Mills

2007-Jan-28 02:03 UTC

head link

[zfs-discuss] Re: Re: ZFS or UFS - what to do?

On Sat, Jan 27, 2007 at 04:15:30PM -0800, Anantha N. Srirama
wrote:>
> I''m not sure what benefit you forsee by running a COW filesystem
> (ZFS) on a COW array (NetApp).
Assuming that that question was addressed to me, the primary feature
that I need from ZFS is snapshots.  The Netapp has snapshots too, but
they are done by disk blocks since, for an iSCSI LUN, the Netapp has
no concept of files.  ZFS snapshots allow restore of individual files
when users accidentally delete them.

As well, I do need a filesystem of some sort on the iSCSI LUN.  If
ZFS is superior to UFS in this application, I''d like to use it.

-- 
-Gary Mills-    -Unix Support-    -U of M Academic Computing and Networking-

Casper.Dik at Sun.COM

2007-Jan-28 09:59 UTC

head link

[zfs-discuss] Re: Re: ZFS or UFS - what to do?

>
>On 27-Jan-07, at 10:15 PM, Anantha N. Srirama wrote:
>
>> ... ZFS will not stop alpha particle induced memory corruption  
>> after data has been received by server and verified to be correct.  
>> Sadly I''ve been hit with that as well.
>
>
>My brother points out that you can use a rad hardened CPU. ECC should  
>take care of the RAM. :-)
>
>I wonder when the former will become data centre best practice?
Alpha particles which "hit" CPUs must have their origin inside said
CPU.

(Alpha particles do not penentrate skin, paper, let alone system cases
or CPU packagaging)

Casper

Erik Trimble

2007-Jan-28 10:31 UTC

head link

[zfs-discuss] Re: Re: ZFS or UFS - what to do?

Casper.Dik at Sun.COM wrote:>> On 27-Jan-07, at 10:15 PM, Anantha N. Srirama wrote:
>>
>>     
>>> ... ZFS will not stop alpha particle induced memory corruption  
>>> after data has been received by server and verified to be correct.
>>> Sadly I''ve been hit with that as well.
>>>       
>> My brother points out that you can use a rad hardened CPU. ECC should  
>> take care of the RAM. :-)
>>
>> I wonder when the former will become data centre best practice?
>>     
>
> Alpha particles which "hit" CPUs must have their origin inside
said CPU.
>
> (Alpha particles do not penentrate skin, paper, let alone system cases
> or CPU packagaging)
>
> Casper
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>   
But, but, but, they''ll get my brain without this nice shiny aluminum
cap
I made!

Cosmic (aka Gamma) Radiation, folks.


And, I think we''ve jumped the shark.


-- 
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA
Timezone: US/Pacific (GMT-0800)

Joerg Schilling

2007-Jan-28 11:41 UTC

head link

[zfs-discuss] Re: Re: ZFS or UFS - what to do?

Casper.Dik at Sun.COM wrote:
> Alpha particles which "hit" CPUs must have their origin inside
said CPU.
>
> (Alpha particles do not penentrate skin, paper, let alone system cases
> or CPU packagaging)
Gamma rays cannot be shielded in a senseful way.

J?rg

-- 
 EMail:joerg at schily.isdn.cs.tu-berlin.de (home) J?rg Schilling D-13353 Berlin
       js at cs.tu-berlin.de                (uni)  
       schilling at fokus.fraunhofer.de     (work) Blog:
http://schily.blogspot.com/
 URL:  http://cdrecord.berlios.de/old/private/ ftp://ftp.berlios.de/pub/schily

Toby Thain

2007-Jan-28 13:10 UTC

head link

[zfs-discuss] Re: Re: ZFS or UFS - what to do?

On 28-Jan-07, at 7:59 AM, Casper.Dik at Sun.COM wrote:
>
>>
>> On 27-Jan-07, at 10:15 PM, Anantha N. Srirama wrote:
>>
>>> ... ZFS will not stop alpha particle induced memory corruption
>>> after data has been received by server and verified to be correct.
>>> Sadly I''ve been hit with that as well.
>>
>>
>> My brother points out that you can use a rad hardened CPU. ECC should
>> take care of the RAM. :-)
>>
>> I wonder when the former will become data centre best practice?
>
> Alpha particles which "hit" CPUs must have their origin inside
said
> CPU.
>
> (Alpha particles do not penentrate skin, paper, let alone system cases
> or CPU packagaging)
Thanks. But what about cosmic rays?
--T
>
> Casper

Casper.Dik at Sun.COM

2007-Jan-28 13:50 UTC

head link

[zfs-discuss] Re: Re: ZFS or UFS - what to do?

>
>On 28-Jan-07, at 7:59 AM, Casper.Dik at Sun.COM wrote:
>
>>
>>>
>>> On 27-Jan-07, at 10:15 PM, Anantha N. Srirama wrote:
>>>
>>>> ... ZFS will not stop alpha particle induced memory corruption
>>>> after data has been received by server and verified to be
correct.
>>>> Sadly I''ve been hit with that as well.
>>>
>>>
>>> My brother points out that you can use a rad hardened CPU. ECC
should
>>> take care of the RAM. :-)
>>>
>>> I wonder when the former will become data centre best practice?
>>
>> Alpha particles which "hit" CPUs must have their origin
inside said
>> CPU.
>>
>> (Alpha particles do not penentrate skin, paper, let alone system cases
>> or CPU packagaging)
>
>Thanks. But what about cosmic rays?

I was just in pedantic mode; "cosmic rays" is the term covering
all different particles, including alpha, beta and gamma rays.

Alpha rays don''t reach us from the "cosmos"; they are caught
long before they can do any harm.  Ditto beta rays.  Both have
an electrical charge that makes passing magnetic fields or passing
through materials difficult.  Both do exist "in the free" but are
commonly caused by slow radioactive decay of our natural environment.

Gamma rays are photons with high energy; they are not capture by
magnetic fields (such as those existing in atoms: electons, protons).
They need to take a direct hit before they''re stopped; they can only
be stopped by dense materials, such as lead.  Unfortunately, natural
occuring lead is polluted by pollonium and uranium and is an alpha/beta
source in its own right.  That''s why 100 year old lead from roofs is
worth more money than new lead: it''s radioisotopes have been depleted.

Casper

Anantha N. Srirama

2007-Jan-28 14:19 UTC

head link

[zfs-discuss] Re: Re: Re: ZFS or UFS - what to do?

You''re right that storage level snapshots are filesystem agnostic.
I''m not sure why you believe you won''t be able to restore
individual files by using a NetApp snapshot? In the case of ZFS you''d
take a periodic snapshot and use it to restore files, in the case of NetApp you
can do the same (of course you''ve to have the additional step to mount
the new snapshot volume.) Is this convenience tipping the scales for you to
pursue ZFS?
 
 
This message posted from opensolaris.org

Gary Mills

2007-Jan-28 16:32 UTC

head link

[zfs-discuss] Re: Re: ZFS or UFS - what to do?

On Sat, Jan 27, 2007 at 04:15:30PM -0800, Anantha N. Srirama
wrote:> 
> I''m not sure what benefit you forsee by running a COW filesystem
> (ZFS) on a COW array (NetApp).
The application requires a filesystem with POSIX semantics.  My first
choice would be NFS from the Netapp, but this won''t work in this case.
My next choice is an iSCSI LUN with a local filesystem on it.  I''m
assuming that since ZFS is more modern than UFS, that ZFS would be the
best of the two, even though the JBOD-oriented features of ZFS will
not be used.

ZFS does seem to be more manageable than UFS.  Filesystems that draw
their space from a common pool is ideal for our application.  The
ability to expand a pool by adding another device, or by extending a
existing device, is also ideal.  Another feature is snapshots, which
I''ve mentioned earlier.

-- 
-Gary Mills-    -Unix Support-    -U of M Academic Computing and Networking-

Gary Mills

2007-Jan-28 16:44 UTC

head link

[zfs-discuss] Re: Re: Re: ZFS or UFS - what to do?

On Sun, Jan 28, 2007 at 06:19:25AM -0800, Anantha N. Srirama
wrote:> 
> You''re right that storage level snapshots are filesystem agnostic.
I''m
> not sure why you believe you won''t be able to restore individual
files
> by using a NetApp snapshot? In the case of ZFS you''d take a
periodic
> snapshot and use it to restore files, in the case of NetApp you can do
> the same (of course you''ve to have the additional step to mount
the
> new snapshot volume.) Is this convenience tipping the scales for you
> to pursue ZFS?
Yes, we''d run out of LUNs.  We''re talking about two weeks of
daily
snapshots on six filesystems.  Each snapshot on the Netapp would
become a separate iSCSI LUN.  They need to be mounted on the server so
that our admins can locate and restore missing files when necessary.

-- 
-Gary Mills-    -Unix Support-    -U of M Academic Computing and Networking-

Robert Milkowski

2007-Jan-29 00:46 UTC

head link

[zfs-discuss] Re: ZFS or UFS - what to do?

Hello Anantha,

Friday, January 26, 2007, 5:06:46 PM, you wrote:

ANS> All my feedback is based on Solaris 10 Update 2 (aka 06/06) and
ANS> I''ve no comments on NFS. I strongly recommend that you use ZFS
ANS> data redundancy (z1, z2, or mirror) and simply delegate the
ANS> Engenio to stripe the data for performance.

Striping on an array and then doing redundancy with ZFS has at least
one drawback - what if one of disks fails? You''ve got to replace bad
disk, re-create stripe on an array and resilver on ZFS (or stay with
hotspare). Lot of hassle.


-- 
Best regards,
 Robert                            mailto:rmilkowski at task.gda.pl
                                       http://milek.blogspot.com

Robert Milkowski

2007-Jan-29 00:53 UTC

head link

[zfs-discuss] ZFS or UFS - what to do?

Hello Francois,

Friday, January 26, 2007, 4:09:43 PM, you wrote:

FD> On Fri, 2007-01-26 at 06:16 -0800, Jeffery Malloch
wrote:>> Hi Folks,
>> 
>> I am currently in the midst of setting up a completely new file server
using a pretty well loaded Sun T2000 (8x1GHz, 16GB RAM) connected to an Engenio
6994 product (I work for LSI Logic so Engenio is a no brainer).  I have
configured a couple of zpools from Volume groups on the Engenio box - 1x2.5TB
and 1x3.75TB.  I then created sub zfs systems below that and set quotas and
sharenfs''d them so that it appears that these "file systems"
are dynamically shrinkable and growable.  It looks very good...  I can see the
correct file system sizes on all types of machines (Linux 32/64bit and of course
Solaris boxes) and if I resize the quota it''s picked up in NFS right
away.  But I would be the first in our organization to use this in an enterprise
system so I definitely have some concerns that I''m hoping someone here
can address.
>> 
>> 1.  How stable is ZFS?  The Engenio box is completely configured for
RAID5 with hot spares
FD> That partly defeats the purpose of ZFS. ZFS offers raid-z and raid-z2
FD> (double parity) with all the advantages of raid-5 or raid-6 but without
FD> several of the raid-5 issues. It also has features that a raid-5
FD> controller could never do: ensure data integrity from the kernel to the
FD> disk, and self correction.

Not always true. Actually you can get much more performance for some
workloads doing raid-5 in HW than raid-z.

Also with some other entry level arrays there''re limits on how much
LUNs can be presented and you actually can''t expose all disks each as
a LUN due to the limit (yes, Sun''s 3510).
>>  and write cache (8GB) has battery backup so I''m not too
concerned from a hardware side.
FD> Whereas the cache/battery backup is a requirement if you run raid-5, it
FD> is not for zfs.

Still it doesn''t mean it won''t help for some workloads.

>> 2.  Recommended config.
FD> The most reliable setup is a JBOD + zfs. But if you have cache, on your

I would argue this. No matter what you still get less reliable setup
when using ZFS on top of simple JBOD than Symmetrix box. It''s just
that in many cases that simple JBOD can be good enough.


FD> box, there might be some magic setup you have to do for that box, and
FD> I''m sure somebody on the list will help you with that. I dont
have an
FD> Engenio.

There''s a workaround for Enginie devices.


-- 
Best regards,
 Robert                            mailto:rmilkowski at task.gda.pl
                                       http://milek.blogspot.com

Anantha N. Srirama

2007-Jan-29 03:57 UTC

head link

[zfs-discuss] Re: Re: ZFS or UFS - what to do?

Agreed, I guess I didn''t articulate my point/thought very well. The
best config is to present JBoDs and let ZFS provide the data protection. This
has been a very stimulating conversation thread; it is shedding new light into
how to best use ZFS.
 
 
This message posted from opensolaris.org

Frank Cusack

2007-Jan-29 07:00 UTC

head link

[zfs-discuss] Re: Re: ZFS or UFS - what to do?

On January 28, 2007 7:57:31 PM -0800 "Anantha N. Srirama" 
<anantha.srirama at cdc.hhs.gov> wrote:> Agreed, I guess I didn''t articulate my point/thought very well.
The best
> config is to present JBoDs and let ZFS provide the data protection. This
> has been a very stimulating conversation thread; it is shedding new light
> into how to best use ZFS.
Actually it depends on the workload.  Best is a very loaded word.

-frank

Roch - PAE

2007-Jan-29 10:05 UTC

head link

[zfs-discuss] Re: Re: ZFS or UFS - what to do?

Anantha N. Srirama writes:
 > Agreed, I guess I didn''t articulate my point/thought very well.
The
 > best config is to present JBoDs and let ZFS provide the data
 > protection. This has been a very stimulating conversation thread; it
 > is shedding new light into how to best use ZFS. 
 >  
 >  

I would say:

	To enable the unique ZFS feature of self-healing
	ZFS must be allowed to manage a level of
	redundancy: mirroring or Raid-z.

	The  type  of LUNs   (JBOD/Raid-*/iscsi) used is not
	relevant in this statement.

Now, if  one also relies on ZFS  to  reconstruct data in the
face of  disk     failures (as opposed   to    storage based
reconstruction), better make  sure that  single/double  disk
failures do not bring down multiple LUNS at once. So better
protection is achieved by configuring LUNS that maps to
seggregated sets of physical things (disks & controllers).

-r

 > This message posted from opensolaris.org
 > _______________________________________________
 > zfs-discuss mailing list
 > zfs-discuss at opensolaris.org
 > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

dudekula mastan

2007-Jan-29 10:27 UTC

head link

[zfs-discuss] ftruncate is failing on ZFS

Hi All,
   
  In my test set up, I have one zpool of size 1000M bytes and it has only 30 M
free space (970 M is used for some other purpose). On this zpool I created one
file (using open () call) and i attempted to write 2MB data on it ( with write()
call) but it is failed. It written only 1.3 MB (the written value of write()
call) data,  it is because of "No space left on the device". After
that I tried to truncate this file to 1.3 Mb data but it is failing.
   
  Any clues on this?
   
  -Masthan

 
---------------------------------
Food fight? Enjoy some healthy debate
in the Yahoo! Answers Food & Drink Q&A.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20070129/1de9b75f/attachment.html>

Darren Dunham

2007-Jan-29 15:58 UTC

head link

[zfs-discuss] Re: ZFS or UFS - what to do?

> > Our Netapp does double-parity RAID.  In fact, the filesystem design is
> > remarkably similar to that of ZFS.  Wouldn''t that also detect
the
> > error?  I suppose it depends if the `wrong sector without
notice''
> > error is repeated each time.  Or is it random?
> 
> On most (all?) other systems the parity only comes into effect when a  
> drive fails. When all the drives are reporting "OK" most (all?)
RAID
> systems don''t use the parity data at all. ZFS is the first (only?)
> system that actively checks the data returned from disk, regardless  
> of whether the drives are reporting they''re okay or not.
> 
> I''m sure I''ll be corrected if I''m wrong. :)
Netapp/OnTAP does do read verification, but it does it outside the
raid-4/raid-dp protection (just like ZFS does it outside the raidz
protction).  So it''s correct that the parity data is not read at all in
either OnTAP or ZFS, but both attempt to do verification of the data on
all reads.

See also: http://blogs.sun.com/bonwick/entry/zfs_end_to_end_data for a
few more specifics on it and the differences from the ZFS data check.

-- 
Darren Dunham                                           ddunham at taos.com
Senior Technical Consultant         TAOS            http://www.taos.com/
Got some Dr Pepper?                           San Francisco, CA bay area
         < This line left intentionally blank to confuse you. >

Jonathan Edwards

2007-Jan-29 17:52 UTC

head link

[zfs-discuss] ZFS or UFS - what to do?

On Jan 26, 2007, at 09:16, Jeffery Malloch wrote:
> Hi Folks,
>
> I am currently in the midst of setting up a completely new file  
> server using a pretty well loaded Sun T2000 (8x1GHz, 16GB RAM)  
> connected to an Engenio 6994 product (I work for LSI Logic so  
> Engenio is a no brainer).  I have configured a couple of zpools  
> from Volume groups on the Engenio box - 1x2.5TB and 1x3.75TB.  I  
> then created sub zfs systems below that and set quotas and  
> sharenfs''d them so that it appears that these "file
systems" are
> dynamically shrinkable and growable.
ah - the 6994 is the controller we use in the 6140/6540 if i''m not  
mistaken .. i guess this thread will go down in a flaming JBOD vs  
RAID controller religious war again .. oops, too late :P

yes - the dynamic LUN expansion bits in ZFS is quite nice and handy  
for managing dynamic growth of a pool or file system.  so going back  
to Jeffery''s original questions:
>
> 1.  How stable is ZFS?  The Engenio box is completely configured  
> for RAID5 with hot spares and write cache (8GB) has battery backup  
> so I''m not too concerned from a hardware side.  I''m
looking for an
> idea of how stable ZFS itself is in terms of corruptability, uptime  
> and OS stability.
I think the stability issue has already been answered pretty well ..

8GB battery backed cache is nice .. performance wise you might find  
some odd interactions with the ZFS adaptive cache integration and the  
way in which the intent log operates (O_DSYNC writes can potentially  
impose a lot of in flight commands for relatively little work) -  
there''s a max blocksize of 128KB (also maxphys), so you might want to  
experiment with tuning back the stripe width .. i seem to recall the  
the 6994 controller seemed to perform best with 256KB or 512KB stripe  
width .. so there may be additional tuning on the read-ahead or write- 
behind algorithms.
> 2.  Recommended config.  Above, I have a fairly simple setup.  In  
> many of the examples the granularity is home directory level and  
> when you have many many users that could get to be a bit of a  
> nightmare administratively.  I am really only looking for high  
> level dynamic size adjustability and am not interested in its built  
> in RAID features.  But given that, any real world recommendations?
Not being interested in the RAID functionality as Roch points out  
eliminates the self-healing functionality and reconstruction bits in  
ZFS .. but you still get other nice benefits like dynamic LUN expansion

As i see it, since we seem to have excess CPU and bus capacity on  
newer systems (most applications haven''t quite caught up to impose  
enough of a load yet) .. we''re back to the mid ''90s where host
based
volume management and caching makes sense and is being proposed  
again.  Being proactive, we might want to consider putting an  
embedded Solaris/ZFS on a RAID controller to see if we''ve really got  
something novel in the caching and RAID algorithms for when the  
application load really does catch up and impose more of a load on  
the host.  Additionally - we''re seeing that there''s a big
benefit in
moving the filesystem closer to the storage array since most users  
care more about their consistency of their data (upper level) than  
the reliability of the disk subsystem or RAID controller.   
Implementing a RAID controller that''s more intimately aware of the  
upper data levels seems like the next logical evolutionary step.
> 3.  Caveats?  Anything I''m missing that isn''t in the docs
that
> could turn into a BIG gotchya?
I would say be careful of the ease at which you can destroy file  
systems and pools .. while convenient - there''s typically no warning  
if you or an administrator does a zfs or zpool destroy .. so i could  
see that turning into an issue.  Also if a LUN goes offline, you may  
not see this right away and you would have the potential to corrupt  
your pool or panic your system.  Hence the self-healing and scrub  
options to detect and repair failure a little bit faster.  People on  
this forum have been finding RAID controller inconsistencies .. hence  
the religious JBOD vs RAID ctlr "disruptive paradigm shift"
> 4.  Since all data access is via NFS we are concerned that 32 bit  
> systems (Mainly Linux and Windows via Samba) will not be able to  
> access all the data areas of a 2TB+ zpool even if the zfs quota on  
> a particular share is less then that.  Can anyone comment?
Doing 2TB+ shouldn''t be a problem for the NFS or Samba mounted  
filesystem regardless if the host is 32bit or not.  The only place  
where you can run into a problem is if the size of an individual file  
crosses 2 or 4TB on a 32bit system.  I know we''ve implemented file  
systems (QFS in this case) that were samba shared to 32bit windows  
hosts in excess of 40-100TB without any major issues.  I''m sure  
there''s similar cases with ZFS and thumper .. i just don''t
have that
data.

a little late to the discussion, but hth
---
.je

Jeffery Malloch

2007-Jan-29 19:17 UTC

head link

[zfs-discuss] Re: ZFS or UFS - what to do?

Hi Guys,

SO...
>From what I can tell from this thread ZFS if VERY fussy about managing
writes,reads and failures.  It wants to be bit perfect.  So if you use the
hardware that comes with a given solution (in my case an Engenio 6994) to manage
failures you risk a) bad writes that don''t get picked up due to
corruption from write cache to disk b) failures due to data changes that ZFS is
unaware of that the hardware imposes when it tries to fix itself.
So now I have a $70K+ lump that''s useless for what it was designed for.
I should have spent $20K on a JBOD.  But since I didn''t do that, it
sounds like a traditional model works best (ie. UFS et al) for the type of
hardware I have.  No sense paying for something and not using it.  And by using
ZFS just as a method for ease of file system growth and management I risk much
more corruption.

The other thing I haven''t heard is why NOT to use ZFS.  Or people who
don''t like it for some reason or another.

Comments?

Thanks,

Jeff

PS - the responses so far have been great and are much appreciated!  Keep
''em coming...
 
 
This message posted from opensolaris.org

Jason J. W. Williams

2007-Jan-29 19:38 UTC

head link

[zfs-discuss] Re: ZFS or UFS - what to do?

Hi Jeff,

Maybe I mis-read this thread, but I don''t think anyone was saying that
using ZFS on-top of an intelligent array risks more corruption. Given
my experience, I wouldn''t run ZFS without some level of redundancy,
since it will panic your kernel in a RAID-0 scenario where it detects
a LUN is missing and can''t fix it. That being said, I wouldn''t
run
anything but ZFS anymore. When we had some database corruption issues
awhile back, ZFS made it very simple to prove it was the DB. Just did
a scrub and boom, verification that the data was laid down correctly.
RAID-5 will have better random read performance the RAID-Z for reasons
Robert had to beat into my head. ;-) But if you really need that
performance, perhaps RAID-10 is what you should be looking at? Someone
smarter than I can probably give a better idea.

Regarding the failure detection, is anyone on the list have the
ZFS/FMA traps fed into a network management app yet? I''m curious what
the experience with it is?

Best Regards,
Jason

On 1/29/07, Jeffery Malloch <jeffery.malloch at lsi.com>
wrote:> Hi Guys,
>
> SO...
>
> >From what I can tell from this thread ZFS if VERY fussy about managing
writes,reads and failures.  It wants to be bit perfect.  So if you use the
hardware that comes with a given solution (in my case an Engenio 6994) to manage
failures you risk a) bad writes that don''t get picked up due to
corruption from write cache to disk b) failures due to data changes that ZFS is
unaware of that the hardware imposes when it tries to fix itself.
>
> So now I have a $70K+ lump that''s useless for what it was designed
for.  I should have spent $20K on a JBOD.  But since I didn''t do that,
it sounds like a traditional model works best (ie. UFS et al) for the type of
hardware I have.  No sense paying for something and not using it.  And by using
ZFS just as a method for ease of file system growth and management I risk much
more corruption.
>
> The other thing I haven''t heard is why NOT to use ZFS.  Or people
who don''t like it for some reason or another.
>
> Comments?
>
> Thanks,
>
> Jeff
>
> PS - the responses so far have been great and are much appreciated!  Keep
''em coming...
>
>
> This message posted from opensolaris.org
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>

Jonathan Edwards

2007-Jan-29 20:03 UTC

head link

[zfs-discuss] Re: ZFS or UFS - what to do?

On Jan 29, 2007, at 14:17, Jeffery Malloch wrote:
> Hi Guys,
>
> SO...
>
>> From what I can tell from this thread ZFS if VERY fussy about  
>> managing writes,reads and failures.  It wants to be bit perfect.   
>> So if you use the hardware that comes with a given solution (in my  
>> case an Engenio 6994) to manage failures you risk a) bad writes  
>> that don''t get picked up due to corruption from write cache to
>> disk b) failures due to data changes that ZFS is unaware of that  
>> the hardware imposes when it tries to fix itself.
>
> So now I have a $70K+ lump that''s useless for what it was designed
> for.  I should have spent $20K on a JBOD.  But since I didn''t do  
> that, it sounds like a traditional model works best (ie. UFS et al)  
> for the type of hardware I have.  No sense paying for something and  
> not using it.  And by using ZFS just as a method for ease of file  
> system growth and management I risk much more corruption.
>
> The other thing I haven''t heard is why NOT to use ZFS.  Or people
> who don''t like it for some reason or another.
>
> Comments?
I put together this chart a while back .. i should probably update it  
for RAID6 and RAIDZ2

#   ZFS     ARRAY HW        CAPACITY    COMMENTS
--  ---     --------        --------    --------
1   R0      R1              N/2         hw mirror - no zfs healing
2   R0      R5              N-1         hw R5 - no zfs healing
3   R1      2 x R0          N/2         flexible, redundant, good perf
4   R1      2 x R5          (N/2)-1     flexible, more redundant,  
decent perf
5   R1      1 x R5          (N-1)/2     parity and mirror on same  
drives (XXX)
6   RZ      R0              N-1         standard RAID-Z no mirroring
7   RZ      R1 (tray)       (N/2)-1     RAIDZ+1
8   RZ      R1 (drives)     (N/2)-1     RAID1+Z (highest redundancy)
9   RZ      3 x R5          N-4         triple parity calculations (XXX)
10  RZ      1 x R5          N-2         double parity calculations (XXX)

(note: I included the cases where you have multiple arrays with a  
single lun per vdisk (say) and where you only have a single array  
split into multiple LUNs.)

The way I see it, you''re better off picking either controller parity  
or zfs parity .. there''s no sense in computing parity multiple times  
unless you have cycles to spare and don''t mind the performance hit ..  
so the questions you should really answer before you choose the  
hardware is what level of redundancy to capacity balance do you want?  
and whether or not you want to compute RAID in ZFS host memory or out  
on a dedicated blackbox controller?  I would say something about  
double caching too, but I think that''s moot since you''ll
always cache
in the ARC if you use ZFS the way it''s currently written.

Other feasible filesystem options for Solaris - UFS, QFS, or vxfs  
with SVM or VxVM for volume mgmt if you''re so inclined .. all depends  
on your budget and application.  There''s currently tradeoffs in each  
one, and contrary to some opinions, the death of any of these has  
been grossly exaggerated.

---
.je
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20070129/2b9c0903/attachment.html>

Albert Chin

2007-Jan-29 20:56 UTC

head link

[zfs-discuss] Re: ZFS or UFS - what to do?

On Mon, Jan 29, 2007 at 11:17:05AM -0800, Jeffery Malloch
wrote:> From what I can tell from this thread ZFS if VERY fussy about
> managing writes,reads and failures.  It wants to be bit perfect.  So
> if you use the hardware that comes with a given solution (in my case
> an Engenio 6994) to manage failures you risk a) bad writes that
> don''t get picked up due to corruption from write cache to disk b)
> failures due to data changes that ZFS is unaware of that the
> hardware imposes when it tries to fix itself.
> 
> So now I have a $70K+ lump that''s useless for what it was designed
> for.  I should have spent $20K on a JBOD.  But since I didn''t do
> that, it sounds like a traditional model works best (ie. UFS et al)
> for the type of hardware I have.  No sense paying for something and
> not using it.  And by using ZFS just as a method for ease of file
> system growth and management I risk much more corruption.
Well, ZFS with HW RAID makes sense in some cases. However, it seems
that if you are unwilling to lose 50% disk space to RAID 10 or two
mirrored HW RAID arrays, you either use RAID 0 on the array with ZFS
RAIDZ/RAIDZ2 on top of that or a JBOD with ZFS RAIDZ/RAIDZ2 on top of
that.

-- 
albert chin (china at thewrittenword.com)

Frank Cusack

2007-Jan-29 20:58 UTC

head link

[zfs-discuss] Re: ZFS or UFS - what to do?

On January 29, 2007 11:17:05 AM -0800 Jeffery Malloch 
<jeffery.malloch at lsi.com> wrote:> Hi Guys,
>
> SO...
>
>> From what I can tell from this thread ZFS if VERY fussy about managing
>> writes,reads and failures.  It wants to be bit perfect.
It''s funny to call that "fussy".  All filesystems WANT to be
bit perfect,
zfs actually does something to ensure it.
>>  So if you use
>> the hardware that comes with a given solution (in my case an Engenio
>> 6994) to manage failures you risk a) bad writes that don''t get
picked up
>> due to corruption from write cache to disk
You would always have that problem, JBOD or RAID.  There are many places
data can get corrupted, not just in the RAID write cache.  zfs will correct
it, or at least detect it depending on your configuration.
>> b) failures due to data
>> changes that ZFS is unaware of that the hardware imposes when it tries
>> to fix itself.
If that happens, you will be lucky to have ZFS to fix it.  If the array
changes data, it is broken.  This is not the same thing as correcting data.
> The other thing I haven''t heard is why NOT to use ZFS.  Or people
who
> don''t like it for some reason or another.
If you need per-user quotas, zfs might not be a good fit.  (In many cases
per-filesystem quotas can be used effectively though.)

If you need NFS clients to traverse mount points on the server
(eg /home/foo), then this won''t work yet.  Then again, does this work
with UFS either?  Seems to me it wouldn''t.  The difference is that zfs
encourages you to create more filesystems.  But you don''t have to.

If you have an application that is very highly tuned for a specific
filesystem (e.g. UFS with directio), you might not want to replace
it with zfs.

If you need incremental restore, you might need to stick with UFS.
(snapshots might be enough for you though)

-frank

Marion Hakanson

2007-Jan-29 22:39 UTC

head link

[zfs-discuss] Re: ZFS or UFS - what to do?

Albert Chin said:> Well, ZFS with HW RAID makes sense in some cases. However, it seems that if
> you are unwilling to lose 50% disk space to RAID 10 or two mirrored HW RAID
> arrays, you either use RAID 0 on the array with ZFS RAIDZ/RAIDZ2 on top of
> that or a JBOD with ZFS RAIDZ/RAIDZ2 on top of that. 
I''ve been re-evaluating our local decision on this question (how to
layout
ZFS on pre-existing RAID hardware).  In our case, the array does not allow
RAID-0 of any type, and we''re unwilling to give up the expensive disk
space to a mirrored configuration.  In fact, in our last decision, we
came to the conclusion that we didn''t want to layer RAID-Z on top of
HW RAID-5, thinking that the added loss of space is too high, given any
of the "XXX" layouts in Jonathan Edwards''
chart:> #   ZFS     ARRAY HW        CAPACITY    COMMENTS
> --  ---     --------        --------    --------
> . . .
> 5   R1      1 x R5          (N-1)/2     parity and mirror on same drives
(XXX)
> 9   RZ      3 x R5          N-4         triple parity calculations (XXX)
> . . .
> 10  RZ      1 x R5          N-2         double parity calculations (XXX)

So, we ended up (some months ago) deciding to go with only HW RAID-5,
using ZFS to stripe together large-ish LUN''s made up of independent HW
RAID-5 groups.  We''d have no ZFS redundancy, but at least ZFS would
catch
any corruption that may come along.  We can restore individual corrupted
files from tape backups (which we''re already doing anyway), if
necessary.

However, given the default behavior of ZFS (as of Solaris-10U3) is to
panic/halt when it encounters a corrupted block that it can''t repair,
I''m re-thinking our options, weighing against the possibility of a
significant downtime caused by a single-block corruption.

Today I''ve been pondering a variant of #10 above, the variation being
to slice a RAID-5 volume across than N LUN''s, i.e. LUN''s
smaller than the
size of the individual disks that make up the HW R5 volume.  A larger
number of small LUN''s results in less space given up to ZFS parity,
which
is nice when overall disk space is important to us.

We''re not expecting RAID-Z across these LUN''s to make it
possible to
survive failure of a whole disk, rather we only "need" RAID-Z to
repair
the occasional block corruption, in the hopes that this might head off the
need to restore a whole multi-TB pool.  We''ll rely on the HW RAID-5 to
protect against whole-disk failure.

Just thinking out loud here.  Now I''m off to see what kind of
performance
cost there is, comparing (with 400GB disks):
	Simple ZFS stripe on one 2198GB LUN from a 6+1 HW RAID5 volume
	8+1 RAID-Z on 9 244.2GB LUN''s from a 6+1 HW RAID5 volume

Regards,

Marion

Boyd Adamson

2007-Jan-30 06:45 UTC

head link

[zfs-discuss] Re: Re: ZFS or UFS - what to do?

On 29/01/2007, at 12:50 AM, Casper.Dik at Sun.COM wrote:
>
>>
>> On 28-Jan-07, at 7:59 AM, Casper.Dik at Sun.COM wrote:
>>
>>>
>>>>
>>>> On 27-Jan-07, at 10:15 PM, Anantha N. Srirama wrote:
>>>>
>>>>> ... ZFS will not stop alpha particle induced memory
corruption
>>>>> after data has been received by server and verified to be
correct.
>>>>> Sadly I''ve been hit with that as well.
>>>>
>>>>
>>>> My brother points out that you can use a rad hardened CPU. ECC
>>>> should
>>>> take care of the RAM. :-)
>>>>
>>>> I wonder when the former will become data centre best practice?
>>>
>>> Alpha particles which "hit" CPUs must have their origin
inside said
>>> CPU.
>>>
>>> (Alpha particles do not penentrate skin, paper, let alone system  
>>> cases
>>> or CPU packagaging)
>>
>> Thanks. But what about cosmic rays?
>
>
> I was just in pedantic mode; "cosmic rays" is the term covering
> all different particles, including alpha, beta and gamma rays.
>
> Alpha rays don''t reach us from the "cosmos"; they are
caught
> long before they can do any harm.  Ditto beta rays.  Both have
> an electrical charge that makes passing magnetic fields or passing
> through materials difficult.  Both do exist "in the free" but are
> commonly caused by slow radioactive decay of our natural environment.
>
> Gamma rays are photons with high energy; they are not capture by
> magnetic fields (such as those existing in atoms: electons, protons).
> They need to take a direct hit before they''re stopped; they can
only
> be stopped by dense materials, such as lead.  Unfortunately, natural
> occuring lead is polluted by pollonium and uranium and is an alpha/ 
> beta
> source in its own right.  That''s why 100 year old lead from roofs
is
> worth more money than new lead: it''s radioisotopes have been
depleted.
<ludicrous_topic_drift>

Ok, I''ll bite. It''s been a long day, so that may be why I
can''t see
why the radioisotopes in lead that was dug up 100 years ago would be  
any more depleted than the lead that sat in the ground for the  
intervening 100 years. Half-life is half-life, no?

Now if it were something about the modern extraction process that  
added contaminants, then I can see it.

</ludicrous_topic_drift>

Casper.Dik at Sun.COM

2007-Jan-30 08:22 UTC

head link

[zfs-discuss] Re: Re: ZFS or UFS - what to do?

>Ok, I''ll bite. It''s been a long day, so that may be why I
can''t see
>why the radioisotopes in lead that was dug up 100 years ago would be  
>any more depleted than the lead that sat in the ground for the  
>intervening 100 years. Half-life is half-life, no?
>Now if it were something about the modern extraction process that  
>added contaminants, then I can see it.

In nature, lead is found in deposits with trace elements of other
heavy radio nucleotides.  (U235/238/Th232).  These are removed in
processing, but one of their decay products is Pb-210.  Pb-210 cannot
be chemically removed from lead. (lead contains mostly stable Pb 207/208/209)
New lead may also contain trace amounts of Polonium-210.

So lead, when mined has trace amounts of radioactive Pb-210; as the
half-life of Pb210 is only 22 years, it''s fairly radioactive but also
decays rapidly (1/32 of radiation left after 100 years, 1/1000th after
200)

Casper

Marion Hakanson

2007-Feb-01 03:58 UTC

head link

[zfs-discuss] Re: ZFS or UFS - what to do?

I wrote:> Just thinking out loud here.  Now I''m off to see what kind of
performance
> cost there is, comparing (with 400GB disks):
> 	Simple ZFS stripe on one 2198GB LUN from a 6+1 HW RAID5 volume
> 	8+1 RAID-Z on 9 244.2GB LUN''s from a 6+1 HW RAID5 volume

Richard.Elling at Sun.COM said:> Interesting idea.  Please post back to let us know how the performance
looks.

The short story is, performance is not bad with the raidz arrangement, until
you get to doing reads, at which point it looks much worse than the 1-LUN setup.

Please bear in mind that I''m not a storage nor benchmarking expert,
though
I''d say I''m not a neophyte either.

Some specifics:

The array is a low-end Hitachi, 9520V.  My two test subjects are a pair
of RAID-5 groups in the same shelf, each consisting of 6D+1P 400GB SATA
drives.  The test host is a Sun T2000, 16GB RAM, connected via 2Gb FC
links through a pair of switches (the array/mpxio combination do not
support load-balancing, so only one 2Gb channel is in use at a time).
It is running Solaris-10U3, patches current as of 12-Jan-2007.

The array was mostly idle except for my tests, although some light
I/O to other shelves may have come from another host on occasion.
The test host wasn''t doing anything else during these tests.

One RAID-5 group was configured as a single 2048GB LUN (with about 150GB
left unallocated, the array has a max LUN size);  The second RAID-5 group
was setup as nine 244.3GB LUN''s.

Here are the zpool configurations I used for these tests:
# zpool status -v
  pool: bulk_sp1
 state: ONLINE
 scrub: none requested
config:

        NAME                                             STATE     READ WRITE
CKSUM
        bulk_sp1                                         ONLINE       0     0   
0
          c6t4849544143484920443630303133323230303230d0  ONLINE       0     0   
0

errors: No known data errors

  pool: bulk_zp2
 state: ONLINE
 scrub: none requested
config:

        NAME                                               STATE     READ WRITE
CKSUM
        bulk_zp2                                           ONLINE       0     0 
0
          raidz1                                           ONLINE       0     0 
0
            c6t4849544143484920443630303133323230303330d0  ONLINE       0     0 
0
            c6t4849544143484920443630303133323230303331d0  ONLINE       0     0 
0
            c6t4849544143484920443630303133323230303332d0  ONLINE       0     0 
0
            c6t4849544143484920443630303133323230303333d0  ONLINE       0     0 
0
            c6t4849544143484920443630303133323230303334d0  ONLINE       0     0 
0
            c6t4849544143484920443630303133323230303335d0  ONLINE       0     0 
0
            c6t4849544143484920443630303133323230303336d0  ONLINE       0     0 
0
            c6t4849544143484920443630303133323230303337d0  ONLINE       0     0 
0
            c6t4849544143484920443630303133323230303338d0  ONLINE       0     0 
0

errors: No known data errors
# zfs list
NAME                   USED  AVAIL  REFER  MOUNTPOINT
bulk_sp1                83K  1.95T  24.5K  /sp1
bulk_zp2              73.8K  1.87T  2.67K  /zp2

I used two benchmarks:  One was a "bunzip2 | tar" extract of the Sun
Studio-11 SPARC distribution tarball, extracting from the T2000''s
internal drives onto the test zpools.  For this benchmark, both zpools
gave similar results:

pool sp1 (single-LUN stripe):
  du -s -k:
    1155141
  time -p:
    real 713.67
    user 614.42
    sys 7.56
  1.6MB/sec overall

pool zp2 (8+1-LUN raidz1):
  du -s -k:
    1169020
  time -p:
    real 714.96
    user 614.78
    sys 7.56
  1.6MB/sec overall

The 2nd benchmark was bonnie++ v1.03, run single-threaded with default
arguments, which means a 32GB dataset made of up 1GB files.  Observations of
"vmstat" and "mpstat" during the tests showed that bonnie++
is CPU-limited
on the T2000, especially for the getc()/putc() tests, so I later ran 3x
bonnie++''s simultaneously (13GB dataset each), and got the same results
in total throughput for the block read/write tests on the single-LUN zpool
(I was not patient enough to sit through the getc/putc tests again :-).

pool sp1 (single-LUN stripe):
Version  1.03       ------Sequential Output------ --Sequential Input- --Random-
                    -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
filer1          32G 15497  99 66245  84 16652  30 15210  90 106600  59 322.3   3
                    ------Sequential Create------ --------Random Create--------
                    -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                 16  5204 100 +++++ +++  8076 100  4551 100 +++++ +++  7509 100
filer1,32G,15497,99,66245,84,16652,30,15210,90,106600,59,322.3,3,16,5204,100,+++++,+++,8076,100,4551,100,+++++,+++,7509,100

pool zp2 (8+1-LUN raidz1):
Version  1.03       ------Sequential Output------ --Sequential Input- --Random-
                    -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
filer1          32G 16118 100 29702  40  7416  13 15828  94 30204  20  25.0   0
                    ------Sequential Create------ --------Random Create--------
                    -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                 16  5215 100 +++++ +++  8527 100  4453 100 +++++ +++  8918 100
filer1,32G,16118,100,29702,40,7416,13,15828,94,30204,20,25.0,0,16,5215,100,+++++,+++,8527,100,4453,100,+++++,+++,8918,100

I''m not sure what to add in the way of comments.  It seems clear from
the
results, and from watching "iostat -xn", "vmstat",
"mpstat", etc. during
the tests, that the raidz pool apparently suffers from not being able to
make as good use of the array''s 1GB cache (the sequential block read
test
seems to match well with Hitachi''s read prefetch algorithms, I guess).
There''s also the potential of too much seeking going on for the raidz
pool,
since there are 9 LUN''s on top of 7 physical disk drives (though how
Hitachi
divides/stripes those LUN''s is not clear to me).

One thing I noticed which puzzles me is that in both configurations, though
more so in the divided-up raidz pool, there were long periods of time where
the LUN''s showed in "iostat -xn" output at 100% busy but with
no I/O''s
happening at all.  No paging, CPU 100% idle, no less than 2GB of free RAM,
for as long as 20-30 seconds.  Sure puts a dent in the throughput.

I''m doing some more testing of NFS throughput over these two
zpool''s,
since the test machine will eventually become an NFS and samba server.
I''ve got some questions about the performance issues in the NFS
scenario,
but will address those in a separate message.

Questions, observations, and/or suggestions are welcome.

Regards,

Marion

Richard Elling

2007-Feb-01 05:55 UTC

head link

[zfs-discuss] Re: ZFS or UFS - what to do?

fishy smell way below...

Marion Hakanson wrote:> I wrote:
>> Just thinking out loud here.  Now I''m off to see what kind of
performance
>> cost there is, comparing (with 400GB disks):
>> 	Simple ZFS stripe on one 2198GB LUN from a 6+1 HW RAID5 volume
>> 	8+1 RAID-Z on 9 244.2GB LUN''s from a 6+1 HW RAID5 volume
> 
> 
> Richard.Elling at Sun.COM said:
>> Interesting idea.  Please post back to let us know how the performance
looks.
> 
> 
> The short story is, performance is not bad with the raidz arrangement,
until
> you get to doing reads, at which point it looks much worse than the 1-LUN
setup.
> 
> Please bear in mind that I''m not a storage nor benchmarking
expert, though
> I''d say I''m not a neophyte either.
> 
> Some specifics:
> 
> The array is a low-end Hitachi, 9520V.  My two test subjects are a pair
> of RAID-5 groups in the same shelf, each consisting of 6D+1P 400GB SATA
> drives.  The test host is a Sun T2000, 16GB RAM, connected via 2Gb FC
> links through a pair of switches (the array/mpxio combination do not
> support load-balancing, so only one 2Gb channel is in use at a time).
> It is running Solaris-10U3, patches current as of 12-Jan-2007.
> 
> The array was mostly idle except for my tests, although some light
> I/O to other shelves may have come from another host on occasion.
> The test host wasn''t doing anything else during these tests.
> 
> One RAID-5 group was configured as a single 2048GB LUN (with about 150GB
> left unallocated, the array has a max LUN size);  The second RAID-5 group
> was setup as nine 244.3GB LUN''s.
> 
> Here are the zpool configurations I used for these tests:
> # zpool status -v
>   pool: bulk_sp1
>  state: ONLINE
>  scrub: none requested
> config:
> 
>         NAME                                             STATE     READ
WRITE CKSUM
>         bulk_sp1                                         ONLINE       0    
0     0
>           c6t4849544143484920443630303133323230303230d0  ONLINE       0    
0     0
> 
> errors: No known data errors
> 
>   pool: bulk_zp2
>  state: ONLINE
>  scrub: none requested
> config:
> 
>         NAME                                               STATE     READ
WRITE CKSUM
>         bulk_zp2                                           ONLINE       0  
0     0
>           raidz1                                           ONLINE       0  
0     0
>             c6t4849544143484920443630303133323230303330d0  ONLINE       0  
0     0
>             c6t4849544143484920443630303133323230303331d0  ONLINE       0  
0     0
>             c6t4849544143484920443630303133323230303332d0  ONLINE       0  
0     0
>             c6t4849544143484920443630303133323230303333d0  ONLINE       0  
0     0
>             c6t4849544143484920443630303133323230303334d0  ONLINE       0  
0     0
>             c6t4849544143484920443630303133323230303335d0  ONLINE       0  
0     0
>             c6t4849544143484920443630303133323230303336d0  ONLINE       0  
0     0
>             c6t4849544143484920443630303133323230303337d0  ONLINE       0  
0     0
>             c6t4849544143484920443630303133323230303338d0  ONLINE       0  
0     0
> 
> errors: No known data errors
> # zfs list
> NAME                   USED  AVAIL  REFER  MOUNTPOINT
> bulk_sp1                83K  1.95T  24.5K  /sp1
> bulk_zp2              73.8K  1.87T  2.67K  /zp2
> 
> 
> I used two benchmarks:  One was a "bunzip2 | tar" extract of the
Sun
> Studio-11 SPARC distribution tarball, extracting from the T2000''s
> internal drives onto the test zpools.  For this benchmark, both zpools
> gave similar results:
> 
> pool sp1 (single-LUN stripe):
>   du -s -k:
>     1155141
>   time -p:
>     real 713.67
>     user 614.42
>     sys 7.56
>   1.6MB/sec overall
> 
> pool zp2 (8+1-LUN raidz1):
>   du -s -k:
>     1169020
>   time -p:
>     real 714.96
>     user 614.78
>     sys 7.56
>   1.6MB/sec overall
> 
> 
> 
> The 2nd benchmark was bonnie++ v1.03, run single-threaded with default
> arguments, which means a 32GB dataset made of up 1GB files.  Observations
of
> "vmstat" and "mpstat" during the tests showed that
bonnie++ is CPU-limited
> on the T2000, especially for the getc()/putc() tests, so I later ran 3x
> bonnie++''s simultaneously (13GB dataset each), and got the same
results
> in total throughput for the block read/write tests on the single-LUN zpool
> (I was not patient enough to sit through the getc/putc tests again :-).
> 
> pool sp1 (single-LUN stripe):
> Version  1.03       ------Sequential Output------ --Sequential Input-
--Random-
>                     -Per Chr- --Block-- -Rewrite- -Per Chr- --Block--
--Seeks--
> Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec
%CP
> filer1          32G 15497  99 66245  84 16652  30 15210  90 106600  59
322.3   3
>                     ------Sequential Create------ --------Random
Create--------
>                     -Create-- --Read--- -Delete-- -Create-- --Read---
-Delete--
>               files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec
%CP
>                  16  5204 100 +++++ +++  8076 100  4551 100 +++++ +++  7509
100
>
filer1,32G,15497,99,66245,84,16652,30,15210,90,106600,59,322.3,3,16,5204,100,+++++,+++,8076,100,4551,100,+++++,+++,7509,100
> 
> pool zp2 (8+1-LUN raidz1):
> Version  1.03       ------Sequential Output------ --Sequential Input-
--Random-
>                     -Per Chr- --Block-- -Rewrite- -Per Chr- --Block--
--Seeks--
> Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec
%CP
> filer1          32G 16118 100 29702  40  7416  13 15828  94 30204  20  25.0
0
>                     ------Sequential Create------ --------Random
Create--------
>                     -Create-- --Read--- -Delete-- -Create-- --Read---
-Delete--
>               files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec
%CP
>                  16  5215 100 +++++ +++  8527 100  4453 100 +++++ +++  8918
100
>
filer1,32G,16118,100,29702,40,7416,13,15828,94,30204,20,25.0,0,16,5215,100,+++++,+++,8527,100,4453,100,+++++,+++,8918,100
> 
[is it just me, or can anybody understand Bonnie++ output?  Maybe they just lose
too
much info trying to cram the results onto a VT-52 screen...]
> I''m not sure what to add in the way of comments.  It seems clear
from the
> results, and from watching "iostat -xn", "vmstat",
"mpstat", etc. during
> the tests, that the raidz pool apparently suffers from not being able to
> make as good use of the array''s 1GB cache (the sequential block
read test
> seems to match well with Hitachi''s read prefetch algorithms, I
guess).
> There''s also the potential of too much seeking going on for the
raidz pool,
> since there are 9 LUN''s on top of 7 physical disk drives (though
how Hitachi
> divides/stripes those LUN''s is not clear to me).
> 
> One thing I noticed which puzzles me is that in both configurations, though
> more so in the divided-up raidz pool, there were long periods of time where
> the LUN''s showed in "iostat -xn" output at 100% busy but
with no I/O''s
> happening at all.  No paging, CPU 100% idle, no less than 2GB of free RAM,
> for as long as 20-30 seconds.  Sure puts a dent in the throughput.
IIRC, the calculation for %busy is the amount of time that an I/O is on
the device.  These symptoms would occur if an I/O is dropped somewhere
along the way or at the array.  Eventually, we''ll timeout and retry,
though
by default that should be after 60 seconds.  I think we need to figure out
what is going on here before accepting the results. It could be that
we''re
overrunning the queue on the Hitachi.  By default, ZFS will send 35 concurrent
commands per vdev and the ssd driver will send up to 256 to a target.  IIRC,
Hitachi has a formula for calculating sdd_max_throttle to avoid such
overruns, but I''m not sure if that applies to this specific array.
  -- richard
> I''m doing some more testing of NFS throughput over these two
zpool''s,
> since the test machine will eventually become an NFS and samba server.
> I''ve got some questions about the performance issues in the NFS
scenario,
> but will address those in a separate message.
> 
> Questions, observations, and/or suggestions are welcome.
> 
> Regards,
> 
> Marion
> 
> 
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Wee Yeh Tan

2007-Feb-01 06:46 UTC

head link

[zfs-discuss] Re: ZFS or UFS - what to do?

On 2/1/07, Marion Hakanson <hakansom at ohsu.edu>
wrote:> There''s also the potential of too much seeking going on for the
raidz pool,
> since there are 9 LUN''s on top of 7 physical disk drives (though
how Hitachi
> divides/stripes those LUN''s is not clear to me).
Marion,

That is the part of your setup that puzzled me.  You took the same 7
disk raid5 set and split them into 9 LUNS.  The Hitachi likely splits
the "virtual disk" into 9 continuous partitions so each LUN maps back
to different parts of the 7 disks.  I speculate that ZFS thinks it is
talking to 9 different disks so spreads out the writes accordingly.
What ZFS thinks is sequential writes becomes well spaced writes across
the entire disk & blows your seek time off the roof.

I''m interested how it looks like from the Hitachi end.  If you can,
repeat the test with the Hitachi presenting all 7 disks directly to
ZFS as LUNs?
> One thing I noticed which puzzles me is that in both configurations, though
> more so in the divided-up raidz pool, there were long periods of time where
> the LUN''s showed in "iostat -xn" output at 100% busy but
with no I/O''s
> happening at all.  No paging, CPU 100% idle, no less than 2GB of free RAM,
> for as long as 20-30 seconds.  Sure puts a dent in the throughput.
Interesting... what you are suggesting is that %b is 100% when w/s and r/s is 0?

-- 
Just me,
Wire ...

Marion Hakanson

2007-Feb-01 17:26 UTC

head link

[zfs-discuss] Re: ZFS or UFS - what to do?

weeyeh at gmail.com said:> That is the part of your setup that puzzled me.  You took the same 7 disk
> raid5 set and split them into 9 LUNS.  The Hitachi likely splits the
"virtual
> disk" into 9 continuous partitions so each LUN maps back to different
parts
> of the 7 disks.  I speculate that ZFS thinks it is talking to 9 different
> disks so spreads out the writes accordingly. What ZFS thinks is sequential
> writes becomes well spaced writes across the entire disk & blows your
seek
> time off the roof. 
That''s what I thought might happen before I even tried this, although
it''s
also possible the Hitachi "stripes" each LUN across all 7 disks. 
Either
way, one could be getting too many seeks.  Note that I''m just trying to
see
if it was so bad that the self-healing capability wasn''t worth the
cost.
I do realize these are 7200rpm SATA disks, so seeking isn''t what they
do best.

> I''m interested how it looks like from the Hitachi end.  If you
can,
> repeat the test with the Hitachi presenting all 7 disks directly to
> ZFS as LUNs?
The array doesn''t give us that capability.

> Interesting... what you are suggesting is that %b is 100% when w/s and r/s
is
> 0? 
Correct.  Sometimes all "iostat -xn" columns are 0 except %b; 
Sometimes
the asvc_t column stays at "4.0" for the duration of the quiet period.
I''ve also observed times where all columns were 0, including %b.  Sure
is puzzling.


Richard.Elling at Sun.COM said:> IIRC, the calculation for %busy is the amount of time that an I/O is on the
> device.  These symptoms would occur if an I/O is dropped somewhere along
the
> way or at the array.  Eventually, we''ll timeout and retry, though
by default
> that should be after 60 seconds.  I think we need to figure out what is
going
> on here before accepting the results. It could be that we''re
overrunning the
> queue on the Hitachi.  By default, ZFS will send 35 concurrent commands per
> vdev and the ssd driver will send up to 256 to a target.  IIRC, Hitachi has
a
> formula for calculating sdd_max_throttle to avoid such overruns, but
I''m not
> sure if that applies to this specific array. 
Hmm, it''s true that I have made no tuning changes on the T2000 side. 
It
would make sense if the array just stopped responding.  I''ll have to
poke
at the array and see if it has any diagnostics logged somewhere.  I recall
that the Hitachi docs do have some recommendations on max-throttle settings,
so I''ll go dig those up and see what I can find out.

Thanks for the comments,

Marion

Torrey McMahon

2007-Feb-02 20:15 UTC

head link

[zfs-discuss] Re: ZFS or UFS - what to do?

Marion Hakanson wrote:> However, given the default behavior of ZFS (as of Solaris-10U3) is to
> panic/halt when it encounters a corrupted block that it can''t
repair,
> I''m re-thinking our options, weighing against the possibility of a
> significant downtime caused by a single-block corruption.
Guess what happens when UFS finds an inconsistency it can''t fix either?

The issue is that ZFS has the chance to fix the inconsistency if the 
zpool is a mirror or raidZ. Not that it finds the inconsistency in the 
first place. ZFS will just find more of them given a set of errors vs 
other filesystems.

zfs discuss - Jan 2007 - ZFS or UFS - what to do?

[zfs-discuss] ZFS or UFS - what to do?

[zfs-discuss] ZFS or UFS - what to do?

[zfs-discuss] ZFS or UFS - what to do?

[zfs-discuss] ZFS or UFS - what to do?

[zfs-discuss] Re: ZFS or UFS - what to do?

[zfs-discuss] Re: ZFS or UFS - what to do?

[zfs-discuss] Re: ZFS or UFS - what to do?

[zfs-discuss] Re: ZFS or UFS - what to do?

[zfs-discuss] Re: ZFS or UFS - what to do?

[zfs-discuss] Re: ZFS or UFS - what to do?

[zfs-discuss] Re: ZFS or UFS - what to do?

[zfs-discuss] Re: ZFS or UFS - what to do?

[zfs-discuss] Re: ZFS or UFS - what to do?

[zfs-discuss] Re: ZFS or UFS - what to do?

[zfs-discuss] Re: ZFS or UFS - what to do?

[zfs-discuss] Re: ZFS or UFS - what to do?

[zfs-discuss] Re: ZFS or UFS - what to do?

[zfs-discuss] Re: ZFS or UFS - what to do?

[zfs-discuss] Re: ZFS or UFS - what to do?

[zfs-discuss] Re: ZFS or UFS - what to do?

[zfs-discuss] Re: ZFS or UFS - what to do?

[zfs-discuss] Re: ZFS or UFS - what to do?

[zfs-discuss] Re: ZFS or UFS - what to do?

[zfs-discuss] Re: ZFS or UFS - what to do?

[zfs-discuss] Re: ZFS or UFS - what to do?

[zfs-discuss] Re: ZFS or UFS - what to do?

[zfs-discuss] Re: ZFS or UFS - what to do?

[zfs-discuss] ZFS or UFS - what to do?

[zfs-discuss] ZFS or UFS - what to do?

[zfs-discuss] Re: ZFS or UFS - what to do?

[zfs-discuss] Re: ZFS or UFS - what to do?

[zfs-discuss] Re: ZFS or UFS - what to do?

[zfs-discuss] Re: ZFS or UFS - what to do?

[zfs-discuss] Re: ZFS or UFS - what to do?

[zfs-discuss] Re: ZFS or UFS - what to do?

[zfs-discuss] Re: ZFS or UFS - what to do?

[zfs-discuss] Re: ZFS or UFS - what to do?

[zfs-discuss] Re: ZFS or UFS - what to do?

[zfs-discuss] Re: Re: ZFS or UFS - what to do?

[zfs-discuss] Re: Re: ZFS or UFS - what to do?

[zfs-discuss] Re: Re: ZFS or UFS - what to do?

[zfs-discuss] Re: Re: ZFS or UFS - what to do?

[zfs-discuss] Re: Re: ZFS or UFS - what to do?

[zfs-discuss] Re: Re: ZFS or UFS - what to do?

[zfs-discuss] Re: Re: ZFS or UFS - what to do?

[zfs-discuss] Re: Re: ZFS or UFS - what to do?

[zfs-discuss] Re: Re: ZFS or UFS - what to do?

[zfs-discuss] Re: Re: Re: ZFS or UFS - what to do?

[zfs-discuss] Re: Re: ZFS or UFS - what to do?

[zfs-discuss] Re: Re: Re: ZFS or UFS - what to do?

[zfs-discuss] Re: ZFS or UFS - what to do?

[zfs-discuss] ZFS or UFS - what to do?

[zfs-discuss] Re: Re: ZFS or UFS - what to do?

[zfs-discuss] Re: Re: ZFS or UFS - what to do?

[zfs-discuss] Re: Re: ZFS or UFS - what to do?

[zfs-discuss] ftruncate is failing on ZFS

[zfs-discuss] Re: ZFS or UFS - what to do?

[zfs-discuss] ZFS or UFS - what to do?

[zfs-discuss] Re: ZFS or UFS - what to do?

[zfs-discuss] Re: ZFS or UFS - what to do?

[zfs-discuss] Re: ZFS or UFS - what to do?

[zfs-discuss] Re: ZFS or UFS - what to do?

[zfs-discuss] Re: ZFS or UFS - what to do?

[zfs-discuss] Re: ZFS or UFS - what to do?

[zfs-discuss] Re: Re: ZFS or UFS - what to do?

[zfs-discuss] Re: Re: ZFS or UFS - what to do?

[zfs-discuss] Re: ZFS or UFS - what to do?

[zfs-discuss] Re: ZFS or UFS - what to do?

[zfs-discuss] Re: ZFS or UFS - what to do?

[zfs-discuss] Re: ZFS or UFS - what to do?

[zfs-discuss] Re: ZFS or UFS - what to do?