thr3ads.net - zfs discuss - [zfs-discuss] ZFS with raidz [Mar 2007]

If this information is useful, please help other people find it:
Share via:

Kory Wheatley

2007-Mar-19 15:47 UTC

[zfs-discuss] ZFS with raidz

Using raidz in zfs  or raidz2 do all the disks have to be the same size.
 
 
This message posted from opensolaris.org

Robert Milkowski

2007-Mar-19 16:31 UTC

head link

[zfs-discuss] ZFS with raidz

Hello Kory,

Monday, March 19, 2007, 4:47:27 PM, you wrote:

KW> Using raidz in zfs  or raidz2 do all the disks have to be the same size.

No, they don''t have to be the same size.
However all disks will be reduced to common size and once you
replace (online) all disks to bigger one the pool size will
automatically expand to new common size.

-- 
Best regards,
 Robert                            mailto:rmilkowski at task.gda.pl
                                       http://milek.blogspot.com

Cindy.Swearingen at Sun.COM

2007-Mar-19 17:28 UTC

head link

[zfs-discuss] ZFS with raidz

Hi Kory,

No, they don''t have to the same size. But, the pool size will be 
constrained by the smallest disk and might not be the best
use of your disk space.

See the output below. I''d be better off mirroring the two 136-GB
disks and using the 4 GB-disk for something else. :-)

Cindy

c0t0d0 = 4 GB
c1t17d0 = 136 GB
c1t18d0 = 136 GB

# zpool create rpool raidz2 c0t0d0 c1t17d0 c1t18d0
# zpool list
NAME                    SIZE    USED   AVAIL    CAP  HEALTH     ALTROOT
rpool                  11.9G    243K   11.9G     0%  ONLINE     -

Kory Wheatley wrote:> Using raidz in zfs  or raidz2 do all the disks have to be the same size.
>  
>  
> This message posted from opensolaris.org
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Kory Wheatley

2007-Mar-20 15:38 UTC

head link

[zfs-discuss] Re: ZFS with raidz

The reason for this question is we currently have our disk setup in a hardware
raid5 on a EMC device and these disks are configured as a zfs file system. 
Would it benefit us to have the disk be setup as a raidz along with the hardware
raid 5 that is already setup too?  Or with this double raid slow our performance
with both a software and hardware raid setup?  Or would raidz setup be better
than the hardware raid5 setup?

Also if we do set the disks as a raidz  would it benefit use more if we
specified each disks in the raidz or create them as Luns then specify the setup
in raidz.
 
 
This message posted from opensolaris.org

Robert Milkowski

2007-Mar-20 16:28 UTC

head link

[zfs-discuss] Re: ZFS with raidz

Hello Kory,

Tuesday, March 20, 2007, 4:38:03 PM, you wrote:

KW> The reason for this question is we currently have our disk setup
KW> in a hardware raid5 on a EMC device and these disks are configured
KW> as a zfs file system.  Would it benefit us to have the disk be
KW> setup as a raidz along with the hardware raid 5 that is already
KW> setup too?  Or with this double raid slow our performance with
KW> both a software and hardware raid setup?  Or would raidz setup be
KW> better than the hardware raid5 setup?

KW> Also if we do set the disks as a raidz  would it benefit use more
KW> if we specified each disks in the raidz or create them as Luns then
specify the setup in raidz.

RAIDZ vs. HW RAID5 - generally you can expect much worse performance
for parallel random and small reads and better performance for other
cases.

RAIDZ on top of HW RAID5 - well, it really depends if performance hit
is acceptable for you along with storage capacity.

Then there''s somewhat lacking hot spare support in ZFS right now.

If a raidz performance is acceptable to you I would go with each disk
presented as a LUN and then put raidz or raidz2 on top of it + hot
spares.


-- 
Best regards,
 Robert                            mailto:rmilkowski at task.gda.pl
                                       http://milek.blogspot.com

Jim Mauro

2007-Mar-20 17:31 UTC

head link

[zfs-discuss] Re: ZFS with raidz

(I''m probably not the best person to answer this, but that has never 
stopped me
before, and I need to give Richard Elling a little more time to get the 
Goats, Cows
and Horses fed, sip his morning coffee, and offer a proper
response...)> Would it benefit us to have the disk be setup as a raidz along with the
hardware raid 5 that is already setup too?Way back when, we called such configurations "plaiding", which
described
a host-based RAID configuration
that criss-crossed hardware RAID LUNs. In doing such things, we had 
potentially better data availability
with a configuration that could survive more failure modes. 
Alternatively, we used the hardware RAID
for the availability configuration (hardware RAID 5), and used 
host-based RAID to stripe across hardware
RAID5 LUNs for performance. Seemed to work pretty well.

In theory, a raidz pool spread across some number of underlying hardware 
raid 5 LUNs would
offer protection against more failure mode, such as the loss of an 
entire raid5 LUN. So from
a failure protection/data availability point of view, it offers some 
benefit. Now, as to whether or not
you experience a real, measurable benefit over time is hard to say. Each 
additional level of protection/redundancy
has a diminishing return, often times at a dramatic incremental cost 
(e.g. getting from "four nines" to "five
nines").> Or with this double raid slow our performance with both a software and
hardware raid setup?You will certainly pay a performance - using raidz across the raid5 luns 
will reduce deliverable IOPS
from the raid 5 luns. Whether or not the performance trade-off is worth 
the RAS gain varies based on
your RAS and data availability requirements.> Or would raidz setup be better than the hardware raid5 setup?
>   Assuming a robust raid5 implementation with battery-backed nvram 
(protect against the "write hole" and
partial stripe writes), I think a raidz zpool covers more of the 
datapath then a hardware raid 5 LUN, but
I''ll wait for Richard to elaborate here (or tell me I''m
wrong).
> Also if we do set the disks as a raidz  would it benefit use more if we
specified each disks in the raidz or create them as Luns then specify the setup
in raidz.
>   Isn''t'' this the same question as the first question?
I''m not sure what
you''re asking here...

The questions you''re asking are good ones, and date back to the decades
old struggle
around configuration tradeoffs for performance / availability / cost.

My knee-jerk reaction is that one level of RAID, like either hardware 
raid5 ZFS raidz is sufficient
for availability, and keeps things relatively simple (and simple also 
improves RAS). The advantage
host-based RAID has always had of hardware RAID is the ability to create 
software LUNs
(like a raidz1 or raidz2 zpool) across physical disk controllers, which 
may also cross SAN
switches, etc. So, twas me, I''d go with non-hardware RAID5 devices from
the storage frame,
and create raidz1 or raidz2 zpools across controllers.

But, that''s me...
:^)

/jim
>  
>  
> This message posted from opensolaris.org
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>

Richard Elling

2007-Mar-20 18:29 UTC

head link

[zfs-discuss] Re: ZFS with raidz

Jim Mauro wrote:> (I''m probably not the best person to answer this, but that has
never stopped me
> before, and I need to give Richard Elling a little more time to get the
Goats, Cows
> and Horses fed, sip his morning coffee, and offer a proper response...)
chores are done, wading through the morning e-mail...
>> Would it benefit us to have the disk be setup as a raidz along with 
>> the hardware raid 5 that is already setup too?  
> Way back when, we called such configurations "plaiding", which
described a host-based RAID configuration
> that criss-crossed hardware RAID LUNs. In doing such things, we had
potentially better data availability
> with a configuration that could survive more failure modes. Alternatively,
we used the hardware RAID
> for the availability configuration (hardware RAID 5), and used host-based
RAID to stripe across hardware
> RAID5 LUNs for performance. Seemed to work pretty well.
Yep, there are various ways to do this and, in general, the more copies
of the data you have, the better reliability you have.  Space is also
fairly easy to calculate.  Performance can be tricky, and you may need to
benchmark with your workload to see which is better, due to the difficulty
in modeling such systems.
> In theory, a raidz pool spread across some number of underlying hardware
raid 5 LUNs would
> offer protection against more failure mode, such as the loss of an entire
raid5 LUN. So from
> a failure protection/data availability point of view, it offers some
benefit. Now, as to whether or not
> you experience a real, measurable benefit over time is hard to say. Each
additional level of protection/redundancy
> has a diminishing return, often times at a dramatic incremental cost (e.g.
getting from "four nines" to "five nines").
If money was no issue, I''m sure we could come up with an awesome
solution :-)
>> Or with this double raid slow our performance with both a software and 
>> hardware raid setup?  
> You will certainly pay a performance - using raidz across the raid5 luns
will reduce deliverable IOPS
> from the raid 5 luns. Whether or not the performance trade-off is worth the
RAS gain varies based on
> your RAS and data availability requirements.
Fast, inexpensive, reliable: pick two.
>> Or would raidz setup be better than the hardware raid5 setup?
>>   
> Assuming a robust raid5 implementation with battery-backed nvram (protect
against the "write hole" and
> partial stripe writes), I think a raidz zpool covers more of the datapath
then a hardware raid 5 LUN, but
> I''ll wait for Richard to elaborate here (or tell me I''m
wrong).
In general, you want the data protection in the application, or as close to
the application as you can get.  Since programmers tend to be lazy (Gosling
said it, not me! :-) most rely on the file system and underlying constructs
to ensure data protection.  So, having ZFS manage the data protection will
always be better than having some box at the other end of a wire managing
the protection.
>> Also if we do set the disks as a raidz  would it benefit use more if 
>> we specified each disks in the raidz or create them as Luns then 
>> specify the setup in raidz.
>>   
> Isn''t'' this the same question as the first question?
I''m not sure what
> you''re asking here...
> 
> The questions you''re asking are good ones, and date back to the
decades old struggle
> around configuration tradeoffs for performance / availability / cost.
> 
> My knee-jerk reaction is that one level of RAID, like either hardware raid5
ZFS raidz is sufficient
> for availability, and keeps things relatively simple (and simple also
improves RAS). The advantage
> host-based RAID has always had of hardware RAID is the ability to create
software LUNs
> (like a raidz1 or raidz2 zpool) across physical disk controllers, which may
also cross SAN
> switches, etc. So, twas me, I''d go with non-hardware RAID5 devices
from the storage frame,
> and create raidz1 or raidz2 zpools across controllers.
This is reasonable.
> But, that''s me...
> :^)
> 
> /jim
The important thing is to protect your data.  You have lots of options here,
so we''d need to know more precisely what the other requirements are
before
we could give better advice.
  -- richard

Jim Mauro

2007-Mar-20 21:30 UTC

head link

[zfs-discuss] Re: ZFS with raidz

Hi Kory - Your problem came our way through other Sun folks a few days ago,
and I wish I had that magic setting to help, but the reality is that
I''m
not aware
of anything that will improve the time required to mount 12k file systems.

I would add (not that this helps) that I''m not convinced this problem
is
unique to
ZFS, but I do not have experience or empirical data on mount time for 12k
UFS, QFS, ext4, etc, file systems.

There is an RFE filed on this:
http://bugs.opensolaris.org/view_bug.do?bug_id=6478980

As I said, I wish I had a better answer.

Thanks,
/jim


Kory Wheatley wrote:> Currently we are trying to setup zfs as file systems for all our user 
> accounts under /homea /homec /homef /homei /homem /homep /homes and 
> /homet. Right now on our Sun Fire v890 with 4 dual core processors and 
> 16gb of memory we have 12,000 zfs file systems setup.  Which Sun has 
> promised will work, but we didn''t know that it would take over an
hour
> to do a reboot on this machine to mount and umount all these file 
> systems.  What were trying to accomplish is the best performance along 
> with best data protection.  Sun speaks that ZFS supports millions of 
> fil e systems, but what they left out is how long it takes to do a 
> reboot when you have thousand''s of file systems.
> Currently we have three LUNS on our EMC disk array that we''ve
created
> one zfs storage pool, and we''ve created these 12,000 zfs file
system
> to this zfs pool.
>
> We really don''t want to have to go ufs to create our user student 
> accounts.  We like the flexibility of ZFS, but with the slow boot 
> process it will kill us when we have to implement patches that require 
> a reboot.  These ZFS file systems will contain all the student data, 
> so reliability and performance is a key to us.   Do you know away  or 
> a different setup for ZFS to allow our system to boot up faster?
> I know each mount takes up memory so that''s part of the slowness
when
> mounting and umounting.  We know when the system is up that the kernel 
> is using 3gb of memory out of the 16gb, and there''s nothing else
on
> this box right, but ZFS.  There''s no data in those
thousand''s of file
> systems yet.
>
> Richard Elling wrote:
>> Jim Mauro wrote:
>>> (I''m probably not the best person to answer this, but that
has never
>>> stopped me
>>> before, and I need to give Richard Elling a little more time to get
>>> the Goats, Cows
>>> and Horses fed, sip his morning coffee, and offer a proper
response...)
>>
>> chores are done, wading through the morning e-mail...
>>
>>>> Would it benefit us to have the disk be setup as a raidz along
with
>>>> the hardware raid 5 that is already setup too?  
>>> Way back when, we called such configurations "plaiding",
which
>>> described a host-based RAID configuration
>>> that criss-crossed hardware RAID LUNs. In doing such things, we had
>>> potentially better data availability
>>> with a configuration that could survive more failure modes. 
>>> Alternatively, we used the hardware RAID
>>> for the availability configuration (hardware RAID 5), and used 
>>> host-based RAID to stripe across hardware
>>> RAID5 LUNs for performance. Seemed to work pretty well.
>>
>> Yep, there are various ways to do this and, in general, the more copies
>> of the data you have, the better reliability you have.  Space is also
>> fairly easy to calculate.  Performance can be tricky, and you may 
>> need to
>> benchmark with your workload to see which is better, due to the 
>> difficulty
>> in modeling such systems.
>>
>>> In theory, a raidz pool spread across some number of underlying 
>>> hardware raid 5 LUNs would
>>> offer protection against more failure mode, such as the loss of an 
>>> entire raid5 LUN. So from
>>> a failure protection/data availability point of view, it offers
some
>>> benefit. Now, as to whether or not
>>> you experience a real, measurable benefit over time is hard to say.
>>> Each additional level of protection/redundancy
>>> has a diminishing return, often times at a dramatic incremental
cost
>>> (e.g. getting from "four nines" to "five
nines").
>>
>> If money was no issue, I''m sure we could come up with an
awesome
>> solution :-)
>>
>>>> Or with this double raid slow our performance with both a
software
>>>> and hardware raid setup?  
>>> You will certainly pay a performance - using raidz across the raid5
>>> luns will reduce deliverable IOPS
>>> from the raid 5 luns. Whether or not the performance trade-off is 
>>> worth the RAS gain varies based on
>>> your RAS and data availability requirements.
>>
>> Fast, inexpensive, reliable: pick two.
>>
>>>> Or would raidz setup be better than the hardware raid5 setup?
>>>>   
>>> Assuming a robust raid5 implementation with battery-backed nvram 
>>> (protect against the "write hole" and
>>> partial stripe writes), I think a raidz zpool covers more of the 
>>> datapath then a hardware raid 5 LUN, but
>>> I''ll wait for Richard to elaborate here (or tell me
I''m wrong).
>>
>> In general, you want the data protection in the application, or as 
>> close to
>> the application as you can get.  Since programmers tend to be lazy 
>> (Gosling
>> said it, not me! :-) most rely on the file system and underlying 
>> constructs
>> to ensure data protection.  So, having ZFS manage the data protection 
>> will
>> always be better than having some box at the other end of a wire 
>> managing
>> the protection.
>>
>>>> Also if we do set the disks as a raidz  would it benefit use
more
>>>> if we specified each disks in the raidz or create them as Luns
then
>>>> specify the setup in raidz.
>>>>   
>>> Isn''t'' this the same question as the first
question? I''m not sure
>>> what you''re asking here...
>>>
>>> The questions you''re asking are good ones, and date back
to the
>>> decades old struggle
>>> around configuration tradeoffs for performance / availability /
cost.
>>>
>>> My knee-jerk reaction is that one level of RAID, like either 
>>> hardware raid5 ZFS raidz is sufficient
>>> for availability, and keeps things relatively simple (and simple 
>>> also improves RAS). The advantage
>>> host-based RAID has always had of hardware RAID is the ability to 
>>> create software LUNs
>>> (like a raidz1 or raidz2 zpool) across physical disk controllers, 
>>> which may also cross SAN
>>> switches, etc. So, twas me, I''d go with non-hardware RAID5
devices
>>> from the storage frame,
>>> and create raidz1 or raidz2 zpools across controllers.
>>
>> This is reasonable.
>>
>>> But, that''s me...
>>> :^)
>>>
>>> /jim
>>
>> The important thing is to protect your data.  You have lots of 
>> options here,
>> so we''d need to know more precisely what the other
requirements are
>> before
>> we could give better advice.
>>  -- richard

Richard Elling

2007-Mar-20 21:46 UTC

head link

[zfs-discuss] Re: ZFS with raidz

I think this is a systems engineering problem, not just a ZFS problem.
Few have bothered to look at mount performance in the past because
most systems have only a few mounted file systems[1].  Since ZFS does
file system quotas instead of user quotas, now we have the situation
where there could be thousands of mounts.  Now we do need to look at
mount performance more closely.  We''re doing some of that work now, and
looking at other possible solutions (CR6478980).

[1] we''ve done some characterization of this while benchmarking Sun
Cluster failovers. The time required for a UFS mount can be quite
substantial, even when fsck is not required, and is also somewhat
variable (from few seconds to tens of seconds).  We''ve made some minor
changes to help improve cluster failover wrt mounts, so perhaps we
can look at our characterization data again and see if there is some
low-hanging fruit which would also apply more generally.
  -- richard

Kory Wheatley wrote:> Currently we are trying to setup zfs as file systems for all our user 
> accounts under /homea /homec /homef /homei /homem /homep /homes and 
> /homet. Right now on our Sun Fire v890 with 4 dual core processors and 
> 16gb of memory we have 12,000 zfs file systems setup.  Which Sun has 
> promised will work, but we didn''t know that it would take over an
hour
> to do a reboot on this machine to mount and umount all these file 
> systems.  What were trying to accomplish is the best performance along 
> with best data protection.  Sun speaks that ZFS supports millions of fil 
> e systems, but what they left out is how long it takes to do a reboot 
> when you have thousand''s of file systems.
> Currently we have three LUNS on our EMC disk array that we''ve
created
> one zfs storage pool, and we''ve created these 12,000 zfs file
system to
> this zfs pool.
> 
> We really don''t want to have to go ufs to create our user student 
> accounts.  We like the flexibility of ZFS, but with the slow boot 
> process it will kill us when we have to implement patches that require a 
> reboot.  These ZFS file systems will contain all the student data, so 
> reliability and performance is a key to us.   Do you know away  or a 
> different setup for ZFS to allow our system to boot up faster?
> I know each mount takes up memory so that''s part of the slowness
when
> mounting and umounting.  We know when the system is up that the kernel 
> is using 3gb of memory out of the 16gb, and there''s nothing else
on this
> box right, but ZFS.  There''s no data in those thousand''s
of file systems
> yet.
> 
> Richard Elling wrote:
>> Jim Mauro wrote:
>>> (I''m probably not the best person to answer this, but that
has never
>>> stopped me
>>> before, and I need to give Richard Elling a little more time to get
>>> the Goats, Cows
>>> and Horses fed, sip his morning coffee, and offer a proper
response...)
>>
>> chores are done, wading through the morning e-mail...
>>
>>>> Would it benefit us to have the disk be setup as a raidz along
with
>>>> the hardware raid 5 that is already setup too?  
>>> Way back when, we called such configurations "plaiding",
which
>>> described a host-based RAID configuration
>>> that criss-crossed hardware RAID LUNs. In doing such things, we had
>>> potentially better data availability
>>> with a configuration that could survive more failure modes. 
>>> Alternatively, we used the hardware RAID
>>> for the availability configuration (hardware RAID 5), and used 
>>> host-based RAID to stripe across hardware
>>> RAID5 LUNs for performance. Seemed to work pretty well.
>>
>> Yep, there are various ways to do this and, in general, the more copies
>> of the data you have, the better reliability you have.  Space is also
>> fairly easy to calculate.  Performance can be tricky, and you may need
to
>> benchmark with your workload to see which is better, due to the 
>> difficulty
>> in modeling such systems.
>>
>>> In theory, a raidz pool spread across some number of underlying 
>>> hardware raid 5 LUNs would
>>> offer protection against more failure mode, such as the loss of an 
>>> entire raid5 LUN. So from
>>> a failure protection/data availability point of view, it offers
some
>>> benefit. Now, as to whether or not
>>> you experience a real, measurable benefit over time is hard to say.
>>> Each additional level of protection/redundancy
>>> has a diminishing return, often times at a dramatic incremental
cost
>>> (e.g. getting from "four nines" to "five
nines").
>>
>> If money was no issue, I''m sure we could come up with an
awesome
>> solution :-)
>>
>>>> Or with this double raid slow our performance with both a
software
>>>> and hardware raid setup?  
>>> You will certainly pay a performance - using raidz across the raid5
>>> luns will reduce deliverable IOPS
>>> from the raid 5 luns. Whether or not the performance trade-off is 
>>> worth the RAS gain varies based on
>>> your RAS and data availability requirements.
>>
>> Fast, inexpensive, reliable: pick two.
>>
>>>> Or would raidz setup be better than the hardware raid5 setup?
>>>>   
>>> Assuming a robust raid5 implementation with battery-backed nvram 
>>> (protect against the "write hole" and
>>> partial stripe writes), I think a raidz zpool covers more of the 
>>> datapath then a hardware raid 5 LUN, but
>>> I''ll wait for Richard to elaborate here (or tell me
I''m wrong).
>>
>> In general, you want the data protection in the application, or as 
>> close to
>> the application as you can get.  Since programmers tend to be lazy 
>> (Gosling
>> said it, not me! :-) most rely on the file system and underlying 
>> constructs
>> to ensure data protection.  So, having ZFS manage the data protection 
>> will
>> always be better than having some box at the other end of a wire
managing
>> the protection.
>>
>>>> Also if we do set the disks as a raidz  would it benefit use
more if
>>>> we specified each disks in the raidz or create them as Luns
then
>>>> specify the setup in raidz.
>>>>   
>>> Isn''t'' this the same question as the first
question? I''m not sure
>>> what you''re asking here...
>>>
>>> The questions you''re asking are good ones, and date back
to the
>>> decades old struggle
>>> around configuration tradeoffs for performance / availability /
cost.
>>>
>>> My knee-jerk reaction is that one level of RAID, like either
hardware
>>> raid5 ZFS raidz is sufficient
>>> for availability, and keeps things relatively simple (and simple
also
>>> improves RAS). The advantage
>>> host-based RAID has always had of hardware RAID is the ability to 
>>> create software LUNs
>>> (like a raidz1 or raidz2 zpool) across physical disk controllers, 
>>> which may also cross SAN
>>> switches, etc. So, twas me, I''d go with non-hardware RAID5
devices
>>> from the storage frame,
>>> and create raidz1 or raidz2 zpools across controllers.
>>
>> This is reasonable.
>>
>>> But, that''s me...
>>> :^)
>>>
>>> /jim
>>
>> The important thing is to protect your data.  You have lots of options 
>> here,
>> so we''d need to know more precisely what the other
requirements are
>> before
>> we could give better advice.
>>  -- richard

Tom Haynes

2007-Mar-20 21:58 UTC

head link

[zfs-discuss] Re: Re: ZFS with raidz

We''ve got some work going on in the NFS group to alleviate this
problem. Doug McCallum has introduced the sharemgr (see
http://blogs.sun.com/dougm) and I''m about to putback the In-Kernel
Sharetab bits (look in http://blogs.sun.com/tdh - especially
http://blogs.sun.com/tdh/entry/in_kernel_sharetab_have_a).

Doug has been doing some performance optimization to the sharemgr to allow
faster boot up in loading, specifically for ZFS - see for example
http://bugs.opensolaris.org/view_bug.do?bug_id=6491973.

It is funny, he just told me a couple of hours ago that he was doing 15k
entries. I know he has significantly reduced the times for 3k and 5k
filesystems. We are still working on the 15k entries.

We are wanting to combine his changes with my changes to see if we can get the
15k time down. With my changes, we remove going to disk for the sharetab and
locking the file.

As you can see, this is a very hot spot for us right now. We really want these
times down.

Also, for the interested, I gave a presentation at Connectathon last year which
highlights some of the issues here: Scaling NFS Services
(http://www.connectathon.org/talks06/haynes.pdf). I also presented an overview
of Doug''s and my projects at the latest Connectathon: 	The Management
of Shares (http://www.connectathon.org/talks07/ScaleShares.pdf).
 
 
This message posted from opensolaris.org

James F. Hranicky

2007-Mar-21 15:56 UTC

head link

[zfs-discuss] Re: ZFS with raidz

Richard Elling wrote:> I think this is a systems engineering problem, not just a ZFS problem.
> Few have bothered to look at mount performance in the past because
> most systems have only a few mounted file systems[1].  Since ZFS does
> file system quotas instead of user quotas, now we have the situation
> where there could be thousands of mounts.  Now we do need to look at
> mount performance more closely.  We''re doing some of that work
now, and
> looking at other possible solutions (CR6478980).
> 
> [1] we''ve done some characterization of this while benchmarking
Sun
> Cluster failovers. The time required for a UFS mount can be quite
> substantial, even when fsck is not required, and is also somewhat
> variable (from few seconds to tens of seconds).  We''ve made some
minor
> changes to help improve cluster failover wrt mounts, so perhaps we
> can look at our characterization data again and see if there is some
> low-hanging fruit which would also apply more generally.
The problem is that in order to restrict disk usage, ZFS *requires*
that you create this many filesystems. I think most in this situation
would prefer not to have to do that. The two solutions I see would
be to add user quotas to ZFS or to be able to set a quota on a
directory without it becoming it''s own filesystem.

We''ve ruled out using ZFS for our systems at this time due to these
limitations and the fact that thousands of mounts on a host entail
a very long reboot (and the fact that snapshots count toward
filesystem quota).

Any chance that user quotas will be added in the future? It would go
a long way to alleviating this problem. Ideally, snapshots would not
count against user quotas if possible.

Jim

Casper.Dik at Sun.COM

2007-Mar-21 16:00 UTC

head link

[zfs-discuss] Re: ZFS with raidz

>The problem is that in order to restrict disk usage, ZFS *requires*
>that you create this many filesystems. I think most in this situation
>would prefer not to have to do that. The two solutions I see would
>be to add user quotas to ZFS or to be able to set a quota on a
>directory without it becoming it''s own filesystem.
What this really means is that ZFS filesystems need to be as
expensive as UFS quotas and not (considerably) more.

As it stands now, ZFS filesystems have two serious limitations:
	- they cost 100Ks per fs (when mounted)
	- they are, by default, all mounted all the time
>Any chance that user quotas will be added in the future? It would go
>a long way to alleviating this problem. Ideally, snapshots would not
>count against user quotas if possible.
If snapshots don''t count against quota, then you give users an
easy way to extend their quotas by (number of snapshots) fold.

Casper

Wade.Stuart at fallon.com

2007-Mar-21 18:24 UTC

head link

[zfs-discuss] Re: ZFS with raidz

zfs-discuss-bounces at opensolaris.org wrote on 03/21/2007 11:00:43 AM:
>
> >The problem is that in order to restrict disk usage, ZFS *requires*
> >that you create this many filesystems. I think most in this situation
> >would prefer not to have to do that. The two solutions I see would
> >be to add user quotas to ZFS or to be able to set a quota on a
> >directory without it becoming it''s own filesystem.
>
> What this really means is that ZFS filesystems need to be as
> expensive as UFS quotas and not (considerably) more.
Well...  and apply to user/groups restriction workflows not just child node
total sizes.  Many times you can create a bunch o zfs mounts to limit usage
or reservations, but at some level this is not as flexible as user/group
level restrictions in many workflows (or even possible given some
structures and workflows).  For instance how would you limit John and Sarah
in accounting from gobbling up all the space in the accounting group''s
folder while allowing the managers Mike and Amy to do so without changing
their file structure to match the restrictions instead of their workflow.
User quotas have their place and are very well suited for certain tasks,
zfs quota''s and reservations are more focused on protecting pooled
storage
from overuse/constraint.  Building out a ton of fs mounts to try to work
around missing user quotas does not scale well in any terms -- system cost,
admin cost or artificial filesystem layout/restructuring. The cost
(cpu/memory) of zfs mountpoints is one concern not by any means the only or
core concern.

>
> As it stands now, ZFS filesystems have two serious limitations:
>    - they cost 100Ks per fs (when mounted)
>    - they are, by default, all mounted all the time
>
> >Any chance that user quotas will be added in the future? It would go
> >a long way to alleviating this problem. Ideally, snapshots would not
> >count against user quotas if possible.
>
> If snapshots don''t count against quota, then you give users an
> easy way to extend their quotas by (number of snapshots) fold.
I see the point and agree to some extent, and will add that when quotas are
putback and do take into account snap bit hogging, administrators will
_need_ way better snap space usage reporting tools.  I think many workflows
may not want to force users to "own" the snap hog space -- in such a
case
as a home directory where administrators set the snap rotation and the
users have no control as to when or how the snaps will be released.  This
space accounting (live + snap vs live) should be a setting.

-Wade

Douglas R. McCallum

2007-Mar-21 19:02 UTC

head link

[zfs-discuss] Re: Re: ZFS with raidz

The fix for CR 6491973 won''t have much effect on boot time since it is
more specific to the act of setting of the sharenfs property, but as Tom said,
we are looking at anything that can reduce the times for sharing out large
numbers of shares.The time to share is separate from the mount times since we
don''t start sharing until all the mounts are done.  With such a large
configuraiton, the console prompt will appear before the shares get started. 
The time prior to that is all in the mounts.
 
 
This message posted from opensolaris.org

Richard Elling

2007-Mar-22 01:34 UTC

head link

[zfs-discuss] Re: ZFS with raidz

Kory,
I''m sorry that you had to go through this.  We''re all working
very hard to
make ZFS better for everyone.  We''ve noted this problem on the ZFS Best
Practices wiki to try and help avoid future problems until we can get the
quotas issue resolved.
  -- richard

Kory Wheatley wrote:> Richard,
> 
> I appreciate your information and insight.  At this time since ZFS is 
> not capable of handling thousands of file systems and has several 
> limitations.  We are forced to focus our migration to using UFS,
"after
> wasting time", where Sun told us, "before we thought of migrating
our
> user accounts to ZFS", that everything would be fine.  But they failed
> to mention about the terrible slowness of the boot process.  We told 
> them we would be adding thousands of file systems under ZFS, and they 
> said there would be no problems.  Very unprofessional from my standpoint 
> since we invested so much time in ZFS.   It''s forced us to hold
back on
> our migration and caused  us to spend another $12k of maintenance on our 
> current system, because we can''t do our migration before the time
our
> maintenance contact runs out.  We have to restructure our migration 
> plans for using UFS
> 
> ZFS needs to be stated in the correct manner in Sun''s
documentation and
> presentation''s that I''ve analyzed. Sure it supports
thousand''s and
> million''s of file system''s, but there''s
ramifications .  Resulting in a
> very slow boot process  (if that would have been stated that would have 
> been enough).  This has caused us a considerable amount of time
we''ve
> exhausted in ZFS, and now we have to turn our attention to UFS for our 
> migration.
>  From what I understand this problem was identified last year. 
I''m
> wondering how much time has been invested on it, since ZFS is such a key 
> element for everyone to migrate or install Solaris 10.  You definitely 
> would not want to use ZFS with thousands of file systems, it will not 
> work for us at all at this time.
> 
> Richard Elling wrote:
>> Jim Mauro wrote:
>>> (I''m probably not the best person to answer this, but that
has never
>>> stopped me
>>> before, and I need to give Richard Elling a little more time to get
>>> the Goats, Cows
>>> and Horses fed, sip his morning coffee, and offer a proper
response...)
>>
>> chores are done, wading through the morning e-mail...
>>
>>>> Would it benefit us to have the disk be setup as a raidz along
with
>>>> the hardware raid 5 that is already setup too?  
>>> Way back when, we called such configurations "plaiding",
which
>>> described a host-based RAID configuration
>>> that criss-crossed hardware RAID LUNs. In doing such things, we had
>>> potentially better data availability
>>> with a configuration that could survive more failure modes. 
>>> Alternatively, we used the hardware RAID
>>> for the availability configuration (hardware RAID 5), and used 
>>> host-based RAID to stripe across hardware
>>> RAID5 LUNs for performance. Seemed to work pretty well.
>>
>> Yep, there are various ways to do this and, in general, the more copies
>> of the data you have, the better reliability you have.  Space is also
>> fairly easy to calculate.  Performance can be tricky, and you may need
to
>> benchmark with your workload to see which is better, due to the 
>> difficulty
>> in modeling such systems.
>>
>>> In theory, a raidz pool spread across some number of underlying 
>>> hardware raid 5 LUNs would
>>> offer protection against more failure mode, such as the loss of an 
>>> entire raid5 LUN. So from
>>> a failure protection/data availability point of view, it offers
some
>>> benefit. Now, as to whether or not
>>> you experience a real, measurable benefit over time is hard to say.
>>> Each additional level of protection/redundancy
>>> has a diminishing return, often times at a dramatic incremental
cost
>>> (e.g. getting from "four nines" to "five
nines").
>>
>> If money was no issue, I''m sure we could come up with an
awesome
>> solution :-)
>>
>>>> Or with this double raid slow our performance with both a
software
>>>> and hardware raid setup?  
>>> You will certainly pay a performance - using raidz across the raid5
>>> luns will reduce deliverable IOPS
>>> from the raid 5 luns. Whether or not the performance trade-off is 
>>> worth the RAS gain varies based on
>>> your RAS and data availability requirements.
>>
>> Fast, inexpensive, reliable: pick two.
>>
>>>> Or would raidz setup be better than the hardware raid5 setup?
>>>>   
>>> Assuming a robust raid5 implementation with battery-backed nvram 
>>> (protect against the "write hole" and
>>> partial stripe writes), I think a raidz zpool covers more of the 
>>> datapath then a hardware raid 5 LUN, but
>>> I''ll wait for Richard to elaborate here (or tell me
I''m wrong).
>>
>> In general, you want the data protection in the application, or as 
>> close to
>> the application as you can get.  Since programmers tend to be lazy 
>> (Gosling
>> said it, not me! :-) most rely on the file system and underlying 
>> constructs
>> to ensure data protection.  So, having ZFS manage the data protection 
>> will
>> always be better than having some box at the other end of a wire
managing
>> the protection.
>>
>>>> Also if we do set the disks as a raidz  would it benefit use
more if
>>>> we specified each disks in the raidz or create them as Luns
then
>>>> specify the setup in raidz.
>>>>   
>>> Isn''t'' this the same question as the first
question? I''m not sure
>>> what you''re asking here...
>>>
>>> The questions you''re asking are good ones, and date back
to the
>>> decades old struggle
>>> around configuration tradeoffs for performance / availability /
cost.
>>>
>>> My knee-jerk reaction is that one level of RAID, like either
hardware
>>> raid5 ZFS raidz is sufficient
>>> for availability, and keeps things relatively simple (and simple
also
>>> improves RAS). The advantage
>>> host-based RAID has always had of hardware RAID is the ability to 
>>> create software LUNs
>>> (like a raidz1 or raidz2 zpool) across physical disk controllers, 
>>> which may also cross SAN
>>> switches, etc. So, twas me, I''d go with non-hardware RAID5
devices
>>> from the storage frame,
>>> and create raidz1 or raidz2 zpools across controllers.
>>
>> This is reasonable.
>>
>>> But, that''s me...
>>> :^)
>>>
>>> /jim
>>
>> The important thing is to protect your data.  You have lots of options 
>> here,
>> so we''d need to know more precisely what the other
requirements are
>> before
>> we could give better advice.
>>  -- richard

Tom Haynes

2007-May-10 06:04 UTC

head link

[zfs-discuss] Re: Re: ZFS with raidz

> 
> Doug has been doing some performance optimization to
> the sharemgr to allow faster boot up in loading
> 
Doug has blogged about his performance numbers here: 
http://blogs.sun.com/dougm/entry/recent_performance_improvement_in_zfs
 
 
This message posted from opensolaris.org

David M Singer

2008-May-29 15:19 UTC

head link

[zfs-discuss] ZFS with raidz

"The important thing is to protect your data. You have lots of options
here,
so we''d need to know more precisely what the other requirements are
before
we could give better advice.
-- richard"

Please let me come in with a parallel need, the answer to which should
contribute to this thread.

-Physical details:
3-drive (plus DVD) box with Micro-ATX board, 1 on-board controller and the
option for one raid card.  Actual board, CPU and Memory
yet-to-be-spec''d, but we''ll throw in whatever the
"hardware-compatible" Micro-ATX board can handle.
-Software details:
OpenSolaris 2008-05, ZFS+PostgreSQL+Python.
-Mission:
ZFS box is to watch a Windoze box (or a MAC box) on which new files are being
created and old ones changed, plus many deletions (animation system).
-Objectives:
(a) make periodic snapshots of animator''s box (actual copies of files)
onto ZFS box, and
(b) Write metadata into the PostgreSQL database to record event changes
happening to key files.
-Design concept:
Integrate ZFS+SQL+Python into a rules-based backup device that notifies a third
party elsewhere in the world about project progress (or lack thereof), and
forwards key files and the SQL metadata (via internet) to some host ZFS box
elsewhere.
-Observations:
(a) The local and the host ZFS boxes are not expected to contain the same
images; indeed, many local ZFS boxes will be distributed, and one host ZFS box
will be the ultimate repository of "completed" works.
(b) High Performance is not an overriding consideration because this box
"serves" only two users (the watched box on the local network and the
host down the internet pipe).

Question that relates to the on-going thread:
What configuration of ZFS and the hardware would serve "reliable and
cheap"?

David Singer
 
 
This message posted from opensolaris.org

Marcelo Leal

2008-May-29 18:58 UTC

head link

[zfs-discuss] ZFS with raidz

Hello...
 If i have understood well, you will have a host with EMC RAID5 discs. Is that
right?
 You pay a lot of money to have EMC discs, and i think is not a good idea have
another layer of *any* RAID on top of it. If you have EMC RAID5 (eg. symmetrix),
you don''t need to have a software RAID...
 ZFS was designed to have a RAID solution to cheap discs! I think is not your
case, and anything that is "too much" is not good. Generates
complexity and loop... :)
 I think ZFS can "trust" on the EMC thing...
 
 Leal.
 
 
This message posted from opensolaris.org

David M Singer

2008-May-30 08:19 UTC

head link

[zfs-discuss] ZFS with raidz

Leal,
The entire configuration through our corporation is being defined.  One of our
team members is heavy into EMC - 200Tb is his "normal" operating
range.  However, for this need we are focused just on local "smart
appliances" the purpose of which is to do more than just automatically
mirror the entirety of another local computer.  What is desired is
"reliable and cheap", plus remotely controlled, virus-free, and easily
updated by the local bone-head.

We expect to have many of these appliances, each in a separate spot in the
world, each serving one local computer (operated by one local bone-head), and
each reporting to one common central repository via internet.  We don''t
expect the appliance to have (relatively) much CPU stress, but the files are
rather large (video, animation, and all the underlying constructs, tracks, and
undo''s thereof).

We''ve come to the conclusion that hardware raid of any sort is not
required.  Remember, the source data on Local Bone-Head''s computer (not
being disparaging, just being practical that an un-supervised person thousands
of miles away has to be considered less-than-optimal in computuer habits) is
being copied to a ZFS machine (backup location number one) and then forwarded to
a central repository (backup location number two) which will itself have a
mirror in some distant location (backup location number three).

We''ll try compression level #9.
We''ll set "scrub" to 30 days automatic.
We''ll have unique virus protection:  the bootable drive will be
read-only.

Here''s the configuration we''ll build for our first appliance:

Case: Antek NSK 1380 - Mini ATX format with a 5.25 + (3) 3.5 drive
"bays" and 350w powersupply and 120mm fan, plus interior side fan
(uses one PSI slot).

MOBO: still to be determined, but we are currently evaluating ASUS P5E-VM,
LGA775 Intel G35 northbridge and ICH9R southbridge.  Comes with 4 memory slots
of dual-channel DDR2.  (2)PCIx-1, (1)PCIx-16, (1)PCI, (6)SATA 3Gbs, (1)IDE PATA,
(1)FDD, (6)USB, (1)Firewire, 5.1surround, HDMI, XVGA, PS/2 (mouse/keybd),
(1)gigaport LAN, (1)coaxial S/PDIF, RAID controller, and a flaky bios that needs
immediate flash update before doing anything.  But wait!  Most of those features
will go unused (read on).

Setup:
  (1) single-core CPU, minimizing heat being the most important factor.
  (8) gb memory (4x2gb) w/heat spreader
  (3) 500gb 7,200rpm "whatever''s in stock cheapest" drives C:
D: E: 100% use as ZFS tank, raidz-1.
   No floppy, no CD, no DVD, no boot from C: or D: or E:
   Shut off sound, Firewire, S/PDIF (and anything else we can figure out how)
   BOOT & run from 4gb USB FLASH (thumb) DRIVE "F:"- READ ONLY!
  (0)monitors, (0)keyboards, (0)mice - operate remotely via Internet

Software (all loaded onto the Flash Drive):
  OpenSolaris
  ZFS
  PostgreSQL
  Python
  A browser
  A VPN

Our Patch/Upgrade pipeline:  FedEx a replacement read-only USB FLASH DRIVE

We think we can get a brand-new 1Tb custom appliance for about $900 US.
We understand there will be a learning curve to this, but are willing to cut
ourselves on the bleeding edge.  We think each part (with the possible exception
of the mobo) has been successfuly employed - that we are just the first to
assemble it all in this particular fashion.

(Many thanks, Richard - chime in if I left anything out)

David
 
 
This message posted from opensolaris.org

Jeff Bonwick

2008-May-30 09:57 UTC

head link

[zfs-discuss] ZFS with raidz

Very cool!  Just one comment.  You said:
> We''ll try compression level #9.
gzip-9 is *really* CPU-intensive, often for little gain over gzip-1.
As in, it can take 100 times longer and yield just a few percent gain.
The CPU cost will limit write bandwidth to a few MB/sec per core.

I''d suggest that you begin by doing a simple experiment -- create a
filesystem at each compression level, copy representative identical
data to each one, and compare space usage.  My guess is that you''ll
find the knee in the cost/benefit curve well below gzip-9.  Also,
if you''re storing jpegs or video files, those are already compressed,
in which case the benefit will zero even at gzip-9.

That said, the other consideration is how you''re using the storage.
If the write rate is modest and disk space is at a premium, the CPU
cost may simply not matter.  And note that only writes are affected:
when reading data back, gzip is equally fast regardless of level.

Jeff

zfs discuss - Mar 2007 - ZFS with raidz

[zfs-discuss] ZFS with raidz

[zfs-discuss] ZFS with raidz

[zfs-discuss] ZFS with raidz

[zfs-discuss] Re: ZFS with raidz

[zfs-discuss] Re: ZFS with raidz

[zfs-discuss] Re: ZFS with raidz

[zfs-discuss] Re: ZFS with raidz

[zfs-discuss] Re: ZFS with raidz

[zfs-discuss] Re: ZFS with raidz

[zfs-discuss] Re: Re: ZFS with raidz

[zfs-discuss] Re: ZFS with raidz

[zfs-discuss] Re: ZFS with raidz

[zfs-discuss] Re: ZFS with raidz

[zfs-discuss] Re: Re: ZFS with raidz

[zfs-discuss] Re: ZFS with raidz

[zfs-discuss] Re: Re: ZFS with raidz

[zfs-discuss] ZFS with raidz

[zfs-discuss] ZFS with raidz

[zfs-discuss] ZFS with raidz

[zfs-discuss] ZFS with raidz