thr3ads.net - zfs discuss - [zfs-discuss] Convert Zpool RAID Types [Nov 2006]

If this information is useful, please help other people find it:
Share via:

Jason J. W. Williams

2006-Nov-28 23:32 UTC

[zfs-discuss] Convert Zpool RAID Types

Hello,

Is it possible to non-destructively change RAID types in zpool while
the data remains on-line?

-J

Richard Elling

2006-Nov-28 23:58 UTC

head link

[zfs-discuss] Convert Zpool RAID Types

Jason J. W. Williams wrote:> Is it possible to non-destructively change RAID types in zpool while
> the data remains on-line?
Yes.  With constraints, however.  What exactly are you trying to do?
  -- richard

Jason J. W. Williams

2006-Nov-29 00:52 UTC

head link

[zfs-discuss] Convert Zpool RAID Types

Hi Richard,

Originally, my thinking was I''d like drop one member out of a 3 member
RAID-Z and turn it into a RAID-1 zpool.

Although, at the moment I''m not sure.

Currently, I have 3 volume groups in my array with 4 disk each (total
12 disks). These VGs are sliced into 3 volumes each. I then have two
database servers using one LUN from each of the 3 VGs RAID-Z''d
together. For redundancy its great, for performance its pretty bad.

One of the major issues is the disk seek contention between the
servers since they''re all using the same disks, and RAID-Z tries to
utilize all the devices it has access to on every write.

What I thought I''d move to was 6 RAID-1 VGs on the array, and assign
the VGs to each server via a 1 device striped zpool. However, given
the fact that ZFS will kernel panic in the event of bad data I''m
reconsidering how to lay it out.

Essentially I''ve got 12 disks to work with.

Anyway, long form of trying to convert from RAID-Z to RAID-1. Any help
is much appreciated.

Best Regards,
Jason

On 11/28/06, Richard Elling <Richard.Elling at sun.com>
wrote:> Jason J. W. Williams wrote:
> > Is it possible to non-destructively change RAID types in zpool while
> > the data remains on-line?
>
> Yes.  With constraints, however.  What exactly are you trying to do?
>   -- richard
>

Richard Elling

2006-Nov-29 01:19 UTC

head link

[zfs-discuss] Convert Zpool RAID Types

comment below...

Jason J. W. Williams wrote:> Hi Richard,
> 
> Originally, my thinking was I''d like drop one member out of a 3
member
> RAID-Z and turn it into a RAID-1 zpool.
> 
> Although, at the moment I''m not sure.
> 
> Currently, I have 3 volume groups in my array with 4 disk each (total
> 12 disks). These VGs are sliced into 3 volumes each. I then have two
> database servers using one LUN from each of the 3 VGs RAID-Z''d
> together. For redundancy its great, for performance its pretty bad.
> 
> One of the major issues is the disk seek contention between the
> servers since they''re all using the same disks, and RAID-Z tries
to
> utilize all the devices it has access to on every write.
> 
> What I thought I''d move to was 6 RAID-1 VGs on the array, and
assign
> the VGs to each server via a 1 device striped zpool. However, given
> the fact that ZFS will kernel panic in the event of bad data I''m
> reconsidering how to lay it out.
> 
> Essentially I''ve got 12 disks to work with.
> 
> Anyway, long form of trying to convert from RAID-Z to RAID-1. Any help
> is much appreciated.
No such conversion-in-place is possible, today.

The concensus for databases, such as Oracle, is that you want your
logs on a different zpool than your data.  The simplest way to implement
this with redundancy is to mirror the log zpool.  You might try that
first, before you relayout the data.
  -- richard
> Best Regards,
> Jason
> 
> On 11/28/06, Richard Elling <Richard.Elling at sun.com> wrote:
>> Jason J. W. Williams wrote:
>> > Is it possible to non-destructively change RAID types in zpool
while
>> > the data remains on-line?
>>
>> Yes.  With constraints, however.  What exactly are you trying to do?
>>   -- richard
>>

Richard Elling

2006-Nov-29 17:37 UTC

head link

[zfs-discuss] Convert Zpool RAID Types

Jason J. W. Williams wrote:> Hi Richard,
> 
> Originally, my thinking was I''d like drop one member out of a 3
member
> RAID-Z and turn it into a RAID-1 zpool.
You would need to destroy the pool to do this -- requiring the data to
be copied twice.
> Although, at the moment I''m not sure.
So many options, so little time... :-)
> Currently, I have 3 volume groups in my array with 4 disk each (total
> 12 disks). These VGs are sliced into 3 volumes each. I then have two
> database servers using one LUN from each of the 3 VGs RAID-Z''d
> together. For redundancy its great, for performance its pretty bad.
> 
> One of the major issues is the disk seek contention between the
> servers since they''re all using the same disks, and RAID-Z tries
to
> utilize all the devices it has access to on every write.
This is difficult to pin down.  The disks cache and the RAID controller
caches.  So it is true that you would have contention, it is difficult
to predict what affect, if any, the hosts would see.
> What I thought I''d move to was 6 RAID-1 VGs on the array, and
assign
> the VGs to each server via a 1 device striped zpool. However, given
> the fact that ZFS will kernel panic in the event of bad data I''m
> reconsidering how to lay it out.
NB. all other file systems will similarly panic.  We get spoiled to
some extent because there are errors where ZFS won''t panic.  In the
future, there will be more errors that ZFS can handle without panic.
> Essentially I''ve got 12 disks to work with.
> 
> Anyway, long form of trying to convert from RAID-Z to RAID-1. Any help
> is much appreciated.
send/receive = copy/copy = backup/restore
It may be possible to do this as a rolling reconfiguration.
  -- richard

Jason J. W. Williams

2006-Nov-30 01:01 UTC

head link

[zfs-discuss] Convert Zpool RAID Types

Hi Richard,

Been watching the stats on the array and the cache hits are < 3% on
these volumes. We''re very write heavy, and rarely write similar enough
data twice. Having random oriented database data and
sequential-oriented database log data on the same volume groups, it
seems to me this was causing a lot of head repositioning.

By shutting down the slaves database servers we cut the latency
tremendously, which would seem to me to indicate a lot of contention.
But I''m trying to come up to speed on this, so I may be wrong.

 "iostat -xtcnz 5" showed the latency dropped from 200 to 20 once we
cut the replication. Since the masters and slaves were using the same
the volume groups and RAID-Z was striping across all of them on both
the masters and slaves, I think this was a big problem.

Any comments?

Best Regards,
Jason

On 11/29/06, Richard Elling <Richard.Elling at sun.com>
wrote:> Jason J. W. Williams wrote:
> > Hi Richard,
> >
> > Originally, my thinking was I''d like drop one member out of a
3 member
> > RAID-Z and turn it into a RAID-1 zpool.
>
> You would need to destroy the pool to do this -- requiring the data to
> be copied twice.
>
> > Although, at the moment I''m not sure.
>
> So many options, so little time... :-)
>
> > Currently, I have 3 volume groups in my array with 4 disk each (total
> > 12 disks). These VGs are sliced into 3 volumes each. I then have two
> > database servers using one LUN from each of the 3 VGs
RAID-Z''d
> > together. For redundancy its great, for performance its pretty bad.
> >
> > One of the major issues is the disk seek contention between the
> > servers since they''re all using the same disks, and RAID-Z
tries to
> > utilize all the devices it has access to on every write.
>
> This is difficult to pin down.  The disks cache and the RAID controller
> caches.  So it is true that you would have contention, it is difficult
> to predict what affect, if any, the hosts would see.
>
> > What I thought I''d move to was 6 RAID-1 VGs on the array, and
assign
> > the VGs to each server via a 1 device striped zpool. However, given
> > the fact that ZFS will kernel panic in the event of bad data
I''m
> > reconsidering how to lay it out.
>
> NB. all other file systems will similarly panic.  We get spoiled to
> some extent because there are errors where ZFS won''t panic.  In
the
> future, there will be more errors that ZFS can handle without panic.
>
> > Essentially I''ve got 12 disks to work with.
> >
> > Anyway, long form of trying to convert from RAID-Z to RAID-1. Any help
> > is much appreciated.
>
> send/receive = copy/copy = backup/restore
> It may be possible to do this as a rolling reconfiguration.
>   -- richard
>

Richard Elling

2006-Nov-30 18:36 UTC

head link

[zfs-discuss] Convert Zpool RAID Types

Hi Jason,
It seems to me that there is some detailed information which would
be needed for a full analysis.  So, to keep the ball rolling, I''ll
respond generally.

Jason J. W. Williams wrote:> Hi Richard,
> 
> Been watching the stats on the array and the cache hits are < 3% on
> these volumes. We''re very write heavy, and rarely write similar
enough
> data twice. Having random oriented database data and
> sequential-oriented database log data on the same volume groups, it
> seems to me this was causing a lot of head repositioning.
In general, writes are buffered.  For many database workloads, the
sequential log writes won''t be write cache hits and will be coelesced.
There are several ways you could account for this, but suffice to say
that the read cache hit rate is more interesting, for performance
improvement opportunities.  The random reads are often cache misses,
and adding prefetch is often a waste of resources -- the nature of the
beast.

For ZFS, all data writes should be sequential until you get near the
capacity of the volume when there will be a search for free blocks
which may be randomly dispersed.  One way to look at this is that
for new and not-yet-filled volumes, ZFS will write sequentially,
unlike other file systems.  Once you get filled, then ZFS will write
more like other file systems.  Hence, your write performance for
ZFS may change over time, though this will be somewhat mitigated by
the RAID array write buffer cache.
> By shutting down the slaves database servers we cut the latency
> tremendously, which would seem to me to indicate a lot of contention.
> But I''m trying to come up to speed on this, so I may be wrong.
This is likely.

Note that RAID controllers are really just servers which speak a
block-level protocol to other hosts.  Some RAID controllers are
underpowered.

ZFS on a modern server can create a significant workload.  This
can also clobber a RAID array.  For example, by default, ZFS will
queue up to 35 iops per vdev before blocking.  If you have one
RAID array which is connected to 4 hosts, each host having 5 vdevs,
then the RAID array would need to be able to handle 700 (35 * 4 * 5)
concurrent iops.  There are RAID arrays, which will remain nameless,
that will not handle that workload very well.  Under lab conditions
you should be able to empirically determine the knee in the response
time curve as you add workload.

To compound the problem, fibre channel has pitiful flow control.
Thus it may also be necessary to throttle the concurrent iops at the
source.  I''m  not sure what the current thinking is on tuning
vq_max_pending (35) for ZFS, you might search for it in the archives.
[the intent is to have no tunables, let the system figure out what to
do best]
> "iostat -xtcnz 5" showed the latency dropped from 200 to 20 once
we
> cut the replication. Since the masters and slaves were using the same
> the volume groups and RAID-Z was striping across all of them on both
> the masters and slaves, I think this was a big problem.
It is hard for me to visualize your setup, but this is a tell-tale
sign that you''ve overrun the RAID box.  Changing the volume
partitioning
will likely help, perhaps tremendously.
  -- richard

Jason J. W. Williams

2006-Dec-01 03:06 UTC

head link

[zfs-discuss] Convert Zpool RAID Types

Hi Richard,

Thank you for taking so much time on this! The array is a StorageTek
FLX210 so it is a bit underpowered...best we could afford at the time.

In terms of the load on it we have two servers running Solaris 10.
Each physical server then has two  containers, each one has a MySQL
instance in it. The primary physical server has the masters, and the
secondary physical server has the slaves. The slaves use MySQL binlog
replication to get the INSERTS/UPDATES from the masters.

Each physical server has 3 vdevs that are RAID-Z''d together. We then
layout two file systems in the zpool, one per container.

The vdevs the zpools are actually LUNs, each one in a separate volume
group on the FLX210. So we have three volume groups on the RAID array
with two LUNs per VG. So six LUNs, and each physical server has a
RAID-Z zpool built out of the LUNs (striped across the volume groups).
Each VG has 4 disks in it, so this maximized across the 12 drives
we''re using. Unfortunately, since both servers are striping across the
same 3 volume groups I think it caused our performance issue. Also,
the volume groups were RAID-5, so we had RAID-Z on top of RAID-5. This
meant we could loss 3 disks and still be OK in a worst case scenario,
but its killing the performance. The FLX210 doesn''t have RAID ASICs as
I recently learned. :-(

As a stop gap, we stopped the replication to the slaves and converted
the RAID array''s volume groups to RAID-1. This seems to have
tremendously reduced the issue temporarily.

Given the limited number of disks we have to work with, the new layout
we''ve decided on is:

5 volume groups:

*VG 1 (2 disk RAID-1): physical_server1_DB
***LUN 0: SQL Master 1 DB
***LUN 1: SQL Master 2 DB

*VG 2 (2 disk RAID-1): physical_server1_logs
***LUN 0: SQL Master 1 Logs
***LUN 1: SQL Master 2 Logs

*VG 3 (2 disk RAID-1): physical_server2_DB
***LUN 0: SQL Slave 1 DB
***LUN 1: SQL Slave 2 DB

*VG 4 (2 disk RAID-1): physical_server2_logs
***LUN 0: SQL Slave 1 Logs
***LUN 1: SQL Slave 2 Logs

*VG 5 (2 disk RAID-1): Windows Server LUNs
***LUN 0: Exchange Server LUN
***LUN 1: Maintenance LUN

Each SQL LUN will be a vdev in its own striped zpool. And we''ve got
two disks in reserve for additional capacity (not counting the 2 array
hot-spares). The main concern at the moment is that given the current
layout, we don''t waste much space at all filesystem-wise (we do lose a
lot of space with the double RAID-5). The new layout however, would
give the logs a ton of space of which they might use 10% (but we don''t
want to consolidate all the logs on the same VG lest we get the
contention problems back). Its a tough trade-off to make---space for
speed.

I''m somewhat of a mind to have all the logs use a single VG and see
how the performance fares. Add a second VG only if necessary.

Currently, we see about 40-70 IOPS per vdev. So that can average
120-200 IOPS per VG with a peak of 300. About 60-70% of those IOPS are
writes as well.

One thing all those VGs in the new layout would let us do is figure
out how many of those IOPS are random, and how many are sequential log
writes.

As always, advice/thoughts are appreciated.

Best Regards,
Jason

On 11/30/06, Richard Elling <Richard.Elling at sun.com>
wrote:> Hi Jason,
> It seems to me that there is some detailed information which would
> be needed for a full analysis.  So, to keep the ball rolling, I''ll
> respond generally.
>
> Jason J. W. Williams wrote:
> > Hi Richard,
> >
> > Been watching the stats on the array and the cache hits are < 3% on
> > these volumes. We''re very write heavy, and rarely write
similar enough
> > data twice. Having random oriented database data and
> > sequential-oriented database log data on the same volume groups, it
> > seems to me this was causing a lot of head repositioning.
>
> In general, writes are buffered.  For many database workloads, the
> sequential log writes won''t be write cache hits and will be
coelesced.
> There are several ways you could account for this, but suffice to say
> that the read cache hit rate is more interesting, for performance
> improvement opportunities.  The random reads are often cache misses,
> and adding prefetch is often a waste of resources -- the nature of the
> beast.
>
> For ZFS, all data writes should be sequential until you get near the
> capacity of the volume when there will be a search for free blocks
> which may be randomly dispersed.  One way to look at this is that
> for new and not-yet-filled volumes, ZFS will write sequentially,
> unlike other file systems.  Once you get filled, then ZFS will write
> more like other file systems.  Hence, your write performance for
> ZFS may change over time, though this will be somewhat mitigated by
> the RAID array write buffer cache.
>
> > By shutting down the slaves database servers we cut the latency
> > tremendously, which would seem to me to indicate a lot of contention.
> > But I''m trying to come up to speed on this, so I may be
wrong.
>
> This is likely.
>
> Note that RAID controllers are really just servers which speak a
> block-level protocol to other hosts.  Some RAID controllers are
> underpowered.
>
> ZFS on a modern server can create a significant workload.  This
> can also clobber a RAID array.  For example, by default, ZFS will
> queue up to 35 iops per vdev before blocking.  If you have one
> RAID array which is connected to 4 hosts, each host having 5 vdevs,
> then the RAID array would need to be able to handle 700 (35 * 4 * 5)
> concurrent iops.  There are RAID arrays, which will remain nameless,
> that will not handle that workload very well.  Under lab conditions
> you should be able to empirically determine the knee in the response
> time curve as you add workload.
>
> To compound the problem, fibre channel has pitiful flow control.
> Thus it may also be necessary to throttle the concurrent iops at the
> source.  I''m  not sure what the current thinking is on tuning
> vq_max_pending (35) for ZFS, you might search for it in the archives.
> [the intent is to have no tunables, let the system figure out what to
> do best]
>
> > "iostat -xtcnz 5" showed the latency dropped from 200 to 20
once we
> > cut the replication. Since the masters and slaves were using the same
> > the volume groups and RAID-Z was striping across all of them on both
> > the masters and slaves, I think this was a big problem.
>
> It is hard for me to visualize your setup, but this is a tell-tale
> sign that you''ve overrun the RAID box.  Changing the volume
partitioning
> will likely help, perhaps tremendously.
>   -- richard
>

Seemingly Similar Threads

Search for more apparently analagous threads

zfs discuss - Nov 2006 - Convert Zpool RAID Types

[zfs-discuss] Convert Zpool RAID Types

[zfs-discuss] Convert Zpool RAID Types

[zfs-discuss] Convert Zpool RAID Types

[zfs-discuss] Convert Zpool RAID Types

[zfs-discuss] Convert Zpool RAID Types

[zfs-discuss] Convert Zpool RAID Types

[zfs-discuss] Convert Zpool RAID Types

[zfs-discuss] Convert Zpool RAID Types

Seemingly Similar Threads