Hello, Is it possible to non-destructively change RAID types in zpool while the data remains on-line? -J
Jason J. W. Williams wrote:> Is it possible to non-destructively change RAID types in zpool while > the data remains on-line?Yes. With constraints, however. What exactly are you trying to do? -- richard
Hi Richard, Originally, my thinking was I''d like drop one member out of a 3 member RAID-Z and turn it into a RAID-1 zpool. Although, at the moment I''m not sure. Currently, I have 3 volume groups in my array with 4 disk each (total 12 disks). These VGs are sliced into 3 volumes each. I then have two database servers using one LUN from each of the 3 VGs RAID-Z''d together. For redundancy its great, for performance its pretty bad. One of the major issues is the disk seek contention between the servers since they''re all using the same disks, and RAID-Z tries to utilize all the devices it has access to on every write. What I thought I''d move to was 6 RAID-1 VGs on the array, and assign the VGs to each server via a 1 device striped zpool. However, given the fact that ZFS will kernel panic in the event of bad data I''m reconsidering how to lay it out. Essentially I''ve got 12 disks to work with. Anyway, long form of trying to convert from RAID-Z to RAID-1. Any help is much appreciated. Best Regards, Jason On 11/28/06, Richard Elling <Richard.Elling at sun.com> wrote:> Jason J. W. Williams wrote: > > Is it possible to non-destructively change RAID types in zpool while > > the data remains on-line? > > Yes. With constraints, however. What exactly are you trying to do? > -- richard >
comment below... Jason J. W. Williams wrote:> Hi Richard, > > Originally, my thinking was I''d like drop one member out of a 3 member > RAID-Z and turn it into a RAID-1 zpool. > > Although, at the moment I''m not sure. > > Currently, I have 3 volume groups in my array with 4 disk each (total > 12 disks). These VGs are sliced into 3 volumes each. I then have two > database servers using one LUN from each of the 3 VGs RAID-Z''d > together. For redundancy its great, for performance its pretty bad. > > One of the major issues is the disk seek contention between the > servers since they''re all using the same disks, and RAID-Z tries to > utilize all the devices it has access to on every write. > > What I thought I''d move to was 6 RAID-1 VGs on the array, and assign > the VGs to each server via a 1 device striped zpool. However, given > the fact that ZFS will kernel panic in the event of bad data I''m > reconsidering how to lay it out. > > Essentially I''ve got 12 disks to work with. > > Anyway, long form of trying to convert from RAID-Z to RAID-1. Any help > is much appreciated.No such conversion-in-place is possible, today. The concensus for databases, such as Oracle, is that you want your logs on a different zpool than your data. The simplest way to implement this with redundancy is to mirror the log zpool. You might try that first, before you relayout the data. -- richard> Best Regards, > Jason > > On 11/28/06, Richard Elling <Richard.Elling at sun.com> wrote: >> Jason J. W. Williams wrote: >> > Is it possible to non-destructively change RAID types in zpool while >> > the data remains on-line? >> >> Yes. With constraints, however. What exactly are you trying to do? >> -- richard >>
Jason J. W. Williams wrote:> Hi Richard, > > Originally, my thinking was I''d like drop one member out of a 3 member > RAID-Z and turn it into a RAID-1 zpool.You would need to destroy the pool to do this -- requiring the data to be copied twice.> Although, at the moment I''m not sure.So many options, so little time... :-)> Currently, I have 3 volume groups in my array with 4 disk each (total > 12 disks). These VGs are sliced into 3 volumes each. I then have two > database servers using one LUN from each of the 3 VGs RAID-Z''d > together. For redundancy its great, for performance its pretty bad. > > One of the major issues is the disk seek contention between the > servers since they''re all using the same disks, and RAID-Z tries to > utilize all the devices it has access to on every write.This is difficult to pin down. The disks cache and the RAID controller caches. So it is true that you would have contention, it is difficult to predict what affect, if any, the hosts would see.> What I thought I''d move to was 6 RAID-1 VGs on the array, and assign > the VGs to each server via a 1 device striped zpool. However, given > the fact that ZFS will kernel panic in the event of bad data I''m > reconsidering how to lay it out.NB. all other file systems will similarly panic. We get spoiled to some extent because there are errors where ZFS won''t panic. In the future, there will be more errors that ZFS can handle without panic.> Essentially I''ve got 12 disks to work with. > > Anyway, long form of trying to convert from RAID-Z to RAID-1. Any help > is much appreciated.send/receive = copy/copy = backup/restore It may be possible to do this as a rolling reconfiguration. -- richard
Hi Richard, Been watching the stats on the array and the cache hits are < 3% on these volumes. We''re very write heavy, and rarely write similar enough data twice. Having random oriented database data and sequential-oriented database log data on the same volume groups, it seems to me this was causing a lot of head repositioning. By shutting down the slaves database servers we cut the latency tremendously, which would seem to me to indicate a lot of contention. But I''m trying to come up to speed on this, so I may be wrong. "iostat -xtcnz 5" showed the latency dropped from 200 to 20 once we cut the replication. Since the masters and slaves were using the same the volume groups and RAID-Z was striping across all of them on both the masters and slaves, I think this was a big problem. Any comments? Best Regards, Jason On 11/29/06, Richard Elling <Richard.Elling at sun.com> wrote:> Jason J. W. Williams wrote: > > Hi Richard, > > > > Originally, my thinking was I''d like drop one member out of a 3 member > > RAID-Z and turn it into a RAID-1 zpool. > > You would need to destroy the pool to do this -- requiring the data to > be copied twice. > > > Although, at the moment I''m not sure. > > So many options, so little time... :-) > > > Currently, I have 3 volume groups in my array with 4 disk each (total > > 12 disks). These VGs are sliced into 3 volumes each. I then have two > > database servers using one LUN from each of the 3 VGs RAID-Z''d > > together. For redundancy its great, for performance its pretty bad. > > > > One of the major issues is the disk seek contention between the > > servers since they''re all using the same disks, and RAID-Z tries to > > utilize all the devices it has access to on every write. > > This is difficult to pin down. The disks cache and the RAID controller > caches. So it is true that you would have contention, it is difficult > to predict what affect, if any, the hosts would see. > > > What I thought I''d move to was 6 RAID-1 VGs on the array, and assign > > the VGs to each server via a 1 device striped zpool. However, given > > the fact that ZFS will kernel panic in the event of bad data I''m > > reconsidering how to lay it out. > > NB. all other file systems will similarly panic. We get spoiled to > some extent because there are errors where ZFS won''t panic. In the > future, there will be more errors that ZFS can handle without panic. > > > Essentially I''ve got 12 disks to work with. > > > > Anyway, long form of trying to convert from RAID-Z to RAID-1. Any help > > is much appreciated. > > send/receive = copy/copy = backup/restore > It may be possible to do this as a rolling reconfiguration. > -- richard >
Hi Jason, It seems to me that there is some detailed information which would be needed for a full analysis. So, to keep the ball rolling, I''ll respond generally. Jason J. W. Williams wrote:> Hi Richard, > > Been watching the stats on the array and the cache hits are < 3% on > these volumes. We''re very write heavy, and rarely write similar enough > data twice. Having random oriented database data and > sequential-oriented database log data on the same volume groups, it > seems to me this was causing a lot of head repositioning.In general, writes are buffered. For many database workloads, the sequential log writes won''t be write cache hits and will be coelesced. There are several ways you could account for this, but suffice to say that the read cache hit rate is more interesting, for performance improvement opportunities. The random reads are often cache misses, and adding prefetch is often a waste of resources -- the nature of the beast. For ZFS, all data writes should be sequential until you get near the capacity of the volume when there will be a search for free blocks which may be randomly dispersed. One way to look at this is that for new and not-yet-filled volumes, ZFS will write sequentially, unlike other file systems. Once you get filled, then ZFS will write more like other file systems. Hence, your write performance for ZFS may change over time, though this will be somewhat mitigated by the RAID array write buffer cache.> By shutting down the slaves database servers we cut the latency > tremendously, which would seem to me to indicate a lot of contention. > But I''m trying to come up to speed on this, so I may be wrong.This is likely. Note that RAID controllers are really just servers which speak a block-level protocol to other hosts. Some RAID controllers are underpowered. ZFS on a modern server can create a significant workload. This can also clobber a RAID array. For example, by default, ZFS will queue up to 35 iops per vdev before blocking. If you have one RAID array which is connected to 4 hosts, each host having 5 vdevs, then the RAID array would need to be able to handle 700 (35 * 4 * 5) concurrent iops. There are RAID arrays, which will remain nameless, that will not handle that workload very well. Under lab conditions you should be able to empirically determine the knee in the response time curve as you add workload. To compound the problem, fibre channel has pitiful flow control. Thus it may also be necessary to throttle the concurrent iops at the source. I''m not sure what the current thinking is on tuning vq_max_pending (35) for ZFS, you might search for it in the archives. [the intent is to have no tunables, let the system figure out what to do best]> "iostat -xtcnz 5" showed the latency dropped from 200 to 20 once we > cut the replication. Since the masters and slaves were using the same > the volume groups and RAID-Z was striping across all of them on both > the masters and slaves, I think this was a big problem.It is hard for me to visualize your setup, but this is a tell-tale sign that you''ve overrun the RAID box. Changing the volume partitioning will likely help, perhaps tremendously. -- richard
Hi Richard, Thank you for taking so much time on this! The array is a StorageTek FLX210 so it is a bit underpowered...best we could afford at the time. In terms of the load on it we have two servers running Solaris 10. Each physical server then has two containers, each one has a MySQL instance in it. The primary physical server has the masters, and the secondary physical server has the slaves. The slaves use MySQL binlog replication to get the INSERTS/UPDATES from the masters. Each physical server has 3 vdevs that are RAID-Z''d together. We then layout two file systems in the zpool, one per container. The vdevs the zpools are actually LUNs, each one in a separate volume group on the FLX210. So we have three volume groups on the RAID array with two LUNs per VG. So six LUNs, and each physical server has a RAID-Z zpool built out of the LUNs (striped across the volume groups). Each VG has 4 disks in it, so this maximized across the 12 drives we''re using. Unfortunately, since both servers are striping across the same 3 volume groups I think it caused our performance issue. Also, the volume groups were RAID-5, so we had RAID-Z on top of RAID-5. This meant we could loss 3 disks and still be OK in a worst case scenario, but its killing the performance. The FLX210 doesn''t have RAID ASICs as I recently learned. :-( As a stop gap, we stopped the replication to the slaves and converted the RAID array''s volume groups to RAID-1. This seems to have tremendously reduced the issue temporarily. Given the limited number of disks we have to work with, the new layout we''ve decided on is: 5 volume groups: *VG 1 (2 disk RAID-1): physical_server1_DB ***LUN 0: SQL Master 1 DB ***LUN 1: SQL Master 2 DB *VG 2 (2 disk RAID-1): physical_server1_logs ***LUN 0: SQL Master 1 Logs ***LUN 1: SQL Master 2 Logs *VG 3 (2 disk RAID-1): physical_server2_DB ***LUN 0: SQL Slave 1 DB ***LUN 1: SQL Slave 2 DB *VG 4 (2 disk RAID-1): physical_server2_logs ***LUN 0: SQL Slave 1 Logs ***LUN 1: SQL Slave 2 Logs *VG 5 (2 disk RAID-1): Windows Server LUNs ***LUN 0: Exchange Server LUN ***LUN 1: Maintenance LUN Each SQL LUN will be a vdev in its own striped zpool. And we''ve got two disks in reserve for additional capacity (not counting the 2 array hot-spares). The main concern at the moment is that given the current layout, we don''t waste much space at all filesystem-wise (we do lose a lot of space with the double RAID-5). The new layout however, would give the logs a ton of space of which they might use 10% (but we don''t want to consolidate all the logs on the same VG lest we get the contention problems back). Its a tough trade-off to make---space for speed. I''m somewhat of a mind to have all the logs use a single VG and see how the performance fares. Add a second VG only if necessary. Currently, we see about 40-70 IOPS per vdev. So that can average 120-200 IOPS per VG with a peak of 300. About 60-70% of those IOPS are writes as well. One thing all those VGs in the new layout would let us do is figure out how many of those IOPS are random, and how many are sequential log writes. As always, advice/thoughts are appreciated. Best Regards, Jason On 11/30/06, Richard Elling <Richard.Elling at sun.com> wrote:> Hi Jason, > It seems to me that there is some detailed information which would > be needed for a full analysis. So, to keep the ball rolling, I''ll > respond generally. > > Jason J. W. Williams wrote: > > Hi Richard, > > > > Been watching the stats on the array and the cache hits are < 3% on > > these volumes. We''re very write heavy, and rarely write similar enough > > data twice. Having random oriented database data and > > sequential-oriented database log data on the same volume groups, it > > seems to me this was causing a lot of head repositioning. > > In general, writes are buffered. For many database workloads, the > sequential log writes won''t be write cache hits and will be coelesced. > There are several ways you could account for this, but suffice to say > that the read cache hit rate is more interesting, for performance > improvement opportunities. The random reads are often cache misses, > and adding prefetch is often a waste of resources -- the nature of the > beast. > > For ZFS, all data writes should be sequential until you get near the > capacity of the volume when there will be a search for free blocks > which may be randomly dispersed. One way to look at this is that > for new and not-yet-filled volumes, ZFS will write sequentially, > unlike other file systems. Once you get filled, then ZFS will write > more like other file systems. Hence, your write performance for > ZFS may change over time, though this will be somewhat mitigated by > the RAID array write buffer cache. > > > By shutting down the slaves database servers we cut the latency > > tremendously, which would seem to me to indicate a lot of contention. > > But I''m trying to come up to speed on this, so I may be wrong. > > This is likely. > > Note that RAID controllers are really just servers which speak a > block-level protocol to other hosts. Some RAID controllers are > underpowered. > > ZFS on a modern server can create a significant workload. This > can also clobber a RAID array. For example, by default, ZFS will > queue up to 35 iops per vdev before blocking. If you have one > RAID array which is connected to 4 hosts, each host having 5 vdevs, > then the RAID array would need to be able to handle 700 (35 * 4 * 5) > concurrent iops. There are RAID arrays, which will remain nameless, > that will not handle that workload very well. Under lab conditions > you should be able to empirically determine the knee in the response > time curve as you add workload. > > To compound the problem, fibre channel has pitiful flow control. > Thus it may also be necessary to throttle the concurrent iops at the > source. I''m not sure what the current thinking is on tuning > vq_max_pending (35) for ZFS, you might search for it in the archives. > [the intent is to have no tunables, let the system figure out what to > do best] > > > "iostat -xtcnz 5" showed the latency dropped from 200 to 20 once we > > cut the replication. Since the masters and slaves were using the same > > the volume groups and RAID-Z was striping across all of them on both > > the masters and slaves, I think this was a big problem. > > It is hard for me to visualize your setup, but this is a tell-tale > sign that you''ve overrun the RAID box. Changing the volume partitioning > will likely help, perhaps tremendously. > -- richard >