Hi All. ZFS document says ZFS schedules it''s I/O in such way that it manages to saturate a single disk bandwidth using enough concurrent 128K I/O. The no of concurrent I/O is decided by vq_max_pending.The default value for vq_max_pending is 35. We have created 4-disk raid-z group inside ZFS pool on Thumper.ZFS record size is set to 128k.When we read/write a 128K record ,it issue a 128K/3 I/O to each of the 3 data disks in the 4-disk raid-z group. We need to saturate all three data disk bandwidth in the Raidz group.Is it required to set vq_max_pending value to 35*3=135 ? Thanks Manoj Nayak
Manoj Nayak wrote:> Hi All. > > ZFS document says ZFS schedules it''s I/O in such way that it manages to > saturate a single disk bandwidth using enough concurrent 128K I/O. > The no of concurrent I/O is decided by vq_max_pending.The default value > for vq_max_pending is 35. > > We have created 4-disk raid-z group inside ZFS pool on Thumper.ZFS > record size is set to 128k.When we read/write a 128K record ,it issue a > 128K/3 I/O to each of the 3 data disks in the 4-disk raid-z group. >Yes, this is how it works for a read without errors. For a write, you should see 4 writes, each 128KBytes/3. Writes may also be coalesced, so you may see larger physical writes.> We need to saturate all three data disk bandwidth in the Raidz group.Is > it required to set vq_max_pending value to 35*3=135 ? >No. vq_max_pending applies to each vdev. Use iostat to see what the device load is. For the commonly used Hitachi 500 GByte disks in a thumper, the read media bandwidth is 31-64.8 MBytes/s. Writes will be about 80% of reads, or 24.8-51.8 MBytes/s. In a thumper, the disk bandwidth will be the limiting factor for the hardware. -- richard
> Manoj Nayak wrote: >> Hi All. >> >> ZFS document says ZFS schedules it''s I/O in such way that it manages to >> saturate a single disk bandwidth using enough concurrent 128K I/O. >> The no of concurrent I/O is decided by vq_max_pending.The default value >> for vq_max_pending is 35. >> >> We have created 4-disk raid-z group inside ZFS pool on Thumper.ZFS record >> size is set to 128k.When we read/write a 128K record ,it issue a >> 128K/3 I/O to each of the 3 data disks in the 4-disk raid-z group. >> > > Yes, this is how it works for a read without errors. For a write, you > should see 4 writes, each 128KBytes/3. Writes may also be > coalesced, so you may see larger physical writes. > >> We need to saturate all three data disk bandwidth in the Raidz group.Is >> it required to set vq_max_pending value to 35*3=135 ? >> > > No. vq_max_pending applies to each vdev.4 disk raidz group issues 128k/3=42.6k io to each individual data disk.If 35 concurrent 128k IO is enough to saturate a disk( vdev ) , then 35*3=105 concurrent 42k io will be required to saturates the same disk. Thanks Manoj Nayak Use iostat to see what> the device load is. For the commonly used Hitachi 500 GByte disks > in a thumper, the read media bandwidth is 31-64.8 MBytes/s. Writes > will be about 80% of reads, or 24.8-51.8 MBytes/s. In a thumper, > the disk bandwidth will be the limiting factor for the hardware. > -- richard > >
manoj nayak wrote:> >> Manoj Nayak wrote: >>> Hi All. >>> >>> ZFS document says ZFS schedules it''s I/O in such way that it manages >>> to saturate a single disk bandwidth using enough concurrent 128K I/O. >>> The no of concurrent I/O is decided by vq_max_pending.The default >>> value for vq_max_pending is 35. >>> >>> We have created 4-disk raid-z group inside ZFS pool on Thumper.ZFS >>> record size is set to 128k.When we read/write a 128K record ,it issue a >>> 128K/3 I/O to each of the 3 data disks in the 4-disk raid-z group. >>> >> >> Yes, this is how it works for a read without errors. For a write, you >> should see 4 writes, each 128KBytes/3. Writes may also be >> coalesced, so you may see larger physical writes. >> >>> We need to saturate all three data disk bandwidth in the Raidz >>> group.Is it required to set vq_max_pending value to 35*3=135 ? >>> >> >> No. vq_max_pending applies to each vdev. > > 4 disk raidz group issues 128k/3=42.6k io to each individual data > disk.If 35 concurrent 128k IO is enough to saturate a disk( vdev ) , > then 35*3=105 concurrent 42k io will be required to saturates the same > disk.ZFS doesn''t know anything about disk saturation. It will send up to vq_max_pending I/O requests per vdev (usually a vdev is a disk). It will try to keep vq_max_pending I/O requests queued to the vdev. For writes, you should see them become coalesced, so rather than sending 3 42.6kByte write requests to a vdev, you might see one 128kByte write request. In other words, ZFS has an I/O scheduler which is responsible for sending I/O requests to vdevs. -- richard
----- Original Message ----- From: "Richard Elling" <Richard.Elling at Sun.COM> To: "manoj nayak" <Manoj.Nayak at Sun.COM> Cc: <zfs-discuss at opensolaris.org> Sent: Wednesday, January 23, 2008 7:20 AM Subject: Re: [zfs-discuss] ZFS vq_max_pending value ?> manoj nayak wrote: >> >>> Manoj Nayak wrote: >>>> Hi All. >>>> >>>> ZFS document says ZFS schedules it''s I/O in such way that it manages to >>>> saturate a single disk bandwidth using enough concurrent 128K I/O. >>>> The no of concurrent I/O is decided by vq_max_pending.The default value >>>> for vq_max_pending is 35. >>>> >>>> We have created 4-disk raid-z group inside ZFS pool on Thumper.ZFS >>>> record size is set to 128k.When we read/write a 128K record ,it issue a >>>> 128K/3 I/O to each of the 3 data disks in the 4-disk raid-z group. >>>> >>> >>> Yes, this is how it works for a read without errors. For a write, you >>> should see 4 writes, each 128KBytes/3. Writes may also be >>> coalesced, so you may see larger physical writes. >>> >>>> We need to saturate all three data disk bandwidth in the Raidz group.Is >>>> it required to set vq_max_pending value to 35*3=135 ? >>>> >>> >>> No. vq_max_pending applies to each vdev. >> >> 4 disk raidz group issues 128k/3=42.6k io to each individual data disk.If >> 35 concurrent 128k IO is enough to saturate a disk( vdev ) , >> then 35*3=105 concurrent 42k io will be required to saturates the same >> disk. > > ZFS doesn''t know anything about disk saturation. It will send > up to vq_max_pending I/O requests per vdev (usually a vdev is a > disk). It will try to keep vq_max_pending I/O requests queued to > the vdev.I can see the "avg pending I/Os" hitting my vq_max_pending limit, then raising the limit would be a good thing. I think , it''s due to many 42k Read IO to individual disk in the 4 disk raidz group. Thanks Manoj Nayak> For writes, you should see them become coalesced, so rather than > sending 3 42.6kByte write requests to a vdev, you might see one > 128kByte write request. > > In other words, ZFS has an I/O scheduler which is responsible > for sending I/O requests to vdevs. > -- richard > >
manoj nayak wrote:> > ----- Original Message ----- From: "Richard Elling" > <Richard.Elling at Sun.COM> > To: "manoj nayak" <Manoj.Nayak at Sun.COM> > Cc: <zfs-discuss at opensolaris.org> > Sent: Wednesday, January 23, 2008 7:20 AM > Subject: Re: [zfs-discuss] ZFS vq_max_pending value ? > > >> manoj nayak wrote: >>> >>>> Manoj Nayak wrote: >>>>> Hi All. >>>>> >>>>> ZFS document says ZFS schedules it''s I/O in such way that it >>>>> manages to saturate a single disk bandwidth using enough >>>>> concurrent 128K I/O. >>>>> The no of concurrent I/O is decided by vq_max_pending.The default >>>>> value for vq_max_pending is 35. >>>>> >>>>> We have created 4-disk raid-z group inside ZFS pool on Thumper.ZFS >>>>> record size is set to 128k.When we read/write a 128K record ,it >>>>> issue a >>>>> 128K/3 I/O to each of the 3 data disks in the 4-disk raid-z group. >>>>> >>>> >>>> Yes, this is how it works for a read without errors. For a write, you >>>> should see 4 writes, each 128KBytes/3. Writes may also be >>>> coalesced, so you may see larger physical writes. >>>> >>>>> We need to saturate all three data disk bandwidth in the Raidz >>>>> group.Is it required to set vq_max_pending value to 35*3=135 ? >>>>> >>>> >>>> No. vq_max_pending applies to each vdev. >>> >>> 4 disk raidz group issues 128k/3=42.6k io to each individual data >>> disk.If 35 concurrent 128k IO is enough to saturate a disk( vdev ) , >>> then 35*3=105 concurrent 42k io will be required to saturates the >>> same disk. >> >> ZFS doesn''t know anything about disk saturation. It will send >> up to vq_max_pending I/O requests per vdev (usually a vdev is a >> disk). It will try to keep vq_max_pending I/O requests queued to >> the vdev. > > I can see the "avg pending I/Os" hitting my vq_max_pending limit, > then raising the limit would be a good thing. I think , it''s due to > many 42k Read IO to individual disk in the 4 disk raidz group.You''re dealing with a queue here. iostat''s average pending I/Os represents the queue depth. Some devices can''t handle a large queue. In any case, queuing theory applies. Note that for reads, the disk will likely have a track cache, so it is not a good assumption that a read I/O will require a media access. -- richard
>>>> 4 disk raidz group issues 128k/3=42.6k io to each individual data >>>> disk.If 35 concurrent 128k IO is enough to saturate a disk( vdev ) , >>>> then 35*3=105 concurrent 42k io will be required to saturates the >>>> same disk. >>> >>> ZFS doesn''t know anything about disk saturation. It will send >>> up to vq_max_pending I/O requests per vdev (usually a vdev is a >>> disk). It will try to keep vq_max_pending I/O requests queued to >>> the vdev. >> >> I can see the "avg pending I/Os" hitting my vq_max_pending limit, >> then raising the limit would be a good thing. I think , it''s due to >> many 42k Read IO to individual disk in the 4 disk raidz group. > > You''re dealing with a queue here. iostat''s average pending I/Os > represents > the queue depth. Some devices can''t handle a large queue. In any > case, queuing theory applies. > > Note that for reads, the disk will likely have a track cache, so it is > not a good assumption that a read I/O will require a media access.My workload issues around 5000 MB read I/0 & iopattern says around 55% of the IO are random in nature. I don''t know how much prefetching through track cache is going to help here.Probably I can try disabling vdev_cache through "set ''zfs_vdev_cache_max'' 1" Thanks Manoj Nayak
Manoj Nayak writes: > Hi All. > > ZFS document says ZFS schedules it''s I/O in such way that it manages to > saturate a single disk bandwidth using enough concurrent 128K I/O. > The no of concurrent I/O is decided by vq_max_pending.The default value > for vq_max_pending is 35. > > We have created 4-disk raid-z group inside ZFS pool on Thumper.ZFS > record size is set to 128k.When we read/write a 128K record ,it issue a > 128K/3 I/O to each of the 3 data disks in the 4-disk raid-z group. > > We need to saturate all three data disk bandwidth in the Raidz group.Is > it required to set vq_max_pending value to 35*3=135 ? > Nope. Once a disk controller is working on 35 requests, we don''t expect to get any more out of it by queueing more requests and we might even confuse the firmware and get less. Now for an array controller and a vdev fronting for large number of disks, then 35 might be a low number not allowing full throughput. Rather than tuning 35 up, we suggest splitting devives into smaller LUNs since each luns is given a 35-deep queue. Tuning vq_max_pending down helps read and synchronous write (ZIL) latency. Today the preferred way to help ZIL latency is to use a Separate Intent Log. -r > Thanks > Manoj Nayak > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Roch - PAE wrote:> Manoj Nayak writes: > > Hi All. > > > > ZFS document says ZFS schedules it''s I/O in such way that it manages to > > saturate a single disk bandwidth using enough concurrent 128K I/O. > > The no of concurrent I/O is decided by vq_max_pending.The default value > > for vq_max_pending is 35. > > > > We have created 4-disk raid-z group inside ZFS pool on Thumper.ZFS > > record size is set to 128k.When we read/write a 128K record ,it issue a > > 128K/3 I/O to each of the 3 data disks in the 4-disk raid-z group. > > > > We need to saturate all three data disk bandwidth in the Raidz group.Is > > it required to set vq_max_pending value to 35*3=135 ? > > > > Nope. > > Once a disk controller is working on 35 requests, we don''t > expect to get any more out of it by queueing more requests > and we might even confuse the firmware and get less. > > Now for an array controller and a vdev fronting for large > number of disks, then 35 might be a low number not allowing > full throughput. Rather than tuning 35 up, we suggest > splitting devives into smaller LUNs since each luns is given > a 35-deep queue. > >It means 4-disk raid-z group inside ZFS pool is exported to ZFS as a single device ( vdev ) .ZFS assigns vq_max_pending value of 35 to this vdev. To get higher throughput , I need to do following things ? 1.Reduce no of disks in the raidz group from four to three disk.So that same pending queue of 35 is available for lesser no of disk. 0r 2.Create slice out of physical disk & create raidz group out of four slices of a physical disk.So that same pending queue of 35 is available four slices of one physical disk. Thanks Manoj Nayak> Tuning vq_max_pending down helps read and synchronous write > (ZIL) latency. Today the preferred way to help ZIL latency > is to use a Separate Intent Log. > > -r > > > > Thanks > > Manoj Nayak > > _______________________________________________ > > zfs-discuss mailing list > > zfs-discuss at opensolaris.org > > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > >
On Jan 23, 2008 6:36 AM, Manoj Nayak <Manoj.Nayak at sun.com> wrote:> It means 4-disk raid-z group inside ZFS pool is exported to ZFS as a > single device ( vdev ) .ZFS assigns vq_max_pending value of 35 to this vdev. > To get higher throughput , I need to do following things ? > > 1.Reduce no of disks in the raidz group from four to three disk.So that > same pending queue of 35 is available for lesser no of disk. > 0r > 2.Create slice out of physical disk & create raidz group out of four > slices of a physical disk.So that same pending queue of 35 is available > four slices of one physical disk.Or switch to mirrors instead, if you can live with the capacity hit. Mirrors will have much better random read performance than raidz, since they don''t need to read from every disk to make sure the checksum matches. Will
Manoj Nayak wrote:> Roch - PAE wrote: > >> Manoj Nayak writes: >> > Hi All. >> > >> > ZFS document says ZFS schedules it''s I/O in such way that it manages to >> > saturate a single disk bandwidth using enough concurrent 128K I/O. >> > The no of concurrent I/O is decided by vq_max_pending.The default value >> > for vq_max_pending is 35. >> > >> > We have created 4-disk raid-z group inside ZFS pool on Thumper.ZFS >> > record size is set to 128k.When we read/write a 128K record ,it issue a >> > 128K/3 I/O to each of the 3 data disks in the 4-disk raid-z group. >> > >> > We need to saturate all three data disk bandwidth in the Raidz group.Is >> > it required to set vq_max_pending value to 35*3=135 ? >> > >> >> Nope. >> >> Once a disk controller is working on 35 requests, we don''t >> expect to get any more out of it by queueing more requests >> and we might even confuse the firmware and get less. >> >> Now for an array controller and a vdev fronting for large >> number of disks, then 35 might be a low number not allowing >> full throughput. Rather than tuning 35 up, we suggest >> splitting devives into smaller LUNs since each luns is given >> a 35-deep queue. >> >> >> > It means 4-disk raid-z group inside ZFS pool is exported to ZFS as a > single device ( vdev ) .ZFS assigns vq_max_pending value of 35 to this vdev. > To get higher throughput , I need to do following things ? >This is not the terminology we use to describe ZFS. Quite simply, a storage pool contains devices configured in some way, hopefully using some form of data protection (mirror, raidz[12]) -- see zpool(1m). Each storage pool can contain one or more file systems or volumes -- see zfs(1m). The term "export" is used to describe transition of ownership of a storage pool between different hosts.> 1.Reduce no of disks in the raidz group from four to three disk.So that > same pending queue of 35 is available for lesser no of disk. > 0r >35 is for each physical disk.> 2.Create slice out of physical disk & create raidz group out of four > slices of a physical disk.So that same pending queue of 35 is available > four slices of one physical disk. >This will likely have a negative scaling affect. Some devices, especially raw disks, have wimpy microprocessors and limited memory. You can easily overload them and see the response time increase dramatically, just as queuing theory will suggest. Some research has shown that a value of 8-16 is better, at least for some storage devices. A value of 1 is perhaps too low, at least for devices which can handle multiple outstanding I/Os. > My workload issues around 5000 MB read I/0 & iopattern says around 55% of the IO are random in nature. > I don''t know how much prefetching through track cache is going to help here.Probably I can try disabling > vdev_cache > through "set ''zfs_vdev_cache_max'' 1" We can''t size something like this unless we also know the I/O size. If you are talking small iops, say 8 kBytes, then you''ll need lots of disks. For larger iops, you may be able to get by with fewer disks. -- richard