Hi, zpool create test raidz c0t0d0 c1t0d0 c2t0d0 c3t0d0 \ raidz c0t1d0 c1t1d0 c2t1d0 c3t1d0 \ raidz c0t2d0 c1t2d0 c2t2d0 c3t2d0 \ raidz c0t3d0 c1t3d0 c2t3d0 c3t3d0 \ [...] raidz c0t10d0 c1t10d0 c2t10d0 c3t10d0 zfs set atime=off test zfs set recordsize=16k test (I know...) now if I create a one large file with filebench and simulate a randomread workload with 1 or more threads then disks on c2 and c3 controllers are getting about 80% more reads. This happens both on 111b and snv_134. I would rather except all of them to get about the same number of iops. Any idea why? -- Robert Milkowski http://milek.blogspot.com
Reaching into the dusty regions of my brain, I seem to recall that since RAIDz does not work like a traditional RAID 5, particularly because of variably sized stripes, that the data may not hit all of the disks, but it will always be redundant. I apologize for not having a reference for this assertion, so I may be completely wrong. I assume your hardware is recent, the controllers are on PCIe x4 buses, etc. -Scott -- This message posted from opensolaris.org
Hey Robert, How big of a file are you making? RAID-Z does not explicitly do the parity distribution that RAID-5 does. Instead, it relies on non-uniform stripe widths to distribute IOPS. Adam On Jun 18, 2010, at 7:26 AM, Robert Milkowski wrote:> Hi, > > > zpool create test raidz c0t0d0 c1t0d0 c2t0d0 c3t0d0 \ > raidz c0t1d0 c1t1d0 c2t1d0 c3t1d0 \ > raidz c0t2d0 c1t2d0 c2t2d0 c3t2d0 \ > raidz c0t3d0 c1t3d0 c2t3d0 c3t3d0 \ > [...] > raidz c0t10d0 c1t10d0 c2t10d0 c3t10d0 > > zfs set atime=off test > zfs set recordsize=16k test > (I know...) > > now if I create a one large file with filebench and simulate a randomread workload with 1 or more threads then disks on c2 and c3 controllers are getting about 80% more reads. This happens both on 111b and snv_134. I would rather except all of them to get about the same number of iops. > > Any idea why? > > > -- > Robert Milkowski > http://milek.blogspot.com > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss-- Adam Leventhal, Fishworks http://blogs.sun.com/ahl
128GB. Does it mean that for dataset used for databases and similar environments where basically all blocks have fixed size and there is no other data all parity information will end-up on one (z1) or two (z2) specific disks? On 23/06/2010 17:51, Adam Leventhal wrote:> Hey Robert, > > How big of a file are you making? RAID-Z does not explicitly do the parity distribution that RAID-5 does. Instead, it relies on non-uniform stripe widths to distribute IOPS. > > Adam > > On Jun 18, 2010, at 7:26 AM, Robert Milkowski wrote: > > >> Hi, >> >> >> zpool create test raidz c0t0d0 c1t0d0 c2t0d0 c3t0d0 \ >> raidz c0t1d0 c1t1d0 c2t1d0 c3t1d0 \ >> raidz c0t2d0 c1t2d0 c2t2d0 c3t2d0 \ >> raidz c0t3d0 c1t3d0 c2t3d0 c3t3d0 \ >> [...] >> raidz c0t10d0 c1t10d0 c2t10d0 c3t10d0 >> >> zfs set atime=off test >> zfs set recordsize=16k test >> (I know...) >> >> now if I create a one large file with filebench and simulate a randomread workload with 1 or more threads then disks on c2 and c3 controllers are getting about 80% more reads. This happens both on 111b and snv_134. I would rather except all of them to get about the same number of iops. >> >> Any idea why? >> >> >> -- >> Robert Milkowski >> http://milek.blogspot.com >> >> _______________________________________________ >> zfs-discuss mailing list >> zfs-discuss at opensolaris.org >> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >> > > -- > Adam Leventhal, Fishworks http://blogs.sun.com/ahl > > >
> Does it mean that for dataset used for databases and similar environments where basically all blocks have fixed size and there is no other data all parity information will end-up on one (z1) or two (z2) specific disks?No. There are always smaller writes to metadata that will distribute parity. What is the total width of your raidz1 stripe? Adam -- Adam Leventhal, Fishworks http://blogs.sun.com/ahl
On Jun 23, 2010, at 1:48 PM, Robert Milkowski <milek at task.gda.pl> wrote:> > 128GB. > > Does it mean that for dataset used for databases and similar environments where basically all blocks have fixed size and there is no other data all parity information will end-up on one (z1) or two (z2) specific disks?What''s the record size on those datasets? 8k? -Ross
On 23/06/2010 18:50, Adam Leventhal wrote:>> Does it mean that for dataset used for databases and similar environments where basically all blocks have fixed size and there is no other data all parity information will end-up on one (z1) or two (z2) specific disks? >> > No. There are always smaller writes to metadata that will distribute parity. What is the total width of your raidz1 stripe? > >4x disks, 16KB recordsize, 128GB file, random read with 16KB block. -- Robert Milkowski http://milek.blogspot.com
On 23/06/2010 19:29, Ross Walker wrote:> On Jun 23, 2010, at 1:48 PM, Robert Milkowski<milek at task.gda.pl> wrote: > > >> 128GB. >> >> Does it mean that for dataset used for databases and similar environments where basically all blocks have fixed size and there is no other data all parity information will end-up on one (z1) or two (z2) specific disks? >> > What''s the record size on those datasets? > > 8k? > >16K
On Jun 24, 2010, at 5:40 AM, Robert Milkowski <milek at task.gda.pl> wrote:> On 23/06/2010 18:50, Adam Leventhal wrote: >>> Does it mean that for dataset used for databases and similar environments where basically all blocks have fixed size and there is no other data all parity information will end-up on one (z1) or two (z2) specific disks? >>> >> No. There are always smaller writes to metadata that will distribute parity. What is the total width of your raidz1 stripe? >> >> > > 4x disks, 16KB recordsize, 128GB file, random read with 16KB block.From what I gather each 16KB record (plus parity) is spread across the raidz disks. This causes the total random IOPS (write AND read) of the raidz to be that of the slowest disk in the raidz. Raidz is definitely made for sequential IO patterns not random. To get good random IO with raidz you need a zpool with X raidz vdevs where X = desired IOPS/IOPS of single drive. -Ross
On 24/06/2010 14:32, Ross Walker wrote:> On Jun 24, 2010, at 5:40 AM, Robert Milkowski<milek at task.gda.pl> wrote: > > >> On 23/06/2010 18:50, Adam Leventhal wrote: >> >>>> Does it mean that for dataset used for databases and similar environments where basically all blocks have fixed size and there is no other data all parity information will end-up on one (z1) or two (z2) specific disks? >>>> >>>> >>> No. There are always smaller writes to metadata that will distribute parity. What is the total width of your raidz1 stripe? >>> >>> >>> >> 4x disks, 16KB recordsize, 128GB file, random read with 16KB block. >> > From what I gather each 16KB record (plus parity) is spread across the raidz disks. This causes the total random IOPS (write AND read) of the raidz to be that of the slowest disk in the raidz. > > Raidz is definitely made for sequential IO patterns not random. To get good random IO with raidz you need a zpool with X raidz vdevs where X = desired IOPS/IOPS of single drive. >I know that and it wasn''t mine question. -- Robert Milkowski http://milek.blogspot.com
On Thu, 24 Jun 2010, Ross Walker wrote:> > Raidz is definitely made for sequential IO patterns not random. To > get good random IO with raidz you need a zpool with X raidz vdevs > where X = desired IOPS/IOPS of single drive.Remarkably, I have yet to see mention of someone testing a raidz which is comprised entirely of FLASH SSDs. This should help with the IOPS, particularly when reading. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
On 24/06/2010 15:54, Bob Friesenhahn wrote:> On Thu, 24 Jun 2010, Ross Walker wrote: >> >> Raidz is definitely made for sequential IO patterns not random. To >> get good random IO with raidz you need a zpool with X raidz vdevs >> where X = desired IOPS/IOPS of single drive. > > Remarkably, I have yet to see mention of someone testing a raidz which > is comprised entirely of FLASH SSDs. This should help with the IOPS, > particularly when reading.I have. Briefly: X4270 2x Quad-core 2.93GHz, 72GB RAM Open Solaris 2009.06 (snv_111b) ARC limited to 4GB 44x SSD in a F5100. 4x SAS HBAs, 4x physical SAS connections to the f5100 (16x SAS channels in total), each to a different domain. 1. RAID-10 pool 22x mirrors across domains ZFS: 16KB recordsize, atime=off randomread filebennch benchmark with a 16KB block size with 1, 16, ..., 128 threads, 128GB working set. maximum performance when 128 threads: ~137,000 ops/s 2. RAID-Z pool 11x 4-way RAID-z, each raid-z vdev across domains ZFS: recordsize=16k, atime=off randomread filebennch benchmark with a 16KB block size with 1, 16, ..., 128 threads, 128GB working set. maximum performance when 64-128 threads: ~34,000 ops/s With a ZFS recordsize of 32KB it got up-to ~41,000 ops/s. Larger ZFS record sizes produced worse results. RAID-Z delivered about 3.3X less ops/s compared to RAID-10 here. SSDs do not make any fundamental chanage here and RAID-Z characteristics are basically the same whether it is configured out of SSDs or HDDs. However SSDs could of course provide a good-enough performance even with RAID-Z, as at the end of a day it is not about benchmarks but your environment requirements. A given number of SSDs in a RAID-Z configuration is able to deliver the same performance as a much greater number of disk drives in RAID-10 configuration and if you don''t need much space it could make sense. -- Robert Milkowski http://milek.blogspot.com
On Jun 24, 2010, at 10:42 AM, Robert Milkowski <milek at task.gda.pl> wrote:> On 24/06/2010 14:32, Ross Walker wrote: >> On Jun 24, 2010, at 5:40 AM, Robert Milkowski<milek at task.gda.pl> wrote: >> >> >>> On 23/06/2010 18:50, Adam Leventhal wrote: >>> >>>>> Does it mean that for dataset used for databases and similar environments where basically all blocks have fixed size and there is no other data all parity information will end-up on one (z1) or two (z2) specific disks? >>>>> >>>>> >>>> No. There are always smaller writes to metadata that will distribute parity. What is the total width of your raidz1 stripe? >>>> >>>> >>>> >>> 4x disks, 16KB recordsize, 128GB file, random read with 16KB block. >>> >> From what I gather each 16KB record (plus parity) is spread across the raidz disks. This causes the total random IOPS (write AND read) of the raidz to be that of the slowest disk in the raidz. >> >> Raidz is definitely made for sequential IO patterns not random. To get good random IO with raidz you need a zpool with X raidz vdevs where X = desired IOPS/IOPS of single drive. >> > > I know that and it wasn''t mine question.Sorry, for the OP...
Hey Robert, I''ve filed a bug to track this issue. We''ll try to reproduce the problem and evaluate the cause. Thanks for bringing this to our attention. Adam On Jun 24, 2010, at 2:40 AM, Robert Milkowski wrote:> On 23/06/2010 18:50, Adam Leventhal wrote: >>> Does it mean that for dataset used for databases and similar environments where basically all blocks have fixed size and there is no other data all parity information will end-up on one (z1) or two (z2) specific disks? >>> >> No. There are always smaller writes to metadata that will distribute parity. What is the total width of your raidz1 stripe? >> >> > > 4x disks, 16KB recordsize, 128GB file, random read with 16KB block. > > -- > Robert Milkowski > http://milek.blogspot.com > >-- Adam Leventhal, Fishworks http://blogs.sun.com/ahl
Ross Walker wrote:> Raidz is definitely made for sequential IO patterns not random. To get good random IO with raidz you need a zpool with X raidz vdevs where X = desired IOPS/IOPS of single drive. >I have seen statements like this repeated several times, though I haven''t been able to find an in-depth discussion of why this is the case. From what I''ve gathered every block (what is the correct term for this? zio block?) written is spread across the whole raid-z. But in what units? will a 4k write be split into 512 byte writes? And in the opposite direction, every block needs to be read fully, even if only parts of it are being requested, because the checksum needs to be checked? Will the parity be read, too? If this is all the case, I can see why raid-z reduces the performance of an array effectively to one device w.r.t. random reads. Thanks, Arne
On 24/06/2010 20:52, Arne Jansen wrote:> Ross Walker wrote: > >> Raidz is definitely made for sequential IO patterns not random. To >> get good random IO with raidz you need a zpool with X raidz vdevs >> where X = desired IOPS/IOPS of single drive. >> > > I have seen statements like this repeated several times, though > I haven''t been able to find an in-depth discussion of why this > is the case. From what I''ve gathered every block (what is the > correct term for this? zio block?) written is spread across the > whole raid-z. But in what units? will a 4k write be split into > 512 byte writes? And in the opposite direction, every block needs > to be read fully, even if only parts of it are being requested, > because the checksum needs to be checked? Will the parity be > read, too? > If this is all the case, I can see why raid-z reduces the performance > of an array effectively to one device w.r.t. random reads. >http://blogs.sun.com/roch/entry/when_to_and_not_to -- Robert Milkowski http://milek.blogspot.com