Under Solaris 10 on a 4 core Sun Ultra 40 with 20GB RAM, I am setting 
up a Sun StorageTek 2540 with 12 300GB 15K RPM SAS drives and 
connected via load-shared 4Gbit FC links.  This week I have tried many 
different configurations, using firmware managed RAID, ZFS managed 
RAID, and with the controller cache enabled or disabled.
My objective is to obtain the best single-file write performance. 
Unfortunately, I am hitting some sort of write bottleneck and I am not 
sure how to solve it.  I was hoping for a write speed of 300MB/second. 
With ZFS on top of a firmware managed RAID 0 across all 12 drives, I 
hit a peak of 200MB/second.  With each drive exported as a LUN and a 
ZFS pool of 6 pairs, I see a write rate of 154MB/second.  The number 
of drives used has not had much effect on write rate.
Information on my pool is shown at the end of this email.
I am driving the writes using ''iozone'' since
''filebench'' does not seem
to want to install/work on Solaris 10.
I am suspecting that the problem is that I am running out of IOPS 
since the drive array indicates a an average IOPS of 214 for one drive 
even though the peak write speed is only 26MB/second (peak read is 
42MB/second).
Can someone share with me what they think the write bottleneck might
be and how I can surmount it?
Thanks,
Bob
=====================================Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/
% zpool status
   pool: Sun_2540
  state: ONLINE
  scrub: none requested
config:
         NAME                                       STATE     READ WRITE CKSUM
         Sun_2540                                   ONLINE       0     0     0
           mirror                                   ONLINE       0     0     0
             c4t600A0B80003A8A0B0000096A47B4559Ed0  ONLINE       0     0     0
             c4t600A0B80003A8A0B0000096E47B456DAd0  ONLINE       0     0     0
           mirror                                   ONLINE       0     0     0
             c4t600A0B80003A8A0B0000096147B451BEd0  ONLINE       0     0     0
             c4t600A0B80003A8A0B0000096647B453CEd0  ONLINE       0     0     0
           mirror                                   ONLINE       0     0     0
             c4t600A0B80003A8A0B0000097347B457D4d0  ONLINE       0     0     0
             c4t600A0B800039C9B500000A9C47B4522Dd0  ONLINE       0     0     0
           mirror                                   ONLINE       0     0     0
             c4t600A0B800039C9B500000AA047B4529Bd0  ONLINE       0     0     0
             c4t600A0B800039C9B500000AA447B4544Fd0  ONLINE       0     0     0
           mirror                                   ONLINE       0     0     0
             c4t600A0B800039C9B500000AA847B45605d0  ONLINE       0     0     0
             c4t600A0B800039C9B500000AAC47B45739d0  ONLINE       0     0     0
           mirror                                   ONLINE       0     0     0
             c4t600A0B800039C9B500000AB047B457ADd0  ONLINE       0     0     0
             c4t600A0B800039C9B500000AB447B4595Fd0  ONLINE       0     0     0
errors: No known data errors
freddy:~% zpool iostat
                capacity     operations    bandwidth
pool         used  avail   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
Sun_2540    64.0G  1.57T    808    861  99.8M   105M
freddy:~% zpool iostat -v
                                            capacity     operations    bandwidth
pool                                     used  avail   read  write   read  write
--------------------------------------  -----  -----  -----  -----  -----  -----
Sun_2540                                64.0G  1.57T    809    860   100M   105M
   mirror                                10.7G   267G    135    143  16.7M 
17.6M
     c4t600A0B80003A8A0B0000096A47B4559Ed0      -      -     66    141  8.37M 
17.6M
     c4t600A0B80003A8A0B0000096E47B456DAd0      -      -     67    141  8.37M 
17.6M
   mirror                                10.7G   267G    135    143  16.7M 
17.6M
     c4t600A0B80003A8A0B0000096147B451BEd0      -      -     66    141  8.37M 
17.6M
     c4t600A0B80003A8A0B0000096647B453CEd0      -      -     66    141  8.37M 
17.6M
   mirror                                10.7G   267G    134    143  16.7M 
17.6M
     c4t600A0B80003A8A0B0000097347B457D4d0      -      -     66    141  8.34M 
17.6M
     c4t600A0B800039C9B500000A9C47B4522Dd0      -      -     66    141  8.32M 
17.6M
   mirror                                10.7G   267G    134    143  16.6M 
17.6M
     c4t600A0B800039C9B500000AA047B4529Bd0      -      -     66    141  8.32M 
17.6M
     c4t600A0B800039C9B500000AA447B4544Fd0      -      -     66    141  8.30M 
17.6M
   mirror                                10.7G   267G    134    143  16.6M 
17.6M
     c4t600A0B800039C9B500000AA847B45605d0      -      -     66    141  8.31M 
17.6M
     c4t600A0B800039C9B500000AAC47B45739d0      -      -     66    141  8.30M 
17.6M
   mirror                                10.7G   267G    134    143  16.6M 
17.6M
     c4t600A0B800039C9B500000AB047B457ADd0      -      -     66    141  8.30M 
17.6M
     c4t600A0B800039C9B500000AB447B4595Fd0      -      -     66    141  8.29M 
17.6M
--------------------------------------  -----  -----  -----  -----  -----  -----
On 2/14/08, Bob Friesenhahn <bfriesen at simple.dallas.tx.us> wrote:> > Under Solaris 10 on a 4 core Sun Ultra 40 with 20GB RAM, I am setting > up a Sun StorageTek 2540 with 12 300GB 15K RPM SAS drives and > connected via load-shared 4Gbit FC links. This week I have tried many > different configurations, using firmware managed RAID, ZFS managed > RAID, and with the controller cache enabled or disabled. > > My objective is to obtain the best single-file write performance. > Unfortunately, I am hitting some sort of write bottleneck and I am not > sure how to solve it. I was hoping for a write speed of 300MB/second. > With ZFS on top of a firmware managed RAID 0 across all 12 drives, I > hit a peak of 200MB/second. With each drive exported as a LUN and a > ZFS pool of 6 pairs, I see a write rate of 154MB/second. The number > of drives used has not had much effect on write rate. > > Information on my pool is shown at the end of this email. > > I am driving the writes using ''iozone'' since ''filebench'' does not seem > to want to install/work on Solaris 10. > > I am suspecting that the problem is that I am running out of IOPS > since the drive array indicates a an average IOPS of 214 for one drive > even though the peak write speed is only 26MB/second (peak read is > 42MB/second). > > Can someone share with me what they think the write bottleneck might > be and how I can surmount it? > > Thanks, > > Bob > =====================================> Bob Friesenhahn > bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ > GraphicsMagick Maintainer, http://www.GraphicsMagick.org/ > > % zpool status > pool: Sun_2540 > state: ONLINE > scrub: none requested > config: > > NAME STATE READ WRITE > CKSUM > Sun_2540 ONLINE 0 > 0 0 > mirror ONLINE 0 > 0 0 > c4t600A0B80003A8A0B0000096A47B4559Ed0 ONLINE 0 > 0 0 > c4t600A0B80003A8A0B0000096E47B456DAd0 ONLINE 0 > 0 0 > mirror ONLINE 0 > 0 0 > c4t600A0B80003A8A0B0000096147B451BEd0 ONLINE 0 > 0 0 > c4t600A0B80003A8A0B0000096647B453CEd0 ONLINE 0 > 0 0 > mirror ONLINE 0 > 0 0 > c4t600A0B80003A8A0B0000097347B457D4d0 ONLINE 0 > 0 0 > c4t600A0B800039C9B500000A9C47B4522Dd0 ONLINE 0 > 0 0 > mirror ONLINE 0 > 0 0 > c4t600A0B800039C9B500000AA047B4529Bd0 ONLINE 0 > 0 0 > c4t600A0B800039C9B500000AA447B4544Fd0 ONLINE 0 > 0 0 > mirror ONLINE 0 > 0 0 > c4t600A0B800039C9B500000AA847B45605d0 ONLINE 0 > 0 0 > c4t600A0B800039C9B500000AAC47B45739d0 ONLINE 0 > 0 0 > mirror ONLINE 0 > 0 0 > c4t600A0B800039C9B500000AB047B457ADd0 ONLINE 0 > 0 0 > c4t600A0B800039C9B500000AB447B4595Fd0 ONLINE 0 > 0 0 > > errors: No known data errors > freddy:~% zpool iostat > capacity operations bandwidth > pool used avail read write read write > ---------- ----- ----- ----- ----- ----- ----- > Sun_2540 64.0G 1.57T 808 861 99.8M 105M > freddy:~% zpool iostat -v > capacity > operations bandwidth > pool used avail read write > read write > > -------------------------------------- ----- ----- ----- ----- ----- ----- > Sun_2540 64.0G 1.57T 809 860 > 100M 105M > mirror 10.7G 267G 135 143 16.7M > 17.6M > c4t600A0B80003A8A0B0000096A47B4559Ed0 - - 66 141 > 8.37M 17.6M > c4t600A0B80003A8A0B0000096E47B456DAd0 - - 67 141 > 8.37M 17.6M > mirror 10.7G 267G 135 143 16.7M > 17.6M > c4t600A0B80003A8A0B0000096147B451BEd0 - - 66 141 > 8.37M 17.6M > c4t600A0B80003A8A0B0000096647B453CEd0 - - 66 141 > 8.37M 17.6M > mirror 10.7G 267G 134 143 16.7M > 17.6M > c4t600A0B80003A8A0B0000097347B457D4d0 - - 66 141 > 8.34M 17.6M > c4t600A0B800039C9B500000A9C47B4522Dd0 - - 66 141 > 8.32M 17.6M > mirror 10.7G 267G 134 143 16.6M > 17.6M > c4t600A0B800039C9B500000AA047B4529Bd0 - - 66 141 > 8.32M 17.6M > c4t600A0B800039C9B500000AA447B4544Fd0 - - 66 141 > 8.30M 17.6M > mirror 10.7G 267G 134 143 16.6M > 17.6M > c4t600A0B800039C9B500000AA847B45605d0 - - 66 141 > 8.31M 17.6M > c4t600A0B800039C9B500000AAC47B45739d0 - - 66 141 > 8.30M 17.6M > mirror 10.7G 267G 134 143 16.6M > 17.6M > c4t600A0B800039C9B500000AB047B457ADd0 - - 66 141 > 8.30M 17.6M > c4t600A0B800039C9B500000AB447B4595Fd0 - - 66 141 > 8.29M 17.6M > > -------------------------------------- ----- ----- ----- ----- ----- ----- > > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >If you''re going for best single file write performance, why are you doing mirrors of the LUNs? Perhaps I''m misunderstanding why you went from one giant raid-0 to what is essentially a raid-10. --Tim -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080214/dfb02737/attachment.html>
On Thu, 14 Feb 2008, Tim wrote:> > If you''re going for best single file write performance, why are you doing > mirrors of the LUNs? Perhaps I''m misunderstanding why you went from one > giant raid-0 to what is essentially a raid-10.That decision was made because I also need data reliability. As mentioned before, the write rate peaked at 200MB/second using RAID-0 across 12 disks exported as one big LUN. Other firmware-based methods I tried typically offered about 170MB/second. Even a four disk firmware-managed RAID-5 with ZFS on top offered about 165MB/second. Given that I would like to achieve 300MB/second, a few tens of MB don''t make much difference. It may be that I bought the wrong product, but perhaps there is a configuration change which will help make up some of the difference without sacrificing data reliability. Bob =====================================Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
On Fri, Feb 15, 2008 at 2:34 AM, Bob Friesenhahn <bfriesen at simple.dallas.tx.us> wrote:> As mentioned before, the write rate peaked at 200MB/second using > RAID-0 across 12 disks exported as one big LUN. Other firmware-based > methods I tried typically offered about 170MB/second. Even a four > disk firmware-managed RAID-5 with ZFS on top offered about > 165MB/second. Given that I would like to achieve 300MB/second, a few > tens of MB don''t make much difference.What is the workload for this system? Benchmarks are fine and good, but application performance is the determining factor of whether a system is performing acceptably. Perhaps iozone is behaving in a bad way; you might investigate bonnie++: http://www.sunfreeware.com/programlistintel10.html Will
On Fri, 15 Feb 2008, Will Murnane wrote:> What is the workload for this system? Benchmarks are fine and good, > but application performance is the determining factor of whether a > system is performing acceptably.The system is primarily used for image processing where the image data is uncompressed and a typical file is 12MB. In some cases the files will be hundreds of MB or GB. The typical case is to read a file and output a new file. For some very large files, an uncompressed temporary file is edited in place with random access. I am the author of the application and need the filesystem to be fast enough that it will uncover any slowness in my code. :-)> Perhaps iozone is behaving in a bad way; you might investigateThat is always possible. Iozone (http://www.iozone.org/) has been around for a very long time and has seen a lot of improvement by many smart people so it does not seem very suspect.> bonnie++: http://www.sunfreeware.com/programlistintel10.htmlI will check it out. Bob =====================================Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Le 15 f?vr. 08 ? 03:34, Bob Friesenhahn a ?crit :> On Thu, 14 Feb 2008, Tim wrote: >> >> If you''re going for best single file write performance, why are you >> doing >> mirrors of the LUNs? Perhaps I''m misunderstanding why you went >> from one >> giant raid-0 to what is essentially a raid-10. > > That decision was made because I also need data reliability. > > As mentioned before, the write rate peaked at 200MB/second using > RAID-0 across 12 disks exported as one big LUN.What was the interlace on the LUN ?> Other firmware-based > methods I tried typically offered about 170MB/second. Even a four > disk firmware-managed RAID-5 with ZFS on top offered about > 165MB/second. Given that I would like to achieve 300MB/second, a few > tens of MB don''t make much difference. It may be that I bought the > wrong product, but perhaps there is a configuration change which will > help make up some of the difference without sacrificing data > reliability. >If this is 165MB application rate consider that ZFS sends that much to each side of the mirror. Your data channel rate was 330MB/sec. -r> Bob > =====================================> Bob Friesenhahn > bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ > GraphicsMagick Maintainer, http://www.GraphicsMagick.org/ > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
On Fri, 15 Feb 2008, Roch Bourbonnais wrote:>> >> As mentioned before, the write rate peaked at 200MB/second using >> RAID-0 across 12 disks exported as one big LUN. > > What was the interlace on the LUN ?There are two 4Gbit FC interfaces on an Emulex LPe11002 card which are supposedly acting in a load-share configuration.> If this is 165MB application rate consider that ZFS sends that much to each > side of the mirror. > Your data channel rate was 330MB/sec.Yes, I am aware of the ZFS RAID "write penalty" but in fact it has only cost 20MB per second vs doing the RAID using controller firmware (150MB vs 170MB/second). This indicates that there is plenty of communications bandwidth from the host to the array. The measured read rates are in the 470MB to 510MB/second range. While writing, it is clear that ZFS does not use all of the drives for writes at once since the drive LEDs show that some remain temporarily idle and ZFS cycles through them. I would be very happy to hear from other StorageTek 2540 owners as to the write rate they were able to achieve. Bob =====================================Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Le 15 f?vr. 08 ? 18:24, Bob Friesenhahn a ?crit :> On Fri, 15 Feb 2008, Roch Bourbonnais wrote: >>> >>> As mentioned before, the write rate peaked at 200MB/second using >>> RAID-0 across 12 disks exported as one big LUN. >> >> What was the interlace on the LUN ? >The question was about LUN interlace not interface. 128K to 1M works better.> There are two 4Gbit FC interfaces on an Emulex LPe11002 card which are > supposedly acting in a load-share configuration. > >> If this is 165MB application rate consider that ZFS sends that much >> to each >> side of the mirror. >> Your data channel rate was 330MB/sec. > > Yes, I am aware of the ZFS RAID "write penalty" but in fact it has> > only cost 20MB per second vs doing the RAID using controller firmware > (150MB vs 170MB/second). This indicates that there is plenty of > communications bandwidth from the host to the array. The measured > read rates are in the 470MB to 510MB/second range. >Any compression ? Does turn off checksum helps the number (that would point to a CPU limited throughput). -r> While writing, it is clear that ZFS does not use all of the drives for > writes at once since the drive LEDs show that some remain > temporarily idle and ZFS cycles through them. > > I would be very happy to hear from other StorageTek 2540 owners as to > the write rate they were able to achieve. >> Bob > =====================================> Bob Friesenhahn > bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ > GraphicsMagick Maintainer, http://www.GraphicsMagick.org/ > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
On Fri, 15 Feb 2008, Roch Bourbonnais wrote:>>> What was the interlace on the LUN ? > > The question was about LUN interlace not interface. > 128K to 1M works better.The "segment size" is set to 128K. The max the 2540 allows is 512K. Unfortunately, the StorageTek 2540 and CAM documentation does not really define what "segment size" means.> Any compression ?Compression is disabled.> Does turn off checksum helps the number (that would point to a CPU limited > throughput).I have not tried that but this system is loafing during the benchmark. It has four 3GHz Opteron cores. Does this output from ''iostat -xnz 20'' help to understand issues? extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 3.0 0.7 26.4 3.5 0.0 0.0 0.0 4.2 0 2 c1t1d0 0.0 154.2 0.0 19680.3 0.0 20.7 0.0 134.2 0 59 c4t600A0B80003A8A0B0000096147B451BEd0 0.0 211.5 0.0 26940.5 1.1 33.9 5.0 160.5 99 100 c4t600A0B800039C9B500000A9C47B4522Dd0 0.0 211.5 0.0 26940.6 1.1 33.9 5.0 160.4 99 100 c4t600A0B800039C9B500000AA047B4529Bd0 0.0 154.0 0.0 19654.7 0.0 20.7 0.0 134.2 0 59 c4t600A0B80003A8A0B0000096647B453CEd0 0.0 211.3 0.0 26915.0 1.1 33.9 5.0 160.5 99 100 c4t600A0B800039C9B500000AA447B4544Fd0 0.0 152.4 0.0 19447.0 0.0 20.5 0.0 134.5 0 59 c4t600A0B80003A8A0B0000096A47B4559Ed0 0.0 213.2 0.0 27183.8 0.9 34.1 4.2 159.9 90 100 c4t600A0B800039C9B500000AA847B45605d0 0.0 152.5 0.0 19453.4 0.0 20.5 0.0 134.5 0 59 c4t600A0B80003A8A0B0000096E47B456DAd0 0.0 213.2 0.0 27177.4 0.9 34.1 4.2 159.9 90 100 c4t600A0B800039C9B500000AAC47B45739d0 0.0 213.2 0.0 27195.3 0.9 34.1 4.2 159.9 90 100 c4t600A0B800039C9B500000AB047B457ADd0 0.0 154.4 0.0 19711.8 0.0 20.7 0.0 134.0 0 59 c4t600A0B80003A8A0B0000097347B457D4d0 0.0 211.3 0.0 26958.6 1.1 33.9 5.0 160.6 99 100 c4t600A0B800039C9B500000AB447B4595Fd0 Bob =====================================Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Hi Bob, I?m assuming you?re measuring sequential write speed ? posting the iozone results would help guide the discussion. For the configuration you describe, you should definitely be able to sustain 200 MB/s write speed for a single file, single thread due to your use of 4Gbps Fibre Channel interfaces and RAID1. Someone else brought up that with host based mirroring over that interface you will be sending the data twice over the FC-AL link, so since you only have 400 MB/s on the FC-AL interface (load balancing will only work for two writes), then you have to divide that by two. If you do the mirroring on the RAID hardware you?ll get double that speed on writing, or 400MB/s and the bottleneck is still the single FC-AL interface. By comparison, we get 750 MB/s sequential read using six 15K RPM 300GB disks on an adaptec (Sun OEM) in-host SAS RAID adapter in RAID10 on four streams and I think I saw 350 MB/s write speed on one stream. Each disk is capable of 130 MB/s of read and write speed. - Luke On 2/15/08 10:39 AM, "Bob Friesenhahn" <bfriesen at simple.dallas.tx.us> wrote:> On Fri, 15 Feb 2008, Roch Bourbonnais wrote: >>>> >>> What was the interlace on the LUN ? >> > >> > The question was about LUN interlace not interface. >> > 128K to 1M works better. > > The "segment size" is set to 128K. The max the 2540 allows is 512K. > Unfortunately, the StorageTek 2540 and CAM documentation does not > really define what "segment size" means. > >> > Any compression ? > > Compression is disabled. > >> > Does turn off checksum helps the number (that would point to a CPU limited >> > throughput). > > I have not tried that but this system is loafing during the benchmark. > It has four 3GHz Opteron cores. > > Does this output from ''iostat -xnz 20'' help to understand issues? > > extended device statistics > r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device > 3.0 0.7 26.4 3.5 0.0 0.0 0.0 4.2 0 2 c1t1d0 > 0.0 154.2 0.0 19680.3 0.0 20.7 0.0 134.2 0 59 > c4t600A0B80003A8A0B0000096147B451BEd0 > 0.0 211.5 0.0 26940.5 1.1 33.9 5.0 160.5 99 100 > c4t600A0B800039C9B500000A9C47B4522Dd0 > 0.0 211.5 0.0 26940.6 1.1 33.9 5.0 160.4 99 100 > c4t600A0B800039C9B500000AA047B4529Bd0 > 0.0 154.0 0.0 19654.7 0.0 20.7 0.0 134.2 0 59 > c4t600A0B80003A8A0B0000096647B453CEd0 > 0.0 211.3 0.0 26915.0 1.1 33.9 5.0 160.5 99 100 > c4t600A0B800039C9B500000AA447B4544Fd0 > 0.0 152.4 0.0 19447.0 0.0 20.5 0.0 134.5 0 59 > c4t600A0B80003A8A0B0000096A47B4559Ed0 > 0.0 213.2 0.0 27183.8 0.9 34.1 4.2 159.9 90 100 > c4t600A0B800039C9B500000AA847B45605d0 > 0.0 152.5 0.0 19453.4 0.0 20.5 0.0 134.5 0 59 > c4t600A0B80003A8A0B0000096E47B456DAd0 > 0.0 213.2 0.0 27177.4 0.9 34.1 4.2 159.9 90 100 > c4t600A0B800039C9B500000AAC47B45739d0 > 0.0 213.2 0.0 27195.3 0.9 34.1 4.2 159.9 90 100 > c4t600A0B800039C9B500000AB047B457ADd0 > 0.0 154.4 0.0 19711.8 0.0 20.7 0.0 134.0 0 59 > c4t600A0B80003A8A0B0000097347B457D4d0 > 0.0 211.3 0.0 26958.6 1.1 33.9 5.0 160.6 99 100 > c4t600A0B800039C9B500000AB447B4595Fd0 > > Bob > =====================================> Bob Friesenhahn > bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ > GraphicsMagick Maintainer, http://www.GraphicsMagick.org/ > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080215/cd08f1bb/attachment.html>
On Fri, 15 Feb 2008, Luke Lonergan wrote:> I''m assuming you''re measuring sequential write speed ? posting the iozone > results would help guide the discussion.Posted below. I am also including the output from mpathadm in case there is something wrong with the load sharing.> For the configuration you describe, you should definitely be able to sustain > 200 MB/s write speed for a single file, single thread due to your use of > 4Gbps Fibre Channel interfaces and RAID1. Someone else brought up that withI only managed to get 200 MB/s write when I did RAID 0 across all drives using the 2540''s RAID controller and with ZFS on top.> host based mirroring over that interface you will be sending the data twice > over the FC-AL link, so since you only have 400 MB/s on the FC-AL interface > (load balancing will only work for two writes), then you have to divide that > by two.While I agree that data is sent twice (actually up to 8X if striping across four mirrors), it seems to me that the load balancing should still work for one application write since ZFS is what does the multiple device I/Os.> If you do the mirroring on the RAID hardware you?ll get double that speed on > writing, or 400MB/s and the bottleneck is still the single FC-AL interface.I didn''t see that level of performance. Perhaps there is something I should be investigating? Bob Output of ''mpathadm list lu'': /scsi_vhci/disk at g600a0b800039c9b50000000000000000 Total Path Count: 1 Operational Path Count: 1 /scsi_vhci/disk at g600a0b80003a8a0b0000000000000000 Total Path Count: 1 Operational Path Count: 1 /dev/rdsk/c4t600A0B80003A8A0B0000096147B451BEd0s2 Total Path Count: 2 Operational Path Count: 2 /dev/rdsk/c4t600A0B800039C9B500000A9C47B4522Dd0s2 Total Path Count: 2 Operational Path Count: 2 /dev/rdsk/c4t600A0B800039C9B500000AA047B4529Bd0s2 Total Path Count: 2 Operational Path Count: 2 /dev/rdsk/c4t600A0B80003A8A0B0000096647B453CEd0s2 Total Path Count: 2 Operational Path Count: 2 /dev/rdsk/c4t600A0B800039C9B500000AA447B4544Fd0s2 Total Path Count: 2 Operational Path Count: 2 /dev/rdsk/c4t600A0B80003A8A0B0000096A47B4559Ed0s2 Total Path Count: 2 Operational Path Count: 2 /dev/rdsk/c4t600A0B800039C9B500000AA847B45605d0s2 Total Path Count: 2 Operational Path Count: 2 /dev/rdsk/c4t600A0B80003A8A0B0000096E47B456DAd0s2 Total Path Count: 2 Operational Path Count: 2 /dev/rdsk/c4t600A0B800039C9B500000AAC47B45739d0s2 Total Path Count: 2 Operational Path Count: 2 /dev/rdsk/c4t600A0B800039C9B500000AB047B457ADd0s2 Total Path Count: 2 Operational Path Count: 2 /dev/rdsk/c4t600A0B80003A8A0B0000097347B457D4d0s2 Total Path Count: 2 Operational Path Count: 2 /dev/rdsk/c4t600A0B800039C9B500000AB447B4595Fd0s2 Total Path Count: 2 Operational Path Count: 2 Output of ''mpathadm show lu /dev/rdsk/c4t600A0B800039C9B500000AB047B457ADd0s2'': Logical Unit: /dev/rdsk/c4t600A0B800039C9B500000AB047B457ADd0s2 mpath-support: libmpscsi_vhci.so Vendor: SUN Product: LCSM100_F Revision: 0617 Name Type: unknown type Name: 600a0b800039c9b500000ab047b457ad Asymmetric: yes Current Load Balance: round-robin Logical Unit Group ID: NA Auto Failback: on Auto Probing: NA Paths: Initiator Port Name: 10000000c967c830 Target Port Name: 200400a0b83a8a0c Override Path: NA Path State: OK Disabled: no Initiator Port Name: 10000000c967c82f Target Port Name: 200500a0b83a8a0c Override Path: NA Path State: OK Disabled: no Target Port Groups: ID: 4 Explicit Failover: yes Access State: standby Target Ports: Name: 200400a0b83a8a0c Relative ID: 0 ID: 1 Explicit Failover: yes Access State: active Target Ports: Name: 200500a0b83a8a0c Relative ID: 0 Performance test run using iozone: Iozone: Performance Test of File I/O Version $Revision: 3.283 $ Compiled for 64 bit mode. Build: Solaris10gcc-64 Contributors:William Norcott, Don Capps, Isom Crawford, Kirby Collins Al Slater, Scott Rhine, Mike Wisner, Ken Goss Steve Landherr, Brad Smith, Mark Kelly, Dr. Alain CYR, Randy Dunlap, Mark Montague, Dan Million, Jean-Marc Zucconi, Jeff Blomberg, Benny Halevy, Erik Habbinga, Kris Strecker, Walter Wong. Run began: Thu Feb 14 16:35:51 2008 Auto Mode Using Minimum Record Size 64 KB Using Maximum Record Size 512 KB Using minimum file size of 33554432 kilobytes. Using maximum file size of 67108864 kilobytes. Command line used: iozone -a -i 0 -i 1 -y 64 -q 512 -n 32G -g 64G Output is in Kbytes/sec Time Resolution = 0.000001 seconds. Processor cache size set to 1024 Kbytes. Processor cache line size set to 32 bytes. File stride size set to 17 * record size. random random bkwd record stride KB reclen write rewrite read reread read write read rewrite read fwrite frewrite fread freread 33554432 64 150370 113779 454731 456158 33554432 128 147032 181308 455496 456239 33554432 256 148182 169944 454192 456252 33554432 512 153843 194189 473982 516130 67108864 64 151047 111227 463406 456302 67108864 128 148597 159236 456959 488100 67108864 256 148995 165041 463519 453896 67108864 512 154556 166802 458304 456833 iozone test complete. =====================================Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
On Fri, Feb 15, 2008 at 12:30 AM, Bob Friesenhahn <bfriesen at simple.dallas.tx.us> wrote:> Under Solaris 10 on a 4 core Sun Ultra 40 with 20GB RAM, I am setting > up a Sun StorageTek 2540 with 12 300GB 15K RPM SAS drives and > connected via load-shared 4Gbit FC links. This week I have tried many > different configurations, using firmware managed RAID, ZFS managed > RAID, and with the controller cache enabled or disabled. > > My objective is to obtain the best single-file write performance. > Unfortunately, I am hitting some sort of write bottleneck and I am not > sure how to solve it. I was hoping for a write speed of 300MB/second. > With ZFS on top of a firmware managed RAID 0 across all 12 drives, I > hit a peak of 200MB/second. With each drive exported as a LUN and a > ZFS pool of 6 pairs, I see a write rate of 154MB/second. The number > of drives used has not had much effect on write rate.May not be relevant, but still worth checking - I have a 2530 (which ought to be that same only SAS instead of FC), and got fairly poor performance at first. Things improved significantly when I got the LUNs properly balanced across the controllers. -- -Peter Tribble http://www.petertribble.co.uk/ - http://ptribble.blogspot.com/
On Fri, 15 Feb 2008, Peter Tribble wrote:> > May not be relevant, but still worth checking - I have a 2530 (which ought > to be that same only SAS instead of FC), and got fairly poor performance > at first. Things improved significantly when I got the LUNs properly > balanced across the controllers.What do you mean by "properly balanced across the controllers"? Are you using the multipath support in Solaris 10 or are you relying on ZFS to balance the I/O load? Do some disks have more affinity for a controller than the other? With the 2540, there is a FC connection to each redundant controller. The Solaris 10 multipathing presumably load-shares the I/O to each controller. The controllers then perform some sort of magic to get the data to and from the SAS drives. The controller stats are below. I notice that it seems that controller B has seen a bit more activity than controller A but the firmware does not provide a controller uptime value so it is possible that one controller was up longer than another: Performance Statistics - A on Storage System Array-1 Timestamp: Fri Feb 15 14:37:39 CST 2008 Total IOPS: 1098.83 Average IOPS: 355.83 Read %: 38.28 Write %: 61.71 Total Data Transferred: 139284.41 KBps Read: 53844.26 KBps Average Read: 17224.04 KBps Peak Read: 242232.70 KBps Written: 85440.15 KBps Average Written: 26966.58 KBps Peak Written: 139918.90 KBps Average Read Size: 639.96 KB Average Write Size: 629.94 KB Cache Hit %: 85.32 Performance Statistics - B on Storage System Array-1 Timestamp: Fri Feb 15 14:37:45 CST 2008 Total IOPS: 1526.69 Average IOPS: 497.32 Read %: 34.90 Write %: 65.09 Total Data Transferred: 193594.58 KBps Read: 68200.00 KBps Average Read: 24052.61 KBps Peak Read: 339693.55 KBps Written: 125394.58 KBps Average Written: 37768.40 KBps Peak Written: 183534.66 KBps Average Read Size: 895.80 KB Average Write Size: 883.38 KB Cache Hit %: 75.05 If I then go to the performance stats on an individual disk, I see Performance Statistics - Disk-08 on Storage System Array-1 Timestamp: Fri Feb 15 14:43:36 CST 2008 Total IOPS: 196.33 Average IOPS: 72.01 Read %: 9.65 Write %: 90.34 Total Data Transferred: 25076.91 KBps Read: 2414.11 KBps Average Read: 3521.44 KBps Peak Read: 48422.00 KBps Written: 22662.79 KBps Average Written: 5423.78 KBps Peak Written: 28036.43 KBps Average Read Size: 127.29 KB Average Write Size: 127.77 KB Cache Hit %: 89.30 Bob =====================================Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
On Fri, Feb 15, 2008 at 8:50 PM, Bob Friesenhahn <bfriesen at simple.dallas.tx.us> wrote:> On Fri, 15 Feb 2008, Peter Tribble wrote: > > > > May not be relevant, but still worth checking - I have a 2530 (which ought > > to be that same only SAS instead of FC), and got fairly poor performance > > at first. Things improved significantly when I got the LUNs properly > > balanced across the controllers. > > What do you mean by "properly balanced across the controllers"? Are > you using the multipath support in Solaris 10 or are you relying on > ZFS to balance the I/O load? Do some disks have more affinity for a > controller than the other?Each LUN is accessed through only one of the controllers (I presume the 2540 works the same way as the 2530 and 61X0 arrays). The paths are active/passive (if the active fails it will relocate to the other path). When I set mine up the first time it allocated all the LUNs to controller B and performance was terrible. I then manually transferred half the LUNs to controller A and it started to fly. I''m using SAS multipathing for failover and just get ZFS to dynamically stripe across the LUNs. Your figures show asymmetry, but that may just be a reflection of the setup where you just created a single raid-0 LUN which would only use one path. (I don''t really understand any of this stuff. Too much fiddling around for my liking.) -- -Peter Tribble http://www.petertribble.co.uk/ - http://ptribble.blogspot.com/
On Fri, 15 Feb 2008, Peter Tribble wrote:> Each LUN is accessed through only one of the controllers (I presume the > 2540 works the same way as the 2530 and 61X0 arrays). The paths are > active/passive (if the active fails it will relocate to the other path). > When I set mine up the first time it allocated all the LUNs to controller B > and performance was terrible. I then manually transferred half the LUNs > to controller A and it started to fly.I assume that you either altered the "Access State" shown for the LUN in the output of ''mpathadm show lu DEVICE'' or you noticed and observed the pattern: Target Port Groups: ID: 3 Explicit Failover: yes Access State: active Target Ports: Name: 200400a0b83a8a0c Relative ID: 0 ID: 2 Explicit Failover: yes Access State: standby Target Ports: Name: 200500a0b83a8a0c Relative ID: 0 I find this all very interesting and illuminating: for dev in c4t600A0B80003A8A0B0000096A47B4559Ed0 \ c4t600A0B80003A8A0B0000096E47B456DAd0 \ c4t600A0B80003A8A0B0000096147B451BEd0 \ c4t600A0B80003A8A0B0000096647B453CEd0 \ c4t600A0B80003A8A0B0000097347B457D4d0 \ c4t600A0B800039C9B500000A9C47B4522Dd0 \ c4t600A0B800039C9B500000AA047B4529Bd0 \ c4t600A0B800039C9B500000AA447B4544Fd0 \ c4t600A0B800039C9B500000AA847B45605d0 \ c4t600A0B800039C9B500000AAC47B45739d0 \ c4t600A0B800039C9B500000AB047B457ADd0 \ c4t600A0B800039C9B500000AB447B4595Fd0 \ do echo "=== $dev ===" for> mpathadm show lu /dev/rdsk/$dev | grep ''Access State'' for> done === c4t600A0B80003A8A0B0000096A47B4559Ed0 == Access State: active Access State: standby === c4t600A0B80003A8A0B0000096E47B456DAd0 == Access State: active Access State: standby === c4t600A0B80003A8A0B0000096147B451BEd0 == Access State: active Access State: standby === c4t600A0B80003A8A0B0000096647B453CEd0 == Access State: active Access State: standby === c4t600A0B80003A8A0B0000097347B457D4d0 == Access State: active Access State: standby === c4t600A0B800039C9B500000A9C47B4522Dd0 == Access State: active Access State: standby === c4t600A0B800039C9B500000AA047B4529Bd0 == Access State: standby Access State: active === c4t600A0B800039C9B500000AA447B4544Fd0 == Access State: standby Access State: active === c4t600A0B800039C9B500000AA847B45605d0 == Access State: standby Access State: active === c4t600A0B800039C9B500000AAC47B45739d0 == Access State: standby Access State: active === c4t600A0B800039C9B500000AB047B457ADd0 == Access State: standby Access State: active === c4t600A0B800039C9B500000AB447B4595Fd0 == Access State: standby Access State: active Notice that the first six LUNs are active to one controller while the second six LUNs are active to the other controller. Based on this, I should rebuild my pool by splitting my mirrors across this boundary. I am really happy that ZFS makes such things easy to try out. Bob =====================================Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
On Fri, 15 Feb 2008, Bob Friesenhahn wrote:> > Notice that the first six LUNs are active to one controller while the > second six LUNs are active to the other controller. Based on this, I > should rebuild my pool by splitting my mirrors across this boundary. > > I am really happy that ZFS makes such things easy to try out.Now that I have tried this out, I can unhappily say that it made no measurable difference to actual performance. However it seems like a better layout anyway. Bob =====================================Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
On Fri, Feb 15, 2008 at 09:00:05PM +0000, Peter Tribble wrote:> On Fri, Feb 15, 2008 at 8:50 PM, Bob Friesenhahn > <bfriesen at simple.dallas.tx.us> wrote: > > On Fri, 15 Feb 2008, Peter Tribble wrote: > > > > > > May not be relevant, but still worth checking - I have a 2530 (which ought > > > to be that same only SAS instead of FC), and got fairly poor performance > > > at first. Things improved significantly when I got the LUNs properly > > > balanced across the controllers. > > > > What do you mean by "properly balanced across the controllers"? Are > > you using the multipath support in Solaris 10 or are you relying on > > ZFS to balance the I/O load? Do some disks have more affinity for a > > controller than the other? > > Each LUN is accessed through only one of the controllers (I presume the > 2540 works the same way as the 2530 and 61X0 arrays). The paths are > active/passive (if the active fails it will relocate to the other path). > When I set mine up the first time it allocated all the LUNs to controller B > and performance was terrible. I then manually transferred half the LUNs > to controller A and it started to fly.http://groups.google.com/group/comp.unix.solaris/browse_frm/thread/59b43034602a7b7f/0b500afc4d62d434?lnk=st&q=#0b500afc4d62d434 -- albert chin (china at thewrittenword.com)
Hi Bob, On 2/15/08 12:13 PM, "Bob Friesenhahn" <bfriesen at simple.dallas.tx.us> wrote:> I only managed to get 200 MB/s write when I did RAID 0 across all > drives using the 2540''s RAID controller and with ZFS on top.Ridiculously bad. You should max out both FC-AL links and get 800 MB/s.> While I agree that data is sent twice (actually up to 8X if striping > across four mirrors)Still only twice the data that would otherwise be sent, in other words: the mirroring causes a duplicate set of data to be written.> it seems to me that the load balancing should > still work for one application write since ZFS is what does the > multiple device I/Os.Depends on how the LUNs are used within the pool, but yes that''s what you should expect in which case you should get 400 MB/s writes on one file using RAID10.>> If you do the mirroring on the RAID hardware you?ll get double that speed on >> writing, or 400MB/s and the bottleneck is still the single FC-AL interface. > > I didn''t see that level of performance. Perhaps there is something I > should be investigating?Yes, if it weren''t for the slow FC-AL in your data path you should be able to sustain 20 x 130 MB/s = 2,600 MB/s based on the drive speeds. Given that you''re not even saturating the FC-AL links, the problem is in the hardware RAID. I suggest disabling read and write caching in the hardware RAID. - Luke
On Fri, 15 Feb 2008, Luke Lonergan wrote:>> I only managed to get 200 MB/s write when I did RAID 0 across all >> drives using the 2540''s RAID controller and with ZFS on top. > > Ridiculously bad.I agree. :-(>> While I agree that data is sent twice (actually up to 8X if striping >> across four mirrors) > > Still only twice the data that would otherwise be sent, in other words: the > mirroring causes a duplicate set of data to be written.Right. But more little bits of data to be sent due to ZFS striping.> Given that you''re not even saturating the FC-AL links, the problem is in the > hardware RAID. I suggest disabling read and write caching in the hardware > RAID.Hardware RAID is not an issue in this case since each disk is exported as a LUN. Performance with ZFS is not much different than when hardware RAID was used. I previously tried disabling caching in the hardware and it did not make a difference in the results. Bob =====================================Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Bob Friesenhahn wrote:> On Fri, 15 Feb 2008, Luke Lonergan wrote: > >>> I only managed to get 200 MB/s write when I did RAID 0 across all >>> drives using the 2540''s RAID controller and with ZFS on top. >>> >> Ridiculously bad. >> > > I agree. :-( > > >>> While I agree that data is sent twice (actually up to 8X if striping >>> across four mirrors) >>> >> Still only twice the data that would otherwise be sent, in other words: the >> mirroring causes a duplicate set of data to be written. >> > > Right. But more little bits of data to be sent due to ZFS striping. >These "little bits" should be 128kBytes by default, which should be plenty to saturate the paths. There seems to be something else going on here... from the iostat data: extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device ... 0.0 211.5 0.0 26940.5 1.1 33.9 5.0 160.5 99 100 c4t600A0B800039C9B500000A9C47B4522Dd0 0.0 211.5 0.0 26940.6 1.1 33.9 5.0 160.4 99 100 c4t600A0B800039C9B500000AA047B4529Bd0 0.0 154.0 0.0 19654.7 0.0 20.7 0.0 134.2 0 59 c4t600A0B80003A8A0B0000096647B453CEd0 ... shows that we have an average of 33.9 iops of 128kBytes each queued to the storage device at a given time. There is an iop queued to the storage device at all times (100% busy). The 59% busy device might not always be 59% busy, but it is difficult to see from this output because you used the "z" flag. Looks to me like ZFS is keeping the queues full, and the device is slow to service them (asvc_t). This is surprising, to a degree, because we would expect faster throughput to a nonvolatile write cache. It would be interesting to see the response for a stable idle system, start the workload, see the fast response as we hit the write cache, followed by the slowdown as we fill the write cache. This sort of experiment is usually easy to create. -- richard
On Fri, 15 Feb 2008, Albert Chin wrote:> > http://groups.google.com/group/comp.unix.solaris/browse_frm/thread/59b43034602a7b7f/0b500afc4d62d434?lnk=st&q=#0b500afc4d62d434This is really discouraging. Based on these newsgroup postings I am thinking that the Sun StorageTek 2540 was not a good investment for me, especially given that the $23K for it came right out of my own paycheck and it took me 6 months of frustration (first shipment was damaged) to receive it. Regardless, this was the best I was able to afford unless I built the drive array myself. The page at http://www.sun.com/storagetek/disk_systems/workgroup/2540/benchmarks.jsp claims "546.22 MBPS" for the large file processing benchmark. So I go to look at the actual SPC2 full disclosure report and see that for one stream, the average data rate is 105MB/second (compared with 102MB/second with RAID-5), and rises to 284MB/second with 10 streams. The product obviously performs much better for reads than it does for writes and is better for multi-user performance than single-user. It seems like I am getting a good bit more performance from my own setup than what the official benchmark suggests (they used 72MB drives, with 24-drives total) so it seems that everything is working fine. This is a lesson for me, and I have certainly learned a fair amount about drive arrays, fiber channel, and ZFS, in the process. Bob =====================================Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
The segment size is amount of contiguous space that each drive contributes to a single stripe. So if you have a 5 drive RAID-5 set @ 128k segment size, a single stripe = (5-1)*128k = 512k BTW, Did you tweak the cache sync handling on the array? -Joel This message posted from opensolaris.org
On Feb 15, 2008 10:20 PM, Luke Lonergan <llonergan at greenplum.com> wrote:> Hi Bob, > > On 2/15/08 12:13 PM, "Bob Friesenhahn" <bfriesen at simple.dallas.tx.us> wrote: > > > I only managed to get 200 MB/s write when I did RAID 0 across all > > drives using the 2540''s RAID controller and with ZFS on top. > > Ridiculously bad.Agreed. My 2530 gives me about 450MB/s on writes and 800 on reads. That''s zfs striped across 4 LUNs, each of which is hardware raid-5 (24 drives in total, so each raid-5 LUN is 5 data + 1 parity). What matters to me is that this is higher than the network bandwidth into the server, and more bandwidth than the users can make use of at the moment. -- -Peter Tribble http://www.petertribble.co.uk/ - http://ptribble.blogspot.com/
Hi Tim;
 
2540 controler can achieve maximum 250 MB/sec on writes on the first 12
drives. So you are pretty close to maximum throughput already. 
Raid 5 can be a little bit slower. 
 
Please try to distribute Lun''s between controllers and try to benchmark
by
disabling cache mirroring. (it''s different then disableing cache) 
 
Best regards
Mertol
 
 
 
 
 
 <http://www.sun.com/> http://www.sun.com/emrkt/sigs/6g_top.gif
Mertol Ozyoney 
Storage Practice - Sales Manager
Sun Microsystems, TR
Istanbul TR
Phone +902123352200
Mobile +905339310752
Fax +902123352222
Email mertol.ozyoney at Sun.COM <mailto:Ayca.Yalcin at Sun.COM> 
 
 
From: zfs-discuss-bounces at opensolaris.org
[mailto:zfs-discuss-bounces at opensolaris.org] On Behalf Of Tim
Sent: 15 ?ubat 2008 Cuma 03:13
To: Bob Friesenhahn
Cc: zfs-discuss at opensolaris.org
Subject: Re: [zfs-discuss] Performance with Sun StorageTek 2540
 
On 2/14/08, Bob Friesenhahn <bfriesen at simple.dallas.tx.us> wrote:
Under Solaris 10 on a 4 core Sun Ultra 40 with 20GB RAM, I am setting
up a Sun StorageTek 2540 with 12 300GB 15K RPM SAS drives and
connected via load-shared 4Gbit FC links.  This week I have tried many
different configurations, using firmware managed RAID, ZFS managed
RAID, and with the controller cache enabled or disabled.
My objective is to obtain the best single-file write performance.
Unfortunately, I am hitting some sort of write bottleneck and I am not
sure how to solve it.  I was hoping for a write speed of 300MB/second.
With ZFS on top of a firmware managed RAID 0 across all 12 drives, I
hit a peak of 200MB/second.  With each drive exported as a LUN and a
ZFS pool of 6 pairs, I see a write rate of 154MB/second.  The number
of drives used has not had much effect on write rate.
Information on my pool is shown at the end of this email.
I am driving the writes using ''iozone'' since
''filebench'' does not seem
to want to install/work on Solaris 10.
I am suspecting that the problem is that I am running out of IOPS
since the drive array indicates a an average IOPS of 214 for one drive
even though the peak write speed is only 26MB/second (peak read is
42MB/second).
Can someone share with me what they think the write bottleneck might
be and how I can surmount it?
Thanks,
Bob
=====================================Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/
% zpool status
   pool: Sun_2540
  state: ONLINE
  scrub: none requested
config:
         NAME                                       STATE     READ WRITE
CKSUM
         Sun_2540                                   ONLINE       0     0
0
           mirror                                   ONLINE       0     0
0
             c4t600A0B80003A8A0B0000096A47B4559Ed0  ONLINE       0     0
0
             c4t600A0B80003A8A0B0000096E47B456DAd0  ONLINE       0     0
0
           mirror                                   ONLINE       0     0
0
             c4t600A0B80003A8A0B0000096147B451BEd0  ONLINE       0     0
0
             c4t600A0B80003A8A0B0000096647B453CEd0  ONLINE       0     0
0
           mirror                                   ONLINE       0     0
0
             c4t600A0B80003A8A0B0000097347B457D4d0  ONLINE       0     0
0
             c4t600A0B800039C9B500000A9C47B4522Dd0  ONLINE       0     0
0
           mirror                                   ONLINE       0     0
0
             c4t600A0B800039C9B500000AA047B4529Bd0  ONLINE       0     0
0
             c4t600A0B800039C9B500000AA447B4544Fd0  ONLINE       0     0
0
           mirror                                   ONLINE       0     0
0
             c4t600A0B800039C9B500000AA847B45605d0  ONLINE       0     0
0
             c4t600A0B800039C9B500000AAC47B45739d0  ONLINE       0     0
0
           mirror                                   ONLINE       0     0
0
             c4t600A0B800039C9B500000AB047B457ADd0  ONLINE       0     0
0
             c4t600A0B800039C9B500000AB447B4595Fd0  ONLINE       0     0
0
errors: No known data errors
freddy:~% zpool iostat
                capacity     operations    bandwidth
pool         used  avail   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
Sun_2540    64.0G  1.57T    808    861  99.8M   105M
freddy:~% zpool iostat -v
                                            capacity     operations
bandwidth
pool                                     used  avail   read  write   read
write
--------------------------------------  -----  -----  -----  -----  -----
-----
Sun_2540                                64.0G  1.57T    809    860   100M
105M
   mirror                                10.7G   267G    135    143  16.7M
17.6M
     c4t600A0B80003A8A0B0000096A47B4559Ed0      -      -     66    141
8.37M  17.6M
     c4t600A0B80003A8A0B0000096E47B456DAd0      -      -     67    141
8.37M  17.6M
   mirror                                10.7G   267G    135    143  16.7M
17.6M
     c4t600A0B80003A8A0B0000096147B451BEd0      -      -     66    141
8.37M  17.6M
     c4t600A0B80003A8A0B0000096647B453CEd0      -      -     66    141
8.37M  17.6M
   mirror                                10.7G   267G    134    143  16.7M
17.6M
     c4t600A0B80003A8A0B0000097347B457D4d0      -      -     66    141
8.34M  17.6M
     c4t600A0B800039C9B500000A9C47B4522Dd0      -      -     66    141
8.32M  17.6M
   mirror                                10.7G   267G    134    143  16.6M
17.6M
     c4t600A0B800039C9B500000AA047B4529Bd0      -      -     66    141
8.32M  17.6M
     c4t600A0B800039C9B500000AA447B4544Fd0      -      -     66    141
8.30M  17.6M
   mirror                                10.7G   267G    134    143  16.6M
17.6M
     c4t600A0B800039C9B500000AA847B45605d0      -      -     66    141
8.31M  17.6M
     c4t600A0B800039C9B500000AAC47B45739d0      -      -     66    141
8.30M  17.6M
   mirror                                10.7G   267G    134    143  16.6M
17.6M
     c4t600A0B800039C9B500000AB047B457ADd0      -      -     66    141
8.30M  17.6M
     c4t600A0B800039C9B500000AB447B4595Fd0      -      -     66    141
8.29M  17.6M
--------------------------------------  -----  -----  -----  -----  -----
-----
_______________________________________________
zfs-discuss mailing list
zfs-discuss at opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
If you''re going for best single file write performance, why are you
doing
mirrors of the LUNs?  Perhaps I''m misunderstanding why you went from
one
giant raid-0 to what is essentially a raid-10.
--Tim
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080216/2a5cfd6a/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image001.gif
Type: image/gif
Size: 1257 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080216/2a5cfd6a/attachment.gif>
Bob, Here is how you can tell the array to ignore cache sync commands and the force unit access bits...(Sorry if it wraps..) On a Solaris CAM install, the ''service'' command is in "/opt/SUNWsefms/bin" To read the current settings: service -d arrayname -c read -q nvsram region=0xf2 host=0x00 save this output so you can reverse the changes below easily if needed... To set new values: service -d arrayname -c set -q nvsram region=0xf2 offset=0x17 value=0x01 host=0x00 service -d arrayname -c set -q nvsram region=0xf2 offset=0x18 value=0x01 host=0x00 service -d arrayname -c set -q nvsram region=0xf2 offset=0x21 value=0x01 host=0x00 Host region 00 is Solaris (w/Traffic Manager) You will need to reboot both controllers after making the change before it becomes active. -Joel This message posted from opensolaris.org
On Sat, 16 Feb 2008, Peter Tribble wrote:> Agreed. My 2530 gives me about 450MB/s on writes and 800 on reads. > That''s zfs striped across 4 LUNs, each of which is hardware raid-5 > (24 drives in total, so each raid-5 LUN is 5 data + 1 parity).Is this single-file bandwidth or multiple-file/thread bandwidth? According to Sun''s own benchmark data, the 2530 was capable of 20MB/second more than the 2540 on writes for a single large file, and the difference went away after that. For multi-user activity the throughput clearly improves to be similar to what you describe. Most people are likely interested in maximizing multi-user performance, and particularly for reads. Visit http://www.storageperformance.org/results/benchmark_results_spc2/#sun_spc2 to see the various benchmark results. According to these results, for large-file writes the 2530/2540 compares well with other StorageTek products, including the more expensive 6140 and 6540 arrays. It also compares well with similarly-sized storage products from other vendors. Bob =====================================Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
On Sat, 16 Feb 2008, Mertol Ozyoney wrote:> > Please try to distribute Lun''s between controllers and try to benchmark by > disabling cache mirroring. (it''s different then disableing cache)By the term "disabling cache mirroring" are you talking about "Write Cache With Replication Enabled" in the Common Array Manager? Does this feature maintain a redundant cache (two data copies) between controllers? Bob =====================================Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Yes, it does replicate data between controllers. Usualy it slows that a lot espacialy on wirte heavy environments. If you properly tune ZFS you may not need this feature for consistency... Mertol Ozyoney Storage Practice - Sales Manager Sun Microsystems, TR Istanbul TR Phone +902123352200 Mobile +905339310752 Fax +902123352222 Email mertol.ozyoney at Sun.COM -----Original Message----- From: Bob Friesenhahn [mailto:bfriesen at simple.dallas.tx.us] Sent: 16 ?ubat 2008 Cumartesi 18:43 To: Mertol Ozyoney Cc: zfs-discuss at opensolaris.org Subject: RE: [zfs-discuss] Performance with Sun StorageTek 2540 On Sat, 16 Feb 2008, Mertol Ozyoney wrote:> > Please try to distribute Lun''s between controllers and try to benchmark by > disabling cache mirroring. (it''s different then disableing cache)By the term "disabling cache mirroring" are you talking about "Write Cache With Replication Enabled" in the Common Array Manager? Does this feature maintain a redundant cache (two data copies) between controllers? Bob =====================================Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
On Sat, 16 Feb 2008, Joel Miller wrote:> Here is how you can tell the array to ignore cache sync commands and > the force unit access bits...(Sorry if it wraps..)Thanks to the kind advice of yourself and Mertol Ozyoney, there is a huge boost in write performance: Was: 154MB/second Now: 279MB/second The average service time for each disk LUN has dropped considerably. The numbers provided by ''zfs iostat'' are very close to what is measured by ''iozone''. This is like night and day and gets me very close to my original target write speed of 300MB/second. Thank you very much! Bob =====================================Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Hi Bob; When you have some spare time can you prepare a simple benchmark report in PDF that I can share with my customers to demonstrate the performance of 2540 ? Best regards Mertol Mertol Ozyoney Storage Practice - Sales Manager Sun Microsystems, TR Istanbul TR Phone +902123352200 Mobile +905339310752 Fax +902123352222 Email mertol.ozyoney at Sun.COM -----Original Message----- From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-bounces at opensolaris.org] On Behalf Of Bob Friesenhahn Sent: 16 ?ubat 2008 Cumartesi 19:57 To: Joel Miller Cc: zfs-discuss at opensolaris.org Subject: Re: [zfs-discuss] Performance with Sun StorageTek 2540 On Sat, 16 Feb 2008, Joel Miller wrote:> Here is how you can tell the array to ignore cache sync commands and > the force unit access bits...(Sorry if it wraps..)Thanks to the kind advice of yourself and Mertol Ozyoney, there is a huge boost in write performance: Was: 154MB/second Now: 279MB/second The average service time for each disk LUN has dropped considerably. The numbers provided by ''zfs iostat'' are very close to what is measured by ''iozone''. This is like night and day and gets me very close to my original target write speed of 300MB/second. Thank you very much! Bob =====================================Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/ _______________________________________________ zfs-discuss mailing list zfs-discuss at opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Mertol Ozyoney wrote:> > 2540 controler can achieve maximum 250 MB/sec on writes on the first > 12 drives. So you are pretty close to maximum throughput already. > > Raid 5 can be a little bit slower. >I''m a bit irritated now. I have ZFS running for some Sybase ASE 12.5 databases using X4600 servers (8x dual core, 64 GB RAM, Solaris 10 11/06) and 4 GBit/s lowest cost Infortrend Fibrechannel JBODs with a total of 4x 16 FC drives imported in a single mirrored zpool. I benchmarked them with tiobench, using a filesize of 64 GB and 32 parallel threads. With an untweaked ZFS the average throughput I got was: sequential & random read > 1GB/s, sequential write 296 MB/s, random write 353 MB/s, leading to a total of approx. 650,000 IOPS with a maximum latency of < 350 ms after the databases went into production and the bottleneck are basically the FC HBA''s. These are averages, the peaks flatline with reaching the 4 GBit/s FibreChannel maximum capacity pretty soon afterwards. I''m a bit disturbed because I think about switching to 2530/2540 shelves, but a maximum 250 MB/sec would disqualify them instantly, even with individual RAID controllers for each shelf. So my question is: Can I do the same thing I did with the IFT shelves, can I buy only 2501 JBOBDs and attach them directly to the server, thus *not* using the 2540 raid controller and still having access to the single drives? I''m quite nervous about this, because I''m not just talking about a single databases - I''d need a total number of 42 shelves and I''m pretty sure SUN doesn''t offer Try&Buy deals at such a scale. -- Ralf Ramge Senior Solaris Administrator, SCNA, SCSA Tel. +49-721-91374-3963 ralf.ramge at webde.de - http://web.de/ 1&1 Internet AG Brauerstra?e 48 76135 Karlsruhe Amtsgericht Montabaur HRB 6484 Vorstand: Henning Ahlert, Ralph Dommermuth, Matthias Ehrlich, Andreas Gauger, Thomas Gottschlich, Matthias Greve, Robert Hoffmann, Markus Huhn, Norbert Lang, Achim Weiss Aufsichtsratsvorsitzender: Michael Scheeren
On Mon, 18 Feb 2008, Ralf Ramge wrote:> I''m a bit disturbed because I think about switching to 2530/2540 > shelves, but a maximum 250 MB/sec would disqualify them instantly, evenNote that this is single-file/single-thread I/O performance. I suggest that you read the formal benchmark report for this equipment since it covers multi-thread I/O performance as well. The multi-user performance is considerably higher. Given ZFS''s smarts, the JBOD approach seems like a good one as long as the hardware provides a non-volatile cache. Bob =====================================Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Bob Friesenhahn writes: > On Fri, 15 Feb 2008, Roch Bourbonnais wrote: > >>> What was the interlace on the LUN ? > > > > The question was about LUN interlace not interface. > > 128K to 1M works better. > > The "segment size" is set to 128K. The max the 2540 allows is 512K. > Unfortunately, the StorageTek 2540 and CAM documentation does not > really define what "segment size" means. > > > Any compression ? > > Compression is disabled. > > > Does turn off checksum helps the number (that would point to a CPU limited > > throughput). > > I have not tried that but this system is loafing during the benchmark. > It has four 3GHz Opteron cores. > > Does this output from ''iostat -xnz 20'' help to understand issues? > > extended device statistics > r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device > 3.0 0.7 26.4 3.5 0.0 0.0 0.0 4.2 0 2 c1t1d0 > 0.0 154.2 0.0 19680.3 0.0 20.7 0.0 134.2 0 59 c4t600A0B80003A8A0B0000096147B451BEd0 > 0.0 211.5 0.0 26940.5 1.1 33.9 5.0 160.5 99 100 c4t600A0B800039C9B500000A9C47B4522Dd0 > 0.0 211.5 0.0 26940.6 1.1 33.9 5.0 160.4 99 100 c4t600A0B800039C9B500000AA047B4529Bd0 > 0.0 154.0 0.0 19654.7 0.0 20.7 0.0 134.2 0 59 c4t600A0B80003A8A0B0000096647B453CEd0 > 0.0 211.3 0.0 26915.0 1.1 33.9 5.0 160.5 99 100 c4t600A0B800039C9B500000AA447B4544Fd0 > 0.0 152.4 0.0 19447.0 0.0 20.5 0.0 134.5 0 59 c4t600A0B80003A8A0B0000096A47B4559Ed0 > 0.0 213.2 0.0 27183.8 0.9 34.1 4.2 159.9 90 100 c4t600A0B800039C9B500000AA847B45605d0 > 0.0 152.5 0.0 19453.4 0.0 20.5 0.0 134.5 0 59 c4t600A0B80003A8A0B0000096E47B456DAd0 > 0.0 213.2 0.0 27177.4 0.9 34.1 4.2 159.9 90 100 c4t600A0B800039C9B500000AAC47B45739d0 > 0.0 213.2 0.0 27195.3 0.9 34.1 4.2 159.9 90 100 c4t600A0B800039C9B500000AB047B457ADd0 > 0.0 154.4 0.0 19711.8 0.0 20.7 0.0 134.0 0 59 c4t600A0B80003A8A0B0000097347B457D4d0 > 0.0 211.3 0.0 26958.6 1.1 33.9 5.0 160.6 99 100 c4t600A0B800039C9B500000AB447B4595Fd0 > Interesting that a subset of 5 disks are responding faster (which also leads to smaller actv queues and so lower service times) than the 7 others. .... and the slow ones are subject to more writes...haha. If the sizes of the luns are different (or have different amount of free blocks) then maybe ZFS is now trying to rebalance free space by targetting a subset of the disks with more new data. Pool throughput will be impacted by this. -r > Bob > ===================================== > Bob Friesenhahn > bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ > GraphicsMagick Maintainer, http://www.GraphicsMagick.org/ > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Hello Joel,
Saturday, February 16, 2008, 4:09:11 PM, you wrote:
JM> Bob,
JM> Here is how you can tell the array to ignore cache sync commands
JM> and the force unit access bits...(Sorry if it wraps..)
JM> On a Solaris CAM install, the ''service'' command is in
"/opt/SUNWsefms/bin"
JM> To read the current settings:
JM> service -d arrayname -c read -q nvsram region=0xf2 host=0x00
JM> save this output so you can reverse the changes below easily if needed...
JM> To set new values:
JM> service -d arrayname -c set -q nvsram region=0xf2 offset=0x17 value=0x01
host=0x00
JM> service -d arrayname -c set -q nvsram region=0xf2 offset=0x18 value=0x01
host=0x00
JM> service -d arrayname -c set -q nvsram region=0xf2 offset=0x21 value=0x01
host=0x00
JM> Host region 00 is Solaris (w/Traffic Manager)
JM> You will need to reboot both controllers after making the change before
it becomes active.
Is it also necessary and does it work on 2530?
-- 
Best regards,
 Robert                            mailto:milek at task.gda.pl
                                       http://milek.blogspot.com
It is the same for the 2530, and I am fairly certain it is also valid for the 6130,6140, & 6540. -Joel On Feb 18, 2008, at 3:51 PM, Robert Milkowski <milek at task.gda.pl> wrote:> Hello Joel, > > Saturday, February 16, 2008, 4:09:11 PM, you wrote: > > JM> Bob, > > JM> Here is how you can tell the array to ignore cache sync commands > JM> and the force unit access bits...(Sorry if it wraps..) > > JM> On a Solaris CAM install, the ''service'' command is in "/opt/ > SUNWsefms/bin" > > JM> To read the current settings: > JM> service -d arrayname -c read -q nvsram region=0xf2 host=0x00 > > JM> save this output so you can reverse the changes below easily if > needed... > > > JM> To set new values: > > JM> service -d arrayname -c set -q nvsram region=0xf2 offset=0x17 > value=0x01 host=0x00 > JM> service -d arrayname -c set -q nvsram region=0xf2 offset=0x18 > value=0x01 host=0x00 > JM> service -d arrayname -c set -q nvsram region=0xf2 offset=0x21 > value=0x01 host=0x00 > > JM> Host region 00 is Solaris (w/Traffic Manager) > > JM> You will need to reboot both controllers after making the change > before it becomes active. > > > Is it also necessary and does it work on 2530? > > > -- > Best regards, > Robert mailto:milek at task.gda.pl > http://milek.blogspot.com >
On Sun, 17 Feb 2008, Mertol Ozyoney wrote:> Hi Bob; > > When you have some spare time can you prepare a simple benchmark report in > PDF that I can share with my customers to demonstrate the performance of > 2540 ?While I do not claim that it is "simple" I have created a report on my configuration and experience. It should be useful for users of the Sun StorageTek 2540, ZFS, and Solaris 10 multipathing. See http://www.simplesystems.org/users/bfriesen/zfs-discuss/2540-zfs-performance.pdf or http://tinyurl.com/2djewn for the URL challenged. Feel free this share this document with anyone who is interested. Thanks Bob =====================================Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
On Wed, Feb 27, 2008 at 6:17 AM, Bob Friesenhahn <bfriesen at simple.dallas.tx.us> wrote:> On Sun, 17 Feb 2008, Mertol Ozyoney wrote: > > > Hi Bob; > > > > When you have some spare time can you prepare a simple benchmark report in > > PDF that I can share with my customers to demonstrate the performance of > > 2540 ? > > While I do not claim that it is "simple" I have created a report on my > configuration and experience. It should be useful for users of the > Sun StorageTek 2540, ZFS, and Solaris 10 multipathing. > > See > > http://www.simplesystems.org/users/bfriesen/zfs-discuss/2540-zfs-performance.pdfNov 26, 2008 ??? May I borrow your time machine ? ;-) -- Regards, Cyril
On Wed, 27 Feb 2008, Cyril Plisko wrote:>> >> http://www.simplesystems.org/users/bfriesen/zfs-discuss/2540-zfs-performance.pdf > > Nov 26, 2008 ??? May I borrow your time machine ? ;-)Are there any stock prices you would like to know about? Perhaps you are interested in the outcome of the elections? There was a time inversion layer in Texas. Fixed now ... Bob =====================================Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Paul Van Der Zwan
2008-Feb-28  10:59 UTC
[zfs-discuss] Performance with Sun StorageTek 2540
> On Wed, 27 Feb 2008, Cyril Plisko wrote: > >> > >> http://www.simplesystems.org/users/bfriesen/zfs-discuss/2540-zfs-performance.pdf > > > > Nov 26, 2008 ??? May I borrow your time machine ? ;-) > > Are there any stock prices you would like to know about? Perhaps you > > are interested in the outcome of the elections? >No need for a time machine, the US presidential election outcome is already known: http://www.theonion.com/content/video/diebold_accidentally_leaks Paul