I know there have been a bunch of discussion of various ZFS performance issues, but I did not see anything specifically on this. In testing a new configuration of an SE-3511 (SATA) array, I ran into an interesting ZFS performance issue. I do not believe that this is creating a major issue for our end users (but it may), but it is certainly impacting our nightly backups. I am only seeing 10-20 MB/sec per thread for random read throughput using iozone for testing. Here is the full config: SF-V480 --- 4 x 1.2 GHz III+ --- 16 GB memory --- Solaris 10U6 with ZFS patch and IDR for snapshot / resilver bug. SE-3511 --- 12 x 500 GB SATA drives --- 11 disk R5 --- dual 2 Gbps FC host connection I have the ARC size limited to 1 GB so that I can test with a rational data set size. The total amount of data that I am testing with is 3 GB and a 256KB record size. I tested with 1 through 20 threads. With 1 thread I got the following results: sequential write: 112 MB/sec. sequential read: 221 MB/sec. random write: 96 MB/sec. random read: 18 MB/sec. As I scaled the number of threads (and kept the total data size the same) I got the following (throughput is in MB/sec): threads sw sr rw rr 2 105 218 93 34 4 106 219 88 52 8 95 189 69 92 16 71 153 76 128 As the number of threads climbs the first thee values drop once you get above 4 threads (one per CPU), but the fourth (random read) climbs well past 4 threads. It is just about linear through 9 threads and then it starts fluctuating, but continues climbing to at least 20 threads (I did not test past 20). Above 16 threads the random read even exceeds the sequential read values. Looking at iostat output for the LUN I am using for the 1 thread case, for the first three tests (sequential write, sequential read, random write) I see %b at 100 and actv climb to 35 and hang out there. For the random read test I see %b at 5 to 7, actv at less than 1 (usually around 0.5 to 0.6), wsvc_t is essentially 0, and asvc_t runs about 14. As the number of threads increases, the iostat values don''t really change for the first three tests (sequential write and read), but they climb for the random read. The array is close to saturated at about 170 MB/sec. random read (18 threads), so I know that the 18 MB/sec. value for one thread is _not_limited by the array. I know the 3511 is not a high performance array, but we needed lots of bulk storage and could not afford better when we bought these 3 years ago. But, it seems to me that there is something wrong with the random read performance of ZFS. To test whether this is an effect of the 3511 I ran some tests on another system we have, as follows: T2000 --- 32 thread 1 GHz --- 32 GB memory --- Solaris 10U8 --- 4 Internal 72 GB SAS drives We have a zpool built of one slice on each of the 4 internal drives configured as a striped mirror layout (2 vdevs each of 2 slices). So I/O is spread over all 4 spindles. I started with 4 threads and 8 GB each (32 GB total to insure I got past the ARC, it is not tuned down on this system). I saw exactly the same ratio of sequential read to random read (the random read performance was 23% of the sequential read performance in both cases). Based on looking at iostat values during the test, I am saturating all four drives with the write operations with just 1 thread. The sequential read is saturating the drives with anything more than 1 thread, and the random read is not saturating the drives until I get to about 6 threads. threads sw sr rw rr 1 100 207 88 30 2 103 370 88 53 4 98 350 90 82 8 101 434 92 95 I confirmed that the problem is not unique to either 10U6 or the IDR, 10U8 has the same behavior. I confirmed that the problem is not unique to a FC attached disk array or the SE-3511 in particular. Then I went back and took another look at my original data (SF-V480/SE-3511) and looked at throughput per thread. For the sequential operations and the random write, the throughput per thread fell pretty far and pretty fast, but the per thread random read numbers fell very slowly. Per thread throughput in MB/sec. threads sw sr rw rr 1 112 221 96 18 2 53 109 46 17 4 26 55 22 13 8 12 24 9 12 16 5 10 5 8 So this makes me think that the random read performance issue is a limitation per thread. Does anyone have any idea why ZFS is not reading as fast as the underlying storage can handle in the case of random reads ? Or am I seeing an artifact of iozone itself ? Is there another benchmark I should be using ? P.S. I posted a OpenOffice.org spreadsheet of my test resulsts here: http://www.ilk.org/~ppk/Geek/throughput-summary.ods -- {--------1---------2---------3---------4---------5---------6---------7---------} Paul Kraus -> Senior Systems Architect, Garnet River ( http://www.garnetriver.com/ ) -> Sound Coordinator, Schenectady Light Opera Company ( http://www.sloctheater.org/ ) -> Technical Advisor, Lunacon 2010 (http://www.lunacon.org/) -> Technical Advisor, RPI Players
Try disabling prefetch. -- richard On Nov 24, 2009, at 6:45 AM, Paul Kraus wrote:> I know there have been a bunch of discussion of various ZFS > performance issues, but I did not see anything specifically on this. > In testing a new configuration of an SE-3511 (SATA) array, I ran into > an interesting ZFS performance issue. I do not believe that this is > creating a major issue for our end users (but it may), but it is > certainly impacting our nightly backups. I am only seeing 10-20 MB/sec > per thread for random read throughput using iozone for testing. Here > is the full config: > > SF-V480 > --- 4 x 1.2 GHz III+ > --- 16 GB memory > --- Solaris 10U6 with ZFS patch and IDR for snapshot / resilver bug. > SE-3511 > --- 12 x 500 GB SATA drives > --- 11 disk R5 > --- dual 2 Gbps FC host connection > > I have the ARC size limited to 1 GB so that I can test with a rational > data set size. The total amount of data that I am testing with is 3 GB > and a 256KB record size. I tested with 1 through 20 threads. > > With 1 thread I got the following results: > sequential write: 112 MB/sec. > sequential read: 221 MB/sec. > random write: 96 MB/sec. > random read: 18 MB/sec. > > As I scaled the number of threads (and kept the total data size the > same) I got the following (throughput is in MB/sec): > threads sw sr rw rr > 2 105 218 93 34 > 4 106 219 88 52 > 8 95 189 69 92 > 16 71 153 76 128 > > As the number of threads climbs the first thee values drop once you > get above 4 threads (one per CPU), but the fourth (random read) climbs > well past 4 threads. It is just about linear through 9 threads and > then it starts fluctuating, but continues climbing to at least 20 > threads (I did not test past 20). Above 16 threads the random read > even exceeds the sequential read values. > > Looking at iostat output for the LUN I am using for the 1 thread case, > for the first three tests (sequential write, sequential read, random > write) I see %b at 100 and actv climb to 35 and hang out there. For > the random read test I see %b at 5 to 7, actv at less than 1 (usually > around 0.5 to 0.6), wsvc_t is essentially 0, and asvc_t runs about 14. > As the number of threads increases, the iostat values don''t really > change for the first three tests (sequential write and read), but they > climb for the random read. The array is close to saturated at about > 170 MB/sec. random read (18 threads), so I know that the 18 MB/sec. > value for one thread is _not_limited by the array. > > I know the 3511 is not a high performance array, but we needed lots of > bulk storage and could not afford better when we bought these 3 years > ago. But, it seems to me that there is something wrong with the random > read performance of ZFS. To test whether this is an effect of the 3511 > I ran some tests on another system we have, as follows: > > T2000 > --- 32 thread 1 GHz > --- 32 GB memory > --- Solaris 10U8 > --- 4 Internal 72 GB SAS drives > > We have a zpool built of one slice on each of the 4 internal drives > configured as a striped mirror layout (2 vdevs each of 2 slices). So > I/O is spread over all 4 spindles. I started with 4 threads and 8 GB > each (32 GB total to insure I got past the ARC, it is not tuned down > on this system). I saw exactly the same ratio of sequential read to > random read (the random read performance was 23% of the sequential > read performance in both cases). Based on looking at iostat values > during the test, I am saturating all four drives with the write > operations with just 1 thread. The sequential read is saturating the > drives with anything more than 1 thread, and the random read is not > saturating the drives until I get to about 6 threads. > > threads sw sr rw rr > 1 100 207 88 30 > 2 103 370 88 53 > 4 98 350 90 82 > 8 101 434 92 95 > > I confirmed that the problem is not unique to either 10U6 or the IDR, > 10U8 has the same behavior. > > I confirmed that the problem is not unique to a FC attached disk array > or the SE-3511 in particular. > > Then I went back and took another look at my original data > (SF-V480/SE-3511) and looked at throughput per thread. For the > sequential operations and the random write, the throughput per thread > fell pretty far and pretty fast, but the per thread random read > numbers fell very slowly. > > Per thread throughput in MB/sec. > threads sw sr rw rr > 1 112 221 96 18 > 2 53 109 46 17 > 4 26 55 22 13 > 8 12 24 9 12 > 16 5 10 5 8 > > So this makes me think that the random read performance issue is a > limitation per thread. Does anyone have any idea why ZFS is not > reading as fast as the underlying storage can handle in the case of > random reads ? Or am I seeing an artifact of iozone itself ? Is there > another benchmark I should be using ? > > P.S. I posted a OpenOffice.org spreadsheet of my test resulsts here: > http://www.ilk.org/~ppk/Geek/throughput-summary.ods > > -- > {--------1 > ---------2---------3---------4---------5---------6---------7---------} > Paul Kraus > -> Senior Systems Architect, Garnet River ( http://www.garnetriver.com/ > ) > -> Sound Coordinator, Schenectady Light Opera Company ( > http://www.sloctheater.org/ ) > -> Technical Advisor, Lunacon 2010 (http://www.lunacon.org/) > -> Technical Advisor, RPI Players > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
On Tue, Nov 24, 2009 at 11:03 AM, Richard Elling <richard.elling at gmail.com> wrote:> Try disabling prefetch.Just tried it... no change in random read (still 17-18 MB/sec for a single thread), but sequential read performance dropped from about 200 MB/sec. to 100 MB/sec. (as expected). Test case is a 3 GB file accessed in 256 KB records. ARC is set to a max of 1 GB for testing. arcstat.pl shows that the vast majority (>95%) of reads are missing the cache. The reason I don''t think that this ishitting our end users is the cache hit ratio (reported by arc_summary.pl) is 95% on the production system (I am working on our test system and am the only one using it right now, so all the I/O load is iozone). I think my next step (beyond more poking with DTrace) is to try a backup and see what I get for ARC hit ratio ... I expect it to be low, but I may be surprised (then I have to figure out why backups are as slow as they are). We are using NetBackup and it takes about 3 days to do a FULL on a 3.3 TB zfs with about 30 million files. Differential Incrementals take 16-22 hours (and almost no data changes). The production server is an M4000, 4 dual core CPUs, 16 GB memory, and about 25 TB of data overall. A big SAMBA file server. -- {--------1---------2---------3---------4---------5---------6---------7---------} Paul Kraus -> Senior Systems Architect, Garnet River ( http://www.garnetriver.com/ ) -> Sound Coordinator, Schenectady Light Opera Company ( http://www.sloctheater.org/ ) -> Technical Advisor, Lunacon 2010 (http://www.lunacon.org/) -> Technical Advisor, RPI Players
On Tue, 24 Nov 2009, Paul Kraus wrote:> On Tue, Nov 24, 2009 at 11:03 AM, Richard Elling > <richard.elling at gmail.com> wrote: > >> Try disabling prefetch. > > Just tried it... no change in random read (still 17-18 MB/sec for a > single thread), but sequential read performance dropped from about 200 > MB/sec. to 100 MB/sec. (as expected). Test case is a 3 GB file > accessed in 256 KB records. ARC is set to a max of 1 GB for testing. > arcstat.pl shows that the vast majority (>95%) of reads are missing > the cache.You will often see the best random access performance if you access the data using the same record size that zfs uses. For example, if you request data in 256KB records, but zfs is using 128KB records, then zfs needs to access, reconstruct, and concatenate two 128K zfs records before it can return any data to the user. This increases the access latency and decreases opportunity to take advantage of concurrency. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
more below... On Nov 24, 2009, at 9:29 AM, Paul Kraus wrote:> On Tue, Nov 24, 2009 at 11:03 AM, Richard Elling > <richard.elling at gmail.com> wrote: > >> Try disabling prefetch. > > Just tried it... no change in random read (still 17-18 MB/sec for a > single thread), but sequential read performance dropped from about 200 > MB/sec. to 100 MB/sec. (as expected). Test case is a 3 GB file > accessed in 256 KB records. ARC is set to a max of 1 GB for testing. > arcstat.pl shows that the vast majority (>95%) of reads are missing > the cache.hmmm... more testing needed. The question is whether the low I/O rate is because of zfs itself, or the application? Disabling prefetch will expose the application, because zfs is not creating additional and perhaps unnecessary read I/O. Your data which shows the sequential write, random write, and sequential read driving actv to 35 is because prefetching is enabled for the read. We expect the writes to drive to 35 with a sustained write workload of any flavor. The random read (with cache misses) will stall the application, so it takes a lot of threads (>>16?) to keep 35 concurrent I/Os in the pipeline without prefetching. The ZFS prefetching algorithm is "intelligent" so it actually complicates the interpretation of the data. You''re peaking at 658 256KB random IOPS for the 3511, or ~66 IOPS per drive. Since ZFS will max out at 128KB per I/O, the disks see something more than 66 IOPS each. The IOPS data from iostat would be a better metric to observe than bandwidth. These drives are good for about 80 random IOPS each, so you may be close to disk saturation. The iostat data for IOPS and svc_t will confirm. The T2000 data (sheet 3) shows pretty consistently around 90 256KB IOPS per drive. Like the 3511 case, this is perhaps 20% less than I would expect, perhaps due to the measurement. Also, the 3511 RAID-5 configuration will perform random reads at around 1/2 IOPS capacity if the partition offset is 34. This was the default long ago. The new default is 256. The reason is that with a 34 block offset, you are almost guaranteed that a larger I/O will stride 2 disks. You won''t notice this as easily with a single thread, but it will be measurable with more threads. Double check the offset with prtvtoc or format. Writes are a completely different matter. ZFS has a tendency to turn random writes into sequential writes, so it is pretty much useless to look at random write data. The sequential writes should easily blow through the cache on the 3511. Squinting my eyes, I would expect the array can do around 70 MB/s writes, or 25 256KB IOPS saturated writes. By contrast, the T2000 JBOD data shows consistent IOPS at the disk level and exposes the track cache affect on the sequential read test. Did I mention that I''m a member of BAARF? www.baarf.com :-) Hint: for performance work with HDDs, pay close attention to IOPS, then convert to bandwidth for the PHB.> The reason I don''t think that this ishitting our end users is the > cache hit ratio (reported by arc_summary.pl) is 95% on the production > system (I am working on our test system and am the only one using it > right now, so all the I/O load is iozone). > > I think my next step (beyond more poking with DTrace) is to try a > backup and see what I get for ARC hit ratio ... I expect it to be low, > but I may be surprised (then I have to figure out why backups are as > slow as they are). We are using NetBackup and it takes about 3 days to > do a FULL on a 3.3 TB zfs with about 30 million files. Differential > Incrementals take 16-22 hours (and almost no data changes). The > production server is an M4000, 4 dual core CPUs, 16 GB memory, and > about 25 TB of data overall. A big SAMBA file server.b119 has improved stat() performance, which should make a positive improvement of such backups. But eventually you may need to move to a multi-stage backup, depending on your business requirements. -- richard
Richard, First, thank you for the detailed reply ... (comments in line below) On Tue, Nov 24, 2009 at 6:31 PM, Richard Elling <richard.elling at gmail.com> wrote:> more below... > > On Nov 24, 2009, at 9:29 AM, Paul Kraus wrote: > >> On Tue, Nov 24, 2009 at 11:03 AM, Richard Elling >> <richard.elling at gmail.com> wrote: >> >>> Try disabling prefetch. >> >> Just tried it... no change in random read (still 17-18 MB/sec for a >> single thread), but sequential read performance dropped from about 200 >> MB/sec. to 100 MB/sec. (as expected). Test case is a 3 GB file >> accessed in 256 KB records. ARC is set to a max of 1 GB for testing. >> arcstat.pl shows that the vast majority (>95%) of reads are missing >> the cache. > > hmmm... more testing needed. The question is whether the low > I/O rate is because of zfs itself, or the application? Disabling prefetch > will expose the application, because zfs is not creating additional > and perhaps unnecessary read I/O.The values reported by iozone are in pretty close agreement with what we are seeing with iostat during the test runs. Compression is off on zfs (the iozone test data compresses very well and yields bogus results). I am looking for a good alternative to iozone for random testing, I did put together a crude script to spawn many dd processes accessing the block device itself, each with a different seek over the range of the disk and saw results much greater than the iozone single threaded random performance.> Your data which shows the sequential write, random write, and > sequential read driving actv to 35 is because prefetching is enabled > for the read. ?We expect the writes to drive to 35 with a sustained > write workload of any flavor.Understood. I tried tuning the queue size to 50 and observed that the actv went to 50 (with very little difference in performance), so returned it to the default of 35.> The random read (with cache misses) > will stall the application, so it takes a lot of threads (>>16?) to keep > 35 concurrent I/Os in the pipeline without prefetching. ?The ZFS > prefetching algorithm is "intelligent" so it actually complicates the > interpretation of the data.What bothers me is that that iostat is showing the ''disk'' device as not being saturated during the random read test. I''ll post iostat output that I captured yesterday to http://www.ilk.org/~ppk/Geek/ You can clearly see the various test phases (sequential write, rewrite, sequential read, reread, random read, then random write).> You''re peaking at 658 256KB random IOPS for the 3511, or ~66 > IOPS per drive. ?Since ZFS will max out at 128KB per I/O, the disks > see something more than 66 IOPS each. ?The IOPS data from > iostat would be a better metric to observe than bandwidth. ?These > drives are good for about 80 random IOPS each, so you may be > close to disk saturation. ?The iostat data for IOPS and svc_t will > confirm.But ... if I am saturating the 3511 with one thread, then why do I get many times that performance with multiple threads ?> The T2000 data (sheet 3) shows pretty consistently around > 90 256KB IOPS per drive. Like the 3511 case, this is perhaps 20% > less than I would expect, perhaps due to the measurement.I ran the T2000 test to see if 10U8 behaved better and to make sure I wasn''t seeing an oddity of the 480 / 3511 case. I wanted to see if the random read bahavior was similar, and it was (in relative terms).> Also, the 3511 RAID-5 configuration will perform random reads at > around 1/2 IOPS capacity if the partition offset is 34. ?This was the > default long ago. ?The new default is 256.Our 3511''s have been running 421F (latest) for a long time :-) We are religious about keeping all the 3511 FW current and matched.> The reason is that with > a 34 block offset, you are almost guaranteed that a larger I/O will > stride 2 disks. ?You won''t notice this as easily with a single thread, > but it will be measurable with more threads. Double check the > offset with prtvtoc or format.How do I check offset ... format -> verify from one of the partitionsis below: format> ver Volume name = < > ascii name = <SUN-StorEdge 3511-421F-517.23GB> bytes/sector = 512 sectors = 1084710911 accessible sectors = 1084710878 Part Tag Flag First Sector Size Last Sector 0 usr wm 256 517.22GB 1084694494 1 unassigned wm 0 0 0 2 unassigned wm 0 0 0 3 unassigned wm 0 0 0 4 unassigned wm 0 0 0 5 unassigned wm 0 0 0 6 unassigned wm 0 0 0 8 reserved wm 1084694495 8.00MB 1084710878 format>> Writes are a completely different matter. ?ZFS has a tendency to > turn random writes into sequential writes, so it is pretty much > useless to look at random write data. The sequential writes > should easily blow through the cache on the 3511.I am seeing cache utilization of 25-30% during write tests, with occasional peaks close to 50%. Which is expected as I am testing against one partition on one logical drive.> ?Squinting > my eyes, I would expect the array can do around 70 MB/s > writes, or 25 256KB IOPS saturated writes.iostat and the 3511 transfer rate monitor is showing peaks of 150-180 MB/sec with sustained throughput of 100 MB/sec.> ?By contrast, the > T2000 JBOD data shows consistent IOPS at the disk level > and exposes the track cache affect on the sequential read test.Yup, it is clear that we are easily hitting the read i/o limits of the drives in the T2000.> Did I mention that I''m a member of BAARF? ?www.baarf.com :-)Not yet :-)> Hint: for performance work with HDDs, pay close attention to > IOPS, then convert to bandwidth for the PHB.PHB ??? I do look at IOPs, but what struck me as odd was the disparate results. <snip>> b119 has improved stat() performance, which should make a positive > improvement of such backups. ?But eventually you may need to move > to a multi-stage backup, depending on your business requirements.Due to contract issues (I am consulting at a government agency), we cannot yet run OpenSolaris in production. On our previous server for this application (Apple G5) we had 4 TB of data and about 50 million files (under HFS+) and a full backup took 3 WEEKS. We went the route of explicitly specifying each directory in the NetBackup config and got _some_ reliability. Today we have about 22 TB in over 200 ZFS datasets (not evenly distributed, unfortunately), the largest of which is about 3.5 TB and 30 million files. BTW, our overall configuration is based on h/w we bought years ago and are having to adopt as best we can. We are pushing to replace the SE-3511 arrays with J4400 JBODs. Our current config has 11 disk R5 sets and 1 hot spare per 3511 tray, we carve up ''standard'' 512 GB partitions, which we mirror at the ZPOOL layer across 3511 arrays. We just add additional mirror pairs as the data in each department grows, keeping the mirrors on different arrays :-) More testing results in a separate email, this one is already too long. -- {--------1---------2---------3---------4---------5---------6---------7---------} Paul Kraus -> Senior Systems Architect, Garnet River ( http://www.garnetriver.com/ ) -> Sound Coordinator, Schenectady Light Opera Company ( http://www.sloctheater.org/ ) -> Technical Advisor, Lunacon 2010 (http://www.lunacon.org/) -> Technical Advisor, RPI Players
I posted baseline stats at http://www.ilk.org/~ppk/Geek/ baseline test was 1 thread, 3 GiB file, 64KiB to 512 KiB record size 480-3511-baseline.xls is an iozone output file iostat-baseline.txt is the iostat output for the device in use (annotated) I also noted an odd behavior yesterady and have not had a chance to better qualify it. I was testing various combinations of vdev quantities and mirror quantities. As I changed the number of vdevs (stripes) from 1 through 8 (all backed buy paritions on the same logical disk on the 3511) there was no real change in sequential write, random write, or random read performance. Sequential read performance did show a drop from 216 MiB/sec at 1 vdev to 180 MiB/sec. at 8 vdevs. This was about as expected. As I changed the number of mirro components things got interesting. Keep in mind that I only have one 3511 for testing right now, I had to use partitions from two other production 3511''s to get three mirror components on different arrays. As expected, as I went from 1 to 2 to 3 mirror components the write performance did not change, but the read performance was interesting... see below: read performance mirrors sequential random 1 174 MiB/sec. 23 MiB/sec. 2 229 MiB/sec. 30 MiB/sec. 3 223 MiB/sec. 125 MiB/sec. What they heck happened here ? 1 to 2 mirrors saw a large increase in sequential read perfromance and from 2 to 3 mirrors show a HUGE increase in random read performance. It "feels" like the behavior of the zfs code changed between 2 and 3 mirrors for the random read data. Now to investigate further, I tried multiple mirrors components on the same array (my test 3511), not that you would do this in production, but I was curious what would happen. In this case the throughput degraded across the board as I added mirror components, as one would expect. In the random read case the array was delivering less overall performance than it was when it was one part of the earlier test (16 MiB/sec. combined vs. 1/3 of 125 MiB/sec.) See sheet 7 of http://www.ilk.org/~ppk/Geek/throughput-summary.ods for these test results. Sheet 8 is the last test I did last night, using the NRAID logical disk type to try to get the 3511 to pass a disk through to zfs, but get the advantage of the cache on the 3511. I''m not sure what to read into those numbers. -- {--------1---------2---------3---------4---------5---------6---------7---------} Paul Kraus -> Senior Systems Architect, Garnet River ( http://www.garnetriver.com/ ) -> Sound Coordinator, Schenectady Light Opera Company ( http://www.sloctheater.org/ ) -> Technical Advisor, Lunacon 2010 (http://www.lunacon.org/) -> Technical Advisor, RPI Players
If you are using (3) 3511''s, then won''t it be possibly that your 3GB workload will be largely or entirely served out of RAID controller cache? Also, I had a question for your production backups (millions of small files), do you have atime=off set for the filesystems? That might be helpful. -- This message posted from opensolaris.org
On Wed, Nov 25, 2009 at 7:54 AM, Paul Kraus <pk1048 at gmail.com> wrote:>> You''re peaking at 658 256KB random IOPS for the 3511, or ~66 >> IOPS per drive. ?Since ZFS will max out at 128KB per I/O, the disks >> see something more than 66 IOPS each. ?The IOPS data from >> iostat would be a better metric to observe than bandwidth. ?These >> drives are good for about 80 random IOPS each, so you may be >> close to disk saturation. ?The iostat data for IOPS and svc_t will >> confirm. > > But ... if I am saturating the 3511 with one thread, then why do I get > many times that performance with multiple threads ?I''m having troubles making sense of the iostat data (I can''t tell how many threads at any given point), but I do see lots of times where asvc_t * reads is in the range 850 ms to 950 ms. That is, this is as fast as a single threaded app with a little bit of think time can issue reads (100 reads * 9 ms svc_t + 100 reads * 1 ms think_time = 1 sec). The %busy shows that 90+% of the time there is an I/O in flight (100 reads * 9ms = 900/1000 = 90%). However, %busy isn''t aware of how many I/O''s could be in flight simultaneously. When you fire up more threads, you are able to have more I/O''s in flight concurrently. I don''t believe that the I/O''s per drive is really a limiting factor at the single threaded case, as the spec sheet for the 3511 says that it has 1 GB of cache per controller. Your working set is small enough that it is somewhat likely that many of those random reads will be served from cache. A dtrace analysis of just how random the reads are would be interesting. I think that hotspot.d from the DTrace Toolkit would be a good starting place. -- Mike Gerdts http://mgerdts.blogspot.com/
more below... On Nov 25, 2009, at 5:54 AM, Paul Kraus wrote:> Richard, > First, thank you for the detailed reply ... (comments in line > below) > > On Tue, Nov 24, 2009 at 6:31 PM, Richard Elling > <richard.elling at gmail.com> wrote: >> more below... >> >> On Nov 24, 2009, at 9:29 AM, Paul Kraus wrote: >> >>> On Tue, Nov 24, 2009 at 11:03 AM, Richard Elling >>> <richard.elling at gmail.com> wrote: >>> >>>> Try disabling prefetch. >>> >>> Just tried it... no change in random read (still 17-18 MB/sec for a >>> single thread), but sequential read performance dropped from about >>> 200 >>> MB/sec. to 100 MB/sec. (as expected). Test case is a 3 GB file >>> accessed in 256 KB records. ARC is set to a max of 1 GB for testing. >>> arcstat.pl shows that the vast majority (>95%) of reads are missing >>> the cache. >> >> hmmm... more testing needed. The question is whether the low >> I/O rate is because of zfs itself, or the application? Disabling >> prefetch >> will expose the application, because zfs is not creating additional >> and perhaps unnecessary read I/O. > > The values reported by iozone are in pretty close agreement with what > we are seeing with iostat during the test runs. Compression is off on > zfs (the iozone test data compresses very well and yields bogus > results). I am looking for a good alternative to iozone for random > testing, I did put together a crude script to spawn many dd processes > accessing the block device itself, each with a different seek over the > range of the disk and saw results much greater than the iozone single > threaded random performance.filebench is usually bundled in /usr/benchmarks or as a pkg. vdbench is easy to use and very portable, www.vdbench.org>> Your data which shows the sequential write, random write, and >> sequential read driving actv to 35 is because prefetching is enabled >> for the read. We expect the writes to drive to 35 with a sustained >> write workload of any flavor. > > Understood. I tried tuning the queue size to 50 and observed that the > actv went to 50 (with very little difference in performance), so > returned it to the default of 35.Yep, bottleneck is on the back end (physical HDDs). For arrays with lots of HDDs, this queue can be deeper, but the 3500 series is way too small to see this. If SSDs are used on the back end, then you can revisit this. From the data, it does look like the random read tests are converging on the media capabilities of the disks in the array. For the array you can see the read-modify-write penalty of RAID-5 as well as the caching and prefetching of reads. Note: the physical I/Os are 128 KB, regardless of the iozone size setting. This is expected, since 128 KB is the default recordsize limit for ZFS.>> The random read (with cache misses) >> will stall the application, so it takes a lot of threads (>>16?) to >> keep >> 35 concurrent I/Os in the pipeline without prefetching. The ZFS >> prefetching algorithm is "intelligent" so it actually complicates the >> interpretation of the data. > > What bothers me is that that iostat is showing the ''disk'' device as > not being saturated during the random read test. I''ll post iostat > output that I captured yesterday to http://www.ilk.org/~ppk/Geek/ You > can clearly see the various test phases (sequential write, rewrite, > sequential read, reread, random read, then random write).Is this a single thread? Usually this means that you aren''t creating enough load. ZFS won''t be prefetching (as much) for a random read workload, so iostat will expose client bottlenecks.>> You''re peaking at 658 256KB random IOPS for the 3511, or ~66 >> IOPS per drive. Since ZFS will max out at 128KB per I/O, the disks >> see something more than 66 IOPS each. The IOPS data from >> iostat would be a better metric to observe than bandwidth. These >> drives are good for about 80 random IOPS each, so you may be >> close to disk saturation. The iostat data for IOPS and svc_t will >> confirm. > > But ... if I am saturating the 3511 with one thread, then why do I get > many times that performance with multiple threads ? > >> The T2000 data (sheet 3) shows pretty consistently around >> 90 256KB IOPS per drive. Like the 3511 case, this is perhaps 20% >> less than I would expect, perhaps due to the measurement. > > I ran the T2000 test to see if 10U8 behaved better and to make sure I > wasn''t seeing an oddity of the 480 / 3511 case. I wanted to see if the > random read bahavior was similar, and it was (in relative terms). > >> Also, the 3511 RAID-5 configuration will perform random reads at >> around 1/2 IOPS capacity if the partition offset is 34. This was the >> default long ago. The new default is 256. > > Our 3511''s have been running 421F (latest) for a long time :-) We are > religious about keeping all the 3511 FW current and matched.>> The reason is that with >> a 34 block offset, you are almost guaranteed that a larger I/O will >> stride 2 disks. You won''t notice this as easily with a single >> thread, >> but it will be measurable with more threads. Double check the >> offset with prtvtoc or format. > > How do I check offset ... format -> verify from one of the > partitionsis below: > > format> ver > > Volume name = < > > ascii name = <SUN-StorEdge 3511-421F-517.23GB> > bytes/sector = 512 > sectors = 1084710911 > accessible sectors = 1084710878 > Part Tag Flag First Sector Size Last > Sector > 0 usr wm 256 517.22GB > 1084694494This is it: First Sector = 256. Good.> 1 unassigned wm 0 0 0 > 2 unassigned wm 0 0 0 > 3 unassigned wm 0 0 0 > 4 unassigned wm 0 0 0 > 5 unassigned wm 0 0 0 > 6 unassigned wm 0 0 0 > 8 reserved wm 1084694495 8.00MB > 1084710878 > > format> > >> Writes are a completely different matter. ZFS has a tendency to >> turn random writes into sequential writes, so it is pretty much >> useless to look at random write data. The sequential writes >> should easily blow through the cache on the 3511. > > I am seeing cache utilization of 25-30% during write tests, with > occasional peaks close to 50%. Which is expected as I am testing > against one partition on one logical drive. > >> Squinting >> my eyes, I would expect the array can do around 70 MB/s >> writes, or 25 256KB IOPS saturated writes. > > iostat and the 3511 transfer rate monitor is showing peaks of 150-180 > MB/sec with sustained throughput of 100 MB/sec.[Richard tries to remember if the V480 uses schizo?] [searching...] [found it] Ok, a quick browse shows that the V480 uses two schizo ASICs as the UPA to PCI bridges. Don''t expect more than 200 MB/s from a schizo. http://www.sun.com/processors/manuals/External_Schizo_PRM.pdf>> By contrast, the >> T2000 JBOD data shows consistent IOPS at the disk level >> and exposes the track cache affect on the sequential read test. > > Yup, it is clear that we are easily hitting the read i/o limits of the > drives in the T2000. > >> Did I mention that I''m a member of BAARF? www.baarf.com :-) > > Not yet :-) > >> Hint: for performance work with HDDs, pay close attention to >> IOPS, then convert to bandwidth for the PHB. > > PHB ???Not a fan of Dilbert? :-)> I do look at IOPs, but what struck me as odd was the disparate > results. > > <snip> > >> b119 has improved stat() performance, which should make a positive >> improvement of such backups. But eventually you may need to move >> to a multi-stage backup, depending on your business requirements. > > Due to contract issues (I am consulting at a government agency), we > cannot yet run OpenSolaris in production.Look for CR6775100 to be rolled into a Solaris 10 patch. It might take another 6 months or so, if it gets backported.> On our previous server for this application (Apple G5) we had 4 TB of > data and about 50 million files (under HFS+) and a full backup took 3 > WEEKS. We went the route of explicitly specifying each directory in > the NetBackup config and got _some_ reliability. Today we have about > 22 TB in over 200 ZFS datasets (not evenly distributed, > unfortunately), the largest of which is about 3.5 TB and 30 million > files.Yep, this is becoming more common as people build larger file systems. I briefly describe the multistage backup here: http://richardelling.blogspot.com/2009/08/backups-for-file-systems-with-millions.html Of course, there are quite a few design details that will vary based on business requirements...> BTW, our overall configuration is based on h/w we bought years ago and > are having to adopt as best we can. We are pushing to replace the > SE-3511 arrays with J4400 JBODs. Our current config has 11 disk R5 > sets and 1 hot spare per 3511 tray, we carve up ''standard'' 512 GB > partitions, which we mirror at the ZPOOL layer across 3511 arrays. We > just add additional mirror pairs as the data in each department grows, > keeping the mirrors on different arrays :-)In general, RAID-5 (or raidz) performs poorly for random reads. It gets worse when the reads are small. -- richard
more below... On Nov 25, 2009, at 7:10 AM, Paul Kraus wrote:> I posted baseline stats at http://www.ilk.org/~ppk/Geek/ > > baseline test was 1 thread, 3 GiB file, 64KiB to 512 KiB record size > > 480-3511-baseline.xls is an iozone output file > > iostat-baseline.txt is the iostat output for the device in use > (annotated) > > I also noted an odd behavior yesterady and have not had a chance to > better qualify it. I was testing various combinations of vdev > quantities and mirror quantities. > > As I changed the number of vdevs (stripes) from 1 through 8 (all > backed buy paritions on the same logical disk on the 3511) there was > no real change in sequential write, random write, or random read > performance. Sequential read performance did show a drop from 216 > MiB/sec at 1 vdev to 180 MiB/sec. at 8 vdevs. This was about as > expected. > > As I changed the number of mirro components things got interesting. > Keep in mind that I only have one 3511 for testing right now, I had to > use partitions from two other production 3511''s to get three mirror > components on different arrays. As expected, as I went from 1 to 2 to > 3 mirror components the write performance did not change, but the read > performance was interesting... see below: > > read performance > mirrors sequential random > 1 174 MiB/sec. 23 MiB/sec. > 2 229 MiB/sec. 30 MiB/sec. > 3 223 MiB/sec. 125 MiB/sec. > > What they heck happened here ? 1 to 2 mirrors saw a large increase in > sequential read perfromance and from 2 to 3 mirrors show a HUGE > increase in random read performance. It "feels" like the behavior of > the zfs code changed between 2 and 3 mirrors for the random read data.I can''t explain this. It may require a detailed understanding of the hardware configuration to identify the potential bottleneck. The ZFS mirroring code doesn''t care how many mirrors there are, it just goes through the list. If the performance is not symmetrical from all sides of the mirror, then YMMV.> Now to investigate further, I tried multiple mirrors components on the > same array (my test 3511), not that you would do this in production, > but I was curious what would happen. In this case the throughput > degraded across the board as I added mirror components, as one would > expect. In the random read case the array was delivering less overall > performance than it was when it was one part of the earlier test (16 > MiB/sec. combined vs. 1/3 of 125 MiB/sec.) See sheet 7 of > http://www.ilk.org/~ppk/Geek/throughput-summary.ods for these test > results. Sheet 8 is the last test I did last night, using the NRAID > logical disk type to try to get the 3511 to pass a disk through to > zfs, but get the advantage of the cache on the 3511. I''m not sure what > to read into those numbers.I read it as the single array, as configured, with 10+1 RAID-5 can deliver around 130 random read IOPS @ 128 KB. -- richard