Hi all, just after sending a message to sunmanagers I realized that my question should rather have gone here. So sunmanagers please excus ethe double post: I have inherited a X4140 (8 SAS slots) and have just setup the system with Solaris 10 09. I first setup the system on a mirrored pool over the first two disks pool: rpool state: ONLINE scrub: none requested config: NAME STATE READ WRITE CKSUM rpool ONLINE 0 0 0 mirror ONLINE 0 0 0 c1t0d0s0 ONLINE 0 0 0 c1t1d0s0 ONLINE 0 0 0 errors: No known data errors and then tried to add the second pair of disks to this pool which did not work (famous error message reagding label, root pool BIOS issue). I therefore simply created an additional pool tank. pool: rpool state: ONLINE scrub: none requested config: NAME STATE READ WRITE CKSUM rpool ONLINE 0 0 0 mirror ONLINE 0 0 0 c1t0d0s0 ONLINE 0 0 0 c1t1d0s0 ONLINE 0 0 0 errors: No known data errors pool: tank state: ONLINE scrub: none requested config: NAME STATE READ WRITE CKSUM tank ONLINE 0 0 0 mirror ONLINE 0 0 0 c1t2d0 ONLINE 0 0 0 c1t3d0 ONLINE 0 0 0 errors: No known data errors So far so good. I have now replaced the last two SAS disks with 32GB SSDs and am wondering how to add these to the system. I googled a lot for best practise but found nothing so far that made me any wiser. My current approach still is to simply do zpool add tank mirror c0t6d0 c0t7d0 as I would do with normal disks but I am wondering whether that''s the right approach to significantly increase system performance. Will ZFS automatically use these SSDs and optimize accesses to tank? Probably! But it won''t optimize accesses to rpool of course. Not sure whether I need that or should look for that. Should I try to get all disks into rpool inspite of the BIOS label issue so that SSDs are used for all accesses to the disk system? Hints (best practises) are greatly appreciated? Thanks a lot, Andreas
I don''t think adding an SSD mirror to an existing pool will do much for performance. Some of your data will surely go to those SSDs, but I don''t think the solaris will know they are SSDs and move blocks in and out according to usage patterns to give you an all around boost. They will just be used to store data, nothing more. Perhaps it will be more useful to add the SSDs as either an L2ARC or SLOG for the ZIL, but that will depend upon your work load. If you do NFS or iSCSI access, the putting the ZIL onto the SSD drive(s) will speed up writes. Added to the L2ARC will speed up reads. Here is the ZFS best practices guide, which should help with this decision: http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide Read that, then come back with more questions. Best, Scott -- This message posted from opensolaris.org
I have a similar question, I put together a cheapo RAID with four 1TB WD Black (7200) SATAs, in a 3TB RAIDZ1, and I added a 64GB OCZ Vertex SSD, with slice 0 (5GB) for ZIL and the rest of the SSD for cache: # zpool status dpool pool: dpool state: ONLINE scrub: none requested config: NAME STATE READ WRITE CKSUM dpool ONLINE 0 0 0 raidz1 ONLINE 0 0 0 c0t0d0 ONLINE 0 0 0 c0t0d1 ONLINE 0 0 0 c0t0d2 ONLINE 0 0 0 c0t0d3 ONLINE 0 0 0 [b] logs c0t0d4s0 ONLINE 0 0 0[/b] [b] cache c0t0d4s1 ONLINE 0 0 0[/b] spares c0t0d6 AVAIL c0t0d7 AVAIL capacity operations bandwidth pool used avail read write read write ---------- ----- ----- ----- ----- ----- ----- dpool 72.1G 3.55T 237 12 29.7M 597K raidz1 72.1G 3.55T 237 9 29.7M 469K c0t0d0 - - 166 3 7.39M 157K c0t0d1 - - 166 3 7.44M 157K c0t0d2 - - 166 3 7.39M 157K c0t0d3 - - 167 3 7.45M 157K c0t0d4s0 20K 4.97G 0 3 0 127K cache - - - - - - c0t0d4s1 17.6G 36.4G 3 1 249K 119K ---------- ----- ----- ----- ----- ----- ----- I just don''t seem to be getting any bang for the buck I should be. This was taken while rebuilding an Oracle index, all files stored in this pool. The WD disks are at 100%, and nothing is coming from the cache. The cache does have the entire DB cached (17.6G used), but hardly reads anything from it. I also am not seeing the spike of data flowing into the ZIL either, although iostat show there is just write traffic hitting the SSD: extended device statistics cpu device r/s w/s kr/s kw/s wait actv svc_t %w %b us sy wt id sd0 170.0 0.4 7684.7 0.0 0.0 35.0 205.3 0 100 11 8 0 82 sd1 168.4 0.4 7680.2 0.0 0.0 34.6 205.1 0 100 sd2 172.0 0.4 7761.7 0.0 0.0 35.0 202.9 0 100 sd3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 sd4 170.0 0.4 7727.1 0.0 0.0 35.0 205.3 0 100 [b]sd5 1.6 2.6 182.4 104.8 0.0 0.5 117.8 0 31 [/b] Since this SSD is in a RAID array, and just presents as a regular disk LUN, is there a special incantation required to turn on the Turbo mode? Doesnt it seem that all this traffic should be maxing out the SSD? Reads from the cache, and writes to the ZIL? I have a seocnd identical SSD I wanted to add as a mirror, but it seems pointless if there''s no zip to be had.... help? Thanks, Tracey -- This message posted from opensolaris.org
On Fri, Feb 12, 2010 at 02:25:51PM -0800, TMB wrote:> I have a similar question, I put together a cheapo RAID with four 1TB WD Black (7200) SATAs, in a 3TB RAIDZ1, and I added a 64GB OCZ Vertex SSD, with slice 0 (5GB) for ZIL and the rest of the SSD for cache: > # zpool status dpool > pool: dpool > state: ONLINE > scrub: none requested > config: > > NAME STATE READ WRITE CKSUM > dpool ONLINE 0 0 0 > raidz1 ONLINE 0 0 0 > c0t0d0 ONLINE 0 0 0 > c0t0d1 ONLINE 0 0 0 > c0t0d2 ONLINE 0 0 0 > c0t0d3 ONLINE 0 0 0 > [b] logs > c0t0d4s0 ONLINE 0 0 0[/b] > [b] cache > c0t0d4s1 ONLINE 0 0 0[/b] > spares > c0t0d6 AVAIL > c0t0d7 AVAIL > > capacity operations bandwidth > pool used avail read write read write > ---------- ----- ----- ----- ----- ----- ----- > dpool 72.1G 3.55T 237 12 29.7M 597K > raidz1 72.1G 3.55T 237 9 29.7M 469K > c0t0d0 - - 166 3 7.39M 157K > c0t0d1 - - 166 3 7.44M 157K > c0t0d2 - - 166 3 7.39M 157K > c0t0d3 - - 167 3 7.45M 157K > c0t0d4s0 20K 4.97G 0 3 0 127K > cache - - - - - - > c0t0d4s1 17.6G 36.4G 3 1 249K 119K > ---------- ----- ----- ----- ----- ----- ----- > I just don''t seem to be getting any bang for the buck I should be. This was taken while rebuilding an Oracle index, all files stored in this pool. The WD disks are at 100%, and nothing is coming from the cache. The cache does have the entire DB cached (17.6G used), but hardly reads anything from it. I also am not seeing the spike of data flowing into the ZIL either, although iostat show there is just write traffic hitting the SSD: > > extended device statistics cpu > device r/s w/s kr/s kw/s wait actv svc_t %w %b us sy wt id > sd0 170.0 0.4 7684.7 0.0 0.0 35.0 205.3 0 100 11 8 0 82 > sd1 168.4 0.4 7680.2 0.0 0.0 34.6 205.1 0 100 > sd2 172.0 0.4 7761.7 0.0 0.0 35.0 202.9 0 100 > sd3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 > sd4 170.0 0.4 7727.1 0.0 0.0 35.0 205.3 0 100 > [b]sd5 1.6 2.6 182.4 104.8 0.0 0.5 117.8 0 31 [/b] > > Since this SSD is in a RAID array, and just presents as a regular disk LUN, is there a special incantation required to turn on the Turbo mode? > > Doesnt it seem that all this traffic should be maxing out the SSD? Reads from the cache, and writes to the ZIL? I have a seocnd identical SSD I wanted to add as a mirror, but it seems pointless if there''s no zip to be had....The most likely reason is that this workload has been identified as streaming by ZFS, which is prefetching from disk instead of the L2ARC (l2arc_nopreftch=1). It also looks like you''ve used a 128 Kbyte ZFS record size. Is Oracle doing 128 Kbyte random I/O? We usually tune that down before creating the database; which will use the L2ARC device more efficiently. Brendan -- Brendan Gregg, Fishworks http://blogs.sun.com/brendan
Thanks Brendan, I was going to move it over to 8kb block size once I got through this index rebuild. My thinking was that a disproportionate block size would show up as excessive IO thruput, not a lack of thruput. The question about the cache comes from the fact that the 18GB or so that it says is in the cache IS the database. This was why I was thinking the index rebuild should be CPU constrained, and I should see a spike in reading from the cache. If the entire file is cached, why would it go to the disks at all for the reads? The disks are delivering about 30MB/s of reads, but this SSD is rated for sustained 70MB/s, so there should be a chance to pick up 100% gain. I''ve seen lots of mention of kernel settings, but those only seem to apply to cache flushes on sync writes. Any idea on where to look next? I''ve spent about a week tinkering with it.I''m trying to get a major customer to switch over to zfs and an open storage solution, but I''m afraid if I cant get it to work in the small scale, I cant convince them about the large scale. Thanks, Tracey On Fri, Feb 12, 2010 at 4:43 PM, Brendan Gregg - Sun Microsystems < brendan at sun.com> wrote:> On Fri, Feb 12, 2010 at 02:25:51PM -0800, TMB wrote: > > I have a similar question, I put together a cheapo RAID with four 1TB WD > Black (7200) SATAs, in a 3TB RAIDZ1, and I added a 64GB OCZ Vertex SSD, with > slice 0 (5GB) for ZIL and the rest of the SSD for cache: > > # zpool status dpool > > pool: dpool > > state: ONLINE > > scrub: none requested > > config: > > > > NAME STATE READ WRITE CKSUM > > dpool ONLINE 0 0 0 > > raidz1 ONLINE 0 0 0 > > c0t0d0 ONLINE 0 0 0 > > c0t0d1 ONLINE 0 0 0 > > c0t0d2 ONLINE 0 0 0 > > c0t0d3 ONLINE 0 0 0 > > [b] logs > > c0t0d4s0 ONLINE 0 0 0[/b] > > [b] cache > > c0t0d4s1 ONLINE 0 0 0[/b] > > spares > > c0t0d6 AVAIL > > c0t0d7 AVAIL > > > > capacity operations bandwidth > > pool used avail read write read write > > ---------- ----- ----- ----- ----- ----- ----- > > dpool 72.1G 3.55T 237 12 29.7M 597K > > raidz1 72.1G 3.55T 237 9 29.7M 469K > > c0t0d0 - - 166 3 7.39M 157K > > c0t0d1 - - 166 3 7.44M 157K > > c0t0d2 - - 166 3 7.39M 157K > > c0t0d3 - - 167 3 7.45M 157K > > c0t0d4s0 20K 4.97G 0 3 0 127K > > cache - - - - - - > > c0t0d4s1 17.6G 36.4G 3 1 249K 119K > > ---------- ----- ----- ----- ----- ----- ----- > > I just don''t seem to be getting any bang for the buck I should be. This > was taken while rebuilding an Oracle index, all files stored in this pool. > The WD disks are at 100%, and nothing is coming from the cache. The cache > does have the entire DB cached (17.6G used), but hardly reads anything from > it. I also am not seeing the spike of data flowing into the ZIL either, > although iostat show there is just write traffic hitting the SSD: > > > > extended device statistics cpu > > device r/s w/s kr/s kw/s wait actv svc_t %w %b us sy wt id > > sd0 170.0 0.4 7684.7 0.0 0.0 35.0 205.3 0 100 11 8 0 82 > > sd1 168.4 0.4 7680.2 0.0 0.0 34.6 205.1 0 100 > > sd2 172.0 0.4 7761.7 0.0 0.0 35.0 202.9 0 100 > > sd3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 > > sd4 170.0 0.4 7727.1 0.0 0.0 35.0 205.3 0 100 > > [b]sd5 1.6 2.6 182.4 104.8 0.0 0.5 117.8 0 31 [/b] > > > > Since this SSD is in a RAID array, and just presents as a regular disk > LUN, is there a special incantation required to turn on the Turbo mode? > > > > Doesnt it seem that all this traffic should be maxing out the SSD? Reads > from the cache, and writes to the ZIL? I have a seocnd identical SSD I > wanted to add as a mirror, but it seems pointless if there''s no zip to be > had.... > > The most likely reason is that this workload has been identified as > streaming > by ZFS, which is prefetching from disk instead of the L2ARC > (l2arc_nopreftch=1). > > It also looks like you''ve used a 128 Kbyte ZFS record size. Is Oracle > doing > 128 Kbyte random I/O? We usually tune that down before creating the > database; > which will use the L2ARC device more efficiently. > > Brendan > > -- > Brendan Gregg, Fishworks > http://blogs.sun.com/brendan >-- Tracey Bernath 913-488-6284 -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100212/41f4d761/attachment.html>
comment below... On Feb 12, 2010, at 2:25 PM, TMB wrote:> I have a similar question, I put together a cheapo RAID with four 1TB WD Black (7200) SATAs, in a 3TB RAIDZ1, and I added a 64GB OCZ Vertex SSD, with slice 0 (5GB) for ZIL and the rest of the SSD for cache: > # zpool status dpool > pool: dpool > state: ONLINE > scrub: none requested > config: > > NAME STATE READ WRITE CKSUM > dpool ONLINE 0 0 0 > raidz1 ONLINE 0 0 0 > c0t0d0 ONLINE 0 0 0 > c0t0d1 ONLINE 0 0 0 > c0t0d2 ONLINE 0 0 0 > c0t0d3 ONLINE 0 0 0 > [b] logs > c0t0d4s0 ONLINE 0 0 0[/b] > [b] cache > c0t0d4s1 ONLINE 0 0 0[/b] > spares > c0t0d6 AVAIL > c0t0d7 AVAIL > > capacity operations bandwidth > pool used avail read write read write > ---------- ----- ----- ----- ----- ----- ----- > dpool 72.1G 3.55T 237 12 29.7M 597K > raidz1 72.1G 3.55T 237 9 29.7M 469K > c0t0d0 - - 166 3 7.39M 157K > c0t0d1 - - 166 3 7.44M 157K > c0t0d2 - - 166 3 7.39M 157K > c0t0d3 - - 167 3 7.45M 157K > c0t0d4s0 20K 4.97G 0 3 0 127K > cache - - - - - - > c0t0d4s1 17.6G 36.4G 3 1 249K 119K > ---------- ----- ----- ----- ----- ----- ----- > I just don''t seem to be getting any bang for the buck I should be. This was taken while rebuilding an Oracle index, all files stored in this pool. The WD disks are at 100%, and nothing is coming from the cache. The cache does have the entire DB cached (17.6G used), but hardly reads anything from it. I also am not seeing the spike of data flowing into the ZIL either, although iostat show there is just write traffic hitting the SSD: > > extended device statistics cpu > device r/s w/s kr/s kw/s wait actv svc_t %w %b us sy wt id > sd0 170.0 0.4 7684.7 0.0 0.0 35.0 205.3 0 100 11 8 0 82 > sd1 168.4 0.4 7680.2 0.0 0.0 34.6 205.1 0 100 > sd2 172.0 0.4 7761.7 0.0 0.0 35.0 202.9 0 100 > sd3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 > sd4 170.0 0.4 7727.1 0.0 0.0 35.0 205.3 0 100 > [b]sd5 1.6 2.6 182.4 104.8 0.0 0.5 117.8 0 31 [/b]iostat has a "n" option, which is very useful for looking at device names :-) The SSD here is perfoming well. The rest are clobbered. 205 millisecond response time will be agonizingly slow. By default, for this version of ZFS, up to 35 I/Os will be queued to the disk, which is why you see 35.0 in the "actv" column. The combination of actv=35 and svc_t>200 indicates that this is the place to start working. Begin by reducing zfs_vdev_max_pending from 35 to something like 1 to 4. This will reduce the concurrent load on the disks, thus reducing svc_t. http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide#Device_I.2FO_Queue_Size_.28I.2FO_Concurrency.29 -- richard> Since this SSD is in a RAID array, and just presents as a regular disk LUN, is there a special incantation required to turn on the Turbo mode? > > Doesnt it seem that all this traffic should be maxing out the SSD? Reads from the cache, and writes to the ZIL? I have a seocnd identical SSD I wanted to add as a mirror, but it seems pointless if there''s no zip to be had.... > > help? > > Thanks, > Tracey > -- > This message posted from opensolaris.org > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
OK, that was the magic incantation I was looking for: - changing the noprefetch option opened the floodgates to the L2ARC - changing the max queue depth relived the wait time on the drives, although I may undo this again in the benchmarking since these drives all have NCQ I went from all four disks of the array at 100%, doing about 170 read IOPS/25MB/s to all four disks of the array at 0%, once hitting nealyr 500 IOPS/65MB/s off the cache drive (@ only 50% load). This bodes well for adding a second mirrored cache drive to push for the 1KIOPS. Now I am ready to insert the mirror for the ZIL and the CACHE, and we will be ready for some production benchmarking. device r/s w/s kr/s kw/s wait actv svc_t %w %b us sy wt id sd0 170.0 0.4 7684.7 0.0 0.0 35.0 205.3 0 100 11 80 0 82 sd1 168.4 0.4 7680.2 0.0 0.0 34.6 205.1 0 100 sd2 172.0 0.4 7761.7 0.0 0.0 35.0 202.9 0 100 sd4 170.0 0.4 7727.1 0.0 0.0 35.0 205.3 0 100 sd5 1.6 2.6 182.4 104.8 0.0 0.5 117.8 0 31 extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c0t0d0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c0t0d1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c0t0d2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c0t0d3 285.2 0.8 36236.2 14.4 0.0 0.5 0.0 1.8 1 37 c0t0d4 And, keep in mind this was on less than $1000 of hardware. Thanks, Tracey On Sat, Feb 13, 2010 at 9:22 AM, Richard Elling <richard.elling at gmail.com>wrote:> comment below... > > On Feb 12, 2010, at 2:25 PM, TMB wrote: > > I have a similar question, I put together a cheapo RAID with four 1TB WD > Black (7200) SATAs, in a 3TB RAIDZ1, and I added a 64GB OCZ Vertex SSD, with > slice 0 (5GB) for ZIL and the rest of the SSD for cache: > > # zpool status dpool > > pool: dpool > > state: ONLINE > > scrub: none requested > > config: > > > > NAME STATE READ WRITE CKSUM > > dpool ONLINE 0 0 0 > > raidz1 ONLINE 0 0 0 > > c0t0d0 ONLINE 0 0 0 > > c0t0d1 ONLINE 0 0 0 > > c0t0d2 ONLINE 0 0 0 > > c0t0d3 ONLINE 0 0 0 > > [b] logs > > c0t0d4s0 ONLINE 0 0 0[/b] > > [b] cache > > c0t0d4s1 ONLINE 0 0 0[/b] > > spares > > c0t0d6 AVAIL > > c0t0d7 AVAIL > > > > capacity operations bandwidth > > pool used avail read write read write > > ---------- ----- ----- ----- ----- ----- ----- > > dpool 72.1G 3.55T 237 12 29.7M 597K > > raidz1 72.1G 3.55T 237 9 29.7M 469K > > c0t0d0 - - 166 3 7.39M 157K > > c0t0d1 - - 166 3 7.44M 157K > > c0t0d2 - - 166 3 7.39M 157K > > c0t0d3 - - 167 3 7.45M 157K > > c0t0d4s0 20K 4.97G 0 3 0 127K > > cache - - - - - - > > c0t0d4s1 17.6G 36.4G 3 1 249K 119K > > ---------- ----- ----- ----- ----- ----- ----- > > I just don''t seem to be getting any bang for the buck I should be. This > was taken while rebuilding an Oracle index, all files stored in this pool. > The WD disks are at 100%, and nothing is coming from the cache. The cache > does have the entire DB cached (17.6G used), but hardly reads anything from > it. I also am not seeing the spike of data flowing into the ZIL either, > although iostat show there is just write traffic hitting the SSD: > > > > extended device statistics cpu > > device r/s w/s kr/s kw/s wait actv svc_t %w %b us sy wt id > > sd0 170.0 0.4 7684.7 0.0 0.0 35.0 205.3 0 100 11 8 0 82 > > sd1 168.4 0.4 7680.2 0.0 0.0 34.6 205.1 0 100 > > sd2 172.0 0.4 7761.7 0.0 0.0 35.0 202.9 0 100 > > sd3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 > > sd4 170.0 0.4 7727.1 0.0 0.0 35.0 205.3 0 100 > > [b]sd5 1.6 2.6 182.4 104.8 0.0 0.5 117.8 0 31 [/b] > > iostat has a "n" option, which is very useful for looking at device names > :-) > > The SSD here is perfoming well. The rest are clobbered. 205 millisecond > response time will be agonizingly slow. > > By default, for this version of ZFS, up to 35 I/Os will be queued to the > disk, which is why you see 35.0 in the "actv" column. The combination > of actv=35 and svc_t>200 indicates that this is the place to start working. > Begin by reducing zfs_vdev_max_pending from 35 to something like 1 to 4. > This will reduce the concurrent load on the disks, thus reducing svc_t. > > http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide#Device_I.2FO_Queue_Size_.28I.2FO_Concurrency.29 > > -- richard > > > Since this SSD is in a RAID array, and just presents as a regular disk > LUN, is there a special incantation required to turn on the Turbo mode? > > > > Doesnt it seem that all this traffic should be maxing out the SSD? Reads > from the cache, and writes to the ZIL? I have a seocnd identical SSD I > wanted to add as a mirror, but it seems pointless if there''s no zip to be > had.... > > > > help? > > > > Thanks, > > Tracey > > -- > > This message posted from opensolaris.org > > _______________________________________________ > > zfs-discuss mailing list > > zfs-discuss at opensolaris.org > > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100213/d7b17a74/attachment.html>
For those following the saga: With the prefetch problem fixed, and data coming off the L2ARC instead of the disks, the system switched from IO bound to CPU bound, I opened up the throttles with some explicit PARALLEL hints in the Oracle commands, and we were finally able to max out the single SSD: r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 826.0 3.2 104361.8 35.2 0.0 9.9 0.0 12.0 3 100 c0t0d4 So, when we maxed out the SSD cache, it was delivering 100+MB/s, and 830 IOPS with 3.4 TB behind it in a 4 disk SATA RAIDz1. Still have to remap it to 8k blocks to get more efficiency, but for raw numbers, it''s right what I was looking for. Now, to add the second SSD ZIL/L2ARC for a mirror. I may even splurge for one more to get a three way mirror. That will completely saturate the SCSI channel. Now I need a bigger server.... Did I mention it was <$1000 for the whole setup? Bah-ha-ha-ha..... Tracey On Sat, Feb 13, 2010 at 11:51 PM, Tracey Bernath <tbernath at ix.netcom.com>wrote:> OK, that was the magic incantation I was looking for: > - changing the noprefetch option opened the floodgates to the L2ARC > - changing the max queue depth relived the wait time on the drives, > although I may undo this again in the benchmarking since these drives all > have NCQ > > I went from all four disks of the array at 100%, doing about 170 read > IOPS/25MB/s > to all four disks of the array at 0%, once hitting nealyr 500 IOPS/65MB/s > off the cache drive (@ only 50% load). > This bodes well for adding a second mirrored cache drive to push for the > 1KIOPS. > > Now I am ready to insert the mirror for the ZIL and the CACHE, and we will > be ready > for some production benchmarking. > > > BEFORE: > device r/s w/s kr/s kw/s wait actv svc_t %w %b us sy wt id > sd0 170.0 0.4 7684.7 0.0 0.0 35.0 205.3 0 100 11 80 0 82 > > sd1 168.4 0.4 7680.2 0.0 0.0 34.6 205.1 0 100 > sd2 172.0 0.4 7761.7 0.0 0.0 35.0 202.9 0 100 > sd4 170.0 0.4 7727.1 0.0 0.0 35.0 205.3 0 100 > sd5 1.6 2.6 182.4 104.8 0.0 0.5 117.8 0 31 >> AFTER: > extended device statistics > r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device > 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c0t0d0 > 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c0t0d1 > 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c0t0d2 > 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c0t0d3 > 285.2 0.8 36236.2 14.4 0.0 0.5 0.0 1.8 1 37 c0t0d4 > > > And, keep in mind this was on less than $1000 of hardware. > > Thanks for the pointers guys, > Tracey > > > > On Sat, Feb 13, 2010 at 9:22 AM, Richard Elling <richard.elling at gmail.com>wrote: > >> comment below... >> >> On Feb 12, 2010, at 2:25 PM, TMB wrote: >> > I have a similar question, I put together a cheapo RAID with four 1TB WD >> Black (7200) SATAs, in a 3TB RAIDZ1, and I added a 64GB OCZ Vertex SSD, with >> slice 0 (5GB) for ZIL and the rest of the SSD for cache: >> > # zpool status dpool >> > pool: dpool >> > state: ONLINE >> > scrub: none requested >> > config: >> > >> > NAME STATE READ WRITE CKSUM >> > dpool ONLINE 0 0 0 >> > raidz1 ONLINE 0 0 0 >> > c0t0d0 ONLINE 0 0 0 >> > c0t0d1 ONLINE 0 0 0 >> > c0t0d2 ONLINE 0 0 0 >> > c0t0d3 ONLINE 0 0 0 >> > [b] logs >> > c0t0d4s0 ONLINE 0 0 0[/b] >> > [b] cache >> > c0t0d4s1 ONLINE 0 0 0[/b] >> > spares >> > c0t0d6 AVAIL >> > c0t0d7 AVAIL >> > >> > capacity operations bandwidth >> > pool used avail read write read write >> > ---------- ----- ----- ----- ----- ----- ----- >> > dpool 72.1G 3.55T 237 12 29.7M 597K >> > raidz1 72.1G 3.55T 237 9 29.7M 469K >> > c0t0d0 - - 166 3 7.39M 157K >> > c0t0d1 - - 166 3 7.44M 157K >> > c0t0d2 - - 166 3 7.39M 157K >> > c0t0d3 - - 167 3 7.45M 157K >> > c0t0d4s0 20K 4.97G 0 3 0 127K >> > cache - - - - - - >> > c0t0d4s1 17.6G 36.4G 3 1 249K 119K >> > ---------- ----- ----- ----- ----- ----- ----- >> > I just don''t seem to be getting any bang for the buck I should be. This >> was taken while rebuilding an Oracle index, all files stored in this pool. >> The WD disks are at 100%, and nothing is coming from the cache. The cache >> does have the entire DB cached (17.6G used), but hardly reads anything from >> it. I also am not seeing the spike of data flowing into the ZIL either, >> although iostat show there is just write traffic hitting the SSD: >> > >> > extended device statistics cpu >> > device r/s w/s kr/s kw/s wait actv svc_t %w %b us sy wt id >> > sd0 170.0 0.4 7684.7 0.0 0.0 35.0 205.3 0 100 11 8 0 82 >> > sd1 168.4 0.4 7680.2 0.0 0.0 34.6 205.1 0 100 >> > sd2 172.0 0.4 7761.7 0.0 0.0 35.0 202.9 0 100 >> > sd3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 >> > sd4 170.0 0.4 7727.1 0.0 0.0 35.0 205.3 0 100 >> > [b]sd5 1.6 2.6 182.4 104.8 0.0 0.5 117.8 0 31 [/b] >> >> iostat has a "n" option, which is very useful for looking at device names >> :-) >> >> The SSD here is perfoming well. The rest are clobbered. 205 millisecond >> response time will be agonizingly slow. >> >> By default, for this version of ZFS, up to 35 I/Os will be queued to the >> disk, which is why you see 35.0 in the "actv" column. The combination >> of actv=35 and svc_t>200 indicates that this is the place to start >> working. >> Begin by reducing zfs_vdev_max_pending from 35 to something like 1 to 4. >> This will reduce the concurrent load on the disks, thus reducing svc_t. >> >> http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide#Device_I.2FO_Queue_Size_.28I.2FO_Concurrency.29 >> >> -- richard >> >> > Since this SSD is in a RAID array, and just presents as a regular disk >> LUN, is there a special incantation required to turn on the Turbo mode? >> > >> > Doesnt it seem that all this traffic should be maxing out the SSD? Reads >> from the cache, and writes to the ZIL? I have a seocnd identical SSD I >> wanted to add as a mirror, but it seems pointless if there''s no zip to be >> had.... >> > >> > help? >> > >> > Thanks, >> > Tracey >> > -- >> > This message posted from opensolaris.org >> > _______________________________________________ >> > zfs-discuss mailing list >> > zfs-discuss at opensolaris.org >> > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >> >> >-- Tracey Bernath 913-488-6284 -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100214/1fbb99ed/attachment.html>
On Sun, Feb 14, 2010 at 11:08:52PM -0600, Tracey Bernath wrote:> Now, to add the second SSD ZIL/L2ARC for a mirror.Just be clear: mirror ZIL by all means, but don''t mirror l2arc, just add more devices and let them load-balance. This is especially true if you''re sharing ssd writes with ZIL, as slices on the same devices.> I may even splurge for one more to get a three way mirror.With more devices, questions about selecting different devices appropriate for each purpose come into play.> Now I need a bigger server....See? :) -- Dan. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 194 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100216/2c5e2dae/attachment.bin>
On Mon, Feb 15, 2010 at 5:51 PM, Daniel Carosone <dan at geek.com.au> wrote:> On Sun, Feb 14, 2010 at 11:08:52PM -0600, Tracey Bernath wrote: > > Now, to add the second SSD ZIL/L2ARC for a mirror. > > Just be clear: mirror ZIL by all means, but don''t mirror l2arc, just > add more devices and let them load-balance. This is especially true > if you''re sharing ssd writes with ZIL, as slices on the same devices. > > Well, the problem I am trying to solve is wouldn''t it read 2x faster withthe mirror? It seems once I can drive the single device to 10 queued actions, and 100% busy, it would be more useful to have two channels to the same data. Is ZFS not smart enough to understand that there are two identical mirror devices in the cache to split requests to? Or, are you saying that ZFS is smart enough to cache it in two places, although not mirrored? If the device itself was full, and items were falling off the L2ARC, then I could see having two separate cache devices, but since I am only at about 50% utilization of the available capacity, and maxing out the IO, then mirroring seemed smarter. Am I missing something here? Tracey> > I may even splurge for one more to get a three way mirror. > > With more devices, questions about selecting different devices > appropriate for each purpose come into play. > > > Now I need a bigger server.... > > See? :) > > -- > Dan.-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100215/0851209e/attachment.html>
On Mon, 15 Feb 2010, Tracey Bernath wrote:> > If the device itself was full, and items were falling off the L2ARC, then I could see having two > separate cache devices, but since I am only at about 50% utilization of the available capacity, and > maxing out the IO, then mirroring seemed smarter. > > Am I missing something here?I doubt it. The only way to know for sure is to test it but it seems unlikely to me that zfs implementors would fail to load share the reads from mirrored L2ARC. Richard''s points about L2ARC bandwidth vs pool disk bandwidth are still good ones. L2ARC is all about read latency, but L2ARC does not necessarily help with read bandwidth. It is also useful to keep in mind that L2ARC offers at least 40x less bandwidth than ARC in RAM. So always populate RAM first if you can afford it. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
On Mon, Feb 15, 2010 at 09:11:02PM -0600, Tracey Bernath wrote:> On Mon, Feb 15, 2010 at 5:51 PM, Daniel Carosone <dan at geek.com.au> wrote: > > Just be clear: mirror ZIL by all means, but don''t mirror l2arc, just > > add more devices and let them load-balance. This is especially true > > if you''re sharing ssd writes with ZIL, as slices on the same devices. > > > > Well, the problem I am trying to solve is wouldn''t it read 2x faster with > the mirror? It seems once I can drive the single device to 10 queued > actions, and 100% busy, it would be more useful to have two channels to the > same data. Is ZFS not smart enough to understand that there are two > identical mirror devices in the cache to split requests to? Or, are you > saying that ZFS is smart enough to cache it in two places, although not > mirrored?First, Bob is right, measurement trumps speculation. Try it. As for speculation, you''re thinking only about reads. I expect reading from l2arc devices will be the same as reading from any other zfs mirror, and largely the same in both cases above; load balanced across either device. In the rare case of a bad read from unmirrored l2arc, data will be fetched from the pool, so mirroring l2arc doesn''t add any resiliency benefit. However, your cache needs to be populated and maintained as well, and this needs writes. Twice as many of them for the mirror as for the "stripe". Half of what is written never needs to be read again. These writes go to the same ssd devices you''re using for ZIL, on commodity ssd''s which are not well write-optimised, they may be hurting zil latency by making the ssd do more writing, stealing from the total iops count on the channel, and (as a lesser concern) adding wear cycles to the device. When you''re already maxing out the IO, eliminating wasted cycles opens your bottleneck, even if only a little. Once you reach steady state, I don''t know how much turnover in l2arc contents you will have, and therefore how many extra writes we''re talking about. It may not be many, but they are unnecessary ones. Normally, we''d talk about measuring a potential benefit, and then choosing based on the results. In this case, if I were you I''d eliminate the unnecessary writes, and measure the difference more as a matter of curiosity and research, since I was already set up to do so. -- Dan. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 194 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100217/9fa45a2e/attachment.bin>
On Feb 16, 2010, at 12:39 PM, Daniel Carosone wrote:> On Mon, Feb 15, 2010 at 09:11:02PM -0600, Tracey Bernath wrote: >> On Mon, Feb 15, 2010 at 5:51 PM, Daniel Carosone <dan at geek.com.au> wrote: >>> Just be clear: mirror ZIL by all means, but don''t mirror l2arc, just >>> add more devices and let them load-balance. This is especially true >>> if you''re sharing ssd writes with ZIL, as slices on the same devices. >>> >>> Well, the problem I am trying to solve is wouldn''t it read 2x faster with >> the mirror? It seems once I can drive the single device to 10 queued >> actions, and 100% busy, it would be more useful to have two channels to the >> same data. Is ZFS not smart enough to understand that there are two >> identical mirror devices in the cache to split requests to? Or, are you >> saying that ZFS is smart enough to cache it in two places, although not >> mirrored? > > First, Bob is right, measurement trumps speculation. Try it. > > As for speculation, you''re thinking only about reads. I expect > reading from l2arc devices will be the same as reading from any other > zfs mirror, and largely the same in both cases above; load balanced > across either device. In the rare case of a bad read from unmirrored > l2arc, data will be fetched from the pool, so mirroring l2arc doesn''t > add any resiliency benefit. > > However, your cache needs to be populated and maintained as well, and > this needs writes. Twice as many of them for the mirror as for the > "stripe". Half of what is written never needs to be read again. These > writes go to the same ssd devices you''re using for ZIL, on commodity > ssd''s which are not well write-optimised, they may be hurting zil > latency by making the ssd do more writing, stealing from the total > iops count on the channel, and (as a lesser concern) adding wear > cycles to the device.The L2ARC writes are throttled to be 8MB/sec, except during cold start where the throttle is 16MB/sec. This should not be noticeable on the channels.> When you''re already maxing out the IO, eliminating wasted cycles opens > your bottleneck, even if only a little.+1 -- richard> Once you reach steady state, I don''t know how much turnover in l2arc > contents you will have, and therefore how many extra writes we''re > talking about. It may not be many, but they are unnecessary ones. > > Normally, we''d talk about measuring a potential benefit, and then > choosing based on the results. In this case, if I were you I''d > eliminate the unnecessary writes, and measure the difference more as a > matter of curiosity and research, since I was already set up to do so. > > -- > Dan. >ZFS storage and performance consulting at http://www.RichardElling.com ZFS training on deduplication, NexentaStor, and NAS performance http://nexenta-atlanta.eventbrite.com (March 15-17, 2010)
On Sun, Feb 14, 2010 at 12:51 PM, Tracey Bernath <tbernath at ix.netcom.com> wrote:> I went from all four disks of the array at 100%, doing about 170 read > IOPS/25MB/s > to all four disks of the array at 0%, once hitting nealyr 500 IOPS/65MB/s > off the cache drive (@ only 50% load).> And, keep? in mind this was on less than $1000 of hardware.really? complete box and all, or is it just the disks? Cause the 4 disks alone should cost about $400. Did you use ECC RAM? -- Fajar