repost - Sorry for ccing the other forums. I''m running into a issue where there seems to be a high number of read iops hitting disks and physical free memory is fluctuating between 200MB -> 450MB out of 16GB total. We have the l2arc configured on a 32GB Intel X25-E ssd and slog on another 32GB X25-E ssd. According to our tester, Oracle writes are extremely slow (high latency). Below is a snippet of iostat: r/s w/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c0t0d0 4898.3 34.2 23.2 1.4 0.1 385.3 0.0 78.1 0 1246 c1 0.0 0.8 0.0 0.0 0.0 0.0 0.0 16.0 0 1 c1t0d0 401.7 0.0 1.9 0.0 0.0 31.5 0.0 78.5 1 100 c1t1d0 421.2 0.0 2.0 0.0 0.0 30.4 0.0 72.3 1 98 c1t2d0 403.9 0.0 1.9 0.0 0.0 32.0 0.0 79.2 1 100 c1t3d0 406.7 0.0 2.0 0.0 0.0 33.0 0.0 81.3 1 100 c1t4d0 414.2 0.0 1.9 0.0 0.0 28.6 0.0 69.1 1 98 c1t5d0 406.3 0.0 1.8 0.0 0.0 32.1 0.0 79.0 1 100 c1t6d0 404.3 0.0 1.9 0.0 0.0 31.9 0.0 78.8 1 100 c1t7d0 404.1 0.0 1.9 0.0 0.0 34.0 0.0 84.1 1 100 c1t8d0 407.1 0.0 1.9 0.0 0.0 31.2 0.0 76.6 1 100 c1t9d0 407.5 0.0 2.0 0.0 0.0 33.2 0.0 81.4 1 100 c1t10d0 402.8 0.0 2.0 0.0 0.0 33.5 0.0 83.2 1 100 c1t11d0 408.9 0.0 2.0 0.0 0.0 32.8 0.0 80.3 1 100 c1t12d0 9.6 10.8 0.1 0.9 0.0 0.4 0.0 20.1 0 17 c1t13d0 0.0 22.7 0.0 0.5 0.0 0.5 0.0 22.8 0 33 c1t14d0 Is this an indicator that we need more physical memory? From http://blogs.sun.com/brendan/entry/test, the order that a read request is satisfied is: 1) ARC 2) vdev cache of L2ARC devices 3) L2ARC devices 4) vdev cache of disks 5) disks Using arc_summary.pl, we determined that prefletch was not helping much so we disabled. CACHE HITS BY DATA TYPE: Demand Data: 22% 158853174 Prefetch Data: 17% 123009991 <---not helping??? Demand Metadata: 60% 437439104 Prefetch Metadata: 0% 2446824 The write iops started to kick in more and latency reduced on spinning disks: 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c0t0d0 1629.0 968.0 17.4 7.3 0.0 35.9 0.0 13.8 0 1088 c1 0.0 1.9 0.0 0.0 0.0 0.0 0.0 1.7 0 0 c1t0d0 126.7 67.3 1.4 0.2 0.0 2.9 0.0 14.8 0 90 c1t1d0 129.7 76.1 1.4 0.2 0.0 2.8 0.0 13.7 0 90 c1t2d0 128.0 73.9 1.4 0.2 0.0 3.2 0.0 16.0 0 91 c1t3d0 128.3 79.1 1.3 0.2 0.0 3.6 0.0 17.2 0 92 c1t4d0 125.8 69.7 1.3 0.2 0.0 2.9 0.0 14.9 0 89 c1t5d0 128.3 81.9 1.4 0.2 0.0 2.8 0.0 13.1 0 89 c1t6d0 128.1 69.2 1.4 0.2 0.0 3.1 0.0 15.7 0 93 c1t7d0 128.3 80.3 1.4 0.2 0.0 3.1 0.0 14.7 0 91 c1t8d0 129.2 69.3 1.4 0.2 0.0 3.0 0.0 15.2 0 90 c1t9d0 130.1 80.0 1.4 0.2 0.0 2.9 0.0 13.6 0 89 c1t10d0 126.2 72.6 1.3 0.2 0.0 2.8 0.0 14.2 0 89 c1t11d0 129.7 81.0 1.4 0.2 0.0 2.7 0.0 12.9 0 88 c1t12d0 90.4 41.3 1.0 4.0 0.0 0.2 0.0 1.2 0 6 c1t13d0 0.0 24.3 0.0 1.2 0.0 0.0 0.0 0.2 0 0 c1t14d0 Is it true if your MFU stats start to go over 50% then more memory is needed? CACHE HITS BY CACHE LIST: Anon: 10% 74845266 [ New Customer, First Cache Hit ] Most Recently Used: 19% 140478087 (mru) [ Return Customer ] Most Frequently Used: 65% 475719362 (mfu) [ Frequent Customer ] Most Recently Used Ghost: 2% 20785604 (mru_ghost) [ Return Customer Evicted, Now Back ] Most Frequently Used Ghost: 1% 9920089 (mfu_ghost) [ Frequent Customer Evicted, Now Back ] CACHE HITS BY DATA TYPE: Demand Data: 22% 158852935 Prefetch Data: 17% 123009991 Demand Metadata: 60% 437438658 Prefetch Metadata: 0% 2446824 My theory is since there''s not enough memory for the arc to cache data, its hits the l2arc where it can''t find data and has to query the disk for the request. This causes contention between reads and writes causing the service times to inflate. uname: 5.10 Generic_141445-09 i86pc i386 i86pc Sun Fire X4270: 11+1 raidz (SAS) l2arc Intel X25-E slog Intel X25-E Thoughts? -- This message posted from opensolaris.org
OK, I''ll take a stab at it... On Dec 26, 2009, at 9:52 PM, Brad wrote:> repost - Sorry for ccing the other forums. > > I''m running into a issue where there seems to be a high number of > read iops hitting disks and physical free memory is fluctuating > between 200MB -> 450MB out of 16GB total. We have the l2arc > configured on a 32GB Intel X25-E ssd and slog on another 32GB X25-E > ssd.OK, this shows that memory is being used... a good thing.> According to our tester, Oracle writes are extremely slow (high > latency).OK, this is a workable problem statement... another good thing.> Below is a snippet of iostat: > > r/s w/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device > 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c0 > 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c0t0d0 > 4898.3 34.2 23.2 1.4 0.1 385.3 0.0 78.1 0 1246 c1 > 0.0 0.8 0.0 0.0 0.0 0.0 0.0 16.0 0 1 c1t0d0 > 401.7 0.0 1.9 0.0 0.0 31.5 0.0 78.5 1 100 c1t1d0 > 421.2 0.0 2.0 0.0 0.0 30.4 0.0 72.3 1 98 c1t2d0 > 403.9 0.0 1.9 0.0 0.0 32.0 0.0 79.2 1 100 c1t3d0 > 406.7 0.0 2.0 0.0 0.0 33.0 0.0 81.3 1 100 c1t4d0 > 414.2 0.0 1.9 0.0 0.0 28.6 0.0 69.1 1 98 c1t5d0 > 406.3 0.0 1.8 0.0 0.0 32.1 0.0 79.0 1 100 c1t6d0 > 404.3 0.0 1.9 0.0 0.0 31.9 0.0 78.8 1 100 c1t7d0 > 404.1 0.0 1.9 0.0 0.0 34.0 0.0 84.1 1 100 c1t8d0 > 407.1 0.0 1.9 0.0 0.0 31.2 0.0 76.6 1 100 c1t9d0 > 407.5 0.0 2.0 0.0 0.0 33.2 0.0 81.4 1 100 c1t10d0 > 402.8 0.0 2.0 0.0 0.0 33.5 0.0 83.2 1 100 c1t11d0 > 408.9 0.0 2.0 0.0 0.0 32.8 0.0 80.3 1 100 c1t12d0 > 9.6 10.8 0.1 0.9 0.0 0.4 0.0 20.1 0 17 c1t13d0 > 0.0 22.7 0.0 0.5 0.0 0.5 0.0 22.8 0 33 c1t14d0You are getting 400+ IOPS @ 4 KB out of HDDs. Count your lucky stars. Don''t expect that kind of performance as normal, it is much better than normal.> Is this an indicator that we need more physical memory? From http://blogs.sun.com/brendan/entry/test > , the order that a read request is satisfied is:0) Oracle SGA> 1) ARC > 2) vdev cache of L2ARC devices > 3) L2ARC devices > 4) vdev cache of disks > 5) disks > > Using arc_summary.pl, we determined that prefletch was not helping > much so we disabled. > > CACHE HITS BY DATA TYPE: > Demand Data: 22% 158853174 > Prefetch Data: 17% 123009991 <---not helping??? > Demand Metadata: 60% 437439104 > Prefetch Metadata: 0% 2446824 > > The write iops started to kick in more and latency reduced on > spinning disks: > > 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c0 > 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c0t0d0 > 1629.0 968.0 17.4 7.3 0.0 35.9 0.0 13.8 0 1088 c1 > 0.0 1.9 0.0 0.0 0.0 0.0 0.0 1.7 0 0 c1t0d0 > 126.7 67.3 1.4 0.2 0.0 2.9 0.0 14.8 0 90 c1t1d0 > 129.7 76.1 1.4 0.2 0.0 2.8 0.0 13.7 0 90 c1t2d0 > 128.0 73.9 1.4 0.2 0.0 3.2 0.0 16.0 0 91 c1t3d0 > 128.3 79.1 1.3 0.2 0.0 3.6 0.0 17.2 0 92 c1t4d0 > 125.8 69.7 1.3 0.2 0.0 2.9 0.0 14.9 0 89 c1t5d0 > 128.3 81.9 1.4 0.2 0.0 2.8 0.0 13.1 0 89 c1t6d0 > 128.1 69.2 1.4 0.2 0.0 3.1 0.0 15.7 0 93 c1t7d0 > 128.3 80.3 1.4 0.2 0.0 3.1 0.0 14.7 0 91 c1t8d0 > 129.2 69.3 1.4 0.2 0.0 3.0 0.0 15.2 0 90 c1t9d0 > 130.1 80.0 1.4 0.2 0.0 2.9 0.0 13.6 0 89 c1t10d0 > 126.2 72.6 1.3 0.2 0.0 2.8 0.0 14.2 0 89 c1t11d0 > 129.7 81.0 1.4 0.2 0.0 2.7 0.0 12.9 0 88 c1t12d0 > 90.4 41.3 1.0 4.0 0.0 0.2 0.0 1.2 0 6 c1t13d0 > 0.0 24.3 0.0 1.2 0.0 0.0 0.0 0.2 0 0 c1t14d0latency is reduced, but you are also now only seeing 200 IOPS, not 400+ IOPS. This is closer to what you would see as a max for HDDs. I cannot tell which device is the cache device. I would expect to see one disk with significantly more reads than the others. What do the l2arc stats show?> Is it true if your MFU stats start to go over 50% then more memory > is needed?That is a good indicator. It means that most of the cache entries are frequently used. Grow your SGA and you should see this go down.> CACHE HITS BY CACHE LIST: > Anon: 10% 74845266 [ New Customer, First Cache Hit ] > Most Recently Used: 19% 140478087 (mru) [ Return Customer ] > Most Frequently Used: 65% 475719362 (mfu) [ Frequent Customer ] > Most Recently Used Ghost: 2% 20785604 (mru_ghost) [ Return Customer > Evicted, Now Back ] > Most Frequently Used Ghost: 1% 9920089 (mfu_ghost) [ Frequent > Customer Evicted, Now Back ] > CACHE HITS BY DATA TYPE: > Demand Data: 22% 158852935 > Prefetch Data: 17% 123009991 > Demand Metadata: 60% 437438658 > Prefetch Metadata: 0% 2446824 > > My theory is since there''s not enough memory for the arc to cache > data, its hits the l2arc where it can''t find data and has to query > the disk for the request. This causes contention between reads and > writes causing the service times to inflate.If you have a choice of where to use memory, always choose closer to the application. Try a larger SGA first. Be aware of large page stealing -- consider increasing the SGA immediately after a reboot and before the database or applications are started. -- richard> uname: 5.10 Generic_141445-09 i86pc i386 i86pc > Sun Fire X4270: 11+1 raidz (SAS) > l2arc Intel X25-E > slog Intel X25-E > Thoughts? > -- > This message posted from opensolaris.org > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Richard - the l2arc is c1t13d0. What tools can be use to show the l2arc stats? raidz1 2.68T 580G 543 453 4.22M 3.70M c1t1d0 - - 258 102 689K 358K c1t2d0 - - 256 103 684K 354K c1t3d0 - - 258 102 690K 359K c1t4d0 - - 260 103 687K 354K c1t5d0 - - 255 101 686K 358K c1t6d0 - - 263 103 685K 354K c1t7d0 - - 259 101 689K 358K c1t8d0 - - 259 103 687K 354K c1t9d0 - - 260 102 689K 358K c1t10d0 - - 263 103 686K 354K c1t11d0 - - 260 102 687K 359K c1t12d0 - - 263 104 684K 354K c1t14d0 396K 29.5G 0 65 7 3.61M cache - - - - - - c1t13d0 29.7G 11.1M 157 84 3.93M 6.45M We''ve added 16GB to the box bring the overall total to 32GB. arc_max is set to 8GB: set zfs:zfs_arc_max = 8589934592 arc_summary output: ARC Size: Current Size: 8192 MB (arcsize) Target Size (Adaptive): 8192 MB (c) Min Size (Hard Limit): 1024 MB (zfs_arc_min) Max Size (Hard Limit): 8192 MB (zfs_arc_max) ARC Size Breakdown: Most Recently Used Cache Size: 39% 3243 MB (p) Most Frequently Used Cache Size: 60% 4948 MB (c-p) ARC Efficency: Cache Access Total: 154663786 Cache Hit Ratio: 41% 64221251 [Defined State for buffer] Cache Miss Ratio: 58% 90442535 [Undefined State for Buffer] REAL Hit Ratio: 41% 64221251 [MRU/MFU Hits Only] Data Demand Efficiency: 38% Data Prefetch Efficiency: DISABLED (zfs_prefetch_disable) CACHE HITS BY CACHE LIST: Anon: --% Counter Rolled. Most Recently Used: 17% 11118906 (mru) [ Return Customer ] Most Frequently Used: 82% 53102345 (mfu) [ Frequent Customer ] Most Recently Used Ghost: 14% 9427708 (mru_ghost) [ Return Customer Evicted, Now Back ] Most Frequently Used Ghost: 6% 4344287 (mfu_ghost) [ Frequent Customer Evicted, Now Back ] CACHE HITS BY DATA TYPE: Demand Data: 84% 54444108 Prefetch Data: 0% 0 Demand Metadata: 15% 9777143 Prefetch Metadata: 0% 0 CACHE MISSES BY DATA TYPE: Demand Data: 96% 87542292 Prefetch Data: 0% 0 Demand Metadata: 3% 2900243 Prefetch Metadata: 0% 0 Also disabled file-level pre-fletch and vdev cache max: set zfs:zfs_prefetch_disable = 1 set zfs:zfs_vdev_cache_max = 0x1 After reading about some issues with concurrent ios, I tweaked the setting down from 35 to 1 and it reduced the response times greatly (2 -> 8ms): set zfs:zfs_vdev_max_pending=1 It did increased the actv...I''m still unsure about the side-effects here: r/s w/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c0t0d0 2295.2 398.7 4.2 7.2 0.0 18.6 0.0 6.9 0 1084 c1 0.0 0.8 0.0 0.0 0.0 0.0 0.0 0.1 0 0 c1t0d0 190.3 22.9 0.4 0.0 0.0 1.5 0.0 7.0 0 87 c1t1d0 180.9 20.6 0.3 0.0 0.0 1.7 0.0 8.5 0 95 c1t2d0 195.0 43.0 0.3 0.2 0.0 1.6 0.0 6.8 0 93 c1t3d0 193.2 21.7 0.4 0.0 0.0 1.5 0.0 6.8 0 88 c1t4d0 195.7 34.8 0.3 0.1 0.0 1.7 0.0 7.5 0 97 c1t5d0 186.8 20.6 0.3 0.0 0.0 1.5 0.0 7.3 0 88 c1t6d0 188.4 21.0 0.4 0.0 0.0 1.6 0.0 7.7 0 91 c1t7d0 189.6 21.2 0.3 0.0 0.0 1.6 0.0 7.4 0 91 c1t8d0 193.8 22.6 0.4 0.0 0.0 1.5 0.0 7.1 0 91 c1t9d0 192.6 20.8 0.3 0.0 0.0 1.4 0.0 6.8 0 88 c1t10d0 195.7 22.2 0.3 0.0 0.0 1.5 0.0 6.7 0 88 c1t11d0 184.7 20.3 0.3 0.0 0.0 1.4 0.0 6.8 0 84 c1t12d0 7.3 82.4 0.1 5.5 0.0 0.0 0.0 0.2 0 1 c1t13d0 1.3 23.9 0.0 1.3 0.0 0.0 0.0 0.2 0 0 c1t14d0 I''m still in talks with the dba in seeing if we can raise the SGA from 4GB to 6GB to see if it''ll help. The changes that showed a lot of improvement is disabling file/device level pre-fletch and reducing concurrent ios from 35 to 1 (tried 10 but it didn''t help much). Is there anything else that could be tweaked to increase write performance? Record sizes are set according to 8K and 128K for redo logs. -- This message posted from opensolaris.org
Hi Brad, comments below... On Dec 27, 2009, at 10:24 PM, Brad wrote:> Richard - the l2arc is c1t13d0. What tools can be use to show the > l2arc stats? > > raidz1 2.68T 580G 543 453 4.22M 3.70M > c1t1d0 - - 258 102 689K 358K > c1t2d0 - - 256 103 684K 354K > c1t3d0 - - 258 102 690K 359K > c1t4d0 - - 260 103 687K 354K > c1t5d0 - - 255 101 686K 358K > c1t6d0 - - 263 103 685K 354K > c1t7d0 - - 259 101 689K 358K > c1t8d0 - - 259 103 687K 354K > c1t9d0 - - 260 102 689K 358K > c1t10d0 - - 263 103 686K 354K > c1t11d0 - - 260 102 687K 359K > c1t12d0 - - 263 104 684K 354K > c1t14d0 396K 29.5G 0 65 7 3.61M > cache - - - - - - > c1t13d0 29.7G 11.1M 157 84 3.93M 6.45M > > We''ve added 16GB to the box bring the overall total to 32GB.In general, this is always a good idea.> arc_max is set to 8GB: > set zfs:zfs_arc_max = 8589934592You will be well served to add much more memory to the SGA and reduce that to the ARC. More below...> arc_summary output: > ARC Size: > Current Size: 8192 MB (arcsize) > Target Size (Adaptive): 8192 MB (c) > Min Size (Hard Limit): 1024 MB (zfs_arc_min) > Max Size (Hard Limit): 8192 MB (zfs_arc_max) > > ARC Size Breakdown: > Most Recently Used Cache Size: 39% 3243 MB (p) > Most Frequently Used Cache Size: 60% 4948 MB (c-p) > > ARC Efficency: > Cache Access Total: 154663786 > Cache Hit Ratio: 41% 64221251 [Defined > State for buffer] > Cache Miss Ratio: 58% 90442535 [Undefined > State for Buffer] > REAL Hit Ratio: 41% 64221251 [MRU/MFU Hits > Only] > > Data Demand Efficiency: 38% > Data Prefetch Efficiency: DISABLED (zfs_prefetch_disable) > > CACHE HITS BY CACHE LIST: > Anon: --% Counter Rolled. > Most Recently Used: 17% 11118906 > (mru) [ Return Customer ] > Most Frequently Used: 82% 53102345 > (mfu) [ Frequent Customer ] > Most Recently Used Ghost: 14% 9427708 > (mru_ghost) [ Return Customer Evicted, Now Back ] > Most Frequently Used Ghost: 6% 4344287 > (mfu_ghost) [ Frequent Customer Evicted, Now Back ] > CACHE HITS BY DATA TYPE: > Demand Data: 84% 54444108 > Prefetch Data: 0% 0 > Demand Metadata: 15% 9777143 > Prefetch Metadata: 0% 0 > CACHE MISSES BY DATA TYPE: > Demand Data: 96% 87542292 > Prefetch Data: 0% 0 > Demand Metadata: 3% 2900243 > Prefetch Metadata: 0% 0 > > > Also disabled file-level pre-fletch and vdev cache max: > set zfs:zfs_prefetch_disable = 1 > set zfs:zfs_vdev_cache_max = 0x1I think this is a waste of time. The database will prefetch, by default, so you might as well start that work ahead of time. Note that ZFS uses an intelligent prefetch algorithm, so if it detects that the accesses are purely random, it won''t prefetch.> After reading about some issues with concurrent ios, I tweaked the > setting down from 35 to 1 and it reduced the response times greatly > (2 -> 8ms): > set zfs:zfs_vdev_max_pending=1This can be a red herring. Judging by the number of IOPS below, it has not improved. At this point, I will assume you are using disks that have NCQ or CTQ (eg most SATA and all FC/SAS drives). If you only issue one command at a time, you effectively disable NCQ and thus cannot take advantage of its efficiencies.> It did increased the actv...I''m still unsure about the side-effects > here: > r/s w/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device > 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c0 > 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c0t0d0 > 2295.2 398.7 4.2 7.2 0.0 18.6 0.0 6.9 0 1084 c1 > 0.0 0.8 0.0 0.0 0.0 0.0 0.0 0.1 0 0 c1t0d0 > 190.3 22.9 0.4 0.0 0.0 1.5 0.0 7.0 0 87 c1t1d0 > 180.9 20.6 0.3 0.0 0.0 1.7 0.0 8.5 0 95 c1t2d0 > 195.0 43.0 0.3 0.2 0.0 1.6 0.0 6.8 0 93 c1t3d0 > 193.2 21.7 0.4 0.0 0.0 1.5 0.0 6.8 0 88 c1t4d0 > 195.7 34.8 0.3 0.1 0.0 1.7 0.0 7.5 0 97 c1t5d0 > 186.8 20.6 0.3 0.0 0.0 1.5 0.0 7.3 0 88 c1t6d0 > 188.4 21.0 0.4 0.0 0.0 1.6 0.0 7.7 0 91 c1t7d0 > 189.6 21.2 0.3 0.0 0.0 1.6 0.0 7.4 0 91 c1t8d0 > 193.8 22.6 0.4 0.0 0.0 1.5 0.0 7.1 0 91 c1t9d0 > 192.6 20.8 0.3 0.0 0.0 1.4 0.0 6.8 0 88 c1t10d0 > 195.7 22.2 0.3 0.0 0.0 1.5 0.0 6.7 0 88 c1t11d0 > 184.7 20.3 0.3 0.0 0.0 1.4 0.0 6.8 0 84 c1t12d0 > 7.3 82.4 0.1 5.5 0.0 0.0 0.0 0.2 0 1 c1t13d0 > 1.3 23.9 0.0 1.3 0.0 0.0 0.0 0.2 0 0 c1t14d0 > > I''m still in talks with the dba in seeing if we can raise the SGA > from 4GB to 6GB to see if it''ll help.Try an SGA more like 20-25 GB. Remember, the database can cache more effectively than any file system underneath. The best I/O is the I/O you don''t have to make. -- richard> > The changes that showed a lot of improvement is disabling file/ > device level pre-fletch and reducing concurrent ios from 35 to 1 > (tried 10 but it didn''t help much). Is there anything else that > could be tweaked to increase write performance? Record sizes are > set according to 8K and 128K for redo logs. > -- > This message posted from opensolaris.org > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
"Try an SGA more like 20-25 GB. Remember, the database can cache more effectively than any file system underneath. The best I/O is the I/O you don''t have to make." We''ll be turning up the SGA size from 4GB to 16GB. The arc size will be set from 8GB to 4GB. "This can be a red herring. Judging by the number of IOPS below, it has not improved. At this point, I will assume you are using disks that have NCQ or CTQ (eg most SATA and all FC/SAS drives). If you only issue one command at a time, you effectively disable NCQ and thus cannot take advantage of its efficiencies." Here''s another sample of the data taken at another time after the number of concurrent ios change from 10 to 1. We''re using Seagate Savio 10K SAS drives...I could not pull up info if the drives support NCQ or not. What''s the recommended value to set concurrent IOs to? r/s w/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c0t0d0 1402.2 7805.3 2.7 36.2 0.2 54.9 0.0 6.0 0 940 c1 10.8 1.0 0.1 0.0 0.0 0.1 0.0 7.0 0 7 c1t0d0 117.1 640.7 0.2 1.8 0.0 4.5 0.0 5.9 1 76 c1t1d0 116.9 638.2 0.2 1.7 0.0 4.6 0.0 6.1 1 78 c1t2d0 116.4 639.1 0.2 1.8 0.0 4.6 0.0 6.0 1 78 c1t3d0 116.6 638.1 0.2 1.7 0.0 4.6 0.0 6.1 1 77 c1t4d0 113.2 638.0 0.2 1.8 0.0 4.6 0.0 6.1 1 77 c1t5d0 116.6 635.3 0.2 1.7 0.0 4.5 0.0 6.0 1 76 c1t6d0 116.2 637.8 0.2 1.8 0.0 4.7 0.0 6.2 1 79 c1t7d0 115.3 636.7 0.2 1.8 0.0 4.4 0.0 5.8 1 77 c1t8d0 115.4 637.8 0.2 1.8 0.0 4.5 0.0 5.9 1 77 c1t9d0 114.8 635.0 0.2 1.8 0.0 4.3 0.0 5.7 1 76 c1t10d0 114.9 639.9 0.2 1.8 0.0 4.7 0.0 6.2 1 78 c1t11d0 115.1 638.7 0.2 1.8 0.0 4.4 0.0 5.9 1 77 c1t12d0 1.6 140.0 0.0 15.1 0.0 0.6 0.0 4.4 0 8 c1t13d0 1.3 9.1 0.0 0.1 0.0 0.0 0.0 1.0 0 0 c1t14d0 -- This message posted from opensolaris.org
On Dec 28, 2009, at 12:40 PM, Brad wrote:> "Try an SGA more like 20-25 GB. Remember, the database can cache more > effectively than any file system underneath. The best I/O is the I/O > you don''t have to make." > > We''ll be turning up the SGA size from 4GB to 16GB. > The arc size will be set from 8GB to 4GB.This doesn''t make sense to me. You''ve got 32 GB, why not use it? Artificially limiting the memory use to 20 GB seems like a waste of good money.> "This can be a red herring. Judging by the number of IOPS below, > it has not improved. At this point, I will assume you are using > disks that have NCQ or CTQ (eg most SATA and all FC/SAS drives). > If you only issue one command at a time, you effectively disable > NCQ and thus cannot take advantage of its efficiencies." > > Here''s another sample of the data taken at another time after the > number of concurrent ios change from 10 to 1. We''re using Seagate > Savio 10K SAS drives...I could not pull up info if the drives > support NCQ or not. What''s the recommended value to set concurrent > IOs to? > > r/s w/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device > 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c0 > 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c0t0d0 > 1402.2 7805.3 2.7 36.2 0.2 54.9 0.0 6.0 0 940 c1 > 10.8 1.0 0.1 0.0 0.0 0.1 0.0 7.0 0 7 c1t0d0 > 117.1 640.7 0.2 1.8 0.0 4.5 0.0 5.9 1 76 c1t1d0 > 116.9 638.2 0.2 1.7 0.0 4.6 0.0 6.1 1 78 c1t2d0 > 116.4 639.1 0.2 1.8 0.0 4.6 0.0 6.0 1 78 c1t3d0 > 116.6 638.1 0.2 1.7 0.0 4.6 0.0 6.1 1 77 c1t4d0 > 113.2 638.0 0.2 1.8 0.0 4.6 0.0 6.1 1 77 c1t5d0 > 116.6 635.3 0.2 1.7 0.0 4.5 0.0 6.0 1 76 c1t6d0 > 116.2 637.8 0.2 1.8 0.0 4.7 0.0 6.2 1 79 c1t7d0 > 115.3 636.7 0.2 1.8 0.0 4.4 0.0 5.8 1 77 c1t8d0 > 115.4 637.8 0.2 1.8 0.0 4.5 0.0 5.9 1 77 c1t9d0 > 114.8 635.0 0.2 1.8 0.0 4.3 0.0 5.7 1 76 c1t10d0 > 114.9 639.9 0.2 1.8 0.0 4.7 0.0 6.2 1 78 c1t11d0 > 115.1 638.7 0.2 1.8 0.0 4.4 0.0 5.9 1 77 c1t12d0 > 1.6 140.0 0.0 15.1 0.0 0.6 0.0 4.4 0 8 c1t13d0 > 1.3 9.1 0.0 0.1 0.0 0.0 0.0 1.0 0 0 c1t14d0SAS will be CTQ, basically the same thing as NCQ for SATA disks. You can see here that you''re averaging 4.6 I/Os queued at the disks (actv column) and the response time is quite good. Meanwhile, the disks are handling more than 700 IOPS with less than 10 ms response time. Not bad at all for HDDs, but not a level that can be expected, either. Here we see more than 600 small write IOPS. These will be sequential (as in contiguous blocks, not sequential as in large blocks) so they get buffered and efficiently written by the disk. When your workload returns to the read-mostly random activity, then the IOPS will go down. As to what is the magic number? It is hard to say. In this case, more than 4 is good. Remember, the default of 35 is as much of a guess as anything. For HDDs, 35 might be a little bit too much, but for a RAID array, something more like 1,000 might be optimal. Keeping an eye on the actv column of iostat can help you make that decision. -- richard
"This doesn''t make sense to me. You''ve got 32 GB, why not use it? Artificially limiting the memory use to 20 GB seems like a waste of good money." I''m having a hard time convincing the dbas to increase the size of the SGA to 20GB because their philosophy is, no matter what eventually you''ll have to hit disk to pick up data thats not stored in cache (arc or l2arc). The typical database server in our environment holds over 3TB of data. If the performance does not improve then we''ll possibly have to change the raid layout from raidz to a raid10. -- This message posted from opensolaris.org
On Mon, 28 Dec 2009, Brad wrote:> I''m having a hard time convincing the dbas to increase the size of > the SGA to 20GB because their philosophy is, no matter what > eventually you''ll have to hit disk to pick up data thats not stored > in cache (arc or l2arc). The typical database server in our > environment holds over 3TB of data.But if the working set is 25GB, then things will be magically better. If it is 50GB or 500GB, then performance may still suck.> If the performance does not improve then we''ll possibly have to > change the raid layout from raidz to a raid10.Mirror vdevs are what is definitely recommended for use with databases. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
On Dec 28, 2009, at 1:40 PM, Brad wrote:> "This doesn''t make sense to me. You''ve got 32 GB, why not use it? > Artificially limiting the memory use to 20 GB seems like a waste of > good money." > > I''m having a hard time convincing the dbas to increase the size of > the SGA to 20GB because their philosophy is, no matter what > eventually you''ll have to hit disk to pick up data thats not stored > in cache (arc or l2arc). The typical database server in our > environment holds over 3TB of data.Wow! Where did you find DBAs who didn''t want more resources? :-) If that is the case, then you might need many more disks to keep the (hungry) database fed.> If the performance does not improve then we''ll possibly have to > change the raid layout from raidz to a raid10.Yes, the notion of adding more disks and using them as mirrors are closely aligned. However, you know that the data in the ARC is more than 50% frequently used, which makes the argument that a larger SGA (or ARC) should benefit the workload. -- richard
On Mon, Dec 28, 2009 at 01:40:03PM -0800, Brad wrote:> "This doesn''t make sense to me. You''ve got 32 GB, why not use it? > Artificially limiting the memory use to 20 GB seems like a waste of > good money." > > I''m having a hard time convincing the dbas to increase the size of the SGA to 20GB because their philosophy is, no matter what eventually you''ll have to hit disk to pick up data thats not stored in cache (arc or l2arc). The typical database server in our environment holds over 3TB of data.Brad, are your DBAs aware that if you increase your SGA (currently 4 GB) - to 8 GB - you get 100 % more memory for SGA - to 16 GB - you get 300 % more memory for SGA - to 20 GB - you get 400 % ... If they are not aware, well ... But try to be patient - I had similar situation. It took quite long time to convince our DBA to increase SGA from 16 GB to 20 GB. Finally they did :-) You can always use "stronger" argument that not using already bought memory is wasting of _money_. Regards Przemyslaw Bak (przemol) -- http://przemol.blogspot.com/ ---------------------------------------------------------------------- Sprawdz, co przyniesie Nowy Rok! Zapytaj wrozke >> http://link.interia.pl/f254d
Thanks for the suggestion! I have heard mirrored vdevs configuration are preferred for Oracle but whats the difference between a raidz mirrored vdev vs a raid10 setup? We have tested a zfs stripe configuration before with 15 disks and our tester was extremely happy with the performance. After talking to our tester, she doesn''t feel comfortable with the current raidz setup. -- This message posted from opensolaris.org
On Dec 29, 2009, at 7:55 AM, Brad <beneri3 at yahoo.com> wrote:> Thanks for the suggestion! > > I have heard mirrored vdevs configuration are preferred for Oracle > but whats the difference between a raidz mirrored vdev vs a raid10 > setup?A mirrored raidz provides redundancy at a steep cost to performance and might I add a high monetary cost. Because each write of a raidz is striped across the disks the effective IOPS of the vdev is equal to that of a single disk. This can be improved by utilizing multiple (smaller) raidz vdevs which are striped, but not by mirroring them. With raid10 each mirrored pair has the IOPS of a single drive. Since these mirrors are typically 2 disk vdevs, you can have a lot more of them and thus a lot more IOPS (some people talk about using 3 disk mirrors, but it''s probably just as good as getting setting copies=2 on a regular pool of mirrors).> We have tested a zfs stripe configuration before with 15 disks and > our tester was extremely happy with the performance. After talking > to our tester, she doesn''t feel comfortable with the current raidz > setup.How many luns are you working with now? 15? Is the storage direct attached or is it coming from a storage server that may have the physical disks in a raid configuration already? If direct attached, create a pool of mirrors. If it''s coming from a storage server where the disks are in a raid already, just create a striped pool and set copies=2. -Ross
On Tue, Dec 29 at 4:55, Brad wrote:> Thanks for the suggestion! > > I have heard mirrored vdevs configuration are preferred for Oracle > but whats the difference between a raidz mirrored vdev vs a raid10 > setup? > > We have tested a zfs stripe configuration before with 15 disks and > our tester was extremely happy with the performance. After talking > to our tester, she doesn''t feel comfortable with the current raidz > setup.As a general rule of thumb, each vdev has the random performance roughly the same as a single member of that vdev. Having six RAIDZ vdevs in a pool should give roughly the performance as a stripe of six bare drives, for random IO. When you''re in a workload that you expect to be bounded by random IO performance, in ZFS you''d want to increase the number of VDEVs to be as large as possible, which acts to distribute random work across all of your disks. Building a pool out of 2-disk mirrors, then, is the preferred layout for random performance, since it''s the highest ratio of disks to vdevs you can achieve (short of non-fault-tolerant configurations). This winds up looking similar to RAID10 in layout, in that you''re striping across a lot of disks that each consists of a mirror, though the checksumming rules are different. Performance should also be similar, though it''s possible RAID10 may give slightly better random read performance at the expense of some data quality guarantees, since I don''t believe RAID10 normally validates checksums on returned data if the device didn''t return an error. In normal practice, RAID10 and a pool of mirrored vdevs should benchmark against each other within your margin of error. --eric -- Eric D. Mudama edmudama at mail.bounceswoosh.org
@ross "Because each write of a raidz is striped across the disks the effective IOPS of the vdev is equal to that of a single disk. This can be improved by utilizing multiple (smaller) raidz vdevs which are striped, but not by mirroring them." So with random reads, would it perform better on a raid5 layout since the FS blocks are written to each disk instead of a stripe? With zfs''s implementation of raid10, would we still get data protection and checksumming? "How many luns are you working with now? 15? Is the storage direct attached or is it coming from a storage server that may have the physical disks in a raid configuration already? If direct attached, create a pool of mirrors. If it''s coming from a storage server where the disks are in a raid already, just create a striped pool and set copies=2." We''re not using a SAN but a Sun X4270 with sixteen SAS drives (two dedicated to OS, two for ssd, raid 11+1. There''s a total of seven datasets from a single pool. -- This message posted from opensolaris.org
@eric "As a general rule of thumb, each vdev has the random performance roughly the same as a single member of that vdev. Having six RAIDZ vdevs in a pool should give roughly the performance as a stripe of six bare drives, for random IO." It sounds like we''ll need 16 vdevs striped in a pool to at least get the performance of 15 drives plus another 16 mirrored for redundancy. If we are bounded in iops by the vdev, would it make sense to go with the bare minimum of drives (3) per vdev? "This winds up looking similar to RAID10 in layout, in that you''re striping across a lot of disks that each consists of a mirror, though the checksumming rules are different. Performance should also be similar, though it''s possible RAID10 may give slightly better random read performance at the expense of some data quality guarantees, since I don''t believe RAID10 normally validates checksums on returned data if the device didn''t return an error. In normal practice, RAID10 and a pool of mirrored vdevs should benchmark against each other within your margin of error." That''s interesting to know that with ZFS''s implementation of raid10 it doesn''t have checksumming built-in. -- This message posted from opensolaris.org
On Tue, 29 Dec 2009, Ross Walker wrote:> > A mirrored raidz provides redundancy at a steep cost to performance and might > I add a high monetary cost.I am not sure what a "mirrored raidz" is. I have never heard of such a thing before.> With raid10 each mirrored pair has the IOPS of a single drive. Since these > mirrors are typically 2 disk vdevs, you can have a lot more of them and thus > a lot more IOPS (some people talk about using 3 disk mirrors, but it''s > probably just as good as getting setting copies=2 on a regular pool of > mirrors).This is another case where using a term like "raid10" does not make sense when discussing zfs. ZFS does not support "raid10". ZFS does not support RAID 0 or RAID 1 so it can''t support RAID 1+0 (RAID 10). Some important points to consider are that every write to a raidz vdev must be synchronous. In other words, the write needs to complete on all the drives in the stripe before the write may return as complete. This is also true of "RAID 1" (mirrors) which specifies that the drives are perfect duplicates of each other. However, zfs does not implement "RAID 1" either. This is easily demonstrated since you can unplug one side of the mirror and the writes to the zfs mirror will still succeed, catching up the mirror which is behind as soon as it is plugged back in. When using mirrors, zfs supports logic which will catch that mirror back up (only sending the missing updates) when connectivity improves. With RAID 1 where is no way to recover a mirror other than a full copy from the other drive. Zfs load-shares across vdevs so it will load-share across mirror vdevs rather than striping (as RAID 10 would require). Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
On Tue, Dec 29, 2009 at 18:16, Brad <beneri3 at yahoo.com> wrote:> @eric > > "As a general rule of thumb, each vdev has the random performance > roughly the same as a single member of that vdev. Having six RAIDZ > vdevs in a pool should give roughly the performance as a stripe of six > bare drives, for random IO." > > It sounds like we''ll need 16 vdevs striped in a pool to at least get the performance of 15 drives plus another 16 mirrored for redundancy. > > If we are bounded in iops by the vdev, would it make sense to go with the bare minimum of drives (3) per vdev?Minimum is 1 drive per vdev. Minimum with redundancy is 2 if you use mirroring. You should do mirroring to get the best performance.> "This winds up looking similar to RAID10 in layout, in that you''re > striping across a lot of disks that each consists of a mirror, though > the checksumming rules are different. Performance should also be > similar, though it''s possible RAID10 may give slightly better random > read performance at the expense of some data quality guarantees, since > I don''t believe RAID10 normally validates checksums on returned data > if the device didn''t return an error. In normal practice, RAID10 and > a pool of mirrored vdevs should benchmark against each other within > your margin of error." > > That''s interesting to know that with ZFS''s implementation of raid10 it doesn''t have checksumming built-in.He was talking about RAID10, not mirroring in ZFS. ZFS will always use checksums.
On Tue, Dec 29 at 9:16, Brad wrote:>@eric > >"As a general rule of thumb, each vdev has the random performance >roughly the same as a single member of that vdev. Having six RAIDZ >vdevs in a pool should give roughly the performance as a stripe of six >bare drives, for random IO." > > It sounds like we''ll need 16 vdevs striped in a pool to at least get > the performance of 15 drives plus another 16 mirrored for redundancy.If you were striping across 16 devices before, you will achieve similar random IO performance by striping across 16 vdevs, regardless of their type. Sequential throughput is more a function of the number of devices, not the number of vdevs, in that a 3-disk RAIDZ will have the sequential write throughput (roughly) of a pair of drives. You still get checksumming, but if a device fails or you get a corruption in your non-redundant stripe, zfs may not have enough information to repair your data. For a read-only data reference, maybe a restore from backup in these situations is okay, but for most installations that is unacceptable. The disk cost of a raidz pool of mirrors is identical to the disk cost of raid10.> If we are bounded in iops by the vdev, would it make sense to go > with the bare minimum of drives (3) per vdev?ZFS supports non-redundant vdev layouts, but they''re generally not recommended. The smallest mirror you can build is 2 devices, and the smallest raidz is 3 devices per vdev.>"This winds up looking similar to RAID10 in layout, in that you''re >striping across a lot of disks that each consists of a mirror, though >the checksumming rules are different. Performance should also be >similar, though it''s possible RAID10 may give slightly better random >read performance at the expense of some data quality guarantees, since >I don''t believe RAID10 normally validates checksums on returned data >if the device didn''t return an error. In normal practice, RAID10 and >a pool of mirrored vdevs should benchmark against each other within >your margin of error." > > That''s interesting to know that with ZFS''s implementation of raid10 > it doesn''t have checksumming built-in.I don''t believe I said this. I am reasonably certain that all zpool/zfs layouts validate checksums, even if built with no redundancy. The "RAID10-similar" layout in ZFS is an array of mirrors, such that you build a bunch of 2-device mirrored vdevs, and add them all into a single pool. You wind up with a layout like: Pool0 mirror-0 disk0 disk1 mirror-1 disk2 disk3 mirror-2 disk4 disk5 ... mirror-N disk-2N disk-2N+1 This will give you the best random IO performance possible with ZFS, independent of the type of disks used. (Obviously some of the same rules may not apply with ramdisks or SSDs, but those are special cases for most.) --eric -- Eric D. Mudama edmudama at mail.bounceswoosh.org
On Dec 29, 2009, at 9:16 AM, Brad wrote:> @eric > > "As a general rule of thumb, each vdev has the random performance > roughly the same as a single member of that vdev. Having six RAIDZ > vdevs in a pool should give roughly the performance as a stripe of six > bare drives, for random IO."This model begins to break down with raidz2 and further breaks down with raidz3. Since I wrote about this simple model here: http://blogs.sun.com/relling/entry/zfs_raid_recommendations_space_performance we''ve refined it a bit, to take into account the number of parity devices. For small, random read IOPS the performance of a single, top-level vdev is performance = performance of a disk * (N / (N - P)) where, N = number of disks in the vdev P = number of parity devices in the vdev For example, using 5 disks @ 100 IOPS we get something like: 2-disk mirror: 200 IOPS 4+1 raidz: 125 IOPS 3+2 raidz2: 167 IOPS 2+3 raidz3: 250 IOPS Once again, it is clear that mirroring will offer the best small, random read IOPS.> It sounds like we''ll need 16 vdevs striped in a pool to at least get > the performance of 15 drives plus another 16 mirrored for redundancy. > > If we are bounded in iops by the vdev, would it make sense to go > with the bare minimum of drives (3) per vdev? > > "This winds up looking similar to RAID10 in layout, in that you''re > striping across a lot of disks that each consists of a mirror, though > the checksumming rules are different. Performance should also be > similar, though it''s possible RAID10 may give slightly better random > read performance at the expense of some data quality guarantees, since > I don''t believe RAID10 normally validates checksums on returned data > if the device didn''t return an error. In normal practice, RAID10 and > a pool of mirrored vdevs should benchmark against each other within > your margin of error." > > That''s interesting to know that with ZFS''s implementation of raid10 > it doesn''t have checksumming built-in.ZFS always checksums everything unless you explicitly disable checksumming for data. Metadata is always checksummed. -- richard
On Tue, Dec 29, 2009 at 12:07 PM, Richard Elling <richard.elling at gmail.com>wrote:> On Dec 29, 2009, at 9:16 AM, Brad wrote: > > @eric >> >> "As a general rule of thumb, each vdev has the random performance >> roughly the same as a single member of that vdev. Having six RAIDZ >> vdevs in a pool should give roughly the performance as a stripe of six >> bare drives, for random IO." >> > > This model begins to break down with raidz2 and further breaks down > with raidz3. Since I wrote about this simple model here: > > http://blogs.sun.com/relling/entry/zfs_raid_recommendations_space_performance > we''ve refined it a bit, to take into account the number of parity devices. > > For small, random read IOPS the performance of a single, top-level vdev is > performance = performance of a disk * (N / (N - P)) > > where, > N = number of disks in the vdev > P = number of parity devices in the vdev > > For example, using 5 disks @ 100 IOPS we get something like: > 2-disk mirror: 200 IOPS > 4+1 raidz: 125 IOPS > 3+2 raidz2: 167 IOPS > 2+3 raidz3: 250 IOPS > > Once again, it is clear that mirroring will offer the best small, random > read > IOPS. > > > It sounds like we''ll need 16 vdevs striped in a pool to at least get the >> performance of 15 drives plus another 16 mirrored for redundancy. >> >> If we are bounded in iops by the vdev, would it make sense to go with the >> bare minimum of drives (3) per vdev? >> >> "This winds up looking similar to RAID10 in layout, in that you''re >> striping across a lot of disks that each consists of a mirror, though >> the checksumming rules are different. Performance should also be >> similar, though it''s possible RAID10 may give slightly better random >> read performance at the expense of some data quality guarantees, since >> I don''t believe RAID10 normally validates checksums on returned data >> if the device didn''t return an error. In normal practice, RAID10 and >> a pool of mirrored vdevs should benchmark against each other within >> your margin of error." >> >> That''s interesting to know that with ZFS''s implementation of raid10 it >> doesn''t have checksumming built-in. >> > > ZFS always checksums everything unless you explicitly disable > checksumming for data. Metadata is always checksummed. > -- richard > > >I imagine he''s referring to the fact that it cannot fix any checksum errors it finds. <flamesuit>Let me open the can of worms by saying this is nearly as bad as not doing checksumming at all. Knowing the data is bad when you can''t do anything to fix it doesn''t really help if you have no way to regenerate it. </flamesuit> -- --Tim -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20091229/780d78ea/attachment.html>
Eric D. Mudama wrote:> On Tue, Dec 29 at 9:16, Brad wrote: > The disk cost of a raidz pool of mirrors is identical to the disk cost > of raid10. >ZFS can''t do a raidz of mirrors or a mirror of raidz. Members of a mirror or raidz[123] must be a fundamental device (i.e. file or drive)>> "This winds up looking similar to RAID10 in layout, in that you''re >> striping across a lot of disks that each consists of a mirror, though >> the checksumming rules are different. Performance should also be >> similar, though it''s possible RAID10 may give slightly better random >> read performance at the expense of some data quality guarantees, since >> I don''t believe RAID10 normally validates checksums on returned data >> if the device didn''t return an error. In normal practice, RAID10 and >> a pool of mirrored vdevs should benchmark against each other within >> your margin of error." >> >> That''s interesting to know that with ZFS''s implementation of raid10 >> it doesn''t have checksumming built-in. > > I don''t believe I said this. I am reasonably certain that all > zpool/zfs layouts validate checksums, even if built with no > redundancy. The "RAID10-similar" layout in ZFS is an array of > mirrors, such that you build a bunch of 2-device mirrored vdevs, and > add them all into a single pool. You wind up with a layout like:Yes. PLEASE be careful - checksumming and redundancy are DIFFERENT concepts. In ZFS, EVERYTHING is checksummed - the data blocks, and the metadata. This is separate from redundancy. Regardless of the zpool layout (mirrors, raidz, or no redundancy), ZFS stores a checksum of all objects - this checksum is used to determine if the object has been corrupted. This check is done on any /read/ Should the checksum determine that the object is corrupt, then there are two things that can happen: if your zpool has some form of redundancy for that object, ZFS will then reread the object from the redundant side of the mirror, or reconstruct the data using parity. It will then re-write the object to another place in the zpool, and eliminate the "bad" object. Else, if there is no redundancy, then it will fail to return the data, and log an error message to the syslog. In the case of metadata, even in a non-redundant zpool, some of that metadata is stored multiple times, so there is the possibility that you will be able to recover/reconstruct some metadata which fails checksumming. In short, Checksumming is how ZFS /determines/ data corruption, and Redundancy is how ZFS /fixes/ it. Checksumming is /always/ present, while redundancy depends on the pool layout and options (cf. "copies" property). -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA Timezone: US/Pacific (GMT-0800)
@relling "For small, random read IOPS the performance of a single, top-level vdev is performance = performance of a disk * (N / (N - P)) 133 * 12/(12-1) 133 * 12/11 where, N = number of disks in the vdev P = number of parity devices in the vdev" performance of a disk => Is this a rough estimate of the disk''s IOP? "For example, using 5 disks @ 100 IOPS we get something like: 2-disk mirror: 200 IOPS 4+1 raidz: 125 IOPS 3+2 raidz2: 167 IOPS 2+3 raidz3: 250 IOPS" So if the rated iops on our disks is @133 iops 133 * 12/(12-1) = 145 11+1 raidz: 145 IOPS????? If that''s the rate for a 11+1 raidz vdev, then why is iostat showing about 700 combined IOPS (reads/writes) per disk? r/s w/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c0t0d0 1402.2 7805.3 2.7 36.2 0.2 54.9 0.0 6.0 0 940 c1 10.8 1.0 0.1 0.0 0.0 0.1 0.0 7.0 0 7 c1t0d0 117.1 640.7 0.2 1.8 0.0 4.5 0.0 5.9 1 76 c1t1d0 116.9 638.2 0.2 1.7 0.0 4.6 0.0 6.1 1 78 c1t2d0 116.4 639.1 0.2 1.8 0.0 4.6 0.0 6.0 1 78 c1t3d0 116.6 638.1 0.2 1.7 0.0 4.6 0.0 6.1 1 77 c1t4d0 113.2 638.0 0.2 1.8 0.0 4.6 0.0 6.1 1 77 c1t5d0 116.6 635.3 0.2 1.7 0.0 4.5 0.0 6.0 1 76 c1t6d0 116.2 637.8 0.2 1.8 0.0 4.7 0.0 6.2 1 79 c1t7d0 115.3 636.7 0.2 1.8 0.0 4.4 0.0 5.8 1 77 c1t8d0 115.4 637.8 0.2 1.8 0.0 4.5 0.0 5.9 1 77 c1t9d0 114.8 635.0 0.2 1.8 0.0 4.3 0.0 5.7 1 76 c1t10d0 114.9 639.9 0.2 1.8 0.0 4.7 0.0 6.2 1 78 c1t11d0 115.1 638.7 0.2 1.8 0.0 4.4 0.0 5.9 1 77 c1t12d0 1.6 140.0 0.0 15.1 0.0 0.6 0.0 4.4 0 8 c1t13d0 1.3 9.1 0.0 0.1 0.0 0.0 0.0 1.0 0 0 c1t14d0 -- This message posted from opensolaris.org
On Dec 29, 2009, at 11:26 AM, Brad wrote:> @relling > "For small, random read IOPS the performance of a single, top-level > vdev is > performance = performance of a disk * (N / (N - P)) > 133 * 12/(12-1)> 133 * 12/11 > > where, > N = number of disks in the vdev > P = number of parity devices in the vdev" > > performance of a disk => Is this a rough estimate of the disk''s IOP? > > > "For example, using 5 disks @ 100 IOPS we get something like: > 2-disk mirror: 200 IOPS > 4+1 raidz: 125 IOPS > 3+2 raidz2: 167 IOPS > 2+3 raidz3: 250 IOPS" > > So if the rated iops on our disks is @133 iops > 133 * 12/(12-1) = 145 > > 11+1 raidz: 145 IOPS????? > > If that''s the rate for a 11+1 raidz vdev, then why is iostat showing > about 700 combined IOPS (reads/writes) per disk?Because the model is for small, random read IOPS over the full size of the disk. What you are seeing is caching and seek optimization at work (a good thing). But, AFAIK, there are no decent performance models which take caching into account. In most cases, storage is sized based on empirical studies. -- richard> r/s w/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device > 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c0 > 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c0t0d0 > 1402.2 7805.3 2.7 36.2 0.2 54.9 0.0 6.0 0 940 c1 > 10.8 1.0 0.1 0.0 0.0 0.1 0.0 7.0 0 7 c1t0d0 > 117.1 640.7 0.2 1.8 0.0 4.5 0.0 5.9 1 76 c1t1d0 > 116.9 638.2 0.2 1.7 0.0 4.6 0.0 6.1 1 78 c1t2d0 > 116.4 639.1 0.2 1.8 0.0 4.6 0.0 6.0 1 78 c1t3d0 > 116.6 638.1 0.2 1.7 0.0 4.6 0.0 6.1 1 77 c1t4d0 > 113.2 638.0 0.2 1.8 0.0 4.6 0.0 6.1 1 77 c1t5d0 > 116.6 635.3 0.2 1.7 0.0 4.5 0.0 6.0 1 76 c1t6d0 > 116.2 637.8 0.2 1.8 0.0 4.7 0.0 6.2 1 79 c1t7d0 > 115.3 636.7 0.2 1.8 0.0 4.4 0.0 5.8 1 77 c1t8d0 > 115.4 637.8 0.2 1.8 0.0 4.5 0.0 5.9 1 77 c1t9d0 > 114.8 635.0 0.2 1.8 0.0 4.3 0.0 5.7 1 76 c1t10d0 > 114.9 639.9 0.2 1.8 0.0 4.7 0.0 6.2 1 78 c1t11d0 > 115.1 638.7 0.2 1.8 0.0 4.4 0.0 5.9 1 77 c1t12d0 > 1.6 140.0 0.0 15.1 0.0 0.6 0.0 4.4 0 8 c1t13d0 > 1.3 9.1 0.0 0.1 0.0 0.0 0.0 1.0 0 0 c1t14d0 > -- > This message posted from opensolaris.org > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
On Tue, Dec 29 at 11:14, Erik Trimble wrote:>Eric D. Mudama wrote: >>On Tue, Dec 29 at 9:16, Brad wrote: >>The disk cost of a raidz pool of mirrors is identical to the disk cost >>of raid10. >> >ZFS can''t do a raidz of mirrors or a mirror of raidz. Members of a >mirror or raidz[123] must be a fundamental device (i.e. file or >drive)Sorry, typo/thinko ... I meant to say a zpool of mirrors, not a raidz pool of mirrors. --eric -- Eric D. Mudama edmudama at mail.bounceswoosh.org
On Dec 29, 2009, at 12:36 PM, Bob Friesenhahn <bfriesen at simple.dallas.tx.us > wrote:> On Tue, 29 Dec 2009, Ross Walker wrote: >> >> A mirrored raidz provides redundancy at a steep cost to performance >> and might I add a high monetary cost. > > I am not sure what a "mirrored raidz" is. I have never heard of > such a thing before. > >> With raid10 each mirrored pair has the IOPS of a single drive. >> Since these mirrors are typically 2 disk vdevs, you can have a lot >> more of them and thus a lot more IOPS (some people talk about using >> 3 disk mirrors, but it''s probably just as good as getting setting >> copies=2 on a regular pool of mirrors). > > This is another case where using a term like "raid10" does not make > sense when discussing zfs. ZFS does not support "raid10". ZFS does > not support RAID 0 or RAID 1 so it can''t support RAID 1+0 (RAID 10).Did it again... I understand the difference. I hope it didn''t confuse the OP by throwing that out there. What I meant to say was a zpool of mirror vdevs.> Some important points to consider are that every write to a raidz > vdev must be synchronous. In other words, the write needs to > complete on all the drives in the stripe before the write may return > as complete. This is also true of "RAID 1" (mirrors) which specifies > that the drives are perfect duplicates of each other.I believe mirrored vdevs can do this in parallel though, while raidz vdevs need to do this serially due to the ordered nature of the transaction which makes the sync writes faster on the mirrors.> However, zfs does not implement "RAID 1" either. This is easily > demonstrated since you can unplug one side of the mirror and the > writes to the zfs mirror will still succeed, catching up the mirror > which is behind as soon as it is plugged back in. When using > mirrors, zfs supports logic which will catch that mirror back up > (only sending the missing updates) when connectivity improves. With > RAID 1 where is no way to recover a mirror other than a full copy > from the other drive.That''s not completely true these days as a lot of raid implementations use bitmaps to track changed blocks and a raid1 continues to function when the other side disappears. The real difference is the mirror implementation in ZFS is in the file system and not at an abstracted block-io layer so it is more intelligent in it''s use and layout.> Zfs load-shares across vdevs so it will load-share across mirror > vdevs rather than striping (as RAID 10 would require).Bob, an interesting question was brought up to me about how copies may affect random read performance. I didn''t know the answer, but if ZFS knows there are additional copies would it not also spread the load across those as well to make sure the wait queues on each spindle are as even as possible? -Ross
On 29-Dec-09, at 11:53 PM, Ross Walker wrote:> On Dec 29, 2009, at 12:36 PM, Bob Friesenhahn > <bfriesen at simple.dallas.tx.us> wrote: > > ... >> However, zfs does not implement "RAID 1" either. This is easily >> demonstrated since you can unplug one side of the mirror and the >> writes to the zfs mirror will still succeed, catching up the >> mirror which is behind as soon as it is plugged back in. When >> using mirrors, zfs supports logic which will catch that mirror >> back up (only sending the missing updates) when connectivity >> improves. With RAID 1 where is no way to recover a mirror other >> than a full copy from the other drive. > > That''s not completely true these days as a lot of raid > implementations use bitmaps to track changed blocks and a raid1 > continues to function when the other side disappears. The real > difference is the mirror implementation in ZFS is in the file > system and not at an abstracted block-io layer so it is more > intelligent in it''s use and layout.Another important difference is that ZFS has the means to know which side of a mirror returned valid data. --Toby
On Tue, 29 Dec 2009, Ross Walker wrote:> >> Some important points to consider are that every write to a raidz vdev must >> be synchronous. In other words, the write needs to complete on all the >> drives in the stripe before the write may return as complete. This is also >> true of "RAID 1" (mirrors) which specifies that the drives are perfect >> duplicates of each other. > > I believe mirrored vdevs can do this in parallel though, while raidz vdevs > need to do this serially due to the ordered nature of the transaction which > makes the sync writes faster on the mirrors.I don''t think that the raidz needs to write the stripe serially, but it does need to ensure that the data is committed to the drives before considering the write to be completed. This is due to the nature of the RAID5 stripe, which needs to be completely written. It seems that mirrors are more sloppy in that writing and committing to one mirror is enough.> Bob, an interesting question was brought up to me about how copies may affect > random read performance. I didn''t know the answer, but if ZFS knows there are > additional copies would it not also spread the load across those as well to > make sure the wait queues on each spindle are as even as possible?Previously we were told that zfs uses a semi-random algorithm to select which mirror side (or copy) to read from. Presumably more (double) performance would be available if zfs was able to precisely schedule and interleave reads from the mirror devices, but perfecting that could be quite challenging. With mirrors, we do see more read performance than one device can provide, but we don''t see double. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
On Wed, Dec 30, 2009 at 12:35 PM, Bob Friesenhahn <bfriesen at simple.dallas.tx.us> wrote:> On Tue, 29 Dec 2009, Ross Walker wrote: >> >>> Some important points to consider are that every write to a raidz vdev >>> must be synchronous. ?In other words, the write needs to complete on all the >>> drives in the stripe before the write may return as complete. This is also >>> true of "RAID 1" (mirrors) which specifies that the drives are perfect >>> duplicates of each other. >> >> I believe mirrored vdevs can do this in parallel though, while raidz vdevs >> need to do this serially due to the ordered nature of the transaction which >> makes the sync writes faster on the mirrors. > > I don''t think that the raidz needs to write the stripe serially, but it does > need to ensure that the data is committed to the drives before considering > the write to be completed. ?This is due to the nature of the RAID5 stripe, > which needs to be completely written. ?It seems that mirrors are more sloppy > in that writing and committing to one mirror is enough.Ok, that makes sense, as long as the metadata is committed, which can happen for mirrors as soon as one side is written, but not for raidz until the whole stripe is written. So accurate in the increased latency for raidz, but for the wrong reason.>> Bob, an interesting question was brought up to me about how copies may >> affect random read performance. I didn''t know the answer, but if ZFS knows >> there are additional copies would it not also spread the load across those >> as well to make sure the wait queues on each spindle are as even as >> possible? > > Previously we were told that zfs uses a semi-random algorithm to select > which mirror side (or copy) to read from. ?Presumably more (double) > performance would be available if zfs was able to precisely schedule and > interleave reads from the mirror devices, but perfecting that could be quite > challenging. ?With mirrors, we do see more read performance than one device > can provide, but we don''t see double.That isn''t quite what I was getting at. Say one has a pool of mirrors, then they set copies=2 on the pool, which in theory should create a second copy of the data on another vdev in the pool. If during servicing a read request, is the ZFS scheduler smart enough to read from the second copy if the vdev where the first copy lies is being serviced by another read/write request? If so this would increase the read performance of the pool of mirrors by sacrificing some write performance. -Ross
On Dec 30, 2009, at 9:35 AM, Bob Friesenhahn wrote:> On Tue, 29 Dec 2009, Ross Walker wrote: >> >>> Some important points to consider are that every write to a raidz >>> vdev must be synchronous. In other words, the write needs to >>> complete on all the drives in the stripe before the write may >>> return as complete. This is also true of "RAID 1" (mirrors) which >>> specifies that the drives are perfect duplicates of each other. >> >> I believe mirrored vdevs can do this in parallel though, while >> raidz vdevs need to do this serially due to the ordered nature of >> the transaction which makes the sync writes faster on the mirrors. > > I don''t think that the raidz needs to write the stripe serially, but > it does need to ensure that the data is committed to the drives > before considering the write to be completed. This is due to the > nature of the RAID5 stripe, which needs to be completely written. > It seems that mirrors are more sloppy in that writing and committing > to one mirror is enough.Yes, though I wouldn''t call it "sloppy" ;-) With traditional software RAID, you have to make sure both sides of the mirror are written because you also assume that you can later read either side. For ZFS, if only one side of the mirror is written, you know the bad side is bad because of the checksum. The checksum is owned by the parent, which is an important design decision that applies here, too. Methinks it might be a good idea to start a comparison wiki to share some of the details... -- richard
On Wed, 30 Dec 2009, Richard Elling wrote:> are written because you also assume that you can later read either side. For > ZFS, if only one side of the mirror is written, you know the bad side is bad > because of the checksum. The checksum is owned by the parent, which is > an important design decision that applies here, too.In a mirror vdev, each mirror contains a complete copy of the vdev, and the last commited TXG can be used to know if the mirror is up to date with its peers, or if it needs a little bit of resilvering to catch up. This differs from a raidz/raidz2 vdev, where most of the disks (N-1 or N-2) need to be available in order to have a complete copy of the vdev. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/