Jim Mauro
2009-Oct-24 19:31 UTC
[zfs-discuss] Performance problems with Thumper and >7TB ZFS pool using RAIDZ2
Posting to zfs-discuss. There''s no reason this needs to be kept confidential. 5-disk RAIDZ2 - doesn''t that equate to only 3 data disks? Seems pointless - they''d be much better off using mirrors, which is a better choice for random IO... Looking at this now... /jim Jeff Savit wrote:> Hi all, > > I''m looking for suggestions for the following situation: I''m helping > another SE with a customer using Thumper with a large ZFS pool mostly > used as an NFS server, and disappointments in performance. The storage > is an intermediate holding place for data to be fed into a relational > database, and the statement is that the NFS side can''t keep up with > data feeds written to it as flat files. > > The ZFS pool has 8 5-volume RAIDZ2 groups, for 7.3TB of storage, with > 1.74TB available. Plenty of idle CPU as shown by vmstat and mpstat. > iostat shows queued I/O and I''m not happy about the total latencies - > wsvc_t in excess of 75ms at times. Average of ~60KB per read and only > ~2.5KB per write. Evil Tuning guide tells me that RAIDZ2 is happiest > for long reads and writes, and this is not the use case here. > > I was surprised to see commands like tar, rm, and chown running > locally on the NFS server, so it looks like they''re locally doing file > maintenance and pruning at the same time it''s being accessed remotely. > That makes sense to me for the short write lengths and for the high > ZFS ACL activity shown by DTrace. I wonder if there is a lot of sync > I/O that would benefit from separately defined ZILs (whether SSD or > not), so I''ve asked them to look for fsync activity. > > Data collected thus far is listed below. I''ve asked for verification > of the Solaris 10 level (I believe it''s S10u6) and ZFS recordsize. > Any suggestions will be appreciated. > > regards, Jeff > > ---- stuff starts here ---- > > > zpool iostat -v gives figures like: > > bash-3.00# zpool iostat -v > capacity operations bandwidth > pool used avail read write read write > ---------- ----- ----- ----- ----- ----- ----- > mdpool 7.32T 1.74T 290 455 1.57M 3.21M > raidz2 937G 223G 36 56 201K 411K > c0t0d0 - - 18 40 1.13M 141K > c1t0d0 - - 18 40 1.12M 141K > c4t0d0 - - 18 40 1.13M 141K > c6t0d0 - - 18 40 1.13M 141K > c7t0d0 - - 18 40 1.13M 141K > > ---the other 7 raidz2 groups have almost identical numbers on their > devices--- > > iostat -iDnxz looks like: > > extended device statistics > r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device > 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.1 0 0 c5t0d0 > 15.8 95.9 996.9 233.1 4.3 1.3 38.2 12.0 20 37 c6t0d0 > 16.1 95.6 1018.5 232.4 2.5 2.6 22.2 23.2 16 36 c7t0d0 > 16.1 96.0 1012.5 232.8 2.8 2.9 24.5 26.1 19 38 c4t0d0 > 16.0 93.1 1012.9 242.2 3.6 1.5 33.2 14.2 18 36 c5t1d0 > 15.9 82.2 1000.5 235.0 1.9 1.6 19.2 16.0 12 31 c5t2d0 > 16.6 95.6 1046.7 232.7 2.5 2.7 22.2 23.7 18 37 c0t0d0 > 16.6 96.1 1042.4 232.8 4.7 0.6 42.0 5.2 19 38 c1t0d0 > ...snip... > 16.5 95.4 1027.2 263.0 5.9 0.4 53.0 3.6 26 40 c0t4d0 > 16.6 95.4 1041.1 263.6 3.9 1.0 34.5 9.3 18 36 c1t4d0 > 16.8 99.1 1060.6 248.6 7.2 0.7 62.0 6.0 32 45 c0t5d0 > 16.5 99.6 1034.7 248.9 8.2 1.1 70.5 9.1 38 48 c1t5d0 > 17.0 82.5 1072.9 219.8 4.8 0.5 48.4 4.7 21 38 c0t6d0 > > > prstat looks like: > > bash-3.00# prstat > PID USERNAME SIZE RSS STATE PRI NICE TIME CPU PROCESS/NLWP > 815 daemon 3192K 2560K sleep 60 -20 83:10:07 0.6% nfsd/24 > 27918 root 1092K 920K cpu2 37 4 0:01:37 0.2% rm/1 > 19142 root 248M 247M sleep 60 0 1:24:24 0.1% chown/1 > 28794 root 2552K 1304K sleep 59 0 0:00:00 0.1% tar/1 > 29957 root 1192K 908K sleep 59 0 0:57:30 0.1% find/1 > 14737 root 7620K 1964K sleep 59 0 0:03:56 0.0% sshd/1 > ... > > > prstat -Lm looks like: > > bash-3.00# prstat -Lm > PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/LWPID > 27918 root 0.0 0.9 0.0 0.0 0.0 0.0 99 0.0 194 7 2K 0 rm/1 > 28794 root 0.1 0.6 0.0 0.0 0.0 0.0 99 0.0 209 10 909 0 tar/1 > 19142 root 0.0 0.6 0.0 0.0 0.0 0.0 99 0.0 224 3 1K 0 chown/1 > 29957 root 0.0 0.4 0.0 0.0 0.0 0.0 100 0.0 213 6 420 0 find/1 > 815 daemon 0.0 0.3 0.0 0.0 0.0 0.0 100 0.0 197 0 0 0 nfsd/28230 > 815 daemon 0.0 0.3 0.0 0.0 0.0 0.0 100 0.0 191 0 0 0 nfsd/28222 > 815 daemon 0.0 0.3 0.0 0.0 0.0 0.0 100 0.0 185 0 0 0 nfsd/28211 > ---many more nfsd lines of similar appearance--- > > > A small DTrace script for ZFS gives me: > > # dtrace -n ''fbt::zfs*:entry{@[pid,execname,probefunc] = count()} END > {trunc(@,20); printa(@)}'' > ^C > ...some lines trimmed... > 28835 tar zfs_dirlook 67761 > 28835 tar zfs_lookup 67761 > 28835 tar zfs_zaccess 69166 > 28835 tar zfs_dirent_lock 71083 > 28835 tar zfs_dirent_unlock 71084 > 28835 tar zfs_zaccess_common28835 tar zfs_acl_node_read 77251 > > 28835 tar zfs_acl_node_read_internal 77251 > 28835 tar zfs_acl_alloc 78656 > 28835 tar zfs_acl_free 78656 > 27918 rm zfs_acl_alloc 85888 > 27918 rm zfs_acl_free 85888 > 27918 rm zfs_acl_node_read 85888 > 27918 rm zfs_acl_node_read_internal 85888 > 27918 rm zfs_zaccess_common 85888 > > > > > > -- > Jeff Savit > Principal Field Technologist > Sun Microsystems, Inc. Phone: 732-537-3451 (x63451) > 2398 E Camelback Rd Email: jeff.savit at sun.com > Phoenix, AZ 85016 http://blogs.sun.com/jsavit/ >
Jeff Savit
2009-Oct-24 20:13 UTC
[zfs-discuss] Performance problems with Thumper and >7TB ZFS pool using RAIDZ2
On 10/24/09 12:31 PM, Jim Mauro wrote:> Posting to zfs-discuss. There''s no reason this needs to be > kept confidential.okay.> > 5-disk RAIDZ2 - doesn''t that equate to only 3 data disks? > Seems pointless - they''d be much better off using mirrors, > which is a better choice for random IO...Hmm, they''re giving up so much % capacity as is, they could just as well give up some more and get better performance. Great idea! -- Jeff Savit Principal Field Technologist Sun Microsystems, Inc. Phone: 732-537-3451 (x63451) 2398 E Camelback Rd Email: jeff.savit at sun.com Phoenix, AZ 85016 http://blogs.sun.com/jsavit/
Albert Chin
2009-Oct-24 23:07 UTC
[zfs-discuss] Performance problems with Thumper and >7TB ZFS pool using RAIDZ2
On Sat, Oct 24, 2009 at 03:31:25PM -0400, Jim Mauro wrote:> Posting to zfs-discuss. There''s no reason this needs to be > kept confidential. > > 5-disk RAIDZ2 - doesn''t that equate to only 3 data disks? > Seems pointless - they''d be much better off using mirrors, > which is a better choice for random IO...Is it really pointless? Maybe they want the insurance RAIDZ2 provides. Given the choice between insurance and performance, I''ll take insurance, though it depends on your use case. We''re using 5-disk RAIDZ2 vdevs. While I want the performance a mirrored vdev would give, it scares me that you''re just one drive away from a failed pool. Of course, you could have two mirrors in each vdev but I don''t want to sacrifice that much space. However, over the last two years, we haven''t had any demonstratable failures that would give us cause for concern. But, it''s still unsettling. Would love to hear other opinions on this.> Looking at this now... > > /jim > > > Jeff Savit wrote: >> Hi all, >> >> I''m looking for suggestions for the following situation: I''m helping >> another SE with a customer using Thumper with a large ZFS pool mostly >> used as an NFS server, and disappointments in performance. The storage >> is an intermediate holding place for data to be fed into a relational >> database, and the statement is that the NFS side can''t keep up with >> data feeds written to it as flat files. >> >> The ZFS pool has 8 5-volume RAIDZ2 groups, for 7.3TB of storage, with >> 1.74TB available. Plenty of idle CPU as shown by vmstat and mpstat. >> iostat shows queued I/O and I''m not happy about the total latencies - >> wsvc_t in excess of 75ms at times. Average of ~60KB per read and only >> ~2.5KB per write. Evil Tuning guide tells me that RAIDZ2 is happiest >> for long reads and writes, and this is not the use case here. >> >> I was surprised to see commands like tar, rm, and chown running >> locally on the NFS server, so it looks like they''re locally doing file >> maintenance and pruning at the same time it''s being accessed remotely. >> That makes sense to me for the short write lengths and for the high >> ZFS ACL activity shown by DTrace. I wonder if there is a lot of sync >> I/O that would benefit from separately defined ZILs (whether SSD or >> not), so I''ve asked them to look for fsync activity. >> >> Data collected thus far is listed below. I''ve asked for verification >> of the Solaris 10 level (I believe it''s S10u6) and ZFS recordsize. >> Any suggestions will be appreciated. >> >> regards, Jeff-- albert chin (china at thewrittenword.com)
Bob Friesenhahn
2009-Oct-25 01:10 UTC
[zfs-discuss] Performance problems with Thumper and >7TB ZFS pool using RAIDZ2
On Sat, 24 Oct 2009, Albert Chin wrote:>> >> 5-disk RAIDZ2 - doesn''t that equate to only 3 data disks? >> Seems pointless - they''d be much better off using mirrors, >> which is a better choice for random IO... > > Is it really pointless? Maybe they want the insurance RAIDZ2 > provides. Given the choice between insurance and performance, I''ll > take insurance, though it depends on your use case. We''re using > 5-disk RAIDZ2 vdevs. While I want the performance a mirrored vdev > would give, it scares me that you''re just one drive away from a > failed pool. Of course, you could have two mirrors in each vdev but > I don''t want to sacrifice that much space. However, over the last > two years, we haven''t had any demonstratable failures that would > give us cause for concern. But, it''s still unsettling.I am using duplex mirrors here even though if a drive fails, the pool is just one drive away from failure. I do feel that it is safer than raidz1 because resilvering is much less complex so there is less to go wrong and the resilver time should be the best possible. For heavy multi-user use (like this Sun customer has) it is impossible to beat the mirrored configuration for performance. If the I/O load is heavy and the storage is "an intermediate holding place for data" then it makes sense to use mirrors. If it was for long term archival storage, then raidz2 would make more sense.>>> iostat shows queued I/O and I''m not happy about the total latencies - >>> wsvc_t in excess of 75ms at times. Average of ~60KB per read and only >>> ~2.5KB per write. Evil Tuning guide tells me that RAIDZ2 is happiest >>> for long reads and writes, and this is not the use case here.~2.5KB per write is definitely problematic. NFS writes are usually synchronous so this is using up the available IOPS, and consuming them at a 5X elevated rate with a 5 disk raidz2. It seems that a SSD for the intent log would help quite a lot for this situation so that zfs can aggregate the writes. If the typical writes are small, it would also help to reduce the filesystem blocksize to 8K. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Marion Hakanson
2009-Oct-26 19:10 UTC
[zfs-discuss] Performance problems with Thumper and >7TB ZFS pool using RAIDZ2
opensolaris-zfs-discuss at mlists.thewrittenword.com said:> Is it really pointless? Maybe they want the insurance RAIDZ2 provides. Given > the choice between insurance and performance, I''ll take insurance, though it > depends on your use case. We''re using 5-disk RAIDZ2 vdevs. > . . . > Would love to hear other opinions on this.Hi again Albert, On our Thumper, we use 7x 6-disk raidz2''s (750GB drives). It seems a good compromise between capacity, IOPS, and data protection. Like you, we are afraid of the possibility of a 2nd disk failure during resilvering of these large drives. Our usage is a mix of disk-to-disk-to-tape backups, archival, and multi-user (tens of users) NFS/SFTP service, in roughly that order of load. We have had no performance problems with this layout. Regards, Marion