thr3ads.net - zfs discuss - [zfs-discuss] Performance problems with Thumper and >7TB ZFS pool using RAIDZ2 [Oct 2009]

If this information is useful, please help other people find it:
Share via:

Jim Mauro

2009-Oct-24 19:31 UTC

[zfs-discuss] Performance problems with Thumper and >7TB ZFS pool using RAIDZ2

Posting to zfs-discuss. There''s no reason this needs to be
kept confidential.

5-disk RAIDZ2 - doesn''t that equate to only 3 data disks?
Seems pointless - they''d be much better off using mirrors,
which is a better choice for random IO...

Looking at this now...

/jim


Jeff Savit wrote:> Hi all,
>
> I''m looking for suggestions for the following situation:
I''m helping
> another SE with a customer using Thumper with a large ZFS pool mostly 
> used as an NFS server, and disappointments in performance. The storage 
> is an intermediate holding place for data to be fed into a relational 
> database, and the statement is that the NFS side can''t keep up
with
> data feeds written to it as flat files.
>
> The ZFS pool has 8 5-volume RAIDZ2 groups, for 7.3TB of storage, with 
> 1.74TB available.  Plenty of idle CPU as shown by vmstat and mpstat.  
> iostat shows queued I/O and I''m not happy about the total
latencies -
> wsvc_t in excess of 75ms at times.  Average of ~60KB per read and only 
> ~2.5KB per write. Evil Tuning guide tells me that RAIDZ2 is happiest 
> for long reads and writes, and this is not the use case here.
>
> I was surprised to see commands like tar, rm, and chown running 
> locally on the NFS server, so it looks like they''re locally doing
file
> maintenance and pruning at the same time it''s being accessed
remotely.
> That makes sense to me for the short write lengths and for the high 
> ZFS ACL activity shown by DTrace. I wonder if there is a lot of sync 
> I/O that would benefit from separately defined ZILs (whether SSD or 
> not), so I''ve asked them to look for fsync activity.
>
> Data collected thus far is listed below. I''ve asked for
verification
> of the Solaris 10 level (I believe it''s S10u6) and ZFS recordsize.
> Any suggestions will be appreciated.
>
> regards, Jeff
>
> ---- stuff starts here ----
>
>
> zpool iostat -v gives figures like:
>
> bash-3.00# zpool iostat -v
>           capacity operations      bandwidth
> pool   used avail read write      read write
> ---------- ----- ----- -----    ----- ----- -----
> mdpool 7.32T 1.74T 290  455     1.57M 3.21M
> raidz2  937G  223G  36   56       201K 411K
> c0t0d0 -      -     18   40      1.13M 141K
> c1t0d0 -      -     18   40      1.12M 141K
> c4t0d0 -      -     18   40      1.13M 141K
> c6t0d0 -      -     18   40      1.13M 141K
> c7t0d0 -      -     18   40      1.13M 141K
>
> ---the other 7 raidz2 groups have almost identical numbers on their 
> devices---
>
> iostat -iDnxz looks like:
>
>                     extended device statistics             
>     r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
>     0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.1   0   0 c5t0d0
>    15.8   95.9  996.9  233.1  4.3  1.3   38.2   12.0  20  37 c6t0d0
>    16.1   95.6 1018.5  232.4  2.5  2.6   22.2   23.2  16  36 c7t0d0
>    16.1   96.0 1012.5  232.8  2.8  2.9   24.5   26.1  19  38 c4t0d0
>    16.0   93.1 1012.9  242.2  3.6  1.5   33.2   14.2  18  36 c5t1d0
>    15.9   82.2 1000.5  235.0  1.9  1.6   19.2   16.0  12  31 c5t2d0
>    16.6   95.6 1046.7  232.7  2.5  2.7   22.2   23.7  18  37 c0t0d0
>    16.6   96.1 1042.4  232.8  4.7  0.6   42.0    5.2  19  38 c1t0d0
> ...snip...
>    16.5   95.4 1027.2  263.0  5.9  0.4   53.0    3.6  26  40 c0t4d0
>    16.6   95.4 1041.1  263.6  3.9  1.0   34.5    9.3  18  36 c1t4d0
>    16.8   99.1 1060.6  248.6  7.2  0.7   62.0    6.0  32  45 c0t5d0
>    16.5   99.6 1034.7  248.9  8.2  1.1   70.5    9.1  38  48 c1t5d0
>    17.0   82.5 1072.9  219.8  4.8  0.5   48.4    4.7  21  38 c0t6d0
>
>
> prstat  looks like:
>
> bash-3.00# prstat
> PID USERNAME SIZE RSS STATE PRI NICE TIME CPU PROCESS/NLWP
> 815 daemon 3192K 2560K sleep 60 -20 83:10:07 0.6% nfsd/24
> 27918 root 1092K 920K cpu2 37 4 0:01:37 0.2% rm/1
> 19142 root 248M 247M sleep 60 0 1:24:24 0.1% chown/1
> 28794 root 2552K 1304K sleep 59 0 0:00:00 0.1% tar/1
> 29957 root 1192K 908K sleep 59 0 0:57:30 0.1% find/1
> 14737 root 7620K 1964K sleep 59 0 0:03:56 0.0% sshd/1
> ...
>
>
> prstat -Lm looks like:
>
> bash-3.00# prstat -Lm
> PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/LWPID
> 27918 root 0.0 0.9 0.0 0.0 0.0 0.0 99 0.0 194 7 2K 0 rm/1
> 28794 root 0.1 0.6 0.0 0.0 0.0 0.0 99 0.0 209 10 909 0 tar/1
> 19142 root 0.0 0.6 0.0 0.0 0.0 0.0 99 0.0 224 3 1K 0 chown/1
> 29957 root 0.0 0.4 0.0 0.0 0.0 0.0 100 0.0 213 6 420 0 find/1
> 815 daemon 0.0 0.3 0.0 0.0 0.0 0.0 100 0.0 197 0 0 0 nfsd/28230
> 815 daemon 0.0 0.3 0.0 0.0 0.0 0.0 100 0.0 191 0 0 0 nfsd/28222
> 815 daemon 0.0 0.3 0.0 0.0 0.0 0.0 100 0.0 185 0 0 0 nfsd/28211
> ---many more nfsd lines of similar appearance---
>
>
> A small DTrace script for ZFS gives me:
>
> # dtrace -n ''fbt::zfs*:entry{@[pid,execname,probefunc] = count()}
END
> {trunc(@,20); printa(@)}''
> ^C
> ...some lines trimmed...
> 28835 tar zfs_dirlook 67761
> 28835 tar zfs_lookup 67761
> 28835 tar zfs_zaccess 69166
> 28835 tar zfs_dirent_lock 71083
> 28835 tar zfs_dirent_unlock 71084
> 28835 tar zfs_zaccess_common28835 tar zfs_acl_node_read 77251
>
> 28835 tar zfs_acl_node_read_internal 77251
> 28835 tar zfs_acl_alloc 78656
> 28835 tar zfs_acl_free 78656
> 27918 rm zfs_acl_alloc 85888
> 27918 rm zfs_acl_free 85888
> 27918 rm zfs_acl_node_read 85888
> 27918 rm zfs_acl_node_read_internal 85888
> 27918 rm zfs_zaccess_common 85888
>
>
>
>
>
> -- 
> Jeff Savit
> Principal Field Technologist
> Sun Microsystems, Inc.        Phone: 732-537-3451 (x63451)
> 2398 E Camelback Rd           Email: jeff.savit at sun.com
> Phoenix, AZ  85016            http://blogs.sun.com/jsavit/ 
>

Jeff Savit

2009-Oct-24 20:13 UTC

head link

[zfs-discuss] Performance problems with Thumper and >7TB ZFS pool using RAIDZ2

On 10/24/09 12:31 PM, Jim Mauro wrote:> Posting to zfs-discuss. There''s no reason this needs to be
> kept confidential.
okay.>
> 5-disk RAIDZ2 - doesn''t that equate to only 3 data disks?
> Seems pointless - they''d be much better off using mirrors,
> which is a better choice for random IO...
Hmm, they''re giving up so much % capacity as is, they could just as
well
give up some more and get better performance. Great idea!

-- 
Jeff Savit
Principal Field Technologist
Sun Microsystems, Inc.        Phone: 732-537-3451 (x63451)
2398 E Camelback Rd           Email: jeff.savit at sun.com
Phoenix, AZ  85016            http://blogs.sun.com/jsavit/

Albert Chin

2009-Oct-24 23:07 UTC

head link

[zfs-discuss] Performance problems with Thumper and >7TB ZFS pool using RAIDZ2

On Sat, Oct 24, 2009 at 03:31:25PM -0400, Jim Mauro
wrote:> Posting to zfs-discuss. There''s no reason this needs to be
> kept confidential.
>
> 5-disk RAIDZ2 - doesn''t that equate to only 3 data disks?
> Seems pointless - they''d be much better off using mirrors,
> which is a better choice for random IO...
Is it really pointless? Maybe they want the insurance RAIDZ2 provides.
Given the choice between insurance and performance, I''ll take
insurance,
though it depends on your use case. We''re using 5-disk RAIDZ2 vdevs.
While I want the performance a mirrored vdev would give, it scares me
that you''re just one drive away from a failed pool. Of course, you
could
have two mirrors in each vdev but I don''t want to sacrifice that much
space. However, over the last two years, we haven''t had any
demonstratable failures that would give us cause for concern. But, it''s
still unsettling.

Would love to hear other opinions on this.
> Looking at this now...
>
> /jim
>
>
> Jeff Savit wrote:
>> Hi all,
>>
>> I''m looking for suggestions for the following situation:
I''m helping
>> another SE with a customer using Thumper with a large ZFS pool mostly  
>> used as an NFS server, and disappointments in performance. The storage
>> is an intermediate holding place for data to be fed into a relational  
>> database, and the statement is that the NFS side can''t keep up
with
>> data feeds written to it as flat files.
>>
>> The ZFS pool has 8 5-volume RAIDZ2 groups, for 7.3TB of storage, with  
>> 1.74TB available.  Plenty of idle CPU as shown by vmstat and mpstat.   
>> iostat shows queued I/O and I''m not happy about the total
latencies -
>> wsvc_t in excess of 75ms at times.  Average of ~60KB per read and only
>> ~2.5KB per write. Evil Tuning guide tells me that RAIDZ2 is happiest  
>> for long reads and writes, and this is not the use case here.
>>
>> I was surprised to see commands like tar, rm, and chown running  
>> locally on the NFS server, so it looks like they''re locally
doing file
>> maintenance and pruning at the same time it''s being accessed
remotely.
>> That makes sense to me for the short write lengths and for the high  
>> ZFS ACL activity shown by DTrace. I wonder if there is a lot of sync  
>> I/O that would benefit from separately defined ZILs (whether SSD or  
>> not), so I''ve asked them to look for fsync activity.
>>
>> Data collected thus far is listed below. I''ve asked for
verification
>> of the Solaris 10 level (I believe it''s S10u6) and ZFS
recordsize.
>> Any suggestions will be appreciated.
>>
>> regards, Jeff
-- 
albert chin (china at thewrittenword.com)

Bob Friesenhahn

2009-Oct-25 01:10 UTC

head link

[zfs-discuss] Performance problems with Thumper and >7TB ZFS pool using RAIDZ2

On Sat, 24 Oct 2009, Albert Chin wrote:>>
>> 5-disk RAIDZ2 - doesn''t that equate to only 3 data disks?
>> Seems pointless - they''d be much better off using mirrors,
>> which is a better choice for random IO...
>
> Is it really pointless? Maybe they want the insurance RAIDZ2 
> provides. Given the choice between insurance and performance, I''ll
> take insurance, though it depends on your use case. We''re using 
> 5-disk RAIDZ2 vdevs. While I want the performance a mirrored vdev 
> would give, it scares me that you''re just one drive away from a 
> failed pool. Of course, you could have two mirrors in each vdev but 
> I don''t want to sacrifice that much space. However, over the last 
> two years, we haven''t had any demonstratable failures that would 
> give us cause for concern. But, it''s still unsettling.
I am using duplex mirrors here even though if a drive fails, the pool 
is just one drive away from failure.  I do feel that it is safer than 
raidz1 because resilvering is much less complex so there is less to go 
wrong and the resilver time should be the best possible.

For heavy multi-user use (like this Sun customer has) it is impossible 
to beat the mirrored configuration for performance.  If the I/O load 
is heavy and the storage is "an intermediate holding place for data" 
then it makes sense to use mirrors.  If it was for long term archival 
storage, then raidz2 would make more sense.
>>> iostat shows queued I/O and I''m not happy about the total
latencies -
>>> wsvc_t in excess of 75ms at times.  Average of ~60KB per read and
only
>>> ~2.5KB per write. Evil Tuning guide tells me that RAIDZ2 is
happiest
>>> for long reads and writes, and this is not the use case here.
~2.5KB per write is definitely problematic.  NFS writes are usually 
synchronous so this is using up the available IOPS, and consuming them 
at a 5X elevated rate with a 5 disk raidz2.  It seems that a SSD for 
the intent log would help quite a lot for this situation so that zfs 
can aggregate the writes.  If the typical writes are small, it would 
also help to reduce the filesystem blocksize to 8K.

Bob
--
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Marion Hakanson

2009-Oct-26 19:10 UTC

head link

[zfs-discuss] Performance problems with Thumper and >7TB ZFS pool using RAIDZ2

opensolaris-zfs-discuss at mlists.thewrittenword.com
said:> Is it really pointless? Maybe they want the insurance RAIDZ2 provides.
Given
> the choice between insurance and performance, I''ll take insurance,
though it
> depends on your use case. We''re using 5-disk RAIDZ2 vdevs. 
> . . .
> Would love to hear other opinions on this. 
Hi again Albert,

On our Thumper, we use 7x 6-disk raidz2''s (750GB drives).  It seems a
good
compromise between capacity, IOPS, and data protection.  Like you, we are
afraid of the possibility of a 2nd disk failure during resilvering of these
large drives.  Our usage is a mix of disk-to-disk-to-tape backups, archival,
and multi-user (tens of users) NFS/SFTP service, in roughly that order
of load.  We have had no performance problems with this layout.

Regards,

Marion

zfs discuss - Oct 2009 - Performance problems with Thumper and >7TB ZFS pool using RAIDZ2

[zfs-discuss] Performance problems with Thumper and >7TB ZFS pool using RAIDZ2

[zfs-discuss] Performance problems with Thumper and >7TB ZFS pool using RAIDZ2

[zfs-discuss] Performance problems with Thumper and >7TB ZFS pool using RAIDZ2

[zfs-discuss] Performance problems with Thumper and >7TB ZFS pool using RAIDZ2

[zfs-discuss] Performance problems with Thumper and >7TB ZFS pool using RAIDZ2