Hello, I am running SPEC SFS benchmark [1] on dual Xeon 2.80GHz box with 4GB memory. More details: snv_56, zil_disable=1, zfs_arc_max = 0x80000000 #2GB Configurations that were tested: 160 dirs/1 zfs/1 zpool/4 SAN LUNs 160 zfs''es/1 zpool/4 SAN LUNs 40 zfs''es/4 zpools/4 SAN LUNs One zpool was created on 4 SAN LUNs. The SAN storage array used doesn''t honor flush cache commands. NFSD_SERVERS=1024, NFS3 via UDP was used. Max. number of obtained SPEC NFS IOPS: 5K Max. number of SPEC NFS IOPS for SVM/VxFS configuration obtained before: 24K [2] So we have almost a five-times difference. Can we improve this? How can we accelerate this NFS/ZFS setup? Two serious problems were observed: 1.Degradation of benchmark results of the same setup. The same benchmark gave first time 4030 IOPS, when was ran second time - 2037 IOPS. 2.When 4 zpools were used instead of 1, the result was degraded about 4 times. The benchmark report shows abnormally high part of [b]readdirplus[/b] operations that reached 50% of the test time. It''s part in SFS mix is: 9%. Does it point to some known problem? Increasing of DNLC size doesn''t help in case ZFS, I checked this. I will appreciate your help very much. This testing is a part of preparation for production deployment. I will provide any additional information that may be needed. Thank you, [i]-- leon[/i] [1] http://www.spec.org/osg/sfs/ [2] http://napobo3.blogspot.com/2006/08/spec-sfs-bencmark-of-zfsufsvxfs.html [3] http://www.opensolaris.org/jive/thread.jspa?threadID=23263 This message posted from opensolaris.org
Leon Koll
2007-Feb-14 09:35 UTC
[zfs-discuss] Re: SPEC SFS benchmark of NFS/ZFS/B56 - please help to improve it!
An update: Not sure is it related to the fragmentation, but I can say that serious performance degradation in my NFS/ZFS benchmarks is a result of on-disk ZFS data layout. Read operations on directories (NFS3 readdirplus) are abnormally time consuming . That kills the server. After cold restart of the host the performans is still on the flour. My conclusion: it''s not CPU, not memory, it''s ZFS on-disk structures. This message posted from opensolaris.org
Robert Milkowski
2007-Feb-14 09:43 UTC
[zfs-discuss] Re: SPEC SFS benchmark of NFS/ZFS/B56 - please help to improve it!
Hello Leon, Wednesday, February 14, 2007, 10:35:05 AM, you wrote: LK> An update: LK> Not sure is it related to the fragmentation, but I can say that LK> serious performance degradation in my NFS/ZFS benchmarks is a LK> result of on-disk ZFS data layout. LK> Read operations on directories (NFS3 readdirplus) are abnormally LK> time consuming . That kills the server. After cold restart of the LK> host the performans is still on the flour. LK> My conclusion: it''s not CPU, not memory, it''s ZFS on-disk structures. LK> Before jumping to any conclusions - first try to eliminate nfs and do readdirs locally - I guess that would be quite fast. Then check on a client (dtrace) the time distribution of nfs requests and sends us results. You may also want to fiddle with async clusters on a nfs client to see if it makes any difference. -- Best regards, Robert mailto:rmilkowski at task.gda.pl http://milek.blogspot.com
Leon Koll
2007-Feb-18 19:29 UTC
[zfs-discuss] Re: Re: SPEC SFS benchmark of NFS/ZFS/B56 - please help to improve it!
Robert wrote:> Before jumping to any conclusions - first try to > eliminate nfs and do readdirs locally - I guess that would be quite fast. > Then check on a client (dtrace) the time distribution of nfs requests > and sends us results.We used this test program that is doing readdirs and can be run with one argument: name of directory to dig into - like in this example : [b]rdir /mnt[/b] The program can be downloaded here: http://tinyurl.com/ywcyyp/rdir.c (source code) http://tinyurl.com/ywcyyp/rdir (executable for sparc) http://tinyurl.com/ywcyyp/rdir.x86 (executable for x86) Results: 1.local ZFS - we have 160 zfs''es under /tank1 : /tank1/1.../tank1/160 that were created during the SFS benchmark run # ptime /var/tmp/rdir /tank1 real [b]1:37.824[/b] user 1.637 sys 38.498 again: real 1:27.001 user 1.595 sys 32.146 To avoid an influence of local runs on the NFS runs: # zfs unmount -a # zfs mount -a # zfs share -a (160 shares) 2. NFS ssh to NFS client, create 160 dirs under /mnt, mount /tank1/i to /mnt/i (i=1...160) from the NFS server> ptime /var/tmp/rdir /mntreal [b]1:48.983[/b] user 1.096 sys 17.265 again: real [b]3:51.001[/b] user 1.657 sys 27.468 There is definitely a problem - 2nd NFS run is more than 2 times longer! What is the reason ? This message posted from opensolaris.org
Roch - PAE
2007-Feb-19 16:42 UTC
[zfs-discuss] Re: SPEC SFS benchmark of NFS/ZFS/B56 - please help to improve it!
Leon Koll writes: > An update: > > Not sure is it related to the fragmentation, but I can say that serious performance degradation in my NFS/ZFS benchmarks is a result of on-disk ZFS data layout. > Read operations on directories (NFS3 readdirplus) are abnormally time consuming . That kills the server. After cold restart of the host the performans is still on the flour. > My conclusion: it''s not CPU, not memory, it''s ZFS on-disk structures. > > > This message posted from opensolaris.org > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss As I understand the issue, a readdirplus is 2X slower when data is already cached in the client than when it is not. Given that the on-disk structure does not change between the 2 runs, I can''t really place the fault on it. -r
Leon Koll
2007-Feb-20 11:21 UTC
[zfs-discuss] Re: Re: SPEC SFS benchmark of NFS/ZFS/B56 - please help to improve it!
> > As I understand the issue, a readdirplus is > 2X slower when data is already cached in the client > than when it is not.Yes, that''s the issue. It''s not always 2X slower, but ALWAYS SLOWER. My another 2runs on NFS/ZFS show: 1. real 3:14.185 user 2.249 sys 33.083 2. real 4:47.681 user 2.578 sys 40.733> > Given that the on-disk structure does not change > between the > 2 runs, I can''t really place the fault on it.You mixed two different tests described in this thread: first is spec.org SFS that shows the bad results on NFS/ZFS even after reboot and second is our own "rdir" that was written to understand the SFS problem and exposed the weird/erroneous behaviour of NFS/ZFS combination. Thank you for your attention. [i]-- leon[/i] This message posted from opensolaris.org
Leon Koll
2007-Feb-21 12:14 UTC
[zfs-discuss] Re: Re: SPEC SFS benchmark of NFS/ZFS/B56 - please help to improve it!
More detailed description of readdir test and conclusion at the end: Roch asked me:> Is this a NFS V3 or V4 test or don''t care ?I am running NFS V3 but the short test of NFS V4 showed that the problem is there. Then Roch asked:> I''ve run rdir on a few of my large directories, However my > large directories are not much larger than ncsize, maybe > your''s are. Do I understand that you hit the issue only upon > first large rdir after reboot ?After reboot of the NFS client (see below). Then Roch added:> If so, it might me that we get a speedup from the part of > the run in which we are initially filling the dnlc cache. > That could explain thge increase in sys time. But the real > time increase seems too much to be due to this. > > Anyway I''m interested in the directory size rdir reports and > the ncsize/D from mdb -k. Also a third pass through might > yield a lead. > > -rncsize has a default value. People told me "don''t increase dnlc size when running ZFS". # echo ''ncsize/D'' | mdb -k ncsize: ncsize: 129675 Directory size? There are 160 ZFS''es under zpool tank1, each ZFS is 202MB, total 31.5GB, 1224000 files # zpool list NAME SIZE USED AVAIL CAP HEALTH ALTROOT tank1 382G 31.5G 351G 8% ONLINE - More detailed results: ZFS local runs - "normal behavior": 1. 2:33.406 2. 2:25.353 3. 2:27.033 NFS V3/ZFS runs - first is ok, then jumped up: 1. 3:14.185 2. 4:47.681 3. 4:52.213 4. 4:49.841 5. 4:53.069 6. 4:45.290 after reboot of the NFS client: 1. 2:56.760 2. 4:43.397 after reboot of both client and server: 1.real 3:12.841 2.real 4:50.869 after reboot of the NFS server only: 1. 5:15.048 2. 4:54.686 3. 4:48.713 It means the problem is on the NFS client: after rebbot of the client the first run is "ok", then all the rest are "bad". When the server was rebooted, it didn''t help and the results stayed "bad". Roch replied :> I''d hypothesize that when the client doesn''t know about a file he > just gets the data and boom. But once he''s got a cached copy > he needs more time to figure out if the data is up to date. > > This seems to have been a tradeoff of metadata operations in favor of > faster data op (!?). > > Note also that SFS doesn''t use the client''s NFS code. It > runs it''s own user space client.The fact that the described problem is 100%-NFS-client-problem, there is nothing to do with ZFS code to improve the situtaion. And the SFS problem we observed (see the first message in this thread) has nothing common with this one. Unfortunately, the abnormal behavior of NFS/ZFS during an SFS test didn''t get much attention so I don''t have any clue. Anyway, I''ll update this thread when I have more information on the problem. This message posted from opensolaris.org
Matthew Ahrens
2007-Feb-22 05:09 UTC
[zfs-discuss] Re: Re: SPEC SFS benchmark of NFS/ZFS/B56 - please help to improve it!
Leon Koll wrote:> The fact that the described problem is 100%-NFS-client-problem, there > is nothing to do with ZFS code to improve the situtaion.You may want to see if the folks over at nfs-discuss at opensolaris.org have any ideas on your NFS problem. --matt