Joseph Little wrote:> I''d love to "vote" to have this addressed, but apparently votes for > bugs are no available to outsiders. > > What''s limiting Stanford EE''s move to using ZFS entirely for our > snapshoting filesystems and multi-tier storage is the inability to > access .zfs directories and snapshots in particular on NFSv3 clients. > We simply can''t move the majority of clients on various OSes to NFSv4 > over night or even within the year (vendors aren''t all there yet). > > Is there any progress on fixing Solaris 11/opensolaris nfsd to > support zfs w/ NFS v3? > > I''ve also noticed in our testing that NFS and ZFS don''t perform as > well together as Solaris NFS and UFS.. Or perhaps its just the RAIDZ > use underlying my ZFS tests. Either way, I do hope that this is tuned > over time! >Good to see a high demand for it... and funny you should ask... we''re working on it right now, here''s the bug: http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6344186 As for NFS on ZFS performance, that is also being actively worked on - one bug in particular that''s holding back (at least) specSFS results is: http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6350001 6350001 improves lookup performance, and should be putback very soon. Neil has some ZIL and range lock changes coming, Matt and Noel have some ZAP perf changes in the pipeline, and Mark''s ARC changes are going to be quite,quite nice. (and now i see gordon beat me to replying - gordon!) eric
> 6350001 improves lookup performance, and should be putback very soon. > > Neil has some ZIL and range lock changes coming, Matt and Noel have some > ZAP perf changes in the pipeline, and Mark''s ARC changes are going to be quite, > quite nice.It seems all of this is included in on-20060228 now. Opensolaris "nightly" build times have improved quite a bit on a system that has almost everything on compressed zfs (with the exception of "/" root) . Before on-20060228, my laptop needed 4.5 - 5.5 hours for a nightly build, now I''m at ~ 3.5 hours. That''s about the same time needed when everything was on UFS. This message posted from opensolaris.org
On Fri, Mar 03, 2006 at 06:35:19AM -0800, J?rgen Keil wrote:> > 6350001 improves lookup performance, and should be putback very soon. > > > > Neil has some ZIL and range lock changes coming, Matt and Noel have some > > ZAP perf changes in the pipeline, and Mark''s ARC changes are going to be quite, > > quite nice. > > It seems all of this is included in on-20060228 now. Opensolaris "nightly" > build times have improved quite a bit on a system that has almost everything > on compressed zfs (with the exception of "/" root) . Before on-20060228, > my laptop needed 4.5 - 5.5 hours for a nightly build, now I''m at ~ 3.5 hours. > That''s about the same time needed when everything was on UFS.Build 33 contained the following CLI fixes: 6377671 zfs mount -a shouldn''t bother checking snapshots 6378361 ''zfs share -a'' needs to avoid expensive checks during boot 6378377 zfs_get_stats() is way to expensive 6378388 zfs_for_each() iterates unnecessarily Build 34 contained the following fix from Neil which improved mount, unmount, and several other less common, but still important, code paths. 6377670 zil_replay() does unnecessary txg_wait_synced(), slowing down mount Build 35 contains the following performance-related fixes from Eric K., Neil, and Noel: 6350001 ZFS lookup performance still much slower than UFS : help tar : help specSFS 6381994 zfs_putpage() serializes I/O unnecessarily 6389368 fat zap should use 16k blocks (with backwards compatability) The fix 6381994 is especially important for ON builds due to the way in ld(1) writes files - a microbenchmark of msync(10MB region) shows a 43x improvment! Build 36 will contain fixes for: 6284889 arc should replace the znode cache 6333092 concurrent reads to a file not scaling with number of readers These are "Mark''s ARC changes" and "Neil''s range lock changes". The former is especially important, as it has wide ranging implications on all benchmarks. For example, I measured a ''zfs list'' performance boost of 83x. - Eric -- Eric Schrock, Solaris Kernel Development http://blogs.sun.com/eschrock
On Fri, Mar 03, 2006 at 08:27:17AM -0800, Eric Schrock wrote:> > 6284889 arc should replace the znode cache > 6333092 concurrent reads to a file not scaling with number of readers > > These are "Mark''s ARC changes" and "Neil''s range lock changes". The > former is especially important, as it has wide ranging implications on > all benchmarks. For example, I measured a ''zfs list'' performance boost > of 83x. >Neil points out that I confused 6333092 with his range lock fixes. The above are both Mark''s fixes. Neil''s range locking should also make 36, but is tracked by: 6343608 ZFS file range locking 6365101 zfs: copying from NFS to ZFS makes ksh response very sluggish - Eric -- Eric Schrock, Solaris Kernel Development http://blogs.sun.com/eschrock
Hello Eric, Friday, March 3, 2006, 5:27:17 PM, you wrote: ES> Build 34 contained the following fix from Neil which improved mount, ES> unmount, and several other less common, but still important, code paths. ES> 6377670 zil_replay() does unnecessary txg_wait_synced(), slowing down mount Does it solve problem with long zfs mount times after system crash? Does it mean that fs will be mounted and available and in background it will be checked? -- Best regards, Robert mailto:rmilkowski at task.gda.pl http://milek.blogspot.com
Robert Milkowski wrote On 03/06/06 04:37,:> Hello Eric, > > Friday, March 3, 2006, 5:27:17 PM, you wrote: > > ES> Build 34 contained the following fix from Neil which improved mount, > ES> unmount, and several other less common, but still important, code paths. > > ES> 6377670 zil_replay() does unnecessary txg_wait_synced(), slowing down mount > > Does it solve problem with long zfs mount times after system crash?The previous code always did a txg_wait_synced() on mount. The new code only calls txg_wait_synced() when there intent log records to replay. This will only happen if there was synchronous activity (eg fsync or O_DSYNC writes) which wasn''t committed to the pool before the crash or power outage. This should be rare and doesn''t take that long.> Does it mean that fs will be mounted and available and in background > it will be checked?The fs is not mounted until the log replay (if any) processing has occurred. It does not happen in background. -- Neil
On Mon, Mar 06, 2006 at 12:37:34PM +0100, Robert Milkowski wrote:> > Does it solve problem with long zfs mount times after system crash?Yes. But if you have a filesystem making excessive use of O_DSYNC (a database, for example), you will still have to wait to push out the intent log transactions. If there is no intent log data (the common case) mount will take on the order of 50 milliseconds. If there is intent log data, it will take on the order of 400 milliseconds. If I remember correctly from when I did these experiments, of course...> Does it mean that fs will be mounted and available and in background > it will be checked?ZFS filesystems do not need to be "checked". If you suspect that your data may be corrupted due to bad hardware, you can always run "zpool scrub", an online operation that verifies the checksums of all your data. - Eric -- Eric Schrock, Solaris Kernel Development http://blogs.sun.com/eschrock
Hello Eric, Monday, March 6, 2006, 5:56:13 PM, you wrote: ES> On Mon, Mar 06, 2006 at 12:37:34PM +0100, Robert Milkowski wrote:>> >> Does it solve problem with long zfs mount times after system crash?ES> Yes. But if you have a filesystem making excessive use of O_DSYNC (a ES> database, for example), you will still have to wait to push out the ES> intent log transactions. If there is no intent log data (the common ES> case) mount will take on the order of 50 milliseconds. If there is ES> intent log data, it will take on the order of 400 milliseconds. If I ES> remember correctly from when I did these experiments, of course...>> Does it mean that fs will be mounted and available and in background >> it will be checked?ES> ZFS filesystems do not need to be "checked". If you suspect that your ES> data may be corrupted due to bad hardware, you can always run "zpool ES> scrub", an online operation that verifies the checksums of all your ES> data. It''s not that I suspect bad data it''s just that sometimes when my server crashes with ZFS on it then during system boot I can see that zfs filesystems are being mounted and depending on the host I can see 95MB/s being read from disks which are in use by zfs for about 10-20 minutes, or 5-10 minutes on x64 host. (see SDR-0070 for example). Hope new fixes to intent log will solve this. btw: if it''s intent log replay why I see 95MB/s of reads for about 10-20 minutes? What is zfs actually reading? -- Best regards, Robert mailto:rmilkowski at task.gda.pl http://milek.blogspot.com
I wrote:> > 6350001 improves lookup performance, and should be putback very soon. > > > > Neil has some ZIL and range lock changes coming, Matt and Noel have some > > ZAP perf changes in the pipeline, and Mark''s ARC changes are going to be quite, > > quite nice. > > It seems all of this is included in on-20060228 now. Opensolaris "nightly" > build times have improved quite a bit on a system that has almost everything > on compressed zfs (with the exception of "/" root) . Before on-20060228, > my laptop needed 4.5 - 5.5 hours for a nightly build, now I''m at ~ 3.5 hours. > That''s about the same time needed when everything was on UFS.Hmm, it seems I spoke too soon. A dual processor (AMD-MP 32-bit) S-x86 machine (snv_34, bfu''ed to on-20060228 release kernel modules), 2GB main memory, is just compiling on-20060307, and I see this: % zpool iostat 10 capacity operations bandwidth pool used avail read write read write ---------- ----- ----- ----- ----- ----- ----- data 40,9G 171G 42 10 3,42M 358K data 40,9G 171G 73 14 6,59M 555K data 40,9G 171G 77 14 6,89M 374K data 40,9G 171G 77 7 6,75M 233K data 40,9G 171G 82 16 6,86M 715K data 40,9G 171G 80 4 6,15M 258K data 40,9G 171G 81 14 5,97M 578K data 40,9G 171G 72 10 5,90M 500K data 40,9G 171G 84 14 6,20M 575K data 40,9G 171G 80 13 6,16M 680K data 40,9G 171G 84 12 6,44M 591K data 40,9G 171G 77 9 5,89M 431K data 40,9G 171G 80 14 5,93M 582K data 40,9G 171G 79 12 6,19M 437K data 40,9G 171G 74 17 6,44M 632K data 40,9G 171G 77 12 6,68M 374K .... (similar data for at least half an hour) .... The "write bandwidth" is ok, but the huge amount of "read bandwidth" that is used isn''t ok. I''d expect to see something close to zero in the "read bandwidth" column (all source and header files cached). % vmstat 10 kthr memory page disk faults cpu r b w swap free re mf pi po fr de sr cd cd f0 s0 in sy cs us sy id 1 0 0 5357996 1405360 219 4255 11 1 1 0 5 56 0 0 0 2930 6136 3430 27 18 55 3 0 0 5340636 1399200 68 2373 0 0 0 0 0 99 0 0 0 3592 4989 8409 25 16 58 0 0 0 5346560 1402772 147 2120 0 0 0 0 0 97 0 0 0 2756 4441 6592 27 15 59 0 0 0 5340348 1397344 46 1284 0 0 0 0 0 98 0 0 0 3984 3623 7794 17 13 70 0 0 0 5342648 1399692 44 1425 0 0 0 0 0 96 0 0 0 3813 3895 7520 15 14 71 0 0 0 5342396 1398632 57 1550 0 0 0 0 0 93 0 0 0 4246 2834 8086 10 14 77 0 0 0 5330936 1391036 83 2482 0 0 0 0 0 94 0 0 0 4198 6528 8239 22 16 62 0 0 0 5323952 1384344 59 2335 0 0 0 0 0 89 0 0 0 4356 3504 8325 22 16 62 0 0 0 5329624 1390472 81 2478 0 0 0 0 0 91 0 0 0 4657 5594 8885 18 18 64 0 0 0 5328392 1388456 51 2154 0 0 0 0 0 87 0 0 0 4702 3844 9797 17 17 67 0 0 0 5333440 1395116 57 1005 0 0 0 0 0 94 0 0 0 3772 3462 8390 8 12 81 0 0 0 5327920 1388884 61 2363 0 0 0 0 0 94 0 0 0 4204 3861 8428 19 16 65 0 0 0 5329084 1391144 95 3365 0 0 0 0 0 92 0 0 0 4453 6698 9441 22 20 59 0 0 0 5339968 1397656 40 1130 0 0 0 0 0 94 0 0 0 3952 2578 7869 11 12 77 0 0 0 5330712 1393812 71 2066 0 0 0 0 0 90 0 0 0 4228 2794 8160 12 15 73 0 0 0 5328300 1391200 69 2268 0 0 0 0 0 99 0 0 0 4157 3896 7592 15 16 69 1 0 0 5326928 1390572 70 2655 0 0 0 0 0 96 0 0 0 4246 5792 8268 28 18 54 ... Lots of free memory, but the machine is "idle" most of the time, obviously waiting for the frequent reads to complete. # echo ''::memstat'' | mdb -k Page Summary Pages MB %Tot ------------ ---------------- ---------------- ---- Kernel 86064 336 16% Anon 71556 279 14% Exec and libs 16225 63 3% Page cache 2811 10 1% Free (cachelist) 109783 428 21% Free (freelist) 235687 920 45% Total 522126 2039 Physical 522125 2039 My "arc.d" dtrace script (see the attachment) produces output like this (it monitors sdt:::arc-hit, miss delete and evict counters in 5 seconds intervalls): # ~jk/src/dtrace/arc.d hit miss delete evict 4217 651 0 695 anon 0, mru 4/477, mfu 5/1422, arc c 482 p 85 anon 0, mru 0/477, mfu 3/1422 (list) 2378 597 0 612 anon 1, mru 4/477, mfu 2/1424, arc c 482 p 85 anon 0, mru 0/477, mfu 1/1424 (list) 4431 803 0 646 anon 0, mru 4/477, mfu 7/1421, arc c 482 p 85 anon 0, mru 0/477, mfu 5/1421 (list) arc_delete_state 175808 arc_delete_state 39168 4412 518 6 811 anon 0, mru 4/477, mfu 4/1423, arc c 482 p 85 anon 0, mru 0/477, mfu 2/1423 (list) 2987 818 0 645 anon 0, mru 4/477, mfu 6/1421, arc c 482 p 85 anon 0, mru 0/477, mfu 5/1421 (list) 4668 545 0 459 anon 0, mru 4/477, mfu 13/1414, arc c 482 p 85 anon 0, mru 0/477, mfu 12/1414 (list) arc_delete_state 69312 arc_delete_state 38144 arc_delete_state 56512 10549 971 4 1316 anon 0, mru 4/477, mfu 1/1426, arc c 482 p 86 anon 0, mru 0/477, mfu 0/1426 (list) 1865 616 0 674 anon 0, mru 4/477, mfu 2/1426, arc c 482 p 85 anon 0, mru 0/477, mfu 0/1426 (list) 1293 680 0 526 anon 0, mru 4/477, mfu 6/1422, arc c 482 p 85 anon 0, mru 0/477, mfu 4/1422 (list) 2255 606 0 417 anon 1, mru 4/476, mfu 7/1420, arc c 482 p 86 anon 0, mru 0/476, mfu 6/1420 (list) 3017 605 0 573 anon 0, mru 5/476, mfu 9/1419, arc c 482 p 86 anon 0, mru 0/476, mfu 7/1419 (list) 5581 659 0 411 anon 1, mru 4/476, mfu 14/1415, arc c 482 p 86 anon 0, mru 0/476, mfu 12/1415 (list) It seems as if the arc cache doesn''t work any more. It seems as if the kernel tries to shrink the arc cache (arc-evict), and whatever is removed from the cache is immediatelly put back into it (arc-miss). arc hits is quite low. A new problem, with b35? This message posted from opensolaris.org -------------- next part -------------- A non-text attachment was scrubbed... Name: arc.d Type: application/octet-stream Size: 1298 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20060310/71ce3b49/attachment.obj>
J?rgen Keil wrote:> I wrote: > >>> 6350001 improves lookup performance, and should be putback very soon. >>> >>> Neil has some ZIL and range lock changes coming, Matt and Noel have some >>> ZAP perf changes in the pipeline, and Mark''s ARC changes are going to be quite, >>> quite nice. >> It seems all of this is included in on-20060228 now. Opensolaris "nightly" >> build times have improved quite a bit on a system that has almost everything >> on compressed zfs (with the exception of "/" root) . Before on-20060228, >> my laptop needed 4.5 - 5.5 hours for a nightly build, now I''m at ~ 3.5 hours. >> That''s about the same time needed when everything was on UFS. > > Hmm, it seems I spoke too soon. > > A dual processor (AMD-MP 32-bit) S-x86 machine (snv_34, bfu''ed to on-20060228 release > kernel modules), 2GB main memory, is just compiling on-20060307, and I see this:I''m seeing good ZFS performance on a dual-core box (Opteron 180), snv_34, BFUed to more recent nightly: http://blogs.sun.com/roller/page/danasblog?entry=zfs_v_ufs_performance_for Note that I did this testing after BFUing a nightly build including a large set of changes integrated on 3 March. It''s quite possible/likely that this wad of fixes explains the difference. I reconfigured my storage so I now have a single ZFS pool and I continue to see fine nightly build times (essentially the same as UFS). I''ll take a look at vmstat during the next nightly build I do. Cheers, Dana
> A dual processor (AMD-MP 32-bit) S-x86 machine (snv_34, bfu''ed to on-20060228 release > kernel modules), 2GB main memory, is just compiling on-20060307, and I see this:The build finished after 6 hours, instead of the expected 2,5 hours... Now that the box is running snv_34, bfu''ed to on-20060307, once again compiling on-20060307 from scratch, I''m getting: % zpool iostat 10 capacity operations bandwidth pool used avail read write read write ---------- ----- ----- ----- ----- ----- ----- data 41,3G 171G 4 3 384K 18,9K data 41,3G 171G 5 52 346K 294K data 41,3G 171G 3 39 230K 211K data 41,3G 171G 4 38 313K 189K data 41,3G 171G 3 33 205K 222K data 41,3G 171G 5 34 339K 267K data 41,3G 171G 8 42 525K 236K data 41,3G 171G 10 46 657K 272K data 41,3G 171G 6 36 408K 281K data 41,3G 171G 4 43 314K 296K ... data 41,7G 170G 19 51 1,25M 434K data 41,7G 170G 10 54 811K 1,20M data 41,7G 170G 2 33 193K 325K data 41,7G 170G 4 34 425K 749K data 41,7G 170G 2 14 190K 349K data 41,7G 170G 2 23 157K 418K data 41,7G 170G 3 16 211K 306K data 41,7G 170G 32 20 2,15M 280K data 41,7G 170G 16 14 1,07M 190K data 41,7G 170G 9 11 579K 183K data 41,7G 170G 8 18 577K 423K data 41,7G 170G 2 17 140K 87,4K data 41,7G 170G 4 16 232K 93,3K data 41,7G 170G 0 13 53,9K 77,4K data 41,7G 170G 1 26 67,3K 155K data 41,7G 170G 1 15 68,4K 119K data 41,7G 170G 0 15 44,3K 65,4K data 41,7G 170G 0 17 46,2K 115K data 41,7G 170G 3 13 225K 96,9K ==> # read operations < # write operations This looks much bettern now. % vmstat 10 ... kthr memory page disk faults cpu r b w swap free re mf pi po fr de sr cd cd f0 lf in sy cs us sy id 1 0 0 5188692 1256136 438 9297 0 0 0 0 0 15 0 0 0 2296 13008 1111 76 24 0 1 0 0 5200976 1265608 706 11240 0 0 0 0 0 17 0 0 0 2257 14544 1188 70 30 0 1 0 0 5209192 1268324 666 9437 0 0 0 0 0 25 0 0 0 2376 15359 1620 69 29 2 2 0 0 5196400 1261460 782 11368 0 0 0 0 0 34 0 0 0 2134 17015 1485 67 32 0 1 0 0 5206780 1267316 681 9764 0 0 0 0 0 31 0 0 0 2121 14903 1679 66 31 2 3 0 0 5188496 1254476 585 10241 0 0 0 0 0 35 0 0 0 2350 17507 1565 71 29 0 3 0 0 5182512 1251840 494 9849 0 0 0 0 0 33 0 0 0 2322 18516 1228 72 28 0 2 0 0 5195664 1258104 574 9710 0 0 0 0 0 29 0 0 0 2424 13983 1406 71 28 1 2 0 0 5178332 1245860 542 10997 0 0 0 0 0 28 0 0 0 2270 14184 1109 72 28 0 1 0 0 5192076 1253392 592 9503 0 0 0 0 0 25 0 0 0 2493 12731 1320 72 27 1 0 0 0 5198268 1256868 538 9199 0 0 0 0 0 20 0 0 0 2005 14809 1117 71 27 2 2 0 0 5182896 1248788 557 10506 0 0 0 0 0 30 0 0 0 2070 20196 1238 69 29 1 3 0 0 5158948 1229804 417 9082 0 0 0 0 0 23 0 0 0 2733 16636 1405 74 26 0 2 0 0 5185080 1249544 580 9803 0 0 0 0 0 18 0 0 0 2525 14231 1276 71 29 1 1 0 0 5197224 1257556 709 10492 0 0 0 0 0 26 0 0 0 2040 19656 1364 68 30 2 1 0 0 5193492 1255184 706 10552 0 0 0 0 0 23 0 0 0 2255 17574 1627 68 31 1 4 0 0 5173660 1246084 515 10714 0 0 0 0 0 34 0 0 0 2920 19964 1383 70 30 0 4 0 0 5172684 1244480 337 8163 0 0 0 0 0 22 0 0 0 2427 13858 1323 76 24 0 0 0 0 5201424 1263696 458 9002 0 0 0 0 0 31 0 0 0 2524 16930 1384 73 27 0 ==> no idle time, no more waiting for reads. This message posted from opensolaris.org
> > A dual processor (AMD-MP 32-bit) S-x86 machine (snv_34, bfu''ed to on-20060228 release > > kernel modules), 2GB main memory, is just compiling on-20060307, and I see this: > > I''m seeing good ZFS performance on a dual-core box (Opteron 180), snv_34, BFUed > to more recent nightly: > > http://blogs.sun.com/roller/page/danasblog?entry=zfs_v_ufs_performance_for > > Note that I did this testing after BFUing a nightly build including a large > set of changes integrated on 3 March. It''s quite possible/likely that this > wad of fixes explains the difference.Nope, once I again, this AMD-MP 32-bit S-x86 machine with 2GB main memory is in a state where the ARC cache doesn''t work at all. The machine is running on-20060307, compiled as a release build. An incremental nightly build is running, and I see this all the time: % zpool iostat 10 capacity operations bandwidth pool used avail read write read write ---------- ----- ----- ----- ----- ----- ----- data 41,8G 170G 59 13 3,70M 174K data 41,8G 170G 86 15 5,37M 87,6K data 41,8G 170G 84 14 5,49M 104K data 41,8G 170G 81 17 5,20M 89,2K data 41,8G 170G 86 16 5,46M 116K data 41,8G 170G 82 16 5,13M 96,9K data 41,8G 170G 85 21 5,26M 128K data 41,8G 170G 85 17 5,29M 102K data 41,8G 170G 92 17 5,76M 148K data 41,8G 170G 86 14 5,18M 157K data 41,8G 170G 88 14 5,64M 128K data 41,8G 170G 85 14 5,37M 108K data 41,8G 170G 87 12 5,19M 185K data 41,8G 170G 87 13 5,49M 118K data 41,8G 170G 79 18 4,83M 433K data 41,8G 170G 89 17 5,43M 505K data 41,8G 170G 86 16 5,25M 182K data 41,8G 170G 82 13 4,94M 139K data 41,8G 170G 97 14 5,51M 95,2K data 41,8G 170G 91 19 5,11M 708K data 41,8G 170G 94 12 5,63M 81,3K data 41,8G 170G 88 14 5,42M 131K data 41,8G 170G 86 13 5,48M 171K data 41,8G 170G 87 14 5,34M 156K data 41,8G 170G 87 8 5,50M 38,0K data 41,8G 170G 88 11 5,59M 101K data 41,8G 170G 89 19 5,66M 276K data 41,8G 170G 87 13 5,48M 133K data 41,8G 170G 90 18 5,74M 126K data 41,8G 170G 88 11 5,43M 229K data 41,8G 170G 87 17 5,53M 146K data 41,8G 170G 90 10 5,61M 132K data 41,8G 170G 88 12 5,40M 69,2K data 41,8G 170G 88 22 5,70M 197K data 41,8G 170G 83 17 5,04M 124K data 41,8G 170G 85 15 5,35M 108K data 41,8G 170G 80 15 5,01M 135K data 41,8G 170G 86 17 5,21M 93,8K data 41,8G 170G 84 17 5,12M 130K data 41,8G 170G 88 14 5,38M 110K data 41,8G 170G 88 18 5,65M 114K data 41,8G 170G 83 15 5,14M 96K data 41,8G 170G 94 10 5,94M 75,5K data 41,8G 170G 86 10 5,29M 72,4K data 41,8G 170G 83 15 5,21M 206K data 41,8G 170G 85 12 5,44M 96,0K data 41,8G 170G 79 14 5,21M 200K data 41,8G 170G 91 20 5,99M 154K data 41,8G 170G 80 25 5,33M 1,62M data 41,8G 170G 71 31 4,65M 1,83M data 41,8G 170G 88 14 5,77M 134K data 41,8G 170G 87 12 5,70M 76,5K data 41,8G 170G 87 13 5,88M 115K data 41,8G 170G 88 26 5,55M 321K data 41,8G 170G 85 8 5,21M 61,4K data 41,8G 170G 88 16 5,64M 238K data 41,8G 170G 93 15 5,82M 151K .... Monitoring the ARC cache shows that it has shrunk to ~ 4-5 mbytes, and there are *lots* of ARC cache misses: # ~jk/src/dtrace/arc2.d hit miss delete evict 931 1302 0 1297 anon 0, mru 4/49, mfu 0/478, arc c 64 p 63 anon 0, mru 0/49, mfu 0/478 (list) 352 1101 0 1103 anon 0, mru 3/49, mfu 0/477, arc c 64 p 63 anon 0, mru 0/49, mfu 0/477 (list) 29 1900 0 1818 anon 0, mru 3/48, mfu 2/476, arc c 64 p 62 anon 0, mru 0/48, mfu 1/476 (list) 365 1024 0 1017 anon 0, mru 3/49, mfu 2/476, arc c 64 p 61 anon 0, mru 0/49, mfu 0/476 (list) 123 913 0 924 anon 0, mru 4/48, mfu 2/477, arc c 64 p 62 anon 0, mru 0/48, mfu 1/477 (list) 217 825 0 965 anon 0, mru 4/50, mfu 0/478, arc c 64 p 63 anon 0, mru 0/50, mfu 0/478 (list) 170 1391 0 1405 anon 0, mru 3/51, mfu 0/479, arc c 64 p 63 anon 0, mru 0/51, mfu 0/479 (list) 355 1043 0 1042 anon 0, mru 3/51, mfu 1/478, arc c 64 p 62 anon 0, mru 0/51, mfu 0/478 (list) 38 1052 0 1095 anon 0, mru 3/51, mfu 0/479, arc c 64 p 64 anon 0, mru 0/51, mfu 0/479 (list) 847 1595 0 1503 anon 0, mru 3/53, mfu 2/478, arc c 64 p 61 anon 0, mru 0/53, mfu 0/478 (list) 171 998 0 991 anon 0, mru 3/53, mfu 2/478, arc c 64 p 61 anon 0, mru 0/53, mfu 0/478 (list) 185 1037 0 1012 anon 1, mru 4/55, mfu 3/477, arc c 64 p 61 anon 0, mru 0/55, mfu 0/477 (list) 3 865 0 980 anon 1, mru 5/54, mfu 0/479, arc c 64 p 62 anon 0, mru 2/54, mfu 0/479 (list) 739 1732 0 1799 anon 1, mru 4/56, mfu 0/481, arc c 64 p 63 anon 0, mru 0/56, mfu 0/481 (list) 17 1724 0 1643 anon 0, mru 4/55, mfu 2/481, arc c 64 p 61 anon 0, mru 0/55, mfu 1/481 (list) 49 623 0 750 anon 0, mru 4/55, mfu 0/482, arc c 64 p 63 anon 0, mru 0/55, mfu 0/482 (list) 642 1107 0 1149 anon 0, mru 3/54, mfu 0/483, arc c 64 p 63 anon 0, mru 0/54, mfu 0/483 (list) 699 742 0 705 anon 0, mru 3/54, mfu 1/482, arc c 64 p 62 anon 0, mru 0/54, mfu 0/482 (list) 35 1969 0 2012 anon 0, mru 3/53, mfu 0/483, arc c 64 p 63 anon 0, mru 0/53, mfu 0/483 (list) 297 2079 0 2043 anon 0, mru 3/57, mfu 1/482, arc c 64 p 61 anon 0, mru 0/57, mfu 0/482 (list) 121 1232 52 1159 anon 3, mru 3/56, mfu 2/481, arc c 64 p 61 anon 0, mru 0/56, mfu 1/481 (list) 300 857 0 1005 anon 0, mru 4/58, mfu 0/485, arc c 64 p 64 anon 0, mru 0/58, mfu 0/485 (list) 262 1825 0 1831 anon 0, mru 4/56, mfu 0/487, arc c 64 p 63 anon 0, mru 0/56, mfu 0/487 (list) 559 1212 0 972 anon 0, mru 4/55, mfu 3/484, arc c 64 p 60 anon 0, mru 0/55, mfu 3/484 (list) 107 626 0 919 anon 0, mru 4/54, mfu 0/487, arc c 64 p 62 anon 0, mru 0/54, mfu 0/487 (list) 19 1262 0 1153 anon 0, mru 4/54, mfu 2/485, arc c 64 p 61 anon 0, mru 0/54, mfu 1/485 (list) 581 660 0 582 anon 0, mru 4/53, mfu 3/485, arc c 64 p 60 anon 0, mru 0/53, mfu 2/485 (list) 726 887 0 1082 anon 0, mru 4/53, mfu 1/487, arc c 64 p 63 anon 0, mru 0/53, mfu 0/487 (list) 314 2285 0 2310 anon 0, mru 3/52, mfu 1/489, arc c 64 p 62 anon 0, mru 0/52, mfu 0/489 (list) 647 1355 0 1380 anon 0, mru 3/52, mfu 1/489, arc c 64 p 62 anon 0, mru 0/52, mfu 0/489 (list) 59 1479 0 1393 anon 0, mru 3/52, mfu 2/488, arc c 64 p 61 anon 0, mru 0/52, mfu 2/488 (list) 407 973 0 1074 anon 0, mru 3/51, mfu 1/491, arc c 64 p 62 anon 0, mru 0/51, mfu 0/491 (list) 225 703 0 577 anon 0, mru 4/50, mfu 2/489, arc c 64 p 61 anon 0, mru 0/50, mfu 1/489 (list) 1255 650 0 643 anon 0, mru 4/51, mfu 3/488, arc c 64 p 61 anon 0, mru 0/51, mfu 1/488 (list) 239 763 0 961 anon 0, mru 3/52, mfu 1/491, arc c 64 p 62 anon 0, mru 0/52, mfu 0/491 (list) 25 1952 0 1894 anon 0, mru 3/52, mfu 1/491, arc c 64 p 60 anon 0, mru 0/52, mfu 1/491 (list) 811 1568 0 1593 anon 1, mru 4/51, mfu 0/492, arc c 64 p 63 anon 0, mru 0/51, mfu 0/492 (list) 370 1170 0 1133 anon 0, mru 3/52, mfu 2/490, arc c 64 p 61 anon 0, mru 0/52, mfu 1/490 (list) 581 643 0 694 anon 0, mru 4/51, mfu 1/492, arc c 64 p 63 anon 0, mru 0/51, mfu 0/492 (list) 31 1287 0 1397 anon 0, mru 3/52, mfu 0/492, arc c 64 p 63 anon 0, mru 0/52, mfu 0/492 (list) 11 1119 0 1097 anon 1, mru 4/52, mfu 0/493, arc c 64 p 63 anon 0, mru 0/52, mfu 0/493 (list) 321 1001 0 904 anon 0, mru 3/52, mfu 2/491, arc c 64 p 61 anon 0, mru 0/52, mfu 2/491 (list) 423 1071 0 1122 anon 0, mru 3/52, mfu 2/491, arc c 64 p 61 anon 0, mru 0/52, mfu 0/491 (list) 337 863 0 891 anon 0, mru 3/53, mfu 2/492, arc c 64 p 61 anon 0, mru 0/53, mfu 1/492 (list) 79 1042 0 1090 anon 0, mru 4/52, mfu 0/493, arc c 64 p 63 anon 0, mru 0/52, mfu 0/493 (list) 489 912 0 902 anon 0, mru 4/52, mfu 0/493, arc c 64 p 63 anon 0, mru 0/52, mfu 0/493 (list) 311 1277 0 1259 anon 0, mru 3/53, mfu 2/492, arc c 64 p 61 anon 0, mru 0/53, mfu 1/492 (list) 514 779 0 850 anon 0, mru 3/53, mfu 1/493, arc c 64 p 62 anon 0, mru 0/53, mfu 0/493 (list) 318 1413 0 1450 anon 0, mru 3/53, mfu 0/494, arc c 64 p 62 anon 0, mru 0/53, mfu 0/494 (list) 50 1317 0 1233 anon 1, mru 3/52, mfu 1/492, arc c 64 p 61 anon 0, mru 0/52, mfu 1/492 (list) 451 878 0 1029 anon 0, mru 3/53, mfu 0/494, arc c 64 p 63 anon 0, mru 0/53, mfu 0/494 (list) 35 896 0 874 anon 0, mru 3/53, mfu 0/494, arc c 64 p 61 anon 0, mru 0/53, mfu 0/494 (list) 317 1305 0 1277 anon 0, mru 4/53, mfu 0/494, arc c 64 p 63 anon 0, mru 0/53, mfu 0/494 (list) 433 1059 0 1041 anon 0, mru 3/53, mfu 1/493, arc c 64 p 63 anon 0, mru 0/53, mfu 1/493 (list) 72 925 0 1045 anon 0, mru 3/54, mfu 0/494, arc c 64 p 63 anon 0, mru 0/54, mfu 0/494 (list) 382 882 0 834 anon 0, mru 4/53, mfu 0/495, arc c 64 p 63 anon 0, mru 0/53, mfu 0/495 (list) 46 1418 0 1408 anon 0, mru 3/53, mfu 0/494, arc c 64 p 62 anon 0, mru 0/53, mfu 0/494 (list) 539 998 0 929 anon 0, mru 3/54, mfu 2/493, arc c 64 p 62 anon 0, mru 0/54, mfu 1/493 (list) 74 954 0 1116 anon 0, mru 3/54, mfu 0/495, arc c 64 p 63 anon 0, mru 0/54, mfu 0/495 (list) 301 991 0 992 anon 0, mru 3/54, mfu 0/495, arc c 64 p 63 anon 0, mru 0/54, mfu 0/495 (list) hit miss delete evict 27 1347 0 1326 anon 0, mru 3/54, mfu 0/495, arc c 64 p 61 anon 0, mru 0/54, mfu 0/495 (list) 624 951 0 859 anon 0, mru 3/54, mfu 1/493, arc c 64 p 62 anon 0, mru 0/54, mfu 1/493 (list) 420 954 0 1012 anon 0, mru 4/53, mfu 0/494, arc c 64 p 62 anon 0, mru 0/53, mfu 0/494 (list) 107 1109 0 1079 anon 0, mru 4/54, mfu 1/493, arc c 64 p 61 anon 0, mru 0/54, mfu 1/493 (list) 428 688 0 794 anon 0, mru 3/54, mfu 0/495, arc c 64 p 63 anon 0, mru 0/54, mfu 0/495 (list) 285 1228 0 1127 anon 0, mru 3/54, mfu 2/493, arc c 64 p 61 anon 0, mru 0/54, mfu 1/493 (list) 278 914 0 921 anon 0, mru 3/56, mfu 2/493, arc c 64 p 61 anon 0, mru 0/56, mfu 1/493 (list) 611 867 0 774 anon 0, mru 3/55, mfu 3/491, arc c 64 p 60 anon 0, mru 0/55, mfu 3/491 (list) 640 656 0 748 anon 0, mru 4/56, mfu 1/494, arc c 64 p 63 anon 0, mru 0/56, mfu 0/494 (list) 448 1032 0 1170 anon 0, mru 4/56, mfu 0/495, arc c 64 p 63 anon 0, mru 0/56, mfu 0/495 (list) 79 1535 0 1538 anon 0, mru 3/57, mfu 0/495, arc c 64 p 63 anon 0, mru 0/57, mfu 0/495 (list) 64 1219 3 1223 anon 0, mru 4/58, mfu 0/496, arc c 64 p 62 anon 0, mru 0/58, mfu 0/496 (list) 35 955 94 936 anon 0, mru 4/59, mfu 1/495, arc c 64 p 62 anon 0, mru 0/59, mfu 0/495 (list) 273 1048 69 1049 anon 0, mru 5/58, mfu 0/496, arc c 64 p 63 anon 0, mru 1/58, mfu 0/496 (list) 266 1286 268 1333 anon 0, mru 4/59, mfu 0/496, arc c 64 p 62 anon 0, mru 0/59, mfu 0/496 (list) 5 1394 233 1379 anon 1, mru 4/57, mfu 1/496, arc c 64 p 63 anon 0, mru 0/57, mfu 0/496 (list) 45 1258 10 1192 anon 1, mru 4/57, mfu 2/496, arc c 64 p 64 anon 0, mru 0/57, mfu 1/496 (list) 86 1758 0 1839 anon 5, mru 4/51, mfu 1/503, arc c 64 p 63 anon 0, mru 0/51, mfu 0/503 (list) 10 1848 0 1729 anon 3, mru 10/45, mfu 2/507, arc c 64 p 62 anon 0, mru 6/45, mfu 0/507 (list) 71 1015 0 1055 anon 15, mru 8/41, mfu 1/512, arc c 64 p 63 anon 0, mru 3/41, mfu 0/512 (list) 742 1544 330 1747 anon 9, mru 4/45, mfu 1/513, arc c 64 p 62 anon 0, mru 0/45, mfu 0/513 (list) 9 999 561 1052 anon 11, mru 4/39, mfu 2/513, arc c 64 p 61 anon 0, mru 0/39, mfu 1/513 (list) 17 1974 0 2084 anon 0, mru 5/39, mfu 1/514, arc c 64 p 63 anon 0, mru 0/39, mfu 0/514 (list) 9 984 0 1039 anon 0, mru 4/38, mfu 0/515, arc c 64 p 63 anon 0, mru 0/38, mfu 0/515 (list) 464 1235 0 1157 anon 0, mru 3/39, mfu 3/513, arc c 64 p 61 anon 0, mru 0/39, mfu 2/513 (list) 73 835 0 893 anon 0, mru 3/39, mfu 3/513, arc c 64 p 61 anon 0, mru 0/39, mfu 2/513 (list) 3 894 0 989 anon 0, mru 3/39, mfu 1/515, arc c 64 p 63 anon 0, mru 0/39, mfu 0/515 (list) 246 1325 0 1194 anon 0, mru 4/38, mfu 2/514, arc c 64 p 61 anon 0, mru 0/38, mfu 1/514 (list) 49 1054 0 1123 anon 3, mru 4/38, mfu 1/516, arc c 64 p 61 anon 0, mru 0/38, mfu 1/516 (list) 44 901 0 982 anon 0, mru 4/30, mfu 1/515, arc c 64 p 62 anon 0, mru 0/30, mfu 1/515 (list) 398 1285 0 1325 anon 0, mru 4/30, mfu 0/516, arc c 64 p 63 anon 0, mru 0/30, mfu 0/516 (list) 417 1378 0 1386 anon 0, mru 4/30, mfu 1/515, arc c 64 p 63 anon 0, mru 0/30, mfu 0/515 (list) 1306 1602 0 1667 anon 0, mru 4/30, mfu 0/516, arc c 64 p 63 anon 0, mru 0/30, mfu 0/516 (list) 38 1660 0 1658 anon 0, mru 5/30, mfu 0/517, arc c 64 p 63 anon 0, mru 1/30, mfu 0/517 (list) 43 2302 0 2255 anon 1, mru 4/29, mfu 1/518, arc c 64 p 63 anon 0, mru 0/29, mfu 0/518 (list) 713 1774 0 1761 anon 0, mru 3/31, mfu 2/517, arc c 64 p 62 anon 0, mru 0/31, mfu 0/517 (list) 141 1185 0 1215 anon 0, mru 3/31, mfu 2/517, arc c 64 p 62 anon 0, mru 0/31, mfu 1/517 (list) 12 1665 0 1639 anon 0, mru 4/31, mfu 2/517, arc c 64 p 62 anon 0, mru 0/31, mfu 0/517 (list) 295 1405 0 1462 anon 0, mru 4/31, mfu 1/518, arc c 64 p 63 anon 0, mru 0/31, mfu 1/518 (list) 499 1613 0 1608 anon 0, mru 4/31, mfu 2/518, arc c 64 p 63 anon 0, mru 0/31, mfu 1/518 (list) This message posted from opensolaris.org -------------- next part -------------- A non-text attachment was scrubbed... Name: arc2.d Type: application/octet-stream Size: 1290 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20060313/0f1a12f3/attachment.obj>
Jürgen Keil
2006-Mar-13 13:23 UTC
[zfs-discuss] ARC cache issues with b35/b36 [was Re: [nfs-discuss] bug 6344186]
> I''m seeing good ZFS performance on a dual-core box (Opteron 180), snv_34, BFUed > to more recent nightly: > > http://blogs.sun.com/roller/page/danasblog?entry=zfs_v_ufs_performance_forAre you running an amd64 kernel? And the box has 1GB of ram? What happens when you boot the 32-bit kernel? Is there still no performance issue when doing nightly builds? Maybe with 2GB of ram (instead of 1GB) on a 32-bit kernel the problem is reproducable? AFAICT, problem is that arc_reclaim_needed() returns 1, most likely because the heap_arena is full (can only happen on 32-bit x86) /* * If we''re on an i386 platform, it''s possible that we''ll exhaust the * kernel heap space before we ever run out of available physical * memory. Most checks of the size of the heap_area compare against * tune.t_minarmem, which is the minimum available real memory that we * can have in the system. However, this is generally fixed at 25 pages * which is so low that it''s useless. In this comparison, we seek to * calculate the total heap-size, and reclaim if more than 3/4ths of the * heap is allocated. (Or, in the caclulation, if less than 1/4th is * free) */ #if defined(__i386) if (btop(vmem_size(heap_arena, VMEM_FREE)) < (btop(vmem_size(heap_arena, VMEM_FREE | VMEM_ALLOC)) >> 2)) return (1); #endif In arc_reclaim_needed(), the condition (freemem < lotsfree + needfree + extra) isn''t true:> freemem::print0x51d50> lotsfree::print0x1fde> needfree::print0> desfree::print0xfef And (availrmem < swapfs_minfree + swapfs_reserve + extra) isn''t true:> availrmem ::print0x6662e> swapfs_minfree::print0xfef1> swapfs_reserve::print0xfef (extra == desfree == 0xfef) The ARC cache enters the "arc.no_grow == TRUE" state because of this. But it seems the arc_kmem_reap_now() calls from arc_reclaim_thread() are unable to free enough heap memory so the system enters a permanent arc_reclaim_needed() == TRUE state. I guess it would help if arc_kmem_reap_now() would start to free entries from the arc.mfu_ghost list, which has collected > 800MB upto now. But it doesn''t:> arc::print mfu_ghost[0].lsizemfu_ghost[0].lsize = 0x2e586200 This message posted from opensolaris.org
Dana H. Myers
2006-Mar-13 16:29 UTC
[zfs-discuss] ARC cache issues with b35/b36 [was Re: [nfs-discuss] bug 6344186]
J?rgen Keil wrote:>> I''m seeing good ZFS performance on a dual-core box (Opteron 180), snv_34, BFUed >> to more recent nightly: >> >> http://blogs.sun.com/roller/page/danasblog?entry=zfs_v_ufs_performance_for > > Are you running an amd64 kernel? And the box has 1GB of ram?Yes, and yes.> What happens when you boot the 32-bit kernel? Is there still no performance > issue when doing nightly builds? Maybe with 2GB of ram (instead of 1GB) on a 32-bit > kernel the problem is reproducable?Haven''t tried 32-bit mode yet; good idea. I don''t have a spare gig of memory laying around for this machine or I''d try bumping it up to 2GB. Will let you know about 32-bit mode. Looks like a good problem diagnosis. Thanks, Dana
Dana H. Myers
2006-Mar-14 15:38 UTC
[zfs-discuss] ARC cache issues with b35/b36 [was Re: [nfs-discuss] bug 6344186]
J?rgen Keil wrote:>> I''m seeing good ZFS performance on a dual-core box (Opteron 180), snv_34, BFUed >> to more recent nightly: >> >> http://blogs.sun.com/roller/page/danasblog?entry=zfs_v_ufs_performance_for > > Are you running an amd64 kernel? And the box has 1GB of ram? > > What happens when you boot the 32-bit kernel? Is there still no performance > issue when doing nightly builds? Maybe with 2GB of ram (instead of 1GB) on a 32-bit > kernel the problem is reproducable?Actually, with 1GB of RAM and booted into 32-bit mode, the same system took 6h 19m to complete the same nightly, with constant high level of read activity reported by ''iostat -x''. I''m seeing the same problem. Dana
Dana H. Myers
2006-Mar-14 16:45 UTC
[zfs-discuss] ARC cache issues with b35/b36 [was Re: [nfs-discuss] bug 6344186]
Dana H. Myers wrote:> J?rgen Keil wrote: >>> I''m seeing good ZFS performance on a dual-core box (Opteron 180), snv_34, BFUed >>> to more recent nightly: >>> >>> http://blogs.sun.com/roller/page/danasblog?entry=zfs_v_ufs_performance_for >> Are you running an amd64 kernel? And the box has 1GB of ram? >> >> What happens when you boot the 32-bit kernel? Is there still no performance >> issue when doing nightly builds? Maybe with 2GB of ram (instead of 1GB) on a 32-bit >> kernel the problem is reproducable? > > Actually, with 1GB of RAM and booted into 32-bit mode, the same system took > 6h 19m to complete the same nightly, with constant high level of read activity > reported by ''iostat -x''. > > I''m seeing the same problem.... and I''ve opened CR 6398177 to report it. Cheers, Dana
> Joseph Little wrote: > > > I''d love to "vote" to have this addressed, but > apparently votes for > > bugs are no available to outsiders. > > > > What''s limiting Stanford EE''s move to using ZFS > entirely for our > > snapshoting filesystems and multi-tier storage is > the inability to > > access .zfs directories and snapshots in particular > on NFSv3 clients. > > We simply can''t move the majority of clients on > various OSes to NFSv4 > > over night or even within the year (vendors aren''t > all there yet). > > > > Is there any progress on fixing Solaris > 11/opensolaris nfsd to > > support zfs w/ NFS v3?6344186 has been putback into build 36, and now NFSv3 clients can access .zfs! happy .zfs''ing, eric This message posted from opensolaris.org
Jürgen Keil
2006-Mar-15 12:19 UTC
[zfs-discuss] Re: ARC cache issues with b35/b36 [was Re: [nfs-discuss] bug 6344186]
> >> What happens when you boot the 32-bit kernel? Is there still no performance > >> issue when doing nightly builds? Maybe with 2GB of ram (instead of 1GB) on a 32-bit > >> kernel the problem is reproducable? > > > > Actually, with 1GB of RAM and booted into 32-bit mode, the same system took > > 6h 19m to complete the same nightly, with constant high level of read activity > > reported by ''iostat -x''. > > > > I''m seeing the same problem. > > ... and I''ve opened CR 6398177 to report it.I also filed a bug, CR 6397610. One of the two bugs should be closed as a duplicate of the other... This message posted from opensolaris.org