Hello zfs-discuss, Relatively low traffic to the pool but sync takes too long to complete and other operations are also not that fast. Disks are on 3510 array. zil_disable=1. bash-3.00# ptime sync real 1:21.569 user 0.001 sys 0.027 During sync zpool iostat and vmstat look like: f3-1 504G 720G 370 859 995K 10.2M misc 20.6M 52.0G 0 0 0 0 ---------- ----- ----- ----- ----- ----- ----- f3-1 504G 720G 697 929 2.91M 10.5M misc 20.6M 52.0G 0 0 0 0 ---------- ----- ----- ----- ----- ----- ----- f3-1 504G 720G 1.21K 90 6.33M 1.57M misc 20.6M 52.0G 0 0 0 0 ---------- ----- ----- ----- ----- ----- ----- f3-1 504G 720G 1.38K 6 6.83M 256K misc 20.6M 52.0G 0 0 0 0 ---------- ----- ----- ----- ----- ----- ----- f3-1 504G 720G 1.29K 0 4.10M 127K misc 20.6M 52.0G 0 0 0 0 ---------- ----- ----- ----- ----- ----- ----- f3-1 504G 720G 1.35K 0 6.98M 127K misc 20.6M 52.0G 0 0 0 0 ---------- ----- ----- ----- ----- ----- ----- f3-1 504G 720G 1012 229 3.06M 631K misc 20.6M 52.0G 0 0 0 0 ---------- ----- ----- ----- ----- ----- ----- f3-1 504G 720G 683 1.74K 7.00M 13.8M misc 20.6M 52.0G 0 0 0 0 ---------- ----- ----- ----- ----- ----- ----- f3-1 504G 720G 966 722 3.00M 6.63M misc 20.6M 52.0G 0 0 0 0 ---------- ----- ----- ----- ----- ----- ----- f3-1 504G 720G 702 134 1.85M 1.96M misc 20.6M 52.0G 0 0 0 0 ---------- ----- ----- ----- ----- ----- ----- f3-1 504G 720G 1K 78 3.05M 880K misc 20.6M 52.0G 0 0 0 0 ---------- ----- ----- ----- ----- ----- ----- f3-1 504G 720G 899 154 2.59M 1.45M misc 20.6M 52.0G 0 0 0 0 ---------- ----- ----- ----- ----- ----- ----- f3-1 504G 720G 1.00K 0 4.35M 0 misc 20.6M 52.0G 0 0 0 0 ---------- ----- ----- ----- ----- ----- ----- ^C kthr memory page disk faults cpu r b w swap free re mf pi po fr de sr m0 m1 m2 m1 in sy cs us sy id 0 0 0 8266008 1100560 0 0 0 0 0 0 0 0 0 0 0 2392 589 9592 0 21 79 1 0 0 8266008 1100560 0 0 0 0 0 0 0 0 0 0 0 3909 1458 13330 0 39 61 0 0 0 8265400 1099952 0 0 0 0 0 0 0 0 0 0 0 6892 1104 21023 0 47 53 0 0 0 8262648 1097200 0 0 0 0 0 0 0 65 64 65 0 7904 1327 22531 0 50 50 0 0 0 8259496 1094048 0 0 0 0 0 0 0 16 16 16 0 7037 986 20123 0 50 50 1 0 0 8258536 1093088 0 0 0 0 0 0 0 0 0 0 0 4363 1084 12107 0 39 61 0 0 0 8250856 1085408 0 0 0 0 0 0 0 0 0 0 0 4378 414 16436 0 30 70 0 0 0 8247888 1080736 580 1048 0 0 0 0 0 0 0 0 0 7283 2409 21480 4 35 61 0 0 0 8248600 1083152 0 0 0 0 0 0 0 0 0 0 0 3045 1184 10368 0 36 64 0 0 0 8248600 1083152 0 0 0 0 0 0 0 0 0 0 0 1659 1543 5847 0 34 66 0 0 0 8248600 1083152 0 0 0 0 0 0 0 0 0 0 0 1755 1743 6639 0 35 65 1 0 0 8248600 1083152 0 0 0 0 0 0 0 0 0 0 0 2723 1259 7973 0 36 64 0 0 0 8250280 1085040 0 0 0 0 0 0 0 0 0 0 0 1104 1308 3944 0 30 69 0 0 0 8250280 1085040 0 0 0 0 0 0 0 0 0 0 0 2348 705 9212 0 29 70 0 0 0 8250016 1084776 0 0 0 0 0 0 0 0 0 0 0 5152 384 17753 0 22 78 1 0 0 8249928 1084688 0 0 0 0 0 0 0 0 0 0 0 2397 1193 7311 0 30 70 ^C bash-3.00# uname -a SunOS nfs-10-1.srv 5.10 Generic_125100-04 sun4u sparc SUNW,Sun-Fire-V440 bash-3.00# showrev -p|grep IDR Patch: IDR126199-01 Obsoletes: Requires: 120473-05 Incompatibles: 120473-06 Packages: SUNWzfskr bash-3.00# -- Best regards, Robert mailto:rmilkowski at task.gda.pl http://milek.blogspot.com
On 4/23/07, Robert Milkowski <rmilkowski at task.gda.pl> wrote:> > Relatively low traffic to the pool but sync takes too long to complete > and other operations are also not that fast. > > Disks are on 3510 array. zil_disable=1. > > > bash-3.00# ptime sync > > real 1:21.569 > user 0.001 > sys 0.027Hey, that is *quick*! On Friday afternoon I typed sync mid-afternoon. Nothing had happened a couple of hours later when I went home. It looked as though it had finished by 11pm, when I checked in from home. This was on a thumper running S10U3. As far as I could tell, all writes to the pool stopped completely. There were applications trying to write, but they had just stopped (and picked up later in the evening). A fairly consistent few hundred K per second of reads; no writes; and pretty low system load. It did recover, but write latencies of a few hours is rather undesirable. What on earth was it doing? -- -Peter Tribble http://www.petertribble.co.uk/ - http://ptribble.blogspot.com/
Hello Peter, Monday, April 23, 2007, 9:27:56 PM, you wrote: PT> On 4/23/07, Robert Milkowski <rmilkowski at task.gda.pl> wrote:>> >> Relatively low traffic to the pool but sync takes too long to complete >> and other operations are also not that fast. >> >> Disks are on 3510 array. zil_disable=1. >> >> >> bash-3.00# ptime sync >> >> real 1:21.569 >> user 0.001 >> sys 0.027PT> Hey, that is *quick*! PT> On Friday afternoon I typed sync mid-afternoon. Nothing had happened PT> a couple of hours later when I went home. It looked as though it had finished PT> by 11pm, when I checked in from home. PT> This was on a thumper running S10U3. As far as I could tell, all writes PT> to the pool stopped completely. There were applications trying to write, PT> but they had just stopped (and picked up later in the evening). A fairly PT> consistent few hundred K per second of reads; no writes; and pretty low PT> system load. PT> It did recover, but write latencies of a few hours is rather undesirable. PT> What on earth was it doing? I''ve seen it too :( Other that that I can see that while I can observe reads and writes zfs is issuing write cache flush commands even in minutes instead of 5s default. And nfsd goes crazy then. Then zfs commands like zpool status, zfs list, etc. can hung for hours... nothing unusual with iostat. -- Best regards, Robert mailto:rmilkowski at task.gda.pl http://milek.blogspot.com
On Mon, Apr 23, 2007 at 20:27:56 +0100, Peter Tribble wrote: : On 4/23/07, Robert Milkowski <rmilkowski at task.gda.pl> wrote: : >Relatively low traffic to the pool but sync takes too long to complete : >and other operations are also not that fast. : >Disks are on 3510 array. zil_disable=1. : >bash-3.00# ptime sync : >real 1:21.569 : >user 0.001 : >sys 0.027 : Hey, that is *quick*! : On Friday afternoon I typed sync mid-afternoon. Nothing had happened : a couple of hours later when I went home. It looked as though it had : finished : by 11pm, when I checked in from home. : This was on a thumper running S10U3. As far as I could tell, all writes : to the pool stopped completely. There were applications trying to write, : but they had just stopped (and picked up later in the evening). A fairly : consistent few hundred K per second of reads; no writes; and pretty low : system load. I''m glad I''m not the only one to have seen this. I''m currently playing with ZFS on a T2000 with 24x500GB SATA discs in an external array that presents as SCSI. After having much ''fun'' with the Solaris SCSI driver not handling LUNs >2TB, I reconfigured the array to present as one target with 24 LUNs, one per disc, and threw ZFS at it in a raidz2 configuration. I admit this isn''t optimal, but it has the behaviour I wanted: namely lots of space with a little redundancy for safety. Having had said ''fun'' with the SD driver I thought I''d thoroughly check large object handling, and started eight ''dd if=/dev/zero''s before retiring to the pub and leaving it overnight. The next morning, I discovered a bunch of rather large files. 340GB in size. Everything seemed OK, so I issued an ''rm *'', expecting it to return rather quickly. How wrong I was. It took a minute (61s from memory) to delete a single 320GB file, which flattened the SCSI bus issuing 4.5MB/s/disc reads (as reported by iostat -x), during which time all writes were suspended. This is not good. Once that had finished, a ''ptime sync'' sat for 25 minutes running at about 1MB/s/disc. Again, all reads. Given what I intend to use this filesystem for -- dropping all the BBC''s Freeview muxes to disc in 24-hour chunks -- performance on large objects is rather important to me. I''ve reconfigured to 3x(7+1) raidz, and this has helped a lot (as I expected it would), but it''s still not great having multi-second write locks when deleting 16GB objects. 100MB/s write speed and 200MB/s read speed isn''t bad, though. Quite impressed with that. : It did recover, but write latencies of a few hours is rather undesirable. To put it mildly. : What on earth was it doing? I wish I knew. Anyone any ideas on how to optimise it further? I''m using the defaults (whatever''s created by a 8GB RAM T2000 with 8 1GHz cores); no compression, no nothing. -- Dickon Hood Due to digital rights management, my .sig is temporarily unavailable. Normal service will be resumed as soon as possible. We apologise for the inconvenience in the meantime. No virus was found in this outgoing message as I didn''t bother looking.
Hello Robert, Monday, April 23, 2007, 10:44:00 PM, you wrote: RM> Hello Peter, RM> Monday, April 23, 2007, 9:27:56 PM, you wrote: PT>> On 4/23/07, Robert Milkowski <rmilkowski at task.gda.pl> wrote:>>> >>> Relatively low traffic to the pool but sync takes too long to complete >>> and other operations are also not that fast. >>> >>> Disks are on 3510 array. zil_disable=1. >>> >>> >>> bash-3.00# ptime sync >>> >>> real 1:21.569 >>> user 0.001 >>> sys 0.027PT>> Hey, that is *quick*! PT>> On Friday afternoon I typed sync mid-afternoon. Nothing had happened PT>> a couple of hours later when I went home. It looked as though it had finished PT>> by 11pm, when I checked in from home. PT>> This was on a thumper running S10U3. As far as I could tell, all writes PT>> to the pool stopped completely. There were applications trying to write, PT>> but they had just stopped (and picked up later in the evening). A fairly PT>> consistent few hundred K per second of reads; no writes; and pretty low PT>> system load. PT>> It did recover, but write latencies of a few hours is rather undesirable. PT>> What on earth was it doing? RM> I''ve seen it too :( RM> Other that that I can see that while I can observe reads and writes RM> zfs is issuing write cache flush commands even in minutes instead of RM> 5s default. And nfsd goes crazy then. RM> Then zfs commands like zpool status, zfs list, etc. can hung for RM> hours... nothing unusual with iostat. Also stopping nfsd can take dozen of minutes to complete. I''ve never observed this with nfsd/ufs. -- Best regards, Robert mailto:rmilkowski at task.gda.pl http://milek.blogspot.com
Hello Robert, Monday, April 23, 2007, 11:12:39 PM, you wrote: RM> Hello Robert, RM> Monday, April 23, 2007, 10:44:00 PM, you wrote: RM>> Hello Peter, RM>> Monday, April 23, 2007, 9:27:56 PM, you wrote: PT>>> On 4/23/07, Robert Milkowski <rmilkowski at task.gda.pl> wrote:>>>> >>>> Relatively low traffic to the pool but sync takes too long to complete >>>> and other operations are also not that fast. >>>> >>>> Disks are on 3510 array. zil_disable=1. >>>> >>>> >>>> bash-3.00# ptime sync >>>> >>>> real 1:21.569 >>>> user 0.001 >>>> sys 0.027PT>>> Hey, that is *quick*! PT>>> On Friday afternoon I typed sync mid-afternoon. Nothing had happened PT>>> a couple of hours later when I went home. It looked as though it had finished PT>>> by 11pm, when I checked in from home. PT>>> This was on a thumper running S10U3. As far as I could tell, all writes PT>>> to the pool stopped completely. There were applications trying to write, PT>>> but they had just stopped (and picked up later in the evening). A fairly PT>>> consistent few hundred K per second of reads; no writes; and pretty low PT>>> system load. PT>>> It did recover, but write latencies of a few hours is rather undesirable. PT>>> What on earth was it doing? RM>> I''ve seen it too :( RM>> Other that that I can see that while I can observe reads and writes RM>> zfs is issuing write cache flush commands even in minutes instead of RM>> 5s default. And nfsd goes crazy then. RM>> Then zfs commands like zpool status, zfs list, etc. can hung for RM>> hours... nothing unusual with iostat. RM> Also stopping nfsd can take dozen of minutes to complete. RM> I''ve never observed this with nfsd/ufs. Run on server itself. ZFS: bash-3.00# dtrace -n fbt::fop_*:entry''{self->t=timestamp;}'' -n fbt::fop_*:return''/self->t/{@[probefunc]=quantize((timestamp-self->t)/1000000000);self->t=0;}'' -n tick-10s''{printa(@);}'' [after some time] [only longer ops] fop_readdir value ------------- Distribution ------------- count -1 | 0 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 35895 1 | 81 2 | 4 4 | 0 fop_mkdir value ------------- Distribution ------------- count -1 | 0 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 864 1 | 9 2 | 5 4 | 0 8 | 0 16 | 1 32 | 2 64 | 2 128 | 0 fop_space value ------------- Distribution ------------- count -1 | 0 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 426 1 | 0 2 | 0 4 | 0 8 | 0 16 | 0 32 | 0 64 | 0 128 | 3 256 | 0 fop_lookup value ------------- Distribution ------------- count -1 | 0 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1181242 1 | 311 2 | 47 4 | 3 8 | 0 fop_read value ------------- Distribution ------------- count -1 | 0 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 100799 1 | 26 2 | 1 4 | 3 8 | 5 16 | 5 32 | 9 64 | 3 128 | 3 256 | 3 512 | 0 fop_remove value ------------- Distribution ------------- count -1 | 0 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 16085 1 | 43 2 | 6 4 | 0 8 | 0 16 | 1 32 | 29 64 | 54 128 | 75 256 | 0 fop_create value ------------- Distribution ------------- count -1 | 0 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 21883 1 |@ 300 2 | 243 4 | 118 8 | 31 16 | 15 32 | 69 64 | 228 128 |@ 359 256 | 1 512 | 0 fop_symlink value ------------- Distribution ------------- count -1 | 0 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 8067 1 |@ 215 2 |@ 183 4 | 114 8 | 47 16 | 6 32 | 35 64 |@ 180 128 |@@@ 689 256 | 2 512 | 0 fop_write value ------------- Distribution ------------- count -1 | 0 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 134052 1 | 174 2 | 20 4 | 1 8 | 3 16 | 179 32 | 148 64 | 412 128 | 632 256 | 0 ^C And the same environment but on UFS (both are nfs servers, the same HW): bash-3.00# dtrace -n fbt::fop_*:entry''{self->t=timestamp;}'' -n fbt::fop_*:return''/self->t/{@[probefunc]=quantize((timestamp-self->t)/1000000000);self->t=0;}'' -n tick-10s''{printa(@);}'' bash-3.00# [after some time] [only ops over 1s] fop_putpage value ------------- Distribution ------------- count -1 | 0 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 540731 1 | 1 2 | 0 fop_read value ------------- Distribution ------------- count -1 | 0 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 122344 1 | 4 2 | 6 4 | 0 8 | 0 16 | 0 32 | 0 64 | 0 128 | 0 256 | 1 512 | 0 ^C Well, looks much better on ufs/nfsd than zfs/nfsd. The hardware is the same, workload the same, the same time. ZFS server is with zil_disable=1. Under smaller load ZFS rocks, with higher load it "suxx" :( At least in a nfsd environment. -- Best regards, Robert mailto:rmilkowski at task.gda.pl http://milek.blogspot.com
Dickon Hood wrote:> [snip] > > I''m currently playing with ZFS on a T2000 with 24x500GB SATA discs in an > external array that presents as SCSI. After having much ''fun'' with the > Solaris SCSI driver not handling LUNs >2TBThat should work if you have the latest KJP and friends. (Actually, it should have been working for a while so if not....) What release are you on?
On Mon, Apr 23, 2007 at 17:43:31 -0400, Torrey McMahon wrote: : Dickon Hood wrote: : >[snip] : >I''m currently playing with ZFS on a T2000 with 24x500GB SATA discs in an : >external array that presents as SCSI. After having much ''fun'' with the : >Solaris SCSI driver not handling LUNs >2TB : That should work if you have the latest KJP and friends. (Actually, it : should have been working for a while so if not....) What release are you on? Google suggested it may or may not, depending on how lucky I was. I assume I was just unlucky, or didn''t find the correct set of patches. Actually I thought I had at one point, but writes after the first 2TB returned IO errors. I tried every recentish version on our Jumpstart server: 0305, 0606, and 1106, with the latest 10_Recommended patch cluster or not, and with various other sd patches I could find. Which versions I couldn''t honestly say; I gave up. 1106 out of the box won''t even see the SCSI card with a 2TB LUN, which has some interesting side effects with installing: the expansion cards appear first, and if it can''t see it, suddenly your boot devices change once patched. I got one combination -- sorry, I don''t recall which, but I think it was a 0606 with a patch -- to see the device, but as I say, writes to >2TB fail with an IO error. This is unhelpful. I gave up and restructured it to export all the discs, as I said. AIUI, that''s better for zfs anyway. -- Dickon Hood Due to digital rights management, my .sig is temporarily unavailable. Normal service will be resumed as soon as possible. We apologise for the inconvenience in the meantime. No virus was found in this outgoing message as I didn''t bother looking.