Joe Little
2007-Nov-17 00:35 UTC
[zfs-discuss] slog tests on read throughput exhaustion (NFS)
I have historically noticed that in ZFS, when ever there is a heavy writer to a pool via NFS, the reads can held back (basically paused). An example is a RAID10 pool of 6 disks, whereby a directory of files including some large 100+MB in size being written can cause other clients over NFS to pause for seconds (5-30 or so). This on B70 bits. I''ve gotten used to this behavior over NFS, but didn''t see it perform as such when on the server itself doing similar actions. To improve upon the situation, I thought perhaps I could dedicate a log device outside the pool, in the hopes that while heavy writes went to the log device, reads would merrily be allowed to coexist from the pool itself. My test case isn''t ideal per se, but I added a local 9GB SCSI (80) drive for a log, and added to LUNs for the pool itself. You''ll see from the below that while the log device is pegged at 15MB/sec (sd5), my directory list request on devices sd15 and sd16 never are answered. I tried this with both no-cache-flush enabled and off, with negligible difference. Is there anyway to force a better balance of reads/writes during heavy writes? extended device statistics device r/s w/s kr/s kw/s wait actv svc_t %w %b fd0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 sd0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 sd1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 sd2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 sd3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 sd4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 sd5 0.0 118.0 0.0 15099.9 0.0 35.0 296.7 0 100 sd6 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 sd7 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 sd8 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 sd9 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 sd10 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 sd11 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 sd12 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 sd13 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 sd14 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 sd15 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 sd16 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 extended device statistics device r/s w/s kr/s kw/s wait actv svc_t %w %b fd0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 sd0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 sd1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 sd2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 sd3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 sd4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 sd5 0.0 117.0 0.0 14970.1 0.0 35.0 299.2 0 100 sd6 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 sd7 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 sd8 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 sd9 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 sd10 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 sd11 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 sd12 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 sd13 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 sd14 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 sd15 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 sd16 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 extended device statistics device r/s w/s kr/s kw/s wait actv svc_t %w %b fd0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 sd0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 sd1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 sd2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 sd3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 sd4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 sd5 0.0 118.1 0.0 15111.9 0.0 35.0 296.4 0 100 sd6 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 sd7 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 sd8 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 sd9 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 sd10 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 sd11 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 sd12 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 sd13 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 sd14 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 sd15 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 sd16 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 extended device statistics device r/s w/s kr/s kw/s wait actv svc_t %w %b fd0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 sd0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 sd1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 sd2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 sd3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 sd4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 sd5 0.0 116.9 0.0 14968.9 0.0 35.0 299.3 0 100 sd6 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 sd7 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 sd8 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 sd9 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 sd10 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 sd11 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 sd12 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 sd13 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 sd14 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 sd15 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 sd16 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 extended device statistics device r/s w/s kr/s kw/s wait actv svc_t %w %b fd0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 sd0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 sd1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 sd2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 sd3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 sd4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 sd5 0.0 118.0 0.0 15103.8 0.0 35.0 296.6 0 100 sd6 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 sd7 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 sd8 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 sd9 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 sd10 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 sd11 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 sd12 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 sd13 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 sd14 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 sd15 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 sd16 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 extended device statistics device r/s w/s kr/s kw/s wait actv svc_t %w %b fd0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 sd0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 sd1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 sd2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 sd3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 sd4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 sd5 0.0 117.1 0.0 14983.9 0.0 35.0 299.0 0 100 sd6 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 sd7 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 sd8 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 sd9 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 sd10 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 sd11 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 sd12 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 sd13 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 sd14 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 sd15 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 sd16 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 extended device statistics device r/s w/s kr/s kw/s wait actv svc_t %w %b fd0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 sd0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 sd1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 sd2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 sd3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 sd4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 sd5 0.0 117.9 0.0 15095.3 0.0 35.0 296.8 0 100 sd6 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 sd7 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 sd8 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 sd9 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 sd10 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 sd11 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 sd12 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 sd13 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 sd14 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 sd15 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 sd16 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 extended device statistics device r/s w/s kr/s kw/s wait actv svc_t %w %b fd0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 sd0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 sd1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 sd2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 sd3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 sd4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 sd5 0.0 117.0 0.0 14977.6 0.0 35.0 299.1 0 100 sd6 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 sd7 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 sd8 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 sd9 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 sd10 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 sd11 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 sd12 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 sd13 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 sd14 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 sd15 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 sd16 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 extended device statistics device r/s w/s kr/s kw/s wait actv svc_t %w %b fd0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 sd0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 sd1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 sd2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 sd3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 sd4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 sd5 0.0 118.0 0.0 15108.8 0.0 35.0 296.5 0 100 sd6 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 sd7 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 sd8 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 sd9 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 sd10 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 sd11 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 sd12 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 sd13 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 sd14 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 sd15 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 sd16 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0
Neil Perrin
2007-Nov-17 05:13 UTC
[zfs-discuss] slog tests on read throughput exhaustion (NFS)
Joe, I don''t think adding a slog helped in this case. In fact I believe it made performance worse. Previously the ZIL would be spread out over all devices but now all synchronous traffic is directed at one device (and everything is synchronous in NFS). Mind you 15MB/s seems a bit on the slow side - especially is cache flushing is disabled. It would be interesting to see what all the threads are waiting on. I think the problem maybe that everything is backed up waiting to start a transaction because the txg train is slow due to NFS requiring the ZIL to push everything synchronously. Neil. Joe Little wrote:> I have historically noticed that in ZFS, when ever there is a heavy > writer to a pool via NFS, the reads can held back (basically paused). > An example is a RAID10 pool of 6 disks, whereby a directory of files > including some large 100+MB in size being written can cause other > clients over NFS to pause for seconds (5-30 or so). This on B70 bits. > I''ve gotten used to this behavior over NFS, but didn''t see it perform > as such when on the server itself doing similar actions. > > To improve upon the situation, I thought perhaps I could dedicate a > log device outside the pool, in the hopes that while heavy writes went > to the log device, reads would merrily be allowed to coexist from the > pool itself. My test case isn''t ideal per se, but I added a local 9GB > SCSI (80) drive for a log, and added to LUNs for the pool itself. > You''ll see from the below that while the log device is pegged at > 15MB/sec (sd5), my directory list request on devices sd15 and sd16 > never are answered. I tried this with both no-cache-flush enabled and > off, with negligible difference. Is there anyway to force a better > balance of reads/writes during heavy writes? > > extended device statistics > device r/s w/s kr/s kw/s wait actv svc_t %w %b > fd0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 > sd0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 > sd1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 > sd2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 > sd3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 > sd4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 > sd5 0.0 118.0 0.0 15099.9 0.0 35.0 296.7 0 100 > sd6 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 > sd7 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 > sd8 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 > sd9 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 > sd10 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 > sd11 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 > sd12 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 > sd13 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 > sd14 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 > sd15 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 > sd16 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0...
Joe Little
2007-Nov-17 05:17 UTC
[zfs-discuss] slog tests on read throughput exhaustion (NFS)
On Nov 16, 2007 9:13 PM, Neil Perrin <Neil.Perrin at sun.com> wrote:> Joe, > > I don''t think adding a slog helped in this case. In fact I > believe it made performance worse. Previously the ZIL would be > spread out over all devices but now all synchronous traffic > is directed at one device (and everything is synchronous in NFS). > Mind you 15MB/s seems a bit on the slow side - especially is > cache flushing is disabled. > > It would be interesting to see what all the threads are waiting > on. I think the problem maybe that everything is backed > up waiting to start a transaction because the txg train is > slow due to NFS requiring the ZIL to push everything synchronously. >I agree completely. The log (even though slow) was an attempt to isolate writes away from the pool. I guess the question is how to provide for async access for NFS. We may have 16, 32 or whatever threads, but if a single writer keeps the ZIL pegged and prohibiting reads, its all for nought. Is there anyway to tune/configure the ZFS/NFS combination to balance reads/writes to not starve one for the other. Its either feast or famine or so tests have shown.> Neil. > > > Joe Little wrote: > > I have historically noticed that in ZFS, when ever there is a heavy > > writer to a pool via NFS, the reads can held back (basically paused). > > An example is a RAID10 pool of 6 disks, whereby a directory of files > > including some large 100+MB in size being written can cause other > > clients over NFS to pause for seconds (5-30 or so). This on B70 bits. > > I''ve gotten used to this behavior over NFS, but didn''t see it perform > > as such when on the server itself doing similar actions. > > > > To improve upon the situation, I thought perhaps I could dedicate a > > log device outside the pool, in the hopes that while heavy writes went > > to the log device, reads would merrily be allowed to coexist from the > > pool itself. My test case isn''t ideal per se, but I added a local 9GB > > SCSI (80) drive for a log, and added to LUNs for the pool itself. > > You''ll see from the below that while the log device is pegged at > > 15MB/sec (sd5), my directory list request on devices sd15 and sd16 > > never are answered. I tried this with both no-cache-flush enabled and > > off, with negligible difference. Is there anyway to force a better > > balance of reads/writes during heavy writes? > > > > extended device statistics > > device r/s w/s kr/s kw/s wait actv svc_t %w %b > > fd0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 > > sd0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 > > sd1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 > > sd2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 > > sd3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 > > sd4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 > > sd5 0.0 118.0 0.0 15099.9 0.0 35.0 296.7 0 100 > > sd6 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 > > sd7 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 > > sd8 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 > > sd9 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 > > sd10 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 > > sd11 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 > > sd12 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 > > sd13 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 > > sd14 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 > > sd15 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 > > sd16 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 > ... >
Joe Little
2007-Nov-17 05:52 UTC
[zfs-discuss] slog tests on read throughput exhaustion (NFS)
On Nov 16, 2007 9:17 PM, Joe Little <jmlittle at gmail.com> wrote:> On Nov 16, 2007 9:13 PM, Neil Perrin <Neil.Perrin at sun.com> wrote: > > Joe, > > > > I don''t think adding a slog helped in this case. In fact I > > believe it made performance worse. Previously the ZIL would be > > spread out over all devices but now all synchronous traffic > > is directed at one device (and everything is synchronous in NFS). > > Mind you 15MB/s seems a bit on the slow side - especially is > > cache flushing is disabled. > > > > It would be interesting to see what all the threads are waiting > > on. I think the problem maybe that everything is backed > > up waiting to start a transaction because the txg train is > > slow due to NFS requiring the ZIL to push everything synchronously. > >Roch wrote this before (thus my interest in the log or NVRAM like solution): "There are 2 independant things at play here. a) NFS sync semantics conspire againts single thread performance with any backend filesystem. However NVRAM normally offers some releaf of the issue. b) ZFS sync semantics along with the Storage Software + imprecise protocol in between, conspire againts ZFS performance of some workloads on NVRAM backed storage. NFS being one of the affected workloads. The conjunction of the 2 causes worst than expected NFS perfomance over ZFS backend running __on NVRAM back storage__. If you are not considering NVRAM storage, then I know of no ZFS/NFS specific problems. Issue b) is being delt with, by both Solaris and Storage Vendors (we need a refined protocol); Issue a) is not related to ZFS and rather fundamental NFS issue. Maybe future NFS protocol will help. Net net; if one finds a way to ''disable cache flushing'' on the storage side, then one reaches the state we''ll be, out of the box, when b) is implemented by Solaris _and_ Storage vendor. At that point, ZFS becomes a fine NFS server not only on JBOD as it is today , both also on NVRAM backed storage. It''s complex enough, I thougt it was worth repeating."> > I agree completely. The log (even though slow) was an attempt to > isolate writes away from the pool. I guess the question is how to > provide for async access for NFS. We may have 16, 32 or whatever > threads, but if a single writer keeps the ZIL pegged and prohibiting > reads, its all for nought. Is there anyway to tune/configure the > ZFS/NFS combination to balance reads/writes to not starve one for the > other. Its either feast or famine or so tests have shown. > > > > Neil. > > > > > > Joe Little wrote: > > > I have historically noticed that in ZFS, when ever there is a heavy > > > writer to a pool via NFS, the reads can held back (basically paused). > > > An example is a RAID10 pool of 6 disks, whereby a directory of files > > > including some large 100+MB in size being written can cause other > > > clients over NFS to pause for seconds (5-30 or so). This on B70 bits. > > > I''ve gotten used to this behavior over NFS, but didn''t see it perform > > > as such when on the server itself doing similar actions. > > > > > > To improve upon the situation, I thought perhaps I could dedicate a > > > log device outside the pool, in the hopes that while heavy writes went > > > to the log device, reads would merrily be allowed to coexist from the > > > pool itself. My test case isn''t ideal per se, but I added a local 9GB > > > SCSI (80) drive for a log, and added to LUNs for the pool itself. > > > You''ll see from the below that while the log device is pegged at > > > 15MB/sec (sd5), my directory list request on devices sd15 and sd16 > > > never are answered. I tried this with both no-cache-flush enabled and > > > off, with negligible difference. Is there anyway to force a better > > > balance of reads/writes during heavy writes? > > > > > > extended device statistics > > > device r/s w/s kr/s kw/s wait actv svc_t %w %b > > > fd0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 > > > sd0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 > > > sd1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 > > > sd2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 > > > sd3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 > > > sd4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 > > > sd5 0.0 118.0 0.0 15099.9 0.0 35.0 296.7 0 100 > > > sd6 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 > > > sd7 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 > > > sd8 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 > > > sd9 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 > > > sd10 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 > > > sd11 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 > > > sd12 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 > > > sd13 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 > > > sd14 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 > > > sd15 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 > > > sd16 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 > > ... > > >
Neil Perrin
2007-Nov-17 06:41 UTC
[zfs-discuss] slog tests on read throughput exhaustion (NFS)
Joe Little wrote:> On Nov 16, 2007 9:13 PM, Neil Perrin <Neil.Perrin at sun.com> wrote: >> Joe, >> >> I don''t think adding a slog helped in this case. In fact I >> believe it made performance worse. Previously the ZIL would be >> spread out over all devices but now all synchronous traffic >> is directed at one device (and everything is synchronous in NFS). >> Mind you 15MB/s seems a bit on the slow side - especially is >> cache flushing is disabled. >> >> It would be interesting to see what all the threads are waiting >> on. I think the problem maybe that everything is backed >> up waiting to start a transaction because the txg train is >> slow due to NFS requiring the ZIL to push everything synchronously. >> > > I agree completely. The log (even though slow) was an attempt to > isolate writes away from the pool. I guess the question is how to > provide for async access for NFS. We may have 16, 32 or whatever > threads, but if a single writer keeps the ZIL pegged and prohibiting > reads, its all for nought. Is there anyway to tune/configure the > ZFS/NFS combination to balance reads/writes to not starve one for the > other. Its either feast or famine or so tests have shown.No there''s no way currently to give reads preference over writes. All transactions get equal priority to enter a transaction group. Three txgs can be outstanding as we use a 3 phase commit model: open; quiescing; and syncing. Neil.
Joe Little
2007-Nov-17 14:52 UTC
[zfs-discuss] slog tests on read throughput exhaustion (NFS)
On Nov 16, 2007 10:41 PM, Neil Perrin <Neil.Perrin at sun.com> wrote:> > > Joe Little wrote: > > On Nov 16, 2007 9:13 PM, Neil Perrin <Neil.Perrin at sun.com> wrote: > >> Joe, > >> > >> I don''t think adding a slog helped in this case. In fact I > >> believe it made performance worse. Previously the ZIL would be > >> spread out over all devices but now all synchronous traffic > >> is directed at one device (and everything is synchronous in NFS). > >> Mind you 15MB/s seems a bit on the slow side - especially is > >> cache flushing is disabled. > >> > >> It would be interesting to see what all the threads are waiting > >> on. I think the problem maybe that everything is backed > >> up waiting to start a transaction because the txg train is > >> slow due to NFS requiring the ZIL to push everything synchronously. > >> > > > > I agree completely. The log (even though slow) was an attempt to > > isolate writes away from the pool. I guess the question is how to > > provide for async access for NFS. We may have 16, 32 or whatever > > threads, but if a single writer keeps the ZIL pegged and prohibiting > > reads, its all for nought. Is there anyway to tune/configure the > > ZFS/NFS combination to balance reads/writes to not starve one for the > > other. Its either feast or famine or so tests have shown. > > No there''s no way currently to give reads preference over writes. > All transactions get equal priority to enter a transaction group. > Three txgs can be outstanding as we use a 3 phase commit model: > open; quiescing; and syncing. >anyway to improve the balance? Is would appear that zil_disable is still a requirement to get NFS to behave in an practical "real world" way with ZFS still. Even with zil_disable, we end up with periods of pausing on the heaviest of writes, and then I think its mostly just ZFS having too much outstanding i/o to commit. If zil_disable is enabled, is the slog disk ignored?> Neil. > >
Richard Elling
2007-Nov-18 21:44 UTC
[zfs-discuss] slog tests on read throughput exhaustion (NFS)
one more thing... Joe Little wrote:> I have historically noticed that in ZFS, when ever there is a heavy > writer to a pool via NFS, the reads can held back (basically paused). > An example is a RAID10 pool of 6 disks, whereby a directory of files > including some large 100+MB in size being written can cause other > clients over NFS to pause for seconds (5-30 or so). This on B70 bits. > I''ve gotten used to this behavior over NFS, but didn''t see it perform > as such when on the server itself doing similar actions. > > To improve upon the situation, I thought perhaps I could dedicate a > log device outside the pool, in the hopes that while heavy writes went > to the log device, reads would merrily be allowed to coexist from the > pool itself. My test case isn''t ideal per se, but I added a local 9GB > SCSI (80) drive for a log, and added to LUNs for the pool itself. > You''ll see from the below that while the log device is pegged at > 15MB/sec (sd5), my directory list request on devices sd15 and sd16 > never are answered. I tried this with both no-cache-flush enabled and > off, with negligible difference. Is there anyway to force a better > balance of reads/writes during heavy writes? > > extended device statistics > device r/s w/s kr/s kw/s wait actv svc_t %w %b > fd0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 > sd0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 > sd1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 > sd2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 > sd3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 > sd4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 > sd5 0.0 118.0 0.0 15099.9 0.0 35.0 296.7 0 100When you see actv = 35 and svc_t > ~20, then it is possible that you can improve performance by reducing the zfs_vdev_max_pending queue depth. See http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide#Device_I.2FO_Queue_Size_.28I.2FO_Concurrency.29 This will be particularly true for JBODs. Doing a little math, there is ~ 4.5 MBytes queued in the drive waiting to be written. 4.5 MBytes isn''t much for a typical RAID array, but for a disk, it is often a sizeable chunk of its available cache. A 9 GByte disk, being rather old, has a pretty wimpy microprocessor, so you are basically beating the poor thing senseless. Reducing the queue depth will allow the disk to perform more efficiently. -- richard
Joe Little
2007-Nov-19 01:40 UTC
[zfs-discuss] slog tests on read throughput exhaustion (NFS)
On Nov 18, 2007 1:44 PM, Richard Elling <Richard.Elling at sun.com> wrote:> one more thing... > > > Joe Little wrote: > > I have historically noticed that in ZFS, when ever there is a heavy > > writer to a pool via NFS, the reads can held back (basically paused). > > An example is a RAID10 pool of 6 disks, whereby a directory of files > > including some large 100+MB in size being written can cause other > > clients over NFS to pause for seconds (5-30 or so). This on B70 bits. > > I''ve gotten used to this behavior over NFS, but didn''t see it perform > > as such when on the server itself doing similar actions. > > > > To improve upon the situation, I thought perhaps I could dedicate a > > log device outside the pool, in the hopes that while heavy writes went > > to the log device, reads would merrily be allowed to coexist from the > > pool itself. My test case isn''t ideal per se, but I added a local 9GB > > SCSI (80) drive for a log, and added to LUNs for the pool itself. > > You''ll see from the below that while the log device is pegged at > > 15MB/sec (sd5), my directory list request on devices sd15 and sd16 > > never are answered. I tried this with both no-cache-flush enabled and > > off, with negligible difference. Is there anyway to force a better > > balance of reads/writes during heavy writes? > > > > extended device statistics > > device r/s w/s kr/s kw/s wait actv svc_t %w %b > > fd0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 > > sd0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 > > sd1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 > > sd2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 > > sd3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 > > sd4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 > > sd5 0.0 118.0 0.0 15099.9 0.0 35.0 296.7 0 100 > > When you see actv = 35 and svc_t > ~20, then it is possible that > you can improve performance by reducing the zfs_vdev_max_pending > queue depth. See > http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide#Device_I.2FO_Queue_Size_.28I.2FO_Concurrency.29 > > This will be particularly true for JBODs. > > Doing a little math, there is ~ 4.5 MBytes queued in the drive > waiting to be written. 4.5 MBytes isn''t much for a typical RAID > array, but for a disk, it is often a sizeable chunk of its > available cache. A 9 GByte disk, being rather old, has a pretty > wimpy microprocessor, so you are basically beating the poor thing > senseless. Reducing the queue depth will allow the disk to perform > more efficiently.I''ll be trying an 18G 10K drive tomorrow. Again the test was simply to see if by having a slog, I''d enable NFS to allow for concurrent reads and writes. Especially in the iscsi case, but even in jbod, I find _any_ heavy writing to completely postpone reads to NFS clients. This makes ZFS and NFS impractical under i/o duress. My just was to simply see how things work. It appears from Neil that it won''t, and the synchronicity RFE per ZFS filesystem is what is needed, or at least zil_disable for NFS to be practically used currently. As for the max_pending, I did try to lower that w/o any success (for values of 10 and 20) in a JBOD.> -- richard >
Roch - PAE
2007-Nov-19 17:41 UTC
[zfs-discuss] slog tests on read throughput exhaustion (NFS)
Neil Perrin writes: > > > Joe Little wrote: > > On Nov 16, 2007 9:13 PM, Neil Perrin <Neil.Perrin at sun.com> wrote: > >> Joe, > >> > >> I don''t think adding a slog helped in this case. In fact I > >> believe it made performance worse. Previously the ZIL would be > >> spread out over all devices but now all synchronous traffic > >> is directed at one device (and everything is synchronous in NFS). > >> Mind you 15MB/s seems a bit on the slow side - especially is > >> cache flushing is disabled. > >> > >> It would be interesting to see what all the threads are waiting > >> on. I think the problem maybe that everything is backed > >> up waiting to start a transaction because the txg train is > >> slow due to NFS requiring the ZIL to push everything synchronously. > >> > > > > I agree completely. The log (even though slow) was an attempt to > > isolate writes away from the pool. I guess the question is how to > > provide for async access for NFS. We may have 16, 32 or whatever > > threads, but if a single writer keeps the ZIL pegged and prohibiting > > reads, its all for nought. Is there anyway to tune/configure the > > ZFS/NFS combination to balance reads/writes to not starve one for the > > other. Its either feast or famine or so tests have shown. > > No there''s no way currently to give reads preference over writes. > All transactions get equal priority to enter a transaction group. > Three txgs can be outstanding as we use a 3 phase commit model: > open; quiescing; and syncing. That makes me wonder if this is not just the lack of write throttling issue. If one txg is syncing and the other is quiesced out, I think it means we have let in too many writes. We do need a better balance. Neil is it correct that reads never hit txg_wait_open(), but they just need an I/O scheduler slot ? If so seems to me just a matter of 6429205 each zpool needs to monitor it''s throughput and throttle heavy writers However, if this is it, disabling the zil would not solve the issue (it might even make it worse). So I am lost as to what could be blocking the reads other than lack of I/O slots. As another way to improve I/O scheduler we have : 6471212 need reserved I/O scheduler slots to improve I/O latency of critical ops -r > > Neil. > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Joe Little
2007-Nov-19 18:39 UTC
[zfs-discuss] slog tests on read throughput exhaustion (NFS)
On Nov 19, 2007 9:41 AM, Roch - PAE <Roch.Bourbonnais at sun.com> wrote:> > Neil Perrin writes: > > > > > > Joe Little wrote: > > > On Nov 16, 2007 9:13 PM, Neil Perrin <Neil.Perrin at sun.com> wrote: > > >> Joe, > > >> > > >> I don''t think adding a slog helped in this case. In fact I > > >> believe it made performance worse. Previously the ZIL would be > > >> spread out over all devices but now all synchronous traffic > > >> is directed at one device (and everything is synchronous in NFS). > > >> Mind you 15MB/s seems a bit on the slow side - especially is > > >> cache flushing is disabled. > > >> > > >> It would be interesting to see what all the threads are waiting > > >> on. I think the problem maybe that everything is backed > > >> up waiting to start a transaction because the txg train is > > >> slow due to NFS requiring the ZIL to push everything synchronously. > > >> > > > > > > I agree completely. The log (even though slow) was an attempt to > > > isolate writes away from the pool. I guess the question is how to > > > provide for async access for NFS. We may have 16, 32 or whatever > > > threads, but if a single writer keeps the ZIL pegged and prohibiting > > > reads, its all for nought. Is there anyway to tune/configure the > > > ZFS/NFS combination to balance reads/writes to not starve one for the > > > other. Its either feast or famine or so tests have shown. > > > > No there''s no way currently to give reads preference over writes. > > All transactions get equal priority to enter a transaction group. > > Three txgs can be outstanding as we use a 3 phase commit model: > > open; quiescing; and syncing. > > That makes me wonder if this is not just the lack of write > throttling issue. If one txg is syncing and the other is > quiesced out, I think it means we have let in too many > writes. We do need a better balance. > > Neil is it correct that reads never hit txg_wait_open(), but > they just need an I/O scheduler slot ? > > If so seems to me just a matter of > > 6429205 each zpool needs to monitor it''s throughput and throttle heavy writers > > However, if this is it, disabling the zil would not solve the > issue (it might even make it worse). So I am lost as to > what could be blocking the reads other than lack of I/O > slots. As another way to improve I/O scheduler we have : > > > 6471212 need reserved I/O scheduler slots to improve I/O latency of critical ops > >So, when are each of these solutions hitting or expected to hit Nevada? I think the second is the winner, as I see in the iostat either one or the other getting i/o at a time, and its just feels like under duress the system simply doesn''t have balance anymore between reads/writes. Its like a smooth moving road with traffic in both directions. Then the construction crew arrives with flag men who only allow one direction at a time to go.> > -r > > > > > Neil. > > > > _______________________________________________ > > zfs-discuss mailing list > > zfs-discuss at opensolaris.org > > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > >
Neil Perrin
2007-Nov-19 19:13 UTC
[zfs-discuss] slog tests on read throughput exhaustion (NFS)
Joe Little wrote:> On Nov 16, 2007 10:41 PM, Neil Perrin <Neil.Perrin at sun.com> wrote: >> >> Joe Little wrote: >>> On Nov 16, 2007 9:13 PM, Neil Perrin <Neil.Perrin at sun.com> wrote: >>>> Joe, >>>> >>>> I don''t think adding a slog helped in this case. In fact I >>>> believe it made performance worse. Previously the ZIL would be >>>> spread out over all devices but now all synchronous traffic >>>> is directed at one device (and everything is synchronous in NFS). >>>> Mind you 15MB/s seems a bit on the slow side - especially is >>>> cache flushing is disabled. >>>> >>>> It would be interesting to see what all the threads are waiting >>>> on. I think the problem maybe that everything is backed >>>> up waiting to start a transaction because the txg train is >>>> slow due to NFS requiring the ZIL to push everything synchronously. >>>> >>> I agree completely. The log (even though slow) was an attempt to >>> isolate writes away from the pool. I guess the question is how to >>> provide for async access for NFS. We may have 16, 32 or whatever >>> threads, but if a single writer keeps the ZIL pegged and prohibiting >>> reads, its all for nought. Is there anyway to tune/configure the >>> ZFS/NFS combination to balance reads/writes to not starve one for the >>> other. Its either feast or famine or so tests have shown. >> No there''s no way currently to give reads preference over writes. >> All transactions get equal priority to enter a transaction group. >> Three txgs can be outstanding as we use a 3 phase commit model: >> open; quiescing; and syncing. >> > > anyway to improve the balance? Is would appear that zil_disable is > still a requirement to get NFS to behave in an practical "real world" > way with ZFS still. Even with zil_disable, we end up with periods of > pausing on the heaviest of writes, and then I think its mostly just > ZFS having too much outstanding i/o to commit. > > If zil_disable is enabled, is the slog disk ignored? >Yes.
Neil Perrin
2007-Nov-19 19:19 UTC
[zfs-discuss] slog tests on read throughput exhaustion (NFS)
Roch - PAE wrote:> Neil Perrin writes: > > > > > > Joe Little wrote: > > > On Nov 16, 2007 9:13 PM, Neil Perrin <Neil.Perrin at sun.com> wrote: > > >> Joe, > > >> > > >> I don''t think adding a slog helped in this case. In fact I > > >> believe it made performance worse. Previously the ZIL would be > > >> spread out over all devices but now all synchronous traffic > > >> is directed at one device (and everything is synchronous in NFS). > > >> Mind you 15MB/s seems a bit on the slow side - especially is > > >> cache flushing is disabled. > > >> > > >> It would be interesting to see what all the threads are waiting > > >> on. I think the problem maybe that everything is backed > > >> up waiting to start a transaction because the txg train is > > >> slow due to NFS requiring the ZIL to push everything synchronously. > > >> > > > > > > I agree completely. The log (even though slow) was an attempt to > > > isolate writes away from the pool. I guess the question is how to > > > provide for async access for NFS. We may have 16, 32 or whatever > > > threads, but if a single writer keeps the ZIL pegged and prohibiting > > > reads, its all for nought. Is there anyway to tune/configure the > > > ZFS/NFS combination to balance reads/writes to not starve one for the > > > other. Its either feast or famine or so tests have shown. > > > > No there''s no way currently to give reads preference over writes. > > All transactions get equal priority to enter a transaction group. > > Three txgs can be outstanding as we use a 3 phase commit model: > > open; quiescing; and syncing. > > That makes me wonder if this is not just the lack of write > throttling issue. If one txg is syncing and the other is > quiesced out, I think it means we have let in too many > writes. We do need a better balance. > > Neil is it correct that reads never hit txg_wait_open(), but > they just need an I/O scheduler slot ?Yes, they don''t modify any meta data (except access time which is handled separately). I''m less clear about what happens further down in the DMU and SPA.