thr3ads.net - zfs discuss - [zfs-discuss] ZFS read/write fairness algorithm for single pool [Feb 2011]

If this information is useful, please help other people find it:
Share via:

Nathan Kroenert

2011-Feb-13 08:56 UTC

[zfs-discuss] ZFS read/write fairness algorithm for single pool

Hi all,

Exec summary: I have a situation where I''m seeing lots of large reads 
starving writes from being able to get through to disk.

Some detail:
I have a newly constructed box (was an old box, but blew the mobo - 
different story - sigh).

Anyhoo - It''s a Gigabyte 890GPA-UD3H - with lots of onboard SATA - and 
an HP P400 Raid controller (PCI-E, 512MB, Battery Backed, presenting 2 
spindles, as single member stripes, so, yeah, the nearest thing to JBOD 
that this controller gets to)

pci bus 0x0002 cardnum 0x00 function 0x00: vendor 0x103c device 0x3230
  Hewlett-Packard Company Smart Array Controller

And it''s off this HP controller I''m handing my data zpool.

config:

     NAME        STATE     READ WRITE CKSUM
     data        ONLINE       0     0     0
       mirror-0  ONLINE       0     0     0
         c0t0d0  ONLINE       0     0     0
         c0t1d0  ONLINE       0     0     0

Cpu is AMD Phenom II, 6 core 1075T, for what it''s worth

I guess my problem is more one that the ZFS folks should be aware of 
rather than something directly impacting me, as the workload I have 
created is not something I typically see - but it is something I see 
easily impacting customers - and in a nasty way should they encounter 
it. It *is* also a case I''ll create  from time to time - when
I''m moving
DVD images backwards and forwards...

I was stress testing the box, giving the new kits legs a stretch and 
kicked off the following:
  - create a test file to use as source for my ''full speed streaming 
write'' (lazy way)
  - dd if=/dev/urandom > /tmp/1
     (and let that run for a few seconds, creating about100MB of random 
junk.)
  - start some jobs
     - while :; do cat /tmp/1 >> /data/delete.me/2; done &
         (The write workload, which is fine and dandy by itself)
     - while :; do dd if=/data/delete.me/2 of=/dev/null bs=65536; done &

Before I kicked off the read workload, everything looked as expected. I 
was getting between 40 and 60MB/s to each of the disks and all was good. 
BUT - As soon as I introduced the read workload, my write throughput 
dropped to virtually zero, and remained there until the write workload 
was killed.

The starvation is immediate. I can 100% reproducibly go from many MB/s 
of write throughput with no read workload to virtually 0MB/s write 
throughput, simply through kicking off that reading dd. Write 
performance picks up again as soon as I kill the read workload. It also 
behaves the same way of the file I''m reading is NOT the same one
I''m
writing to. (eg: cat >> file3  and the dd reading file 2)

Other things to know about the system:
  - Disks are Seagate 2GB, 512 byte sector SATA disks
  - OS is Solaris 11 Express (build 151a)
  - zpool version is old. I''m still hedging my bets on having to go
back
to Nevada (sxce, build 124 or so, which is what I was at before 
installing s11express)
     Cached configuration:
             version: 19
  - Plenty of space remains in the pool -
     bash-4.0$ zpool list
     NAME    SIZE  ALLOC   FREE    CAP  DEDUP  HEALTH  ALTROOT
     data   1.81T  1.34T   480G    74%  1.00x  ONLINE  -
  - The box has 8GB of memory - and ZFS is getting a fair whack at it.
 > ::memstat
Page Summary                Pages                MB  %Tot
------------     ----------------  ----------------  ----
Kernel                     211843               827   11%
ZFS File Data             1426054              5570   73%
Anon                       106814               417    5%
Exec and libs                9364                36    0%
Page cache                  47192               184    2%
Free (cachelist)            31448               122    2%
Free (freelist)            130431               509    7%

Total                     1963146              7668
Physical                  1963145              7668

  - Rest of the zfs dataset properties:
         # zfs get all data
         NAME  PROPERTY               VALUE                  SOURCE
         data  type                   filesystem             -
         data  creation               Mon May 24 10:46 2010  -
         data  used                   1.34T                  -
         data  available              451G                   -
         data  referenced             500G                   -
         data  compressratio          1.02x                  -
         data  mounted                yes                    -
         data  quota                  none                   default
         data  reservation            none                   default
         data  recordsize             128K                   default
         data  mountpoint             /data                  default
         data  sharenfs               ro,anon=0              local
         data  checksum               on                     default
         data  compression            off                    local
         data  atime                  off                    local
         data  devices                on                     default
         data  exec                   on                     default
         data  setuid                 on                     default
         data  readonly               off                    default
         data  zoned                  off                    default
         data  snapdir                hidden                 default
         data  aclinherit             restricted             default
         data  canmount               on                     default
         data  xattr                  on                     default
         data  copies                 1                      default
         data  version                3                      -
         data  utf8only               off                    -
         data  normalization          none                   -
         data  casesensitivity        sensitive              -
         data  vscan                  off                    default
         data  nbmand                 off                    default
         data  sharesmb               off                    default
         data  refquota               none                   default
         data  refreservation         none                   local
         data  primarycache           all                    default
         data  secondarycache         all                    default
         data  usedbysnapshots        12.2G                  -
         data  usedbydataset          500G                   -
         data  usedbychildren         864G                   -
         data  usedbyrefreservation   0                      -
         data  logbias                latency                default
         data  dedup                  off                    default
         data  mlslabel               none                   default
         data  sync                   standard               default
         data  encryption             off                    -
         data  keysource              none                   default
         data  keystatus              none                   -
         data  rekeydate              -                      default
         data  rstchown               on                     default
         data  com.sun:auto-snapshot  true                   local


Obviously, the potential for performance issues is considerable - and 
should it be required, I can provide some other detail, but given that 
this is so easy to reproduce, I thought I''d get it out there, just in
case.

It is also worthy of note that commands like ''zfs list'' take
anywhere
from 20 to 40 seconds to run when I have that sort of workload running - 
which also seems less optimal.

I tried to recreate this issue on the boot pool (rpool) which is a 
single 2.5" 7200rpm disk (to take the cache controller out of the 
configuration) - but this seemed to hard-hang the system (yep - even 
caps lock / num-lock were non-responsive) - and I did not have any 
watchdog/snooping set and ran out of steam myself so just hit the big 
button.

When I get the chance, I''ll give the rpool thing a crack again, but 
overall, it seems to me that the behavior I''m observing is not great...

I''m also happy to supply lockstats / dtrace output etc if
it''ll help.

Thoughts?

Cheers!

Nathan.

Richard Elling

2011-Feb-13 17:31 UTC

head link

[zfs-discuss] ZFS read/write fairness algorithm for single pool

On Feb 13, 2011, at 12:56 AM, Nathan Kroenert <nathan at tuneunix.com>
wrote:
> Hi all,
> 
> Exec summary: I have a situation where I''m seeing lots of large
reads starving writes from being able to get through to disk.
> 
> Some detail:
> I have a newly constructed box (was an old box, but blew the mobo -
different story - sigh).
> 
> Anyhoo - It''s a Gigabyte 890GPA-UD3H - with lots of onboard SATA -
and an HP P400 Raid controller (PCI-E, 512MB, Battery Backed, presenting 2
spindles, as single member stripes, so, yeah, the nearest thing to JBOD that
this controller gets to)
What is the average service time of each disk? Multiply that by the average
active queue depth. If that number is greater than, say, 100ms, then the ZFS
I/O scheduler is not able to be very effective because the disks are too slow.
Reducing the active queue depth can help, see zfs_vdev_max_pending in the
ZFS Evil Tuning Guide. Faster disks helps, too.

NexentaStor fans, note that you can do this easily, on the fly, via the Settings
->
Preferences -> System web GUI.
  -- richard
> 
> pci bus 0x0002 cardnum 0x00 function 0x00: vendor 0x103c device 0x3230
> Hewlett-Packard Company Smart Array Controller
> 
> And it''s off this HP controller I''m handing my data
zpool.
> 
> config:
> 
>    NAME        STATE     READ WRITE CKSUM
>    data        ONLINE       0     0     0
>      mirror-0  ONLINE       0     0     0
>        c0t0d0  ONLINE       0     0     0
>        c0t1d0  ONLINE       0     0     0
> 
> Cpu is AMD Phenom II, 6 core 1075T, for what it''s worth
> 
> I guess my problem is more one that the ZFS folks should be aware of rather
than something directly impacting me, as the workload I have created is not
something I typically see - but it is something I see easily impacting customers
- and in a nasty way should they encounter it. It *is* also a case I''ll
create  from time to time - when I''m moving DVD images backwards and
forwards...
> 
> I was stress testing the box, giving the new kits legs a stretch and kicked
off the following:
> - create a test file to use as source for my ''full speed streaming
write'' (lazy way)
> - dd if=/dev/urandom > /tmp/1
>    (and let that run for a few seconds, creating about100MB of random
junk.)
> - start some jobs
>    - while :; do cat /tmp/1 >> /data/delete.me/2; done &
>        (The write workload, which is fine and dandy by itself)
>    - while :; do dd if=/data/delete.me/2 of=/dev/null bs=65536; done &
> 
> Before I kicked off the read workload, everything looked as expected. I was
getting between 40 and 60MB/s to each of the disks and all was good. BUT - As
soon as I introduced the read workload, my write throughput dropped to virtually
zero, and remained there until the write workload was killed.
> 
> The starvation is immediate. I can 100% reproducibly go from many MB/s of
write throughput with no read workload to virtually 0MB/s write throughput,
simply through kicking off that reading dd. Write performance picks up again as
soon as I kill the read workload. It also behaves the same way of the file
I''m reading is NOT the same one I''m writing to. (eg: cat
>> file3  and the dd reading file 2)
> 
> Other things to know about the system:
> - Disks are Seagate 2GB, 512 byte sector SATA disks
> - OS is Solaris 11 Express (build 151a)
> - zpool version is old. I''m still hedging my bets on having to go
back to Nevada (sxce, build 124 or so, which is what I was at before installing
s11express)
>    Cached configuration:
>            version: 19
> - Plenty of space remains in the pool -
>    bash-4.0$ zpool list
>    NAME    SIZE  ALLOC   FREE    CAP  DEDUP  HEALTH  ALTROOT
>    data   1.81T  1.34T   480G    74%  1.00x  ONLINE  -
> - The box has 8GB of memory - and ZFS is getting a fair whack at it.
> > ::memstat
> Page Summary                Pages                MB  %Tot
> ------------     ----------------  ----------------  ----
> Kernel                     211843               827   11%
> ZFS File Data             1426054              5570   73%
> Anon                       106814               417    5%
> Exec and libs                9364                36    0%
> Page cache                  47192               184    2%
> Free (cachelist)            31448               122    2%
> Free (freelist)            130431               509    7%
> 
> Total                     1963146              7668
> Physical                  1963145              7668
> 
> - Rest of the zfs dataset properties:
>        # zfs get all data
>        NAME  PROPERTY               VALUE                  SOURCE
>        data  type                   filesystem             -
>        data  creation               Mon May 24 10:46 2010  -
>        data  used                   1.34T                  -
>        data  available              451G                   -
>        data  referenced             500G                   -
>        data  compressratio          1.02x                  -
>        data  mounted                yes                    -
>        data  quota                  none                   default
>        data  reservation            none                   default
>        data  recordsize             128K                   default
>        data  mountpoint             /data                  default
>        data  sharenfs               ro,anon=0              local
>        data  checksum               on                     default
>        data  compression            off                    local
>        data  atime                  off                    local
>        data  devices                on                     default
>        data  exec                   on                     default
>        data  setuid                 on                     default
>        data  readonly               off                    default
>        data  zoned                  off                    default
>        data  snapdir                hidden                 default
>        data  aclinherit             restricted             default
>        data  canmount               on                     default
>        data  xattr                  on                     default
>        data  copies                 1                      default
>        data  version                3                      -
>        data  utf8only               off                    -
>        data  normalization          none                   -
>        data  casesensitivity        sensitive              -
>        data  vscan                  off                    default
>        data  nbmand                 off                    default
>        data  sharesmb               off                    default
>        data  refquota               none                   default
>        data  refreservation         none                   local
>        data  primarycache           all                    default
>        data  secondarycache         all                    default
>        data  usedbysnapshots        12.2G                  -
>        data  usedbydataset          500G                   -
>        data  usedbychildren         864G                   -
>        data  usedbyrefreservation   0                      -
>        data  logbias                latency                default
>        data  dedup                  off                    default
>        data  mlslabel               none                   default
>        data  sync                   standard               default
>        data  encryption             off                    -
>        data  keysource              none                   default
>        data  keystatus              none                   -
>        data  rekeydate              -                      default
>        data  rstchown               on                     default
>        data  com.sun:auto-snapshot  true                   local
> 
> 
> Obviously, the potential for performance issues is considerable - and
should it be required, I can provide some other detail, but given that this is
so easy to reproduce, I thought I''d get it out there, just in case.
> 
> It is also worthy of note that commands like ''zfs list''
take anywhere from 20 to 40 seconds to run when I have that sort of workload
running - which also seems less optimal.
> 
> I tried to recreate this issue on the boot pool (rpool) which is a single
2.5" 7200rpm disk (to take the cache controller out of the configuration) -
but this seemed to hard-hang the system (yep - even caps lock / num-lock were
non-responsive) - and I did not have any watchdog/snooping set and ran out of
steam myself so just hit the big button.
> 
> When I get the chance, I''ll give the rpool thing a crack again,
but overall, it seems to me that the behavior I''m observing is not
great...
> 
> I''m also happy to supply lockstats / dtrace output etc if
it''ll help.
> 
> Thoughts?
> 
> Cheers!
> 
> Nathan.
> 
> 
> 
> 
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Nathan Kroenert

2011-Feb-14 04:04 UTC

head link

[zfs-discuss] ZFS read/write fairness algorithm for single pool

Hi Steve,

Thanks for the thoughts - I think that everything you asked about is in 
the original email - but for reference again, it''s 151a (s11 express).

Are you really suggesting, for a single user system I need 16GB of 
memory, just to get ZFS to be able to write when it''s reading? (and
even
them, that would be contingent on you getting repeat, cached hits on the 
ARC). That''s hardly sensible, and anything but enterprise. I know
I''m
only talking about my little baby box at the moment, but extend that to 
a large database application, and I''m seeing badness all round.

Worse - If I''m reading a 45GB contiguous file (say, HD video), the only
way an ARC will help me is if I have 64GB, and have read it in the 
past... especially if I''m reading it sequentially. That''s 
inconceivable!! (cue reference to the Princess Bride :). I''d also ad 
that for the most part, 8GB is plenty for ZFS, and there are a lot of 
Sun/Oracle customers using it now in LDOM environments where 8GB is just 
great in the control/IO domain.

I don''t think trying to blame the system in this case is the right 
answer. ZFS schedules the read/write activities, and to me it seems that 
it''s just not doing that.

I was suspicious of the impact the HP Raid controller is having - and 
how it might be reacting to what''s being pushed at it, so re-created 
exactly this problem again on a different system with native non-cached 
SATA controllers. Issue is identical. (Though I have since determined 
that my HP raid controller is actually *slowing* my reads and writes to 
disk! ;)

Cheers!

Nathan.

On 14/02/2011 4:08 AM, gonczi at comcast.net wrote:> Hi Nathan,
>
> Maybe  it is buried somewhere in your email, but I did not see what 
> zfs version you are using.
>
> This is rather important, because the  145+ kernels work a lot better 
> in many ways than the
> early ones ( say 134-ish).
>
> So whenever you are reporting various ZFS issues, something like 
> `uname -a` to report the kernel rev
> is most useful.
>
> Writes starved by reads has been a complaint in early ZFS, I certainy 
> do not see
> any evidence of this in the 145+ kernels.
>
> There is a fair amount of tuning and configuration that can be done
> (adding ssd-s to your pool, zil vs no zil, how cacheing is configured, 
> ie what to cache..)
> 8 Gig is not a lot of memory for ZFS, I would recommend double of that.
>
> If all goes well, most reads would be statisfied from ARC, and not 
> interfere with writes.
>
>
> Steve
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20110214/bf1025b9/attachment-0001.html>

Nathan Kroenert

2011-Feb-14 04:28 UTC

head link

[zfs-discuss] ZFS read/write fairness algorithm for single pool

On 14/02/2011 4:31 AM, Richard Elling wrote:> On Feb 13, 2011, at 12:56 AM, Nathan Kroenert<nathan at tuneunix.com>
wrote:
>
>> Hi all,
>>
>> Exec summary: I have a situation where I''m seeing lots of
large reads starving writes from being able to get through to disk.
>>
>> <snip>
> What is the average service time of each disk? Multiply that by the average
> active queue depth. If that number is greater than, say, 100ms, then the
ZFS
> I/O scheduler is not able to be very effective because the disks are too
slow.
> Reducing the active queue depth can help, see zfs_vdev_max_pending in the
> ZFS Evil Tuning Guide. Faster disks helps, too.
>
> NexentaStor fans, note that you can do this easily, on the fly, via the
Settings ->
> Preferences ->  System web GUI.
>    -- richard
>
Hi Richard,

Long time no speak! Anyhoo - See below.

I''m unconvinced that faster disks would help. I think faster disks, at 
least in what I''m observing, would make it suck just as bad, just 
reading faster... ;) Maybe I''m missing something.

Queue depth is around 10 (default and unchanged since install), and 
average service time is about 25ms... Below are 1 second samples with 
iostat - while I have included only about 10 seconds, it''s 
representative of what I''m seeing all the time.
                  extended device statistics
device    r/s    w/s   kr/s   kw/s wait actv  svc_t  %w  %b
sd6     360.9   13.0 46190.5  351.4  0.0 10.0   26.7   1 100
sd7     342.9   12.0 43887.3  329.9  0.0 10.0   28.1   1 100

                  extended device statistics
device    r/s    w/s   kr/s   kw/s wait actv  svc_t  %w  %b

sd6     422.1    0.0 54025.0    0.0  0.0 10.0   23.6   1 100
sd7     422.1    0.0 54025.0    0.0  0.0 10.0   23.6   1 100

                  extended device statistics
device    r/s    w/s   kr/s   kw/s wait actv  svc_t  %w  %b
sd6     370.0   11.0 47360.4  342.0  0.0 10.0   26.2   1 100
sd7     327.0   16.0 41856.4  632.0  0.0  9.6   28.0   1 100

                  extended device statistics
device    r/s    w/s   kr/s   kw/s wait actv  svc_t  %w  %b
sd6     388.0    7.0 49406.4  290.0  0.0  9.8   24.8   1 100
sd7     409.0    1.0 52350.3    2.0  0.0  9.5   23.2   1  99

                  extended device statistics
device    r/s    w/s   kr/s   kw/s wait actv  svc_t  %w  %b
sd6     423.0    0.0 54148.6    0.0  0.0 10.0   23.6   1 100
sd7     413.0    0.0 52868.5    0.0  0.0 10.0   24.2   1 100

                  extended device statistics
device    r/s    w/s   kr/s   kw/s wait actv  svc_t  %w  %b
sd6     400.0    2.0 51081.2    2.0  0.0 10.0   24.8   1 100
sd7     384.0    4.0 49153.2    4.0  0.0 10.0   25.7   1 100

                  extended device statistics
device    r/s    w/s   kr/s   kw/s wait actv  svc_t  %w  %b
sd6     401.9    1.0 51448.9    8.0  0.0 10.0   24.8   1 100
sd7     424.9    0.0 54392.4    0.0  0.0 10.0   23.5   1 100

                  extended device statistics
device    r/s    w/s   kr/s   kw/s wait actv  svc_t  %w  %b
sd6     215.1  208.1 26751.9 25433.5  0.0  9.3   22.1   1 100
sd7     189.1  216.1 24199.1 26833.9  0.0  8.9   22.1   1  91

                  extended device statistics
device    r/s    w/s   kr/s   kw/s wait actv  svc_t  %w  %b
sd6     295.0  162.0 37756.8 20610.2  0.0 10.0   21.8   1 100
sd7     307.0  150.0 39292.6 19198.4  0.0 10.0   21.8   1 100

                  extended device statistics
device    r/s    w/s   kr/s   kw/s wait actv  svc_t  %w  %b
sd6     405.0    2.0 51843.8    6.0  0.0 10.0   24.5   1 100
sd7     408.0    3.0 52227.8   10.0  0.0 10.0   24.3   1 100

Bottom line is that ZFS does not seem to be caring about getting my 
writes to disk when there is a heavy read workload.

I have also confirmed that it''s not the RAID controller either - 
behaviour is identical with direct attach SATA.

But - to your excellent theory: Setting zfs_vdev_max_pending to 1 causes 
things to swing dramatically!
  - At 1, writes proceed much more than reads - 20mb/s read per 
spindle:35mb/s write per spindle
  - At 2, writes still outstrip reads - 15mb/s read per spindle:44mb/s
  - At 3, it''s starting to lean more heavily to reads again, but writes
at least get a whack - 35mb/s per spindle read:15-20mb/s write.
  - At 4, we are closer to 35-40mb/s read, 15mb/s write

By the time we get back to the default of 0xa, writes drop off almost 
completely.

The crossover (on the box with no RAID controller) seems to be 5. 
Anything more than that, and writes get shouldered out the way almost 
completely.

So - aside from the obvious - manually setting zfs_vdev_max_pending - do 
you have any thoughts on ZFS being able to make this sort of 
determination by itself? It would be somewhat of a shame to bust out 
such ''whacky knobs'' for plain old direct attach SATA disks to
get balance...

Also - can I set this property per-vdev? (just in case I have sata and, 
say, a USP-V connected to the same box)?

Thanks again, and good to see you are still playing close by!

Cheers!

Nathan.
>> pci bus 0x0002 cardnum 0x00 function 0x00: vendor 0x103c device 0x3230
>> Hewlett-Packard Company Smart Array Controller
>>
>> And it''s off this HP controller I''m handing my data
zpool.
>>
>> config:
>>
>>     NAME        STATE     READ WRITE CKSUM
>>     data        ONLINE       0     0     0
>>       mirror-0  ONLINE       0     0     0
>>         c0t0d0  ONLINE       0     0     0
>>         c0t1d0  ONLINE       0     0     0
>>
>> Cpu is AMD Phenom II, 6 core 1075T, for what it''s worth
>>
>> I guess my problem is more one that the ZFS folks should be aware of
rather than something directly impacting me, as the workload I have created is
not something I typically see - but it is something I see easily impacting
customers - and in a nasty way should they encounter it. It *is* also a case
I''ll create  from time to time - when I''m moving DVD images
backwards and forwards...
>>
>> I was stress testing the box, giving the new kits legs a stretch and
kicked off the following:
>> - create a test file to use as source for my ''full speed
streaming write'' (lazy way)
>> - dd if=/dev/urandom>  /tmp/1
>>     (and let that run for a few seconds, creating about100MB of random
junk.)
>> - start some jobs
>>     - while :; do cat /tmp/1>>  /data/delete.me/2; done&
>>         (The write workload, which is fine and dandy by itself)
>>     - while :; do dd if=/data/delete.me/2 of=/dev/null bs=65536;
done&
>>
>> Before I kicked off the read workload, everything looked as expected. I
was getting between 40 and 60MB/s to each of the disks and all was good. BUT -
As soon as I introduced the read workload, my write throughput dropped to
virtually zero, and remained there until the write workload was killed.
>>
>> The starvation is immediate. I can 100% reproducibly go from many MB/s
of write throughput with no read workload to virtually 0MB/s write throughput,
simply through kicking off that reading dd. Write performance picks up again as
soon as I kill the read workload. It also behaves the same way of the file
I''m reading is NOT the same one I''m writing to. (eg:
cat>>  file3  and the dd reading file 2)
>>
>> Other things to know about the system:
>> - Disks are Seagate 2GB, 512 byte sector SATA disks
>> - OS is Solaris 11 Express (build 151a)
>> - zpool version is old. I''m still hedging my bets on having to
go back to Nevada (sxce, build 124 or so, which is what I was at before
installing s11express)
>>     Cached configuration:
>>             version: 19
>> - Plenty of space remains in the pool -
>>     bash-4.0$ zpool list
>>     NAME    SIZE  ALLOC   FREE    CAP  DEDUP  HEALTH  ALTROOT
>>     data   1.81T  1.34T   480G    74%  1.00x  ONLINE  -
>> - The box has 8GB of memory - and ZFS is getting a fair whack at it.
>>> ::memstat
>> Page Summary                Pages                MB  %Tot
>> ------------     ----------------  ----------------  ----
>> Kernel                     211843               827   11%
>> ZFS File Data             1426054              5570   73%
>> Anon                       106814               417    5%
>> Exec and libs                9364                36    0%
>> Page cache                  47192               184    2%
>> Free (cachelist)            31448               122    2%
>> Free (freelist)            130431               509    7%
>>
>> Total                     1963146              7668
>> Physical                  1963145              7668
>>
>> - Rest of the zfs dataset properties:
>>         # zfs get all data
>>         NAME  PROPERTY               VALUE                  SOURCE
>>         data  type                   filesystem             -
>>         data  creation               Mon May 24 10:46 2010  -
>>         data  used                   1.34T                  -
>>         data  available              451G                   -
>>         data  referenced             500G                   -
>>         data  compressratio          1.02x                  -
>>         data  mounted                yes                    -
>>         data  quota                  none                   default
>>         data  reservation            none                   default
>>         data  recordsize             128K                   default
>>         data  mountpoint             /data                  default
>>         data  sharenfs               ro,anon=0              local
>>         data  checksum               on                     default
>>         data  compression            off                    local
>>         data  atime                  off                    local
>>         data  devices                on                     default
>>         data  exec                   on                     default
>>         data  setuid                 on                     default
>>         data  readonly               off                    default
>>         data  zoned                  off                    default
>>         data  snapdir                hidden                 default
>>         data  aclinherit             restricted             default
>>         data  canmount               on                     default
>>         data  xattr                  on                     default
>>         data  copies                 1                      default
>>         data  version                3                      -
>>         data  utf8only               off                    -
>>         data  normalization          none                   -
>>         data  casesensitivity        sensitive              -
>>         data  vscan                  off                    default
>>         data  nbmand                 off                    default
>>         data  sharesmb               off                    default
>>         data  refquota               none                   default
>>         data  refreservation         none                   local
>>         data  primarycache           all                    default
>>         data  secondarycache         all                    default
>>         data  usedbysnapshots        12.2G                  -
>>         data  usedbydataset          500G                   -
>>         data  usedbychildren         864G                   -
>>         data  usedbyrefreservation   0                      -
>>         data  logbias                latency                default
>>         data  dedup                  off                    default
>>         data  mlslabel               none                   default
>>         data  sync                   standard               default
>>         data  encryption             off                    -
>>         data  keysource              none                   default
>>         data  keystatus              none                   -
>>         data  rekeydate              -                      default
>>         data  rstchown               on                     default
>>         data  com.sun:auto-snapshot  true                   local
>>
>>
>> Obviously, the potential for performance issues is considerable - and
should it be required, I can provide some other detail, but given that this is
so easy to reproduce, I thought I''d get it out there, just in case.
>>
>> It is also worthy of note that commands like ''zfs
list'' take anywhere from 20 to 40 seconds to run when I have that sort
of workload running - which also seems less optimal.
>>
>> I tried to recreate this issue on the boot pool (rpool) which is a
single 2.5" 7200rpm disk (to take the cache controller out of the
configuration) - but this seemed to hard-hang the system (yep - even caps lock /
num-lock were non-responsive) - and I did not have any watchdog/snooping set and
ran out of steam myself so just hit the big button.
>>
>> When I get the chance, I''ll give the rpool thing a crack
again, but overall, it seems to me that the behavior I''m observing is
not great...
>>
>> I''m also happy to supply lockstats / dtrace output etc if
it''ll help.
>>
>> Thoughts?
>>
>> Cheers!
>>
>> Nathan.
>>
>>
>>
>>
>> _______________________________________________
>> zfs-discuss mailing list
>> zfs-discuss at opensolaris.org
>> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Richard Elling

2011-Feb-14 19:44 UTC

head link

[zfs-discuss] ZFS read/write fairness algorithm for single pool

Hi Nathan,
comments below...

On Feb 13, 2011, at 8:28 PM, Nathan Kroenert wrote:
> On 14/02/2011 4:31 AM, Richard Elling wrote:
>> On Feb 13, 2011, at 12:56 AM, Nathan Kroenert<nathan at
tuneunix.com>  wrote:
>> 
>>> Hi all,
>>> 
>>> Exec summary: I have a situation where I''m seeing lots of
large reads starving writes from being able to get through to disk.
>>> 
>>> <snip>
>> What is the average service time of each disk? Multiply that by the
average
>> active queue depth. If that number is greater than, say, 100ms, then
the ZFS
>> I/O scheduler is not able to be very effective because the disks are
too slow.
>> Reducing the active queue depth can help, see zfs_vdev_max_pending in
the
>> ZFS Evil Tuning Guide. Faster disks helps, too.
>> 
>> NexentaStor fans, note that you can do this easily, on the fly, via the
Settings ->
>> Preferences ->  System web GUI.
>>   -- richard
>> 
> 
> Hi Richard,
> 
> Long time no speak! Anyhoo - See below.
> 
> I''m unconvinced that faster disks would help. I think faster
disks, at least in what I''m observing, would make it suck just as bad,
just reading faster... ;) Maybe I''m missing something.
Faster disks always help :-)
> 
> Queue depth is around 10 (default and unchanged since install), and average
service time is about 25ms... Below are 1 second samples with iostat - while I
have included only about 10 seconds, it''s representative of what
I''m seeing all the time.
>                 extended device statistics
> device    r/s    w/s   kr/s   kw/s wait actv  svc_t  %w  %b
> sd6     360.9   13.0 46190.5  351.4  0.0 10.0   26.7   1 100
> sd7     342.9   12.0 43887.3  329.9  0.0 10.0   28.1   1 100
ok, we''ll take sd6 as an example (the math is easy :-) ...
	actv = 10
	svc_t = 26.7

	actv * svc_t = 267 milliseconds

This is the queue at the disk. ZFS manages its own queue for the disk,
but once it leaves ZFS, there is no way for ZFS to manage it. In the 
case of the active queue, the I/Os have left the OS, so even the OS
is unable to change what is in the queue or directly influence when
the I/Os will be finished.

In ZFS, the queue has a priority scheduler and does place a higher
priority on async writes than async reads (since b130 or so). But what
you can see is that the intermittent nature of the async writes get 
stuck behind the 267 milliseconds as the queue drains the reads.
[no, I''m not sure if that makes sense, try again...]
If it sends reads continuously and writes occasionally, it will appear
that reads have much more domination. In older releases, when the
reads and writes had the same priority, this looks even worse.
> 
>                 extended device statistics
> device    r/s    w/s   kr/s   kw/s wait actv  svc_t  %w  %b
> 
> sd6     422.1    0.0 54025.0    0.0  0.0 10.0   23.6   1 100
> sd7     422.1    0.0 54025.0    0.0  0.0 10.0   23.6   1 100
> 
>                 extended device statistics
> device    r/s    w/s   kr/s   kw/s wait actv  svc_t  %w  %b
> sd6     370.0   11.0 47360.4  342.0  0.0 10.0   26.2   1 100
> sd7     327.0   16.0 41856.4  632.0  0.0  9.6   28.0   1 100
> 
>                 extended device statistics
> device    r/s    w/s   kr/s   kw/s wait actv  svc_t  %w  %b
> sd6     388.0    7.0 49406.4  290.0  0.0  9.8   24.8   1 100
> sd7     409.0    1.0 52350.3    2.0  0.0  9.5   23.2   1  99
> 
>                 extended device statistics
> device    r/s    w/s   kr/s   kw/s wait actv  svc_t  %w  %b
> sd6     423.0    0.0 54148.6    0.0  0.0 10.0   23.6   1 100
> sd7     413.0    0.0 52868.5    0.0  0.0 10.0   24.2   1 100
> 
>                 extended device statistics
> device    r/s    w/s   kr/s   kw/s wait actv  svc_t  %w  %b
> sd6     400.0    2.0 51081.2    2.0  0.0 10.0   24.8   1 100
> sd7     384.0    4.0 49153.2    4.0  0.0 10.0   25.7   1 100
> 
>                 extended device statistics
> device    r/s    w/s   kr/s   kw/s wait actv  svc_t  %w  %b
> sd6     401.9    1.0 51448.9    8.0  0.0 10.0   24.8   1 100
> sd7     424.9    0.0 54392.4    0.0  0.0 10.0   23.5   1 100
> 
>                 extended device statistics
> device    r/s    w/s   kr/s   kw/s wait actv  svc_t  %w  %b
> sd6     215.1  208.1 26751.9 25433.5  0.0  9.3   22.1   1 100
> sd7     189.1  216.1 24199.1 26833.9  0.0  8.9   22.1   1  91
> 
>                 extended device statistics
> device    r/s    w/s   kr/s   kw/s wait actv  svc_t  %w  %b
> sd6     295.0  162.0 37756.8 20610.2  0.0 10.0   21.8   1 100
> sd7     307.0  150.0 39292.6 19198.4  0.0 10.0   21.8   1 100
> 
>                 extended device statistics
> device    r/s    w/s   kr/s   kw/s wait actv  svc_t  %w  %b
> sd6     405.0    2.0 51843.8    6.0  0.0 10.0   24.5   1 100
> sd7     408.0    3.0 52227.8   10.0  0.0 10.0   24.3   1 100
> 
> Bottom line is that ZFS does not seem to be caring about getting my writes
to disk when there is a heavy read workload.
> 
> I have also confirmed that it''s not the RAID controller either -
behaviour is identical with direct attach SATA.
> 
> But - to your excellent theory: Setting zfs_vdev_max_pending to 1 causes
things to swing dramatically!
> - At 1, writes proceed much more than reads - 20mb/s read per
spindle:35mb/s write per spindle
> - At 2, writes still outstrip reads - 15mb/s read per spindle:44mb/s
Though the NexentaStor docs recommend "1" for SATA disks, I find that
"2" works better.
> - At 3, it''s starting to lean more heavily to reads again, but
writes at least get a whack - 35mb/s per spindle read:15-20mb/s write.
> - At 4, we are closer to 35-40mb/s read, 15mb/s write
Isn''t queuing theory fun! :-)
> 
> By the time we get back to the default of 0xa, writes drop off almost
completely.
> 
> The crossover (on the box with no RAID controller) seems to be 5. Anything
more than that, and writes get shouldered out the way almost completely.
> 
> So - aside from the obvious - manually setting zfs_vdev_max_pending - do
you have any thoughts on ZFS being able to make this sort of determination by
itself? It would be somewhat of a shame to bust out such ''whacky
knobs'' for plain old direct attach SATA disks to get balance...
> 
> Also - can I set this property per-vdev? (just in case I have sata and,
say, a USP-V connected to the same box)?
Today, there is not a per-vdev setting. There are several changes in the works
for
this and other scheduling.

Incidentally, you can change the priorities on the fly, so you could experiment
with different settings for mixed workloads. Obviously, non-mixed workloads
won''t
be very interesting.

Also FWIW, for SATA disks in particular, it is not unusual for us to recommend
dropping zfs_vdev_max_pending to 2. It can make a big difference for some
workloads.
 -- richard

Nathan Kroenert

2011-Feb-15 00:27 UTC

head link

[zfs-discuss] ZFS read/write fairness algorithm for single pool

Thanks for all the thoughts, Richard.

One thing that still sticks in my craw is that I''m not wanting to write
intermittently. I''m wanting to write flat out, and those writes are 
being held up... Seems to me that zfs should know and do something about 
that without me needing to tune zfs_vdev_max_pending...

Nonetheless, I''m now at a far more balanced point than when I started, 
so that''s a good thing. :)

Cheers,

Nathan.

On 15/02/2011 6:44 AM, Richard Elling wrote:> Hi Nathan,
> comments below...
>
> On Feb 13, 2011, at 8:28 PM, Nathan Kroenert wrote:
>
>> On 14/02/2011 4:31 AM, Richard Elling wrote:
>>> On Feb 13, 2011, at 12:56 AM, Nathan Kroenert<nathan at
tuneunix.com>   wrote:
>>>
>>>> Hi all,
>>>>
>>>> Exec summary: I have a situation where I''m seeing lots
of large reads starving writes from being able to get through to disk.
>>>>
>>>> <snip>
>>> What is the average service time of each disk? Multiply that by the
average
>>> active queue depth. If that number is greater than, say, 100ms,
then the ZFS
>>> I/O scheduler is not able to be very effective because the disks
are too slow.
>>> Reducing the active queue depth can help, see zfs_vdev_max_pending
in the
>>> ZFS Evil Tuning Guide. Faster disks helps, too.
>>>
>>> NexentaStor fans, note that you can do this easily, on the fly, via
the Settings ->
>>> Preferences ->   System web GUI.
>>>    -- richard
>>>
>> Hi Richard,
>>
>> Long time no speak! Anyhoo - See below.
>>
>> I''m unconvinced that faster disks would help. I think faster
disks, at least in what I''m observing, would make it suck just as bad,
just reading faster... ;) Maybe I''m missing something.
> Faster disks always help :-)
>
>> Queue depth is around 10 (default and unchanged since install), and
average service time is about 25ms... Below are 1 second samples with iostat -
while I have included only about 10 seconds, it''s representative of
what I''m seeing all the time.
>>                  extended device statistics
>> device    r/s    w/s   kr/s   kw/s wait actv  svc_t  %w  %b
>> sd6     360.9   13.0 46190.5  351.4  0.0 10.0   26.7   1 100
>> sd7     342.9   12.0 43887.3  329.9  0.0 10.0   28.1   1 100
> ok, we''ll take sd6 as an example (the math is easy :-) ...
> 	actv = 10
> 	svc_t = 26.7
>
> 	actv * svc_t = 267 milliseconds
>
> This is the queue at the disk. ZFS manages its own queue for the disk,
> but once it leaves ZFS, there is no way for ZFS to manage it. In the
> case of the active queue, the I/Os have left the OS, so even the OS
> is unable to change what is in the queue or directly influence when
> the I/Os will be finished.
>
> In ZFS, the queue has a priority scheduler and does place a higher
> priority on async writes than async reads (since b130 or so). But what
> you can see is that the intermittent nature of the async writes get
> stuck behind the 267 milliseconds as the queue drains the reads.
> [no, I''m not sure if that makes sense, try again...]
> If it sends reads continuously and writes occasionally, it will appear
> that reads have much more domination. In older releases, when the
> reads and writes had the same priority, this looks even worse.
>
>>                  extended device statistics
>> device    r/s    w/s   kr/s   kw/s wait actv  svc_t  %w  %b
>>
>> sd6     422.1    0.0 54025.0    0.0  0.0 10.0   23.6   1 100
>> sd7     422.1    0.0 54025.0    0.0  0.0 10.0   23.6   1 100
>>
>>                  extended device statistics
>> device    r/s    w/s   kr/s   kw/s wait actv  svc_t  %w  %b
>> sd6     370.0   11.0 47360.4  342.0  0.0 10.0   26.2   1 100
>> sd7     327.0   16.0 41856.4  632.0  0.0  9.6   28.0   1 100
>>
>>                  extended device statistics
>> device    r/s    w/s   kr/s   kw/s wait actv  svc_t  %w  %b
>> sd6     388.0    7.0 49406.4  290.0  0.0  9.8   24.8   1 100
>> sd7     409.0    1.0 52350.3    2.0  0.0  9.5   23.2   1  99
>>
>>                  extended device statistics
>> device    r/s    w/s   kr/s   kw/s wait actv  svc_t  %w  %b
>> sd6     423.0    0.0 54148.6    0.0  0.0 10.0   23.6   1 100
>> sd7     413.0    0.0 52868.5    0.0  0.0 10.0   24.2   1 100
>>
>>                  extended device statistics
>> device    r/s    w/s   kr/s   kw/s wait actv  svc_t  %w  %b
>> sd6     400.0    2.0 51081.2    2.0  0.0 10.0   24.8   1 100
>> sd7     384.0    4.0 49153.2    4.0  0.0 10.0   25.7   1 100
>>
>>                  extended device statistics
>> device    r/s    w/s   kr/s   kw/s wait actv  svc_t  %w  %b
>> sd6     401.9    1.0 51448.9    8.0  0.0 10.0   24.8   1 100
>> sd7     424.9    0.0 54392.4    0.0  0.0 10.0   23.5   1 100
>>
>>                  extended device statistics
>> device    r/s    w/s   kr/s   kw/s wait actv  svc_t  %w  %b
>> sd6     215.1  208.1 26751.9 25433.5  0.0  9.3   22.1   1 100
>> sd7     189.1  216.1 24199.1 26833.9  0.0  8.9   22.1   1  91
>>
>>                  extended device statistics
>> device    r/s    w/s   kr/s   kw/s wait actv  svc_t  %w  %b
>> sd6     295.0  162.0 37756.8 20610.2  0.0 10.0   21.8   1 100
>> sd7     307.0  150.0 39292.6 19198.4  0.0 10.0   21.8   1 100
>>
>>                  extended device statistics
>> device    r/s    w/s   kr/s   kw/s wait actv  svc_t  %w  %b
>> sd6     405.0    2.0 51843.8    6.0  0.0 10.0   24.5   1 100
>> sd7     408.0    3.0 52227.8   10.0  0.0 10.0   24.3   1 100
>>
>> Bottom line is that ZFS does not seem to be caring about getting my
writes to disk when there is a heavy read workload.
>>
>> I have also confirmed that it''s not the RAID controller either
- behaviour is identical with direct attach SATA.
>>
>> But - to your excellent theory: Setting zfs_vdev_max_pending to 1
causes things to swing dramatically!
>> - At 1, writes proceed much more than reads - 20mb/s read per
spindle:35mb/s write per spindle
>> - At 2, writes still outstrip reads - 15mb/s read per spindle:44mb/s
> Though the NexentaStor docs recommend "1" for SATA disks, I find
that "2" works better.
>
>> - At 3, it''s starting to lean more heavily to reads again, but
writes at least get a whack - 35mb/s per spindle read:15-20mb/s write.
>> - At 4, we are closer to 35-40mb/s read, 15mb/s write
> Isn''t queuing theory fun! :-)
>
>> By the time we get back to the default of 0xa, writes drop off almost
completely.
>>
>> The crossover (on the box with no RAID controller) seems to be 5.
Anything more than that, and writes get shouldered out the way almost
completely.
>>
>> So - aside from the obvious - manually setting zfs_vdev_max_pending -
do you have any thoughts on ZFS being able to make this sort of determination by
itself? It would be somewhat of a shame to bust out such ''whacky
knobs'' for plain old direct attach SATA disks to get balance...
>>
>> Also - can I set this property per-vdev? (just in case I have sata and,
say, a USP-V connected to the same box)?
> Today, there is not a per-vdev setting. There are several changes in the
works for
> this and other scheduling.
>
> Incidentally, you can change the priorities on the fly, so you could
experiment
> with different settings for mixed workloads. Obviously, non-mixed workloads
won''t
> be very interesting.
>
> Also FWIW, for SATA disks in particular, it is not unusual for us to
recommend
> dropping zfs_vdev_max_pending to 2. It can make a big difference for some
> workloads.
>   -- richard
>

zfs discuss - Feb 2011 - ZFS read/write fairness algorithm for single pool

[zfs-discuss] ZFS read/write fairness algorithm for single pool

[zfs-discuss] ZFS read/write fairness algorithm for single pool

[zfs-discuss] ZFS read/write fairness algorithm for single pool

[zfs-discuss] ZFS read/write fairness algorithm for single pool

[zfs-discuss] ZFS read/write fairness algorithm for single pool

[zfs-discuss] ZFS read/write fairness algorithm for single pool