thr3ads.net - zfs code - [zfs-code] Read reordering [Jan 2010]

If this information is useful, please help other people find it:
Share via:

bank kus

2010-Jan-11 16:05 UTC

[zfs-code] Read reordering

As of 2009.06 what is the policy with reordering ZFS file reads i.e., consider
the following timeline:
T0:  Process A issues read of size 20K and gets its thread switched out

T1:  Process B issues reads of size 8 bytes and gets its thread switched out

Are the 8 byte reads from B going to fall in queue _behind_ A if:
--> if A and B are from separate users?
--> If B is from the system process?

Regards
banks
-- 
This message posted from opensolaris.org

Andrey Kuzmin

2010-Jan-11 17:01 UTC

head link

[zfs-code] Read reordering

Per Posix there''s no read ordering guarantees for a file with
concurrent non-exclusive readers. Use queue/locks in the application
if you need ordering like this.

Regards,
Andrey




On Mon, Jan 11, 2010 at 7:05 PM, bank kus <kus.bank at gmail.com>
wrote:> As of 2009.06 what is the policy with reordering ZFS file reads i.e.,
consider the following timeline:
> T0: ?Process A issues read of size 20K and gets its thread switched out
>
> T1: ?Process B issues reads of size 8 bytes and gets its thread switched
out
>
> Are the 8 byte reads from B going to fall in queue _behind_ A if:
> --> if A and B are from separate users?
> --> If B is from the system process?
>
> Regards
> banks
> --
> This message posted from opensolaris.org
> _______________________________________________
> zfs-code mailing list
> zfs-code at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-code
>

bank kus

2010-Jan-11 17:12 UTC

head link

[zfs-code] Read reordering

I was asking from the starvatoin point of view to see if B can be starved by a
long bust from A
-- 
This message posted from opensolaris.org

Andrey Kuzmin

2010-Jan-11 17:46 UTC

head link

[zfs-code] Read reordering

Then you''re actually asking for a fair I/O scheduler.

Regards,
Andrey




On Mon, Jan 11, 2010 at 8:12 PM, bank kus <kus.bank at gmail.com>
wrote:> I was asking from the starvatoin point of view to see if B can be starved
by a long bust from A
> --
> This message posted from opensolaris.org
> _______________________________________________
> zfs-code mailing list
> zfs-code at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-code
>

bank kus

2010-Jan-11 18:29 UTC

head link

[zfs-code] Read reordering

> Then you''re actually asking for a fair I/O scheduler.
yes are we currently fair? any good documentation on the priority model as it
exists today?
-- 
This message posted from opensolaris.org

Andrey Kuzmin

2010-Jan-11 18:34 UTC

head link

[zfs-code] Read reordering

On Mon, Jan 11, 2010 at 9:29 PM, bank kus <kus.bank at gmail.com>
wrote:>> Then you''re actually asking for a fair I/O scheduler.
>
> yes are we currently fair? any good documentation on the priority model as
it exists today?
I doubt it, first come-first go is most common. The same holds for
memory as well.
Regards,
Andrey
> --
> This message posted from opensolaris.org
> _______________________________________________
> zfs-code mailing list
> zfs-code at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-code
>

bank kus

2010-Jan-11 18:52 UTC

head link

[zfs-code] Read reordering

> I doubt it, first come-first go is most common. The
> same holds for
> memory as well.
> Regards,
> Andrey
and that is because it was considered and rejected because of XYZ reaons (or
lack of sufficient reasons) or simply something thats not been evaluated. I
would argue the following problem is a side effect of not having fair I/O
scheduling.
http://opensolaris.org/jive/thread.jspa?threadID=121374&tstart=0

But again I find it hard to believe no read  preemption takes place. What
happens while executing a read you memory fault and need to issue other reads
clearly these new reads initiated by OS cannot be ordered.
-- 
This message posted from opensolaris.org

Richard Elling

2010-Jan-11 19:21 UTC

head link

[zfs-code] Read reordering

On Jan 11, 2010, at 8:05 AM, bank kus wrote:
> As of 2009.06 what is the policy with reordering ZFS file reads i.e.,
consider the following timeline:
> T0:  Process A issues read of size 20K and gets its thread switched out
> 
> T1:  Process B issues reads of size 8 bytes and gets its thread switched
out
> 
> Are the 8 byte reads from B going to fall in queue _behind_ A if:
Order is not preserved in either the OS or the device. The data will be cached
in the device, ZFS vdev cache, ARC, and L2ARC. At some point in time, the 
data will be evicted, depending on the cache demand and policies. It is unlikely
that the media will be read twice if both reads are issued at near the same
time.
> --> if A and B are from separate users?
> --> If B is from the system process?
Doesn''t matter. None of those caches are tagged by pid.
 -- richard

Richard Elling

2010-Jan-11 19:30 UTC

head link

[zfs-code] Read reordering

I misinterpreted the question. My answer assumes reads from the same file.

AFAIK, there is no thread-level I/O scheduler in Solaris. ZFS uses a priority
scheduler which is based on the type of I/O and there are some other
resource management policies implemented. By default, ZFS will queue
35 I/Os to each leaf vdev, so it is not clear that scheduling above the ZFS
level will be as effective as one might presume, based on how other 
systems implement I/O scheduling.

Solaris does have CPU, network, and memory resource management.
 -- richard

On Jan 11, 2010, at 11:21 AM, Richard Elling wrote:
> On Jan 11, 2010, at 8:05 AM, bank kus wrote:
> 
>> As of 2009.06 what is the policy with reordering ZFS file reads i.e.,
consider the following timeline:
>> T0:  Process A issues read of size 20K and gets its thread switched out
>> 
>> T1:  Process B issues reads of size 8 bytes and gets its thread
switched out
>> 
>> Are the 8 byte reads from B going to fall in queue _behind_ A if:
> 
> Order is not preserved in either the OS or the device. The data will be
cached
> in the device, ZFS vdev cache, ARC, and L2ARC. At some point in time, the 
> data will be evicted, depending on the cache demand and policies. It is
unlikely
> that the media will be read twice if both reads are issued at near the same
> time.
> 
>> --> if A and B are from separate users?
>> --> If B is from the system process?
> 
> Doesn''t matter. None of those caches are tagged by pid.
> -- richard
>

Andrey Kuzmin

2010-Jan-11 19:41 UTC

head link

[zfs-code] Read reordering

On Mon, Jan 11, 2010 at 10:30 PM, Richard Elling
<richard.elling at gmail.com> wrote:> I misinterpreted the question. My answer assumes reads from the same file.
>
> AFAIK, there is no thread-level I/O scheduler in Solaris. ZFS uses a
priority
> scheduler which is based on the type of I/O and there are some other
> resource management policies implemented. By default, ZFS will queue
> 35 I/Os to each leaf vdev, so it is not clear that scheduling above the ZFS
> level will be as effective as one might presume, based on how other
> systems implement I/O scheduling.
>
> Solaris does have CPU, network, and memory resource management.Do you mean that Solaris supports fair memory bandwidth sharing?


Regards,
Andrey> ?-- richard
>
> On Jan 11, 2010, at 11:21 AM, Richard Elling wrote:
>
>> On Jan 11, 2010, at 8:05 AM, bank kus wrote:
>>
>>> As of 2009.06 what is the policy with reordering ZFS file reads
i.e., consider the following timeline:
>>> T0: ?Process A issues read of size 20K and gets its thread switched
out
>>>
>>> T1: ?Process B issues reads of size 8 bytes and gets its thread
switched out
>>>
>>> Are the 8 byte reads from B going to fall in queue _behind_ A if:
>>
>> Order is not preserved in either the OS or the device. The data will be
cached
>> in the device, ZFS vdev cache, ARC, and L2ARC. At some point in time,
the
>> data will be evicted, depending on the cache demand and policies. It is
unlikely
>> that the media will be read twice if both reads are issued at near the
same
>> time.
>>
>>> --> if A and B are from separate users?
>>> --> If B is from the system process?
>>
>> Doesn''t matter. None of those caches are tagged by pid.
>> -- richard
>>
>
> _______________________________________________
> zfs-code mailing list
> zfs-code at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-code
>

Richard Elling

2010-Jan-11 19:54 UTC

head link

[zfs-code] Read reordering

On Jan 11, 2010, at 11:41 AM, Andrey Kuzmin wrote:
> On Mon, Jan 11, 2010 at 10:30 PM, Richard Elling
> <richard.elling at gmail.com> wrote:
>> I misinterpreted the question. My answer assumes reads from the same
file.
>> 
>> AFAIK, there is no thread-level I/O scheduler in Solaris. ZFS uses a
priority
>> scheduler which is based on the type of I/O and there are some other
>> resource management policies implemented. By default, ZFS will queue
>> 35 I/Os to each leaf vdev, so it is not clear that scheduling above the
ZFS
>> level will be as effective as one might presume, based on how other
>> systems implement I/O scheduling.
>> 
>> Solaris does have CPU, network, and memory resource management.
> Do you mean that Solaris supports fair memory bandwidth sharing?
That seems like a loaded question, since Solaris is NUMA-aware and
offers resource management at the process, project, and zone levels.

I mean that you can manage memory resources using rcapd(1m) and friends.
 -- richard

bank kus

2010-Jan-12 04:21 UTC

head link

[zfs-code] Read reordering

> resource management policies implemented. By default,
> ZFS will queue
> 35 I/Os to each leaf vdev, so it is not clear that
> scheduling above the ZFS
> level will be as effective 
It doesnt have to be above the _ZFS_ layer no? In place of a single queue one
could maintain separate queues that are either timesliced between or FSSed (some
form of ticket scheduling where ticket = priority * age) You could still have
elevator + deadline and therefore not hurt disk bandwidth.
-- 
This message posted from opensolaris.org

Richard Elling

2010-Jan-12 15:59 UTC

head link

[zfs-code] Read reordering

On Jan 11, 2010, at 8:21 PM, bank kus wrote:
>> resource management policies implemented. By default,
>> ZFS will queue
>> 35 I/Os to each leaf vdev, so it is not clear that
>> scheduling above the ZFS
>> level will be as effective 
> 
> It doesnt have to be above the _ZFS_ layer no? In place of a single queue
one could maintain separate queues that are either timesliced between or FSSed
(some form of ticket scheduling where ticket = priority * age) You could still
have elevator + deadline and therefore not hurt disk bandwidth.
I think if you look at the majority of performance problems reported on this 
forum, they are latency and not bandwidth bound. Modern systems tend to
be over provisioned for bandwidth.
 -- richard

bank kus

2010-Jan-12 18:27 UTC

head link

[zfs-code] Read reordering

> I think if you look at the majority of performance
> problems reported on this 
> forum, they are latency and not bandwidth bound.
__latency__ of reads in a highly contested system  (lots of reads from different
processes ahead of you)
   OR
 the speed of light case (heres a read queues are empty fetch me the data if not
from cache then disk, theres no contention).

The former will improve for latency with fair queuing while SOL its true will
incur a certain overhead yes. Again I would be interested to see if there have
been mail threads / discussions around _intentionally_ not providing fair
queuing in Solaris for perf reasons or if this something thats not been thought
of so far.
-- 
This message posted from opensolaris.org

Richard Elling

2010-Jan-12 18:44 UTC

head link

[zfs-code] Read reordering

On Jan 12, 2010, at 10:27 AM, bank kus wrote:
>> I think if you look at the majority of performance
>> problems reported on this 
>> forum, they are latency and not bandwidth bound.
> 
> __latency__ of reads in a highly contested system  (lots of reads from
different processes ahead of you)
>   OR
> the speed of light case (heres a read queues are empty fetch me the data if
not from cache then disk, theres no contention).
> 
> The former will improve for latency with fair queuing while SOL its true
will incur a certain overhead yes. Again I would be interested to see if there
have been mail threads / discussions around _intentionally_ not providing fair
queuing in Solaris for perf reasons or if this something thats not been thought
of so far.
Good question.  I suggest you look at the resource management community.
http://hub.opensolaris.org/bin/view/Project+rm/
 -- richard

bank kus

2010-Feb-07 04:01 UTC

head link

[zfs-code] Read reordering

looks like set zfs:zfs_vdev_max_pending = 1 fixes this problem __very__
elegantly. Now with a 16GB file copy in the background I can launch an intensive
application like Eclipse very fast.
-- 
This message posted from opensolaris.org

zfs code - Jan 2010 - Read reordering

[zfs-code] Read reordering

[zfs-code] Read reordering

[zfs-code] Read reordering

[zfs-code] Read reordering

[zfs-code] Read reordering

[zfs-code] Read reordering

[zfs-code] Read reordering

[zfs-code] Read reordering

[zfs-code] Read reordering

[zfs-code] Read reordering

[zfs-code] Read reordering

[zfs-code] Read reordering

[zfs-code] Read reordering

[zfs-code] Read reordering

[zfs-code] Read reordering

[zfs-code] Read reordering