As of 2009.06 what is the policy with reordering ZFS file reads i.e., consider the following timeline: T0: Process A issues read of size 20K and gets its thread switched out T1: Process B issues reads of size 8 bytes and gets its thread switched out Are the 8 byte reads from B going to fall in queue _behind_ A if: --> if A and B are from separate users? --> If B is from the system process? Regards banks -- This message posted from opensolaris.org
Per Posix there''s no read ordering guarantees for a file with concurrent non-exclusive readers. Use queue/locks in the application if you need ordering like this. Regards, Andrey On Mon, Jan 11, 2010 at 7:05 PM, bank kus <kus.bank at gmail.com> wrote:> As of 2009.06 what is the policy with reordering ZFS file reads i.e., consider the following timeline: > T0: ?Process A issues read of size 20K and gets its thread switched out > > T1: ?Process B issues reads of size 8 bytes and gets its thread switched out > > Are the 8 byte reads from B going to fall in queue _behind_ A if: > --> if A and B are from separate users? > --> If B is from the system process? > > Regards > banks > -- > This message posted from opensolaris.org > _______________________________________________ > zfs-code mailing list > zfs-code at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-code >
I was asking from the starvatoin point of view to see if B can be starved by a long bust from A -- This message posted from opensolaris.org
Then you''re actually asking for a fair I/O scheduler. Regards, Andrey On Mon, Jan 11, 2010 at 8:12 PM, bank kus <kus.bank at gmail.com> wrote:> I was asking from the starvatoin point of view to see if B can be starved by a long bust from A > -- > This message posted from opensolaris.org > _______________________________________________ > zfs-code mailing list > zfs-code at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-code >
> Then you''re actually asking for a fair I/O scheduler.yes are we currently fair? any good documentation on the priority model as it exists today? -- This message posted from opensolaris.org
On Mon, Jan 11, 2010 at 9:29 PM, bank kus <kus.bank at gmail.com> wrote:>> Then you''re actually asking for a fair I/O scheduler. > > yes are we currently fair? any good documentation on the priority model as it exists today?I doubt it, first come-first go is most common. The same holds for memory as well. Regards, Andrey> -- > This message posted from opensolaris.org > _______________________________________________ > zfs-code mailing list > zfs-code at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-code >
> I doubt it, first come-first go is most common. The > same holds for > memory as well. > Regards, > Andreyand that is because it was considered and rejected because of XYZ reaons (or lack of sufficient reasons) or simply something thats not been evaluated. I would argue the following problem is a side effect of not having fair I/O scheduling. http://opensolaris.org/jive/thread.jspa?threadID=121374&tstart=0 But again I find it hard to believe no read preemption takes place. What happens while executing a read you memory fault and need to issue other reads clearly these new reads initiated by OS cannot be ordered. -- This message posted from opensolaris.org
On Jan 11, 2010, at 8:05 AM, bank kus wrote:> As of 2009.06 what is the policy with reordering ZFS file reads i.e., consider the following timeline: > T0: Process A issues read of size 20K and gets its thread switched out > > T1: Process B issues reads of size 8 bytes and gets its thread switched out > > Are the 8 byte reads from B going to fall in queue _behind_ A if:Order is not preserved in either the OS or the device. The data will be cached in the device, ZFS vdev cache, ARC, and L2ARC. At some point in time, the data will be evicted, depending on the cache demand and policies. It is unlikely that the media will be read twice if both reads are issued at near the same time.> --> if A and B are from separate users? > --> If B is from the system process?Doesn''t matter. None of those caches are tagged by pid. -- richard
I misinterpreted the question. My answer assumes reads from the same file. AFAIK, there is no thread-level I/O scheduler in Solaris. ZFS uses a priority scheduler which is based on the type of I/O and there are some other resource management policies implemented. By default, ZFS will queue 35 I/Os to each leaf vdev, so it is not clear that scheduling above the ZFS level will be as effective as one might presume, based on how other systems implement I/O scheduling. Solaris does have CPU, network, and memory resource management. -- richard On Jan 11, 2010, at 11:21 AM, Richard Elling wrote:> On Jan 11, 2010, at 8:05 AM, bank kus wrote: > >> As of 2009.06 what is the policy with reordering ZFS file reads i.e., consider the following timeline: >> T0: Process A issues read of size 20K and gets its thread switched out >> >> T1: Process B issues reads of size 8 bytes and gets its thread switched out >> >> Are the 8 byte reads from B going to fall in queue _behind_ A if: > > Order is not preserved in either the OS or the device. The data will be cached > in the device, ZFS vdev cache, ARC, and L2ARC. At some point in time, the > data will be evicted, depending on the cache demand and policies. It is unlikely > that the media will be read twice if both reads are issued at near the same > time. > >> --> if A and B are from separate users? >> --> If B is from the system process? > > Doesn''t matter. None of those caches are tagged by pid. > -- richard >
On Mon, Jan 11, 2010 at 10:30 PM, Richard Elling <richard.elling at gmail.com> wrote:> I misinterpreted the question. My answer assumes reads from the same file. > > AFAIK, there is no thread-level I/O scheduler in Solaris. ZFS uses a priority > scheduler which is based on the type of I/O and there are some other > resource management policies implemented. By default, ZFS will queue > 35 I/Os to each leaf vdev, so it is not clear that scheduling above the ZFS > level will be as effective as one might presume, based on how other > systems implement I/O scheduling. > > Solaris does have CPU, network, and memory resource management.Do you mean that Solaris supports fair memory bandwidth sharing? Regards, Andrey> ?-- richard > > On Jan 11, 2010, at 11:21 AM, Richard Elling wrote: > >> On Jan 11, 2010, at 8:05 AM, bank kus wrote: >> >>> As of 2009.06 what is the policy with reordering ZFS file reads i.e., consider the following timeline: >>> T0: ?Process A issues read of size 20K and gets its thread switched out >>> >>> T1: ?Process B issues reads of size 8 bytes and gets its thread switched out >>> >>> Are the 8 byte reads from B going to fall in queue _behind_ A if: >> >> Order is not preserved in either the OS or the device. The data will be cached >> in the device, ZFS vdev cache, ARC, and L2ARC. At some point in time, the >> data will be evicted, depending on the cache demand and policies. It is unlikely >> that the media will be read twice if both reads are issued at near the same >> time. >> >>> --> if A and B are from separate users? >>> --> If B is from the system process? >> >> Doesn''t matter. None of those caches are tagged by pid. >> -- richard >> > > _______________________________________________ > zfs-code mailing list > zfs-code at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-code >
On Jan 11, 2010, at 11:41 AM, Andrey Kuzmin wrote:> On Mon, Jan 11, 2010 at 10:30 PM, Richard Elling > <richard.elling at gmail.com> wrote: >> I misinterpreted the question. My answer assumes reads from the same file. >> >> AFAIK, there is no thread-level I/O scheduler in Solaris. ZFS uses a priority >> scheduler which is based on the type of I/O and there are some other >> resource management policies implemented. By default, ZFS will queue >> 35 I/Os to each leaf vdev, so it is not clear that scheduling above the ZFS >> level will be as effective as one might presume, based on how other >> systems implement I/O scheduling. >> >> Solaris does have CPU, network, and memory resource management. > Do you mean that Solaris supports fair memory bandwidth sharing?That seems like a loaded question, since Solaris is NUMA-aware and offers resource management at the process, project, and zone levels. I mean that you can manage memory resources using rcapd(1m) and friends. -- richard
> resource management policies implemented. By default, > ZFS will queue > 35 I/Os to each leaf vdev, so it is not clear that > scheduling above the ZFS > level will be as effectiveIt doesnt have to be above the _ZFS_ layer no? In place of a single queue one could maintain separate queues that are either timesliced between or FSSed (some form of ticket scheduling where ticket = priority * age) You could still have elevator + deadline and therefore not hurt disk bandwidth. -- This message posted from opensolaris.org
On Jan 11, 2010, at 8:21 PM, bank kus wrote:>> resource management policies implemented. By default, >> ZFS will queue >> 35 I/Os to each leaf vdev, so it is not clear that >> scheduling above the ZFS >> level will be as effective > > It doesnt have to be above the _ZFS_ layer no? In place of a single queue one could maintain separate queues that are either timesliced between or FSSed (some form of ticket scheduling where ticket = priority * age) You could still have elevator + deadline and therefore not hurt disk bandwidth.I think if you look at the majority of performance problems reported on this forum, they are latency and not bandwidth bound. Modern systems tend to be over provisioned for bandwidth. -- richard
> I think if you look at the majority of performance > problems reported on this > forum, they are latency and not bandwidth bound.__latency__ of reads in a highly contested system (lots of reads from different processes ahead of you) OR the speed of light case (heres a read queues are empty fetch me the data if not from cache then disk, theres no contention). The former will improve for latency with fair queuing while SOL its true will incur a certain overhead yes. Again I would be interested to see if there have been mail threads / discussions around _intentionally_ not providing fair queuing in Solaris for perf reasons or if this something thats not been thought of so far. -- This message posted from opensolaris.org
On Jan 12, 2010, at 10:27 AM, bank kus wrote:>> I think if you look at the majority of performance >> problems reported on this >> forum, they are latency and not bandwidth bound. > > __latency__ of reads in a highly contested system (lots of reads from different processes ahead of you) > OR > the speed of light case (heres a read queues are empty fetch me the data if not from cache then disk, theres no contention). > > The former will improve for latency with fair queuing while SOL its true will incur a certain overhead yes. Again I would be interested to see if there have been mail threads / discussions around _intentionally_ not providing fair queuing in Solaris for perf reasons or if this something thats not been thought of so far.Good question. I suggest you look at the resource management community. http://hub.opensolaris.org/bin/view/Project+rm/ -- richard
looks like set zfs:zfs_vdev_max_pending = 1 fixes this problem __very__ elegantly. Now with a 16GB file copy in the background I can launch an intensive application like Eclipse very fast. -- This message posted from opensolaris.org