Vincent, Pradeep
2011-May-02 07:04 UTC
[Xen-devel] [PATCH] blkback: Fix block I/O latency issue
In blkback driver, after I/O requests are submitted to Dom-0 block I/O subsystem, blkback goes to ''sleep'' effectively without letting blkfront know about it (req_event isn''t set appropriately). Hence blkfront doesn''t notify blkback when it submits a new I/O thus delaying the ''dispatch'' of the new I/O to Dom-0 block I/O subsystem. The new I/O is dispatched as soon as one of the previous I/Os completes. As a result of this issue, the block I/O latency performance is degraded for some workloads on Xen guests using blkfront-blkback stack. The following change addresses this issue: Signed-off-by: Pradeep Vincent <pradeepv@amazon.com> diff --git a/drivers/xen/blkback/blkback.c b/drivers/xen/blkback/blkback.c --- a/drivers/xen/blkback/blkback.c +++ b/drivers/xen/blkback/blkback.c @@ -383,6 +383,12 @@ static int do_block_io_op(blkif_t *blkif) cond_resched(); } + /* If blkback might go to sleep (i.e. more_to_do == 0) then we better + let blkfront know about it (by setting req_event appropriately) so that + blkfront will bother to wake us up (via interrupt) when it submits a + new I/O */ + if (!more_to_do) + RING_FINAL_CHECK_FOR_REQUESTS(&blk_rings->common, more_to_do); return more_to_do; } _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jan Beulich
2011-May-02 08:13 UTC
Re: [Xen-devel] [PATCH] blkback: Fix block I/O latency issue
>>> On 02.05.11 at 09:04, "Vincent, Pradeep" <pradeepv@amazon.com> wrote: > In blkback driver, after I/O requests are submitted to Dom-0 block I/O > subsystem, blkback goes to ''sleep'' effectively without letting blkfront know > about it (req_event isn''t set appropriately). Hence blkfront doesn''t notify > blkback when it submits a new I/O thus delaying the ''dispatch'' of the new I/O > to Dom-0 block I/O subsystem. The new I/O is dispatched as soon as one of the > previous I/Os completes. > > As a result of this issue, the block I/O latency performance is degraded for > some workloads on Xen guests using blkfront-blkback stack. > > The following change addresses this issue: > > > Signed-off-by: Pradeep Vincent <pradeepv@amazon.com> > > diff --git a/drivers/xen/blkback/blkback.c b/drivers/xen/blkback/blkback.c > --- a/drivers/xen/blkback/blkback.c > +++ b/drivers/xen/blkback/blkback.c > @@ -383,6 +383,12 @@ static int do_block_io_op(blkif_t *blkif) > cond_resched(); > } > > + /* If blkback might go to sleep (i.e. more_to_do == 0) then we better > + let blkfront know about it (by setting req_event appropriately) so that > + blkfront will bother to wake us up (via interrupt) when it submits a > + new I/O */ > + if (!more_to_do) > + RING_FINAL_CHECK_FOR_REQUESTS(&blk_rings->common, more_to_do);To me this contradicts the comment preceding the use of RING_FINAL_CHECK_FOR_REQUESTS() in make_response() (there it''s supposedly used to avoid unnecessary notification, here you say it''s used to force notification). Albeit I agree that the change looks consistent with the comments in io/ring.h. Even if correct, you''re not holding blkif->blk_ring_lock here, and hence I think you''ll need to explain how this is not a problem.>From a formal perspective, you also want to correct usage of tabs,and (assuming this is intended for the 2.6.18 tree) you''d also need to indicate so for Keir to pick this up and apply it to that tree (and it might then also be a good idea to submit an equivalent patch for the pv-ops trees). Jan> return more_to_do; > }_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Vincent, Pradeep
2011-May-03 01:10 UTC
Re: [Xen-devel] [PATCH] blkback: Fix block I/O latency issue
Thanks Jan. Re: avoid unnecessary notification If this was a deliberate design choice then the duration of the delay is at the mercy of the pending I/O latencies & I/O patterns and the delay is simply too long in some cases. E.g. A write I/O stuck behind a read I/O could see more than double the latency on a Xen guest compared to a baremetal host. Avoiding notifications this way results in significant latency degradation perceived by many applications. If this is about allowing I/O scheduler to coalesce more I/Os, then I bet I/O scheduler''s ''wait and coalesce'' logic is a great substitute for the delays introduced by blkback. I totally agree IRQ coalescing or delay is useful for both blkback and netback but we need a logic that doesn''t impact I/O latencies significantly. Also, I don''t think netback has this type of notification avoidance logic (at least in 2.6.18 code base). Re: Other points Good call. Changed the patch to include tabs. I wasn''t very sure about blk_ring_lock usage and I should have clarified it before sending out the patch. Assuming blk_ring_lock was meant to protect shared ring manipulations within blkback, is there a reason ''blk_rings->common.req_cons'' manipulation in do_block_io_op is not protected ? The reasons for the differences between locking logic in do_block_io_op and make_response weren''t terribly obvious although the failure mode for the race condition may very well be benign. Anyway, I am attaching a patch with appropriate changes. Jeremey, Can you apply this patch to pvops Dom-0 (http://git.kernel.org/pub/scm/linux/kernel/git/jeremy/xen.git). Should I submit another patch for 2.6.18 Dom-0 ? Signed-off-by: Pradeep Vincent <pradeepv@amazon.com> diff --git a/drivers/xen/blkback/blkback.c b/drivers/xen/blkback/blkback.c --- a/drivers/xen/blkback/blkback.c +++ b/drivers/xen/blkback/blkback.c @@ -315,6 +315,7 @@ static int do_block_io_op(blkif_t *blkif) pending_req_t *pending_req; RING_IDX rc, rp; int more_to_do = 0; + unsigned long flags; rc = blk_rings->common.req_cons; rp = blk_rings->common.sring->req_prod; @@ -383,6 +384,15 @@ static int do_block_io_op(blkif_t *blkif) cond_resched(); } + /* If blkback might go to sleep (i.e. more_to_do == 0) then we better + let blkfront know about it (by setting req_event appropriately) so that + blkfront will bother to wake us up (via interrupt) when it submits a + new I/O */ + if (!more_to_do){ + spin_lock_irqsave(&blkif->blk_ring_lock, flags); + RING_FINAL_CHECK_FOR_REQUESTS(&blk_rings->common, more_to_do); + spin_unlock_irqrestore(&blkif->blk_ring_lock, flags); + } return more_to_do; } On 5/2/11 1:13 AM, "Jan Beulich" <JBeulich@novell.com> wrote:>>>> On 02.05.11 at 09:04, "Vincent, Pradeep" <pradeepv@amazon.com> wrote: >> In blkback driver, after I/O requests are submitted to Dom-0 block I/O >> subsystem, blkback goes to ''sleep'' effectively without letting blkfront >>know >> about it (req_event isn''t set appropriately). Hence blkfront doesn''t >>notify >> blkback when it submits a new I/O thus delaying the ''dispatch'' of the >>new I/O >> to Dom-0 block I/O subsystem. The new I/O is dispatched as soon as one >>of the >> previous I/Os completes. >> >> As a result of this issue, the block I/O latency performance is >>degraded for >> some workloads on Xen guests using blkfront-blkback stack. >> >> The following change addresses this issue: >> >> >> Signed-off-by: Pradeep Vincent <pradeepv@amazon.com> >> >> diff --git a/drivers/xen/blkback/blkback.c >>b/drivers/xen/blkback/blkback.c >> --- a/drivers/xen/blkback/blkback.c >> +++ b/drivers/xen/blkback/blkback.c >> @@ -383,6 +383,12 @@ static int do_block_io_op(blkif_t *blkif) >> cond_resched(); >> } >> >> + /* If blkback might go to sleep (i.e. more_to_do == 0) then we better >> + let blkfront know about it (by setting req_event appropriately) so >>that >> + blkfront will bother to wake us up (via interrupt) when it submits a >> + new I/O */ >> + if (!more_to_do) >> + RING_FINAL_CHECK_FOR_REQUESTS(&blk_rings->common, >>more_to_do); > >To me this contradicts the comment preceding the use of >RING_FINAL_CHECK_FOR_REQUESTS() in make_response() >(there it''s supposedly used to avoid unnecessary notification, >here you say it''s used to force notification). Albeit I agree that >the change looks consistent with the comments in io/ring.h. > >Even if correct, you''re not holding blkif->blk_ring_lock here, and >hence I think you''ll need to explain how this is not a problem. > >From a formal perspective, you also want to correct usage of tabs, >and (assuming this is intended for the 2.6.18 tree) you''d also need >to indicate so for Keir to pick this up and apply it to that tree (and >it might then also be a good idea to submit an equivalent patch for >the pv-ops trees). > >Jan > >> return more_to_do; >> } > > >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Konrad Rzeszutek Wilk
2011-May-03 14:55 UTC
Re: [Xen-devel] [PATCH] blkback: Fix block I/O latency issue
On Mon, May 02, 2011 at 06:10:22PM -0700, Vincent, Pradeep wrote:> Thanks Jan. > > Re: avoid unnecessary notification > > If this was a deliberate design choice then the duration of the delay is > at the mercy of the pending I/O latencies & I/O patterns and the delay is > simply too long in some cases. E.g. A write I/O stuck behind a read I/O > could see more than double the latency on a Xen guest compared to a > baremetal host. Avoiding notifications this way results in significant > latency degradation perceived by many applications.You sure this is not the fault of the IO scheduler? I had similar issues with the CFQ scheduler upstream and found out that I needed to add REQ_SYNC on write requests.> > If this is about allowing I/O scheduler to coalesce more I/Os, then I bet > I/O scheduler''s ''wait and coalesce'' logic is a great substitute for the > delays introduced by blkback. > > I totally agree IRQ coalescing or delay is useful for both blkback and > netback but we need a logic that doesn''t impact I/O latencies > significantly. Also, I don''t think netback has this type of notification > avoidance logic (at least in 2.6.18 code base). > > > Re: Other points > > Good call. Changed the patch to include tabs. > > I wasn''t very sure about blk_ring_lock usage and I should have clarified > it before sending out the patch. > > Assuming blk_ring_lock was meant to protect shared ring manipulations > within blkback, is there a reason ''blk_rings->common.req_cons'' > manipulation in do_block_io_op is not protected ? The reasons for the > differences between locking logic in do_block_io_op and make_response > weren''t terribly obvious although the failure mode for the race condition > may very well be benign. > > Anyway, I am attaching a patch with appropriate changes. > > Jeremey, Can you apply this patch to pvops Dom-0 > (http://git.kernel.org/pub/scm/linux/kernel/git/jeremy/xen.git). Should I > submit another patch for 2.6.18 Dom-0 ? > > > Signed-off-by: Pradeep Vincent <pradeepv@amazon.com> > > diff --git a/drivers/xen/blkback/blkback.c b/drivers/xen/blkback/blkback.c > --- a/drivers/xen/blkback/blkback.c > +++ b/drivers/xen/blkback/blkback.c > @@ -315,6 +315,7 @@ static int do_block_io_op(blkif_t *blkif) > pending_req_t *pending_req; > RING_IDX rc, rp; > int more_to_do = 0; > + unsigned long flags; > > rc = blk_rings->common.req_cons; > rp = blk_rings->common.sring->req_prod; > @@ -383,6 +384,15 @@ static int do_block_io_op(blkif_t *blkif) > cond_resched(); > } > > + /* If blkback might go to sleep (i.e. more_to_do == 0) then we better > + let blkfront know about it (by setting req_event appropriately) so > that > + blkfront will bother to wake us up (via interrupt) when it submits a > + new I/O */ > + if (!more_to_do){ > + spin_lock_irqsave(&blkif->blk_ring_lock, flags); > + RING_FINAL_CHECK_FOR_REQUESTS(&blk_rings->common, more_to_do); > + spin_unlock_irqrestore(&blkif->blk_ring_lock, flags); > + } > return more_to_do; > } > > > > > > > > > On 5/2/11 1:13 AM, "Jan Beulich" <JBeulich@novell.com> wrote: > > >>>> On 02.05.11 at 09:04, "Vincent, Pradeep" <pradeepv@amazon.com> wrote: > >> In blkback driver, after I/O requests are submitted to Dom-0 block I/O > >> subsystem, blkback goes to ''sleep'' effectively without letting blkfront > >>know > >> about it (req_event isn''t set appropriately). Hence blkfront doesn''t > >>notify > >> blkback when it submits a new I/O thus delaying the ''dispatch'' of the > >>new I/O > >> to Dom-0 block I/O subsystem. The new I/O is dispatched as soon as one > >>of the > >> previous I/Os completes. > >> > >> As a result of this issue, the block I/O latency performance is > >>degraded for > >> some workloads on Xen guests using blkfront-blkback stack. > >> > >> The following change addresses this issue: > >> > >> > >> Signed-off-by: Pradeep Vincent <pradeepv@amazon.com> > >> > >> diff --git a/drivers/xen/blkback/blkback.c > >>b/drivers/xen/blkback/blkback.c > >> --- a/drivers/xen/blkback/blkback.c > >> +++ b/drivers/xen/blkback/blkback.c > >> @@ -383,6 +383,12 @@ static int do_block_io_op(blkif_t *blkif) > >> cond_resched(); > >> } > >> > >> + /* If blkback might go to sleep (i.e. more_to_do == 0) then we better > >> + let blkfront know about it (by setting req_event appropriately) so > >>that > >> + blkfront will bother to wake us up (via interrupt) when it submits a > >> + new I/O */ > >> + if (!more_to_do) > >> + RING_FINAL_CHECK_FOR_REQUESTS(&blk_rings->common, > >>more_to_do); > > > >To me this contradicts the comment preceding the use of > >RING_FINAL_CHECK_FOR_REQUESTS() in make_response() > >(there it''s supposedly used to avoid unnecessary notification, > >here you say it''s used to force notification). Albeit I agree that > >the change looks consistent with the comments in io/ring.h. > > > >Even if correct, you''re not holding blkif->blk_ring_lock here, and > >hence I think you''ll need to explain how this is not a problem. > > > >From a formal perspective, you also want to correct usage of tabs, > >and (assuming this is intended for the 2.6.18 tree) you''d also need > >to indicate so for Keir to pick this up and apply it to that tree (and > >it might then also be a good idea to submit an equivalent patch for > >the pv-ops trees). > > > >Jan > > > >> return more_to_do; > >> } > > > > > > >> _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xensource.com > http://lists.xensource.com/xen-devel_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Vincent, Pradeep
2011-May-03 17:16 UTC
Re: [Xen-devel] [PATCH] blkback: Fix block I/O latency issue
Yes - I am positive what I am seeing isn''t ''I/O scheduler issue due to REQ_SYNC''. Trace data from blkback showed that blkback was simply not submitting the 2nd I/O to the I/O scheduler. Type of I/O (read vs write) doesn''t matter. Recreation Steps: 1. Generate I/O requests so that two I/Os are pending at any given time. The I/O submissions shouldn''t be synchronized. Potentially use two threads for I/O submissions each submitting a small size random direct I/O. 2. Verify that the guest sends out two I/Os at a given time. ''iostat'' avgqu-sz will be ''2'' 3. Now check iostat in Dom-0 for the corresponding block device. Avgqu-sz will be ''1'' 4. ''await'' comparison in DomU vs Dom0 will show a fairly big difference. And I confirmed that the patch I submitted fixes this issue. - Pradeep Vincent On 5/3/11 7:55 AM, "Konrad Rzeszutek Wilk" <konrad.wilk@oracle.com> wrote:>On Mon, May 02, 2011 at 06:10:22PM -0700, Vincent, Pradeep wrote: >> Thanks Jan. >> >> Re: avoid unnecessary notification >> >> If this was a deliberate design choice then the duration of the delay is >> at the mercy of the pending I/O latencies & I/O patterns and the delay >>is >> simply too long in some cases. E.g. A write I/O stuck behind a read I/O >> could see more than double the latency on a Xen guest compared to a >> baremetal host. Avoiding notifications this way results in significant >> latency degradation perceived by many applications. > >You sure this is not the fault of the IO scheduler? I had similar issues >with the CFQ scheduler upstream and found out that I needed to add >REQ_SYNC on write requests. >> >> If this is about allowing I/O scheduler to coalesce more I/Os, then I >>bet >> I/O scheduler''s ''wait and coalesce'' logic is a great substitute for the >> delays introduced by blkback. >> >> I totally agree IRQ coalescing or delay is useful for both blkback and >> netback but we need a logic that doesn''t impact I/O latencies >> significantly. Also, I don''t think netback has this type of notification >> avoidance logic (at least in 2.6.18 code base). >> >> >> Re: Other points >> >> Good call. Changed the patch to include tabs. >> >> I wasn''t very sure about blk_ring_lock usage and I should have clarified >> it before sending out the patch. >> >> Assuming blk_ring_lock was meant to protect shared ring manipulations >> within blkback, is there a reason ''blk_rings->common.req_cons'' >> manipulation in do_block_io_op is not protected ? The reasons for the >> differences between locking logic in do_block_io_op and make_response >> weren''t terribly obvious although the failure mode for the race >>condition >> may very well be benign. >> >> Anyway, I am attaching a patch with appropriate changes. >> >> Jeremey, Can you apply this patch to pvops Dom-0 >> (http://git.kernel.org/pub/scm/linux/kernel/git/jeremy/xen.git). Should >>I >> submit another patch for 2.6.18 Dom-0 ? >> >> >> Signed-off-by: Pradeep Vincent <pradeepv@amazon.com> >> >> diff --git a/drivers/xen/blkback/blkback.c >>b/drivers/xen/blkback/blkback.c >> --- a/drivers/xen/blkback/blkback.c >> +++ b/drivers/xen/blkback/blkback.c >> @@ -315,6 +315,7 @@ static int do_block_io_op(blkif_t *blkif) >> pending_req_t *pending_req; >> RING_IDX rc, rp; >> int more_to_do = 0; >> + unsigned long flags; >> >> rc = blk_rings->common.req_cons; >> rp = blk_rings->common.sring->req_prod; >> @@ -383,6 +384,15 @@ static int do_block_io_op(blkif_t *blkif) >> cond_resched(); >> } >> >> + /* If blkback might go to sleep (i.e. more_to_do == 0) then we better >> + let blkfront know about it (by setting req_event appropriately) so >> that >> + blkfront will bother to wake us up (via interrupt) when it submits >>a >> + new I/O */ >> + if (!more_to_do){ >> + spin_lock_irqsave(&blkif->blk_ring_lock, flags); >> + RING_FINAL_CHECK_FOR_REQUESTS(&blk_rings->common, more_to_do); >> + spin_unlock_irqrestore(&blkif->blk_ring_lock, flags); >> + } >> return more_to_do; >> } >> >> >> >> >> >> >> >> >> On 5/2/11 1:13 AM, "Jan Beulich" <JBeulich@novell.com> wrote: >> >> >>>> On 02.05.11 at 09:04, "Vincent, Pradeep" <pradeepv@amazon.com> >>wrote: >> >> In blkback driver, after I/O requests are submitted to Dom-0 block >>I/O >> >> subsystem, blkback goes to ''sleep'' effectively without letting >>blkfront >> >>know >> >> about it (req_event isn''t set appropriately). Hence blkfront doesn''t >> >>notify >> >> blkback when it submits a new I/O thus delaying the ''dispatch'' of the >> >>new I/O >> >> to Dom-0 block I/O subsystem. The new I/O is dispatched as soon as >>one >> >>of the >> >> previous I/Os completes. >> >> >> >> As a result of this issue, the block I/O latency performance is >> >>degraded for >> >> some workloads on Xen guests using blkfront-blkback stack. >> >> >> >> The following change addresses this issue: >> >> >> >> >> >> Signed-off-by: Pradeep Vincent <pradeepv@amazon.com> >> >> >> >> diff --git a/drivers/xen/blkback/blkback.c >> >>b/drivers/xen/blkback/blkback.c >> >> --- a/drivers/xen/blkback/blkback.c >> >> +++ b/drivers/xen/blkback/blkback.c >> >> @@ -383,6 +383,12 @@ static int do_block_io_op(blkif_t *blkif) >> >> cond_resched(); >> >> } >> >> >> >> + /* If blkback might go to sleep (i.e. more_to_do == 0) then we >>better >> >> + let blkfront know about it (by setting req_event appropriately) >>so >> >>that >> >> + blkfront will bother to wake us up (via interrupt) when it >>submits a >> >> + new I/O */ >> >> + if (!more_to_do) >> >> + RING_FINAL_CHECK_FOR_REQUESTS(&blk_rings->common, >> >>more_to_do); >> > >> >To me this contradicts the comment preceding the use of >> >RING_FINAL_CHECK_FOR_REQUESTS() in make_response() >> >(there it''s supposedly used to avoid unnecessary notification, >> >here you say it''s used to force notification). Albeit I agree that >> >the change looks consistent with the comments in io/ring.h. >> > >> >Even if correct, you''re not holding blkif->blk_ring_lock here, and >> >hence I think you''ll need to explain how this is not a problem. >> > >> >From a formal perspective, you also want to correct usage of tabs, >> >and (assuming this is intended for the 2.6.18 tree) you''d also need >> >to indicate so for Keir to pick this up and apply it to that tree (and >> >it might then also be a good idea to submit an equivalent patch for >> >the pv-ops trees). >> > >> >Jan >> > >> >> return more_to_do; >> >> } >> > >> > >> > >> > > >> _______________________________________________ >> Xen-devel mailing list >> Xen-devel@lists.xensource.com >> http://lists.xensource.com/xen-devel >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Daniel Stodden
2011-May-03 17:51 UTC
Re: [Xen-devel] [PATCH] blkback: Fix block I/O latency issue
On Tue, 2011-05-03 at 13:16 -0400, Vincent, Pradeep wrote:> Yes - I am positive what I am seeing isn''t ''I/O scheduler issue due to > REQ_SYNC''. Trace data from blkback showed that blkback was simply not > submitting the 2nd I/O to the I/O scheduler. Type of I/O (read vs write) > doesn''t matter.Block trace? Just to make sure we''re debugging the right thing -- what I/O scheduler are you running? Daniel> Recreation Steps: > > 1. Generate I/O requests so that two I/Os are pending at any given time. > The I/O submissions shouldn''t be synchronized. Potentially use two threads > for I/O submissions each submitting a small size random direct I/O. > 2. Verify that the guest sends out two I/Os at a given time. ''iostat'' > avgqu-sz will be ''2'' > 3. Now check iostat in Dom-0 for the corresponding block device. Avgqu-sz > will be ''1'' > 4. ''await'' comparison in DomU vs Dom0 will show a fairly big difference. > > And I confirmed that the patch I submitted fixes this issue. > > - Pradeep Vincent > > > On 5/3/11 7:55 AM, "Konrad Rzeszutek Wilk" <konrad.wilk@oracle.com> wrote: > > >On Mon, May 02, 2011 at 06:10:22PM -0700, Vincent, Pradeep wrote: > >> Thanks Jan. > >> > >> Re: avoid unnecessary notification > >> > >> If this was a deliberate design choice then the duration of the delay is > >> at the mercy of the pending I/O latencies & I/O patterns and the delay > >>is > >> simply too long in some cases. E.g. A write I/O stuck behind a read I/O > >> could see more than double the latency on a Xen guest compared to a > >> baremetal host. Avoiding notifications this way results in significant > >> latency degradation perceived by many applications. > > > >You sure this is not the fault of the IO scheduler? I had similar issues > >with the CFQ scheduler upstream and found out that I needed to add > >REQ_SYNC on write requests. > >> > >> If this is about allowing I/O scheduler to coalesce more I/Os, then I > >>bet > >> I/O scheduler''s ''wait and coalesce'' logic is a great substitute for the > >> delays introduced by blkback. > >> > >> I totally agree IRQ coalescing or delay is useful for both blkback and > >> netback but we need a logic that doesn''t impact I/O latencies > >> significantly. Also, I don''t think netback has this type of notification > >> avoidance logic (at least in 2.6.18 code base). > >> > >> > >> Re: Other points > >> > >> Good call. Changed the patch to include tabs. > >> > >> I wasn''t very sure about blk_ring_lock usage and I should have clarified > >> it before sending out the patch. > >> > >> Assuming blk_ring_lock was meant to protect shared ring manipulations > >> within blkback, is there a reason ''blk_rings->common.req_cons'' > >> manipulation in do_block_io_op is not protected ? The reasons for the > >> differences between locking logic in do_block_io_op and make_response > >> weren''t terribly obvious although the failure mode for the race > >>condition > >> may very well be benign. > >> > >> Anyway, I am attaching a patch with appropriate changes. > >> > >> Jeremey, Can you apply this patch to pvops Dom-0 > >> (http://git.kernel.org/pub/scm/linux/kernel/git/jeremy/xen.git). Should > >>I > >> submit another patch for 2.6.18 Dom-0 ? > >> > >> > >> Signed-off-by: Pradeep Vincent <pradeepv@amazon.com> > >> > >> diff --git a/drivers/xen/blkback/blkback.c > >>b/drivers/xen/blkback/blkback.c > >> --- a/drivers/xen/blkback/blkback.c > >> +++ b/drivers/xen/blkback/blkback.c > >> @@ -315,6 +315,7 @@ static int do_block_io_op(blkif_t *blkif) > >> pending_req_t *pending_req; > >> RING_IDX rc, rp; > >> int more_to_do = 0; > >> + unsigned long flags; > >> > >> rc = blk_rings->common.req_cons; > >> rp = blk_rings->common.sring->req_prod; > >> @@ -383,6 +384,15 @@ static int do_block_io_op(blkif_t *blkif) > >> cond_resched(); > >> } > >> > >> + /* If blkback might go to sleep (i.e. more_to_do == 0) then we better > >> + let blkfront know about it (by setting req_event appropriately) so > >> that > >> + blkfront will bother to wake us up (via interrupt) when it submits > >>a > >> + new I/O */ > >> + if (!more_to_do){ > >> + spin_lock_irqsave(&blkif->blk_ring_lock, flags); > >> + RING_FINAL_CHECK_FOR_REQUESTS(&blk_rings->common, more_to_do); > >> + spin_unlock_irqrestore(&blkif->blk_ring_lock, flags); > >> + } > >> return more_to_do; > >> } > >> > >> > >> > >> > >> > >> > >> > >> > >> On 5/2/11 1:13 AM, "Jan Beulich" <JBeulich@novell.com> wrote: > >> > >> >>>> On 02.05.11 at 09:04, "Vincent, Pradeep" <pradeepv@amazon.com> > >>wrote: > >> >> In blkback driver, after I/O requests are submitted to Dom-0 block > >>I/O > >> >> subsystem, blkback goes to ''sleep'' effectively without letting > >>blkfront > >> >>know > >> >> about it (req_event isn''t set appropriately). Hence blkfront doesn''t > >> >>notify > >> >> blkback when it submits a new I/O thus delaying the ''dispatch'' of the > >> >>new I/O > >> >> to Dom-0 block I/O subsystem. The new I/O is dispatched as soon as > >>one > >> >>of the > >> >> previous I/Os completes. > >> >> > >> >> As a result of this issue, the block I/O latency performance is > >> >>degraded for > >> >> some workloads on Xen guests using blkfront-blkback stack. > >> >> > >> >> The following change addresses this issue: > >> >> > >> >> > >> >> Signed-off-by: Pradeep Vincent <pradeepv@amazon.com> > >> >> > >> >> diff --git a/drivers/xen/blkback/blkback.c > >> >>b/drivers/xen/blkback/blkback.c > >> >> --- a/drivers/xen/blkback/blkback.c > >> >> +++ b/drivers/xen/blkback/blkback.c > >> >> @@ -383,6 +383,12 @@ static int do_block_io_op(blkif_t *blkif) > >> >> cond_resched(); > >> >> } > >> >> > >> >> + /* If blkback might go to sleep (i.e. more_to_do == 0) then we > >>better > >> >> + let blkfront know about it (by setting req_event appropriately) > >>so > >> >>that > >> >> + blkfront will bother to wake us up (via interrupt) when it > >>submits a > >> >> + new I/O */ > >> >> + if (!more_to_do) > >> >> + RING_FINAL_CHECK_FOR_REQUESTS(&blk_rings->common, > >> >>more_to_do); > >> > > >> >To me this contradicts the comment preceding the use of > >> >RING_FINAL_CHECK_FOR_REQUESTS() in make_response() > >> >(there it''s supposedly used to avoid unnecessary notification, > >> >here you say it''s used to force notification). Albeit I agree that > >> >the change looks consistent with the comments in io/ring.h. > >> > > >> >Even if correct, you''re not holding blkif->blk_ring_lock here, and > >> >hence I think you''ll need to explain how this is not a problem. > >> > > >> >From a formal perspective, you also want to correct usage of tabs, > >> >and (assuming this is intended for the 2.6.18 tree) you''d also need > >> >to indicate so for Keir to pick this up and apply it to that tree (and > >> >it might then also be a good idea to submit an equivalent patch for > >> >the pv-ops trees). > >> > > >> >Jan > >> > > >> >> return more_to_do; > >> >> } > >> > > >> > > >> > > >> > > > > > >> _______________________________________________ > >> Xen-devel mailing list > >> Xen-devel@lists.xensource.com > >> http://lists.xensource.com/xen-devel > > > > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xensource.com > http://lists.xensource.com/xen-devel_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Daniel Stodden
2011-May-03 17:52 UTC
Re: [Xen-devel] [PATCH] blkback: Fix block I/O latency issue
On Mon, 2011-05-02 at 21:10 -0400, Vincent, Pradeep wrote:> Thanks Jan. > > Re: avoid unnecessary notification > > If this was a deliberate design choice then the duration of the delay is > at the mercy of the pending I/O latencies & I/O patterns and the delay is > simply too long in some cases. E.g. A write I/O stuck behind a read I/O > could see more than double the latency on a Xen guest compared to a > baremetal host. Avoiding notifications this way results in significant > latency degradation perceived by many applications.I''m trying to follow - let me know if I misread you - but I think you''re misunderstanding this stuff. The notification avoidance these macros implement does not promote deliberate latency. This stuff is not dropping events or deferring guest requests. It only avoids a gratuitious notification sent by the remote end in cases where the local one didn''t go to sleep yet, and therefore can guarantee that it''s going to process the message ASAP, right after finishing what''s still pending from the previous kick. It''s only a mechanism to avoid excess interrupt signaling. Think about a situation where you ask the guy at the front door to take his thumb off the buzzer while you''re already running down the hallway. R/W reordering is a matter dealt with by I/O schedulers. Any case of write I/O behind the read you describe is supposed to be queued back-to-back. It should never get stuck. A backend can obviously reserve the right to override guest submit order, but blkback doesn''t do this, it''s just pushing everything down the disk queue as soon as it sees it. So, that''d be the basic idea. Now, we''ve got that extra stuff in there mixing that up between request and response processing, and it''s admittedly somewhat hard to read. If you found a bug in there, well, yoho. Normally the slightest mistake on the event processing front rather leads to deadlocks, and we currently don''t see any. Iff you''re right -- I guess the better fix would look different. If this stuff is actually broken, may we can rather simplify things again, not add more extra checks on top. :) Daniel> If this is about allowing I/O scheduler to coalesce more I/Os, then I bet > I/O scheduler''s ''wait and coalesce'' logic is a great substitute for the > delays introduced by blkback. > > I totally agree IRQ coalescing or delay is useful for both blkback and > netback but we need a logic that doesn''t impact I/O latencies > significantly. Also, I don''t think netback has this type of notification > avoidance logic (at least in 2.6.18 code base). > > > Re: Other points > > Good call. Changed the patch to include tabs. > > I wasn''t very sure about blk_ring_lock usage and I should have clarified > it before sending out the patch. > > Assuming blk_ring_lock was meant to protect shared ring manipulations > within blkback, is there a reason ''blk_rings->common.req_cons'' > manipulation in do_block_io_op is not protected ? The reasons for the > differences between locking logic in do_block_io_op and make_response > weren''t terribly obvious although the failure mode for the race condition > may very well be benign. > > Anyway, I am attaching a patch with appropriate changes. > > Jeremey, Can you apply this patch to pvops Dom-0 > (http://git.kernel.org/pub/scm/linux/kernel/git/jeremy/xen.git). Should I > submit another patch for 2.6.18 Dom-0 ? > > > Signed-off-by: Pradeep Vincent <pradeepv@amazon.com> > > diff --git a/drivers/xen/blkback/blkback.c b/drivers/xen/blkback/blkback.c > --- a/drivers/xen/blkback/blkback.c > +++ b/drivers/xen/blkback/blkback.c > @@ -315,6 +315,7 @@ static int do_block_io_op(blkif_t *blkif) > pending_req_t *pending_req; > RING_IDX rc, rp; > int more_to_do = 0; > + unsigned long flags; > > rc = blk_rings->common.req_cons; > rp = blk_rings->common.sring->req_prod; > @@ -383,6 +384,15 @@ static int do_block_io_op(blkif_t *blkif) > cond_resched(); > } > > + /* If blkback might go to sleep (i.e. more_to_do == 0) then we better > + let blkfront know about it (by setting req_event appropriately) so > that > + blkfront will bother to wake us up (via interrupt) when it submits a > + new I/O */ > + if (!more_to_do){ > + spin_lock_irqsave(&blkif->blk_ring_lock, flags); > + RING_FINAL_CHECK_FOR_REQUESTS(&blk_rings->common, more_to_do); > + spin_unlock_irqrestore(&blkif->blk_ring_lock, flags); > + } > return more_to_do; > } > > > > > > > > > On 5/2/11 1:13 AM, "Jan Beulich" <JBeulich@novell.com> wrote: > > >>>> On 02.05.11 at 09:04, "Vincent, Pradeep" <pradeepv@amazon.com> wrote: > >> In blkback driver, after I/O requests are submitted to Dom-0 block I/O > >> subsystem, blkback goes to ''sleep'' effectively without letting blkfront > >>know > >> about it (req_event isn''t set appropriately). Hence blkfront doesn''t > >>notify > >> blkback when it submits a new I/O thus delaying the ''dispatch'' of the > >>new I/O > >> to Dom-0 block I/O subsystem. The new I/O is dispatched as soon as one > >>of the > >> previous I/Os completes. > >> > >> As a result of this issue, the block I/O latency performance is > >>degraded for > >> some workloads on Xen guests using blkfront-blkback stack. > >> > >> The following change addresses this issue: > >> > >> > >> Signed-off-by: Pradeep Vincent <pradeepv@amazon.com> > >> > >> diff --git a/drivers/xen/blkback/blkback.c > >>b/drivers/xen/blkback/blkback.c > >> --- a/drivers/xen/blkback/blkback.c > >> +++ b/drivers/xen/blkback/blkback.c > >> @@ -383,6 +383,12 @@ static int do_block_io_op(blkif_t *blkif) > >> cond_resched(); > >> } > >> > >> + /* If blkback might go to sleep (i.e. more_to_do == 0) then we better > >> + let blkfront know about it (by setting req_event appropriately) so > >>that > >> + blkfront will bother to wake us up (via interrupt) when it submits a > >> + new I/O */ > >> + if (!more_to_do) > >> + RING_FINAL_CHECK_FOR_REQUESTS(&blk_rings->common, > >>more_to_do); > > > >To me this contradicts the comment preceding the use of > >RING_FINAL_CHECK_FOR_REQUESTS() in make_response() > >(there it''s supposedly used to avoid unnecessary notification, > >here you say it''s used to force notification). Albeit I agree that > >the change looks consistent with the comments in io/ring.h. > > > >Even if correct, you''re not holding blkif->blk_ring_lock here, and > >hence I think you''ll need to explain how this is not a problem. > > > >From a formal perspective, you also want to correct usage of tabs, > >and (assuming this is intended for the 2.6.18 tree) you''d also need > >to indicate so for Keir to pick this up and apply it to that tree (and > >it might then also be a good idea to submit an equivalent patch for > >the pv-ops trees). > > > >Jan > > > >> return more_to_do; > >> } > > > > > > >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Vincent, Pradeep
2011-May-03 23:41 UTC
RE: [Xen-devel] [PATCH] blkback: Fix block I/O latency issue
Hey Daniel, CFQ primarily but I/O scheduler doesn't come into the picture here.>> Block trace?Nope.. Systemtap - Pradeep Vincent -----Original Message----- From: Daniel Stodden [mailto:daniel.stodden@citrix.com] Sent: Tuesday, May 03, 2011 10:52 AM To: Vincent, Pradeep Cc: Konrad Rzeszutek Wilk; Jeremy Fitzhardinge; xen-devel@lists.xensource.com; Jan Beulich Subject: Re: [Xen-devel] [PATCH] blkback: Fix block I/O latency issue On Tue, 2011-05-03 at 13:16 -0400, Vincent, Pradeep wrote:> Yes - I am positive what I am seeing isn't 'I/O scheduler issue due to > REQ_SYNC'. Trace data from blkback showed that blkback was simply not > submitting the 2nd I/O to the I/O scheduler. Type of I/O (read vs write) > doesn't matter.Block trace? Just to make sure we're debugging the right thing -- what I/O scheduler are you running? Daniel> Recreation Steps: > > 1. Generate I/O requests so that two I/Os are pending at any given time. > The I/O submissions shouldn't be synchronized. Potentially use two threads > for I/O submissions each submitting a small size random direct I/O. > 2. Verify that the guest sends out two I/Os at a given time. 'iostat' > avgqu-sz will be '2' > 3. Now check iostat in Dom-0 for the corresponding block device. Avgqu-sz > will be '1' > 4. 'await' comparison in DomU vs Dom0 will show a fairly big difference. > > And I confirmed that the patch I submitted fixes this issue. > > - Pradeep Vincent > > > On 5/3/11 7:55 AM, "Konrad Rzeszutek Wilk" <konrad.wilk@oracle.com> wrote: > > >On Mon, May 02, 2011 at 06:10:22PM -0700, Vincent, Pradeep wrote: > >> Thanks Jan. > >> > >> Re: avoid unnecessary notification > >> > >> If this was a deliberate design choice then the duration of the delay is > >> at the mercy of the pending I/O latencies & I/O patterns and the delay > >>is > >> simply too long in some cases. E.g. A write I/O stuck behind a read I/O > >> could see more than double the latency on a Xen guest compared to a > >> baremetal host. Avoiding notifications this way results in significant > >> latency degradation perceived by many applications. > > > >You sure this is not the fault of the IO scheduler? I had similar issues > >with the CFQ scheduler upstream and found out that I needed to add > >REQ_SYNC on write requests. > >> > >> If this is about allowing I/O scheduler to coalesce more I/Os, then I > >>bet > >> I/O scheduler's 'wait and coalesce' logic is a great substitute for the > >> delays introduced by blkback. > >> > >> I totally agree IRQ coalescing or delay is useful for both blkback and > >> netback but we need a logic that doesn't impact I/O latencies > >> significantly. Also, I don't think netback has this type of notification > >> avoidance logic (at least in 2.6.18 code base). > >> > >> > >> Re: Other points > >> > >> Good call. Changed the patch to include tabs. > >> > >> I wasn't very sure about blk_ring_lock usage and I should have clarified > >> it before sending out the patch. > >> > >> Assuming blk_ring_lock was meant to protect shared ring manipulations > >> within blkback, is there a reason 'blk_rings->common.req_cons' > >> manipulation in do_block_io_op is not protected ? The reasons for the > >> differences between locking logic in do_block_io_op and make_response > >> weren't terribly obvious although the failure mode for the race > >>condition > >> may very well be benign. > >> > >> Anyway, I am attaching a patch with appropriate changes. > >> > >> Jeremey, Can you apply this patch to pvops Dom-0 > >> (http://git.kernel.org/pub/scm/linux/kernel/git/jeremy/xen.git). Should > >>I > >> submit another patch for 2.6.18 Dom-0 ? > >> > >> > >> Signed-off-by: Pradeep Vincent <pradeepv@amazon.com> > >> > >> diff --git a/drivers/xen/blkback/blkback.c > >>b/drivers/xen/blkback/blkback.c > >> --- a/drivers/xen/blkback/blkback.c > >> +++ b/drivers/xen/blkback/blkback.c > >> @@ -315,6 +315,7 @@ static int do_block_io_op(blkif_t *blkif) > >> pending_req_t *pending_req; > >> RING_IDX rc, rp; > >> int more_to_do = 0; > >> + unsigned long flags; > >> > >> rc = blk_rings->common.req_cons; > >> rp = blk_rings->common.sring->req_prod; > >> @@ -383,6 +384,15 @@ static int do_block_io_op(blkif_t *blkif) > >> cond_resched(); > >> } > >> > >> + /* If blkback might go to sleep (i.e. more_to_do == 0) then we better > >> + let blkfront know about it (by setting req_event appropriately) so > >> that > >> + blkfront will bother to wake us up (via interrupt) when it submits > >>a > >> + new I/O */ > >> + if (!more_to_do){ > >> + spin_lock_irqsave(&blkif->blk_ring_lock, flags); > >> + RING_FINAL_CHECK_FOR_REQUESTS(&blk_rings->common, more_to_do); > >> + spin_unlock_irqrestore(&blkif->blk_ring_lock, flags); > >> + } > >> return more_to_do; > >> } > >> > >> > >> > >> > >> > >> > >> > >> > >> On 5/2/11 1:13 AM, "Jan Beulich" <JBeulich@novell.com> wrote: > >> > >> >>>> On 02.05.11 at 09:04, "Vincent, Pradeep" <pradeepv@amazon.com> > >>wrote: > >> >> In blkback driver, after I/O requests are submitted to Dom-0 block > >>I/O > >> >> subsystem, blkback goes to 'sleep' effectively without letting > >>blkfront > >> >>know > >> >> about it (req_event isn't set appropriately). Hence blkfront doesn't > >> >>notify > >> >> blkback when it submits a new I/O thus delaying the 'dispatch' of the > >> >>new I/O > >> >> to Dom-0 block I/O subsystem. The new I/O is dispatched as soon as > >>one > >> >>of the > >> >> previous I/Os completes. > >> >> > >> >> As a result of this issue, the block I/O latency performance is > >> >>degraded for > >> >> some workloads on Xen guests using blkfront-blkback stack. > >> >> > >> >> The following change addresses this issue: > >> >> > >> >> > >> >> Signed-off-by: Pradeep Vincent <pradeepv@amazon.com> > >> >> > >> >> diff --git a/drivers/xen/blkback/blkback.c > >> >>b/drivers/xen/blkback/blkback.c > >> >> --- a/drivers/xen/blkback/blkback.c > >> >> +++ b/drivers/xen/blkback/blkback.c > >> >> @@ -383,6 +383,12 @@ static int do_block_io_op(blkif_t *blkif) > >> >> cond_resched(); > >> >> } > >> >> > >> >> + /* If blkback might go to sleep (i.e. more_to_do == 0) then we > >>better > >> >> + let blkfront know about it (by setting req_event appropriately) > >>so > >> >>that > >> >> + blkfront will bother to wake us up (via interrupt) when it > >>submits a > >> >> + new I/O */ > >> >> + if (!more_to_do) > >> >> + RING_FINAL_CHECK_FOR_REQUESTS(&blk_rings->common, > >> >>more_to_do); > >> > > >> >To me this contradicts the comment preceding the use of > >> >RING_FINAL_CHECK_FOR_REQUESTS() in make_response() > >> >(there it's supposedly used to avoid unnecessary notification, > >> >here you say it's used to force notification). Albeit I agree that > >> >the change looks consistent with the comments in io/ring.h. > >> > > >> >Even if correct, you're not holding blkif->blk_ring_lock here, and > >> >hence I think you'll need to explain how this is not a problem. > >> > > >> >From a formal perspective, you also want to correct usage of tabs, > >> >and (assuming this is intended for the 2.6.18 tree) you'd also need > >> >to indicate so for Keir to pick this up and apply it to that tree (and > >> >it might then also be a good idea to submit an equivalent patch for > >> >the pv-ops trees). > >> > > >> >Jan > >> > > >> >> return more_to_do; > >> >> } > >> > > >> > > >> > > >> > > > > > >> _______________________________________________ > >> Xen-devel mailing list > >> Xen-devel@lists.xensource.com > >> http://lists.xensource.com/xen-devel > > > > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xensource.com > http://lists.xensource.com/xen-devel_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Vincent, Pradeep
2011-May-04 01:54 UTC
Re: [Xen-devel] [PATCH] blkback: Fix block I/O latency issue
Hey Daniel, Thanks for your comments.>> The notification avoidance these macros implement does not promote >>deliberate latency. This stuff is not dropping events or deferring guestrequests. It only avoids a gratuitious notification sent by the remote end in cases where the local one didn''t go to sleep yet, and therefore can>>guarantee that it''s going to process the message ASAP, right after >>finishing what''s still pending from the previous kick.If the design goal was to simply avoid unnecessary interrupts but not delay I/Os, then blkback code has a bug. If the design goal was to delay the I/Os in order to reducing interrupt rate, then I am arguing that the design introduces way too much latency that affects many applications. Either way, this issue needs to be addressed. Perhaps, a timeline example will help shed some light on this issue. Let''s IO-1 and IO-2 are submitted with a gap of 200 usecs. Let''s assume interrupt latency is 10usec and disk drive takes ~10,000 usecs to process each I/O. t1: IO-1 arrives at blkfront. RING_PUSH_REQUESTS_AND_CHECK_NOTIFY is called which updates ''sring->req_prod'' and uses ''sring->req_event'' to determine if an interrupt must be generated. In this case, blkfront generates the interrupt. t1+10 usecs: Interrupt is received by blkback. do_block_io_op is eventually invoked which dispatches the I/O after incrementing ''common.req_cons''. Note that ''req_event'' is NOT updated. There are no more I/Os to be processed and hence blkback thread goes to sleep. t1+200 usecs: IO-2 arrives at blkfront. RING_PUSH_REQUESTS_AND_CHECK_NOTIFY is called which updates ''sring->req_prod'' and uses ''sring->req_event'' to determine if an interrupt must be generated. Unfortunately, ''req_event'' was NOT updated in the previous step and as a result blkfront decides not to send an interrupt. As a result blkback doesn''t wake up immediately to process the I/O that has been added to the shared ring by blkfront. t1+10000 usecs: IO-1 completes. ''make_response'' is invoked which signals the completion of IO-1 to blkfront. Now it goes through the following code and decides there is ''more_to_do''. if (blk_rings->common.rsp_prod_pvt == blk_rings->common.req_cons) { /* * Tail check for pending requests. Allows frontend to avoid * notifications if requests are already in flight (lower * overheads and promotes batching). */ RING_FINAL_CHECK_FOR_REQUESTS(&blk_rings->common, more_to_do); Hence the blkback thread is woken up which then invokes ''do_block_io_op''. ''do_block_io_op'' then dispatches IO-2 t1+20000 usecs: IO-2 completes.>From guest point of view, IO-1 took ~10,000 usecs to complete which isfine. But IO-2 took 19,800 usecs which is obviously very bad. Now once the patch is applied, t1+10 usecs : Interrupt is received by blkback. do_block_io_op is eventually invoked which dispatches the I/O after incrementing ''common.req_cons''. RING_FINAL_CHECK_FOR_REQUESTS is invoked which updates ''req_event''. There are no more I/Os to be processed and hence blkback thread goes to sleep. t1+200 usecs: IO-2 arrives at blkfront. RING_PUSH_REQUESTS_AND_CHECK_NOTIFY is called which updates ''sring->req_prod'' and uses ''sring->req_event'' to determine if an interrupt must be generated. Since req_event was updated in the previous step, blkfront decides to generate an interrupt t1+210 usecs: Interrupt is received by blkback. do_block_io_op is eventually invoked which dispatches IO-2 after incrementing ''common.req_cons''. RING_FINAL_CHECK_FOR_REQUESTS is invoked which updates ''req_event''. There are no more I/Os to be processed and hence blkback thread goes to sleep. t1+10000 usecs: IO-1 completes. t1+10210 usecs: IO-2 completes. Both I/Os take ~10,000 usecs to complete and the application lives happily ever after. Does that make sense ?>>Normally the slightest mistakeon the event processing front rather leads to deadlocks, and we>> currently don''t see any.Yeah - I had the same thought initially. In this case, the fact that the make_response kicks off any pending I/Os turns potential deadlocks into latency issues.>>Iff you''re right -- I guess the better fix would look different. If thisstuff is actually broken, may we can rather simplify things again, not add more extra checks on top. :) Love to hear better ways of fixing this issue. Any proposals ? Thanks, - Pradeep Vincent On 5/3/11 10:52 AM, "Daniel Stodden" <daniel.stodden@citrix.com> wrote:>On Mon, 2011-05-02 at 21:10 -0400, Vincent, Pradeep wrote: >> Thanks Jan. >> >> Re: avoid unnecessary notification >> >> If this was a deliberate design choice then the duration of the delay is >> at the mercy of the pending I/O latencies & I/O patterns and the delay >>is >> simply too long in some cases. E.g. A write I/O stuck behind a read I/O >> could see more than double the latency on a Xen guest compared to a >> baremetal host. Avoiding notifications this way results in significant >> latency degradation perceived by many applications. > >I''m trying to follow - let me know if I misread you - but I think you''re >misunderstanding this stuff. > >The notification avoidance these macros implement does not promote >deliberate latency. This stuff is not dropping events or deferring guest >requests. > >It only avoids a gratuitious notification sent by the remote end in >cases where the local one didn''t go to sleep yet, and therefore can >guarantee that it''s going to process the message ASAP, right after >finishing what''s still pending from the previous kick. > >It''s only a mechanism to avoid excess interrupt signaling. Think about a >situation where you ask the guy at the front door to take his thumb off >the buzzer while you''re already running down the hallway. > >R/W reordering is a matter dealt with by I/O schedulers. > >Any case of write I/O behind the read you describe is supposed to be >queued back-to-back. It should never get stuck. A backend can obviously >reserve the right to override guest submit order, but blkback doesn''t do >this, it''s just pushing everything down the disk queue as soon as it >sees it. > >So, that''d be the basic idea. Now, we''ve got that extra stuff in there >mixing that up between request and response processing, and it''s >admittedly somewhat hard to read. > >If you found a bug in there, well, yoho. Normally the slightest mistake >on the event processing front rather leads to deadlocks, and we >currently don''t see any. > >Iff you''re right -- I guess the better fix would look different. If this >stuff is actually broken, may we can rather simplify things again, not >add more extra checks on top. :) > >Daniel > >> If this is about allowing I/O scheduler to coalesce more I/Os, then I >>bet >> I/O scheduler''s ''wait and coalesce'' logic is a great substitute for the >> delays introduced by blkback. >> >> I totally agree IRQ coalescing or delay is useful for both blkback and >> netback but we need a logic that doesn''t impact I/O latencies >> significantly. Also, I don''t think netback has this type of notification >> avoidance logic (at least in 2.6.18 code base). >> >> >> Re: Other points >> >> Good call. Changed the patch to include tabs. >> >> I wasn''t very sure about blk_ring_lock usage and I should have clarified >> it before sending out the patch. >> >> Assuming blk_ring_lock was meant to protect shared ring manipulations >> within blkback, is there a reason ''blk_rings->common.req_cons'' >> manipulation in do_block_io_op is not protected ? The reasons for the >> differences between locking logic in do_block_io_op and make_response >> weren''t terribly obvious although the failure mode for the race >>condition >> may very well be benign. >> >> Anyway, I am attaching a patch with appropriate changes. >> >> Jeremey, Can you apply this patch to pvops Dom-0 >> (http://git.kernel.org/pub/scm/linux/kernel/git/jeremy/xen.git). Should >>I >> submit another patch for 2.6.18 Dom-0 ? >> >> >> Signed-off-by: Pradeep Vincent <pradeepv@amazon.com> >> >> diff --git a/drivers/xen/blkback/blkback.c >>b/drivers/xen/blkback/blkback.c >> --- a/drivers/xen/blkback/blkback.c >> +++ b/drivers/xen/blkback/blkback.c >> @@ -315,6 +315,7 @@ static int do_block_io_op(blkif_t *blkif) >> pending_req_t *pending_req; >> RING_IDX rc, rp; >> int more_to_do = 0; >> + unsigned long flags; >> >> rc = blk_rings->common.req_cons; >> rp = blk_rings->common.sring->req_prod; >> @@ -383,6 +384,15 @@ static int do_block_io_op(blkif_t *blkif) >> cond_resched(); >> } >> >> + /* If blkback might go to sleep (i.e. more_to_do == 0) then we better >> + let blkfront know about it (by setting req_event appropriately) so >> that >> + blkfront will bother to wake us up (via interrupt) when it submits >>a >> + new I/O */ >> + if (!more_to_do){ >> + spin_lock_irqsave(&blkif->blk_ring_lock, flags); >> + RING_FINAL_CHECK_FOR_REQUESTS(&blk_rings->common, more_to_do); >> + spin_unlock_irqrestore(&blkif->blk_ring_lock, flags); >> + } >> return more_to_do; >> } >> >> >> >> >> >> >> >> >> On 5/2/11 1:13 AM, "Jan Beulich" <JBeulich@novell.com> wrote: >> >> >>>> On 02.05.11 at 09:04, "Vincent, Pradeep" <pradeepv@amazon.com> >>wrote: >> >> In blkback driver, after I/O requests are submitted to Dom-0 block >>I/O >> >> subsystem, blkback goes to ''sleep'' effectively without letting >>blkfront >> >>know >> >> about it (req_event isn''t set appropriately). Hence blkfront doesn''t >> >>notify >> >> blkback when it submits a new I/O thus delaying the ''dispatch'' of the >> >>new I/O >> >> to Dom-0 block I/O subsystem. The new I/O is dispatched as soon as >>one >> >>of the >> >> previous I/Os completes. >> >> >> >> As a result of this issue, the block I/O latency performance is >> >>degraded for >> >> some workloads on Xen guests using blkfront-blkback stack. >> >> >> >> The following change addresses this issue: >> >> >> >> >> >> Signed-off-by: Pradeep Vincent <pradeepv@amazon.com> >> >> >> >> diff --git a/drivers/xen/blkback/blkback.c >> >>b/drivers/xen/blkback/blkback.c >> >> --- a/drivers/xen/blkback/blkback.c >> >> +++ b/drivers/xen/blkback/blkback.c >> >> @@ -383,6 +383,12 @@ static int do_block_io_op(blkif_t *blkif) >> >> cond_resched(); >> >> } >> >> >> >> + /* If blkback might go to sleep (i.e. more_to_do == 0) then we >>better >> >> + let blkfront know about it (by setting req_event appropriately) >>so >> >>that >> >> + blkfront will bother to wake us up (via interrupt) when it >>submits a >> >> + new I/O */ >> >> + if (!more_to_do) >> >> + RING_FINAL_CHECK_FOR_REQUESTS(&blk_rings->common, >> >>more_to_do); >> > >> >To me this contradicts the comment preceding the use of >> >RING_FINAL_CHECK_FOR_REQUESTS() in make_response() >> >(there it''s supposedly used to avoid unnecessary notification, >> >here you say it''s used to force notification). Albeit I agree that >> >the change looks consistent with the comments in io/ring.h. >> > >> >Even if correct, you''re not holding blkif->blk_ring_lock here, and >> >hence I think you''ll need to explain how this is not a problem. >> > >> >From a formal perspective, you also want to correct usage of tabs, >> >and (assuming this is intended for the 2.6.18 tree) you''d also need >> >to indicate so for Keir to pick this up and apply it to that tree (and >> >it might then also be a good idea to submit an equivalent patch for >> >the pv-ops trees). >> > >> >Jan >> > >> >> return more_to_do; >> >> } >> > >> > >> > >> >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Konrad Rzeszutek Wilk
2011-May-09 20:24 UTC
Re: [Xen-devel] [PATCH] blkback: Fix block I/O latency issue
On Tue, May 03, 2011 at 06:54:38PM -0700, Vincent, Pradeep wrote:> Hey Daniel, > > Thanks for your comments. > > >> The notification avoidance these macros implement does not promote > >>deliberate latency. This stuff is not dropping events or deferring guest > requests. > > It only avoids a gratuitious notification sent by the remote end in > cases where the local one didn''t go to sleep yet, and therefore can > >>guarantee that it''s going to process the message ASAP, right after > >>finishing what''s still pending from the previous kick. > > If the design goal was to simply avoid unnecessary interrupts but not > delay I/Os, then blkback code has a bug. > > If the design goal was to delay the I/Os in order to reducing interrupt > rate, then I am arguing that the design introduces way too much latency > that affects many applications. > > Either way, this issue needs to be addressed.I agree we need to fix this. What I am curious is: - what are the workloads under which this patch has a negative effect. - I presume you have tested this in the production - what were the numbers when it came to high bandwith numbers (so imagine, four or six threads putting as much I/O as possible)? Did the level of IRQs go way up compared to not running with this patch? I am wondering if it might be worth looking in something NAPI-type in the block layer (so polling basically). The concern I''ve is that this patch would trigger a interrupt storm for small sized requests which might be happening at a high rate (say, 512 bytes random writes). But perhaps the way for this work is to have a ratelimiting code in it so that there is no chance of interrupt storms. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Vincent, Pradeep
2011-May-13 00:40 UTC
Re: [Xen-devel] [PATCH] blkback: Fix block I/O latency issue
Thanks Konrad.>>I presume you have tested this in the productionYes. Absolutely.>>what were the numbers when it came to high bandwith numbersUnder high I/O workload, where the blkfront would fill up the queue as blkback works the queue, the I/O latency problem in question doesn''t manifest itself and as a result this patch doesn''t make much of a difference in terms of interrupt rate. My benchmarks didn''t show any significant effect. The above rationale combined with relatively high disk I/O latencies (compared to IRQ latency) generally minimizes excessive interrupt rate. Also, blkfront interrupt generation mechanism works exactly the same way as the patched blkback. Netfront and netback generate interrupts the same way as well. Under ''moderate'' I/O workload, the rate of interrupt does go up but the I/O latency benefit clearly outweighs the cost extra interrupt rate (which isn''t much for moderate I/O load anyways) Overall, advantages of this patch (I/O latency improvement) outweighs any potential fringe negative effects by a large margin and the fact that netfront, netback and blkfront already have the same interrupt generation design point should give us a lot of confidence. That said, I do think a comprehensive interrupt throttling mechanism for netback, blkback and other backend I/O drivers would be useful and should be pursued as a separate initiative. Such a mechanism would be particularly useful for netfront-netback stack which is more susceptible to interrupt storms than blkfront-blkback. ''IRQ coalescing'' type mechanism that could induce delays in the order of 10s of microsecs (certainly not in millisecs though) to minimize interrupt generation rate would be useful (similar to what NICs do). Thanks, - Pradeep Vincent On 5/9/11 1:24 PM, "Konrad Rzeszutek Wilk" <konrad.wilk@oracle.com> wrote:>On Tue, May 03, 2011 at 06:54:38PM -0700, Vincent, Pradeep wrote: >> Hey Daniel, >> >> Thanks for your comments. >> >> >> The notification avoidance these macros implement does not promote >> >>deliberate latency. This stuff is not dropping events or deferring >>guest >> requests. >> >> It only avoids a gratuitious notification sent by the remote end in >> cases where the local one didn''t go to sleep yet, and therefore can >> >>guarantee that it''s going to process the message ASAP, right after >> >>finishing what''s still pending from the previous kick. >> >> If the design goal was to simply avoid unnecessary interrupts but not >> delay I/Os, then blkback code has a bug. >> >> If the design goal was to delay the I/Os in order to reducing interrupt >> rate, then I am arguing that the design introduces way too much latency >> that affects many applications. >> >> Either way, this issue needs to be addressed. > > >I agree we need to fix this. What I am curious is: > - what are the workloads under which this patch has a negative effect. > - I presume you have tested this in the production - what were the >numbers > when it came to high bandwith numbers (so imagine, four or six threads > putting as much I/O as possible)? Did the level of IRQs go way up > compared to not running with this patch? > > >I am wondering if it might be worth looking in something NAPI-type in the >block layer (so polling basically). The concern I''ve is that this >patch would trigger a interrupt storm for small sized requests which >might be >happening at a high rate (say, 512 bytes random writes). > >But perhaps the way for this work is to have a ratelimiting code in it >so that there is no chance of interrupt storms._______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Konrad Rzeszutek Wilk
2011-May-13 02:51 UTC
Re: [Xen-devel] [PATCH] blkback: Fix block I/O latency issue
> >>what were the numbers when it came to high bandwidth numbers > > Under high I/O workload, where the blkfront would fill up the queue as > blkback works the queue, the I/O latency problem in question doesn''t > manifest itself and as a result this patch doesn''t make much of a > difference in terms of interrupt rate. My benchmarks didn''t show any > significant effect.I have to rerun my benchmarks. Under high load (so 64Kb, four threads writting as much as they can to a iSCSI disk), the IRQ rate for each blkif went from 2-3/sec to ~5K/sec. But I did not do a good job on capturing the submission latency to see if the I/Os get the response back as fast (or the same) as without your patch. And the iSCSI disk on the target side was an RAMdisk, so latency was quite small which is not fair to your problem. Do you have a program to measure the latency for the workload you had encountered? I would like to run those numbers myself.> > The above rationale combined with relatively high disk I/O latencies > (compared to IRQ latency) generally minimizes excessive interrupt rate. > Also, blkfront interrupt generation mechanism works exactly the same way > as the patched blkback. Netfront and netback generate interrupts the same > way as well. > > Under ''moderate'' I/O workload, the rate of interrupt does go up but the > I/O latency benefit clearly outweighs the cost extra interrupt rate (which > isn''t much for moderate I/O load anyways) > > > Overall, advantages of this patch (I/O latency improvement) outweighs any > potential fringe negative effects by a large margin and the fact that > netfront, netback and blkfront already have the same interrupt generation > design point should give us a lot of confidence.If we can come up with a patch that does decrease the I/O latency _and_ does not cause a huge spike in interrupt generation that would be super. I am particularly worried about the high interrupt generation when there is a high I/O load. But that seems to be tied directly to the "disk" I have. I am curious to how this works with drivers that are SSDs for example.> > That said, I do think a comprehensive interrupt throttling mechanism for > netback, blkback and other backend I/O drivers would be useful and should > be pursued as a separate initiative. Such a mechanism would beYeah, my fear is that you will disappear once this patch is in and I will have to go forth to do this (and I don''t know when I would get to it). Would prefer to have your help here or (better) you write the patch for both problems right now :-)> particularly useful for netfront-netback stack which is more susceptible > to interrupt storms than blkfront-blkback. ''IRQ coalescing'' type mechanism > that could induce delays in the order of 10s of microsecs (certainly not > in millisecs though) to minimize interrupt generation rate would be useful > (similar to what NICs do).Good point. Awaiting your patch for the netfront-netback code :-) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Konrad Rzeszutek Wilk
2011-May-16 15:22 UTC
Re: [Xen-devel] [PATCH] blkback: Fix block I/O latency issue
On Thu, May 12, 2011 at 10:51:32PM -0400, Konrad Rzeszutek Wilk wrote:> > >>what were the numbers when it came to high bandwidth numbers > > > > Under high I/O workload, where the blkfront would fill up the queue as > > blkback works the queue, the I/O latency problem in question doesn''t > > manifest itself and as a result this patch doesn''t make much of a > > difference in terms of interrupt rate. My benchmarks didn''t show any > > significant effect. > > I have to rerun my benchmarks. Under high load (so 64Kb, four threads > writting as much as they can to a iSCSI disk), the IRQ rate for each > blkif went from 2-3/sec to ~5K/sec. But I did not do a good > job on capturing the submission latency to see if the I/Os get the > response back as fast (or the same) as without your patch. > > And the iSCSI disk on the target side was an RAMdisk, so latency > was quite small which is not fair to your problem. > > Do you have a program to measure the latency for the workload you > had encountered? I would like to run those numbers myself.Ran some more benchmarks over this week. This time I tried to run it on: - iSCSI target (1GB, and on the "other side" it wakes up every 1msec, so the latency is set to 1msec). - scsi_debug delay=0 (no delay and as fast possible. Comes out to be about 4 microseconds completion with queue depth of one with 32K I/Os). - local SATAI 80GB ST3808110AS. Still running as it is quite slow. With only one PV guest doing a round (three times) of two threads randomly writting I/Os with a queue depth of 256. Then a different round of four threads writting/reading (80/20) 512bytes up to 64K randomly over the disk. I used the attached patch against #master (git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen.git) to gauge how well we are doing (and what the interrupt generation rate is). These workloads I think would be considered ''high I/O'' and I was expecting your patch to not have any influence on the numbers. But to my surprise the case where the I/O latency is high, the interrupt generation was quite small. But where the I/O latency was very very small (4 microseconds) the interrupt generation was on average about 20K/s. And this is with a queue depth of 256 with four threads. I was expecting the opposite. Hence quite curious to see your use case. What do you consider a middle I/O and low I/O cases? Do you use ''fio'' for your testing? With the high I/O load, the numbers came out to give us about 1% benefit with your patch. However, I am worried (maybe unneccassarily?) about the 20K interrupt generation when the iometer tests kicked in (this was only when using the unrealistic ''scsi_debug'' drive). The picture of this using iSCSI target: http://darnok.org/xen/amazon/iscsi_target/iometer-bw.png And when done on top of local RAMdisk: http://darnok.org/xen/amazon/scsi_debug/iometer-bw.png _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Vincent, Pradeep
2011-May-20 06:12 UTC
Re: [Xen-devel] [PATCH] blkback: Fix block I/O latency issue
Hey Konrad, Thanks for running the tests. Very useful data. Re: Experiment to show latency improvement I never ran anything on ramdisk. You should be able to see the latency benefit with ''orion'' tool but I am sure other tools can be used as well. For a volume backed by a single disk drive, keep the number of small random I/O outstanding to 2 (I think "num_small" parameter in orion should do the job) with a 50-50 mix of write and read. Measure the latency reported by the guest and Dom-0 & compare them. For LVM volumes that present multiple drives as a single LUN (inside the guest), the latency improvement will be the highest when the number of I/O outstanding is 2X the number of spindles. This is the ''moderate I/O'' scenario I was describing and you should see significant improvement in latencies. If you allow page cache to perform sequential I/O using dd or other sequential non-direct I/O generation tool, you should find that the interrupt rate doesn''t go up for high I/O load. Thinking about this, I think burstiness of I/O submission as seen by the driver is also a key player particularly in the absence of I/O coalescing waits introduced by I/O scheduler. Page cache draining is notoriously bursty.>>queue depth of 256.What ''queue depth'' is this ? If I am not wrong, blkfront-blkback is restricted to ~32 max pending I/Os due to the limit of one page being used for mailbox entries - no ?>>But to my surprise the case where the I/O latency is high, the interrupt >>generation was quite smallIf this patch results in an extra interrupt, it will very likely result in reduction of latency for the next I/O. If the interrupt generation increase is not high, then the number of I/Os whose latencies this patch has improved is low. Looks like your workload belonged to this category. Perhaps that''s why you didn''t much of an improvement in overall performance ? I think this is close to the high I/O workload scenario I described.>>But where the I/O latency was very very small (4 microseconds) the >>interrupt generation was on average about 20K/s.This is not a scenario I tested but the results aren''t surprising. This isn''t the high I/O load I was describing though (I didn''t test ramdisk). SSD is probably the closest real world workload. An increase of 20K/sec means this patch very likely improved latency of 20K I/Os per sec although the absolute value of latency improvements would be smaller in this case. 20K/sec interrupt rate (50usec delay between interrupt) is something I would be comfortable with if they directly translate to latency improvement for the users. The graphs seem to indicate a 5% increase in throughput for this case - Am I reading the graphs right ? Overall, Very useful tests indeed and I haven''t seen anything too concerning or unexpected except that I don''t think you have seen the 50+% latency benefit that the patch got me in my moderate I/O benchmark :-) Feel free to ping me offline if you aren''t able to see the latency impact using the ''moderate I/O'' methodology described above. About IRQ coalescing: Stepping back a bit, there are few different use cases that irq coalescing mechanism would be useful for 1. Latency sensitive workload: Wait time of 10s of usecs. Particularly useful for SSDs. 2. Interrupt rate conscious workload/environment: Wait time of 200+ usecs which will essentially cap the theoretical interrupt rate to 5K. 3. Excessive CPU consumption Mitigation: This is similar to (2) but includes the case of malicious guests. Perhaps not a big concern unless you have lots of drives attached to each guest. I suspect the implementation for (1) and (2) would be different (spin vs sleep perhaps). (3) can''t be implemented by manipulation of ''req_event'' since a guest has the ability to abuse irq channel independent of what ''blkback'' tries to tell ''blkfront'' via ''req_event'' manipulation. (3) could be implemented in the hypervisor as a generic irq throttler that could be leveraged for all irqs heading to Dom-0 from DomUs including blkback/netback. Such a mechanism could potentially solve (1) and/or (2) as well. Thoughts ? One crude way to address (3) for ''many disk drive'' scenario is to pin all/most blkback interrupts for an instance to the same CPU core in Dom-0 and throttle down the thread wake up (wake_up(&blkif->wq) in blkif_notify_work) that usually results in IPIs. Not an elegant solution but might be a good crutch. Another angle to (1) and (2) is whether these irq coalesce settings should be controllable by the guest, perhaps within limits set by the administrator. Thoughts ? Suggestions ? Konrad, Love to help out if you are already working on something around irq coalescing. Or when I have irq coalescing functionality that can be consumed by community I will certainly submit them. Meanwhile, I wouldn''t want to deny Xen users the advantage of this patch just because there is no irq coalescing functionality. Particularly since the downside is very minimal on blkfront-blkback stack. My 2 cents.. Thanks much Konrad, - Pradeep Vincent On 5/16/11 8:22 AM, "Konrad Rzeszutek Wilk" <konrad.wilk@oracle.com> wrote:>On Thu, May 12, 2011 at 10:51:32PM -0400, Konrad Rzeszutek Wilk wrote: >> > >>what were the numbers when it came to high bandwidth numbers >> > >> > Under high I/O workload, where the blkfront would fill up the queue as >> > blkback works the queue, the I/O latency problem in question doesn''t >> > manifest itself and as a result this patch doesn''t make much of a >> > difference in terms of interrupt rate. My benchmarks didn''t show any >> > significant effect. >> >> I have to rerun my benchmarks. Under high load (so 64Kb, four threads >> writting as much as they can to a iSCSI disk), the IRQ rate for each >> blkif went from 2-3/sec to ~5K/sec. But I did not do a good >> job on capturing the submission latency to see if the I/Os get the >> response back as fast (or the same) as without your patch. >> >> And the iSCSI disk on the target side was an RAMdisk, so latency >> was quite small which is not fair to your problem. >> >> Do you have a program to measure the latency for the workload you >> had encountered? I would like to run those numbers myself. > >Ran some more benchmarks over this week. This time I tried to run it on: > > - iSCSI target (1GB, and on the "other side" it wakes up every 1msec, so >the > latency is set to 1msec). > - scsi_debug delay=0 (no delay and as fast possible. Comes out to be >about > 4 microseconds completion with queue depth of one with 32K I/Os). > - local SATAI 80GB ST3808110AS. Still running as it is quite slow. > >With only one PV guest doing a round (three times) of two threads randomly >writting I/Os with a queue depth of 256. Then a different round of four >threads writting/reading (80/20) 512bytes up to 64K randomly over the >disk. > >I used the attached patch against #master >(git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen.git) >to gauge how well we are doing (and what the interrupt generation rate >is). > >These workloads I think would be considered ''high I/O'' and I was expecting >your patch to not have any influence on the numbers. > >But to my surprise the case where the I/O latency is high, the interrupt >generation >was quite small. But where the I/O latency was very very small (4 >microseconds) >the interrupt generation was on average about 20K/s. And this is with a >queue depth >of 256 with four threads. I was expecting the opposite. Hence quite >curious >to see your use case. > >What do you consider a middle I/O and low I/O cases? Do you use ''fio'' for >your >testing? > >With the high I/O load, the numbers came out to give us about 1% benefit >with your >patch. However, I am worried (maybe unneccassarily?) about the 20K >interrupt generation >when the iometer tests kicked in (this was only when using the >unrealistic ''scsi_debug'' >drive). > >The picture of this using iSCSI target: >http://darnok.org/xen/amazon/iscsi_target/iometer-bw.png > >And when done on top of local RAMdisk: >http://darnok.org/xen/amazon/scsi_debug/iometer-bw.png >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Konrad Rzeszutek Wilk
2011-May-24 16:02 UTC
Re: [Xen-devel] [PATCH] blkback: Fix block I/O latency issue
On Thu, May 19, 2011 at 11:12:25PM -0700, Vincent, Pradeep wrote:> Hey Konrad, > > Thanks for running the tests. Very useful data. > > Re: Experiment to show latency improvement > > I never ran anything on ramdisk. > > You should be able to see the latency benefit with ''orion'' tool but I amLink?> sure other tools can be used as well. For a volume backed by a single disk > drive, keep the number of small random I/O outstanding to 2 (I think > "num_small" parameter in orion should do the job) with a 50-50 mix of > write and read. Measure the latency reported by the guest and Dom-0 & > compare them. For LVM volumes that present multiple drives as a single LUN > (inside the guest), the latency improvement will be the highest when the > number of I/O outstanding is 2X the number of spindles. This is the > ''moderate I/O'' scenario I was describing and you should see significant > improvement in latencies.Ok.> > > If you allow page cache to perform sequential I/O using dd or other > sequential non-direct I/O generation tool, you should find that the > interrupt rate doesn''t go up for high I/O load. Thinking about this, I > think burstiness of I/O submission as seen by the driver is also a key > player particularly in the absence of I/O coalescing waits introduced by > I/O scheduler. Page cache draining is notoriously bursty.Sure, .. thought most of the tests I''ve been doing have been bypassing the page cache.> > >>queue depth of 256. > > What ''queue depth'' is this ? If I am not wrong, blkfront-blkback isThe ''request_queue'' one. This is the block API one.> restricted to ~32 max pending I/Os due to the limit of one page being used > for mailbox entries - no ?This is the frontend''s block API queue I was thinking about. In regards to the ring buffer .. um, not exactly sure the right number (would have to compute it), but it is much bigger I believe. The ring buffer entries are for ''requests'', wherein each request can contain up to 11 pages of data (nr segments).> > >>But to my surprise the case where the I/O latency is high, the interrupt > >>generation was quite small > > If this patch results in an extra interrupt, it will very likely result in > reduction of latency for the next I/O. If the interrupt generation > increase is not high, then the number of I/Os whose latencies this patch > has improved is low. Looks like your workload belonged to this category. > Perhaps that''s why you didn''t much of an improvement in overall > performance ? I think this is close to the high I/O workload scenario I > described.Ok> > >>But where the I/O latency was very very small (4 microseconds) the > >>interrupt generation was on average about 20K/s. > > This is not a scenario I tested but the results aren''t surprising. This > isn''t the high I/O load I was describing though (I didn''t test ramdisk). > SSD is probably the closest real world workload. > An increase of 20K/sec means this patch very likely improved latency of > 20K I/Os per sec although the absolute value of latency improvements would > be smaller in this case. 20K/sec interrupt rate (50usec delay between > interrupt) is something I would be comfortable with if they directly > translate to latency improvement for the users. The graphs seem to > indicate a 5% increase in throughput for this case - Am I reading theI came up with 1%. But those are a bit unrealistic - and I ordered an SSD to do some proper testing.> graphs right ? > > Overall, Very useful tests indeed and I haven''t seen anything too > concerning or unexpected except that I don''t think you have seen the 50+% > latency benefit that the patch got me in my moderate I/O benchmark :-)Let me redo the tests again.> Feel free to ping me offline if you aren''t able to see the latency impact > using the ''moderate I/O'' methodology described above. > > About IRQ coalescing: Stepping back a bit, there are few different use > cases that irq coalescing mechanism would be useful for > > 1. Latency sensitive workload: Wait time of 10s of usecs. Particularly > useful for SSDs. > 2. Interrupt rate conscious workload/environment: Wait time of 200+ usecs > which will essentially cap the theoretical interrupt rate to 5K. > 3. Excessive CPU consumption Mitigation: This is similar to (2) but > includes the case of malicious guests. Perhaps not a big concern unless > you have lots of drives attached to each guest. > > I suspect the implementation for (1) and (2) would be different (spin vs > sleep perhaps). (3) can''t be implemented by manipulation of ''req_event'' > since a guest has the ability to abuse irq channel independent of what > ''blkback'' tries to tell ''blkfront'' via ''req_event'' manipulation. > > (3) could be implemented in the hypervisor as a generic irq throttler that > could be leveraged for all irqs heading to Dom-0 from DomUs including > blkback/netback. Such a mechanism could potentially solve (1) and/or (2) > as well. Thoughts ?The hypervisor does have some irq storm avoidancy mechanism. Thought the number is 100K/sec and it only applies to physical IRQs.> > One crude way to address (3) for ''many disk drive'' scenario is to pin > all/most blkback interrupts for an instance to the same CPU core in Dom-0 > and throttle down the thread wake up (wake_up(&blkif->wq) in > blkif_notify_work) that usually results in IPIs. Not an elegant solution > but might be a good crutch. > > Another angle to (1) and (2) is whether these irq coalesce settings should > be controllable by the guest, perhaps within limits set by the > administrator. > > Thoughts ? Suggestions ? > > Konrad, Love to help out if you are already working on something around > irq coalescing. Or when I have irq coalescing functionality that can beNot yet. Hence hinting for you to do it :-)> consumed by community I will certainly submit them. > > Meanwhile, I wouldn''t want to deny Xen users the advantage of this patch > just because there is no irq coalescing functionality. Particularly since > the downside is very minimal on blkfront-blkback stack. My 2 cents.. > > Thanks much Konrad, > > - Pradeep Vincent > > > > > On 5/16/11 8:22 AM, "Konrad Rzeszutek Wilk" <konrad.wilk@oracle.com> wrote: > > >On Thu, May 12, 2011 at 10:51:32PM -0400, Konrad Rzeszutek Wilk wrote: > >> > >>what were the numbers when it came to high bandwidth numbers > >> > > >> > Under high I/O workload, where the blkfront would fill up the queue as > >> > blkback works the queue, the I/O latency problem in question doesn''t > >> > manifest itself and as a result this patch doesn''t make much of a > >> > difference in terms of interrupt rate. My benchmarks didn''t show any > >> > significant effect. > >> > >> I have to rerun my benchmarks. Under high load (so 64Kb, four threads > >> writting as much as they can to a iSCSI disk), the IRQ rate for each > >> blkif went from 2-3/sec to ~5K/sec. But I did not do a good > >> job on capturing the submission latency to see if the I/Os get the > >> response back as fast (or the same) as without your patch. > >> > >> And the iSCSI disk on the target side was an RAMdisk, so latency > >> was quite small which is not fair to your problem. > >> > >> Do you have a program to measure the latency for the workload you > >> had encountered? I would like to run those numbers myself. > > > >Ran some more benchmarks over this week. This time I tried to run it on: > > > > - iSCSI target (1GB, and on the "other side" it wakes up every 1msec, so > >the > > latency is set to 1msec). > > - scsi_debug delay=0 (no delay and as fast possible. Comes out to be > >about > > 4 microseconds completion with queue depth of one with 32K I/Os). > > - local SATAI 80GB ST3808110AS. Still running as it is quite slow. > > > >With only one PV guest doing a round (three times) of two threads randomly > >writting I/Os with a queue depth of 256. Then a different round of four > >threads writting/reading (80/20) 512bytes up to 64K randomly over the > >disk. > > > >I used the attached patch against #master > >(git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen.git) > >to gauge how well we are doing (and what the interrupt generation rate > >is). > > > >These workloads I think would be considered ''high I/O'' and I was expecting > >your patch to not have any influence on the numbers. > > > >But to my surprise the case where the I/O latency is high, the interrupt > >generation > >was quite small. But where the I/O latency was very very small (4 > >microseconds) > >the interrupt generation was on average about 20K/s. And this is with a > >queue depth > >of 256 with four threads. I was expecting the opposite. Hence quite > >curious > >to see your use case. > > > >What do you consider a middle I/O and low I/O cases? Do you use ''fio'' for > >your > >testing? > > > >With the high I/O load, the numbers came out to give us about 1% benefit > >with your > >patch. However, I am worried (maybe unneccassarily?) about the 20K > >interrupt generation > >when the iometer tests kicked in (this was only when using the > >unrealistic ''scsi_debug'' > >drive). > > > >The picture of this using iSCSI target: > >http://darnok.org/xen/amazon/iscsi_target/iometer-bw.png > > > >And when done on top of local RAMdisk: > >http://darnok.org/xen/amazon/scsi_debug/iometer-bw.png > > > > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xensource.com > http://lists.xensource.com/xen-devel_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Vincent, Pradeep
2011-May-24 22:40 UTC
RE: [Xen-devel] [PATCH] blkback: Fix block I/O latency issue
Response inline.. -----Original Message----- From: Konrad Rzeszutek Wilk [mailto:konrad.wilk@oracle.com] Sent: Tuesday, May 24, 2011 9:03 AM To: Vincent, Pradeep Cc: Daniel@acsinet11.oracle.com; Jeremy Fitzhardinge; xen-devel@lists.xensource.com; Jan Beulich; Stodden Subject: Re: [Xen-devel] [PATCH] blkback: Fix block I/O latency issue On Thu, May 19, 2011 at 11:12:25PM -0700, Vincent, Pradeep wrote:> Hey Konrad, > > Thanks for running the tests. Very useful data. > > Re: Experiment to show latency improvement > > I never ran anything on ramdisk. > > You should be able to see the latency benefit with ''orion'' tool but I am>Link?PV: http://www.oracle.com/technetwork/topics/index-089595.html> sure other tools can be used as well. For a volume backed by a single disk > drive, keep the number of small random I/O outstanding to 2 (I think > "num_small" parameter in orion should do the job) with a 50-50 mix of > write and read. Measure the latency reported by the guest and Dom-0 & > compare them. For LVM volumes that present multiple drives as a single LUN > (inside the guest), the latency improvement will be the highest when the > number of I/O outstanding is 2X the number of spindles. This is the > ''moderate I/O'' scenario I was describing and you should see significant > improvement in latencies.Ok.> > > If you allow page cache to perform sequential I/O using dd or other > sequential non-direct I/O generation tool, you should find that the > interrupt rate doesn''t go up for high I/O load. Thinking about this, I > think burstiness of I/O submission as seen by the driver is also a key > player particularly in the absence of I/O coalescing waits introduced by > I/O scheduler. Page cache draining is notoriously bursty.Sure, .. thought most of the tests I''ve been doing have been bypassing the page cache.> > >>queue depth of 256. > > What ''queue depth'' is this ? If I am not wrong, blkfront-blkback isThe ''request_queue'' one. This is the block API one. PV: Got it.> restricted to ~32 max pending I/Os due to the limit of one page being used > for mailbox entries - no ?>This is the frontend''s block API queue I was thinking about. In regardsto the ring buffer .. um, not exactly sure the right number (would have to compute it), but it is much bigger I believe. The ring buffer entries are for ''requests'', wherein each request can contain>up to 11 pages of data (nr segments).PV: I just did a back of the envelope calculation for size of blkif_request that gave me ~78 bytes, primarily dominated by 6 bytes per segment for 11 segments per request. This would result in max pending I/O count of 32. This matches my recollection from long time back but not sure if I missed something. Of course, like you said each I/O req can have 44K of data but small sized random I/O can''t take advantage of it. (If I am not wrong, netback takes a slightly different approach where each slot is essentially a 4K page and multiple slots are used for larger sized packets.)> > >>But to my surprise the case where the I/O latency is high, the interrupt > >>generation was quite small > > If this patch results in an extra interrupt, it will very likely result in > reduction of latency for the next I/O. If the interrupt generation > increase is not high, then the number of I/Os whose latencies this patch > has improved is low. Looks like your workload belonged to this category. > Perhaps that''s why you didn''t much of an improvement in overall > performance ? I think this is close to the high I/O workload scenario I > described.Ok> > >>But where the I/O latency was very very small (4 microseconds) the > >>interrupt generation was on average about 20K/s. > > This is not a scenario I tested but the results aren''t surprising. This > isn''t the high I/O load I was describing though (I didn''t test ramdisk). > SSD is probably the closest real world workload. > An increase of 20K/sec means this patch very likely improved latency of > 20K I/Os per sec although the absolute value of latency improvements would > be smaller in this case. 20K/sec interrupt rate (50usec delay between > interrupt) is something I would be comfortable with if they directly > translate to latency improvement for the users. The graphs seem to > indicate a 5% increase in throughput for this case - Am I reading the>I came up with 1%. But those are a bit unrealistic - and I ordered >an SSD to do some proper testing.PV: Terrific.> graphs right ? > > Overall, Very useful tests indeed and I haven''t seen anything too > concerning or unexpected except that I don''t think you have seen the 50+% > latency benefit that the patch got me in my moderate I/O benchmark :-)Let me redo the tests again. PV: Thanks much. Let me know if you need more info on test setup.> Feel free to ping me offline if you aren''t able to see the latency impact > using the ''moderate I/O'' methodology described above. > > About IRQ coalescing: Stepping back a bit, there are few different use > cases that irq coalescing mechanism would be useful for > > 1. Latency sensitive workload: Wait time of 10s of usecs. Particularly > useful for SSDs. > 2. Interrupt rate conscious workload/environment: Wait time of 200+ usecs > which will essentially cap the theoretical interrupt rate to 5K. > 3. Excessive CPU consumption Mitigation: This is similar to (2) but > includes the case of malicious guests. Perhaps not a big concern unless > you have lots of drives attached to each guest. > > I suspect the implementation for (1) and (2) would be different (spin vs > sleep perhaps). (3) can''t be implemented by manipulation of ''req_event'' > since a guest has the ability to abuse irq channel independent of what > ''blkback'' tries to tell ''blkfront'' via ''req_event'' manipulation. > > (3) could be implemented in the hypervisor as a generic irq throttler that > could be leveraged for all irqs heading to Dom-0 from DomUs including > blkback/netback. Such a mechanism could potentially solve (1) and/or (2) > as well. Thoughts ?The hypervisor does have some irq storm avoidancy mechanism. Thought the>number is 100K/sec and it only applies to physical IRQs.PV: I will take a closer look to see what hypervisor already does here.> > One crude way to address (3) for ''many disk drive'' scenario is to pin > all/most blkback interrupts for an instance to the same CPU core in Dom-0 > and throttle down the thread wake up (wake_up(&blkif->wq) in > blkif_notify_work) that usually results in IPIs. Not an elegant solution > but might be a good crutch. > > Another angle to (1) and (2) is whether these irq coalesce settings should > be controllable by the guest, perhaps within limits set by the > administrator. > > Thoughts ? Suggestions ? > > Konrad, Love to help out if you are already working on something around > irq coalescing. Or when I have irq coalescing functionality that can beNot yet. Hence hinting for you to do it :-)> consumed by community I will certainly submit them. > > Meanwhile, I wouldn''t want to deny Xen users the advantage of this patch > just because there is no irq coalescing functionality. Particularly since > the downside is very minimal on blkfront-blkback stack. My 2 cents.. > > Thanks much Konrad, > > - Pradeep Vincent > > > > > On 5/16/11 8:22 AM, "Konrad Rzeszutek Wilk" <konrad.wilk@oracle.com> wrote: > > >On Thu, May 12, 2011 at 10:51:32PM -0400, Konrad Rzeszutek Wilk wrote: > >> > >>what were the numbers when it came to high bandwidth numbers > >> > > >> > Under high I/O workload, where the blkfront would fill up the queue as > >> > blkback works the queue, the I/O latency problem in question doesn''t > >> > manifest itself and as a result this patch doesn''t make much of a > >> > difference in terms of interrupt rate. My benchmarks didn''t show any > >> > significant effect. > >> > >> I have to rerun my benchmarks. Under high load (so 64Kb, four threads > >> writting as much as they can to a iSCSI disk), the IRQ rate for each > >> blkif went from 2-3/sec to ~5K/sec. But I did not do a good > >> job on capturing the submission latency to see if the I/Os get the > >> response back as fast (or the same) as without your patch. > >> > >> And the iSCSI disk on the target side was an RAMdisk, so latency > >> was quite small which is not fair to your problem. > >> > >> Do you have a program to measure the latency for the workload you > >> had encountered? I would like to run those numbers myself. > > > >Ran some more benchmarks over this week. This time I tried to run it on: > > > > - iSCSI target (1GB, and on the "other side" it wakes up every 1msec, so > >the > > latency is set to 1msec). > > - scsi_debug delay=0 (no delay and as fast possible. Comes out to be > >about > > 4 microseconds completion with queue depth of one with 32K I/Os). > > - local SATAI 80GB ST3808110AS. Still running as it is quite slow. > > > >With only one PV guest doing a round (three times) of two threads randomly > >writting I/Os with a queue depth of 256. Then a different round of four > >threads writting/reading (80/20) 512bytes up to 64K randomly over the > >disk. > > > >I used the attached patch against #master > >(git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen.git) > >to gauge how well we are doing (and what the interrupt generation rate > >is). > > > >These workloads I think would be considered ''high I/O'' and I was expecting > >your patch to not have any influence on the numbers. > > > >But to my surprise the case where the I/O latency is high, the interrupt > >generation > >was quite small. But where the I/O latency was very very small (4 > >microseconds) > >the interrupt generation was on average about 20K/s. And this is with a > >queue depth > >of 256 with four threads. I was expecting the opposite. Hence quite > >curious > >to see your use case. > > > >What do you consider a middle I/O and low I/O cases? Do you use ''fio'' for > >your > >testing? > > > >With the high I/O load, the numbers came out to give us about 1% benefit > >with your > >patch. However, I am worried (maybe unneccassarily?) about the 20K > >interrupt generation > >when the iometer tests kicked in (this was only when using the > >unrealistic ''scsi_debug'' > >drive). > > > >The picture of this using iSCSI target: > >http://darnok.org/xen/amazon/iscsi_target/iometer-bw.png > > > >And when done on top of local RAMdisk: > >http://darnok.org/xen/amazon/scsi_debug/iometer-bw.png > > > > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xensource.com > http://lists.xensource.com/xen-devel_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Daniel Stodden
2011-May-28 20:12 UTC
[RE-PATCH] Re: [Xen-devel] [PATCH] blkback: Fix block I/O latency issue
Hi. I got a chance to look into the problem below while chasing after some issues with big ring support. The way the blkback request dispatching presently works is indeed critically retarded. What the present approach essentially does is blocking any further request processing until the previously observed batch starts to complete. This completely kills throughput in the very common case where guests unplug or notify early, while rightfully assuming the reminder of their batch will be picked up as timely as the initial one. Proposed solution is to render request consumption and response production independent, as usual. Without having worked my way through the remainder of this thread again, the original goes into the right direction, but I think it should distinguish more between more_to_do -> we got more requests, go for them, and more_to_do -> we got more requests, but we''re resource congested, so wait(). Remaining request related processing in make_response is garbage, so removed. Patch follows and I''d strongly recommend to apply this or a similar fix to any tree under maintenance too. Daniel On Mon, 2011-05-02 at 03:04 -0400, Vincent, Pradeep wrote:> In blkback driver, after I/O requests are submitted to Dom-0 block I/O subsystem, blkback goes to ''sleep'' effectively without letting blkfront know about it (req_event isn''t set appropriately). Hence blkfront doesn''t notify blkback when it submits a new I/O thus delaying the ''dispatch'' of the new I/O to Dom-0 block I/O subsystem. The new I/O is dispatched as soon as one of the previous I/Os completes. > > As a result of this issue, the block I/O latency performance is degraded for some workloads on Xen guests using blkfront-blkback stack. > > The following change addresses this issue: > > > Signed-off-by: Pradeep Vincent <pradeepv@amazon.com> > > diff --git a/drivers/xen/blkback/blkback.c b/drivers/xen/blkback/blkback.c > --- a/drivers/xen/blkback/blkback.c > +++ b/drivers/xen/blkback/blkback.c > @@ -383,6 +383,12 @@ static int do_block_io_op(blkif_t *blkif) > cond_resched(); > } > > + /* If blkback might go to sleep (i.e. more_to_do == 0) then we better > + let blkfront know about it (by setting req_event appropriately) so that > + blkfront will bother to wake us up (via interrupt) when it submits a > + new I/O */ > + if (!more_to_do) > + RING_FINAL_CHECK_FOR_REQUESTS(&blk_rings->common, more_to_do); > return more_to_do; > }_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Daniel Stodden
2011-May-28 20:21 UTC
[Xen-devel] [PATCH] xen/blkback: Don''t let in-flight requests defer pending ones.
Running RING_FINAL_CHECK_FOR_REQUESTS from make_response is a bad idea. It means that in-flight I/O is essentially blocking continued batches. This essentially kills throughput on frontends which unplug (or even just notify) early and rightfully assume addtional requests will be picked up on time, not synchronously. Signed-off-by: Daniel Stodden <daniel.stodden@citrix.com> --- drivers/block/xen-blkback/blkback.c | 36 ++++++++++++++++++---------------- 1 files changed, 19 insertions(+), 17 deletions(-) diff --git a/drivers/block/xen-blkback/blkback.c b/drivers/block/xen-blkback/blkback.c index 9dee545..48ad7fa 100644 --- a/drivers/block/xen-blkback/blkback.c +++ b/drivers/block/xen-blkback/blkback.c @@ -451,7 +451,8 @@ static void end_block_io_op(struct bio *bio, int error) * (which has the sectors we want, number of them, grant references, etc), * and transmute it to the block API to hand it over to the proper block disk. */ -static int do_block_io_op(struct xen_blkif *blkif) +static int +__do_block_io_op(struct xen_blkif *blkif) { union blkif_back_rings *blk_rings = &blkif->blk_rings; struct blkif_request req; @@ -508,6 +509,23 @@ static int do_block_io_op(struct xen_blkif *blkif) return more_to_do; } +static int +do_block_io_op(blkif_t *blkif) +{ + blkif_back_rings_t *blk_rings = &blkif->blk_rings; + int more_to_do; + + do { + more_to_do = __do_block_io_op(blkif); + if (more_to_do) + break; + + RING_FINAL_CHECK_FOR_REQUESTS(&blk_rings->common, more_to_do); + } while (more_to_do); + + return more_to_do; +} + /* * Transmutation of the ''struct blkif_request'' to a proper ''struct bio'' * and call the ''submit_bio'' to pass it to the underlying storage. @@ -698,7 +716,6 @@ static void make_response(struct xen_blkif *blkif, u64 id, struct blkif_response resp; unsigned long flags; union blkif_back_rings *blk_rings = &blkif->blk_rings; - int more_to_do = 0; int notify; resp.id = id; @@ -725,22 +742,7 @@ static void make_response(struct xen_blkif *blkif, u64 id, } blk_rings->common.rsp_prod_pvt++; RING_PUSH_RESPONSES_AND_CHECK_NOTIFY(&blk_rings->common, notify); - if (blk_rings->common.rsp_prod_pvt == blk_rings->common.req_cons) { - /* - * Tail check for pending requests. Allows frontend to avoid - * notifications if requests are already in flight (lower - * overheads and promotes batching). - */ - RING_FINAL_CHECK_FOR_REQUESTS(&blk_rings->common, more_to_do); - - } else if (RING_HAS_UNCONSUMED_REQUESTS(&blk_rings->common)) { - more_to_do = 1; - } - spin_unlock_irqrestore(&blkif->blk_ring_lock, flags); - - if (more_to_do) - blkif_notify_work(blkif); if (notify) notify_remote_via_irq(blkif->irq); } -- 1.7.4.1 _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Vincent, Pradeep
2011-May-29 08:09 UTC
[Xen-devel] Re: [PATCH] xen/blkback: Don''t let in-flight requests defer pending ones.
Opportunistically avoiding interrupts by checking for I/Os in the flight doesn''t sound like a bad idea. I think the RING_HAS_UNCONSUMED_REQUESTS call and what follows should be retained in ''make_response''. Also, should RING_FINAL_CHECK_FOR_REQUESTS be protected by blk_ring_lock ? - Pradeep Vincent On 5/28/11 1:21 PM, "Daniel Stodden" <daniel.stodden@citrix.com> wrote:>Running RING_FINAL_CHECK_FOR_REQUESTS from make_response is a bad >idea. It means that in-flight I/O is essentially blocking continued >batches. This essentially kills throughput on frontends which unplug >(or even just notify) early and rightfully assume addtional requests >will be picked up on time, not synchronously. > >Signed-off-by: Daniel Stodden <daniel.stodden@citrix.com> >--- > drivers/block/xen-blkback/blkback.c | 36 >++++++++++++++++++---------------- > 1 files changed, 19 insertions(+), 17 deletions(-) > >diff --git a/drivers/block/xen-blkback/blkback.c >b/drivers/block/xen-blkback/blkback.c >index 9dee545..48ad7fa 100644 >--- a/drivers/block/xen-blkback/blkback.c >+++ b/drivers/block/xen-blkback/blkback.c >@@ -451,7 +451,8 @@ static void end_block_io_op(struct bio *bio, int >error) > * (which has the sectors we want, number of them, grant references, >etc), > * and transmute it to the block API to hand it over to the proper >block disk. > */ >-static int do_block_io_op(struct xen_blkif *blkif) >+static int >+__do_block_io_op(struct xen_blkif *blkif) > { > union blkif_back_rings *blk_rings = &blkif->blk_rings; > struct blkif_request req; >@@ -508,6 +509,23 @@ static int do_block_io_op(struct xen_blkif *blkif) > return more_to_do; > } > >+static int >+do_block_io_op(blkif_t *blkif) >+{ >+ blkif_back_rings_t *blk_rings = &blkif->blk_rings; >+ int more_to_do; >+ >+ do { >+ more_to_do = __do_block_io_op(blkif); >+ if (more_to_do) >+ break; >+ >+ RING_FINAL_CHECK_FOR_REQUESTS(&blk_rings->common, more_to_do); >+ } while (more_to_do); >+ >+ return more_to_do; >+} >+ > /* > * Transmutation of the ''struct blkif_request'' to a proper ''struct bio'' > * and call the ''submit_bio'' to pass it to the underlying storage. >@@ -698,7 +716,6 @@ static void make_response(struct xen_blkif *blkif, >u64 id, > struct blkif_response resp; > unsigned long flags; > union blkif_back_rings *blk_rings = &blkif->blk_rings; >- int more_to_do = 0; > int notify; > > resp.id = id; >@@ -725,22 +742,7 @@ static void make_response(struct xen_blkif *blkif, >u64 id, > } > blk_rings->common.rsp_prod_pvt++; > RING_PUSH_RESPONSES_AND_CHECK_NOTIFY(&blk_rings->common, notify); >- if (blk_rings->common.rsp_prod_pvt == blk_rings->common.req_cons) { >- /* >- * Tail check for pending requests. Allows frontend to avoid >- * notifications if requests are already in flight (lower >- * overheads and promotes batching). >- */ >- RING_FINAL_CHECK_FOR_REQUESTS(&blk_rings->common, more_to_do); >- >- } else if (RING_HAS_UNCONSUMED_REQUESTS(&blk_rings->common)) { >- more_to_do = 1; >- } >- > spin_unlock_irqrestore(&blkif->blk_ring_lock, flags); >- >- if (more_to_do) >- blkif_notify_work(blkif); > if (notify) > notify_remote_via_irq(blkif->irq); > } >-- >1.7.4.1 >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Daniel Stodden
2011-May-29 11:34 UTC
[Xen-devel] Re: [PATCH] xen/blkback: Don''t let in-flight requests defer pending ones.
On Sun, 2011-05-29 at 04:09 -0400, Vincent, Pradeep wrote:> Opportunistically avoiding interrupts by checking for I/Os in the flight > doesn''t sound like a bad idea. I think the RING_HAS_UNCONSUMED_REQUESTS > call and what follows should be retained in ''make_response''.There''s not much room for opportunism left here. After FINAL_CHECK returning with !_work_to_do you''re going to receive an interrupt. Holding that notification off would kill performance.>From there on, still leaving a duplicate check around end_io has only aninfinitesimal chance to preempt (none to prevent) the event reception. Even if it ever happens, the chance of making a difference in time to actual thread wake is probably even smaller. I think it''s just overhead. If you disagree, this stuff is easy to prove or confute with event counting. Good luck :)> Also, should RING_FINAL_CHECK_FOR_REQUESTS be protected by blk_ring_lock ?Nope. The ring lock is only needed to sync rsp production. Specifically, make_response upon request completion colliding with make_response called from the backend thread (the error case in do_block_io_op). Should rather be named rsp_lock or so, it doesn''t lock anything except rsp_prod. Daniel> > - Pradeep Vincent > > > On 5/28/11 1:21 PM, "Daniel Stodden" <daniel.stodden@citrix.com> wrote: > > >Running RING_FINAL_CHECK_FOR_REQUESTS from make_response is a bad > >idea. It means that in-flight I/O is essentially blocking continued > >batches. This essentially kills throughput on frontends which unplug > >(or even just notify) early and rightfully assume addtional requests > >will be picked up on time, not synchronously. > > > >Signed-off-by: Daniel Stodden <daniel.stodden@citrix.com> > >--- > > drivers/block/xen-blkback/blkback.c | 36 > >++++++++++++++++++---------------- > > 1 files changed, 19 insertions(+), 17 deletions(-) > > > >diff --git a/drivers/block/xen-blkback/blkback.c > >b/drivers/block/xen-blkback/blkback.c > >index 9dee545..48ad7fa 100644 > >--- a/drivers/block/xen-blkback/blkback.c > >+++ b/drivers/block/xen-blkback/blkback.c > >@@ -451,7 +451,8 @@ static void end_block_io_op(struct bio *bio, int > >error) > > * (which has the sectors we want, number of them, grant references, > >etc), > > * and transmute it to the block API to hand it over to the proper > >block disk. > > */ > >-static int do_block_io_op(struct xen_blkif *blkif) > >+static int > >+__do_block_io_op(struct xen_blkif *blkif) > > { > > union blkif_back_rings *blk_rings = &blkif->blk_rings; > > struct blkif_request req; > >@@ -508,6 +509,23 @@ static int do_block_io_op(struct xen_blkif *blkif) > > return more_to_do; > > } > > > >+static int > >+do_block_io_op(blkif_t *blkif) > >+{ > >+ blkif_back_rings_t *blk_rings = &blkif->blk_rings; > >+ int more_to_do; > >+ > >+ do { > >+ more_to_do = __do_block_io_op(blkif); > >+ if (more_to_do) > >+ break; > >+ > >+ RING_FINAL_CHECK_FOR_REQUESTS(&blk_rings->common, more_to_do); > >+ } while (more_to_do); > >+ > >+ return more_to_do; > >+} > >+ > > /* > > * Transmutation of the ''struct blkif_request'' to a proper ''struct bio'' > > * and call the ''submit_bio'' to pass it to the underlying storage. > >@@ -698,7 +716,6 @@ static void make_response(struct xen_blkif *blkif, > >u64 id, > > struct blkif_response resp; > > unsigned long flags; > > union blkif_back_rings *blk_rings = &blkif->blk_rings; > >- int more_to_do = 0; > > int notify; > > > > resp.id = id; > >@@ -725,22 +742,7 @@ static void make_response(struct xen_blkif *blkif, > >u64 id, > > } > > blk_rings->common.rsp_prod_pvt++; > > RING_PUSH_RESPONSES_AND_CHECK_NOTIFY(&blk_rings->common, notify); > >- if (blk_rings->common.rsp_prod_pvt == blk_rings->common.req_cons) { > >- /* > >- * Tail check for pending requests. Allows frontend to avoid > >- * notifications if requests are already in flight (lower > >- * overheads and promotes batching). > >- */ > >- RING_FINAL_CHECK_FOR_REQUESTS(&blk_rings->common, more_to_do); > >- > >- } else if (RING_HAS_UNCONSUMED_REQUESTS(&blk_rings->common)) { > >- more_to_do = 1; > >- } > >- > > spin_unlock_irqrestore(&blkif->blk_ring_lock, flags); > >- > >- if (more_to_do) > >- blkif_notify_work(blkif); > > if (notify) > > notify_remote_via_irq(blkif->irq); > > } > >-- > >1.7.4.1 > > >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Dong, Eddie
2011-May-31 13:44 UTC
[Xen-devel] Fix wrong help message for parameter nestedhvm
A typo? The config file file is using nestedhvm=x. Thx, Eddie Fix wrong help message for parameter nestedhvm. Signed-off-by: Eddie Dong <eddie.dong@intel.com> diff -r 6489452ca865 tools/libxl/libxl.idl --- a/tools/libxl/libxl.idl Tue May 31 18:11:35 2011 +0800 +++ b/tools/libxl/libxl.idl Tue May 31 21:34:52 2011 +0800 @@ -170,7 +170,7 @@ ("hpet", bool), ("vpt_align", bool), ("timer_mode", integer), - ("nested_hvm", bool), + ("nestedhvm", bool), ])), ("pv", "!%s", Struct(None, [("kernel", libxl_file_reference), _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Konrad Rzeszutek Wilk
2011-May-31 16:08 UTC
Re: [Xen-devel] [PATCH] xen/blkback: Don''t let in-flight requests defer pending ones.
On Sat, May 28, 2011 at 01:21:10PM -0700, Daniel Stodden wrote:> Running RING_FINAL_CHECK_FOR_REQUESTS from make_response is a bad > idea. It means that in-flight I/O is essentially blocking continued > batches. This essentially kills throughput on frontends which unplug > (or even just notify) early and rightfully assume addtional requestsYou have any perf numbers?> will be picked up on time, not synchronously. > > Signed-off-by: Daniel Stodden <daniel.stodden@citrix.com> > --- > drivers/block/xen-blkback/blkback.c | 36 ++++++++++++++++++---------------- > 1 files changed, 19 insertions(+), 17 deletions(-) > > diff --git a/drivers/block/xen-blkback/blkback.c b/drivers/block/xen-blkback/blkback.c > index 9dee545..48ad7fa 100644 > --- a/drivers/block/xen-blkback/blkback.c > +++ b/drivers/block/xen-blkback/blkback.c > @@ -451,7 +451,8 @@ static void end_block_io_op(struct bio *bio, int error) > * (which has the sectors we want, number of them, grant references, etc), > * and transmute it to the block API to hand it over to the proper block disk. > */ > -static int do_block_io_op(struct xen_blkif *blkif) > +static int > +__do_block_io_op(struct xen_blkif *blkif) > { > union blkif_back_rings *blk_rings = &blkif->blk_rings; > struct blkif_request req; > @@ -508,6 +509,23 @@ static int do_block_io_op(struct xen_blkif *blkif) > return more_to_do; > } > > +static int > +do_block_io_op(blkif_t *blkif) > +{ > + blkif_back_rings_t *blk_rings = &blkif->blk_rings; > + int more_to_do; > + > + do { > + more_to_do = __do_block_io_op(blkif); > + if (more_to_do) > + break; > + > + RING_FINAL_CHECK_FOR_REQUESTS(&blk_rings->common, more_to_do); > + } while (more_to_do); > + > + return more_to_do; > +} > + > /* > * Transmutation of the ''struct blkif_request'' to a proper ''struct bio'' > * and call the ''submit_bio'' to pass it to the underlying storage. > @@ -698,7 +716,6 @@ static void make_response(struct xen_blkif *blkif, u64 id, > struct blkif_response resp; > unsigned long flags; > union blkif_back_rings *blk_rings = &blkif->blk_rings; > - int more_to_do = 0; > int notify; > > resp.id = id; > @@ -725,22 +742,7 @@ static void make_response(struct xen_blkif *blkif, u64 id, > } > blk_rings->common.rsp_prod_pvt++; > RING_PUSH_RESPONSES_AND_CHECK_NOTIFY(&blk_rings->common, notify); > - if (blk_rings->common.rsp_prod_pvt == blk_rings->common.req_cons) { > - /* > - * Tail check for pending requests. Allows frontend to avoid > - * notifications if requests are already in flight (lower > - * overheads and promotes batching). > - */ > - RING_FINAL_CHECK_FOR_REQUESTS(&blk_rings->common, more_to_do); > - > - } else if (RING_HAS_UNCONSUMED_REQUESTS(&blk_rings->common)) { > - more_to_do = 1; > - } > - > spin_unlock_irqrestore(&blkif->blk_ring_lock, flags); > - > - if (more_to_do) > - blkif_notify_work(blkif); > if (notify) > notify_remote_via_irq(blkif->irq); > } > -- > 1.7.4.1 > > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xensource.com > http://lists.xensource.com/xen-devel_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Ian Campbell
2011-May-31 16:23 UTC
Re: [Xen-devel] Fix wrong help message for parameter nestedhvm
(please don''t reply to random threads in order to start a new thread, it confuses threaded mailers.) On Tue, 2011-05-31 at 14:44 +0100, Dong, Eddie wrote:> A typo? The config file file is using nestedhvm=x.This is the name of the struct member, not the configuration file option nor a help message. Perhaps it would be nicer if they were the same but there is no requirement for that to be so. See xl_cmdimpl.c:parse_config_data(): if (!xlu_cfg_get_long (config, "nestedhvm", &l)) b_info->u.hvm.nested_hvm = l; I don''t think you can have compiled this patch! Ian. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Daniel Stodden
2011-May-31 16:30 UTC
Re: [Xen-devel] [PATCH] xen/blkback: Don''t let in-flight requests defer pending ones.
On Tue, 2011-05-31 at 12:08 -0400, Konrad Rzeszutek Wilk wrote:> On Sat, May 28, 2011 at 01:21:10PM -0700, Daniel Stodden wrote: > > Running RING_FINAL_CHECK_FOR_REQUESTS from make_response is a bad > > idea. It means that in-flight I/O is essentially blocking continued > > batches. This essentially kills throughput on frontends which unplug > > (or even just notify) early and rightfully assume addtional requests > > You have any perf numbers?This was on iSCSI. Single stripe, not a big array. Case of interest was a simple linear dd, 1GB. Dom0 gets to around 70M/s on the raw device node. Guest before: dd if=/dev/zero of=/dev/xvdc bs=1M count=1024 1073741824 bytes (1.1 GB) copied, 24.5 s, 43.8 MB/s 1073741824 bytes (1.1 GB) copied, 23.9216 s, 44.9 MB/s 1073741824 bytes (1.1 GB) copied, 23.336 s, 46.0 MB/s Guest after: dd if=/dev/zero of=/dev/xvdc bs=1M count=1024 1073741824 bytes (1.1 GB) copied, 17.0105 s, 63.1 MB/s 1073741824 bytes (1.1 GB) copied, 17.7566 s, 60.5 MB/s 1073741824 bytes (1.1 GB) copied, 17.163 s, 62.6 MB/s This is not a silver bullet. With faster backends, there are more issues with queue depth and request assembly, but one main blocker is what Pradeep first described. Daniel> > will be picked up on time, not synchronously. > > > > Signed-off-by: Daniel Stodden <daniel.stodden@citrix.com> > > --- > > drivers/block/xen-blkback/blkback.c | 36 ++++++++++++++++++---------------- > > 1 files changed, 19 insertions(+), 17 deletions(-) > > > > diff --git a/drivers/block/xen-blkback/blkback.c b/drivers/block/xen-blkback/blkback.c > > index 9dee545..48ad7fa 100644 > > --- a/drivers/block/xen-blkback/blkback.c > > +++ b/drivers/block/xen-blkback/blkback.c > > @@ -451,7 +451,8 @@ static void end_block_io_op(struct bio *bio, int error) > > * (which has the sectors we want, number of them, grant references, etc), > > * and transmute it to the block API to hand it over to the proper block disk. > > */ > > -static int do_block_io_op(struct xen_blkif *blkif) > > +static int > > +__do_block_io_op(struct xen_blkif *blkif) > > { > > union blkif_back_rings *blk_rings = &blkif->blk_rings; > > struct blkif_request req; > > @@ -508,6 +509,23 @@ static int do_block_io_op(struct xen_blkif *blkif) > > return more_to_do; > > } > > > > +static int > > +do_block_io_op(blkif_t *blkif) > > +{ > > + blkif_back_rings_t *blk_rings = &blkif->blk_rings; > > + int more_to_do; > > + > > + do { > > + more_to_do = __do_block_io_op(blkif); > > + if (more_to_do) > > + break; > > + > > + RING_FINAL_CHECK_FOR_REQUESTS(&blk_rings->common, more_to_do); > > + } while (more_to_do); > > + > > + return more_to_do; > > +} > > + > > /* > > * Transmutation of the ''struct blkif_request'' to a proper ''struct bio'' > > * and call the ''submit_bio'' to pass it to the underlying storage. > > @@ -698,7 +716,6 @@ static void make_response(struct xen_blkif *blkif, u64 id, > > struct blkif_response resp; > > unsigned long flags; > > union blkif_back_rings *blk_rings = &blkif->blk_rings; > > - int more_to_do = 0; > > int notify; > > > > resp.id = id; > > @@ -725,22 +742,7 @@ static void make_response(struct xen_blkif *blkif, u64 id, > > } > > blk_rings->common.rsp_prod_pvt++; > > RING_PUSH_RESPONSES_AND_CHECK_NOTIFY(&blk_rings->common, notify); > > - if (blk_rings->common.rsp_prod_pvt == blk_rings->common.req_cons) { > > - /* > > - * Tail check for pending requests. Allows frontend to avoid > > - * notifications if requests are already in flight (lower > > - * overheads and promotes batching). > > - */ > > - RING_FINAL_CHECK_FOR_REQUESTS(&blk_rings->common, more_to_do); > > - > > - } else if (RING_HAS_UNCONSUMED_REQUESTS(&blk_rings->common)) { > > - more_to_do = 1; > > - } > > - > > spin_unlock_irqrestore(&blkif->blk_ring_lock, flags); > > - > > - if (more_to_do) > > - blkif_notify_work(blkif); > > if (notify) > > notify_remote_via_irq(blkif->irq); > > } > > -- > > 1.7.4.1 > > > > > > _______________________________________________ > > Xen-devel mailing list > > Xen-devel@lists.xensource.com > > http://lists.xensource.com/xen-devel_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Vincent, Pradeep
2011-Jun-01 08:02 UTC
[Xen-devel] Re: [PATCH] xen/blkback: Don''t let in-flight requests defer pending ones.
Re: Opportunistic check in make_response Looking through blkfront logic, req_prod is only updated in batches - so you are right that given the current blkfront design the opportunistic check in make_response doesn''t reduce interrupt rate and the benefit will be limited to the reduction of I/O latency by an amount less than the interrupt latency. Having said that, I think batching req_prod update delays the processing of the first I/O by blkback until all the I/Os in the batch have been added to the I/O ring - potentially increasing I/O latencies. While batche interrupt generation makes perfect sense, I don''t see a solid reason for req_prod not being updated when I/Os are processed by blkfront. Netfront does increment req_prod as soon as it processes each skb (but it might send an interrupt for each skb as well which isn''t great). I think it would make sense to enhance blkfront design to increment req_prod as soon as it processes an I/O while batching irq generation. When blkfront and blkback are busy processing continuous stream of I/O requests, it would be great if blkfront-blkback pipeline is able to process them without generating unnecessary interrupts while improving I/O latency performance. Thoughts ? Any historical context on why this might be bad ? If blkfront does increment req_prod on every I/O submission, I think opportunistic check in make_response would make sense since such a check could trigger blkback to start processing requests well ahead of the ''batch'' completion on blkfront side (just the check for RING_HAS_UNCONSUMED_REQUESTS should suffice - other parts of what you removed should go) Re: Locking I was reflecting Jan Beulich''s comment earlier in this thread. Like I said before (in this thread), the locking logic in blkback isn''t obvious from the code and the failure modes seem benign. If someone has good context on blkback locking strategy, I would love to learn. Also it would be very useful to add some comments around lock usage to the blkback code. Jan ?? - Pradeep Vincent On 5/29/11 4:34 AM, "Daniel Stodden" <daniel.stodden@citrix.com> wrote:>On Sun, 2011-05-29 at 04:09 -0400, Vincent, Pradeep wrote: >> Opportunistically avoiding interrupts by checking for I/Os in the flight >> doesn''t sound like a bad idea. I think the RING_HAS_UNCONSUMED_REQUESTS >> call and what follows should be retained in ''make_response''. > >There''s not much room for opportunism left here. After FINAL_CHECK >returning with !_work_to_do you''re going to receive an interrupt. >Holding that notification off would kill performance. > >From there on, still leaving a duplicate check around end_io has only an >infinitesimal chance to preempt (none to prevent) the event reception. >Even if it ever happens, the chance of making a difference in time to >actual thread wake is probably even smaller. > >I think it''s just overhead. If you disagree, this stuff is easy to prove >or confute with event counting. Good luck :) > >> Also, should RING_FINAL_CHECK_FOR_REQUESTS be protected by >>blk_ring_lock ? > >Nope. The ring lock is only needed to sync rsp production. Specifically, >make_response upon request completion colliding with make_response >called from the backend thread (the error case in do_block_io_op). > >Should rather be named rsp_lock or so, it doesn''t lock anything except >rsp_prod. > >Daniel > >> >> - Pradeep Vincent >> >> >> On 5/28/11 1:21 PM, "Daniel Stodden" <daniel.stodden@citrix.com> wrote: >> >> >Running RING_FINAL_CHECK_FOR_REQUESTS from make_response is a bad >> >idea. It means that in-flight I/O is essentially blocking continued >> >batches. This essentially kills throughput on frontends which unplug >> >(or even just notify) early and rightfully assume addtional requests >> >will be picked up on time, not synchronously. >> > >> >Signed-off-by: Daniel Stodden <daniel.stodden@citrix.com> >> >--- >> > drivers/block/xen-blkback/blkback.c | 36 >> >++++++++++++++++++---------------- >> > 1 files changed, 19 insertions(+), 17 deletions(-) >> > >> >diff --git a/drivers/block/xen-blkback/blkback.c >> >b/drivers/block/xen-blkback/blkback.c >> >index 9dee545..48ad7fa 100644 >> >--- a/drivers/block/xen-blkback/blkback.c >> >+++ b/drivers/block/xen-blkback/blkback.c >> >@@ -451,7 +451,8 @@ static void end_block_io_op(struct bio *bio, int >> >error) >> > * (which has the sectors we want, number of them, grant references, >> >etc), >> > * and transmute it to the block API to hand it over to the proper >> >block disk. >> > */ >> >-static int do_block_io_op(struct xen_blkif *blkif) >> >+static int >> >+__do_block_io_op(struct xen_blkif *blkif) >> > { >> > union blkif_back_rings *blk_rings = &blkif->blk_rings; >> > struct blkif_request req; >> >@@ -508,6 +509,23 @@ static int do_block_io_op(struct xen_blkif *blkif) >> > return more_to_do; >> > } >> > >> >+static int >> >+do_block_io_op(blkif_t *blkif) >> >+{ >> >+ blkif_back_rings_t *blk_rings = &blkif->blk_rings; >> >+ int more_to_do; >> >+ >> >+ do { >> >+ more_to_do = __do_block_io_op(blkif); >> >+ if (more_to_do) >> >+ break; >> >+ >> >+ RING_FINAL_CHECK_FOR_REQUESTS(&blk_rings->common, more_to_do); >> >+ } while (more_to_do); >> >+ >> >+ return more_to_do; >> >+} >> >+ >> > /* >> > * Transmutation of the ''struct blkif_request'' to a proper ''struct >>bio'' >> > * and call the ''submit_bio'' to pass it to the underlying storage. >> >@@ -698,7 +716,6 @@ static void make_response(struct xen_blkif *blkif, >> >u64 id, >> > struct blkif_response resp; >> > unsigned long flags; >> > union blkif_back_rings *blk_rings = &blkif->blk_rings; >> >- int more_to_do = 0; >> > int notify; >> > >> > resp.id = id; >> >@@ -725,22 +742,7 @@ static void make_response(struct xen_blkif *blkif, >> >u64 id, >> > } >> > blk_rings->common.rsp_prod_pvt++; >> > RING_PUSH_RESPONSES_AND_CHECK_NOTIFY(&blk_rings->common, notify); >> >- if (blk_rings->common.rsp_prod_pvt == blk_rings->common.req_cons) >>{ >> >- /* >> >- * Tail check for pending requests. Allows frontend to avoid >> >- * notifications if requests are already in flight (lower >> >- * overheads and promotes batching). >> >- */ >> >- RING_FINAL_CHECK_FOR_REQUESTS(&blk_rings->common, more_to_do); >> >- >> >- } else if (RING_HAS_UNCONSUMED_REQUESTS(&blk_rings->common)) { >> >- more_to_do = 1; >> >- } >> >- >> > spin_unlock_irqrestore(&blkif->blk_ring_lock, flags); >> >- >> >- if (more_to_do) >> >- blkif_notify_work(blkif); >> > if (notify) >> > notify_remote_via_irq(blkif->irq); >> > } >> >-- >> >1.7.4.1 >> > >> > >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jan Beulich
2011-Jun-01 08:24 UTC
[Xen-devel] Re: [PATCH] xen/blkback: Don''t let in-flight requests defer pending ones.
>>> On 01.06.11 at 10:02, "Vincent, Pradeep" <pradeepv@amazon.com> wrote: > Re: Locking > > I was reflecting Jan Beulich''s comment earlier in this thread. Like I said > before (in this thread), the locking logic in blkback isn''t obvious from > the code and the failure modes seem benign. If someone has good context on > blkback locking strategy, I would love to learn. Also it would be very > useful to add some comments around lock usage to the blkback code. > > Jan ??No, not really - I''m more of an observer on this and similar discussions, just trying to learn as much as I can. Now and then I do notice possible issues, but as the design and the reasons for the choices made aren''t (afaik, and like you also say) documented, a lot of room for guessing is usually left. Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Daniel Stodden
2011-Jun-01 17:49 UTC
[Xen-devel] Re: [PATCH] xen/blkback: Don''t let in-flight requests defer pending ones.
On Wed, 2011-06-01 at 04:02 -0400, Vincent, Pradeep wrote:> Re: Opportunistic check in make_response > > Looking through blkfront logic, req_prod is only updated in batches - so > you are right that given the current blkfront design the opportunistic > check in make_response doesn''t reduce interrupt rate and the benefit will > be limited to the reduction of I/O latency by an amount less than the > interrupt latency.Yup.> Having said that, I think batching req_prod update delays the processing > of the first I/O by blkback until all the I/Os in the batch have been > added to the I/O ring - potentially increasing I/O latencies. While batche > interrupt generation makes perfect sense, I don''t see a solid reason for > req_prod not being updated when I/Os are processed by blkfront. Netfront > does increment req_prod as soon as it processes each skb (but it might > send an interrupt for each skb as well which isn''t great). > > I think it would make sense to enhance blkfront design to increment > req_prod as soon as it processes an I/O while batching irq generation. > When blkfront and blkback are busy processing continuous stream of I/O > requests, it would be great if blkfront-blkback pipeline is able to > process them without generating unnecessary interrupts while improving I/O > latency performance.> Thoughts ? Any historical context on why this might > be bad ?No, and it''s a good idea, and your assumptions are imho correct. The extreme case is an PUSH_REQUESTS_AND_CHECK_NOTIFY after each request. Even in this case, the majority of interrupts should be held off by looking at req_event. There are certainly variants, but I can show you some results for what you''re asking. because I happened to try that just last week. Next consider adding a couple event counters in the backend. And you need the patch above, of course. req_event: Frontend event received rsp_event: Frontend notification sent req_again: FINAL_CHECK indicated more_to_do. boost-1, order=0: (This is the unmodified version) dd if=/dev/zero of=/dev/xvdc bs=1M count=1024 1073741824 bytes (1.1 GB) copied, 17.0105 s, 63.1 MB/s 1073741824 bytes (1.1 GB) copied, 17.7566 s, 60.5 MB/s 1073741824 bytes (1.1 GB) copied, 17.163 s, 62.6 MB/s rsp_event 6759 req_event 6753 req_again 16 boost-2, order=0 (This was the aggressive one, one PUSH_NOTIFY per ring request). dd if=/dev/zero of=/dev/xvdc bs=1M count=1024 1073741824 bytes (1.1 GB) copied, 17.3208 s, 62.0 MB/s 1073741824 bytes (1.1 GB) copied, 17.4851 s, 61.4 MB/s 1073741824 bytes (1.1 GB) copied, 17.7333 s, 60.5 MB/s rsp_event 7122 req_event 7141 req_again 5497 So the result is that the event load even in the most aggressive case will typically increase only moderately. Instead, the restored outer loop in the dispatcher just starts to play out. I''m not proposing this as the final solution, we might chose to be more careful and limit event emission to some stride instead. Don''t be confused by the throughput values not going up, the problem I had with the array (iSCSI in this case) just turned out to be elsewhere. I''m pretty convinced there are workload/storage configurations which benefit from that. In the case at hand, increasing the ring size was way more productive. At which point the queue depth multiplies as well. And I currently expect that the longer it gets the more urgent the issue you describe will be. But I also think it needs some experiments and wants to be backed by numbers. Daniel> If blkfront does increment req_prod on every I/O submission, I think > opportunistic check in make_response would make sense since such a check > could trigger blkback to start processing requests well ahead of the > ''batch'' completion on blkfront side (just the check for > RING_HAS_UNCONSUMED_REQUESTS should suffice - other parts of what you > removed should go) > > > Re: Locking > > I was reflecting Jan Beulich''s comment earlier in this thread. Like I said > before (in this thread), the locking logic in blkback isn''t obvious from > the code and the failure modes seem benign. If someone has good context on > blkback locking strategy, I would love to learn. Also it would be very > useful to add some comments around lock usage to the blkback code. > > Jan ?? > > - Pradeep Vincent > > > On 5/29/11 4:34 AM, "Daniel Stodden" <daniel.stodden@citrix.com> wrote: > > >On Sun, 2011-05-29 at 04:09 -0400, Vincent, Pradeep wrote: > >> Opportunistically avoiding interrupts by checking for I/Os in the flight > >> doesn''t sound like a bad idea. I think the RING_HAS_UNCONSUMED_REQUESTS > >> call and what follows should be retained in ''make_response''. > > > >There''s not much room for opportunism left here. After FINAL_CHECK > >returning with !_work_to_do you''re going to receive an interrupt. > >Holding that notification off would kill performance. > > > >From there on, still leaving a duplicate check around end_io has only an > >infinitesimal chance to preempt (none to prevent) the event reception. > >Even if it ever happens, the chance of making a difference in time to > >actual thread wake is probably even smaller. > > > >I think it''s just overhead. If you disagree, this stuff is easy to prove > >or confute with event counting. Good luck :) > > > >> Also, should RING_FINAL_CHECK_FOR_REQUESTS be protected by > >>blk_ring_lock ? > > > >Nope. The ring lock is only needed to sync rsp production. Specifically, > >make_response upon request completion colliding with make_response > >called from the backend thread (the error case in do_block_io_op). > > > >Should rather be named rsp_lock or so, it doesn''t lock anything except > >rsp_prod. > > > >Daniel > > > >> > >> - Pradeep Vincent > >> > >> > >> On 5/28/11 1:21 PM, "Daniel Stodden" <daniel.stodden@citrix.com> wrote: > >> > >> >Running RING_FINAL_CHECK_FOR_REQUESTS from make_response is a bad > >> >idea. It means that in-flight I/O is essentially blocking continued > >> >batches. This essentially kills throughput on frontends which unplug > >> >(or even just notify) early and rightfully assume addtional requests > >> >will be picked up on time, not synchronously. > >> > > >> >Signed-off-by: Daniel Stodden <daniel.stodden@citrix.com> > >> >--- > >> > drivers/block/xen-blkback/blkback.c | 36 > >> >++++++++++++++++++---------------- > >> > 1 files changed, 19 insertions(+), 17 deletions(-) > >> > > >> >diff --git a/drivers/block/xen-blkback/blkback.c > >> >b/drivers/block/xen-blkback/blkback.c > >> >index 9dee545..48ad7fa 100644 > >> >--- a/drivers/block/xen-blkback/blkback.c > >> >+++ b/drivers/block/xen-blkback/blkback.c > >> >@@ -451,7 +451,8 @@ static void end_block_io_op(struct bio *bio, int > >> >error) > >> > * (which has the sectors we want, number of them, grant references, > >> >etc), > >> > * and transmute it to the block API to hand it over to the proper > >> >block disk. > >> > */ > >> >-static int do_block_io_op(struct xen_blkif *blkif) > >> >+static int > >> >+__do_block_io_op(struct xen_blkif *blkif) > >> > { > >> > union blkif_back_rings *blk_rings = &blkif->blk_rings; > >> > struct blkif_request req; > >> >@@ -508,6 +509,23 @@ static int do_block_io_op(struct xen_blkif *blkif) > >> > return more_to_do; > >> > } > >> > > >> >+static int > >> >+do_block_io_op(blkif_t *blkif) > >> >+{ > >> >+ blkif_back_rings_t *blk_rings = &blkif->blk_rings; > >> >+ int more_to_do; > >> >+ > >> >+ do { > >> >+ more_to_do = __do_block_io_op(blkif); > >> >+ if (more_to_do) > >> >+ break; > >> >+ > >> >+ RING_FINAL_CHECK_FOR_REQUESTS(&blk_rings->common, more_to_do); > >> >+ } while (more_to_do); > >> >+ > >> >+ return more_to_do; > >> >+} > >> >+ > >> > /* > >> > * Transmutation of the ''struct blkif_request'' to a proper ''struct > >>bio'' > >> > * and call the ''submit_bio'' to pass it to the underlying storage. > >> >@@ -698,7 +716,6 @@ static void make_response(struct xen_blkif *blkif, > >> >u64 id, > >> > struct blkif_response resp; > >> > unsigned long flags; > >> > union blkif_back_rings *blk_rings = &blkif->blk_rings; > >> >- int more_to_do = 0; > >> > int notify; > >> > > >> > resp.id = id; > >> >@@ -725,22 +742,7 @@ static void make_response(struct xen_blkif *blkif, > >> >u64 id, > >> > } > >> > blk_rings->common.rsp_prod_pvt++; > >> > RING_PUSH_RESPONSES_AND_CHECK_NOTIFY(&blk_rings->common, notify); > >> >- if (blk_rings->common.rsp_prod_pvt == blk_rings->common.req_cons) > >>{ > >> >- /* > >> >- * Tail check for pending requests. Allows frontend to avoid > >> >- * notifications if requests are already in flight (lower > >> >- * overheads and promotes batching). > >> >- */ > >> >- RING_FINAL_CHECK_FOR_REQUESTS(&blk_rings->common, more_to_do); > >> >- > >> >- } else if (RING_HAS_UNCONSUMED_REQUESTS(&blk_rings->common)) { > >> >- more_to_do = 1; > >> >- } > >> >- > >> > spin_unlock_irqrestore(&blkif->blk_ring_lock, flags); > >> >- > >> >- if (more_to_do) > >> >- blkif_notify_work(blkif); > >> > if (notify) > >> > notify_remote_via_irq(blkif->irq); > >> > } > >> >-- > >> >1.7.4.1 > >> > > >> > > > > >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Daniel Stodden
2011-Jun-01 18:07 UTC
[Xen-devel] Re: [PATCH] xen/blkback: Don''t let in-flight requests defer pending ones.
I attached the event counting patch. It should make sense, but it was against XCP, so it may not apply all smoothly for you. If you really want to play with it, and manage to find more interesting storage/system/patch combinations than the one quoted below, that''d probaby be really useful. Daniel On Wed, 2011-06-01 at 13:49 -0400, Daniel Stodden wrote:> > > > I think it would make sense to enhance blkfront design to increment > > req_prod as soon as it processes an I/O while batching irq generation. > > When blkfront and blkback are busy processing continuous stream of I/O > > requests, it would be great if blkfront-blkback pipeline is able to > > process them without generating unnecessary interrupts while improving I/O > > latency performance. > > > > Thoughts ? Any historical context on why this might > > be bad ? > > No, and it''s a good idea, and your assumptions are imho correct. > > The extreme case is an PUSH_REQUESTS_AND_CHECK_NOTIFY after each > request. Even in this case, the majority of interrupts should be held > off by looking at req_event. There are certainly variants, but I can > show you some results for what you''re asking. because I happened to try > that just last week. > > Next consider adding a couple event counters in the backend. And you > need the patch above, of course. > > req_event: Frontend event received > rsp_event: Frontend notification sent > req_again: FINAL_CHECK indicated more_to_do.> boost-1, order=0: > > (This is the unmodified version) > > dd if=/dev/zero of=/dev/xvdc bs=1M count=1024 > 1073741824 bytes (1.1 GB) copied, 17.0105 s, 63.1 MB/s > 1073741824 bytes (1.1 GB) copied, 17.7566 s, 60.5 MB/s > 1073741824 bytes (1.1 GB) copied, 17.163 s, 62.6 MB/s > > rsp_event 6759 > req_event 6753 > req_again 16 > > boost-2, order=0 > > (This was the aggressive one, one PUSH_NOTIFY per ring request). > > dd if=/dev/zero of=/dev/xvdc bs=1M count=1024 > 1073741824 bytes (1.1 GB) copied, 17.3208 s, 62.0 MB/s > 1073741824 bytes (1.1 GB) copied, 17.4851 s, 61.4 MB/s > 1073741824 bytes (1.1 GB) copied, 17.7333 s, 60.5 MB/s > > rsp_event 7122 > req_event 7141 > req_again 5497 > > > So the result is that the event load even in the most aggressive case > will typically increase only moderately. Instead, the restored outer > loop in the dispatcher just starts to play out. > > I''m not proposing this as the final solution, we might chose to be more > careful and limit event emission to some stride instead. > > Don''t be confused by the throughput values not going up, the problem I > had with the array (iSCSI in this case) just turned out to be elsewhere. > I''m pretty convinced there are workload/storage configurations which > benefit from that. > > In the case at hand, increasing the ring size was way more productive. > At which point the queue depth multiplies as well. And I currently > expect that the longer it gets the more urgent the issue you describe > will be. > > But I also think it needs some experiments and wants to be backed by > numbers. > > Daniel_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Konrad Rzeszutek Wilk
2011-Jun-27 14:03 UTC
[Xen-devel] Re: [PATCH] xen/blkback: Don''t let in-flight requests defer pending ones.
> In the case at hand, increasing the ring size was way more productive. > At which point the queue depth multiplies as well. And I currently > expect that the longer it gets the more urgent the issue you describe > will be.You wouldn''t have patches for that somewhere tucked away? I am going over the patches for 3.1 xen-blkback and was thinking to have them all queued up and test them all at once.. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Daniel Stodden
2011-Jun-27 18:42 UTC
[Xen-devel] Re: [PATCH] xen/blkback: Don''t let in-flight requests defer pending ones.
On Mon, 2011-06-27 at 10:03 -0400, Konrad Rzeszutek Wilk wrote:> > In the case at hand, increasing the ring size was way more productive. > > At which point the queue depth multiplies as well. And I currently > > expect that the longer it gets the more urgent the issue you describe > > will be. > > You wouldn''t have patches for that somewhere tucked away? I am going over > the patches for 3.1 xen-blkback and was thinking to have them all queued up and > test them all at once..I was going to send the kernel patch right after, just to discover that xen-blkback lacks some of the synchronization items the original one was based on. It''s coming, but it''s rather going to be a series. Daniel _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Konrad Rzeszutek Wilk
2011-Jun-27 19:13 UTC
Re: [Xen-devel] Re: [PATCH] xen/blkback: Don''t let in-flight requests defer pending ones.
On Mon, Jun 27, 2011 at 11:42:28AM -0700, Daniel Stodden wrote:> On Mon, 2011-06-27 at 10:03 -0400, Konrad Rzeszutek Wilk wrote: > > > In the case at hand, increasing the ring size was way more productive. > > > At which point the queue depth multiplies as well. And I currently > > > expect that the longer it gets the more urgent the issue you describe > > > will be. > > > > You wouldn''t have patches for that somewhere tucked away? I am going over > > the patches for 3.1 xen-blkback and was thinking to have them all queued up and > > test them all at once.. > > I was going to send the kernel patch right after, just to discover that > xen-blkback lacks some of the synchronization items the original one was > based on. It''s coming, but it''s rather going to be a series.That is fine. Please also CC lkml when posting the series. Thanks! _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Daniel Stodden
2011-Jun-28 00:31 UTC
Re: [Xen-devel] Re: [PATCH] xen/blkback: Don''t let in-flight requests defer pending ones.
On Mon, 2011-06-27 at 15:13 -0400, Konrad Rzeszutek Wilk wrote:> On Mon, Jun 27, 2011 at 11:42:28AM -0700, Daniel Stodden wrote: > > On Mon, 2011-06-27 at 10:03 -0400, Konrad Rzeszutek Wilk wrote: > > > > In the case at hand, increasing the ring size was way more productive. > > > > At which point the queue depth multiplies as well. And I currently > > > > expect that the longer it gets the more urgent the issue you describe > > > > will be. > > > > > > You wouldn''t have patches for that somewhere tucked away? I am going over > > > the patches for 3.1 xen-blkback and was thinking to have them all queued up and > > > test them all at once.. > > > > I was going to send the kernel patch right after, just to discover that > > xen-blkback lacks some of the synchronization items the original one was > > based on. It''s coming, but it''s rather going to be a series. > > That is fine. Please also CC lkml when posting the series. Thanks!That''s a more interesting thing, actually: Do you plan to maintain this stuff? Because xen-blkback presently has no dedicated MAINTAINERS entry, iirc, so I guess it defaults to Jens. It might indeed make more sense to collect tested batches, and submit them as such. Daniel _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Konrad Rzeszutek Wilk
2011-Jun-28 13:19 UTC
Re: [Xen-devel] Re: [PATCH] xen/blkback: Don''t let in-flight requests defer pending ones.
On Mon, Jun 27, 2011 at 05:31:47PM -0700, Daniel Stodden wrote:> On Mon, 2011-06-27 at 15:13 -0400, Konrad Rzeszutek Wilk wrote: > > On Mon, Jun 27, 2011 at 11:42:28AM -0700, Daniel Stodden wrote: > > > On Mon, 2011-06-27 at 10:03 -0400, Konrad Rzeszutek Wilk wrote: > > > > > In the case at hand, increasing the ring size was way more productive. > > > > > At which point the queue depth multiplies as well. And I currently > > > > > expect that the longer it gets the more urgent the issue you describe > > > > > will be. > > > > > > > > You wouldn''t have patches for that somewhere tucked away? I am going over > > > > the patches for 3.1 xen-blkback and was thinking to have them all queued up and > > > > test them all at once.. > > > > > > I was going to send the kernel patch right after, just to discover that > > > xen-blkback lacks some of the synchronization items the original one was > > > based on. It''s coming, but it''s rather going to be a series. > > > > That is fine. Please also CC lkml when posting the series. Thanks! > > That''s a more interesting thing, actually: Do you plan to maintain thisThat is my plan.> stuff? Because xen-blkback presently has no dedicated MAINTAINERS entry,You are right. I should send a patch to explicitly state it, even thought this: $scripts/get_maintainer.pl -f drivers/block/xen-blkback/ Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> (commit_signer:31/32=97%) Laszlo Ersek <lersek@redhat.com> (commit_signer:2/32=6%) Jan Beulich <jbeulich@novell.com> (commit_signer:2/32=6%) linux-kernel@vger.kernel.org (open list) Kind of makes me the top choice.> iirc, so I guess it defaults to Jens.Well, everything under drivers/block _has_ to eventually go through Jens. It can go first through xen-devel to make sure there is nothing bogus, and be reviewed here. And I can collect the patches, stick them in a branch, run through the Xen gauntlet tests and then ask Jens to GIT PULL them. It does not hurt to additionaly go through LKML - more eyes the better.> > It might indeed make more sense to collect tested batches, and submit > them as such.<nods> So far I''ve: xen/blkback: Don''t let in-flight requests defer pending ones. (#stable/for-jens) And I wouldn''t mind putting some more there before I start cranking some tests. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel