thr3ads.net - Xen devel - [Xen-devel] [PATCH RFC] xen/blkfront: use tagged queuing for barriers [Jul 2010]

If this information is useful, please help other people find it:
Share via:

Jeremy Fitzhardinge

2010-Jul-22 21:20 UTC

[Xen-devel] [PATCH RFC] xen/blkfront: use tagged queuing for barriers

When barriers are supported, then use QUEUE_ORDERED_TAG to tell the block
subsystem that it doesn''t need to do anything else with the barriers.
Previously we used ORDERED_DRAIN which caused the block subsystem to
drain all pending IO before submitting the barrier, which would be
very expensive.

Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>

diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
index 92091a7..b61a021 100644
--- a/drivers/block/xen-blkfront.c
+++ b/drivers/block/xen-blkfront.c
@@ -418,10 +418,20 @@ static int xlvbd_init_blk_queue(struct gendisk *gd, u16
sector_size)
 static int xlvbd_barrier(struct blkfront_info *info)
 {
 	int err;
+	unsigned ordered = QUEUE_ORDERED_NONE;
 
-	err = blk_queue_ordered(info->rq,
-				info->feature_barrier ? QUEUE_ORDERED_DRAIN : QUEUE_ORDERED_NONE,
-				NULL);
+	/*
+	 * If we don''t have barrier support, then there''s really no
+	 * way to guarantee write ordering, so we really just have to
+	 * send writes to the backend and hope for the best.  If
+	 * barriers are supported, we don''t really know what the
+	 * flushing semantics of the barrier are, so again, tag the
+	 * requests in order and hope for the best.
+	 */
+	if (info->feature_barrier)
+		ordered = QUEUE_ORDERED_TAG;
+
+	err = blk_queue_ordered(info->rq, ordered, NULL);
 
 	if (err)
 		return err;
@@ -508,8 +518,7 @@ static int xlvbd_alloc_gendisk(blkif_sector_t capacity,
 	info->rq = gd->queue;
 	info->gd = gd;
 
-	if (info->feature_barrier)
-		xlvbd_barrier(info);
+	xlvbd_barrier(info);
 
 	if (vdisk_info & VDISK_READONLY)
 		set_disk_ro(gd, 1);



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Daniel Stodden

2010-Jul-28 11:06 UTC

head link

[Xen-devel] Re: [PATCH RFC] xen/blkfront: use tagged queuing for barriers

Hey.

I tried to figure out some of the barrier stuff. I think it''s becoming
more clear to me now. But take everything below with a grain of salt.
Sorry for having written a book about it. It''s just because
I''m not sure
if I''m always reasoning correctly.

There''s summary at the bottom. Maybe skip over and just apply your
patch, possibly with a more optimistic update to the comment, and with a
third option in support of QUEUE_ORDERED_DRAIN.

In between, I seems to make sense to differentiate between blkback and
blktap1/2 to understand what our disk model traditionally was/is,
because until now frontends don''t differentiate much. Except for maybe
that feature-barrier flag. And below I gather why I think we might want
to do something about it.

Blktap1
--------------------------------------------------------------------------

A blktap1 backend ring is quite simple to explain. It thinks it''s a
cacheless disk with no ordering guarantees. In kernel queueing terms
that''s a QUEUE_ORDERED_DRAIN.

 - We pass requests directly on to tapdisk.

 - Tapdisk drivers will scatter and reorder at will, especially VHD.

 - The RSP message is driven by iocb completion(s).

 - It fully relies on AIO/O_DIRECT to make sure the data made it
   actually down to the platter by that time. 

I''m not aware of anything else left to a userspace app.

Blktap2
--------------------------------------------------------------------------

The blktap2 datapath was completely derived from blktap1.

The difference that it''s now hooked into a bdev. Which is quite
important because image consistency is now in the hands of the block
layer.

It presently doesn''t declare a barrier mode, which in my understanding
evaluates to QUEUE_ORDERED_NONE.

My understanding of QUEUE_ORDERED_NONE semantics is that the blk code
and filesystems will happily fill the queue and assmume in-order
completion. This is the mode of choice with either a queue depth of 1
or no reordering. Or a non-flushable cache, at which point you know
you''re already screwed, so you don''t need to worry.

Blktap2 passes a frontend ring worth of requests to userland, with the
same semantics as blktap1: queue depth of 32, no serialization
control. Which should actually be a QUEUE_ORDERED_DRAIN passed on to
the block layer.

So I guess we better fix that.

Blkback
--------------------------------------------------------------------------

I had some trouble with this one. Blkback and -tap1 being somewhat the
same kinda guy from the frontend perspective, I''ve always assumed the
semantics should be (are) the same. Now I think that''s quite wrong.

Blkback is sitting on a kernel device queue, and that''s a slightly
different beast, because the blkdev interface works SCSI-style, which
means it''s completely revolving around SCSI barriers.

On SCSI you issue a barrier into a request stream and ideally that
directly maps to controller hardware, you''re done.

For ATA with NCQ that apparently doesn''t apply, but you still issue a
barrier op. It will cause a cache flush (if given) to complete
foregoing requests, then a forced unit access (if given) to complete
the barrier''d write itself. Or post-flush, again.

The point is you don''t do that on every request separately, or
performance somewhat suffers. :)

Let''s assume blkback on a bdev of queue depth >1 with
barriers enabled. And that we wanted to guarantee responses implying
physical completion. We''d have to do at least something like the
following: queue any batch we can get from the frontend, set a barrier
bit on the last one, then ack every req only after the last one
completed. Because in between, bio completion doesn''t mean much. That
might even be an option, I haven''t tried. One could probably also play
tricks based on the _FLUSH bit. Anyway, I don''t see the blkback code
presently doing that.
>From the blkfront perspective, that would still be a weird-lookingQUEUE_ORDERED_DRAIN. Because everything queued appears to complete at
once. It didn''t, of course, so you still have to drain where you want
to be safe.

If we don''t do that, then the frontend has to submit the barrier
position. I find this has some interesting implications. One is that the
guest will effectively be utilizing a writeback cache, safely. The
reason why I find this so exciting is because that cache can be
potentially much larger than a ring, and it''d still be safely used by
the guest. The other reason is I''m often made to wonder if blktap
should
get one.

To a Linux/blkfront queue, this is a QUEUE_ORDERED_TAG.

Blkif
--------------------------------------------------------------------------

I''m not entirely clear about the reason for BLKIF_OP_FLUSH_DISKCACHE
because of the existence of BLKIF_OP_WRITE_BARRIER.  Maybe someone can
enlighten me.

The latter implies a flush, if one is necessary (otherwise it wouldn''t
be a barrier).

I could imagine drain/flush combinations executed to achieve barriers,
or a flush as a slightly relaxed alternative, e.g. for queues which
preserve order.

Otoh, Xen normally tries to be about interface idealization where
appropriate. There is no point in feature-barrier==0 &&
feature-cache-flush==1 or vice versa, because a barrier would still
have been realizable per drain/flush, and at least in the blkback case
there''d be no additional cost in the backend, if it just does that
on behalf of the frontend. And it saves precious ring space.

Last I''m not clear about how the meaning of feature-barrier got
formulated in the header. To the frontend, is not a "feature" which
"may succeed". Rather, if set, as a strong recommendation for guests
who aim to eventually remount their filesystems r/w after a
reboot. 

Summary
--------------------------------------------------------------------------

I''m assuming most guest OS filesystem/block layer glue follows an
ordering interface based on SCSI? Does anyone know if there are
exceptions to that? I''d be really curious. [*]

Assuming the former is correct, I think we absolutely want interface
idealization. 

This leaves exactly 2.5 reasonable modes which frontends should prepare
for:

 - feature-barrier == 0 -> QUEUE_ORDERED_NONE

   Disk is either a) non-caching/non-reordering or b) roken.

 - feature-barrier == 1 -> QUEUE_ORDERED_TAG

   Disk is reordering, quite possibly caching, but you won''t need to
   know. You seriously want to issue ordered writes with
   BLKIF_OP_WRITE_BARRIER.

[
 - !xenbus_exists(feature-barrier) -> QUEUE_ORDERED_DRAIN (?)

   This is either blktap1, or an outdated blkback (?) I think one would
   want to assume blktap1 (or parse otherend-id). With Blkback there''s
   probably not much you can do anyway. With blktap1 there''s a chance
   you''re doing the right thing.
]

I''d suggest to ignore/phase-out the caching bit, because it''s
no use
wrt QUEUE_ORDERED_TAG and complementary to QUEUE_ORDERED_NONE.

I''d suggest to fix the backends where people see fit. In blktap2,
which appears urgent to me, this is a one liner for now, setting
QUEUE_ORDERED_DRAIN on the bdev.  Unless we discover some day we want
to implement barriers in tapdisk. Which is not going to happen in
July.

In good old blktap1, it would take a queue drain. I guess that would
best be be done in userspace, by forwarding BLKIF_OP_BARRIER. At which
point we might just shift the blktap2 tapdev mode to QUEUE_ORDERED_TAG
as well.

[*] One could imagine different hardware models for a ring-based Xen
backend. As mentioned above, NCQ could be reasonable too. Essentially,
one might just go and map queue ordering bits into a xenstore key and
let the guest I/O stack figure it out.  That doesn''t map to bio.h, but
it would map to blkdev.h.

I''m also the wrong guy to ask if there''d be a benefit :).
Definitely
not for Linux guests, where it''s only overhead in order to get the
same full barrier op the ring. But maybe other OSes?

On Thu, 2010-07-22 at 17:20 -0400, Jeremy Fitzhardinge
wrote:> When barriers are supported, then use QUEUE_ORDERED_TAG to tell the block
> subsystem that it doesn''t need to do anything else with the
barriers.
> Previously we used ORDERED_DRAIN which caused the block subsystem to
> drain all pending IO before submitting the barrier, which would be
> very expensive.
> 
> Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
> 
> diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
> index 92091a7..b61a021 100644
> --- a/drivers/block/xen-blkfront.c
> +++ b/drivers/block/xen-blkfront.c
> @@ -418,10 +418,20 @@ static int xlvbd_init_blk_queue(struct gendisk *gd,
u16 sector_size)
>  static int xlvbd_barrier(struct blkfront_info *info)
>  {
>  	int err;
> +	unsigned ordered = QUEUE_ORDERED_NONE;
>  
> -	err = blk_queue_ordered(info->rq,
> -				info->feature_barrier ? QUEUE_ORDERED_DRAIN : QUEUE_ORDERED_NONE,
> -				NULL);
> +	/*
> +	 * If we don''t have barrier support, then there''s
really no
> +	 * way to guarantee write ordering, so we really just have to
> +	 * send writes to the backend and hope for the best.  If
> +	 * barriers are supported, we don''t really know what the
> +	 * flushing semantics of the barrier are, so again, tag the
> +	 * requests in order and hope for the best.
> +	 */
> +	if (info->feature_barrier)
> +		ordered = QUEUE_ORDERED_TAG;
> +
> +	err = blk_queue_ordered(info->rq, ordered, NULL);
>  
>  	if (err)
>  		return err;
> @@ -508,8 +518,7 @@ static int xlvbd_alloc_gendisk(blkif_sector_t capacity,
>  	info->rq = gd->queue;
>  	info->gd = gd;
>  
> -	if (info->feature_barrier)
> -		xlvbd_barrier(info);
> +	xlvbd_barrier(info);
>  
>  	if (vdisk_info & VDISK_READONLY)
>  		set_disk_ro(gd, 1);
> 
> 

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Jeremy Fitzhardinge

2010-Jul-28 18:03 UTC

head link

[Xen-devel] Re: [PATCH RFC] xen/blkfront: use tagged queuing for barriers

On 07/28/2010 04:06 AM, Daniel Stodden wrote:> Hey.
>
> I tried to figure out some of the barrier stuff. I think it''s
becoming
> more clear to me now. But take everything below with a grain of salt.
> Sorry for having written a book about it. It''s just because
I''m not sure
> if I''m always reasoning correctly.
Thanks for the detailed response!
> There''s summary at the bottom. Maybe skip over and just apply your
> patch, possibly with a more optimistic update to the comment, and with a
> third option in support of QUEUE_ORDERED_DRAIN.
>
> In between, I seems to make sense to differentiate between blkback and
> blktap1/2 to understand what our disk model traditionally was/is,
> because until now frontends don''t differentiate much. Except for
maybe
> that feature-barrier flag. And below I gather why I think we might want
> to do something about it.
>
> Blktap1
> --------------------------------------------------------------------------
>
> A blktap1 backend ring is quite simple to explain. It thinks it''s
a
> cacheless disk with no ordering guarantees. In kernel queueing terms
> that''s a QUEUE_ORDERED_DRAIN.
Yes, if it has no cache then we don''t need to worry about delayed 
writes, and draining should be enough.
>   - We pass requests directly on to tapdisk.
>
>   - Tapdisk drivers will scatter and reorder at will, especially VHD.
>
>   - The RSP message is driven by iocb completion(s).
>
>   - It fully relies on AIO/O_DIRECT to make sure the data made it
>     actually down to the platter by that time.
>
> I''m not aware of anything else left to a userspace app.
>
> Blktap2
> --------------------------------------------------------------------------
>
> The blktap2 datapath was completely derived from blktap1.
>
> The difference that it''s now hooked into a bdev. Which is quite
> important because image consistency is now in the hands of the block
> layer.
>
> It presently doesn''t declare a barrier mode, which in my
understanding
> evaluates to QUEUE_ORDERED_NONE.
>
> My understanding of QUEUE_ORDERED_NONE semantics is that the blk code
> and filesystems will happily fill the queue and assmume in-order
> completion. This is the mode of choice with either a queue depth of 1
> or no reordering. Or a non-flushable cache, at which point you know
> you''re already screwed, so you don''t need to worry.
>
> Blktap2 passes a frontend ring worth of requests to userland, with the
> same semantics as blktap1: queue depth of 32, no serialization
> control. Which should actually be a QUEUE_ORDERED_DRAIN passed on to
> the block layer.
So blktap2 doesn''t pass barriers into userland?  I guess userland 
doesn''t really have a way to send barriers into the kernel except with 
f(data)sync or direct io.
> So I guess we better fix that.
>
> Blkback
> --------------------------------------------------------------------------
>
> I had some trouble with this one. Blkback and -tap1 being somewhat the
> same kinda guy from the frontend perspective, I''ve always assumed
the
> semantics should be (are) the same. Now I think that''s quite
wrong.
>
> Blkback is sitting on a kernel device queue, and that''s a slightly
> different beast, because the blkdev interface works SCSI-style, which
> means it''s completely revolving around SCSI barriers.
>
> On SCSI you issue a barrier into a request stream and ideally that
> directly maps to controller hardware, you''re done.
>
> For ATA with NCQ that apparently doesn''t apply, but you still
issue a
> barrier op. It will cause a cache flush (if given) to complete
> foregoing requests, then a forced unit access (if given) to complete
> the barrier''d write itself. Or post-flush, again.
>
> The point is you don''t do that on every request separately, or
> performance somewhat suffers. :)
>
> Let''s assume blkback on a bdev of queue depth>1 with
> barriers enabled. And that we wanted to guarantee responses implying
> physical completion. We''d have to do at least something like the
> following: queue any batch we can get from the frontend, set a barrier
> bit on the last one, then ack every req only after the last one
> completed. Because in between, bio completion doesn''t mean much.
That
> might even be an option, I haven''t tried. One could probably also
play
> tricks based on the _FLUSH bit. Anyway, I don''t see the blkback
code
> presently doing that.
Couldn''t blkback just send a barrier write down into the
host''s storage
subsystem, and rely on it completing the bios in the right order?
> > From the blkfront perspective, that would still be a weird-looking
> QUEUE_ORDERED_DRAIN. Because everything queued appears to complete at
> once. It didn''t, of course, so you still have to drain where you
want
> to be safe.
>
> If we don''t do that, then the frontend has to submit the barrier
> position.
Not sure I follow this.
>   I find this has some interesting implications. One is that the
> guest will effectively be utilizing a writeback cache, safely. The
> reason why I find this so exciting is because that cache can be
> potentially much larger than a ring, and it''d still be safely used
by
> the guest. The other reason is I''m often made to wonder if blktap
should
> get one.
>
> To a Linux/blkfront queue, this is a QUEUE_ORDERED_TAG.
>
> Blkif
> --------------------------------------------------------------------------
>
> I''m not entirely clear about the reason for
BLKIF_OP_FLUSH_DISKCACHE
> because of the existence of BLKIF_OP_WRITE_BARRIER.  Maybe someone can
> enlighten me.
>
> The latter implies a flush, if one is necessary (otherwise it
wouldn''t
> be a barrier).
Really?  It''s perfectly reasonable to define a barrier which prevents 
reordering but doesn''t imply any synchronous write.  If you have a 
filesystem which just wants to make sure the superblock is written after 
metadata updates, but doesn''t really mean to push things out *now*, a 
non-flushing barrier makes complete sense.  I found it odd that all the 
Linux documentation about barriers seem to imply a flush.  Or is that 
just the conventional meaning of "barrier" in storage-world?
> I could imagine drain/flush combinations executed to achieve barriers,
> or a flush as a slightly relaxed alternative, e.g. for queues which
> preserve order.
Right.
> Otoh, Xen normally tries to be about interface idealization where
> appropriate. There is no point in feature-barrier==0&&
> feature-cache-flush==1 or vice versa, because a barrier would still
> have been realizable per drain/flush, and at least in the blkback case
> there''d be no additional cost in the backend, if it just does that
> on behalf of the frontend. And it saves precious ring space.
>
> Last I''m not clear about how the meaning of feature-barrier got
> formulated in the header. To the frontend, is not a "feature"
which
> "may succeed". Rather, if set, as a strong recommendation for
guests
> who aim to eventually remount their filesystems r/w after a
> reboot.
>
> Summary
> --------------------------------------------------------------------------
>
> I''m assuming most guest OS filesystem/block layer glue follows an
> ordering interface based on SCSI? Does anyone know if there are
> exceptions to that? I''d be really curious. [*]
>
> Assuming the former is correct, I think we absolutely want interface
> idealization.
>
> This leaves exactly 2.5 reasonable modes which frontends should prepare
> for:
>
>   - feature-barrier == 0 ->  QUEUE_ORDERED_NONE
>
>     Disk is either a) non-caching/non-reordering or b) roken.
>
>   - feature-barrier == 1 ->  QUEUE_ORDERED_TAG
>
>     Disk is reordering, quite possibly caching, but you won''t need
to
>     know. You seriously want to issue ordered writes with
>     BLKIF_OP_WRITE_BARRIER.
So does this mean that feature-barrier is actually not backwards 
compatible?  If you use an old blkfront which doesn''t know what to do 
with it, you end up with less reliable storage than before feature-barrier?

Perhaps the backend should keep writes synchronous until it sees a 
barrier coming from the guest, then it switches to 
caching/reordering/etc (and hope the guest sends a barrier quickish).
> [
>   - !xenbus_exists(feature-barrier) ->  QUEUE_ORDERED_DRAIN (?)
>
>     This is either blktap1, or an outdated blkback (?) I think one would
>     want to assume blktap1 (or parse otherend-id). With Blkback
there''s
>     probably not much you can do anyway. With blktap1 there''s a
chance
>     you''re doing the right thing.
> ]
See patch below (delta against previous one).  Does that look OK?
> I''d suggest to ignore/phase-out the caching bit, because
it''s no use
> wrt QUEUE_ORDERED_TAG and complementary to QUEUE_ORDERED_NONE.
>
> I''d suggest to fix the backends where people see fit. In blktap2,
> which appears urgent to me, this is a one liner for now, setting
> QUEUE_ORDERED_DRAIN on the bdev.  Unless we discover some day we want
> to implement barriers in tapdisk. Which is not going to happen in
> July.
OK.  Is blkback OK as-is?  And I don''t care about blktap1, but I guess 
its still the current product storage backend...


Thanks,
     J


From: Jeremy Fitzhardinge<jeremy.fitzhardinge@citrix.com>
Date: Wed, 28 Jul 2010 10:49:29 -0700
Subject: [PATCH] xen/blkfront: Use QUEUE_ORDERED_DRAIN for old backends

If there''s no feature-barrier key in xenstore, then it means its a
fairly
old backend which does uncached in-order writes, which means ORDERED_DRAIN
is appropriate.

Signed-off-by: Jeremy Fitzhardinge<jeremy.fitzhardinge@citrix.com>

diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
index eb28e1f..8cd5418 100644
--- a/drivers/block/xen-blkfront.c
+++ b/drivers/block/xen-blkfront.c
@@ -423,27 +423,22 @@ static int xlvbd_init_blk_queue(struct blkfront_info
*info,
  static int xlvbd_barrier(struct blkfront_info *info)
  {
  	int err;
-	unsigned ordered = QUEUE_ORDERED_NONE;
+	const char *barrier;

-	/*
-	 * If we don''t have barrier support, then there''s really no
-	 * way to guarantee write ordering, so we really just have to
-	 * send writes to the backend and hope for the best.  If
-	 * barriers are supported, we don''t really know what the
-	 * flushing semantics of the barrier are, so again, tag the
-	 * requests in order and hope for the best.
-	 */
-	if (info->feature_barrier)
-		ordered = QUEUE_ORDERED_TAG;
+	switch (info->feature_barrier) {
+	case QUEUE_ORDERED_DRAIN:	barrier = "enabled (drain)"; break;
+	case QUEUE_ORDERED_TAG:		barrier = "enabled (tag)"; break;
+	case QUEUE_ORDERED_NONE:	barrier = "disabled"; break;
+	default:			return -EINVAL;
+	}

-	err = blk_queue_ordered(info->rq, ordered, NULL);
+	err = blk_queue_ordered(info->rq, info->feature_barrier, NULL);

  	if (err)
  		return err;

  	printk(KERN_INFO "blkfront: %s: barriers %s\n",
-	       info->gd->disk_name,
-	       info->feature_barrier ? "enabled" : "disabled");
+	       info->gd->disk_name, barrier);
  	return 0;
  }

@@ -717,7 +712,7 @@ static irqreturn_t blkif_interrupt(int irq, void *dev_id)
  				printk(KERN_WARNING "blkfront: %s: write barrier op failed\n",
  				       info->gd->disk_name);
  				error = -EOPNOTSUPP;
-				info->feature_barrier = 0;
+				info->feature_barrier = QUEUE_ORDERED_NONE;
  				xlvbd_barrier(info);
  			}
  			/* fall through */
@@ -1057,6 +1052,7 @@ static void blkfront_connect(struct blkfront_info *info)
  	unsigned long sector_size;
  	unsigned int binfo;
  	int err;
+	int barrier;

  	switch (info->connected) {
  	case BLKIF_STATE_CONNECTED:
@@ -1097,10 +1093,27 @@ static void blkfront_connect(struct blkfront_info *info)
  	}

  	err = xenbus_gather(XBT_NIL, info->xbdev->otherend,
-			    "feature-barrier",
"%lu",&info->feature_barrier,
+			    "feature-barrier", "%lu",&barrier,
  			    NULL);
+
+	/*
+	 * If there''s no "feature-barrier" defined, then it means
+	 * we''re dealing with a very old backend which probably writes
+	 * in order with no cache; draining will do what needs to get
+	 * done.
+	 *
+	 * If there are barriers, then we can do full queued writes
+	 * with tagged barriers.
+	 *
+	 * If barriers are not supported, then there''s no much we can
+	 * do, so just set ordering to NONE.
+	 */
  	if (err)
-		info->feature_barrier = 0;
+		info->feature_barrier = QUEUE_ORDERED_DRAIN;
+	else if (barrier)
+		info->feature_barrier = QUEUE_ORDERED_TAG;
+	else
+		info->feature_barrier = QUEUE_ORDERED_NONE;

  	err = xlvbd_alloc_gendisk(sectors, info, binfo, sector_size);
  	if (err) {



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Daniel Stodden

2010-Jul-28 21:52 UTC

head link

[Xen-devel] Re: [PATCH RFC] xen/blkfront: use tagged queuing for barriers

Hey.

Comments inline.

On Wed, 2010-07-28 at 14:03 -0400, Jeremy Fitzhardinge
wrote:> On 07/28/2010 04:06 AM, Daniel Stodden wrote:
> > Hey.
> >
> > I tried to figure out some of the barrier stuff. I think it''s
becoming
> > more clear to me now. But take everything below with a grain of salt.
> > Sorry for having written a book about it. It''s just because
I''m not sure
> > if I''m always reasoning correctly.
> 
> Thanks for the detailed response!
> 
> > There''s summary at the bottom. Maybe skip over and just apply
your
> > patch, possibly with a more optimistic update to the comment, and with
a
> > third option in support of QUEUE_ORDERED_DRAIN.
> >
> > In between, I seems to make sense to differentiate between blkback and
> > blktap1/2 to understand what our disk model traditionally was/is,
> > because until now frontends don''t differentiate much. Except
for maybe
> > that feature-barrier flag. And below I gather why I think we might
want
> > to do something about it.
> >
> > Blktap1
> >
--------------------------------------------------------------------------
> >
> > A blktap1 backend ring is quite simple to explain. It thinks
it''s a
> > cacheless disk with no ordering guarantees. In kernel queueing terms
> > that''s a QUEUE_ORDERED_DRAIN.
> 
> Yes, if it has no cache then we don''t need to worry about delayed 
> writes, and draining should be enough.
> 
> >   - We pass requests directly on to tapdisk.
> >
> >   - Tapdisk drivers will scatter and reorder at will, especially VHD.
> >
> >   - The RSP message is driven by iocb completion(s).
> >
> >   - It fully relies on AIO/O_DIRECT to make sure the data made it
> >     actually down to the platter by that time.
> >
> > I''m not aware of anything else left to a userspace app.
> >
> > Blktap2
> >
--------------------------------------------------------------------------
> >
> > The blktap2 datapath was completely derived from blktap1.
> >
> > The difference that it''s now hooked into a bdev. Which is
quite
> > important because image consistency is now in the hands of the block
> > layer.
> >
> > It presently doesn''t declare a barrier mode, which in my
understanding
> > evaluates to QUEUE_ORDERED_NONE.
> >
> > My understanding of QUEUE_ORDERED_NONE semantics is that the blk code
> > and filesystems will happily fill the queue and assmume in-order
> > completion. This is the mode of choice with either a queue depth of 1
> > or no reordering. Or a non-flushable cache, at which point you know
> > you''re already screwed, so you don''t need to worry.
> >
> > Blktap2 passes a frontend ring worth of requests to userland, with the
> > same semantics as blktap1: queue depth of 32, no serialization
> > control. Which should actually be a QUEUE_ORDERED_DRAIN passed on to
> > the block layer.
> 
> So blktap2 doesn''t pass barriers into userland?  I guess userland 
> doesn''t really have a way to send barriers into the kernel except
with
> f(data)sync or direct io.
Right. Buffer I/O the barrier is an f(data)sync. For direct-io, which is
the most common case, we trust in the kernel.

I guess you''re right regarding the buffered case, although it
shouldn''t
really affect any present user, because we typically nowhere  do that.
There have been experiments though.

There''s probably also a potential use case where driver writers want to
flush asynchronous I/O pending in some unspecified fancy filter module.
(Thinking async replication..).

We''d have to set an enable bit (ioctl/sysfs) to selectively turn it on
or off, because present taps won''t cope well with unexpected cmds. 

That might be fine because there already is the DEVICE_CREATE ioctl. The
problem is that ioctl args don''t grow easily and we didn''t tag
versions
(yet), so we might need something smarter.

Well, until then it''s a QUEUE_ORDERED_DRAIN.
> > So I guess we better fix that.
> >
> > Blkback
> >
--------------------------------------------------------------------------
> >
> > I had some trouble with this one. Blkback and -tap1 being somewhat the
> > same kinda guy from the frontend perspective, I''ve always
assumed the
> > semantics should be (are) the same. Now I think that''s quite
wrong.
> >
> > Blkback is sitting on a kernel device queue, and that''s a
slightly
> > different beast, because the blkdev interface works SCSI-style, which
> > means it''s completely revolving around SCSI barriers.
> >
> > On SCSI you issue a barrier into a request stream and ideally that
> > directly maps to controller hardware, you''re done.
> >
> > For ATA with NCQ that apparently doesn''t apply, but you still
issue a
> > barrier op. It will cause a cache flush (if given) to complete
> > foregoing requests, then a forced unit access (if given) to complete
> > the barrier''d write itself. Or post-flush, again.
> >
> > The point is you don''t do that on every request separately,
or
> > performance somewhat suffers. :)
> >
> > Let''s assume blkback on a bdev of queue depth>1 with
> > barriers enabled. And that we wanted to guarantee responses implying
> > physical completion. We''d have to do at least something like
the
> > following: queue any batch we can get from the frontend, set a barrier
> > bit on the last one, then ack every req only after the last one
> > completed. Because in between, bio completion doesn''t mean
much. That
> > might even be an option, I haven''t tried. One could probably
also play
> > tricks based on the _FLUSH bit. Anyway, I don''t see the
blkback code
> > presently doing that.
> 
> Couldn''t blkback just send a barrier write down into the
host''s storage
> subsystem, and rely on it completing the bios in the right order?
Yes, and that''s what I''m suggesting (further below). Sorry for
being
unclear.
> > > From the blkfront perspective, that would still be a
weird-looking
> > QUEUE_ORDERED_DRAIN. Because everything queued appears to complete at
> > once. It didn''t, of course, so you still have to drain where
you want
> > to be safe.
> >
> > If we don''t do that, then the frontend has to submit the
barrier
> > position.
> 
> Not sure I follow this.
The above was rather a thought experiment answering the question if one
would want to establish QUEUE_ORDERED_DRAIN semantics, what e.g.
blktap1/2 implement, as mode permissible to backends, thereby required
for frontends. Or even declare that as the single correct ordering mode
for any backend. That''s what we got in present xen-blkfront.

The reason is that, while can only speak for XCP, the normal rule of
thumb engraved in everybody''s brain around here is ''frontends
may/should
assume that completion implies durability''. So that question seemed
important.

The answer is that while it''s somewhat possible, it doesn''t
seem like
particularly good idea. Not only because it''s painful, also because
we''d
be losing a couple nice extensions I feel I could care about in future.
Such as loading more pending I/O into the backend without modifying the
ring size. By adding a virtual writeback cache.

And the other way around, QUEUE_TAG seems easy to emulate. For blkback
it''s a no-brainer because, as you say, it matches the kernel modes.
> >   I find this has some interesting implications. One is that the
> > guest will effectively be utilizing a writeback cache, safely. The
> > reason why I find this so exciting is because that cache can be
> > potentially much larger than a ring, and it''d still be safely
used by
> > the guest. The other reason is I''m often made to wonder if
blktap should
> > get one.
> >
> > To a Linux/blkfront queue, this is a QUEUE_ORDERED_TAG.
> >
> > Blkif
> >
--------------------------------------------------------------------------
> >
> > I''m not entirely clear about the reason for
BLKIF_OP_FLUSH_DISKCACHE
> > because of the existence of BLKIF_OP_WRITE_BARRIER.  Maybe someone can
> > enlighten me.
> >
> > The latter implies a flush, if one is necessary (otherwise it
wouldn''t
> > be a barrier).
> 
> Really?  It''s perfectly reasonable to define a barrier which
prevents
> reordering but doesn''t imply any synchronous write.  If you have a
> filesystem which just wants to make sure the superblock is written after 
> metadata updates, but doesn''t really mean to push things out
*now*, a
> non-flushing barrier makes complete sense.  
If your average disk is caching, there is no non-flushing barrier. You
cannot order a metadata write *after* any foregoing one if you submit it
to a non-empty disk cache only with a post-flush, because the cache
won''t preserve order.

If you have a modern SCSI disk which supports an ordered write, good
news is that you can leave the problem to the hardware and keep queuing
without explicit draining or flushing. I think I understood that
correctly.

There''s noting in the kernel suggesting a BIO_RW_BARRIER with a bit
dropping the durability requirement. So it must be a full flush. You''re
right that a non-flushing one might make sense to FSes, but then again
that requires caches to track order. Or NVRAM right away.

Caching disks without hardware barriers in the above sense need to get
flushed explicitly on the host before submitting the request. That''s
probably why my desktop''s ATA drive has all those drain and flush bits
set on its queue.

With barrier bits in the software queue that''s fairly transparent to an
FS which doesn''t want to care. But I think it will want to.
I''m not sure
what the filesystem does in every detail, but with data=ordered the way
to deal with it would be to gather as much data writes as reasonable
queue before following up with a bunch of metadata updates.

So you gather and merge threads at the filesystem level to optimize the
queue. This queue, I only started studying it very lately, right now
looks very single-threaded to me. There''s no partial ordering anywhere
on the way to the hardware, although having that it sounds like a neat
idea to me. Again, I might be missing some bits, so I probably
shouldn''t
even comment on that issue.
> I found it odd that all the 
> Linux documentation about barriers seem to imply a flush.  Or is that 
> just the conventional meaning of "barrier" in storage-world?
Barrier in barrier.txt means a full barrier in every respect: No prior
request may commit later, no later request may commit earlier. Plus (!)
a barrier write completion is durable. 

So you''re right. Maybe also in wondering if a weaker model
wouldn''t be
more elegant, but I''d expect it not to be done because it''s
too
expensive in hardware. NVRAM is simpler because it works well in a
legacy context. So you can call that fast and cacheless, and everything
stays as is and just gets faster. And flash is going to be everywhere,
anyway.

What I would personally find interesting would be partial ordering, to
multi-thread the queue down to the controller.. Because it appears to me
that''s what SATA with NCQ does best. You have 32 slots. A normal
request
may cache, or has the FUA bit set, so it''s writing through. In either
case, requests complete out-of-order.

Now imagine you have a whole lot of threads in your filesystem,
performing independent updates. I guess that''s the normal case. You
order data, so in phase 1 you fill the disk cache with user data. Then
flush. Once. Then you follow up with 32 metadata updates, all FUA, now
simultaneously.

I think in the kernel that''s presently not possible, because
there''s no
partial ordering. One would probably want to add an I/O context
allocated by each thread, as a key to bio submission, which then tracks
what parent request the ordered ones related to. Then aggressively merge
those, bundling independent barrier writes at the tip of the queue.

I might be missing a couple tiny details. Such as journaling and what
not. :P
> > I could imagine drain/flush combinations executed to achieve barriers,
> > or a flush as a slightly relaxed alternative, e.g. for queues which
> > preserve order.
> 
> Right.
> 
> > Otoh, Xen normally tries to be about interface idealization where
> > appropriate. There is no point in feature-barrier==0&&
> > feature-cache-flush==1 or vice versa, because a barrier would still
> > have been realizable per drain/flush, and at least in the blkback case
> > there''d be no additional cost in the backend, if it just does
that
> > on behalf of the frontend. And it saves precious ring space.
> >
> > Last I''m not clear about how the meaning of feature-barrier
got
> > formulated in the header. To the frontend, is not a
"feature" which
> > "may succeed". Rather, if set, as a strong recommendation
for guests
> > who aim to eventually remount their filesystems r/w after a
> > reboot.
> >
> > Summary
> >
--------------------------------------------------------------------------
> >
> > I''m assuming most guest OS filesystem/block layer glue
follows an
> > ordering interface based on SCSI? Does anyone know if there are
> > exceptions to that? I''d be really curious. [*]
> >
> > Assuming the former is correct, I think we absolutely want interface
> > idealization.
> >
> > This leaves exactly 2.5 reasonable modes which frontends should
prepare
> > for:
> >
> >   - feature-barrier == 0 ->  QUEUE_ORDERED_NONE
> >
> >     Disk is either a) non-caching/non-reordering or b) roken.
> >
> >   - feature-barrier == 1 ->  QUEUE_ORDERED_TAG
> >
> >     Disk is reordering, quite possibly caching, but you won''t
need to
> >     know. You seriously want to issue ordered writes with
> >     BLKIF_OP_WRITE_BARRIER.
> 
> So does this mean that feature-barrier is actually not backwards 
> compatible?  If you use an old blkfront which doesn''t know what to
do
> with it, you end up with less reliable storage than before feature-barrier?
The previous model in blkfront is the one you just replaced, right?

	err = blk_queue_ordered(info->rq,
				info->feature_barrier ? QUEUE_ORDERED_DRAIN : QUEUE_ORDERED_NONE,
				NULL);

Blkback writes feature-barrier=1 because the disk needs it, and the
frontend is moving to NONE, i.e. assumes non-reordering. That''s wrong
and dangerous?

Blkback writes feature-barrier=0 and the disk moves to DRAIN. That''s
not
wrong. It might even help avoiding the worst, because it adds latency.
But otherwise it''s a waste of time.

Blk*tap*1 writes no feature-barrier at all, iirc, and the frontend stays
at NONE. That sounds wrong again.

That''s already fairly broken.

We could fix blktap1 by making it write the bit complemented (i.e. 1).
We cannot entirely fix blkback with barriers enabled, because there''s
nothing generating barriers in there. 

But writing the bit inverted might help increasing the probability of
getting away with it.

We could update blkif.h accordingly:

"""
#define BLKIF_OP_BARRIER ..

When feature-barrier is set, then barrier support is only a feature, and
it''s almost guaranteed to fail. If feature-barrier is zero, then
barrier
support is not just a feature but strictly required, assuming you care
for your disk. If feature-barrier isn''t set at all, we strongly
recommend resorting to a queue drain instead, to improve the probability
of getting your writes to order correctly.
"""

Then again, I guess pvops cares mostly for the blkback case where
feature barrier is normally set, because there''s not many disks left
with queue depth 1 or no reordering (?). We won''t get this fixed, but
the DRAIN might be better than nothing (?) Altogether not sure about
this one.

I should also check with Paul what the XenServer PV drivers are
assuming. [hereby CCd]
> Perhaps the backend should keep writes synchronous until it sees a 
> barrier coming from the guest, then it switches to 
> caching/reordering/etc (and hope the guest sends a barrier quickish).
That won''t help. The synchronous mode just means request completion and
release, but not durability. Or do you mean flagging everything to
barrier writes?

That sounds like a good correctness-preserving measure. I''m just kinda
worried about rogue guests, a gratuitous barrier doesn''t come cheap,
does it?
> > [
> >   - !xenbus_exists(feature-barrier) ->  QUEUE_ORDERED_DRAIN (?)
> >
> >     This is either blktap1, or an outdated blkback (?) I think one
would
> >     want to assume blktap1 (or parse otherend-id). With Blkback
there''s
> >     probably not much you can do anyway. With blktap1 there''s
a chance
> >     you''re doing the right thing.
> > ]
> 
> See patch below (delta against previous one).  Does that look OK?
Yes, this patch looks good to me.

Do you think it''s worth asking Jens or someone else who''s
really good
with this stuff about the reasoning so far?

I also have no idea if we''re facing a regrettable performance
difference. :}
> > I''d suggest to ignore/phase-out the caching bit, because
it''s no use
> > wrt QUEUE_ORDERED_TAG and complementary to QUEUE_ORDERED_NONE.
> >
> > I''d suggest to fix the backends where people see fit. In
blktap2,
> > which appears urgent to me, this is a one liner for now, setting
> > QUEUE_ORDERED_DRAIN on the bdev.  Unless we discover some day we want
> > to implement barriers in tapdisk. Which is not going to happen in
> > July.
> 
> OK.  Is blkback OK as-is?  And I don''t care about blktap1, but I
guess
> its still the current product storage backend...
The blkback side is still beyond my understanding.

Blkback has only a few simple duties. 

* A q->orderd must always be reflected in feature-barrier.

   We start out with 1, without testing, then clear after failing.

   That looks okay. Should it rather test first? Because I''m
   pretty sure even those spurious request attempts generate alerting
   noise in the logs.

   Also, judging from it seems the ordering mode
   may change (stuff like hdparm -W called maybe?). 

   I should check if that needs improvement then.

   If that''s the case, then the frontend should be
   watching. Could be even be the case that this is too simplistic
   if you''re switching to ordered > 0 while I/O is in-flight.

   As you pointed out below, that would have to make all requests 
   synchronous until the frontend starts
   to ack that state change with some barrier request. I''m wondering
   about the shared performance impact then in case that doesn''t
happen.

* A barrier write should translate to a barrier write.

   I see it generating an empty bio. I don''t see it setting a 
   BIO_RW_BARRIER.

   I see nothing in blk-core which suggests to me that this translates
   to a barrier, but there is always hope I''m just mistaken. Help?

   I think a BLKIF_OP_WRITE_BARRIER should be allowed to carry 
   data, even encouraged to do so, because ring space is precious
   and I think Linux would otherwise have to insert barriers before 
   and after a kernel barrier write to stay simple. Right now
   it''s apparently doing the right thing. Only that this path has
   never been taken.

   It should not be required to carry data.

   Should we phase out OP_FLUSH_DISKCACHE? Did anything actually 
   ever implement it?

So I think just doing a quiet blkback update where necessary would stay
compatible.

Thanks!

Daniel
> 
> Thanks,
>      J
> 
> 
> From: Jeremy Fitzhardinge<jeremy.fitzhardinge@citrix.com>
> Date: Wed, 28 Jul 2010 10:49:29 -0700
> Subject: [PATCH] xen/blkfront: Use QUEUE_ORDERED_DRAIN for old backends
> 
> If there''s no feature-barrier key in xenstore, then it means its a
fairly
> old backend which does uncached in-order writes, which means ORDERED_DRAIN
> is appropriate.
> 
> Signed-off-by: Jeremy Fitzhardinge<jeremy.fitzhardinge@citrix.com>
> 
> diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
> index eb28e1f..8cd5418 100644
> --- a/drivers/block/xen-blkfront.c
> +++ b/drivers/block/xen-blkfront.c
> @@ -423,27 +423,22 @@ static int xlvbd_init_blk_queue(struct blkfront_info
*info,
>   static int xlvbd_barrier(struct blkfront_info *info)
>   {
>   	int err;
> -	unsigned ordered = QUEUE_ORDERED_NONE;
> +	const char *barrier;
> 
> -	/*
> -	 * If we don''t have barrier support, then there''s
really no
> -	 * way to guarantee write ordering, so we really just have to
> -	 * send writes to the backend and hope for the best.  If
> -	 * barriers are supported, we don''t really know what the
> -	 * flushing semantics of the barrier are, so again, tag the
> -	 * requests in order and hope for the best.
> -	 */
> -	if (info->feature_barrier)
> -		ordered = QUEUE_ORDERED_TAG;
> +	switch (info->feature_barrier) {
> +	case QUEUE_ORDERED_DRAIN:	barrier = "enabled (drain)"; break;
> +	case QUEUE_ORDERED_TAG:		barrier = "enabled (tag)"; break;
> +	case QUEUE_ORDERED_NONE:	barrier = "disabled"; break;
> +	default:			return -EINVAL;
> +	}
> 
> -	err = blk_queue_ordered(info->rq, ordered, NULL);
> +	err = blk_queue_ordered(info->rq, info->feature_barrier, NULL);
> 
>   	if (err)
>   		return err;
> 
>   	printk(KERN_INFO "blkfront: %s: barriers %s\n",
> -	       info->gd->disk_name,
> -	       info->feature_barrier ? "enabled" :
"disabled");
> +	       info->gd->disk_name, barrier);
>   	return 0;
>   }
> 
> @@ -717,7 +712,7 @@ static irqreturn_t blkif_interrupt(int irq, void
*dev_id)
>   				printk(KERN_WARNING "blkfront: %s: write barrier op
failed\n",
>   				       info->gd->disk_name);
>   				error = -EOPNOTSUPP;
> -				info->feature_barrier = 0;
> +				info->feature_barrier = QUEUE_ORDERED_NONE;
>   				xlvbd_barrier(info);
>   			}
>   			/* fall through */
> @@ -1057,6 +1052,7 @@ static void blkfront_connect(struct blkfront_info
*info)
>   	unsigned long sector_size;
>   	unsigned int binfo;
>   	int err;
> +	int barrier;
> 
>   	switch (info->connected) {
>   	case BLKIF_STATE_CONNECTED:
> @@ -1097,10 +1093,27 @@ static void blkfront_connect(struct blkfront_info
*info)
>   	}
> 
>   	err = xenbus_gather(XBT_NIL, info->xbdev->otherend,
> -			    "feature-barrier",
"%lu",&info->feature_barrier,
> +			    "feature-barrier", "%lu",&barrier,
>   			    NULL);
> +
> +	/*
> +	 * If there''s no "feature-barrier" defined, then it
means
> +	 * we''re dealing with a very old backend which probably writes
> +	 * in order with no cache; draining will do what needs to get
> +	 * done.
> +	 *
> +	 * If there are barriers, then we can do full queued writes
> +	 * with tagged barriers.
> +	 *
> +	 * If barriers are not supported, then there''s no much we can
> +	 * do, so just set ordering to NONE.
> +	 */
>   	if (err)
> -		info->feature_barrier = 0;
> +		info->feature_barrier = QUEUE_ORDERED_DRAIN;
> +	else if (barrier)
> +		info->feature_barrier = QUEUE_ORDERED_TAG;
> +	else
> +		info->feature_barrier = QUEUE_ORDERED_NONE;
> 
>   	err = xlvbd_alloc_gendisk(sectors, info, binfo, sector_size);
>   	if (err) {
> 
> 

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Jeremy Fitzhardinge

2010-Jul-28 23:12 UTC

head link

[Xen-devel] Re: [PATCH RFC] xen/blkfront: use tagged queuing for barriers

On 07/28/2010 02:52 PM, Daniel Stodden wrote:> If your average disk is caching, there is no non-flushing barrier. You
> cannot order a metadata write *after* any foregoing one if you submit it
> to a non-empty disk cache only with a post-flush, because the cache
> won''t preserve order.
>
> If you have a modern SCSI disk which supports an ordered write, good
> news is that you can leave the problem to the hardware and keep queuing
> without explicit draining or flushing. I think I understood that
> correctly.
Same with NCQ for SATA?  Or is that something else?
> There''s noting in the kernel suggesting a BIO_RW_BARRIER with a
bit
> dropping the durability requirement. So it must be a full flush.
You''re
> right that a non-flushing one might make sense to FSes, but then again
> that requires caches to track order. Or NVRAM right away.
Yes.  Given that the drive manufacturers seem keen on putting more 
smarts into the drive (deeper queues to allow drives to schedule, object 
storage, etc), then tracking write ordering in their delayed write 
buffers shouldn''t be too complex.
> Caching disks without hardware barriers in the above sense need to get
> flushed explicitly on the host before submitting the request.
That''s
> probably why my desktop''s ATA drive has all those drain and flush
bits
> set on its queue.
>
> With barrier bits in the software queue that''s fairly transparent
to an
> FS which doesn''t want to care. But I think it will want to.
I''m not sure
> what the filesystem does in every detail, but with data=ordered the way
> to deal with it would be to gather as much data writes as reasonable
> queue before following up with a bunch of metadata updates.
>
> So you gather and merge threads at the filesystem level to optimize the
> queue. This queue, I only started studying it very lately, right now
> looks very single-threaded to me. There''s no partial ordering
anywhere
> on the way to the hardware, although having that it sounds like a neat
> idea to me. Again, I might be missing some bits, so I probably
shouldn''t
> even comment on that issue.
>
>> I found it odd that all the
>> Linux documentation about barriers seem to imply a flush.  Or is that
>> just the conventional meaning of "barrier" in storage-world?
> Barrier in barrier.txt means a full barrier in every respect: No prior
> request may commit later, no later request may commit earlier. Plus (!)
> a barrier write completion is durable.
>
> So you''re right. Maybe also in wondering if a weaker model
wouldn''t be
> more elegant, but I''d expect it not to be done because
it''s too
> expensive in hardware. NVRAM is simpler because it works well in a
> legacy context. So you can call that fast and cacheless, and everything
> stays as is and just gets faster. And flash is going to be everywhere,
> anyway.
Flash is still dependent on being able to reorder writes, particularly 
to make sure they get grouped in whatever way is good for the flash 
controller.

But a weak, non-flushing barrier can be implemented as a flushing 
barrier if the hardware finds it convenient (or at any level); just 
because you can''t implement on all devices/paths doesn''t mean
you can''t
specify it conceptually.
> What I would personally find interesting would be partial ordering, to
> multi-thread the queue down to the controller.. Because it appears to me
> that''s what SATA with NCQ does best. You have 32 slots. A normal
request
> may cache, or has the FUA bit set, so it''s writing through. In
either
> case, requests complete out-of-order.
>
> Now imagine you have a whole lot of threads in your filesystem,
> performing independent updates. I guess that''s the normal case.
You
> order data, so in phase 1 you fill the disk cache with user data. Then
> flush. Once. Then you follow up with 32 metadata updates, all FUA, now
> simultaneously.
>
> I think in the kernel that''s presently not possible, because
there''s no
> partial ordering. One would probably want to add an I/O context
> allocated by each thread, as a key to bio submission, which then tracks
> what parent request the ordered ones related to. Then aggressively merge
> those, bundling independent barrier writes at the tip of the queue.
>
> I might be missing a couple tiny details. Such as journaling and what
> not. :P
That might make sense for a very tree-oriented filesystem like btrfs, 
though ultimately you need a total ordering to make sure the overall 
filesystem is in a consistent state.
>>> Summary
>>>
--------------------------------------------------------------------------
>>>
>>> I''m assuming most guest OS filesystem/block layer glue
follows an
>>> ordering interface based on SCSI? Does anyone know if there are
>>> exceptions to that? I''d be really curious. [*]
>>>
>>> Assuming the former is correct, I think we absolutely want
interface
>>> idealization.
>>>
>>> This leaves exactly 2.5 reasonable modes which frontends should
prepare
>>> for:
>>>
>>>    - feature-barrier == 0 ->   QUEUE_ORDERED_NONE
>>>
>>>      Disk is either a) non-caching/non-reordering or b) roken.
>>>
>>>    - feature-barrier == 1 ->   QUEUE_ORDERED_TAG
>>>
>>>      Disk is reordering, quite possibly caching, but you
won''t need to
>>>      know. You seriously want to issue ordered writes with
>>>      BLKIF_OP_WRITE_BARRIER.
>> So does this mean that feature-barrier is actually not backwards
>> compatible?  If you use an old blkfront which doesn''t know
what to do
>> with it, you end up with less reliable storage than before
feature-barrier?
> The previous model in blkfront is the one you just replaced, right?
>
> 	err = blk_queue_ordered(info->rq,
> 				info->feature_barrier ? QUEUE_ORDERED_DRAIN : QUEUE_ORDERED_NONE,
> 				NULL);
>
> Blkback writes feature-barrier=1 because the disk needs it, and the
> frontend is moving to NONE, i.e. assumes non-reordering. That''s
wrong
> and dangerous?
No, I mean an *old* blkfront which doesn''t know about feature-barrier 
*at all*.  It never looks at it, and never sends barriers.  If blkback 
sets feature-barrier=1 and the blkfront ignores it, then you end up with 
no ordering or durability guarantees at all.  Whereas if 
feature-barrier=0 or the blkback is pre-barrier, then the writes are 
synchronous and in-order.

In other words, at the moment, its actually mandatory for a backend to 
implement barriers if feature-barrier=1.
> Blkback writes feature-barrier=0 and the disk moves to DRAIN.
That''s not
> wrong. It might even help avoiding the worst, because it adds latency.
> But otherwise it''s a waste of time.
>
> Blk*tap*1 writes no feature-barrier at all, iirc, and the frontend stays
> at NONE. That sounds wrong again.
>
> That''s already fairly broken.
Well, if we assume that blktap1 with no feature-barrier is doing 
synchronous writes (not necessarily ordered) then the patch to make it 
use DRAIN should be fine.  If they''re ordered then NONE is better.
> We could fix blktap1 by making it write the bit complemented (i.e. 1).
> We cannot entirely fix blkback with barriers enabled, because
there''s
> nothing generating barriers in there.
>
> But writing the bit inverted might help increasing the probability of
> getting away with it.
>
> We could update blkif.h accordingly:
>
> """
> #define BLKIF_OP_BARRIER ..
>
> When feature-barrier is set, then barrier support is only a feature, and
> it''s almost guaranteed to fail. If feature-barrier is zero, then
barrier
> support is not just a feature but strictly required, assuming you care
> for your disk. If feature-barrier isn''t set at all, we strongly
> recommend resorting to a queue drain instead, to improve the probability
> of getting your writes to order correctly.
> """
That just seems gratuitously confusing.
> Then again, I guess pvops cares mostly for the blkback case where
> feature barrier is normally set, because there''s not many disks
left
> with queue depth 1 or no reordering (?). We won''t get this fixed,
but
> the DRAIN might be better than nothing (?) Altogether not sure about
> this one.
I don''t think it has very much to do with the underlying disks.  If you
submit a bunch of IOs to the blk subsystem, then it will reorder them in 
the ioscheduler anyway, won''t it?
> I should also check with Paul what the XenServer PV drivers are
> assuming. [hereby CCd]
>
>> Perhaps the backend should keep writes synchronous until it sees a
>> barrier coming from the guest, then it switches to
>> caching/reordering/etc (and hope the guest sends a barrier quickish).
> That won''t help. The synchronous mode just means request
completion and
> release, but not durability. Or do you mean flagging everything to
> barrier writes?
I mean that until the backend sees a barrier from the frontend, it does 
whatever it can to make sure that writes completed in order, and durable 
when they''re complete.  So I guess that means sticking a barrier on 
every one.
> That sounds like a good correctness-preserving measure. I''m just
kinda
> worried about rogue guests, a gratuitous barrier doesn''t come
cheap,
> does it?
Rogue guest in what sense?  I guess the question is how long does it 
take before a guest typically sends a write barrier down.  I wonder if 
it requires using a barrier capable and enabled filesystem?
>>> [
>>>    - !xenbus_exists(feature-barrier) ->   QUEUE_ORDERED_DRAIN
(?)
>>>
>>>      This is either blktap1, or an outdated blkback (?) I think one
would
>>>      want to assume blktap1 (or parse otherend-id). With Blkback
there''s
>>>      probably not much you can do anyway. With blktap1
there''s a chance
>>>      you''re doing the right thing.
>>> ]
>> See patch below (delta against previous one).  Does that look OK?
> Yes, this patch looks good to me.
>
> Do you think it''s worth asking Jens or someone else who''s
really good
> with this stuff about the reasoning so far?
Not sure.  I''m going to push these blkfront patches to Jens shortly; 
perhaps he''ll review them.
> I also have no idea if we''re facing a regrettable performance
> difference. :}
TAG should always be an improvement over DRAIN, no?
>>> I''d suggest to ignore/phase-out the caching bit, because
it''s no use
>>> wrt QUEUE_ORDERED_TAG and complementary to QUEUE_ORDERED_NONE.
>>>
>>> I''d suggest to fix the backends where people see fit. In
blktap2,
>>> which appears urgent to me, this is a one liner for now, setting
>>> QUEUE_ORDERED_DRAIN on the bdev.  Unless we discover some day we
want
>>> to implement barriers in tapdisk. Which is not going to happen in
>>> July.
>> OK.  Is blkback OK as-is?  And I don''t care about blktap1, but
I guess
>> its still the current product storage backend...
> The blkback side is still beyond my understanding.
>
> Blkback has only a few simple duties.
>
> * A q->orderd must always be reflected in feature-barrier.
>
>     We start out with 1, without testing, then clear after failing.
>
>     That looks okay. Should it rather test first? Because I''m
>     pretty sure even those spurious request attempts generate alerting
>     noise in the logs.
>
>     Also, judging from it seems the ordering mode
>     may change (stuff like hdparm -W called maybe?).
I think the choice of IO scheduler will have a bigger effect.
>     I should check if that needs improvement then.
>
>     If that''s the case, then the frontend should be
>     watching. Could be even be the case that this is too simplistic
>     if you''re switching to ordered>  0 while I/O is in-flight.
I don''t think that should matter.  If the backend understands barriers,
then the frontend should always be able to use barrier requests and 
QUEUE_ORDER_TAG internally.  It should be up to the backend to make the 
barriers work correctly, regardless of the underlying IO stack.

In other words, I''m not sure there''s any point in advertising 
feature-barrier=0 unless the backend literally does not understand 
BLKIF_OP_WRITE_BARRIER (in which case, it shouldn''t even mention 
feature-barrier either).  And conversely feature-barrier=1 doesn''t mean
"the disks have native barrier support", but "I will make 
BLKIF_OP_WRITE_BARRIER do the right thing, so go ahead and use it to 
your heart''s desire".

Perhaps this would be/have been better:

    * feature-barrier missing - backend doesn''t understand
      BLKIF_OP_WRITE_BARRIER at all
    * feature-barrier=0 - backend will complete all writes in order, and
      they''ll be durable once complete
    * feature-barrier=1 - backend can reorder writes at will, but
      they''ll be durable once complete.  BLKIF_OP_WRITE_BARRIER
required
      to guarantee any specific ordering
    * blkfront writes 1 to feature-barrier when it wants to use
      barriers.  The first use of BLKIF_OP_WRITE_BARRIER also has the
      effect of writing 1 to feature-barrier

But its probably too late to retrofit now.  But we can make the frontend 
issue a "gratuitous" barrier early on to make sure the backend knows 
what to expect later.
>     As you pointed out below, that would have to make all requests
>     synchronous until the frontend starts
>     to ack that state change with some barrier request. I''m
wondering
>     about the shared performance impact then in case that doesn''t
happen.
Probably not good.  The IO path probably gets a lot of win out of being 
given the freedom to reorder IOs.  It also raises the question of guests 
just deliberately barriering everything to cause a seek storm DoS (I 
guess what you''re alluding to by "rogue guest" above).  We
need to look
at all the new(ish) IO controller stuff for managing QoS, etc.
> * A barrier write should translate to a barrier write.
>
>     I see it generating an empty bio. I don''t see it setting a
>     BIO_RW_BARRIER.
>
>     I see nothing in blk-core which suggests to me that this translates
>     to a barrier, but there is always hope I''m just mistaken.
Help?
In dispatch_rw_block_io():

	switch (req->operation) {
	case BLKIF_OP_READ:
		operation = READ;
		break;
	case BLKIF_OP_WRITE:
		operation = WRITE;
		break;
	case BLKIF_OP_WRITE_BARRIER:
		operation = WRITE_BARRIER;
		break;
	default:
		operation = 0; /* make gcc happy */
		BUG();
	}
where WRITE_BARRIER is
#define WRITE_BARRIER	(WRITE | (1<<  BIO_RW_BARRIER))

>     I think a BLKIF_OP_WRITE_BARRIER should be allowed to carry
>     data, even encouraged to do so, because ring space is precious
>     and I think Linux would otherwise have to insert barriers before
>     and after a kernel barrier write to stay simple. Right now
>     it''s apparently doing the right thing. Only that this path has
>     never been taken.
>
>     It should not be required to carry data.
Seems reasonable.  It isn''t clear to me whether blkfront will generate 
WRITE_BARRIER with a payload, but it looks like it might already.
>     Should we phase out OP_FLUSH_DISKCACHE? Did anything actually
>     ever implement it?
Is that relatively new?  It isn''t present in the blkif.h in the kernel,
which is probably a snapshot from a while ago.
> So I think just doing a quiet blkback update where necessary would stay
> compatible.
Got a patch?

Thanks,
     J

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Daniel Stodden

2010-Jul-29 19:31 UTC

head link

[Xen-devel] Re: [PATCH RFC] xen/blkfront: use tagged queuing for barriers

On Wed, 2010-07-28 at 19:12 -0400, Jeremy Fitzhardinge
wrote:> On 07/28/2010 02:52 PM, Daniel Stodden wrote:
> > If your average disk is caching, there is no non-flushing barrier. You
> > cannot order a metadata write *after* any foregoing one if you submit
it
> > to a non-empty disk cache only with a post-flush, because the cache
> > won''t preserve order.
> >
> > If you have a modern SCSI disk which supports an ordered write, good
> > news is that you can leave the problem to the hardware and keep
queuing
> > without explicit draining or flushing. I think I understood that
> > correctly.
> 
> Same with NCQ for SATA?  Or is that something else?
Different, explained somewhere above/below. These are heavily reordering
but don''t support barriers natively. Semantics somewhat similar to
blktap, if it had a cache. The typical AHCI controller takes a queue
drain, followed by a cache flush, followed by a single FUA write for the
request itself.
> > There''s noting in the kernel suggesting a BIO_RW_BARRIER with
a bit
> > dropping the durability requirement. So it must be a full flush.
You''re
> > right that a non-flushing one might make sense to FSes, but then again
> > that requires caches to track order. Or NVRAM right away.
> 
> Yes.  Given that the drive manufacturers seem keen on putting more
> smarts into the drive (deeper queues to allow drives to schedule, object
> storage, etc), then tracking write ordering in their delayed write
> buffers shouldn''t be too complex.
> 
> > Caching disks without hardware barriers in the above sense need to get
> > flushed explicitly on the host before submitting the request.
That''s
> > probably why my desktop''s ATA drive has all those drain and
flush bits
> > set on its queue.
> >
> > With barrier bits in the software queue that''s fairly
transparent to an
> > FS which doesn''t want to care. But I think it will want to.
I''m not sure
> > what the filesystem does in every detail, but with data=ordered the
way
> > to deal with it would be to gather as much data writes as reasonable
> > queue before following up with a bunch of metadata updates.
> >
> > So you gather and merge threads at the filesystem level to optimize
the
> > queue. This queue, I only started studying it very lately, right now
> > looks very single-threaded to me. There''s no partial ordering
anywhere
> > on the way to the hardware, although having that it sounds like a neat
> > idea to me. Again, I might be missing some bits, so I probably
shouldn''t
> > even comment on that issue.
> >
> >> I found it odd that all the
> >> Linux documentation about barriers seem to imply a flush.  Or is
that
> >> just the conventional meaning of "barrier" in
storage-world?
> > Barrier in barrier.txt means a full barrier in every respect: No prior
> > request may commit later, no later request may commit earlier. Plus
(!)
> > a barrier write completion is durable.
> >
> > So you''re right. Maybe also in wondering if a weaker model
wouldn''t be
> > more elegant, but I''d expect it not to be done because
it''s too
> > expensive in hardware. NVRAM is simpler because it works well in a
> > legacy context. So you can call that fast and cacheless, and
everything
> > stays as is and just gets faster. And flash is going to be everywhere,
> > anyway.
> 
> Flash is still dependent on being able to reorder writes, particularly
> to make sure they get grouped in whatever way is good for the flash
> controller.
> 
> But a weak, non-flushing barrier can be implemented as a flushing
> barrier if the hardware finds it convenient (or at any level); just
> because you can''t implement on all devices/paths doesn''t
mean you can''t
> specify it conceptually.
> 
> > What I would personally find interesting would be partial ordering, to
> > multi-thread the queue down to the controller.. Because it appears to
me
> > that''s what SATA with NCQ does best. You have 32 slots. A
normal request
> > may cache, or has the FUA bit set, so it''s writing through.
In either
> > case, requests complete out-of-order.
> >
> > Now imagine you have a whole lot of threads in your filesystem,
> > performing independent updates. I guess that''s the normal
case. You
> > order data, so in phase 1 you fill the disk cache with user data. Then
> > flush. Once. Then you follow up with 32 metadata updates, all FUA, now
> > simultaneously.
> >
> > I think in the kernel that''s presently not possible, because
there''s no
> > partial ordering. One would probably want to add an I/O context
> > allocated by each thread, as a key to bio submission, which then
tracks
> > what parent request the ordered ones related to. Then aggressively
merge
> > those, bundling independent barrier writes at the tip of the queue.
> >
> > I might be missing a couple tiny details. Such as journaling and what
> > not. :P
> 
> That might make sense for a very tree-oriented filesystem like btrfs,
> though ultimately you need a total ordering to make sure the overall
> filesystem is in a consistent state.
> 
> >>> Summary
> >>>
--------------------------------------------------------------------------
> >>>
> >>> I''m assuming most guest OS filesystem/block layer
glue follows an
> >>> ordering interface based on SCSI? Does anyone know if there
are
> >>> exceptions to that? I''d be really curious. [*]
> >>>
> >>> Assuming the former is correct, I think we absolutely want
interface
> >>> idealization.
> >>>
> >>> This leaves exactly 2.5 reasonable modes which frontends
should prepare
> >>> for:
> >>>
> >>>    - feature-barrier == 0 ->   QUEUE_ORDERED_NONE
> >>>
> >>>      Disk is either a) non-caching/non-reordering or b) roken.
> >>>
> >>>    - feature-barrier == 1 ->   QUEUE_ORDERED_TAG
> >>>
> >>>      Disk is reordering, quite possibly caching, but you
won''t need to
> >>>      know. You seriously want to issue ordered writes with
> >>>      BLKIF_OP_WRITE_BARRIER.
> >> So does this mean that feature-barrier is actually not backwards
> >> compatible?  If you use an old blkfront which doesn''t
know what to do
> >> with it, you end up with less reliable storage than before
feature-barrier?
> > The previous model in blkfront is the one you just replaced, right?
> >
> >       err = blk_queue_ordered(info->rq,
> >                               info->feature_barrier ?
QUEUE_ORDERED_DRAIN : QUEUE_ORDERED_NONE,
> >                               NULL);
> >
> > Blkback writes feature-barrier=1 because the disk needs it, and the
> > frontend is moving to NONE, i.e. assumes non-reordering.
That''s wrong
> > and dangerous?
> 
> No, I mean an *old* blkfront which doesn''t know about
feature-barrier
> *at all*.  It never looks at it, and never sends barriers.  If blkback
> sets feature-barrier=1 and the blkfront ignores it, then you end up with
> no ordering or durability guarantees at all.  Whereas if
> feature-barrier=0 or the blkback is pre-barrier, then the writes are
> synchronous and in-order.
> 
> In other words, at the moment, its actually mandatory for a backend to
> implement barriers if feature-barrier=1.
> 
> > Blkback writes feature-barrier=0 and the disk moves to DRAIN.
That''s not
> > wrong. It might even help avoiding the worst, because it adds latency.
> > But otherwise it''s a waste of time.
> >
> > Blk*tap*1 writes no feature-barrier at all, iirc, and the frontend
stays
> > at NONE. That sounds wrong again.
> >
> > That''s already fairly broken.
> 
> Well, if we assume that blktap1 with no feature-barrier is doing
> synchronous writes (not necessarily ordered) then the patch to make it
> use DRAIN should be fine.  If they''re ordered then NONE is better.
Yes, with your patch. I was referring to the old code. Wasn''t the logic
above inverted? If *no* feature-barrier is set, the above resulted in
NONE. And if the backend sets it to 1, the drain doesn''t buy you much.

I really agree we will not want to be *that* counterintuitive to fix
that. After all we got away with it.
> > We could fix blktap1 by making it write the bit complemented (i.e. 1).
> > We cannot entirely fix blkback with barriers enabled, because
there''s
> > nothing generating barriers in there.
> >
> > But writing the bit inverted might help increasing the probability of
> > getting away with it.
> >
> > We could update blkif.h accordingly:
> >
> > """
> > #define BLKIF_OP_BARRIER ..
> >
> > When feature-barrier is set, then barrier support is only a feature,
and
> > it''s almost guaranteed to fail. If feature-barrier is zero,
then barrier
> > support is not just a feature but strictly required, assuming you care
> > for your disk. If feature-barrier isn''t set at all, we
strongly
> > recommend resorting to a queue drain instead, to improve the
probability
> > of getting your writes to order correctly.
> > """
> 
> That just seems gratuitously confusing.
I wasn''t really serious about this. Just depicting the resulting
insanity. :)
> > Then again, I guess pvops cares mostly for the blkback case where
> > feature barrier is normally set, because there''s not many
disks left
> > with queue depth 1 or no reordering (?). We won''t get this
fixed, but
> > the DRAIN might be better than nothing (?) Altogether not sure about
> > this one.
> 
> I don''t think it has very much to do with the underlying disks. 
If you
> submit a bunch of IOs to the blk subsystem, then it will reorder them in
> the ioscheduler anyway, won''t it?
> 
> > I should also check with Paul what the XenServer PV drivers are
> > assuming. [hereby CCd]
> >
> >> Perhaps the backend should keep writes synchronous until it sees a
> >> barrier coming from the guest, then it switches to
> >> caching/reordering/etc (and hope the guest sends a barrier
quickish).
> > That won''t help. The synchronous mode just means request
completion and
> > release, but not durability. Or do you mean flagging everything to
> > barrier writes?
> 
> I mean that until the backend sees a barrier from the frontend, it does
> whatever it can to make sure that writes completed in order, and durable
> when they''re complete.  So I guess that means sticking a barrier
on
> every one.
> 
> > That sounds like a good correctness-preserving measure. I''m
just kinda
> > worried about rogue guests, a gratuitous barrier doesn''t come
cheap,
> > does it?
> 
> Rogue guest in what sense?  I guess the question is how long does it
> take before a guest typically sends a write barrier down.  I wonder if
> it requires using a barrier capable and enabled filesystem?
> 
> >>> [
> >>>    - !xenbus_exists(feature-barrier) ->  
QUEUE_ORDERED_DRAIN (?)
> >>>
> >>>      This is either blktap1, or an outdated blkback (?) I
think one would
> >>>      want to assume blktap1 (or parse otherend-id). With
Blkback there''s
> >>>      probably not much you can do anyway. With blktap1
there''s a chance
> >>>      you''re doing the right thing.
> >>> ]
> >> See patch below (delta against previous one).  Does that look OK?
> > Yes, this patch looks good to me.
> >
> > Do you think it''s worth asking Jens or someone else
who''s really good
> > with this stuff about the reasoning so far?
> 
> Not sure.  I''m going to push these blkfront patches to Jens
shortly;
> perhaps he''ll review them.
> 
> > I also have no idea if we''re facing a regrettable performance
> > difference. :}
> 
> TAG should always be an improvement over DRAIN, no?
> 
> >>> I''d suggest to ignore/phase-out the caching bit,
because it''s no use
> >>> wrt QUEUE_ORDERED_TAG and complementary to QUEUE_ORDERED_NONE.
> >>>
> >>> I''d suggest to fix the backends where people see fit.
In blktap2,
> >>> which appears urgent to me, this is a one liner for now,
setting
> >>> QUEUE_ORDERED_DRAIN on the bdev.  Unless we discover some day
we want
> >>> to implement barriers in tapdisk. Which is not going to happen
in
> >>> July.
> >> OK.  Is blkback OK as-is?  And I don''t care about
blktap1, but I guess
> >> its still the current product storage backend...
> > The blkback side is still beyond my understanding.
> >
> > Blkback has only a few simple duties.
> >
> > * A q->orderd must always be reflected in feature-barrier.
> >
> >     We start out with 1, without testing, then clear after failing.
> >
> >     That looks okay. Should it rather test first? Because I''m
> >     pretty sure even those spurious request attempts generate alerting
> >     noise in the logs.
> >
> >     Also, judging from it seems the ordering mode
> >     may change (stuff like hdparm -W called maybe?).
> 
> I think the choice of IO scheduler will have a bigger effect.
Could be, but I don''t see those talking anywhere. Presumably because
the
queue enforces the serialization where necessary, and the schedulers
only get to see the reorderable sequences in between.
> >     I should check if that needs improvement then.
> >
> >     If that''s the case, then the frontend should be
> >     watching. Could be even be the case that this is too simplistic
> >     if you''re switching to ordered>  0 while I/O is
in-flight.
> 
> I don''t think that should matter.  If the backend understands
barriers,
> then the frontend should always be able to use barrier requests and
> QUEUE_ORDER_TAG internally.  It should be up to the backend to make the
> barriers work correctly, regardless of the underlying IO stack.
> 
> In other words, I''m not sure there''s any point in
advertising
> feature-barrier=0 unless the backend literally does not understand
> BLKIF_OP_WRITE_BARRIER (in which case, it shouldn''t even mention
> feature-barrier either).  
> And conversely feature-barrier=1 doesn''t mean
> "the disks have native barrier support", but "I will make
> BLKIF_OP_WRITE_BARRIER do the right thing, so go ahead and use it to
> your heart''s desire".
> 
> Perhaps this would be/have been better:
> 
>     * feature-barrier missing - backend doesn''t understand
>       BLKIF_OP_WRITE_BARRIER at all
>     * feature-barrier=0 - backend will complete all writes in order, and
>       they''ll be durable once complete
>     * feature-barrier=1 - backend can reorder writes at will, but
>       they''ll be durable once complete.  BLKIF_OP_WRITE_BARRIER
required
>       to guarantee any specific ordering
That''s the one discussed originally. Better not, enforcing those
semantics in blkback with what little efficiency remains possible would
have to emulate a queue drain and insert a barrier request at least once
per batch. That''s a lot of extra code and synthetic deferrals in the
backend, with no practical use case. There seems to be no guest type
worth taken seriously which would depend on it. Let''s scratch that.
>     * blkfront writes 1 to feature-barrier when it wants to use
>       barriers.  The first use of BLKIF_OP_WRITE_BARRIER also has the
>       effect of writing 1 to feature-barrier
> 
> But its probably too late to retrofit now.  But we can make the frontend
> issue a "gratuitous" barrier early on to make sure the backend
knows
> what to expect later.
> 
> >     As you pointed out below, that would have to make all requests
> >     synchronous until the frontend starts
> >     to ack that state change with some barrier request. I''m
wondering
> >     about the shared performance impact then in case that
doesn''t happen.
> 
> Probably not good.  The IO path probably gets a lot of win out of being
> given the freedom to reorder IOs.  It also raises the question of guests
> just deliberately barriering everything to cause a seek storm DoS (I
> guess what you''re alluding to by "rogue guest" above). 
We need to look
> at all the new(ish) IO controller stuff for managing QoS, etc.
> 
> > * A barrier write should translate to a barrier write.
> >
> >     I see it generating an empty bio. I don''t see it setting
a
> >     BIO_RW_BARRIER.
> >
> >     I see nothing in blk-core which suggests to me that this
translates
> >     to a barrier, but there is always hope I''m just mistaken.
Help?
> 
> In dispatch_rw_block_io():
> 
>         switch (req->operation) {
>         case BLKIF_OP_READ:
>                 operation = READ;
>                 break;
>         case BLKIF_OP_WRITE:
>                 operation = WRITE;
>                 break;
>         case BLKIF_OP_WRITE_BARRIER:
>                 operation = WRITE_BARRIER;
>                 break;
>         default:
>                 operation = 0; /* make gcc happy */
>                 BUG();
>         }
> where WRITE_BARRIER is
> #define WRITE_BARRIER   (WRITE | (1<<  BIO_RW_BARRIER))
Aah, I''m blind. Yes, that makes more much sense. I stumbled over that
empty bio in there, but that''s just the case where there''s a
barrier but
no data. Thank you.
> >     I think a BLKIF_OP_WRITE_BARRIER should be allowed to carry
> >     data, even encouraged to do so, because ring space is precious
> >     and I think Linux would otherwise have to insert barriers before
> >     and after a kernel barrier write to stay simple. Right now
> >     it''s apparently doing the right thing. Only that this
path has
> >     never been taken.
> >
> >     It should not be required to carry data.
> 
> Seems reasonable.  It isn''t clear to me whether blkfront will
generate
> WRITE_BARRIER with a payload, but it looks like it might already.
Fully resolved above, thanks again. So Blkback looks okay to me.
Blkfront too, with your patch. 

Probably worth keeping in mind that both ends might not have seen so
much exercise in the past, if blkfront never submitted any barrier ops
before. Just in case we see regressions pop up.
> >     Should we phase out OP_FLUSH_DISKCACHE? Did anything actually
> >     ever implement it?
> 
> Is that relatively new?  It isn''t present in the blkif.h in the
kernel,
> which is probably a snapshot from a while ago.
Oh, interesting. I was looking at the XCP kernel. Looks like we pulled
that in from SLES. It''s only in the header though, nowhere in the code
we''re running.
> > So I think just doing a quiet blkback update where necessary would
stay
> > compatible.
> 
> Got a patch?
Only changes left for blktap2 then, as far as I can see. I should
experiment a little with the performance difference, also with in the
raw blkfront/blkback case.

Thanks!

Daniel


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Xen devel - Jul 2010 - [PATCH RFC] xen/blkfront: use tagged queuing for barriers

[Xen-devel] [PATCH RFC] xen/blkfront: use tagged queuing for barriers

[Xen-devel] Re: [PATCH RFC] xen/blkfront: use tagged queuing for barriers

[Xen-devel] Re: [PATCH RFC] xen/blkfront: use tagged queuing for barriers

[Xen-devel] Re: [PATCH RFC] xen/blkfront: use tagged queuing for barriers

[Xen-devel] Re: [PATCH RFC] xen/blkfront: use tagged queuing for barriers

[Xen-devel] Re: [PATCH RFC] xen/blkfront: use tagged queuing for barriers