Joe Jin
2008-Nov-04 02:50 UTC
[Xen-devel] [patch 0/6] xenblk: Add O_DIRECT and O_SYNC support.
Resend. [patch 0/6] xenblk: Add O_DIRECT and O_SYNC support. This patchset to add O_DIRECT and O_SYNC support to xenblk driver. At present xenblk driver, a ring buffer have introduced in backend and frontend, the cache got better for io performance. But at some conditions, data need write date to disk as soon as possible, this ring buffer made data could not sync to disk immediately, even open file with O_DIRECT or O_SYNC flag. Obviously some potential risks at os for IO operation, so sync data to disk asap which the file open with O_DIRECT/O_SYNC is meaningful for xenblk driver. Please review and welcome comment. Signed-off-by: Joe Jin <joe.jin@oracle.com> --- drivers/xen/blkback/blkback.c | 100 +++++++++++++++++++++++++++++++-------- drivers/xen/blkfront/blkfront.c | 4 + fs/buffer.c | 40 +++++++++++++++ fs/ext3/inode.c | 6 ++ fs/jbd/checkpoint.c | 2 fs/jbd/commit.c | 2 include/linux/buffer_head.h | 21 ++++++++ include/xen/interface/io/blkif.h | 15 +++++ mm/filemap.c | 2 9 files changed, 173 insertions(+), 19 deletions(-) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Joe Jin
2008-Nov-04 02:52 UTC
[Xen-devel] [patch 1/6] xenblk: Add O_DIRECT and O_SYNC support - generic support.
[patch 1/6] xenblk: Add O_DIRECT and O_SYNC support - generic support. Defined some marcos for write sync support. Signed-off-by: Joe Jin <joe.jin@oracle.com> --- diff -r 2fb13b8cbe13 include/xen/interface/io/blkif.h --- a/include/xen/interface/io/blkif.h Thu Oct 30 13:34:43 2008 +0000 +++ b/include/xen/interface/io/blkif.h Mon Nov 03 10:31:41 2008 +0800 @@ -63,6 +63,21 @@ * create the "feature-barrier" node! */ #define BLKIF_OP_WRITE_BARRIER 2 +/* + * For O_DIRECT, let backend driver commit bios as soon as possible. + */ +#define BLKIF_OP_RW_SYNC_BITS 2 +#define BLKIF_OP_RW_SYNC 4 +#define set_rw_sync(req) set_bit(BLKIF_OP_RW_SYNC_BITS, \ + (unsigned long *)&((req)->operation)) +#define BLKIF_OP(op) ((op) & ((1 << BLKIF_OP_RW_SYNC_BITS) - 1)) +#define req_is_sync(req) ((BLKIF_OP((req)->operation) == BLKIF_OP_WRITE) &&\ + ((req)->operation & (1 << BLKIF_OP_RW_SYNC_BITS))) + +/* Check if struct request is WRITE and with REQ_RW_SYNC flag */ +#define blk_rw_sync_rq(req) (((req)->flags & REQ_RW) && \ + ((req)->flags & REQ_RW_SYNC)) + _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Joe Jin
2008-Nov-04 02:52 UTC
[Xen-devel] [patch 2/6] xenblk: Add O_DIRECT and O_SYNC support -- backend.
[patch 2/6] xenblk: Add O_DIRECT and O_SYNC support -- backend. This patch for backend support. If the io requests marked as sync write, will wait the bios finish then return to frontend. Signed-off-by: Joe Jin <joe.jin@oracle.com> --- diff -r 2fb13b8cbe13 drivers/xen/blkback/blkback.c --- a/drivers/xen/blkback/blkback.c Thu Oct 30 13:34:43 2008 +0000 +++ b/drivers/xen/blkback/blkback.c Mon Nov 03 10:31:41 2008 +0800 @@ -77,12 +77,15 @@ unsigned short operation; int status; struct list_head free_list; + wait_queue_head_t pr_bio_wait; } pending_req_t; static pending_req_t *pending_reqs; static struct list_head pending_free; static DEFINE_SPINLOCK(pending_free_lock); static DECLARE_WAIT_QUEUE_HEAD(pending_free_wq); + +static struct bio_set *_bios; #define BLKBACK_INVALID_HANDLE (~0) @@ -123,6 +126,7 @@ if (!list_empty(&pending_free)) { req = list_entry(pending_free.next, pending_req_t, free_list); list_del(&req->free_list); + init_waitqueue_head(&req->pr_bio_wait); } spin_unlock_irqrestore(&pending_free_lock, flags); return req; @@ -199,6 +203,15 @@ blkif->st_oo_req = 0; } +static void make_response_and_free_req(pending_req_t *pending_req) +{ + fast_flush_area(pending_req); + make_response(pending_req->blkif, pending_req->id, + pending_req->operation, pending_req->status); + blkif_put(pending_req->blkif); + free_req(pending_req); +} + int blkif_schedule(void *arg) { blkif_t *blkif = arg; @@ -248,7 +261,7 @@ static void __end_block_io_op(pending_req_t *pending_req, int error) { /* An error fails the entire request. */ - if ((pending_req->operation == BLKIF_OP_WRITE_BARRIER) && + if ((BLKIF_OP(pending_req->operation) == BLKIF_OP_WRITE_BARRIER) && (error == -EOPNOTSUPP)) { DPRINTK("blkback: write barrier op failed, not supported\n"); blkback_barrier(XBT_NIL, pending_req->blkif->be, 0); @@ -260,19 +273,20 @@ } if (atomic_dec_and_test(&pending_req->pendcnt)) { - fast_flush_area(pending_req); - make_response(pending_req->blkif, pending_req->id, - pending_req->operation, pending_req->status); - blkif_put(pending_req->blkif); - free_req(pending_req); + if (!req_is_sync(pending_req)) + make_response_and_free_req(pending_req); } } static int end_block_io_op(struct bio *bio, unsigned int done, int error) { + pending_req_t *req = bio->bi_private; + if (bio->bi_size != 0) return 1; - __end_block_io_op(bio->bi_private, error); + __end_block_io_op(req, error); + if (req_is_sync(req)) + wake_up(&req->pr_bio_wait); bio_put(bio); return error; } @@ -378,6 +392,23 @@ return more_to_do; } +static void blkif_bio_destructor(struct bio *bio) +{ + bio_free(bio, _bios); +} + +#define __wait_pending_bio(wq, condition) \ +do { \ + DEFINE_WAIT(__pr_wait); \ + for(;;) { \ + prepare_to_wait((wq), &__pr_wait, TASK_UNINTERRUPTIBLE);\ + if (condition) \ + break; \ + schedule(); \ + } \ + finish_wait((wq), &__pr_wait); \ +} while(0) + static void dispatch_rw_block_io(blkif_t *blkif, blkif_request_t *req, pending_req_t *pending_req) @@ -392,8 +423,9 @@ struct bio *bio = NULL, *biolist[BLKIF_MAX_SEGMENTS_PER_REQUEST]; int ret, i, nbio = 0; int operation; + int sync = req_is_sync(req); - switch (req->operation) { + switch (BLKIF_OP(req->operation)) { case BLKIF_OP_READ: operation = READ; break; @@ -407,6 +439,8 @@ operation = 0; /* make gcc happy */ BUG(); } + if (sync) + operation |= (1 << BIO_RW_SYNC); /* Check that number of segments is sane. */ nseg = req->nr_segments; @@ -438,7 +472,7 @@ preq.nr_sects += seg[i].nsec; flags = GNTMAP_host_map; - if (operation != READ) + if (BLKIF_OP(operation) != READ) flags |= GNTMAP_readonly; gnttab_set_map_op(&map[i], vaddr(pending_req, i), flags, req->seg[i].gref, blkif->domid); @@ -469,9 +503,9 @@ if (ret) goto fail_flush; - if (vbd_translate(&preq, blkif, operation) != 0) { + if (vbd_translate(&preq, blkif, BLKIF_OP(operation)) != 0) { DPRINTK("access denied: %s of [%llu,%llu] on dev=%04x\n", - operation == READ ? "read" : "write", + BLKIF_OP(operation) == READ ? "read" : "write", preq.sector_number, preq.sector_number + preq.nr_sects, preq.dev); goto fail_flush; @@ -490,7 +524,16 @@ virt_to_page(vaddr(pending_req, i)), seg[i].nsec << 9, seg[i].buf & ~PAGE_MASK) == 0)) { - bio = biolist[nbio++] = bio_alloc(GFP_KERNEL, nseg-i); + /* submit the bio before allocate new one */ + if (sync && bio) { + plug_queue(blkif, bio); + if (nbio == 1) + blkif_get(blkif); + atomic_inc(&pending_req->pendcnt); + submit_bio(operation, bio); + } + bio = biolist[nbio++] = bio_alloc_bioset(GFP_KERNEL, + nseg-i, _bios); if (unlikely(bio == NULL)) goto fail_put_bio; @@ -498,6 +541,10 @@ bio->bi_private = pending_req; bio->bi_end_io = end_block_io_op; bio->bi_sector = preq.sector_number; + bio->bi_rw = operation; + bio->bi_destructor = blkif_bio_destructor; + if (sync) + bio->bi_rw |= (1 << BIO_RW_SYNC); } preq.sector_number += seg[i].nsec; @@ -516,15 +563,28 @@ } plug_queue(blkif, bio); - atomic_set(&pending_req->pendcnt, nbio); - blkif_get(blkif); - for (i = 0; i < nbio; i++) - submit_bio(operation, biolist[i]); + if (sync) { + if (bio) { + if (nbio == 1 ) + blkif_get(blkif); + atomic_inc(&pending_req->pendcnt); + /* submit the last bio to request queue */ + submit_bio(operation, bio); + } + __wait_pending_bio(&pending_req->pr_bio_wait, + atomic_read(&pending_req->pendcnt) == 0); + make_response_and_free_req(pending_req); + } else { + atomic_set(&pending_req->pendcnt, nbio); + blkif_get(blkif); + for (i = 0; i < nbio; i++) + submit_bio(operation, biolist[i]); + } - if (operation == READ) + if (BLKIF_OP(operation) == READ) blkif->st_rd_sect += preq.nr_sects; - else if (operation == WRITE || operation == WRITE_BARRIER) + else if (BLKIF_OP(operation) == WRITE || BLKIF_OP(operation) == WRITE_BARRIER) blkif->st_wr_sect += preq.nr_sects; return; @@ -618,6 +678,10 @@ if (!pending_reqs || !pending_grant_handles || !pending_pages) goto out_of_memory; + _bios = bioset_create(16, 16, 4); + if (!_bios) + goto out_of_memory; + for (i = 0; i < mmap_pages; i++) pending_grant_handles[i] = BLKBACK_INVALID_HANDLE; _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Joe Jin
2008-Nov-04 02:53 UTC
[Xen-devel] [patch 3/6] xenblk: Add O_DIRECT and O_SYNC support -- frontend.
[patch 3/6] xenblk: Add O_DIRECT and O_SYNC support -- frontend. If requests submited with REQ_RW_SYNC flag, pass the flag to backend. Signed-off-by: Joe Jin <joe.jin@oracle.com> --- diff -r 2fb13b8cbe13 drivers/xen/blkfront/blkfront.c --- a/drivers/xen/blkfront/blkfront.c Thu Oct 30 13:34:43 2008 +0000 +++ b/drivers/xen/blkfront/blkfront.c Mon Nov 03 10:31:41 2008 +0800 @@ -621,6 +621,8 @@ BLKIF_OP_WRITE : BLKIF_OP_READ; if (blk_barrier_rq(req)) ring_req->operation = BLKIF_OP_WRITE_BARRIER; + if (blk_rw_sync_rq(req)) + set_rw_sync(ring_req); ring_req->nr_segments = 0; rq_for_each_bio (bio, req) { @@ -745,7 +747,7 @@ ADD_ID_TO_FREELIST(info, id); uptodate = (bret->status == BLKIF_RSP_OKAY); - switch (bret->operation) { + switch (BLKIF_OP(bret->operation)) { case BLKIF_OP_WRITE_BARRIER: if (unlikely(bret->status == BLKIF_RSP_EOPNOTSUPP)) { printk("blkfront: %s: write barrier op failed\n", _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Joe Jin
2008-Nov-04 02:54 UTC
[Xen-devel] [patch 4/6] xenblk: Add O_DIRECT and O_SYNC support -- generic support for O_SYNC.
[patch 4/6] xenblk: Add O_DIRECT and O_SYNC support -- generic support for O_SYNC. Add new buffer_head flag BH_Sync_Write for sync write. If the inode with sync flag(always is a mount options) or file open with O_SYNC flag, mark the buffer_head(s) to Sync_Write flag. On buffer try to submit, check the flag, if the flag have setted, mark submit_bio() rw flag to WRITE_SYNC, then frontend would pass the flag to backend. diff -r 2fb13b8cbe13 include/linux/buffer_head.h --- a/include/linux/buffer_head.h Thu Oct 30 13:34:43 2008 +0000 +++ b/include/linux/buffer_head.h Mon Nov 03 10:31:41 2008 +0800 @@ -27,6 +27,7 @@ BH_New, /* Disk mapping was newly created by get_block */ BH_Async_Read, /* Is under end_buffer_async_read I/O */ BH_Async_Write, /* Is under end_buffer_async_write I/O */ + BH_Sync_Write, /* For xenblk O_SYNC write support */ BH_Delay, /* Buffer is not yet allocated on disk */ BH_Boundary, /* Block is followed by a discontiguity */ BH_Write_EIO, /* I/O error on write */ @@ -118,6 +119,7 @@ BUFFER_FNS(Async_Read, async_read) BUFFER_FNS(Async_Write, async_write) BUFFER_FNS(Delay, delay) +BUFFER_FNS(Sync_Write, sync_write) BUFFER_FNS(Boundary, boundary) BUFFER_FNS(Write_EIO, write_io_error) BUFFER_FNS(Ordered, ordered) @@ -185,6 +187,25 @@ sector_t bblock, unsigned blocksize); extern int buffer_heads_over_limit; + +#if defined(CONFIG_XEN) && (defined(CONFIG_XEN_BLKDEV_FRONTEND_MODULE) || defined(CONFIG_XEN_BLKDEV_FRONTEND)) +#define mark_sync_write_bh(bh) set_buffer_sync_write(bh) +static inline void mark_sync_write_bhs(struct page *page) +{ + struct buffer_head *bh, *head; + + BUG_ON(!page_has_buffers(page)); + bh = head = page_buffers(page); + do { + set_buffer_sync_write(bh); + bh = bh->b_this_page; + } while (bh != head); + +} +#else +#define mark_sync_write_bh(bh) do { } while(0) +#define mark_sync_write_bhs(struct page *page) do { } while(0) +#endif /* * Generic address_space_operations implementations for buffer_head-backed /* * Recognised if "feature-flush-cache" is present in backend xenbus * info. A flush will ask the underlying storage hardware to flush its diff -r 2fb13b8cbe13 fs/buffer.c --- a/fs/buffer.c Thu Oct 30 13:34:43 2008 +0000 +++ b/fs/buffer.c Mon Nov 03 10:31:41 2008 +0800 @@ -616,6 +616,38 @@ bit_spin_unlock(BH_Uptodate_Lock, &first->b_state); local_irq_restore(flags); return; +} + +static void end_buffer_sync_write(struct buffer_head *bh) +{ + unsigned long flags; + + BUG_ON(!buffer_sync_write(bh)); + + + local_irq_save(flags); + bit_spin_lock(BH_Uptodate_Lock, &bh->b_state); + clear_buffer_sync_write(bh); + bit_spin_unlock(BH_Uptodate_Lock, &bh->b_state); + local_irq_restore(flags); + return; +} + +static int end_bio_bh_io_sync_write(struct bio *bio, unsigned int bytes_done, int err) +{ + struct buffer_head *bh = bio->bi_private; + + if (bio->bi_size) + return 1; + + if (err == -EOPNOTSUPP) { + set_bit(BIO_EOPNOTSUPP, &bio->bi_flags); + set_bit(BH_Eopnotsupp, &bh->b_state); + } + end_buffer_sync_write(bh); + bh->b_end_io(bh, test_bit(BIO_UPTODATE, &bio->bi_flags)); + bio_put(bio); + return 0; } /* @@ -2840,6 +2872,13 @@ bio->bi_end_io = end_bio_bh_io_sync; bio->bi_private = bh; + /* for xenblk O_SYNC support, let frontend mark it as SYNC io, + * then backend would submit it immedately */ + if ((rw & WRITE) && buffer_sync_write(bh)) { + bio->bi_end_io = end_bio_bh_io_sync_write; + rw |= WRITE_SYNC; + } + bio_get(bio); submit_bio(rw, bio); @@ -2921,6 +2960,7 @@ if (test_clear_buffer_dirty(bh)) { get_bh(bh); bh->b_end_io = end_buffer_write_sync; + mark_sync_write_bh(bh); ret = submit_bh(WRITE, bh); wait_on_buffer(bh); if (buffer_eopnotsupp(bh)) { diff -r 2fb13b8cbe13 mm/filemap.c --- a/mm/filemap.c Thu Oct 30 13:34:43 2008 +0000 +++ b/mm/filemap.c Mon Nov 03 10:31:41 2008 +0800 @@ -2145,6 +2145,8 @@ vmtruncate(inode, isize); break; } + if ((file->f_flags & O_SYNC) || IS_SYNC(inode)) + mark_sync_write_bhs(page); if (likely(nr_segs == 1)) copied = filemap_copy_from_user(page, offset, buf, bytes); _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Joe Jin
2008-Nov-04 02:54 UTC
[Xen-devel] [patch 5/6] xenblk: Add O_DIRECT and O_SYNC support -- for ext3.
[patch 5/6] xenblk: Add O_DIRECT and O_SYNC support -- for ext3. Signed-off-by: Joe Jin <joe.jin@oracle.com> --- diff -r 2fb13b8cbe13 fs/ext3/inode.c --- a/fs/ext3/inode.c Thu Oct 30 13:34:43 2008 +0000 +++ b/fs/ext3/inode.c Mon Nov 03 10:31:41 2008 +0800 @@ -1229,6 +1229,8 @@ EXT3_I(inode)->i_disksize = new_i_size; ret = generic_commit_write(file, page, from, to); } + if ((file->f_flags & O_SYNC) || IS_SYNC(inode)) + handle->h_sync = 1; ret2 = ext3_journal_stop(handle); if (!ret) ret = ret2; @@ -1252,6 +1254,8 @@ else ret = generic_commit_write(file, page, from, to); + if ((file->f_flags & O_SYNC) || IS_SYNC(inode)) + handle->h_sync = 1; ret2 = ext3_journal_stop(handle); if (!ret) ret = ret2; @@ -1285,6 +1289,8 @@ if (!ret) ret = ret2; } + if ((file->f_flags & O_SYNC) || IS_SYNC(inode)) + handle->h_sync = 1; ret2 = ext3_journal_stop(handle); if (!ret) ret = ret2; _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Joe Jin
2008-Nov-04 02:55 UTC
[Xen-devel] [patch 6/6] xenblk: Add O_DIRECT and O_SYNC support -- for jbd.
[patch 6/6] xenblk: Add O_DIRECT and O_SYNC support -- for jbd. Journal data always need sync to disk as soon as possible, so set all jbd write buffer_head to BH_Sync_Write. Signed-off-by: Joe Jin <joe.jin@oracle.com> --- diff -r 2fb13b8cbe13 fs/jbd/checkpoint.c --- a/fs/jbd/checkpoint.c Thu Oct 30 13:34:43 2008 +0000 +++ b/fs/jbd/checkpoint.c Tue Nov 04 10:43:59 2008 +0800 @@ -211,6 +211,8 @@ { int i; + for (i = 0; i < *batch_count; i++) + mark_sync_write_bh(bhs[i]); ll_rw_block(SWRITE, *batch_count, bhs); for (i = 0; i < *batch_count; i++) { struct buffer_head *bh = bhs[i]; diff -r 2fb13b8cbe13 fs/jbd/commit.c --- a/fs/jbd/commit.c Thu Oct 30 13:34:43 2008 +0000 +++ b/fs/jbd/commit.c Tue Nov 04 10:43:59 2008 +0800 @@ -166,6 +166,7 @@ for (i = 0; i < bufs; i++) { wbuf[i]->b_end_io = end_buffer_write_sync; + set_buffer_sync_write(wbuf[i]); /* We use-up our safety reference in submit_bh() */ submit_bh(WRITE, wbuf[i]); } @@ -631,6 +632,7 @@ clear_buffer_dirty(bh); set_buffer_uptodate(bh); bh->b_end_io = journal_end_buffer_io_sync; + set_buffer_sync_write(bh); submit_bh(WRITE, bh); } cond_resched(); _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir Fraser
2008-Nov-04 07:24 UTC
Re: [Xen-devel] [patch 0/6] xenblk: Add O_DIRECT and O_SYNC support.
On 4/11/08 02:50, "Joe Jin" <joe.jin@oracle.com> wrote:> Resend.I never saw this patchset before!> [patch 0/6] xenblk: Add O_DIRECT and O_SYNC support. > > This patchset to add O_DIRECT and O_SYNC support to xenblk driver. > > At present xenblk driver, a ring buffer have introduced in backend and > frontend, the cache got better for io performance. But at some conditions, > data need write date to disk as soon as possible, this ring buffer made > data could not sync to disk immediately, even open file with O_DIRECT or > O_SYNC flag. Obviously some potential risks at os for IO operation, > so sync data to disk asap which the file open with O_DIRECT/O_SYNC > is meaningful for xenblk driver. > > Please review and welcome comment.How is this better than waiting in the domU (blkfront) domain to collect the required asynchronous responses? Scattering changes into Linux filesystems is a dead end unless you can get the changes accepted upstream via lkml. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Joe Jin
2008-Nov-04 08:02 UTC
Re: [Xen-devel] [patch 0/6] xenblk: Add O_DIRECT and O_SYNC support.
On 2008-11-04 07:24, Keir Fraser wrote:> On 4/11/08 02:50, "Joe Jin" <joe.jin@oracle.com> wrote: > > > Resend. > > I never saw this patchset before! >Have sent it but could not deliveried to mailing list for I did not subscribed the mailing list :)> > [patch 0/6] xenblk: Add O_DIRECT and O_SYNC support. > > > > This patchset to add O_DIRECT and O_SYNC support to xenblk driver. > > > > At present xenblk driver, a ring buffer have introduced in backend and > > frontend, the cache got better for io performance. But at some conditions, > > data need write date to disk as soon as possible, this ring buffer made > > data could not sync to disk immediately, even open file with O_DIRECT or > > O_SYNC flag. Obviously some potential risks at os for IO operation, > > so sync data to disk asap which the file open with O_DIRECT/O_SYNC > > is meaningful for xenblk driver. > > > > Please review and welcome comment. > > How is this better than waiting in the domU (blkfront) domain to collect the > required asynchronous responses? Scattering changes into Linux filesystems > is a dead end unless you can get the changes accepted upstream via lkml. >Do you meant implement this feature just in domU (blkfront)? To direct-io answer is yes for request have marked if write request with O_DIRECT flag, but to O_SYNC, filesystem implentation it at filesystem, layer, I think do not touch filesystem codes to implement O_SYNC support is impossible. And, I think the changs related filesystem maybe hard to accepted by upsteam for this is a special issue just related xenblk. Thanks, Joe _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir Fraser
2008-Nov-04 08:13 UTC
Re: [Xen-devel] [patch 0/6] xenblk: Add O_DIRECT and O_SYNC support.
On 4/11/08 08:02, "Joe Jin" <joe.jin@oracle.com> wrote:>> How is this better than waiting in the domU (blkfront) domain to collect the >> required asynchronous responses? Scattering changes into Linux filesystems >> is a dead end unless you can get the changes accepted upstream via lkml. >> > > Do you meant implement this feature just in domU (blkfront)?Yes. If I understand what is being done.> To direct-io answer is yes for request have marked if write request with > O_DIRECT flag, but to O_SYNC, filesystem implentation it at filesystem, > layer, I think do not touch filesystem codes to implement O_SYNC support > is impossible.O_DIRECT has no meaning really across the xenblk interface. The intention is that blkback talks at a raw block device below the buffer cache anyway. For O_SYNC it can be implemented by pushing out all I/Os into the ring buffer and then wait for all required responses. Why do you think that doesn''t work? Indeed it should work already if such logic exists in the generic Linux block layer, since blkfront notifies that layer as responses come in from blkback.> And, I think the changs related filesystem maybe hard to accepted by upsteam > for this is a special issue just related xenblk.But xenblk (blkfront at least) is upstream now. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Joe Jin
2008-Nov-05 03:22 UTC
Re: [Xen-devel] [patch 0/6] xenblk: Add O_DIRECT and O_SYNC support.
> > To direct-io answer is yes for request have marked if write request with > > O_DIRECT flag, but to O_SYNC, filesystem implentation it at filesystem, > > layer, I think do not touch filesystem codes to implement O_SYNC support > > is impossible. > > O_DIRECT has no meaning really across the xenblk interface. The intention is > that blkback talks at a raw block device below the buffer cache anyway. > > For O_SYNC it can be implemented by pushing out all I/Os into the ring > buffer and then wait for all required responses. Why do you think that > doesn''t work? Indeed it should work already if such logic exists in the > generic Linux block layer, since blkfront notifies that layer as responses > come in from blkback. >Do you think when xenblk backend make_reponse() means the request have commited to disk? We have did a sample testing and found without BI_RW_SYNC flag setted, either O_DIRECT or O_SYNC flag, when vm crashed/power outage, lots of data lost, sometimes more than 1M data lost, that means vm under high data lost risk. With BI_RW_SYNC flag when call submit_bio, open file with O_DIRECT or O_SYNC flag, could sync data very well. Thanks, Joe _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir Fraser
2008-Nov-05 07:39 UTC
Re: [Xen-devel] [patch 0/6] xenblk: Add O_DIRECT and O_SYNC support.
On 5/11/08 03:22, "Joe Jin" <joe.jin@oracle.com> wrote:>> O_DIRECT has no meaning really across the xenblk interface. The intention is >> that blkback talks at a raw block device below the buffer cache anyway. >> >> For O_SYNC it can be implemented by pushing out all I/Os into the ring >> buffer and then wait for all required responses. Why do you think that >> doesn''t work? Indeed it should work already if such logic exists in the >> generic Linux block layer, since blkfront notifies that layer as responses >> come in from blkback. >> > Do you think when xenblk backend make_reponse() means the request have > commited to disk? > We have did a sample testing and found without BI_RW_SYNC flag setted, > either O_DIRECT or O_SYNC flag, when vm crashed/power outage, lots of > data lost, sometimes more than 1M data lost, that means vm under > high data lost risk. With BI_RW_SYNC flag when call submit_bio, open > file with O_DIRECT or O_SYNC flag, could sync data very well.What''s the vbd type in this case: raw partition, lvm, qcow file, ...? The existing BLKIF_OP_WRITE_BARRIER and BLKIF_OP_FLUSH_DISKCACHE should suffice to implement O_SYNC on the blkfront side, I think. O_DIRECT doesn''t mean writes are synchronous to the platters -- just means the buffer cache is bypassed -- which should generally be the case on the blkback side always anyway. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Joe Jin
2008-Nov-05 08:16 UTC
Re: [Xen-devel] [patch 0/6] xenblk: Add O_DIRECT and O_SYNC support.
>> Do you think when xenblk backend make_reponse() means the request have >> commited to disk? >> We have did a sample testing and found without BI_RW_SYNC flag setted, >> either O_DIRECT or O_SYNC flag, when vm crashed/power outage, lots of >> data lost, sometimes more than 1M data lost, that means vm under >> high data lost risk. With BI_RW_SYNC flag when call submit_bio, open >> file with O_DIRECT or O_SYNC flag, could sync data very well.>What''s the vbd type in this case: raw partition, lvm, qcow file, ...?Raw partition, crashed/power outage, lots of data lost :(>The existing BLKIF_OP_WRITE_BARRIER and BLKIF_OP_FLUSH_DISKCACHE should >suffice to implement O_SYNC on the blkfront side, I think. O_DIRECT doesn''t >mean writes are synchronous to the platters -- just means the buffer cache >is bypassed -- which should generally be the case on the blkback side always >anyway.However, frontend driver could not got the write''s flag at all. if we could know the request with O_SYNC flag should be easy to handle, need not touch filesystem layer. O_DIRECT is difference not only buffer head, if read/write is direct-io, need to commit request as soon as possible, means unplug request_queue if request with O_DIRECT flag(request should be marked REQ_RW_SYNC flag). Thanks, Joe _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir Fraser
2008-Nov-05 09:08 UTC
Re: [Xen-devel] [patch 0/6] xenblk: Add O_DIRECT and O_SYNC support.
On 5/11/08 08:16, "Joe Jin" <joe.jin@oracle.com> wrote:>> What''s the vbd type in this case: raw partition, lvm, qcow file, ...? > > Raw partition, crashed/power outage, lots of data lost :( > >> The existing BLKIF_OP_WRITE_BARRIER and BLKIF_OP_FLUSH_DISKCACHE should >> suffice to implement O_SYNC on the blkfront side, I think. O_DIRECT doesn''t >> mean writes are synchronous to the platters -- just means the buffer cache >> is bypassed -- which should generally be the case on the blkback side always >> anyway. > > However, frontend driver could not got the write''s flag at all. > if we could know the request with O_SYNC flag should be easy to handle, > need not touch filesystem layer.So what does O_SYNC mean to Linux then? If it''s not passed down to the blkdev layer then it can only mean that requests must be synchronously committed as far as that layer. I would therefore imagine you could lose as much data natively as you do running on Xen. Where are these megabytes of data sitting when they get lost, and why does it not happen when running natively? If O_SYNC is not waiting for completion responses from the block layer, that''s a limitation of Linux''s generic block subsystem, isn''t it? -- Keir> O_DIRECT is difference not only buffer head, if read/write is direct-io, > need to commit request as soon as possible, means unplug request_queue > if request with O_DIRECT flag(request should be marked REQ_RW_SYNC > flag)._______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir Fraser
2008-Nov-05 09:43 UTC
Re: [Xen-devel] [patch 0/6] xenblk: Add O_DIRECT and O_SYNC support.
On 5/11/08 09:37, "Jens Axboe" <jens.axboe@oracle.com> wrote:>>> However, frontend driver could not got the write''s flag at all. >>> if we could know the request with O_SYNC flag should be easy to handle, >>> need not touch filesystem layer. >> >> So what does O_SYNC mean to Linux then? If it''s not passed down to the >> blkdev layer then it can only mean that requests must be synchronously >> committed as far as that layer. I would therefore imagine you could lose as >> much data natively as you do running on Xen. Where are these megabytes of >> data sitting when they get lost, and why does it not happen when running >> natively? If O_SYNC is not waiting for completion responses from the block >> layer, that''s a limitation of Linux''s generic block subsystem, isn''t it? > > The block layer doesn''t care, it''s async by nature. What makes O_SYNC > work is that callers will wait on the submitted writes afterwards, not > returning a result until they have reached the drive (that''s as far as > that guarentee goes, beyond that you need flushes or barriers). > > So it is definitely NOT a limitation of the block layer, that''s very > much how it is designed.The issue at hand is whether Xen''s block drivers need to handle O_SYNC specially in any way. It sounds like it is actually handled properly at a higher level (e.g. individual filesystems?) and hence the patches proposed by Oracle make not much sense. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir Fraser
2008-Nov-05 09:53 UTC
Re: [Xen-devel] [patch 0/6] xenblk: Add O_DIRECT and O_SYNC support.
On 5/11/08 09:42, "Jens Axboe" <jens.axboe@oracle.com> wrote:> On Wed, Nov 05 2008, Keir Fraser wrote: >> O_DIRECT doesn''t >> mean writes are synchronous to the platters -- just means the buffer cache >> is bypassed -- which should generally be the case on the blkback side always >> anyway. > > Ehm, yes it does. When a write(2) returns for an fd that has been opened > with O_DIRECT, that io WILL have gone to the disk. So it''s as close to > ''sync to platter'' as we can be without fiddling with the write cache on > the device, there''s even talk of making it fully platter sync by > switching on FUA (forced unit access) in the write command, which would > make it always direct-to-platter regardless of device cache setting.Okay, I was going by the open() man page description which only talks about interactions with the buffer cache. Even so, when the Xen block-device is backed by a raw partition (Oracle''s test scenario), there should be nothing the Xen drivers need to do. They generate their asynchronous response to the frontend only after the backend gets the async callback from the backend domain''s block-device layer. Presumably something in the filesystem or generic blkdev layer in the frontend VM will be making the write()ing process block until such responses are gathered. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Alan Cox
2008-Nov-05 11:26 UTC
Re: [Xen-devel] [patch 0/6] xenblk: Add O_DIRECT and O_SYNC support.
> So what does O_SYNC mean to Linux then? If it''s not passed down to the > blkdev layer then it can only mean that requests must be synchronouslyThe block device layer speaks ordering barriers not file flags. The O_SYNC flag is passed to the file system the file system is responsible for turning that into some type of ordering, ditto stuff like fdatasync(). The exact behaviour is then configuable by the fs (eg ext3 has journalled and non journalled data modes) In terms of data integrity if you are relying on the order of physical media writes internally in Xen you already lost as there is nothing in most ATA specs requiring that a sequence of I/O operations occurs in the order they go to the drive. The sematics of cache flushing versus other I/O are defined so you have ordering points you can impose and some control over the time stuff may sit around before hitting disk. Alan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir Fraser
2008-Nov-05 13:52 UTC
Re: [Xen-devel] [patch 0/6] xenblk: Add O_DIRECT and O_SYNC support.
On 5/11/08 13:33, "Jens Axboe" <jens.axboe@oracle.com> wrote:> If you only complete the request once you get a completion in, then > there''s no issue. I thought it did queuing, serves me well for not > actually having the code handy to look at... There''s still some merit to > Joe''s patch, he should have split #2 into two pieces - some of it is > actually a real bug fix (you must not do bio_alloc() without having > submitted the previous bio you allocated, and you should use a private > bioset if you stack). This much is at least visible from just looking at > the patch!blkfront can queue multiple requests to blkback, each of which is a scatter-gather list. Each request is turned into one or more bios by blkback, and each is individually submit_bio()ed. When ->bi_end_io(0) is called for all bios corresponding to a request, only then is a response queued for collection by blkfront. I wasn''t aware of limitations on number of times you can call bio_alloc(). Perhaps we should arrange to submit_bio() more eagerly, rather than batching up (we plug the queue previously, so there''s no need to batch up submit_bio() calls for scheduling purposes). Would that be better than a bioset, or is a bioset easier do you think?> Does blkback propagate unplug events downwards?Not sure what that means. blkback does its own limited queue plugging to try to develop usefully schedulable batches of I/O. There''s no concept of plugging across the blkfront/blkback interface. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir Fraser
2008-Nov-05 21:02 UTC
Re: [Xen-devel] [patch 0/6] xenblk: Add O_DIRECT and O_SYNC support.
On 5/11/08 20:16, "Jens Axboe" <jens.axboe@oracle.com> wrote:>>> Does blkback propagate unplug events downwards? >> >> Not sure what that means. blkback does its own limited queue plugging to try >> to develop usefully schedulable batches of I/O. There''s no concept of >> plugging across the blkfront/blkback interface. > > So once you leave your queuing loop, you unplug the below device? Or > just point me at the source...The tree''s at http://xenbits.xensource.com/linux-2.6.18-xen.hg You can clone it or browse it there. The medium- to long-term issue is just to make sure that Jeremy''s port of blkback into upstream Linux (which he''s probably working on at the moment) doesn''t perpetuate the bio violations. But I guess we should fix 2.6.18 too since others base their own Xen patchsets on it still (e.g., for dom0 support). Thanks, Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Joe Jin
2008-Nov-06 08:13 UTC
Re: [Xen-devel] [patch 0/6] xenblk: Add O_DIRECT and O_SYNC support.
> > You don''t need a bio_set if you are not queuing beneath someone else. > But you do need to take care not to call bio_alloc() again without > having submitted the previous allocation, that is a violation of the bio > mempool and can lead to deadlocks. I think the batching you do looks > pointless (again, only looking at the patch), you should submit each bio > as soon as it is built instead of stashing it in an array for later > submission. That also gets rid of your bio mempool violation. > > > > Does blkback propagate unplug events downwards? > > > > Not sure what that means. blkback does its own limited queue plugging to try > > to develop usefully schedulable batches of I/O. There''s no concept of > > plugging across the blkfront/blkback interface. > > So once you leave your queuing loop, you unplug the below device? Or > just point me at the source... >>From source code when got all requests from ring buffer, it have haveunpluged request_queue, when unplug_fn() called, maybe some requests in request_queue waiting for commit to device. I have suspected the data lost for device''s cache, but when set the bio rw to WRITE_SYNC, data could almost sync to disk at once. Thanks, Joe _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Joe Jin
2008-Nov-06 08:23 UTC
Re: [Xen-devel] [patch 0/6] xenblk: Add O_DIRECT and O_SYNC support.
On 2008-11-06 09:17, Jens Axboe wrote:> On Wed, Nov 05 2008, Keir Fraser wrote: > > On 5/11/08 20:16, "Jens Axboe" <jens.axboe@oracle.com> wrote: > > > > >>> Does blkback propagate unplug events downwards? > > >> > > >> Not sure what that means. blkback does its own limited queue plugging to try > > >> to develop usefully schedulable batches of I/O. There''s no concept of > > >> plugging across the blkfront/blkback interface. > > > > > > So once you leave your queuing loop, you unplug the below device? Or > > > just point me at the source... > > > > The tree''s at http://xenbits.xensource.com/linux-2.6.18-xen.hg > > > > You can clone it or browse it there. > > > > The medium- to long-term issue is just to make sure that Jeremy''s port of > > blkback into upstream Linux (which he''s probably working on at the moment) > > doesn''t perpetuate the bio violations. But I guess we should fix 2.6.18 too > > since others base their own Xen patchsets on it still (e.g., for dom0 > > support). > > Ah thanks! Perhaps we can talk Joe into snipping the bio_alloc() changes > out of his patch #2 - Joe, just the bio_alloc() and submit_bio() stuff, > you don''t have to include the bio_set changes. >Thanks, will get rid of it. Joe _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir Fraser
2008-Nov-06 08:44 UTC
Re: [Xen-devel] [patch 0/6] xenblk: Add O_DIRECT and O_SYNC support.
On 6/11/08 08:17, "Jens Axboe" <jens.axboe@oracle.com> wrote:>> The medium- to long-term issue is just to make sure that Jeremy''s port of >> blkback into upstream Linux (which he''s probably working on at the moment) >> doesn''t perpetuate the bio violations. But I guess we should fix 2.6.18 too >> since others base their own Xen patchsets on it still (e.g., for dom0 >> support). > > Ah thanks! Perhaps we can talk Joe into snipping the bio_alloc() changes > out of his patch #2 - Joe, just the bio_alloc() and submit_bio() stuff, > you don''t have to include the bio_set changes.But Joe''s patch still queues up bios into an array, the same as ever? -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir Fraser
2008-Nov-06 09:26 UTC
Re: [Xen-devel] [patch 0/6] xenblk: Add O_DIRECT and O_SYNC support.
On 6/11/08 08:51, "Jens Axboe" <jens.axboe@oracle.com> wrote:> I guess he needs that for waiting on them, but it really should just do: > > struct bio *bio = NULL; > > loop { > if (bio) { > submit_bio(bio, ...); > bio = NULL; > } > bio = bio_alloc(...); > ... > } > > if (bio) > submit_bio(bio, ...); > > and be done with it.Yes, I can put together a patch for that easily enough, I think. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir Fraser
2008-Nov-06 10:33 UTC
Re: [Xen-devel] [patch 0/6] xenblk: Add O_DIRECT and O_SYNC support.
On 6/11/08 09:29, "Jens Axboe" <jens.axboe@oracle.com> wrote:>>> and be done with it. >> >> Yes, I can put together a patch for that easily enough, I think. > > Should be easy enough. Also be sure to defer any unplugging of the > device until the very end, to still make a lot of room for merging and > sorting.http://xenbits.xensource.com/staging/linux-2.6.18-xen.hg c/s 723 -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Joe Jin
2008-Nov-07 00:51 UTC
Re: [Xen-devel] [patch 0/6] xenblk: Add O_DIRECT and O_SYNC support.
On 2008-11-06 12:07, Jens Axboe wrote:> On Thu, Nov 06 2008, Keir Fraser wrote: > > On 6/11/08 09:29, "Jens Axboe" <jens.axboe@oracle.com> wrote: > > > > >>> and be done with it. > > >> > > >> Yes, I can put together a patch for that easily enough, I think. > > > > > > Should be easy enough. Also be sure to defer any unplugging of the > > > device until the very end, to still make a lot of room for merging and > > > sorting. > > > > http://xenbits.xensource.com/staging/linux-2.6.18-xen.hg c/s 723 > > Looks fine, but you can also get rid of the fail_put_bio and check for > NULL return from bio_alloc(), since it''ll never fail if you use > GFP_KERNEL as the allocation mask. That''s why you cannot nest > allocations as well. >OK, will take a testing for this patch, thanks all. Joe _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel