blktap devices attached to dom0 are liable to wedge during IO transfers. The problem does not occur in typical usage scenarios (i.e., virtual devices attached to guest domains); it is unique to the unanticipated case in which virtual devices are attached to dom0. The problem arises when processes in dom0 generate a large number of dirty pages while writing to a block-attached device. Once the number of dirty pages reaches a certain threshold, the dom0 kernel begins throttling IO in balance_dirty_pages; processes traversing the buffered IO path will block in this function until the number of dirty pages decreases. This is bad for the tapdisk process, which is responsible for servicing IO requests from the blktap driver. The tapdisk process normally performs direct IO, but if it writes to a hole in a sparse file, it falls into the buffered IO path. If the tapdisk process blocks in balance_dirty_pages, it will do so indefinitely, because it is the only process that cleans the pages dirtied by the processes writing to the virtual device. Thus dirty pages continue to amass in dom0 as IO is performed on the virtual device, but none of them make it to the physical devices because the tapdisk process is unable to service the requests. Note that when used as originally intended, blktap does not suffer from this problem: when blktap devices are attached to guest domains, performing IO on them dirties pages in the guest domain, not in dom0, so the tapdisk process doesn''t get throttled in balance_dirty_pages. Attached is a patch that eschews the dom0 problem by exempting the tapdisk process from blocking in balance_dirty_pages. tapdisk processes servicing dom0-attached devices are granted special status using a modified setpriority syscall; a check in balance_dirty_pages ensures that such processes do not block indefinitely. This is clearly a hacky solution; any suggestions for improvement are welcome. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Brendan Cully
2007-Mar-21 16:59 UTC
Re: [Xen-devel] blktap wedges when block-attached to dom0
Any chance this will be refreshed for 2.6.18? I very much enjoy being able to block-attach in domain 0, but am less enamoured of the frequent hangs when I fsck those devices... On Tuesday, 02 January 2007 at 17:37, jake wrote:> blktap devices attached to dom0 are liable to wedge during IO transfers. > The problem does not occur in typical usage scenarios (i.e., virtual > devices attached to guest domains); it is unique to the unanticipated > case in which virtual devices are attached to dom0. > > The problem arises when processes in dom0 generate a large number of > dirty pages while writing to a block-attached device. Once the number > of dirty pages reaches a certain threshold, the dom0 kernel begins > throttling IO in balance_dirty_pages; processes traversing the buffered > IO path will block in this function until the number of dirty pages > decreases. > > This is bad for the tapdisk process, which is responsible for servicing > IO requests from the blktap driver. The tapdisk process normally > performs direct IO, but if it writes to a hole in a sparse file, it > falls into the buffered IO path. If the tapdisk process blocks in > balance_dirty_pages, it will do so indefinitely, because it is the only > process that cleans the pages dirtied by the processes writing to the > virtual device. Thus dirty pages continue to amass in dom0 as IO is > performed on the virtual device, but none of them make it to the > physical devices because the tapdisk process is unable to service the > requests. > > Note that when used as originally intended, blktap does not suffer from > this problem: when blktap devices are attached to guest domains, > performing IO on them dirties pages in the guest domain, not in dom0, so > the tapdisk process doesn''t get throttled in balance_dirty_pages. > > Attached is a patch that eschews the dom0 problem by exempting the > tapdisk process from blocking in balance_dirty_pages. tapdisk processes > servicing dom0-attached devices are granted special status using a > modified setpriority syscall; a check in balance_dirty_pages ensures > that such processes do not block indefinitely. > > This is clearly a hacky solution; any suggestions for improvement are > welcome.> # HG changeset patch > # User Jake Wires <jwires@xensource.com> > # Date 1166551978 28800 > # Node ID 34c6a9a2983ae46fad5dbba7e4b49520fb639a8c > # Parent df1e7ae878b4badf4e5555df12a1c4d233170fb9 > [BLKTAP] prevent tapdisk processes from blocking in balance_dirty_pages > > This patch mods the setpriority syscall to enable marking processes as special > IO processes. IO processes are exempted from blocking in balance_dirty_pages. > This patch is intended to avoid deadlocks when block-attaching a blktap VDI to > dom0. > > diff -r df1e7ae878b4 -r 34c6a9a2983a patches/linux-2.6.16.33/series > +++ b/patches/linux-2.6.16.33/series Tue Dec 19 10:12:58 2006 -0800 > @@ -5,6 +5,7 @@ git-4bfaaef01a1badb9e8ffb0c0a37cd2379008 > git-4bfaaef01a1badb9e8ffb0c0a37cd2379008d21f.patch > linux-2.6.19-rc1-kexec-move_segment_code-x86_64.patch > blktap-aio-16_03_06.patch > +blktap-ioprio.patch > device_bind.patch > fix-hz-suspend.patch > fix-ide-cd-pio-mode.patch > diff -r df1e7ae878b4 -r 34c6a9a2983a tools/blktap/drivers/blktapctrl.c > +++ b/tools/blktap/drivers/blktapctrl.c Tue Dec 19 10:12:58 2006 -0800 > @@ -51,6 +51,7 @@ > #include <xs.h> > #include <printf.h> > #include <sys/time.h> > +#include <sys/resource.h> > #include <syslog.h> > > #include "blktaplib.h" > @@ -535,6 +536,14 @@ int blktapctrl_new_blkif(blkif_t *blkif) > goto fail; > } > > + /* exempt tapdisk from flushing when attached to dom0 */ > + if (blkif->domid == 0) > + if (setpriority(PRIO_PROCESS, > + blkif->tappid, PRIO_SPECIAL_IO)) { > + DPRINTF("Unable to prioritize tapdisk proc\n"); > + goto fail; > + } > + > /* Both of the following read and write calls will block up to > * max_timeout val*/ > if (write_msg(blkif->fds[WRITE], CTLMSG_PARAMS, blkif, ptr) > diff -r df1e7ae878b4 -r 34c6a9a2983a tools/blktap/lib/blktaplib.h > +++ b/tools/blktap/lib/blktaplib.h Tue Dec 19 10:12:58 2006 -0800 > @@ -57,6 +57,8 @@ > #define BLKTAP_QUERY_ALLOC_REQS 8 > #define BLKTAP_IOCTL_FREEINTF 9 > #define BLKTAP_IOCTL_PRINT_IDXS 100 > + > +#define PRIO_SPECIAL_IO -9999 > > /* blktap switching modes: (Set with BLKTAP_IOCTL_SETMODE) */ > #define BLKTAP_MODE_PASSTHROUGH 0x00000000 /* default */ > diff -r df1e7ae878b4 -r 34c6a9a2983a patches/linux-2.6.16.33/blktap-ioprio.patch > +++ b/patches/linux-2.6.16.33/blktap-ioprio.patch Tue Dec 19 10:12:58 2006 -0800 > @@ -0,0 +1,81 @@ > +diff -pruN ../orig-linux-2.6.16.33/include/linux/sched.h ./include/linux/sched.h > +--- ../orig-linux-2.6.16.33/include/linux/sched.h 2006-12-18 18:42:00.000000000 -0800 > ++++ ./include/linux/sched.h 2006-12-18 18:46:07.000000000 -0800 > +@@ -706,6 +706,7 @@ struct task_struct { > + prio_array_t *array; > + > + unsigned short ioprio; > ++ short special_prio; > + > + unsigned long sleep_avg; > + unsigned long long timestamp, last_ran; > +diff -pruN ../orig-linux-2.6.16.33/include/linux/resource.h ./include/linux/resource.h > +--- ../orig-linux-2.6.16.33/include/linux/resource.h 2006-12-18 18:42:00.000000000 -0800 > ++++ ./include/linux/resource.h 2006-12-18 18:44:35.000000000 -0800 > +@@ -44,6 +44,7 @@ struct rlimit { > + > + #define PRIO_MIN (-20) > + #define PRIO_MAX 20 > ++#define PRIO_SPECIAL_IO -9999 > + > + #define PRIO_PROCESS 0 > + #define PRIO_PGRP 1 > +diff -pruN ../orig-linux-2.6.16.33/include/linux/init_task.h ./include/linux/init_task.h > +--- ../orig-linux-2.6.16.33/include/linux/init_task.h 2006-12-18 18:42:00.000000000 -0800 > ++++ ./include/linux/init_task.h 2006-12-18 18:45:56.000000000 -0800 > +@@ -85,6 +85,7 @@ extern struct group_info init_groups; > + .lock_depth = -1, \ > + .prio = MAX_PRIO-20, \ > + .static_prio = MAX_PRIO-20, \ > ++ .special_prio = 0, \ > + .policy = SCHED_NORMAL, \ > + .cpus_allowed = CPU_MASK_ALL, \ > + .mm = NULL, \ > +diff -pruN ../orig-linux-2.6.16.33/kernel/sys.c ./kernel/sys.c > +--- ../orig-linux-2.6.16.33/kernel/sys.c 2006-12-18 18:42:00.000000000 -0800 > ++++ ./kernel/sys.c 2006-12-18 18:43:30.000000000 -0800 > +@@ -245,6 +245,11 @@ static int set_one_prio(struct task_stru > + error = -EPERM; > + goto out; > + } > ++ if (niceval == PRIO_SPECIAL_IO) { > ++ p->special_prio = PRIO_SPECIAL_IO; > ++ error = 0; > ++ goto out; > ++ } > + if (niceval < task_nice(p) && !can_nice(p, niceval)) { > + error = -EACCES; > + goto out; > +@@ -272,10 +277,15 @@ asmlinkage long sys_setpriority(int whic > + > + /* normalize: avoid signed division (rounding problems) */ > + error = -ESRCH; > +- if (niceval < -20) > +- niceval = -20; > +- if (niceval > 19) > +- niceval = 19; > ++ if (niceval == PRIO_SPECIAL_IO) { > ++ if (which != PRIO_PROCESS) > ++ return -EINVAL; > ++ } else { > ++ if (niceval < -20) > ++ niceval = -20; > ++ if (niceval > 19) > ++ niceval = 19; > ++ } > + > + read_lock(&tasklist_lock); > + switch (which) { > +diff -pruN ../orig-linux-2.6.16.33/mm/page-writeback.c ./mm/page-writeback.c > +--- ../orig-linux-2.6.16.33/mm/page-writeback.c 2006-12-19 10:03:59.000000000 -0800 > ++++ ./mm/page-writeback.c 2006-12-19 10:04:17.000000000 -0800 > +@@ -231,6 +231,9 @@ static void balance_dirty_pages(struct a > + pages_written += write_chunk - wbc.nr_to_write; > + if (pages_written >= write_chunk) > + break; /* We''ve done our duty */ > ++ if (current->special_prio == PRIO_SPECIAL_IO) > ++ break; /* Exempt IO processes */ > ++ > + } > + blk_congestion_wait(WRITE, HZ/10); > + }> _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xensource.com > http://lists.xensource.com/xen-devel_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel