Hi, Qemu-dm might be blocked with fsync system call over 3 seconds when the dom0 is overloaded. It causes SMP Windows 2008 crashes with Bug Check 0x101. 0x101 indicates that an expected clock interrupt on a secondary processor, in a multi-processor system, was not received within the allocated interval. It can be easily reproduced with the following modification: diff -r 76c9cf11ce23 tools/ioemu/block-raw.c --- a/tools/ioemu/block-raw.c Fri Mar 21 09:45:34 2008 +0000 +++ b/tools/ioemu/block-raw.c Mon Mar 24 16:28:16 2008 +0900 @@ -603,6 +603,11 @@ static void raw_flush(BlockDriverState * static void raw_flush(BlockDriverState *bs) { BDRVRawState *s = bs->opaque; +#if 1 /* reproduce Windows Bug Check 0x101 */ + extern int send_vcpu; + if (send_vcpu != 0) + sleep(4); +#endif fsync(s->fd); } An attached patch fixes it. However I think the root cause is that a timer event can''t interrupt an i/o emulation. How should we fix it? Thanks, Kouya _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Kouya Shimura, le Mon 24 Mar 2008 18:53:18 +0900, a écrit : Content-Description: message body text> An attached patch fixes it. However I think the root cause is > that a timer event can''t interrupt an i/o emulation.The way qemu is designed wouldn''t permit that anyway.> How should we fix it?We need to change the behavior of the flush operation, to make it asynchronous, so that things can continue while the host OS is syncing, and eventually the SCSI or IDE layer will report the completion of the flush.> +static void aio_fsync_cb(void *opague, int ret) > +{ > +} > #endif > > static void raw_close(BlockDriverState *bs) > @@ -602,8 +606,20 @@ static int raw_create(const char *filena > > static void raw_flush(BlockDriverState *bs) > { > +#ifdef NO_AIO > BDRVRawState *s = bs->opaque; > fsync(s->fd); > +#else > + RawAIOCB *acb; > + > + acb = raw_aio_setup(bs, 0, NULL, 0, aio_fsync_cb, NULL); > + if (!acb) > + return; > + if (aio_fsync(O_SYNC, &acb->aiocb) < 0) { > + qemu_aio_release(acb); > + return; > + } > +#endif > }That''s not correct: callers of bdrv_flush() assume that when it returns, data _is_ on the disk. We need to change that assumption, so that your code becomes correct (and reports asynchronous completion from aio_fsync_cb). Samuel _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 24/3/08 09:53, "Kouya Shimura" <kouya@jp.fujitsu.com> wrote:> An attached patch fixes it. However I think the root cause is > that a timer event can''t interrupt an i/o emulation. > How should we fix it?The attached patch also makes the fsync asynchronous, which seems a bit dangerous if we acknowledge the flush to the guest meanwhile. If whatever happens on return from raw_flush() could be deferred to aio_fsync_cb() then this might be a good fix. In general timer events *cannot* interrupt i/o emulation, as we cannot really take an interrupt in the middle of executing an instruction. That would be a mess! However in this case I guess the IDE/SCSI device model could allow asynchronous reporting of flush completion to the guest, and this new aspect of the device model would obviously tie into your patch, being done on aio_fsync_cb(). -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir Fraser writes:> However in this case I guess the IDE/SCSI device model could allow > asynchronous reporting of flush completion to the guest, and this new aspect > of the device model would obviously tie into your patch, being done on > aio_fsync_cb().Here is a revised patch. Is this a good fix? I''m not an expert on the SCSI. So please excuse the SCSI side is not implemented yet. (I tried but it doesn''t work well) Thanks, Kouya Signed-off-by: Kouya Shimura <kouya@jp.fujitsu.com> _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Certainly it looks better. What do you think, Samuel? -- Keir On 25/3/08 10:54, "Kouya Shimura" <kouya@jp.fujitsu.com> wrote:> Keir Fraser writes: >> However in this case I guess the IDE/SCSI device model could allow >> asynchronous reporting of flush completion to the guest, and this new aspect >> of the device model would obviously tie into your patch, being done on >> aio_fsync_cb(). > > Here is a revised patch. Is this a good fix? > > I''m not an expert on the SCSI. So please excuse > the SCSI side is not implemented yet. > (I tried but it doesn''t work well) > > Thanks, > Kouya > > Signed-off-by: Kouya Shimura <kouya@jp.fujitsu.com> > > diff -r 76c9cf11ce23 tools/ioemu/block-raw.c > --- a/tools/ioemu/block-raw.c Fri Mar 21 09:45:34 2008 +0000 > +++ b/tools/ioemu/block-raw.c Tue Mar 25 19:40:42 2008 +0900 > @@ -606,6 +606,18 @@ static void raw_flush(BlockDriverState * > fsync(s->fd); > } > > +void bdrv_aio_flush(BlockDriverState *bs, > + BlockDriverCompletionFunc *cb, void *opaque) > +{ > + RawAIOCB *acb; > + > + acb = raw_aio_setup(bs, 0, NULL, 0, cb, opaque); > + if (!acb) > + return; > + if (aio_fsync(O_SYNC, &acb->aiocb) < 0) > + qemu_aio_release(acb); > +} > + > BlockDriver bdrv_raw = { > "raw", > sizeof(BDRVRawState), > diff -r 76c9cf11ce23 tools/ioemu/hw/ide.c > --- a/tools/ioemu/hw/ide.c Fri Mar 21 09:45:34 2008 +0000 > +++ b/tools/ioemu/hw/ide.c Tue Mar 25 19:40:42 2008 +0900 > @@ -1738,6 +1738,13 @@ static void ide_clear_hob(IDEState *ide_ > ide_if[1].select &= ~(1 << 7); > } > > +static void ide_flush_cb(void *opaque, int ret) > +{ > + IDEState *s = (IDEState *)opaque; > + s->status = READY_STAT; > + ide_set_irq(s); > +} > + > static void ide_ioport_write(void *opaque, uint32_t addr, uint32_t val) > { > IDEState *ide_if = opaque; > @@ -1976,10 +1983,13 @@ static void ide_ioport_write(void *opaqu > break; > case WIN_FLUSH_CACHE: > case WIN_FLUSH_CACHE_EXT: > - if (s->bs) > - bdrv_flush(s->bs); > - s->status = READY_STAT; > - ide_set_irq(s); > + if (s->bs) { > + bdrv_aio_flush(s->bs, ide_flush_cb, s); > + s->status = BUSY_STAT; > + } else { > + s->status = READY_STAT; > + ide_set_irq(s); > + } > break; > case WIN_IDLEIMMEDIATE: > case WIN_STANDBY: > diff -r 76c9cf11ce23 tools/ioemu/vl.h > --- a/tools/ioemu/vl.h Fri Mar 21 09:45:34 2008 +0000 > +++ b/tools/ioemu/vl.h Tue Mar 25 19:40:42 2008 +0900 > @@ -653,6 +653,8 @@ BlockDriverAIOCB *bdrv_aio_write(BlockDr > const uint8_t *buf, int nb_sectors, > BlockDriverCompletionFunc *cb, void > *opaque); > void bdrv_aio_cancel(BlockDriverAIOCB *acb); > +void bdrv_aio_flush(BlockDriverState *bs, > + BlockDriverCompletionFunc *cb, void *opaque); > > void qemu_aio_init(void); > void qemu_aio_poll(void);_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Kouya Shimura writes ("Re: [Xen-devel] Windows Bug Check 0x101 issue"):> Here is a revised patch. Is this a good fix?Sadly not. There are two problems: One is that the qemu block driver abstraction does not support asynchronous flush. To fix this problem, it is necessary to add a new entrypoint for block drivers (and a new fixup wrapper in block.c to deal with drivers which only support the synch entrypoint). The other is that the IDE flush necessarily blocks. The IDE command being used here is an IO write which instructs the disk to flush and which the guest operating system expects to complete only when the flush is done. Windows is evidently expecting to successfully receive a timer interrupt on another CPU while the first one is blocked on the IO operation - but as I understand it our emulation and threading model doesn''t permit this. As a workaround, perhaps we should provide a way for the Xen administrator to turn this operation into a nonblocking lazy flush request ?> +void bdrv_aio_flush(BlockDriverState *bs, > + BlockDriverCompletionFunc *cb, void *opaque) > +{ > + RawAIOCB *acb; > + > + acb = raw_aio_setup(bs, 0, NULL, 0, cb, opaque); > + if (!acb) > + return; > + if (aio_fsync(O_SYNC, &acb->aiocb) < 0) > + qemu_aio_release(acb); > +}This is quite wrong, I''m afraid. You absolutely must not call through to raw_aio_* functions from block.c in this way. This will break for all block backends other than raw. Ian. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Ian Jackson, le Tue 25 Mar 2008 11:28:32 +0000, a écrit :> The other is that the IDE flush necessarily blocks.What do you mean by that? In a real machine, the processor doesn''t block while the flush is being done, and the OS just expects to see an irq some time after. In that regard his patch should work fine. That said it can''t be applied as is because of the other points you raised, of course. Samuel _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Samuel Thibault writes:> Ian Jackson, le Tue 25 Mar 2008 11:28:32 +0000, a écrit : > > The other is that the IDE flush necessarily blocks. > > What do you mean by that? In a real machine, the processor doesn''t > block while the flush is being done, and the OS just expects to see an > irq some time after. In that regard his patch should work fine. > > That said it can''t be applied as is because of the other points you > raised, of course.Anyway, I remade the patch as you point out. Is it enough? To be honest, I''m skeptical about the necessity of the flush for a *emulated* IDE disk but shared SCSI disk. Signed-off-by: Kouya Shimura <kouya@jp.fujitsu.com> _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Kouya Shimura, le Wed 26 Mar 2008 16:07:26 +0900, a écrit :> To be honest, I''m skeptical about the necessity of the flush > for a *emulated* IDE disk but shared SCSI disk.Emulation doesn''t mean that we can afford losing the data. If the guest wants to have data flushed, it is because it needs the data to get to the platters, so we need to enforce that. Samuel _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
> > +static void ide_flush_cb(void *opaque, int ret) > > +{ > > + IDEState *s = opaque; > > + > > + s->status = READY_STAT; > > + ide_set_irq(s); > > You need to check the return value (ret) and set an appropriate IDE > error status if the operation failed. ide_abort_command may be of > some use.Flush errors are a bit hard to emulate. On a flush failing you have to provide back an error state indicating the LBA block number (LBA28 or LBA48 according to which flush command) of the specific failing block. Calling the cache flush again restarts the flushing process having dropped that block. Alan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Kouya Shimura writes ("Re: [Xen-devel] Windows Bug Check 0x101 issue"):> Samuel Thibault writes: > > Ian Jackson, le Tue 25 Mar 2008 11:28:32 +0000, a écrit : > > > The other is that the IDE flush necessarily blocks. > > > > What do you mean by that? In a real machine, the processor doesn''t > > block while the flush is being done, and the OS just expects to see an > > irq some time after. In that regard his patch should work fine.I must have misread the IDE spec I was reading.> > That said it can''t be applied as is because of the other points you > > raised, of course. > > Anyway, I remade the patch as you point out. Is it enough?This one is much better but I still have a comment ...> To be honest, I''m skeptical about the necessity of the flush > for a *emulated* IDE disk but shared SCSI disk.I''m not sure I follow. (What do you mean by `shared SCSI disk''? We''re discussing the IDE driver here ...)> +BlockDriverAIOCB *bdrv_aio_flush(BlockDriverState *bs, > + BlockDriverCompletionFunc *cb, void *opaque) > +{ > + BlockDriver *drv = bs->drv; > + > + if (!drv) > + return NULL; > + if (!drv->bdrv_aio_flush) > + return NULL; > + > + return drv->bdrv_aio_flush(bs, cb, opaque); > +}This function should emulate the asynchronous operation with the synchronous one: so it should call the synchronous ->bdrv_flush and then call the callback. That will make it work like bdrv_aio_{read,write}. (Although there is I think no need to replicate the baroque bdrv_aio_{read,write}_em emulation methods; that''s an overly-complex way to do things.) If you do that then callers who get NULL from bdrv_aio_flush will know that it is actually an error. Your current code, as well as needing the non-aio emulation at each call site, will try the fallback after errors as well as lack of aio support.> +static void ide_flush_cb(void *opaque, int ret) > +{ > + IDEState *s = opaque; > + > + s->status = READY_STAT; > + ide_set_irq(s);You need to check the return value (ret) and set an appropriate IDE error status if the operation failed. ide_abort_command may be of some use. Ian. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Ian Jackson writes ("Re: [Xen-devel] Windows Bug Check 0x101 issue"):> If you do that then callers who get NULL from bdrv_aio_flush will know > that it is actually an error.Sorry, rereading the code I see that that''s not quite right. bdrv_aio_* routines return NULL to indicate that the callback has already been done; any error is given to the callback to handle. That''s what this one should do; that part code should look quite like the innards of bdrv_aio_read_em. Ian. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Alan Cox writes ("Re: [Xen-devel] Windows Bug Check 0x101 issue"):> Flush errors are a bit hard to emulate. On a flush failing you have to > provide back an error state indicating the LBA block number (LBA28 or > LBA48 according to which flush command) of the specific failing block. > Calling the cache flush again restarts the flushing process having > dropped that block.Hrm. I think it would be better to return an endless sequence of errors than to return no error at all. Since we don''t know which block failed, we could report an error flushing block 0, and then an error flushing block 1, and so on ? Ian. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
> Hrm. I think it would be better to return an endless sequence of > errors than to return no error at all. Since we don''t know which > block failed, we could report an error flushing block 0, and then an > error flushing block 1, and so on ?On a raid volume that will give wrong results and incorrect recovery behaviour. You need to report correct information. In addition a lot of software treats repeated errors on flush as a device failure after a certain count. If a write to the underlying emulated media fails you know at least which write fails and usually the page that failed (as you get a short write) Alan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Alan Cox writes ("Re: [Xen-devel] Windows Bug Check 0x101 issue"):> On a raid volume that will give wrong results and incorrect recovery > behaviour. You need to report correct information. In addition a lot of > software treats repeated errors on flush as a device failure after a > certain count. > > If a write to the underlying emulated media fails you know at least which > write fails and usually the page that failed (as you get a short write)Well, no, because in this case we''re calling f(data)sync. So we get no information from the dom0 kernel about what went wrong. You''re right about _writes_ rather than flush requests, if the drive write cache is turned off. I think that reporting errors in those cases is already done correctly now, including details of the block offsets, but I haven''t double-checked. Ian. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Ian Jackson writes:> This one is much better but I still have a comment ...Thank you for many comments. Actually I''m not good at qemu. How about the attached patch?> > +static void ide_flush_cb(void *opaque, int ret) > > +{ > > + IDEState *s = opaque; > > + > > + s->status = READY_STAT; > > + ide_set_irq(s); > > You need to check the return value (ret) and set an appropriate IDE > error status if the operation failed. ide_abort_command may be of > some use.This patch uses ide_abort_command(). As for the failure case, it looks too much for me. I hope someone who is expert on the IDE fixes it. Thanks, Kouya Signed-off-by: Kouya Shimura <kouya@jp.fujitsu.com> _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On Thu, 27 Mar 2008 14:20:32 +0900 Kouya Shimura <kouya@jp.fujitsu.com> wrote:> Ian Jackson writes: > > This one is much better but I still have a comment ... > > Thank you for many comments. Actually I''m not good at qemu. > How about the attached patch? > > > > +static void ide_flush_cb(void *opaque, int ret) > > > +{ > > > + IDEState *s = opaque; > > > + > > > + s->status = READY_STAT; > > > + ide_set_irq(s); > > > > You need to check the return value (ret) and set an appropriate IDE > > error status if the operation failed. ide_abort_command may be of > > some use. > > This patch uses ide_abort_command(). > As for the failure case, it looks too much for me. > I hope someone who is expert on the IDE fixes it.If a flush fails you can''t just just reply with an abrt as I said earlier, many OS will then loop forever trying to finish the flush. flush cache has unusual error reporting behaviour (see ATA-7). An abort means a cache flush fail and to retry for further sectors, if you can''t properly emulate reporting block numbers back then you need to offline the virtual device. Alan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
> having decided to violate it, perhaps the right answer is to make the > disk simply vanish: ignore all future command writes, zero all of the > registers that the host is attempting to read, not issue an interrupt, > and somehow discard the responses from any overlapped IO which is in > flight.I think it is safest - and if the underlying storage fails something bad has happened and anything else we do would make it worse> > Is that what you meant by `offline the virtual device'' ?Just leave the busy bit set forever, the host will get fed up of waiting, reset, rinse repeat a few times and (except for older Linux) then offline the device. Older Linux (ie drivers/ide) has problems coping with failed drives so will carry on spewing but limp along ok. Alan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Alan Cox writes ("Re: [Xen-devel] Windows Bug Check 0x101 issue"):> [...] if you can''t properly > emulate reporting block numbers back then you need to offline the virtual > device.I''ve been reading the ATA-7 spec and I''m afraid it''s not clear to me what options are open to the emulation in this situation. Setting DF (device fault) is still no good as the spec clearly indicates that we should still have correct LBA offsets and retry behaviour. I can''t see an approach that isn''t a violation of the ATA spec. So having decided to violate it, perhaps the right answer is to make the disk simply vanish: ignore all future command writes, zero all of the registers that the host is attempting to read, not issue an interrupt, and somehow discard the responses from any overlapped IO which is in flight. Is that what you meant by `offline the virtual device'' ? Ian. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Alan Cox writes ("Re: [Xen-devel] Windows Bug Check 0x101 issue"):> Just leave the busy bit set forever, the host will get fed up of waiting, > reset, rinse repeat a few times and (except for older Linux) then offline > the device. Older Linux (ie drivers/ide) has problems coping with failed > drives so will carry on spewing but limp along ok.Thanks for the advice. I think the patch below does this. I have compiled it but I don''t have an easy setup for testing it and there may be some side-effects on bizarre setups with emulated ide channels with slaves but no masters. Comments welcome. Before we apply this: Kouya Shimra, could you test it and let us know if it solves your original problem ? Thanks, Ian. Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com> Modified-by: Ian Jackson <ian.jackson@eu.citrix.com> Signed-off-by: Kouya Shimura <kouya@jp.fujitsu.com> _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Ian Jackson writes:> Before we apply this: Kouya Shimra, could you test it and let us know > if it solves your original problem ?Many thanks. It works fine for me. BSOD of Bug Check 0x101 can be reproduced by running iozone benchmark on dom0. I''ve tried 10 times and never seen the BSOD with this patch. The BSOD was sure to happen everytime before. Moreover, I tried the case of aio_fsync failure by using a debugger. After the manually injection of failure (force to set "ret=1"), I got a BSOD of STOP 0x000000F4 what we probably expect. In addition to prevent Windows from the BSOD, this patch improves the interactivity of Windows remarkably. Without this patch, I/O load of dom0 affects the response of a guest Windows. (The mouse pointer stucks sometimes) Thanks, Kouya _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Andi Kleen writes ("Re: Windows Bug Check 0x101 issue"):> It would be certainly possible to implement kernel interfaces > that would allow discovering the error easily. I guess aio could > pass it somewhereIn principle, yes. But like your other proposals, I don''t think that''s really a practical answer. If such a kernel interface appears then we should use it. In the meantime people who want this kind of error reporting and recovery should turn the emulated drive''s cache off. Ian. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel