thr3ads.net - Xen devel - Re: driver domain crash d reconnect handling [Jan 2013]

If this information is useful, please help other people find it:
Share via:

Paul Durrant

2013-Jan-24 16:57 UTC

Re: driver domain crash d reconnect handling

> -----Original Message-----
> From: Andrew Cooper [mailto:andrew.cooper3@citrix.com]
> Sent: 24 January 2013 15:11
> To: Zoltan Kiss
> Cc: George Shuklin; xen-api@lists.xen.org; Ian Campbell; Paul Durrant; Dave
> Scott; 'xen-devel@lists.xen.org'
> Subject: Re: [Xen-devel] driver domain crash and reconnect handling
> 
> On 24/01/13 15:01, Zoltan Kiss wrote:
> > On 24/01/13 14:06, George Shuklin wrote:
> >> 24.01.2013 17:25, Paul Durrant пишет:
> >>>> Some notes about guest suspend during IO.
> >>>>
> >>>> I tested that way for storage reboot (pause all domains,
reboot
> >>>> ISCSI storage and resume every domain). If pause is short
(less
> >>>> that 2 minutes), guest can survive. If pause is longer
than 2
> >>>> minutes, guests in state of waiting for io completion,
detects IO
> >>>> timeout after resuming  and cause IO error on virtual
block devices.
> (PV).
> >>>>
> >>> To be clear here: do you mean you *paused* and then unpaused
the
> VMs, or *suspended* and then resumed the VMs? I suspect you mean the
> former.
> >>>
> >>>     Paul
> >> Pause, of cause. My bad.
> >>
> > If you would do a suspend, the frontend driver flush out disk IO
> > operations before suspend reached, and therefore there won't be
> > anything to timeout after resume. However, if the storage driver
> > domain just crashed, I guess the guest would crash at suspend. Maybe
> > we can try out something to save the the ring buffer, and replay them
> > back once the backend come back (but before resuming the guest). But
> > I'm not sure whether the guest would handle the timeouts after the
> > resume first, or cancel them if the requests were succesfully
responded.
> >
> > Zoli
> 
> Perhaps I am making this harder, but might it be best to wait for a short
> while (15-30 seconds) for the device driver domain to come back, and if it
> takes longer than that, pause the VM.
> 
> This way, if the driver domain is fast to come back, all the guest notices
is
> transitorily blocked IO, and if the driver domain is too slow (but does
come
> back), all the guest might notices is a pause.
> 
> Ultimately, if the driver domain never comes back, then we are in a no
> worse position than currently.
>
What do you mean by 'come back' here? If you're talking about the
same driver domain then fair enough. If you're talking about a new instance
then pausing or not pausing the VM is immaterial. Unless the frontends are
prodded to connect to the new backends (remembering that the xenstore paths have
the domid baked into them) then IO will block forever. In general you're
going to need to go through a full suspend/resume of the frontend to achieve
this, unless we write new frontend code to directly notice the change in the
backend (and distinguish it from an unplug) and reconnect automatically.

  Paul
 
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Xen devel - Jan 2013 - Re: driver domain crash d reconnect handling

Re: driver domain crash d reconnect handling