Hi, I don''t see any obvious flush to disk taking place for vbd''s on the source host in XendCheckpoint.py before the domain is started on the new host. Is there a guarantee that all written data is on disk somewhere else or is something needed? Thanks, John Byrne _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
It''s slightly more than a flush that''s required. The migration protocol needs to be extended so that execution on the target host doesn''t start until all of the outstanding (i.e. issued by the backend) block requests have been either cancelled or acknowledged. This should be pretty straight forward given that the backend driver ref counts a blkif''s state based on pending requests, and won''t tear down the backend directory in xenstore until all the outstanding requests have cleared. All that is likely required is to have the migration code register watches on the backend vbd directories, and wait for them to disappear before giving the all-clear to the new host. We''ve talked about this enough to know how to fix it, but haven''t had a chance to hack it up. (I think Julian has looked into the problem a bit for blktap, but not yet done a general fix.) Patches would certainly be welcome though. ;) a. On 7/31/06, John Byrne <john.l.byrne@hp.com> wrote:> > Hi, > > I don''t see any obvious flush to disk taking place for vbd''s on the > source host in XendCheckpoint.py before the domain is started on the new > host. Is there a guarantee that all written data is on disk somewhere > else or is something needed? > > Thanks, > > John Byrne > > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xensource.com > http://lists.xensource.com/xen-devel >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
It would be a bit ugly, but mostly straightforward to watch for the destruction of the vbds (or all devices) after the destroyDomain() is done and then sending an all-clear. (The last time I looked there wasn''t a waitForDomainDestroy() anywhere, so it would probably be best to write one.) This would guarantee correctness: which is the most important thing. The problem I see with that strategy is the effect on downtime during a live-move. Ideally you''d like to start the vbd cleanup when the final suspend is done and hope to parallelize the any final device operations with the final pass of live-move. How to do that and play nice with domain destruction on the normal path and handle errors seems a lot less clear to me. So, are you just ignoring the notion of minimizing downtime for the moment or is there something I''m missing? John Andrew Warfield wrote:> It''s slightly more than a flush that''s required. The migration > protocol needs to be extended so that execution on the target host > doesn''t start until all of the outstanding (i.e. issued by the > backend) block requests have been either cancelled or acknowledged. > This should be pretty straight forward given that the backend driver > ref counts a blkif''s state based on pending requests, and won''t tear > down the backend directory in xenstore until all the outstanding > requests have cleared. All that is likely required is to have the > migration code register watches on the backend vbd directories, and > wait for them to disappear before giving the all-clear to the new > host. > > We''ve talked about this enough to know how to fix it, but haven''t had > a chance to hack it up. (I think Julian has looked into the problem a > bit for blktap, but not yet done a general fix.) Patches would > certainly be welcome though. ;) > > a. > > On 7/31/06, John Byrne <john.l.byrne@hp.com> wrote: >> >> Hi, >> >> I don''t see any obvious flush to disk taking place for vbd''s on the >> source host in XendCheckpoint.py before the domain is started on the new >> host. Is there a guarantee that all written data is on disk somewhere >> else or is something needed? >> >> Thanks, >> >> John Byrne >> >> >> _______________________________________________ >> Xen-devel mailing list >> Xen-devel@lists.xensource.com >> http://lists.xensource.com/xen-devel >> >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
> So, are you just ignoring the notion of minimizing downtime for the > moment or is there something I''m missing?That''s exactly what I''m suggesting. The current risk is a (very slim) write-after-write error case. Basically, you have a number of in-flight write requests on the original machine that''s somewhere in between the backend and the physical disk at the time of migration. Currently, you migrate and the shadow request ring reissues these on the new host -- which is the right thing to do given that requests are idempotent. The problem is that the original in-flight requests can still hit the disk some time later and cause problems. The WAW is if you write an update to a block that had an in-flight request immediately on arriving at the new host, and it then gets overwritten by the original request. Note that for sane block devices this is extremely unlikely as the aperture that we are talking about is basically whatever is in the disk''s request queue-- it''s only really a problem for things like NFS+loopback and other instances of buffered I/O behind blockback (which is generally a really bad idea!) where you could see a large window of outstanding requests that haven''t actually hit the disk. These situations probably need more than just waiting for blkback to clear pending reqs, as loopback will acknowledge requests befre they hit the disk in some cases. So, I think the short-term correctness-preserving approach is to (a) modify the migration process to add an interlock on block backends on the source physical machine to go to a closed state -- indicating that all the outstanding requests have cleared, and (b) not to use loopback, or buffered IO generally, behind blkback when you intend to do migration. The blktap code in the tree is much safer for this sort of thing and we''re happy to sort out migration problems if/when they come up. If this winds up adding a big overhead to migration switching time (I don''t think it should, block shutdown can be parallelized with the stop-and-copy round of migration -- you''ll be busy transferring all the dirty pages that you''ve queued for DMA anyway) we can probably speed it up. One option would be to look into whether the linux block layer will let you abort submitted requests. Another would be to modify the block frontend driver to realize that it''s just been migrated and queue all requests to blocks that were in it''s shadow ring until it receives notification that those writes have cleared from the original host. As you point out -- these are probably best left as a second step. ;) I''d be interested to know if anyone on the list is solving this sort of thing already using some sort of storage fencing fanciness to just sink any pending requests on the original host after migration has happened. a. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
I''ve got a patch in our tree that does (basically) what John is describing. The exact bug we hit was that a "xm shutdown -w vm" did not wait until the vbds were cleared out before returning. So now I wait until the backend/vbd nodes go away before returning. This could probably be done more cleanly with watches, and should be abstracted out to be sure it applies equally to migration, and so forth. But for the sake of discussion, the patch is attached. -Charles>>> On Mon, Jul 31, 2006 at 4:26 PM, in message<44CE83B1.1090605@hp.com>, John Byrne <john.l.byrne@hp.com> wrote:> It would be a bit ugly, but mostly straightforward to watch for the > destruction of the vbds (or all devices) after the destroyDomain() is> done and then sending an all- clear. (The last time I looked therewasn''t> a waitForDomainDestroy() anywhere, so it would probably be best towrite> one.) This would guarantee correctness: which is the most importantthing.> > The problem I see with that strategy is the effect on downtime duringa> live- move. Ideally you''d like to start the vbd cleanup when thefinal> suspend is done and hope to parallelize the any final deviceoperations> with the final pass of live- move. How to do that and play nice with> domain destruction on the normal path and handle errors seems a lotless> clear to me. > > So, are you just ignoring the notion of minimizing downtime for the > moment or is there something I''m missing? > > John > > Andrew Warfield wrote: >> It''s slightly more than a flush that''s required. The migration >> protocol needs to be extended so that execution on the target host >> doesn''t start until all of the outstanding (i.e. issued by the >> backend) block requests have been either cancelled or acknowledged. >> This should be pretty straight forward given that the backenddriver>> ref counts a blkif''s state based on pending requests, and won''ttear>> down the backend directory in xenstore until all the outstanding >> requests have cleared. All that is likely required is to have the >> migration code register watches on the backend vbd directories, and >> wait for them to disappear before giving the all- clear to the new >> host. >> >> We''ve talked about this enough to know how to fix it, but haven''thad>> a chance to hack it up. (I think Julian has looked into the problema>> bit for blktap, but not yet done a general fix.) Patches would >> certainly be welcome though. ;) >> >> a. >> >> On 7/31/06, John Byrne <john.l.byrne@hp.com> wrote: >>> >>> Hi, >>> >>> I don''t see any obvious flush to disk taking place for vbd''s onthe>>> source host in XendCheckpoint.py before the domain is started onthe new>>> host. Is there a guarantee that all written data is on disksomewhere>>> else or is something needed? >>> >>> Thanks, >>> >>> John Byrne >>> >>> >>> _______________________________________________ >>> Xen- devel mailing list >>> Xen- devel@lists.xensource.com >>> http://lists.xensource.com/xen- devel >>> >> > > > _______________________________________________ > Xen- devel mailing list > Xen- devel@lists.xensource.com > http://lists.xensource.com/xen- devel_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel