We are seeing a disk corruption problem when migrating a VM between two nodes that are both active writers of a shared storage block device. The corruption seems to be caused by a lack of synchronization between the migration source and destination regarding outstanding block write requests. The failing scenario is as follows: 1) The VM has block write A in progress on the source node X at the time it is being migrated. 2) The blkfront driver requeues A on the destination node Y after migration. Request A gets completed immediately, because the shared storage already has a request in flight to the same block (from X), so it ignores the new request. 3) New block write request A'' is made from Y, now that the VM is running, to the same block number as A. Request A'' gets completed immediately for the same reasons as in #2. The corruption we are seeing is that the block contains the data A, not A'' as the VM expects. The problem is that the shared storage doesn''t guarantee the outcome of the concurrent writes X->A and Y->A. It is choosing to ignore and immediately complete the second request, which I understand is one of the acceptable strategies for managing concurrent writes to the same block. That behavior is fine when the redundant request A is being ignored, but when the new request A'' occurs, we get corruption. The problem only shows up under heavy disk load (e.g the Bonnie benchmark) while migrating, so most users probably haven''t seen it. If I understand this correctly though, this could affect anyone using shared block storage with dual active writers and live migration. When we run with a single active writer and then move the active writer to the destination node, all outstanding requests get flushed in the background and we don''t see this problem. The blkfront xenbus_driver doesn''t have a "suspend" method. I was going to add one to flush the outstanding requests from the migration source to fix the problem. Or maybe we can cancel all outstanding I/O requests to eliminate the concurrency between the two nodes. Does the Linux block I/O interface allow the canceling of requests? Anyone else seeing this problem? Any other ideas for solutions? Thanks, Jeff _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Ian Pratt
2006-Aug-21 19:58 UTC
RE: [Xen-devel] Shared disk corruption caused by migration
> The blkfront xenbus_driver doesn''t have a "suspend" method. I wasgoing to> add one to flush the outstanding requests from the migration source tofix> the problem. Or maybe we can cancel all outstanding I/O requests to > eliminate the concurrency between the two nodes. Does the Linux blockI/O> interface allow the canceling of requests? > > Anyone else seeing this problem? Any other ideas for solutions?There''s already work in progress on this. The simplest thing to do is to wait until the backend queues are empty before signalling the destination host to unpause the relocated domain. However, this would add to migration downtime. It would be nice if we could quickly cancel the IOs queued at the original host, but Linux doesn''t have a good mechanism for this. For targets that support fencing it''s possible to quickly and synchronously fence the original host. For other targets, we need to be a bit cunning to minimize downtime: we can actually start running the VM on the destination host before we''ve had the ''all queues empty'' message from the source host. We just have to be careful to make sure that we don''t issue any writes to blocks that also potentially still have writes pending on them in the source host. If such a write occurs, we have to stall issuing of the write until we receive the ''all queues empty'' from the source host. However, such conflicting writes are actually pretty unusual, so the majority of relocations won''t incur the stall. Stay tuned for a patch. Ian _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel