Alex Bligh
2013-Mar-18 12:18 UTC
[PATCHv3] QEMU(upstream): Disable xen''s use of O_DIRECT by default as it results in crashes.
Due to what is almost certainly a kernel bug, writes with O_DIRECT may continue to reference the page after the write has been marked as completed, particularly in the case of TCP retransmit. In other scenarios, this "merely" risks data corruption on the write, but with Xen pages from domU are only transiently mapped into dom0''s memory, resulting in kernel panics when they are subsequently accessed. This brings PV devices in line with emulated devices. Removing O_DIRECT is safe as barrier operations are now correctly passed through. See: http://lists.xen.org/archives/html/xen-devel/2012-12/msg01154.html for more details. This patch has already been applied to the xenbits.org qemu-upstream repository. http://xenbits.xen.org/gitweb/?p=qemu-upstream-unstable.git;a=commit;h=f3903bbac78a81fcbce1350cdce860764a62783a Clearly it should go into qemu''s own repository. Acked-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com> Signed-off-by: Alex Bligh <alex@alex.org.uk> --- hw/xen_disk.c | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/hw/xen_disk.c b/hw/xen_disk.c index a402ac8..14f8723 100644 --- a/hw/xen_disk.c +++ b/hw/xen_disk.c @@ -603,7 +603,7 @@ static int blk_init(struct XenDevice *xendev) } /* read-only ? */ - qflags = BDRV_O_NOCACHE | BDRV_O_CACHE_WB | BDRV_O_NATIVE_AIO; + qflags = BDRV_O_CACHE_WB | BDRV_O_NATIVE_AIO; if (strcmp(blkdev->mode, "w") == 0) { qflags |= BDRV_O_RDWR; } else { -- 1.7.4.1
Stefan Hajnoczi
2013-Mar-18 13:03 UTC
Re: [Qemu-devel] [PATCHv3] QEMU(upstream): Disable xen''s use of O_DIRECT by default as it results in crashes.
On Mon, Mar 18, 2013 at 12:18:43PM +0000, Alex Bligh wrote:> Due to what is almost certainly a kernel bug, writes with > O_DIRECT may continue to reference the page after the write > has been marked as completed, particularly in the case of > TCP retransmit. In other scenarios, this "merely" risks > data corruption on the write, but with Xen pages from domU > are only transiently mapped into dom0''s memory, resulting > in kernel panics when they are subsequently accessed. > > This brings PV devices in line with emulated devices. Removing > O_DIRECT is safe as barrier operations are now correctly passed > through. > > See: > http://lists.xen.org/archives/html/xen-devel/2012-12/msg01154.html > for more details.From the mailing list discussion it appears that this patch is a workaround - using the dom0 page cache to avoid the failed host kernel paging request, which is caused by the true bug. Has any progress been made at understanding the true problem?> This patch has already been applied to the xenbits.org > qemu-upstream repository. > http://xenbits.xen.org/gitweb/?p=qemu-upstream-unstable.git;a=commit;h=f3903bbac78a81fcbce1350cdce860764a62783a > Clearly it should go into qemu''s own repository. > > Acked-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com> > Signed-off-by: Alex Bligh <alex@alex.org.uk> > --- > hw/xen_disk.c | 2 +- > 1 files changed, 1 insertions(+), 1 deletions(-) > > diff --git a/hw/xen_disk.c b/hw/xen_disk.c > index a402ac8..14f8723 100644 > --- a/hw/xen_disk.c > +++ b/hw/xen_disk.c > @@ -603,7 +603,7 @@ static int blk_init(struct XenDevice *xendev) > } > > /* read-only ? */ > - qflags = BDRV_O_NOCACHE | BDRV_O_CACHE_WB | BDRV_O_NATIVE_AIO; > + qflags = BDRV_O_CACHE_WB | BDRV_O_NATIVE_AIO; > if (strcmp(blkdev->mode, "w") == 0) { > qflags |= BDRV_O_RDWR; > } else { > -- > 1.7.4.1 > >
Alex Bligh
2013-Mar-18 13:19 UTC
Re: [Qemu-devel] [PATCHv3] QEMU(upstream): Disable xen''s use of O_DIRECT by default as it results in crashes.
Stefan, --On 18 March 2013 14:03:49 +0100 Stefan Hajnoczi <stefanha@gmail.com> wrote:> From the mailing list discussion it appears that this patch is a > workaround - using the dom0 page cache to avoid the failed host kernel > paging request, which is caused by the true bug. > > Has any progress been made at understanding the true problem?It certainly is a workaround. My understanding is that ANY write with O_DIRECT turned on can write data written to the page after the O_DIRECT write is marked as complete, if tcp retransmit (and various other skb related things) happen. This thread is shorter that the one on xen-devel if you want to follow the history and the explanation. http://comments.gmane.org/gmane.linux.nfs/54325 xen is particularly affected as the page in question is paged out of dom0 when the access happens. However, anything using O_DIRECT I/O to any form of network device (NFS, iSCSI, DRDB) is by my analysis vulnerable to writing corrupt data. Mel Gorman kindly forward ported (but not to tip) Ian Campbell''s fragment tracking patch, and I sent it to netdev here: http://marc.info/?l=linux-netdev&m=135912467817630 Given this was originally raised as an issue in 2008, and probably has been an issue ''forever'', I think it would be fair to say there has not been an enormous amount of interest in fixing the underlying problem. -- Alex Bligh
Paolo Bonzini
2013-Mar-18 13:32 UTC
Re: [PATCHv3] QEMU(upstream): Disable xen''s use of O_DIRECT by default as it results in crashes.
Il 18/03/2013 13:18, Alex Bligh ha scritto:> Due to what is almost certainly a kernel bug, writes with > O_DIRECT may continue to reference the page after the write > has been marked as completed, particularly in the case of > TCP retransmit. In other scenarios, this "merely" risks > data corruption on the write, but with Xen pages from domU > are only transiently mapped into dom0''s memory, resulting > in kernel panics when they are subsequently accessed. > > This brings PV devices in line with emulated devices. Removing > O_DIRECT is safe as barrier operations are now correctly passed > through. > > See: > http://lists.xen.org/archives/html/xen-devel/2012-12/msg01154.html > for more details. > > This patch has already been applied to the xenbits.org > qemu-upstream repository. > http://xenbits.xen.org/gitweb/?p=qemu-upstream-unstable.git;a=commit;h=f3903bbac78a81fcbce1350cdce860764a62783a > Clearly it should go into qemu''s own repository. > > Acked-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com> > Signed-off-by: Alex Bligh <alex@alex.org.uk> > --- > hw/xen_disk.c | 2 +- > 1 files changed, 1 insertions(+), 1 deletions(-) > > diff --git a/hw/xen_disk.c b/hw/xen_disk.c > index a402ac8..14f8723 100644 > --- a/hw/xen_disk.c > +++ b/hw/xen_disk.c > @@ -603,7 +603,7 @@ static int blk_init(struct XenDevice *xendev) > } > > /* read-only ? */ > - qflags = BDRV_O_NOCACHE | BDRV_O_CACHE_WB | BDRV_O_NATIVE_AIO; > + qflags = BDRV_O_CACHE_WB | BDRV_O_NATIVE_AIO; > if (strcmp(blkdev->mode, "w") == 0) { > qflags |= BDRV_O_RDWR; > } else { >How does migration work with this change? The target may read stale data from the page cache. Fix the kernel bug please. Paolo Paolo
Alex Bligh
2013-Mar-18 13:49 UTC
Re: [Qemu-devel] [PATCHv3] QEMU(upstream): Disable xen''s use of O_DIRECT by default as it results in crashes.
--On 18 March 2013 14:32:23 +0100 Paolo Bonzini <pbonzini@redhat.com> wrote:> How does migration work with this change? The target may read stale > data from the page cache. Fix the kernel bug please.Presumably the same way as if writeback caching is selected. I presume that must fsync() / fdatasync() all the data to disk, and a barrier will produce one of those. It would be great to fix the kernel bug (and I have submitted code), but the fix is pretty intrusive (see the link I posted) and there appears to be little interest in taking it forward. Certainly my kernel hacking skills are not adequate to the task. The current position is that booting a Xen domU which does disk I/O (Ubuntu cloud image used as the test case) with an NFS root crashes dom0 absolutely repeatably, and kills all other guests. Unless and until there is a kernel fix for that, Xen is in essence unusable with HVM and network based disk backend. So we need a workaround in the meantime which doesn''t require a kernel fix. -- Alex Bligh
Paolo Bonzini
2013-Mar-18 14:05 UTC
Re: [Qemu-devel] [PATCHv3] QEMU(upstream): Disable xen''s use of O_DIRECT by default as it results in crashes.
Il 18/03/2013 14:49, Alex Bligh ha scritto:> > > --On 18 March 2013 14:32:23 +0100 Paolo Bonzini <pbonzini@redhat.com> > wrote: > >> How does migration work with this change? The target may read stale >> data from the page cache. Fix the kernel bug please. > > Presumably the same way as if writeback caching is selected. I presume > that must fsync() / fdatasync() all the data to disk, and a barrier will > produce one of those.No, that''s done already. The source does an fsync/fdatasync before terminating the migration. The problem is that the target''s page cache might host image data from a previous run. If you do not use O_DIRECT, it will not see the changes made on the source.> It would be great to fix the kernel bug (and I have submitted code), but > the fix is pretty intrusive (see the link I posted) and there appears > to be little interest in taking it forward. Certainly my kernel hacking > skills are not adequate to the task. > > The current position is that booting a Xen domU which does disk I/O > (Ubuntu cloud image used as the test case) with an NFS root crashes dom0 > absolutely repeatably, and kills all other guests. Unless and until > there is a kernel fix for that, Xen is in essence unusable with HVM > and network based disk backend. So we need a workaround in the meantime > which doesn''t require a kernel fix.If you want to have this patch, you need to detect the bug and only do the hack if the bug is detect. Plus, disable migration when the hack is in use. Also: 1) why does blkback not have the bug? 2) does it also affect virtio disks (or perhaps AHCI too)? I think Stefano experimented with virtio in Xen. If it does, then you''re working around the problem in the wrong place. Paolo
Alex Bligh
2013-Mar-18 14:30 UTC
Re: [Qemu-devel] [PATCHv3] QEMU(upstream): Disable xen''s use of O_DIRECT by default as it results in crashes.
Paolo, --On 18 March 2013 15:05:08 +0100 Paolo Bonzini <pbonzini@redhat.com> wrote:>> Presumably the same way as if writeback caching is selected. I presume >> that must fsync() / fdatasync() all the data to disk, and a barrier will >> produce one of those. > > No, that''s done already. The source does an fsync/fdatasync before > terminating the migration. > > The problem is that the target''s page cache might host image data from a > previous run. If you do not use O_DIRECT, it will not see the changes > made on the source.I was under the impression that with cache=writeback, qemu doesn''t use O_DIRECT, in which case why isn''t there the same danger under kvm, i.e. that the target page cache contains data from a previous run?>> It would be great to fix the kernel bug (and I have submitted code), but >> the fix is pretty intrusive (see the link I posted) and there appears >> to be little interest in taking it forward. Certainly my kernel hacking >> skills are not adequate to the task. >> >> The current position is that booting a Xen domU which does disk I/O >> (Ubuntu cloud image used as the test case) with an NFS root crashes dom0 >> absolutely repeatably, and kills all other guests. Unless and until >> there is a kernel fix for that, Xen is in essence unusable with HVM >> and network based disk backend. So we need a workaround in the meantime >> which doesn''t require a kernel fix. > > If you want to have this patch, you need to detect the bug and only do > the hack if the bug is detect. Plus, disable migration when the hack is > in use.I originally suggested having this as an option (detecting it live and non-destructively is practically impossible - suggestions welcome), but xen-devel felt it should just be changed. My original preference was for xl to process cache= type options (so those using a local file system known to be safe could use O_DIRECT still), but that requires a change to xenstore, was not popular, and is probably too intrusive. I patched it the way the xendevel folks wanted. Disabling migration seems a bit excessive when migration isn''t disabled with cache=unsafe (AFAIK), and the alternative (using O_DIRECT) is far far more unsafe (one tcp retransmit and your system is dead).> 1) why does blkback not have the bug? > > 2) does it also affect virtio disks (or perhaps AHCI too)? I think > Stefano experimented with virtio in Xen. If it does, then you''re > working around the problem in the wrong place.I believe it affects PV disks and not emulated disks as emulated disks under Xen do not use O_DIRECT (despite migration apparently working notwithstanding your comment above). Stefano did ack the patch, and for a one line change it''s been through a pretty extensive discussion on xen-devel ... I''ve no idea what else it affects. I''d suggest it also affects kvm, save that the kvm ''bad'' will be writing the wrong data, not hosing the whole machine. -- Alex Bligh
Paolo Bonzini
2013-Mar-18 14:49 UTC
Re: [PATCHv3] QEMU(upstream): Disable xen''s use of O_DIRECT by default as it results in crashes.
Il 18/03/2013 15:30, Alex Bligh ha scritto:> Paolo, > > --On 18 March 2013 15:05:08 +0100 Paolo Bonzini <pbonzini@redhat.com> > wrote: > >>> Presumably the same way as if writeback caching is selected. I presume >>> that must fsync() / fdatasync() all the data to disk, and a barrier will >>> produce one of those. >> >> No, that''s done already. The source does an fsync/fdatasync before >> terminating the migration. >> >> The problem is that the target''s page cache might host image data from a >> previous run. If you do not use O_DIRECT, it will not see the changes >> made on the source. > > I was under the impression that with cache=writeback, qemu doesn''t > use O_DIRECT, in which case why isn''t there the same danger under > kvm, i.e. that the target page cache contains data from a previous > run?KVM in fact only supports migration using cache=none. This does not apply of course if you''re using cache-coherent storage, such as rbd or gluster; or if you''re using one of the userspace backends that bypass the page cache, like NBD or libiscsi.> Disabling migration seems a bit excessive when migration isn''t disabled > with cache=unsafe (AFAIK)It is not QEMU''s task. There are cases where the cache= options are unnecessary or ignored. But indeed libvirt warns (or blocks, I don''t remember) in this case.> , and the alternative (using O_DIRECT) > is far far more unsafe (one tcp retransmit and your system is dead). > >> 1) why does blkback not have the bug? >> >> 2) does it also affect virtio disks (or perhaps AHCI too)? I think >> Stefano experimented with virtio in Xen. If it does, then you''re >> working around the problem in the wrong place. > > I believe it affects PV disks and not emulated disks as emulated disks > under Xen do not use O_DIRECT (despite migration apparently working > notwithstanding your comment above).If libxl does migration without O_DIRECT, then that''s a bug in libxl. What about blkback? IIRC it uses bios, so it also bypasses the page cache.> Stefano did ack the patch, and for a one line change it''s been > through a pretty extensive discussion on xen-devel ...It may be a one-line change, but it completely changes the paths that I/O goes through. Apparently the discussion was not enough.> I''ve no idea what else it affects. I''d suggest it also affects kvm, > save that the kvm ''bad'' will be writing the wrong data, not hosing > the whole machine. >
Alex Bligh
2013-Mar-18 15:40 UTC
Re: [PATCHv3] QEMU(upstream): Disable xen''s use of O_DIRECT by default as it results in crashes.
Paolo, --On 18 March 2013 15:49:48 +0100 Paolo Bonzini <pbonzini@redhat.com> wrote:>>> The problem is that the target''s page cache might host image data from a >>> previous run. If you do not use O_DIRECT, it will not see the changes >>> made on the source. >> >> I was under the impression that with cache=writeback, qemu doesn''t >> use O_DIRECT, in which case why isn''t there the same danger under >> kvm, i.e. that the target page cache contains data from a previous >> run? > > KVM in fact only supports migration using cache=none. This does not > apply of course if you''re using cache-coherent storage, such as rbd or > gluster; or if you''re using one of the userspace backends that bypass > the page cache, like NBD or libiscsi.Well right now emulated devices under xen with qemu-upstream dm do not use O_DIRECT. So that''s going to hit this problem anyway. Let me understand the problem a little more. Let''s say a VM has a disk backed by NFS. It runs on node A, at which point pages are introduced to the page cache. It then migrates to node B, with node A flushing its dirty pages to disk before the migration completed, but not emptying the page cache completely. It''s then migrated back to node A in which case I think you are saying that node A may still hold the page cache entries from prior to the first migration. However, node A has closed and reopened the file. Doesn''t NFSv4 notice that the file has changed at this point and invalidate all page cache entries? Else (re)opening a file on one NFS client that has been changed on another would never work. I know this is unpredictable if you haven''t closed the file, but in this instance A has closed the file.>> Disabling migration seems a bit excessive when migration isn''t disabled >> with cache=unsafe (AFAIK) > > It is not QEMU''s task. There are cases where the cache= options are > unnecessary or ignored. But indeed libvirt warns (or blocks, I don''t > remember) in this case.Well we are kvm user too, and I understand hat kvm (as opposed to libvirt) does not error or warn if you live migrate with cache=writeback. I''ve no problem if xl or libvirt or whatever error or warn. My usage is API based, rather than xl / libvirt based.> If libxl does migration without O_DIRECT, then that''s a bug in libxl. > What about blkback? IIRC it uses bios, so it also bypasses the page > cache.Possibly a bug in xl rather than libxl, but as no emulated devices use O_DIRECT, that bug is already there, and isn''t in QEMU.>> Stefano did ack the patch, and for a one line change it''s been >> through a pretty extensive discussion on xen-devel ... > > It may be a one-line change, but it completely changes the paths that > I/O goes through. Apparently the discussion was not enough.What would you suggest? -- Alex Bligh
Paolo Bonzini
2013-Mar-18 16:19 UTC
Re: [PATCHv3] QEMU(upstream): Disable xen''s use of O_DIRECT by default as it results in crashes.
Il 18/03/2013 16:40, Alex Bligh ha scritto:> Paolo, > > --On 18 March 2013 15:49:48 +0100 Paolo Bonzini <pbonzini@redhat.com> > wrote: > >>>> The problem is that the target''s page cache might host image data from a >>>> previous run.I remembered this incorrectly, sorry. It''s not from a previous run, it''s from the beginning of this run. See http://wiki.qemu.org/Migration/Storage for more information. A VM has a disk backed by NFS. It runs on node A, at which point pages are introduced to the page cache. It then migrates to node B, which entails starting the VM on node B while it is still running on node A. Closing has yet to happen on node A, but the file is already open on node B; anything that is cached on node B will never be invalidated. Thus, any changes done to the disk on node A during migration may not become visible on node B.>>> Disabling migration seems a bit excessive when migration isn''t disabled >>> with cache=unsafe (AFAIK) >> >> It is not QEMU''s task. There are cases where the cache= options are >> unnecessary or ignored. But indeed libvirt warns (or blocks, I don''t >> remember) in this case. > > Well we are kvm user too, and I understand hat kvm (as opposed to > libvirt) does not error or warn if you live migrate with cache=writeback. > > I''ve no problem if xl or libvirt or whatever error or warn. My usage > is API based, rather than xl / libvirt based.What makes libvirt not an API (just like libxl)?>> If libxl does migration without O_DIRECT, then that''s a bug in libxl. >> What about blkback? IIRC it uses bios, so it also bypasses the page >> cache. > > Possibly a bug in xl rather than libxl, but as no emulated devices > use O_DIRECT, that bug is already there, and isn''t in QEMU.blkback is the in-kernel PV device, it''s not an emulated device.>>> Stefano did ack the patch, and for a one line change it''s been >>> through a pretty extensive discussion on xen-devel ... >> >> It may be a one-line change, but it completely changes the paths that >> I/O goes through. Apparently the discussion was not enough. > > What would you suggest?Nothing except fixing the bug in the kernel. No one has yet explained why blkback is not susceptible to the same bug. Paolo
Alex Bligh
2013-Mar-18 16:53 UTC
Re: [PATCHv3] QEMU(upstream): Disable xen''s use of O_DIRECT by default as it results in crashes.
Paolo, --On 18 March 2013 17:19:14 +0100 Paolo Bonzini <pbonzini@redhat.com> wrote:> I remembered this incorrectly, sorry. It''s not from a previous run, > it''s from the beginning of this run. See > http://wiki.qemu.org/Migration/Storage for more information. > > A VM has a disk backed by NFS. It runs on node A, at which point pages > are introduced to the page cache. It then migrates to node B, which > entails starting the VM on node B while it is still running on node A. > Closing has yet to happen on node A, but the file is already open on > node B; anything that is cached on node B will never be invalidated. > > Thus, any changes done to the disk on node A during migration may not > become visible on node B.This might be a difference between Xen and KVM. On Xen migration is made to a server in a paused state, and it''s only unpaused when the migration to B is complete. There''s a sort of extra handshake at the end. I believe what''s happening is that libxl_domain_suspend when called with LIBXL_SUSPEND_LIVE will do a final fsync()/fdatasync() at the end, then await a migrate_receiver_ready message, and only when that has been received will it send a migrate_permission_to_go message which unpauses the domain. Before that, I don''t believe the disk is read (I may be wrong about that). The sending code is in migrate_domain() in xl_cmdimpl.c, and the receiving code is in migrate_receive() (same file). On xen at least, I don''t think the VM is ever started on node B whilst it is still running on node A.>> I''ve no problem if xl or libvirt or whatever error or warn. My usage >> is API based, rather than xl / libvirt based. > > What makes libvirt not an API (just like libxl)?Nothing, just I''m using the QMP API and the libxl API. I''m just saying whether libvirt or xl warn or error makes no difference to me.>>> If libxl does migration without O_DIRECT, then that''s a bug in libxl. >>> What about blkback? IIRC it uses bios, so it also bypasses the page >>> cache. >> >> Possibly a bug in xl rather than libxl, but as no emulated devices >> use O_DIRECT, that bug is already there, and isn''t in QEMU. > > blkback is the in-kernel PV device, it''s not an emulated device.I mean that an emulated device will already not use O_DIRECT. So if you are right about live migrate being unsafe without O_DIRECT, it''s already unsafe for emulated devices.>>>> Stefano did ack the patch, and for a one line change it''s been >>>> through a pretty extensive discussion on xen-devel ... >>> >>> It may be a one-line change, but it completely changes the paths that >>> I/O goes through. Apparently the discussion was not enough. >> >> What would you suggest? > > Nothing except fixing the bug in the kernel.I have already posted patches for that, as Ian Campbell did in 2008, but no one seems particularly interested. Be my guest in trying to get them adopted. That''s quite obviously the long term solution. In the mean time, however, there is a need to run Xen on kernels with long term support. Not being able to run Xen in a stable manner is not an acceptable position.> No one has yet explained > why blkback is not susceptible to the same bug.I would guess it will be if it uses O_DIRECT or whatever the in kernel equivalent is, unless it''s doing a copy of the guest pages prior to the write being marked as complete. I can''t claim to be familiar with blkback, but I presume this would require a similar fix elsewhere. -- Alex Bligh
George Dunlap
2013-Mar-18 17:38 UTC
Re: [PATCHv3] QEMU(upstream): Disable xen''s use of O_DIRECT by default as it results in crashes.
On 18/03/13 16:53, Alex Bligh wrote:> Paolo, > > --On 18 March 2013 17:19:14 +0100 Paolo Bonzini <pbonzini@redhat.com> wrote: > >> I remembered this incorrectly, sorry. It''s not from a previous run, >> it''s from the beginning of this run. See >> http://wiki.qemu.org/Migration/Storage for more information. >> >> A VM has a disk backed by NFS. It runs on node A, at which point pages >> are introduced to the page cache. It then migrates to node B, which >> entails starting the VM on node B while it is still running on node A. >> Closing has yet to happen on node A, but the file is already open on >> node B; anything that is cached on node B will never be invalidated. >> >> Thus, any changes done to the disk on node A during migration may not >> become visible on node B. > This might be a difference between Xen and KVM. On Xen migration is > made to a server in a paused state, and it''s only unpaused when > the migration to B is complete. There''s a sort of extra handshake at > the end.I think what you mean is that all the memory is handled by Xen and the toolstack, not by qemu. The qemu state is sent as the very last thing, after all of the memory, and therefore (you are arguing) that qemu is not started, and the files cannot be opened, until after the migration is nearly complete, and certainly until after the file is closed on the sending side. (In KVM this isn''t the case because I assume it''s qemu that handles the memory transition, and so it is likely to open the files on the receiving side when the migration starts.) If so I think that should answer Paolo''s argument -- but let me go through and verify that first. It will have to wait until tomorrow, however. :-) -George
Alex Bligh
2013-Mar-18 17:47 UTC
Re: [PATCHv3] QEMU(upstream): Disable xen''s use of O_DIRECT by default as it results in crashes.
George, --On 18 March 2013 17:38:54 +0000 George Dunlap <george.dunlap@eu.citrix.com> wrote:> I think what you mean is that all the memory is handled by Xen and the > toolstack, not by qemu. The qemu state is sent as the very last thing, > after all of the memory, and therefore (you are arguing) that qemu is not > started, and the files cannot be opened, until after the migration is > nearly complete, and certainly until after the file is closed on the > sending side.That is my understanding. Thank you for putting it better than I did :-) -- Alex Bligh
Paolo Bonzini
2013-Mar-18 18:00 UTC
Re: [PATCHv3] QEMU(upstream): Disable xen''s use of O_DIRECT by default as it results in crashes.
Il 18/03/2013 18:38, George Dunlap ha scritto:>>> >> This might be a difference between Xen and KVM. On Xen migration is >> made to a server in a paused state, and it''s only unpaused when >> the migration to B is complete. There''s a sort of extra handshake at >> the end. > > I think what you mean is that all the memory is handled by Xen and the > toolstack, not by qemu. The qemu state is sent as the very last thing, > after all of the memory, and therefore (you are arguing) that qemu is > not started, and the files cannot be opened, until after the migration > is nearly complete, and certainly until after the file is closed on the > sending side.That would be quite dangerous. Files aren''t closed until after QEMU exits; at this point whatever problem you have launching QEMU on the destination would be unrecoverable. Even for successful migration, it would also be bad for downtime (QEMU isn''t exactly lightning-fast to start). And even if failure weren''t catastrophic, it would be a pity to transfer a few gigs of memory and then find out that QEMU isn''t present in the destination. :) Still, it''s more than possible that I''ve forgotten something about Xen''s management of QEMU.> (In KVM this isn''t the case because I assume it''s qemu that handles the > memory transition, and so it is likely to open the files on the > receiving side when the migration starts.)Yup. Paolo
George Dunlap
2013-Mar-19 10:06 UTC
Re: [PATCHv3] QEMU(upstream): Disable xen''s use of O_DIRECT by default as it results in crashes.
On Mon, Mar 18, 2013 at 6:00 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:> Il 18/03/2013 18:38, George Dunlap ha scritto: >>>> >>> This might be a difference between Xen and KVM. On Xen migration is >>> made to a server in a paused state, and it''s only unpaused when >>> the migration to B is complete. There''s a sort of extra handshake at >>> the end. >> >> I think what you mean is that all the memory is handled by Xen and the >> toolstack, not by qemu. The qemu state is sent as the very last thing, >> after all of the memory, and therefore (you are arguing) that qemu is >> not started, and the files cannot be opened, until after the migration >> is nearly complete, and certainly until after the file is closed on the >> sending side. > > That would be quite dangerous. Files aren''t closed until after QEMU > exits; at this point whatever problem you have launching QEMU on the > destination would be unrecoverable.But if I understand your concern correctly, you were concerned about the following scenario: R1. Receiver qemu opens file R2. Something causes receiver kernel to cache parts of file (maybe optimistic read-ahead) S1. Sender qemu writes to file S2. Sender qemu does final flush S3. Sender qemu closes file R3. Receiver reads stale blocks from cache Even supposing that Xen doesn''t actually shut down qemu until it is started on the remote side, as long as the file isn''t opened by qemu until after S2, we should be safe, right? It would look like this: S1. Sender qemu writes to file S2. Sender qemu does final flush R1. Receiver qemu opens file R2. Receiver kernel caches file S3. Sender qemu closes file This is all assuming that: 1. The barrier operations / write flush are effective at getting the data back on to the NFS server 2. The receiver qemu doesn''t open the file until after the last flush by the sender. Number 1 has been tested by Alex I believe, and is mentioned in the changeset log; so if #2 is true, then we should be safe. I''ll try to verify that today.> Even for successful migration, it would also be bad for downtime (QEMU > isn''t exactly lightning-fast to start). And even if failure weren''t > catastrophic, it would be a pity to transfer a few gigs of memory and > then find out that QEMU isn''t present in the destination. :)Well, if qemu isn''t present at the destination, that''s definitely user error. :-) In any case, I know that he migrate can resume if it fails, so I suspect that the qemu is just paused on the sending side until the migration is known to complete. As long as the last write was flushed to the NFS server before the receiver opens the file, we should be safe.> Still, it''s more than possible that I''ve forgotten something about Xen''s > management of QEMU.And unfortunately I am not intimately familiar with that codepath; it just happens that I''m the last person to have to dig into that code and fix something. :-) -George
Paolo Bonzini
2013-Mar-19 10:43 UTC
Re: [PATCHv3] QEMU(upstream): Disable xen''s use of O_DIRECT by default as it results in crashes.
Il 19/03/2013 11:06, George Dunlap ha scritto:> On Mon, Mar 18, 2013 at 6:00 PM, Paolo Bonzini <pbonzini@redhat.com> wrote: >> Il 18/03/2013 18:38, George Dunlap ha scritto: >>>>> >>>> This might be a difference between Xen and KVM. On Xen migration is >>>> made to a server in a paused state, and it''s only unpaused when >>>> the migration to B is complete. There''s a sort of extra handshake at >>>> the end. >>> >>> I think what you mean is that all the memory is handled by Xen and the >>> toolstack, not by qemu. The qemu state is sent as the very last thing, >>> after all of the memory, and therefore (you are arguing) that qemu is >>> not started, and the files cannot be opened, until after the migration >>> is nearly complete, and certainly until after the file is closed on the >>> sending side. >> >> That would be quite dangerous. Files aren''t closed until after QEMU >> exits; at this point whatever problem you have launching QEMU on the >> destination would be unrecoverable. > > But if I understand your concern correctly, you were concerned about > the following scenario: > R1. Receiver qemu opens file > R2. Something causes receiver kernel to cache parts of file (maybe > optimistic read-ahead)For some image formats, metadata is cached inside QEMU on startup. There is a callback to invalidate QEMU''s cache at the end of migration, but that does not extend to the page cache.> S1. Sender qemu writes to file > S2. Sender qemu does final flush > S3. Sender qemu closes file > R3. Receiver reads stale blocks from cache > > Even supposing that Xen doesn''t actually shut down qemu until it is > started on the remote side, as long as the file isn''t opened by qemu > until after S2, we should be safe, right? It would look like this: > > S1. Sender qemu writes to file > S2. Sender qemu does final flush > R1. Receiver qemu opens file > R2. Receiver kernel caches file > S3. Sender qemu closes file > > This is all assuming that: > 1. The barrier operations / write flush are effective at getting the > data back on to the NFS server > 2. The receiver qemu doesn''t open the file until after the last flush > by the sender. > > Number 1 has been tested by Alex I believe, and is mentioned in the > changeset log; so if #2 is true, then we should be safe. I''ll try to > verify that today.Thanks.>> Even for successful migration, it would also be bad for downtime (QEMU >> isn''t exactly lightning-fast to start). And even if failure weren''t >> catastrophic, it would be a pity to transfer a few gigs of memory and >> then find out that QEMU isn''t present in the destination. :) > > Well, if qemu isn''t present at the destination, that''s definitely user > error. :-) In any case, I know that he migrate can resume if it > fails, so I suspect that the qemu is just paused on the sending side > until the migration is known to complete. As long as the last write > was flushed to the NFS server before the receiver opens the file, we > should be safe.Note that the close really must happen before the next open. Otherwise the file metadata might not be up-to-date on the destination, too. Paolo
George Dunlap
2013-Mar-19 10:51 UTC
Re: [PATCHv3] QEMU(upstream): Disable xen''s use of O_DIRECT by default as it results in crashes.
On 03/19/2013 10:43 AM, Paolo Bonzini wrote:>>> Even for successful migration, it would also be bad for downtime (QEMU >>> isn''t exactly lightning-fast to start). And even if failure weren''t >>> catastrophic, it would be a pity to transfer a few gigs of memory and >>> then find out that QEMU isn''t present in the destination. :) >> >> Well, if qemu isn''t present at the destination, that''s definitely user >> error. :-) In any case, I know that he migrate can resume if it >> fails, so I suspect that the qemu is just paused on the sending side >> until the migration is known to complete. As long as the last write >> was flushed to the NFS server before the receiver opens the file, we >> should be safe. > > Note that the close really must happen before the next open. Otherwise > the file metadata might not be up-to-date on the destination, too.By "file metadata" I assume you mean "metadata about the virtual disk within the file", not "metadata about the file within the filesystem", right? That''s good to know, I''ll keep that in mind. Even if it''s true that at the moment qemu doesn''t write the file metadata until it closes the file, that just means we''d have to add a hook to the callback to save qemu state, to sync the metadata at that point, right? (This may in fact already be done; I''m more or less completely unfamiliar with the qemu side of save/restore.) -George
Paolo Bonzini
2013-Mar-19 11:14 UTC
Re: [PATCHv3] QEMU(upstream): Disable xen''s use of O_DIRECT by default as it results in crashes.
Il 19/03/2013 11:51, George Dunlap ha scritto:> On 03/19/2013 10:43 AM, Paolo Bonzini wrote: >>>> Even for successful migration, it would also be bad for downtime (QEMU >>>> isn''t exactly lightning-fast to start). And even if failure weren''t >>>> catastrophic, it would be a pity to transfer a few gigs of memory and >>>> then find out that QEMU isn''t present in the destination. :) >>> >>> Well, if qemu isn''t present at the destination, that''s definitely user >>> error. :-) In any case, I know that he migrate can resume if it >>> fails, so I suspect that the qemu is just paused on the sending side >>> until the migration is known to complete. As long as the last write >>> was flushed to the NFS server before the receiver opens the file, we >>> should be safe. >> >> Note that the close really must happen before the next open. Otherwise >> the file metadata might not be up-to-date on the destination, too. > > By "file metadata" I assume you mean "metadata about the virtual disk > within the file", not "metadata about the file within the filesystem", > right? That''s good to know, I''ll keep that in mind.Actually especially the former (I''m calling them respectively "image metadata" and "file metadata"). File metadata could also be a problem, but I think it might just work except in cases like on-line resizing during migration.> Even if it''s true that at the moment qemu doesn''t write the file > metadata until it closes the file, that just means we''d have to add a > hook to the callback to save qemu state, to sync the metadata at that > point, right?Unfortunately no. The problem is in the loading side''s kernel, on which you do not have any control. If the loading side doesn''t use O_DIRECT, any attempt to invalidate the metadata in userspace or on the source is futile, because there is no way to invalidate the page cache''s copy of that metadata. Paolo
George Dunlap
2013-Mar-19 11:21 UTC
Re: [PATCHv3] QEMU(upstream): Disable xen''s use of O_DIRECT by default as it results in crashes.
On 03/19/2013 11:14 AM, Paolo Bonzini wrote:> Il 19/03/2013 11:51, George Dunlap ha scritto: >> On 03/19/2013 10:43 AM, Paolo Bonzini wrote: >>>>> Even for successful migration, it would also be bad for downtime (QEMU >>>>> isn''t exactly lightning-fast to start). And even if failure weren''t >>>>> catastrophic, it would be a pity to transfer a few gigs of memory and >>>>> then find out that QEMU isn''t present in the destination. :) >>>> >>>> Well, if qemu isn''t present at the destination, that''s definitely user >>>> error. :-) In any case, I know that he migrate can resume if it >>>> fails, so I suspect that the qemu is just paused on the sending side >>>> until the migration is known to complete. As long as the last write >>>> was flushed to the NFS server before the receiver opens the file, we >>>> should be safe. >>> >>> Note that the close really must happen before the next open. Otherwise >>> the file metadata might not be up-to-date on the destination, too. >> >> By "file metadata" I assume you mean "metadata about the virtual disk >> within the file", not "metadata about the file within the filesystem", >> right? That''s good to know, I''ll keep that in mind. > > Actually especially the former (I''m calling them respectively "image > metadata" and "file metadata"). File metadata could also be a problem, > but I think it might just work except in cases like on-line resizing > during migration. > >> Even if it''s true that at the moment qemu doesn''t write the file >> metadata until it closes the file, that just means we''d have to add a >> hook to the callback to save qemu state, to sync the metadata at that >> point, right? > > Unfortunately no. The problem is in the loading side''s kernel, on which > you do not have any control. If the loading side doesn''t use O_DIRECT, > any attempt to invalidate the metadata in userspace or on the source is > futile, because there is no way to invalidate the page cache''s copy of > that metadata.Yes, I meant "the only further thing we would have to do". The entire discussion relies on the assumption that the receiving side doesn''t open the file until after the sending side has issued the qemu state save. So as long as both the virtual blocks and the image metadata have been synced to the NFS server at that point, we should be all right. If at the moment the image metadata is *not* synced at that point, it seems like we should be able to make it so relatively easily. -George
Alex Bligh
2013-Mar-19 11:44 UTC
Re: [PATCHv3] QEMU(upstream): Disable xen''s use of O_DIRECT by default as it results in crashes.
--On 19 March 2013 12:14:36 +0100 Paolo Bonzini <pbonzini@redhat.com> wrote:> Unfortunately no. The problem is in the loading side''s kernel, on which > you do not have any control. If the loading side doesn''t use O_DIRECT, > any attempt to invalidate the metadata in userspace or on the source is > futile, because there is no way to invalidate the page cache''s copy of > that metadata.If you closed the file and reopened it, then on NFS this would have this effect I believe, as it needs to check whether other clients have modified the file to give open/close coherency. However, we are rather assuming the file isn''t even open at the point it is closed by the sender, which is what George is investigating. If this isn''t true, we have a problem anyway with (e.g.) emulated devices which don''t use O_DIRECT anyway. And I had thought (I may be wrong) using O_DIRECT does not guarantee no read caching with NFS; O_DIRECT merely guarantees the page cache is not used under Linux and isn''t defined under POSIX: http://www.spinics.net/lists/linux-nfs/msg17472.html If it were just a write caching issue, we could use O_DSYNC instead of O_DIRECT, which would at least ensure the copy from userspace. -- Alex Bligh
Paolo Bonzini
2013-Mar-19 11:49 UTC
Re: [PATCHv3] QEMU(upstream): Disable xen''s use of O_DIRECT by default as it results in crashes.
Il 19/03/2013 12:44, Alex Bligh ha scritto:> If this isn''t true, we have a problem anyway with (e.g.) emulated > devices which don''t use O_DIRECT anyway.Yes, though that would be a libxl bug, not a QEMU bug.> And I had thought (I may be > wrong) using O_DIRECT does not guarantee no read caching with NFS; > O_DIRECT merely guarantees the page cache is not used under Linux > and isn''t defined under POSIX: > http://www.spinics.net/lists/linux-nfs/msg17472.htmlRead caching on the server is fine, because it is the same server that was used for the writes. O_DIRECT bypasses the client''s page cache, and that''s enough for our purposes.> If it were just a write caching issue, we could use O_DSYNC instead of > O_DIRECT, which would at least ensure the copy from userspace.O_DSYNC is not necessary. We do issue the appropriate fsync/fdatasync. What O_DSYNC does is add an implicit fdatasync after every write, basically. Paolo
George Dunlap
2013-Mar-19 15:12 UTC
Re: [PATCHv3] QEMU(upstream): Disable xen''s use of O_DIRECT by default as it results in crashes.
On Tue, Mar 19, 2013 at 11:21 AM, George Dunlap <george.dunlap@eu.citrix.com> wrote:> On 03/19/2013 11:14 AM, Paolo Bonzini wrote: >> >> Il 19/03/2013 11:51, George Dunlap ha scritto: >>> >>> On 03/19/2013 10:43 AM, Paolo Bonzini wrote: >>>>>> >>>>>> Even for successful migration, it would also be bad for downtime (QEMU >>>>>> isn''t exactly lightning-fast to start). And even if failure weren''t >>>>>> catastrophic, it would be a pity to transfer a few gigs of memory and >>>>>> then find out that QEMU isn''t present in the destination. :) >>>>> >>>>> >>>>> Well, if qemu isn''t present at the destination, that''s definitely user >>>>> error. :-) In any case, I know that he migrate can resume if it >>>>> fails, so I suspect that the qemu is just paused on the sending side >>>>> until the migration is known to complete. As long as the last write >>>>> was flushed to the NFS server before the receiver opens the file, we >>>>> should be safe. >>>> >>>> >>>> Note that the close really must happen before the next open. Otherwise >>>> the file metadata might not be up-to-date on the destination, too. >>> >>> >>> By "file metadata" I assume you mean "metadata about the virtual disk >>> within the file", not "metadata about the file within the filesystem", >>> right? That''s good to know, I''ll keep that in mind. >> >> >> Actually especially the former (I''m calling them respectively "image >> metadata" and "file metadata"). File metadata could also be a problem, >> but I think it might just work except in cases like on-line resizing >> during migration. >> >>> Even if it''s true that at the moment qemu doesn''t write the file >>> metadata until it closes the file, that just means we''d have to add a >>> hook to the callback to save qemu state, to sync the metadata at that >>> point, right? >> >> >> Unfortunately no. The problem is in the loading side''s kernel, on which >> you do not have any control. If the loading side doesn''t use O_DIRECT, >> any attempt to invalidate the metadata in userspace or on the source is >> futile, because there is no way to invalidate the page cache''s copy of >> that metadata. > > > Yes, I meant "the only further thing we would have to do". The entire > discussion relies on the assumption that the receiving side doesn''t open the > file until after the sending side has issued the qemu state save. So as long > as both the virtual blocks and the image metadata have been synced to the > NFS server at that point, we should be all right. If at the moment the > image metadata is *not* synced at that point, it seems like we should be > able to make it so relatively easily.I''ve just had a chat with Stefano, and it turns out I was a bit confused -- this change has nothing to do with qemu running as a device model, but only as qemu running as a PV back-end for PV guests. So the question of when in the save/restore process the qemu is started is moot. For posterity''s sake, however, here is what I have found about qemu as a device model and Xen migration: * qemu on the receiving side gets the name of the qemu save file from the command-line arguments; so it cannot be started until after qemu on the sending side has been paused, and the file send across the wire. * qemu on the sending side does not exit until it is determined that the receiver is ready to begin. So it is still running when qemu on the receiving side starts. * I didn''t determine whether the commands to stop and save state cause a disk image metadata sync. -George
Stefano Stabellini
2013-Mar-19 15:13 UTC
Re: [PATCHv3] QEMU(upstream): Disable xen''s use of O_DIRECT by default as it results in crashes.
On Tue, 19 Mar 2013, Paolo Bonzini wrote:> Il 19/03/2013 11:06, George Dunlap ha scritto: > > On Mon, Mar 18, 2013 at 6:00 PM, Paolo Bonzini <pbonzini@redhat.com> wrote: > >> Il 18/03/2013 18:38, George Dunlap ha scritto: > >>>>> > >>>> This might be a difference between Xen and KVM. On Xen migration is > >>>> made to a server in a paused state, and it''s only unpaused when > >>>> the migration to B is complete. There''s a sort of extra handshake at > >>>> the end. > >>> > >>> I think what you mean is that all the memory is handled by Xen and the > >>> toolstack, not by qemu. The qemu state is sent as the very last thing, > >>> after all of the memory, and therefore (you are arguing) that qemu is > >>> not started, and the files cannot be opened, until after the migration > >>> is nearly complete, and certainly until after the file is closed on the > >>> sending side. > >> > >> That would be quite dangerous. Files aren''t closed until after QEMU > >> exits; at this point whatever problem you have launching QEMU on the > >> destination would be unrecoverable. > > > > But if I understand your concern correctly, you were concerned about > > the following scenario: > > R1. Receiver qemu opens file > > R2. Something causes receiver kernel to cache parts of file (maybe > > optimistic read-ahead) > > For some image formats, metadata is cached inside QEMU on startup. > There is a callback to invalidate QEMU''s cache at the end of migration, > but that does not extend to the page cache.[...]> > Well, if qemu isn''t present at the destination, that''s definitely user > > error. :-) In any case, I know that he migrate can resume if it > > fails, so I suspect that the qemu is just paused on the sending side > > until the migration is known to complete. As long as the last write > > was flushed to the NFS server before the receiver opens the file, we > > should be safe. > > Note that the close really must happen before the next open. Otherwise > the file metadata might not be up-to-date on the destination, too.First of all, I want to thank Paolo for spending time on this, because it''s a very delicate issue. I didn''t spot this possible race before. If we make a mistake now, in 6 months time a very poor soul will be spending weeks debugging a difficult, not very reproducible, bug. Also, just as a clarification for the audience at home, it''s either safe or it''s not: if it''s not safe, we certainly can''t have this patch go in. If it is safe, then it should go in. This patch only impacts the PV backend in QEMU, not the IDE interface. PV frontends and backends always disconnect and reconnect during save/restore. So we can be *certain* that bdrv_close at the sender side is called before the new connection is established at the receiver side. Unfortunately the QEMU PV backend opens the disk during initialization, so before the actual connection is established (blk_init). Therefore I think that the current change is not safe, but it is pretty easy to make it safe. You just need to move the call to blk_open from blk_init to blk_connect. Actually you can move most of blk_init to blk_connect, you just need to keep the 4 xenstore_write_be_int at the end of the function.
George Dunlap
2013-Mar-19 15:29 UTC
Re: [PATCHv3] QEMU(upstream): Disable xen''s use of O_DIRECT by default as it results in crashes.
On Tue, Mar 19, 2013 at 3:12 PM, George Dunlap <George.Dunlap@eu.citrix.com> wrote:> > I''ve just had a chat with Stefano, and it turns out I was a bit > confused -- this change has nothing to do with qemu running as a > device model, but only as qemu running as a PV back-end for PV guests. > So the question of when in the save/restore process the qemu is > started is moot.Alex, the title of the changelog should certainly be more specific, so people aren''t confused. Maybe something like the following? "Disable use of O_DIRECT when acting as a Xen PV backend" And emphasizing in the changelog that this has no effect on qemu when acting as a device model for Xen? -George
Paolo Bonzini
2013-Mar-19 16:53 UTC
Re: [PATCHv3] QEMU(upstream): Disable xen''s use of O_DIRECT by default as it results in crashes.
Il 19/03/2013 16:13, Stefano Stabellini ha scritto:> This patch only impacts the PV backend in QEMU, not the IDE interface. > PV frontends and backends always disconnect and reconnect during > save/restore. > So we can be *certain* that bdrv_close at the sender side is called > before the new connection is established at the receiver side.From the little I remember about Xen, they go to XenbusStateClosed on the source before invoking the suspend hypercall? And then restart on the destination? If so, that sounds right. Thanks for finding an elegant solution.> Unfortunately the QEMU PV backend opens the disk during initialization, > so before the actual connection is established (blk_init). > > Therefore I think that the current change is not safe, but it is pretty > easy to make it safe. > You just need to move the call to blk_open from blk_init to blk_connect.Paolo
Stefano Stabellini
2013-Mar-19 17:03 UTC
Re: [PATCHv3] QEMU(upstream): Disable xen''s use of O_DIRECT by default as it results in crashes.
On Tue, 19 Mar 2013, Paolo Bonzini wrote:> Il 19/03/2013 16:13, Stefano Stabellini ha scritto: > > This patch only impacts the PV backend in QEMU, not the IDE interface. > > PV frontends and backends always disconnect and reconnect during > > save/restore. > > So we can be *certain* that bdrv_close at the sender side is called > > before the new connection is established at the receiver side. > > >From the little I remember about Xen, they go to XenbusStateClosed on > the source before invoking the suspend hypercall? And then restart on > the destination?That''s right> If so, that sounds right. Thanks for finding an elegant solution.Thank you :)
Alex Bligh
2013-Mar-19 19:15 UTC
Re: [Qemu-devel] [PATCHv3] QEMU(upstream): Disable xen''s use of O_DIRECT by default as it results in crashes.
Stefano, George,> This patch only impacts the PV backend in QEMU, not the IDE interface. > PV frontends and backends always disconnect and reconnect during > save/restore. > So we can be *certain* that bdrv_close at the sender side is called > before the new connection is established at the receiver side. > > Unfortunately the QEMU PV backend opens the disk during initialization, > so before the actual connection is established (blk_init). > > Therefore I think that the current change is not safe, but it is pretty > easy to make it safe. > You just need to move the call to blk_open from blk_init to blk_connect. > Actually you can move most of blk_init to blk_connect, you just need to > keep the 4 xenstore_write_be_int at the end of the function.Stefano: do you want me to come up with a patch to do this? --On 19 March 2013 15:29:36 +0000 George Dunlap <George.Dunlap@eu.citrix.com> wrote:> Alex, the title of the changelog should certainly be more specific, so > people aren''t confused. Maybe something like the following? > > "Disable use of O_DIRECT when acting as a Xen PV backend" > > And emphasizing in the changelog that this has no effect on qemu when > acting as a device model for Xen?Sure, I can do that. However, my understanding (possibly incorrect) is that when qemu is acting as a device model for xen (i.e. emulated disks) it doesn''t use O_DIRECT anyway. I am aware this is a different problem! -- Alex Bligh
Alex Bligh
2013-Mar-20 08:33 UTC
Re: [PATCHv3] QEMU(upstream): Disable xen''s use of O_DIRECT by default as it results in crashes.
Stefano, --On 19 March 2013 15:13:29 +0000 Stefano Stabellini <stefano.stabellini@eu.citrix.com> wrote:> Therefore I think that the current change is not safe, but it is pretty > easy to make it safe. > You just need to move the call to blk_open from blk_init to blk_connect. > Actually you can move most of blk_init to blk_connect, you just need to > keep the 4 xenstore_write_be_int at the end of the function.The final of these 4 writes in blk_init is: xenstore_write_be_int(&blkdev->xendev, "sectors", blkdev->file_size / blkdev->file_blk); and blkdev->filesize comes from bdrv_getlength(blkdev->bs), and that requires (I think) bdrv_open to have been called. Can these 4 writes move safely to blk_connect? Or can we write the sectors count as (say) 0 here and fix it later in blk_connect? The remainder look ok. -- Alex Bligh
Paolo Bonzini
2013-Mar-20 09:26 UTC
Re: [PATCHv3] QEMU(upstream): Disable xen''s use of O_DIRECT by default as it results in crashes.
Il 20/03/2013 09:33, Alex Bligh ha scritto:> Stefano, > > --On 19 March 2013 15:13:29 +0000 Stefano Stabellini > <stefano.stabellini@eu.citrix.com> wrote: > >> Therefore I think that the current change is not safe, but it is pretty >> easy to make it safe. >> You just need to move the call to blk_open from blk_init to blk_connect. >> Actually you can move most of blk_init to blk_connect, you just need to >> keep the 4 xenstore_write_be_int at the end of the function. > > The final of these 4 writes in blk_init is: > > xenstore_write_be_int(&blkdev->xendev, "sectors", > blkdev->file_size / blkdev->file_blk); > > and blkdev->filesize comes from bdrv_getlength(blkdev->bs), and that > requires (I think) bdrv_open to have been called. Can these > 4 writes move safely to blk_connect? Or can we write the sectors > count as (say) 0 here and fix it later in blk_connect? The remainder > look ok.I think so. blkfront reads "sectors" when QEMU moves to XenbusStateConnected, in blkfront_connect. blk_connect is called from xen_be_try_initialise, which moves to XenbusStateConnected on success. So, QEMU''s blk_connect will always be called before blkfront''s blkfront_connect. Paolo
Stefano Stabellini
2013-Mar-20 10:24 UTC
Re: [Qemu-devel] [PATCHv3] QEMU(upstream): Disable xen''s use of O_DIRECT by default as it results in crashes.
On Tue, 19 Mar 2013, Alex Bligh wrote:> Stefano, George, > > > This patch only impacts the PV backend in QEMU, not the IDE interface. > > PV frontends and backends always disconnect and reconnect during > > save/restore. > > So we can be *certain* that bdrv_close at the sender side is called > > before the new connection is established at the receiver side. > > > > Unfortunately the QEMU PV backend opens the disk during initialization, > > so before the actual connection is established (blk_init). > > > > Therefore I think that the current change is not safe, but it is pretty > > easy to make it safe. > > You just need to move the call to blk_open from blk_init to blk_connect. > > Actually you can move most of blk_init to blk_connect, you just need to > > keep the 4 xenstore_write_be_int at the end of the function. > > Stefano: do you want me to come up with a patch to do this?Yes, please.> --On 19 March 2013 15:29:36 +0000 George Dunlap > <George.Dunlap@eu.citrix.com> wrote: > > > Alex, the title of the changelog should certainly be more specific, so > > people aren''t confused. Maybe something like the following? > > > > "Disable use of O_DIRECT when acting as a Xen PV backend" > > > > And emphasizing in the changelog that this has no effect on qemu when > > acting as a device model for Xen? > > Sure, I can do that. > > However, my understanding (possibly incorrect) is that when qemu is > acting as a device model for xen (i.e. emulated disks) it doesn''t > use O_DIRECT anyway. I am aware this is a different problem!That is true. My guess is that nobody really migrates HVM guests without PV drivers installed (it''s not even possible on XenServer but xl let you do that if you want to). When the PV drivers initialize at boot time, the IDE disk is closed. Therefore we wouldn''t have this problem. Maybe we should prevent HVM guest migration without PV drivers with xl too. Ian, what do you think?
George Dunlap
2013-Mar-20 10:37 UTC
Re: [Qemu-devel] [PATCHv3] QEMU(upstream): Disable xen''s use of O_DIRECT by default as it results in crashes.
On 20/03/13 10:24, Stefano Stabellini wrote:> That is true. My guess is that nobody really migrates HVM guests without > PV drivers installed (it''s not even possible on XenServer but xl let you > do that if you want to). When the PV drivers initialize at boot time, > the IDE disk is closed. Therefore we wouldn''t have this problem. > > Maybe we should prevent HVM guest migration without PV drivers with xl > too. Ian, what do you think?We should be able to make it safe if we just make sure that qemu does a metadata sync / FS sync when we ask qemu to write the save file. -George
Paolo Bonzini
2013-Mar-20 11:08 UTC
Re: [PATCHv3] QEMU(upstream): Disable xen''s use of O_DIRECT by default as it results in crashes.
Il 20/03/2013 11:37, George Dunlap ha scritto:>> That is true. My guess is that nobody really migrates HVM guests without >> PV drivers installed (it''s not even possible on XenServer but xl let you >> do that if you want to). When the PV drivers initialize at boot time, >> the IDE disk is closed. Therefore we wouldn''t have this problem. >> >> Maybe we should prevent HVM guest migration without PV drivers with xl >> too. Ian, what do you think? > > We should be able to make it safe if we just make sure that qemu does a > metadata sync / FS sync when we ask qemu to write the save file.To make it safe, just use cache=none. There are downsides (for example tmpfs does not support it), but perhaps you can use a global option or environment variable to toggle the behavior. Is xl still using the driver:subdriver:/path,readonly syntax? That''s really inflexible compared to libvirt''s XML... Paolo
Alex Bligh
2013-Mar-20 11:20 UTC
Re: [PATCHv3] QEMU(upstream): Disable xen''s use of O_DIRECT by default as it results in crashes.
--On 20 March 2013 12:08:19 +0100 Paolo Bonzini <pbonzini@redhat.com> wrote:> To make it safe, just use cache=none. There are downsides (for example > tmpfs does not support it), but perhaps you can use a global option or > environment variable to toggle the behavior.If you don''t use cache=none, you it is unsafe because of the page cache issue. If you use cache=none, it uses O_DIRECT, and it is unsafe because of the kernel issue. My preference would be to leave this can of worms to the end user. I''d rather take the risk on the first (as I''d rather domU broke than dom0) but that''s because I always use network drives. Some documentation would be useful. -- Alex Bligh
David Scott
2013-Mar-20 11:57 UTC
Re: [Qemu-devel] [PATCHv3] QEMU(upstream): Disable xen''s use of O_DIRECT by default as it results in crashes.
Hi, On 20/03/13 10:24, Stefano Stabellini wrote:> That is true. My guess is that nobody really migrates HVM guests without > PV drivers installed (it''s not even possible on XenServer but xl let you > do that if you want to). When the PV drivers initialize at boot time, > the IDE disk is closed. Therefore we wouldn''t have this problem. > > Maybe we should prevent HVM guest migration without PV drivers with xl > too. Ian, what do you think?Although it''s not normally possible on XenServer to migrate HVM guests without PV drivers, a lot of people override the check and do it anyway. It happens a lot in cloud scenarios where the host (XenServer) admin is different from the guest admin (the end customer) and it''s impractical for the host admin to force the guest admin to load the drivers. We do our best to recommend, encourage and help people to install HVM PV drivers but we have to tolerate those who ignore our advice :/ Cheers, Dave
Stefano Stabellini
2013-Mar-29 17:19 UTC
Re: [PATCHv3] QEMU(upstream): Disable xen''s use of O_DIRECT by default as it results in crashes.
On Wed, 20 Mar 2013, Paolo Bonzini wrote:> Il 20/03/2013 09:33, Alex Bligh ha scritto: > > Stefano, > > > > --On 19 March 2013 15:13:29 +0000 Stefano Stabellini > > <stefano.stabellini@eu.citrix.com> wrote: > > > >> Therefore I think that the current change is not safe, but it is pretty > >> easy to make it safe. > >> You just need to move the call to blk_open from blk_init to blk_connect. > >> Actually you can move most of blk_init to blk_connect, you just need to > >> keep the 4 xenstore_write_be_int at the end of the function. > > > > The final of these 4 writes in blk_init is: > > > > xenstore_write_be_int(&blkdev->xendev, "sectors", > > blkdev->file_size / blkdev->file_blk); > > > > and blkdev->filesize comes from bdrv_getlength(blkdev->bs), and that > > requires (I think) bdrv_open to have been called. Can these > > 4 writes move safely to blk_connect? Or can we write the sectors > > count as (say) 0 here and fix it later in blk_connect? The remainder > > look ok. > > I think so. blkfront reads "sectors" when QEMU moves to > XenbusStateConnected, in blkfront_connect. > > blk_connect is called from xen_be_try_initialise, which moves to > XenbusStateConnected on success. So, QEMU''s blk_connect will always be > called before blkfront''s blkfront_connect.Alex, do you have any updates on this patch?
Alex Bligh
2013-Mar-31 19:53 UTC
Re: [PATCHv3] QEMU(upstream): Disable xen''s use of O_DIRECT by default as it results in crashes.
Stefano, --On 29 March 2013 17:19:26 +0000 Stefano Stabellini <stefano.stabellini@eu.citrix.com> wrote:>> I think so. blkfront reads "sectors" when QEMU moves to >> XenbusStateConnected, in blkfront_connect. >> >> blk_connect is called from xen_be_try_initialise, which moves to >> XenbusStateConnected on success. So, QEMU''s blk_connect will always be >> called before blkfront''s blkfront_connect. > > Alex, do you have any updates on this patch?Sorry, I''ve been snowed under with other stuff. I got to the speed bump below, then didn''t get back to it after Paolo''s reply. I''ll see if I can get some time tomorrow. -- Alex Bligh ---------- Forwarded Message ---------- Date: 20 March 2013 10:26:21 +0100 From: Paolo Bonzini <pbonzini@redhat.com> To: Alex Bligh <alex@alex.org.uk> CC: Stefano Stabellini <stefano.stabellini@eu.citrix.com>, George Dunlap <George.Dunlap@eu.citrix.com>, Ian Campbell <Ian.Campbell@citrix.com>, Ian Jackson <Ian.Jackson@eu.citrix.com>, qemu-devel@nongnu.org, xen-devel <xen-devel@lists.xen.org>, Anthony Liguori <anthony@codemonkey.ws> Subject: Re: [Xen-devel] [PATCHv3] QEMU(upstream): Disable xen''s use of O_DIRECT by default as it results in crashes. Il 20/03/2013 09:33, Alex Bligh ha scritto:> Stefano, > > --On 19 March 2013 15:13:29 +0000 Stefano Stabellini > <stefano.stabellini@eu.citrix.com> wrote: > >> Therefore I think that the current change is not safe, but it is pretty >> easy to make it safe. >> You just need to move the call to blk_open from blk_init to blk_connect. >> Actually you can move most of blk_init to blk_connect, you just need to >> keep the 4 xenstore_write_be_int at the end of the function. > > The final of these 4 writes in blk_init is: > > xenstore_write_be_int(&blkdev->xendev, "sectors", > blkdev->file_size / blkdev->file_blk); > > and blkdev->filesize comes from bdrv_getlength(blkdev->bs), and that > requires (I think) bdrv_open to have been called. Can these > 4 writes move safely to blk_connect? Or can we write the sectors > count as (say) 0 here and fix it later in blk_connect? The remainder > look ok.I think so. blkfront reads "sectors" when QEMU moves to XenbusStateConnected, in blkfront_connect. blk_connect is called from xen_be_try_initialise, which moves to XenbusStateConnected on success. So, QEMU''s blk_connect will always be called before blkfront''s blkfront_connect. Paolo ---------- End Forwarded Message ----------
Alex Bligh
2013-Apr-01 16:35 UTC
Re: [PATCHv3] QEMU(upstream): Disable xen''s use of O_DIRECT by default as it results in crashes.
Stefano, --On 31 March 2013 19:53:28 +0000 Alex Bligh <alex@alex.org.uk> wrote:>> Alex, do you have any updates on this patch? > > Sorry, I''ve been snowed under with other stuff. > > I got to the speed bump below, then didn''t get back to it after Paolo''s > reply. I''ll see if I can get some time tomorrow.Just sent under the subject: [PATCH] [RFC] Xen PV backend: Move call to bdrv_new from blk_init to blk_connect Compile tested only at this stage as I''d like some feedback on whether this is the sort of change you meant. Unfortunately the git diff isn''t very readable as it has decided I didn''t move a pile of code down from blk_init to blk_connect, but rather moved the declaration of blk_connect up above the code concerned ... -- Alex Bligh