--On 28 June 2013 12:17:43 +0800 Joe Jin <joe.jin@oracle.com> wrote:> Find a similar issue > http://www.gossamer-threads.com/lists/xen/devel/265611 So copied to Xen > developer as well.I thought this sounded familiar. I haven''t got the start of this thread, but what version of Xen are you running and what device model? If before 4.3, there is a page lifetime bug in the kernel (not the xen code) which can affect anything where the guest accesses the host''s block stack and that in turn accesses the networking stack (it may in fact be wider than that). So, e.g. domU on iCSSI will do it. It tends to get triggered by a TCP retransmit or (on NFS) the RPC equivalent. Essentially block operation is considered complete, returning through xen and freeing the grant table entry, and yet something in the kernel (e.g. tcp retransmit) can still access the data. The nature of the bug is extensively discussed in that thread - you''ll also find a reference to a thread on linux-nfs which concludes it isn''t an nfs problem, and even some patches to fix it in the kernel adding reference counting. A workaround is to turn off O_DIRECT use by Xen as that ensures the pages are copied. Xen 4.3 does this by default. I believe fixes for this are in 4.3 and 4.2.2 if using the qemu upstream DM. Note these aren''t real fixes, just a workaround of a kernel bug. To fix on a local build of xen you will need something like this: https://github.com/abligh/qemu-upstream-4.2-testing/commit/9a97c011e1a682eed9bc7195a25349eaf23ff3f9 and something like this (NB: obviously insert your own git repo and commit numbers) https://github.com/abligh/xen/commit/f5c344afac96ced8b980b9659fb3e81c4a0db5ca Also note those fixes are (technically) unsafe for live migration unless there is an ordering change made in qemu''s block open call. Of course this might be something completely different. -- Alex Bligh
--On 30 June 2013 10:13:35 +0100 Alex Bligh <alex@alex.org.uk> wrote:> The nature of the bug > is extensively discussed in that thread - you''ll also find > a reference to a thread on linux-nfs which concludes it > isn''t an nfs problem, and even some patches to fix it in the > kernel adding reference counting.Some more links for anyone interested in fixing the kernel bug: http://lists.xen.org/archives/html/xen-devel/2013-01/msg01618.html http://www.spinics.net/lists/linux-nfs/msg34913.html http://www.spinics.net/lists/netdev/msg224106.html -- Alex Bligh
On 06/30/13 17:13, Alex Bligh wrote:> > > --On 28 June 2013 12:17:43 +0800 Joe Jin <joe.jin@oracle.com> wrote: > >> Find a similar issue >> http://www.gossamer-threads.com/lists/xen/devel/265611 So copied to Xen >> developer as well. > > I thought this sounded familiar. I haven''t got the start of this > thread, but what version of Xen are you running and what device > model? If before 4.3, there is a page lifetime bug in the kernel > (not the xen code) which can affect anything where the guest accesses > the host''s block stack and that in turn accesses the networking > stack (it may in fact be wider than that). So, e.g. domU on > iCSSI will do it. It tends to get triggered by a TCP retransmit > or (on NFS) the RPC equivalent. Essentially block operation > is considered complete, returning through xen and freeing the > grant table entry, and yet something in the kernel (e.g. tcp > retransmit) can still access the data. The nature of the bug > is extensively discussed in that thread - you''ll also find > a reference to a thread on linux-nfs which concludes it > isn''t an nfs problem, and even some patches to fix it in the > kernel adding reference counting.Do you know if have a fix for above? so far we also suspected the grant page be unmapped earlier, we using 4.1 stable during our test.> > A workaround is to turn off O_DIRECT use by Xen as that ensures > the pages are copied. Xen 4.3 does this by default. > > I believe fixes for this are in 4.3 and 4.2.2 if using the > qemu upstream DM. Note these aren''t real fixes, just a workaround > of a kernel bug.The guest is pvm, and disk model is xvbd, guest config file as below: vif = [''mac=00:21:f6:00:00:01,bridge=c0a80b00''] OVM_simple_name = ''Guest#1'' disk = [''file:/OVS/Repositories/0004fb000003000091e9eae94d1e907c/VirtualDisks/0004fb0000120000f78799dad800ef47.img,xvda,w'', ''phy:/dev/mapper/360060e8010141870058b415700000002,xvdb,w'', ''phy:/dev/mapper/360060e8010141870058b415700000003,xvdc,w''] bootargs = '''' uuid = ''0004fb00-0006-0000-2b00-77a4766001ed'' on_reboot = ''restart'' cpu_weight = 27500 OVM_os_type = ''Oracle Linux 5'' cpu_cap = 0 maxvcpus = 8 OVM_high_availability = False memory = 4096 OVM_description = '''' on_poweroff = ''destroy'' on_crash = ''restart'' bootloader = ''/usr/bin/pygrub'' guest_os_type = ''linux'' name = ''0004fb00000600002b0077a4766001ed'' vfb = [''type=vnc,vncunused=1,vnclisten=127.0.0.1,keymap=en-us''] vcpus = 8 OVM_cpu_compat_group = '''' OVM_domain_type = ''xen_pvm''> > To fix on a local build of xen you will need something like this: > https://github.com/abligh/qemu-upstream-4.2-testing/commit/9a97c011e1a682eed9bc7195a25349eaf23ff3f9 > and something like this (NB: obviously insert your own git > repo and commit numbers) > https://github.com/abligh/xen/commit/f5c344afac96ced8b980b9659fb3e81c4a0db5ca >I think this only for pvhvm/hvm? Thanks, Joe> Also note those fixes are (technically) unsafe for live migration > unless there is an ordering change made in qemu''s block open > call. > > Of course this might be something completely different. >
On Mon, 2013-07-01 at 11:18 +0800, Joe Jin wrote:> > A workaround is to turn off O_DIRECT use by Xen as that ensures > > the pages are copied. Xen 4.3 does this by default. > > > > I believe fixes for this are in 4.3 and 4.2.2 if using the > > qemu upstream DM. Note these aren''t real fixes, just a workaround > > of a kernel bug. > > The guest is pvm, and disk model is xvbd, guest config file as below:Do you know which disk backend? The workaround Alex refers to went into qdisk but I think blkback could still suffer from a variant of the retransmit issue if you run it over iSCSI.> > To fix on a local build of xen you will need something like this: > > https://github.com/abligh/qemu-upstream-4.2-testing/commit/9a97c011e1a682eed9bc7195a25349eaf23ff3f9 > > and something like this (NB: obviously insert your own git > > repo and commit numbers) > > https://github.com/abligh/xen/commit/f5c344afac96ced8b980b9659fb3e81c4a0db5ca > > > > I think this only for pvhvm/hvm?No, the underlying issue affects any PV device which is run over a network protocol (NFS, iSCSI etc). In effect a delayed retransmit can cross over the deayed ack and cause I/O to be completed while retransmits are pending, such as is described in http://www.spinics.net/lists/linux-nfs/msg34913.html (the original NFS variant). The problem is that because Xen PV drivers often unmap the page on I/O completion you get a crash (page fault) on the retransmit. The issue also affects native but in that case the symptom is "just" a corrupt packet on the wire. I tried to address this with my "skb destructor" series but unfortunately I got bogged down on the details, then I had to take time out to look into some other stuff and never managed to get back into it. I''d be very grateful if there was someone who could pick up that work (Alex gave some useful references in another reply to this thread) Some PV disk backends (e.g. blktap2) have worked around this by using grant copy instead of grant map, others (e.g. qdisk) have disabled O_DIRECT so that the pages are copied into the dom0 page cache and transmitted from there. We were discussing recently the possibility of mapping all ballooned out pages to a single read-only scratch page instead of leaving them empty in the page tables, this would cause the Xen case to revert to the native case. I think Thanos was going to take a look into this. Ian.
Joe,> Do you know if have a fix for above? so far we also suspected the > grant page be unmapped earlier, we using 4.1 stable during our test.A true fix? No, but I posted a patch set (see later email message for a link) that you could forward port. The workaround is:>> A workaround is to turn off O_DIRECT use by Xen as that ensures >> the pages are copied. Xen 4.3 does this by default. >> >> I believe fixes for this are in 4.3 and 4.2.2 if using the >> qemu upstream DM. Note these aren''t real fixes, just a workaround >> of a kernel bug. > > The guest is pvm, and disk model is xvbd, guest config file as below:...> I think this only for pvhvm/hvm?I don''t have much experience outside pvhvm/hvm, but I believe it should work for any device. Testing was simple - just find all (*) the references to O_DIRECT in your device model and remove them! (*)=you could be less lazy than me and find the right ones. I am guessing it will be the same ones though. -- Alex Bligh
On 07/01/13 16:11, Ian Campbell wrote:> On Mon, 2013-07-01 at 11:18 +0800, Joe Jin wrote: >>> A workaround is to turn off O_DIRECT use by Xen as that ensures >>> the pages are copied. Xen 4.3 does this by default. >>> >>> I believe fixes for this are in 4.3 and 4.2.2 if using the >>> qemu upstream DM. Note these aren''t real fixes, just a workaround >>> of a kernel bug. >> >> The guest is pvm, and disk model is xvbd, guest config file as below: > > Do you know which disk backend? The workaround Alex refers to went into > qdisk but I think blkback could still suffer from a variant of the > retransmit issue if you run it over iSCSI.The backend is xen-blkback on iSCSI storage.> >>> To fix on a local build of xen you will need something like this: >>> https://github.com/abligh/qemu-upstream-4.2-testing/commit/9a97c011e1a682eed9bc7195a25349eaf23ff3f9 >>> and something like this (NB: obviously insert your own git >>> repo and commit numbers) >>> https://github.com/abligh/xen/commit/f5c344afac96ced8b980b9659fb3e81c4a0db5ca >>> >> >> I think this only for pvhvm/hvm? > > No, the underlying issue affects any PV device which is run over a > network protocol (NFS, iSCSI etc). In effect a delayed retransmit can > cross over the deayed ack and cause I/O to be completed while > retransmits are pending, such as is described in > http://www.spinics.net/lists/linux-nfs/msg34913.html (the original NFS > variant). The problem is that because Xen PV drivers often unmap the > page on I/O completion you get a crash (page fault) on the retransmit. >To prevent iSCSI call sendpage() reuse the page we disabled the sg from NIC, per test result the panic went. This also confirmed the page be unmpped by grant system, the symptom as same as nfs panic.> The issue also affects native but in that case the symptom is "just" a > corrupt packet on the wire. I tried to address this with my "skb > destructor" series but unfortunately I got bogged down on the details, > then I had to take time out to look into some other stuff and never > managed to get back into it. I''d be very grateful if there was someone > who could pick up that work (Alex gave some useful references in another > reply to this thread) > > Some PV disk backends (e.g. blktap2) have worked around this by using > grant copy instead of grant map, others (e.g. qdisk) have disabled > O_DIRECT so that the pages are copied into the dom0 page cache and > transmitted from there.The work around as same as we disable sg from NIC(disable it sendpage will create own page copy rather than reuse the page). Thanks, Joe> > We were discussing recently the possibility of mapping all ballooned out > pages to a single read-only scratch page instead of leaving them empty > in the page tables, this would cause the Xen case to revert to the > native case. I think Thanos was going to take a look into this. > > Ian. >
On 07/01/13 16:11, Ian Campbell wrote:> On Mon, 2013-07-01 at 11:18 +0800, Joe Jin wrote: >>> A workaround is to turn off O_DIRECT use by Xen as that ensures >>> the pages are copied. Xen 4.3 does this by default. >>> >>> I believe fixes for this are in 4.3 and 4.2.2 if using the >>> qemu upstream DM. Note these aren''t real fixes, just a workaround >>> of a kernel bug. >> >> The guest is pvm, and disk model is xvbd, guest config file as below: > > Do you know which disk backend? The workaround Alex refers to went into > qdisk but I think blkback could still suffer from a variant of the > retransmit issue if you run it over iSCSI. > >>> To fix on a local build of xen you will need something like this: >>> https://github.com/abligh/qemu-upstream-4.2-testing/commit/9a97c011e1a682eed9bc7195a25349eaf23ff3f9 >>> and something like this (NB: obviously insert your own git >>> repo and commit numbers) >>> https://github.com/abligh/xen/commit/f5c344afac96ced8b980b9659fb3e81c4a0db5ca >>> >> >> I think this only for pvhvm/hvm? > > No, the underlying issue affects any PV device which is run over a > network protocol (NFS, iSCSI etc). In effect a delayed retransmit can > cross over the deayed ack and cause I/O to be completed while > retransmits are pending, such as is described in > http://www.spinics.net/lists/linux-nfs/msg34913.html (the original NFS > variant). The problem is that because Xen PV drivers often unmap the > page on I/O completion you get a crash (page fault) on the retransmit. >Can we do it by remember grant page refcount when mapping, and when unmap check if page refcount as same as mapping? This change will limited in xen-blkback. Another way is add new page flag like PG_send, when sendpage() be called, set the bit, when page be put, clear the bit. Then xen-blkback can wait on the pagequeue. Thanks, Joe> The issue also affects native but in that case the symptom is "just" a > corrupt packet on the wire. I tried to address this with my "skb > destructor" series but unfortunately I got bogged down on the details, > then I had to take time out to look into some other stuff and never > managed to get back into it. I''d be very grateful if there was someone > who could pick up that work (Alex gave some useful references in another > reply to this thread) > > Some PV disk backends (e.g. blktap2) have worked around this by using > grant copy instead of grant map, others (e.g. qdisk) have disabled > O_DIRECT so that the pages are copied into the dom0 page cache and > transmitted from there. > > We were discussing recently the possibility of mapping all ballooned out > pages to a single read-only scratch page instead of leaving them empty > in the page tables, this would cause the Xen case to revert to the > native case. I think Thanos was going to take a look into this. > > Ian. >-- Oracle <http://www.oracle.com> Joe Jin | Software Development Senior Manager | +8610.6106.5624 ORACLE | Linux and Virtualization No. 24 Zhongguancun Software Park, Haidian District | 100193 Beijing
On Thu, 2013-07-04 at 16:55 +0800, Joe Jin wrote:> On 07/01/13 16:11, Ian Campbell wrote: > > On Mon, 2013-07-01 at 11:18 +0800, Joe Jin wrote: > >>> A workaround is to turn off O_DIRECT use by Xen as that ensures > >>> the pages are copied. Xen 4.3 does this by default. > >>> > >>> I believe fixes for this are in 4.3 and 4.2.2 if using the > >>> qemu upstream DM. Note these aren''t real fixes, just a workaround > >>> of a kernel bug. > >> > >> The guest is pvm, and disk model is xvbd, guest config file as below: > > > > Do you know which disk backend? The workaround Alex refers to went into > > qdisk but I think blkback could still suffer from a variant of the > > retransmit issue if you run it over iSCSI. > > > >>> To fix on a local build of xen you will need something like this: > >>> https://github.com/abligh/qemu-upstream-4.2-testing/commit/9a97c011e1a682eed9bc7195a25349eaf23ff3f9 > >>> and something like this (NB: obviously insert your own git > >>> repo and commit numbers) > >>> https://github.com/abligh/xen/commit/f5c344afac96ced8b980b9659fb3e81c4a0db5ca > >>> > >> > >> I think this only for pvhvm/hvm? > > > > No, the underlying issue affects any PV device which is run over a > > network protocol (NFS, iSCSI etc). In effect a delayed retransmit can > > cross over the deayed ack and cause I/O to be completed while > > retransmits are pending, such as is described in > > http://www.spinics.net/lists/linux-nfs/msg34913.html (the original NFS > > variant). The problem is that because Xen PV drivers often unmap the > > page on I/O completion you get a crash (page fault) on the retransmit. > > > > Can we do it by remember grant page refcount when mapping, and when unmap > check if page refcount as same as mapping? This change will limited in > xen-blkback. > > Another way is add new page flag like PG_send, when sendpage() be called, > set the bit, when page be put, clear the bit. Then xen-blkback can wait > on the pagequeue.These schemes don''t work when you have multiple simultaneous I/Os referencing the same underlying page.> > Thanks, > Joe > > > The issue also affects native but in that case the symptom is "just" a > > corrupt packet on the wire. I tried to address this with my "skb > > destructor" series but unfortunately I got bogged down on the details, > > then I had to take time out to look into some other stuff and never > > managed to get back into it. I''d be very grateful if there was someone > > who could pick up that work (Alex gave some useful references in another > > reply to this thread) > > > > Some PV disk backends (e.g. blktap2) have worked around this by using > > grant copy instead of grant map, others (e.g. qdisk) have disabled > > O_DIRECT so that the pages are copied into the dom0 page cache and > > transmitted from there. > > > > We were discussing recently the possibility of mapping all ballooned out > > pages to a single read-only scratch page instead of leaving them empty > > in the page tables, this would cause the Xen case to revert to the > > native case. I think Thanos was going to take a look into this. > > > > Ian. > > > >
On Thu, 2013-07-04 at 09:59 +0100, Ian Campbell wrote:> On Thu, 2013-07-04 at 16:55 +0800, Joe Jin wrote: > > > > Another way is add new page flag like PG_send, when sendpage() be called, > > set the bit, when page be put, clear the bit. Then xen-blkback can wait > > on the pagequeue. > > These schemes don''t work when you have multiple simultaneous I/Os > referencing the same underlying page.So this is a page property, still the patches I saw tried to address this problem adding networking stuff (destructors) in the skbs. Given that a page refcount can be transfered between entities, say using splice() system call, I do not really understand why the fix would imply networking only. Let''s try to fix it properly, or else we must disable zero copies because they are not reliable. Why sendfile() doesn''t have the problem, but vmsplice()+splice() do have this issue ? As soon as a page fragment reference is taken somewhere, the only way to properly reuse the page is to rely on put_page() and page being freed. Adding workarounds in TCP stack to always copy the page fragments in case of a retransmit is partial solution, as the remote peer could be malicious and send ACK _before_ page content is actually read by the NIC. So if we rely on networking stacks to give the signal for page reuse, we can have major security issue.
On Thu, 2013-07-04 at 02:34 -0700, Eric Dumazet wrote:> On Thu, 2013-07-04 at 09:59 +0100, Ian Campbell wrote: > > On Thu, 2013-07-04 at 16:55 +0800, Joe Jin wrote: > > > > > > Another way is add new page flag like PG_send, when sendpage() be called, > > > set the bit, when page be put, clear the bit. Then xen-blkback can wait > > > on the pagequeue. > > > > These schemes don''t work when you have multiple simultaneous I/Os > > referencing the same underlying page. > > So this is a page property, still the patches I saw tried to address > this problem adding networking stuff (destructors) in the skbs. > > Given that a page refcount can be transfered between entities, say using > splice() system call, I do not really understand why the fix would imply > networking only. > > Let''s try to fix it properly, or else we must disable zero copies > because they are not reliable. > > Why sendfile() doesn''t have the problem, but vmsplice()+splice() do have > this issue ?Might just be that no one has observed it with vmsplice()+splice()? Most of the time this happens silently and you''ll probably never notice, it''s just the behaviour of Xen which escalates the issue into one you can see.> As soon as a page fragment reference is taken somewhere, the only way to > properly reuse the page is to rely on put_page() and page being freed.Xen''s out of tree netback used to fix this by a destructor call back on page free, but that was a core mm patch in the hot memory free path which wasn''t popular, and it doesn''t solve anything for the non-Xen instances of this issue.> Adding workarounds in TCP stack to always copy the page fragments in > case of a retransmit is partial solution, as the remote peer could be > malicious and send ACK _before_ page content is actually read by the > NIC. > > So if we rely on networking stacks to give the signal for page reuse, we > can have major security issue.If you ignore the Xen case and consider just the native case then the issue isn''t page reuse in the sense of getting mapped into another process, it''s the same page in the same process but the process has written something new to the buffer, e.g. memset(buf, 0xaa, 4096); write(fd, buf, 4096) memset(buf, 0x55, 4096); (where fd is O_DIRECT on NFS) Can result in 0x55 being seen on the wire in the TCP retransmit. If the retransmit is at the RPC layer then you get a resend of the NFS write RPC, but the XDR sequence stuff catches that case (I think, memory is fuzzy). If the retransmit is at the TCP level then the TCP sequence/ack will cause the receiver to ignore the corrupt version, but if you replace the second memset with write_critical_secret_key(buf), then you have an information leak. Ian.
On Thu, 2013-07-04 at 10:52 +0100, Ian Campbell wrote:> Might just be that no one has observed it with vmsplice()+splice()? Most > of the time this happens silently and you''ll probably never notice, it''s > just the behaviour of Xen which escalates the issue into one you can > see.The point I wanted to make is that nobody can seriously use vmsplice(), unless the memory is never reused by the application, or the application doesn''t care of security implications. Because an application has no way to know when it''s safe to reuse the area for another usage. [ Unless it uses the obscure and complex pagemap stuff (Documentation/vm/pagemap.txt), but its not asynchronous signaling and not pluggable into epoll()/poll()/select()) ]> Xen''s out of tree netback used to fix this by a destructor call back on > page free, but that was a core mm patch in the hot memory free path > which wasn''t popular, and it doesn''t solve anything for the non-Xen > instances of this issue.It _is_ a mm core patch which is needed, if we ever want to fix this problem. It looks like a typical COW issue to me. If the page content is written while there is still a reference on this page, we should allocate a new page and copy the previous content. And this has little to do with networking.
--On 4 July 2013 03:12:10 -0700 Eric Dumazet <eric.dumazet@gmail.com> wrote:> It looks like a typical COW issue to me. > > If the page content is written while there is still a reference on this > page, we should allocate a new page and copy the previous content. > > And this has little to do with networking.I suspect this would get more attention if we could make Ian''s case below trigger (a) outside Xen, (b) outside networking.> memset(buf, 0xaa, 4096); > write(fd, buf, 4096) > memset(buf, 0x55, 4096); > (where fd is O_DIRECT on NFS) Can result in 0x55 being seen on the wire > in the TCP retransmit.We know this should fail using O_DIRECT+NFS. We''ve had reports suggesting it fails in O_DIRECT+iSCSI. However, that''s been with a kernel panic (under Xen) rather than data corruption as per the above. Historical trawling suggests this is an issue with DRDB (see Ian''s original thread from the mists of time). I don''t quite understand why we aren''t seeing corruption with standard ATA devices + O_DIRECT and no Xen involved at all. My memory is a bit misty on this but I had thought the reason why this would NOT be solved simply by O_DIRECT taking a reference to the page was that the O_DIRECT I/O completed (and thus the reference would be freed up) before the networking stack had actually finished with the page. If the O_DIRECT I/O did not complete until the page was actually finished with, we wouldn''t see the problem in the first place. I may be completely off base here. -- Alex Bligh