Hi all, these are Konrad''s and my notes (mostly Konrad''s) on possible improvements of the PV network protocol, taken at the Hackathon. A) Network bandwidth: multipage rings The max outstanding amount of data the it can have is 898kB (64K of data use 18 slot, out of 256. 256 / 18 = 14, 14 * 64KB). This can be expanded by having multi-page to expand the ring. This would benefit NFS and bulk data transfer (such as netperf data). B) Producer and consumer index is on the same cache line In present hardware that means the reader and writer will compete for the same cacheline causing a ping-pong between sockets. This can be solved by having a feature-split-indexes (or better name) where the req_prod and req_event as a tuple are different from the rsp_prod and rsp_prod. This would entail using 128bytes of the ring at the start - each cacheline for each tuple. C) Cache alignment of requests The fix is to make the request structures more cache-aligned. For networking that means making it 16 bytes and block 64 bytes. Since it does not shrink the structure but just expands it, could be called feature-align-slot. E) Multiqueue (request-feature-multiqueue) It means creating many TX and RX rings for each vif. F) don''t gnt_copy all of the requests Instead don''t touch them and let the Xen IOMMU create appropriate entries. This would require the DMA API in dom0 to be aware whether the grant has been done and if not (so FOREIGN, aka no m2p_override), then do the hypercall to tell the hypervisor that this grant is going to be used by a specific PCI device. This would create the IOMMU entry in Xen. G) On TX side, do persistent grant mapping This would only be done from frontend -> backend path. That means that we could exhaust initial domains memory. H) Affinity of the frontend and backend being on the same NUMA node This touches upon the discussion about NUMA and having PV guests be aware of memory layout. It also means that each backend kthread needs to be on a different NUMA node. I) separate request and response rings for TX and RX J) Map the whole physical memory of the machine in dom0 If mapping/unmapping or copying slows us down, could we just keep the whole physical memory of the machine mapped in dom0 (with corresponding IOMMU entries)? At that point the frontend could just pass mfn numbers to the backend, and the backend would already have them mapped.>From a security perspective it doesn''t change anything when runningthe backend in dom0, because dom0 is already capable of mapping random pages of any guests. QEMU instances do that all the time. But it would take away one of the benefits of deploying driver domains: we wouldn''t be able to run the backends at a lower privilege level. However it might still be worth considering as an option? The backend is still trusted and protected from the frontend, but the frontend wouldn''t be protected from the backend.
On Mon, May 20, 2013 at 3:08 PM, Stefano Stabellini <stefano.stabellini@eu.citrix.com> wrote:> Hi all, > these are Konrad''s and my notes (mostly Konrad''s) on possible > improvements of the PV network protocol, taken at the Hackathon. > > > A) Network bandwidth: multipage rings > The max outstanding amount of data the it can have is 898kB (64K of > data use 18 slot, out of 256. 256 / 18 = 14, 14 * 64KB). This can be > expanded by having multi-page to expand the ring. This would benefit NFS > and bulk data transfer (such as netperf data). > > > B) Producer and consumer index is on the same cache line > In present hardware that means the reader and writer will compete for > the same cacheline causing a ping-pong between sockets. > This can be solved by having a feature-split-indexes (or better name) > where the req_prod and req_event as a tuple are different from the > rsp_prod and rsp_prod. This would entail using 128bytes of the ring at > the start - each cacheline for each tuple. > > > C) Cache alignment of requests > The fix is to make the request structures more cache-aligned. For > networking that means making it 16 bytes and block 64 bytes. > Since it does not shrink the structure but just expands it, could be > called feature-align-slot. > > > E) Multiqueue (request-feature-multiqueue) > It means creating many TX and RX rings for each vif. > > > F) don''t gnt_copy all of the requests > Instead don''t touch them and let the Xen IOMMU create appropriate > entries. This would require the DMA API in dom0 to be aware whether the > grant has been done and if not (so FOREIGN, aka no m2p_override), then > do the hypercall to tell the hypervisor that this grant is going to be > used by a specific PCI device. This would create the IOMMU entry in Xen. > > > G) On TX side, do persistent grant mapping > This would only be done from frontend -> backend path. That means that > we could exhaust initial domains memory. > > > H) Affinity of the frontend and backend being on the same NUMA node > This touches upon the discussion about NUMA and having PV guests be > aware of memory layout. It also means that each backend kthread needs to > be on a different NUMA node. > > > I) separate request and response rings for TX and RX > > > J) Map the whole physical memory of the machine in dom0 > If mapping/unmapping or copying slows us down, could we just keep the > whole physical memory of the machine mapped in dom0 (with corresponding > IOMMU entries)? > At that point the frontend could just pass mfn numbers to the backend, > and the backend would already have them mapped. > >From a security perspective it doesn''t change anything when running > the backend in dom0, because dom0 is already capable of mapping random > pages of any guests. QEMU instances do that all the time. > But it would take away one of the benefits of deploying driver domains: > we wouldn''t be able to run the backends at a lower privilege level. > However it might still be worth considering as an option? The backend is > still trusted and protected from the frontend, but the frontend wouldn''t > be protected from the backend.What''s missing from this was my side of the discussion: I was saying that if TLB flushes from grant-unmap is indeed the problem, then maybe we could have the *front-end* in charge of requesting a TLB flush for its pages. The strict TLB flushing is to protect a frontend from rogue back-ends from reading sensitive data; if the front-end were willing to just not use the pages for a short amount of time, and issue a flush say every second or so, that would reduce the TLB flushes greatly while maintaining the safety advantages of driver domains. -George
On Mon, May 20, 2013 at 03:08:05PM +0100, Stefano Stabellini wrote:> Hi all, > these are Konrad''s and my notes (mostly Konrad''s) on possible > improvements of the PV network protocol, taken at the Hackathon. >Just for completeness, these items are future working items. I''m now upstreaming my queues to lay a baseline for these items, which include: 1. split event channels support (generally useful) 2. netback global page pool (prerequisite for 1:1 model) 3. kthread + NAPI 1:1 model (prerequisite for multiqueue)> > A) Network bandwidth: multipage rings > The max outstanding amount of data the it can have is 898kB (64K of > data use 18 slot, out of 256. 256 / 18 = 14, 14 * 64KB). This can be > expanded by having multi-page to expand the ring. This would benefit NFS > and bulk data transfer (such as netperf data). >This is in my queue as well. It''s generic change in xenbus interface which can benefit not only network but also block device.>[...]> J) Map the whole physical memory of the machine in dom0 > If mapping/unmapping or copying slows us down, could we just keep the > whole physical memory of the machine mapped in dom0 (with corresponding > IOMMU entries)? > At that point the frontend could just pass mfn numbers to the backend, > and the backend would already have them mapped. > >From a security perspective it doesn''t change anything when running > the backend in dom0, because dom0 is already capable of mapping random > pages of any guests. QEMU instances do that all the time. > But it would take away one of the benefits of deploying driver domains: > we wouldn''t be able to run the backends at a lower privilege level. > However it might still be worth considering as an option? The backend is > still trusted and protected from the frontend, but the frontend wouldn''t > be protected from the backend. >I think Dom0 mapping all machine memory is a good starting point. As for the driver domain, can we not have a driver domain mapped all of its target''s machine memory? What''s the security implication here? Wei.
On Mon, May 20, 2013 at 03:49:32PM +0100, George Dunlap wrote: [...]> > J) Map the whole physical memory of the machine in dom0 > > If mapping/unmapping or copying slows us down, could we just keep the > > whole physical memory of the machine mapped in dom0 (with corresponding > > IOMMU entries)? > > At that point the frontend could just pass mfn numbers to the backend, > > and the backend would already have them mapped. > > >From a security perspective it doesn''t change anything when running > > the backend in dom0, because dom0 is already capable of mapping random > > pages of any guests. QEMU instances do that all the time. > > But it would take away one of the benefits of deploying driver domains: > > we wouldn''t be able to run the backends at a lower privilege level. > > However it might still be worth considering as an option? The backend is > > still trusted and protected from the frontend, but the frontend wouldn''t > > be protected from the backend. > > What''s missing from this was my side of the discussion: > > I was saying that if TLB flushes from grant-unmap is indeed the > problem, then maybe we could have the *front-end* in charge of > requesting a TLB flush for its pages. The strict TLB flushing is to > protect a frontend from rogue back-ends from reading sensitive data; > if the front-end were willing to just not use the pages for a short > amount of time, and issue a flush say every second or so, that would > reduce the TLB flushes greatly while maintaining the safety advantages > of driver domains. >I''m not sure I get what you mean here. Are you saying DomU flushes Dom0''s TLB entries? Wei.
On 2013-5-20 10:08, Stefano Stabellini wrote:> Hi all, > these are Konrad''s and my notes (mostly Konrad''s) on possible > improvements of the PV network protocol, taken at the Hackathon. > > [...] > > G) On TX side, do persistent grant mapping > This would only be done from frontend -> backend path. That means that > we could exhaust initial domains memory.I did some persistent grant mapping patch on both TX and RX sides a while ago, and could keep TX persistent and optimize it. Thanks Annie
On Mon, 2013-05-20 at 19:33 +0100, Wei Liu wrote:> On Mon, May 20, 2013 at 03:49:32PM +0100, George Dunlap wrote: > [...] > > > J) Map the whole physical memory of the machine in dom0 > > > If mapping/unmapping or copying slows us down, could we just keep the > > > whole physical memory of the machine mapped in dom0 (with corresponding > > > IOMMU entries)? > > > At that point the frontend could just pass mfn numbers to the backend, > > > and the backend would already have them mapped. > > > >From a security perspective it doesn''t change anything when running > > > the backend in dom0, because dom0 is already capable of mapping random > > > pages of any guests. QEMU instances do that all the time. > > > But it would take away one of the benefits of deploying driver domains: > > > we wouldn''t be able to run the backends at a lower privilege level. > > > However it might still be worth considering as an option? The backend is > > > still trusted and protected from the frontend, but the frontend wouldn''t > > > be protected from the backend. > > > > What''s missing from this was my side of the discussion: > > > > I was saying that if TLB flushes from grant-unmap is indeed the > > problem, then maybe we could have the *front-end* in charge of > > requesting a TLB flush for its pages. The strict TLB flushing is to > > protect a frontend from rogue back-ends from reading sensitive data; > > if the front-end were willing to just not use the pages for a short > > amount of time, and issue a flush say every second or so, that would > > reduce the TLB flushes greatly while maintaining the safety advantages > > of driver domains. > > > > I''m not sure I get what you mean here. Are you saying DomU flushes > Dom0''s TLB entries?The gnt_unmap made by dom0 needs to flush the TLB of any physical processor which may have seen the mapping, which means approximately all dom0 vcpus.
On 05/21/2013 09:22 AM, Ian Campbell wrote:> On Mon, 2013-05-20 at 19:33 +0100, Wei Liu wrote: >> On Mon, May 20, 2013 at 03:49:32PM +0100, George Dunlap wrote: >> [...] >>>> J) Map the whole physical memory of the machine in dom0 >>>> If mapping/unmapping or copying slows us down, could we just keep the >>>> whole physical memory of the machine mapped in dom0 (with corresponding >>>> IOMMU entries)? >>>> At that point the frontend could just pass mfn numbers to the backend, >>>> and the backend would already have them mapped. >>>> >From a security perspective it doesn''t change anything when running >>>> the backend in dom0, because dom0 is already capable of mapping random >>>> pages of any guests. QEMU instances do that all the time. >>>> But it would take away one of the benefits of deploying driver domains: >>>> we wouldn''t be able to run the backends at a lower privilege level. >>>> However it might still be worth considering as an option? The backend is >>>> still trusted and protected from the frontend, but the frontend wouldn''t >>>> be protected from the backend. >>> >>> What''s missing from this was my side of the discussion: >>> >>> I was saying that if TLB flushes from grant-unmap is indeed the >>> problem, then maybe we could have the *front-end* in charge of >>> requesting a TLB flush for its pages. The strict TLB flushing is to >>> protect a frontend from rogue back-ends from reading sensitive data; >>> if the front-end were willing to just not use the pages for a short >>> amount of time, and issue a flush say every second or so, that would >>> reduce the TLB flushes greatly while maintaining the safety advantages >>> of driver domains. >>> >> >> I''m not sure I get what you mean here. Are you saying DomU flushes >> Dom0''s TLB entries? > > The gnt_unmap made by dom0 needs to flush the TLB of any physical > processor which may have seen the mapping, which means approximately all > dom0 vcpus.That''s what I was getting at. It''s Xen that does any actual TLB flushes, and for now the "promise" to the front-end is that the page is safe from the backend* after the transaction is done. But it would be nicer if we could batch these flushes to happen once every few hundred milliseconds, or even one a second. If we allowed the front-end to choose a new the interface, that said "the page is safe from the backend once you have made this hypercall", then guests could choose the "window" size based on their own parameters. * Remember that the point of grant maps isn''t to allow *dom0* access to the guests; dom0 already has all the access it needs. It''s to allow driver domains access to the guests. -George
On Mon, 2013-05-20 at 19:31 +0100, Wei Liu wrote:> On Mon, May 20, 2013 at 03:08:05PM +0100, Stefano Stabellini wrote: > [...] > > J) Map the whole physical memory of the machine in dom0 > > If mapping/unmapping or copying slows us down, could we just keep the > > whole physical memory of the machine mapped in dom0 (with corresponding > > IOMMU entries)? > > At that point the frontend could just pass mfn numbers to the backend, > > and the backend would already have them mapped. > > >From a security perspective it doesn''t change anything when running > > the backend in dom0, because dom0 is already capable of mapping random > > pages of any guests. QEMU instances do that all the time.Actually there are mechanisms in place to remove this privilege from dom0, specifically there is an XSM class (terminology?) for non-migratable domains which effectively equates to exactly this restriction. Of course you need stub qemu too.> > But it would take away one of the benefits of deploying driver domains: > > we wouldn''t be able to run the backends at a lower privilege level. > > However it might still be worth considering as an option? The backend is > > still trusted and protected from the frontend, but the frontend wouldn''t > > be protected from the backend. > > > > I think Dom0 mapping all machine memory is a good starting point. As for > the driver domain, can we not have a driver domain mapped all of its > target''s machine memory? What''s the security implication here?It gives the driver domain an enormous amount of privilege which it doesn''t require and which it could use to compromise the integrity of the system (i.e. to snoop any guest''s memory and extract "secrets"). It effectively reduces our security/isolation story to "effectively equivalent to KVM" and this isolation is one of the big selling points for Xen. I don''t think we should go down this path either for dom0 or driver domains and I am absolutely positive that there are other approaches we should be investigating before we even start to consider it. George''s idea of not flushing at unmap time, with co-operation from the frontend to not reuse the pages until it has batched up a bigger flush seems like an interesting one to look into. By choosing the sizes and times correct it may even be that by the time domU wants to reuse the page the TLB has already been flushed for some other reason (context switch etc) and the hypervisor can elide the expense. There are probably mechanisms in the guest kernels which allow us to hold on to memory but still provide a memory pressure hook so we can flush immediately instead of OOMing. Ian.
At 19:31 +0100 on 20 May (1369078279), Wei Liu wrote:> On Mon, May 20, 2013 at 03:08:05PM +0100, Stefano Stabellini wrote: > > J) Map the whole physical memory of the machine in dom0 > > If mapping/unmapping or copying slows us down, could we just keep the > > whole physical memory of the machine mapped in dom0 (with corresponding > > IOMMU entries)? > > At that point the frontend could just pass mfn numbers to the backend, > > and the backend would already have them mapped. > > >From a security perspective it doesn''t change anything when running > > the backend in dom0, because dom0 is already capable of mapping random > > pages of any guests. QEMU instances do that all the time. > > But it would take away one of the benefits of deploying driver domains: > > we wouldn''t be able to run the backends at a lower privilege level. > > However it might still be worth considering as an option? The backend is > > still trusted and protected from the frontend, but the frontend wouldn''t > > be protected from the backend. > > > > I think Dom0 mapping all machine memory is a good starting point.I _strongly_ disagree. The opportunity for disaggregation and reduction of privilege in backends is probably Xen''s biggest techical advantage and we should not be taking any backward steps there.> As for the driver domain, can we not have a driver domain mapped all > of its target''s machine memory? What''s the security implication here?If, say, a network driver domain is compromised it''s the difference between intercepting network traffic and total control of the OS. It''s probably worth reading some of the Xen papers about this stuff, if you haven''t already: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.103.6391 http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.229.3708 Tim.
On Tue, May 21, 2013 at 10:26:00AM +0100, Tim Deegan wrote:> At 19:31 +0100 on 20 May (1369078279), Wei Liu wrote: > > On Mon, May 20, 2013 at 03:08:05PM +0100, Stefano Stabellini wrote: > > > J) Map the whole physical memory of the machine in dom0 > > > If mapping/unmapping or copying slows us down, could we just keep the > > > whole physical memory of the machine mapped in dom0 (with corresponding > > > IOMMU entries)? > > > At that point the frontend could just pass mfn numbers to the backend, > > > and the backend would already have them mapped. > > > >From a security perspective it doesn''t change anything when running > > > the backend in dom0, because dom0 is already capable of mapping random > > > pages of any guests. QEMU instances do that all the time. > > > But it would take away one of the benefits of deploying driver domains: > > > we wouldn''t be able to run the backends at a lower privilege level. > > > However it might still be worth considering as an option? The backend is > > > still trusted and protected from the frontend, but the frontend wouldn''t > > > be protected from the backend. > > > > > > > I think Dom0 mapping all machine memory is a good starting point. > > I _strongly_ disagree. The opportunity for disaggregation and reduction > of privilege in backends is probably Xen''s biggest techical advantage > and we should not be taking any backward steps there. >I agree with you that disaggregation and reduction of privilege is Xen''s biggest technical advantage. Just to make clear, this idea was summerized from a discussion among George, Stefano and I on the way back from hackathon. We want to see if things like mapping / unmapping incur heavy performance penalty. As now it is really hard to identify the real performance bottleneck we would like to have some quick hack to see how things work.> > As for the driver domain, can we not have a driver domain mapped all > > of its target''s machine memory? What''s the security implication here? > > If, say, a network driver domain is compromised it''s the difference > between intercepting network traffic and total control of the OS. > It''s probably worth reading some of the Xen papers about this stuff, > if you haven''t already: > > http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.103.6391 > http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.229.3708 >Thanks Tim. I read them before. :-) We''re just talking about some experimental things here, not something that set in stone and must be done in the future. Wei.> Tim.
On Tue, May 21, 2013 at 10:26 AM, Tim Deegan <tim@xen.org> wrote:> At 19:31 +0100 on 20 May (1369078279), Wei Liu wrote: >> On Mon, May 20, 2013 at 03:08:05PM +0100, Stefano Stabellini wrote: >> > J) Map the whole physical memory of the machine in dom0 >> > If mapping/unmapping or copying slows us down, could we just keep the >> > whole physical memory of the machine mapped in dom0 (with corresponding >> > IOMMU entries)? >> > At that point the frontend could just pass mfn numbers to the backend, >> > and the backend would already have them mapped. >> > >From a security perspective it doesn''t change anything when running >> > the backend in dom0, because dom0 is already capable of mapping random >> > pages of any guests. QEMU instances do that all the time. >> > But it would take away one of the benefits of deploying driver domains: >> > we wouldn''t be able to run the backends at a lower privilege level. >> > However it might still be worth considering as an option? The backend is >> > still trusted and protected from the frontend, but the frontend wouldn''t >> > be protected from the backend. >> > >> >> I think Dom0 mapping all machine memory is a good starting point. > > I _strongly_ disagree. The opportunity for disaggregation and reduction > of privilege in backends is probably Xen''s biggest techical advantage > and we should not be taking any backward steps there.I think Wei meant, "A good point to start the investigation". If having all the memory mapped doesn''t give any performance advantage, then a more complicated interface to avoid TLB flushes is mos likely a waste of time. If it does, then we can try see if we can find a way to get performance without giving up security. -George
On Tue, May 21, 2013 at 11:01:51AM +0100, George Dunlap wrote:> On Tue, May 21, 2013 at 10:26 AM, Tim Deegan <tim@xen.org> wrote: > > At 19:31 +0100 on 20 May (1369078279), Wei Liu wrote: > >> On Mon, May 20, 2013 at 03:08:05PM +0100, Stefano Stabellini wrote: > >> > J) Map the whole physical memory of the machine in dom0 > >> > If mapping/unmapping or copying slows us down, could we just keep the > >> > whole physical memory of the machine mapped in dom0 (with corresponding > >> > IOMMU entries)? > >> > At that point the frontend could just pass mfn numbers to the backend, > >> > and the backend would already have them mapped. > >> > >From a security perspective it doesn''t change anything when running > >> > the backend in dom0, because dom0 is already capable of mapping random > >> > pages of any guests. QEMU instances do that all the time. > >> > But it would take away one of the benefits of deploying driver domains: > >> > we wouldn''t be able to run the backends at a lower privilege level. > >> > However it might still be worth considering as an option? The backend is > >> > still trusted and protected from the frontend, but the frontend wouldn''t > >> > be protected from the backend. > >> > > >> > >> I think Dom0 mapping all machine memory is a good starting point. > > > > I _strongly_ disagree. The opportunity for disaggregation and reduction > > of privilege in backends is probably Xen''s biggest techical advantage > > and we should not be taking any backward steps there. > > I think Wei meant, "A good point to start the investigation". IfYes that''s what I meant. :-) Wei.
At 10:39 +0100 on 21 May (1369132774), Wei Liu wrote:> On Tue, May 21, 2013 at 10:26:00AM +0100, Tim Deegan wrote: > > At 19:31 +0100 on 20 May (1369078279), Wei Liu wrote: > > > On Mon, May 20, 2013 at 03:08:05PM +0100, Stefano Stabellini wrote: > > > > J) Map the whole physical memory of the machine in dom0 > > > > If mapping/unmapping or copying slows us down, could we just keep the > > > > whole physical memory of the machine mapped in dom0 (with corresponding > > > > IOMMU entries)? > > > > At that point the frontend could just pass mfn numbers to the backend, > > > > and the backend would already have them mapped. > > > > >From a security perspective it doesn''t change anything when running > > > > the backend in dom0, because dom0 is already capable of mapping random > > > > pages of any guests. QEMU instances do that all the time. > > > > But it would take away one of the benefits of deploying driver domains: > > > > we wouldn''t be able to run the backends at a lower privilege level. > > > > However it might still be worth considering as an option? The backend is > > > > still trusted and protected from the frontend, but the frontend wouldn''t > > > > be protected from the backend. > > > > > > > > > > I think Dom0 mapping all machine memory is a good starting point. > > > > I _strongly_ disagree. The opportunity for disaggregation and reduction > > of privilege in backends is probably Xen''s biggest techical advantage > > and we should not be taking any backward steps there. > > > > I agree with you that disaggregation and reduction of privilege is Xen''s > biggest technical advantage. > > Just to make clear, this idea was summerized from a discussion among > George, Stefano and I on the way back from hackathon. We want to see if > things like mapping / unmapping incur heavy performance penalty.Ah, I see. :) As an experiment to measure the overheads it''s obviously a Good Thing. I thought you were considering it as a _solution_ to the perf problem! Cheers, Tim.
On Tue, 21 May 2013, Tim Deegan wrote:> At 19:31 +0100 on 20 May (1369078279), Wei Liu wrote: > > On Mon, May 20, 2013 at 03:08:05PM +0100, Stefano Stabellini wrote: > > > J) Map the whole physical memory of the machine in dom0 > > > If mapping/unmapping or copying slows us down, could we just keep the > > > whole physical memory of the machine mapped in dom0 (with corresponding > > > IOMMU entries)? > > > At that point the frontend could just pass mfn numbers to the backend, > > > and the backend would already have them mapped. > > > >From a security perspective it doesn''t change anything when running > > > the backend in dom0, because dom0 is already capable of mapping random > > > pages of any guests. QEMU instances do that all the time. > > > But it would take away one of the benefits of deploying driver domains: > > > we wouldn''t be able to run the backends at a lower privilege level. > > > However it might still be worth considering as an option? The backend is > > > still trusted and protected from the frontend, but the frontend wouldn''t > > > be protected from the backend. > > > > > > > I think Dom0 mapping all machine memory is a good starting point. > > I _strongly_ disagree. The opportunity for disaggregation and reduction > of privilege in backends is probably Xen''s biggest techical advantage > and we should not be taking any backward steps there.While I agree with you, as a matter of fact the vast majority of Xen installations today do not use driver domains. That didn''t stop them from enjoying Xen so far. Moreover the frontend/backend interface remains narrow and difficult to exploit, it''s not a fully emulated interface (AHCI / virtio). The backend is still protected from the frontend. Having the backend running non-privileged is a great bonus and certainly required on a product that allows the user to install third party driver domains. However if the driver domains are "trusted" then I think they can also be trusted with a full memory map. After all it has been the case for all XenServer, OVM and SLES releases so far AFAIK. An hypothetic future Xen release could offer both increased security (driver domains) or increased IO performances (backends with a full physical memory map) and give the user a choice between the two. I am pretty sure that a non-negligible amount of people would make the conscious choice to go for the performance option. Why should we be the ones to force security down their throats? After all it''s all about what the users want from the project. Obviously in an ideal world we would be able to offer both at the same time, and maybe George''s proposal is exactly what is going to achieve that. But I was describing the case that requires us to make a choice.
Konrad Rzeszutek Wilk
2013-May-21 12:52 UTC
Re: [Hackathon minutes] PV network improvements
On Tue, May 21, 2013 at 11:51:03AM +0100, Stefano Stabellini wrote:> On Tue, 21 May 2013, Tim Deegan wrote: > > At 19:31 +0100 on 20 May (1369078279), Wei Liu wrote: > > > On Mon, May 20, 2013 at 03:08:05PM +0100, Stefano Stabellini wrote: > > > > J) Map the whole physical memory of the machine in dom0 > > > > If mapping/unmapping or copying slows us down, could we just keep the > > > > whole physical memory of the machine mapped in dom0 (with corresponding > > > > IOMMU entries)? > > > > At that point the frontend could just pass mfn numbers to the backend, > > > > and the backend would already have them mapped. > > > > >From a security perspective it doesn''t change anything when running > > > > the backend in dom0, because dom0 is already capable of mapping random > > > > pages of any guests. QEMU instances do that all the time. > > > > But it would take away one of the benefits of deploying driver domains: > > > > we wouldn''t be able to run the backends at a lower privilege level. > > > > However it might still be worth considering as an option? The backend is > > > > still trusted and protected from the frontend, but the frontend wouldn''t > > > > be protected from the backend. > > > > > > > > > > I think Dom0 mapping all machine memory is a good starting point. > > > > I _strongly_ disagree. The opportunity for disaggregation and reduction > > of privilege in backends is probably Xen''s biggest techical advantage > > and we should not be taking any backward steps there. > > While I agree with you, as a matter of fact the vast majority of Xen > installations today do not use driver domains. That didn''t stop them > from enjoying Xen so far. Moreover the frontend/backend interface > remains narrow and difficult to exploit, it''s not a fully emulated > interface (AHCI / virtio). The backend is still protected from the > frontend. Having the backend running non-privileged is a great bonus > and certainly required on a product that allows the user to install > third party driver domains. However if the driver domains are "trusted" > then I think they can also be trusted with a full memory map. After all > it has been the case for all XenServer, OVM and SLES releases so far > AFAIK. > > An hypothetic future Xen release could offer both increased security > (driver domains) or increased IO performances (backends with a full > physical memory map) and give the user a choice between the two. I am > pretty sure that a non-negligible amount of people would make the > conscious choice to go for the performance option. > Why should we be the ones to force security down their throats? > After all it''s all about what the users want from the project. > > Obviously in an ideal world we would be able to offer both at the same > time, and maybe George''s proposal is exactly what is going to achieve > that. But I was describing the case that requires us to make a choice.CC-ing Mukesh here as driver domains have some relevance to PVH work. Please also CC Malcolm here (I don''t have his email). I would say that perhaps a better option is to do both - as in retain the security architecture Xen has _and_ also provide increased IO performance. Concurently everybody is also looking at both backend and frontend having a persistent pool of grants. This means we do setup an "window" from either backend -> frontend or vice-versa that persists. Said "window" is bolted for the life-time of the guest. For networking the kernel stack already copies the pages from the user-space in the kernel and copying in the kernel to specific pages is mostly using the CPU cache. We need to exploit that and also make sure that the path is not interrupted. The grant_mapping on the TX side also looks a nice path - just have to make sure that the networking API don''t try to free the page once the TX has been done (and this is where Ian''s skb deconstructor would be beneficial). For block it is a bit different as aio''s are mapped from kernel to user-space. But the neat thing there is that there is no need to inspect the data - when giving it to the DMA device (the exception is DIF/DIX which need calculate checksums). That is unless one needs to do the xen_biovec_phys_mergeable (to check if the next page is contingous and if so add new bio''s and copy the data in). But with PVH and PVHVM driver domains, and also piggybacking on the work that Malcolm is doing (Xen IOMMU), we can skip that check. (As the PFNs for the guest would look contingous). In essence we can do a lot: 1). not copying or mapping grants if we detect that they are going to a DMA device. 2). The 1) above + also use the Xen IOMMU to take care of setting the proper EPT entries for the pages that we need. This could be done as part of a grant_copy or grant_light_mapping in the hypervisor. This is a case were we MUST copy those pages in the other domain (say the Ethernet header). Whether a copy is done or a light mapping (b/c the moment the device does the DMA operation on the granted page we might as well remove the mapping. Hence the "light" or maybe "expiring" grant. 3). The 2) above + Intel QuickData (a DMA engine that uses the same L3 cache that PCI devices use) to keep the copied pages in the L3. This has the benefit that when the PCI device is instructed to fetch the data, it would do it from the L3 cache and be incredibly quick. This would be using the grant_copy, but instead of the hypervisor doing it, it instructs the Intel QuickData chipset to do it. Would require some form of asynchronous grant_copy mechanism. 4). Variants of the above.
On Tue, 21 May 2013, Konrad Rzeszutek Wilk wrote:> On Tue, May 21, 2013 at 11:51:03AM +0100, Stefano Stabellini wrote: > > On Tue, 21 May 2013, Tim Deegan wrote: > > > At 19:31 +0100 on 20 May (1369078279), Wei Liu wrote: > > > > On Mon, May 20, 2013 at 03:08:05PM +0100, Stefano Stabellini wrote: > > > > > J) Map the whole physical memory of the machine in dom0 > > > > > If mapping/unmapping or copying slows us down, could we just keep the > > > > > whole physical memory of the machine mapped in dom0 (with corresponding > > > > > IOMMU entries)? > > > > > At that point the frontend could just pass mfn numbers to the backend, > > > > > and the backend would already have them mapped. > > > > > >From a security perspective it doesn''t change anything when running > > > > > the backend in dom0, because dom0 is already capable of mapping random > > > > > pages of any guests. QEMU instances do that all the time. > > > > > But it would take away one of the benefits of deploying driver domains: > > > > > we wouldn''t be able to run the backends at a lower privilege level. > > > > > However it might still be worth considering as an option? The backend is > > > > > still trusted and protected from the frontend, but the frontend wouldn''t > > > > > be protected from the backend. > > > > > > > > > > > > > I think Dom0 mapping all machine memory is a good starting point. > > > > > > I _strongly_ disagree. The opportunity for disaggregation and reduction > > > of privilege in backends is probably Xen''s biggest techical advantage > > > and we should not be taking any backward steps there. > > > > While I agree with you, as a matter of fact the vast majority of Xen > > installations today do not use driver domains. That didn''t stop them > > from enjoying Xen so far. Moreover the frontend/backend interface > > remains narrow and difficult to exploit, it''s not a fully emulated > > interface (AHCI / virtio). The backend is still protected from the > > frontend. Having the backend running non-privileged is a great bonus > > and certainly required on a product that allows the user to install > > third party driver domains. However if the driver domains are "trusted" > > then I think they can also be trusted with a full memory map. After all > > it has been the case for all XenServer, OVM and SLES releases so far > > AFAIK. > > > > An hypothetic future Xen release could offer both increased security > > (driver domains) or increased IO performances (backends with a full > > physical memory map) and give the user a choice between the two. I am > > pretty sure that a non-negligible amount of people would make the > > conscious choice to go for the performance option. > > Why should we be the ones to force security down their throats? > > After all it''s all about what the users want from the project. > > > > Obviously in an ideal world we would be able to offer both at the same > > time, and maybe George''s proposal is exactly what is going to achieve > > that. But I was describing the case that requires us to make a choice. > > CC-ing Mukesh here as driver domains have some relevance to PVH work. > Please also CC Malcolm here (I don''t have his email). > > I would say that perhaps a better option is to do both - as in retain > the security architecture Xen has _and_ also provide increased IO performance.Of course that is the best option. However I think that we should know exactly what would be the level of performances if we had all the memory mapped in the backend domain all the time. It would be very useful to understand what we need to optimize. It might turn out that the difference is not that much, and we need to optimize something else. Or it might turn out that the difference is huge even after all the optimizations you listed below.> Concurently everybody is also looking at both backend and frontend having a > persistent pool of grants. This means we do setup an "window" from either > backend -> frontend or vice-versa that persists. Said "window" is bolted > for the life-time of the guest. For networking the kernel stack already > copies the pages from the user-space in the kernel and copying > in the kernel to specific pages is mostly using the CPU cache. We need to > exploit that and also make sure that the path is not interrupted. > The grant_mapping on the TX side also looks a nice path - just have to > make sure that the networking API don''t try to free the page once the TX > has been done (and this is where Ian''s skb deconstructor would be beneficial). > > For block it is a bit different as aio''s are mapped from kernel to > user-space. But the neat thing there is that there is no need to inspect > the data - when giving it to the DMA device (the exception is DIF/DIX which > need calculate checksums). That is unless one needs to do the > xen_biovec_phys_mergeable (to check if the next page is contingous and > if so add new bio''s and copy the data in). > > But with PVH and PVHVM driver domains, and also piggybacking on the work > that Malcolm is doing (Xen IOMMU), we can skip that check. (As the PFNs > for the guest would look contingous). > > In essence we can do a lot: > 1). not copying or mapping grants if we detect that they are going to > a DMA device. > 2). The 1) above + also use the Xen IOMMU to take care of setting the > proper EPT entries for the pages that we need. This could be done > as part of a grant_copy or grant_light_mapping in the hypervisor. This is > a case were we MUST copy those pages in the other domain (say the > Ethernet header). Whether a copy is done or a light mapping > (b/c the moment the device does the DMA operation on the granted > page we might as well remove the mapping. Hence the "light" or > maybe "expiring" grant. > 3). The 2) above + Intel QuickData (a DMA engine that uses the same > L3 cache that PCI devices use) to keep the copied pages in the L3. > This has the benefit that when the PCI device is instructed to > fetch the data, it would do it from the L3 cache and be incredibly quick. > This would be using the grant_copy, but instead of the hypervisor > doing it, it instructs the Intel QuickData chipset to do it. Would > require some form of asynchronous grant_copy mechanism. > 4). Variants of the above.
At 11:51 +0100 on 21 May (1369137063), Stefano Stabellini wrote:> On Tue, 21 May 2013, Tim Deegan wrote: > > At 19:31 +0100 on 20 May (1369078279), Wei Liu wrote: > > > On Mon, May 20, 2013 at 03:08:05PM +0100, Stefano Stabellini wrote: > > > > However it might still be worth considering as an option? The backend is > > > > still trusted and protected from the frontend, but the frontend wouldn''t > > > > be protected from the backend. > > > > > > > > > > I think Dom0 mapping all machine memory is a good starting point. > > > > I _strongly_ disagree. The opportunity for disaggregation and reduction > > of privilege in backends is probably Xen''s biggest techical advantage > > and we should not be taking any backward steps there. > > While I agree with you, as a matter of fact the vast majority of Xen > installations today do not use driver domains.Sure, and that''s a bad thing, right?> However if the driver domains are "trusted" > then I think they can also be trusted with a full memory map. After all > it has been the case for all XenServer, OVM and SLES releases so far > AFAIK....and that''s a bad thing, right? :)> Obviously in an ideal world we would be able to offer both at the same > time, and maybe George''s proposal is exactly what is going to achieve > that. But I was describing the case that requires us to make a choice.Righto. I don''t think we need to worry about that yet. You''re all smart engineers, and I''ve heard a bunch of good ideas flying around that address the costs of mapping and unmapping in backends. Cheers, Tim.
On Tue, 21 May 2013, Tim Deegan wrote:> At 11:51 +0100 on 21 May (1369137063), Stefano Stabellini wrote: > > On Tue, 21 May 2013, Tim Deegan wrote: > > > At 19:31 +0100 on 20 May (1369078279), Wei Liu wrote: > > > > On Mon, May 20, 2013 at 03:08:05PM +0100, Stefano Stabellini wrote: > > > > > However it might still be worth considering as an option? The backend is > > > > > still trusted and protected from the frontend, but the frontend wouldn''t > > > > > be protected from the backend. > > > > > > > > > > > > > I think Dom0 mapping all machine memory is a good starting point. > > > > > > I _strongly_ disagree. The opportunity for disaggregation and reduction > > > of privilege in backends is probably Xen''s biggest techical advantage > > > and we should not be taking any backward steps there. > > > > While I agree with you, as a matter of fact the vast majority of Xen > > installations today do not use driver domains. > > Sure, and that''s a bad thing, right? > > > However if the driver domains are "trusted" > > then I think they can also be trusted with a full memory map. After all > > it has been the case for all XenServer, OVM and SLES releases so far > > AFAIK. > > ...and that''s a bad thing, right? :)It''s a good thing: even though it could be better our users don''t seem to mind. :)> > Obviously in an ideal world we would be able to offer both at the same > > time, and maybe George''s proposal is exactly what is going to achieve > > that. But I was describing the case that requires us to make a choice. > > Righto. I don''t think we need to worry about that yet. You''re all > smart engineers, and I''ve heard a bunch of good ideas flying around that > address the costs of mapping and unmapping in backends.Right. I would consider the performance of "backend with all the memory mapped" as the limit we should try to achieve even without having all the memory mapped. But if it turns out that we are very far from it, we might want to consider allowing it as an option in the meantime.
At 17:58 +0100 on 21 May (1369159125), Stefano Stabellini wrote:> On Tue, 21 May 2013, Tim Deegan wrote: > > At 11:51 +0100 on 21 May (1369137063), Stefano Stabellini wrote: > > > Obviously in an ideal world we would be able to offer both at the same > > > time, and maybe George''s proposal is exactly what is going to achieve > > > that. But I was describing the case that requires us to make a choice. > > > > Righto. I don''t think we need to worry about that yet. You''re all > > smart engineers, and I''ve heard a bunch of good ideas flying around that > > address the costs of mapping and unmapping in backends. > > Right. I would consider the performance of "backend with all the memory > mapped" as the limit we should try to achieve even without having all > the memory mapped.Yes, absolutely.> But if it turns out that we are very far from it, we > might want to consider allowing it as an option in the meantime.Understood, and I still strongly disagree. Tim.
On Tue, 2013-05-21 at 17:58 +0100, Stefano Stabellini wrote:> On Tue, 21 May 2013, Tim Deegan wrote: > > At 11:51 +0100 on 21 May (1369137063), Stefano Stabellini wrote: > > > On Tue, 21 May 2013, Tim Deegan wrote: > > > > At 19:31 +0100 on 20 May (1369078279), Wei Liu wrote: > > > > > On Mon, May 20, 2013 at 03:08:05PM +0100, Stefano Stabellini wrote: > > > > > > However it might still be worth considering as an option? The backend is > > > > > > still trusted and protected from the frontend, but the frontend wouldn''t > > > > > > be protected from the backend. > > > > > > > > > > > > > > > > I think Dom0 mapping all machine memory is a good starting point. > > > > > > > > I _strongly_ disagree. The opportunity for disaggregation and reduction > > > > of privilege in backends is probably Xen''s biggest techical advantage > > > > and we should not be taking any backward steps there. > > > > > > While I agree with you, as a matter of fact the vast majority of Xen > > > installations today do not use driver domains. > > > > Sure, and that''s a bad thing, right? > > > > > However if the driver domains are "trusted" > > > then I think they can also be trusted with a full memory map. After all > > > it has been the case for all XenServer, OVM and SLES releases so far > > > AFAIK. > > > > ...and that''s a bad thing, right? :) > > It''s a good thing: even though it could be better our users don''t seem > to mind. :)At least in the case of XenServer they are, as you know, actively moving towards disaggregating. Other users such as XenClient, Qubeos, NSA etc already do make use of disaggregation to a greater or lesser extent. For the distros I think the lack of disaggregation is mostly our fault as upstream for not making it easier to achieve, rather than a lack of desire on the part of users.> > > Obviously in an ideal world we would be able to offer both at the same > > > time, and maybe George''s proposal is exactly what is going to achieve > > > that. But I was describing the case that requires us to make a choice. > > > > Righto. I don''t think we need to worry about that yet. You''re all > > smart engineers, and I''ve heard a bunch of good ideas flying around that > > address the costs of mapping and unmapping in backends. > > Right. I would consider the performance of "backend with all the memory > mapped" as the limit we should try to achieve even without having all > the memory mapped. But if it turns out that we are very far from it, we > might want to consider allowing it as an option in the meantime.I think it is incredibly premature to be thinking about even considering making this an option or anything other than a useful datapoint for developers. Ian.