Greetings! The Xen project caught my attention on LKML discussing hypervisors, so I took a look at Xen and read the README, where it says: This install tree contains source for a Linux 2.6 guest This immediately turned me off, as I hoped Xen would be a bit more transparent, by simply exposing native hw tunneled thru some multiplexed Xen patched host-kernel driver. I maybe missing something, but why should the Xen-design require the guest to be patched? Thanks! -- Al _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 7/8/06 4:01 pm, "Al Boldi" <a1426z@gawab.com> wrote:> The Xen project caught my attention on LKML discussing hypervisors, so I took > a look at Xen and read the README, where it says: > > This install tree contains source for a Linux 2.6 guest > > This immediately turned me off, as I hoped Xen would be a bit more > transparent, by simply exposing native hw tunneled thru some multiplexed Xen > patched host-kernel driver. > > I maybe missing something, but why should the Xen-design require the guest to > be patched?You can run fully-virtualised guests on VT-x and AMDV hardware these days. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Harry Butterworth
2006-Aug-08 09:17 UTC
Re: [Xen-devel] Questioning the Xen Design of the VMM
On Mon, 2006-08-07 at 18:01 +0300, Al Boldi wrote:> Greetings! > > The Xen project caught my attention on LKML discussing hypervisors, so I took > a look at Xen and read the README, where it says: > > This install tree contains source for a Linux 2.6 guest > > This immediately turned me off, as I hoped Xen would be a bit more > transparent, by simply exposing native hw tunneled thru some multiplexed Xen > patched host-kernel driver. > > I maybe missing something, but why should the Xen-design require the guest to > be patched?Xen runs with high performance without binary translation on hardware without virtualization support. This requires patching the guest. With hardware virtualization support Xen can run the guest unmodified. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Petersson, Mats
2006-Aug-08 09:20 UTC
RE: [Xen-devel] Questioning the Xen Design of the VMM
> -----Original Message----- > From: xen-devel-bounces@lists.xensource.com > [mailto:xen-devel-bounces@lists.xensource.com] On Behalf Of Al Boldi > Sent: 07 August 2006 16:01 > To: xen-devel@lists.xensource.com > Subject: [Xen-devel] Questioning the Xen Design of the VMM > > Greetings! > > The Xen project caught my attention on LKML discussing > hypervisors, so I took > a look at Xen and read the README, where it says: > > This install tree contains source for a Linux 2.6 guest > > This immediately turned me off, as I hoped Xen would be a bit more > transparent, by simply exposing native hw tunneled thru some > multiplexed Xen > patched host-kernel driver.The actual hardware isn''t exposed to the guest at all [unless you explicitly ask for it in the configuration]. There are drivers that are virtual versions of the real hardware, but there is no way that the guest OS is ever touching any network or hard-disk, unless you''ve explicitly configured it so - and then it uses a driver that is the native driver [with some minor modifications to deal with the virtualization - those modifications are generally in header files (at least for well-behaved drivers)]. On the other hand, to reduce the size of the actual hypervisor (VMM), the approach of Xen is to use Linux as a driver-domain (commonly combined as the management "domain" of Dom0). This means that Xen hypervisor itself can be driver-less, but of course also relies on having another OS on top of itself to make up for this. Currently Linux is the only available option for a driver-domain, but there''s nothing in the interface between Xen and the driver domain that says it HAS to be so - it''s just much easier to do with a well-known, open-source, driver-rich kernel, than with a closed-source or driver-poor kernel...> > I maybe missing something, but why should the Xen-design > require the guest to > be patched?There are two flavours of Xen guests: Para-virtual guests. Those are patched kernels, and have (in past versions of Xen) been implemented for Linux 2.4, Linux 2.6, Windows, <some version of>BSD and perhaps other versions that I don''t know of. Current Xen is "Linux only" supplied with the Xen kernel. Other kernels are being worked on. HVM guests. These are fully virtualized guests, where the guest contains the same binary as you would use on a non-virtual system. You can run Windows or Linux, or most other OS''s on this. It does require "new" hardware that has virtualization support in hardware (AMD''s AMDV (SVM) or Intel VT) to use this flavour of guest though, so the older model is still maintained. I hope this is of use to you. Please feel free to ask any further questions... -- Mats> > > Thanks! > > -- > Al > > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xensource.com > http://lists.xensource.com/xen-devel > > >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Petersson, Mats wrote:> Al Boldi wrote: > > I hoped Xen would be a bit more > > transparent, by simply exposing native hw tunneled thru some > > multiplexed Xen patched host-kernel driver. > > On the other hand, to reduce the size of the actual hypervisor (VMM), > the approach of Xen is to use Linux as a driver-domain (commonly > combined as the management "domain" of Dom0). This means that Xen > hypervisor itself can be driver-less, but of course also relies on > having another OS on top of itself to make up for this. Currently Linux > is the only available option for a driver-domain, but there''s nothing in > the interface between Xen and the driver domain that says it HAS to be > so - it''s just much easier to do with a well-known, open-source, > driver-rich kernel, than with a closed-source or driver-poor kernel...Ok, you are probably describing the state of the host-kernel, which I agree needs to be patched for performance reasons.> > I maybe missing something, but why should the Xen-design > > require the guest to be patched? > > There are two flavours of Xen guests: > Para-virtual guests. Those are patched kernels, and have (in past > versions of Xen) been implemented for Linux 2.4, Linux 2.6, Windows, > <some version of>BSD and perhaps other versions that I don''t know of. > Current Xen is "Linux only" supplied with the Xen kernel. Other kernels > are being worked on.This is the part I am questioning.> HVM guests. These are fully virtualized guests, where the guest contains > the same binary as you would use on a non-virtual system. You can run > Windows or Linux, or most other OS''s on this. It does require "new" > hardware that has virtualization support in hardware (AMD''s AMDV (SVM) > or Intel VT) to use this flavour of guest though, so the older model is > still maintained.So HVM solves the problem, but why can''t this layer be implemented in software? I''m sure there can''t be a performance issue, as this virtualization doesn''t occur on the physical resource level, but is (should be) rather implemented as some sort of a multiplexed routing algorithm, I think :)> I hope this is of use to you. > > Please feel free to ask any further questions...Thanks a lot for your detailed response! -- Al _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Petersson, Mats
2006-Aug-08 15:07 UTC
RE: [Xen-devel] Questioning the Xen Design of the VMM
> -----Original Message----- > From: Al Boldi [mailto:a1426z@gawab.com] > Sent: 08 August 2006 15:10 > To: Petersson, Mats > Cc: xen-devel@lists.xensource.com > Subject: Re: [Xen-devel] Questioning the Xen Design of the VMM > > Petersson, Mats wrote: > > Al Boldi wrote: > > > I hoped Xen would be a bit more > > > transparent, by simply exposing native hw tunneled thru some > > > multiplexed Xen patched host-kernel driver. > > > > On the other hand, to reduce the size of the actual > hypervisor (VMM), > > the approach of Xen is to use Linux as a driver-domain (commonly > > combined as the management "domain" of Dom0). This means that Xen > > hypervisor itself can be driver-less, but of course also relies on > > having another OS on top of itself to make up for this. > Currently Linux > > is the only available option for a driver-domain, but > there''s nothing in > > the interface between Xen and the driver domain that says > it HAS to be > > so - it''s just much easier to do with a well-known, open-source, > > driver-rich kernel, than with a closed-source or > driver-poor kernel... > > Ok, you are probably describing the state of the host-kernel, > which I agree > needs to be patched for performance reasons.Yes, but you could have more than one driver domain, that is isolated in all aspects from other driver domains (host-kernel implies, to me, that it''s also the management of the other domains). Why would you want to have more than one driver domain? For separation of course... 1. Competing Company A and Company B are sharing the same hardware - you don''t want Company A to have even the remotest chance of seeing any data that belongs to B or the other way around, so you definitely want them to be separated in as many ways as possible. 2. Let''s assume that someone finds a way to "hack" into a system by sending some particular pattern on the network (TCP/IP to a particular port, causing buffer overflow, seems to have been popular on Widnows at least). If you have multiple driver domains, you would only get ONE domain broken (into) by this approach - of course, if it''s widespread it would still break all ports, but if it''s targetted towards one particular domain, the others will survive [let''s say one of your client companies are attacked with a targetted attack - other companies will then be unaffected].> > > > I maybe missing something, but why should the Xen-design > > > require the guest to be patched? > > > > There are two flavours of Xen guests: > > Para-virtual guests. Those are patched kernels, and have (in past > > versions of Xen) been implemented for Linux 2.4, Linux 2.6, Windows, > > <some version of>BSD and perhaps other versions that I > don''t know of. > > Current Xen is "Linux only" supplied with the Xen kernel. > Other kernels > > are being worked on. > > This is the part I am questioning.The main reason to use a para-virtual kernel that it performs better than the fully virtualized version.> > > HVM guests. These are fully virtualized guests, where the > guest contains > > the same binary as you would use on a non-virtual system. > You can run > > Windows or Linux, or most other OS''s on this. It does require "new" > > hardware that has virtualization support in hardware (AMD''s > AMDV (SVM) > > or Intel VT) to use this flavour of guest though, so the > older model is > > still maintained. > > So HVM solves the problem, but why can''t this layer be implemented in > software?It CAN, and has been done. It is however, a little bit difficult to cover some of the "strange" corner cases, as the x86 processor wasn''t really designed to handle virtualization natively [until these extensions where added]. This is why you end up with binary translation in VMWare for example. For example, let''s say that we use the method of "ring compression" (which is when the guest-OS is moved from Ring 0 [full privileges] to Ring 1 [less than full privileges]), and the hypervisor wants to have full control of interrupt flags: some_function: ... pushf // Save interrupt flag. cli // Disable interrupts ... ... ... popf // Restore interrupt flag. ... In Ring 0, all this works just fine - but of course, we don''t know that the guest-OS tried to disable interrupts, so we have to change something. In Ring 1, the guest can''t disable interrupts, so the CLI instruction can be intercepted. Great. But pushf/popf is a valid instruction in all four rings - it just doesn''t change the interrupt enable flag in the flags register if you''re not allowed to use the CLI/STI instructions! So, that means that interrupts are disabled forever after [until an STI instruction gets found by chance, at least]. And if the next bit of code is: mov someaddress, eax // someaddress is updated by an interrupt! $1: cmp someaddress, eax // Check it... jz $1 Then we''d very likely never get out of there, since the actual interrupt causing someaddress to change is believed by the VMM to be disabled. There is no real way to make popf trap [other than supplying it with invalid arguments in virtual 8086 mode, which isn''t really a practical thing to do here!] Another problem is "hidden bits" in registers. Let''s say this: mov cr0, eax mov eax, ecx or $1, eax mov eax, cr0 mov $0x10, eax mov eax, fs mov ecx, cr0 mov $0xF000000, eax mov $10000, ecx $1: mov $0, fs:eax add $4, eax dec ecx jnz $1 Let''s now say that we have an interrupt that the hypervisor would handle in the loop in the above code. The hypervisor itself uses FS for some special purpose, and thus needs to save/restore the FS register. When it returns, the system will crash (GP fault) because the FS register limit is 0xFFFF (64KB) and eax is greater than the limit - but the limit of FS was set to 0xFFFFFFFF before we took the interrupt... Incorrect behaviour like this is terribly difficult to deal with, and there really isn''t any good way to solve these issues [other than not allowing the code to run when it does "funny" things like this - or to perform the necessary code in "translation mode" - i.e. emulate each instruction -> slow(ish)].> > I''m sure there can''t be a performance issue, as this > virtualization doesn''t > occur on the physical resource level, but is (should be) > rather implemented > as some sort of a multiplexed routing algorithm, I think :)I''m not entirely sure what this statement is trying to say, but as I understand the situation, performance is entirely the reason why the Xen paravirtual model was implemented - all other VMM''s are slower [although it''s often hard to prove that, since for example Vmware have the rule that they have to give permission before publishing benchmarks of their product, and of course that permission would only be given in cases where there is some benefit to them]. One of the obvious reasons for para-virtual being better than full virtualization is that it can be used in a "batched" mode. Let''s say we have some code that does this: ... p = malloc(2000 * 4096); ... Let''s then say that the guts of malloc ends up in something like this: map_pages_to_user(...) { for(v = random_virtual_address, p = start_page; p < end_page; p++, v+=4096) map_one_page_to_user(p, v); } In full virtualization, we have no way to understand that someone is mapping 2000 pages to the same user-process in one guest, we''d just see writes to the page-table one page at a time. In the para-virtual case, we could do something like: map_pages_to_user(...) { hypervisor_map_pages_to_user(current_process, start_page, end_page, random_virtual_address); } Now, the hypervisor knows "the full story" and can map all those pages in one go - much quicker, I would say. There''s still more work than in the native case, but it''s much closer to the native case.> > > I hope this is of use to you. > > > > Please feel free to ask any further questions... > > Thanks a lot for your detailed response! > > -- > Al > > > >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Steven Rostedt
2006-Aug-08 16:39 UTC
Re: [Xen-devel] Questioning the Xen Design of the VMM
Mats, thanks for the examples of where the hypervisor needs to know otherwise x86 guest doesn''t do what it expects to be done. I''ve just recently started working with Xen, but my background has been more with other architectures than x86. I understand all that you explained, but one: see below. (I''m posting to the list so that others can learn too ;) Petersson, Mats wrote:>[ snipped a lot of good info ]> > Another problem is "hidden bits" in registers. > > Let''s say this: > > mov cr0, eax > mov eax, ecx > or $1, eax > mov eax, cr0 > mov $0x10, eax > mov eax, fs > mov ecx, cr0 > > mov $0xF000000, eax > mov $10000, ecx > $1: > mov $0, fs:eax > add $4, eax > dec ecx > jnz $1 > > Let''s now say that we have an interrupt that the hypervisor would handle > in the loop in the above code. The hypervisor itself uses FS for some > special purpose, and thus needs to save/restore the FS register. When it > returns, the system will crash (GP fault) because the FS register limit > is 0xFFFF (64KB) and eax is greater than the limit - but the limit of FS > was set to 0xFFFFFFFF before we took the interrupt... Incorrect > behaviour like this is terribly difficult to deal with, and there really > isn''t any good way to solve these issues [other than not allowing the > code to run when it does "funny" things like this - or to perform the > necessary code in "translation mode" - i.e. emulate each instruction -> > slow(ish)]. >The above I''m confused on. In x86, the hypervisor can''t store the fs register fully before returning from the interrupt?? You stated that the fs register limit was 0xffffffff before the interrupt, but ends up being 0xffff afterwards. As I mentioned, I''m just learning the internals of x86, so my full comprehension on segment registers of x86 is still a little fuzzy. Could you explain further here? Thanks, -- Steve _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Petersson, Mats
2006-Aug-08 17:14 UTC
RE: [Xen-devel] Questioning the Xen Design of the VMM
> -----Original Message----- > From: xen-devel-bounces@lists.xensource.com > [mailto:xen-devel-bounces@lists.xensource.com] On Behalf Of > Steven Rostedt > Sent: 08 August 2006 17:39 > To: Petersson, Mats > Cc: Al Boldi; xen-devel@lists.xensource.com > Subject: Re: [Xen-devel] Questioning the Xen Design of the VMM > > Mats, thanks for the examples of where the hypervisor needs to know > otherwise x86 guest doesn''t do what it expects to be done. > > I''ve just recently started working with Xen, but my > background has been > more with other architectures than x86. I understand all that you > explained, but one: see below. (I''m posting to the list so > that others > can learn too ;) > > Petersson, Mats wrote: > > > > [ snipped a lot of good info ] > > > > > Another problem is "hidden bits" in registers. > > > > Let''s say this: > > > > mov cr0, eax > > mov eax, ecx > > or $1, eax > > mov eax, cr0 > > mov $0x10, eax > > mov eax, fs > > mov ecx, cr0 > > > > mov $0xF000000, eax > > mov $10000, ecx > > $1: > > mov $0, fs:eax > > add $4, eax > > dec ecx > > jnz $1 > > > > Let''s now say that we have an interrupt that the hypervisor > would handle > > in the loop in the above code. The hypervisor itself uses > FS for some > > special purpose, and thus needs to save/restore the FS > register. When it > > returns, the system will crash (GP fault) because the FS > register limit > > is 0xFFFF (64KB) and eax is greater than the limit - but > the limit of FS > > was set to 0xFFFFFFFF before we took the interrupt... Incorrect > > behaviour like this is terribly difficult to deal with, and > there really > > isn''t any good way to solve these issues [other than not > allowing the > > code to run when it does "funny" things like this - or to > perform the > > necessary code in "translation mode" - i.e. emulate each > instruction -> > > slow(ish)]. > > > > The above I''m confused on. In x86, the hypervisor can''t store the fs > register fully before returning from the interrupt?? You stated that > the fs register limit was 0xffffffff before the interrupt, > but ends up > being 0xffff afterwards. As I mentioned, I''m just learning the > internals of x86, so my full comprehension on segment > registers of x86 > is still a little fuzzy. > > Could you explain further here?Sure, this code-snippet enters protected mode (bit 0 of CR0) and sets up FS from the Global Descriptor Table. FS visible part (16 bits) gets set to the value 0x10, and the limit is set to whatever happens to be in the descriptor table, and I didn''t actually specify what that value is, but rather implied that the value for the limit is (0xfffff << 12 | 0xFFF) (i.e. the limit is 2^20 - 1 and the granularity bit is set to 1 -> multiply by 4096 and set lower bits to one). As we leave protected mode, the contents of FS is still maintained, including the 80 bits of hidden information (limit, base and attributes). However, if we then take an interrupt (or otherwise need to save/restore FS), we''d loose all the hidden bits, and restoring it later would need to figure out "how it got loaded" to make sure it''s hidden parts are re-loaded. It''s unlikely that you''d see this scenario in Xen, since Xen works on para-virtual kernels [unless we''ve got virtualization hardware, in which case the hypervisor CAN SEE the internal parts of FS (or any other segment register)]. Another tricky situation is: GDT[5] = {base = 0x1000, limit=0x1000, attr=<something> } FS = GDT[5]; CLI(); GDT [5] = [base = 0x2000, limit = 0x1000, attr=<something> } ... ... ... FS = GDT[5]; STI(); Now, whilst this tricky code is unreliable on real hardware too (if interrupts were enabled), if you have a situation where the guest can not accept interrupts, but the hypervisor can, it would break if the code with ... in it were to have an interrupt, because we''d have lost the value of FS (we''d reload the NEW value of GDT[5] at the end of interrupt, assuming it saves FS). Hidden parts of segment registers is one of the "security features" of the 286 architecture, but it also creates some pretty interesting scenarios for us programmers... -- Mats _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Steven Rostedt
2006-Aug-08 18:22 UTC
Re: [Xen-devel] Questioning the Xen Design of the VMM
Mat, thanks for the reply Petersson, Mats wrote:> > > Hidden parts of segment registers is one of the "security features" of > the 286 architecture, but it also creates some pretty interesting > scenarios for us programmers...The missing part in my mind was that I didn''t know that the segment registers can''t be completely read. A colleague of mine told me a little more about them. Yuck! Thanks for the nice write ups though. -- Steve _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Daniel Stodden
2006-Aug-09 12:49 UTC
Re: [Xen-devel] Questioning the Xen Design of the VMM
On Tue, 2006-08-08 at 17:10 +0300, Al Boldi wrote:> > There are two flavours of Xen guests: > > Para-virtual guests. Those are patched kernels, and have (in past > > versions of Xen) been implemented for Linux 2.4, Linux 2.6, Windows, > > <some version of>BSD and perhaps other versions that I don''t know of. > > Current Xen is "Linux only" supplied with the Xen kernel. Other kernels > > are being worked on. > > This is the part I am questioning. > > > HVM guests. These are fully virtualized guests, where the guest contains > > the same binary as you would use on a non-virtual system. You can run > > Windows or Linux, or most other OS''s on this. It does require "new" > > hardware that has virtualization support in hardware (AMD''s AMDV (SVM) > > or Intel VT) to use this flavour of guest though, so the older model is > > still maintained. > > So HVM solves the problem, but why can''t this layer be implemented in > software?the short answer at the cpu level is "because of the arcane nature of the x86 architecture" :/ it can be done, but it requires mechanisms xen developers currently do not and wouldn''t be willing to apply. non-paravirtualized guests may perform operations which on bare x86 hardware are hard/impossible to track. one way to work around this would be patching guest code segments before executing them. that''s where systems like e.g. vmware come into play. xen-style paravirtualization at the cpu level basically resolves that efficiently by teaching the guest system not to use the critical stuff, but be aware of the vmm to do it instead. once the cpu problem has been solved, you''d need to emulate hardware resources an unmodified guest system attempts to drive. that again takes additional cycles. elimination of the peripheral hardware interfaces by putting the I/O layers on top of an abstract low-level path into the VMM is one of the reasons why xen is faster than others. many systems do this quite successfully, even for ''non-modified'' guests like e.g. windows, by installing dedicated, virtualization aware drivers once the base installation went ok.> I''m sure there can''t be a performance issue, as this virtualization doesn''t > occur on the physical resource level, but is (should be) rather implemented > as some sort of a multiplexed routing algorithm, I think :)few device classes support resource sharing in that manner efficiently. peripheral devices in commodity platforms are inherently single-hosted and won''t support unfiltered access by multiple driver instances in several guests. from the vmm perspective, it always boils down to emulating the device. howerver, with varying degrees of complexity regarding the translation of guest requests to physical access. it depends. ide, afaik is known to work comparatively well. an example of an area where it''s getting more sportive would be network adapters. this is basically the whole problem when building virtualization layers for cots platforms: the device/driver landscape spreads to infinity :) since you''ll have a hard time driving any possible combination by yourself, you need something else to do it. one solution are hosted vmms, running on top of an existing operating system. a second solution is what xen does: offload drivers to a modified guest system which can then carry the I/O load from the additional, nonprivileged guests as well. regards, daniel -- Daniel Stodden LRR - Lehrstuhl für Rechnertechnik und Rechnerorganisation Institut für Informatik der TU München D-85748 Garching http://www.lrr.in.tum.de/~stodden mailto:stodden@cs.tum.edu PGP Fingerprint: F5A4 1575 4C56 E26A 0B33 3D80 457E 82AE B0D8 735B _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Petersson, Mats wrote:> > > Al Boldi wrote: > > > > I maybe missing something, but why should the Xen-design > > > > require the guest to be patched? > > The main reason to use a para-virtual kernel that it performs better > than the fully virtualized version. > > > So HVM solves the problem, but why can''t this layer be implemented in > > software? > > It CAN, and has been done.You mean full virtualization using binary translation in software? My understanding was, that HVM implies full virtualization without the need for binary translation in software.> It is however, a little bit difficult to > cover some of the "strange" corner cases, as the x86 processor wasn''t > really designed to handle virtualization natively [until these > extensions where added].You mean AMDV/IntelVT extensions? If so, then these extensions don''t actively participate in the act of virtualization, but rather fix some x86-arch shortcomings, that make it easier for software (i.e. Xen) to virtualize, thus circumventing the need to do binary translation. Is this a correct reading?> This is why you end up with binary translation > in VMWare for example. For example, let''s say that we use the method of > "ring compression" (which is when the guest-OS is moved from Ring 0 > [full privileges] to Ring 1 [less than full privileges]), and the > hypervisor wants to have full control of interrupt flags: > > some_function: > ... > pushf // Save interrupt flag. > cli // Disable interrupts > ... > ... > ... > popf // Restore interrupt flag. > ... > > In Ring 0, all this works just fine - but of course, we don''t know that > the guest-OS tried to disable interrupts, so we have to change > something. In Ring 1, the guest can''t disable interrupts, so the CLI > instruction can be intercepted. Great. But pushf/popf is a valid > instruction in all four rings - it just doesn''t change the interrupt > enable flag in the flags register if you''re not allowed to use the > CLI/STI instructions! So, that means that interrupts are disabled > forever after [until an STI instruction gets found by chance, at least]. > > > And if the next bit of code is: > > mov someaddress, eax // someaddress is > updated by an interrupt! > $1: > cmp someaddress, eax // Check it... > jz $1 > > Then we''d very likely never get out of there, since the actual interrupt > causing someaddress to change is believed by the VMM to be disabled. > > There is no real way to make popf trap [other than supplying it with > invalid arguments in virtual 8086 mode, which isn''t really a practical > thing to do here!] > > Another problem is "hidden bits" in registers. > > Let''s say this: > > mov cr0, eax > mov eax, ecx > or $1, eax > mov eax, cr0 > mov $0x10, eax > mov eax, fs > mov ecx, cr0 > > mov $0xF000000, eax > mov $10000, ecx > $1: > mov $0, fs:eax > add $4, eax > dec ecx > jnz $1 > > Let''s now say that we have an interrupt that the hypervisor would handle > in the loop in the above code. The hypervisor itself uses FS for some > special purpose, and thus needs to save/restore the FS register. When it > returns, the system will crash (GP fault) because the FS register limit > is 0xFFFF (64KB) and eax is greater than the limit - but the limit of FS > was set to 0xFFFFFFFF before we took the interrupt... Incorrect > behaviour like this is terribly difficult to deal with, and there really > isn''t any good way to solve these issues [other than not allowing the > code to run when it does "funny" things like this - or to perform the > necessary code in "translation mode" - i.e. emulate each instruction -> > slow(ish)].Or introduce AMDV/IntelVT extensions?> > I''m sure there can''t be a performance issue, as this > > virtualization doesn''t > > occur on the physical resource level, but is (should be) > > rather implemented > > as some sort of a multiplexed routing algorithm, I think :) > > I''m not entirely sure what this statement is trying to say, but as I > understand the situation, performance is entirely the reason why the Xen > paravirtual model was implemented - all other VMM''s are slower [although > it''s often hard to prove that, since for example Vmware have the rule > that they have to give permission before publishing benchmarks of their > product, and of course that permission would only be given in cases > where there is some benefit to them]. > > One of the obvious reasons for para-virtual being better than full > virtualization is that it can be used in a "batched" mode. Let''s say we > have some code that does this: > > ... > p = malloc(2000 * 4096); > ... > > Let''s then say that the guts of malloc ends up in something like this: > > map_pages_to_user(...) > { > for(v = random_virtual_address, p = start_page; p < end_page; > p++, v+=4096) > map_one_page_to_user(p, v); > } > > In full virtualization, we have no way to understand that someone is > mapping 2000 pages to the same user-process in one guest, we''d just see > writes to the page-table one page at a time. > > In the para-virtual case, we could do something like: > map_pages_to_user(...) > { > hypervisor_map_pages_to_user(current_process, start_page, > end_page, > random_virtual_address); > } > > Now, the hypervisor knows "the full story" and can map all those pages > in one go - much quicker, I would say. There''s still more work than in > the native case, but it''s much closer to the native case.Sure, but wouldn''t this be for the price of losing guest-OS transparency? Thanks! -- Al _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Petersson, Mats
2006-Aug-09 13:28 UTC
RE: [Xen-devel] Questioning the Xen Design of the VMM
> -----Original Message----- > From: Al Boldi [mailto:a1426z@gawab.com] > Sent: 09 August 2006 13:53 > To: Petersson, Mats > Cc: xen-devel@lists.xensource.com > Subject: Re: [Xen-devel] Questioning the Xen Design of the VMM > > Petersson, Mats wrote: > > > > Al Boldi wrote: > > > > > I maybe missing something, but why should the Xen-design > > > > > require the guest to be patched? > > > > The main reason to use a para-virtual kernel that it performs better > > than the fully virtualized version. > > > > > So HVM solves the problem, but why can''t this layer be > implemented in > > > software? > > > > It CAN, and has been done. > > You mean full virtualization using binary translation in software?Yes, exactly - or other types of "full virtualiziation" using software - I haven''t made a complete inventory of "different technologies used for virtualization on x86", so I can''t really say - my job with AMD and Xen is to implement into Xen the parts that support the AMD virtualization, not understand the entire VM architecture available in the world...> > My understanding was, that HVM implies full virtualization > without the need > for binary translation in software.Yes, that''s generally correct. In very detail, there are some variants of execution where this is broken, but that''s the obscure corner cases, rather than the normal behaviour. In particular, Intel''s VT doesn''t support running real-mode inside a virtual machine, so if the guest is run in real-mode, it requires some forms of "emulation" (actually, the current solution uses a VM86 mode of the processor, and it''s then only having to emulate the opcodes that fault when run in VM86 mode). There are some things that we (AMD) didn''t get perfectly right either, and as such could be improved...> > > It is however, a little bit difficult to > > cover some of the "strange" corner cases, as the x86 > processor wasn''t > > really designed to handle virtualization natively [until these > > extensions where added]. > > You mean AMDV/IntelVT extensions?Yes.> > If so, then these extensions don''t actively participate in the act of > virtualization, but rather fix some x86-arch shortcomings, > that make it > easier for software (i.e. Xen) to virtualize, thus > circumventing the need to > do binary translation. Is this a correct reading?Not sure what your exact meaning is here. What do you mean by "actively participate in the act of virtualization". Please clarify, and examplify an architecture where the hardware is ACTIVELY taking part in the virtualization - do you mean a hardware implementation of a hypervisor. [as, again, I haven''t spent an awful lot of time trying to understand how/what can and can''t be done in other architectures - as far as I understand it, both AMD and Intel''s virtualization technologies are fairly close "copies" IBM''s original implementation on the 360 series machines, so I expect that what can/can''t be done in that, is what can/can''t be done in the x86 world]. I do agree that it removes the need for binary translation and emulation, and makes the writing of the software to manage the VM''s easier. It also helps in the sense that it allows more selective intercepts than for example ring compression (where all protected instructions are "faulting", whether it''s actually necessary for the hypervisor to intercept or not - for example, it''s completely useless for the hypervisor to know when the guest reads or writes to CR2 - but CR2 is a protected register, so it''s going to get intercepted by a ring-compressed kernel.), so fewer intercepts. It''s also more easy to determine the actual intercept reason in a virtualization enhanced processor, since it gives an "exitcode" to indicate the reason for the "exit" back to the hypervisor.> > > This is why you end up with binary translation > > in VMWare for example. For example, let''s say that we use > the method of > > "ring compression" (which is when the guest-OS is moved from Ring 0 > > [full privileges] to Ring 1 [less than full privileges]), and the > > hypervisor wants to have full control of interrupt flags: > > > > some_function: > > ... > > pushf // Save interrupt flag. > > cli // Disable interrupts > > ... > > ... > > ... > > popf // Restore interrupt flag. > > ... > > > > In Ring 0, all this works just fine - but of course, we > don''t know that > > the guest-OS tried to disable interrupts, so we have to change > > something. In Ring 1, the guest can''t disable interrupts, so the CLI > > instruction can be intercepted. Great. But pushf/popf is a valid > > instruction in all four rings - it just doesn''t change the interrupt > > enable flag in the flags register if you''re not allowed to use the > > CLI/STI instructions! So, that means that interrupts are disabled > > forever after [until an STI instruction gets found by > chance, at least]. > > > > > > And if the next bit of code is: > > > > mov someaddress, eax // someaddress is > > updated by an interrupt! > > $1: > > cmp someaddress, eax // Check it... > > jz $1 > > > > Then we''d very likely never get out of there, since the > actual interrupt > > causing someaddress to change is believed by the VMM to be disabled. > > > > There is no real way to make popf trap [other than supplying it with > > invalid arguments in virtual 8086 mode, which isn''t really > a practical > > thing to do here!] > > > > Another problem is "hidden bits" in registers. > > > > Let''s say this: > > > > mov cr0, eax > > mov eax, ecx > > or $1, eax > > mov eax, cr0 > > mov $0x10, eax > > mov eax, fs > > mov ecx, cr0 > > > > mov $0xF000000, eax > > mov $10000, ecx > > $1: > > mov $0, fs:eax > > add $4, eax > > dec ecx > > jnz $1 > > > > Let''s now say that we have an interrupt that the hypervisor > would handle > > in the loop in the above code. The hypervisor itself uses > FS for some > > special purpose, and thus needs to save/restore the FS > register. When it > > returns, the system will crash (GP fault) because the FS > register limit > > is 0xFFFF (64KB) and eax is greater than the limit - but > the limit of FS > > was set to 0xFFFFFFFF before we took the interrupt... Incorrect > > behaviour like this is terribly difficult to deal with, and > there really > > isn''t any good way to solve these issues [other than not > allowing the > > code to run when it does "funny" things like this - or to > perform the > > necessary code in "translation mode" - i.e. emulate each > instruction -> > > slow(ish)]. > > Or introduce AMDV/IntelVT extensions? > > > > I''m sure there can''t be a performance issue, as this > > > virtualization doesn''t > > > occur on the physical resource level, but is (should be) > > > rather implemented > > > as some sort of a multiplexed routing algorithm, I think :) > > > > I''m not entirely sure what this statement is trying to say, but as I > > understand the situation, performance is entirely the > reason why the Xen > > paravirtual model was implemented - all other VMM''s are > slower [although > > it''s often hard to prove that, since for example Vmware > have the rule > > that they have to give permission before publishing > benchmarks of their > > product, and of course that permission would only be given in cases > > where there is some benefit to them]. > > > > One of the obvious reasons for para-virtual being better than full > > virtualization is that it can be used in a "batched" mode. > Let''s say we > > have some code that does this: > > > > ... > > p = malloc(2000 * 4096); > > ... > > > > Let''s then say that the guts of malloc ends up in something > like this: > > > > map_pages_to_user(...) > > { > > for(v = random_virtual_address, p = start_page; p < end_page; > > p++, v+=4096) > > map_one_page_to_user(p, v); > > } > > > > In full virtualization, we have no way to understand that someone is > > mapping 2000 pages to the same user-process in one guest, > we''d just see > > writes to the page-table one page at a time. > > > > In the para-virtual case, we could do something like: > > map_pages_to_user(...) > > { > > hypervisor_map_pages_to_user(current_process, start_page, > > end_page, > > random_virtual_address); > > } > > > > Now, the hypervisor knows "the full story" and can map all > those pages > > in one go - much quicker, I would say. There''s still more > work than in > > the native case, but it''s much closer to the native case. > > Sure, but wouldn''t this be for the price of losing guest-OS > transparency?Life is full of compromizes between one ideal solution and another. In an ideal world, virtualization wouldn''t cost anything, but it does. Loosing guest-OS transparency when the geust-OS is open-source isn''t really a big issue, in my opinion. However, if you haven''t got source-code readily available, it becomes a big issue - since without source code, it gets much harder to make the necessary modifications (probably to the extent that it''s actually IMPOSSIBLE to make them in a sane and reliable manner). There is no doubt that para-virtualization is one viable solution to the virtualization problem, but it''s not the ONLY solution. Each user has a choice: Recompile and get performance, or run unmodified code at lower performance. -- Mats> > > Thanks! > > -- > Al > > > >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Daniel Stodden
2006-Aug-10 11:20 UTC
Re: [Xen-devel] Questioning the Xen Design of the VMM
On Wed, 2006-08-09 at 15:53 +0300, Al Boldi wrote:> Petersson, Mats wrote: > > > > Al Boldi wrote: > > > > > I maybe missing something, but why should the Xen-design > > > > > require the guest to be patched? > > > > The main reason to use a para-virtual kernel that it performs better > > than the fully virtualized version. > > > > > So HVM solves the problem, but why can''t this layer be implemented in > > > software? > > > > It CAN, and has been done. > > You mean full virtualization using binary translation in software? > > My understanding was, that HVM implies full virtualization without the need > for binary translation in software. > > > It is however, a little bit difficult to > > cover some of the "strange" corner cases, as the x86 processor wasn''t > > really designed to handle virtualization natively [until these > > extensions where added]. > > You mean AMDV/IntelVT extensions? > > If so, then these extensions don''t actively participate in the act of > virtualization, but rather fix some x86-arch shortcomings, that make it > easier for software (i.e. Xen) to virtualize, thus circumventing the need to > do binary translation. Is this a correct reading?they fix the issues, removing the general need for binary translation, but go well beyond that as well. a comparatively simple example of where it goes beyond are privilege levels. basic system virtualization would just move the guest kernel to a nonprivileged level to maintain control in the vmm. so you''d have the hypervisor in supervisor mode (that''s why it''s called a hypervisor), and both guest kernel and applications in user mode [1]. [should note that xen makes a difference here, using x86 privilege levels which are more complex]. what vtx does is keeping the privilege rings in protected mode untouched by the virtualization features. instead, two whole new modes are added: ''vmx root'' and ''vmx non-root''. the former applies to the vmm, the latter to the guests. _both_ of these basically implement the protected mode as it used to be. so hardware virtualization won''t have to muck around with the regular privilege system. one example where this is particularly useful are hosted vmms, e.g. vmware workstation. imagine a natively-running operating system and a machine monitor running on top of (or integrated with) that. the system would run in vmx-root mode. regular application processes there in ring3 as they used to. additionally, one may start guest systems on top of the vmm, which again are implemented on top a regular x86 protected mode, but in non-root mode. all of the above - can be functionally achieved _efficiently_ without hardware extensions like vmx - but ONLY as long as the privilege architecture supports virtualization - x86 does NOT [2] the pushf/popf outlined is an example of where the problems are - binary translation is a way to do it anyway, but does not count as ''efficient''. with vmx - efficient virtualization is achieved. - some things just get additional flexibility. related reading: [1] popek & goldberg: Formal Requirements for Virtualizable Third Generation Architectures.pdf, 1974 (!) [2] robin & irvine: Analysis of the Intel Pentium''s Ability to Support a Secure Virtual Machine Monitor.pdf, 2000 both should be available from the web if you dig around long enough. :)> > This is why you end up with binary translation > > in VMWare for example. For example, let''s say that we use the method of > > "ring compression" (which is when the guest-OS is moved from Ring 0 > > [full privileges] to Ring 1 [less than full privileges]), and the > > hypervisor wants to have full control of interrupt flags: > > > > some_function: > > ... > > pushf // Save interrupt flag. > > cli // Disable interrupts > > ...regards, daniel -- Daniel Stodden LRR - Lehrstuhl für Rechnertechnik und Rechnerorganisation Institut für Informatik der TU München D-85748 Garching http://www.lrr.in.tum.de/~stodden mailto:stodden@cs.tum.edu PGP Fingerprint: F5A4 1575 4C56 E26A 0B33 3D80 457E 82AE B0D8 735B _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Petersson, Mats wrote:> > Al Boldi wrote: > > You mean AMDV/IntelVT extensions? > > Yes. > > > If so, then these extensions don''t actively participate in the act of > > virtualization, but rather fix some x86-arch shortcomings, that make it > > easier for software (i.e. Xen) to virtualize, thus circumventing the need > > to do binary translation. Is this a correct reading? > > Not sure what your exact meaning is here. > > What do you mean by "actively participate in the act of virtualization".Is there any logic involved, that does some kind of a translation/control? It seems not. Daniel Stodden wrote:> > they fix the issues, removing the general need for binary translation, > but go well beyond that as well. > > a comparatively simple example of where it goes beyond are privilege > levels. basic system virtualization would just move the guest kernel to > a nonprivileged level to maintain control in the vmm. so you''d have the > hypervisor in supervisor mode (that''s why it''s called a hypervisor), and > both guest kernel and applications in user mode [1]. [should note that > xen makes a difference here, using x86 privilege levels which are more > complex]. > > what vtx does is keeping the privilege rings in protected mode untouched > by the virtualization features. instead, two whole new modes are added: > ''vmx root'' and ''vmx non-root''. the former applies to the vmm, the latter > to the guests. _both_ of these basically implement the protected mode as > it used to be. so hardware virtualization won''t have to muck around with > the regular privilege system. > > one example where this is particularly useful are hosted vmms, e.g. > vmware workstation. imagine a natively-running operating system and a > machine monitor running on top of (or integrated with) that. the system > would run in vmx-root mode. regular application processes there in ring3 > as they used to. additionally, one may start guest systems on top of the > vmm, which again are implemented on top a regular x86 protected mode, > but in non-root mode. > > all of the above > - can be functionally achieved _efficiently_ without hardware > extensions like vmx > - but ONLY as long as the privilege architecture supports > virtualization > - x86 does NOT [2] > the pushf/popf outlined is an example of where the problems are > - binary translation is a way to do it anyway, but does not count > as ''efficient''. > > with vmx > - efficient virtualization is achieved. > - some things just get additional flexibility.So VMX doesn''t really virtualize anything, but rather enables software to perform virtualization more efficiently. Petersson, Mats wrote:> There is no doubt that para-virtualization is one viable solution to the > virtualization problem, but it''s not the ONLY solution. Each user has a > choice: Recompile and get performance, or run unmodified code at lower > performance.Agreed, but how much lower performance are we talking about in an HVM vs para-virtualized scenario? Thanks! -- Al _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Daniel Stodden wrote:> On Tue, 2006-08-08 at 17:10 +0300, Al Boldi wrote: > > So HVM solves the problem, but why can''t this layer be implemented in > > software? > > the short answer at the cpu level is "because of the arcane nature of > the x86 architecture" :/Which AMDV/IntelVT supposedly solves?> once the cpu problem has been solved, you''d need to emulate hardware > resources an unmodified guest system attempts to drive. that again takes > additional cycles. elimination of the peripheral hardware interfaces by > putting the I/O layers on top of an abstract low-level path into the VMM > is one of the reasons why xen is faster than others. many systems do > this quite successfully, even for ''non-modified'' guests like e.g. > windows, by installing dedicated, virtualization aware drivers once the > base installation went ok.You mean "virtualization aware" drivers in the guest-OS? Wouldn''t this amount to a form of patching?> > I''m sure there can''t be a performance issue, as this virtualization > > doesn''t occur on the physical resource level, but is (should be) rather > > implemented as some sort of a multiplexed routing algorithm, I think :) > > few device classes support resource sharing in that manner efficiently. > peripheral devices in commodity platforms are inherently single-hosted > and won''t support unfiltered access by multiple driver instances in > several guests.Would this be due to the inability of the peripheral to switch contexts fast enough? If so, how about a "AMDV/IntelVT" for peripherals?> from the vmm perspective, it always boils down to emulating the device. > howerver, with varying degrees of complexity regarding the translation > of guest requests to physical access. it depends. ide, afaik is known to > work comparatively well.Probably because IDE follows a well defined API?> an example of an area where it''s getting more > sportive would be network adapters. > > this is basically the whole problem when building virtualization layers > for cots platforms: the device/driver landscape spreads to infinity :) > since you''ll have a hard time driving any possible combination by > yourself, you need something else to do it. one solution are hosted > vmms, running on top of an existing operating system. a second solution > is what xen does: offload drivers to a modified guest system which can > then carry the I/O load from the additional, nonprivileged guests as > well.Agreed; so let me rephrase the dilemma like this: The PC platform was never intended to be used in a virtualizing scenario, and therefore does not contain the infrastructure to support this kind of a scenario efficiently, but this could easily be rectified by introducing simple extensions, akin to AMDV/IntelVT, on all levels of the PC hardware. Is this a correct reading? If so, has this been considered in the Xen design, so as to accommodate any future hwV/VT/VMX extensions easily and quickly? Thanks for your input! -- Al _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Petersson, Mats
2006-Aug-10 15:42 UTC
RE: [Xen-devel] Questioning the Xen Design of the VMM
> -----Original Message----- > From: Al Boldi [mailto:a1426z@gawab.com] > Sent: 10 August 2006 15:55 > To: Petersson, Mats; Daniel Stodden > Cc: xen-devel@lists.xensource.com > Subject: Re: [Xen-devel] Questioning the Xen Design of the VMM > > Petersson, Mats wrote: > > > Al Boldi wrote: > > > You mean AMDV/IntelVT extensions? > > > > Yes. > > > > > If so, then these extensions don''t actively participate > in the act of > > > virtualization, but rather fix some x86-arch > shortcomings, that make it > > > easier for software (i.e. Xen) to virtualize, thus > circumventing the need > > > to do binary translation. Is this a correct reading? > > > > Not sure what your exact meaning is here. > > > > What do you mean by "actively participate in the act of > virtualization". > > Is there any logic involved, that does some kind of a > translation/control? > > It seems not.AMD has annouced feature called "Nested page tables", which will allow the translation of page-table lookups, essentially adding another layer of address translation, so we can give the guest "physical" a map of [0..256MB], whilst we''re actually giving it some (completely) random set of physical pages that it actually gets to use. This is not available in the current generation of chips, but it will be in the next... I believe that Intel has at least publicly stated that they have a similar solution in the pipeline. We (AMD) have also publicly talked about IOMMU, which will help hardware virtualiztion. I''ll make more comments on that in reply to your other posting. So, in the current generation, shadow-page tables are used, so the actual page-table used by the guest is write-protected, and when write-faults occur, we replace the data written by the guest with a translated value in a second page-table, which the guest never sees, but the processor uses to translate the memory accesses. It''s a fair bit more work, but the guest is entirely unaware of the REAL PHYSICAL address it lives at.> > Daniel Stodden wrote: > > > > they fix the issues, removing the general need for binary > translation, > > but go well beyond that as well. > > > > a comparatively simple example of where it goes beyond are privilege > > levels. basic system virtualization would just move the > guest kernel to > > a nonprivileged level to maintain control in the vmm. so > you''d have the > > hypervisor in supervisor mode (that''s why it''s called a > hypervisor), and > > both guest kernel and applications in user mode [1]. > [should note that > > xen makes a difference here, using x86 privilege levels > which are more > > complex]. > > > > what vtx does is keeping the privilege rings in protected > mode untouched > > by the virtualization features. instead, two whole new > modes are added: > > ''vmx root'' and ''vmx non-root''. the former applies to the > vmm, the latter > > to the guests. _both_ of these basically implement the > protected mode as > > it used to be. so hardware virtualization won''t have to > muck around with > > the regular privilege system. > > > > one example where this is particularly useful are hosted vmms, e.g. > > vmware workstation. imagine a natively-running operating > system and a > > machine monitor running on top of (or integrated with) > that. the system > > would run in vmx-root mode. regular application processes > there in ring3 > > as they used to. additionally, one may start guest systems > on top of the > > vmm, which again are implemented on top a regular x86 > protected mode, > > but in non-root mode. > > > > all of the above > > - can be functionally achieved _efficiently_ without hardware > > extensions like vmx > > - but ONLY as long as the privilege architecture supports > > virtualization > > - x86 does NOT [2] > > the pushf/popf outlined is an example of where the problems are > > - binary translation is a way to do it anyway, but does not count > > as ''efficient''. > > > > with vmx > > - efficient virtualization is achieved. > > - some things just get additional flexibility. > > So VMX doesn''t really virtualize anything, but rather enables > software to > perform virtualization more efficiently.Yes.> > Petersson, Mats wrote: > > There is no doubt that para-virtualization is one viable > solution to the > > virtualization problem, but it''s not the ONLY solution. > Each user has a > > choice: Recompile and get performance, or run unmodified > code at lower > > performance. > > Agreed, but how much lower performance are we talking about > in an HVM vs > para-virtualized scenario?Unfortunately, this is not a trivial question to answer, since it depends very much on what amoutn of hardware accesses are involved in the system. I''m sure it can be concieved both cases that are 10x slower and other cases where you get 98-99.9% of the original performance in the virtual machine. Para-virtual is suppsed to be around 95-98% of native solution - but again it depends on the workload what the exact figures are - pathological cases can probably be found. A large percentage of any slowdown from HVM is caused by the way hardware is emulated - using qemu-dm to model the virtual hardware. If you have a disk benchmark, it''s quite feasible that the native machine has 10x or so the throughput of the HVM system. -- Mats> > > Thanks! > > -- > Al > > > >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Daniel Stodden
2006-Aug-10 15:53 UTC
Re: [Xen-devel] Questioning the Xen Design of the VMM
On Thu, 2006-08-10 at 17:57 +0300, Al Boldi wrote:> > > So HVM solves the problem, but why can''t this layer be implemented in > > > software? > > > > the short answer at the cpu level is "because of the arcane nature of > > the x86 architecture" :/ > > Which AMDV/IntelVT supposedly solves?regarding the virtualization issue, yes.> > once the cpu problem has been solved, you''d need to emulate hardware > > resources an unmodified guest system attempts to drive. that again takes > > additional cycles. elimination of the peripheral hardware interfaces by > > putting the I/O layers on top of an abstract low-level path into the VMM > > is one of the reasons why xen is faster than others. many systems do > > this quite successfully, even for ''non-modified'' guests like e.g. > > windows, by installing dedicated, virtualization aware drivers once the > > base installation went ok. > > You mean "virtualization aware" drivers in the guest-OS? Wouldn''t this > amount to a form of patching?yes, strictly speaking it is a modification. but one based upon usually well-defined interfaces, and it does not require parsing opcodes and patching code segments. otoh, one which obviously needs to be reiterated for any additional guest os family.> > > I''m sure there can''t be a performance issue, as this virtualization > > > doesn''t occur on the physical resource level, but is (should be) rather > > > implemented as some sort of a multiplexed routing algorithm, I think :) > > > > few device classes support resource sharing in that manner efficiently. > > peripheral devices in commodity platforms are inherently single-hosted > > and won''t support unfiltered access by multiple driver instances in > > several guests. > > Would this be due to the inability of the peripheral to switch contexts fast > enough?maybe. more important: commodity peripherals typically wouldn''t sufficiently implement security and isolation. you''re certainly won''t ''route'' arbitraty block I/O from a guest system to your disk controller without further investigation and translation. it may gladly overwrite your host partition or whatever resource you granted elsewhere.> If so, how about a "AMDV/IntelVT" for peripherals?good idea, and actually practical. unfortunately, this is where it''s getting expensive.> > from the vmm perspective, it always boils down to emulating the device. > > howerver, with varying degrees of complexity regarding the translation > > of guest requests to physical access. it depends. ide, afaik is known to > > work comparatively well. > > Probably because IDE follows a well defined API?yes. however, i''m not an ide guy.> > an example of an area where it''s getting more > > sportive would be network adapters. > > > > this is basically the whole problem when building virtualization layers > > for cots platforms: the device/driver landscape spreads to infinity :) > > since you''ll have a hard time driving any possible combination by > > yourself, you need something else to do it. one solution are hosted > > vmms, running on top of an existing operating system. a second solution > > is what xen does: offload drivers to a modified guest system which can > > then carry the I/O load from the additional, nonprivileged guests as > > well. > > Agreed; so let me rephrase the dilemma like this: > The PC platform was never intended to be used in a virtualizing scenario, and > therefore does not contain the infrastructure to support this kind of a > scenario efficiently, but this could easily be rectified by introducing > simple extensions, akin to AMDV/IntelVT, on all levels of the PC hardware. > > Is this a correct reading?yes, with restrictions. at this point in time, correct not from an economical standpoint. the whole "virtualization renaissance", we''ve been experiencing for the last 3 years or so builts upon the fact that PC hardware has become 1. terribly powerful, compared to the workloads most software systems then run actually require. 2. remained comparatively cheap, as it always used to. if you start to redesign the I/O system, you''re likely to raise the cost for the overall system. I/O virtualization down to the device level may come, but like with processor prices, it''s all a "economy of scale". hardware-assisted virtualization at various places in the architecture, however, including I/O, is a topic as well understood. may i again point you to some reading matter in that area: nair/smith: virtual machines. http://www.amazon.de/gp/product/1558609105/028-2651277-1478934?v=glance&n=52044011 excellent textbook on many aspects of system virtualization, including those covered by this conversation so far.> If so, has this been considered in the Xen design, so as to accommodate any > future hwV/VT/VMX extensions easily and quickly?vmx is all about processor virtualization. addtional topics would include memory virtualization (required, and available in the form of regular virtual memory; but might see additional improvements.) and I/O virtualization. i see no reasons why those could not be supported by xen. as they are subsystems which have been backed in a portable and scalable fashion in the operating system landscape for many year now. so the topic of how to accomodate changes in that area is not particularly new. regards, daniel -- Daniel Stodden LRR - Lehrstuhl für Rechnertechnik und Rechnerorganisation Institut für Informatik der TU München D-85748 Garching http://www.lrr.in.tum.de/~stodden mailto:stodden@cs.tum.edu PGP Fingerprint: F5A4 1575 4C56 E26A 0B33 3D80 457E 82AE B0D8 735B _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Petersson, Mats
2006-Aug-10 16:34 UTC
RE: [Xen-devel] Questioning the Xen Design of the VMM
> -----Original Message----- > From: xen-devel-bounces@lists.xensource.com > [mailto:xen-devel-bounces@lists.xensource.com] On Behalf Of > Daniel Stodden > Sent: 10 August 2006 16:54 > To: Al Boldi > Cc: xen-devel@lists.xensource.com > Subject: Re: [Xen-devel] Questioning the Xen Design of the VMM > > On Thu, 2006-08-10 at 17:57 +0300, Al Boldi wrote: > > > > > So HVM solves the problem, but why can''t this layer be > implemented in > > > > software? > > > > > > the short answer at the cpu level is "because of the > arcane nature of > > > the x86 architecture" :/ > > > > Which AMDV/IntelVT supposedly solves? > > regarding the virtualization issue, yes. > > > > once the cpu problem has been solved, you''d need to > emulate hardware > > > resources an unmodified guest system attempts to drive. > that again takes > > > additional cycles. elimination of the peripheral hardware > interfaces by > > > putting the I/O layers on top of an abstract low-level > path into the VMM > > > is one of the reasons why xen is faster than others. many > systems do > > > this quite successfully, even for ''non-modified'' guests like e.g. > > > windows, by installing dedicated, virtualization aware > drivers once the > > > base installation went ok. > > > > You mean "virtualization aware" drivers in the guest-OS? > Wouldn''t this > > amount to a form of patching? > > yes, strictly speaking it is a modification. but one based > upon usually > well-defined interfaces, and it does not require parsing opcodes and > patching code segments.Exactly. There''s a big difference between applying patches to the existing binary, and adding new code by instaling a driver once the system is running. Compare for example that when you install Windows, it may not know how to drive nVidia''s or ATI''s latest graphics card, but you can install a new driver for it. You could, alternatively, perhaps patch the existing nVidia driver to make it work for the latest card, but most people prefer to grab the latest driver from www.nvidia.com or so... In this case, we''re installing a driver that can talk via a defined interface to the hypervisor, and by doing so, allow us to get "fast" disk access, network access or even graphics.> > otoh, one which obviously needs to be reiterated for any additional > guest os family. > > > > > I''m sure there can''t be a performance issue, as this > virtualization > > > > doesn''t occur on the physical resource level, but is > (should be) rather > > > > implemented as some sort of a multiplexed routing > algorithm, I think :) > > > > > > few device classes support resource sharing in that > manner efficiently. > > > peripheral devices in commodity platforms are inherently > single-hosted > > > and won''t support unfiltered access by multiple driver > instances in > > > several guests. > > > > Would this be due to the inability of the peripheral to > switch contexts fast > > enough? > > maybe. more important: commodity peripherals typically wouldn''t > sufficiently implement security and isolation. you''re certainly won''t > ''route'' arbitraty block I/O from a guest system to your disk > controller > without further investigation and translation. it may gladly overwrite > your host partition or whatever resource you granted elsewhere.Context-switching is only part of the problem, as Daniel says. IOMMU is a technology that is coming in future products from AMD (and I''m sure Intel are working on such products as well. IBM already have a chipset in production for some of the PowerPC and x86-based servers). This will solve address translation, but it won''t solve problems with sharing devices - that will require some form of either context-switching (which may be acceptable for some devices) or hardware changes to allow multi-porting within the device with multiple ports to allow a separated interface. Or, if applicable to the device, a context-switch of the device. However, context switching of external devices is DIFFICULT for several reasons, one being: it''s not always possible to read the "context" of a device... Many devices have write-only fields, and other types of "can''t read it back" type of behaviour. For example, in an IDE controller, if the system just issued a non-DMA transfer of a sector, waited for the READY to come back from the IDE controller, and started writing bytes to the IDE interface, it can''t stop writing bytes until you''ve reached the correct number as per what the interface expects (usually 512 bytes). There is also, AFAIK, no way to tell how many bytes are left to write (or read in case of opposite direction transfers). This is obviusly "braindead" hardware, but it just so happens that much of the PC hardware, even in modern varieties, is pretty much "braindead" - i.e. it has no more intelligence in the device than absolutely necessary. I''m not sure how easy it is to interrogate the status of a DMA transfer, as I''ve never really dealt much with those. Another complexity in context-switching devices is that it really can''t be done on-the-fly, but must be implemented on a "on-demand" basis [anything else would be FAR to slow - we don''t want to do that many operations over the PCI bus that often].> > > If so, how about a "AMDV/IntelVT" for peripherals? > > good idea, and actually practical. unfortunately, this is where it''s > getting expensive.IOMMU isn''t particularly expensive, but multiple ports within a device can get pretty complicated - and only really suitable for higher end devices in the first place.> > > > from the vmm perspective, it always boils down to > emulating the device. > > > howerver, with varying degrees of complexity regarding > the translation > > > of guest requests to physical access. it depends. ide, > afaik is known to > > > work comparatively well. > > > > Probably because IDE follows a well defined API? > > yes. however, i''m not an ide guy.I''m strictly not an IDE guy either, but I do know a fair bit about it, as I''ve written a bunch of test-code that uses the IDE interface to exercise the Xen-HVM/SVM code-paths involved with IO operations. The IDE interface is pretty straightforward and simple, so it makes it easy to emulate for that reason. Other devices may have more complex interfaces, that are harder to write emulation code for.> > > > an example of an area where it''s getting more > > > sportive would be network adapters. > > > > > > this is basically the whole problem when building > virtualization layers > > > for cots platforms: the device/driver landscape spreads > to infinity :) > > > since you''ll have a hard time driving any possible combination by > > > yourself, you need something else to do it. one solution > are hosted > > > vmms, running on top of an existing operating system. a > second solution > > > is what xen does: offload drivers to a modified guest > system which can > > > then carry the I/O load from the additional, > nonprivileged guests as > > > well. > > > > Agreed; so let me rephrase the dilemma like this: > > The PC platform was never intended to be used in a > virtualizing scenario, and > > therefore does not contain the infrastructure to support > this kind of a > > scenario efficiently, but this could easily be rectified by > introducing > > simple extensions, akin to AMDV/IntelVT, on all levels of > the PC hardware. > > > > Is this a correct reading? > > yes, with restrictions. at this point in time, correct not from an > economical standpoint. the whole "virtualization renaissance", we''ve > been experiencing for the last 3 years or so builts upon the fact that > PC hardware has become > > 1. terribly powerful, compared to the workloads most software > systems then run actually require. > > 2. remained comparatively cheap, as it always used to. > > if you start to redesign the I/O system, you''re likely to > raise the cost > for the overall system. > > I/O virtualization down to the device level may come, but like with > processor prices, it''s all a "economy of scale". > > hardware-assisted virtualization at various places in the > architecture, > however, including I/O, is a topic as well understood. > > may i again point you to some reading matter in that area: > > nair/smith: virtual machines. > http://www.amazon.de/gp/product/1558609105/028-2651277-1478934 > ?v=glance&n=52044011 > > excellent textbook on many aspects of system virtualization, including > those covered by this conversation so far. > > > If so, has this been considered in the Xen design, so as to > accommodate any > > future hwV/VT/VMX extensions easily and quickly? > > vmx is all about processor virtualization. addtional topics would > include memory virtualization (required, and available in the form of > regular virtual memory; but might see additional > improvements.) and I/O > virtualization. i see no reasons why those could not be supported by > xen. as they are subsystems which have been backed in a portable and > scalable fashion in the operating system landscape for many > year now. so > the topic of how to accomodate changes in that area is not > particularly > new. > > regards, > daniel > > > -- > Daniel Stodden > LRR - Lehrstuhl für Rechnertechnik und Rechnerorganisation > Institut für Informatik der TU München D-85748 Garching > http://www.lrr.in.tum.de/~stodden mailto:stodden@cs.tum.edu > PGP Fingerprint: F5A4 1575 4C56 E26A 0B33 3D80 457E 82AE B0D8 735B >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Daniel Stodden
2006-Aug-10 18:07 UTC
RE: [Xen-devel] Questioning the Xen Design of the VMM
On Thu, 2006-08-10 at 18:34 +0200, Petersson, Mats wrote:> Context-switching is only part of the problem, as Daniel says.> IOMMU is a technology that is coming in future products from AMD> (and I''m sure Intel are working on such products as well.> IBM already have a chipset in production for some of the PowerPC> and x86-based servers).i didn''t have a look yet at the papers from amd, but it may be of interest that the PCI interfaces (1998, maybe even earlier) built by sun for their ultrasparc processors already implemented such a beast. al, docs on the bridge should be available from sun online, if you''re interested in such things. the basic idea being virtualization of the I/O address space, this feature is quite cool even if you don''t give a single thought about system virtualization (sun probably didn''t at that point). getting your hands on contiguous, dma-able memory areas can be a permanent headache in os and device driver design if you peripheral bus seeks physical memory untranslated. put a translation table in between and upstream transactions become a non-issue, without offloading any additional logic into the peripheral bus interface. mats, i suppose amd''s iommu solves this as well? regards, daniel -- Daniel Stodden LRR - Lehrstuhl für Rechnertechnik und Rechnerorganisation Institut für Informatik der TU München D-85748 Garching http://www.lrr.in.tum.de/~stodden mailto:stodden@cs.tum.edu PGP Fingerprint: F5A4 1575 4C56 E26A 0B33 3D80 457E 82AE B0D8 735B _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Petersson, Mats
2006-Aug-11 08:41 UTC
RE: [Xen-devel] Questioning the Xen Design of the VMM
> -----Original Message----- > From: Daniel Stodden [mailto:stodden@cs.tum.edu] > Sent: 10 August 2006 19:08 > To: Al Boldi; Petersson@nmail.informatik.tu-muenchen.de; > Petersson, Mats > Cc: xen-devel@lists.xensource.com > Subject: RE: [Xen-devel] Questioning the Xen Design of the VMM > > On Thu, 2006-08-10 at 18:34 +0200, Petersson, Mats wrote: > > > Context-switching is only part of the problem, as Daniel says. > > > IOMMU is a technology that is coming in future products from AMD > > > (and I''m sure Intel are working on such products as well. > > > IBM already have a chipset in production for some of the PowerPC > > > and x86-based servers). > > i didn''t have a look yet at the papers from amd, but > it may be of interest that the PCI interfaces (1998, maybe > even earlier) > built by sun for their ultrasparc processors already > implemented such a > beast. al, docs on the bridge should be available from sun online, if > you''re interested in such things. > > the basic idea being virtualization of the I/O address space, this > feature is quite cool even if you don''t give a single thought about > system virtualization (sun probably didn''t at that point). > getting your > hands on contiguous, dma-able memory areas can be a permanent headache > in os and device driver design if you peripheral bus seeks physical > memory untranslated. put a translation table in between and upstream > transactions become a non-issue, without offloading any > additional logic > into the peripheral bus interface. > > mats, i suppose amd''s iommu solves this as well?Yes, of course [see note below]. The only thing it doesn''t solve is if the OS decides to swap the pages out - so there still needs to be a call to say "lock this area into memory, don''t allow it to move or be swapped out" - but that''s trivial compared to "make sure this [large] block of memory is contiguous so that it can be transferred to the hard-disk as one transfer". Of course, modern devices cope with this by using scatter/gather technology... Note: It does somewhat depend on how you implement the software to control the IOMMU and how you deal with memory allocation above and below this layer. Since the idea of the IOMMU is to translate guest physical addresses to machine physical addresses, when used in conjunction with a VMM, it doesn''t necessarily help driver-writers as such, because all it does is present the guest OS and physical device with "the same view" of physical memory, so let''s say that we give a guest-OS a mapping of 0..256MB, that on the Machine physical level isn''t contiguous, the guest''s physical view would still be contiguous [aside from the regular PC hardware holes, of course] - but the OS would still have to use contiguous regions to give to the hardware [assuming HW hasn''t got scatter/gather], since the guest doesn''t have control over the IOMMU itself - just like nested paging gives the guest it''s own level of paging on top of an already virtual address, the IOMMU gives the guest a "virtual" PCI-space that matches it''s guest-physical view. So, let''s make a trivial example [using contiguous machine physical range - which may not be the case in real life]: Guest Machine 0..256MB 256..512MB IOMMU would then map the 256MB of guest to the relevant machine physical address. In a driver, we are given the address 0x12345000, and 12K (three pages long) as a buffer for a pci device. The driver will do a virt_to_phys() call to the OS, which gives it an address in the 0..256MB range, say 0x1005000 - this address can then be given to the pci device, to translate it. But if the page 0x12346000 isn''t mapped to the next guest-physical address (0x1006000), then you''d still have to deal with that in some way [presumably by allocating a new buffer with a "please make this contiguous" flag and copying the data or by sending the data in 4KB chunks]. I hope that''s clear - it''s rather confusing to think about all these things, because there are several levels of translation, which makes life pretty complicated. At least the IOMMU mapping should be pretty static. -- Mats> > regards, > daniel > > -- > Daniel Stodden > LRR - Lehrstuhl für Rechnertechnik und Rechnerorganisation > Institut für Informatik der TU München D-85748 Garching > http://www.lrr.in.tum.de/~stodden mailto:stodden@cs.tum.edu > PGP Fingerprint: F5A4 1575 4C56 E26A 0B33 3D80 457E 82AE B0D8 735B >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel