thr3ads.net - Linux Virtualization - huh startup_ipi

If this information is useful, please help other people find it:
Share via:

Eric W. Biederman

2007-Apr-28 00:16 UTC

huh startup_ipi_hook?

The current paravirt startup_ipi hook for vmware 
commit: ae5da273fe3352febd38658d8d34484cbcfb3423
is quite frankly ridiculous.

In the middle of wake_up_secondary_cpu:
We have:
      /*
       * Paravirt / VMI wants a startup IPI hook here to set up the
       * target processor state.
       */
      startup_ipi_hook(phys_apicid, (unsigned long) start_secondary,
                       (unsigned long) stack_start.esp);

As far as I can tell from reading this there is a completely
different mechanism in place to start for a secondary processor.
Which seems sane.

What doesn't seem sane is bothering to run the rest of the code
for sending an INIT message to a secondary processor.  It certainly
does not feel general at all.

I think we should be intercepting this startup call at a higher level,
where we can just say:  Start secondary cpu with this stack
and with this esp.  Or something like that.

So conceptually I think the concept makes sense but implementation
wise I think what is currently present is totally ridiculous.

Eric

Jeremy Fitzhardinge

2007-Apr-28 00:23 UTC

head link

huh startup_ipi_hook?

Eric W. Biederman wrote:> So conceptually I think the concept makes sense but implementation
> wise I think what is currently present is totally ridiculous.
I tend to agree. For Xen I added smp_ops as an adjunct to paravirt_ops,
which is basically the interface defined in linux/smp.h:

struct smp_ops
{
	void (*smp_prepare_boot_cpu)(void);
	void (*smp_prepare_cpus)(unsigned max_cpus);
	int (*cpu_up)(unsigned cpu);
	void (*smp_cpus_done)(unsigned max_cpus);

	void (*smp_send_stop)(void);
	void (*smp_send_reschedule)(int cpu);
	int (*smp_call_function_mask)(cpumask_t mask,
				      void (*func)(void *info), void *info,
				      int wait);
};


This is a fairly close match to Xen's requirements. Certainly, anything
APIC-related is useless for us, since there's no APIC emulation going on.

I won't speak for Zach, but his counter-argument is generally along the
lines of "we can just make use of the existing code with a couple of
little hooks near the bottom". But I wonder if the existing genapic
interface can be used (or extended) to cover these cases without having
needing to have APIC-level interfaces in paravirt_ops.

Are you reviewing -mm? That's basically OK, but there's newer stuff in
Andi's patch queue.

J

Andi Kleen

2007-Apr-28 01:48 UTC

head link

huh startup_ipi_hook?

> 
> I think we should be intercepting this startup call at a higher level,
> where we can just say:  Start secondary cpu with this stack
> and with this esp.  Or something like that.
The reason VMI tends to thing lower level is that most of its 
interfaces are like real hardware and of the code
would be just copied then if it did high level interfaces.

I suspect this specific hook could only be made better if 
smpboot.c was refactored into more pieces and callable, but I doubt
that would be a big overall improvement.

-Andi

Zachary Amsden

2007-Apr-30 13:33 UTC

head link

huh startup_ipi_hook?

Eric W. Biederman wrote:> The current paravirt startup_ipi hook for vmware 
> commit: ae5da273fe3352febd38658d8d34484cbcfb3423
> is quite frankly ridiculous.
>
> In the middle of wake_up_secondary_cpu:
> We have:
>       /*
>        * Paravirt / VMI wants a startup IPI hook here to set up the
>        * target processor state.
>        */
>       startup_ipi_hook(phys_apicid, (unsigned long) start_secondary,
>                        (unsigned long) stack_start.esp);
>
> As far as I can tell from reading this there is a completely
> different mechanism in place to start for a secondary processor.
> Which seems sane.
>   
It is not completely different.  The startup mechanism is the same; the 
startup state is not.
> What doesn't seem sane is bothering to run the rest of the code
> for sending an INIT message to a secondary processor.  It certainly
> does not feel general at all.
>   
We need some wakeup mechanism to launch the APs; we already implement 
INIT and STARTUP IPIs for the non-paravirt case and the startup IPI is a 
good match to the wakeup we need, both in the the Linux code and the 
hypervisor.
> I think we should be intercepting this startup call at a higher level,
> where we can just say:  Start secondary cpu with this stack
> and with this esp.  Or something like that.
>
> So conceptually I think the concept makes sense but implementation
> wise I think what is currently present is totally ridiculous.
>   
A heathen notion, conceivably, but not, I hope, an unenlightened one.

We have to support two methods of booting on the same hardware.  
Traditional booting does standard SMP startup, which means the BIOS has 
put CPUs into a real mode wait loop (basically, cli;hlt, wait for INIT 
IPI).  We have to emulate traditional booting; you might not be booting 
a paravirt kernel.

Now here is where problems begin.  BSP enters paravirt mode.  It 
switches paravirt-ops over to warm and fuzzy hypercalls.  APs have no 
idea about this.  In fact, they cannot be switched into paravirt mode 
yet because not only might the BSP be running a UP kernel, which could 
crash or reboot, but more importantly, they have no code to run.  
Unfortunately, they can not run real mode code either.  Once the BSP is 
up and running paravirt style, the binary translator which we use to run 
privileged code has been hobbled at the knees.  This is an 
implementation artifact, certainly, and one that is mostly fixed now, 
but suffice it to say that interactions between CPUs in paravirt and 
non-paravirt mode are currently unsupported at best and unreliable at worst.

To get out of this real mode loop and into paravirt mode, we have to 
switch on the APs at some point.  There are major problems lurking 
here.  To follow points so far:

1) We can't start all CPUs at time-zero in paravirt mode; we might load 
any kernel, paravirt or non para
2) At the time when we are bringing up APs, BSP is in paravirt mode and 
APs are halted in real mode
3) We can't run paravirt mode code on APs without properly initialized 
segment registers for code, stack and data.
4) The i386 architecture provides no way to initialize GDTR or segment 
state on AP prior to a startup IPI.
5) We can't run real mode code on APs to go through the boot trampoline 
and initialize GDTR because of mixed mode problems.

To solve this, we modify the startup IPI to carry additional 
information; it takes almost a full state map and allows the startup IPI 
to initialize the protected mode register settings to any value the OS 
might want.  This is what startup_ipi_hook does - it tells the 
hypervisor the initial state to place the AP in when it receives a 
startup IPI.  It is the most general startup mechanism you can possibly 
have, and allows you to solve the above combination of constraints on 
any protected mode operating system.

We use it to bypass head.S completely, setting control registers and 
segments and jumping directly into paravirtualized protected mode on the 
APs at the C code entry point.  It is arguably cleaner than having some 
real mode trampoline system.

So yes, we have a very different entry method, and it carries the burden 
of maintaining a list of register and segments that the initial CPU 
state should look like on the APs.  Is it easy to break?  Yes.  Jeremy 
broke it at least twice already when reworking per-cpu state.  Did it 
affect his code in any way?  No.  And that is _good_.

Could we hack head.S into a thousand points of light and contort it so 
that both protected mode and real mode entry took the same path, running 
on some default assumed segment state provided by the hypervisor?  
Certainly.  Would this make life easier for you to have new entry points 
popping up all along head.S that all have to do these initial state 
manipulations in slightly different yet co-dependent ways?

No, the best long term solution is to fix the constraint that introduced 
the problem; drop condition 5 above, and make VMI / paravirt entry on 
APs start in real mode, just like the standard hardware, and make it 
follow the regular code in head.S.  Once we get up to C code, it is a 
simple matter to call out to the paravirt-ops code and do the same thing 
that the BSP did to get into paravirt mode, and there are no more 
odd-looking hacks hanging on the wall.  But it is a long term solution, 
not something that is feasible currently.

So that is why it is good that breakage here did not stop Jeremy from 
improving the native kernel with per-cpu data segments.  There is a 
deficiency on our end that did not impede his progress, and the burden 
of maintaining code which you (rightfully) feel is ridiculous is limited 
to those who have it.  That's why I'm listed as a maintainer for the 
code, because it is not maintenance free, but certainly we would like it 
to be hassle free for everyone else.

Zach

tatpratishedhaartham ekatattva abhyasah
"Adherence to single-minded effort prevents these impediments"

Apparently Analagous Threads

Search for more possibly parallel threads

Linux Virtualization - Apr 2007 - huh startup_ipi_hook?

huh startup_ipi_hook?

huh startup_ipi_hook?

huh startup_ipi_hook?

huh startup_ipi_hook?

Apparently Analagous Threads