Fix cpu offline bug
At current xen, when cpu offline, cpu0 will wait the 1st cpu offline;
However, if offline 2nd, 3rd, ... cpu, cpu0 will not wait it.
This patch is used to fix the bug.
Signed-off-by: Liu, Jinsong <jinsong.liu@intel.com>
diff -r d1631540bcc4 xen/arch/x86/smpboot.c
--- a/xen/arch/x86/smpboot.c	Tue Jan 18 17:23:24 2011 +0000
+++ b/xen/arch/x86/smpboot.c	Sat Feb 12 03:48:09 2011 +0800
@@ -78,7 +78,8 @@ static enum cpu_state {
     CPU_STATE_INIT,     /* master -> slave: Early bringup phase 1 */
     CPU_STATE_CALLOUT,  /* master -> slave: Early bringup phase 2 */
     CPU_STATE_CALLIN,   /* slave -> master: Completed phase 2 */
-    CPU_STATE_ONLINE    /* master -> slave: Go fully online now. */
+    CPU_STATE_ONLINE,   /* master -> slave: Go fully online now. */
+    CPU_STATE_STAY      /* after slave dead, global cpu_state stay here */
 } cpu_state;
 #define set_cpu_state(state) do { mb(); cpu_state = (state); } while (0)
 
@@ -867,6 +868,8 @@ void __cpu_die(unsigned int cpu)
         if ( (++i % 10) == 0 )
             printk(KERN_ERR "CPU %u still not dead...\n", cpu);
     }
+
+    set_cpu_state(CPU_STATE_STAY);
 }
 
 int cpu_add(uint32_t apic_id, uint32_t acpi_id, uint32_t pxm)
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel
Good fix of a nasty bug! I think I will refactor this a bit and do it slightly differently, but the (simple) principle will of course be the same. Thanks, Keir On 04/03/2011 10:22, "Liu, Jinsong" <jinsong.liu@intel.com> wrote:> Fix cpu offline bug > > At current xen, when cpu offline, cpu0 will wait the 1st cpu offline; > However, if offline 2nd, 3rd, ... cpu, cpu0 will not wait it. > This patch is used to fix the bug. > > Signed-off-by: Liu, Jinsong <jinsong.liu@intel.com> > > diff -r d1631540bcc4 xen/arch/x86/smpboot.c > --- a/xen/arch/x86/smpboot.c Tue Jan 18 17:23:24 2011 +0000 > +++ b/xen/arch/x86/smpboot.c Sat Feb 12 03:48:09 2011 +0800 > @@ -78,7 +78,8 @@ static enum cpu_state { > CPU_STATE_INIT, /* master -> slave: Early bringup phase 1 */ > CPU_STATE_CALLOUT, /* master -> slave: Early bringup phase 2 */ > CPU_STATE_CALLIN, /* slave -> master: Completed phase 2 */ > - CPU_STATE_ONLINE /* master -> slave: Go fully online now. */ > + CPU_STATE_ONLINE, /* master -> slave: Go fully online now. */ > + CPU_STATE_STAY /* after slave dead, global cpu_state stay here */ > } cpu_state; > #define set_cpu_state(state) do { mb(); cpu_state = (state); } while (0) > > @@ -867,6 +868,8 @@ void __cpu_die(unsigned int cpu) > if ( (++i % 10) == 0 ) > printk(KERN_ERR "CPU %u still not dead...\n", cpu); > } > + > + set_cpu_state(CPU_STATE_STAY); > } > > int cpu_add(uint32_t apic_id, uint32_t acpi_id, uint32_t pxm)_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 04/03/2011 10:22, "Liu, Jinsong" <jinsong.liu@intel.com> wrote:> Fix cpu offline bug > > At current xen, when cpu offline, cpu0 will wait the 1st cpu offline; > However, if offline 2nd, 3rd, ... cpu, cpu0 will not wait it. > This patch is used to fix the bug.Alternative fix applied tio unstable and 4.1-testing. The fix below doesn''t account for the fact that cpu_state can also == CPU_STATE_DEAD after an (unsuccessful) CPU online operation. -- Keir> Signed-off-by: Liu, Jinsong <jinsong.liu@intel.com> > > diff -r d1631540bcc4 xen/arch/x86/smpboot.c > --- a/xen/arch/x86/smpboot.c Tue Jan 18 17:23:24 2011 +0000 > +++ b/xen/arch/x86/smpboot.c Sat Feb 12 03:48:09 2011 +0800 > @@ -78,7 +78,8 @@ static enum cpu_state { > CPU_STATE_INIT, /* master -> slave: Early bringup phase 1 */ > CPU_STATE_CALLOUT, /* master -> slave: Early bringup phase 2 */ > CPU_STATE_CALLIN, /* slave -> master: Completed phase 2 */ > - CPU_STATE_ONLINE /* master -> slave: Go fully online now. */ > + CPU_STATE_ONLINE, /* master -> slave: Go fully online now. */ > + CPU_STATE_STAY /* after slave dead, global cpu_state stay here */ > } cpu_state; > #define set_cpu_state(state) do { mb(); cpu_state = (state); } while (0) > > @@ -867,6 +868,8 @@ void __cpu_die(unsigned int cpu) > if ( (++i % 10) == 0 ) > printk(KERN_ERR "CPU %u still not dead...\n", cpu); > } > + > + set_cpu_state(CPU_STATE_STAY); > } > > int cpu_add(uint32_t apic_id, uint32_t acpi_id, uint32_t pxm)_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir,
Thanks for comments and update!
However, I think we still need update it a little :)
============================Fix cpu offline bug
Remove BUG_ON since it''s of some risk, considering the small time
window:
cpu0 read (cpu_state) --> offlining cpu write (cpu_state) --> cpu0 read
(cpu_state)
Signed-off-by: Liu, Jinsong <jinsong.liu@intel.com>
diff -r 1ecb840bea17 xen/arch/x86/smpboot.c
--- a/xen/arch/x86/smpboot.c    Sat Mar 05 23:30:29 2022 +0800
+++ b/xen/arch/x86/smpboot.c    Sun Mar 06 01:21:34 2022 +0800
@@ -864,7 +864,6 @@ void __cpu_die(unsigned int cpu)
     while ( cpu_state != CPU_STATE_DEAD )
     {
-        BUG_ON(cpu_state != CPU_STATE_DYING);
         mdelay(100);
         cpu_relax();
         process_pending_softirqs();
============================
Thanks,
Jinsong
Keir Fraser wrote:> On 04/03/2011 10:22, "Liu, Jinsong" <jinsong.liu@intel.com>
wrote:
> 
>> Fix cpu offline bug
>> 
>> At current xen, when cpu offline, cpu0 will wait the 1st cpu offline;
>> However, if offline 2nd, 3rd, ... cpu, cpu0 will not wait it.
>> This patch is used to fix the bug.
> 
> Alternative fix applied tio unstable and 4.1-testing. The fix below
> doesn''t account for the fact that cpu_state can also =>
CPU_STATE_DEAD after an (unsuccessful) CPU online operation.
> 
>  -- Keir
> 
>> Signed-off-by: Liu, Jinsong <jinsong.liu@intel.com>
>> 
>> diff -r d1631540bcc4 xen/arch/x86/smpboot.c
>> --- a/xen/arch/x86/smpboot.c Tue Jan 18 17:23:24 2011 +0000
>> +++ b/xen/arch/x86/smpboot.c Sat Feb 12 03:48:09 2011 +0800
>> @@ -78,7 +78,8 @@ static enum cpu_state {
>>      CPU_STATE_INIT,     /* master -> slave: Early bringup phase 1
*/
>>      CPU_STATE_CALLOUT,  /* master -> slave: Early bringup phase 2
*/
>>      CPU_STATE_CALLIN,   /* slave -> master: Completed phase 2 */
>> -    CPU_STATE_ONLINE    /* master -> slave: Go fully online now. */
>> +    CPU_STATE_ONLINE,   /* master -> slave: Go fully online now. */
>> +    CPU_STATE_STAY      /* after slave dead, global cpu_state stay
>>  here */  } cpu_state; #define set_cpu_state(state) do { mb();
>> cpu_state = (state); } while (0) 
>> 
>> @@ -867,6 +868,8 @@ void __cpu_die(unsigned int cpu)
>>          if ( (++i % 10) == 0 )
>>              printk(KERN_ERR "CPU %u still not dead...\n",
cpu);
>> } +
>> +    set_cpu_state(CPU_STATE_STAY);
>>  }
>> 
>>  int cpu_add(uint32_t apic_id, uint32_t acpi_id, uint32_t pxm)
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel
Urk, good point! Thanks, Keir On 05/03/2011 15:10, "Liu, Jinsong" <jinsong.liu@intel.com> wrote:> Keir, > > Thanks for comments and update! > However, I think we still need update it a little :) > > ============================> Fix cpu offline bug > > Remove BUG_ON since it''s of some risk, considering the small time window: > cpu0 read (cpu_state) --> offlining cpu write (cpu_state) --> cpu0 read > (cpu_state) > > Signed-off-by: Liu, Jinsong <jinsong.liu@intel.com> > > diff -r 1ecb840bea17 xen/arch/x86/smpboot.c > --- a/xen/arch/x86/smpboot.c Sat Mar 05 23:30:29 2022 +0800 > +++ b/xen/arch/x86/smpboot.c Sun Mar 06 01:21:34 2022 +0800 > @@ -864,7 +864,6 @@ void __cpu_die(unsigned int cpu) > > while ( cpu_state != CPU_STATE_DEAD ) > { > - BUG_ON(cpu_state != CPU_STATE_DYING); > mdelay(100); > cpu_relax(); > process_pending_softirqs(); > ============================> > Thanks, > Jinsong > > > > Keir Fraser wrote: >> On 04/03/2011 10:22, "Liu, Jinsong" <jinsong.liu@intel.com> wrote: >> >>> Fix cpu offline bug >>> >>> At current xen, when cpu offline, cpu0 will wait the 1st cpu offline; >>> However, if offline 2nd, 3rd, ... cpu, cpu0 will not wait it. >>> This patch is used to fix the bug. >> >> Alternative fix applied tio unstable and 4.1-testing. The fix below >> doesn''t account for the fact that cpu_state can also =>> CPU_STATE_DEAD after an (unsuccessful) CPU online operation. >> >> -- Keir >> >>> Signed-off-by: Liu, Jinsong <jinsong.liu@intel.com> >>> >>> diff -r d1631540bcc4 xen/arch/x86/smpboot.c >>> --- a/xen/arch/x86/smpboot.c Tue Jan 18 17:23:24 2011 +0000 >>> +++ b/xen/arch/x86/smpboot.c Sat Feb 12 03:48:09 2011 +0800 >>> @@ -78,7 +78,8 @@ static enum cpu_state { >>> CPU_STATE_INIT, /* master -> slave: Early bringup phase 1 */ >>> CPU_STATE_CALLOUT, /* master -> slave: Early bringup phase 2 */ >>> CPU_STATE_CALLIN, /* slave -> master: Completed phase 2 */ >>> - CPU_STATE_ONLINE /* master -> slave: Go fully online now. */ >>> + CPU_STATE_ONLINE, /* master -> slave: Go fully online now. */ >>> + CPU_STATE_STAY /* after slave dead, global cpu_state stay >>> here */ } cpu_state; #define set_cpu_state(state) do { mb(); >>> cpu_state = (state); } while (0) >>> >>> @@ -867,6 +868,8 @@ void __cpu_die(unsigned int cpu) >>> if ( (++i % 10) == 0 ) >>> printk(KERN_ERR "CPU %u still not dead...\n", cpu); >>> } + >>> + set_cpu_state(CPU_STATE_STAY); >>> } >>> >>> int cpu_add(uint32_t apic_id, uint32_t acpi_id, uint32_t pxm) >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
>>> On 05.03.11, Keir Fraser <keir.xen@gmail.com> wrote: >Urk, good point!But your fix still allows what the comment says it is supposed to avoid - simply introducing a variable doesn''t help afaict, you also need to insert a barrier() to avoid a second read (or "cpu_state" would need to be volatile). Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jan, I''m not quite clear your meaning. Why and where need to insert barrier, or volatile cpu_state? Thanks, Jinsong Jan Beulich wrote:>>>> On 05.03.11, Keir Fraser <keir.xen@gmail.com> wrote: Urk, good >>>> point! > > But your fix still allows what the comment says it is supposed to > avoid - simply introducing a variable doesn''t help afaict, you also > need to insert a barrier() to avoid a second read (or "cpu_state" > would need to be volatile). > > Jan > > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xensource.com > http://lists.xensource.com/xen-devel_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
>>> On 08.03.11 at 10:47, "Liu, Jinsong" <jinsong.liu@intel.com> wrote: > Jan, I''m not quite clear your meaning. > Why and where need to insert barrier, or volatile cpu_state?while ( (seen_state = cpu_state) != CPU_STATE_DEAD ) { + barrier(); BUG_ON(seen_state != CPU_STATE_DYING); mdelay(100); cpu_relax(); Without this, the compiler is free to eliminate "seen_state" in favor of reading "cpu_state" twice (irrespective of the optimizer very likely trying to do exactly the opposite). Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 08/03/2011 10:14, "Jan Beulich" <JBeulich@novell.com> wrote:>>>> On 08.03.11 at 10:47, "Liu, Jinsong" <jinsong.liu@intel.com> wrote: >> Jan, I''m not quite clear your meaning. >> Why and where need to insert barrier, or volatile cpu_state? > > while ( (seen_state = cpu_state) != CPU_STATE_DEAD ) > { > + barrier(); > BUG_ON(seen_state != CPU_STATE_DYING); > mdelay(100); > cpu_relax(); > > Without this, the compiler is free to eliminate "seen_state" in favor > of reading "cpu_state" twice (irrespective of the optimizer very > likely trying to do exactly the opposite).Haha, you''re getting paranoid Jan! What you describe is impossible -- there is no way that C semantics allow a local variable''s value to change after it has affected program behaviour (if it hasn''t escaped local scope by having its address taken for example). If that could happen then we''d have plenty of other code that doesn''t work properly either, I''m sure. In this case, if a second read from cpu_state did occur then we could end up observing seen_state==CPU_STATE_DEAD, in a code block that is predicated on this not being true! -- Keir> Jan >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel