Hi, when creating a bigger (> 50 GB) HVM guest with maxmem > memory we get softlockups from time to time. kernel: [ 802.084335] BUG: soft lockup - CPU#1 stuck for 22s! [xend:31351] I tracked this down to the call of xc_domain_set_pod_target() and further p2m_pod_set_mem_target(). Unfortunately I can this check only with xen-4.2.2 as I don''t have a machine with enough memory for current hypervisors. But it seems the code is nearly the same. My suggestion would be to do the ''pod set target'' in the function xc_domain_set_pod_target() in chunks of maybe 1GB to give the dom0 scheduler a chance to run. As this is not performance critical it should not be a problem. I can reproduce this with SLES11-SP3 with Linux 3.0.101 and xen-4.2.2. # cat dummy name = "DummyOS" memory = 10000 maxmem = 12000 builder=''hvm'' # echo 1 > /proc/sys/kernel/watchdog_thresh # xm create -c dummy This leads to a kernel message: kernel: [ 5019.958089] BUG: soft lockup - CPU#4 stuck for 3s! [xend:20854] Any comments are welcome. Thanks. Dietmar. -- Company details: http://ts.fujitsu.com/imprint.html
>>> On 05.12.13 at 14:55, Dietmar Hahn <dietmar.hahn@ts.fujitsu.com> wrote: > when creating a bigger (> 50 GB) HVM guest with maxmem > memory we get > softlockups from time to time. > > kernel: [ 802.084335] BUG: soft lockup - CPU#1 stuck for 22s! [xend:31351] > > I tracked this down to the call of xc_domain_set_pod_target() and further > p2m_pod_set_mem_target(). > > Unfortunately I can this check only with xen-4.2.2 as I don''t have a machine > with enough memory for current hypervisors. But it seems the code is nearly > the same. > > My suggestion would be to do the ''pod set target'' in the function > xc_domain_set_pod_target() in chunks of maybe 1GB to give the dom0 scheduler > a chance to run. > As this is not performance critical it should not be a problem.This is a broader problem: There are more long running hypercalls than just the one setting the POD target. While a kernel built with CONFIG_PREEMPT ought to have no issue with this (as the hypervisor internal preemption will always exit back to the guest, thus allowing interrupts to be processed) as long as such hypercalls aren''t invoked with preemption disabled, non- preemptable kernels (the suggested default for servers) have - afaict - no way to deal with this. However, as long as interrupts and softirqs can get serviced by the kernel (which they can as long as they weren''t disabled upon invocation of the hypercall), that may also be a mostly cosmetic problem (in that the soft lockup is being reported) as long as no real time like guarantees are required (which if they were would be sort of contradictory to the kernel being non-preemptable), i.e. other tasks may get starved for some time, but OS health shouldn''t be impacted. Hence I wonder whether it wouldn''t make sense to simply suppress the soft lockup detection at least across privcmd invoked hypercalls - Cc-ing upstream Linux maintainers to see if they have an opinion or thoughts towards a proper solution. Jan
On 06/12/13 10:00, Jan Beulich wrote:>>>> On 05.12.13 at 14:55, Dietmar Hahn <dietmar.hahn@ts.fujitsu.com> wrote: >> when creating a bigger (> 50 GB) HVM guest with maxmem > memory we get >> softlockups from time to time. >> >> kernel: [ 802.084335] BUG: soft lockup - CPU#1 stuck for 22s! [xend:31351] >> >> I tracked this down to the call of xc_domain_set_pod_target() and further >> p2m_pod_set_mem_target(). >> >> Unfortunately I can this check only with xen-4.2.2 as I don''t have a machine >> with enough memory for current hypervisors. But it seems the code is nearly >> the same. >> >> My suggestion would be to do the ''pod set target'' in the function >> xc_domain_set_pod_target() in chunks of maybe 1GB to give the dom0 scheduler >> a chance to run. >> As this is not performance critical it should not be a problem. > > This is a broader problem: There are more long running hypercalls > than just the one setting the POD target. While a kernel built with > CONFIG_PREEMPT ought to have no issue with this (as the > hypervisor internal preemption will always exit back to the guest, > thus allowing interrupts to be processed) as long as such > hypercalls aren''t invoked with preemption disabled, non- > preemptable kernels (the suggested default for servers) have - > afaict - no way to deal with this. > > However, as long as interrupts and softirqs can get serviced by > the kernel (which they can as long as they weren''t disabled upon > invocation of the hypercall), that may also be a mostly cosmetic > problem (in that the soft lockup is being reported) as long as no > real time like guarantees are required (which if they were would > be sort of contradictory to the kernel being non-preemptable), > i.e. other tasks may get starved for some time, but OS health > shouldn''t be impacted. > > Hence I wonder whether it wouldn''t make sense to simply > suppress the soft lockup detection at least across privcmd > invoked hypercalls - Cc-ing upstream Linux maintainers to see if > they have an opinion or thoughts towards a proper solution.We do not want to disable the soft lockup detection here as it has found a bug. We can''t have tasks that are unschedulable for minutes as it would only take a handful of such tasks to hose the system. We should put an explicit preemption point in. This will fix it for the CONFIG_PREEMPT_VOLUNTARY case which I think is the most common configuration. Or perhaps this should even be a cond_reched() call to fix it for fully non-preemptible as well. David
>>> On 06.12.13 at 12:07, David Vrabel <david.vrabel@citrix.com> wrote: > We do not want to disable the soft lockup detection here as it has found > a bug. We can''t have tasks that are unschedulable for minutes as it > would only take a handful of such tasks to hose the system.My understanding is that the soft lockup detection is what its name says - a mechanism to find cases where the kernel software locked up. Yet that''s not the case with long running hypercalls.> We should put an explicit preemption point in. This will fix it for the > CONFIG_PREEMPT_VOLUNTARY case which I think is the most common > configuration. Or perhaps this should even be a cond_reched() call to > fix it for fully non-preemptible as well.How do you imagine to do this? When the hypervisor preempts a hypercall, all the kernel gets to see is that it drops back into the hypercall page, such that the next thing to happen would be re-execution of the hypercall. You can''t call anything at that point, all that can get run here are interrupts (i.e. event upcalls). Or do you suggest to call cond_resched() from within __xen_evtchn_do_upcall()? And even if you do - how certain is it that what gets its continuation deferred won''t interfere with other things the kernel wants to do (since if you''d be doing it that way, you''d cover all hypercalls at once, not just those coming through privcmd, and hence you could end up with partially completed multicalls or other forms of batching, plus you''d need to deal with possibly active lazy modes). Jan
On 06/12/13 11:30, Jan Beulich wrote:>>>> On 06.12.13 at 12:07, David Vrabel <david.vrabel@citrix.com> wrote: >> We do not want to disable the soft lockup detection here as it has found >> a bug. We can''t have tasks that are unschedulable for minutes as it >> would only take a handful of such tasks to hose the system. > > My understanding is that the soft lockup detection is what its name > says - a mechanism to find cases where the kernel software locked > up. Yet that''s not the case with long running hypercalls.Well ok, it''s not a lockup in the kernel but it''s still a task that cannot be descheduled for minutes of wallclock time. This is still a bug that needs to be fixed.>> We should put an explicit preemption point in. This will fix it for the >> CONFIG_PREEMPT_VOLUNTARY case which I think is the most common >> configuration. Or perhaps this should even be a cond_reched() call to >> fix it for fully non-preemptible as well. > > How do you imagine to do this? When the hypervisor preempts a > hypercall, all the kernel gets to see is that it drops back into the > hypercall page, such that the next thing to happen would be > re-execution of the hypercall. You can''t call anything at that point, > all that can get run here are interrupts (i.e. event upcalls). Or do > you suggest to call cond_resched() from within > __xen_evtchn_do_upcall()?I''ve not looked at how.> And even if you do - how certain is it that what gets its continuation > deferred won''t interfere with other things the kernel wants to do > (since if you''d be doing it that way, you''d cover all hypercalls at > once, not just those coming through privcmd, and hence you could > end up with partially completed multicalls or other forms of batching, > plus you''d need to deal with possibly active lazy modes).I would only do this for hypercalls issued by the privcmd driver. David
Am Freitag 06 Dezember 2013, 12:00:02 schrieb David Vrabel:> On 06/12/13 11:30, Jan Beulich wrote: > >>>> On 06.12.13 at 12:07, David Vrabel <david.vrabel@citrix.com> wrote: > >> We do not want to disable the soft lockup detection here as it has found > >> a bug. We can''t have tasks that are unschedulable for minutes as it > >> would only take a handful of such tasks to hose the system. > > > > My understanding is that the soft lockup detection is what its name > > says - a mechanism to find cases where the kernel software locked > > up. Yet that''s not the case with long running hypercalls. > > Well ok, it''s not a lockup in the kernel but it''s still a task that > cannot be descheduled for minutes of wallclock time. This is still a > bug that needs to be fixed. > > >> We should put an explicit preemption point in. This will fix it for the > >> CONFIG_PREEMPT_VOLUNTARY case which I think is the most common > >> configuration. Or perhaps this should even be a cond_reched() call to > >> fix it for fully non-preemptible as well. > > > > How do you imagine to do this? When the hypervisor preempts a > > hypercall, all the kernel gets to see is that it drops back into the > > hypercall page, such that the next thing to happen would be > > re-execution of the hypercall. You can''t call anything at that point, > > all that can get run here are interrupts (i.e. event upcalls). Or do > > you suggest to call cond_resched() from within > > __xen_evtchn_do_upcall()? > > I''ve not looked at how. > > > And even if you do - how certain is it that what gets its continuation > > deferred won''t interfere with other things the kernel wants to do > > (since if you''d be doing it that way, you''d cover all hypercalls at > > once, not just those coming through privcmd, and hence you could > > end up with partially completed multicalls or other forms of batching, > > plus you''d need to deal with possibly active lazy modes). > > I would only do this for hypercalls issued by the privcmd driver.But I also got soft lockups when unmapping a bigger chunk of guest memory (our BS2000 OS) in the dom0 kernel via vunmap(). This calls in the end HYPERVISOR_update_va_mapping() and may take a very long time. From a kernel module I found no solution to split the virtual address area to be able to call schedule(). Because all needed kernel functions are not exported to be usable in modules. The only possible solution was to turn of the soft lockup detection. Dietmar.> > David >-- Company details: http://ts.fujitsu.com/imprint.html
On 12/06/2013 07:00 AM, David Vrabel wrote:> On 06/12/13 11:30, Jan Beulich wrote: >>>>> On 06.12.13 at 12:07, David Vrabel <david.vrabel@citrix.com> wrote: >>> We do not want to disable the soft lockup detection here as it has found >>> a bug. We can''t have tasks that are unschedulable for minutes as it >>> would only take a handful of such tasks to hose the system. >> My understanding is that the soft lockup detection is what its name >> says - a mechanism to find cases where the kernel software locked >> up. Yet that''s not the case with long running hypercalls. > Well ok, it''s not a lockup in the kernel but it''s still a task that > cannot be descheduled for minutes of wallclock time. This is still a > bug that needs to be fixed. > >>> We should put an explicit preemption point in. This will fix it for the >>> CONFIG_PREEMPT_VOLUNTARY case which I think is the most common >>> configuration. Or perhaps this should even be a cond_reched() call to >>> fix it for fully non-preemptible as well. >> How do you imagine to do this? When the hypervisor preempts a >> hypercall, all the kernel gets to see is that it drops back into the >> hypercall page, such that the next thing to happen would be >> re-execution of the hypercall. You can''t call anything at that point, >> all that can get run here are interrupts (i.e. event upcalls). Or do >> you suggest to call cond_resched() from within >> __xen_evtchn_do_upcall()? > I''ve not looked at how.KVM has a hook (kvm_check_and_clear_guest_paused()) into watchdog code to prevent it from having false positives (for a different reason though). If we claim that soft lockup mechanism is only to detect Linux kernel problems and not long-running hypervisor code then perhaps we can make this hook a bit more generic. We would still need to think about what may happen if we are stuck in the hypervisor for abnormally long time. Maybe this Xen hook can still return false when such cases are detected. -boris> >> And even if you do - how certain is it that what gets its continuation >> deferred won''t interfere with other things the kernel wants to do >> (since if you''d be doing it that way, you''d cover all hypercalls at >> once, not just those coming through privcmd, and hence you could >> end up with partially completed multicalls or other forms of batching, >> plus you''d need to deal with possibly active lazy modes). > I would only do this for hypercalls issued by the privcmd driver. > > David
On 06/12/13 13:52, Dietmar Hahn wrote:> Am Freitag 06 Dezember 2013, 12:00:02 schrieb David Vrabel: >> On 06/12/13 11:30, Jan Beulich wrote: >>>>>> On 06.12.13 at 12:07, David Vrabel <david.vrabel@citrix.com> wrote: >>>> We do not want to disable the soft lockup detection here as it has found >>>> a bug. We can''t have tasks that are unschedulable for minutes as it >>>> would only take a handful of such tasks to hose the system. >>> >>> My understanding is that the soft lockup detection is what its name >>> says - a mechanism to find cases where the kernel software locked >>> up. Yet that''s not the case with long running hypercalls. >> >> Well ok, it''s not a lockup in the kernel but it''s still a task that >> cannot be descheduled for minutes of wallclock time. This is still a >> bug that needs to be fixed. >> >>>> We should put an explicit preemption point in. This will fix it for the >>>> CONFIG_PREEMPT_VOLUNTARY case which I think is the most common >>>> configuration. Or perhaps this should even be a cond_reched() call to >>>> fix it for fully non-preemptible as well. >>> >>> How do you imagine to do this? When the hypervisor preempts a >>> hypercall, all the kernel gets to see is that it drops back into the >>> hypercall page, such that the next thing to happen would be >>> re-execution of the hypercall. You can''t call anything at that point, >>> all that can get run here are interrupts (i.e. event upcalls). Or do >>> you suggest to call cond_resched() from within >>> __xen_evtchn_do_upcall()? >> >> I''ve not looked at how. >> >>> And even if you do - how certain is it that what gets its continuation >>> deferred won''t interfere with other things the kernel wants to do >>> (since if you''d be doing it that way, you''d cover all hypercalls at >>> once, not just those coming through privcmd, and hence you could >>> end up with partially completed multicalls or other forms of batching, >>> plus you''d need to deal with possibly active lazy modes). >> >> I would only do this for hypercalls issued by the privcmd driver. > > But I also got soft lockups when unmapping a bigger chunk of guest memory > (our BS2000 OS) in the dom0 kernel via vunmap(). This calls in the end > HYPERVISOR_update_va_mapping() and may take a very long time. > From a kernel module I found no solution to split the virtual address area to > be able to call schedule(). Because all needed kernel functions are not > exported to be usable in modules. The only possible solution was to turn of > the soft lockup detection.vunmap() does a hypercall per-page since it calls ptep_get_and_clear() so there are no long running hypercalls here. zap_pmd_range() (which is used for munmap()) already has appropriate cond_resched() calls after every zap_pte_range() so I think there needs to be a cond_resched() call added into vunmap_pmd_range() as well. David