On 29/03/2022 20:25, Nir Soffer wrote:> On Wed, Mar 16, 2022 at 1:55 PM lejeczek <peljasz at yahoo.co.uk>
wrote:
>>
>>
>> On 15/03/2022 11:21, Daniel P. Berrang? wrote:
>>> On Tue, Mar 15, 2022 at 10:39:50AM +0000, lejeczek wrote:
>>>> Hi guys.
>>>>
>>>> Without explicitly, manually using watchdog device for a VM,
the VM (centOS
>>>> 8 Stream 4.18.0-365.el8.x86_64) shows '/dev/watchdog'
exists.
>>>> To double check - 'dumpxml' does not show any such
device - what kind of a
>>>> 'watchdog' that is?
>>> The kernel can always provide a pure software watchdog IIRC. It can
be
>>> useful if a userspace app wants a watchdog. The limitation is that
it
>>> relies on the kernel remaining functional, as there's no
hardware
>>> backing it up.
>>>
>>> Regards,
>>> Daniel
>> On a related note - with 'i6300esb' watchdog which I tested
>> and I believe is working.
>> I get often in my VMs from 'dmesg':
>> ...
>> watchdog: BUG: soft lockup - CPU#0 stuck for xxxs! [swapper/0:0]
>> rcu: INFO: rcu_sched self-detected stall on CPU
>> ...
>> This above is from Ubuntu and CentOS alike and when this
>> happens, console via VNC responds to until first 'enter'
>> then is non-resposive.
>> This happens after VM(s) was migrated between hosts, but
>> anyway..
>> I do not see what I expected from 'watchdog' - there is no
>> action whatsoever, which should be 'reset'. VM remains in
>> such 'frozen' state forever.
>>
>> any & all shared thoughts much appreciated.
>> L.
> You need to run some userspace tool that will open the watchdog
> device, and pet it periodically, telling the kernel that userspace is
alive.
>
> If this tool will stop petting the watchdog, maybe because of a soft lockup
> or other trouble, the watchdog device will reset the VM.
>
> watchdog(8) may be the tool you need.
>
> See also
> https://www.kernel.org/doc/Documentation/watchdog/watchdog-api.rst
>
> Nir
>
I do not think that 'i6300esb' watchog works under those
soft-lockups, whether it's qemu or OS end I cannot say.
With:
??? <watchdog model='i6300esb' action='reset'/>
in dom xml OS sees:
-> $ llr /dev/watchdog*
crw-------. 1 root root? 10, 130 Apr? 5 16:59 /dev/watchdog
crw-------. 1 root root 248,?? 0 Apr? 5 16:59 /dev/watchdog0
crw-------. 1 root root 248,?? 1 Apr? 5 16:59 /dev/watchdog1
and
-> $ wdctl
Device:??????? /dev/watchdog
Identity:????? i6300ESB timer [version 0]
Timeout:?????? 30 seconds
Pre-timeout:??? 0 seconds
FLAG?????????? DESCRIPTION?????????????? STATUS BOOT-STATUS
KEEPALIVEPING? Keep alive ping reply????????? 1?????????? 0
MAGICCLOSE???? Supports magic close char????? 0?????????? 0
SETTIMEOUT???? Set timeout (in seconds)?????? 0?????????? 0
If it worked, the HW watchdog, then 'i6300esb' should reset
the VM if nothing is pinging the watchdog - I read that it's
possible to exit 'software' watchdog and not to cause HW
watchdog take action. I do not know it that's happening here
when I just 'systemclt stop watchdog'
In '/etc/watchdog.conf' I do not point to any specific
device, which I believe makes watchdogd do its things.
Simple test:
-> $ cat >> /dev/watchdog
& 'Enter' press twice
does invoke 'reset' action and I was to believe 'wdctl' that
is HW watchdog working. But!...
The main issue I have are those "soft lockups" where VM's OS
becomes frozen, but nothing from the watchdog, no action -
though, as VM is in such frozen state host shows high CPU
for the VM.
I do not anything fancy so I really wonder if what I see is
that rare.
Soft-lockup occur I think usually, cannot say that uniquely
though, during or after VM live-migration.
thanks, L.