Ihar Hrachyshka
2019-Aug-21 20:37 UTC
[libvirt-users] RLIMIT_MEMLOCK in container environment
Hi all, KubeVirt uses libvirtd to manage qemu VMs represented as Kubernetes API resources. In this case, libvirtd is running inside an unprivileged pod, with some host mounts / capabilities added to the pod, needed by libvirtd and other services. One of the capabilities libvirtd requires for successful startup inside a pod is SYS_RESOURCE. This capability is used to adjust RLIMIT_MEMLOCK ulimit value depending on devices attached to the managed guest, both on startup and during hotplug. AFAIU the need to lock the memory is to avoid pages being pushed out from RAM into swap. In KubeVirt world, several libvirtd assumptions do not apply: 1. In Kubernetes environments, swap is usually disabled. (e.g. kubeadm official deployment tool won't even initialize a cluster until you disable it.) This is documented in lots of places, f.e.: https://docs.platform9.com/kubernetes/disabling-swap-kubernetes-node/ (note: while it's vendor docs, regardless it's well known community recommendation.) 2. hotplug is not supported. Domain definition is stable through its whole lifetime. We are working on a series of patches that would remove the need for SYS_RESOURCE capability from the pod running libvirtd: https://github.com/kubevirt/kubevirt/pull/2584 We achieve it by making another, *privileged* component to set RLIMIT_MEMLOCK for libvirtd process using prlimit() syscall, using the value that is higher than the final value libvirtd uses with setrlimit() [Linux kernel will allow to lower the value without the capability.] Since the formula to calculate the actual MEMLOCK value is embedded in libvirt and is not simple to reproduce outside, we pick the upper limit value set for libvirtd process quite conservatively even if ideally we would use the exact same value as libvirtd would do. The estimation code is here: https://github.com/kubevirt/kubevirt/pull/2584/files#diff-6edccf5f0d11c09e7025d4fae3fa6dc6 While the solution works, there are some drawbacks: 1. the value we use for prlimit() is not exactly equal to the final value used by libvirtd; 2. we are doing all this work in environment that is not prone to issues because of disabled swap space. I believe we would benefit from one of the following features on libvirt side (or both): a) expose the memory lock value calculated by libvirtd through libvirt ABI so that we can use it when calling prlimit() on libvirtd process; b) allow to disable setrlimit() calls via libvirtd config file knob or domain definition. Do you think it would be acceptable to have one of these enhancements in libvirtd, or perhaps both, for degenerate cases like KubeVirt? Thanks for attention, Ihar
Daniel P. Berrangé
2019-Aug-22 09:24 UTC
Re: [libvirt-users] RLIMIT_MEMLOCK in container environment
On Wed, Aug 21, 2019 at 01:37:21PM -0700, Ihar Hrachyshka wrote:> Hi all, > > KubeVirt uses libvirtd to manage qemu VMs represented as Kubernetes > API resources. In this case, libvirtd is running inside an > unprivileged pod, with some host mounts / capabilities added to the > pod, needed by libvirtd and other services. > > One of the capabilities libvirtd requires for successful startup > inside a pod is SYS_RESOURCE. This capability is used to adjust > RLIMIT_MEMLOCK ulimit value depending on devices attached to the > managed guest, both on startup and during hotplug. AFAIU the need to > lock the memory is to avoid pages being pushed out from RAM into swap.Libvirt shouldn't set RLIMIT_MEMLOCK by default, unless there's something in the XML that requires it - one of - hard limit memory value is present - host PCI device passthrough is requested - memory is locked into RAM which of these are you actually using ? Regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|
Ihar Hrachyshka
2019-Aug-22 14:56 UTC
Re: [libvirt-users] RLIMIT_MEMLOCK in container environment
On Thu, Aug 22, 2019 at 2:24 AM Daniel P. Berrangé <berrange@redhat.com> wrote:> > On Wed, Aug 21, 2019 at 01:37:21PM -0700, Ihar Hrachyshka wrote: > > Hi all, > > > > KubeVirt uses libvirtd to manage qemu VMs represented as Kubernetes > > API resources. In this case, libvirtd is running inside an > > unprivileged pod, with some host mounts / capabilities added to the > > pod, needed by libvirtd and other services. > > > > One of the capabilities libvirtd requires for successful startup > > inside a pod is SYS_RESOURCE. This capability is used to adjust > > RLIMIT_MEMLOCK ulimit value depending on devices attached to the > > managed guest, both on startup and during hotplug. AFAIU the need to > > lock the memory is to avoid pages being pushed out from RAM into swap. > > Libvirt shouldn't set RLIMIT_MEMLOCK by default, unless there's > something in the XML that requires it - one ofYou are right, sorry. We add SYS_RESOURCE only for particular domains.> > - hard limit memory value is present > - host PCI device passthrough is requestedWe are using passthrough to pass SR-IOV NIC VFs into guests. We also plan to do the same for GPUs in the near future.> - memory is locked into RAM > > which of these are you actually using ? > > Regards, > Daniel > -- > |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| > |: https://libvirt.org -o- https://fstop138.berrange.com :| > |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|