thr3ads.net - libvirt users - Re: [libvirt-users] RLIMIT_MEMLOCK in container environment [Aug 2019]

If this information is useful, please help other people find it:
Share via:

Ihar Hrachyshka

2019-Aug-22 14:56 UTC

Re: [libvirt-users] RLIMIT_MEMLOCK in container environment

On Thu, Aug 22, 2019 at 2:24 AM Daniel P. Berrangé <berrange@redhat.com>
wrote:>
> On Wed, Aug 21, 2019 at 01:37:21PM -0700, Ihar Hrachyshka wrote:
> > Hi all,
> >
> > KubeVirt uses libvirtd to manage qemu VMs represented as Kubernetes
> > API resources. In this case, libvirtd is running inside an
> > unprivileged pod, with some host mounts / capabilities added to the
> > pod, needed by libvirtd and other services.
> >
> > One of the capabilities libvirtd requires for successful startup
> > inside a pod is SYS_RESOURCE. This capability is used to adjust
> > RLIMIT_MEMLOCK ulimit value depending on devices attached to the
> > managed guest, both on startup and during hotplug. AFAIU the need to
> > lock the memory is to avoid pages being pushed out from RAM into swap.
>
> Libvirt shouldn't set RLIMIT_MEMLOCK by default, unless there's
> something in the XML that requires it - one of
You are right, sorry. We add SYS_RESOURCE only for particular domains.
>
>  - hard limit memory value is present
>  - host PCI device passthrough is requested
We are using passthrough to pass SR-IOV NIC VFs into guests. We also
plan to do the same for GPUs in the near future.
>  - memory is locked into RAM
>
> which of these are you actually using ?
>
> Regards,
> Daniel
> --
> |: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange
:|
> |: https://libvirt.org         -o-            https://fstop138.berrange.com
:|
> |: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange
:|

Laine Stump

2019-Aug-22 19:01 UTC

head link

Re: [libvirt-users] RLIMIT_MEMLOCK in container environment

On 8/22/19 10:56 AM, Ihar Hrachyshka wrote:> On Thu, Aug 22, 2019 at 2:24 AM Daniel P. Berrangé
<berrange@redhat.com> wrote:
>>
>> On Wed, Aug 21, 2019 at 01:37:21PM -0700, Ihar Hrachyshka wrote:
>>> Hi all,
>>>
>>> KubeVirt uses libvirtd to manage qemu VMs represented as Kubernetes
>>> API resources. In this case, libvirtd is running inside an
>>> unprivileged pod, with some host mounts / capabilities added to the
>>> pod, needed by libvirtd and other services.
>>>
>>> One of the capabilities libvirtd requires for successful startup
>>> inside a pod is SYS_RESOURCE. This capability is used to adjust
>>> RLIMIT_MEMLOCK ulimit value depending on devices attached to the
>>> managed guest, both on startup and during hotplug. AFAIU the need
to
>>> lock the memory is to avoid pages being pushed out from RAM into
swap.

I recall successfully testing GPU assignment from an unprivileged 
libvirtd several years ago by setting a high enough ulimit for the uid 
used to run libvirtd in advance (. I think we check if the current 
setting is high enough, and don't try to set it unless we think we need to.

If I understand you correctly, you're saying that in your case it's okay
for the memlock limit to be lower than we try to set it to, because swap 
is disabled anyway, is that correct?
>>
>> Libvirt shouldn't set RLIMIT_MEMLOCK by default, unless there's
>> something in the XML that requires it - one of
> 
> You are right, sorry. We add SYS_RESOURCE only for particular domains.
> 
>>
>>   - hard limit memory value is present
>>   - host PCI device passthrough is requested
> 
> We are using passthrough 
(If you want to make Alex happy, use the term "VFIO device assignment"
rather than passthrough :-).)
> to pass SR-IOV NIC VFs into guests. We also
> plan to do the same for GPUs in the near future.
 >>> I believe we would benefit from one of the following features on
 >>> libvirt side (or both):
 >>>
 >>> a) expose the memory lock value calculated by libvirtd through
 >>> libvirt ABI so that we can use it when calling prlimit() on
libvirtd
 >>> process;
 >>> b) allow to disable setrlimit() calls via libvirtd config file
knob
 >>> or domain definition.

(b) sounds much more reasonable, as long as qemu doesn't complain (I 
don't know whether or not it checks)

Slightly related to this - I'm currently working on patches to avoid 
making any ioctl calls that would fail in an unprivileged libvirtd when 
using tap/macvtap devices. ATM, I'm doing this by adding an attribute 
"unmanaged='yes'" to the interface <target> element. The
idea is that if
someone sets unmanaged='yes', they're stating that the caller (i.e. 
kubevirt) is responsible for all device setup, and that libvirt should 
just use it without further setup. A similar approach could be applied 
to hostdev devices - if unmanaged is set, we assume that the caller has 
done everything to make the associated device usable.

(Of course this all makes me realize the inanity of adding a <target 
dev='blah' unmanaged='yes'/> for interfaces when hostdevs
already have
<hostdev managed='yes'> and <interface type='hostdev'
managed='yes'>. So
to prevent setting the locklimit for hostdev, would we make a new 
setting like <hostdev managed='no-never-not-even-a-tiny-bit'>?
Sigh. I
*hate* trying to make config consistent :-/)

(alternately, we could just automatically fail the attempt to set the 
lock limit in a graceful manner and allow the guest to continue)

BTW, I'm guessing that you use <hostdev> to assign the SRIOV VFs
rather
than <interface type='hostdev'>, correct? The latter would require
that
you have enough capabilities to set MAC addresses on the VFs (that's the 
entire point of using <interface type='hostdev'> instead of plain
<hostdev>)

Ihar Hrachyshka

2019-Aug-22 20:39 UTC

head link

Re: [libvirt-users] RLIMIT_MEMLOCK in container environment

On Thu, Aug 22, 2019 at 12:01 PM Laine Stump <laine@redhat.com>
wrote:>
> On 8/22/19 10:56 AM, Ihar Hrachyshka wrote:
> > On Thu, Aug 22, 2019 at 2:24 AM Daniel P. Berrangé
<berrange@redhat.com> wrote:
> >>
> >> On Wed, Aug 21, 2019 at 01:37:21PM -0700, Ihar Hrachyshka wrote:
> >>> Hi all,
> >>>
> >>> KubeVirt uses libvirtd to manage qemu VMs represented as
Kubernetes
> >>> API resources. In this case, libvirtd is running inside an
> >>> unprivileged pod, with some host mounts / capabilities added
to the
> >>> pod, needed by libvirtd and other services.
> >>>
> >>> One of the capabilities libvirtd requires for successful
startup
> >>> inside a pod is SYS_RESOURCE. This capability is used to
adjust
> >>> RLIMIT_MEMLOCK ulimit value depending on devices attached to
the
> >>> managed guest, both on startup and during hotplug. AFAIU the
need to
> >>> lock the memory is to avoid pages being pushed out from RAM
into swap.
>
>
> I recall successfully testing GPU assignment from an unprivileged
> libvirtd several years ago by setting a high enough ulimit for the uid
> used to run libvirtd in advance (. I think we check if the current
> setting is high enough, and don't try to set it unless we think we need
to.
>
The PR I linked to in the original email does just that: it starts
libvirtd; then, if domain is going to use VFIO, sets ulimit of
libvirtd process to VM memory size + 1Gb (mimicking libvirt code) +
256Mb (to stay conservative) using prlimit() syscall; then defines the
domain.
> If I understand you correctly, you're saying that in your case it's
okay
> for the memlock limit to be lower than we try to set it to, because swap
> is disabled anyway, is that correct?
>
I'm honestly not exactly sure about the reason why we need to set the
limit, but I assume it's because of swap. I can be totally confused on
that part though.
> >>
> >> Libvirt shouldn't set RLIMIT_MEMLOCK by default, unless
there's
> >> something in the XML that requires it - one of
> >
> > You are right, sorry. We add SYS_RESOURCE only for particular domains.
> >
> >>
> >>   - hard limit memory value is present
> >>   - host PCI device passthrough is requested
> >
> > We are using passthrough
>
> (If you want to make Alex happy, use the term "VFIO device
assignment"
> rather than passthrough :-).)
>
Not sure who Alex is but I'll try to make everyone happy! :)
> > to pass SR-IOV NIC VFs into guests. We also
> > plan to do the same for GPUs in the near future.
>
>  >>> I believe we would benefit from one of the following features
on
>  >>> libvirt side (or both):
>  >>>
>  >>> a) expose the memory lock value calculated by libvirtd
through
>  >>> libvirt ABI so that we can use it when calling prlimit() on
libvirtd
>  >>> process;
>  >>> b) allow to disable setrlimit() calls via libvirtd config
file knob
>  >>> or domain definition.
>
> (b) sounds much more reasonable, as long as qemu doesn't complain (I
> don't know whether or not it checks)
>
> Slightly related to this - I'm currently working on patches to avoid
> making any ioctl calls that would fail in an unprivileged libvirtd when
> using tap/macvtap devices. ATM, I'm doing this by adding an attribute
> "unmanaged='yes'" to the interface <target>
element. The idea is that if
> someone sets unmanaged='yes', they're stating that the caller
(i.e.
> kubevirt) is responsible for all device setup, and that libvirt should
> just use it without further setup. A similar approach could be applied
> to hostdev devices - if unmanaged is set, we assume that the caller has
> done everything to make the associated device usable.
>
> (Of course this all makes me realize the inanity of adding a <target
> dev='blah' unmanaged='yes'/> for interfaces when
hostdevs already have
> <hostdev managed='yes'> and <interface
type='hostdev' managed='yes'>. So
> to prevent setting the locklimit for hostdev, would we make a new
> setting like <hostdev
managed='no-never-not-even-a-tiny-bit'>? Sigh. I
> *hate* trying to make config consistent :-/)
>
> (alternately, we could just automatically fail the attempt to set the
> lock limit in a graceful manner and allow the guest to continue)
>
If that's something maintainers feel good about, I am all for it since
it simplifies the implementation.
> BTW, I'm guessing that you use <hostdev> to assign the SRIOV VFs
rather
> than <interface type='hostdev'>, correct? The latter would
require that
> you have enough capabilities to set MAC addresses on the VFs (that's
the
> entire point of using <interface type='hostdev'> instead of
plain <hostdev>)
Yes, we use <hostdev> exactly because interface sets MAC address: in
kubevirt scenario, the container that is running libvirtd has its own
network namespace and doesn't have access to PF to set the VF MAC
address on. Instead, we rely on CNI plugin that is running in the root
namespace context to configure the VF interface as needed. (I've
contributed custom MAC support to SR-IOV CNI plugin very recently.)

Ihar

Apparently Analagous Threads

Search for more reasonably related threads

libvirt users - Aug 2019 - Re: RLIMIT_MEMLOCK in container environment

Re: [libvirt-users] RLIMIT_MEMLOCK in container environment

Re: [libvirt-users] RLIMIT_MEMLOCK in container environment

Re: [libvirt-users] RLIMIT_MEMLOCK in container environment

Apparently Analagous Threads