Edward and I have had a multi-day private conversation in IRC on the
topic of this mail. I was planning to update this thread with an email,
but forgot until now :-/
On 9/24/20 10:54 AM, Daniel P. Berrangé wrote:> On Mon, Sep 21, 2020 at 06:04:36PM +0300, Edward Haas wrote:
>> The PCI addresses appearing on the domxml are not the same as the ones
>> mappend/detected in the VM itself. I compared the domxml on the host
>> and the lspci in the VM while the VM runs.
> Can you clarify what you are comparing here ?
>
> The PCI slot / function in the libvirt XML should match, but the
"bus"
> number in libvirt XML is just a index referencing the <controller>
> element in the libvirt XML. So the "bus" number won't
directly match
> what's reported in the guest OS. If you want to correlate, you need
> to look at the <address> on the <controller> to translate the
libvirt
> "bus" number.
Right. The bus number that is visible in the guest is 100% controlled by
the device firmware (and possibly the guest OS?), and there is no way
for qemu to explicitly set it, and thus no way for libvirt to guarantee
that the bus number in libvirt XML will be what is seen in the guest OS;
the bus number in the XML only has meaning within the XML - you can find
which controller a device is connected to by looking for the PCI
controller that has the same "index" as the device's
"bus".
>
>> This occurs only when SRIOV is defined, messing up also the other
>> "regular" vnics.
>> Somehow, everything comes up and runs (with the SRIOV interface as
>> well) on the first boot (even though the PCI addresses are not in
>> sync), but additional boots cause the VM to mess up the interfaces
>> (not all are detected).
Actually we looked at this offline, and the "messing up" that's
occurring is not due to any change in PCI address from one boot to the
next. The entire problem is caused by the guest OS using traditional
"eth0" and "eth1" netdev names, and making the incorrect
assumption that
those names are stable from one boot to the next. In fact, it is a
long-known problem that, due to a race between kernel code initializing
devices and user processes giving them names, the ordering of ethN
device names can change from one boot to the next *even with completely
identical hardware and no configuration changes. Here is a good
description of that problem, and of systemd's solution to it
("predictable network device names"):
https://www.freedesktop.org/wiki/Software/systemd/PredictableNetworkInterfaceNames/
Edward's inquiry was initiated by this bugzilla:
https://bugzilla.redhat.com/show_bug.cgi?id=1874096
You can see in the "first boot" and "second boot" ifconfig
output that
the same ethernet device has the "altname" enp2s1, and the same device
has the altname enp3s0 during both runs; these names are given by
systemd's "predictable network device name" algorithm (which bases
the
netdev name on the PCI address of the device). But the race between
kernel and userspace causes the "ethN" names to be assigned
differently
during one boot and the next.
In order to have predictable netdev names, the OS image needs to stop
setting net.ifnames=0 on the kernel command line. If they like, they can
give their own more descriptive names to the devices (methods arae
described in the above systemd document), but they need to stop relying
on ethN device names.
(note that this experience did uncover another bug in libvirt, which
*might* contribute to the racy code flip flopping from boot to boot, but
still isn't the root cause of the problem - in this case libvirtd is
running privileged, but inside a container, and the container doesn't
have full access to the devices' PCI config data in sysfs (you can see
this when you run "lspci -v" inside the container, you'll notice
"Capabilities: <access denied>". One result of this is that
libvirt
mistakenly determines the VF is a conventional PCI device (not PCIe), so
it auto-adds a pcie-to-pci-bridge, and plugs the VF into that
controller. I'm guessing that makes device initialization take slightly
longer or something, changing the results of the race. I'm looking into
changing the test for PCIe vs. conventional PCI, but again that isn't
the real problem here)
>> This is how the domxml hostdev section looks like:
>> ```
>> <hostdev mode='subsystem' type='pci'
managed='yes'>
>> <driver name='vfio'/>
>> <source>
>> <address domain='0x0000' bus='0x3b'
slot='0x0a' function='0x4'/>
>> </source>
>> <alias name='hostdev0'/>
>> <address type='pci' domain='0x0000'
bus='0x06' slot='0x01'
>> function='0x0'/>
>> </hostdev>
>> ```
>>
>> Is there something we are missing or we misconfigured?
>> Tested with 6.0.0-16.fc31
>>
>> My second question is: Can libvirt avoid accessing the PF (as we do
>> not need mac and other options).
> I'm not sure, probably a question for Laine.
The entire point of <interface type='hostdev'> is to be able to
set the
MAC address (and optionally the vlan tag) of a VF when assigning it to a
guest, and the only way to set those is via the PF. If you use plain
<hostdev>, then libvirt has no idea that the device is a VF, so it
doesn't look for or try to access its PF.
So, you're doing the right thing - since your container has no access to
the PF, you need to set the MAC address / vlan tag outside the container
(via the PF), and then use <hostdev> (which doesn't do anything
related
to PF devices).