Jiri Denemark
2016-Jan-14 09:51 UTC
Re: [Libguestfs] [libvirt] Quantifying libvirt errors in launching the libguestfs appliance
On Wed, Jan 13, 2016 at 16:25:14 +0100, Martin Kletzander wrote:> On Wed, Jan 13, 2016 at 10:18:42AM +0000, Richard W.M. Jones wrote: > >As people may know, we frequently encounter errors caused by libvirt > >when running the libguestfs appliance. > > > >I wanted to find out exactly how frequently these happen and classify > >the errors, so I ran the 'virt-df' tool overnight 1700 times. This > >tool runs several parallel qemu:///session libvirt connections both > >creating a short-lived appliance guest. > > > >Note that I have added Cole's patch to fix https://bugzilla.redhat.com/1271183 > >"XML-RPC error : Cannot write data: Transport endpoint is not connected" > > > >Results: > > > >The test failed 538 times (32% of the time), which is pretty dismal. > >To be fair, virt-df is aggressive about how it launches parallel > >libvirt connections. Most other virt-* tools use only a single > >libvirt connection and are consequently more reliable. > > > >Of the failures, 518 (96%) were of the form: > > > > process exited while connecting to monitor: qemu: could not load kernel '/home/rjones/d/libguestfs/tmp/.guestfs-1000/appliance.d/kernel': Permission denied > > > >which is https://bugzilla.redhat.com/921135 or maybe > >https://bugzilla.redhat.com/1269975. It's not clear to me if these > >bugs have different causes, but if they do then potentially we're > >seeing a mix of both since my test has no way to distinguish them. > > > > It looks to me as the same problem. And as the same problem we were > talking about bunch of time and, apparently, didn't get to a conclusion. > > For each of the kernels, libvirt labels them (with both DAC and selinux > labels), then proceeds to launching qemu. If this is done parallel, the > race is pretty obvious. Could you remind me why you couldn't use > <seclabel model='none'/> or <seclabel relabel='no'/> or something that > would mitigate this? If we cannot use this, then we need to implement > the <seclabel/> element for kernel and initrd.Hmm, can't we just label kernel and initrd files the same way we label <shareable/> disk images, i.e., non-exclusive label so that all QEMU process can access them and avoid removing the label once a domain disappears? Jirka
Daniel P. Berrange
2016-Jan-14 10:12 UTC
Re: [Libguestfs] [libvirt] Quantifying libvirt errors in launching the libguestfs appliance
On Thu, Jan 14, 2016 at 10:51:47AM +0100, Jiri Denemark wrote:> On Wed, Jan 13, 2016 at 16:25:14 +0100, Martin Kletzander wrote: > > On Wed, Jan 13, 2016 at 10:18:42AM +0000, Richard W.M. Jones wrote: > > >As people may know, we frequently encounter errors caused by libvirt > > >when running the libguestfs appliance. > > > > > >I wanted to find out exactly how frequently these happen and classify > > >the errors, so I ran the 'virt-df' tool overnight 1700 times. This > > >tool runs several parallel qemu:///session libvirt connections both > > >creating a short-lived appliance guest. > > > > > >Note that I have added Cole's patch to fix https://bugzilla.redhat.com/1271183 > > >"XML-RPC error : Cannot write data: Transport endpoint is not connected" > > > > > >Results: > > > > > >The test failed 538 times (32% of the time), which is pretty dismal. > > >To be fair, virt-df is aggressive about how it launches parallel > > >libvirt connections. Most other virt-* tools use only a single > > >libvirt connection and are consequently more reliable. > > > > > >Of the failures, 518 (96%) were of the form: > > > > > > process exited while connecting to monitor: qemu: could not load kernel '/home/rjones/d/libguestfs/tmp/.guestfs-1000/appliance.d/kernel': Permission denied > > > > > >which is https://bugzilla.redhat.com/921135 or maybe > > >https://bugzilla.redhat.com/1269975. It's not clear to me if these > > >bugs have different causes, but if they do then potentially we're > > >seeing a mix of both since my test has no way to distinguish them. > > > > > > > It looks to me as the same problem. And as the same problem we were > > talking about bunch of time and, apparently, didn't get to a conclusion. > > > > For each of the kernels, libvirt labels them (with both DAC and selinux > > labels), then proceeds to launching qemu. If this is done parallel, the > > race is pretty obvious. Could you remind me why you couldn't use > > <seclabel model='none'/> or <seclabel relabel='no'/> or something that > > would mitigate this? If we cannot use this, then we need to implement > > the <seclabel/> element for kernel and initrd. > > Hmm, can't we just label kernel and initrd files the same way we label > <shareable/> disk images, i.e., non-exclusive label so that all QEMU > process can access them and avoid removing the label once a domain > disappears?We actually should treat it in the same way as <readonly/> disks, and give it a shared read-only label. And indeed we *do* that. The difference comes in the restore step - where we blow away the readonly label and put it back to the original. For disks we never restore readonly/shared labels, but for kernels we do. If we just kill the restore step for kernels too, we should be fine AFAICT. Regards, Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|
Richard W.M. Jones
2016-Jan-14 10:24 UTC
Re: [Libguestfs] [libvirt] Quantifying libvirt errors in launching the libguestfs appliance
On Thu, Jan 14, 2016 at 10:12:30AM +0000, Daniel P. Berrange wrote:> The difference comes in the restore step - where we blow away the > readonly label and put it back to the original. For disks we never > restore readonly/shared labels, but for kernels we do. If we just > kill the restore step for kernels too, we should be fine AFAICT.Works for me - I can try a patch, or if you can point me at the code I should comment out I'll do that. Rich. -- Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones Read my programming and virtualization blog: http://rwmj.wordpress.com virt-builder quickly builds VMs from scratch http://libguestfs.org/virt-builder.1.html
Cole Robinson
2016-Jan-14 15:35 UTC
Re: [Libguestfs] [libvirt] Quantifying libvirt errors in launching the libguestfs appliance
On 01/14/2016 05:12 AM, Daniel P. Berrange wrote:> On Thu, Jan 14, 2016 at 10:51:47AM +0100, Jiri Denemark wrote: >> On Wed, Jan 13, 2016 at 16:25:14 +0100, Martin Kletzander wrote: >>> On Wed, Jan 13, 2016 at 10:18:42AM +0000, Richard W.M. Jones wrote: >>>> As people may know, we frequently encounter errors caused by libvirt >>>> when running the libguestfs appliance. >>>> >>>> I wanted to find out exactly how frequently these happen and classify >>>> the errors, so I ran the 'virt-df' tool overnight 1700 times. This >>>> tool runs several parallel qemu:///session libvirt connections both >>>> creating a short-lived appliance guest. >>>> >>>> Note that I have added Cole's patch to fix https://bugzilla.redhat.com/1271183 >>>> "XML-RPC error : Cannot write data: Transport endpoint is not connected" >>>> >>>> Results: >>>> >>>> The test failed 538 times (32% of the time), which is pretty dismal. >>>> To be fair, virt-df is aggressive about how it launches parallel >>>> libvirt connections. Most other virt-* tools use only a single >>>> libvirt connection and are consequently more reliable. >>>> >>>> Of the failures, 518 (96%) were of the form: >>>> >>>> process exited while connecting to monitor: qemu: could not load kernel '/home/rjones/d/libguestfs/tmp/.guestfs-1000/appliance.d/kernel': Permission denied >>>> >>>> which is https://bugzilla.redhat.com/921135 or maybe >>>> https://bugzilla.redhat.com/1269975. It's not clear to me if these >>>> bugs have different causes, but if they do then potentially we're >>>> seeing a mix of both since my test has no way to distinguish them. >>>> >>> >>> It looks to me as the same problem. And as the same problem we were >>> talking about bunch of time and, apparently, didn't get to a conclusion. >>> >>> For each of the kernels, libvirt labels them (with both DAC and selinux >>> labels), then proceeds to launching qemu. If this is done parallel, the >>> race is pretty obvious. Could you remind me why you couldn't use >>> <seclabel model='none'/> or <seclabel relabel='no'/> or something that >>> would mitigate this? If we cannot use this, then we need to implement >>> the <seclabel/> element for kernel and initrd. >> >> Hmm, can't we just label kernel and initrd files the same way we label >> <shareable/> disk images, i.e., non-exclusive label so that all QEMU >> process can access them and avoid removing the label once a domain >> disappears? > > We actually should treat it in the same way as <readonly/> disks, > and give it a shared read-only label. And indeed we *do* that. > > The difference comes in the restore step - where we blow away the > readonly label and put it back to the original. For disks we never > restore readonly/shared labels, but for kernels we do. If we just > kill the restore step for kernels too, we should be fine AFAICT. >Indeed I forgot we don't restore labels on readonly/shareable disks.. certainly kernel/initrd should match that. - Cole
Possibly Parallel Threads
- Re: [libvirt] Quantifying libvirt errors in launching the libguestfs appliance
- Re: [libvirt] Quantifying libvirt errors in launching the libguestfs appliance
- Re: [libvirt] Quantifying libvirt errors in launching the libguestfs appliance
- Re: [libvirt] Quantifying libvirt errors in launching the libguestfs appliance
- Quantifying libvirt errors in launching the libguestfs appliance