Richard W.M. Jones
2020-Jul-11 08:18 UTC
[Libguestfs] nbdkit / exposing disk images in containers
KubeVirt is a custom resource (a kind of plugin) for Kubernetes which adds support for running virtual machines. As part of this they have the same problems as everyone else of how to import large disk images into the system for pets, templates, etc. As part of the project they've defined a format for embedding a disk image into a container (unclear why? perhaps so these can be distributed using the existing container registry systems?): https://github.com/kubevirt/containerized-data-importer/blob/master/doc/image-from-registry.md An example of such a disk-in-a-container is here: https://hub.docker.com/r/kubevirt/fedora-cloud-container-disk-demo We've been asked if we can help with tools to efficiently import these disk images, and I have suggested a few things with nbdkit and have written a couple of filters (tar, gzip) to support this. This email is my thoughts on further development work in this area. ---------------------------------------------------------------------- (1) Accessing the disk image directly from the Docker Hub. When you get down to it, what this actually is: * There is a disk image in qcow2 format. * It is embedded as "./disk/downloaded" in a gzip-compressed tar file. (This is a container with a single layer). * This tarball is uploaded to (in this case) the Docker Hub and can be accessed over a URL. The URL can be constructed using a few json requests. * The URL is served by nginx and this supports HTTP range requests. I encapsulated all of this in the attached script. This is an existence proof that it is possible to access the image with nbdkit. One problem is that the auth token only lasts for a limited time (seems to be 5 minutes in my test), and it doesn't automatically renew as you download the layer, so if the download takes longer than 5 minutes you'll suddenly get unrecoverable authorization failures. There seem to be two possible ways to solve this: (a) Write a new nbdkit-container-plugin which does the authorization (essentially hiding most of the details in the attached script from the user). It could deal with renewing the key as required. (b) Modify nbdkit-curl-plugin so the user could provide a script for renewing authorization. This would expose the full gory details to the end user, but on the other hand might be useful in other situations that require authorization. (2) nbdkit-tar-filter exportname and listing files. This has already been covered by an email from Nir Soffer, so I'll simply link to that: https://lists.gnu.org/archive/html/qemu-discuss/2020-06/msg00058.html It basically requires a fairly simple change to nbdkit-tar-filter to map the tar filenames into export names, and a deeper change to nbdkit core server to allow listing all export names. The end result would be that an NBD client could query the list of files [ie exports] in the tarball and choose one to download. (3) gzip & tar require full downloads - why not “docker/podman save/export”? Stepping back to get the bigger picture: Because the OCI standard uses gzip for compression (https://stackoverflow.com/a/9213826), and because the tar index is interspersed with the tar data, you always need to download the whole container layer before you can access the disk image inside. Currently nbdkit-gzip-filter hides this from the end user, but it's still downloading the whole thing to a temporary file. There's no way round that unless OCI can be persuaded to use a better format. But docker/podman already has a way to export container layers, ie. the save and export commands. These also have the advantage that it will cache the downloaded layers between runs. So why aren't we using that? In this world, nbdkit-container-plugin would simply use docker/podman save (or export?) to grab the container as a tar file, and we would use the tar filter as above to expose the contents as an NBD endpoint for further consumption. IOW: nbdkit container docker.io/kubevirt/fedora-cloud-container-disk-demo \ --filter=tar tar-entry=./downloaded/disk Rich. -- Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones Read my programming and virtualization blog: http://rwmj.wordpress.com Fedora Windows cross-compiler. Compile Windows programs, test, and build Windows installers. Over 100 libraries supported. http://fedoraproject.org/wiki/MinGW --fYgRtaZIy+F1uhp1 Content-Type: application/x-sh Content-Disposition: attachment; filename="get-it.sh" Content-Transfer-Encoding: quoted-printable #!/bin/bash -=0A=0Aset -e=0A=0Aimage=3Dkubevirt/fedora-cloud-container-disk-demo=0Anbdkit=3D$HOME/d/nbdkit/nbdkit=0A=0ATOKEN=3D\=0A"$(curl \=0A--silent \=0A--header 'GET' \=0A"https://auth.docker.io/token?service=3Dregistry.docker.io&scope=3Drepository:$image:pull" \=0A| jq -r '.token' \=0A)"=0Aecho TOKEN=3D$TOKEN=0A=0ABLOBSUM=3D\=0A"$(curl \=0A--silent \=0A--request 'GET' \=0A--header "Authorization: Bearer $TOKEN" \=0A"https://registry-1.docker=2Eio/v2/$image/manifests/latest" \=0A| jq -r '.fsLayers[].blobSum'=0A)"=0Aecho BLOBSUM=3D$BLOBSUM=0A=0AURL=3D"https://registry-1.docker.io/v2/$image/blobs/$BLOBSUM"=0A=0A# Run nbdkit.=0A$nbdkit -f -v \=0A curl "$URL" header=3D"Authorization: Bearer $TOKEN" \=0A --filter=3Dtar --filter=3Dgzip tar-entry=3D./disk/downloaded=0A --fYgRtaZIy+F1uhp1--
Nir Soffer
2020-Jul-12 20:16 UTC
Re: [Libguestfs] nbdkit / exposing disk images in containers
On Sat, Jul 11, 2020 at 11:18 AM Richard W.M. Jones <rjones@redhat.com> wrote:> > KubeVirt is a custom resource (a kind of plugin) for Kubernetes which > adds support for running virtual machines. As part of this they have > the same problems as everyone else of how to import large disk images > into the system for pets, templates, etc. > > As part of the project they've defined a format for embedding a disk > image into a container (unclear why? perhaps so these can be > distributed using the existing container registry systems?): > > https://github.com/kubevirt/containerized-data-importer/blob/master/doc/image-from-registry.md > > An example of such a disk-in-a-container is here: > > https://hub.docker.com/r/kubevirt/fedora-cloud-container-disk-demo > > We've been asked if we can help with tools to efficiently import these > disk images, and I have suggested a few things with nbdkit and have > written a couple of filters (tar, gzip) to support this.I don't think gzip filter matches nbdkit very well. Having to decompress the entire disk before you can serve it does not sound right.> This email is my thoughts on further development work in this area. > > ---------------------------------------------------------------------- > > (1) Accessing the disk image directly from the Docker Hub. > > When you get down to it, what this actually is: > > * There is a disk image in qcow2 format. > > * It is embedded as "./disk/downloaded" in a gzip-compressed tar > file. (This is a container with a single layer). > > * This tarball is uploaded to (in this case) the Docker Hub and can > be accessed over a URL. The URL can be constructed using a few > json requests. > > * The URL is served by nginx and this supports HTTP range requests. > > I encapsulated all of this in the attached script. This is an > existence proof that it is possible to access the image with nbdkit. > > One problem is that the auth token only lasts for a limited time > (seems to be 5 minutes in my test), and it doesn't automatically renew > as you download the layer, so if the download takes longer than 5 > minutes you'll suddenly get unrecoverable authorization failures. > > There seem to be two possible ways to solve this: > > (a) Write a new nbdkit-container-plugin which does the authorization > (essentially hiding most of the details in the attached script > from the user). It could deal with renewing the key as > required. > > (b) Modify nbdkit-curl-plugin so the user could provide a script for > renewing authorization. This would expose the full gory details > to the end user, but on the other hand might be useful in other > situations that require authorization.docker/podman already solved this, why should nbdkit solve it again? Do you get timeouts while you download the image with a single request?> (2) nbdkit-tar-filter exportname and listing files. > > This has already been covered by an email from Nir Soffer, so I'll > simply link to that: > > https://lists.gnu.org/archive/html/qemu-discuss/2020-06/msg00058.html > > It basically requires a fairly simple change to nbdkit-tar-filter to > map the tar filenames into export names, and a deeper change to nbdkit > core server to allow listing all export names. The end result would > be that an NBD client could query the list of files [ie exports] in > the tarball and choose one to download.We know the tar member name upfront, so why do we need to list the contents?> (3) gzip & tar require full downloads - why not “docker/podman save/export”?This looks like a better direction. The nice thing about embedding the disk in the container image is being able to use existing infrastructure (docker, quay) to host the images, and to transfer them to the hosts. We don't need to write any code for this. Even better, we have automatic caching on the host by docker/podman, so we have to pull the image from the registry only once on every host. Then we can access the local cache.> Stepping back to get the bigger picture: Because the OCI standard uses > gzip for compression (https://stackoverflow.com/a/9213826), and > because the tar index is interspersed with the tar data, you always > need to download the whole container layer before you can access the > disk image inside.You need to download most of the tar, but you don't need to keep the tar in a temporary file. For example in python you can create a tarfile over with the http response object in streaming with transparent decompression mode ("r|*"), and stream the disk contents from the tar without a temporary file. with tarfile.open(mode="r|*", fileobj=response) as tar: for member in tar: if member.name == "./disk/downloaded": with tar.extractfile(member) as f shutil.copyfileobj(f, sys.stdout.buffer) sys.exit(0) I think this is what cdi import code does, and is the most efficient way to copy the disk directly from the registry with the current format.> Currently nbdkit-gzip-filter hides this from the > end user, but it's still downloading the whole thing to a temporary > file. There's no way round that unless OCI can be persuaded to use a > better format.The way is to use the container image downloaded by podman/docker.> But docker/podman already has a way to export container layers, > ie. the save and export commands. These also have the advantage that > it will cache the downloaded layers between runs. So why aren't we > using that? > > In this world, nbdkit-container-plugin would simply use docker/podman > save (or export?) to grab the container as a tar file, and we would > use the tar filter as above to expose the contents as an NBD endpoint > for further consumption. IOW: > > nbdkit container docker.io/kubevirt/fedora-cloud-container-disk-demo \ > --filter=tar tar-entry=./downloaded/diskThis will work but there are 2 issues: 1. podman save/export copy the tar locally. This is pretty fast for the example image but copying the tar and deleting it seems wasteful. 2. If we have the tar locally, why not use qemu-img directly? we can find the offset of the disk inside the tar and use: $ time podman save --format oci-dir -o demo-oci docker.io/kubevirt/fedora-cloud-container-disk-demo real 0m2.795s user 0m2.011s sys 0m0.878s $ time qemu-img convert -O raw 'json:{"file": {"driver": "raw", "offset": 1536, "file": {"driver": "file", "filename": "demo-oci/8162f3eda33d5a87df56e969dcd9777523bd53278a0701b2e53b93c33c01853e"}}}' out.raw real 0m1.036s user 0m3.237s sys 0m1.326s But I think we have a better way - using a self-extracting-disk container. Start a container with a disk image, and run qemu-img inside this container to convert the disk to the target PV. It can work like this: 1. We create a base image - this will be used for all disks container images. $ cat Dockerfile.kubevirt-img FROM alpine RUN apk add qemu-img $ podman build -t kubevirt-img -f Dockerfile.kubevirt-img . ... You pull this from quay.io/nirsof/kubevirt-img. 2. Create a disk container image, based on the base image $ cat Dockerfile.kubevirt-fedora-cloud-disk FROM quay.io/nirsof/kubevirt-fimg COPY disk.qcow2 /disk.qcow2 CMD ["qemu-img", "convert", "-p", "-f", "qcow2", "-O", "raw", "/disk.qcow2", "/target/disk.img"] $ podman build -t kubevirt-fedora-cloud-disk -f Dockerfile.kubevirt-fedora-cloud-disk . ... This container is a little larger, but the common layer with qemu-img and its dependencies is shared between all disk container images. In this example it adds only 25 MiB. You can pull this from quay.io/nirsof/kubevirt-fedora-cloud-disk. With this we can create a copy of the disk using: $ time podman run --volume ./:/target:Z --rm -it quay.io/nirsof/kubevirt-fedora-cloud-disk Trying to pull quay.io/nirsof/kubevirt-fedora-cloud-disk... Getting image source signatures Copying blob 0d9094d70e9c skipped: already exists Copying blob a3ed95caeb02 done Copying blob a3ed95caeb02 done Copying blob 18717781bd09 done Copying blob fe5cd0d8bf32 done Writing manifest to image destination Storing signatures (100.00/100%) real 0m59.800s user 0m8.988s sys 0m7.437s $ ls -lhs disk.img 728M -rw-r--r--. 1 nsoffer nsoffer 4.0G Jul 12 21:45 disk.img $ podman images | grep fedora-cloud quay.io/nirsof/kubevirt-fedora-cloud-disk latest 097ef06b6d71 About an hour ago 326 MB docker.io/kubevirt/fedora-cloud-container-disk-demo latest 6494830c6dc7 50 years ago 303 MB The next time we run this we get the container from the cache: $ time podman run --volume ./:/target:Z --rm -it quay.io/nirsof/kubevirt-fedora-cloud-disk (100.00/100%) real 0m2.244s user 0m0.070s sys 0m0.253s Nir
Richard W.M. Jones
2020-Jul-13 09:37 UTC
Re: [Libguestfs] nbdkit / exposing disk images in containers
On Sun, Jul 12, 2020 at 11:16:01PM +0300, Nir Soffer wrote:> On Sat, Jul 11, 2020 at 11:18 AM Richard W.M. Jones <rjones@redhat.com> wrote: > > > > KubeVirt is a custom resource (a kind of plugin) for Kubernetes which > > adds support for running virtual machines. As part of this they have > > the same problems as everyone else of how to import large disk images > > into the system for pets, templates, etc. > > > > As part of the project they've defined a format for embedding a disk > > image into a container (unclear why? perhaps so these can be > > distributed using the existing container registry systems?): > > > > https://github.com/kubevirt/containerized-data-importer/blob/master/doc/image-from-registry.md > > > > An example of such a disk-in-a-container is here: > > > > https://hub.docker.com/r/kubevirt/fedora-cloud-container-disk-demo > > > > We've been asked if we can help with tools to efficiently import these > > disk images, and I have suggested a few things with nbdkit and have > > written a couple of filters (tar, gzip) to support this. > > I don't think gzip filter matches nbdkit very well. Having to decompress the > entire disk before you can serve it does not sound right.We do have existing plugins -- I'm thinking of http://libguestfs.org/nbdkit-iso-plugin.1.html -- which are merely convenient wrappers around what you could do with a bit of shell scripting. You might question why we have them at all, but a reason is that it just makes things simpler for the end user. They don't have to worry about how to do the download and cleaning up the temporary file afterwards. So while you're right that the gzip filter isn't a good fit with nbdkit for quite annoying technical reasons, I still think it's worth having it. We're not forcing people to use it or preventing them from using alternatives.> > This email is my thoughts on further development work in this area. > > > > ---------------------------------------------------------------------- > > > > (1) Accessing the disk image directly from the Docker Hub. > > > > When you get down to it, what this actually is: > > > > * There is a disk image in qcow2 format. > > > > * It is embedded as "./disk/downloaded" in a gzip-compressed tar > > file. (This is a container with a single layer). > > > > * This tarball is uploaded to (in this case) the Docker Hub and can > > be accessed over a URL. The URL can be constructed using a few > > json requests. > > > > * The URL is served by nginx and this supports HTTP range requests. > > > > I encapsulated all of this in the attached script. This is an > > existence proof that it is possible to access the image with nbdkit. > > > > One problem is that the auth token only lasts for a limited time > > (seems to be 5 minutes in my test), and it doesn't automatically renew > > as you download the layer, so if the download takes longer than 5 > > minutes you'll suddenly get unrecoverable authorization failures. > > > > There seem to be two possible ways to solve this: > > > > (a) Write a new nbdkit-container-plugin which does the authorization > > (essentially hiding most of the details in the attached script > > from the user). It could deal with renewing the key as > > required. > > > > (b) Modify nbdkit-curl-plugin so the user could provide a script for > > renewing authorization. This would expose the full gory details > > to the end user, but on the other hand might be useful in other > > situations that require authorization. > > docker/podman already solved this, why should nbdkit solve it again?Right, exactly my thoughts and the reason why (3) below.> Do you get timeouts while you download the image with a single request?Do you mean a single massive curl request? I didn't try. You get a 401 authorization failure if you make a request after ~ 5 minutes after the token was issued. Unlike VMware's and RHV's disk-over-web services, the token doesn't automatically extend when a request is made.> > (2) nbdkit-tar-filter exportname and listing files. > > > > This has already been covered by an email from Nir Soffer, so I'll > > simply link to that: > > > > https://lists.gnu.org/archive/html/qemu-discuss/2020-06/msg00058.html > > > > It basically requires a fairly simple change to nbdkit-tar-filter to > > map the tar filenames into export names, and a deeper change to nbdkit > > core server to allow listing all export names. The end result would > > be that an NBD client could query the list of files [ie exports] in > > the tarball and choose one to download. > > We know the tar member name upfront, so why do we need to list the contents?AIUI we don't necessarily know the name up front. It might not always be ./downloaded/disk. (I might be wrong on this.)> > (3) gzip & tar require full downloads - why not “docker/podman save/export”? > > This looks like a better direction. > > The nice thing about embedding the disk in the container image is being able > to use existing infrastructure (docker, quay) to host the images, and > to transfer > them to the hosts. We don't need to write any code for this. > > Even better, we have automatic caching on the host by docker/podman, so we > have to pull the image from the registry only once on every host. Then we can > access the local cache. > > > Stepping back to get the bigger picture: Because the OCI standard uses > > gzip for compression (https://stackoverflow.com/a/9213826), and > > because the tar index is interspersed with the tar data, you always > > need to download the whole container layer before you can access the > > disk image inside. > > You need to download most of the tar, but you don't need to keep the tar > in a temporary file. For example in python you can create a tarfile over with > the http response object in streaming with transparent decompression mode > ("r|*"), and stream the disk contents from the tar without a temporary file. > > with tarfile.open(mode="r|*", fileobj=response) as tar: > for member in tar: > if member.name == "./disk/downloaded": > with tar.extractfile(member) as f > shutil.copyfileobj(f, sys.stdout.buffer) > sys.exit(0) > > I think this is what cdi import code does, and is the most efficient way > to copy the disk directly from the registry with the current format. > > > Currently nbdkit-gzip-filter hides this from the > > end user, but it's still downloading the whole thing to a temporary > > file. There's no way round that unless OCI can be persuaded to use a > > better format. > > The way is to use the container image downloaded by podman/docker. > > > But docker/podman already has a way to export container layers, > > ie. the save and export commands. These also have the advantage that > > it will cache the downloaded layers between runs. So why aren't we > > using that? > > > > In this world, nbdkit-container-plugin would simply use docker/podman > > save (or export?) to grab the container as a tar file, and we would > > use the tar filter as above to expose the contents as an NBD endpoint > > for further consumption. IOW: > > > > nbdkit container docker.io/kubevirt/fedora-cloud-container-disk-demo \ > > --filter=tar tar-entry=./downloaded/disk > > This will work but there are 2 issues: > > 1. podman save/export copy the tar locally. This is pretty fast for the example > image but copying the tar and deleting it seems wasteful. > > 2. If we have the tar locally, why not use qemu-img directly? we can find the > offset of the disk inside the tar and use: > > $ time podman save --format oci-dir -o demo-oci > docker.io/kubevirt/fedora-cloud-container-disk-demo > > real 0m2.795s > user 0m2.011s > sys 0m0.878s > > $ time qemu-img convert -O raw 'json:{"file": {"driver": "raw", > "offset": 1536, "file": {"driver": "file", "filename": > "demo-oci/8162f3eda33d5a87df56e969dcd9777523bd53278a0701b2e53b93c33c01853e"}}}' > out.raw > > real 0m1.036s > user 0m3.237s > sys 0m1.326s > > But I think we have a better way - using a self-extracting-disk > container. Start a container with > a disk image, and run qemu-img inside this container to convert the > disk to the target PV. > > It can work like this: > > 1. We create a base image - this will be used for all disks container images. > > $ cat Dockerfile.kubevirt-img > FROM alpine > RUN apk add qemu-img > > $ podman build -t kubevirt-img -f Dockerfile.kubevirt-img . > ... > > You pull this from quay.io/nirsof/kubevirt-img. > > 2. Create a disk container image, based on the base image > > $ cat Dockerfile.kubevirt-fedora-cloud-disk > FROM quay.io/nirsof/kubevirt-fimg > COPY disk.qcow2 /disk.qcow2 > CMD ["qemu-img", "convert", "-p", "-f", "qcow2", "-O", "raw", > "/disk.qcow2", "/target/disk.img"] > > $ podman build -t kubevirt-fedora-cloud-disk -f > Dockerfile.kubevirt-fedora-cloud-disk . > ... > > This container is a little larger, but the common layer with qemu-img > and its dependencies is > shared between all disk container images. In this example it adds only 25 MiB. > > You can pull this from quay.io/nirsof/kubevirt-fedora-cloud-disk. > > With this we can create a copy of the disk using: > > $ time podman run --volume ./:/target:Z --rm -it > quay.io/nirsof/kubevirt-fedora-cloud-disk > Trying to pull quay.io/nirsof/kubevirt-fedora-cloud-disk... > Getting image source signatures > Copying blob 0d9094d70e9c skipped: already exists > Copying blob a3ed95caeb02 done > Copying blob a3ed95caeb02 done > Copying blob 18717781bd09 done > Copying blob fe5cd0d8bf32 done > Writing manifest to image destination > Storing signatures > (100.00/100%) > > real 0m59.800s > user 0m8.988s > sys 0m7.437s > > $ ls -lhs disk.img > 728M -rw-r--r--. 1 nsoffer nsoffer 4.0G Jul 12 21:45 disk.img > > $ podman images | grep fedora-cloud > quay.io/nirsof/kubevirt-fedora-cloud-disk latest > 097ef06b6d71 About an hour ago 326 MB > docker.io/kubevirt/fedora-cloud-container-disk-demo latest > 6494830c6dc7 50 years ago 303 MB > > The next time we run this we get the container from the cache: > > $ time podman run --volume ./:/target:Z --rm -it > quay.io/nirsof/kubevirt-fedora-cloud-disk > (100.00/100%) > > real 0m2.244s > user 0m0.070s > sys 0m0.253sInteresting, yes. Although I guess this involves recreating all of these disk-in-a-container images? Also I'd be a bit concerned from a security angle: We're turning dumb data into a self-extracting executable program. (I love shar(1) too, but unfortunately it's not compatible with the world we live in). Rich. -- Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones Read my programming and virtualization blog: http://rwmj.wordpress.com virt-builder quickly builds VMs from scratch http://libguestfs.org/virt-builder.1.html