thr3ads.net - Libguestfs - [Libguestfs] nbdkit / exposing disk images in containers [Jul 2020]

If this information is useful, please help other people find it:
Share via:

Richard W.M. Jones

2020-Jul-11 08:18 UTC

[Libguestfs] nbdkit / exposing disk images in containers

KubeVirt is a custom resource (a kind of plugin) for Kubernetes which
adds support for running virtual machines.  As part of this they have
the same problems as everyone else of how to import large disk images
into the system for pets, templates, etc.

As part of the project they've defined a format for embedding a disk
image into a container (unclear why?  perhaps so these can be
distributed using the existing container registry systems?):

 
github.com/kubevirt/containerized-data-importer/blob/master/doc/image-from-registry.md

An example of such a disk-in-a-container is here:

  hub.docker.com/r/kubevirt/fedora-cloud-container-disk-demo

We've been asked if we can help with tools to efficiently import these
disk images, and I have suggested a few things with nbdkit and have
written a couple of filters (tar, gzip) to support this.

This email is my thoughts on further development work in this area.

----------------------------------------------------------------------

(1) Accessing the disk image directly from the Docker Hub.

When you get down to it, what this actually is:

  * There is a disk image in qcow2 format.

  * It is embedded as "./disk/downloaded" in a gzip-compressed tar
    file.  (This is a container with a single layer).

  * This tarball is uploaded to (in this case) the Docker Hub and can
    be accessed over a URL.  The URL can be constructed using a few
    json requests.

  * The URL is served by nginx and this supports HTTP range requests.

I encapsulated all of this in the attached script.  This is an
existence proof that it is possible to access the image with nbdkit.

One problem is that the auth token only lasts for a limited time
(seems to be 5 minutes in my test), and it doesn't automatically renew
as you download the layer, so if the download takes longer than 5
minutes you'll suddenly get unrecoverable authorization failures.

There seem to be two possible ways to solve this:

  (a) Write a new nbdkit-container-plugin which does the authorization
      (essentially hiding most of the details in the attached script
      from the user).  It could deal with renewing the key as
      required.

  (b) Modify nbdkit-curl-plugin so the user could provide a script for
      renewing authorization.  This would expose the full gory details
      to the end user, but on the other hand might be useful in other
      situations that require authorization.


(2) nbdkit-tar-filter exportname and listing files.

This has already been covered by an email from Nir Soffer, so I'll
simply link to that:

lists.gnu.org/archive/html/qemu-discuss/2020-06/msg00058.html

It basically requires a fairly simple change to nbdkit-tar-filter to
map the tar filenames into export names, and a deeper change to nbdkit
core server to allow listing all export names.  The end result would
be that an NBD client could query the list of files [ie exports] in
the tarball and choose one to download.


(3) gzip & tar require full downloads - why not “docker/podman save/export”?

Stepping back to get the bigger picture: Because the OCI standard uses
gzip for compression (stackoverflow.com/a/9213826), and
because the tar index is interspersed with the tar data, you always
need to download the whole container layer before you can access the
disk image inside.  Currently nbdkit-gzip-filter hides this from the
end user, but it's still downloading the whole thing to a temporary
file.  There's no way round that unless OCI can be persuaded to use a
better format.

But docker/podman already has a way to export container layers,
ie. the save and export commands.  These also have the advantage that
it will cache the downloaded layers between runs.  So why aren't we
using that?

In this world, nbdkit-container-plugin would simply use docker/podman
save (or export?) to grab the container as a tar file, and we would
use the tar filter as above to expose the contents as an NBD endpoint
for further consumption.  IOW:

  nbdkit container docker.io/kubevirt/fedora-cloud-container-disk-demo \
         --filter=tar tar-entry=./downloaded/disk

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat people.redhat.com/~rjones
Read my programming and virtualization blog: rwmj.wordpress.com
Fedora Windows cross-compiler. Compile Windows programs, test, and
build Windows installers. Over 100 libraries supported.
fedoraproject.org/wiki/MinGW

--fYgRtaZIy+F1uhp1
Content-Type: application/x-sh
Content-Disposition: attachment; filename="get-it.sh"
Content-Transfer-Encoding: quoted-printable

#!/bin/bash -=0A=0Aset
-e=0A=0Aimage=3Dkubevirt/fedora-cloud-container-disk-demo=0Anbdkit=3D$HOME/d/nbdkit/nbdkit=0A=0ATOKEN=3D\=0A"$(curl
\=0A--silent \=0A--header 'GET'
\=0A"auth.docker.io/token?service=3Dregistry.docker.io&scope=3Drepository:$image:pull"
\=0A| jq -r '.token' \=0A)"=0Aecho
TOKEN=3D$TOKEN=0A=0ABLOBSUM=3D\=0A"$(curl \=0A--silent \=0A--request
'GET' \=0A--header "Authorization: Bearer $TOKEN"
\=0A"registry-1.docker=2Eio/v2/$image/manifests/latest" \=0A|
jq -r '.fsLayers[].blobSum'=0A)"=0Aecho
BLOBSUM=3D$BLOBSUM=0A=0AURL=3D"registry-1.docker.io/v2/$image/blobs/$BLOBSUM"=0A=0A#
Run nbdkit.=0A$nbdkit -f -v \=0A        curl "$URL"
header=3D"Authorization: Bearer $TOKEN" \=0A        --filter=3Dtar
--filter=3Dgzip tar-entry=3D./disk/downloaded=0A
--fYgRtaZIy+F1uhp1--

Nir Soffer

2020-Jul-12 20:16 UTC

head link

Re: [Libguestfs] nbdkit / exposing disk images in containers

On Sat, Jul 11, 2020 at 11:18 AM Richard W.M. Jones <rjones@redhat.com>
wrote:>
> KubeVirt is a custom resource (a kind of plugin) for Kubernetes which
> adds support for running virtual machines.  As part of this they have
> the same problems as everyone else of how to import large disk images
> into the system for pets, templates, etc.
>
> As part of the project they've defined a format for embedding a disk
> image into a container (unclear why?  perhaps so these can be
> distributed using the existing container registry systems?):
>
>  
github.com/kubevirt/containerized-data-importer/blob/master/doc/image-from-registry.md
>
> An example of such a disk-in-a-container is here:
>
>   hub.docker.com/r/kubevirt/fedora-cloud-container-disk-demo
>
> We've been asked if we can help with tools to efficiently import these
> disk images, and I have suggested a few things with nbdkit and have
> written a couple of filters (tar, gzip) to support this.
I don't think gzip filter matches nbdkit very well. Having to decompress the
entire disk before you can serve it does not sound right.
> This email is my thoughts on further development work in this area.
>
> ----------------------------------------------------------------------
>
> (1) Accessing the disk image directly from the Docker Hub.
>
> When you get down to it, what this actually is:
>
>   * There is a disk image in qcow2 format.
>
>   * It is embedded as "./disk/downloaded" in a gzip-compressed
tar
>     file.  (This is a container with a single layer).
>
>   * This tarball is uploaded to (in this case) the Docker Hub and can
>     be accessed over a URL.  The URL can be constructed using a few
>     json requests.
>
>   * The URL is served by nginx and this supports HTTP range requests.
>
> I encapsulated all of this in the attached script.  This is an
> existence proof that it is possible to access the image with nbdkit.
>
> One problem is that the auth token only lasts for a limited time
> (seems to be 5 minutes in my test), and it doesn't automatically renew
> as you download the layer, so if the download takes longer than 5
> minutes you'll suddenly get unrecoverable authorization failures.
>
> There seem to be two possible ways to solve this:
>
>   (a) Write a new nbdkit-container-plugin which does the authorization
>       (essentially hiding most of the details in the attached script
>       from the user).  It could deal with renewing the key as
>       required.
>
>   (b) Modify nbdkit-curl-plugin so the user could provide a script for
>       renewing authorization.  This would expose the full gory details
>       to the end user, but on the other hand might be useful in other
>       situations that require authorization.
docker/podman already solved this, why should nbdkit solve it again?

Do you get timeouts while you download the image with a single request?
> (2) nbdkit-tar-filter exportname and listing files.
>
> This has already been covered by an email from Nir Soffer, so I'll
> simply link to that:
>
> lists.gnu.org/archive/html/qemu-discuss/2020-06/msg00058.html
>
> It basically requires a fairly simple change to nbdkit-tar-filter to
> map the tar filenames into export names, and a deeper change to nbdkit
> core server to allow listing all export names.  The end result would
> be that an NBD client could query the list of files [ie exports] in
> the tarball and choose one to download.
We know the tar member name upfront, so why do we need to list the contents?
> (3) gzip & tar require full downloads - why not “docker/podman
save/export”?
This looks like a better direction.

The nice thing about embedding the disk in the container image is being able
to use existing infrastructure (docker, quay) to host the images, and
to transfer
them to the hosts. We don't need to write any code for this.

Even better, we have automatic caching on the host by docker/podman, so we
have to pull the image from the registry only once on every host. Then we can
access the local cache.
> Stepping back to get the bigger picture: Because the OCI standard uses
> gzip for compression (stackoverflow.com/a/9213826), and
> because the tar index is interspersed with the tar data, you always
> need to download the whole container layer before you can access the
> disk image inside.
You need to download most of the tar, but you don't need to keep the tar
in a temporary file. For example in python you can create a tarfile over with
the http response object in streaming with transparent decompression mode
("r|*"), and stream the disk contents from the tar without a temporary
file.

    with tarfile.open(mode="r|*", fileobj=response) as tar:
        for member in tar:
            if member.name == "./disk/downloaded":
                with tar.extractfile(member) as f
                    shutil.copyfileobj(f, sys.stdout.buffer)
                    sys.exit(0)

I think this is what cdi import code does, and is the most efficient way
to copy the disk directly from the registry with the current format.
> Currently nbdkit-gzip-filter hides this from the
> end user, but it's still downloading the whole thing to a temporary
> file.  There's no way round that unless OCI can be persuaded to use a
> better format.
The way is to use the container image downloaded by podman/docker.
> But docker/podman already has a way to export container layers,
> ie. the save and export commands.  These also have the advantage that
> it will cache the downloaded layers between runs.  So why aren't we
> using that?
>
> In this world, nbdkit-container-plugin would simply use docker/podman
> save (or export?) to grab the container as a tar file, and we would
> use the tar filter as above to expose the contents as an NBD endpoint
> for further consumption.  IOW:
>
>   nbdkit container docker.io/kubevirt/fedora-cloud-container-disk-demo \
>          --filter=tar tar-entry=./downloaded/disk
This will work but there are 2 issues:

1. podman save/export copy the tar locally. This is pretty fast for the example
image but copying the tar and deleting it seems wasteful.

2. If we have the tar locally, why not use qemu-img directly? we can find the
offset of the disk inside the tar and use:

$ time podman save --format oci-dir -o demo-oci
docker.io/kubevirt/fedora-cloud-container-disk-demo

real 0m2.795s
user 0m2.011s
sys 0m0.878s

$ time qemu-img convert -O raw 'json:{"file": {"driver":
"raw",
"offset": 1536, "file": {"driver":
"file", "filename":
"demo-oci/8162f3eda33d5a87df56e969dcd9777523bd53278a0701b2e53b93c33c01853e"}}}'
out.raw

real 0m1.036s
user 0m3.237s
sys 0m1.326s

But I think we have a better way - using a self-extracting-disk
container. Start a container with
a disk image, and run qemu-img inside this container to convert the
disk to the target PV.

It can work like this:

1. We create a base image - this will be used for all disks container images.

$ cat Dockerfile.kubevirt-img
FROM alpine
RUN apk add qemu-img

$ podman build -t kubevirt-img -f Dockerfile.kubevirt-img .
...

You pull this from quay.io/nirsof/kubevirt-img.

2. Create a disk container image, based on the base image

$ cat Dockerfile.kubevirt-fedora-cloud-disk
FROM quay.io/nirsof/kubevirt-fimg
COPY disk.qcow2 /disk.qcow2
CMD ["qemu-img", "convert", "-p", "-f",
"qcow2", "-O", "raw",
"/disk.qcow2", "/target/disk.img"]

$ podman build -t kubevirt-fedora-cloud-disk -f
Dockerfile.kubevirt-fedora-cloud-disk .
...

This container is a little larger, but the common layer with qemu-img
and its dependencies is
shared between all disk container images. In this example it adds only 25 MiB.

You can pull this from quay.io/nirsof/kubevirt-fedora-cloud-disk.

With this we can create a copy of the disk using:

$ time podman run --volume ./:/target:Z --rm -it
quay.io/nirsof/kubevirt-fedora-cloud-disk
Trying to pull quay.io/nirsof/kubevirt-fedora-cloud-disk...
Getting image source signatures
Copying blob 0d9094d70e9c skipped: already exists
Copying blob a3ed95caeb02 done
Copying blob a3ed95caeb02 done
Copying blob 18717781bd09 done
Copying blob fe5cd0d8bf32 done
Writing manifest to image destination
Storing signatures
    (100.00/100%)

real 0m59.800s
user 0m8.988s
sys 0m7.437s

$ ls -lhs disk.img
728M -rw-r--r--. 1 nsoffer nsoffer 4.0G Jul 12 21:45 disk.img

$ podman images | grep fedora-cloud
quay.io/nirsof/kubevirt-fedora-cloud-disk             latest
097ef06b6d71   About an hour ago   326 MB
docker.io/kubevirt/fedora-cloud-container-disk-demo   latest
6494830c6dc7   50 years ago        303 MB

The next time we run this we get the container from the cache:

$ time podman run --volume ./:/target:Z --rm -it
quay.io/nirsof/kubevirt-fedora-cloud-disk
    (100.00/100%)

real 0m2.244s
user 0m0.070s
sys 0m0.253s

Nir

Richard W.M. Jones

2020-Jul-13 09:37 UTC

head link

Re: [Libguestfs] nbdkit / exposing disk images in containers

On Sun, Jul 12, 2020 at 11:16:01PM +0300, Nir Soffer
wrote:> On Sat, Jul 11, 2020 at 11:18 AM Richard W.M. Jones
<rjones@redhat.com> wrote:
> >
> > KubeVirt is a custom resource (a kind of plugin) for Kubernetes which
> > adds support for running virtual machines.  As part of this they have
> > the same problems as everyone else of how to import large disk images
> > into the system for pets, templates, etc.
> >
> > As part of the project they've defined a format for embedding a
disk
> > image into a container (unclear why?  perhaps so these can be
> > distributed using the existing container registry systems?):
> >
> >  
github.com/kubevirt/containerized-data-importer/blob/master/doc/image-from-registry.md
> >
> > An example of such a disk-in-a-container is here:
> >
> >   hub.docker.com/r/kubevirt/fedora-cloud-container-disk-demo
> >
> > We've been asked if we can help with tools to efficiently import
these
> > disk images, and I have suggested a few things with nbdkit and have
> > written a couple of filters (tar, gzip) to support this.
> 
> I don't think gzip filter matches nbdkit very well. Having to
decompress the
> entire disk before you can serve it does not sound right.
We do have existing plugins -- I'm thinking of
libguestfs.org/nbdkit-iso-plugin.1.html -- which are merely
convenient wrappers around what you could do with a bit of shell
scripting.  You might question why we have them at all, but a reason
is that it just makes things simpler for the end user.  They don't
have to worry about how to do the download and cleaning up the
temporary file afterwards.

So while you're right that the gzip filter isn't a good fit with
nbdkit for quite annoying technical reasons, I still think it's worth
having it.  We're not forcing people to use it or preventing them from
using alternatives.
> > This email is my thoughts on further development work in this area.
> >
> > ----------------------------------------------------------------------
> >
> > (1) Accessing the disk image directly from the Docker Hub.
> >
> > When you get down to it, what this actually is:
> >
> >   * There is a disk image in qcow2 format.
> >
> >   * It is embedded as "./disk/downloaded" in a
gzip-compressed tar
> >     file.  (This is a container with a single layer).
> >
> >   * This tarball is uploaded to (in this case) the Docker Hub and can
> >     be accessed over a URL.  The URL can be constructed using a few
> >     json requests.
> >
> >   * The URL is served by nginx and this supports HTTP range requests.
> >
> > I encapsulated all of this in the attached script.  This is an
> > existence proof that it is possible to access the image with nbdkit.
> >
> > One problem is that the auth token only lasts for a limited time
> > (seems to be 5 minutes in my test), and it doesn't automatically
renew
> > as you download the layer, so if the download takes longer than 5
> > minutes you'll suddenly get unrecoverable authorization failures.
> >
> > There seem to be two possible ways to solve this:
> >
> >   (a) Write a new nbdkit-container-plugin which does the authorization
> >       (essentially hiding most of the details in the attached script
> >       from the user).  It could deal with renewing the key as
> >       required.
> >
> >   (b) Modify nbdkit-curl-plugin so the user could provide a script for
> >       renewing authorization.  This would expose the full gory details
> >       to the end user, but on the other hand might be useful in other
> >       situations that require authorization.
> 
> docker/podman already solved this, why should nbdkit solve it again?
Right, exactly my thoughts and the reason why (3) below.
> Do you get timeouts while you download the image with a single request?
Do you mean a single massive curl request?  I didn't try.  You get a
401 authorization failure if you make a request after ~ 5 minutes
after the token was issued.  Unlike VMware's and RHV's disk-over-web
services, the token doesn't automatically extend when a request is made.
> > (2) nbdkit-tar-filter exportname and listing files.
> >
> > This has already been covered by an email from Nir Soffer, so I'll
> > simply link to that:
> >
> > lists.gnu.org/archive/html/qemu-discuss/2020-06/msg00058.html
> >
> > It basically requires a fairly simple change to nbdkit-tar-filter to
> > map the tar filenames into export names, and a deeper change to nbdkit
> > core server to allow listing all export names.  The end result would
> > be that an NBD client could query the list of files [ie exports] in
> > the tarball and choose one to download.
> 
> We know the tar member name upfront, so why do we need to list the
contents?
AIUI we don't necessarily know the name up front.  It might not always
be ./downloaded/disk.  (I might be wrong on this.)
> > (3) gzip & tar require full downloads - why not “docker/podman
save/export”?
> 
> This looks like a better direction.
> 
> The nice thing about embedding the disk in the container image is being
able
> to use existing infrastructure (docker, quay) to host the images, and
> to transfer
> them to the hosts. We don't need to write any code for this.
> 
> Even better, we have automatic caching on the host by docker/podman, so we
> have to pull the image from the registry only once on every host. Then we
can
> access the local cache.
> 
> > Stepping back to get the bigger picture: Because the OCI standard uses
> > gzip for compression (stackoverflow.com/a/9213826), and
> > because the tar index is interspersed with the tar data, you always
> > need to download the whole container layer before you can access the
> > disk image inside.
> 
> You need to download most of the tar, but you don't need to keep the
tar
> in a temporary file. For example in python you can create a tarfile over
with
> the http response object in streaming with transparent decompression mode
> ("r|*"), and stream the disk contents from the tar without a
temporary file.
> 
>     with tarfile.open(mode="r|*", fileobj=response) as tar:
>         for member in tar:
>             if member.name == "./disk/downloaded":
>                 with tar.extractfile(member) as f
>                     shutil.copyfileobj(f, sys.stdout.buffer)
>                     sys.exit(0)
> 
> I think this is what cdi import code does, and is the most efficient way
> to copy the disk directly from the registry with the current format.
> 
> > Currently nbdkit-gzip-filter hides this from the
> > end user, but it's still downloading the whole thing to a
temporary
> > file.  There's no way round that unless OCI can be persuaded to
use a
> > better format.
> 
> The way is to use the container image downloaded by podman/docker.
> 
> > But docker/podman already has a way to export container layers,
> > ie. the save and export commands.  These also have the advantage that
> > it will cache the downloaded layers between runs.  So why aren't
we
> > using that?
> >
> > In this world, nbdkit-container-plugin would simply use docker/podman
> > save (or export?) to grab the container as a tar file, and we would
> > use the tar filter as above to expose the contents as an NBD endpoint
> > for further consumption.  IOW:
> >
> >   nbdkit container docker.io/kubevirt/fedora-cloud-container-disk-demo
\
> >          --filter=tar tar-entry=./downloaded/disk
> 
> This will work but there are 2 issues:
> 
> 1. podman save/export copy the tar locally. This is pretty fast for the
example
> image but copying the tar and deleting it seems wasteful.
> 
> 2. If we have the tar locally, why not use qemu-img directly? we can find
the
> offset of the disk inside the tar and use:
> 
> $ time podman save --format oci-dir -o demo-oci
> docker.io/kubevirt/fedora-cloud-container-disk-demo
> 
> real 0m2.795s
> user 0m2.011s
> sys 0m0.878s
> 
> $ time qemu-img convert -O raw 'json:{"file":
{"driver": "raw",
> "offset": 1536, "file": {"driver":
"file", "filename":
>
"demo-oci/8162f3eda33d5a87df56e969dcd9777523bd53278a0701b2e53b93c33c01853e"}}}'
> out.raw
> 
> real 0m1.036s
> user 0m3.237s
> sys 0m1.326s
>
> But I think we have a better way - using a self-extracting-disk
> container. Start a container with
> a disk image, and run qemu-img inside this container to convert the
> disk to the target PV.
> 
> It can work like this:
> 
> 1. We create a base image - this will be used for all disks container
images.
> 
> $ cat Dockerfile.kubevirt-img
> FROM alpine
> RUN apk add qemu-img
> 
> $ podman build -t kubevirt-img -f Dockerfile.kubevirt-img .
> ...
> 
> You pull this from quay.io/nirsof/kubevirt-img.
> 
> 2. Create a disk container image, based on the base image
> 
> $ cat Dockerfile.kubevirt-fedora-cloud-disk
> FROM quay.io/nirsof/kubevirt-fimg
> COPY disk.qcow2 /disk.qcow2
> CMD ["qemu-img", "convert", "-p",
"-f", "qcow2", "-O", "raw",
> "/disk.qcow2", "/target/disk.img"]
> 
> $ podman build -t kubevirt-fedora-cloud-disk -f
> Dockerfile.kubevirt-fedora-cloud-disk .
> ...
> 
> This container is a little larger, but the common layer with qemu-img
> and its dependencies is
> shared between all disk container images. In this example it adds only 25
MiB.
> 
> You can pull this from quay.io/nirsof/kubevirt-fedora-cloud-disk.
> 
> With this we can create a copy of the disk using:
> 
> $ time podman run --volume ./:/target:Z --rm -it
> quay.io/nirsof/kubevirt-fedora-cloud-disk
> Trying to pull quay.io/nirsof/kubevirt-fedora-cloud-disk...
> Getting image source signatures
> Copying blob 0d9094d70e9c skipped: already exists
> Copying blob a3ed95caeb02 done
> Copying blob a3ed95caeb02 done
> Copying blob 18717781bd09 done
> Copying blob fe5cd0d8bf32 done
> Writing manifest to image destination
> Storing signatures
>     (100.00/100%)
> 
> real 0m59.800s
> user 0m8.988s
> sys 0m7.437s
> 
> $ ls -lhs disk.img
> 728M -rw-r--r--. 1 nsoffer nsoffer 4.0G Jul 12 21:45 disk.img
> 
> $ podman images | grep fedora-cloud
> quay.io/nirsof/kubevirt-fedora-cloud-disk             latest
> 097ef06b6d71   About an hour ago   326 MB
> docker.io/kubevirt/fedora-cloud-container-disk-demo   latest
> 6494830c6dc7   50 years ago        303 MB
> 
> The next time we run this we get the container from the cache:
> 
> $ time podman run --volume ./:/target:Z --rm -it
> quay.io/nirsof/kubevirt-fedora-cloud-disk
>     (100.00/100%)
> 
> real 0m2.244s
> user 0m0.070s
> sys 0m0.253s
Interesting, yes.  Although I guess this involves recreating all of
these disk-in-a-container images?  Also I'd be a bit concerned from a
security angle: We're turning dumb data into a self-extracting
executable program.  (I love shar(1) too, but unfortunately it's not
compatible with the world we live in).

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat people.redhat.com/~rjones
Read my programming and virtualization blog: rwmj.wordpress.com
virt-builder quickly builds VMs from scratch
libguestfs.org/virt-builder.1.html

Apparently Analagous Threads

Search for more reasonably related threads

Libguestfs - Jul 2020 - nbdkit / exposing disk images in containers

[Libguestfs] nbdkit / exposing disk images in containers

Re: [Libguestfs] nbdkit / exposing disk images in containers

Re: [Libguestfs] nbdkit / exposing disk images in containers

Apparently Analagous Threads