[Who wants to be on CC here? I took several names from the posters
to the last lkml and klibc list threads: Olaf Hering, H. Peter Anvin,
Roman Zippel, Jeff Bailey, Aaron Griffin, Gerd Hoffmann, Milton Miller,
Andi Kleen, Jeff Garzik. Since I'm not subscribed I only see the From
unless I am Cc'd].
On Mon Jul 10 21:48:34 PDT 2006, Olaf Hering wrote:> On Tue, Jun 27, 2006 at 03:12:53PM +0200, Roman Zippel wrote:
>
> > So anyone who likes to see klibc merged, because it will solve some
> kind
> > of problem for him, please speak up now. Without this information
> it's
> > hard to judge whether we're going to solve the right problems.
>
> I do not want to see kinit merged.
> It (the merge into linux-2.6.XY) doesnt solve any real problem in the
> long term.
> Instead, make a kinit distribution. Let it install itself into an
> obvious
> location on the development box (/usr/lib/kinit/* or whatever), remove
> all
> code behind prepare_namespace() and put a disclaimer into the Linux
> 2.6.XY
> releasenote stating where to grab and build a kinit binary:
> make && sudo make install
> It can even provide its own CONFIG_INITRAMFS_SOURCE file, so that would
> be the only required change to the used .config.
I proposed that we not merge but do delete over a week ago so
I'm somewhat in agreement. However, I don't think the requirements
for /init are quite as static as you suggest. At least not for
the whole of initramfs.
http://www.ussg.iu.edu/hypermail/linux/kernel/0606.3/1737.html
And I think as long as we have CONFIG_INITRAMFS_SOURCE that
replaces the list we will break; the only way out is to do
feature or content checks.
> The rationale is that there are essentially 2 kind of consumers:
... of todays prepare_namespace.
> One is the kind that builds static kernels and uses no initrd of any
> kind.
> For those people, the code and interfaces behind prepare_namespace()
> has
> not changed in a long time.
> They will install that kinit binary once and it will continue to work
> with
> kernels from 2.6.6 and later, when "/init" support was merged. Or
> rather
> from 2.6.1x when CONFIG_INITRAMFS_SOURCE was introduced.
>
> The other group is the one that uses some sort of initrd (loop mount
> or cpio),
> created with tools from their distribution.
> Again, they should install that kinit binary as well because kinit
> emulates
> the loop mount handling of /initrd.image. This is for older
> distributions
> that still create a loop mounted initrd.
So we agree that the feature "kinit" aka old prepare_namespace has
been quite static / stable, and for the legacy uses it probably
doesn't need to change much.
> A distribution that uses a cpio archive (SuSE does that since 2.6.5 in
> SLES9,
> and since 2.6.13 in 10.0) has no need for the kinit. mkinitrd creates a
> suitable cpio archive and the contained "/init" executable does
all the
> hard work (as stated in the comment in init/main.c:init())
So there are tools that build an initrd image that does all that
that is necessary today.
> In earlier mails you stated that having kinit/klibc in the kernel
> sources
> would make it easier to keep up with interface changes.
> What interface changes did you have in mind, and can you name any
> relevant
> interface changes that were made after 2.6.0 which would break an
> external
> kinit?
I can't speak for hpa, but I do have some ideas. I'll list a
few of them later in this message. Yes, they would break any
external INITRAMFS_SOURCE.
> 24-klibc-basic-build-infrastructure.patch forces the klibc build, even
> for
> setups where it is not required. CONFIG_KLIBC, if it ever gets merged,
> should
> be optional. usr/initramfs.default as example has a hard requirement.
> If CONFIG_KLIBC is off, only /dev, /dev/console and /root should be
> part of
> the cpio archive in the .init.ramfs ELF section. The exectuables come
> from
> the cpio initrd.
This is one of the first things I think could or even should
change. Why are these special? Do you not put them in your
initramfs cpio? Why not?
The only reason I see to have /dev is /dev/console, and the only
reason for that is so that you can printk the warning that you
could not open it. We mkdir and rmdir /proc and /sys, and even
/old, so why is /root special?
I'd like to move unpack_to_rootfs to userspace. Why? Because
I'd like to have userspace be able to fetch cpios and unpack
them, and if I have that in the initramfs, then that means I
don't need a second copy in the kernel. I prototyped this
idea back at 2.4.18 or so. Unpack includes a private copy of
zlib, as does the initrd and floppy loaders code. That
means 12k twice, plus my userspace zlibs (yes, I know its init).
This also gives me the option of deferring part of the unpack,
and making a copy of archive to assist in the preparaton of a
kexec reboot.
You say there is a chicken and egg problem. I agree, but it
can be easily solved. I think with todays libfs it will be
even less code than I wrote, just a couple of file operations.
As far as I can tell, we need a writable file system or a
imuatable one with a directory that exposes (1) the initramfs
section (2) initrd, if one was loaded, and (3) a mmap'ble file
containing the unpack command. Optionally, (4) a library or
two for it to be a shared executable. This program can create
/dev, /dev/tty, /dev/console, /root, /sys, /proc, and whatever
else it wants (what, unpack to a seperate ramfs or tmpfs so you
can just do unmount or pivot_root instead of nuke-fs? Too simple).
The hard parts I see are (1) protecting the lifetime of the
initramfs section, it is part of __init, (2) puting back deferred
free_initrd behaviour (3) getting the binary to be mmap()able for
binfmt_* if needed. Possible rules for (1) are the files are
revoked (as in they disappear and we sigkill any task with an
open file) when we are ready to free __init. Or we just make
them -EIO. Userspace can copy contents if they want to preserve
the file. For (2) I think none of the code called is marked init;
it didn't use to be, so inode ops that free on unlink release sounds
good. For (3) we can either splice the pages or just say we have
one kernel copy loop. The later may take less space in the load
image binary at the cost of a memcpy of the app (and shared library)
and the space ovhead theirin.
I would see unpack flow be rootfs is mounted with simple_fill_super,
which takes a list names, file operatons, and modes. The kernel
forks and execs unpack on initramfs.img (the initramfs section).
When it is done it does an exec of exec unpack on initrd.img". The
kernel waits for this thread to complete before continuing with
initcalls, similar to linuxrc. After initcalls, the kenrnel execs
/init as it does today. Depending on needs, the initramfs section
unpack may have more features (different binary format, etc) than
the built-in kernel one. Hmm... some issues for directories.
Now all of this says we need a way of incorporating a target built
userspace binary program into the kernel build. It says we need
to be open to having new programs added. It probably points out
we have a "who owns initfs namespace" issue, and probably issues
that the shared library would like to be in /lib but simple_fill
doesn't allow for any directorys, let alone populated ones.
> I once looked briefly for a patch that would introduce something like
> CONFIG_OBSOLETE_PREPARE_NAMESPACE which will make all code behind
> prepare_namespace() optional. For SuSE, this dead code just wastes
> build time,
> and it increases the vmlinux file size.
I have just such a patch in my minimal kernel. It moved the globals and
removed do-mounts from the list. Of course, my cpio always had /init,
even if it was the main app.
> While looking for ROOT_DEV users,
> it wasnt obvious to me what requirements mtd has, so I postponed that
> attempt.
> I'm sure ROOT_DEV can go as well in your patch named
> 29-remove-in-kernel-root-mounting-code.patch
> All can be done outside the kernel code, and there is the root=
> cmdline option.
First of all, /proc/sys/real-root-device is needed as a sysctl
for initrd backwards compatability. I know, you said ROOT_DEV.
Lots of architectures have code like several PowerPC boards: if
initrd was loaded by boot loader then choose ram0 else if nfs
support was compiled then choose it else choose hda2.
Sparc seems to have even more options. Looking at the patch again,
it looks like they had the equivalent of rdev and ramdisk_size
encoded as on x86 (the patch doesn't show how they are obtained),
plus searching openfirmware for "was I loaded via network, if so add
corresponding ip= config". Which can certainly be done in userspace.
However, x86 rdev support (setting a word in the old boot sector)
and ramdisk size/flag support was removed "because they were
obsolete".
(That I disagree with, either remove from sparc and x86, or neither).
We can say "there is a new way to specify these defaults, its create
this file or script".
While I believe all this root and ramdisk code default selection could
be moved to user space, the list will grow long and complex and will
entail figuring out what the platform is again. It would grow as one
big
unmaintanable hook that every platform will want to update vs the
scaleable if duplicated "the board support file can set the default
boot device" that we have today. Or, we just say "add this option /
config file to your initramfs build" or write a small program to
mimic the conditions.
> As others have stated in this thread, the code behind
> prepare_namespace() is
> very simple. It doesnt know anything abould lvm etc, nor about mount by
> filesystem UUID/LABEL nor does it know how to deal with properly with
> new
> technologies like iSCSI, evms, persistant storage device names,
> usb-storage,
> sbp2 or async device probing.
> Should all that knowledge end up in the kernel source on day?
I think its agreed the answer to "in kernel" is no. "In kernel
sources"
is a bit less clear.
> An external kinit for such features sounds more logical.
> If there are no plans to add these features, that just means that kinit
> is real static functionality-wise. So having it as external binary
> makes
> even more sense, given the amount of build time to spent for every new
> kernel.
I don't think it makes sense to say the kinit feature list will
not change over time. Just the platform support choose the root.
Well, maybe this shows that initramfs needs to be customizable.
> Not really related to kinit per se:
> There is also not much distribution specific inside initramfs (beside
> the file
> locations to actually create a suitable cpio archive). There is still
> the
> boundary between "/init" and the real init on the final root
> filesystem.
> Otherwise static kernels would not be usable on these distributions.
> I'm sure a mkinitrd that works for every distribution out there is
> possible.
>
Ok now I'm a bit confused. First you say /kinit shouldn't be
merged, then you say the bulid for the whole contents can be
standardized.
And I think you are just considering "traditional" installations.
My hand off to the real init? That would be immediately after
the cpio is unpacked and the built in kernel device drivers are
configured. And yes, it is a normal distro sysv type init.
its a symlink to /bin/init.
While I'm sure my "embedded" usage is atypical, I'm sure your
build script will not handle it. And I think there are other
users who could benefit from it.
I've been working for years helping to administer an environment
where having a repeatable starting point is more important than
being able to store persistent state. For ninety percent of my
users, a quick reboot time that isn't corrupt more than makes up
for limited personalization. They get to do whatever they want,
but can always quickly reboot back to the known good environment.
No "oops the fs is corrupt, reinstall".
Our solution has been to install a file system then make a
cpio.gz. At first we patched the incbin in usr to add our
minimal filesystem cpio. A minimal customization script
was read from a fixed location in memory. Within a week of
initramfs looking at initrd, we changed our platform code
and boot loader to use the initrd mechanism. Within a few
weeks the loader had support for loading multiple cpios.
We were then able to separate frequently used but large
packages from the base usability install. We now start with
a normal distribution install into a chroot. We strip most
documentation and make a cpio.gz. For most of our users,
"installing" a system is editing a copy of a shell script to
setup the network, then kick off our loader giving it the list
of cpios (and optionally a custom kernel). The loader
loads the shell script just like the rest of the files
but it magically becomes a cpio thanks to the truncated
cpio that gets placed before it and padding file after it.
Those users who need persistent storage make a filesystem
and mount it as needed. Sometimes they do this as purely
a user data disk, sometimes a whole chroot. Others use the
network filesystems to store their data or results. Users
have complete control of their machine, we just provide a
standard starting point. We don't worry about local user
exploits, as everyone has root access to their machine. So
running out of ramfs is fine for us (ok, we have one patch:
we say there is always free space so the distribution package
installers do their thing).
Now our situation of starting from a "known checkpoint" for
every boot may sound specialized. But I would imagine that
there are cluster users where this might sound appealing for
compute nodes. And I'm sure there are embedded people who
would say "I would love to boot from external image that I
knew was good". I'm thinking here of coprocessor boards.
Stuffing a couple of files from the host should be easy
compared to adding and managing all the storage locally and
providing the recovery path. They may even say "I have
limited resources, so here are a list of files. I always
need X, you can choose up to 6 from A-K depending on what
you plan to do."
For another example, I note that the Linksys NLSU2 (embedded
linux appliance sold for putting usb disk(s) on a network)
comes from the factory as a kernel and initrd loaded from
flash. The advanced users say "no, make that a jffs2 so
its easy to update" then say "be careful there, if you fill
up your root you can not boot." Well, having a fs to fetch
pages from does free up significant amounts of ram. For me,
knowing I can boot again is more important than easy updates.
I'll stop by saying that I need to take some time and customize
mine before I will get the use I want from it.
Back to the topic:
I am convinced we should remove most of the in-kernel code that
is properly done in userspace. The file system mounting
appears to be that. I'm not convinced on the disk partitions,
that seems to be kernel, but you can configure all of them out.
I am arguing for removing the unpack the archive code.
Do I think there is a place for kinit and klibc? Absolutely. I
do appreciate the work to get it to this point. There are some
good inventions in making the code small but common. I can
foresee saying "you can't remove that without the feature being
merged into the kinit distribution" will keep it updated.
Should this code bin in the kernel tree? I'm not sure. Its
handy to say "it just works". Especially as the distros
go to fancier initramfs. If we say "you must claim to provide
these features" at build, then if it doesn't work its the
fault of the tool who claimed that feature.
I'm not convinced the build needs to be integrated. The
distros seem to want it separate. From a kbuild sequence,
rebuliding klibc means I might get a new hash, so I have to run
make on anything I want dynamically linked. I should be
able use klcc with configure if that is how the package is
normally built, but that is not kbuild. It ties dissimilar
build systems together. And I still doesn't solve the "you
now need this too" problem when using a vendor supplied
content list. We just need to have an easy "you can get
it from here until your vendor catches up" point that is
easy to compile.
milton