Dariusz Michaluk
2014-Feb-26 16:24 UTC
[libvirt-users] [libvirt] LXC, user namespaces and systemd
Hi! I with my colleagues from Samsung trying to run systemd in Linux container. I saw that the others are experimenting in this topic, so I would like to present the results of my work and tests, perhaps it will be helpful to others. As the prototype I used a manual written by Daniel: https://www.berrange.com/posts/2013/08/12/running-a-full-fedora-os-inside-a-libvirt-lxc-guest/ After many attempts, I managed to run systemd. Let's move to specifics. 1. Host configuration, Fedora 20 - kernel 3.14 with NAMESPACES, UTS_NS, IPC_NS, USER_NS, PID_NS, NET_NS enabled in kernel config I used kernel-3.14.0-0.rc2.git0.1.fc21.i686.rpm downloaded from https://dl.fedoraproject.org/pub/fedora/linux/development/rawhide - libvirtd (libvirt) 1.2.2 I used libvirt build from git sources, it is important that the source contained commit 6fb42d7cdc57da453691d043d6b9bf23e2bae15e Patch from Richard Weinberger "Ensure systemd cgroup ownership is delegated to container with userns" 2. Container configuration - setup Fedora environment # yum -y --releasever=20 --nogpg --installroot=/var/lib/libvirt/filesystems/mycontainer --disablerepo='*' --enablerepo=fedora install systemd passwd yum fedora-release vim-minimal openssh-server procps-ng # echo "pts/0" >> /var/lib/libvirt/filesystems/mycontainer/etc/securetty # chroot /var/lib/libvirt/filesystems/mycontainer /bin/passwd root - In the final solution I want to map root inside container to some normal user in the host. So let's create some user (in host): # useradd foo -u 666 #id foo uid=666(foo) gid=1001(foo) grupy=1001(foo) # chown -R foo:foo /var/lib/libvirt/filesystems/mycontainer - enabling user namespace (user mapping setup), look at my full libvirt config file # cat /etc/libvirt/lxc/container.xml <domain type='lxc'> <name>mycontainer</name> <uuid>d750af59-6082-437c-b860-922e76b46410</uuid> <memory unit='KiB'>819200</memory> <currentMemory unit='KiB'>819200</currentMemory> <vcpu placement='static'>1</vcpu> <os> <type arch='i686'>exe</type> <init>/sbin/init</init> </os> <clock offset='utc'/> <on_poweroff>destroy</on_poweroff> <on_reboot>restart</on_reboot> <on_crash>restart</on_crash> <idmap> <uid start='0' target='666' count='1000'/> <gid start='0' target='1001' count='1000'/> </idmap> <devices> <emulator>/usr/libexec/libvirt_lxc</emulator> <filesystem type='mount' accessmode='passthrough'> <source dir='/var/lib/libvirt/filesystems/mycontainer'/> <target dir='/'/> </filesystem> <interface type='network'> <mac address='00:16:3e:34:a2:dd'/> <source network='default'/> </interface> <console type='pty'> <target type='lxc' port='0'/> </console> </devices> </domain> 3. Start container # virsh --connect lxc:/// define /etc/libvirt/lxc/container.xml # virsh --connect lxc:/// start mycontainer --console If all login attempts are rejected, please boot host machine with audit=0 # vi /etc/default/grub GRUB_CMDLINE_LINUX=" [...] audit=0 [...]" # grub2-mkconfig -o /boot/grub2/grub.cfg # reboot 4. Problems and solutions a) "Cannot add dependency job for unit display-manager.service, ignoring: Unit display-manager.service failed to load: No such file or directory." Delete or just comment line "Wants=display-manager.service" # cat /usr/lib/systemd/system/default.target [Unit] Description=Graphical Interface Documentation=man:systemd.special(7) Requires=multi-user.target After=multi-user.target Conflicts=rescue.target #Wants=display-manager.service AllowIsolate=yes [Install] Alias=default.target b) [FAILED] Failed to mount Huge Pages File System. See 'systemctl status dev-hugepages.mount' for details. [FAILED] Failed to mount Configuration File System. See 'systemctl status sys-kernel-config.mount' for details. [FAILED] Failed to mount Debug File System. See 'systemctl status sys-kernel-debug.mount' for details. [FAILED] Failed to mount FUSE Control File System. See 'systemctl status sys-fs-fuse-connections.mount' for details. Based on knowledge, which gave Daniel: "When a syscall requires CAP_SYS_ADMIN, for example, the kernel will either use capable(CAP_SYS_ADMIN) which only succeeds in the host, or ns_capable(CAP_SYS_ADMIN) which is allowed to suceed in the container. Different filesystems have differing restrictions, but at this time the vast majority of filesystems require that capable(CAP_SYS_ADMIN) succeeed and thus you can only mount them in the host.", and discussion about "allow some kernel filesystems to be mounted in a user namespace" from: http://comments.gmane.org/gmane.linux.kernel/1525998 I decided to disable mounting this filesystems: # systemctl mask dev-hugepages.mount ln -s '/dev/null' '/etc/systemd/system/dev-hugepages.mount' # systemctl mask sys-kernel-config.mount ln -s '/dev/null' '/etc/systemd/system/sys-kernel-config.mount' # systemctl mask sys-kernel-debug.mount ln -s '/dev/null' '/etc/systemd/system/sys-kernel-debug.mount' # systemctl mask sys-fs-fuse-connections.mount ln -s '/dev/null' '/etc/systemd/system/sys-fs-fuse-connections.mount' c) [FAILED] Failed to start D-Bus System Message Bus. See 'systemctl status dbus.service' for details. Feb 26 09:26:12 localhost.localdomain systemd[1]: Starting D-Bus System Message Bus... Feb 26 09:26:12 localhost.localdomain systemd[20]: Failed at step OOM_ADJUST spawning /bin/dbus-daemon: Permission denied # echo -900 > /proc/20/oom_score_adj /proc/20/oom_score_adj: Permission denied # ls -l /proc/20/oom_score_adj -rw-r--r--. 1 65534 65534 0 Feb 26 10:28 /proc/20/oom_score_adj Regarding to kernel documentation in user namespace local root user (on guest) cannot set the OOM on any value. Set OOM on any value required except CAP_SYS_RESOURCE also full root privileges. To disable OOM support delete or just comment line "OOMScoreAdjust=-900" # cat /usr/lib/systemd/system/dbus.service [Unit] Description=D-Bus System Message Bus Requires=dbus.socket After=syslog.target [Service] ExecStart=/bin/dbus-daemon --system --address=systemd: --nofork --nopidfile --systemd-activation ExecReload=/bin/dbus-send --print-reply --system --type=method_call --dest=org.freedesktop.DBus / org.freedesktop.DBus.ReloadConfig #OOMScoreAdjust=-900 5. Final systemd start # virsh --connect lxc:/// start mycontainer --console systemd 208 running in system mode. (+PAM +LIBWRAP +AUDIT +SELINUX +IMA +SYSVINIT +LIBCRYPTSETUP +GCRYPT +ACL +XZ) Detected virtualization 'lxc-libvirt'. Welcome to Fedora 20 (Heisenbug)! Failed to install release agent, ignoring: No such file or directory [ OK ] Reached target Remote File Systems. [ OK ] Created slice Root Slice. [ OK ] Created slice User and Session Slice. [ OK ] Created slice System Slice. [ OK ] Created slice system-getty.slice. [ OK ] Reached target Slices. [ OK ] Listening on Delayed Shutdown Socket. [ OK ] Listening on /dev/initctl Compatibility Named Pipe. [ OK ] Reached target Paths. [ OK ] Reached target Encrypted Volumes. [ OK ] Listening on Journal Socket. Mounting POSIX Message Queue File System... Starting Journal Service... [ OK ] Started Journal Service. Starting Create static device nodes in /dev... [ OK ] Reached target Swap. Mounting Temporary Directory... Starting Load/Save Random Seed... [ OK ] Mounted POSIX Message Queue File System. [ OK ] Started Create static device nodes in /dev. [ OK ] Reached target Local File Systems (Pre). [ OK ] Started Load/Save Random Seed. [ OK ] Mounted Temporary Directory. [ OK ] Reached target Local File Systems. Starting Trigger Flushing of Journal to Persistent Storage... Starting Recreate Volatile Files and Directories... [ OK ] Started Trigger Flushing of Journal to Persistent Storage. [ OK ] Started Recreate Volatile Files and Directories. Starting Update UTMP about System Reboot/Shutdown... [ OK ] Started Update UTMP about System Reboot/Shutdown. [ OK ] Reached target System Initialization. [ OK ] Reached target Timers. [ OK ] Listening on D-Bus System Message Bus Socket. [ OK ] Reached target Sockets. [ OK ] Reached target Basic System. Starting OpenSSH server daemon... Starting Permit User Sessions... Starting D-Bus System Message Bus... [ OK ] Started D-Bus System Message Bus. Starting Login Service... [ OK ] Started OpenSSH server daemon. [ OK ] Started Permit User Sessions. Starting Console Getty... [ OK ] Started Console Getty. [ OK ] Reached target Login Prompts. Starting Cleanup of Temporary Directories... [ OK ] Started Cleanup of Temporary Directories. [ OK ] Started Login Service. [ OK ] Reached target Multi-User System. [ OK ] Reached target Graphical Interface. Fedora release 20 (Heisenbug) Kernel 3.14.0-0.rc2.git0.1.fc21.i686 on an i686 (console) localhost login: root Password: Last login: Wed Feb 26 09:26:21 on pts/0 -bash-4.2# - verification which namespace is used inside container # ls -l /proc/self/ns/ ipc -> ipc:[4026532341] mnt -> mnt:[4026532338] net -> net:[4026532344] pid -> pid:[4026532342] user -> user:[4026532337] uts -> uts:[4026532339] outside container $ ls -l /proc/self/ns/ ipc -> ipc:[4026531839] mnt -> mnt:[4026531840] net -> net:[4026531956] pid -> pid:[4026531836] user -> user:[4026531837] uts -> uts:[4026531838] I know that no one likes to read long emails , but most is config and logs. I will be grateful for comments and suggestions. Regards. -- Dariusz Michaluk Samsung R&D Institute Poland Samsung Electronics d.michaluk@samsung.com
Stephan Sachse
2014-Feb-26 16:59 UTC
Re: [libvirt-users] [libvirt] LXC, user namespaces and systemd
> # chown -R foo:foo /var/lib/libvirt/filesystems/mycontaineryou must "shift" the uids for the container 0 -> 666, 1 -> 667, 2 -> 668. there is a tool for this: uidmapshift some tools may not work, because of the missing file capabilities. chown removes all file capabilities! try ping as user inside the container. (missing file cap cap_net_admin,cap_net_raw) /stephan -- Software is like sex, it's better when it's free!
Kashyap Chamarthy
2014-Feb-27 10:43 UTC
Re: [libvirt-users] [libvirt] LXC, user namespaces and systemd
On Wed, Feb 26, 2014 at 05:24:03PM +0100, Dariusz Michaluk wrote: [. . .]> If all login attempts are rejected, please boot host machine with audit=0 > > # vi /etc/default/grub > GRUB_CMDLINE_LINUX=" [...] audit=0 [...]"IIUC, this is no longer needed with systemd 209 and above. I just did a quick test[1] with systemd-210-2.fc21.x86_64 3.14.0-0.rc4.git0.1.fc21.x86_64 and audit subsystem enabled: $ auditctl -s AUDIT_STATUS: enabled=1 flag=1 pid=816 rate_limit=0 backlog_limit=320 lost=0 backlog=0 I can at-least boot into my old systemd-nspawn container just fine. Yet to test with libvirt-lxc. [1] https://bugzilla.redhat.com/show_bug.cgi?id=966807#c14 -- /kashyap
Dariusz Michaluk
2014-Feb-27 14:07 UTC
Re: [libvirt-users] [libvirt] LXC, user namespaces and systemd
On 26.02.2014 17:59, Stephan Sachse wrote:>> # chown -R foo:foo /var/lib/libvirt/filesystems/mycontainer > > you must "shift" the uids for the container 0 -> 666, 1 -> 667, 2 -> > 668. there is a tool for this: uidmapshiftI prepared two containers, the first I used chown, in the second uidmapshift, here is the results. ./uidmapshift -r /var/lib/libvirt/filesystems/mycontainer UIDs 666 - 666 GIDs 1001 - 2000 foo 28919 28917 0 14:42 ? 00:00:00 /sbin/init 747 28950 28919 0 14:42 ? 00:00:00 /bin/dbus-daemon ./uidmapshift -r /var/lib/libvirt/filesystems/test UIDs 888 - 1776 GIDs 1002 - 2001 foo1 29298 29296 0 14:45 ? 00:00:00 /sbin/init 969 29329 29298 0 14:45 ? 00:00:00 /bin/dbus-daemon As you can see root is mapped to foo or foo1 user and dbus user is mapped to 747 (uid=81(dbus) + uid=666(foo)) or 969 (uid=81(dbus) + uid=888(foo1)). Mapping looks properly. Why use uidmapshift ?, it still performs chown. Could you explain more?> some tools may not work, because of the missing file capabilities. > chown removes all file capabilities! try ping as user inside the > container. (missing file cap cap_net_admin,cap_net_raw)# getcap /usr/bin/ping # ping localhost PING localhost (127.0.0.1) 56(84) bytes of data. 64 bytes from localhost (127.0.0.1): icmp_seq=1 ttl=64 time=0.077 ms 64 bytes from localhost (127.0.0.1): icmp_seq=2 ttl=64 time=0.066 ms ^C --- localhost ping statistics --- 2 packets transmitted, 2 received, 0% packet loss, time 999ms rtt min/avg/max/mdev = 0.066/0.071/0.077/0.010 ms Yes you are right, chown removed capabilities, but ping still works properly. -- Dariusz Michaluk Samsung R&D Institute Poland Samsung Electronics d.michaluk@samsung.com
Dariusz Michaluk
2014-Feb-27 14:15 UTC
Re: [libvirt-users] [libvirt] LXC, user namespaces and systemd
On 27.02.2014 11:43, Kashyap Chamarthy wrote:> IIUC, this is no longer needed with systemd 209 and above. I just did a > quick test[1] with > > systemd-210-2.fc21.x86_64 > 3.14.0-0.rc4.git0.1.fc21.x86_64 > > and audit subsystem enabled: > > $ auditctl -s > AUDIT_STATUS: enabled=1 flag=1 pid=816 rate_limit=0 backlog_limit=320 lost=0 backlog=0 > > I can at-least boot into my old systemd-nspawn container just fine. Yet > to test with libvirt-lxc. > > [1] https://bugzilla.redhat.com/show_bug.cgi?id=966807#c14 > >I was using systemd v208, systemd v209 was released week ago, but thank you for the information ;) -- Dariusz Michaluk Samsung R&D Institute Poland Samsung Electronics d.michaluk@samsung.com
Dariusz Michaluk
2014-Mar-03 14:52 UTC
Re: [libvirt-users] [libvirt] LXC, user namespaces and systemd
Hi. Another week, another experiment ;) I was trying to run systemd user session for non-root user, for example darek (uid=1000), operation failed with error: systemd[26]: pam_unix(systemd-user:session): session opened for user darek by (uid=0) systemd[1]: Started Login Service. systemd[26]: Failed to create root cgroup hierarchy: Permission denied systemd[26]: Failed to allocate manager object: Permission denied systemd[29]: pam_unix(systemd-user:session): session closed for user darek The Cgroup hierarchy for the machine looks as follows: ├─machine.slice │ └─machine-lxc\x2dmycontainer.scope │ ├─17303 /usr/libexec/libvirt_lxc --name mycontainer --console 22 --security=selinux --handshake 25 --background │ └─machine.slice │ └─machine-lxc\x2dmycontainer.scope │ ├─17306 /usr/lib/systemd/systemd │ ├─machine.slice │ │ └─machine-lxc\x2dmycontainer.scope │ │ └─user.slice │ │ └─user-0.slice │ │ └─user@0.service │ │ └─17400 /usr/lib/systemd/systemd --user │ ├─system.slice │ │ ├─systemd-logind.service │ │ │ └─17373 /usr/lib/systemd/systemd-logind │ │ ├─dbus.service │ │ │ └─17372 /bin/dbus-daemon --system --address=systemd: --nofork --nopidfile --systemd-activation │ │ ├─sshd.service │ │ │ └─17379 /usr/sbin/sshd -D │ │ └─systemd-journald.service │ │ └─17348 /usr/lib/systemd/systemd-journald │ └─user.slice │ └─user-0.slice │ ├─session-c1.scope │ │ ├─17377 login -- root │ │ └─17413 -bash │ └─user@0.service │ └─17412 (sd-pam) Then I repeated the test, but I used systemd-nspawn, the operation was successful. systemd[25]: pam_unix(systemd-user:session): session opened for user darek by (uid=0) In this case the Cgroup hierarchy is somewhat different, as shown below: ├─machine.slice │ └─machine-mycontainer.scope │ ├─17054 /usr/lib/systemd/systemd │ ├─system.slice │ │ ├─systemd-logind.service │ │ │ └─17099 /usr/lib/systemd/systemd-logind │ │ ├─dbus.service │ │ │ └─17098 /bin/dbus-daemon --system --address=systemd: --nofork --nopidfile --systemd-activation │ │ ├─sshd.service │ │ │ └─17103 /usr/sbin/sshd -D │ │ └─systemd-journald.service │ │ └─17069 /usr/lib/systemd/systemd-journald │ └─user.slice │ ├─user-0.slice │ │ ├─session-55.scope │ │ │ ├─17110 login -- root │ │ │ └─17160 -bash │ │ └─user@0.service │ │ ├─17147 /usr/lib/systemd/systemd --user │ │ └─17155 (sd-pam) │ └─user-1000.slice │ └─user@1000.service │ ├─17109 /usr/lib/systemd/systemd --user │ └─17116 (sd-pam) It looks like the libvirt creates bad Cgroup hierarchy (according to http://libvirt.org/cgroups.html). What do you think? Regards.
Daniel P. Berrange
2014-Mar-03 15:26 UTC
Re: [libvirt-users] [libvirt] LXC, user namespaces and systemd
On Mon, Mar 03, 2014 at 03:52:01PM +0100, Dariusz Michaluk wrote:> Hi. > > Another week, another experiment ;) I was trying to run systemd user > session for non-root user, for example darek (uid=1000), operation > failed with error: > > systemd[26]: pam_unix(systemd-user:session): session opened for user > darek by (uid=0) > systemd[1]: Started Login Service. > systemd[26]: Failed to create root cgroup hierarchy: Permission denied > systemd[26]: Failed to allocate manager object: Permission denied > systemd[29]: pam_unix(systemd-user:session): session closed for user darek > > The Cgroup hierarchy for the machine looks as follows: > > ├─machine.slice > │ └─machine-lxc\x2dmycontainer.scope > │ ├─17303 /usr/libexec/libvirt_lxc --name mycontainer --console 22 > --security=selinux --handshake 25 --background > │ └─machine.slice > │ └─machine-lxc\x2dmycontainer.scope > │ ├─17306 /usr/lib/systemd/systemd > │ ├─machine.slice > │ │ └─machine-lxc\x2dmycontainer.scopeThat looks really bizarre. The same two directory names nested over and over again. I can't reproduce this kind of thing on my own host. Libvirt only ever creates the first two levels as expected /sys/fs/cgroup/systemd/machine.slice /sys/fs/cgroup/systemd/machine.slice/machine-lxc\x2dmycontainer.scope The fact that the libvirt_lxc process itself ends up in the right place suggest that this isn't libvirt, but rather something else is creating these extra levels and moving systemd into them. Regards, Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|