Peter Steele
2016-Jun-04 04:56 UTC
[libvirt-users] Host loses network connectivity when starting containers
I have hit a problem running libvirt based containers in a CentOS 7 based host, with the extra wrinkle that my host is an EC2 instance in AWS. Ultimately everything works as advertised, and I can launch instances that host multiple libvirt lxc containers without problems, with one caveat: About one time in ten when the containers are started, the instance loses all network connectivity. This can even happen on instances that host only a single container, so it's not related to the number of containers that are being run. Once this happens, the only way to fix the problem is to reboot the instance and hope that on the reboot, the containers will start successfully this time. In troubleshooting efforts that I've done I've found that the problem does not occur with containers defined with the linuxcontainers.org flavor of lxc. I've also discovered that if I configure my containers to use libvirt's default isolated bridge device virbr0, this loss of network connectivity does not appear to happen when the containers are started. Specifically, I ran a test that repeatedly started/stopped a single container configured with virbr0 and the test ran for a long time without an issue. When I switched the container to use my fully configured bridge device br0, the start/stop test usually hung up within a few iterations. Meaning that my ssh session into the instance would hang and I'd be unable to reconnect to the instance. Unfortunately AWS does not provide an console back door into an instance so troubleshooting has been difficult. I've checked the system logs though after one of these hangs occurs and there are no errors reported anywhere that I can see. I do not see this problem when running on real hardware. I also do not see the problem when running under other virtual environments, including VMware, KVM, and VirtualBox. My guess is that it is a bug in libvirt (since containers defined with the linuxcontainers.org lxc framework do not cause this issue.) There seems to be an AWS component to the problem as well though since I've only seen this happen in EC2 based instances. Is anyone familiar with this problem? Does anyone have any suggestions how I might resolve it? Note that I have tried a recent version of libvirt (1.3.2-1) and it behaves the same as the stock CentOS version of libvirt (1.2.17-13). Peter
Peter Steele
2016-Jun-14 16:07 UTC
Re: [libvirt-users] Host loses network connectivity when starting containers
If anyone is curious about this, the cause turned out to be related to the way bridging works and how AWS deals with arp table updates. The underlying problem is described here: http://backreference.org/2010/07/28/linux-bridge-mac-addresses-and-dynamic-ports/ The solution was to add the line MACADDR=<mac-addr> to the ifcfg-br0 file used to define the bridge interface the containers reference. The mac address should be set to the mac address of the interface hosting the bridge (e.g. eth0). By adding this line, the bridge uses the same mac address as its host interface and it never changes. Without this entry, the bridge's mac address changes whenever a container is started to match that of the container just started. This can cause a flood of arps and that was causing issues for the AWS networking infrastructure. Peter On 06/03/2016 09:56 PM, Peter Steele wrote:> I have hit a problem running libvirt based containers in a CentOS 7 > based host, with the extra wrinkle that my host is an EC2 instance in > AWS. Ultimately everything works as advertised, and I can launch > instances that host multiple libvirt lxc containers without problems, > with one caveat: About one time in ten when the containers are > started, the instance loses all network connectivity. This can even > happen on instances that host only a single container, so it's not > related to the number of containers that are being run. Once this > happens, the only way to fix the problem is to reboot the instance and > hope that on the reboot, the containers will start successfully this > time. > > In troubleshooting efforts that I've done I've found that the problem > does not occur with containers defined with the linuxcontainers.org > flavor of lxc. I've also discovered that if I configure my containers > to use libvirt's default isolated bridge device virbr0, this loss of > network connectivity does not appear to happen when the containers are > started. Specifically, I ran a test that repeatedly started/stopped a > single container configured with virbr0 and the test ran for a long > time without an issue. When I switched the container to use my fully > configured bridge device br0, the start/stop test usually hung up > within a few iterations. Meaning that my ssh session into the instance > would hang and I'd be unable to reconnect to the instance. > Unfortunately AWS does not provide an console back door into an > instance so troubleshooting has been difficult. I've checked the > system logs though after one of these hangs occurs and there are no > errors reported anywhere that I can see. > > I do not see this problem when running on real hardware. I also do not > see the problem when running under other virtual environments, > including VMware, KVM, and VirtualBox. My guess is that it is a bug in > libvirt (since containers defined with the linuxcontainers.org lxc > framework do not cause this issue.) There seems to be an AWS component > to the problem as well though since I've only seen this happen in EC2 > based instances. > > Is anyone familiar with this problem? Does anyone have any suggestions > how I might resolve it? Note that I have tried a recent version of > libvirt (1.3.2-1) and it behaves the same as the stock CentOS version > of libvirt (1.2.17-13). > > Peter >