Benjamin Weaver
2011-Aug-18 09:06 UTC
[Xen-users] Debian Squeeze server running 2 Ubuntu (Lucid) vms blows up under large NFS network load
I am running Debian Squeeze with Xen 4.0. I am running a stress test. I have created 2 virtual machines, each with 512 Mb of memory adn 8Gb in size. I have made one of the vms an NFS server, sharing out a large file (4.6 Gb). The other virtual machine is an NFS client. The stress test consists of passing that big file back and forth via an mv command executed on the client, which moves the file back and forth from the nfs share directory to a local directory. The virtual machines are stored on a remote SAN connected to by ISCSI and formatted in ocfs2. It is true I have had better luck with some ethernet cards than others. One of the boxes, running Intel 1000Mb cards (1 to the SAN/OCFS2, 1 to the outside world), runs the vms and the stress test without problems. But the other box, running a 1000Mb Realtek nic to the outside world and a 100Mb Realtek nic to the SAN, fails. The 100Mb nic was dropping packets to the SAN so I changed the SAN nic to the Realtek 1000Mb Now I do not drop packets (aside from a handful on the 2 vif interfaces at startup) And yet, about 1 out of 2 times I attempt to mv the file from the local directory of the nfs client vm to the nfs share, the box running the vms reboots. It leaves no logs, and seldom even any messages on the screen. It just blanks out and the next thing I know I it is rebooting. I have tried manipulating the size of the MTU, with out positive success. I have noticed that all--or nearly all--the reboots occur when I attempt to mv the file BACK INTO the nfs shared directory. I begun testing with tcpdump, and notice that a large number of packets go over with checksums correct, then, after a packet of unusually long length, all show checksums incorrect. (but I am new to tcpdump and may not be interpreting the output correctly). Any ideas why the host machine is rebooting, and how this could be fixed? Could changing the size of the ring buffer make a difference? I read about this on a couple of web pages. _______________________________________________ Xen-users mailing list Xen-users@lists.xensource.com http://lists.xensource.com/xen-users
Andrew
2011-Aug-18 10:15 UTC
Re: [Xen-users] Debian Squeeze server running 2 Ubuntu (Lucid) vms blows up under large NFS network load
Hi Benjamin, I had a somewhat similar situation earlier this year (vm host freezing in my situation). End fix was to use a intel server grade gigabit nic. Perfect since then. The other nics interrupt the CPU like crazy (once for each packet) - and when i started moving more than about 50,000 packets/sec it would lock up. I have had the intel nic moving 60-70K/packets a sec without issues. Cheers, Andrew On 18/08/11 19:06, Benjamin Weaver wrote:> I am running Debian Squeeze with Xen 4.0. I am running a stress test. I have created 2 virtual machines, each with 512 Mb of memory adn 8Gb in size. I have made one of the vms an NFS server, sharing out a large file (4.6 Gb). The other virtual machine is an NFS client. The stress test consists of passing that big file back and forth via an mv command executed on the client, which moves the file back and forth from the nfs share directory to a local directory. The virtual machines are stored on a remote SAN connected to by ISCSI and formatted in ocfs2. > > It is true I have had better luck with some ethernet cards than others. > One of the boxes, running Intel 1000Mb cards (1 to the SAN/OCFS2, 1 to the outside world), runs the vms and the stress test without problems. > > But the other box, running a 1000Mb Realtek nic to the outside > world and a 100Mb Realtek nic to the SAN, fails. The 100Mb nic was dropping packets to the SAN so I changed the SAN nic to the Realtek 1000Mb > > Now I do not drop packets (aside from a handful on the 2 vif interfaces at startup) > > And yet, > > about 1 out of 2 times I attempt to mv the file from the local directory of the nfs client vm to the nfs share, the box running the vms reboots. It leaves no logs, and seldom even any messages on the screen. It just blanks out and the next thing I know I it is rebooting. > > I have tried manipulating the size of the MTU, with out positive success. > I have noticed that all--or nearly all--the reboots occur when I attempt to mv the file BACK INTO the nfs shared directory. > > I begun testing with tcpdump, and notice that a large number of packets go over with checksums correct, then, after a packet of unusually long length, all show checksums incorrect. (but I am new to tcpdump and may not be interpreting the output correctly). > > Any ideas why the host machine is rebooting, and how this could be fixed? Could changing the size of the ring buffer make a difference? I read about this on a couple of web pages. > _______________________________________________ > Xen-users mailing list > Xen-users@lists.xensource.com > http://lists.xensource.com/xen-users_______________________________________________ Xen-users mailing list Xen-users@lists.xensource.com http://lists.xensource.com/xen-users
Benjamin Weaver
2011-Aug-19 16:42 UTC
Re: [Xen-users] Debian Squeeze server running 2 Ubuntu (Lucid) vms blows up under large NFS network load
On 18/08/11 11:15, Andrew wrote:> Hi Benjamin, > > I had a somewhat similar situation earlier this year (vm host freezing > in my situation). End fix was to use a intel server grade gigabit nic. > Perfect since then. The other nics interrupt the CPU like crazy (once > for each packet) - and when i started moving more than about 50,000 > packets/sec it would lock up. I have had the intel nic moving > 60-70K/packets a sec without issues. > > > Cheers, > Andrew > > > On 18/08/11 19:06, Benjamin Weaver wrote: >> I am running Debian Squeeze with Xen 4.0. I am running a stress test. I have created 2 virtual machines, each with 512 Mb of memory adn 8Gb in size. I have made one of the vms an NFS server, sharing out a large file (4.6 Gb). The other virtual machine is an NFS client. The stress test consists of passing that big file back and forth via an mv command executed on the client, which moves the file back and forth from the nfs share directory to a local directory. The virtual machines are stored on a remote SAN connected to by ISCSI and formatted in ocfs2. >> >> It is true I have had better luck with some ethernet cards than others. >> One of the boxes, running Intel 1000Mb cards (1 to the SAN/OCFS2, 1 to the outside world), runs the vms and the stress test without problems. >> >> But the other box, running a 1000Mb Realtek nic to the outside >> world and a 100Mb Realtek nic to the SAN, fails. The 100Mb nic was dropping packets to the SAN so I changed the SAN nic to the Realtek 1000Mb >> >> Now I do not drop packets (aside from a handful on the 2 vif interfaces at startup) >> >> And yet, >> >> about 1 out of 2 times I attempt to mv the file from the local directory of the nfs client vm to the nfs share, the box running the vms reboots. It leaves no logs, and seldom even any messages on the screen. It just blanks out and the next thing I know I it is rebooting. >> >> I have tried manipulating the size of the MTU, with out positive success. >> I have noticed that all--or nearly all--the reboots occur when I attempt to mv the file BACK INTO the nfs shared directory. >> >> I begun testing with tcpdump, and notice that a large number of packets go over with checksums correct, then, after a packet of unusually long length, all show checksums incorrect. (but I am new to tcpdump and may not be interpreting the output correctly). >> >> Any ideas why the host machine is rebooting, and how this could be fixed? Could changing the size of the ring buffer make a difference? I read about this on a couple of web pages. >> _______________________________________________ >> Xen-users mailing list >> Xen-users@lists.xensource.com >> http://lists.xensource.com/xen-usersAndrew, This is music to my ears: I have been struggling with this for a while. To understand the problem, I had a few questions. (1) are Intel server nics better owing to their interrupt throttle control? They apparently offer better interrupt control than other nices. Such is suggested by the output of modinfo -p on the Intel e1000 module in contrast to the Realtek module, 8139too. The Intel lists the RxIntDelay and TxIntDelay params where as nothing of this kind is available for the Realtek 8139. (2) might a workaround to nic replacement consist of adjusting any one several system buffers or other params that govern packet handling? I am thinking of the adjustment either of (a) Linux system buffer sizes via systctl.conf, tcp window sizes, or (b) adjustment of ifconfig parameters--packet queue lengths (txqueuelen, rxQueuelen), or MTU? I have adjusted all these params, including enlargement of the system buffers |wmem_max and ||rmem_max,|but so far have not been able to get a fix on the server crashes through these. _______________________________________________ Xen-users mailing list Xen-users@lists.xensource.com http://lists.xensource.com/xen-users
Andrew
2011-Aug-19 22:27 UTC
Re: [Xen-users] Debian Squeeze server running 2 Ubuntu (Lucid) vms blows up under large NFS network load
Hi Benjamin, 1) Yes, I ended up tuning the interrupt throttling on the intel nic, as well as a few other parameters. Even before the tuning it seemed to make a difference (maybe some other default configuration assisted) - but at the same time, I like the general quality of the card, and the exceptional linux support. I found this article useful (with some further changes): http://x443.wordpress.com/2011/03/18/tuning-intel-pro1000-family-nics-drivers-parameters-for-maximal-throughput-sk25921/ 2) I can''t remember if I tried this with the card that was giving me grief, but I believe I might of (I went through a lot of options). A lot of the cheaper cards don''t use msi interrupts (and are generally messy) - it was amazing to see the cpu load drop after putting the intel card it. Cheers, Andrew On 20/08/11 02:42, Benjamin Weaver wrote:> On 18/08/11 11:15, Andrew wrote: >> Hi Benjamin, >> >> I had a somewhat similar situation earlier this year (vm host freezing >> in my situation). End fix was to use a intel server grade gigabit nic. >> Perfect since then. The other nics interrupt the CPU like crazy (once >> for each packet) - and when i started moving more than about 50,000 >> packets/sec it would lock up. I have had the intel nic moving >> 60-70K/packets a sec without issues. >> >> >> Cheers, >> Andrew >> >> >> On 18/08/11 19:06, Benjamin Weaver wrote: >>> I am running Debian Squeeze with Xen 4.0. I am running a stress test. I have created 2 virtual machines, each with 512 Mb of memory adn 8Gb in size. I have made one of the vms an NFS server, sharing out a large file (4.6 Gb). The other virtual machine is an NFS client. The stress test consists of passing that big file back and forth via an mv command executed on the client, which moves the file back and forth from the nfs share directory to a local directory. The virtual machines are stored on a remote SAN connected to by ISCSI and formatted in ocfs2. >>> >>> It is true I have had better luck with some ethernet cards than others. >>> One of the boxes, running Intel 1000Mb cards (1 to the SAN/OCFS2, 1 to the outside world), runs the vms and the stress test without problems. >>> >>> But the other box, running a 1000Mb Realtek nic to the outside >>> world and a 100Mb Realtek nic to the SAN, fails. The 100Mb nic was dropping packets to the SAN so I changed the SAN nic to the Realtek 1000Mb >>> >>> Now I do not drop packets (aside from a handful on the 2 vif interfaces at startup) >>> >>> And yet, >>> >>> about 1 out of 2 times I attempt to mv the file from the local directory of the nfs client vm to the nfs share, the box running the vms reboots. It leaves no logs, and seldom even any messages on the screen. It just blanks out and the next thing I know I it is rebooting. >>> >>> I have tried manipulating the size of the MTU, with out positive success. >>> I have noticed that all--or nearly all--the reboots occur when I attempt to mv the file BACK INTO the nfs shared directory. >>> >>> I begun testing with tcpdump, and notice that a large number of packets go over with checksums correct, then, after a packet of unusually long length, all show checksums incorrect. (but I am new to tcpdump and may not be interpreting the output correctly). >>> >>> Any ideas why the host machine is rebooting, and how this could be fixed? Could changing the size of the ring buffer make a difference? I read about this on a couple of web pages. >>> _______________________________________________ >>> Xen-users mailing list >>> Xen-users@lists.xensource.com >>> http://lists.xensource.com/xen-users > Andrew, > > This is music to my ears: I have been struggling with this for a > while. To understand the problem, I had a few questions. > (1) are Intel server nics better owing to their interrupt throttle > control? They apparently offer better interrupt control than other > nices. Such is suggested by the output of modinfo -p on the Intel > e1000 module in contrast to the Realtek module, 8139too. The Intel > lists the RxIntDelay and TxIntDelay params where as nothing of this > kind is available for the Realtek 8139. > > (2) might a workaround to nic replacement consist of adjusting any one > several system buffers or other params that govern packet handling? I > am thinking of the adjustment either of (a) Linux system buffer sizes > via systctl.conf, tcp window sizes, or (b) adjustment of ifconfig > parameters--packet queue lengths (txqueuelen, rxQueuelen), or MTU? I > have adjusted all these params, including enlargement of the system > buffers |wmem_max and ||rmem_max,|but so far have not been able to get > a fix on the server crashes through these. > > > > _______________________________________________ > Xen-users mailing list > Xen-users@lists.xensource.com > http://lists.xensource.com/xen-users_______________________________________________ Xen-users mailing list Xen-users@lists.xensource.com http://lists.xensource.com/xen-users