Hello James, I am doing quite rigorous torture tests with Xen and GPLPV. Let me first repeat the test setup: Use Xen 4.1.1 and kernel 2.6.32.36 (commit ae333e9). Configure 2 HVMs called VM1 and VM2 as follows (per HVM): 2 VCPUs, 2 virtual disks, 1024 MB RAM, viridian=1 Install Windows 2008 R2 SP1, do install everything twice - never clone. Install GPLPV, iometer 2006.07.27, prime95 26.6 x64, ActiveState Perl 5.12.4 x64, wget for Windows and the attached perl script. Run iometer with 2 workers on the same but separate second virtual disk, queue depth 4 per worker, access specification "All in one". Run prime95 torture test with "In-place large FFTs". On VM1 use the task manager to set affinity to VCPU2, on VM2 set affinity to VCPU1. Run the perl script to fetch a good mix of some large (50-500 MB) and many small (some KB) files from a high performance FTP server on the LAN (I use vsftpd). This generates quite some load as vmstat shows: virt5620 ~ # vmstat 1 procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu---- r b swpd free buff cache si so bi bo in cs us sy id wa 0 0 0 723408 6860 33860 0 0 82113 82132 22503 30252 2 12 84 0 0 0 0 723408 6860 33860 0 0 80117 82913 23109 30776 1 13 83 0 4 0 0 723408 6860 33860 0 0 92555 87013 28411 33283 2 12 84 0 4 0 0 723408 6860 33860 0 0 82678 85775 26228 31739 1 13 83 0 5 0 0 723408 6860 33860 0 0 82252 84837 24180 29723 1 14 82 0 With GPLPV 0.11.0.308 it worked perfectly and with very good performance for over 9 days but then when I wanted to monitor the status, I was no longer able to connect via remote desktop. When examining the file system of the HVMs I found that somehow even the prime95 processes did stop. Any ideas? Could c/s 948 make any difference? Network worked perfectly for 9 days, so I ask myself if the count of c/s 948 is used at all? Regards Andreas _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
> > With GPLPV 0.11.0.308 it worked perfectly and with very goodperformance> for over 9 days but then when I wanted to monitor the status, I was no > longer able to connect via remote desktop. When examining the file > system of the HVMs I found that somehow even the prime95 processes didstop.> > Any ideas? Could c/s 948 make any difference? Network worked perfectly > for 9 days, so I ask myself if the count of c/s 948 is used at all? >There is some combination that I can''t reproduce that seems to cause a problem when that count isn''t passed in correctly. So it is a bug, but I''m not sure if it causes the problems you are seeing. I can give you a link to a build with that fix applied if you want to test further. Thanks James _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
>> With GPLPV 0.11.0.308 it worked perfectly and with very good performance>> for over 9 days but then when I wanted to monitor the status, I was no >> longer able to connect via remote desktop. When examining the file >> system of the HVMs I found that somehow even the prime95 processes did stop. >> Any ideas? Could c/s 948 make any difference? Network worked perfectly >> for 9 days, so I ask myself if the count of c/s 948 is used at all? >There is some combination that I can''t reproduce that seems to cause a >problem when that count isn''t passed in correctly. So it is a bug, but >I''m not sure if it causes the problems you are seeing. What exactly is the problem you encountered? >I can give you a link to a build with that fix applied if you want to >test further. Thanks. I have plans to test the 0.11.0.312 version. I have a complete build system here and a kernel-mode enabled signing certificate so I can use that for my tests. Regards Andreas _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
> > >> With GPLPV 0.11.0.308 it worked perfectly and with very goodperformance> >> for over 9 days but then when I wanted to monitor the status, Iwas no> >> longer able to connect via remote desktop. When examining the file > >> system of the HVMs I found that somehow even the prime95 processes > did stop. > >> Any ideas? Could c/s 948 make any difference? Network workedperfectly> >> for 9 days, so I ask myself if the count of c/s 948 is used atall?> >There is some combination that I can''t reproduce that seems to causea> >problem when that count isn''t passed in correctly. So it is a bug,but> >I''m not sure if it causes the problems you are seeing. > > What exactly is the problem you encountered? > > >I can give you a link to a build with that fix applied if you wantto> >test further. > > Thanks. I have plans to test the 0.11.0.312 version. I have a complete > build system here and a kernel-mode enabled signing certificate so Ican> use that for my tests. >I actually looked back through the changes and there is still a fix to come - basically xennet allocates new buffers when it needs them, but never frees them again so if there is a really big burst of traffic it could end up taking all the available memory. That could cause the problem you are seeing. I''m a bit busy with some other things at the moment but I hope to have a fix by the end of the week. James _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Andreas Kinzler
2011-Sep-12 21:39 UTC
Re: [Xen-devel] RE: Stability report GPLPV 0.11.0.308
On 06.09.2011 07:21, James Harper wrote: > I actually looked back through the changes and there is still a fix to come - basically > xennet allocates new buffers when it needs them, but never frees them again so if > there is a really big burst of traffic it could end up taking all the available memory. That > could cause the problem you are seeing. I''m a bit busy with some other things at the > moment but I hope to have a fix by the end of the week. I am currently running the torture test on 0.11.0.312 and did not see any increase in usage of kernel memory (uptime is over 7 days now with quite extreme load on net/disk). Anything new about your fix? Regards Andreas _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Andreas Kinzler
2011-Sep-19 11:35 UTC
Re: [Xen-devel] RE: Stability report GPLPV 0.11.0.308
Hello James, the torture test of GPLPV 0.11.0.312 failed (as 0.11.0.308 did). What really puzzles me is that the uptime was 9-10 days for both VMs (as in 0.11.0.308). One could think that there is something about the uptime of 9-10 days. There is no noticeable malfunction in dom0 and while the domUs were running they looked perfectly. Really odd. Regards Andreas _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
> Hello James, > > the torture test of GPLPV 0.11.0.312 failed (as 0.11.0.308 did). Whatreally> puzzles me is that the uptime was 9-10 days for both VMs (as in0.11.0.308).> One could think that there is something about the uptime of > 9-10 days. There is no noticeable malfunction in dom0 and while thedomUs> were running they looked perfectly. Really odd. >I haven''t tested it well, but the latest should be a bit more stable. Could you try one of the following from http://www.meadowcourt.org/private: gplpv_Vista2008x64_0.11.0.265.msi gplpv_2000_0.11.0.322_debug.msi gplpv_XP_0.11.0.322_debug.msi gplpv_2003x32_0.11.0.322_debug.msi gplpv_2003x64_0.11.0.322_debug.msi gplpv_Vista2008x32_0.11.0.322_debug.msi gplpv_Vista2008x64_0.11.0.322_debug.msi gplpv_2000_0.11.0.322.msi gplpv_XP_0.11.0.322.msi gplpv_2003x32_0.11.0.322.msi gplpv_2003x64_0.11.0.322.msi gplpv_Vista2008x32_0.11.0.322.msi gplpv_Vista2008x64_0.11.0.322.msi thanks James _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Andreas Kinzler
2011-Sep-22 09:49 UTC
Re: [Xen-devel] RE: Stability report GPLPV 0.11.0.308
Hello James,> I haven''t tested it well, but the latest should be a bit more stable. > Could you try one of the following from > http://www.meadowcourt.org/privateGives me an HTTP 403. I''d really prefer to fetch it from your mercurial repo since I don''t want to use testsigning and use our real certificate instead. Regards Andreas _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
> Hello James, > > > I haven''t tested it well, but the latest should be a bit morestable.> > Could you try one of the following from > > http://www.meadowcourt.org/private > > Gives me an HTTP 403.You need to append the filename to the url - no browsing on that directory.> I''d really prefer to fetch it from your mercurial repo > since I don''t want to use testsigning and use our real certificateinstead. My laptop broke and I haven''t gotten mercurial set up yet. I''ll push it as soon as I can. James _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Andreas Kinzler
2011-Sep-23 20:57 UTC
Re: [Xen-devel] RE: Stability report GPLPV 0.11.0.308
Hello James, I did take a look at your commit 950 and I think there are 3 typos (see my patch). Anyway, I don''t think that memory problems are causing the stability issues and (as some kind of "proof") I did not notice any increase in kernel memory usage during uptime of the VMs. Actually I am not really sure if xennet is even the problem since in none of the crash scenarios there was something in the Windows event log. In my last test I was able to enter my password via VNC (it did not login though) - this should have written some entries to the security log which it did not. So I assume that xenvbd was dead too (killed by xennet) or is actually the real reason for the stability problems. Regards Andreas _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
> > I did take a look at your commit 950 and I think there are 3 typos(see my> patch).Thanks for that. I guess the length parameter mustn''t be verified or I would have crashed...> Anyway, I don''t think that memory problems are causing the stability > issues and (as some kind of "proof") I did not notice any increase inkernel> memory usage during uptime of the VMs. > > Actually I am not really sure if xennet is even the problem since innone of> the crash scenarios there was something in the Windows event log. Inmy> last test I was able to enter my password via VNC (it did not loginthough) -> this should have written some entries to the security log which it didnot. So I> assume that xenvbd was dead too (killed by > xennet) or is actually the real reason for the stability problems. >I''m just setting up a few new servers and on one of them a Windows 2008R2 machine hung and there were lots of xenvbd errors in the logs. The underlying block device for that DomU is iSCSI and I assumed the problem was there (I was doing a lot of testing on the SAN at the time) but maybe not. No evidence of problems in /var/log/xen/qemu-dm-<domu name>.log? James _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Andreas Kinzler
2011-Sep-26 14:44 UTC
Re: [Xen-devel] RE: Stability report GPLPV 0.11.0.308
On 24.09.2011 01:49, James Harper wrote:> I''m just setting up a few new servers and on one of them a Windows > 2008R2 machine hung and there were lots of xenvbd errors in the logs. > The underlying block device for that DomU is iSCSI and I assumed the > problem was there (I was doing a lot of testing on the SAN at the time) > but maybe not.Hmmm, do you have any more ideas? Anything how I could help you debugging?> No evidence of problems in /var/log/xen/qemu-dm-<domu name>.log?I did not see anything special. Regards Andreas _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
> > On 24.09.2011 01:49, James Harper wrote: > > I''m just setting up a few new servers and on one of them a Windows > > 2008R2 machine hung and there were lots of xenvbd errors in thelogs.> > The underlying block device for that DomU is iSCSI and I assumed the > > problem was there (I was doing a lot of testing on the SAN at the > > time) but maybe not. > > Hmmm, do you have any more ideas? Anything how I could help you > debugging?You could look for all the failure paths (eg can''t allocate buffer etc) and put debug prints there. I''m almost in a position to be able to start testing a bit better now. James _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
> > On 24.09.2011 01:49, James Harper wrote: > > I''m just setting up a few new servers and on one of them a Windows > > 2008R2 machine hung and there were lots of xenvbd errors in thelogs.> > The underlying block device for that DomU is iSCSI and I assumed the > > problem was there (I was doing a lot of testing on the SAN at the > > time) but maybe not. > > Hmmm, do you have any more ideas? Anything how I could help you > debugging? > > > No evidence of problems in /var/log/xen/qemu-dm-<domu name>.log? > > I did not see anything special. >I just had a crash under heavy testing: 12961573357740: 12961573357740: *** Assertion failed: pi.curr_mdl 12961573357740: *** Source File: c:\projects\win-pvdrivers.hg\xennet\xennet6_tx.c, line 308 12961573357740: Can you have a look in your logs for anything like this? I''m curious as to if we are chasing the same problem or a different one. I haven''t checked yet to make sure I am running the latest code on that server so it could just be something stupid... James _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Andreas Kinzler
2011-Sep-30 09:17 UTC
Re: [Xen-devel] RE: Stability report GPLPV 0.11.0.308
Hello James,> 12961573357740: *** Assertion failed: pi.curr_mdl > 12961573357740: *** Source File: > c:\projects\win-pvdrivers.hg\xennet\xennet6_tx.c, line 308I took this report about a problem you had with "tx" to modify my tests and to make them mostly tx-based while previously they were mostly rx-based. Tests are running for 2d 18h now - no problems so far. I wanted to tell you about one interesting observation. In my tests I did two runs with modified xenvbd drivers. In run #1 I switched to the scsiport driver of 0.11.0.312 and this made one domU crash after one day while with 0.11.0.312 storport version I always had more than 9 days (as I reported earlier). In run #2 I forward-ported xenvbd from 0.11.0.213 (which is totally stable on our systems) and again one domU crashed after one day. This is really interesting and leads me to two thoughts: 1) xennet has some problem, but still why does scsiport vs. storport make a difference then? 2) perhaps there is some new bug outside xennet and outside xenvbd (some infrastructure thing: event handling, PCI, ...) and this is the real reason.> Can you have a look in your logs for anything like this? I''m curious as > to if we are chasing the same problem or a different one.I am not running kernel debugging so far (have played with it though). So I cannot say. Regards Andreas _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
> > 1) xennet has some problem, but still why does scsiport vs. storportmake a> difference then?If I''m running over the end of a buffer then anything goes...> 2) perhaps there is some new bug outside xennet and outside xenvbd(some> infrastructure thing: event handling, PCI, ...) and this is the realreason. Could be.> > Can you have a look in your logs for anything like this? I''m curious > > as to if we are chasing the same problem or a different one. > > I am not running kernel debugging so far (have played with it though). > So I cannot say. >You only need to be running the debug version of the drivers for this to be logged. Also are you running with the driver verifier enabled? That can help catch bugs and also allows you to notice memory leaks a bit better. One other thing to try is running with the checked build of windows. I''ve used the checked kernel and hal under 2003 and it picked up one error - it basically just checks parameters etc everywhere it can to make sure you aren''t doing anything you shouldn''t. I seem to remember the checked NDIS driver threw a fit though because xen was presenting 4 CPU''s but a the cpuid registers said it was a single cpu with 4 cores so there are little things like that that can be troublesome with the checked builds... James _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
> > > 12961573357740: *** Assertion failed: pi.curr_mdl > > 12961573357740: *** Source File: > > c:\projects\win-pvdrivers.hg\xennet\xennet6_tx.c, line 308 > > I took this report about a problem you had with "tx" to modify mytests and> to make them mostly tx-based while previously they were mostlyrx-based.> Tests are running for 2d 18h now - no problems so far. > > I wanted to tell you about one interesting observation. In my tests Idid two> runs with modified xenvbd drivers. In run #1 I switched to thescsiport driver> of 0.11.0.312 and this made one domU crash after one day while with > 0.11.0.312 storport version I always had more than 9 days (as Ireported> earlier). In run #2 I forward-ported xenvbd from 0.11.0.213 (which istotally> stable on our systems) and again one domU crashed after one day. Thisis> really interesting and leads me to two thoughts: >Actually one other thing you could try is simply using the Windows 2003 version of the drivers. That uses ndis5 and scsiport instead of ndis6 and storport. If that worked we could try running with ndis5 + storport and see if that works okay. As long as they are from the same patchlevel it shouldn''t matter if you use one compiled for windows 2008 and one for windows 2003 (it''s possible that it might matter but I can''t think of anything). James _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
> Hello James, > > > 12961573357740: *** Assertion failed: pi.curr_mdl > > 12961573357740: *** Source File: > > c:\projects\win-pvdrivers.hg\xennet\xennet6_tx.c, line 308 > > I took this report about a problem you had with "tx" to modify mytests and> to make them mostly tx-based while previously they were mostlyrx-based.> Tests are running for 2d 18h now - no problems so far. >I''m not quite sure but I think my testing was in the rx direction just prior to the crash. I think the testing finished and then this crash happened a few minutes later. I think it also happened when the machine hadn''t been up for very long too... I haven''t been able to reproduce it since though which is frustrating. James _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Andreas Kinzler
2011-Oct-01 09:28 UTC
Re: [Xen-devel] RE: Stability report GPLPV 0.11.0.308
Hello,> What version of Xen are you testing with? I''m using the latest xen 4.1I am using Xen 4.1.1 official.> from hg and this morning I couldn''t log into my server via RDP because > the date had advanced about 2 months. I was testing it hard enough that > it lost network connectivity for a bit so I''m wondering if that had > something to do with it... have you seen anything like that during > intensive testing?Yes, there might be additional problems that are cause by time issues, but that does not explain why the Windows log does not mention my login try (see earlier mail). I have one production system with Xen 4.1.1 and GPLPV 0.11.0.213 which has an uptime of more then 100 days. Regards Andreas _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Andreas Kinzler
2011-Oct-01 11:10 UTC
Re: [Xen-devel] RE: Stability report GPLPV 0.11.0.308
Hello James,> Actually one other thing you could try is simply using the Windows 2003 > version of the drivers. That uses ndis5 and scsiport instead of ndis6 > and storport. If that worked we could try running with ndis5 + storport > and see if that works okay. As long as they are from the same patchlevel > it shouldn''t matter if you use one compiled for windows 2008 and one for > windows 2003 (it''s possible that it might matter but I can''t think of > anything).After 3d:18h I stopped my tx-based test with no problems to far. My conclusion: the switch of rx-based to tx-based did not change anything. I now compiled 0.11.0.312 with scsiport and ndis5 (and patched the .inf file, deleted the [XenGplPv.NT$ARCH$.6.0] section). Test is now running. Regards Andreas _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Andreas Kinzler
2011-Oct-10 16:07 UTC
Re: [Xen-devel] RE: Stability report GPLPV 0.11.0.308
Hello James,>> Actually one other thing you could try is simply using the Windows >> 2003 version of the drivers. That uses ndis5 and scsiport instead >> of ndis6 and storport. If that worked we could try running with >> ndis5 + storport and see if that works okay. As long as they are >> from the same patchlevel it shouldn''t matter if you use one >> compiled for windows 2008 and one for windows 2003 (it''s possible >> that it might matter but I can''t think of anything). > I now compiled 0.11.0.312 with scsiport and ndis5 (and patched the > .inf file, deleted the [XenGplPv.NT$ARCH$.6.0] section). Test is now > running.Crashed after 1-2 days, but actually I found that ndis5 of 0.11.0.213 has major differences from ndis5 of 0.11.0.312 so I am not sure what that means? The whole reason I am doing all the testing is because the net performance of 0.11.0.213 is not good enough and 0.11.0.312 has near native performance on gigabit links - but even the ndis5 driver of 0.11.0.312 has very good performance so it does not seem to be an NDIS 6 improvement. Any news on your side? Regards Andreas _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel