Hello James, I am still running tests 7 days a week on two test systems. Results are quite discouraging though. After experiencing crash after crash I wanted to test if the configuration I called "stable" (Xen 4.0.1, GPLPV 0.11.0.213, dom0 kernel 2.6.32.18-pvops0-ak3) was stable indeed. But even that config crashed when running my torture test. It is stable on our production systems - running other workloads of course. > One thing I thought of... virtualisation gives an interesting > opportunity to exaggerate race conditions. If you have 8 vCPU''s in a > DomU but only let one or two physical CPUs service those 8 vCPU''s,then > it can give rise to race conditions which could only be rarely seen > (or never seen) in normal operation. It''s awful for performance but > if you could try that and see if it gives rise to crashes a bit > more frequently it might help us track down the problem. What exactly is the config you are talking about in terms of Xen/dom0 command line? In terms of domU config files? As always, I monitor your mercurial repo ;-) How would you see the relationship of commits 952+953 to our problem? 952 seems to affect LSO in some way since LsoV1TransmitComplete.TcpPayload is finally wrong (could it be negative since tx_length is smaller than the fixed tx_length?). What about 953? One more thought: As mentioned earlier crashes often occurred after an uptime of 9-10 days and these crashes occurred too consistently to be a "by chance" event. In my torture tests I am NOT USING a Windows NTP service (I use the meinberg NTP daemon on Windows). But on production I do. Can you see any possible impact here? Regards Andreas
> Hello James, > > I am still running tests 7 days a week on two test systems. Resultsare quite> discouraging though. After experiencing crash after crash I wanted totest if> the configuration I called "stable" (Xen 4.0.1, GPLPV 0.11.0.213, dom0kernel> 2.6.32.18-pvops0-ak3) was stable indeed. But even that config crashedwhen> running my torture test. It is stable on our production systems -running> other workloads of course.What crash are you getting these days? Is it the same one as you used to get?> > One thing I thought of... virtualisation gives an interesting >opportunity to> exaggerate race conditions. If you have 8 vCPU''s in a > DomU but onlylet> one or two physical CPUs service those 8 vCPU''s,then > it can giverise to> race conditions which could only be rarely seen > (or never seen) innormal> operation. It''s awful for performance but > if you could try that andsee if it> gives rise to crashes a bit > more frequently it might help us trackdown the> problem. > > What exactly is the config you are talking about in terms of Xen/dom0 > command line? In terms of domU config files?I don''t remember the exact syntax, but if you specify vcpus=4 but only let the DomU run on one physical cpu it might trip up more often, if the problem is caused by a race. If the problem is an arithmetic error in xennet then it won''t help.> > As always, I monitor your mercurial repo ;-) How would you see the > relationship of commits 952+953 to our problem? 952 seems to affectLSO in> some way since LsoV1TransmitComplete.TcpPayload is finally wrong(could it> be negative since tx_length is smaller than the fixed tx_length?).What about> 953?Not sure.> One more thought: As mentioned earlier crashes often occurred after an > uptime of 9-10 days and these crashes occurred too consistently to bea "by> chance" event. In my torture tests I am NOT USING a Windows NTPservice (I> use the meinberg NTP daemon on Windows). But on production I do. Can > you see any possible impact here? >It''s certainly more likely for a stray UDP packet to cause an upset I guess. As the packets pass through a Linux firewall (iptables in Dom0) it''s more likely that errant TCP packets will be dropped there. Do you have a crash dump against 0.11.0.323? James
On 29.11.2011 00:16, James Harper wrote:>> I am still running tests 7 days a week on two test systems. Results are quite >> discouraging though. After experiencing crash after crash I wanted to test if >> the configuration I called "stable" (Xen 4.0.1, GPLPV 0.11.0.213, dom0 kernel >> 2.6.32.18-pvops0-ak3) was stable indeed. But even that config crashed when >> running my torture test. It is stable on our production systems - running >> other workloads of course. > What crash are you getting these days? Is it the same one as you used to > get?Yes, still exactly the same crashes. Good good news: I think I have found the bug. Since I am not really a Xen or Windows kernel developer it cannot say for sure but here is what I found: When domU hang I ran xentop and found out that the number of vbd read requests was an number like 0x7FFFzzzz in hex which lead me to a thesis: GPLPV crashes as soon as the number of disk requests reaches 2^32. On my hardware with 5000 IIOPs/sec this is reached in 2^32 / 5000 IIOPs / 3600 sec-per-hour / 24 hours-per-day = 9.94 days And there we go: there are the 9-10 days I was always seeing. I studied the source code of blkback/blktap/aio and found nothing. But in GPLPV and its use of the ring macros I found suspicious code in every version of GPLPV I ever used while (more_to_do) { rp = xvdd->ring.sring->rsp_prod; KeMemoryBarrier(); for (i = xvdd->ring.rsp_cons; i < rp; i++) { rep = XenVbd_GetResponse(xvdd, i); If now rp is 10 for example and xvdd->ring.rsp_cons is 0xFFFFFFF7 then the for loop is skipped, responses are not delivered and we see the hang. Regards Andreas
> > On 29.11.2011 00:16, James Harper wrote: > >> I am still running tests 7 days a week on two test systems. Results > >> are quite discouraging though. After experiencing crash after crashI> >> wanted to test if the configuration I called "stable" (Xen 4.0.1, > >> GPLPV 0.11.0.213, dom0 kernel > >> 2.6.32.18-pvops0-ak3) was stable indeed. But even that configcrashed> >> when running my torture test. It is stable on our productionsystems> >> - running other workloads of course. > > What crash are you getting these days? Is it the same one as youused> > to get? > > Yes, still exactly the same crashes. > > Good good news: I think I have found the bug. Since I am not really aXen or> Windows kernel developer it cannot say for sure but here is what Ifound:> > When domU hang I ran xentop and found out that the number of vbd read > requests was an number like 0x7FFFzzzz in hex which lead me to athesis:> GPLPV crashes as soon as the number of disk requests reaches 2^32. Onmy> hardware with 5000 IIOPs/sec this is reached in > 2^32 / 5000 IIOPs / 3600 sec-per-hour / 24 hours-per-day = 9.94 daysAnd> there we go: there are the 9-10 days I was always seeing. > > I studied the source code of blkback/blktap/aio and found nothing. Butin> GPLPV and its use of the ring macros I found suspicious code in everyversion> of GPLPV I ever used > > while (more_to_do) > { > rp = xvdd->ring.sring->rsp_prod; > KeMemoryBarrier(); > for (i = xvdd->ring.rsp_cons; i < rp; i++) > { > rep = XenVbd_GetResponse(xvdd, i); > > If now rp is 10 for example and xvdd->ring.rsp_cons is 0xFFFFFFF7 thenthe> for loop is skipped, responses are not delivered and we see the hang. >Good work! I''m impressed :) I''ll get straight on that... I must have gone wrong somewhere very early on in development. James
2012/1/31 Vasiliy Tolstov <v.tolstov@selfip.ru>:> 2012/1/31 James Harper <james.harper@bendigoit.com.au>: >>> >>> Sorry for bumping old thread, where i can find latest signed drivers that >>> contains all fixes ?=) http://www.meadowcourt.org/downloads/ >>> says, that latest version uploaded in Sunday, 10 July 2011... >> >> http://www.meadowcourt.org/private/<filename> >> >> where <filename> is one of: >> >> gplpv_2000_0.11.0.357_debug.msi >> gplpv_XP_0.11.0.357_debug.msi >> gplpv_2003x32_0.11.0.357_debug.msi >> gplpv_2003x64_0.11.0.357_debug.msi >> gplpv_Vista2008x32_0.11.0.357_debug.msi >> gplpv_Vista2008x64_0.11.0.357_debug.msi >> gplpv_2000_0.11.0.357.msi >> gplpv_XP_0.11.0.357.msi >> gplpv_2003x32_0.11.0.357.msi >> gplpv_2003x64_0.11.0.357.msi >> gplpv_Vista2008x32_0.11.0.357.msi >> gplpv_Vista2008x64_0.11.0.357.msi >> >> james >> >I''m get simple tests and windows does not take BSOD and get good network speed (download is about 70-80 Mb/s, upload ~40 Mb/s), but now i get very poor disk performance =( Now i dont have any tests results, but six mounth ago i have windows 2008 install is about 30 min, now i get 1 hour. I''m use self made winpe image with integrated xen gpl pv drivers. -- Vasiliy Tolstov, Clodo.ru e-mail: v.tolstov@selfip.ru jabber: vase@selfip.ru