thr3ads.net - Xen devel - State of GPLPV tests

If this information is useful, please help other people find it:
Share via:

Andreas Kinzler

2011-Nov-28 13:49 UTC

State of GPLPV tests - 28.11.11

Hello James,

I am still running tests 7 days a week on two test systems. Results are 
quite discouraging though. After experiencing crash after crash I wanted 
to test if the configuration I called "stable" (Xen 4.0.1, GPLPV 
0.11.0.213, dom0 kernel 2.6.32.18-pvops0-ak3) was stable indeed. But 
even that config crashed when running my torture test. It is stable on 
our production systems - running other workloads of course.

 > One thing I thought of... virtualisation gives an interesting
 > opportunity to exaggerate race conditions. If you have 8 vCPU''s
in a
 > DomU but only let one or two physical CPUs service those 8
vCPU''s,then
 > it can give rise to race conditions which could only be rarely seen
 > (or never seen) in normal operation. It''s awful for performance
but
 > if you could try that and see if it gives rise to crashes a bit
 > more frequently it might help us track down the problem.

What exactly is the config you are talking about in terms of Xen/dom0 
command line? In terms of domU config files?

As always, I monitor your mercurial repo ;-) How would you see the 
relationship of commits 952+953 to our problem? 952 seems to affect LSO 
in some way since LsoV1TransmitComplete.TcpPayload is finally wrong 
(could it be negative since tx_length is smaller than the fixed 
tx_length?). What about 953?

One more thought: As mentioned earlier crashes often occurred after an 
uptime of 9-10 days and these crashes occurred too consistently to be a 
"by chance" event. In my torture tests I am NOT USING a Windows NTP 
service (I use the meinberg NTP daemon on Windows). But on production I 
do. Can you see any possible impact here?

Regards Andreas

James Harper

2011-Nov-28 23:16 UTC

head link

Re: State of GPLPV tests - 28.11.11

> Hello James,
> 
> I am still running tests 7 days a week on two test systems. Results
are quite> discouraging though. After experiencing crash after crash I wanted to
test if> the configuration I called "stable" (Xen 4.0.1, GPLPV 0.11.0.213,
dom0
kernel> 2.6.32.18-pvops0-ak3) was stable indeed. But even that config crashed
when> running my torture test. It is stable on our production systems -
running> other workloads of course.
What crash are you getting these days? Is it the same one as you used to
get?
>  > One thing I thought of... virtualisation gives an interesting  >
opportunity to> exaggerate race conditions. If you have 8 vCPU''s in a  > DomU
but only
let> one or two physical CPUs service those 8 vCPU''s,then  > it can
give
rise to> race conditions which could only be rarely seen  > (or never seen) in
normal> operation. It''s awful for performance but  > if you could try
that and
see if it> gives rise to crashes a bit  > more frequently it might help us track
down the> problem.
> 
> What exactly is the config you are talking about in terms of Xen/dom0
> command line? In terms of domU config files?
I don''t remember the exact syntax, but if you specify vcpus=4 but only
let the DomU run on one physical cpu it might trip up more often, if the
problem is caused by a race. If the problem is an arithmetic error in
xennet then it won''t help.
> 
> As always, I monitor your mercurial repo ;-) How would you see the
> relationship of commits 952+953 to our problem? 952 seems to affect
LSO in> some way since LsoV1TransmitComplete.TcpPayload is finally wrong
(could it> be negative since tx_length is smaller than the fixed tx_length?).
What about> 953?
Not sure.
> One more thought: As mentioned earlier crashes often occurred after an
> uptime of 9-10 days and these crashes occurred too consistently to be
a "by> chance" event. In my torture tests I am NOT USING a Windows NTP
service (I> use the meinberg NTP daemon on Windows). But on production I do. Can
> you see any possible impact here?
> 
It''s certainly more likely for a stray UDP packet to cause an upset I
guess. As the packets pass through a Linux firewall (iptables in Dom0)
it''s more likely that errant TCP packets will be dropped there.

Do you have a crash dump against 0.11.0.323?

James

Andreas Kinzler

2011-Nov-29 17:05 UTC

head link

Re: State of GPLPV tests - 28.11.11

On 29.11.2011 00:16, James Harper wrote:>> I am still running tests 7 days a week on two test systems. Results are
quite
>> discouraging though. After experiencing crash after crash I wanted to
test if
>> the configuration I called "stable" (Xen 4.0.1, GPLPV
0.11.0.213, dom0 kernel
>> 2.6.32.18-pvops0-ak3) was stable indeed. But even that config crashed
when
>> running my torture test. It is stable on our production systems -
running
>> other workloads of course.
> What crash are you getting these days? Is it the same one as you used to
> get?
Yes, still exactly the same crashes.

Good good news: I think I have found the bug. Since I am not really a 
Xen or Windows kernel developer it cannot say for sure but here is what 
I found:

When domU hang I ran xentop and found out that the number of vbd read 
requests was an number like 0x7FFFzzzz in hex which lead me to a thesis: 
GPLPV crashes as soon as the number of disk requests reaches 2^32. On my 
hardware with 5000 IIOPs/sec this is reached in
2^32 / 5000 IIOPs / 3600 sec-per-hour / 24 hours-per-day = 9.94 days
And there we go: there are the 9-10 days I was always seeing.

I studied the source code of blkback/blktap/aio and found nothing. But 
in GPLPV and its use of the ring macros I found suspicious code in every 
version of GPLPV I ever used

   while (more_to_do)
   {
     rp = xvdd->ring.sring->rsp_prod;
     KeMemoryBarrier();
     for (i = xvdd->ring.rsp_cons; i < rp; i++)
     {
       rep = XenVbd_GetResponse(xvdd, i);

If now rp is 10 for example and xvdd->ring.rsp_cons is 0xFFFFFFF7 then 
the for loop is skipped, responses are not delivered and we see the hang.

Regards Andreas

James Harper

2011-Nov-29 22:39 UTC

head link

Re: State of GPLPV tests - 28.11.11

> 
> On 29.11.2011 00:16, James Harper wrote:
> >> I am still running tests 7 days a week on two test systems.
Results
> >> are quite discouraging though. After experiencing crash after
crash
I> >> wanted to test if the configuration I called "stable"
(Xen 4.0.1,
> >> GPLPV 0.11.0.213, dom0 kernel
> >> 2.6.32.18-pvops0-ak3) was stable indeed. But even that config
crashed> >> when running my torture test. It is stable on our production
systems> >> - running other workloads of course.
> > What crash are you getting these days? Is it the same one as you
used> > to get?
> 
> Yes, still exactly the same crashes.
> 
> Good good news: I think I have found the bug. Since I am not really a
Xen or> Windows kernel developer it cannot say for sure but here is what I
found:> 
> When domU hang I ran xentop and found out that the number of vbd read
> requests was an number like 0x7FFFzzzz in hex which lead me to a
thesis:> GPLPV crashes as soon as the number of disk requests reaches 2^32. On
my> hardware with 5000 IIOPs/sec this is reached in
> 2^32 / 5000 IIOPs / 3600 sec-per-hour / 24 hours-per-day = 9.94 days
And> there we go: there are the 9-10 days I was always seeing.
> 
> I studied the source code of blkback/blktap/aio and found nothing. But
in> GPLPV and its use of the ring macros I found suspicious code in every
version> of GPLPV I ever used
> 
>    while (more_to_do)
>    {
>      rp = xvdd->ring.sring->rsp_prod;
>      KeMemoryBarrier();
>      for (i = xvdd->ring.rsp_cons; i < rp; i++)
>      {
>        rep = XenVbd_GetResponse(xvdd, i);
> 
> If now rp is 10 for example and xvdd->ring.rsp_cons is 0xFFFFFFF7 then
the> for loop is skipped, responses are not delivered and we see the hang.
> 
Good work! I''m impressed :)

I''ll get straight on that... I must have gone wrong somewhere very
early
on in development.

James

Vasiliy Tolstov

2012-Feb-10 08:52 UTC

head link

Re: State of GPLPV tests - 28.11.11

2012/1/31 Vasiliy Tolstov <v.tolstov@selfip.ru>:> 2012/1/31 James Harper <james.harper@bendigoit.com.au>:
>>>
>>> Sorry for bumping old thread, where i can find latest signed
drivers that
>>> contains all fixes ?=) http://www.meadowcourt.org/downloads/
>>> says, that latest version uploaded in Sunday, 10 July 2011...
>>
>> http://www.meadowcourt.org/private/<filename>
>>
>> where <filename> is one of:
>>
>> gplpv_2000_0.11.0.357_debug.msi
>> gplpv_XP_0.11.0.357_debug.msi
>> gplpv_2003x32_0.11.0.357_debug.msi
>> gplpv_2003x64_0.11.0.357_debug.msi
>> gplpv_Vista2008x32_0.11.0.357_debug.msi
>> gplpv_Vista2008x64_0.11.0.357_debug.msi
>> gplpv_2000_0.11.0.357.msi
>> gplpv_XP_0.11.0.357.msi
>> gplpv_2003x32_0.11.0.357.msi
>> gplpv_2003x64_0.11.0.357.msi
>> gplpv_Vista2008x32_0.11.0.357.msi
>> gplpv_Vista2008x64_0.11.0.357.msi
>>
>> james
>>
>

I''m get simple tests and windows does not take BSOD and get good
network speed (download is about 70-80 Mb/s, upload ~40 Mb/s), but now
i get very poor disk performance =(
Now i dont have any tests results, but six mounth ago i have windows
2008 install is about 30 min, now i get 1 hour. I''m use self made
winpe image with integrated xen gpl pv drivers.


-- 
Vasiliy Tolstov,
Clodo.ru
e-mail: v.tolstov@selfip.ru
jabber: vase@selfip.ru

Xen devel - Nov 2011 - State of GPLPV tests - 28.11.11

State of GPLPV tests - 28.11.11

Re: State of GPLPV tests - 28.11.11

Re: State of GPLPV tests - 28.11.11

Re: State of GPLPV tests - 28.11.11

Re: State of GPLPV tests - 28.11.11