thr3ads.net - Xen users - [Xen-users] XEN 3.1: critical bug: vif init failure after creating 15-17 VMs (XENBUS: Timeout connecting to device: device/vif) [Jul 2007]

If this information is useful, please help other people find it:
Share via:

Eric Tessler

2007-Jul-13 01:19 UTC

[Xen-users] XEN 3.1: critical bug: vif init failure after creating 15-17 VMs (XENBUS: Timeout connecting to device: device/vif)

We have found a critical problem with the XEN 3.1 release (for those who are
running 15-20 VMs on a single server). We are using the official XEN 3.1 release
on a rackable server (Dual-Core AMD Opteron, 8GB RAM).
   
  The problem we are seeing is that intermittently vifs fail to work properly in
VMs after we create around 15-17 VMs on our server (all running at the same
time, created one by one). Sometimes we can create up to 40 VMs w/o a problem,
other times vifs begin to fail on the 15th-17th VM (each VM has 4 vifs, 1 block
device, 64MB memory), we see the following error message in the VM (domU) on its
console:
  "XENBUS: Timeout connecting to device: device/vif/3 (state 6)"
   
  At the same time in dom0, we see the following error message in
/var/log/messages:
  "vif vif-16-3: 1 mapping shared-frames 2310/2311 port 11"
  (the error message above means that netif_map failed for some reason in
XenBus)
   
  If we repeat this same exact test using XEN 3.0.4, we never have any problems.
All vifs in all VMs work correctly. This problem must be specific to XEN 3.1.
   
  I have searched the web and this user list and I have not been able to find
out if anyone else has observed this same problem or if a fix for this problem
already exists (if there is a fix, please post info about it here). If there is
no fix for this yet, I will be looking into this bug to solve it, any pointers
on where to concentrate my debugging efforts would be appreciated (I
don''t know the XEN code that well).
   
  One other strange note about this issue: If we leave the failed VM alone, we
actually can create another VM w/o any problem (vifs come up correctly).
Afterwards, we can then destroy and create the VM that used to fail and now it
boots w/o any problems (its vif comes up correctly). This smells like a race
condition bug in the XEN code (this proves that it is not due to low resources
or something like that).
   
  Any help on this issue would be greatly appreciated,
   
  Thank you,
   
  Eric
   

       
---------------------------------
Get the free Yahoo! toolbar and rest assured with the added security of spyware
protection.

_______________________________________________
Xen-users mailing list
Xen-users@lists.xensource.com
http://lists.xensource.com/xen-users

Keir Fraser

2007-Jul-13 06:29 UTC

head link

[Xen-users] Re: XEN 3.1: critical bug: vif init failure after creating 15-17 VMs (XENBUS: Timeout connecting to device: device/vif)

Can you try 3.0.4 domU kernel agaianst 3.1 dom0 kernel, and vice versa?
Also, turn on debug tracing in Xen (boot options loglvl=all
guest_loglvl=all¹) and see what appears at the end of xm dmesg¹.

 -- Keir


On 13/7/07 02:19, "Eric Tessler" <maiden1134@yahoo.com> wrote:
> At the same time in dom0, we see the following error message in
> /var/log/messages:
>   
> "vif vif-16-3: 1 mapping shared-frames 2310/2311 port 11"
>   
> (the error message above means that netif_map failed for some reason in
> XenBus)
>   
>  
>   
> If we repeat this same exact test using XEN 3.0.4, we never have any
problems.
> All vifs in all VMs work correctly. This problem must be specific to XEN
3.1.



_______________________________________________
Xen-users mailing list
Xen-users@lists.xensource.com
http://lists.xensource.com/xen-users

Eric Tessler

2007-Jul-14 02:32 UTC

head link

Re: [Xen-users] Re: XEN 3.1: critical bug: vif init failure after creating 15-17 VMs (XENBUS: Timeout connecting to device: device/vif)

I was able to get some debugging in on this problem and here is what I have
found.
   
  I re-ran my test with the XEN debug options enabled as Keir suggested (I also
put some debug output in netif_map and map_frontend_pages to find out exactly
what was failing). The 16th VM''s vif timed out again and here is what I
saw in the dmesg log:
   
     (XEN) grant_table.c:557:d1 Expanding dom (1) grant table from (4) to (5)
frames.
   (XEN) grant_table.c:557:d2 Expanding dom (2) grant table from (4) to (5)
frames.
   (XEN) grant_table.c:557:d3 Expanding dom (3) grant table from (4) to (5)
frames.
   (XEN) grant_table.c:557:d4 Expanding dom (4) grant table from (4) to (5)
frames.
   (XEN) grant_table.c:557:d5 Expanding dom (5) grant table from (4) to (5)
frames.
   (XEN) grant_table.c:557:d6 Expanding dom (6) grant table from (4) to (5)
frames.
   (XEN) grant_table.c:557:d7 Expanding dom (7) grant table from (4) to (5)
frames.
   (XEN) grant_table.c:557:d8 Expanding dom (8) grant table from (4) to (5)
frames.
   (XEN) grant_table.c:557:d9 Expanding dom (9) grant table from (4) to (5)
frames.
   (XEN) grant_table.c:557:d10 Expanding dom (10) grant table from (4) to (5)
frames.
   (XEN) grant_table.c:557:d11 Expanding dom (11) grant table from (4) to (5)
frames.
   (XEN) grant_table.c:557:d12 Expanding dom (12) grant table from (4) to (5)
frames.
   (XEN) grant_table.c:557:d13 Expanding dom (13) grant table from (4) to (5)
frames.
   (XEN) grant_table.c:557:d14 Expanding dom (14) grant table from (4) to (5)
frames.
   (XEN) grant_table.c:557:d15 Expanding dom (15) grant table from (4) to (5)
frames.
   (XEN) grant_table.c:557:d16 Expanding dom (16) grant table from (4) to (5)
frames.
   (XEN) mm.c:2605:d0 Could not find L1 PTE for address d1400000

  You can see from above that the first 15 VMs are OK, and the 16th VM fails
with the last error message in "mm.c" as shown above. I attempted to
trace upwards what exactly was failing so I enabled debug output in
"linux-2.6-xen-sparse/drivers/xen/netback/interface.c" (this is where
netif_map() is located). I then observed the following output in
/var/log/messages when the 16th VMs vif timed out:
     (map_frontend_pages:227)  Gnttab failure mapping rx_ring_ref!
     (netif_map:274) map frontend pages failed [I added this debug output]
     vif vif-16-3: 1 mapping shared-frames 2310/2311 port 11
  
The error message from mm.c displayed in the dmesg log is coming from the
function "create_grant_va_mapping" (a call to guest_map_l1e() is
failing with NULL).
   
  In summary, it looks like the mapping of the RX shared memory ring is failing
(the TX mapping is passing, it always fails on the mapping of the RX ring).
Another interesting note is that the address dumped in the dmesg log is always
the same: d1400000 (I saw the failure about 10 times today and the address never
changes).
   
  Also, by suggestion of Keir, I tried the XEN 3.0.4 kernel in my 16th VM
(2.6.16.33), it failed the same way. The only difference is that instead of
extending the grant table from 4 to 5 frames, it was extended from 4 to 16
frames:
     (XEN) grant_table.c:557:d18 Expanding dom (18) grant table from (4) to (16)
frames.
   (XEN) mm.c:2605:d0 Could not find L1 PTE for address d1400000

  I believe the following stack trace represents the trace of the failure
(starting from within XenBus, traced by hand):
  connect_rings                      
linux-2.6-xen-sparse/drivers/xen/netback/xenbus.c
netif_map                          
linux-2.6-xen-sparse/drivers/xen/netback/interface.c
map_frontend_pages                 
linux-2.6-xen-sparse/drivers/xen/netback/interface.c
__gnttab_map_grant_ref (hypercall)             xen/common/grant_table.c
create_grant_host_mapping           xen/arch/x86/mm.c
create_grant_va_mapping             xen/arch/x86/mm.c
   guest_map_l1e                    xen/arch/x86/mm.c
     (this is the function that is ultimately failing)
   
  Any clue as to what is causing this failure or how to fix it? Is there any
other debug info I can provide here that would be of any help in resolving this
issue? I have some free time tomorrow to debug this issue, but need some
direction; this is in an area of XEN I don''t understand very well.
   
  I am also thinking about downloading the xen 3.1 unstable release and trying
that one to see if the problem also exists there.
   
  Thanks,
   
  Eric
  
Keir Fraser <keir@xensource.com> wrote:
  Can you try 3.0.4 domU kernel agaianst 3.1 dom0 kernel, and vice versa? Also,
turn on debug tracing in Xen (boot options loglvl=all guest_loglvl=all) and
see what appears at the end of xm dmesg.

 -- Keir


On 13/7/07 02:19, "Eric Tessler" <maiden1134@yahoo.com> wrote:

  At the same time in dom0, we see the following error message in
/var/log/messages:
  
"vif vif-16-3: 1 mapping shared-frames 2310/2311 port 11"
  
(the error message above means that netif_map failed for some reason in XenBus)
  
 
  
If we repeat this same exact test using XEN 3.0.4, we never have any problems.
All vifs in all VMs work correctly. This problem must be specific to XEN 3.1.

_______________________________________________
Xen-users mailing list
Xen-users@lists.xensource.com
http://lists.xensource.com/xen-users

       
---------------------------------
Be a better Globetrotter. Get better travel answers from someone who knows.
Yahoo! Answers - Check it out.
--0-827924448-1184380378=:98145
Content-Type: text/html; charset=iso-8859-1
Content-Transfer-Encoding: 8bit

<div>I was able to get some debugging in on this problem and here is what
I have found.</div>  <div>&nbsp;</div>  <div>I
re-ran my test with the XEN debug options enabled as Keir suggested (I also put
some debug output in netif_map and map_frontend_pages to find out exactly what
was failing). The 16th VM''s vif timed out again and here is what I saw
in the dmesg log:</div>  <div>&nbsp;</div> 
<div>&nbsp;&nbsp; (XEN) grant_table.c:557:d1 Expanding dom (1)
grant table from (4) to (5) frames.<BR>&nbsp;&nbsp; (XEN)
grant_table.c:557:d2 Expanding dom (2) grant table from (4) to (5)
frames.<BR>&nbsp;&nbsp; (XEN) grant_table.c:557:d3 Expanding dom
(3) grant table from (4) to (5) frames.<BR>&nbsp;&nbsp; (XEN)
grant_table.c:557:d4 Expanding dom (4) grant table from (4) to (5)
frames.<BR>&nbsp;&nbsp; (XEN) grant_table.c:557:d5 Expanding dom
(5) grant table from (4) to (5) frames.<BR>&nbsp;&nbsp; (XEN)
grant_table.c:557:d6 Expanding dom (6) grant table from (4) to (5)
 frames.<BR>&nbsp;&nbsp; (XEN) grant_table.c:557:d7 Expanding dom
(7) grant table from (4) to (5) frames.<BR>&nbsp;&nbsp; (XEN)
grant_table.c:557:d8 Expanding dom (8) grant table from (4) to (5)
frames.<BR>&nbsp;&nbsp; (XEN) grant_table.c:557:d9 Expanding dom
(9) grant table from (4) to (5) frames.<BR>&nbsp;&nbsp; (XEN)
grant_table.c:557:d10 Expanding dom (10) grant table from (4) to (5)
frames.<BR>&nbsp;&nbsp; (XEN) grant_table.c:557:d11 Expanding dom
(11) grant table from (4) to (5) frames.<BR>&nbsp;&nbsp; (XEN)
grant_table.c:557:d12 Expanding dom (12) grant table from (4) to (5)
frames.<BR>&nbsp;&nbsp; (XEN) grant_table.c:557:d13 Expanding dom
(13) grant table from (4) to (5) frames.<BR>&nbsp;&nbsp; (XEN)
grant_table.c:557:d14 Expanding dom (14) grant table from (4) to (5)
frames.<BR>&nbsp;&nbsp; (XEN) grant_table.c:557:d15 Expanding dom
(15) grant table from (4) to (5) frames.<BR>&nbsp;&nbsp; (XEN)
grant_table.c:557:d16 Expanding dom (16) grant table from (4) to (5)
 frames.<BR>&nbsp;&nbsp; (XEN) mm.c:2605:d0 Could not find L1 PTE
for address d1400000<BR></div>  <div>You can see from above
that the first 15 VMs are OK, and the 16th VM fails with the last error
message&nbsp;in "mm.c" as shown above. I attempted to trace
upwards what exactly was failing so I enabled debug output in
"linux-2.6-xen-sparse/drivers/xen/netback/interface.c" (this is where
netif_map() is located). I&nbsp;then observed the following output in
/var/log/messages when the 16th VMs vif timed out:</div> 
<div>&nbsp;&nbsp; (map_frontend_pages:227)&nbsp; Gnttab
failure mapping rx_ring_ref!</div>  <div>&nbsp;&nbsp;
(netif_map:274) map frontend pages failed [I added this debug
output]</div>  <div>&nbsp;&nbsp; vif vif-16-3: 1 mapping
shared-frames 2310/2311 port 11</div>  <div><BR>The error
message from mm.c displayed in the dmesg log&nbsp;is coming from the
function "create_grant_va_mapping" (a call to guest_map_l1e() is
failing with NULL).</div>  <div>&nbsp;</div>
 <div>In summary, it looks like the mapping of the RX shared memory ring
is failing (the TX mapping is passing, it always fails on the mapping of the RX
ring). Another interesting note is that the address dumped in the dmesg log is
always the same: d1400000 (I saw the failure about 10 times today and the
address never changes).</div>  <div>&nbsp;</div> 
<div>Also, by suggestion of Keir, I tried the XEN 3.0.4 kernel in my 16th
VM (2.6.16.33), it failed the same way. The only difference is that instead of
extending the grant table from 4 to 5 frames, it was extended from 4 to 16
frames:</div>  <div>&nbsp;&nbsp; (XEN) grant_table.c:557:d18
Expanding dom (18) grant table from (4) to (16)
frames.<BR>&nbsp;&nbsp; (XEN) mm.c:2605:d0 Could not find L1 PTE
for address d1400000<BR></div>  <div>I believe the
following&nbsp;stack trace&nbsp;represents the trace of the failure
(starting from within XenBus, traced by hand):</div>
 <div>connect_rings&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
linux-2.6-xen-sparse/drivers/xen/netback/xenbus.c<BR>netif_map&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
linux-2.6-xen-sparse/drivers/xen/netback/interface.c<BR>map_frontend_pages&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
linux-2.6-xen-sparse/drivers/xen/netback/interface.c<BR>__gnttab_map_grant_ref&nbsp;(hypercall)&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
xen/common/grant_table.c<BR>create_grant_host_mapping&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
xen/arch/x86/mm.c<BR>create_grant_va_mapping&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
xen/arch/x86/mm.c<BR>&nbsp;&nbsp;
 guest_map_l1e&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
xen/arch/x86/mm.c<BR>&nbsp;&nbsp;&nbsp;&nbsp; (this is the
function that is ultimately failing)</div> 
<div>&nbsp;</div>  <div>Any clue as to what is causing
this failure or how to fix it? Is there any other debug info I can provide here
that would be of any help in resolving this issue? I have some free time
tomorrow to debug this issue, but need some direction; this is in an area of XEN
I don''t understand very well.</div> 
<div>&nbsp;</div>  <div>I am also thinking about
downloading the xen 3.1 unstable release and trying that one to see if the
problem also exists there.</div>  <div>&nbsp;</div> 
<div>Thanks,</div>  <div>&nbsp;</div> 
<div>Eric</div>  <div><BR><B><I>Keir Fraser
&lt;keir@xensource.com&gt;</I></B> wrote:</div> 
<BLOCKQUOTE class=replbq style="PADDING-LEFT: 5px; MARGIN-LEFT: 5px;
BORDER-LEFT: #1010ff 2px solid"><FONT face="Verdana,
 Helvetica, Arial"><SPAN style="FONT-SIZE: 12px">Can you
try 3.0.4 domU kernel agaianst 3.1 dom0 kernel, and vice versa? Also, turn on
debug tracing in Xen (boot options loglvl=all guest_loglvl=all) and see what
appears at the end of xm dmesg.<BR><BR>&nbsp;--
Keir<BR><BR><BR>On 13/7/07 02:19, "Eric Tessler"
&lt;maiden1134@yahoo.com&gt;
wrote:<BR><BR></SPAN></FONT>  <BLOCKQUOTE><FONT
face="Verdana, Helvetica, Arial"><SPAN style="FONT-SIZE:
12px">At the same time in dom0, we see the following error message in
/var/log/messages:<BR>&nbsp;&nbsp;<BR>"vif vif-16-3: 1
mapping shared-frames 2310/2311 port
11"<BR>&nbsp;&nbsp;<BR>(the error message above means
that netif_map failed for some reason in
XenBus)<BR>&nbsp;&nbsp;<BR>&nbsp;<BR>&nbsp;&nbsp;<BR>If
we repeat this same exact test using XEN 3.0.4, we never have any problems. All
vifs in all VMs work correctly. This problem must be specific to XEN
3.1.<BR></SPAN></FONT></BLOCKQUOTE><FONT
face="Verdana, Helvetica, Arial"><SPAN
 style="FONT-SIZE:
12px"><BR></SPAN></FONT>_______________________________________________<BR>Xen-users
mailing
list<BR>Xen-users@lists.xensource.com<BR>http://lists.xensource.com/xen-users</BLOCKQUOTE><BR><p>&#32;
      <hr size=1>Be a better Globetrotter. <a
href="http://us.rd.yahoo.com/evt=48254/*http://answers.yahoo.com/dir/_ylc=X3oDMTI5MGx2aThyBF9TAzIxMTU1MDAzNTIEX3MDMzk2NTQ1MTAzBHNlYwNCQUJwaWxsYXJfTklfMzYwBHNsawNQcm9kdWN0X3F1ZXN0aW9uX3BhZ2U-?link=list&sid=396545469">Get
better travel answers </a>from someone who knows.<br>Yahoo! Answers
- Check it out.


--0-827924448-1184380378=:98145--


--===============2133859501=Content-Type: text/plain;
charset="us-ascii"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
Content-Disposition: inline

_______________________________________________
Xen-users mailing list
Xen-users@lists.xensource.com
http://lists.xensource.com/xen-users
--===============2133859501==--

Keir Fraser

2007-Jul-14 06:43 UTC

head link

Re: [Xen-users] Re: XEN 3.1: critical bug: vif init failure after creating 15-17 VMs (XENBUS: Timeout connecting to device: device/vif)

What dom0 kernel image are you running? It looks like vmalloc_sync_all(),
called from alloc_vm_area() has not caused the pte that will map the rx ring
to be made present in the currently-running page tables. The code looks okay
on inspection though.

 -- Keir

On 14/7/07 03:32, "Eric Tessler" <maiden1134@yahoo.com> wrote:
> Also, by suggestion of Keir, I tried the XEN 3.0.4 kernel in my 16th VM
> (2.6.16.33), it failed the same way. The only difference is that instead of
> extending the grant table from 4 to 5 frames, it was extended from 4 to 16
> frames:
>   
>    (XEN) grant_table.c:557:d18 Expanding dom (18) grant table from (4) to
(16)
> frames.
>    (XEN) mm.c:2605:d0 Could not find L1 PTE for address d1400000
>   
> I believe the following stack trace represents the trace of the failure
> (starting from within XenBus, traced by hand):
>   
> connect_rings    
> linux-2.6-xen-sparse/drivers/xen/netback/xenbus.c
> netif_map        
> linux-2.6-xen-sparse/drivers/xen/netback/interface.c
> map_frontend_pages
> linux-2.6-xen-sparse/drivers/xen/netback/interface.c
> __gnttab_map_grant_ref (hypercall)             xen/common/grant_table.c
> create_grant_host_mapping           xen/arch/x86/mm.c
> create_grant_va_mapping             xen/arch/x86/mm.c
>    guest_map_l1e                    xen/arch/x86/mm.c
>      (this is the function that is ultimately failing)
>   
>  
>   
> Any clue as to what is causing this failure or how to fix it? Is there any
> other debug info I can provide here that would be of any help in resolving
> this issue? I have some free time tomorrow to debug this issue, but need
some
> direction; this is in an area of XEN I don''t understand very well.



_______________________________________________
Xen-users mailing list
Xen-users@lists.xensource.com
http://lists.xensource.com/xen-users

Keir Fraser

2007-Jul-14 09:01 UTC

head link

Re: [Xen-users] Re: XEN 3.1: critical bug: vif init failure after creating 15-17 VMs (XENBUS: Timeout connecting to device: device/vif)

Now fixed in the staging tree. The patch (for your dom0 kernel) is also
attached to this email.

Thanks for your help in tracking this one down!

 -- Keir

On 14/7/07 07:43, "Keir Fraser" <keir@xensource.com> wrote:
> 
> What dom0 kernel image are you running? It looks like vmalloc_sync_all(),
> called from alloc_vm_area() has not caused the pte that will map the rx
ring
> to be made present in the currently-running page tables. The code looks
okay
> on inspection though.
> 
>  -- Keir
> 
> On 14/7/07 03:32, "Eric Tessler" <maiden1134@yahoo.com>
wrote:
> 
>> Also, by suggestion of Keir, I tried the XEN 3.0.4 kernel in my 16th VM
>> (2.6.16.33), it failed the same way. The only difference is that
instead of
>> extending the grant table from 4 to 5 frames, it was extended from 4 to
16
>> frames:
>>   
>>    (XEN) grant_table.c:557:d18 Expanding dom (18) grant table from (4)
to
>> (16) frames.
>>    (XEN) mm.c:2605:d0 Could not find L1 PTE for address d1400000
>>   
>> I believe the following stack trace represents the trace of the failure
>> (starting from within XenBus, traced by hand):
>>   
>> connect_rings   
>> linux-2.6-xen-sparse/drivers/xen/netback/xenbus.c
>> netif_map       
>> linux-2.6-xen-sparse/drivers/xen/netback/interface.c
>> map_frontend_pages
>> linux-2.6-xen-sparse/drivers/xen/netback/interface.c
>> __gnttab_map_grant_ref (hypercall)             xen/common/grant_table.c
>> create_grant_host_mapping           xen/arch/x86/mm.c
>> create_grant_va_mapping             xen/arch/x86/mm.c
>>    guest_map_l1e                    xen/arch/x86/mm.c
>>      (this is the function that is ultimately failing)
>>   
>>  
>>   
>> Any clue as to what is causing this failure or how to fix it? Is there
any
>> other debug info I can provide here that would be of any help in
resolving
>> this issue? I have some free time tomorrow to debug this issue, but
need some
>> direction; this is in an area of XEN I don''t understand very
well.
> 




_______________________________________________
Xen-users mailing list
Xen-users@lists.xensource.com
http://lists.xensource.com/xen-users

Eric Tessler

2007-Jul-15 06:15 UTC

head link

Re: [Xen-users] Re: XEN 3.1: critical bug: vif init failure after creating 15-17 VMs (XENBUS: Timeout connecting to device: device/vif)

I applied the patch and rebuilt XEN - this did indeed resolve the problem. My
test now can create 40 VMs w/o any failures. I will leave the test running for a
few days to make sure.
   
  Thanks for the help,
   
  Eric

Keir Fraser <keir@xensource.com> wrote:
  Now fixed in the staging tree. The patch (for your dom0 kernel) is also
attached to this email.

Thanks for your help in tracking this one down!

 -- Keir

On 14/7/07 07:43, "Keir Fraser" <keir@xensource.com> wrote:

  
What dom0 kernel image are you running? It looks like vmalloc_sync_all(), called
from alloc_vm_area() has not caused the pte that will map the rx ring to be made
present in the currently-running page tables. The code looks okay on inspection
though.

 -- Keir

On 14/7/07 03:32, "Eric Tessler" <maiden1134@yahoo.com> wrote:

  Also, by suggestion of Keir, I tried the XEN 3.0.4 kernel in my 16th VM
(2.6.16.33), it failed the same way. The only difference is that instead of
extending the grant table from 4 to 5 frames, it was extended from 4 to 16
frames:
  
   (XEN) grant_table.c:557:d18 Expanding dom (18) grant table from (4) to (16)
frames.
   (XEN) mm.c:2605:d0 Could not find L1 PTE for address d1400000
  
I believe the following stack trace represents the trace of the failure
(starting from within XenBus, traced by hand):
  
connect_rings                      
linux-2.6-xen-sparse/drivers/xen/netback/xenbus.c
netif_map                          
linux-2.6-xen-sparse/drivers/xen/netback/interface.c
map_frontend_pages                 
linux-2.6-xen-sparse/drivers/xen/netback/interface.c
__gnttab_map_grant_ref (hypercall)             xen/common/grant_table.c
create_grant_host_mapping           xen/arch/x86/mm.c
create_grant_va_mapping             xen/arch/x86/mm.c
   guest_map_l1e                    xen/arch/x86/mm.c
     (this is the function that is ultimately failing)
  
 
  
Any clue as to what is causing this failure or how to fix it? Is there any other
debug info I can provide here that would be of any help in resolving this issue?
I have some free time tomorrow to debug this issue, but need some direction;
this is in an area of XEN I don''t understand very well.




 
---------------------------------
Bored stiff? Loosen up...
Download and play hundreds of games for free on Yahoo! Games.

_______________________________________________
Xen-users mailing list
Xen-users@lists.xensource.com
http://lists.xensource.com/xen-users

Thomas Ronner

2007-Jul-24 11:06 UTC

head link

Re: [Xen-users] Re: XEN 3.1: critical bug: vif init failure after creating 15-17 VMs (XENBUS: Timeout connecting to device: device/vif)

Hi Keir,

Keir Fraser wrote:> Now fixed in the staging tree. The patch (for your dom0 kernel) is also 
> attached to this email.
I have a similar problem with vbds instead of vifs:

(domU:)
XENBUS: Timeout connecting to device: device/vbd/2049 (state 6)
XENBUS: Timeout connecting to device: device/vbd/2052 (state 6)
XENBUS: Timeout connecting to device: device/vbd/2050 (state 6)
XENBUS: Timeout connecting to device: device/vbd/2051 (state 6)


Does your patch also fix this (in theory)? This is a production machine 
so I''m somewhat reluctant to try things before knowing what they do. 
I''ll attach the full domU output below. This is using a custom kernel 
without modules (I hate having to deploy modules in all domUs) and 
kernel level IP auto config (I like having this info in the xen config 
file).

There are other domUs on this machine with similar configs having no 
problem at all.


>  -- Keir
Regards, Thomas



---8<--[ domU output ]------------------------------------------
[root@diana ~]# xm create vechtstreek_test -c
Using config file "/etc/xen/vechtstreek_test".
Started domain vechtstreek_test
                                Linux version 2.6.18-tr01 
(root@diana.zoo.cs.uu.nl) (gcc version 4.1.1 20070105 (Red Hat 
4.1.1-52)) #2 SMP Fri Jul 20 12:14:40 CEST 2007
BIOS-provided physical RAM map:
  Xen: 0000000000000000 - 0000000010800000 (usable)
0MB HIGHMEM available.
264MB LOWMEM available.
NX (Execute Disable) protection: active
Allocating PCI resources starting at 20000000 (gap: 10800000:ef800000)
Detected 3200.282 MHz processor.
Built 1 zonelists.  Total pages: 67584
Kernel command line: root=/dev/sda1 ro 
ip=131.211.84.207:1.2.3.4:131.211.84.193:255.255.255.192:vechtstreek_test:eth0:off
Enabling fast FPU save and restore... done.
Enabling unmasked SIMD FPU exception support... done.
Initializing CPU#0
PID hash table entries: 2048 (order: 11, 8192 bytes)
Xen reported: 3200.112 MHz processor.
Console: colour dummy device 80x25
Dentry cache hash table entries: 65536 (order: 6, 262144 bytes)
Inode-cache hash table entries: 32768 (order: 5, 131072 bytes)
Software IO TLB disabled
vmalloc area: d1000000-f51fe000, maxmem 2d7fe000
Memory: 251648k/270336k available (3953k kernel code, 10220k reserved, 
1648k data, 216k init, 0k highmem)
Checking if this processor honours the WP bit even in supervisor mode... Ok.
Calibrating delay using timer specific routine.. 6403.14 BogoMIPS 
(lpj=32015708)
Security Framework v1.0.0 initialized
Capability LSM initialized
Mount-cache hash table entries: 512
CPU: Trace cache: 12K uops, L1 D cache: 16K
CPU: L2 cache: 2048K
Checking ''hlt'' instruction... OK.
SMP alternatives: switching to UP code
Freeing SMP alternatives: 20k freed
Brought up 1 CPUs
migration_cost=0
checking if image is initramfs... it is
Freeing initrd memory: 588k freed
NET: Registered protocol family 16
Brought up 1 CPUs
xen_mem: Initialising balloon driver.
SCSI subsystem initialized
NET: Registered protocol family 2
IP route cache hash table entries: 4096 (order: 2, 16384 bytes)
TCP established hash table entries: 16384 (order: 5, 131072 bytes)
TCP bind hash table entries: 8192 (order: 4, 65536 bytes)
TCP: Hash tables configured (established 16384 bind 8192)
TCP reno registered
audit: initializing netlink socket (disabled)
audit(1185274517.008:1): initialized
VFS: Disk quotas dquot_6.5.1
Dquot-cache hash table entries: 1024 (order 0, 4096 bytes)
Installing knfsd (copyright (C) 1996 okir@monad.swb.de).
NTFS driver 2.1.27 [Flags: R/O].
fuse init (API version 7.7)
OCFS2 1.3.3
OCFS2 Node Manager 1.3.3
OCFS2 DLM 1.3.3
OCFS2 DLMFS 1.3.3
OCFS2 User DLM kernel interface loaded
seclvl: seclvl_init: seclvl: Failure registering with the kernel.
seclvl: seclvl_init: seclvl: Failure registering with primary security 
module.
seclvl: Error during initialization: rc = [-22]
Initializing Cryptographic API
io scheduler noop registered
io scheduler anticipatory registered
io scheduler deadline registered
io scheduler cfq registered (default)
rtc: IRQ 8 is not free.
RAMDISK driver initialized: 16 RAM disks of 16384K size 1024 blocksize
loop: loaded (max 8 devices)
nbd: registered device at major 43
tun: Universal TUN/TAP device driver, 1.6
tun: (C) 1999-2004 Max Krasnyansky <maxk@qualcomm.com>
Xen virtual console successfully installed as tty1
Event-channel device installed.
netfront: Initialising virtual ethernet driver.
Loading iSCSI transport class v1.1-646.<5>iscsi: registered transport
(tcp)
register_blkdev: cannot get major 8 for sd
vbd vbd-2049: 19 xlvbd_add at /local/domain/0/backend/vbd/17/2049
i8042.c: No controller found.
mice: PS/2 mouse device common for all mice
register_blkdev: cannot get major 8 for sd
vbd vbd-2049: 19 xlvbd_add at /local/domain/0/backend/vbd/17/2049
device-mapper: ioctl: 4.7.0-ioctl (2006-06-24) initialised: 
dm-devel@redhat.com
device-mapper: multipath: version 1.0.4 loaded
device-mapper: multipath round-robin: version 1.0.0 loaded
register_blkdev: cannot get major 8 for sd
vbd vbd-2052: 19 xlvbd_add at /local/domain/0/backend/vbd/17/2052
dcdbas dcdbas: Dell Systems Management Base Driver (version 5.6.0-2)
netem: version 1.2
u32 classifier
     Performance counters on
     OLD policer on
Netfilter messages via NETLINK v0.30.
IPv4 over IPv4 tunneling driver
register_blkdev: cannot get major 8 for sd
vbd vbd-2052: 19 xlvbd_add at /local/domain/0/backend/vbd/17/2052
GRE over IPv4 tunneling driver
ip_conntrack version 2.4 (2112 buckets, 16896 max) - 228 bytes per conntrack
register_blkdev: cannot get major 8 for sd
vbd vbd-2050: 19 xlvbd_add at /local/domain/0/backend/vbd/17/2050
register_blkdev: cannot get major 8 for sd
vbd vbd-2050: 19 xlvbd_add at /local/domain/0/backend/vbd/17/2050
register_blkdev: cannot get major 8 for sd
vbd vbd-2051: 19 xlvbd_add at /local/domain/0/backend/vbd/17/2051
register_blkdev: cannot get major 8 for sd
vbd vbd-2051: 19 xlvbd_add at /local/domain/0/backend/vbd/17/2051
netfront: device eth0 has copying receive path.
ctnetlink v0.90: registering with nfnetlink.
ip_conntrack_pptp version 3.1 loaded
ip_nat_pptp version 3.0 loaded
ip_tables: (C) 2000-2006 Netfilter Core Team
ClusterIP Version 0.8 loaded successfully
arp_tables: (C) 2002 David S. Miller
IPVS: Registered protocols (TCP, UDP, AH, ESP)
IPVS: Connection hash table configured (size=4096, memory=32Kbytes)
IPVS: ipvs loaded.
IPVS: [rr] scheduler registered.
IPVS: [wrr] scheduler registered.
IPVS: [lc] scheduler registered.
IPVS: [wlc] scheduler registered.
IPVS: [lblc] scheduler registered.
IPVS: [lblcr] scheduler registered.
IPVS: [dh] scheduler registered.
IPVS: [sh] scheduler registered.
IPVS: [sed] scheduler registered.
IPVS: [nq] scheduler registered.
IPVS: ftp: loaded support on port[0] = 21
TCP bic registered
TCP cubic registered
TCP westwood registered
TCP highspeed registered
TCP hybla registered
TCP htcp registered
TCP vegas registered
TCP veno registered
TCP scalable registered
TCP lp registered
Initializing IPsec netlink socket
NET: Registered protocol family 1
NET: Registered protocol family 10
lo: Disabled Privacy Extensions
IPv6 over IPv4 tunneling driver
ip6_tables: (C) 2000-2006 Netfilter Core Team
NET: Registered protocol family 17
NET: Registered protocol family 15
Bridge firewalling registered
Ebtables v2.0 registered
ebt_ulog: not logging via ulog since somebody else already registered 
for PF_BRIDGE
802.1Q VLAN Support v1.8 Ben Greear <greearb@candelatech.com>
All bugs added by David S. Miller <davem@redhat.com>
ieee80211: 802.11 data/management/control stack, git-1.1.13
ieee80211: Copyright (C) 2004-2005 Intel Corporation 
<jketreno@linux.intel.com>
Using IPI No-Shortcut mode
XENBUS: Timeout connecting to device: device/vbd/2049 (state 6)
XENBUS: Timeout connecting to device: device/vbd/2052 (state 6)
XENBUS: Timeout connecting to device: device/vbd/2050 (state 6)
XENBUS: Timeout connecting to device: device/vbd/2051 (state 6)
XENBUS: Device with no driver: device/console/0
IP-Config: Complete:
       device=eth0, addr=131.211.84.207, mask=255.255.255.192, 
gw=131.211.84.193,
      host=vechtstreek_test, domain=, nis-domain=(none),
      bootserver=1.2.3.4, rootserver=1.2.3.4, rootpathFreeing unused kernel
memory: 216k freed
Red Hat nash version 4.1.18 starting
Mounted /proc filesystem
Mounting sysfs
Creating /dev
Starting udev
Creating root device
Mounting root filesystem
mount: error 6 mounting ext3
mount: error 2 mounting none
Switching to new root
switchroot: mount failed: 22
umount /initrd/dev failed: 2
Kernel panic - not syncing: Attempted to kill init!
--------------------------------------------------------

---8<--[ /etc/xen/vechtstreek_test ]--------------------
kernel = "/boot/vmlinux-stripped"
ramdisk = "/boot/initrd-xenU-tr01"
memory = 256
name = "vechtstreek_test"
vif = [ ''mac=00:00:6C:00:00:0D'' ]
disk = [ ''phy:sata/vechtstreek_root,sda1,w'',
          ''phy:sata/vechtstreek_swap,sda4,w'',
          ''phy:sata/vechtstreek_var,sda2,w'',
          ''phy:sata/vechtstreek_home,sda3,w'' ]
ip="131.211.84.207"
netmask="255.255.255.192"
gateway="131.211.84.193"
hostname="vechtstreek_test"
root = "/dev/sda1 ro"

_______________________________________________
Xen-users mailing list
Xen-users@lists.xensource.com
http://lists.xensource.com/xen-users

Keir Fraser

2007-Jul-24 11:46 UTC

head link

Re: [Xen-users] Re: XEN 3.1: critical bug: vif init failure after creating 15-17 VMs (XENBUS: Timeout connecting to device: device/vif)

Hi Thomas,

This problem is entirely different. The problem is visible earlier in your
console output: the Xen block-device driver is unable to acquire the
device-number space for SCSI devices (sda, sdb, etc). Hence it is failing to
initialise the vbd connections to the backend and is ending up in state 6
(which is XenbusStateClosed).

The solutions you have are:
 1. Do not build the generic SCSI subsystem into your dom0 kernels. It is
this subsystem which (quite reasonably) is allocating the sd* number space
to the exclusion of the Xen block-device driver.
 2. Call your devices hd* instead of sd* (i.e., hijack the IDE device
numbers instead of the SCSI ones), or even use the xvd* number space, which
is exclusively reserved for Xen VBDs.

 Hope this helps,
 Keir

On 24/7/07 12:06, "Thomas Ronner" <thomas@cs.uu.nl> wrote:
> Hi Keir,
> 
> Keir Fraser wrote:
>> Now fixed in the staging tree. The patch (for your dom0 kernel) is also
>> attached to this email.
> 
> I have a similar problem with vbds instead of vifs:
> 
> (domU:)
> XENBUS: Timeout connecting to device: device/vbd/2049 (state 6)
> XENBUS: Timeout connecting to device: device/vbd/2052 (state 6)
> XENBUS: Timeout connecting to device: device/vbd/2050 (state 6)
> XENBUS: Timeout connecting to device: device/vbd/2051 (state 6)
> 
> 
> Does your patch also fix this (in theory)? This is a production machine
> so I''m somewhat reluctant to try things before knowing what they
do.
> I''ll attach the full domU output below. This is using a custom
kernel
> without modules (I hate having to deploy modules in all domUs) and
> kernel level IP auto config (I like having this info in the xen config
> file).
> 
> There are other domUs on this machine with similar configs having no
> problem at all.
> 
> 
> 
>>  -- Keir
> 
> Regards, Thomas
> 
> 
> 
> ---8<--[ domU output ]------------------------------------------
> [root@diana ~]# xm create vechtstreek_test -c
> Using config file "/etc/xen/vechtstreek_test".
> Started domain vechtstreek_test
>                                 Linux version 2.6.18-tr01
> (root@diana.zoo.cs.uu.nl) (gcc version 4.1.1 20070105 (Red Hat
> 4.1.1-52)) #2 SMP Fri Jul 20 12:14:40 CEST 2007
> BIOS-provided physical RAM map:
>   Xen: 0000000000000000 - 0000000010800000 (usable)
> 0MB HIGHMEM available.
> 264MB LOWMEM available.
> NX (Execute Disable) protection: active
> Allocating PCI resources starting at 20000000 (gap: 10800000:ef800000)
> Detected 3200.282 MHz processor.
> Built 1 zonelists.  Total pages: 67584
> Kernel command line: root=/dev/sda1 ro
>
ip=131.211.84.207:1.2.3.4:131.211.84.193:255.255.255.192:vechtstreek_test:eth0
> :off
> Enabling fast FPU save and restore... done.
> Enabling unmasked SIMD FPU exception support... done.
> Initializing CPU#0
> PID hash table entries: 2048 (order: 11, 8192 bytes)
> Xen reported: 3200.112 MHz processor.
> Console: colour dummy device 80x25
> Dentry cache hash table entries: 65536 (order: 6, 262144 bytes)
> Inode-cache hash table entries: 32768 (order: 5, 131072 bytes)
> Software IO TLB disabled
> vmalloc area: d1000000-f51fe000, maxmem 2d7fe000
> Memory: 251648k/270336k available (3953k kernel code, 10220k reserved,
> 1648k data, 216k init, 0k highmem)
> Checking if this processor honours the WP bit even in supervisor mode...
Ok.
> Calibrating delay using timer specific routine.. 6403.14 BogoMIPS
> (lpj=32015708)
> Security Framework v1.0.0 initialized
> Capability LSM initialized
> Mount-cache hash table entries: 512
> CPU: Trace cache: 12K uops, L1 D cache: 16K
> CPU: L2 cache: 2048K
> Checking ''hlt'' instruction... OK.
> SMP alternatives: switching to UP code
> Freeing SMP alternatives: 20k freed
> Brought up 1 CPUs
> migration_cost=0
> checking if image is initramfs... it is
> Freeing initrd memory: 588k freed
> NET: Registered protocol family 16
> Brought up 1 CPUs
> xen_mem: Initialising balloon driver.
> SCSI subsystem initialized
> NET: Registered protocol family 2
> IP route cache hash table entries: 4096 (order: 2, 16384 bytes)
> TCP established hash table entries: 16384 (order: 5, 131072 bytes)
> TCP bind hash table entries: 8192 (order: 4, 65536 bytes)
> TCP: Hash tables configured (established 16384 bind 8192)
> TCP reno registered
> audit: initializing netlink socket (disabled)
> audit(1185274517.008:1): initialized
> VFS: Disk quotas dquot_6.5.1
> Dquot-cache hash table entries: 1024 (order 0, 4096 bytes)
> Installing knfsd (copyright (C) 1996 okir@monad.swb.de).
> NTFS driver 2.1.27 [Flags: R/O].
> fuse init (API version 7.7)
> OCFS2 1.3.3
> OCFS2 Node Manager 1.3.3
> OCFS2 DLM 1.3.3
> OCFS2 DLMFS 1.3.3
> OCFS2 User DLM kernel interface loaded
> seclvl: seclvl_init: seclvl: Failure registering with the kernel.
> seclvl: seclvl_init: seclvl: Failure registering with primary security
> module.
> seclvl: Error during initialization: rc = [-22]
> Initializing Cryptographic API
> io scheduler noop registered
> io scheduler anticipatory registered
> io scheduler deadline registered
> io scheduler cfq registered (default)
> rtc: IRQ 8 is not free.
> RAMDISK driver initialized: 16 RAM disks of 16384K size 1024 blocksize
> loop: loaded (max 8 devices)
> nbd: registered device at major 43
> tun: Universal TUN/TAP device driver, 1.6
> tun: (C) 1999-2004 Max Krasnyansky <maxk@qualcomm.com>
> Xen virtual console successfully installed as tty1
> Event-channel device installed.
> netfront: Initialising virtual ethernet driver.
> Loading iSCSI transport class v1.1-646.<5>iscsi: registered transport
(tcp)
> register_blkdev: cannot get major 8 for sd
> vbd vbd-2049: 19 xlvbd_add at /local/domain/0/backend/vbd/17/2049
> i8042.c: No controller found.
> mice: PS/2 mouse device common for all mice
> register_blkdev: cannot get major 8 for sd
> vbd vbd-2049: 19 xlvbd_add at /local/domain/0/backend/vbd/17/2049
> device-mapper: ioctl: 4.7.0-ioctl (2006-06-24) initialised:
> dm-devel@redhat.com
> device-mapper: multipath: version 1.0.4 loaded
> device-mapper: multipath round-robin: version 1.0.0 loaded
> register_blkdev: cannot get major 8 for sd
> vbd vbd-2052: 19 xlvbd_add at /local/domain/0/backend/vbd/17/2052
> dcdbas dcdbas: Dell Systems Management Base Driver (version 5.6.0-2)
> netem: version 1.2
> u32 classifier
>      Performance counters on
>      OLD policer on
> Netfilter messages via NETLINK v0.30.
> IPv4 over IPv4 tunneling driver
> register_blkdev: cannot get major 8 for sd
> vbd vbd-2052: 19 xlvbd_add at /local/domain/0/backend/vbd/17/2052
> GRE over IPv4 tunneling driver
> ip_conntrack version 2.4 (2112 buckets, 16896 max) - 228 bytes per
conntrack
> register_blkdev: cannot get major 8 for sd
> vbd vbd-2050: 19 xlvbd_add at /local/domain/0/backend/vbd/17/2050
> register_blkdev: cannot get major 8 for sd
> vbd vbd-2050: 19 xlvbd_add at /local/domain/0/backend/vbd/17/2050
> register_blkdev: cannot get major 8 for sd
> vbd vbd-2051: 19 xlvbd_add at /local/domain/0/backend/vbd/17/2051
> register_blkdev: cannot get major 8 for sd
> vbd vbd-2051: 19 xlvbd_add at /local/domain/0/backend/vbd/17/2051
> netfront: device eth0 has copying receive path.
> ctnetlink v0.90: registering with nfnetlink.
> ip_conntrack_pptp version 3.1 loaded
> ip_nat_pptp version 3.0 loaded
> ip_tables: (C) 2000-2006 Netfilter Core Team
> ClusterIP Version 0.8 loaded successfully
> arp_tables: (C) 2002 David S. Miller
> IPVS: Registered protocols (TCP, UDP, AH, ESP)
> IPVS: Connection hash table configured (size=4096, memory=32Kbytes)
> IPVS: ipvs loaded.
> IPVS: [rr] scheduler registered.
> IPVS: [wrr] scheduler registered.
> IPVS: [lc] scheduler registered.
> IPVS: [wlc] scheduler registered.
> IPVS: [lblc] scheduler registered.
> IPVS: [lblcr] scheduler registered.
> IPVS: [dh] scheduler registered.
> IPVS: [sh] scheduler registered.
> IPVS: [sed] scheduler registered.
> IPVS: [nq] scheduler registered.
> IPVS: ftp: loaded support on port[0] = 21
> TCP bic registered
> TCP cubic registered
> TCP westwood registered
> TCP highspeed registered
> TCP hybla registered
> TCP htcp registered
> TCP vegas registered
> TCP veno registered
> TCP scalable registered
> TCP lp registered
> Initializing IPsec netlink socket
> NET: Registered protocol family 1
> NET: Registered protocol family 10
> lo: Disabled Privacy Extensions
> IPv6 over IPv4 tunneling driver
> ip6_tables: (C) 2000-2006 Netfilter Core Team
> NET: Registered protocol family 17
> NET: Registered protocol family 15
> Bridge firewalling registered
> Ebtables v2.0 registered
> ebt_ulog: not logging via ulog since somebody else already registered
> for PF_BRIDGE
> 802.1Q VLAN Support v1.8 Ben Greear <greearb@candelatech.com>
> All bugs added by David S. Miller <davem@redhat.com>
> ieee80211: 802.11 data/management/control stack, git-1.1.13
> ieee80211: Copyright (C) 2004-2005 Intel Corporation
> <jketreno@linux.intel.com>
> Using IPI No-Shortcut mode
> XENBUS: Timeout connecting to device: device/vbd/2049 (state 6)
> XENBUS: Timeout connecting to device: device/vbd/2052 (state 6)
> XENBUS: Timeout connecting to device: device/vbd/2050 (state 6)
> XENBUS: Timeout connecting to device: device/vbd/2051 (state 6)
> XENBUS: Device with no driver: device/console/0
> IP-Config: Complete:
>        device=eth0, addr=131.211.84.207, mask=255.255.255.192,
> gw=131.211.84.193,
>       host=vechtstreek_test, domain=, nis-domain=(none),
>       bootserver=1.2.3.4, rootserver=1.2.3.4, rootpath> Freeing unused
kernel memory: 216k freed
> Red Hat nash version 4.1.18 starting
> Mounted /proc filesystem
> Mounting sysfs
> Creating /dev
> Starting udev
> Creating root device
> Mounting root filesystem
> mount: error 6 mounting ext3
> mount: error 2 mounting none
> Switching to new root
> switchroot: mount failed: 22
> umount /initrd/dev failed: 2
> Kernel panic - not syncing: Attempted to kill init!
> --------------------------------------------------------
> 
> ---8<--[ /etc/xen/vechtstreek_test ]--------------------
> kernel = "/boot/vmlinux-stripped"
> ramdisk = "/boot/initrd-xenU-tr01"
> memory = 256
> name = "vechtstreek_test"
> vif = [ ''mac=00:00:6C:00:00:0D'' ]
> disk = [ ''phy:sata/vechtstreek_root,sda1,w'',
>           ''phy:sata/vechtstreek_swap,sda4,w'',
>           ''phy:sata/vechtstreek_var,sda2,w'',
>           ''phy:sata/vechtstreek_home,sda3,w'' ]
> ip="131.211.84.207"
> netmask="255.255.255.192"
> gateway="131.211.84.193"
> hostname="vechtstreek_test"
> root = "/dev/sda1 ro"

_______________________________________________
Xen-users mailing list
Xen-users@lists.xensource.com
http://lists.xensource.com/xen-users

Thomas Ronner

2007-Jul-24 12:34 UTC

head link

Re: [Xen-users] Re: XEN 3.1: critical bug: vif init failure after creating 15-17 VMs (XENBUS: Timeout connecting to device: device/vif)

Hi Keir,

Thanks for your quick reply!


Keir Fraser wrote:> Hi Thomas,
> 
> This problem is entirely different. The problem is visible earlier in your
> console output: the Xen block-device driver is unable to acquire the
> device-number space for SCSI devices (sda, sdb, etc). Hence it is failing
to
> initialise the vbd connections to the backend and is ending up in state 6
> (which is XenbusStateClosed).
I don''t understand. Which Xen block-device driver is unable to? The 
frontend or the backend? This never happened on Xen 2, al least I never 
encountered it.

> The solutions you have are:
>  1. Do not build the generic SCSI subsystem into your dom0 kernels. It is
> this subsystem which (quite reasonably) is allocating the sd* number space
> to the exclusion of the Xen block-device driver.
This is not possible, as the physical machine has SCSI-disks and a SATA 
disk (which also uses the SCSI subsystem).

>  2. Call your devices hd* instead of sd* (i.e., hijack the IDE device
> numbers instead of the SCSI ones), or even use the xvd* number space, which
> is exclusively reserved for Xen VBDs.
I tried hd*, which works. I''m used to making sd* devices as there used 
to be some Xen version (forgot which one) that was more stable when 
using sd* devices in domUs.

>  Hope this helps,
>  Keir
Thanks,
Thomas

_______________________________________________
Xen-users mailing list
Xen-users@lists.xensource.com
http://lists.xensource.com/xen-users

Keir Fraser

2007-Jul-24 13:26 UTC

head link

Re: [Xen-users] Re: XEN 3.1: critical bug: vif init failure after creating 15-17 VMs (XENBUS: Timeout connecting to device: device/vif)

>> This problem is entirely different. The problem is visible earlier in
your
>> console output: the Xen block-device driver is unable to acquire the
>> device-number space for SCSI devices (sda, sdb, etc). Hence it is
failing to
>> initialise the vbd connections to the backend and is ending up in state
6
>> (which is XenbusStateClosed).
> 
> I don''t understand. Which Xen block-device driver is unable to?
The
> frontend or the backend? This never happened on Xen 2, al least I never
> encountered it.
The driver in domU. If you never saw this problem with Xen2, then
that''s
because the domU kernels you used at that time did not have the normal SCSI
subsystem compiled into them.
>> The solutions you have are:
>>  1. Do not build the generic SCSI subsystem into your dom0 kernels. It
is
>> this subsystem which (quite reasonably) is allocating the sd* number
space
>> to the exclusion of the Xen block-device driver.
> 
> This is not possible, as the physical machine has SCSI-disks and a SATA
> disk (which also uses the SCSI subsystem).
Sorry, that was a typo. I meant you should not build it into your *domU*
kernels. It is of course fine to have ordinary SCSI compiled into dom0.
>>  2. Call your devices hd* instead of sd* (i.e., hijack the IDE device
>> numbers instead of the SCSI ones), or even use the xvd* number space,
which
>> is exclusively reserved for Xen VBDs.
> 
> I tried hd*, which works. I''m used to making sd* devices as there
used
> to be some Xen version (forgot which one) that was more stable when
> using sd* devices in domUs.
Sounds unlikely to me. sd* and hd* are just names.

Anyhow, if you stop compiling SCSI into your domU kernel then you can
continue to use sd* names for your VBDs.

 -- Keir


_______________________________________________
Xen-users mailing list
Xen-users@lists.xensource.com
http://lists.xensource.com/xen-users

Thomas Ronner

2007-Jul-25 08:12 UTC

head link

Re: [Xen-users] Re: XEN 3.1: critical bug: vif init failure after creating 15-17 VMs (XENBUS: Timeout connecting to device: device/vif)

Keir Fraser wrote:> The driver in domU. If you never saw this problem with Xen2, then
that''s
> because the domU kernels you used at that time did not have the normal SCSI
> subsystem compiled into them.
>   Other domUs with similar configs and the same kernel run fine. It was 
only until I started the 7th domU the problems begun.

But I''ll try your suggestion by compiling a domU kernel without SCSI. I
wonder what I was thinking including it in the first place.
>  -- Keir
>   Thanks, Thomas

_______________________________________________
Xen-users mailing list
Xen-users@lists.xensource.com
http://lists.xensource.com/xen-users

Keir Fraser

2007-Jul-25 08:15 UTC

head link

Re: [Xen-users] Re: XEN 3.1: critical bug: vif init failure after creating 15-17 VMs (XENBUS: Timeout connecting to device: device/vif)

On 25/7/07 09:12, "Thomas Ronner" <thomas@cs.uu.nl> wrote:
>> The driver in domU. If you never saw this problem with Xen2, then
that''s
>> because the domU kernels you used at that time did not have the normal
SCSI
>> subsystem compiled into them.
>>   
> Other domUs with similar configs and the same kernel run fine. It was
> only until I started the 7th domU the problems begun.
I have to be skeptical about that. If you run the same kernel and basically
same configuration, this problem should occur deterministically for all such
domains. So something must be different in domUs 1 thru 6!

 -- Keir


_______________________________________________
Xen-users mailing list
Xen-users@lists.xensource.com
http://lists.xensource.com/xen-users

Thomas Ronner

2007-Jul-25 09:55 UTC

head link

Re: [Xen-users] Re: XEN 3.1: critical bug: vif init failure after creating 15-17 VMs (XENBUS: Timeout connecting to device: device/vif)

Hi Keir,

Keir Fraser wrote:>> Other domUs with similar configs and the same kernel run fine. It was
>> only until I started the 7th domU the problems begun.
>>     
>
> I have to be skeptical about that. If you run the same kernel and basically
> same configuration, this problem should occur deterministically for all
such
> domains. So something must be different in domUs 1 thru 6!
>   No. I copied the config file in /etc/xen from a working config every 
time I created a new domU. Change names, ips, and go. Block devices for 
the different domUs are LVM-backed ext3 file systems which I created by 
unpacking a tar. Every time the same tar.


But... Some spooky shit going on here.. I did

xm shutdown wikitest
xm create wikitest -c

(wikitest is another domain, one of the 6 mentioned earlier)

It didn''t come up. It behaved the same as the other domain (the 
''vechtstreek_test'' domain I originally reported about). It
boots fine
with the newer kernel without SCSI-support. It started fine a couple of 
days ago using the kernel with SCSI-support built in. I''m 100% certain 
about that.
>  -- KeirThomas

_______________________________________________
Xen-users mailing list
Xen-users@lists.xensource.com
http://lists.xensource.com/xen-users

Apparently Analagous Threads

Search for more reasonably related threads

Xen users - Jul 2007 - XEN 3.1: critical bug: vif init failure after creating 15-17 VMs (XENBUS: Timeout connecting to device: device/vif)

[Xen-users] XEN 3.1: critical bug: vif init failure after creating 15-17 VMs (XENBUS: Timeout connecting to device: device/vif)

[Xen-users] Re: XEN 3.1: critical bug: vif init failure after creating 15-17 VMs (XENBUS: Timeout connecting to device: device/vif)

Re: [Xen-users] Re: XEN 3.1: critical bug: vif init failure after creating 15-17 VMs (XENBUS: Timeout connecting to device: device/vif)

Re: [Xen-users] Re: XEN 3.1: critical bug: vif init failure after creating 15-17 VMs (XENBUS: Timeout connecting to device: device/vif)

Re: [Xen-users] Re: XEN 3.1: critical bug: vif init failure after creating 15-17 VMs (XENBUS: Timeout connecting to device: device/vif)

Re: [Xen-users] Re: XEN 3.1: critical bug: vif init failure after creating 15-17 VMs (XENBUS: Timeout connecting to device: device/vif)

Re: [Xen-users] Re: XEN 3.1: critical bug: vif init failure after creating 15-17 VMs (XENBUS: Timeout connecting to device: device/vif)

Re: [Xen-users] Re: XEN 3.1: critical bug: vif init failure after creating 15-17 VMs (XENBUS: Timeout connecting to device: device/vif)

Re: [Xen-users] Re: XEN 3.1: critical bug: vif init failure after creating 15-17 VMs (XENBUS: Timeout connecting to device: device/vif)

Re: [Xen-users] Re: XEN 3.1: critical bug: vif init failure after creating 15-17 VMs (XENBUS: Timeout connecting to device: device/vif)

Re: [Xen-users] Re: XEN 3.1: critical bug: vif init failure after creating 15-17 VMs (XENBUS: Timeout connecting to device: device/vif)

Re: [Xen-users] Re: XEN 3.1: critical bug: vif init failure after creating 15-17 VMs (XENBUS: Timeout connecting to device: device/vif)

Re: [Xen-users] Re: XEN 3.1: critical bug: vif init failure after creating 15-17 VMs (XENBUS: Timeout connecting to device: device/vif)

Apparently Analagous Threads