thr3ads.net - Xen devel - [Xen-devel] analyze for the P1 bug 593(xensource bug tracker) [May 2006]

If this information is useful, please help other people find it:
Share via:

Han, Zhu

2006-May-10 06:26 UTC

[Xen-devel] analyze for the P1 bug 593(xensource bug tracker)

Hi, all!
Our QA team submitted a bug 593 to xensource bug tracker one month ago
and it was boosted up to P1 several days ago! So I spend some time to
trace this bug this week! Below words is what I have found:
1) This bug is hard to been reproduced on most of the platforms we owns,
especially the UP box.  The platform on which we got the bug and could
reproduce the bug stably is Paxville, which owns 4 physical CPUs, and 2
cores, 2 hyperthreads for each CPU.
2) This root cause of this problem is "losetup -d /dev/loop*" could
fail
at a rather low probability. "losetup -d /dev/loop*" is invoked by
/etc/xen/scripts/block when the script processes remove action. If we
exhausted all the loop devices, the VMX cannot be initialized properly.
That''s why XEND complains "Error: Device creation failed for
domain
****". However, if we remove the loop device manually, everything goes
OK!
3) "losetup -d /dev/loop" failed because kernel/drivers/block/loop.c
return EBUSY for the LOOP_CLR_FD ioctl operation. The probable cause for
this action is some one else didn''t close the loop device when we try
to
delete it!
4) The program opens the loop device could be VBD device driver. It
opens the loop device in vbd_create() through open_by_devnum. It closes
the handle for the loop device in vbd_free which is called by a
schedulable work item free_blkif. Is it true? If so, the problem could
be arised by the possible race condition between the work item and the
hotplug script! When the xenbus driver is notified the front end device
has been destroyed by the xenstore thread, it will remove the backend
device and related resources, and then notify the hotplug subsystem the
remove action! Because the code close the loop device''s handle and the
script delete the loop device can run concurrently, the script could
fail when it try to delete the loop device!

My question is:
1) Does this possible race condition exist?
2) Why does the code closing the loop device been put to another out of
code workitem instead of finishing all work directly in
blkback_remove()? Any operation in free_blkif() could be blocked? Which
one?

Since I''m a really newbie to this field, any tips and comments will be
appreciated!
Thanks a lot!



Best Regards, 
hanzhu

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Keir Fraser

2006-May-10 07:13 UTC

head link

Re: [Xen-devel] analyze for the P1 bug 593(xensource bug tracker)

On 10 May 2006, at 07:26, Han, Zhu wrote:
> My question is:
> 1) Does this possible race condition exist?
It certainly sounds plausible to me!
> 2) Why does the code closing the loop device been put to another out of
> code workitem instead of finishing all work directly in
> blkback_remove()? Any operation in free_blkif() could be blocked? Which
> one?
Several are unsafe in interrupt context, for example:
  * unbind_from_irqhandler calls free_irq which can do procfs work
  * vbd_free calls blkdev_put which takes a semaphore and probably does 
a whole load of blocking operations
  * free_vm_area calls remove_vm_area which acquires a rw spinlock which 
is not interrupt safe

Correctness dictates that we should be withholding the upcall to user 
space until the deferred operations are complete. Perhaps you could try 
doing a wait_for_completion() in blkback_remove, immediately after the 
blkif_put(). This would then block until kicked by free_blkif().

I may try to put some code together myself if I have time. I suspect 
netback has a similar issue also (although of course the remove 
operation tends to be non-critical for net devices, so it won''t usually
matter!).

  -- Keir

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Keir Fraser

2006-May-10 12:29 UTC

head link

Re: [Xen-devel] Re: analyze for the P1 bug 593(xensource bug tracker)

On 10 May 2006, at 14:00, han wrote:
> Your patch works quite well. We have created and destroyed the VMX 
> more than 500 times, and everything goes OK! I suppose the patch could 
> solve the race condition! You may put the correctness code about VBD 
> and VNIF together and send it to the maillist. We could help you to 
> test it!
> I prefer the wait_event and wakeup approach, it is clearer and 
> straightful just as you said! :-)
> BTW: I''m out of office right now, so i can''t send the
patch back to
> you! That''s also why I change to another mailbox to send this
mail.
> Thanks a lot for your help!
Thanks for testing the patch. I''ve extended it to fix netback driver 
too, and cleaned things up a little. I''ve applied my revised patch to 
both -unstable and -testing trees.

  -- Keir


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

han

2006-May-10 13:00 UTC

head link

[Xen-devel] Re: analyze for the P1 bug 593(xensource bug tracker)

Hi, Keir!

Your patch works quite well. We have created and destroyed the VMX more than 500
times, and everything goes OK! I suppose the patch could solve the race
condition! You may put the correctness code about VBD and VNIF together and send
it to the maillist. We could help you to test it!
I prefer the wait_event and wakeup approach, it is clearer and straightful just
as you said! :-)
BTW: I''m out of office right now, so i can''t send the patch
back to you! That''s also why I change to another mailbox to send this
mail.

Thanks a lot for your help!

_______________________________________________________
Best Regards,
hanzhu



Han, Zhu 写道:> Best Regards, 
> hanzhu
>
> -----Original Message-----
> From: Han, Zhu 
> Sent: 2006$BG/(J5$B7n(J10$BF|(J 14:27
> To: Yu, Ke; ''xen-devel@lists.xensource.com''
> Cc: Helix-vmm
> Subject: analyze for the P1 bug 593(xensource bug tracker)
>
> Hi, all!
> Our QA team submitted a bug 593 to xensource bug tracker one month ago and
it was boosted up to P1 several days ago! So I spend some time to trace this bug
this week! Below words is what I have found:
> 1) This bug is hard to been reproduced on most of the platforms we owns,
especially the UP box.  The platform on which we got the bug and could reproduce
the bug stably is Paxville, which owns 4 physical CPUs, and 2 cores, 2
hyperthreads for each CPU.
> 2) This root cause of this problem is "losetup -d /dev/loop*"
could fail at a rather low probability. "losetup -d /dev/loop*" is
invoked by /etc/xen/scripts/block when the script processes remove action. If we
exhausted all the loop devices, the VMX cannot be initialized properly.
That''s why XEND complains "Error: Device creation failed for
domain ****". However, if we remove the loop device manually, everything
goes OK!
> 3) "losetup -d /dev/loop" failed because
kernel/drivers/block/loop.c return EBUSY for the LOOP_CLR_FD ioctl operation.
The probable cause for this action is some one else didn''t close the
loop device when we try to delete it!
> 4) The program opens the loop device could be VBD device driver. It opens
the loop device in vbd_create() through open_by_devnum. It closes the handle for
the loop device in vbd_free which is called by a schedulable work item
free_blkif. Is it true? If so, the problem could be arised by the possible race
condition between the work item and the hotplug script! When the xenbus driver
is notified the front end device has been destroyed by the xenstore thread, it
will remove the backend device and related resources, and then notify the
hotplug subsystem the remove action! Because the code close the loop
device''s handle and the script delete the loop device can run
concurrently, the script could fail when it try to delete the loop device!
>
> My question is:
> 1) Does this possible race condition exist?
> 2) Why does the code closing the loop device been put to another out of
code workitem instead of finishing all work directly in blkback_remove()? Any
operation in free_blkif() could be blocked? Which one?
>
> Since I''m a really newbie to this field, any tips and comments
will be appreciated!
> Thanks a lot!
>
>
>
> Best Regards, 
> hanzhu
>
>   
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Xen devel - May 2006 - analyze for the P1 bug 593(xensource bug tracker)

[Xen-devel] analyze for the P1 bug 593(xensource bug tracker)

Re: [Xen-devel] analyze for the P1 bug 593(xensource bug tracker)

Re: [Xen-devel] Re: analyze for the P1 bug 593(xensource bug tracker)

[Xen-devel] Re: analyze for the P1 bug 593(xensource bug tracker)