When runnign two scripts for hvm save/restore iterations at the same time, I got this message after around 150 (combined between the two scripts). The one process is now stuck on "saving"... kobject_add failed for tap0 with -EEXIST, don''t try to register things with the same name in the same directory. Call Trace: [<ffffffff802e9af5>] kobject_add+0x16e/0x199 [<ffffffff8034522b>] class_device_add+0xaf/0x441 [<ffffffff802e97c0>] kobject_get+0x12/0x17 [<ffffffff80394a39>] register_netdevice+0x23b/0x30b [<ffffffff882df972>] :tun:tun_chr_ioctl+0x272/0x4db [<ffffffff8027e6f0>] do_filp_open+0x2d/0x3d [<ffffffff80291621>] do_ioctl+0x55/0x6b [<ffffffff80291889>] vfs_ioctl+0x252/0x26b [<ffffffff802918fb>] sys_ioctl+0x59/0x7a [<ffffffff8020a640>] tracesys+0xa7/0xb3 -- Mats _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Looks like a dom0 kernel bug? -- Keir On 3/5/07 16:39, "Petersson, Mats" <Mats.Petersson@amd.com> wrote:> When runnign two scripts for hvm save/restore iterations at the same > time, I got this message after around 150 (combined between the two > scripts). The one process is now stuck on "saving"... > > kobject_add failed for tap0 with -EEXIST, don''t try to register things > with the > same name in the same directory. > > Call Trace: > [<ffffffff802e9af5>] kobject_add+0x16e/0x199 > [<ffffffff8034522b>] class_device_add+0xaf/0x441 > [<ffffffff802e97c0>] kobject_get+0x12/0x17 > [<ffffffff80394a39>] register_netdevice+0x23b/0x30b > [<ffffffff882df972>] :tun:tun_chr_ioctl+0x272/0x4db > [<ffffffff8027e6f0>] do_filp_open+0x2d/0x3d > [<ffffffff80291621>] do_ioctl+0x55/0x6b > [<ffffffff80291889>] vfs_ioctl+0x252/0x26b > [<ffffffff802918fb>] sys_ioctl+0x59/0x7a > [<ffffffff8020a640>] tracesys+0xa7/0xb3_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
> -----Original Message----- > From: Keir Fraser [mailto:keir@xensource.com] > Sent: 03 May 2007 17:41 > To: Petersson, Mats; xen-devel@lists.xensource.com > Subject: Re: [Xen-devel] "kobject add failed" > > > Looks like a dom0 kernel bug?Yes, that''s what I think too... I haven''t seen it before. Any idea why this would happen now, and not before? Why would it happen only when doing two save/restore sessions (of different domains of course) on the same machine (which I have done before - but not that recently). I got a second backtrace like that. I''ve since tried to avoid it by removing the (unnecessary) vif= in the setup of the simple-guest (it''s got no code to deal with network devices anyway). -- Mats> > -- Keir > > On 3/5/07 16:39, "Petersson, Mats" <Mats.Petersson@amd.com> wrote: > > > When runnign two scripts for hvm save/restore iterations at the same > > time, I got this message after around 150 (combined between the two > > scripts). The one process is now stuck on "saving"... > > > > kobject_add failed for tap0 with -EEXIST, don''t try to > register things > > with the > > same name in the same directory. > > > > Call Trace: > > [<ffffffff802e9af5>] kobject_add+0x16e/0x199 > > [<ffffffff8034522b>] class_device_add+0xaf/0x441 > > [<ffffffff802e97c0>] kobject_get+0x12/0x17 > > [<ffffffff80394a39>] register_netdevice+0x23b/0x30b > > [<ffffffff882df972>] :tun:tun_chr_ioctl+0x272/0x4db > > [<ffffffff8027e6f0>] do_filp_open+0x2d/0x3d > > [<ffffffff80291621>] do_ioctl+0x55/0x6b > > [<ffffffff80291889>] vfs_ioctl+0x252/0x26b > > [<ffffffff802918fb>] sys_ioctl+0x59/0x7a > > [<ffffffff8020a640>] tracesys+0xa7/0xb3 > > > >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 3/5/07 17:45, "Petersson, Mats" <Mats.Petersson@amd.com> wrote:> I haven''t seen it before. Any idea why this would happen now, and not > before?Because now you are doing two save/restores in a loop at the same time.> Why would it happen only when doing two save/restore sessions (of > different domains of course) on the same machine (which I have done > before - but not that recently).It looks like there might be a race in drivers/net/tun.c:tun_set_iff(). Two invocations of ioctl(TUNSETIFF) can both resolve "tap%d" to "tap0" (because both observe that tap0 is not registered). The second one to execute register_netdevice() then bugs out because the interface already exists! However, the invocation of tun_set_iff() is wrapped in rtnl_lock()/unlock() so should be concurrency safe. Still, this is where I would concentrate my search if I were you. -- Keir> I got a second backtrace like that. I''ve since tried to avoid it by > removing the (unnecessary) vif= in the setup of the simple-guest (it''s > got no code to deal with network devices anyway)._______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
> -----Original Message----- > From: Keir Fraser [mailto:keir@xensource.com] > Sent: 03 May 2007 18:03 > To: Petersson, Mats; xen-devel@lists.xensource.com > Subject: Re: [Xen-devel] "kobject add failed" > > On 3/5/07 17:45, "Petersson, Mats" <Mats.Petersson@amd.com> wrote: > > > I haven''t seen it before. Any idea why this would happen > now, and not > > before? > > Because now you are doing two save/restores in a loop at the > same time. > > > Why would it happen only when doing two save/restore sessions (of > > different domains of course) on the same machine (which I have done > > before - but not that recently). > > It looks like there might be a race in > drivers/net/tun.c:tun_set_iff(). Two > invocations of ioctl(TUNSETIFF) can both resolve "tap%d" to > "tap0" (because > both observe that tap0 is not registered). The second one to execute > register_netdevice() then bugs out because the interface > already exists!Ok, I''ll try to track that down - not sure I''m at home figuring how to fix it, but at least I can probably prove that it is re-entrancy in the function or not. It will have to wait until Wednesday tho'', as I''m off work tomorrow and Tuesday (and for those not familiar with UK holidays, Monday is a "bank-holiday", meaning "public holiday").> > However, the invocation of tun_set_iff() is wrapped in > rtnl_lock()/unlock() > so should be concurrency safe. Still, this is where I would > concentrate my > search if I were you.Indeed, the tun_set_iff() is protected by rtnl_lock/unlock()... Any reason to believe that doesn''t work? -- Mats> > -- Keir > > > > I got a second backtrace like that. I''ve since tried to avoid it by > > removing the (unnecessary) vif= in the setup of the > simple-guest (it''s > > got no code to deal with network devices anyway). > > > >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 3/5/07 18:19, "Petersson, Mats" <Mats.Petersson@amd.com> wrote:>> However, the invocation of tun_set_iff() is wrapped in >> rtnl_lock()/unlock() >> so should be concurrency safe. Still, this is where I would >> concentrate my >> search if I were you. > > Indeed, the tun_set_iff() is protected by rtnl_lock/unlock()... Any > reason to believe that doesn''t work?Your crash report. :-) However tun_set_iff() is not in the oops backtrace... Perhaps I''m wrong about which ioctl is being executed, or the path taken through the Linux kernel. However, it definitely does look like a lack of locking around picking a device name and registering it. I''m fairly confident that this will turn out to be the problem. You might want to add some tracing to the kernel to find out what paths you take during qemu-dm startup (this might cause the race not to happen any more, but you can remove the tracing once you know which functions you should be staring at). -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
> -----Original Message----- > From: Keir Fraser [mailto:keir@xensource.com] > Sent: 03 May 2007 18:27 > To: Petersson, Mats; xen-devel@lists.xensource.com > Subject: Re: [Xen-devel] "kobject add failed" > > > > > On 3/5/07 18:19, "Petersson, Mats" <Mats.Petersson@amd.com> wrote: > > >> However, the invocation of tun_set_iff() is wrapped in > >> rtnl_lock()/unlock() > >> so should be concurrency safe. Still, this is where I would > >> concentrate my > >> search if I were you. > > > > Indeed, the tun_set_iff() is protected by rtnl_lock/unlock()... Any > > reason to believe that doesn''t work? > > Your crash report. :-) > > However tun_set_iff() is not in the oops backtrace... Perhaps > I''m wrong > about which ioctl is being executed, or the path taken > through the Linux > kernel.According to my dump of the tun.o object file, "tun_set_iff" is inlined into tun_chr_ioctl(), so it''s no real surprise it''s not in the call-stack...> > However, it definitely does look like a lack of locking > around picking a > device name and registering it. I''m fairly confident that > this will turn out > to be the problem. You might want to add some tracing to the > kernel to find > out what paths you take during qemu-dm startup (this might > cause the race > not to happen any more, but you can remove the tracing once > you know which > functions you should be staring at).I''ll look at it on Wednesday. -- Mats> > -- Keir > > > >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
> -----Original Message----- > From: xen-devel-bounces@lists.xensource.com > [mailto:xen-devel-bounces@lists.xensource.com] On Behalf Of > Petersson, Mats > Sent: 03 May 2007 18:37 > To: Keir Fraser; xen-devel@lists.xensource.com > Subject: RE: [Xen-devel] "kobject add failed" > > > > > -----Original Message----- > > From: Keir Fraser [mailto:keir@xensource.com] > > Sent: 03 May 2007 18:27 > > To: Petersson, Mats; xen-devel@lists.xensource.com > > Subject: Re: [Xen-devel] "kobject add failed" > > > > > > > > > > On 3/5/07 18:19, "Petersson, Mats" <Mats.Petersson@amd.com> wrote: > > > > >> However, the invocation of tun_set_iff() is wrapped in > > >> rtnl_lock()/unlock() > > >> so should be concurrency safe. Still, this is where I would > > >> concentrate my > > >> search if I were you. > > > > > > Indeed, the tun_set_iff() is protected by > rtnl_lock/unlock()... Any > > > reason to believe that doesn''t work? > > > > Your crash report. :-) > > > > However tun_set_iff() is not in the oops backtrace... Perhaps > > I''m wrong > > about which ioctl is being executed, or the path taken > > through the Linux > > kernel. > > According to my dump of the tun.o object file, "tun_set_iff" > is inlined > into tun_chr_ioctl(), so it''s no real surprise it''s not in the > call-stack...Just to get back to this rather old subject (I got side-tracked with the "stuff left in xenstore" problem last week), here''s the comment for register_netdevice: /** * register_netdevice - register a network device * @dev: device to register * * Take a completed network device structure and add it to the kernel * interfaces. A %NETDEV_REGISTER message is sent to the netdev notifier * chain. 0 is returned on success. A negative errno code is returned * on a failure to set up the device, or if the name is a duplicate. * * Callers must hold the rtnl semaphore. You may want * register_netdev() instead of this. * * BUGS: * The locking appears insufficient to guarantee two parallel registers * will not get the same name. */ So it seems like there''s a problem using the mutex-locking to prevent a name-collision. I don''t know if I should pursue this... I presume that if it was real trivial to fix, it would have been fixed already... ;-) -- Mats _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Petersson, Mats
2007-May-15 12:49 UTC
RE: [Xen-devel][PATCH][RFC] "kobject add failed" Suggested workaround.
> -----Original Message----- > From: Petersson, Mats > Sent: 14 May 2007 17:37 > To: Petersson, Mats; Keir Fraser; xen-devel@lists.xensource.com > Subject: RE: [Xen-devel] "kobject add failed" > > > -----Original Message----- > > From: xen-devel-bounces@lists.xensource.com > > [mailto:xen-devel-bounces@lists.xensource.com] On Behalf Of > > Petersson, Mats > > Sent: 03 May 2007 18:37 > > To: Keir Fraser; xen-devel@lists.xensource.com > > Subject: RE: [Xen-devel] "kobject add failed" > > > > > > > > > -----Original Message----- > > > From: Keir Fraser [mailto:keir@xensource.com] > > > Sent: 03 May 2007 18:27 > > > To: Petersson, Mats; xen-devel@lists.xensource.com > > > Subject: Re: [Xen-devel] "kobject add failed" > > > > > > > > > > > > > > > On 3/5/07 18:19, "Petersson, Mats" <Mats.Petersson@amd.com> wrote: > > > > > > >> However, the invocation of tun_set_iff() is wrapped in > > > >> rtnl_lock()/unlock() > > > >> so should be concurrency safe. Still, this is where I would > > > >> concentrate my > > > >> search if I were you. > > > > > > > > Indeed, the tun_set_iff() is protected by > > rtnl_lock/unlock()... Any > > > > reason to believe that doesn''t work? > > > > > > Your crash report. :-) > > > > > > However tun_set_iff() is not in the oops backtrace... Perhaps > > > I''m wrong > > > about which ioctl is being executed, or the path taken > > > through the Linux > > > kernel. > > > > According to my dump of the tun.o object file, "tun_set_iff" > > is inlined > > into tun_chr_ioctl(), so it''s no real surprise it''s not in the > > call-stack... > > Just to get back to this rather old subject (I got > side-tracked with the "stuff left in xenstore" problem last > week), here''s the comment for register_netdevice: > > /** > * register_netdevice - register a network device > * @dev: device to register > * > * Take a completed network device structure and add it > to the kernel > * interfaces. A %NETDEV_REGISTER message is sent to the > netdev notifier > * chain. 0 is returned on success. A negative errno > code is returned > * on a failure to set up the device, or if the name is > a duplicate. > * > * Callers must hold the rtnl semaphore. You may want > * register_netdev() instead of this. > * > * BUGS: > * The locking appears insufficient to guarantee two > parallel registers > * will not get the same name. > */ > > So it seems like there''s a problem using the mutex-locking to > prevent a name-collision. > > I don''t know if I should pursue this... I presume that if it > was real trivial to fix, it would have been fixed already... ;-)Would a patch like the attached be a valid workaround for this problem? -- Mats> > -- > Mats_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel