thr3ads.net - Nut upsdev - [Nut-upsdev] a nasty kernel oops [Jan 2011]

If this information is useful, please help other people find it:
Share via:

Alfred Ganz

2011-Jan-12 15:08 UTC

[Nut-upsdev] a nasty kernel oops

Ladies and Gentlemen,

I have been trying to sort out a nasty kernel oops for which I would
like to ask for some advice. I don't actually think that this is a nut
problem, although it is triggered by upsdrvctl or usbhid-ups. I rather
suspect the USB library or the associated kernel code.
Here is the configuration information:
 * system: Centos-5.5
 * kernel: 2.6.18-194.32.1.el5, latest vanilla for Centos-5.5
 * nut: nut-2.4.3-1, a build provided by Arnaud last fall
 * libusb: libusb-0.1.12-5.1, the latest for this system
 * the UPS: 
	device.mfr: APC
	device.model: Back-UPS ES 650
	device.serial: QB0514232934  
	device.type: ups
	ups.firmware: 818.w1.D
	ups.mfr.date: 2005/08/10
	ups.productid: 0002
   Sorry Arnaud!
 * the driver:
	driver.name: usbhid-ups
	driver.parameter.pollfreq: 30
	driver.parameter.pollinterval: 1
	driver.parameter.port: auto
	driver.version: 2.4.3
	driver.version.data: APC HID 0.95
	driver.version.internal: 0.34
The problem:
 * under certain circumstances system boot fails with a kernel oops 
   during the upsdrvctl/usbhid-ups phase of the startup script. 
   Note that we have *not* reached the upsd part of the startup script, 
	and the startup script of the printer package (cups) would be 
	reached much later.
 * the boot process works fine if another USB device, the USB printer, 
   is powered up. Unfortunately, I would like to keep it powered off 
   most of the time if I could do so.
   Note, the printer is not the only other USB device on the system,
	but the number of active USB devices is being changed.
 * Once the system is completely booted, there is no problem starting
   and stopping the ups using the startup script.
 * the problem appeared late in the lifetime of Centos-5, the system
   has shut down due to power loss and restarted with the printer off
   before the arrival of the problem.
 * the problem has probably appeared with the transition from Centos-5.4 
   (kernel-2.6.18-164.15.1.el5) to Centos-5.5 (kernel-2.6.18-194.3.1.el5),
   although I noticed it much later
 * libusb has been the same since 2008/04/22, I don't believe the problem
   appeared that long ago.
 * the problem happens with the locally rebuilt (with no changes)
   nut-2.2.0-6.1 from Fedora 8, with the nut-2.4.3-1 provided by
   Arnaud, and with a locally rebuilt version of nut-2.4.3-1 using
   a spec file provided by Arnaud.
Here is what I have tried:
 * I have started the system for a crash dump, and below is an extract
   from the crash log.
 * I have tried to recreate the problem on a virtual machine, but it
   doesn't happen there
 * I have introduced a delay before the start of the ups startup script
   to make sure that the suspected timing problem doesn't happen earlier.
 * I have hacked the startup script to invoke usbhid-ups directly with
   the proper device parameter and -D, and the problem goes away
 * I have used the hacked startup script to invoke usbhid-ups directly 
   with the proper device parameter but *no* -D, and we have a crash!
So I am now left with the observation that we have a timing problem
somewhere among usbhid-ups, libusb, and the kernel, and I have no
idea how to further narrow things down. 

I would be glad to do some more tests, but I would need your help with
this.

Thanks for any advice, AG

----------------------------------------------------------------------------
BUG: unable to handle kernel paging request at virtual address 0000190c
 printing eip:
c05954c6
*pde = 74ecb067
Oops: 0000 [#1]
SMP 
....
CPU:    0
EIP:    0060:[<c05954c6>]    Not tainted VLI
EFLAGS: 00010206   (2.6.18-194.26.1.el5 #1) 
EIP is at hid_close+0x2/0x1f
eax: 00000000   ebx: d216bd20   ecx: f7fff080   edx: 00000000
esi: d21ebc00   edi: c06b3470   ebp: 00000000   esp: f768ad34
ds: 007b   es: 007b   ss: 0068
Process usbhid-ups (pid: 2437, ti=f768a000 task=f7682000 task.ti=f768a000)
Stack: c05976fd d20fe000 c0595668 d21ebc00 c06b3440 c058e1c5 d21ebc8c d21ebc14 
       c055e859 d21ebc14 d21ebc14 d21ebc00 c055ea91 d21ebc00 c0588351 ffffffc3 
       00000000 c0590cc0 f75ee340 00000000 00005516 00000000 c00c5512 d2173400 
Call Trace:
 [<c05976fd>] hiddev_disconnect+0x3f/0x5e
 [<c0595668>] hid_disconnect+0x81/0xbf
 [<c058e1c5>] usb_unbind_interface+0x34/0x6a
 [<c055e859>] __device_release_driver+0x7d/0xbb
 [<c055ea91>] device_release_driver+0x1c/0x2b
 [<c0588351>] usb_driver_release_interface+0x38/0x60
 [<c0590cc0>] proc_ioctl_default+0x10d/0x1d0
 [<c0591ffd>] usbdev_ioctl+0x1027/0x10de
 [<c048b06b>] __d_lookup+0x98/0xdb
 [<c04c7eff>] inode_has_perm+0x54/0x5c
 [<c04c78ab>] avc_has_perm+0x3c/0x46
 [<c04c7eff>] inode_has_perm+0x54/0x5c
 [<c0464c32>] __handle_mm_fault+0x463/0xaac
 [<c0486290>] do_ioctl+0x47/0x5d
 [<c04867f9>] vfs_ioctl+0x47b/0x4d3
 [<c0486899>] sys_ioctl+0x48/0x5f
 [<c0404f17>] syscall_call+0x7/0xb
 ======================Code: 00 05 b0 01 86 83 b4 0c 00 00 fb 8d 83 54 0c 00 00
e8 4c 87 e9 ff 8b 83 a8 0c 00 00 e8 7d 74 ff ff 31 c0 5b c3 31 d2 eb be 89 c2
<8b> 80 0c 19 00 00 48 85 c0 89 82 0c 19 00 00 75 0b 8b 82 a8 0c
EIP: [<c05954c6>] hid_close+0x2/0x1f SS:ESP 0068:f768ad34
----------------------------------------------------------------------------

-- 
 ----------------------------------------------------------------------
   Alfred Ganz					alfred-ganz:at:agci.com
   AG Consulting, Inc.				(203) 624-9667
   440 Prospect Street # 11
   New Haven, CT 06511
 ----------------------------------------------------------------------

Knight, Dave

2011-Jan-12 16:16 UTC

head link

[Nut-upsdev] a nasty kernel oops

My guess would be that there is some miscommunication between driver levels,
perhaps a result of running a new nut with an old kernel (latest is 2.6-37)?

That the presence of the "-D" flag appears to suppress the error:

 * I have hacked the startup script to invoke usbhid-ups directly
with>   the proper device parameter and -D, and the problem goes away
>  * I have used the hacked startup script to invoke usbhid-ups directly
>   with the proper device parameter but *no* -D, and we have a crash!
>
suggests a place to start, possibly looking for a device/interface timing
issue.

Dave

On Wed, Jan 12, 2011 at 10:08 AM, Alfred Ganz
<alfred-ganz+nut at agci.com<alfred-ganz%2Bnut at
agci.com>> wrote:
> Ladies and Gentlemen,
>
> I have been trying to sort out a nasty kernel oops for which I would
> like to ask for some advice. I don't actually think that this is a nut
> problem, although it is triggered by upsdrvctl or usbhid-ups. I rather
> suspect the USB library or the associated kernel code.
> Here is the configuration information:
>  * system: Centos-5.5
>  * kernel: 2.6.18-194.32.1.el5, latest vanilla for Centos-5.5
>  * nut: nut-2.4.3-1, a build provided by Arnaud last fall
>  * libusb: libusb-0.1.12-5.1, the latest for this system
>  * the UPS:
>        device.mfr: APC
>        device.model: Back-UPS ES 650
>        device.serial: QB0514232934
>        device.type: ups
>        ups.firmware: 818.w1.D
>        ups.mfr.date: 2005/08/10
>        ups.productid: 0002
>   Sorry Arnaud!
>  * the driver:
>        driver.name: usbhid-ups
>        driver.parameter.pollfreq: 30
>        driver.parameter.pollinterval: 1
>        driver.parameter.port: auto
>        driver.version: 2.4.3
>        driver.version.data: APC HID 0.95
>        driver.version.internal: 0.34
> The problem:
>  * under certain circumstances system boot fails with a kernel oops
>   during the upsdrvctl/usbhid-ups phase of the startup script.
>   Note that we have *not* reached the upsd part of the startup script,
>        and the startup script of the printer package (cups) would be
>        reached much later.
>  * the boot process works fine if another USB device, the USB printer,
>   is powered up. Unfortunately, I would like to keep it powered off
>   most of the time if I could do so.
>   Note, the printer is not the only other USB device on the system,
>        but the number of active USB devices is being changed.
>  * Once the system is completely booted, there is no problem starting
>   and stopping the ups using the startup script.
>  * the problem appeared late in the lifetime of Centos-5, the system
>   has shut down due to power loss and restarted with the printer off
>   before the arrival of the problem.
>  * the problem has probably appeared with the transition from Centos-5.4
>   (kernel-2.6.18-164.15.1.el5) to Centos-5.5 (kernel-2.6.18-194.3.1.el5),
>   although I noticed it much later
>  * libusb has been the same since 2008/04/22, I don't believe the
problem
>   appeared that long ago.
>  * the problem happens with the locally rebuilt (with no changes)
>   nut-2.2.0-6.1 from Fedora 8, with the nut-2.4.3-1 provided by
>   Arnaud, and with a locally rebuilt version of nut-2.4.3-1 using
>   a spec file provided by Arnaud.
> Here is what I have tried:
>  * I have started the system for a crash dump, and below is an extract
>   from the crash log.
>  * I have tried to recreate the problem on a virtual machine, but it
>   doesn't happen there
>  * I have introduced a delay before the start of the ups startup script
>   to make sure that the suspected timing problem doesn't happen
earlier.
>  * I have hacked the startup script to invoke usbhid-ups directly with
>   the proper device parameter and -D, and the problem goes away
>  * I have used the hacked startup script to invoke usbhid-ups directly
>   with the proper device parameter but *no* -D, and we have a crash!
> So I am now left with the observation that we have a timing problem
> somewhere among usbhid-ups, libusb, and the kernel, and I have no
> idea how to further narrow things down.
>
> I would be glad to do some more tests, but I would need your help with
> this.
>
> Thanks for any advice, AG
>
>
>
----------------------------------------------------------------------------
> BUG: unable to handle kernel paging request at virtual address 0000190c
>  printing eip:
> c05954c6
> *pde = 74ecb067
> Oops: 0000 [#1]
> SMP
> ....
> CPU:    0
> EIP:    0060:[<c05954c6>]    Not tainted VLI
> EFLAGS: 00010206   (2.6.18-194.26.1.el5 #1)
> EIP is at hid_close+0x2/0x1f
> eax: 00000000   ebx: d216bd20   ecx: f7fff080   edx: 00000000
> esi: d21ebc00   edi: c06b3470   ebp: 00000000   esp: f768ad34
> ds: 007b   es: 007b   ss: 0068
> Process usbhid-ups (pid: 2437, ti=f768a000 task=f7682000 task.ti=f768a000)
> Stack: c05976fd d20fe000 c0595668 d21ebc00 c06b3440 c058e1c5 d21ebc8c
> d21ebc14
>       c055e859 d21ebc14 d21ebc14 d21ebc00 c055ea91 d21ebc00 c0588351
> ffffffc3
>       00000000 c0590cc0 f75ee340 00000000 00005516 00000000 c00c5512
> d2173400
> Call Trace:
>  [<c05976fd>] hiddev_disconnect+0x3f/0x5e
>  [<c0595668>] hid_disconnect+0x81/0xbf
>  [<c058e1c5>] usb_unbind_interface+0x34/0x6a
>  [<c055e859>] __device_release_driver+0x7d/0xbb
>  [<c055ea91>] device_release_driver+0x1c/0x2b
>  [<c0588351>] usb_driver_release_interface+0x38/0x60
>  [<c0590cc0>] proc_ioctl_default+0x10d/0x1d0
>  [<c0591ffd>] usbdev_ioctl+0x1027/0x10de
>  [<c048b06b>] __d_lookup+0x98/0xdb
>  [<c04c7eff>] inode_has_perm+0x54/0x5c
>  [<c04c78ab>] avc_has_perm+0x3c/0x46
>  [<c04c7eff>] inode_has_perm+0x54/0x5c
>  [<c0464c32>] __handle_mm_fault+0x463/0xaac
>  [<c0486290>] do_ioctl+0x47/0x5d
>  [<c04867f9>] vfs_ioctl+0x47b/0x4d3
>  [<c0486899>] sys_ioctl+0x48/0x5f
>  [<c0404f17>] syscall_call+0x7/0xb
>  ======================> Code: 00 05 b0 01 86 83 b4 0c 00 00 fb 8d 83 54
0c 00 00 e8 4c 87 e9 ff 8b
> 83 a8 0c 00 00 e8 7d 74 ff ff 31 c0 5b c3 31 d2 eb be 89 c2 <8b> 80
0c 19 00
> 00 48 85 c0 89 82 0c 19 00 00 75 0b 8b 82 a8 0c
> EIP: [<c05954c6>] hid_close+0x2/0x1f SS:ESP 0068:f768ad34
>
>
----------------------------------------------------------------------------
>
> --
>  ----------------------------------------------------------------------
>   Alfred Ganz                                  alfred-ganz:at:agci.com
>   AG Consulting, Inc.                          (203) 624-9667
>   440 Prospect Street # 11
>   New Haven, CT 06511
>  ----------------------------------------------------------------------
>


-- 
"Ridicule is man's most potent weapon.  It?s hard to counterattack
ridicule,
and it infuriates
 the opposition, which then reacts to your advantage."

                                                              Saul Alinsky,
Marxist, Obama mentor
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.alioth.debian.org/pipermail/nut-upsdev/attachments/20110112/6b780000/attachment.htm>

Charles Lepple

2011-Jan-13 03:24 UTC

head link

[Nut-upsdev] a nasty kernel oops

On Jan 12, 2011, at 10:08 AM, Alfred Ganz wrote:
> EIP is at hid_close+0x2/0x1f
> eax: 00000000   ebx: d216bd20   ecx: f7fff080   edx: 00000000
> esi: d21ebc00   edi: c06b3470   ebp: 00000000   esp: f768ad34
> ds: 007b   es: 007b   ss: 0068
> Process usbhid-ups (pid: 2437, ti=f768a000 task=f7682000  
> task.ti=f768a000)
> Stack: c05976fd d20fe000 c0595668 d21ebc00 c06b3440 c058e1c5  
> d21ebc8c d21ebc14
>       c055e859 d21ebc14 d21ebc14 d21ebc00 c055ea91 d21ebc00 c0588351  
> ffffffc3
>       00000000 c0590cc0 f75ee340 00000000 00005516 00000000 c00c5512  
> d2173400
> Call Trace:
> [<c05976fd>] hiddev_disconnect+0x3f/0x5e
> [<c0595668>] hid_disconnect+0x81/0xbf
> [<c058e1c5>] usb_unbind_interface+0x34/0x6a
> [<c055e859>] __device_release_driver+0x7d/0xbb
> [<c055ea91>] device_release_driver+0x1c/0x2b
> [<c0588351>] usb_driver_release_interface+0x38/0x60...

I suspect that the problem goes away once you have booted because the  
kernel HID driver has been detached from the UPS once already.

Perhaps I am misreading your description, but have you tried booting  
without any USB devices, plugging the UPS in later (maybe once the  
system has quiesced), then restarting the NUT init scripts?

It also might work better to disconnect the kernel HID driver before  
starting usbhid-ups. If you have libhid, it comes with an example  
program (libhid-detach-device) that detaches the kernel driver from  
the first interface of a USB device. If not, it's just a handful of  
libusb calls, and we can put together a test program to do that before  
usbhid-ups gets to it.

Alfred Ganz

2011-Jan-14 00:54 UTC

head link

[Nut-upsdev] a nasty kernel oops

Charles,
	From: Charles Lepple <clepple at gmail.com>
	Subject: Re: [Nut-upsdev] a nasty kernel oops
	Date: Wed, 12 Jan 2011 22:24:42 -0500
	
	I suspect that the problem goes away once you have booted because the  
	kernel HID driver has been detached from the UPS once already.
	
	Perhaps I am misreading your description, but have you tried booting  
	without any USB devices, plugging the UPS in later (maybe once the  
	system has quiesced), then restarting the NUT init scripts?
I have not booted the system with all USB devices unplugged. What I
have done is booting with the UPS startscript disabled, and then run
the startscript without trouble. Unfortunately I don't have a spare 
system, and so these tests always involve significant disruptions. 
I will attempt to play around along these lines as well as with your
second suggestion at the next opportunity.
	
	It also might work better to disconnect the kernel HID driver before  
	starting usbhid-ups. If you have libhid, it comes with an example  
	program (libhid-detach-device) that detaches the kernel driver from  
	the first interface of a USB device. If not, it's just a handful of  
	libusb calls, and we can put together a test program to do that before  
	usbhid-ups gets to it.
I didn't see an installable rpm package in any of the archives for my 
current system, but I have been able to hack the one from Fedora 12, 
and libhid-detach-device works as expected. After disconnecting and then 
reconnecting the USB cable of the UPS, the kernel reports:
	hiddev96: USB HID v1.10 Device [APC Back-UPS ES 650 FW:818.w1.D \
					USB FW:w1] on usb-0000:00:1d.0-1
and "lsusb -v" shows the expected short output with: 
	Report Descriptors: ** UNAVAILABLE **, 
after running "libhid-detach-device" it shows the rather lengthy
Report
Descriptors.

First observation, on the running system the startup script works fine
with the HID driver attached or detached. I will make some boot tests
as soon as possible!

Thanks for your suggestions, AG

-- 
 ----------------------------------------------------------------------
   Alfred Ganz					alfred-ganz:at:agci.com
   AG Consulting, Inc.				(203) 624-9667
   440 Prospect Street # 11
   New Haven, CT 06511
 ----------------------------------------------------------------------

Alfred Ganz

2011-Jan-15 02:31 UTC

head link

[Nut-upsdev] a nasty kernel oops

Charles,

Here is some more insight into my problem.
 * I am now able to get a crash on a virtual machine, so life has
   become a bit easier
 * disabling the UPS, then immediately after re-enabling it, the first 
   libhid-detach-device fails after about 10 sec, with:
	hid_force_open failed with return code 7.
   i.e. no device has been found.
   Following the first libhid-detach-device immediately by several more,
   they all fail the same way, but without another 10 sec delay.
   Finally, adding a sleep 1 followed by another libhid-detach-device
   will succeed.
 * disabling the UPS, then waiting 20 sec after re-enabling it, the 
   first libhid-detach-device will succeed.
   Note, I wasn't able to reduce this delay significantly, so it seems
	that the total delay can be smaller when doing the above
	failing operations.
 * The same behavior occurs when using lsusb -d instead of the above
   libhid-detach-device.
 * usbhid-ups crashes if the last preceeding libhid-detach-device fails,
   but it will not fail if there is a successful libhid-detach-device
   preceeding it, or if there is a longer inactive delay.
Unfortunately, the timing is for the virtual machine, and I don't expect
things to be similar on the real machine, not to speak of the boot
context with other devices present.

As you suspected, it looks like usbhid-ups crashes if things have not 
reached quiescence or some other kind of availability. However, I have 
no idea how at boot time adding an active USB device can achieve this 
(or maybe achieve it much more quickly).

It would of course be nice to make usbhid-ups have a builtin method for
detecting such a state and at the same time be able to detect the 
absence of the device in question. However, I think the appropriate
thing is to determine such a method outside of usbhid-ups first. If at
all possible, I would prefer to do this with some shell script, but if
push comes to shuff, I might have to resort to some C code as well.

Any advice on what might work would of course be much appreciated. 

Thanks, AG

P.S. What happened to the mail server at lists.alioth.debian.org

-- 
 ----------------------------------------------------------------------
   Alfred Ganz					alfred-ganz:at:agci.com
   AG Consulting, Inc.				(203) 624-9667
   440 Prospect Street # 11
   New Haven, CT 06511
 ----------------------------------------------------------------------

Nut upsdev - Jan 2011 - a nasty kernel oops

[Nut-upsdev] a nasty kernel oops

[Nut-upsdev] a nasty kernel oops

[Nut-upsdev] a nasty kernel oops

[Nut-upsdev] a nasty kernel oops

[Nut-upsdev] a nasty kernel oops