Ladies and Gentlemen, I have been trying to sort out a nasty kernel oops for which I would like to ask for some advice. I don't actually think that this is a nut problem, although it is triggered by upsdrvctl or usbhid-ups. I rather suspect the USB library or the associated kernel code. Here is the configuration information: * system: Centos-5.5 * kernel: 2.6.18-194.32.1.el5, latest vanilla for Centos-5.5 * nut: nut-2.4.3-1, a build provided by Arnaud last fall * libusb: libusb-0.1.12-5.1, the latest for this system * the UPS: device.mfr: APC device.model: Back-UPS ES 650 device.serial: QB0514232934 device.type: ups ups.firmware: 818.w1.D ups.mfr.date: 2005/08/10 ups.productid: 0002 Sorry Arnaud! * the driver: driver.name: usbhid-ups driver.parameter.pollfreq: 30 driver.parameter.pollinterval: 1 driver.parameter.port: auto driver.version: 2.4.3 driver.version.data: APC HID 0.95 driver.version.internal: 0.34 The problem: * under certain circumstances system boot fails with a kernel oops during the upsdrvctl/usbhid-ups phase of the startup script. Note that we have *not* reached the upsd part of the startup script, and the startup script of the printer package (cups) would be reached much later. * the boot process works fine if another USB device, the USB printer, is powered up. Unfortunately, I would like to keep it powered off most of the time if I could do so. Note, the printer is not the only other USB device on the system, but the number of active USB devices is being changed. * Once the system is completely booted, there is no problem starting and stopping the ups using the startup script. * the problem appeared late in the lifetime of Centos-5, the system has shut down due to power loss and restarted with the printer off before the arrival of the problem. * the problem has probably appeared with the transition from Centos-5.4 (kernel-2.6.18-164.15.1.el5) to Centos-5.5 (kernel-2.6.18-194.3.1.el5), although I noticed it much later * libusb has been the same since 2008/04/22, I don't believe the problem appeared that long ago. * the problem happens with the locally rebuilt (with no changes) nut-2.2.0-6.1 from Fedora 8, with the nut-2.4.3-1 provided by Arnaud, and with a locally rebuilt version of nut-2.4.3-1 using a spec file provided by Arnaud. Here is what I have tried: * I have started the system for a crash dump, and below is an extract from the crash log. * I have tried to recreate the problem on a virtual machine, but it doesn't happen there * I have introduced a delay before the start of the ups startup script to make sure that the suspected timing problem doesn't happen earlier. * I have hacked the startup script to invoke usbhid-ups directly with the proper device parameter and -D, and the problem goes away * I have used the hacked startup script to invoke usbhid-ups directly with the proper device parameter but *no* -D, and we have a crash! So I am now left with the observation that we have a timing problem somewhere among usbhid-ups, libusb, and the kernel, and I have no idea how to further narrow things down. I would be glad to do some more tests, but I would need your help with this. Thanks for any advice, AG ---------------------------------------------------------------------------- BUG: unable to handle kernel paging request at virtual address 0000190c printing eip: c05954c6 *pde = 74ecb067 Oops: 0000 [#1] SMP .... CPU: 0 EIP: 0060:[<c05954c6>] Not tainted VLI EFLAGS: 00010206 (2.6.18-194.26.1.el5 #1) EIP is at hid_close+0x2/0x1f eax: 00000000 ebx: d216bd20 ecx: f7fff080 edx: 00000000 esi: d21ebc00 edi: c06b3470 ebp: 00000000 esp: f768ad34 ds: 007b es: 007b ss: 0068 Process usbhid-ups (pid: 2437, ti=f768a000 task=f7682000 task.ti=f768a000) Stack: c05976fd d20fe000 c0595668 d21ebc00 c06b3440 c058e1c5 d21ebc8c d21ebc14 c055e859 d21ebc14 d21ebc14 d21ebc00 c055ea91 d21ebc00 c0588351 ffffffc3 00000000 c0590cc0 f75ee340 00000000 00005516 00000000 c00c5512 d2173400 Call Trace: [<c05976fd>] hiddev_disconnect+0x3f/0x5e [<c0595668>] hid_disconnect+0x81/0xbf [<c058e1c5>] usb_unbind_interface+0x34/0x6a [<c055e859>] __device_release_driver+0x7d/0xbb [<c055ea91>] device_release_driver+0x1c/0x2b [<c0588351>] usb_driver_release_interface+0x38/0x60 [<c0590cc0>] proc_ioctl_default+0x10d/0x1d0 [<c0591ffd>] usbdev_ioctl+0x1027/0x10de [<c048b06b>] __d_lookup+0x98/0xdb [<c04c7eff>] inode_has_perm+0x54/0x5c [<c04c78ab>] avc_has_perm+0x3c/0x46 [<c04c7eff>] inode_has_perm+0x54/0x5c [<c0464c32>] __handle_mm_fault+0x463/0xaac [<c0486290>] do_ioctl+0x47/0x5d [<c04867f9>] vfs_ioctl+0x47b/0x4d3 [<c0486899>] sys_ioctl+0x48/0x5f [<c0404f17>] syscall_call+0x7/0xb ======================Code: 00 05 b0 01 86 83 b4 0c 00 00 fb 8d 83 54 0c 00 00 e8 4c 87 e9 ff 8b 83 a8 0c 00 00 e8 7d 74 ff ff 31 c0 5b c3 31 d2 eb be 89 c2 <8b> 80 0c 19 00 00 48 85 c0 89 82 0c 19 00 00 75 0b 8b 82 a8 0c EIP: [<c05954c6>] hid_close+0x2/0x1f SS:ESP 0068:f768ad34 ---------------------------------------------------------------------------- -- ---------------------------------------------------------------------- Alfred Ganz alfred-ganz:at:agci.com AG Consulting, Inc. (203) 624-9667 440 Prospect Street # 11 New Haven, CT 06511 ----------------------------------------------------------------------
My guess would be that there is some miscommunication between driver levels, perhaps a result of running a new nut with an old kernel (latest is 2.6-37)? That the presence of the "-D" flag appears to suppress the error: * I have hacked the startup script to invoke usbhid-ups directly with> the proper device parameter and -D, and the problem goes away > * I have used the hacked startup script to invoke usbhid-ups directly > with the proper device parameter but *no* -D, and we have a crash! >suggests a place to start, possibly looking for a device/interface timing issue. Dave On Wed, Jan 12, 2011 at 10:08 AM, Alfred Ganz <alfred-ganz+nut at agci.com<alfred-ganz%2Bnut at agci.com>> wrote:> Ladies and Gentlemen, > > I have been trying to sort out a nasty kernel oops for which I would > like to ask for some advice. I don't actually think that this is a nut > problem, although it is triggered by upsdrvctl or usbhid-ups. I rather > suspect the USB library or the associated kernel code. > Here is the configuration information: > * system: Centos-5.5 > * kernel: 2.6.18-194.32.1.el5, latest vanilla for Centos-5.5 > * nut: nut-2.4.3-1, a build provided by Arnaud last fall > * libusb: libusb-0.1.12-5.1, the latest for this system > * the UPS: > device.mfr: APC > device.model: Back-UPS ES 650 > device.serial: QB0514232934 > device.type: ups > ups.firmware: 818.w1.D > ups.mfr.date: 2005/08/10 > ups.productid: 0002 > Sorry Arnaud! > * the driver: > driver.name: usbhid-ups > driver.parameter.pollfreq: 30 > driver.parameter.pollinterval: 1 > driver.parameter.port: auto > driver.version: 2.4.3 > driver.version.data: APC HID 0.95 > driver.version.internal: 0.34 > The problem: > * under certain circumstances system boot fails with a kernel oops > during the upsdrvctl/usbhid-ups phase of the startup script. > Note that we have *not* reached the upsd part of the startup script, > and the startup script of the printer package (cups) would be > reached much later. > * the boot process works fine if another USB device, the USB printer, > is powered up. Unfortunately, I would like to keep it powered off > most of the time if I could do so. > Note, the printer is not the only other USB device on the system, > but the number of active USB devices is being changed. > * Once the system is completely booted, there is no problem starting > and stopping the ups using the startup script. > * the problem appeared late in the lifetime of Centos-5, the system > has shut down due to power loss and restarted with the printer off > before the arrival of the problem. > * the problem has probably appeared with the transition from Centos-5.4 > (kernel-2.6.18-164.15.1.el5) to Centos-5.5 (kernel-2.6.18-194.3.1.el5), > although I noticed it much later > * libusb has been the same since 2008/04/22, I don't believe the problem > appeared that long ago. > * the problem happens with the locally rebuilt (with no changes) > nut-2.2.0-6.1 from Fedora 8, with the nut-2.4.3-1 provided by > Arnaud, and with a locally rebuilt version of nut-2.4.3-1 using > a spec file provided by Arnaud. > Here is what I have tried: > * I have started the system for a crash dump, and below is an extract > from the crash log. > * I have tried to recreate the problem on a virtual machine, but it > doesn't happen there > * I have introduced a delay before the start of the ups startup script > to make sure that the suspected timing problem doesn't happen earlier. > * I have hacked the startup script to invoke usbhid-ups directly with > the proper device parameter and -D, and the problem goes away > * I have used the hacked startup script to invoke usbhid-ups directly > with the proper device parameter but *no* -D, and we have a crash! > So I am now left with the observation that we have a timing problem > somewhere among usbhid-ups, libusb, and the kernel, and I have no > idea how to further narrow things down. > > I would be glad to do some more tests, but I would need your help with > this. > > Thanks for any advice, AG > > > ---------------------------------------------------------------------------- > BUG: unable to handle kernel paging request at virtual address 0000190c > printing eip: > c05954c6 > *pde = 74ecb067 > Oops: 0000 [#1] > SMP > .... > CPU: 0 > EIP: 0060:[<c05954c6>] Not tainted VLI > EFLAGS: 00010206 (2.6.18-194.26.1.el5 #1) > EIP is at hid_close+0x2/0x1f > eax: 00000000 ebx: d216bd20 ecx: f7fff080 edx: 00000000 > esi: d21ebc00 edi: c06b3470 ebp: 00000000 esp: f768ad34 > ds: 007b es: 007b ss: 0068 > Process usbhid-ups (pid: 2437, ti=f768a000 task=f7682000 task.ti=f768a000) > Stack: c05976fd d20fe000 c0595668 d21ebc00 c06b3440 c058e1c5 d21ebc8c > d21ebc14 > c055e859 d21ebc14 d21ebc14 d21ebc00 c055ea91 d21ebc00 c0588351 > ffffffc3 > 00000000 c0590cc0 f75ee340 00000000 00005516 00000000 c00c5512 > d2173400 > Call Trace: > [<c05976fd>] hiddev_disconnect+0x3f/0x5e > [<c0595668>] hid_disconnect+0x81/0xbf > [<c058e1c5>] usb_unbind_interface+0x34/0x6a > [<c055e859>] __device_release_driver+0x7d/0xbb > [<c055ea91>] device_release_driver+0x1c/0x2b > [<c0588351>] usb_driver_release_interface+0x38/0x60 > [<c0590cc0>] proc_ioctl_default+0x10d/0x1d0 > [<c0591ffd>] usbdev_ioctl+0x1027/0x10de > [<c048b06b>] __d_lookup+0x98/0xdb > [<c04c7eff>] inode_has_perm+0x54/0x5c > [<c04c78ab>] avc_has_perm+0x3c/0x46 > [<c04c7eff>] inode_has_perm+0x54/0x5c > [<c0464c32>] __handle_mm_fault+0x463/0xaac > [<c0486290>] do_ioctl+0x47/0x5d > [<c04867f9>] vfs_ioctl+0x47b/0x4d3 > [<c0486899>] sys_ioctl+0x48/0x5f > [<c0404f17>] syscall_call+0x7/0xb > ======================> Code: 00 05 b0 01 86 83 b4 0c 00 00 fb 8d 83 54 0c 00 00 e8 4c 87 e9 ff 8b > 83 a8 0c 00 00 e8 7d 74 ff ff 31 c0 5b c3 31 d2 eb be 89 c2 <8b> 80 0c 19 00 > 00 48 85 c0 89 82 0c 19 00 00 75 0b 8b 82 a8 0c > EIP: [<c05954c6>] hid_close+0x2/0x1f SS:ESP 0068:f768ad34 > > ---------------------------------------------------------------------------- > > -- > ---------------------------------------------------------------------- > Alfred Ganz alfred-ganz:at:agci.com > AG Consulting, Inc. (203) 624-9667 > 440 Prospect Street # 11 > New Haven, CT 06511 > ---------------------------------------------------------------------- >-- "Ridicule is man's most potent weapon. It?s hard to counterattack ridicule, and it infuriates the opposition, which then reacts to your advantage." Saul Alinsky, Marxist, Obama mentor -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.alioth.debian.org/pipermail/nut-upsdev/attachments/20110112/6b780000/attachment.htm>
On Jan 12, 2011, at 10:08 AM, Alfred Ganz wrote:> EIP is at hid_close+0x2/0x1f > eax: 00000000 ebx: d216bd20 ecx: f7fff080 edx: 00000000 > esi: d21ebc00 edi: c06b3470 ebp: 00000000 esp: f768ad34 > ds: 007b es: 007b ss: 0068 > Process usbhid-ups (pid: 2437, ti=f768a000 task=f7682000 > task.ti=f768a000) > Stack: c05976fd d20fe000 c0595668 d21ebc00 c06b3440 c058e1c5 > d21ebc8c d21ebc14 > c055e859 d21ebc14 d21ebc14 d21ebc00 c055ea91 d21ebc00 c0588351 > ffffffc3 > 00000000 c0590cc0 f75ee340 00000000 00005516 00000000 c00c5512 > d2173400 > Call Trace: > [<c05976fd>] hiddev_disconnect+0x3f/0x5e > [<c0595668>] hid_disconnect+0x81/0xbf > [<c058e1c5>] usb_unbind_interface+0x34/0x6a > [<c055e859>] __device_release_driver+0x7d/0xbb > [<c055ea91>] device_release_driver+0x1c/0x2b > [<c0588351>] usb_driver_release_interface+0x38/0x60... I suspect that the problem goes away once you have booted because the kernel HID driver has been detached from the UPS once already. Perhaps I am misreading your description, but have you tried booting without any USB devices, plugging the UPS in later (maybe once the system has quiesced), then restarting the NUT init scripts? It also might work better to disconnect the kernel HID driver before starting usbhid-ups. If you have libhid, it comes with an example program (libhid-detach-device) that detaches the kernel driver from the first interface of a USB device. If not, it's just a handful of libusb calls, and we can put together a test program to do that before usbhid-ups gets to it.
Charles, From: Charles Lepple <clepple at gmail.com> Subject: Re: [Nut-upsdev] a nasty kernel oops Date: Wed, 12 Jan 2011 22:24:42 -0500 I suspect that the problem goes away once you have booted because the kernel HID driver has been detached from the UPS once already. Perhaps I am misreading your description, but have you tried booting without any USB devices, plugging the UPS in later (maybe once the system has quiesced), then restarting the NUT init scripts? I have not booted the system with all USB devices unplugged. What I have done is booting with the UPS startscript disabled, and then run the startscript without trouble. Unfortunately I don't have a spare system, and so these tests always involve significant disruptions. I will attempt to play around along these lines as well as with your second suggestion at the next opportunity. It also might work better to disconnect the kernel HID driver before starting usbhid-ups. If you have libhid, it comes with an example program (libhid-detach-device) that detaches the kernel driver from the first interface of a USB device. If not, it's just a handful of libusb calls, and we can put together a test program to do that before usbhid-ups gets to it. I didn't see an installable rpm package in any of the archives for my current system, but I have been able to hack the one from Fedora 12, and libhid-detach-device works as expected. After disconnecting and then reconnecting the USB cable of the UPS, the kernel reports: hiddev96: USB HID v1.10 Device [APC Back-UPS ES 650 FW:818.w1.D \ USB FW:w1] on usb-0000:00:1d.0-1 and "lsusb -v" shows the expected short output with: Report Descriptors: ** UNAVAILABLE **, after running "libhid-detach-device" it shows the rather lengthy Report Descriptors. First observation, on the running system the startup script works fine with the HID driver attached or detached. I will make some boot tests as soon as possible! Thanks for your suggestions, AG -- ---------------------------------------------------------------------- Alfred Ganz alfred-ganz:at:agci.com AG Consulting, Inc. (203) 624-9667 440 Prospect Street # 11 New Haven, CT 06511 ----------------------------------------------------------------------
Charles, Here is some more insight into my problem. * I am now able to get a crash on a virtual machine, so life has become a bit easier * disabling the UPS, then immediately after re-enabling it, the first libhid-detach-device fails after about 10 sec, with: hid_force_open failed with return code 7. i.e. no device has been found. Following the first libhid-detach-device immediately by several more, they all fail the same way, but without another 10 sec delay. Finally, adding a sleep 1 followed by another libhid-detach-device will succeed. * disabling the UPS, then waiting 20 sec after re-enabling it, the first libhid-detach-device will succeed. Note, I wasn't able to reduce this delay significantly, so it seems that the total delay can be smaller when doing the above failing operations. * The same behavior occurs when using lsusb -d instead of the above libhid-detach-device. * usbhid-ups crashes if the last preceeding libhid-detach-device fails, but it will not fail if there is a successful libhid-detach-device preceeding it, or if there is a longer inactive delay. Unfortunately, the timing is for the virtual machine, and I don't expect things to be similar on the real machine, not to speak of the boot context with other devices present. As you suspected, it looks like usbhid-ups crashes if things have not reached quiescence or some other kind of availability. However, I have no idea how at boot time adding an active USB device can achieve this (or maybe achieve it much more quickly). It would of course be nice to make usbhid-ups have a builtin method for detecting such a state and at the same time be able to detect the absence of the device in question. However, I think the appropriate thing is to determine such a method outside of usbhid-ups first. If at all possible, I would prefer to do this with some shell script, but if push comes to shuff, I might have to resort to some C code as well. Any advice on what might work would of course be much appreciated. Thanks, AG P.S. What happened to the mail server at lists.alioth.debian.org -- ---------------------------------------------------------------------- Alfred Ganz alfred-ganz:at:agci.com AG Consulting, Inc. (203) 624-9667 440 Prospect Street # 11 New Haven, CT 06511 ----------------------------------------------------------------------