Paul Whittaker <paul.whittaker <at> drisq.com> writes:> > > Hi Arnaud, > Great to hear this is an isolated case; hopefully it means we've > found a bug we can help solve. > On 27/11/12 22:50, Arnaud Quette wrote:please send the log here, incompressed form.> are the photos taken with a still cam? > please put these on your website (or any other public > storage) and point the link > > > > For each of the four crashes I've got syslog excerpts (which > include some messages from just before and just after the crash) > and photos of the screen.? I couldn't take the whole screen at > once clearly, so I took several photos that overlap.? The photos > were pretty bad (low end smartphone - sorry) but they should be > legible; I've treated them to improve clarity and image size. > Crashes 1 and 2 used the same (fairly basic) setup.? After crash 2 > I added the 'pollonly' driver flag, and after crash 3 I removed > that and was running the driver with -D instead. > Crash 1 - A.png > B.png > (syslog.1.gz attached) > Crash 2 - A.png > B.png C.png D.png > (syslog.2.gz attached) > Crash 3 - A.png > B.png > (syslog.3.gz attached) > Crash 4 - A.png > B.png C.png > (syslog.4.gz attached) > There should be most kernel details you need in the screenshots.? > This is running the latest nut-server package from Ubuntu Precise > 64-bit (version 2.6.3-1ubuntu1.1 at present) on a Dell PowerEdge > R210 II. > > a dump file would bepreferable:https://help.ubuntu.com/12.04/serverguide/kernel-crash-dump.html> > > > I'll see what I can do; this is our main development server, so > I'll probably be limited to weekends. > I'm not very familiar with what data crash dumps record, but do > know > that I can only post one if I'm sure it doesn't contain company > file data, encryption keys, VM state, etc. (for obvious reasons).? > Since I know in advance I'll be taking a crash dump, are there any > steps I can take steps to > avoid such data?? If not, I can still run tools on the dump > locally (out of hours) and send backtraces or other output you > might need. > Let me know if any other details, config files, etc., would help. > Cheers, > > > Paul. > > > PS I've replied only to the list; do you prefer that I reply to > you and CC: the list instead? > > > Attachment (syslog.1.gz): application/x-gzip, 919 bytes > Attachment (syslog.2.gz): application/x-gzip, 1669 bytes > Attachment (syslog.3.gz): application/x-gzip, 2707 bytes > Attachment (syslog.4.gz): application/x-gzip, 769 bytes > > _______________________________________________ > Nut-upsuser mailing list > Nut-upsuser <at> lists.alioth.debian.org > http://lists.alioth.debian.org/cgi-bin/mailman/listinfo/nut-upsuserI found this thread on a google search and just wanted to add that I seem to have encountered a very similar problem: general protection faults once every 1-3 days after installing nut for my CyberPower 1000PFCLCD. Using Ubuntu 12.04 LTS 64-bit, stock kernel 3.5.0-34-generic. ASRock H77M motherboard, Core i3-3225 CPU. A syslog dump is below in case it may be useful. This one's a bit special in that there seems to be 3 successive GPFs. I don't follow this list so please contact me separately if I can be helpful. Jun 30 10:29:24 ns1 kernel: [90099.127943] general protection fault: 0000 [#1] SMP Jun 30 10:29:24 ns1 kernel: [90099.127984] CPU 3 Jun 30 10:29:24 ns1 kernel: [90099.127996] Modules linked in: veth xt_multiport iptable_filter ip_tables x_tables bnep rfcomm bluetooth snd_hda_codec_hdmi snd_hda_codec_realtek coretemp bridge kvm_intel kvm stp llc ghash_clmulni_intel cryptd i915 drm_kms_helper drm snd_hda_intel snd_hda_codec joydev mei serio_raw i2c_algo_bit snd_hwdep snd_pcm ppdev video lpc_ich mac_hid snd_seq_midi snd_rawmidi parport_pc snd_seq_midi_event snd_seq microcode snd_timer snd_seq_device snd soundcore snd_page_alloc lp parport hid_logitech ff_memless hid_generic usbhid hid raid10 raid456 async_pq async_xor xor async_memcpy async_raid6_recov r8169 raid6_pq async_tx ahci libahci raid1 raid0 multipath linear Jun 30 10:29:24 ns1 kernel: [90099.128430] Jun 30 10:29:24 ns1 kernel: [90099.128436] Pid: 2634, comm: usbhid-ups Not tainted 3.5.0-34-generic #55~precise1-Ubuntu To Be Filled By O.E.M. To Be Filled By O.E.M./H77M Jun 30 10:29:24 ns1 kernel: [90099.128500] RIP: 0010:[<ffffffff811740db>] [<ffffffff811740db>] __kmalloc+0x7b/0x1a0 Jun 30 10:29:24 ns1 kernel: [90099.128545] RSP: 0018:ffff8800c188dd68 EFLAGS: 00010282 Jun 30 10:29:24 ns1 kernel: [90099.128571] RAX: 0000000000000000 RBX: ffff8800c188de60 RCX: 0000000000031bda Jun 30 10:29:24 ns1 kernel: [90099.128606] RDX: 0000000000031bd9 RSI: 0000000000000000 RDI: 0000000000016860 Jun 30 10:29:24 ns1 kernel: [90099.128640] RBP: ffff8800c188ddb8 R08: ffff88011f396860 R09: ffff88008eaf9780 Jun 30 10:29:24 ns1 kernel: [90099.128674] R10: 0000000000000000 R11: 0000000000000001 R12: ffff88011ac02a00 Jun 30 10:29:24 ns1 kernel: [90099.128708] R13: ffff012c11947808 R14: 00000000000000d0 R15: ffffffff814caa00 Jun 30 10:29:24 ns1 kernel: [90099.128742] FS: 00007fd56798c700(0000) GS:ffff88011f380000(0000) knlGS:0000000000000000 Jun 30 10:29:24 ns1 kernel: [90099.128780] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b Jun 30 10:29:24 ns1 kernel: [90099.128808] CR2: 000000000804f7c8 CR3: 00000000c1896000 CR4: 00000000001407e0 Jun 30 10:29:24 ns1 kernel: [90099.128842] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 Jun 30 10:29:24 ns1 kernel: [90099.128876] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Jun 30 10:29:24 ns1 kernel: [90099.128910] Process usbhid-ups (pid: 2634, threadinfo ffff8800c188c000, task ffff8800c6bec500) Jun 30 10:29:24 ns1 kernel: [90099.128950] Stack: Jun 30 10:29:24 ns1 kernel: [90099.128963] ffff8800c188df08 ffff8800c188de60 0000000000000000 0000000000000200 Jun 30 10:29:24 ns1 kernel: [90099.129007] ffff880115599280 ffff8800c188de60 0000000000000000 ffff880036b176c0 Jun 30 10:29:24 ns1 kernel: [90099.129050] ffff880115599280 ffff88008eaf9780 ffff8800c188de38 ffffffff814caa00 Jun 30 10:29:24 ns1 kernel: [90099.129094] Call Trace: Jun 30 10:29:24 ns1 kernel: [90099.129112] [<ffffffff814caa00>] proc_do_submiturb+0x4c0/0x9f0 Jun 30 10:29:24 ns1 kernel: [90099.129143] [<ffffffff814cbb4c>] usbdev_do_ioctl+0x3cc/0xc50 Jun 30 10:29:24 ns1 kernel: [90099.129172] [<ffffffff814cc3fe>] usbdev_ioctl+0xe/0x20 Jun 30 10:29:24 ns1 kernel: [90099.129202] [<ffffffff8119a10a>] do_vfs_ioctl+0x8a/0x340 Jun 30 10:29:24 ns1 kernel: [90099.129231] [<ffffffff8119a451>] sys_ioctl+0x91/0xa0 Jun 30 10:29:24 ns1 kernel: [90099.129259] [<ffffffff816a5669>] system_call_fastpath+0x16/0x1b Jun 30 10:29:24 ns1 kernel: [90099.129288] Code: 00 4d 8b 04 24 65 4c 03 04 25 48 dc 00 00 49 8b 50 08 4d 8b 28 4d 85 ed 0f 84 ec 00 00 00 49 63 44 24 20 49 8b 3c 24 48 8d 4a 01 <49> 8b 5c 05 00 4c 89 e8 65 48 0f c7 0f 0f 94 c0 84 c0 74 c2 49 Jun 30 10:29:24 ns1 kernel: [90099.129540] RIP [<ffffffff811740db>] __kmalloc+0x7b/0x1a0 Jun 30 10:29:24 ns1 kernel: [90099.131778] RSP <ffff8800c188dd68> Jun 30 10:29:24 ns1 kernel: [90099.206651] ---[ end trace 6002a33ab3b18572 ]--- Jun 30 10:29:39 ns1 upsd[2636]: Data for UPS [cyber] is stale - check driver Jun 30 10:29:42 ns1 upsmon[2640]: Poll UPS [cyber at localhost] failed - Data stale Jun 30 10:29:42 ns1 upsmon[2640]: Communications with UPS cyber at localhost lost Jun 30 10:29:42 ns1 kernel: [90117.515437] general protection fault: 0000 [#2] SMP Jun 30 10:29:42 ns1 kernel: [90117.517705] CPU 3 Jun 30 10:29:42 ns1 kernel: [90117.517717] Modules linked in: veth xt_multiport iptable_filter ip_tables x_tables bnep rfcomm bluetooth snd_hda_codec_hdmi snd_hda_codec_realtek coretemp bridge kvm_intel kvm stp llc ghash_clmulni_intel cryptd i915 drm_kms_helper drm snd_hda_intel snd_hda_codec joydev mei serio_raw i2c_algo_bit snd_hwdep snd_pcm ppdev video lpc_ich mac_hid snd_seq_midi snd_rawmidi parport_pc snd_seq_midi_event snd_seq microcode snd_timer snd_seq_device snd soundcore snd_page_alloc lp parport hid_logitech ff_memless hid_generic usbhid hid raid10 raid456 async_pq async_xor xor async_memcpy async_raid6_recov r8169 raid6_pq async_tx ahci libahci raid1 raid0 multipath linear Jun 30 10:29:42 ns1 kernel: [90117.527368] Jun 30 10:29:42 ns1 kernel: [90117.529789] Pid: 12632, comm: upsmon Tainted: G D 3.5.0-34-generic #55~precise1-Ubuntu To Be Filled By O.E.M. To Be Filled By O.E.M./H77M Jun 30 10:29:42 ns1 kernel: [90117.532309] RIP: 0010:[<ffffffff811740db>] [<ffffffff811740db>] __kmalloc+0x7b/0x1a0 Jun 30 10:29:42 ns1 kernel: [90117.534906] RSP: 0018:ffff880117eb3d08 EFLAGS: 00010282 Jun 30 10:29:42 ns1 kernel: [90117.537427] RAX: 0000000000000000 RBX: ffff8800a63eae00 RCX: 0000000000031bda Jun 30 10:29:42 ns1 kernel: [90117.539956] RDX: 0000000000031bd9 RSI: 0000000000000000 RDI: 0000000000016860 Jun 30 10:29:42 ns1 kernel: [90117.542476] RBP: ffff880117eb3d58 R08: ffff88011f396860 R09: 0000000000000000 Jun 30 10:29:42 ns1 kernel: [90117.544993] R10: 0000000000000002 R11: fefefefefefefeff R12: ffff88011ac02a00 Jun 30 10:29:42 ns1 kernel: [90117.547510] R13: ffff012c11947808 R14: 00000000000000d0 R15: ffffffff811da892 Jun 30 10:29:42 ns1 kernel: [90117.550026] FS: 00007f224c00b700(0000) GS:ffff88011f380000(0000) knlGS:0000000000000000 Jun 30 10:29:42 ns1 kernel: [90117.552550] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b Jun 30 10:29:42 ns1 kernel: [90117.555093] CR2: 0000000000408068 CR3: 000000011791a000 CR4: 00000000001407e0 Jun 30 10:29:42 ns1 kernel: [90117.557620] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 Jun 30 10:29:42 ns1 kernel: [90117.560144] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Jun 30 10:29:42 ns1 kernel: [90117.562661] Process upsmon (pid: 12632, threadinfo ffff880117eb2000, task ffff8800b6cd0000) Jun 30 10:29:42 ns1 kernel: [90117.565169] Stack: Jun 30 10:29:42 ns1 kernel: [90117.567667] ffff880117eb3fd8 ffffffff811da779 0000000000000080 00000000000001f8 Jun 30 10:29:42 ns1 kernel: [90117.570202] ffff880117eb3fd8 ffff8800a63eae00 ffff880082132080 00000000fffffff4 Jun 30 10:29:42 ns1 kernel: [90117.572749] ffff880117eb3f58 00000000000001f8 ffff880117eb3e38 ffffffff811da892 Jun 30 10:29:42 ns1 kernel: [90117.575278] Call Trace: Jun 30 10:29:42 ns1 kernel: [90117.577795] [<ffffffff811da779>] ? load_elf_binary+0x49/0xe20 Jun 30 10:29:42 ns1 kernel: [90117.580354] [<ffffffff811da892>] load_elf_binary+0x162/0xe20 Jun 30 10:29:42 ns1 kernel: [90117.582876] [<ffffffff81151cb2>] ? get_user_pages+0x52/0x60 Jun 30 10:29:42 ns1 kernel: [90117.585381] [<ffffffff8118e92a>] ? get_arg_page+0xaa/0xe0 Jun 30 10:29:42 ns1 kernel: [90117.587866] [<ffffffff813540d6>] ? strnlen_user+0x36/0xf0 Jun 30 10:29:42 ns1 kernel: [90117.590356] [<ffffffff8118eb61>] ? get_user_arg_ptr.isra.23+0x31/0x70 Jun 30 10:29:42 ns1 kernel: [90117.592846] [<ffffffff8118dd10>] search_binary_handler+0x110/0x350 Jun 30 10:29:42 ns1 kernel: [90117.595357] [<ffffffff811da730>] ? load_elf_library+0x230/0x230 Jun 30 10:29:42 ns1 kernel: [90117.597844] [<ffffffff8119025d>] do_execve_common.isra.30+0x2bd/0x350 Jun 30 10:29:42 ns1 kernel: [90117.600328] [<ffffffff8119030b>] do_execve+0x1b/0x20 Jun 30 10:29:42 ns1 kernel: [90117.602732] [<ffffffff8101d707>] sys_execve+0x47/0x70 Jun 30 10:29:42 ns1 kernel: [90117.605068] [<ffffffff816a5aac>] stub_execve+0x6c/0xc0 Jun 30 10:29:42 ns1 kernel: [90117.607325] Code: 00 4d 8b 04 24 65 4c 03 04 25 48 dc 00 00 49 8b 50 08 4d 8b 28 4d 85 ed 0f 84 ec 00 00 00 49 63 44 24 20 49 8b 3c 24 48 8d 4a 01 <49> 8b 5c 05 00 4c 89 e8 65 48 0f c7 0f 0f 94 c0 84 c0 74 c2 49 Jun 30 10:29:42 ns1 kernel: [90117.609856] RIP [<ffffffff811740db>] __kmalloc+0x7b/0x1a0 Jun 30 10:29:42 ns1 kernel: [90117.612226] RSP <ffff880117eb3d08> Jun 30 10:29:42 ns1 kernel: [90117.614581] ---[ end trace 6002a33ab3b18573 ]--- Jun 30 10:29:42 ns1 kernel: [90117.614745] BUG: Bad rss-counter state mm:ffff88006d861880 idx:1 val:2 Jun 30 10:29:47 ns1 upsmon[2640]: Poll UPS [cyber at localhost] failed - Data stale Jun 30 10:29:52 ns1 upsmon[2640]: Poll UPS [cyber at localhost] failed - Data stale Jun 30 10:29:57 ns1 kernel: [90131.585402] general protection fault: 0000 [#3] SMP Jun 30 10:29:57 ns1 kernel: [90131.587750] CPU 3 Jun 30 10:29:57 ns1 kernel: [90131.587762] Modules linked in: veth xt_multiport iptable_filter ip_tables x_tables bnep rfcomm bluetooth snd_hda_codec_hdmi snd_hda_codec_realtek coretemp bridge kvm_intel kvm stp llc ghash_clmulni_intel cryptd i915 drm_kms_helper drm snd_hda_intel snd_hda_codec joydev mei serio_raw i2c_algo_bit snd_hwdep snd_pcm ppdev video lpc_ich mac_hid snd_seq_midi snd_rawmidi parport_pc snd_seq_midi_event snd_seq microcode snd_timer snd_seq_device snd soundcore snd_page_alloc lp parport hid_logitech ff_memless hid_generic usbhid hid raid10 raid456 async_pq async_xor xor async_memcpy async_raid6_recov r8169 raid6_pq async_tx ahci libahci raid1 raid0 multipath linear Jun 30 10:29:57 ns1 kernel: [90131.597760] Jun 30 10:29:57 ns1 kernel: [90131.600309] Pid: 2037, comm: postgres Tainted: G D 3.5.0-34-generic #55~precise1-Ubuntu To Be Filled By O.E.M. To Be Filled By O.E.M./H77M Jun 30 10:29:57 ns1 kernel: [90131.602946] RIP: 0010:[<ffffffff81176fcd>] [<ffffffff81176fcd>] __kmalloc_node_track_caller+0x13d/0x1f0 Jun 30 10:29:57 ns1 kernel: [90131.605596] RJun 30 11:03:33 ns1 kernel: imklog 5.8.6, log source = /proc/kmsg started.
Charles Lepple
2013-Jul-02 01:41 UTC
[Nut-upsuser] Kernel crash when using usbhid-ups driver.
On Jul 1, 2013, at 12:00 AM, Rod Roark wrote:> Jun 30 10:29:24 ns1 kernel: [90099.128500] RIP: 0010:[<ffffffff811740db>] > [<ffffffff811740db>] __kmalloc+0x7b/0x1a0The reference to kmalloc makes me think "memory corruption". I'd report this to Ubuntu as a kernel bug to see if this is an issue specific to that kernel version, but it might also be worth running a memory test. There is a slim chance that we are triggering some sort of kernel logic bug. (The usbhid-ups driver is making the same requests, over and over, and nothing intentionally has a period of 1-3 days.) The kernel should return an error to the ioctl() syscall if it's a userspace programming bug. -- Charles Lepple clepple at gmail
On 07/01/2013 06:41 PM, Charles Lepple wrote:> On Jul 1, 2013, at 12:00 AM, Rod Roark wrote: > >> Jun 30 10:29:24 ns1 kernel: [90099.128500] RIP: 0010:[<ffffffff811740db>] >> [<ffffffff811740db>] __kmalloc+0x7b/0x1a0 > The reference to kmalloc makes me think "memory corruption". I'd report this to Ubuntu as a kernel bug to see if this is an issue specific to that kernel version, but it might also be worth running a memory test. > > There is a slim chance that we are triggering some sort of kernel logic bug. (The usbhid-ups driver is making the same requests, over and over, and nothing intentionally has a period of 1-3 days.) The kernel should return an error to the ioctl() syscall if it's a userspace programming bug.Thanks. My first thought was hardware error, and so already did memory testing (everything passed) as well as swapping out both memory sticks and even lowering the DRAM refresh rate and some other BIOS tweaks. No difference. Then I remembered that all ran fine for a week before I installed nut, and after that there were 6 crashes in 10 days. So removed nut and it's been good since, though will have more certainty about that over time. I don't know what privilege level this stuff runs at, but if none of it is ring 0 then in theory the kernel should not crash, right? In that case it's got to be a kernel bug that's being exposed, but I don't think they'd act without more info or at least better reproducability. I expect the 1-3 days just comes from the probability of crashing in some period of time. As another point of interest, the crashes have all happened when the server was relatively idle, so UPS polling is (relatively speaking) a significant part of what little is going on. Anyway, just sharing what I can because it seemed important. Seems likely a kernel bug, but my solution for now is not using nut. I'll chime in again if any new info comes to light. Rod