Robert Hulme
2007-May-16 08:21 UTC
[Xen-devel] Timer going backwards and Unable to handle kernel NULL pointer
Hi, I wonder if anyone has any idea what the cause of these two problems is (and if they''re related). I''ve got a Thunder K8SRE with two dual core Opteron processors and eight 1 GB SDRAM sticks in it. Over the last few months I''ve been experiencing weird crashes with it which at first I thought was because of a dodgy motherboard, so I replaced the motherboard, then it looked like the PSU was dodgy (well, it stopped working), so that got replaced. Finally when that didn''t fix the problem I replaced the CPUs. Unfortunately the problem continued. At the time I was running some ancient version of Debian unstable AMD64 along with a not up to date version of Xen, so I decided the cause could be software related. So I''ve now installed Debian stable and the latest Xen stable source tarball from Xensource. Booting in to SMP mode resulted in this: May 16 00:43:37 weebl kernel: Timer ISR/3: Time went backwards: delta=-12967620 delta_cpu=217015619 shadow=46263756493 off=423276508 processed=46700000000 cpu_processed=46470016761 May 16 00:43:37 weebl kernel: 0: 46680000000 May 16 00:43:37 weebl kernel: 1: 46700016761 May 16 00:43:37 weebl kernel: 2: 45670016761 May 16 00:43:37 weebl kernel: 3: 46470016761 May 16 00:43:37 weebl kernel: Timer ISR/1: Time went backwards: delta=-14213208 delta_cpu=-4229969 shadow=46311295002 off=434492777 processed=46760000000 cpu_processed=46750016761 May 16 00:43:37 weebl kernel: 0: 46760000000 May 16 00:43:37 weebl kernel: 1: 46750016761 May 16 00:43:37 weebl kernel: 2: 46700016761 May 16 00:43:37 weebl kernel: 3: 46680016761 May 16 00:43:37 weebl kernel: Timer ISR/1: Time went backwards: delta=-10857604 delta_cpu=9125635 shadow=46311295002 off=457847726 processed=46780000000 cpu_processed=46760016761 May 16 00:43:37 weebl kernel: 0: 46780000000 May 16 00:43:37 weebl kernel: 1: 46760016761 May 16 00:43:37 weebl kernel: 2: 46700016761 May 16 00:43:37 weebl kernel: 3: 46680016761 May 16 00:43:37 weebl kernel: Timer ISR/1: Time went backwards: delta=-13339199 delta_cpu=6644040 shadow=46311295002 off=475366135 processed=46800000000 cpu_processed=46780016761 May 16 00:43:37 weebl kernel: 0: 46800000000 May 16 00:43:37 weebl kernel: 1: 46780016761 May 16 00:43:37 weebl kernel: 2: 46700016761 May 16 00:43:37 weebl kernel: 3: 46680016761 etc... Which appears to be this bug: http://bugzilla.xensource.com/bugzilla/show_bug.cgi?id=195 So I rebooted with nosmp and things seemed much better. This was at 1AM, but when I woke up this morning although all the domains were reachable by ping (the box is colocated) I couldn''t ssh to them or access them on port 80 etc. I rebooted the system remotely back in to the earlier version of Xen / Debian and found the following errors in the log: May 16 06:28:34 weebl kernel: PGD 11d4ab067 PUD 11db77067 PMD 0 May 16 06:28:34 weebl kernel: CPU 0 May 16 06:28:34 weebl kernel: Modules linked in: xt_physdev iptable_filter ip_tables x_tables bridge ipv6 button ac battery raid5 xor ide_cd cdrom serio_raw pcspkr i2c_nforce2 i2c_core floppy dm_mirror dm_snapshot dm_mod raid1 ide_generic sd_mod tg3 sata_nv amd74xx libata ide_disk scsi_mod ehci_hcd ohci_hcd usbcore thermal processor fan May 16 06:28:34 weebl kernel: Pid: 4261, comm: find Not tainted 2.6.16.33-xen #5 May 16 06:28:34 weebl kernel: RIP: e030:[<ffffffff80195afc>] <ffffffff80195afc>{inotify_inode_queue_event+286} May 16 06:28:34 weebl kernel: RSP: e02b:ffff88011d49fec8 EFLAGS: 00010246 May 16 06:28:34 weebl kernel: RAX: ffff880112ec6c70 RBX: 0000000040000010 RCX: 0000000000000000 May 16 06:28:34 weebl kernel: RDX: 0000000000000000 RSI: 0000000040000010 RDI: ffff880112a96c80 May 16 06:28:34 weebl kernel: RBP: fffffffffffffff1 R08: ffff880112a96a78 R09: 0000000000000001 May 16 06:28:34 weebl kernel: R10: 0000000000000003 R11: ffffffffff578000 R12: ffff880112a96a78 May 16 06:28:34 weebl kernel: R13: ffff880112a96c80 R14: 0000000000000000 R15: fffffffffffffff1 May 16 06:28:34 weebl kernel: FS: 00002aad53da46d0(0000) GS:ffffffff80469000(0000) knlGS:0000000000000000 May 16 06:28:34 weebl kernel: CS: e033 DS: 0000 ES: 0000 May 16 06:28:34 weebl kernel: Process find (pid: 4261, threadinfo ffff88011d49e000, task ffff8801c0518100) May 16 06:28:34 weebl kernel: Stack: 0000000000000000 4000001000000000 ffff880112a96c70 0000000040000010 May 16 06:28:34 weebl kernel: ffff8801bfc8ae80 ffff880112a96a78 ffff880112a8f250 ffff8801c10ee0c0 May 16 06:28:34 weebl kernel: 0000000000576400 ffffffff801734e2 May 16 06:28:34 weebl kernel: Call Trace: <ffffffff801734e2>{__fput+97} <ffffffff80170d43>{filp_close+89} May 16 06:28:34 weebl kernel: <ffffffff801721a6>{sys_close+136} <ffffffff8010adf6>{system_call+134} May 16 06:28:34 weebl kernel: <ffffffff8010ad70>{system_call+0} May 16 06:28:34 weebl kernel: May 16 06:28:34 weebl kernel: Code: 4d 8b 7f 10 49 83 ef 10 48 8d 45 10 48 3b 44 24 10 0f 85 2d May 16 06:28:34 weebl kernel: <1>Unable to handle kernel NULL pointer dereference at 0000000000000000 RIP: May 16 06:28:34 weebl kernel: PGD 11d519067 PUD 11da61067 PMD 0 May 16 06:28:34 weebl kernel: CPU 0 May 16 06:28:34 weebl kernel: Modules linked in: xt_physdev iptable_filter ip_tables x_tables bridge ipv6 button ac battery raid5 xor ide_cd cdrom serio_raw pcspkr i2c_nforce2 i2c_core floppy dm_mirror dm_snapshot dm_mod raid1 ide_generic sd_mod tg3 sata_nv amd74xx libata ide_disk scsi_mod ehci_hcd ohci_hcd usbcore thermal processor fan May 16 06:28:34 weebl kernel: Pid: 4692, comm: run-parts Not tainted 2.6.16.33-xen #5 May 16 06:28:34 weebl kernel: RIP: e030:[<ffffffff80195afc>] <ffffffff80195afc>{inotify_inode_queue_event+286} May 16 06:28:34 weebl kernel: RSP: e02b:ffff88011db13e48 EFLAGS: 00010246 May 16 06:28:34 weebl kernel: RAX: ffff880112f86950 RBX: ffff8801bee14d00 RCX: 0000000000000000 May 16 06:28:34 weebl kernel: RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff880112a96960 May 16 06:28:34 weebl kernel: RBP: fffffffffffffff0 R08: ffff880112a96758 R09: ffff88011ceba7f0 May 16 06:28:34 weebl kernel: R10: 0000000000000059 R11: ffff88011d4e23d8 R12: 0000000000000001 May 16 06:28:34 weebl kernel: R13: ffff880112a96960 R14: 0000000000000000 R15: fffffffffffffff0 May 16 06:28:34 weebl kernel: FS: 00002b82087656d0(0000) GS:ffffffff80469000(0000) knlGS:0000000000000000 May 16 06:28:34 weebl kernel: CS: e033 DS: 0000 ES: 0000 May 16 06:28:34 weebl kernel: Process run-parts (pid: 4692, threadinfo ffff88011db12000, task ffff8801c1080860) May 16 06:28:34 weebl kernel: Stack: 0000000000000000 0000000100000000 ffff880112a96950 ffff8801bee14d00 May 16 06:28:34 weebl kernel: 0000000000000059 0000000000000001 ffff880112a96758 00007fffff89f468 May 16 06:28:34 weebl kernel: 0000000000503010 ffffffff80172e73 May 16 06:28:34 weebl kernel: Call Trace: <ffffffff80172e73>{vfs_read+313} <ffffffff8017b76b>{kernel_read+65} May 16 06:28:34 weebl kernel: <ffffffff8017cc17>{do_execve+299} <ffffffff8010976f>{sys_execve+54} May 16 06:28:34 weebl kernel: <ffffffff8010b285>{stub_execve+61} May 16 06:28:34 weebl kernel: May 16 06:28:34 weebl kernel: Code: 4d 8b 7f 10 49 83 ef 10 48 8d 45 10 48 3b 44 24 10 0f 85 2d May 16 06:28:34 weebl kernel: <1>Unable to handle kernel paging request at 0000000000110050 RIP: May 16 06:28:34 weebl kernel: PGD 11d46f067 PUD 11da4a067 PMD 0 May 16 06:28:34 weebl kernel: CPU 0 May 16 06:28:34 weebl kernel: Modules linked in: xt_physdev iptable_filter ip_tables x_tables bridge ipv6 button ac battery raid5 xor ide_cd cdrom serio_raw pcspkr i2c_nforce2 i2c_core floppy dm_mirror dm_snapshot dm_mod raid1 ide_generic sd_mod tg3 sata_nv amd74xx libata ide_disk scsi_mod ehci_hcd ohci_hcd usbcore thermal processor fan May 16 06:28:34 weebl kernel: Pid: 4695, comm: run-parts Not tainted 2.6.16.33-xen #5 May 16 06:28:34 weebl kernel: RIP: e030:[<ffffffff8017a4a3>] <ffffffff8017a4a3>{cdev_put+6} May 16 06:28:34 weebl kernel: RSP: e02b:ffff88011db51dc8 EFLAGS: 00010206 May 16 06:28:34 weebl kernel: RAX: 0000000000000000 RBX: 0000000000000010 RCX: 0000000000000000 May 16 06:28:34 weebl kernel: RDX: ffffffffff578000 RSI: ffff8801c1107ec0 RDI: 0000000000110000 May 16 06:28:34 weebl kernel: RBP: ffff8801c1107ec0 R08: ffff880112a96438 R09: 0000000000105874 May 16 06:28:34 weebl kernel: R10: 0000000000000000 R11: ffffffff8026141c R12: ffff880112a96438 May 16 06:28:34 weebl kernel: R13: ffff88011d441e90 R14: ffff8801c10eebc0 R15: 0000000000503010 May 16 06:28:34 weebl kernel: FS: 00002b82087656d0(0000) GS:ffffffff80469000(0000) knlGS:0000000000000000 May 16 06:28:34 weebl kernel: CS: e033 DS: 0000 ES: 0000 May 16 06:28:34 weebl kernel: Process run-parts (pid: 4695, threadinfo ffff88011db50000, task ffff8801c0518100) May 16 06:28:34 weebl kernel: Stack: 0000000000000010 ffffffff8017353e ffff880143724a00 ffff880143724a00 May 16 06:28:34 weebl kernel: ffff88011db51f58 0000000000000000 ffff88011db51f58 ffffffff8019d278 May 16 06:28:34 weebl kernel: ffff880143724a00 ffffffff801536de May 16 06:28:34 weebl kernel: Call Trace: <ffffffff8017353e>{__fput+189} <ffffffff8019d278>{load_script+92} May 16 06:28:34 weebl kernel: <ffffffff801536de>{__alloc_pages+92} <ffffffff8017b407>{copy_strings+373} May 16 06:28:34 weebl kernel: <ffffffff8017b53f>{search_binary_handler+167} <ffffffff8017cc87>{do_execve+411} May 16 06:28:34 weebl kernel: <ffffffff8010976f>{sys_execve+54} <ffffffff8010b285>{stub_execve+61} May 16 06:28:34 weebl kernel: May 16 06:28:34 weebl kernel: Code: 48 8b 5f 50 e8 b4 78 10 00 48 85 db 74 2a 65 8b 04 25 24 00 May 16 06:28:35 weebl kernel: <6>savelog[4718]: segfault at 0000000000185ca8 rip 0000000000444236 rsp 00007fffffdf1fa0 error 4 May 16 06:28:35 weebl kernel: standard[4732]: segfault at 000000000008cca8 rip 0000000000489330 rsp 00007fffffb1f980 error 4 May 16 06:28:35 weebl kernel: standard[4733]: segfault at 000000000008cca8 rip 0000000000489330 rsp 00007fffffb1f980 error 4 May 16 06:28:35 weebl kernel: PGD 1c5b067 PUD 0 May 16 06:28:35 weebl kernel: CPU 0 May 16 06:28:35 weebl kernel: Modules linked in: xt_physdev iptable_filter ip_tables x_tables bridge ipv6 button ac battery raid5 xor ide_cd cdrom serio_raw pcspkr i2c_nforce2 i2c_core floppy dm_mirror dm_snapshot dm_mod raid1 ide_generic sd_mod tg3 sata_nv amd74xx libata ide_disk scsi_mod ehci_hcd ohci_hcd usbcore thermal processor fan May 16 06:28:35 weebl kernel: Pid: 4768, comm: sysklogd Not tainted 2.6.16.33-xen #5 May 16 06:28:35 weebl kernel: RIP: e030:[<ffffffff80119088>] <ffffffff80119088>{mm_walk+988} May 16 06:28:35 weebl kernel: RSP: e02b:ffff88011e713bc0 EFLAGS: 00010246 May 16 06:28:35 weebl kernel: RAX: 7fffffffffffffff RBX: ffff8801128d2008 RCX: 0000000055555555 May 16 06:28:35 weebl kernel: RDX: ffff880008d14000 RSI: 0000000155555555 RDI: 0000000155555555 May 16 06:28:35 weebl kernel: RBP: 8000000000000063 R08: 0000000000000000 R09: ffff880112a94518 May 16 06:28:35 weebl kernel: R10: 0000000112a94000 R11: 00000001128d2000 R12: ffff88011db12000 May 16 06:28:35 weebl kernel: R13: 0000000000000000 R14: ffff8801c10911c0 R15: ffff880143724200 May 16 06:28:35 weebl kernel: FS: 00002b138b6e86d0(0000) GS:ffffffff80469000(0000) knlGS:0000000000000000 May 16 06:28:35 weebl kernel: CS: e033 DS: 0000 ES: 0000 May 16 06:28:35 weebl kernel: Process sysklogd (pid: 4768, threadinfo ffff88011e712000, task ffff880000c6c820) May 16 06:28:35 weebl kernel: Stack: ffff88011d598c00 ffff880000c6c820 ffff88011d598c00 ffff880000c6c820 May 16 06:28:35 weebl kernel: ffffffff8011a275 ffff88011d598c00 ffffffff8011a4a4 0000000000000080 May 16 06:28:35 weebl kernel: ffff880143724200 ffff880143724200 May 16 06:28:35 weebl kernel: Call Trace: <ffffffff8011a275>{mm_unpin+427} <ffffffff8011a4a4>{_arch_exit_mmap+476} May 16 06:28:35 weebl kernel: <ffffffff8014fedd>{generic_file_aio_read+52} <ffffffff801724a6>{do_sync_read+199} May 16 06:28:35 weebl kernel: <ffffffff801612d2>{exit_mmap+34} <ffffffff80128e05>{mmput+40} May 16 06:28:35 weebl kernel: <ffffffff8017c509>{flush_old_exec+2443} <ffffffff80172e73>{vfs_read+313} May 16 06:28:35 weebl kernel: <ffffffff8019dd36>{load_elf_binary+1095} <ffffffff801536de>{__alloc_pages+92} May 16 06:28:35 weebl kernel: <ffffffff8017b407>{copy_strings+373} <ffffffff8017b53f>{search_binary_handler+167} May 16 06:28:35 weebl kernel: <ffffffff8017cc87>{do_execve+411} <ffffffff8010976f>{sys_execve+54} May 16 06:28:35 weebl kernel: <ffffffff8010b285>{stub_execve+61} May 16 06:28:35 weebl kernel: May 16 06:28:35 weebl kernel: Code: 48 8b 34 ca 48 21 c6 48 c1 e7 0c 48 b8 00 00 00 00 00 88 ff Has anyone got any idea why this is occuring? I''m completely out of ideas. Many months ago I thought the random crashing could be because of RAM, but it''s all ECC RAM, and with any of the RAM sticks in it would still randomly crash. I''m not sure where to go next - what to try. Does anyone know what the cause is, or have any idea how to diagnose it? Thanks -Rob -- http://www.robhulme.com/ http://robhu.livejournal.com/ On two occasions, I have been asked [by members of Parliament], "Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out?" I am not able to rightly apprehend the kind of confusion of ideas that could provoke such a question. -- Charles Babbage _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Ian Pratt
2007-May-28 20:03 UTC
RE: [Xen-devel] Timer going backwards and Unable to handle kernel NULLpointer
> I''ve got a Thunder K8SRE with two dual core Opteron processors and > eight 1 GB SDRAM sticks in it. > Booting in to SMP mode resulted in this: > > May 16 00:43:37 weebl kernel: Timer ISR/3: Time went backwards: > delta=-12967620 delta_cpu=217015619 shadow=46263756493 off=423276508 > processed=46700000000 cpu_processed=46470016761 > May 16 00:43:37 weebl kernel: 0: 46680000000 > May 16 00:43:37 weebl kernel: 1: 46700016761 > May 16 00:43:37 weebl kernel: 2: 45670016761 > May 16 00:43:37 weebl kernel: 3: 46470016761The only time I''ve seen this is on a system that was overheating and bouncing in and out of thermal throttling. Any chance that could be happening here?> May 16 06:28:35 weebl kernel: Call Trace: > <ffffffff8011a275>{mm_unpin+427} > <ffffffff8011a4a4>{_arch_exit_mmap+476} > May 16 06:28:35 weebl kernel: > <ffffffff8014fedd>{generic_file_aio_read+52} > <ffffffff801724a6>{do_sync_read+199} > May 16 06:28:35 weebl kernel: <ffffffff801612d2>{exit_mmap+34} > <ffffffff80128e05>{mmput+40} > May 16 06:28:35 weebl kernel:That''s a nasty crash. There were some cleanups in this area in Xen 3.1, but I don''t think it was to address a specific crash like this. Might be worth trying 3.1 on the machine. Best, Ian _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jan Beulich
2007-May-29 07:01 UTC
RE: [Xen-devel] Timer going backwards and Unable to handle kernel NULLpointer
>>> "Ian Pratt" <Ian.Pratt@cl.cam.ac.uk> 28.05.07 22:03 >>> >> I''ve got a Thunder K8SRE with two dual core Opteron processors and >> eight 1 GB SDRAM sticks in it. >> Booting in to SMP mode resulted in this: >> >> May 16 00:43:37 weebl kernel: Timer ISR/3: Time went backwards: >> delta=-12967620 delta_cpu=217015619 shadow=46263756493 off=423276508 >> processed=46700000000 cpu_processed=46470016761 >> May 16 00:43:37 weebl kernel: 0: 46680000000 >> May 16 00:43:37 weebl kernel: 1: 46700016761 >> May 16 00:43:37 weebl kernel: 2: 45670016761 >> May 16 00:43:37 weebl kernel: 3: 46470016761 > >The only time I''ve seen this is on a system that was overheating and >bouncing in and out of thermal throttling. Any chance that could be >happening here?I''ve been seeing these pretty regularly on a single-socket dual-core Athlon system for the last couple of months, and only on Friday finally found time to start looking into these. Besides the messages above, I also see hangs in about every other boot attempt but only if I do *not* use serial output (which makes debugging a little harder), and never once initial boot finished - this is why I finally needed to find time to look into the problem. I shall note though that the kernel we use does not disable CONFIG_GENERIC_TIME and makes use of a Xen clocksource as posted by Jeremy among the paravirt ops patches. What happens when the hang occurs (in do_nanosleep context) is that the time read/interpolated from the Xen provided values is in the past compared to the last value read (and cached inside the kernel), resulting in a huge timeout value rather than the intended 50ms one. Without having collected data proving this (will do later today), I currently think that the interpolation parameters are too imprecise until the first time local_time_calibration() runs on each CPU, i.e. during little less than the first second of dom0''s life). Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel