We''ve been seeing the following bugs hit. This is happening with kernel versions 2.6.39 and 3.0.1. So far we''ve only see this problem happen on ubuntu servers and it always seams to be the apache process that triggers it. Also this time we were running a PCI compliance scan on the server. We are thinking that may have triggered it. 2.6.39 Dump ------------[ cut here ]------------ kernel BUG at mm/swapfile.c:2527! invalid opcode: 0000 [#1] SMP last sysfs file: /sys/devices/vbd-51712/block/xvda/uevent Modules linked in: Pid: 30706, comm: apache2 Not tainted 2.6.39-2 #3 EIP: 0061:[<c01ab016>] EFLAGS: 00210246 CPU: 0 EIP is at swap_count_continued+0x176/0x190 EAX: 00000000 EBX: ebba0800 ECX: 80000001 EDX: f57ba95f ESI: 00000080 EDI: ebbd7d40 EBP: 0000095f ESP: df4dbe38 DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0069 Process apache2 (pid: 30706, ti=df4da000 task=e9259bd0 task.ti=df4da000) Stack: ea298d40 0000495f ee11a000 00000000 c01ab157 0000495f 00092be0 ea298d40 b8f33000 c01ac277 00000000 00092be0 e91ed998 c019dba7 6afaa065 80000001 00000000 00000000 c01065b3 c01036cd b9531fff 00000000 e8fdb348 df4dbf0c Call Trace: [<c01ab157>] ? swap_entry_free+0x127/0x150 [<c01ac277>] ? free_swap_and_cache+0x27/0xd0 [<c019dba7>] ? unmap_vmas+0x587/0x7f0 [<c01065b3>] ? xen_restore_fl_direct_reloc+0x4/0x4 [<c01036cd>] ? xen_mc_flush+0xdd/0x190 [<c01a1e0a>] ? exit_mmap+0x8a/0x140 [<c0132aa1>] ? mmput+0x41/0xd0 [<c0136afd>] ? exit_mm+0xed/0x110 [<c0652710>] ? _raw_spin_lock_irq+0x10/0x20 [<c01380d7>] ? do_exit+0x197/0x760 [<c04417a7>] ? __xen_evtchn_do_upcall+0x1e7/0x240 [<c0105d97>] ? xen_force_evtchn_callback+0x17/0x30 [<c01386cf>] ? do_group_exit+0x2f/0x90 [<c013873d>] ? sys_exit_group+0xd/0x10 [<c0652a41>] ? syscall_call+0x7/0xb [<c0650000>] ? cpuup_callback+0x100/0x260 Code: d7 fe ff ff 89 d8 e8 7a 9f f7 ff 8d 54 05 00 c6 02 00 eb b0 0f 0b eb fe 0f 0b eb fe 89 f2 31 c0 80 fa 80 0f 94 c0 e9 b2 fe ff ff <0f> 0b eb fe 0f 0b eb fe 0f 0b eb fe 8d b4 26 00 00 00 00 8d bc EIP: [<c01ab016>] swap_count_continued+0x176/0x190 SS:ESP 0069:df4dbe38 ---[ end trace 9fa17c616c267728 ]--- Fixing recursive fault but reboot is needed! BUG: scheduling while atomic: apache2/30706/0x00000001 Modules linked in: Pid: 30706, comm: apache2 Tainted: G D 2.6.39-2 #3 Call Trace: [<c065104f>] ? schedule+0x76f/0x840 [<c01358ff>] ? vprintk+0x19f/0x3a0 [<c01065bc>] ? check_events+0x8/0xc [<c0652731>] ? _raw_spin_unlock_irqrestore+0x11/0x20 [<c01358ff>] ? vprintk+0x19f/0x3a0 [<c01385ea>] ? do_exit+0x6aa/0x760 [<c06526e7>] ? _raw_spin_lock_irqsave+0x27/0x40 [<c0652731>] ? _raw_spin_unlock_irqrestore+0x11/0x20 [<c0135016>] ? kmsg_dump+0x36/0xd0 [<c0109b90>] ? do_bounds+0x80/0x80 [<c0135b1b>] ? printk+0x1b/0x20 [<c0109b90>] ? do_bounds+0x80/0x80 [<c010b98f>] ? oops_end+0x9f/0xa0 [<c0109c0f>] ? do_invalid_op+0x7f/0x90 [<c01ab016>] ? swap_count_continued+0x176/0x190 [<c018a939>] ? free_pcppages_bulk+0x2c9/0x2f0 [<c0105d97>] ? xen_force_evtchn_callback+0x17/0x30 [<c01065bc>] ? check_events+0x8/0xc [<c01065b3>] ? xen_restore_fl_direct_reloc+0x4/0x4 [<c018b4f6>] ? free_hot_cold_page+0xd6/0x160 [<c0103ff5>] ? pte_pfn_to_mfn+0xb5/0xd0 [<c0104071>] ? xen_make_pte+0x41/0x110 [<c0652fb6>] ? error_code+0x5a/0x60 [<c0109b90>] ? do_bounds+0x80/0x80 [<c01ab016>] ? swap_count_continued+0x176/0x190 [<c01ab157>] ? swap_entry_free+0x127/0x150 [<c01ac277>] ? free_swap_and_cache+0x27/0xd0 [<c019dba7>] ? unmap_vmas+0x587/0x7f0 [<c01065b3>] ? xen_restore_fl_direct_reloc+0x4/0x4 [<c01036cd>] ? xen_mc_flush+0xdd/0x190 [<c01a1e0a>] ? exit_mmap+0x8a/0x140 [<c0132aa1>] ? mmput+0x41/0xd0 [<c0136afd>] ? exit_mm+0xed/0x110 [<c0652710>] ? _raw_spin_lock_irq+0x10/0x20 [<c01380d7>] ? do_exit+0x197/0x760 [<c04417a7>] ? __xen_evtchn_do_upcall+0x1e7/0x240 [<c0105d97>] ? xen_force_evtchn_callback+0x17/0x30 [<c01386cf>] ? do_group_exit+0x2f/0x90 [<c013873d>] ? sys_exit_group+0xd/0x10 [<c0652a41>] ? syscall_call+0x7/0xb [<c0650000>] ? cpuup_callback+0x100/0x260 Here''s the 3.0.1 Dump, unfortunately i didn''t catch a full dump. BUG: unable to handle kernel paging request at f57ba13c IP: [<c01ae845>] swap_count_continued+0x85/0x190 *pdpt = 0000000000959027 *pde = 00000000008f5067 *pte = 0000000000000000 Oops: 0000 [#1] SMP Modules linked in: Pid: 3666, comm: apache2 Not tainted 3.0.1-1 #1 EIP: 0061:[<c01ae845>] EFLAGS: 00010246 CPU: 0 EIP is at swap_count_continued+0x85/0x190 EAX: 00000080 EBX: ed302400 ECX: ecb870a0 EDX: f57ba13c ESI: 00000080 EDI: ed3d7760 EBP: 0000013c ESP: ea479dec DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0069 Process apache2 (pid: 3666, ti=ea478000 task=ebe91bd0 task.ti=ea478000) Stack: ec6915c0 0001913c ee129000 00000040 c01aea77 0001913c 00322780 ec6915c0 b9275000 c01b0927 00000000 00322780 ea6533a8 c01a2d41 6e484067 80000001 c01059ef 80000000 00000000 ebad13c0 eae13e48 ec6ebb1c ea479ee8 00000000 Call Trace: [<c01aea77>] ? swap_entry_free+0x127/0x150 [<c01b0927>] ? free_swap_and_cache+0x27/0xd0 [<c01a2d41>] ? zap_pte_range+0x321/0x420 [<c01059ef>] ? xen_make_pte+0x3f/0xc0 [<c01a2f98>] ? unmap_page_range+0x158/0x1a0 [<c01a3058>] ? unmap_vmas+0x78/0xb0 [<c01a524e>] ? exit_mmap+0x6e/0xf0 [<c0136421>] ? mmput+0x41/0xd0 [<c0139fcd>] ? exit_mm+0xed/0x110 [<c06c76e0>] ? _raw_spin_lock_irq+0x10/0x20 [<c013b7e7>] ? do_exit+0x197/0x340 [<c01a5309>] ? remove_vma_list+0x39/0x50 [<c013b9bf>] ? do_group_exit+0x2f/0x90 [<c013ba2d>] ? sys_exit_group+0xd/0x10 [<c06c7a11>] ? syscall_call+0x7/0xb Code: 2a 90 8d 74 26 00 e9 15 01 00 00 89 d0 e8 c4 7e f7 ff 8b 5b 18 83 eb 18 39 df 0f 84 e5 00 00 00 89 d8 e8 3f 81 f7 ff 8d 54 05 00 <0f> b6 02 3c 80 74 d9 84 c0 0f 84 e2 00 00 00 83 e8 01 84 c0 88 EIP: [<c01ae845>] swap_count_continued+0x85/0x190 SS:ESP 0069:ea479dec CR2: 00000000f57ba13c ---[ end trace 36a533bb83dd2812 ]--- Fixing recursive fault but reboot is needed! BUG: scheduling while atomic: apache2/3666/0x00000001 Modules linked in: Pid: 3666, comm: apache2 Tainted: G D 3.0.1-1 #1 Call Trace: [<c06c60ed>] ? schedule+0x50d/0x520 [<c0106a23>] ? xen_restore_fl_direct_reloc+0x4/0x4 [<c01061d7>] ? xen_force_evtchn_callback+0x17/0x30 [<c013b92f>] ? do_exit+0x2df/0x340 [<c0138c3b>] ? printk+0x1b/0x20 [<c010bf6f>] ? oops_end+0x9f/0xa0 [<c0120f4f>] ? bad_area_nosemaphore+0xf/0x20 [<c012149b>] ? do_page_fault+0x1bb/0x420 [<c0177e85>] ? irq_get_irq_data+0x5/0x10 [<c047da45>] ? info_for_irq+0x5/0x20 [<c047e270>] ? evtchn_from_irq+0x10/0x40 [<c01061d7>] ? xen_force_evtchn_callback+0x17/0x30 [<c0106a2c>] ? check_events+0x8/0xc [<c0106a23>] ? xen_restore_fl_direct_reloc+0x4/0x4 [<c0104bab>] ? xen_batched_set_pte+0xab/0xf0 [<c01212e0>] ? vmalloc_fault+0x2c0/0x2c0 [<c06c7f86>] ? error_code+0x5a/0x60 [<c01212e0>] ? vmalloc_fault+0x2c0/0x2c0 [<c01ae845>] ? swap_count_continued+0x85/0x190 [<c01aea77>] ? swap_entry_free+0x127/0x150 [<c01b0927>] ? free_swap_and_cache+0x27/0xd0 [<c01a2d41>] ? zap_pte_range+0x321/0x420 [<c01059ef>] ? xen_make_pte+0x3f/0xc0 [<c01a2f98>] ? unmap_page_range+0x158/0x1a0 [<c01a3058>] ? unmap_vmas+0x78/0xb0 [<c01a524e>] ? exit_mmap+0x6e/0xf0 [<c0136421>] ? mmput+0x41/0xd0 [<c0139fcd>] ? exit_mm+0xed/0x110 [<c06c76e0>] ? _raw_spin_lock_irq+0x10/0x20 [<c013b7e7>] ? do_exit+0x197/0x340 [<c01a5309>] ? remove_vma_list+0x39/0x50 [<c013b9bf>] ? do_group_exit+0x2f/0x90 [<c013ba2d>] ? sys_exit_group+0xd/0x10 [<c06c7a11>] ? syscall_call+0x7/0xb -- Shaun Retian Chief Technical Officer Network Data Center Host, Inc. http://www.ndchost.com _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
I can just about reproduce this bug on the fly, a PCI compliance scan seams to be triggering it every time. Let me know what you guys need! ~Shaun _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Konrad Rzeszutek Wilk
2011-Sep-16 08:24 UTC
Re: [Xen-devel] Re: kernel BUG at mm/swapfile.c:2527!
On Thu, Sep 15, 2011 at 12:52:42PM -0700, Shaun Reitan wrote:> I can just about reproduce this bug on the fly, a PCI compliance > scan seams to be triggering it every time. Let me know what you > guys need!How do I reproduce it? Is the PCI compliance easily available? Is there any chance we can get access to the physical box to figure out what is happening?> > ~Shaun > > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xensource.com > http://lists.xensource.com/xen-devel_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
> How do I reproduce it? Is the PCI compliance easily available? Is > there any chance we can get access to the physical box to figure > out what is happening?At this point I''m not able to reproduce the problem on the fly. We had thought it was a PCI compliance scan that was triggering the error but now this customer is seeing the error constantly and the scans are not running. I''m thrashing a test server that i attempted to setup exactly like this customers server and so far no crash. The customers server is crashing like crazy, I''m attempting to figure out the trigger but it''s proving difficult. What do you need to see to figure out why it''s crashing? I''m willing to do whatever it takes but I cannot give you access to the host, but customer is willing to give you access to there virtual instance as a last resort. ~Shaun _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Joining this thread lately as a follow-on from a similar problem that is happening in Amazon AWS instances. There is a thread on the AWS forums where an instance owner has figured out how to cause this bug on demand using apache and PHP: https://forums.aws.amazon.com/thread.jspa?messageID=269851 In case those forums require a login, the php script to hit is: <?php $data = array(); for($x = 0; $x< 10000; $x++) { for($y = 0; $y<1000; $y++){ $data[][]=rand(1,100000); } } echo count($data); I am not a PHP programmer, so unsure if that php tag needs to be closed or not, but that is what is posted on the forum. Run apache bench against your test URL with 200 concurrent connections. My Amazon instance isn''t running PHP but encounters a similar problem once a day (1:48pm Pacific). I cannot allow people onto the instance but am willing to run diagnostics and post them here. Kent _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 9/16/2011 1:24 AM, Konrad Rzeszutek Wilk wrote:> How do I reproduce it? Is the PCI compliance easily available? Is > there any chance we can get access to the physical box to figure > out what is happening?Konrad, did you get my email with the server I setup for you and logins? -- Shaun _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Konrad Rzeszutek Wilk
2011-Sep-22 11:06 UTC
Re: [Xen-devel] Re: kernel BUG at mm/swapfile.c:2527!
On Mon, Sep 19, 2011 at 09:33:40PM -0700, Shaun Reitan wrote:> On 9/16/2011 1:24 AM, Konrad Rzeszutek Wilk wrote: > >How do I reproduce it? Is the PCI compliance easily available? Is > >there any chance we can get access to the physical box to figure > >out what is happening? > > Konrad, > > did you get my email with the server I setup for you and logins?Yup. Just came back from a conference so getting back to the groove.> > -- > Shaun > > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xensource.com > http://lists.xensource.com/xen-devel_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel