Alastair McKinstry
2007-Feb-21 02:14 UTC
[Lustre-discuss] Oops with 1.5.97 (on 2.6.18 kernel)
Hi, While testing 1.5.97 on kernel 2.6.18, I had the following Oops on a client. Has this been seen before? *Unable to handle kernel paging request at ffff88000b17fff8 RIP: * [<ffffffff8843951a>] :osc:osc_brw_prep_request+0x3e3/0x9e2 PGD 1675067 PUD 1676067 PMD 16cf067 PTE 0 Oops: 0000 [1] SMP CPU 0 Modules linked in: xt_tcpudp xt_physdev iptable_filter ip_tables x_tables osc mgc lustre lov lquota mdc ksocklnd ptlrpc obdclass lnet lvfs libcfs bridge netloop ipv6 dm_snapshot dm_mirror dm_mod usbhid usbkbd ipmi_watchdog ipmi_devintf ipmi_poweroff ipmi_si ipmi_msghandler dummy loop ide_generic ide_disk evdev psmouse shpchp pci_hotplug pcspkr serio_raw i2c_piix4 i2c_core ext3 jbd mbcache sd_mod ide_cd cdrom sata_svw libata scsi_mod tg3 ehci_hcd generic ohci_hcd serverworks ide_core fan Pid: 3638, comm: cat Not tainted 2.6.18-3-xen-amd64 #1 RIP: e030:[<ffffffff8843951a>] [<ffffffff8843951a>] :osc:osc_brw_prep_request+0x3e3/0x9e2 RSP: e02b:ffff88000fbf1798 EFLAGS: 00010287 RAX: 0000000000000001 RBX: ffff88000eb79ce0 RCX: 0000000000000000 RDX: ffff88000b180000 RSI: ffff88000b180000 RDI: ffff880026c53cb0 RBP: ffff88005c6e4170 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000000 R12: ffff88005c6e4088 R13: ffff88005c701a00 R14: ffff88005c58e498 R15: 0000000000000000 FS: 00002b0e147c86d0(0000) GS:ffffffff804c3000(0000) knlGS:0000000000000000 CS: e033 DS: 0000 ES: 0000 Process cat (pid: 3638, threadinfo ffff88000fbf0000, task ffff8800626cc7f0) Stack: ffff88000b180000 00000100802a6c2b ffff880026c53cb0 ffff88005c58e498 0000000000000001 0000000000000010 0000001062948240 ffff880056946000 0000000102a13e28 0000000400000000 Call Trace: [<ffffffff8843d611>] :osc:osc_send_oap_rpc+0x80a/0xe6d [<ffffffff8843ddc0>] :osc:osc_check_rpcs+0x14c/0x29b [<ffffffff88445751>] :osc:osc_queue_async_io+0xc2b/0xd0a [<ffffffff8820e41e>] :libcfs:libcfs_debug_vmsg2+0x600/0x897 [<ffffffff883a0d38>] :lov:lov_queue_async_io+0x2f1/0x3a1 [<ffffffff883f996f>] :lustre:queue_or_sync_write+0x2b6/0xc43 [<ffffffff883fcc99>] :lustre:ll_commit_write+0x269/0x5df [<ffffffff80210ba1>] generic_file_buffered_write+0x438/0x646 [<ffffffff883b742f>] :lov:lov_update_enqueue_set+0x345/0x3a1 [<ffffffff8020f13c>] current_fs_time+0x3b/0x40 [<ffffffff884424b5>] :osc:osc_enqueue+0xfb/0x48e [<ffffffff8021646f>] __generic_file_aio_write_nolock+0x2e4/0x32f [<ffffffff883ef54b>] :lustre:ll_inode_size_unlock+0x81/0xd6 [<ffffffff802a2809>] __generic_file_write_nolock+0x8f/0xa8 [<ffffffff882e4886>] :ptlrpc:ldlm_completion_ast+0x0/0x5e2 [<ffffffff80290415>] autoremove_wake_function+0x0/0x2e [<ffffffff88408367>] :lustre:lt_get_mmap_locks+0x2b1/0x3a5 [<ffffffff8024529d>] generic_file_write+0x49/0xa7 [<ffffffff883ea8f3>] :lustre:ll_file_write+0x669/0x7e7 [<ffffffff80216b9b>] vfs_write+0xce/0x174 [<ffffffff802173be>] sys_write+0x45/0x6e [<ffffffff8025c81a>] system_call+0x86/0x8b [<ffffffff8025c794>] system_call+0x0/0x8b Note: the client was running on a Xen dom0, with fairly small memory, so lustre is not necessarily at fault. Further details on request. - Alastair McKinstry -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.clusterfs.com/pipermail/lustre-discuss/attachments/20070221/1f6b53d8/attachment-0001.html
On 2/21/07, Alastair McKinstry <alastair.mckinstry@ichec.ie> wrote:> > Hi, > > > While testing 1.5.97 on kernel 2.6.18, I had the following Oops on a client. > Has this been seen before? > > > > Unable to handle kernel paging request at ffff88000b17fff8 RIP: > [<ffffffff8843951a>] :osc:osc_brw_prep_request+0x3e3/0x9e2 > PGD 1675067 PUD 1676067 PMD 16cf067 PTE 0 > Oops: 0000 [1] SMP > CPU 0 > Modules linked in: xt_tcpudp xt_physdev iptable_filter ip_tables x_tables > osc mgc lustre lov lquota mdc ksocklnd ptlrpc obdclass lnet lvfs libcfs > bridge netloop ipv6 dm_snapshot dm_mirror dm_mod usbhid usbkbd ipmi_watchdog > ipmi_devintf ipmi_poweroff ipmi_si ipmi_msghandler dummy loop ide_generic > ide_disk evdev psmouse shpchp pci_hotplug pcspkr serio_raw i2c_piix4 > i2c_core ext3 jbd mbcache sd_mod ide_cd cdrom sata_svw libata scsi_mod tg3 > ehci_hcd generic ohci_hcd serverworks ide_core fan > Pid: 3638, comm: cat Not tainted 2.6.18-3-xen-amd64 #1 > RIP: e030:[<ffffffff8843951a>] [<ffffffff8843951a>] > :osc:osc_brw_prep_request+0x3e3/0x9e2 > RSP: e02b:ffff88000fbf1798 EFLAGS: 00010287 > RAX: 0000000000000001 RBX: ffff88000eb79ce0 RCX: 0000000000000000 > RDX: ffff88000b180000 RSI: ffff88000b180000 RDI: ffff880026c53cb0 > RBP: ffff88005c6e4170 R08: 0000000000000000 R09: 0000000000000000 > R10: 0000000000000000 R11: 0000000000000000 R12: ffff88005c6e4088 > R13: ffff88005c701a00 R14: ffff88005c58e498 R15: 0000000000000000 > FS: 00002b0e147c86d0(0000) GS:ffffffff804c3000(0000) knlGS:0000000000000000 > CS: e033 DS: 0000 ES: 0000 > Process cat (pid: 3638, threadinfo ffff88000fbf0000, task ffff8800626cc7f0) > Stack: ffff88000b180000 00000100802a6c2b ffff880026c53cb0 ffff88005c58e498 > 0000000000000001 0000000000000010 0000001062948240 ffff880056946000 > 0000000102a13e28 0000000400000000 > Call Trace: > [<ffffffff8843d611>] :osc:osc_send_oap_rpc+0x80a/0xe6d > [<ffffffff8843ddc0>] :osc:osc_check_rpcs+0x14c/0x29b > [<ffffffff88445751>] :osc:osc_queue_async_io+0xc2b/0xd0a > [<ffffffff8820e41e>] > :libcfs:libcfs_debug_vmsg2+0x600/0x897 > [<ffffffff883a0d38>] :lov:lov_queue_async_io+0x2f1/0x3a1 > [<ffffffff883f996f>] > :lustre:queue_or_sync_write+0x2b6/0xc43 > [<ffffffff883fcc99>] :lustre:ll_commit_write+0x269/0x5df > [<ffffffff80210ba1>] > generic_file_buffered_write+0x438/0x646 > [<ffffffff883b742f>] > :lov:lov_update_enqueue_set+0x345/0x3a1 > [<ffffffff8020f13c>] current_fs_time+0x3b/0x40 > [<ffffffff884424b5>] :osc:osc_enqueue+0xfb/0x48e > [<ffffffff8021646f>] > __generic_file_aio_write_nolock+0x2e4/0x32f > [<ffffffff883ef54b>] > :lustre:ll_inode_size_unlock+0x81/0xd6 > [<ffffffff802a2809>] __generic_file_write_nolock+0x8f/0xa8 > [<ffffffff882e4886>] :ptlrpc:ldlm_completion_ast+0x0/0x5e2 > [<ffffffff80290415>] autoremove_wake_function+0x0/0x2e > [<ffffffff88408367>] :lustre:lt_get_mmap_locks+0x2b1/0x3a5 > [<ffffffff8024529d>] generic_file_write+0x49/0xa7 > [<ffffffff883ea8f3>] :lustre:ll_file_write+0x669/0x7e7 > [<ffffffff80216b9b>] vfs_write+0xce/0x174 > [<ffffffff802173be>] sys_write+0x45/0x6e > [<ffffffff8025c81a>] system_call+0x86/0x8b > [<ffffffff8025c794>] system_call+0x0/0x8b > > > Note: the client was running on a Xen dom0, with fairly small memory, so > lustre is not necessarily at fault. > Further details on request.Yeah I found running lustre in xen is kinda difficult when trying to partition out the memory, I use at least 512M for all lustre involved components of my xen/lustre systems. - David Brown
On Feb 21, 2007 08:30 -0800, David Brown wrote:> >While testing 1.5.97 on kernel 2.6.18, I had the following Oops on a > >client. > > > >Unable to handle kernel paging request at ffff88000b17fff8 RIP: > > [<ffffffff8843951a>] :osc:osc_brw_prep_request+0x3e3/0x9e2Can you please decode this to a line number? If you do "gdp .../osc.ko" and "list *(osc_brw_prep_request+0x3e3)" it should give you a line number.> >Note: the client was running on a Xen dom0, with fairly small memory, so > >lustre is not necessarily at fault. > >Further details on request. > > Yeah I found running lustre in xen is kinda difficult when trying to > partition out the memory, I use at least 512M for all lustre involved > components of my xen/lustre systems.Strange. I''ve run 1.4 with client + MDS + 5 OSTs in a single 96MB UML. The memory allocated by lustre generally scales as a function of RAM in the system, so that it can at least function in lower-memory environments. That said, I haven''t run 1.6 in this environment. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc.
> Strange. I''ve run 1.4 with client + MDS + 5 OSTs in a single 96MB UML. > The memory allocated by lustre generally scales as a function of RAM in > the system, so that it can at least function in lower-memory environments. > That said, I haven''t run 1.6 in this environment.Well since my experience in xen I know he''s using an unstable version of xen to work from since xen 3.0.4 is a 2.6.16.33 based system and only their scm repos have the 2.6.18. Also I''ve seen funny things happen when a dom0/U starts running out of memory with xen. It really depends on how your system is setup, I''ve seen xen kill processes simply because its out of memory. Also if you give dom0 512M of memory you really end up with 437M shown by /proc/meminfo so depending on how small he''s got memory set to it might have less than he thought. I think this has something todo with the xen.gz hypervisor taking up space.