thr3ads.net - Lustre discuss - [Lustre-discuss] Oops with 1.5.97 (on 2.6.18 kernel) [Feb 2007]

If this information is useful, please help other people find it:
Share via:

Alastair McKinstry

2007-Feb-21 02:14 UTC

[Lustre-discuss] Oops with 1.5.97 (on 2.6.18 kernel)

Hi,


While testing 1.5.97 on kernel 2.6.18, I had the following Oops on a client.
Has this been seen before?



*Unable to handle kernel paging request at ffff88000b17fff8 RIP: *
 [<ffffffff8843951a>] :osc:osc_brw_prep_request+0x3e3/0x9e2
PGD 1675067 PUD 1676067 PMD 16cf067 PTE 0
Oops: 0000 [1] SMP 
CPU 0 
Modules linked in: xt_tcpudp xt_physdev iptable_filter ip_tables x_tables osc
mgc lustre lov lquota mdc ksocklnd ptlrpc obdclass lnet lvfs libcfs bridge
netloop ipv6 dm_snapshot dm_mirror dm_mod usbhid usbkbd ipmi_watchdog
ipmi_devintf ipmi_poweroff ipmi_si ipmi_msghandler dummy loop ide_generic
ide_disk evdev psmouse shpchp pci_hotplug pcspkr serio_raw i2c_piix4 i2c_core
ext3 jbd mbcache sd_mod ide_cd cdrom sata_svw libata scsi_mod tg3 ehci_hcd
generic ohci_hcd serverworks ide_core fan
Pid: 3638, comm: cat Not tainted 2.6.18-3-xen-amd64 #1
RIP: e030:[<ffffffff8843951a>]  [<ffffffff8843951a>]
:osc:osc_brw_prep_request+0x3e3/0x9e2
RSP: e02b:ffff88000fbf1798  EFLAGS: 00010287
RAX: 0000000000000001 RBX: ffff88000eb79ce0 RCX: 0000000000000000
RDX: ffff88000b180000 RSI: ffff88000b180000 RDI: ffff880026c53cb0
RBP: ffff88005c6e4170 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: ffff88005c6e4088
R13: ffff88005c701a00 R14: ffff88005c58e498 R15: 0000000000000000
FS:  00002b0e147c86d0(0000) GS:ffffffff804c3000(0000) knlGS:0000000000000000
CS:  e033 DS: 0000 ES: 0000
Process cat (pid: 3638, threadinfo ffff88000fbf0000, task ffff8800626cc7f0)
Stack:  ffff88000b180000  00000100802a6c2b  ffff880026c53cb0  ffff88005c58e498 
 0000000000000001  0000000000000010  0000001062948240  ffff880056946000 
 0000000102a13e28  0000000400000000 
Call Trace:
 [<ffffffff8843d611>] :osc:osc_send_oap_rpc+0x80a/0xe6d
 [<ffffffff8843ddc0>] :osc:osc_check_rpcs+0x14c/0x29b
 [<ffffffff88445751>] :osc:osc_queue_async_io+0xc2b/0xd0a
 [<ffffffff8820e41e>] :libcfs:libcfs_debug_vmsg2+0x600/0x897
 [<ffffffff883a0d38>] :lov:lov_queue_async_io+0x2f1/0x3a1
 [<ffffffff883f996f>] :lustre:queue_or_sync_write+0x2b6/0xc43
 [<ffffffff883fcc99>] :lustre:ll_commit_write+0x269/0x5df
 [<ffffffff80210ba1>] generic_file_buffered_write+0x438/0x646
 [<ffffffff883b742f>] :lov:lov_update_enqueue_set+0x345/0x3a1
 [<ffffffff8020f13c>] current_fs_time+0x3b/0x40
 [<ffffffff884424b5>] :osc:osc_enqueue+0xfb/0x48e
 [<ffffffff8021646f>] __generic_file_aio_write_nolock+0x2e4/0x32f
 [<ffffffff883ef54b>] :lustre:ll_inode_size_unlock+0x81/0xd6
 [<ffffffff802a2809>] __generic_file_write_nolock+0x8f/0xa8
 [<ffffffff882e4886>] :ptlrpc:ldlm_completion_ast+0x0/0x5e2
 [<ffffffff80290415>] autoremove_wake_function+0x0/0x2e
 [<ffffffff88408367>] :lustre:lt_get_mmap_locks+0x2b1/0x3a5
 [<ffffffff8024529d>] generic_file_write+0x49/0xa7
 [<ffffffff883ea8f3>] :lustre:ll_file_write+0x669/0x7e7
 [<ffffffff80216b9b>] vfs_write+0xce/0x174
 [<ffffffff802173be>] sys_write+0x45/0x6e
 [<ffffffff8025c81a>] system_call+0x86/0x8b
 [<ffffffff8025c794>] system_call+0x0/0x8b


Note: the client was running on a Xen dom0, with fairly small memory, so lustre
is not necessarily at fault.
Further details on request.

- Alastair McKinstry


-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://mail.clusterfs.com/pipermail/lustre-discuss/attachments/20070221/1f6b53d8/attachment-0001.html

David Brown

2007-Feb-21 09:31 UTC

head link

[Lustre-discuss] Oops with 1.5.97 (on 2.6.18 kernel)

On 2/21/07, Alastair McKinstry <alastair.mckinstry@ichec.ie>
wrote:>
>  Hi,
>
>
> While testing 1.5.97 on kernel 2.6.18, I had the following Oops on a
client.
> Has this been seen before?
>
>
>
> Unable to handle kernel paging request at ffff88000b17fff8 RIP:
>  [<ffffffff8843951a>] :osc:osc_brw_prep_request+0x3e3/0x9e2
> PGD 1675067 PUD 1676067 PMD 16cf067 PTE 0
> Oops: 0000 [1] SMP
> CPU 0
> Modules linked in: xt_tcpudp xt_physdev iptable_filter ip_tables x_tables
> osc mgc lustre lov lquota mdc ksocklnd ptlrpc obdclass lnet lvfs libcfs
> bridge netloop ipv6 dm_snapshot dm_mirror dm_mod usbhid usbkbd
ipmi_watchdog
> ipmi_devintf ipmi_poweroff ipmi_si ipmi_msghandler dummy loop ide_generic
> ide_disk evdev psmouse shpchp pci_hotplug pcspkr serio_raw i2c_piix4
> i2c_core ext3 jbd mbcache sd_mod ide_cd cdrom sata_svw libata scsi_mod tg3
> ehci_hcd generic ohci_hcd serverworks ide_core fan
> Pid: 3638, comm: cat Not tainted 2.6.18-3-xen-amd64 #1
> RIP: e030:[<ffffffff8843951a>] [<ffffffff8843951a>]
> :osc:osc_brw_prep_request+0x3e3/0x9e2
> RSP: e02b:ffff88000fbf1798 EFLAGS: 00010287
> RAX: 0000000000000001 RBX: ffff88000eb79ce0 RCX: 0000000000000000
> RDX: ffff88000b180000 RSI: ffff88000b180000 RDI: ffff880026c53cb0
> RBP: ffff88005c6e4170 R08: 0000000000000000 R09: 0000000000000000
> R10: 0000000000000000 R11: 0000000000000000 R12: ffff88005c6e4088
> R13: ffff88005c701a00 R14: ffff88005c58e498 R15: 0000000000000000
> FS: 00002b0e147c86d0(0000) GS:ffffffff804c3000(0000) knlGS:0000000000000000
> CS: e033 DS: 0000 ES: 0000
> Process cat (pid: 3638, threadinfo ffff88000fbf0000, task ffff8800626cc7f0)
> Stack: ffff88000b180000 00000100802a6c2b ffff880026c53cb0 ffff88005c58e498
>  0000000000000001 0000000000000010 0000001062948240 ffff880056946000
>  0000000102a13e28 0000000400000000
> Call Trace:
>  [<ffffffff8843d611>] :osc:osc_send_oap_rpc+0x80a/0xe6d
>  [<ffffffff8843ddc0>] :osc:osc_check_rpcs+0x14c/0x29b
>  [<ffffffff88445751>] :osc:osc_queue_async_io+0xc2b/0xd0a
>  [<ffffffff8820e41e>]
> :libcfs:libcfs_debug_vmsg2+0x600/0x897
>  [<ffffffff883a0d38>] :lov:lov_queue_async_io+0x2f1/0x3a1
>  [<ffffffff883f996f>]
> :lustre:queue_or_sync_write+0x2b6/0xc43
>  [<ffffffff883fcc99>] :lustre:ll_commit_write+0x269/0x5df
>  [<ffffffff80210ba1>]
> generic_file_buffered_write+0x438/0x646
>  [<ffffffff883b742f>]
> :lov:lov_update_enqueue_set+0x345/0x3a1
>  [<ffffffff8020f13c>] current_fs_time+0x3b/0x40
>  [<ffffffff884424b5>] :osc:osc_enqueue+0xfb/0x48e
>  [<ffffffff8021646f>]
> __generic_file_aio_write_nolock+0x2e4/0x32f
>  [<ffffffff883ef54b>]
> :lustre:ll_inode_size_unlock+0x81/0xd6
>  [<ffffffff802a2809>] __generic_file_write_nolock+0x8f/0xa8
>  [<ffffffff882e4886>] :ptlrpc:ldlm_completion_ast+0x0/0x5e2
>  [<ffffffff80290415>] autoremove_wake_function+0x0/0x2e
>  [<ffffffff88408367>] :lustre:lt_get_mmap_locks+0x2b1/0x3a5
>  [<ffffffff8024529d>] generic_file_write+0x49/0xa7
>  [<ffffffff883ea8f3>] :lustre:ll_file_write+0x669/0x7e7
>  [<ffffffff80216b9b>] vfs_write+0xce/0x174
>  [<ffffffff802173be>] sys_write+0x45/0x6e
>  [<ffffffff8025c81a>] system_call+0x86/0x8b
>  [<ffffffff8025c794>] system_call+0x0/0x8b
>
>
> Note: the client was running on a Xen dom0, with fairly small memory, so
> lustre is not necessarily at fault.
> Further details on request.
Yeah I found running lustre in xen is kinda difficult when trying to
partition out the memory, I use at least 512M for all lustre involved
components of my xen/lustre systems.

- David Brown

Andreas Dilger

2007-Feb-21 10:26 UTC

head link

[Lustre-discuss] Oops with 1.5.97 (on 2.6.18 kernel)

On Feb 21, 2007  08:30 -0800, David Brown wrote:> >While testing 1.5.97 on kernel 2.6.18, I had the following Oops on a 
> >client.
> >
> >Unable to handle kernel paging request at ffff88000b17fff8 RIP:
> > [<ffffffff8843951a>] :osc:osc_brw_prep_request+0x3e3/0x9e2
Can you please decode this to a line number?

If you do "gdp .../osc.ko" and "list
*(osc_brw_prep_request+0x3e3)"
it should give you a line number.
> >Note: the client was running on a Xen dom0, with fairly small memory,
so
> >lustre is not necessarily at fault.
> >Further details on request.
> 
> Yeah I found running lustre in xen is kinda difficult when trying to
> partition out the memory, I use at least 512M for all lustre involved
> components of my xen/lustre systems.
Strange.  I''ve run 1.4 with client + MDS + 5 OSTs in a single 96MB UML.
The memory allocated by lustre generally scales as a function of RAM in
the system, so that it can at least function in lower-memory environments.
That said, I haven''t run 1.6 in this environment.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

David Brown

2007-Feb-21 10:54 UTC

head link

[Lustre-discuss] Oops with 1.5.97 (on 2.6.18 kernel)

> Strange.  I''ve run 1.4 with client + MDS + 5 OSTs in a single 96MB
UML.
> The memory allocated by lustre generally scales as a function of RAM in
> the system, so that it can at least function in lower-memory environments.
> That said, I haven''t run 1.6 in this environment.
Well since my experience in xen I know he''s using an unstable version
of xen to work from since xen 3.0.4 is a 2.6.16.33 based system and
only their scm repos have the 2.6.18. Also I''ve seen funny things
happen when a dom0/U starts running out of memory with xen. It really
depends on how your system is setup, I''ve seen xen kill processes
simply because its out of memory.

Also if you give dom0 512M of memory you really end up with 437M shown
by /proc/meminfo so depending on how small he''s got memory set to it
might have less than he thought. I think this has something todo with
the xen.gz hypervisor taking up space.

Lustre discuss - Feb 2007 - Oops with 1.5.97 (on 2.6.18 kernel)

[Lustre-discuss] Oops with 1.5.97 (on 2.6.18 kernel)

[Lustre-discuss] Oops with 1.5.97 (on 2.6.18 kernel)

[Lustre-discuss] Oops with 1.5.97 (on 2.6.18 kernel)

[Lustre-discuss] Oops with 1.5.97 (on 2.6.18 kernel)