Hi all, One of our OSS with Lustre version:"2.6.18-194.17.1.el5_lustre.1.8.5 " crashed today. crash> sys KERNEL: /usr/lib/debug/lib/modules/2.6.18-194.17.1.el5_lustre.1.8.5//vmlinux DUMPFILE: /var/crash/2011-10-10-15:30/vmcore CPUS: 8 DATE: Mon Oct 10 15:29:15 2011 UPTIME: 05:30:52 LOAD AVERAGE: 3.74, 2.14, 1.57 TASKS: 983 NODENAME: RELEASE: 2.6.18-194.17.1.el5_lustre.1.8.5 VERSION: #1 SMP Mon Nov 15 15:48:43 MST 2010 MACHINE: x86_64 (2399 Mhz) MEMORY: 23.6 GB PANIC: "Oops: 0000 [1] SMP " (check log for details) Here is the end of crash dump log: Unable to handle kernel NULL pointer dereference at 0000000000000000 RIP: [<ffffffff8895e7c6>] :obdfilter:filter_preprw+0x1746/0x1e00 PGD 2f8e86067 PUD 31c2c4067 PMD 0 Oops: 0000 [1] SMP last sysfs file: /devices/pci0000:00/0000:00:00.0/irq CPU 4 Modules linked in: autofs4(U) hidp(U) obdfilter(U) fsfilt_ldiskfs(U) ost(U) mgc( U) ldiskfs(U) jbd2(U) crc16(U) lustre(U) lov(U) mdc(U) lquota(U) osc(U) ksocklnd (U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) libcfs(U) rfcomm(U) l2cap(U) bluetooth (U) sunrpc(U) dm_multipath(U) scsi_dh(U) video(U) backlight(U) sbs(U) power_mete r(U) hwmon(U) i2c_ec(U) i2c_core(U) dell_wmi(U) wmi(U) button(U) battery(U) asus _acpi(U) acpi_memhotplug(U) ac(U) ipv6(U) xfrm_nalgo(U) crypto_api(U) parport_pc (U) lp(U) parport(U) joydev(U) ixgbe(U) 8021q(U) hpilo(U) sg(U) shpchp(U) dca(U) serio_raw(U) pcspkr(U) bnx2(U) dm_raid45(U) dm_message(U) dm_region_hash(U) dm_ mem_cache(U) dm_snapshot(U) dm_zero(U) dm_mirror(U) dm_log(U) dm_mod(U) usb_stor age(U) lpfc(U) scsi_transport_fc(U) cciss(U) sd_mod(U) scsi_mod(U) ext3(U) jbd(U ) uhci_hcd(U) ohci_hcd(U) ehci_hcd(U) Pid: 4252, comm: ll_ost_io_12 Tainted: G 2.6.18-194.17.1.el5_lustre.1.8.5 # 1 RIP: 0010:[<ffffffff8895e7c6>] [<ffffffff8895e7c6>] :obdfilter:filter_preprw+0x 1746/0x1e00 RSP: 0018:ffff81030bdcd8c0 EFLAGS: 00010206 RAX: 0000000000000021 RBX: 0000000000000000 RCX: ffff810011017300 RDX: ffff8101067b4c90 RSI: 000000000000000e RDI: 3533313130323331 RBP: ffff81030bdd1388 R08: ffff81061ff40b03 R09: 0000000000001000 R10: 0000000000000000 R11: 00000000000200d2 R12: 000000000000007e R13: 000000000007e000 R14: 0000000000000100 R15: 0000000000000100 FS: 00002ba793935220(0000) GS:ffff81010af99240(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 0000000000000000 CR3: 00000002f92d1000 CR4: 00000000000006e0 Process ll_ost_io_12 (pid: 4252, threadinfo ffff81030bdcc000, task ffff81030bd17 080) Stack: 0000000000000000 ffff81031fc503c0 ffff8102c44c7200 0000000000000000 ffff81031fc503c0 00020000c0a83281 00020000ca7a214e ffffffff88539543 0000000000000000 ffffffff885a0d80 ffff8102c44c7200 ffffffff8853ba03 Call Trace: [<ffffffff88539543>] :lnet:lnet_ni_send+0x93/0xd0 [<ffffffff885a0d80>] :obdclass:class_handle2object+0xe0/0x170 [<ffffffff8853ba03>] :lnet:lnet_send+0x9a3/0x9d0 [<ffffffff8002b84a>] truncate_inode_pages_range+0x222/0x2ba [<ffffffff88908ffc>] :ost:ost_brw_write+0xf9c/0x2480 [<ffffffff8864a658>] :ptlrpc:ptlrpc_send_reply+0x5c8/0x5e0 [<ffffffff886158b0>] :ptlrpc:target_committed_to_req+0x40/0x120 [<ffffffff8864eb05>] :ptlrpc:lustre_msg_get_version+0x35/0xf0 [<ffffffff8864ea15>] :ptlrpc:lustre_msg_get_opc+0x35/0xf0 [<ffffffff8008cf93>] default_wake_function+0x0/0xe [<ffffffff8864ebc8>] :ptlrpc:lustre_msg_check_version_v2+0x8/0x20 [<ffffffff8890d08e>] :ost:ost_handle+0x2bae/0x55b0 [<ffffffff80150d56>] __next_cpu+0x19/0x28 [<ffffffff800767ae>] smp_send_reschedule+0x4e/0x53 [<ffffffff8865e15a>] :ptlrpc:ptlrpc_server_handle_request+0x97a/0xdf0 [<ffffffff8865e8a8>] :ptlrpc:ptlrpc_wait_event+0x2d8/0x310 [<ffffffff8008b3bd>] __wake_up_common+0x3e/0x68 [<ffffffff8865f817>] :ptlrpc:ptlrpc_main+0xf37/0x10f0 [<ffffffff8005dfb1>] child_rip+0xa/0x11 [<ffffffff8865e8e0>] :ptlrpc:ptlrpc_main+0x0/0x10f0 [<ffffffff8005dfa7>] child_rip+0x0/0x11 Code: 44 39 23 7e 0c 48 83 c5 28 e9 56 fd ff ff 45 31 ed 48 8d bc RIP [<ffffffff8895e7c6>] :obdfilter:filter_preprw+0x1746/0x1e00 RSP <ffff81030bdcd8c0> and here is bt result: crash> bt PID: 4252 TASK: ffff81030bd17080 CPU: 4 COMMAND: "ll_ost_io_12" #0 [ffff81030bdcd620] crash_kexec at ffffffff800ad9c4 #1 [ffff81030bdcd6e0] __die at ffffffff80065157 #2 [ffff81030bdcd720] do_page_fault at ffffffff80066dd7 #3 [ffff81030bdcd810] error_exit at ffffffff8005dde9 [exception RIP: filter_preprw+5958] RIP: ffffffff8895e7c6 RSP: ffff81030bdcd8c0 RFLAGS: 00010206 RAX: 0000000000000021 RBX: 0000000000000000 RCX: ffff810011017300 RDX: ffff8101067b4c90 RSI: 000000000000000e RDI: 3533313130323331 RBP: ffff81030bdd1388 R8: ffff81061ff40b03 R9: 0000000000001000 R10: 0000000000000000 R11: 00000000000200d2 R12: 000000000000007e R13: 000000000007e000 R14: 0000000000000100 R15: 0000000000000100 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 #4 [ffff81030bdcd8f8] lnet_ni_send at ffffffff88539543 #5 [ffff81030bdcd918] lnet_send at ffffffff8853ba03 #6 [ffff81030bdcd9d8] truncate_inode_pages_range at ffffffff8002b84a #7 [ffff81030bdcdaf8] ptlrpc_send_reply at ffffffff8864a658 #8 [ffff81030bdcdc18] lustre_msg_get_version at ffffffff8864eb05 #9 [ffff81030bdcdc48] lustre_msg_check_version_v2 at ffffffff8864ebc8 #10 [ffff81030bdcdca8] ost_handle at ffffffff8890d08e #11 [ffff81030bdcde38] ptlrpc_wait_event at ffffffff8865e8a8 #12 [ffff81030bdcdf48] kernel_thread at ffffffff8005dfb1 When the crash happened, the machine seemed working at good condition, low load, low memory usage, low iowait... Do you have any suggetions for avoiding this kind of crashes? Thank you very much ! --------------------------------------------------------- Lu Wang Computing Center, Institute of High Energy Physics Chinese Academy of Sciences Beijing, 100049 P. R. China Tel: (86)010-88236010 ext 105 E-mail: Lu.Wang at ihep.ac.cn --------------------------------------------------------- -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20111010/dd7d59cd/attachment.html