Hello again,
things have evolved during this long week-end. Kernel still crashing, but
I''ve managed to make it step a bit forward.
At present, this is what I see in my logs:
Oct 15 15:05:44 DiskServer kernel: LDISKFS-fs: mounted filesystem with ordered
data mode.
Oct 15 15:07:06 DiskServer kernel: kjournald starting. Commit interval 5
seconds
Oct 15 15:07:06 DiskServer kernel: LDISKFS FS on sdb, internal journal
Oct 15 15:07:06 DiskServer kernel: LDISKFS-fs: mounted filesystem with ordered
data mode.
Oct 15 15:07:06 DiskServer kernel: Lustre:
8753:0:(mds_fs.c:239:mds_init_server_data()) mds1: initializing new last_rcvd
Oct 15 15:07:06 DiskServer kernel: Lustre: MDT mds1 now serving /dev/sdb
(ec110dc6-6bcc-4746-8134-5fa6777cabc5) with recovery enabled
Oct 15 15:07:07 DiskServer kernel: Lustre: MDT mds1 has stopped.
Oct 15 15:07:54 DiskServer kernel: kjournald starting. Commit interval 5
seconds
Oct 15 15:07:54 DiskServer kernel: LDISKFS FS on sdc, internal journal
Oct 15 15:07:54 DiskServer kernel: LDISKFS-fs: mounted filesystem with ordered
data mode.
Oct 15 15:07:54 DiskServer kernel: Lustre:
8840:0:(filter.c:391:filter_init_server_data()) ost1: initializing new last_rcvd
Oct 15 15:07:54 DiskServer kernel: Lustre: OST ost1 now serving /dev/sdc
(e5fb59ef-e649-41c3-bccd-f5ff55d61ae6) with recovery enabled
Oct 15 15:07:54 DiskServer kernel: kjournald starting. Commit interval 5
seconds
Oct 15 15:07:54 DiskServer kernel: LDISKFS FS on sdd, internal journal
Oct 15 15:07:54 DiskServer kernel: LDISKFS-fs: mounted filesystem with ordered
data mode.
Oct 15 15:07:54 DiskServer kernel: Lustre:
8845:0:(filter.c:391:filter_init_server_data()) ost2: initializing new last_rcvd
Oct 15 15:07:54 DiskServer kernel: Lustre: OST ost2 now serving /dev/sdd
(2bfe5d58-c150-4805-b51b-f44f496615b8) with recovery enabled
Oct 15 15:07:54 DiskServer kernel: kjournald starting. Commit interval 5
seconds
Oct 15 15:07:54 DiskServer kernel: LDISKFS FS on sdb, internal journal
Oct 15 15:07:54 DiskServer kernel: LDISKFS-fs: mounted filesystem with ordered
data mode.
Oct 15 15:07:54 DiskServer kernel: LustreError:
8850:0:(llog_lvfs.c:552:llog_lvfs_create()) error looking up logfile
0x100000098393bc0:0x0: rc -2
Oct 15 15:07:54 DiskServer kernel: LustreError:
8850:0:(lov_log.c:205:lov_llog_init()) error osc_llog_init 0
Oct 15 15:07:54 DiskServer kernel: LustreError:
8850:0:(mds_log.c:197:mds_llog_init()) error lov_llog_init
Oct 15 15:07:54 DiskServer kernel: LustreError:
8850:0:(llog_obd.c:321:llog_cat_initialize()) rc: -2
Oct 15 15:07:54 DiskServer kernel: LustreError:
8850:0:(mds_lov.c:237:mds_lov_connect()) failed to initialize catalog -2
Oct 15 15:07:54 DiskServer kernel: Lustre:
8518:0:(recov_thread.c:155:llog_obd_repl_cancel()) no import for ctxt d4b43880
Oct 15 15:07:54 DiskServer kernel: Lustre:
8490:0:(recov_thread.c:155:llog_obd_repl_cancel()) no import for ctxt d350c900
Oct 15 15:07:54 DiskServer kernel: Unable to handle kernel NULL pointer
dereference at virtual address 0000018a
Oct 15 15:07:54 DiskServer kernel: printing eip:
Oct 15 15:07:54 DiskServer kernel: f98241d7
Oct 15 15:07:54 DiskServer kernel: *pde = 1124a001
Oct 15 15:07:54 DiskServer kernel: Oops: 0000 [#1]
Oct 15 15:07:54 DiskServer kernel: SMP
Oct 15 15:07:54 DiskServer kernel: Modules linked in: llite mds lov osc mdc
obdfilter fsfilt_ldiskfs ost ptlrpc obdclass lvfs ksocklnd lnet libcfs ldiskfs
loop iscsi_tcp
scsi_transport_iscsi md5 ipv6 sunrpc hw_random e1000 floppy sg ext3 jbd aic79xx
sd_mod scsi_mod
Oct 15 15:07:54 DiskServer kernel: CPU: 0
Oct 15 15:07:54 DiskServer kernel: EIP: 0060:[<f98241d7>] Not
tainted VLI
Oct 15 15:07:54 DiskServer kernel: EFLAGS: 00010282 (2.6.12.6-lustre)
Oct 15 15:07:54 DiskServer kernel: EIP is at mds_lov_clean+0xa7/0x13a0 [mds]
Oct 15 15:07:54 DiskServer kernel: eax: 00001ffe ebx: 00000000 ecx: f7fff080
edx: 00000002
Oct 15 15:07:54 DiskServer kernel: esi: f93321c8 edi: dbf14fc5 ebp: fffffffe
esp: dcccfc08
Oct 15 15:07:54 DiskServer kernel: ds: 007b es: 007b ss: 0068
Oct 15 15:07:54 DiskServer kernel: Process lctl (pid: 8850, threadinfo=dccce000
task=d1191a60)
Oct 15 15:07:54 DiskServer kernel: Stack: c01729a5 f93320b4 f93320b4 00000000
f896142f 00000000 00000000 f9331fa0
Oct 15 15:07:54 DiskServer kernel: d3c53a00 fffffffe dbf14fc0 c0179012
1e0af68c f9331f98 f93320b4 f9331f98
Oct 15 15:07:54 DiskServer kernel: f93321c8 fffffffe f9825a76 00000000
00000000 f93207c0 f9332296 00000001
Oct 15 15:07:54 DiskServer kernel: Call Trace:
Oct 15 15:07:54 DiskServer kernel: [<c01729a5>] dput+0x25/0x1d0
Oct 15 15:07:54 DiskServer kernel: [<f896142f>] pop_ctxt+0x9f/0x870
[lvfs]
Oct 15 15:07:54 DiskServer kernel: [<c0179012>] set_fs_pwd+0x62/0xc0
Oct 15 15:07:54 DiskServer kernel: [<f9825a76>] mds_postsetup+0x5a6/0xbd0
[mds]
Oct 15 15:07:54 DiskServer kernel: [<c01ce110>] vsnprintf+0x220/0x4b0
Oct 15 15:07:54 DiskServer kernel: [<f8969ff9>]
upcall_cache_init+0x59/0x870 [lvfs]
Oct 15 15:07:54 DiskServer kernel: [<f9821d4f>] mds_setup+0xd4f/0x3130
[mds]
Oct 15 15:07:54 DiskServer kernel: [<f92a0008>]
class_new_export+0x278/0xa20 [obdclass]
Oct 15 15:07:54 DiskServer kernel: [<f8bb003c>]
libcfs_debug_msg+0xdc/0x2e0 [libcfs]
Oct 15 15:07:54 DiskServer kernel: [<f92bdf73>] class_setup+0x1173/0x1580
[obdclass]
Oct 15 15:07:54 DiskServer kernel: [<f92c8eb4>]
class_process_config+0x1314/0x2680 [obdclass]
Oct 15 15:07:54 DiskServer kernel: [<f92915ea>]
class_handle_ioctl+0x32ba/0x81d0 [obdclass]
Oct 15 15:07:54 DiskServer kernel: [<c01f410b>] misc_open+0xeb/0x270
Oct 15 15:07:54 DiskServer kernel: [<c01f4020>] misc_open+0x0/0x270
Oct 15 15:07:54 DiskServer kernel: [<c0163ab5>] chrdev_open+0xc5/0x170
Oct 15 15:07:54 DiskServer kernel: [<c015b797>] get_empty_filp+0x87/0x120
Oct 15 15:07:54 DiskServer kernel: [<c016dbda>] do_ioctl+0x6a/0x90
Oct 15 15:07:54 DiskServer kernel: [<c016ddb3>] vfs_ioctl+0x63/0x1c0
Oct 15 15:07:54 DiskServer kernel: [<c016df88>] sys_ioctl+0x78/0x80
Oct 15 15:07:54 DiskServer kernel: [<c0102cd9>] syscall_call+0x7/0xb
Oct 15 15:07:54 DiskServer kernel: Code: 0c 3b 15 80 67 bc f8 0f 87 e7 0f 00 00
f6 05 e4 3a bc f8 01 74 0d f6 05 e0 3a bc f8 04 0f 85 81 0f 00 00 85 ed 0f 84 c0
0c 00 00
<8b> 85 8c 01 00 00 85 c0 0f 84 ec 0b 00 00 31 c0 85 c0 0f 84 e2
Oct 15 15:07:54 DiskServer kernel: <3>LustreError:
8586:0:(ldlm_lib.c:544:target_handle_connect()) @@@ UUID
''mds1_UUID'' is not available for connect (not set up) req@d3
78a800 x558/t0 o38-><?>@<?>:-1 lens 240/0 ref 0 fl Interpret:/0/0
rc 0/0
Oct 15 15:07:54 DiskServer kernel: LustreError:
8586:0:(ldlm_lib.c:1262:target_send_reply_msg()) @@@ processing error (-19)
req@d378a800 x558/t0 o38-><?>@<?>:-1 lens 240/
0 ref 0 fl Interpret:/0/0 rc -19/0
Oct 15 15:07:54 DiskServer kernel: LustreError:
8674:0:(client.c:577:ptlrpc_check_status()) @@@ type == PTL_RPC_MSG_ERR, err ==
-19 req@f1e7b600 x558/t0 o38->mds1_UUID@Di
skServer_UUID:12 lens 240/272 ref 1 fl Rpc:R/0/0 rc 0/-19
[... with last three lines repeating ''ad libitum'' ...]
I really really hope you can help me, I know data is still there, I''ve
seen my files on OST devices!!
Thanks in advance,
Giorgio Cardarelli
_____
From: Nathaniel Rutman [mailto:nathan@clusterfs.com]
To: Giorgio Cardarelli [mailto:giorgio.cardarelli@e-eureka.it]
Cc: lustre-discuss@clusterfs.com
Sent: Mon, 16 Oct 2006 19:27:32 +0200
Subject: Re: [Lustre-discuss] MDS not working anymore!
Giorgio Cardarelli wrote:> Hello list,
> I''m having a big problem with my Lustre storage network. I have
two
> computers, DiskServer (MDS and OSD1) and DiskServer1 (OSD2),
> configured with Lustre. In my configuration OSD1 + OSD2 = LOV1.
> Everything worked perfectly for several months, but today, after a
> single physical access error on OSD2, something happened to Lustre and
> everything stopped working. I have Linux OS (kernel 2.6.12.6 properly
> patched) and Lustre 1.4.6.2.
> When I start Lustre on my MDS:
>
> # lconf local.xml
> LOV: lov1 76e8c_lov1_9673c3d5bf mds1_UUID 0 1048576 0 0
[u''ost1_UUID'',
> u''ost2_UUID''] mds1
> OSC: OSC_DiskServer_ost1_MNT_DiskServer 76e8c_lov1_9673c3d5bf ost1_UUID
> OSC: OSC_DiskServer_ost2_MNT_DiskServer 76e8c_lov1_9673c3d5bf ost2_UUID
> MDC: MDC_DiskServer_mds1_MNT_DiskServer
> a2f96_MNT_DiskServer_105cb6d535 mds1_UUID
> MTPT: MNT_DiskServer MNT_DiskServer_UUID /storage mds1_UUID lov1_UUID
>
> and nothing more happens. This is what I see in the logfile:
>
>
>
> Oct 15 00:58:26 DiskServer kernel: Lustre:
> 2681:0:(module.c:381:init_libcfs_module()) maximum lustre stack 8192
> Oct 15 00:58:26 DiskServer kernel: Lustre: OBD class driver Build
> Version:
>
1.4.6.2-19700101010000-PRISTINE-.usr.src.linux-2.6.12.6-lustre.-2.6.12.6-lustre,
>
> info@clusterfs.com
> Oct 15 00:58:26 DiskServer kernel: Lustre: Added LNI 192.168.99.10@tcp
> [8/256]
> Oct 15 00:58:26 DiskServer kernel: Lustre: Accept secure, port 988
> Oct 15 00:58:26 DiskServer kernel: Lustre: Filtering OBD driver;
> info@clusterfs.com
> Oct 15 00:58:26 DiskServer kernel: Lustre: Lustre Lite Client File
> System; info@clusterfs.com
> Oct 15 00:58:27 DiskServer kernel: kjournald starting. Commit
> interval 5 seconds
> Oct 15 00:58:27 DiskServer kernel: LDISKFS FS on sdc, internal journal
> Oct 15 00:58:27 DiskServer kernel: LDISKFS-fs: mounted filesystem with
> ordered data mode.
> Oct 15 00:58:27 DiskServer kernel: kjournald starting. Commit
> interval 5 seconds
> Oct 15 00:58:27 DiskServer kernel: LDISKFS FS on sdb, internal journal
> Oct 15 00:58:27 DiskServer kernel: LDISKFS-fs: mounted filesystem with
> ordered data mode.
> Oct 15 00:58:27 DiskServer kernel: Unable to handle kernel NULL
> pointer dereference at virtual address 0000018a
> Oct 15 00:58:27 DiskServer kernel: printing eip:
> Oct 15 00:58:27 DiskServer kernel: f90e61d7
> Oct 15 00:58:27 DiskServer kernel: *pde = 33c7f001
> Oct 15 00:58:27 DiskServer kernel: Oops: 0000 [#1]
> [... skipping kernel dumps ...]
The kernel dumps are the important part, so we can see where it crashed.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://mail.clusterfs.com/pipermail/lustre-discuss/attachments/20061016/b294154e/attachment-0001.html