Hi, After a power outage, I get some difficulties to mount a OST. I am running a lustre 1.6.3 and I get a panic on the OSS when I try to mount a OST. I get a couple of other OSTs on the system and are properly mountable I try to perform a fsck and a tunefs.lustre --writeconf on the ost, but the problem is still the same. Any ideas ? Thanks, Franck
Hello! On Dec 12, 2007, at 11:39 AM, Franck Martinaux wrote:> After a power outage, I get some difficulties to mount a OST. > I am running a lustre 1.6.3 and I get a panic on the OSS when I try to > mount a OST.It would greatly help us if you show us panic message and possibly stacktrace. Bye, Oleg
On 12 d?c, 17:51, Oleg Drokin <Oleg.Dro... at Sun.COM> wrote:> Hello! > > On Dec 12, 2007, at 11:39 AM, Franck Martinaux wrote: > > > After a power outage, I get some difficulties to mount a OST. > > I am running a lustre 1.6.3 and I get a panic on the OSS when I try to > > mount a OST. > > It would greatly help us if you show us panic message and possibly > stacktrace.Hi, Please find below all information we got this morning Environment ========== ,---- | [root at oss01 ~]# uname -a | Linux oss01.data.cluster 2.6.9-55.0.9.EL_lustre.1.6.3smp #1 SMP Sun Oct 7 20:08:31 EDT 2007 x86_64 x86_64 x86_64 GNU/Linux | [root at oss01 ~]# `---- Mount of this specific OST ========================= ,---- | [root at oss01 ~]# mount -t lustre /dev/mpath/mpath0 /mnt/lustre/ost1 | Read from remote host oss01: Connection reset by peer | Connection to oss01 closed. | [ddn at admin01 ~]$ `---- /var/log/messages during the operation ===================================== --8<---------------cut here---------------start------------->8--- Dec 13 08:36:04 oss01 sshd(pam_unix)[13469]: session opened for user root by root(uid=0)Dec 13 08:36:20 oss01 kernel: kjournald starting. Commit interval 5 seconds Dec 13 08:36:20 oss01 kernel: LDISKFS FS on dm-1, internal journal Dec 13 08:36:20 oss01 kernel: LDISKFS-fs: recovery complete.Dec 13 08:36:20 oss01 kernel: LDISKFS-fs: mounted filesystem with ordered data mode. Dec 13 08:36:20 oss01 kernel: kjournald starting. Commit interval 5 seconds Dec 13 08:36:20 oss01 kernel: LDISKFS FS on dm-1, internal journalDec 13 08:36:20 oss01 kernel: LDISKFS-fs: mounted filesystem with ordered data mode. Dec 13 08:36:20 oss01 kernel: LDISKFS-fs: file extents enabled Dec 13 08:36:20 oss01 kernel: LDISKFS-fs: mballoc enabled Dec 13 08:36:20 oss01 kernel: Lustre: OST lustre-OST0002 now serving dev (lustre-OST0002/0258906d-8eca-ba98-4e3d-19adfa472914) with recovery enabled Dec 13 08:36:20 oss01 kernel: Lustre: Server lustre-OST0002 on device / dev/mpath/mpath1 has started Dec 13 08:36:21 oss01 kernel: LustreError: 137-5: UUID ''lustre- OST0030_UUID'' is not available for connect (no target) Dec 13 08:36:21 oss01 kernel: LustreError: Skipped 4 previous similar messages Dec 13 08:36:21 oss01 kernel: LustreError: 13664:0:(ldlm_lib.c: 1437:target_send_reply_msg()) @@@ processing error (-19) req at 00000102244efe00 x146203/t0 o8-><?>@<?>:-1 lens 240/0 ref 0 fl Interpret:/0/0 rc -19/0 Dec 13 08:36:21 oss01 kernel: LustreError: 13664:0:(ldlm_lib.c: 1437:target_send_reply_msg()) Skipped 4 previous similar messages Dec 13 08:36:41 oss01 kernel: LustreError: 137-5: UUID ''lustre- OST0030_UUID'' is not available for connect (no target) Dec 13 08:36:41 oss01 kernel: LustreError: 13665:0:(ldlm_lib.c: 1437:target_send_reply_msg()) @@@ processing error (-19) req at 000001021d1d68 00 x146233/t0 o8-><?>@<?>:-1 lens 240/0 ref 0 fl Interpret:/0/0 rc -19/0 Dec 13 08:37:01 oss01 kernel: LustreError: 137-5: UUID ''lustre- OST0030_UUID'' is not available for connect (no target) Dec 13 08:37:01 oss01 kernel: LustreError: 13666:0:(ldlm_lib.c: 1437:target_send_reply_msg()) @@@ processing error (-19) req at 0000010006b95c 00 x146264/t0 o8-><?>@<?>:-1 lens 240/0 ref 0 fl Interpret:/0/0 rc -19/0 Dec 13 08:37:21 oss01 kernel: LustreError: 137-5: UUID ''lustre- OST0030_UUID'' is not available for connect (no target) Dec 13 08:37:21 oss01 kernel: LustreError: 13667:0:(ldlm_lib.c: 1437:target_send_reply_msg()) @@@ processing error (-19) req at 00000100cfe9ba 00 x146300/t0 o8-><?>@<?>:-1 lens 240/0 ref 0 fl Interpret:/0/0 rc -19/0 Dec 13 08:37:41 oss01 kernel: LustreError: 137-5: UUID ''lustre- OST0030_UUID'' is not available for connect (no target) Dec 13 08:37:41 oss01 kernel: LustreError: Skipped 5 previous similar messages Dec 13 08:37:41 oss01 kernel: LustreError: 13668:0:(ldlm_lib.c: 1437:target_send_reply_msg()) @@@ processing error (-19) req at 0000010037e88e 00 x146373/t0 o8-><?>@<?>:-1 lens 240/0 ref 0 fl Interpret:/0/0 rc -19/0 Dec 13 08:37:41 oss01 kernel: LustreError: 13668:0:(ldlm_lib.c: 1437:target_send_reply_msg()) Skipped 5 previous similar messages Dec 13 08:37:47 oss01 kernel: Lustre: Failing over lustre-OST0002 Dec 13 08:37:47 oss01 kernel: Lustre: *** setting obd lustre-OST0002 device ''unknown-block(253,1)'' read-only *** Dec 13 08:37:47 oss01 kernel: Turning device dm-1 (0xfd00001) read- only Dec 13 08:37:47 oss01 kernel: Lustre: lustre-OST0002: shutting down for failover; client state will be preserved. Dec 13 08:37:47 oss01 kernel: Lustre: OST lustre-OST0002 has stopped. Dec 13 08:37:47 oss01 kernel: LDISKFS-fs: mballoc: 1 blocks 1 reqs (0 success) Dec 13 08:37:47 oss01 kernel: LDISKFS-fs: mballoc: 1 extents scanned, 1 goal hits, 0 2^N hits, 0 breaks, 0 lost Dec 13 08:37:47 oss01 kernel: LDISKFS-fs: mballoc: 1 generated and it took 12560 Dec 13 08:37:47 oss01 kernel: LDISKFS-fs: mballoc: 256 preallocated, 0 discarded Dec 13 08:37:47 oss01 kernel: Removing read-only on dm-1 (0xfd00001) Dec 13 08:37:47 oss01 kernel: Lustre: server umount lustre-OST0002 complete Dec 13 08:37:57 oss01 sshd(pam_unix)[13946]: session opened for user root by root(uid=0) Dec 13 08:38:18 oss01 kernel: kjournald starting. Commit interval 5 seconds Dec 13 08:38:18 oss01 kernel: LDISKFS FS on dm-0, internal journal Dec 13 08:38:18 oss01 kernel: LDISKFS-fs: mounted filesystem with ordered data mode. Dec 13 08:38:18 oss01 kernel: kjournald starting. Commit interval 5 seconds Dec 13 08:38:18 oss01 kernel: LDISKFS FS on dm-0, internal journal Dec 13 08:38:18 oss01 kernel: LDISKFS-fs: mounted filesystem with ordered data mode. Dec 13 08:38:18 oss01 kernel: LDISKFS-fs: file extents enabled Dec 13 08:38:18 oss01 kernel: LDISKFS-fs: mballoc enabled Dec 13 08:43:52 oss01 syslogd 1.4.1: restart. Dec 13 08:43:52 oss01 syslog: syslogd startup succeeded Dec 13 08:43:52 oss01 kernel: klogd 1.4.1, log source = /proc/kmsg started. --8<---------------cut here---------------end--------------->8--- We have to do a power cycle to connect again =========================================== ,---- | # ipmitool -I lan -H 192.168.99.101 -U $login -P $password power cycle `---- The OST fsck seems correct ========================= ,---- | [root at oss01 log]# fsck.ext2 /dev/mpath/mpath0 | e2fsck 1.40.2.cfs1 (12-Jul-2007) | lustre-OST0030: recovering journal | lustre-OST0030: clean, 227/244195328 files, 15614685/976760320 blocks | [root at oss01 log]# `---- tunefs.lustre reads correctly mpath0 information =============================================== ,---- | [root at oss01 log]# tunefs.lustre /dev/mpath/mpath0 | checking for existing Lustre data: found CONFIGS/mountdata | Reading CONFIGS/mountdata | | Read previous values: | Target: lustre-OST0030 | Index: 48 | Lustre FS: lustre | Mount type: ldiskfs | Flags: 0x142 | (OST update writeconf ) | Persistent mount opts: errors=remount-ro,extents,mballoc | Parameters: mgsnode=10.143.0.5 at tcp mgsnode=10.143.0.6 at tcp failover.node=10.143.0.2 at tcp sys.timeout=80 mgsnode=10.143.0.5 at tcp mgsnode=10.143.0.6 at tcp failover.node=10.143.0.2 at tcp sys.timeout=80 | | | Permanent disk data: | Target: lustre-OST0030 | Index: 48 | Lustre FS: lustre | Mount type: ldiskfs | Flags: 0x142 | (OST update writeconf ) | Persistent mount opts: errors=remount-ro,extents,mballoc | Parameters: mgsnode=10.143.0.5 at tcp mgsnode=10.143.0.6 at tcp failover.node=10.143.0.2 at tcp sys.timeout=80 mgsnode=10.143.0.5 at tcp mgsnode=10.143.0.6 at tcp failover.node=10.143.0.2 at tcp sys.timeout=80 | | Writing CONFIGS/mountdata | [root at oss01 log]# `---- DDN lun is ready and working correctly ===================================== ,----[ OSS view ] | [root at oss01 log]# multipath -l | grep mpath0 | mpath0 (360001ff00fd4922302000800001d1c17) | [root at oss01 log]# `---- ,----[ S2A9550 view ] | [ddn at admin01 ~]$ s2a -h 10.141.0.92 -e "lun list" | grep -i 0fd492230200 | 2 1 Ready 3815470 0FD492230200 | [ddn at admin01 ~]$ `---- Stack trace (We got it from OSS02 via the serial line during a mounting try) =========================================================================== --8<---------------cut here---------------start------------->8--- LDISKFS-fs: file extents enabled LDISKFS-fs: mballoc enabled LustreError: 134-6: Trying to start OBD lustre-OST0030_UUID using the wrong disk lustre-OST0000_UUID. Were the /dev/ assignments rearranged ? LustreError: 10203:0:(filter.c:1022:filter_prep()) cannot read last_rcvd: rc = -22 LustreEr<ro4>re:i p10:2 f0f3f:0ff:f(fobfad_0c3aon2f12ig1. c:325:class_setup()) set--up- -l-u--s-t-re--- OS[Tcu00t 30h efreai ]le d- -(----22--)- [please bite here ] -L-u-s--tr--eE-- r - or: 10203:0:(obd_config.Kcer:n1e06l2 B:cUlG asast _csponifnilogc_lkl:o1g19_h ndler()) Err -22 on cfign cvaolmimdan odp: a and: 0000 [1] SLMuPs tre: cmd=cf003 0:lu<s4tr> OST0030 1:dev 2:type CP 3U: 3f ustreError: 15b-f: MGC1M0o.d1ul4e3s. 0l.i5 at nktcepd : inT:he configuration from l ogo bd''lfuisltterer-OST0030'' failed (-22). ( U)Make sure th is client a ndf stfihlet _MlGSdi askrfes running compatible ver(siUo)ns of Lustre. oLussttreError: 15c-8: MGC10.<144>(3.U0).5 at tcp: The configurati omng cfrom log ''lustre-OST003(0U'') failed (-22). This may l dbies tkhfes r esult of communicatio(nU) errors between this nod el usantdr ethe MGS, a bad configur(aU)tion, or other errors. Seloev the syslog for more inf(oU)rmation. LlqusutotraeError: 10203:0:(obd_mo<u4nt>.(cU):1082:server_start_targe tmds(c)) failed to start serv(eUr) lustre-OST0030: -22 ksocklnd(LU)ustreError: 10203:0:(obd<4_>m oupnttl.rpcc: 1573:server_fill_super((U))) Unable to start target so:bd -c2l2a (ULu)streError: 10203:0:(obd<_4c>o nlfneitg.c:392:class_cleanup()) ( U)Device 2 not setup ss lvfs(U) libcfs(U) md5(U) ipv6(U) parport_pc(U) lp(U) parport(U) autofs4(U) i2c_dev(U) i2c_core(U) sunrpc(U) ds(U) yenta_socket(U) pcmcia_c ore(U) dm_mirror(U) dm_round_robin(U) dm_multipath(U) joydev(U) button(U) battery(U) ac(U) uhci_hcd(U) ehci_hcd(U) hw_random(U) myri10ge(U) bnx2(U) ext3(U) jbd(U) dm_mod(U) qla2400(U) ata_piix(U) megaraid_sas(U) qla2xxx(U) scsi_transport_fc(U) sd_mod(U) multipath(U) Pid: 10286, comm: ptlrpcd Tainted: GF 2.6.9-55.0.9.EL_lustre. 1.6.3smp RIP: 0010:[<ffffffff80321465>] <ffffffff80321465>{__lock_text_start +32} RSP: 0018:0000010218cd9bc8 EFLAGS: 00010216 RAX: 0000000000000016 RBX: 000001021654e4bc RCX: 0000000000020000 RDX: 000000000000baa7 RSI: 0000000000000246 RDI: ffffffff80396fc0 RBP: 000001021654e4a0 R08: 00000000fffffffe R09: 000001021654e4bc R10: 0000000000000000 R11: 0000000000000000 R12: 00000102196e6058 R13: 00000102196e6000 R14: 0000010218cd9eb8 R15: 0000010218cd9e58 FS: 0000002a9557ab00(0000) GS:ffffffff804a6880(0000) knlGS: 0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 0000002a95557000 CR3: 0000000228514000 CR4: 00000000000006e0 Process ptlrpcd (pid: 10286, threadinfo 0000010218cd8000, task 00000102170b4030) Stack: 000001021654e4bc ffffffffa03a2121 000001021a99304e ffffffffa03b32a0 000001021654e0b0 ffffffffa04d6510 0000008000000000 0000000000000000 0000000000000000 00000102203920c0 Call Trace:<ffffffffa03a2121>{:lquota:filter_quota_clearinfo+49} <ffffffffa04d6510>{:obdfilter:filter_destroy_export+560} <ffffffff80131923>{recalc_task_prio+337} <ffffffffa02586fd>{:obdclass:class_export_destroy+381} <ffffffffa025c336>{:obdclass:obd_zombie_impexp_cull+150} <ffffffffa0318345>{:ptlrpc:ptlrpcd_check+229} <ffffffffa031883a>{:ptlrpc:ptlrpcd+874} <ffffffff80133566>{default_wake_function+0} <ffffffffa02eb450>{:ptlrpc:ptlrpc_expired_set+0} <ffffffffa02eb450>{:ptlrpc:ptlrpc_expired_set+0} <ffffffff80133566>{default_wake_function+0} <ffffffff80110de3>{child_rip+8} <ffffffffa03184d0>{:ptlrpc:ptlrpcd+0} <ffffffff80110ddb>{child_rip+0} Code: 0f 0b 04 c2 33 80 ff ff ff ff 77 00 f0 ff 0b 0f 88 8b 03 00 RIP <ffffffff80321465>{__lock_text_start+32} RSP <0000010218cd9bc8> <0>Kernel pani<4c> -L DnIoStKF sSy-nfcs:i nmg:b aOlolopsc 1 blocks 1 reqs (0 succ ess) : --8<---------------cut here---------------end--------------->8--- If you need more information or debug, feel free to request us. The problem occurs only with this OST. Thanks, Ludo -- Ludovic Francois +33 (0)6 14 77 26 93 System Engineer DataDirect Networks
Hi, On 13 Dec 2007, at 10:55, Ludovic Francois wrote:> On 12 d?c, 17:51, Oleg Drokin <Oleg.Dro... at Sun.COM> wrote: >> Hello! >> >> On Dec 12, 2007, at 11:39 AM, Franck Martinaux wrote: >> >>> After a power outage, I get some difficulties to mount a OST. >>> I am running a lustre 1.6.3 and I get a panic on the OSS when I >>> try to >>> mount a OST. >> >> It would greatly help us if you show us panic message and possibly >> stacktrace. > > > Hi, > > Please find below all information we got this morning > > Environment > ==========> > ,---- > | [root at oss01 ~]# uname -a > | Linux oss01.data.cluster 2.6.9-55.0.9.EL_lustre.1.6.3smp #1 SMP Sun > Oct 7 20:08:31 EDT 2007 x86_64 x86_64 x86_64 GNU/Linux > | [root at oss01 ~]# > `---- > > Mount of this specific OST > =========================> > ,---- > | [root at oss01 ~]# mount -t lustre /dev/mpath/mpath0 /mnt/lustre/ost1 > | Read from remote host oss01: Connection reset by peer > | Connection to oss01 closed. > | [ddn at admin01 ~]$ > `---- > > /var/log/messages during the operation > =====================================> > --8<---------------cut here---------------start------------->8--- > Dec 13 08:36:04 oss01 sshd(pam_unix)[13469]: session opened for user > root by root(uid=0)Dec 13 08:36:20 oss01 kernel: kjournald starting. > Commit interval 5 seconds > Dec 13 08:36:20 oss01 kernel: LDISKFS FS on dm-1, internal journal > Dec 13 08:36:20 oss01 kernel: LDISKFS-fs: recovery complete.Dec 13 > 08:36:20 oss01 kernel: LDISKFS-fs: mounted filesystem with ordered > data mode. > Dec 13 08:36:20 oss01 kernel: kjournald starting. Commit interval 5 > seconds > Dec 13 08:36:20 oss01 kernel: LDISKFS FS on dm-1, internal journalDec > 13 08:36:20 oss01 kernel: LDISKFS-fs: mounted filesystem with ordered > data mode. > Dec 13 08:36:20 oss01 kernel: LDISKFS-fs: file extents enabled > Dec 13 08:36:20 oss01 kernel: LDISKFS-fs: mballoc enabled > Dec 13 08:36:20 oss01 kernel: Lustre: OST lustre-OST0002 now serving > dev (lustre-OST0002/0258906d-8eca-ba98-4e3d-19adfa472914) with > recovery enabledOk because this is the only device I can see in this log being mounted I assume that at this moment /dev/mpath0 = lustre-OST0002> Dec 13 08:36:20 oss01 kernel: Lustre: Server lustre-OST0002 on > device / > dev/mpath/mpath1 has started > Dec 13 08:36:21 oss01 kernel: LustreError: 137-5: UUID ''lustre- > OST0030_UUID'' is not available for connect (no target) > Dec 13 08:36:21 oss01 kernel: LustreError: Skipped 4 previous similar > messages > Dec 13 08:36:21 oss01 kernel: LustreError: 13664:0:(ldlm_lib.c: > 1437:target_send_reply_msg()) @@@ processing error (-19) > req at 00000102244efe00 x146203/t0 o8-><?>@<?>:-1 lens 240/0 ref 0 fl > Interpret:/0/0 rc -19/0 > Dec 13 08:36:21 oss01 kernel: LustreError: 13664:0:(ldlm_lib.c: > 1437:target_send_reply_msg()) Skipped 4 previous similar messages > Dec 13 08:36:41 oss01 kernel: LustreError: 137-5: UUID ''lustre- > OST0030_UUID'' is not available for connect (no target) > Dec 13 08:36:41 oss01 kernel: LustreError: 13665:0:(ldlm_lib.c: > 1437:target_send_reply_msg()) @@@ processing error (-19) > req at 000001021d1d68 > 00 x146233/t0 o8-><?>@<?>:-1 lens 240/0 ref 0 fl Interpret:/0/0 rc > -19/0 > Dec 13 08:37:01 oss01 kernel: LustreError: 137-5: UUID ''lustre- > OST0030_UUID'' is not available for connect (no target) > Dec 13 08:37:01 oss01 kernel: LustreError: 13666:0:(ldlm_lib.c: > 1437:target_send_reply_msg()) @@@ processing error (-19) > req at 0000010006b95c > 00 x146264/t0 o8-><?>@<?>:-1 lens 240/0 ref 0 fl Interpret:/0/0 rc > -19/0 > Dec 13 08:37:21 oss01 kernel: LustreError: 137-5: UUID ''lustre- > OST0030_UUID'' is not available for connect (no target) > Dec 13 08:37:21 oss01 kernel: LustreError: 13667:0:(ldlm_lib.c: > 1437:target_send_reply_msg()) @@@ processing error (-19) > req at 00000100cfe9ba > 00 x146300/t0 o8-><?>@<?>:-1 lens 240/0 ref 0 fl Interpret:/0/0 rc > -19/0 > Dec 13 08:37:41 oss01 kernel: LustreError: 137-5: UUID ''lustre- > OST0030_UUID'' is not available for connect (no target) > Dec 13 08:37:41 oss01 kernel: LustreError: Skipped 5 previous similar > messages > Dec 13 08:37:41 oss01 kernel: LustreError: 13668:0:(ldlm_lib.c: > 1437:target_send_reply_msg()) @@@ processing error (-19) > req at 0000010037e88e > 00 x146373/t0 o8-><?>@<?>:-1 lens 240/0 ref 0 fl Interpret:/0/0 rc > -19/0 > Dec 13 08:37:41 oss01 kernel: LustreError: 13668:0:(ldlm_lib.c: > 1437:target_send_reply_msg()) Skipped 5 previous similar messages > Dec 13 08:37:47 oss01 kernel: Lustre: Failing over lustre-OST0002 > Dec 13 08:37:47 oss01 kernel: Lustre: *** setting obd lustre-OST0002 > device ''unknown-block(253,1)'' read-only *** > Dec 13 08:37:47 oss01 kernel: Turning device dm-1 (0xfd00001) read- > only > Dec 13 08:37:47 oss01 kernel: Lustre: lustre-OST0002: shutting down > for failover; client state will be preserved. > Dec 13 08:37:47 oss01 kernel: Lustre: OST lustre-OST0002 has stopped. > Dec 13 08:37:47 oss01 kernel: LDISKFS-fs: mballoc: 1 blocks 1 reqs (0 > success) > Dec 13 08:37:47 oss01 kernel: LDISKFS-fs: mballoc: 1 extents scanned, > 1 goal hits, 0 2^N hits, 0 breaks, 0 lost > Dec 13 08:37:47 oss01 kernel: LDISKFS-fs: mballoc: 1 generated and it > took 12560 > Dec 13 08:37:47 oss01 kernel: LDISKFS-fs: mballoc: 256 preallocated, 0 > discarded > Dec 13 08:37:47 oss01 kernel: Removing read-only on dm-1 (0xfd00001) > Dec 13 08:37:47 oss01 kernel: Lustre: server umount lustre-OST0002 > complete > Dec 13 08:37:57 oss01 sshd(pam_unix)[13946]: session opened for user > root by root(uid=0) > Dec 13 08:38:18 oss01 kernel: kjournald starting. Commit interval 5 > seconds > Dec 13 08:38:18 oss01 kernel: LDISKFS FS on dm-0, internal journal > Dec 13 08:38:18 oss01 kernel: LDISKFS-fs: mounted filesystem with > ordered data mode. > Dec 13 08:38:18 oss01 kernel: kjournald starting. Commit interval 5 > seconds > Dec 13 08:38:18 oss01 kernel: LDISKFS FS on dm-0, internal journal > Dec 13 08:38:18 oss01 kernel: LDISKFS-fs: mounted filesystem with > ordered data mode. > Dec 13 08:38:18 oss01 kernel: LDISKFS-fs: file extents enabled > Dec 13 08:38:18 oss01 kernel: LDISKFS-fs: mballoc enabled > Dec 13 08:43:52 oss01 syslogd 1.4.1: restart. > Dec 13 08:43:52 oss01 syslog: syslogd startup succeeded > Dec 13 08:43:52 oss01 kernel: klogd 1.4.1, log source = /proc/kmsg > started. > --8<---------------cut here---------------end--------------->8--- > > We have to do a power cycle to connect again > ===========================================> > ,---- > | # ipmitool -I lan -H 192.168.99.101 -U $login -P $password power > cycle > `---- > > > The OST fsck seems correct > =========================> > ,---- > | [root at oss01 log]# fsck.ext2 /dev/mpath/mpath0 > | e2fsck 1.40.2.cfs1 (12-Jul-2007) > | lustre-OST0030: recovering journal > | lustre-OST0030: clean, 227/244195328 files, 15614685/976760320 > blocks > | [root at oss01 log]# > `---- >This is after power cycle right? And now your mpath0 on the same server claims that it is lustre-OST30 Isn''t this strange? My first shot would be that your multipath devices are being mixed up every time you reboot your server. Make sure that your multipath binding file is the same on all servers or you can create your own aliases based on WWID of each lun in /etc/ multipath.conf> tunefs.lustre reads correctly mpath0 information > ===============================================> > ,---- > | [root at oss01 log]# tunefs.lustre /dev/mpath/mpath0 > | checking for existing Lustre data: found CONFIGS/mountdata > | Reading CONFIGS/mountdata > | > | Read previous values: > | Target: lustre-OST0030 > | Index: 48 > | Lustre FS: lustre > | Mount type: ldiskfs > | Flags: 0x142 > | (OST update writeconf ) > | Persistent mount opts: errors=remount-ro,extents,mballoc > | Parameters: mgsnode=10.143.0.5 at tcp mgsnode=10.143.0.6 at tcp > failover.node=10.143.0.2 at tcp sys.timeout=80 mgsnode=10.143.0.5 at tcp > mgsnode=10.143.0.6 at tcp failover.node=10.143.0.2 at tcp sys.timeout=80 > | > | > | Permanent disk data: > | Target: lustre-OST0030 > | Index: 48 > | Lustre FS: lustre > | Mount type: ldiskfs > | Flags: 0x142 > | (OST update writeconf ) > | Persistent mount opts: errors=remount-ro,extents,mballoc > | Parameters: mgsnode=10.143.0.5 at tcp mgsnode=10.143.0.6 at tcp > failover.node=10.143.0.2 at tcp sys.timeout=80 mgsnode=10.143.0.5 at tcp > mgsnode=10.143.0.6 at tcp failover.node=10.143.0.2 at tcp sys.timeout=80 > | > | Writing CONFIGS/mountdata > | [root at oss01 log]# > `---- > > > DDN lun is ready and working correctly > =====================================> > ,----[ OSS view ] > | [root at oss01 log]# multipath -l | grep mpath0 > | mpath0 (360001ff00fd4922302000800001d1c17) > | [root at oss01 log]# > `---- > > ,----[ S2A9550 view ] > | [ddn at admin01 ~]$ s2a -h 10.141.0.92 -e "lun list" | grep -i > 0fd492230200 > | 2 1 Ready 3815470 0FD492230200 > | [ddn at admin01 ~]$ > `---- > > Stack trace (We got it from OSS02 via the serial line during a > mounting try) > ====================================================================== > =====> > --8<---------------cut here---------------start------------->8--- > LDISKFS-fs: file extents enabled > LDISKFS-fs: mballoc enabled > LustreError: 134-6: Trying to start OBD lustre-OST0030_UUID using the > wrong disk lustre-OST0000_UUID. Were the /dev/ assignments rearranged > ?This seem to confirm my theory about mixed up block devices?> LustreError: 10203:0:(filter.c:1022:filter_prep()) cannot read > last_rcvd: rc = -22 > LustreEr<ro4>re:i p10:2 f0f3f:0ff:f(fobfad_0c3aon2f12ig1. > c:325:class_setup()) set--up- -l-u--s-t-re--- OS[Tcu00t 30h efreai ]le > d- -(----22--)- > [please bite here ] -L-u-s--tr--eE-- > r - > or: 10203:0:(obd_config.Kcer:n1e06l2 B:cUlG asast > _csponifnilogc_lkl:o1g19_h > ndler()) Err -22 on cfign cvaolmimdan > odp: a > and: 0000 [1] SLMuPs tre: cmd=cf003 0:lu<s4tr> > OST0030 1:dev 2:type CP 3U: 3f > ustreError: 15b-f: MGC1M0o.d1ul4e3s. 0l.i5 at nktcepd : inT:he > configuration from l ogo bd''lfuisltterer-OST0030'' failed (-22). > ( U)Make sure th > is client a ndf stfihlet _MlGSdi askrfes running compatible > ver(siUo)ns of > Lustre. > oLussttreError: 15c-8: MGC10.<144>(3.U0).5 at tcp: The configurati omng > cfrom log ''lustre-OST003(0U'') failed (-22). This may l dbies tkhfes r > esult of communicatio(nU) errors between this nod el usantdr ethe MGS, > a bad configur(aU)tion, or other errors. Seloev the syslog for more > > inf(oU)rmation. > LlqusutotraeError: 10203:0:(obd_mo<u4nt>.(cU):1082:server_start_targe > tmds(c)) failed to start serv(eUr) lustre-OST0030: -22 > ksocklnd(LU)ustreError: 10203:0:(obd<4_>m oupnttl.rpcc: > 1573:server_fill_super((U))) Unable to start target so:bd -c2l2a > (ULu)streError: 10203:0:(obd<_4c>o nlfneitg.c:392:class_cleanup()) > ( U)Device 2 not setup ss > lvfs(U) libcfs(U) md5(U) ipv6(U) parport_pc(U) lp(U) parport(U) > autofs4(U) i2c_dev(U) i2c_core(U) sunrpc(U) ds(U) yenta_socket(U) > pcmcia_c > ore(U) dm_mirror(U) dm_round_robin(U) dm_multipath(U) joydev(U) > button(U) battery(U) ac(U) uhci_hcd(U) ehci_hcd(U) hw_random(U) > myri10ge(U) > bnx2(U) ext3(U) jbd(U) dm_mod(U) qla2400(U) ata_piix(U) > megaraid_sas(U) qla2xxx(U) scsi_transport_fc(U) sd_mod(U) > multipath(U) > Pid: 10286, comm: ptlrpcd Tainted: GF 2.6.9-55.0.9.EL_lustre. > 1.6.3smp > RIP: 0010:[<ffffffff80321465>] <ffffffff80321465>{__lock_text_start > +32} > RSP: 0018:0000010218cd9bc8 EFLAGS: 00010216 > RAX: 0000000000000016 RBX: 000001021654e4bc RCX: 0000000000020000 > RDX: 000000000000baa7 RSI: 0000000000000246 RDI: ffffffff80396fc0 > RBP: 000001021654e4a0 R08: 00000000fffffffe R09: 000001021654e4bc > R10: 0000000000000000 R11: 0000000000000000 R12: 00000102196e6058 > R13: 00000102196e6000 R14: 0000010218cd9eb8 R15: 0000010218cd9e58 > FS: 0000002a9557ab00(0000) GS:ffffffff804a6880(0000) knlGS: > 0000000000000000 > CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b > CR2: 0000002a95557000 CR3: 0000000228514000 CR4: 00000000000006e0 > Process ptlrpcd (pid: 10286, threadinfo 0000010218cd8000, task > 00000102170b4030) > Stack: 000001021654e4bc ffffffffa03a2121 000001021a99304e > ffffffffa03b32a0 > 000001021654e0b0 ffffffffa04d6510 0000008000000000 > 0000000000000000 > 0000000000000000 00000102203920c0 > Call Trace:<ffffffffa03a2121>{:lquota:filter_quota_clearinfo+49} > <ffffffffa04d6510>{:obdfilter:filter_destroy_export+560} > <ffffffff80131923>{recalc_task_prio+337} > <ffffffffa02586fd>{:obdclass:class_export_destroy+381} > <ffffffffa025c336>{:obdclass:obd_zombie_impexp_cull+150} > <ffffffffa0318345>{:ptlrpc:ptlrpcd_check+229} > <ffffffffa031883a>{:ptlrpc:ptlrpcd+874} > <ffffffff80133566>{default_wake_function+0} > <ffffffffa02eb450>{:ptlrpc:ptlrpc_expired_set+0} > <ffffffffa02eb450>{:ptlrpc:ptlrpc_expired_set+0} > <ffffffff80133566>{default_wake_function+0} > <ffffffff80110de3>{child_rip+8} > <ffffffffa03184d0>{:ptlrpc:ptlrpcd+0} > <ffffffff80110ddb>{child_rip+0} > > Code: 0f 0b 04 c2 33 80 ff ff ff ff 77 00 f0 ff 0b 0f 88 8b 03 00 > RIP <ffffffff80321465>{__lock_text_start+32} RSP <0000010218cd9bc8> > <0>Kernel pani<4c> -L DnIoStKF sSy-nfcs:i nmg:b aOlolopsc > 1 blocks 1 reqs (0 succ ess) : > --8<---------------cut here---------------end--------------->8--- > > If you need more information or debug, feel free to request us. The > problem occurs only with this OST. > > Thanks, LudoI hope this help Cheers, Wojciech> > -- > Ludovic Francois +33 (0)6 14 77 26 93 > System Engineer DataDirect Networks > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at clusterfs.com > https://mail.clusterfs.com/mailman/listinfo/lustre-discussMr Wojciech Turek Assistant System Manager University of Cambridge High Performance Computing service email: wjt27 at cam.ac.uk tel. +441223763517 -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20071213/9665eaf7/attachment-0002.html
Hi, I just would like add that you could do very simple test to see if mpath is working correctly. On your server oss1 run tunefs.lustre -- print /dev/<all_mpath_devices> then write down target name for each mpath device. Reboot the server and do the same and compare if the mpath -> target map is the same as it was before reboot. Cheers Wojciech On 13 Dec 2007, at 10:55, Ludovic Francois wrote:> On 12 d?c, 17:51, Oleg Drokin <Oleg.Dro... at Sun.COM> wrote: >> Hello! >> >> On Dec 12, 2007, at 11:39 AM, Franck Martinaux wrote: >> >>> After a power outage, I get some difficulties to mount a OST. >>> I am running a lustre 1.6.3 and I get a panic on the OSS when I >>> try to >>> mount a OST. >> >> It would greatly help us if you show us panic message and possibly >> stacktrace. > > > Hi, > > Please find below all information we got this morning > > Environment > ==========> > ,---- > | [root at oss01 ~]# uname -a > | Linux oss01.data.cluster 2.6.9-55.0.9.EL_lustre.1.6.3smp #1 SMP Sun > Oct 7 20:08:31 EDT 2007 x86_64 x86_64 x86_64 GNU/Linux > | [root at oss01 ~]# > `---- > > Mount of this specific OST > =========================> > ,---- > | [root at oss01 ~]# mount -t lustre /dev/mpath/mpath0 /mnt/lustre/ost1 > | Read from remote host oss01: Connection reset by peer > | Connection to oss01 closed. > | [ddn at admin01 ~]$ > `---- > > /var/log/messages during the operation > =====================================> > --8<---------------cut here---------------start------------->8--- > Dec 13 08:36:04 oss01 sshd(pam_unix)[13469]: session opened for user > root by root(uid=0)Dec 13 08:36:20 oss01 kernel: kjournald starting. > Commit interval 5 seconds > Dec 13 08:36:20 oss01 kernel: LDISKFS FS on dm-1, internal journal > Dec 13 08:36:20 oss01 kernel: LDISKFS-fs: recovery complete.Dec 13 > 08:36:20 oss01 kernel: LDISKFS-fs: mounted filesystem with ordered > data mode. > Dec 13 08:36:20 oss01 kernel: kjournald starting. Commit interval 5 > seconds > Dec 13 08:36:20 oss01 kernel: LDISKFS FS on dm-1, internal journalDec > 13 08:36:20 oss01 kernel: LDISKFS-fs: mounted filesystem with ordered > data mode. > Dec 13 08:36:20 oss01 kernel: LDISKFS-fs: file extents enabled > Dec 13 08:36:20 oss01 kernel: LDISKFS-fs: mballoc enabled > Dec 13 08:36:20 oss01 kernel: Lustre: OST lustre-OST0002 now serving > dev (lustre-OST0002/0258906d-8eca-ba98-4e3d-19adfa472914) with > recovery enabled > Dec 13 08:36:20 oss01 kernel: Lustre: Server lustre-OST0002 on > device / > dev/mpath/mpath1 has started > Dec 13 08:36:21 oss01 kernel: LustreError: 137-5: UUID ''lustre- > OST0030_UUID'' is not available for connect (no target) > Dec 13 08:36:21 oss01 kernel: LustreError: Skipped 4 previous similar > messages > Dec 13 08:36:21 oss01 kernel: LustreError: 13664:0:(ldlm_lib.c: > 1437:target_send_reply_msg()) @@@ processing error (-19) > req at 00000102244efe00 x146203/t0 o8-><?>@<?>:-1 lens 240/0 ref 0 fl > Interpret:/0/0 rc -19/0 > Dec 13 08:36:21 oss01 kernel: LustreError: 13664:0:(ldlm_lib.c: > 1437:target_send_reply_msg()) Skipped 4 previous similar messages > Dec 13 08:36:41 oss01 kernel: LustreError: 137-5: UUID ''lustre- > OST0030_UUID'' is not available for connect (no target) > Dec 13 08:36:41 oss01 kernel: LustreError: 13665:0:(ldlm_lib.c: > 1437:target_send_reply_msg()) @@@ processing error (-19) > req at 000001021d1d68 > 00 x146233/t0 o8-><?>@<?>:-1 lens 240/0 ref 0 fl Interpret:/0/0 rc > -19/0 > Dec 13 08:37:01 oss01 kernel: LustreError: 137-5: UUID ''lustre- > OST0030_UUID'' is not available for connect (no target) > Dec 13 08:37:01 oss01 kernel: LustreError: 13666:0:(ldlm_lib.c: > 1437:target_send_reply_msg()) @@@ processing error (-19) > req at 0000010006b95c > 00 x146264/t0 o8-><?>@<?>:-1 lens 240/0 ref 0 fl Interpret:/0/0 rc > -19/0 > Dec 13 08:37:21 oss01 kernel: LustreError: 137-5: UUID ''lustre- > OST0030_UUID'' is not available for connect (no target) > Dec 13 08:37:21 oss01 kernel: LustreError: 13667:0:(ldlm_lib.c: > 1437:target_send_reply_msg()) @@@ processing error (-19) > req at 00000100cfe9ba > 00 x146300/t0 o8-><?>@<?>:-1 lens 240/0 ref 0 fl Interpret:/0/0 rc > -19/0 > Dec 13 08:37:41 oss01 kernel: LustreError: 137-5: UUID ''lustre- > OST0030_UUID'' is not available for connect (no target) > Dec 13 08:37:41 oss01 kernel: LustreError: Skipped 5 previous similar > messages > Dec 13 08:37:41 oss01 kernel: LustreError: 13668:0:(ldlm_lib.c: > 1437:target_send_reply_msg()) @@@ processing error (-19) > req at 0000010037e88e > 00 x146373/t0 o8-><?>@<?>:-1 lens 240/0 ref 0 fl Interpret:/0/0 rc > -19/0 > Dec 13 08:37:41 oss01 kernel: LustreError: 13668:0:(ldlm_lib.c: > 1437:target_send_reply_msg()) Skipped 5 previous similar messages > Dec 13 08:37:47 oss01 kernel: Lustre: Failing over lustre-OST0002 > Dec 13 08:37:47 oss01 kernel: Lustre: *** setting obd lustre-OST0002 > device ''unknown-block(253,1)'' read-only *** > Dec 13 08:37:47 oss01 kernel: Turning device dm-1 (0xfd00001) read- > only > Dec 13 08:37:47 oss01 kernel: Lustre: lustre-OST0002: shutting down > for failover; client state will be preserved. > Dec 13 08:37:47 oss01 kernel: Lustre: OST lustre-OST0002 has stopped. > Dec 13 08:37:47 oss01 kernel: LDISKFS-fs: mballoc: 1 blocks 1 reqs (0 > success) > Dec 13 08:37:47 oss01 kernel: LDISKFS-fs: mballoc: 1 extents scanned, > 1 goal hits, 0 2^N hits, 0 breaks, 0 lost > Dec 13 08:37:47 oss01 kernel: LDISKFS-fs: mballoc: 1 generated and it > took 12560 > Dec 13 08:37:47 oss01 kernel: LDISKFS-fs: mballoc: 256 preallocated, 0 > discarded > Dec 13 08:37:47 oss01 kernel: Removing read-only on dm-1 (0xfd00001) > Dec 13 08:37:47 oss01 kernel: Lustre: server umount lustre-OST0002 > complete > Dec 13 08:37:57 oss01 sshd(pam_unix)[13946]: session opened for user > root by root(uid=0) > Dec 13 08:38:18 oss01 kernel: kjournald starting. Commit interval 5 > seconds > Dec 13 08:38:18 oss01 kernel: LDISKFS FS on dm-0, internal journal > Dec 13 08:38:18 oss01 kernel: LDISKFS-fs: mounted filesystem with > ordered data mode. > Dec 13 08:38:18 oss01 kernel: kjournald starting. Commit interval 5 > seconds > Dec 13 08:38:18 oss01 kernel: LDISKFS FS on dm-0, internal journal > Dec 13 08:38:18 oss01 kernel: LDISKFS-fs: mounted filesystem with > ordered data mode. > Dec 13 08:38:18 oss01 kernel: LDISKFS-fs: file extents enabled > Dec 13 08:38:18 oss01 kernel: LDISKFS-fs: mballoc enabled > Dec 13 08:43:52 oss01 syslogd 1.4.1: restart. > Dec 13 08:43:52 oss01 syslog: syslogd startup succeeded > Dec 13 08:43:52 oss01 kernel: klogd 1.4.1, log source = /proc/kmsg > started. > --8<---------------cut here---------------end--------------->8--- > > We have to do a power cycle to connect again > ===========================================> > ,---- > | # ipmitool -I lan -H 192.168.99.101 -U $login -P $password power > cycle > `---- > > > The OST fsck seems correct > =========================> > ,---- > | [root at oss01 log]# fsck.ext2 /dev/mpath/mpath0 > | e2fsck 1.40.2.cfs1 (12-Jul-2007) > | lustre-OST0030: recovering journal > | lustre-OST0030: clean, 227/244195328 files, 15614685/976760320 > blocks > | [root at oss01 log]# > `---- > > tunefs.lustre reads correctly mpath0 information > ===============================================> > ,---- > | [root at oss01 log]# tunefs.lustre /dev/mpath/mpath0 > | checking for existing Lustre data: found CONFIGS/mountdata > | Reading CONFIGS/mountdata > | > | Read previous values: > | Target: lustre-OST0030 > | Index: 48 > | Lustre FS: lustre > | Mount type: ldiskfs > | Flags: 0x142 > | (OST update writeconf ) > | Persistent mount opts: errors=remount-ro,extents,mballoc > | Parameters: mgsnode=10.143.0.5 at tcp mgsnode=10.143.0.6 at tcp > failover.node=10.143.0.2 at tcp sys.timeout=80 mgsnode=10.143.0.5 at tcp > mgsnode=10.143.0.6 at tcp failover.node=10.143.0.2 at tcp sys.timeout=80 > | > | > | Permanent disk data: > | Target: lustre-OST0030 > | Index: 48 > | Lustre FS: lustre > | Mount type: ldiskfs > | Flags: 0x142 > | (OST update writeconf ) > | Persistent mount opts: errors=remount-ro,extents,mballoc > | Parameters: mgsnode=10.143.0.5 at tcp mgsnode=10.143.0.6 at tcp > failover.node=10.143.0.2 at tcp sys.timeout=80 mgsnode=10.143.0.5 at tcp > mgsnode=10.143.0.6 at tcp failover.node=10.143.0.2 at tcp sys.timeout=80 > | > | Writing CONFIGS/mountdata > | [root at oss01 log]# > `---- > > > DDN lun is ready and working correctly > =====================================> > ,----[ OSS view ] > | [root at oss01 log]# multipath -l | grep mpath0 > | mpath0 (360001ff00fd4922302000800001d1c17) > | [root at oss01 log]# > `---- > > ,----[ S2A9550 view ] > | [ddn at admin01 ~]$ s2a -h 10.141.0.92 -e "lun list" | grep -i > 0fd492230200 > | 2 1 Ready 3815470 0FD492230200 > | [ddn at admin01 ~]$ > `---- > > Stack trace (We got it from OSS02 via the serial line during a > mounting try) > ====================================================================== > =====> > --8<---------------cut here---------------start------------->8--- > LDISKFS-fs: file extents enabled > LDISKFS-fs: mballoc enabled > LustreError: 134-6: Trying to start OBD lustre-OST0030_UUID using the > wrong disk lustre-OST0000_UUID. Were the /dev/ assignments rearranged > ? > LustreError: 10203:0:(filter.c:1022:filter_prep()) cannot read > last_rcvd: rc = -22 > LustreEr<ro4>re:i p10:2 f0f3f:0ff:f(fobfad_0c3aon2f12ig1. > c:325:class_setup()) set--up- -l-u--s-t-re--- OS[Tcu00t 30h efreai ]le > d- -(----22--)- > [please bite here ] -L-u-s--tr--eE-- > r - > or: 10203:0:(obd_config.Kcer:n1e06l2 B:cUlG asast > _csponifnilogc_lkl:o1g19_h > ndler()) Err -22 on cfign cvaolmimdan > odp: a > and: 0000 [1] SLMuPs tre: cmd=cf003 0:lu<s4tr> > OST0030 1:dev 2:type CP 3U: 3f > ustreError: 15b-f: MGC1M0o.d1ul4e3s. 0l.i5 at nktcepd : inT:he > configuration from l ogo bd''lfuisltterer-OST0030'' failed (-22). > ( U)Make sure th > is client a ndf stfihlet _MlGSdi askrfes running compatible > ver(siUo)ns of > Lustre. > oLussttreError: 15c-8: MGC10.<144>(3.U0).5 at tcp: The configurati omng > cfrom log ''lustre-OST003(0U'') failed (-22). This may l dbies tkhfes r > esult of communicatio(nU) errors between this nod el usantdr ethe MGS, > a bad configur(aU)tion, or other errors. Seloev the syslog for more > > inf(oU)rmation. > LlqusutotraeError: 10203:0:(obd_mo<u4nt>.(cU):1082:server_start_targe > tmds(c)) failed to start serv(eUr) lustre-OST0030: -22 > ksocklnd(LU)ustreError: 10203:0:(obd<4_>m oupnttl.rpcc: > 1573:server_fill_super((U))) Unable to start target so:bd -c2l2a > (ULu)streError: 10203:0:(obd<_4c>o nlfneitg.c:392:class_cleanup()) > ( U)Device 2 not setup ss > lvfs(U) libcfs(U) md5(U) ipv6(U) parport_pc(U) lp(U) parport(U) > autofs4(U) i2c_dev(U) i2c_core(U) sunrpc(U) ds(U) yenta_socket(U) > pcmcia_c > ore(U) dm_mirror(U) dm_round_robin(U) dm_multipath(U) joydev(U) > button(U) battery(U) ac(U) uhci_hcd(U) ehci_hcd(U) hw_random(U) > myri10ge(U) > bnx2(U) ext3(U) jbd(U) dm_mod(U) qla2400(U) ata_piix(U) > megaraid_sas(U) qla2xxx(U) scsi_transport_fc(U) sd_mod(U) > multipath(U) > Pid: 10286, comm: ptlrpcd Tainted: GF 2.6.9-55.0.9.EL_lustre. > 1.6.3smp > RIP: 0010:[<ffffffff80321465>] <ffffffff80321465>{__lock_text_start > +32} > RSP: 0018:0000010218cd9bc8 EFLAGS: 00010216 > RAX: 0000000000000016 RBX: 000001021654e4bc RCX: 0000000000020000 > RDX: 000000000000baa7 RSI: 0000000000000246 RDI: ffffffff80396fc0 > RBP: 000001021654e4a0 R08: 00000000fffffffe R09: 000001021654e4bc > R10: 0000000000000000 R11: 0000000000000000 R12: 00000102196e6058 > R13: 00000102196e6000 R14: 0000010218cd9eb8 R15: 0000010218cd9e58 > FS: 0000002a9557ab00(0000) GS:ffffffff804a6880(0000) knlGS: > 0000000000000000 > CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b > CR2: 0000002a95557000 CR3: 0000000228514000 CR4: 00000000000006e0 > Process ptlrpcd (pid: 10286, threadinfo 0000010218cd8000, task > 00000102170b4030) > Stack: 000001021654e4bc ffffffffa03a2121 000001021a99304e > ffffffffa03b32a0 > 000001021654e0b0 ffffffffa04d6510 0000008000000000 > 0000000000000000 > 0000000000000000 00000102203920c0 > Call Trace:<ffffffffa03a2121>{:lquota:filter_quota_clearinfo+49} > <ffffffffa04d6510>{:obdfilter:filter_destroy_export+560} > <ffffffff80131923>{recalc_task_prio+337} > <ffffffffa02586fd>{:obdclass:class_export_destroy+381} > <ffffffffa025c336>{:obdclass:obd_zombie_impexp_cull+150} > <ffffffffa0318345>{:ptlrpc:ptlrpcd_check+229} > <ffffffffa031883a>{:ptlrpc:ptlrpcd+874} > <ffffffff80133566>{default_wake_function+0} > <ffffffffa02eb450>{:ptlrpc:ptlrpc_expired_set+0} > <ffffffffa02eb450>{:ptlrpc:ptlrpc_expired_set+0} > <ffffffff80133566>{default_wake_function+0} > <ffffffff80110de3>{child_rip+8} > <ffffffffa03184d0>{:ptlrpc:ptlrpcd+0} > <ffffffff80110ddb>{child_rip+0} > > Code: 0f 0b 04 c2 33 80 ff ff ff ff 77 00 f0 ff 0b 0f 88 8b 03 00 > RIP <ffffffff80321465>{__lock_text_start+32} RSP <0000010218cd9bc8> > <0>Kernel pani<4c> -L DnIoStKF sSy-nfcs:i nmg:b aOlolopsc > 1 blocks 1 reqs (0 succ ess) : > --8<---------------cut here---------------end--------------->8--- > > If you need more information or debug, feel free to request us. The > problem occurs only with this OST. > > Thanks, Ludo > > -- > Ludovic Francois +33 (0)6 14 77 26 93 > System Engineer DataDirect Networks > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at clusterfs.com > https://mail.clusterfs.com/mailman/listinfo/lustre-discussMr Wojciech Turek Assistant System Manager University of Cambridge High Performance Computing service email: wjt27 at cam.ac.uk tel. +441223763517 -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20071213/09c2b346/attachment-0002.html
Hi Wojciech, Here is more infos : [root at oss01 ~]# multipath -l mpath0 mpath0 (360001ff00fd4922302000800001d1c17) [size=3726 GB][features="0"][hwhandler="0"] \_ round-robin 0 [active] \_ 2:0:0:2 sdaa 65:160 [active] \_ round-robin 0 [enabled] \_ 1:0:0:2 sdc 8:32 [active] [root at oss01 ~]# tunefs.lustre --print /dev/sdaa checking for existing Lustre data: found CONFIGS/mountdata Reading CONFIGS/mountdata Read previous values: Target: lustre-OST0030 Index: 48 Lustre FS: lustre Mount type: ldiskfs Flags: 0x2 (OST ) Persistent mount opts: errors=remount-ro,extents,mballoc Parameters: mgsnode=10.143.0.5 at tcp mgsnode=10.143.0.6 at tcp failover.node=10.143.0.2 at tcp sys.timeout=80 mgsnode=10.143.0.5 at tcp mgsnode=10.143.0.6 at tcp failover.node=10.143.0.2 at tcp sys.timeout=80 Permanent disk data: Target: lustre-OST0030 Index: 48 Lustre FS: lustre Mount type: ldiskfs Flags: 0x2 (OST ) Persistent mount opts: errors=remount-ro,extents,mballoc Parameters: mgsnode=10.143.0.5 at tcp mgsnode=10.143.0.6 at tcp failover.node=10.143.0.2 at tcp sys.timeout=80 mgsnode=10.143.0.5 at tcp mgsnode=10.143.0.6 at tcp failover.node=10.143.0.2 at tcp sys.timeout=80 exiting before disk write. [root at oss01 ~]# tunefs.lustre --print /dev/sdc checking for existing Lustre data: found CONFIGS/mountdata Reading CONFIGS/mountdata Read previous values: Target: lustre-OST0030 Index: 48 Lustre FS: lustre Mount type: ldiskfs Flags: 0x2 (OST ) Persistent mount opts: errors=remount-ro,extents,mballoc Parameters: mgsnode=10.143.0.5 at tcp mgsnode=10.143.0.6 at tcp failover.node=10.143.0.2 at tcp sys.timeout=80 mgsnode=10.143.0.5 at tcp mgsnode=10.143.0.6 at tcp failover.node=10.143.0.2 at tcp sys.timeout=80 Permanent disk data: Target: lustre-OST0030 Index: 48 Lustre FS: lustre Mount type: ldiskfs Flags: 0x2 (OST ) Persistent mount opts: errors=remount-ro,extents,mballoc Parameters: mgsnode=10.143.0.5 at tcp mgsnode=10.143.0.6 at tcp failover.node=10.143.0.2 at tcp sys.timeout=80 mgsnode=10.143.0.5 at tcp mgsnode=10.143.0.6 at tcp failover.node=10.143.0.2 at tcp sys.timeout=80 exiting before disk write. Any ideas ? Regards Franck Le 13 d?c. 07 ? 12:53, Wojciech Turek a ?crit :> Hi, > > I just would like add that you could do very simple test to see if > mpath is working correctly. On your server oss1 run tunefs.lustre -- > print /dev/<all_mpath_devices> then write down target name for each > mpath device. Reboot the server and do the same and compare if the > mpath -> target map is the same as it was before reboot. > > Cheers > > Wojciech > On 13 Dec 2007, at 10:55, Ludovic Francois wrote: > >> On 12 d?c, 17:51, Oleg Drokin <Oleg.Dro... at Sun.COM> wrote: >>> Hello! >>> >>> On Dec 12, 2007, at 11:39 AM, Franck Martinaux wrote: >>> >>>> After a power outage, I get some difficulties to mount a OST. >>>> I am running a lustre 1.6.3 and I get a panic on the OSS when I >>>> try to >>>> mount a OST. >>> >>> It would greatly help us if you show us panic message and possibly >>> stacktrace. >> >> >> Hi, >> >> Please find below all information we got this morning >> >> Environment >> ==========>> >> ,---- >> | [root at oss01 ~]# uname -a >> | Linux oss01.data.cluster 2.6.9-55.0.9.EL_lustre.1.6.3smp #1 SMP Sun >> Oct 7 20:08:31 EDT 2007 x86_64 x86_64 x86_64 GNU/Linux >> | [root at oss01 ~]# >> `---- >> >> Mount of this specific OST >> =========================>> >> ,---- >> | [root at oss01 ~]# mount -t lustre /dev/mpath/mpath0 /mnt/lustre/ost1 >> | Read from remote host oss01: Connection reset by peer >> | Connection to oss01 closed. >> | [ddn at admin01 ~]$ >> `---- >> >> /var/log/messages during the operation >> =====================================>> >> --8<---------------cut here---------------start------------->8--- >> Dec 13 08:36:04 oss01 sshd(pam_unix)[13469]: session opened for user >> root by root(uid=0)Dec 13 08:36:20 oss01 kernel: kjournald starting. >> Commit interval 5 seconds >> Dec 13 08:36:20 oss01 kernel: LDISKFS FS on dm-1, internal journal >> Dec 13 08:36:20 oss01 kernel: LDISKFS-fs: recovery complete.Dec 13 >> 08:36:20 oss01 kernel: LDISKFS-fs: mounted filesystem with ordered >> data mode. >> Dec 13 08:36:20 oss01 kernel: kjournald starting. Commit interval 5 >> seconds >> Dec 13 08:36:20 oss01 kernel: LDISKFS FS on dm-1, internal journalDec >> 13 08:36:20 oss01 kernel: LDISKFS-fs: mounted filesystem with ordered >> data mode. >> Dec 13 08:36:20 oss01 kernel: LDISKFS-fs: file extents enabled >> Dec 13 08:36:20 oss01 kernel: LDISKFS-fs: mballoc enabled >> Dec 13 08:36:20 oss01 kernel: Lustre: OST lustre-OST0002 now serving >> dev (lustre-OST0002/0258906d-8eca-ba98-4e3d-19adfa472914) with >> recovery enabled >> Dec 13 08:36:20 oss01 kernel: Lustre: Server lustre-OST0002 on >> device / >> dev/mpath/mpath1 has started >> Dec 13 08:36:21 oss01 kernel: LustreError: 137-5: UUID ''lustre- >> OST0030_UUID'' is not available for connect (no target) >> Dec 13 08:36:21 oss01 kernel: LustreError: Skipped 4 previous similar >> messages >> Dec 13 08:36:21 oss01 kernel: LustreError: 13664:0:(ldlm_lib.c: >> 1437:target_send_reply_msg()) @@@ processing error (-19) >> req at 00000102244efe00 x146203/t0 o8-><?>@<?>:-1 lens 240/0 ref 0 fl >> Interpret:/0/0 rc -19/0 >> Dec 13 08:36:21 oss01 kernel: LustreError: 13664:0:(ldlm_lib.c: >> 1437:target_send_reply_msg()) Skipped 4 previous similar messages >> Dec 13 08:36:41 oss01 kernel: LustreError: 137-5: UUID ''lustre- >> OST0030_UUID'' is not available for connect (no target) >> Dec 13 08:36:41 oss01 kernel: LustreError: 13665:0:(ldlm_lib.c: >> 1437:target_send_reply_msg()) @@@ processing error (-19) >> req at 000001021d1d68 >> 00 x146233/t0 o8-><?>@<?>:-1 lens 240/0 ref 0 fl Interpret:/0/0 rc >> -19/0 >> Dec 13 08:37:01 oss01 kernel: LustreError: 137-5: UUID ''lustre- >> OST0030_UUID'' is not available for connect (no target) >> Dec 13 08:37:01 oss01 kernel: LustreError: 13666:0:(ldlm_lib.c: >> 1437:target_send_reply_msg()) @@@ processing error (-19) >> req at 0000010006b95c >> 00 x146264/t0 o8-><?>@<?>:-1 lens 240/0 ref 0 fl Interpret:/0/0 rc >> -19/0 >> Dec 13 08:37:21 oss01 kernel: LustreError: 137-5: UUID ''lustre- >> OST0030_UUID'' is not available for connect (no target) >> Dec 13 08:37:21 oss01 kernel: LustreError: 13667:0:(ldlm_lib.c: >> 1437:target_send_reply_msg()) @@@ processing error (-19) >> req at 00000100cfe9ba >> 00 x146300/t0 o8-><?>@<?>:-1 lens 240/0 ref 0 fl Interpret:/0/0 rc >> -19/0 >> Dec 13 08:37:41 oss01 kernel: LustreError: 137-5: UUID ''lustre- >> OST0030_UUID'' is not available for connect (no target) >> Dec 13 08:37:41 oss01 kernel: LustreError: Skipped 5 previous similar >> messages >> Dec 13 08:37:41 oss01 kernel: LustreError: 13668:0:(ldlm_lib.c: >> 1437:target_send_reply_msg()) @@@ processing error (-19) >> req at 0000010037e88e >> 00 x146373/t0 o8-><?>@<?>:-1 lens 240/0 ref 0 fl Interpret:/0/0 rc >> -19/0 >> Dec 13 08:37:41 oss01 kernel: LustreError: 13668:0:(ldlm_lib.c: >> 1437:target_send_reply_msg()) Skipped 5 previous similar messages >> Dec 13 08:37:47 oss01 kernel: Lustre: Failing over lustre-OST0002 >> Dec 13 08:37:47 oss01 kernel: Lustre: *** setting obd lustre-OST0002 >> device ''unknown-block(253,1)'' read-only *** >> Dec 13 08:37:47 oss01 kernel: Turning device dm-1 (0xfd00001) read- >> only >> Dec 13 08:37:47 oss01 kernel: Lustre: lustre-OST0002: shutting down >> for failover; client state will be preserved. >> Dec 13 08:37:47 oss01 kernel: Lustre: OST lustre-OST0002 has stopped. >> Dec 13 08:37:47 oss01 kernel: LDISKFS-fs: mballoc: 1 blocks 1 reqs (0 >> success) >> Dec 13 08:37:47 oss01 kernel: LDISKFS-fs: mballoc: 1 extents scanned, >> 1 goal hits, 0 2^N hits, 0 breaks, 0 lost >> Dec 13 08:37:47 oss01 kernel: LDISKFS-fs: mballoc: 1 generated and it >> took 12560 >> Dec 13 08:37:47 oss01 kernel: LDISKFS-fs: mballoc: 256 >> preallocated, 0 >> discarded >> Dec 13 08:37:47 oss01 kernel: Removing read-only on dm-1 (0xfd00001) >> Dec 13 08:37:47 oss01 kernel: Lustre: server umount lustre-OST0002 >> complete >> Dec 13 08:37:57 oss01 sshd(pam_unix)[13946]: session opened for user >> root by root(uid=0) >> Dec 13 08:38:18 oss01 kernel: kjournald starting. Commit interval 5 >> seconds >> Dec 13 08:38:18 oss01 kernel: LDISKFS FS on dm-0, internal journal >> Dec 13 08:38:18 oss01 kernel: LDISKFS-fs: mounted filesystem with >> ordered data mode. >> Dec 13 08:38:18 oss01 kernel: kjournald starting. Commit interval 5 >> seconds >> Dec 13 08:38:18 oss01 kernel: LDISKFS FS on dm-0, internal journal >> Dec 13 08:38:18 oss01 kernel: LDISKFS-fs: mounted filesystem with >> ordered data mode. >> Dec 13 08:38:18 oss01 kernel: LDISKFS-fs: file extents enabled >> Dec 13 08:38:18 oss01 kernel: LDISKFS-fs: mballoc enabled >> Dec 13 08:43:52 oss01 syslogd 1.4.1: restart. >> Dec 13 08:43:52 oss01 syslog: syslogd startup succeeded >> Dec 13 08:43:52 oss01 kernel: klogd 1.4.1, log source = /proc/kmsg >> started. >> --8<---------------cut here---------------end--------------->8--- >> >> We have to do a power cycle to connect again >> ===========================================>> >> ,---- >> | # ipmitool -I lan -H 192.168.99.101 -U $login -P $password power >> cycle >> `---- >> >> >> The OST fsck seems correct >> =========================>> >> ,---- >> | [root at oss01 log]# fsck.ext2 /dev/mpath/mpath0 >> | e2fsck 1.40.2.cfs1 (12-Jul-2007) >> | lustre-OST0030: recovering journal >> | lustre-OST0030: clean, 227/244195328 files, 15614685/976760320 >> blocks >> | [root at oss01 log]# >> `---- >> >> tunefs.lustre reads correctly mpath0 information >> ===============================================>> >> ,---- >> | [root at oss01 log]# tunefs.lustre /dev/mpath/mpath0 >> | checking for existing Lustre data: found CONFIGS/mountdata >> | Reading CONFIGS/mountdata >> | >> | Read previous values: >> | Target: lustre-OST0030 >> | Index: 48 >> | Lustre FS: lustre >> | Mount type: ldiskfs >> | Flags: 0x142 >> | (OST update writeconf ) >> | Persistent mount opts: errors=remount-ro,extents,mballoc >> | Parameters: mgsnode=10.143.0.5 at tcp mgsnode=10.143.0.6 at tcp >> failover.node=10.143.0.2 at tcp sys.timeout=80 mgsnode=10.143.0.5 at tcp >> mgsnode=10.143.0.6 at tcp failover.node=10.143.0.2 at tcp sys.timeout=80 >> | >> | >> | Permanent disk data: >> | Target: lustre-OST0030 >> | Index: 48 >> | Lustre FS: lustre >> | Mount type: ldiskfs >> | Flags: 0x142 >> | (OST update writeconf ) >> | Persistent mount opts: errors=remount-ro,extents,mballoc >> | Parameters: mgsnode=10.143.0.5 at tcp mgsnode=10.143.0.6 at tcp >> failover.node=10.143.0.2 at tcp sys.timeout=80 mgsnode=10.143.0.5 at tcp >> mgsnode=10.143.0.6 at tcp failover.node=10.143.0.2 at tcp sys.timeout=80 >> | >> | Writing CONFIGS/mountdata >> | [root at oss01 log]# >> `---- >> >> >> DDN lun is ready and working correctly >> =====================================>> >> ,----[ OSS view ] >> | [root at oss01 log]# multipath -l | grep mpath0 >> | mpath0 (360001ff00fd4922302000800001d1c17) >> | [root at oss01 log]# >> `---- >> >> ,----[ S2A9550 view ] >> | [ddn at admin01 ~]$ s2a -h 10.141.0.92 -e "lun list" | grep -i >> 0fd492230200 >> | 2 1 Ready 3815470 0FD492230200 >> | [ddn at admin01 ~]$ >> `---- >> >> Stack trace (We got it from OSS02 via the serial line during a >> mounting try) >> = >> = >> = >> = >> = >> = >> = >> ====================================================================>> >> --8<---------------cut here---------------start------------->8--- >> LDISKFS-fs: file extents enabled >> LDISKFS-fs: mballoc enabled >> LustreError: 134-6: Trying to start OBD lustre-OST0030_UUID using the >> wrong disk lustre-OST0000_UUID. Were the /dev/ assignments rearranged >> ? >> LustreError: 10203:0:(filter.c:1022:filter_prep()) cannot read >> last_rcvd: rc = -22 >> LustreEr<ro4>re:i p10:2 f0f3f:0ff:f(fobfad_0c3aon2f12ig1. >> c:325:class_setup()) set--up- -l-u--s-t-re--- OS[Tcu00t 30h >> efreai ]le >> d- -(----22--)- >> [please bite here ] -L-u-s--tr--eE-- >> r - >> or: 10203:0:(obd_config.Kcer:n1e06l2 B:cUlG asast >> _csponifnilogc_lkl:o1g19_h >> ndler()) Err -22 on cfign cvaolmimdan >> odp: a >> and: 0000 [1] SLMuPs tre: cmd=cf003 0:lu<s4tr> >> OST0030 1:dev 2:type CP 3U: 3f >> ustreError: 15b-f: MGC1M0o.d1ul4e3s. 0l.i5 at nktcepd : inT:he >> configuration from l ogo bd''lfuisltterer-OST0030'' failed (-22). >> ( U)Make sure th >> is client a ndf stfihlet _MlGSdi askrfes running compatible >> ver(siUo)ns of >> Lustre. >> oLussttreError: 15c-8: MGC10.<144>(3.U0).5 at tcp: The configurati omng >> cfrom log ''lustre-OST003(0U'') failed (-22). This may l dbies tkhfes r >> esult of communicatio(nU) errors between this nod el usantdr ethe >> MGS, >> a bad configur(aU)tion, or other errors. Seloev the syslog for more >> >> inf(oU)rmation. >> LlqusutotraeError: 10203:0:(obd_mo<u4nt>.(cU): >> 1082:server_start_targe >> tmds(c)) failed to start serv(eUr) lustre-OST0030: -22 >> ksocklnd(LU)ustreError: 10203:0:(obd<4_>m oupnttl.rpcc: >> 1573:server_fill_super((U))) Unable to start target so:bd -c2l2a >> (ULu)streError: 10203:0:(obd<_4c>o nlfneitg.c:392:class_cleanup()) >> ( U)Device 2 not setup ss >> lvfs(U) libcfs(U) md5(U) ipv6(U) parport_pc(U) lp(U) parport(U) >> autofs4(U) i2c_dev(U) i2c_core(U) sunrpc(U) ds(U) yenta_socket(U) >> pcmcia_c >> ore(U) dm_mirror(U) dm_round_robin(U) dm_multipath(U) joydev(U) >> button(U) battery(U) ac(U) uhci_hcd(U) ehci_hcd(U) hw_random(U) >> myri10ge(U) >> bnx2(U) ext3(U) jbd(U) dm_mod(U) qla2400(U) ata_piix(U) >> megaraid_sas(U) qla2xxx(U) scsi_transport_fc(U) sd_mod(U) >> multipath(U) >> Pid: 10286, comm: ptlrpcd Tainted: GF 2.6.9-55.0.9.EL_lustre. >> 1.6.3smp >> RIP: 0010:[<ffffffff80321465>] <ffffffff80321465>{__lock_text_start >> +32} >> RSP: 0018:0000010218cd9bc8 EFLAGS: 00010216 >> RAX: 0000000000000016 RBX: 000001021654e4bc RCX: 0000000000020000 >> RDX: 000000000000baa7 RSI: 0000000000000246 RDI: ffffffff80396fc0 >> RBP: 000001021654e4a0 R08: 00000000fffffffe R09: 000001021654e4bc >> R10: 0000000000000000 R11: 0000000000000000 R12: 00000102196e6058 >> R13: 00000102196e6000 R14: 0000010218cd9eb8 R15: 0000010218cd9e58 >> FS: 0000002a9557ab00(0000) GS:ffffffff804a6880(0000) knlGS: >> 0000000000000000 >> CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b >> CR2: 0000002a95557000 CR3: 0000000228514000 CR4: 00000000000006e0 >> Process ptlrpcd (pid: 10286, threadinfo 0000010218cd8000, task >> 00000102170b4030) >> Stack: 000001021654e4bc ffffffffa03a2121 000001021a99304e >> ffffffffa03b32a0 >> 000001021654e0b0 ffffffffa04d6510 0000008000000000 >> 0000000000000000 >> 0000000000000000 00000102203920c0 >> Call Trace:<ffffffffa03a2121>{:lquota:filter_quota_clearinfo+49} >> <ffffffffa04d6510>{:obdfilter:filter_destroy_export+560} >> <ffffffff80131923>{recalc_task_prio+337} >> <ffffffffa02586fd>{:obdclass:class_export_destroy+381} >> <ffffffffa025c336>{:obdclass:obd_zombie_impexp_cull+150} >> <ffffffffa0318345>{:ptlrpc:ptlrpcd_check+229} >> <ffffffffa031883a>{:ptlrpc:ptlrpcd+874} >> <ffffffff80133566>{default_wake_function+0} >> <ffffffffa02eb450>{:ptlrpc:ptlrpc_expired_set+0} >> <ffffffffa02eb450>{:ptlrpc:ptlrpc_expired_set+0} >> <ffffffff80133566>{default_wake_function+0} >> <ffffffff80110de3>{child_rip+8} >> <ffffffffa03184d0>{:ptlrpc:ptlrpcd+0} >> <ffffffff80110ddb>{child_rip+0} >> >> Code: 0f 0b 04 c2 33 80 ff ff ff ff 77 00 f0 ff 0b 0f 88 8b 03 00 >> RIP <ffffffff80321465>{__lock_text_start+32} RSP <0000010218cd9bc8> >> <0>Kernel pani<4c> -L DnIoStKF sSy-nfcs:i nmg:b aOlolopsc >> 1 blocks 1 reqs (0 succ ess) : >> --8<---------------cut here---------------end--------------->8--- >> >> If you need more information or debug, feel free to request us. The >> problem occurs only with this OST. >> >> Thanks, Ludo >> >> -- >> Ludovic Francois +33 (0)6 14 77 26 93 >> System Engineer DataDirect Networks >> >> _______________________________________________ >> Lustre-discuss mailing list >> Lustre-discuss at clusterfs.com >> https://mail.clusterfs.com/mailman/listinfo/lustre-discuss > > Mr Wojciech Turek > Assistant System Manager > University of Cambridge > High Performance Computing service > email: wjt27 at cam.ac.uk > tel. +441223763517 > > > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at clusterfs.com > https://mail.clusterfs.com/mailman/listinfo/lustre-discuss
On Dec 13, 2007 12:53 PM, Wojciech Turek <wjt27 at cam.ac.uk> wrote:> Hi,Hello Wojciech> I just would like add that you could do very simple test to see if mpath is > working correctly.By the way we are reworking on it to integrate nicely mulitpath with the DDN disk array. And generate the multipath.conf file according at LUNs wwid and s2a labels.> On your server oss1 run tunefs.lustre --print > /dev/<all_mpath_devices> then write down target name for each mpath device. > Reboot the server and do the same and compare if the mpath -> target map is > the same as it was before reboot.I just did it, and it''s exactly the same, between 2 different reboots. But I accord you it''s possible the device name moved in the time according our actual configuration. How do you explain the UUID can move in the time, the uuid of mpath device and uuid of sd device corresponding should be always the same. Am I wrong? I turned off the multipathing, and I got exactly the same issue. ,---- | LustreError: 134-6: Trying to start OBD lustre-OST0030_UUID using the | wrong disk lustre-OST0000_UUID. Were the /dev/ assignments rearranged | ? `---- I think the multipath is not the problem right now, but the fact there is some weakness, maybe the multipathing helped someone to do a wrong command. Do you think it''s possible someone overwrote the "label" with a tunefs command? I already saw it with some other file system. Ludo, -- Ludovic Francois +33 (0)6 20 67 05 42
On Dec 13, 2:59 pm, "Ludovic Francois" <lfranc... at gmail.com> wrote:> Do you think it''s possible someone overwrote the "label" with a tunefs command?or the system> I already saw it with some other file system. >Just two commands they go in this direction, the lustre-OST0030 shouldn''t be exist [root at oss01 ~]# for i in $(seq 0 23); do tunefs.lustre --print /dev/ mpath/mpath$i | grep Target ;done | sort | uniq Target: lustre-OST0001 Target: lustre-OST0002 Target: lustre-OST0003 Target: lustre-OST0004 Target: lustre-OST0005 Target: lustre-OST0006 Target: lustre-OST0007 Target: lustre-OST0008 Target: lustre-OST0009 Target: lustre-OST000a Target: lustre-OST000b Target: lustre-OST000c Target: lustre-OST000d Target: lustre-OST000e Target: lustre-OST000f Target: lustre-OST0010 Target: lustre-OST0011 Target: lustre-OST0012 Target: lustre-OST0013 Target: lustre-OST0014 Target: lustre-OST0015 Target: lustre-OST0016 Target: lustre-OST0017 Target: lustre-OST0030 [root at oss01 ~]# [root at oss03 ~]# for i in $(seq 0 23); do tunefs.lustre --print /dev/ mpath/mpath$i | grep Target; done | sort | uniq Target: lustre-OST0018 Target: lustre-OST0019 Target: lustre-OST001a Target: lustre-OST001b Target: lustre-OST001c Target: lustre-OST001d Target: lustre-OST001e Target: lustre-OST001f Target: lustre-OST0020 Target: lustre-OST0021 Target: lustre-OST0022 Target: lustre-OST0023 Target: lustre-OST0024 Target: lustre-OST0025 Target: lustre-OST0026 Target: lustre-OST0027 Target: lustre-OST0028 Target: lustre-OST0029 Target: lustre-OST002a Target: lustre-OST002b Target: lustre-OST002c Target: lustre-OST002d Target: lustre-OST002e Target: lustre-OST002f [root at oss03 ~]#
On Dec 13, 3:12 pm, Ludovic Francois <lfranc... at gmail.com> wrote:> On Dec 13, 2:59 pm, "Ludovic Francois" <lfranc... at gmail.com> wrote: > > > Do you think it''s possible someone overwrote the "label" with a tunefs command? > > or the system > > > I already saw it with some other file system.We recreated Target and Index with the tunefs.lustre command: --8<---------------cut here---------------start------------->8--- [root at oss01 ~]# tunefs.lustre --writeconf --index 0 /dev/mpath/mpath0 checking for existing Lustre data: found CONFIGS/mountdata Reading CONFIGS/mountdata Read previous values: Target: lustre-OST0030 Index: 48 Lustre FS: lustre Mount type: ldiskfs Flags: 0x2 (OST ) Persistent mount opts: errors=remount-ro,extents,mballoc Parameters: mgsnode=10.143.0.5 at tcp mgsnode=10.143.0.6 at tcp failover.node=10.143.0.2 at tcp sys.timeout=80 mgsnode=10.143.0.5 at tcp mgsnode=10.143.0.6 at tcp failover.node=10.143.0.2 at tcp sys.timeout=80 Permanent disk data: Target: lustre-OST0000 Index: 0 Lustre FS: lustre Mount type: ldiskfs Flags: 0x102 (OST writeconf ) Persistent mount opts: errors=remount-ro,extents,mballoc Parameters: mgsnode=10.143.0.5 at tcp mgsnode=10.143.0.6 at tcp failover.node=10.143.0.2 at tcp sys.timeout=80 mgsnode=10.143.0.5 at tcp mgsnode=10.143.0.6 at tcp failover.node=10.143.0.2 at tcp sys.timeout=80 Writing CONFIGS/mountdata [root at oss01 ~]# --8<---------------cut here---------------end--------------->8--- But now we have some problems to remount the file system, could you confirm us this command just rewrite the index? Best Regards, Ludo -- Ludovic Francois +33 (0)6 14 77 26 93 System Engineer DataDirect Networks
Hi, Sorry for big delay in responses but I was away for christmas lunch. Changing index may help but I am not certain of that. Definitly it is weird that you don''t have OST0000 but you have OST0030 this may cause problems later with quotas when you try to turn them off. I think lustre will expect OST0000 to exist and if it don''t find it, lustre will complain and quota will not work. However if you do change indexes I think you need to do that in certain way I suggest do it as follows # Umount all OST''s and MDT''s and run for each target: tunefs.lustre --reformat --index=<index> --writeconf /dev/ <block_device_name> # This need to be done on all OSS''s and on MDS # For each target mount it as ldiskfs file system. This need to be done on all OSS''s and on MDS #for example: mount -t ldiskfs /dev/dm-0 /mnt/mdt # delete file /mnt/mdt/last_rcvd # mount filesystem after that you can do writeconf for each target, then start MGS/MDT target, and then start one by one OST targets starting with mpath0 as first one Also have a look at our /etc/multipath.conf As you can see it is very static but we can be sure that each dm- <number> device is always pointing to the same LUN defaults { udev_dir /dev polling_interval 10 selector "round-robin 0" path_grouping_policy failover getuid_callout "/sbin/scsi_id -g -u -s /block/%n" prio_callout /bin/true path_checker tur rr_min_io 100 rr_weight priorities failback immediate no_path_retry fail user_friendly_name yes prio_callout "/sbin/mpath_prio_my %n" } devnode_blacklist { devnode "^(ram|raw|loop|fd|md|dm-|sr|scd|st|sda)[0-9]*" devnode "^hd[a-z]" devnode "^cciss!c[0-9]d[0-9]*" } multipaths { multipath { wwid 360001ff007e6173300000800001d1c17 alias dm-0 path_grouping_policy failover path_checker tur path_selector "round-robin 0" failback immediate rr_weight priorities no_path_retry 5 } multipath { wwid 360001ff007e6173401000800001d1c17 alias dm-1 path_grouping_policy failover path_checker tur path_selector "round-robin 0" failback immediate rr_weight priorities no_path_retry 5 } multipath { wwid 360001ff007e6173502000800001d1c17 alias dm-2 path_grouping_policy failover path_checker tur path_selector "round-robin 0" failback immediate rr_weight priorities no_path_retry 5 } multipath { wwid 360001ff007e6173906000800001d1c17 alias dm-3 path_grouping_policy failover path_checker tur path_selector "round-robin 0" failback immediate rr_weight priorities no_path_retry 5 } multipath { wwid 360001ff007e6173a07000800001d1c17 alias dm-4 path_grouping_policy failover path_checker tur path_selector "round-robin 0" failback immediate rr_weight priorities no_path_retry 5 } multipath { wwid 360001ff007e6173b08000800001d1c17 alias dm-5 path_grouping_policy failover path_checker tur path_selector "round-robin 0" failback immediate rr_weight priorities no_path_retry 5 } multipath { wwid 360001ff007e6173603000800001d1c17 alias dm-6 path_grouping_policy failover path_checker tur path_selector "round-robin 0" failback immediate rr_weight priorities no_path_retry 5 } multipath { wwid 360001ff007e6173704000800001d1c17 alias dm-7 path_grouping_policy failover path_checker tur path_selector "round-robin 0" failback immediate rr_weight priorities no_path_retry 5 } multipath { wwid 360001ff007e6173805000800001d1c17 alias dm-8 path_grouping_policy failover path_checker tur path_selector "round-robin 0" failback immediate rr_weight priorities no_path_retry 5 } multipath { wwid 360001ff007e6173c09000800001d1c17 alias dm-9 path_grouping_policy failover path_checker tur path_selector "round-robin 0" failback immediate rr_weight priorities no_path_retry 5 } multipath { wwid 360001ff007e6173d0a000800001d1c17 alias dm-10 path_grouping_policy failover path_checker tur path_selector "round-robin 0" failback immediate rr_weight priorities no_path_retry 5 } multipath { wwid 360001ff007e6173e0b000800001d1c17 alias dm-11 path_grouping_policy failover path_checker tur path_selector "round-robin 0" failback immediate rr_weight priorities no_path_retry 5 } } I hope this helps Wojciech On 13 Dec 2007, at 15:18, Ludovic Francois wrote:> On Dec 13, 3:12 pm, Ludovic Francois <lfranc... at gmail.com> wrote: >> On Dec 13, 2:59 pm, "Ludovic Francois" <lfranc... at gmail.com> wrote: >> >>> Do you think it''s possible someone overwrote the "label" with a >>> tunefs command? >> >> or the system >> >>> I already saw it with some other file system. > > We recreated Target and Index with the tunefs.lustre command: > > --8<---------------cut here---------------start------------->8--- > [root at oss01 ~]# tunefs.lustre --writeconf --index 0 /dev/mpath/mpath0 > checking for existing Lustre data: found CONFIGS/mountdata > Reading CONFIGS/mountdata > > Read previous values: > Target: lustre-OST0030 > Index: 48 > Lustre FS: lustre > Mount type: ldiskfs > Flags: 0x2 > (OST ) > Persistent mount opts: errors=remount-ro,extents,mballoc > Parameters: mgsnode=10.143.0.5 at tcp mgsnode=10.143.0.6 at tcp > failover.node=10.143.0.2 at tcp sys.timeout=80 mgsnode=10.143.0.5 at tcp > mgsnode=10.143.0.6 at tcp failover.node=10.143.0.2 at tcp sys.timeout=80 > > > Permanent disk data: > Target: lustre-OST0000 > Index: 0 > Lustre FS: lustre > Mount type: ldiskfs > Flags: 0x102 > (OST writeconf ) > Persistent mount opts: errors=remount-ro,extents,mballoc > Parameters: mgsnode=10.143.0.5 at tcp mgsnode=10.143.0.6 at tcp > failover.node=10.143.0.2 at tcp sys.timeout=80 mgsnode=10.143.0.5 at tcp > mgsnode=10.143.0.6 at tcp failover.node=10.143.0.2 at tcp sys.timeout=80 > > Writing CONFIGS/mountdata > [root at oss01 ~]# > --8<---------------cut here---------------end--------------->8--- > > But now we have some problems to remount the file system, could you > confirm us this command just rewrite the index? > > Best Regards, Ludo > > -- > Ludovic Francois +33 (0)6 14 77 26 93 > System Engineer DataDirect Networks > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at clusterfs.com > https://mail.clusterfs.com/mailman/listinfo/lustre-discussMr Wojciech Turek Assistant System Manager University of Cambridge High Performance Computing service email: wjt27 at cam.ac.uk tel. +441223763517 -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20071213/ebd8b10b/attachment-0002.html
On Dec 13, 2007 5:22 PM, Wojciech Turek <wjt27 at cam.ac.uk> wrote:> Hi,Hello Wojciech,> However if you do change indexes I think you need to do that in certain way > I suggest do it as follows > # Umount all OST''s and MDT''s and run for each target: > tunefs.lustre --reformat --index=<index> --writeconf > /dev/<block_device_name>We did it, but it was enough, finally there was 2 different issues in the configuration, we finished to solve this night. And now the system is running fine. You can check the end of the incident progress managed by Anand and Cliff in the bugzilla: https://bugzilla.lustre.org/show_bug.cgi?id=14470 Thanks for your help, Ludo -- Ludovic Francois +33 (0)6 20 67 05 42