thr3ads.net - Lustre discuss - [Lustre-discuss] Kernel panic on mounting an OST [Dec 2007]

If this information is useful, please help other people find it:
Share via:

Franck Martinaux

2007-Dec-12 16:39 UTC

[Lustre-discuss] Kernel panic on mounting an OST

Hi,

After a power outage, I get some difficulties to mount a OST.
I am running a lustre 1.6.3 and I get a panic on the OSS when I try to  
mount a OST.
I get a couple of other OSTs on the system and are properly mountable

I try to perform a fsck and a tunefs.lustre --writeconf on the ost,  
but the problem is still the same.

Any ideas ?

Thanks,

Franck

Oleg Drokin

2007-Dec-12 16:51 UTC

head link

[Lustre-discuss] Kernel panic on mounting an OST

Hello!

On Dec 12, 2007, at 11:39 AM, Franck Martinaux wrote:> After a power outage, I get some difficulties to mount a OST.
> I am running a lustre 1.6.3 and I get a panic on the OSS when I try to
> mount a OST.
It would greatly help us if you show us panic message and possibly  
stacktrace.

Bye,
     Oleg

Ludovic Francois

2007-Dec-13 10:55 UTC

head link

[Lustre-discuss] Kernel panic on mounting an OST

On 12 d?c, 17:51, Oleg Drokin <Oleg.Dro... at Sun.COM>
wrote:> Hello!
>
> On Dec 12, 2007, at 11:39 AM, Franck Martinaux wrote:
>
> > After a power outage, I get some difficulties to mount a OST.
> > I am running a lustre 1.6.3 and I get a panic on the OSS when I try to
> > mount a OST.
>
> It would greatly help us if you show us panic message and possibly
> stacktrace.

Hi,

Please find below all information we got this morning

Environment
==========
,----
| [root at oss01 ~]# uname -a
| Linux oss01.data.cluster 2.6.9-55.0.9.EL_lustre.1.6.3smp #1 SMP Sun
Oct 7 20:08:31 EDT 2007 x86_64 x86_64 x86_64 GNU/Linux
| [root at oss01 ~]#
`----

Mount of this specific OST
=========================
,----
| [root at oss01 ~]# mount -t lustre /dev/mpath/mpath0 /mnt/lustre/ost1
| Read from remote host oss01: Connection reset by peer
| Connection to oss01 closed.
| [ddn at admin01 ~]$
`----

/var/log/messages during the operation
=====================================
--8<---------------cut here---------------start------------->8---
Dec 13 08:36:04 oss01 sshd(pam_unix)[13469]: session opened for user
root by root(uid=0)Dec 13 08:36:20 oss01 kernel: kjournald starting.
Commit interval 5 seconds
Dec 13 08:36:20 oss01 kernel: LDISKFS FS on dm-1, internal journal
Dec 13 08:36:20 oss01 kernel: LDISKFS-fs: recovery complete.Dec 13
08:36:20 oss01 kernel: LDISKFS-fs: mounted filesystem with ordered
data mode.
Dec 13 08:36:20 oss01 kernel: kjournald starting.  Commit interval 5
seconds
Dec 13 08:36:20 oss01 kernel: LDISKFS FS on dm-1, internal journalDec
13 08:36:20 oss01 kernel: LDISKFS-fs: mounted filesystem with ordered
data mode.
Dec 13 08:36:20 oss01 kernel: LDISKFS-fs: file extents enabled
Dec 13 08:36:20 oss01 kernel: LDISKFS-fs: mballoc enabled
Dec 13 08:36:20 oss01 kernel: Lustre: OST lustre-OST0002 now serving
dev (lustre-OST0002/0258906d-8eca-ba98-4e3d-19adfa472914) with
recovery enabled
Dec 13 08:36:20 oss01 kernel: Lustre: Server lustre-OST0002 on device /
dev/mpath/mpath1 has started
Dec 13 08:36:21 oss01 kernel: LustreError: 137-5: UUID ''lustre-
OST0030_UUID'' is not available  for connect (no target)
Dec 13 08:36:21 oss01 kernel: LustreError: Skipped 4 previous similar
messages
Dec 13 08:36:21 oss01 kernel: LustreError: 13664:0:(ldlm_lib.c:
1437:target_send_reply_msg()) @@@ processing error (-19)
req at 00000102244efe00 x146203/t0 o8-><?>@<?>:-1 lens 240/0 ref
0 fl
Interpret:/0/0 rc -19/0
Dec 13 08:36:21 oss01 kernel: LustreError: 13664:0:(ldlm_lib.c:
1437:target_send_reply_msg()) Skipped 4 previous similar messages
Dec 13 08:36:41 oss01 kernel: LustreError: 137-5: UUID ''lustre-
OST0030_UUID'' is not available  for connect (no target)
Dec 13 08:36:41 oss01 kernel: LustreError: 13665:0:(ldlm_lib.c:
1437:target_send_reply_msg()) @@@ processing error (-19)
req at 000001021d1d68
00 x146233/t0 o8-><?>@<?>:-1 lens 240/0 ref 0 fl Interpret:/0/0
rc
-19/0
Dec 13 08:37:01 oss01 kernel: LustreError: 137-5: UUID ''lustre-
OST0030_UUID'' is not available  for connect (no target)
Dec 13 08:37:01 oss01 kernel: LustreError: 13666:0:(ldlm_lib.c:
1437:target_send_reply_msg()) @@@ processing error (-19)
req at 0000010006b95c
00 x146264/t0 o8-><?>@<?>:-1 lens 240/0 ref 0 fl Interpret:/0/0
rc
-19/0
Dec 13 08:37:21 oss01 kernel: LustreError: 137-5: UUID ''lustre-
OST0030_UUID'' is not available  for connect (no target)
Dec 13 08:37:21 oss01 kernel: LustreError: 13667:0:(ldlm_lib.c:
1437:target_send_reply_msg()) @@@ processing error (-19)
req at 00000100cfe9ba
00 x146300/t0 o8-><?>@<?>:-1 lens 240/0 ref 0 fl Interpret:/0/0
rc
-19/0
Dec 13 08:37:41 oss01 kernel: LustreError: 137-5: UUID ''lustre-
OST0030_UUID'' is not available  for connect (no target)
Dec 13 08:37:41 oss01 kernel: LustreError: Skipped 5 previous similar
messages
Dec 13 08:37:41 oss01 kernel: LustreError: 13668:0:(ldlm_lib.c:
1437:target_send_reply_msg()) @@@ processing error (-19)
req at 0000010037e88e
00 x146373/t0 o8-><?>@<?>:-1 lens 240/0 ref 0 fl Interpret:/0/0
rc
-19/0
Dec 13 08:37:41 oss01 kernel: LustreError: 13668:0:(ldlm_lib.c:
1437:target_send_reply_msg()) Skipped 5 previous similar messages
Dec 13 08:37:47 oss01 kernel: Lustre: Failing over lustre-OST0002
Dec 13 08:37:47 oss01 kernel: Lustre: *** setting obd lustre-OST0002
device ''unknown-block(253,1)'' read-only ***
Dec 13 08:37:47 oss01 kernel: Turning device dm-1 (0xfd00001) read-
only
Dec 13 08:37:47 oss01 kernel: Lustre: lustre-OST0002: shutting down
for failover; client state will be preserved.
Dec 13 08:37:47 oss01 kernel: Lustre: OST lustre-OST0002 has stopped.
Dec 13 08:37:47 oss01 kernel: LDISKFS-fs: mballoc: 1 blocks 1 reqs (0
success)
Dec 13 08:37:47 oss01 kernel: LDISKFS-fs: mballoc: 1 extents scanned,
1 goal hits, 0 2^N hits, 0 breaks, 0 lost
Dec 13 08:37:47 oss01 kernel: LDISKFS-fs: mballoc: 1 generated and it
took 12560
Dec 13 08:37:47 oss01 kernel: LDISKFS-fs: mballoc: 256 preallocated, 0
discarded
Dec 13 08:37:47 oss01 kernel: Removing read-only on dm-1 (0xfd00001)
Dec 13 08:37:47 oss01 kernel: Lustre: server umount lustre-OST0002
complete
Dec 13 08:37:57 oss01 sshd(pam_unix)[13946]: session opened for user
root by root(uid=0)
Dec 13 08:38:18 oss01 kernel: kjournald starting.  Commit interval 5
seconds
Dec 13 08:38:18 oss01 kernel: LDISKFS FS on dm-0, internal journal
Dec 13 08:38:18 oss01 kernel: LDISKFS-fs: mounted filesystem with
ordered data mode.
Dec 13 08:38:18 oss01 kernel: kjournald starting.  Commit interval 5
seconds
Dec 13 08:38:18 oss01 kernel: LDISKFS FS on dm-0, internal journal
Dec 13 08:38:18 oss01 kernel: LDISKFS-fs: mounted filesystem with
ordered data mode.
Dec 13 08:38:18 oss01 kernel: LDISKFS-fs: file extents enabled
Dec 13 08:38:18 oss01 kernel: LDISKFS-fs: mballoc enabled
Dec 13 08:43:52 oss01 syslogd 1.4.1: restart.
Dec 13 08:43:52 oss01 syslog: syslogd startup succeeded
Dec 13 08:43:52 oss01 kernel: klogd 1.4.1, log source = /proc/kmsg
started.
--8<---------------cut here---------------end--------------->8---

We have to do a power cycle to connect again
===========================================
,----
| # ipmitool -I lan -H 192.168.99.101 -U $login -P $password power
cycle
`----


The OST fsck seems correct
=========================
,----
| [root at oss01 log]# fsck.ext2 /dev/mpath/mpath0
| e2fsck 1.40.2.cfs1 (12-Jul-2007)
| lustre-OST0030: recovering journal
| lustre-OST0030: clean, 227/244195328 files, 15614685/976760320
blocks
| [root at oss01 log]#
`----

tunefs.lustre reads correctly mpath0 information
===============================================
,----
| [root at oss01 log]# tunefs.lustre /dev/mpath/mpath0
| checking for existing Lustre data: found CONFIGS/mountdata
| Reading CONFIGS/mountdata
|
|    Read previous values:
| Target:     lustre-OST0030
| Index:      48
| Lustre FS:  lustre
| Mount type: ldiskfs
| Flags:      0x142
|               (OST update writeconf )
| Persistent mount opts: errors=remount-ro,extents,mballoc
| Parameters: mgsnode=10.143.0.5 at tcp mgsnode=10.143.0.6 at tcp
failover.node=10.143.0.2 at tcp sys.timeout=80 mgsnode=10.143.0.5 at tcp
mgsnode=10.143.0.6 at tcp failover.node=10.143.0.2 at tcp sys.timeout=80
|
|
|    Permanent disk data:
| Target:     lustre-OST0030
| Index:      48
| Lustre FS:  lustre
| Mount type: ldiskfs
| Flags:      0x142
|               (OST update writeconf )
| Persistent mount opts: errors=remount-ro,extents,mballoc
| Parameters: mgsnode=10.143.0.5 at tcp mgsnode=10.143.0.6 at tcp
failover.node=10.143.0.2 at tcp sys.timeout=80 mgsnode=10.143.0.5 at tcp
mgsnode=10.143.0.6 at tcp failover.node=10.143.0.2 at tcp sys.timeout=80
|
| Writing CONFIGS/mountdata
| [root at oss01 log]#
`----


DDN lun is ready and working correctly
=====================================
,----[ OSS view ]
| [root at oss01 log]# multipath -l | grep mpath0
| mpath0 (360001ff00fd4922302000800001d1c17)
| [root at oss01 log]#
`----

,----[ S2A9550 view ]
| [ddn at admin01 ~]$ s2a -h 10.141.0.92 -e "lun list" | grep -i
0fd492230200
|   2                     1    Ready          3815470 0FD492230200
| [ddn at admin01 ~]$
`----

Stack trace (We got it from OSS02 via the serial line during a
mounting try)
===========================================================================
--8<---------------cut here---------------start------------->8---
LDISKFS-fs: file extents enabled
LDISKFS-fs: mballoc enabled
LustreError: 134-6: Trying to start OBD lustre-OST0030_UUID using the
wrong disk lustre-OST0000_UUID. Were the /dev/ assignments rearranged
?
LustreError: 10203:0:(filter.c:1022:filter_prep()) cannot read
last_rcvd: rc = -22
LustreEr<ro4>re:i p10:2 f0f3f:0ff:f(fobfad_0c3aon2f12ig1.
c:325:class_setup()) set--up- -l-u--s-t-re--- OS[Tcu00t 30h efreai ]le
d- -(----22--)-
 [please bite here ] -L-u-s--tr--eE--
r                                                -
or: 10203:0:(obd_config.Kcer:n1e06l2 B:cUlG asast
_csponifnilogc_lkl:o1g19_h
ndler()) Err -22 on cfign cvaolmimdan
odp:                                  a
and: 0000 [1] SLMuPs tre:    cmd=cf003 0:lu<s4tr>
OST0030  1:dev  2:type CP 3U: 3f
ustreError: 15b-f: MGC1M0o.d1ul4e3s. 0l.i5 at nktcepd : inT:he
configuration from l ogo bd''lfuisltterer-OST0030'' failed
(-22).
( U)Make sure th
is client a ndf stfihlet _MlGSdi askrfes running compatible
ver(siUo)ns of
Lustre.
 oLussttreError: 15c-8: MGC10.<144>(3.U0).5 at tcp: The configurati omng
cfrom log ''lustre-OST003(0U'') failed (-22). This may l dbies
tkhfes r
esult of communicatio(nU) errors between this nod el usantdr ethe MGS,
a bad configur(aU)tion, or other errors.  Seloev the syslog for more
 
inf(oU)rmation.
 LlqusutotraeError: 10203:0:(obd_mo<u4nt>.(cU):1082:server_start_targe
tmds(c)) failed to start serv(eUr) lustre-OST0030: -22
 ksocklnd(LU)ustreError: 10203:0:(obd<4_>m oupnttl.rpcc:
1573:server_fill_super((U))) Unable to start target so:bd -c2l2a
(ULu)streError: 10203:0:(obd<_4c>o nlfneitg.c:392:class_cleanup())
( U)Device 2 not setup                                ss
 lvfs(U) libcfs(U) md5(U) ipv6(U) parport_pc(U) lp(U) parport(U)
autofs4(U) i2c_dev(U) i2c_core(U) sunrpc(U) ds(U) yenta_socket(U)
pcmcia_c
ore(U) dm_mirror(U) dm_round_robin(U) dm_multipath(U) joydev(U)
button(U) battery(U) ac(U) uhci_hcd(U) ehci_hcd(U) hw_random(U)
myri10ge(U)
 bnx2(U) ext3(U) jbd(U) dm_mod(U) qla2400(U) ata_piix(U)
megaraid_sas(U) qla2xxx(U) scsi_transport_fc(U) sd_mod(U)
multipath(U)
Pid: 10286, comm: ptlrpcd Tainted: GF     2.6.9-55.0.9.EL_lustre.
1.6.3smp
RIP: 0010:[<ffffffff80321465>] <ffffffff80321465>{__lock_text_start
+32}
RSP: 0018:0000010218cd9bc8  EFLAGS: 00010216
RAX: 0000000000000016 RBX: 000001021654e4bc RCX: 0000000000020000
RDX: 000000000000baa7 RSI: 0000000000000246 RDI: ffffffff80396fc0
RBP: 000001021654e4a0 R08: 00000000fffffffe R09: 000001021654e4bc
R10: 0000000000000000 R11: 0000000000000000 R12: 00000102196e6058
R13: 00000102196e6000 R14: 0000010218cd9eb8 R15: 0000010218cd9e58
FS:  0000002a9557ab00(0000) GS:ffffffff804a6880(0000) knlGS:
0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000002a95557000 CR3: 0000000228514000 CR4: 00000000000006e0
Process ptlrpcd (pid: 10286, threadinfo 0000010218cd8000, task
00000102170b4030)
Stack: 000001021654e4bc ffffffffa03a2121 000001021a99304e
ffffffffa03b32a0
       000001021654e0b0 ffffffffa04d6510 0000008000000000
0000000000000000
       0000000000000000 00000102203920c0
Call Trace:<ffffffffa03a2121>{:lquota:filter_quota_clearinfo+49}
       <ffffffffa04d6510>{:obdfilter:filter_destroy_export+560}
       <ffffffff80131923>{recalc_task_prio+337}
<ffffffffa02586fd>{:obdclass:class_export_destroy+381}
       <ffffffffa025c336>{:obdclass:obd_zombie_impexp_cull+150}
       <ffffffffa0318345>{:ptlrpc:ptlrpcd_check+229}
<ffffffffa031883a>{:ptlrpc:ptlrpcd+874}
       <ffffffff80133566>{default_wake_function+0}
<ffffffffa02eb450>{:ptlrpc:ptlrpc_expired_set+0}
       <ffffffffa02eb450>{:ptlrpc:ptlrpc_expired_set+0}
<ffffffff80133566>{default_wake_function+0}
       <ffffffff80110de3>{child_rip+8}
<ffffffffa03184d0>{:ptlrpc:ptlrpcd+0}
       <ffffffff80110ddb>{child_rip+0}

Code: 0f 0b 04 c2 33 80 ff ff ff ff 77 00 f0 ff 0b 0f 88 8b 03 00
RIP <ffffffff80321465>{__lock_text_start+32} RSP <0000010218cd9bc8>
 <0>Kernel pani<4c> -L DnIoStKF sSy-nfcs:i nmg:b aOlolopsc
1 blocks 1 reqs (0 succ ess)                              :
--8<---------------cut here---------------end--------------->8---

If you need more information or debug, feel free to request us. The
problem occurs only with this OST.

Thanks, Ludo

--
Ludovic Francois                 +33 (0)6 14 77 26 93
System Engineer                  DataDirect Networks

Wojciech Turek

2007-Dec-13 11:39 UTC

head link

[Lustre-discuss] Kernel panic on mounting an OST

Hi,
On 13 Dec 2007, at 10:55, Ludovic Francois wrote:
> On 12 d?c, 17:51, Oleg Drokin <Oleg.Dro... at Sun.COM> wrote:
>> Hello!
>>
>> On Dec 12, 2007, at 11:39 AM, Franck Martinaux wrote:
>>
>>> After a power outage, I get some difficulties to mount a OST.
>>> I am running a lustre 1.6.3 and I get a panic on the OSS when I  
>>> try to
>>> mount a OST.
>>
>> It would greatly help us if you show us panic message and possibly
>> stacktrace.
>
>
> Hi,
>
> Please find below all information we got this morning
>
> Environment
> ==========>
> ,----
> | [root at oss01 ~]# uname -a
> | Linux oss01.data.cluster 2.6.9-55.0.9.EL_lustre.1.6.3smp #1 SMP Sun
> Oct 7 20:08:31 EDT 2007 x86_64 x86_64 x86_64 GNU/Linux
> | [root at oss01 ~]#
> `----
>
> Mount of this specific OST
> =========================>
> ,----
> | [root at oss01 ~]# mount -t lustre /dev/mpath/mpath0 /mnt/lustre/ost1
> | Read from remote host oss01: Connection reset by peer
> | Connection to oss01 closed.
> | [ddn at admin01 ~]$
> `----
>
> /var/log/messages during the operation
> =====================================>
> --8<---------------cut here---------------start------------->8---
> Dec 13 08:36:04 oss01 sshd(pam_unix)[13469]: session opened for user
> root by root(uid=0)Dec 13 08:36:20 oss01 kernel: kjournald starting.
> Commit interval 5 seconds
> Dec 13 08:36:20 oss01 kernel: LDISKFS FS on dm-1, internal journal
> Dec 13 08:36:20 oss01 kernel: LDISKFS-fs: recovery complete.Dec 13
> 08:36:20 oss01 kernel: LDISKFS-fs: mounted filesystem with ordered
> data mode.
> Dec 13 08:36:20 oss01 kernel: kjournald starting.  Commit interval 5
> seconds
> Dec 13 08:36:20 oss01 kernel: LDISKFS FS on dm-1, internal journalDec
> 13 08:36:20 oss01 kernel: LDISKFS-fs: mounted filesystem with ordered
> data mode.
> Dec 13 08:36:20 oss01 kernel: LDISKFS-fs: file extents enabled
> Dec 13 08:36:20 oss01 kernel: LDISKFS-fs: mballoc enabled
> Dec 13 08:36:20 oss01 kernel: Lustre: OST lustre-OST0002 now serving
> dev (lustre-OST0002/0258906d-8eca-ba98-4e3d-19adfa472914) with
> recovery enabled
Ok because this is the only device I can see in this log  being  
mounted I assume that at this moment /dev/mpath0 = lustre-OST0002
> Dec 13 08:36:20 oss01 kernel: Lustre: Server lustre-OST0002 on  
> device /
> dev/mpath/mpath1 has started
> Dec 13 08:36:21 oss01 kernel: LustreError: 137-5: UUID ''lustre-
> OST0030_UUID'' is not available  for connect (no target)
> Dec 13 08:36:21 oss01 kernel: LustreError: Skipped 4 previous similar
> messages
> Dec 13 08:36:21 oss01 kernel: LustreError: 13664:0:(ldlm_lib.c:
> 1437:target_send_reply_msg()) @@@ processing error (-19)
> req at 00000102244efe00 x146203/t0 o8-><?>@<?>:-1 lens 240/0
ref 0 fl
> Interpret:/0/0 rc -19/0
> Dec 13 08:36:21 oss01 kernel: LustreError: 13664:0:(ldlm_lib.c:
> 1437:target_send_reply_msg()) Skipped 4 previous similar messages
> Dec 13 08:36:41 oss01 kernel: LustreError: 137-5: UUID ''lustre-
> OST0030_UUID'' is not available  for connect (no target)
> Dec 13 08:36:41 oss01 kernel: LustreError: 13665:0:(ldlm_lib.c:
> 1437:target_send_reply_msg()) @@@ processing error (-19)
> req at 000001021d1d68
> 00 x146233/t0 o8-><?>@<?>:-1 lens 240/0 ref 0 fl
Interpret:/0/0 rc
> -19/0
> Dec 13 08:37:01 oss01 kernel: LustreError: 137-5: UUID ''lustre-
> OST0030_UUID'' is not available  for connect (no target)
> Dec 13 08:37:01 oss01 kernel: LustreError: 13666:0:(ldlm_lib.c:
> 1437:target_send_reply_msg()) @@@ processing error (-19)
> req at 0000010006b95c
> 00 x146264/t0 o8-><?>@<?>:-1 lens 240/0 ref 0 fl
Interpret:/0/0 rc
> -19/0
> Dec 13 08:37:21 oss01 kernel: LustreError: 137-5: UUID ''lustre-
> OST0030_UUID'' is not available  for connect (no target)
> Dec 13 08:37:21 oss01 kernel: LustreError: 13667:0:(ldlm_lib.c:
> 1437:target_send_reply_msg()) @@@ processing error (-19)
> req at 00000100cfe9ba
> 00 x146300/t0 o8-><?>@<?>:-1 lens 240/0 ref 0 fl
Interpret:/0/0 rc
> -19/0
> Dec 13 08:37:41 oss01 kernel: LustreError: 137-5: UUID ''lustre-
> OST0030_UUID'' is not available  for connect (no target)
> Dec 13 08:37:41 oss01 kernel: LustreError: Skipped 5 previous similar
> messages
> Dec 13 08:37:41 oss01 kernel: LustreError: 13668:0:(ldlm_lib.c:
> 1437:target_send_reply_msg()) @@@ processing error (-19)
> req at 0000010037e88e
> 00 x146373/t0 o8-><?>@<?>:-1 lens 240/0 ref 0 fl
Interpret:/0/0 rc
> -19/0
> Dec 13 08:37:41 oss01 kernel: LustreError: 13668:0:(ldlm_lib.c:
> 1437:target_send_reply_msg()) Skipped 5 previous similar messages
> Dec 13 08:37:47 oss01 kernel: Lustre: Failing over lustre-OST0002
> Dec 13 08:37:47 oss01 kernel: Lustre: *** setting obd lustre-OST0002
> device ''unknown-block(253,1)'' read-only ***
> Dec 13 08:37:47 oss01 kernel: Turning device dm-1 (0xfd00001) read-
> only
> Dec 13 08:37:47 oss01 kernel: Lustre: lustre-OST0002: shutting down
> for failover; client state will be preserved.
> Dec 13 08:37:47 oss01 kernel: Lustre: OST lustre-OST0002 has stopped.
> Dec 13 08:37:47 oss01 kernel: LDISKFS-fs: mballoc: 1 blocks 1 reqs (0
> success)
> Dec 13 08:37:47 oss01 kernel: LDISKFS-fs: mballoc: 1 extents scanned,
> 1 goal hits, 0 2^N hits, 0 breaks, 0 lost
> Dec 13 08:37:47 oss01 kernel: LDISKFS-fs: mballoc: 1 generated and it
> took 12560
> Dec 13 08:37:47 oss01 kernel: LDISKFS-fs: mballoc: 256 preallocated, 0
> discarded
> Dec 13 08:37:47 oss01 kernel: Removing read-only on dm-1 (0xfd00001)
> Dec 13 08:37:47 oss01 kernel: Lustre: server umount lustre-OST0002
> complete
> Dec 13 08:37:57 oss01 sshd(pam_unix)[13946]: session opened for user
> root by root(uid=0)
> Dec 13 08:38:18 oss01 kernel: kjournald starting.  Commit interval 5
> seconds
> Dec 13 08:38:18 oss01 kernel: LDISKFS FS on dm-0, internal journal
> Dec 13 08:38:18 oss01 kernel: LDISKFS-fs: mounted filesystem with
> ordered data mode.
> Dec 13 08:38:18 oss01 kernel: kjournald starting.  Commit interval 5
> seconds
> Dec 13 08:38:18 oss01 kernel: LDISKFS FS on dm-0, internal journal
> Dec 13 08:38:18 oss01 kernel: LDISKFS-fs: mounted filesystem with
> ordered data mode.
> Dec 13 08:38:18 oss01 kernel: LDISKFS-fs: file extents enabled
> Dec 13 08:38:18 oss01 kernel: LDISKFS-fs: mballoc enabled
> Dec 13 08:43:52 oss01 syslogd 1.4.1: restart.
> Dec 13 08:43:52 oss01 syslog: syslogd startup succeeded
> Dec 13 08:43:52 oss01 kernel: klogd 1.4.1, log source = /proc/kmsg
> started.
> --8<---------------cut here---------------end--------------->8---
>
> We have to do a power cycle to connect again
> ===========================================>
> ,----
> | # ipmitool -I lan -H 192.168.99.101 -U $login -P $password power
> cycle
> `----
>
>
> The OST fsck seems correct
> =========================>
> ,----
> | [root at oss01 log]# fsck.ext2 /dev/mpath/mpath0
> | e2fsck 1.40.2.cfs1 (12-Jul-2007)
> | lustre-OST0030: recovering journal
> | lustre-OST0030: clean, 227/244195328 files, 15614685/976760320
> blocks
> | [root at oss01 log]#
> `----
>This is after power cycle right? And now your mpath0 on the same  
server claims that it is lustre-OST30
Isn''t this strange? My first shot would be that your multipath  
devices are being mixed up every time you reboot your server. Make  
sure that your multipath binding file is the same on all servers or  
you can create your own aliases based on WWID of each lun in /etc/ 
multipath.conf
> tunefs.lustre reads correctly mpath0 information
> ===============================================>
> ,----
> | [root at oss01 log]# tunefs.lustre /dev/mpath/mpath0
> | checking for existing Lustre data: found CONFIGS/mountdata
> | Reading CONFIGS/mountdata
> |
> |    Read previous values:
> | Target:     lustre-OST0030
> | Index:      48
> | Lustre FS:  lustre
> | Mount type: ldiskfs
> | Flags:      0x142
> |               (OST update writeconf )
> | Persistent mount opts: errors=remount-ro,extents,mballoc
> | Parameters: mgsnode=10.143.0.5 at tcp mgsnode=10.143.0.6 at tcp
> failover.node=10.143.0.2 at tcp sys.timeout=80 mgsnode=10.143.0.5 at tcp
> mgsnode=10.143.0.6 at tcp failover.node=10.143.0.2 at tcp sys.timeout=80
> |
> |
> |    Permanent disk data:
> | Target:     lustre-OST0030
> | Index:      48
> | Lustre FS:  lustre
> | Mount type: ldiskfs
> | Flags:      0x142
> |               (OST update writeconf )
> | Persistent mount opts: errors=remount-ro,extents,mballoc
> | Parameters: mgsnode=10.143.0.5 at tcp mgsnode=10.143.0.6 at tcp
> failover.node=10.143.0.2 at tcp sys.timeout=80 mgsnode=10.143.0.5 at tcp
> mgsnode=10.143.0.6 at tcp failover.node=10.143.0.2 at tcp sys.timeout=80
> |
> | Writing CONFIGS/mountdata
> | [root at oss01 log]#
> `----
>
>
> DDN lun is ready and working correctly
> =====================================>
> ,----[ OSS view ]
> | [root at oss01 log]# multipath -l | grep mpath0
> | mpath0 (360001ff00fd4922302000800001d1c17)
> | [root at oss01 log]#
> `----
>
> ,----[ S2A9550 view ]
> | [ddn at admin01 ~]$ s2a -h 10.141.0.92 -e "lun list" | grep -i
> 0fd492230200
> |   2                     1    Ready          3815470 0FD492230200
> | [ddn at admin01 ~]$
> `----
>
> Stack trace (We got it from OSS02 via the serial line during a
> mounting try)
> ====================================================================== 
> =====>
> --8<---------------cut here---------------start------------->8---
> LDISKFS-fs: file extents enabled
> LDISKFS-fs: mballoc enabled
> LustreError: 134-6: Trying to start OBD lustre-OST0030_UUID using the
> wrong disk lustre-OST0000_UUID. Were the /dev/ assignments rearranged
> ?This seem to confirm my theory about mixed up block
devices?> LustreError: 10203:0:(filter.c:1022:filter_prep()) cannot read
> last_rcvd: rc = -22
> LustreEr<ro4>re:i p10:2 f0f3f:0ff:f(fobfad_0c3aon2f12ig1.
> c:325:class_setup()) set--up- -l-u--s-t-re--- OS[Tcu00t 30h efreai ]le
> d- -(----22--)-
>  [please bite here ] -L-u-s--tr--eE--
> r                                                -
> or: 10203:0:(obd_config.Kcer:n1e06l2 B:cUlG asast
> _csponifnilogc_lkl:o1g19_h
> ndler()) Err -22 on cfign cvaolmimdan
> odp:                                  a
> and: 0000 [1] SLMuPs tre:    cmd=cf003 0:lu<s4tr>
> OST0030  1:dev  2:type CP 3U: 3f
> ustreError: 15b-f: MGC1M0o.d1ul4e3s. 0l.i5 at nktcepd : inT:he
> configuration from l ogo bd''lfuisltterer-OST0030'' failed
(-22).
> ( U)Make sure th
> is client a ndf stfihlet _MlGSdi askrfes running compatible
> ver(siUo)ns of
> Lustre.
>  oLussttreError: 15c-8: MGC10.<144>(3.U0).5 at tcp: The configurati
omng
> cfrom log ''lustre-OST003(0U'') failed (-22). This may l
dbies tkhfes r
> esult of communicatio(nU) errors between this nod el usantdr ethe MGS,
> a bad configur(aU)tion, or other errors.  Seloev the syslog for more
>
> inf(oU)rmation.
>  LlqusutotraeError:
10203:0:(obd_mo<u4nt>.(cU):1082:server_start_targe
> tmds(c)) failed to start serv(eUr) lustre-OST0030: -22
>  ksocklnd(LU)ustreError: 10203:0:(obd<4_>m oupnttl.rpcc:
> 1573:server_fill_super((U))) Unable to start target so:bd -c2l2a
> (ULu)streError: 10203:0:(obd<_4c>o nlfneitg.c:392:class_cleanup())
> ( U)Device 2 not setup                                ss
>  lvfs(U) libcfs(U) md5(U) ipv6(U) parport_pc(U) lp(U) parport(U)
> autofs4(U) i2c_dev(U) i2c_core(U) sunrpc(U) ds(U) yenta_socket(U)
> pcmcia_c
> ore(U) dm_mirror(U) dm_round_robin(U) dm_multipath(U) joydev(U)
> button(U) battery(U) ac(U) uhci_hcd(U) ehci_hcd(U) hw_random(U)
> myri10ge(U)
>  bnx2(U) ext3(U) jbd(U) dm_mod(U) qla2400(U) ata_piix(U)
> megaraid_sas(U) qla2xxx(U) scsi_transport_fc(U) sd_mod(U)
> multipath(U)
> Pid: 10286, comm: ptlrpcd Tainted: GF     2.6.9-55.0.9.EL_lustre.
> 1.6.3smp
> RIP: 0010:[<ffffffff80321465>]
<ffffffff80321465>{__lock_text_start
> +32}
> RSP: 0018:0000010218cd9bc8  EFLAGS: 00010216
> RAX: 0000000000000016 RBX: 000001021654e4bc RCX: 0000000000020000
> RDX: 000000000000baa7 RSI: 0000000000000246 RDI: ffffffff80396fc0
> RBP: 000001021654e4a0 R08: 00000000fffffffe R09: 000001021654e4bc
> R10: 0000000000000000 R11: 0000000000000000 R12: 00000102196e6058
> R13: 00000102196e6000 R14: 0000010218cd9eb8 R15: 0000010218cd9e58
> FS:  0000002a9557ab00(0000) GS:ffffffff804a6880(0000) knlGS:
> 0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> CR2: 0000002a95557000 CR3: 0000000228514000 CR4: 00000000000006e0
> Process ptlrpcd (pid: 10286, threadinfo 0000010218cd8000, task
> 00000102170b4030)
> Stack: 000001021654e4bc ffffffffa03a2121 000001021a99304e
> ffffffffa03b32a0
>        000001021654e0b0 ffffffffa04d6510 0000008000000000
> 0000000000000000
>        0000000000000000 00000102203920c0
> Call Trace:<ffffffffa03a2121>{:lquota:filter_quota_clearinfo+49}
>        <ffffffffa04d6510>{:obdfilter:filter_destroy_export+560}
>        <ffffffff80131923>{recalc_task_prio+337}
> <ffffffffa02586fd>{:obdclass:class_export_destroy+381}
>        <ffffffffa025c336>{:obdclass:obd_zombie_impexp_cull+150}
>        <ffffffffa0318345>{:ptlrpc:ptlrpcd_check+229}
> <ffffffffa031883a>{:ptlrpc:ptlrpcd+874}
>        <ffffffff80133566>{default_wake_function+0}
> <ffffffffa02eb450>{:ptlrpc:ptlrpc_expired_set+0}
>        <ffffffffa02eb450>{:ptlrpc:ptlrpc_expired_set+0}
> <ffffffff80133566>{default_wake_function+0}
>        <ffffffff80110de3>{child_rip+8}
> <ffffffffa03184d0>{:ptlrpc:ptlrpcd+0}
>        <ffffffff80110ddb>{child_rip+0}
>
> Code: 0f 0b 04 c2 33 80 ff ff ff ff 77 00 f0 ff 0b 0f 88 8b 03 00
> RIP <ffffffff80321465>{__lock_text_start+32} RSP
<0000010218cd9bc8>
>  <0>Kernel pani<4c> -L DnIoStKF sSy-nfcs:i nmg:b aOlolopsc
> 1 blocks 1 reqs (0 succ ess)                              :
> --8<---------------cut here---------------end--------------->8---
>
> If you need more information or debug, feel free to request us. The
> problem occurs only with this OST.
>
> Thanks, Ludo
I hope this help

Cheers,

Wojciech>
> --
> Ludovic Francois                 +33 (0)6 14 77 26 93
> System Engineer                  DataDirect Networks
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at clusterfs.com
> https://mail.clusterfs.com/mailman/listinfo/lustre-discuss
Mr Wojciech Turek
Assistant System Manager
University of Cambridge
High Performance Computing service
email: wjt27 at cam.ac.uk
tel. +441223763517



-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20071213/9665eaf7/attachment-0002.html

Wojciech Turek

2007-Dec-13 11:53 UTC

head link

[Lustre-discuss] Kernel panic on mounting an OST

Hi,

I just would like add that you could do very simple test to see if  
mpath is working correctly. On your server oss1 run tunefs.lustre -- 
print /dev/<all_mpath_devices> then write down target name for each  
mpath device. Reboot the server and do the same and compare if the  
mpath -> target map is the same as it was before reboot.

Cheers

Wojciech
On 13 Dec 2007, at 10:55, Ludovic Francois wrote:
> On 12 d?c, 17:51, Oleg Drokin <Oleg.Dro... at Sun.COM> wrote:
>> Hello!
>>
>> On Dec 12, 2007, at 11:39 AM, Franck Martinaux wrote:
>>
>>> After a power outage, I get some difficulties to mount a OST.
>>> I am running a lustre 1.6.3 and I get a panic on the OSS when I  
>>> try to
>>> mount a OST.
>>
>> It would greatly help us if you show us panic message and possibly
>> stacktrace.
>
>
> Hi,
>
> Please find below all information we got this morning
>
> Environment
> ==========>
> ,----
> | [root at oss01 ~]# uname -a
> | Linux oss01.data.cluster 2.6.9-55.0.9.EL_lustre.1.6.3smp #1 SMP Sun
> Oct 7 20:08:31 EDT 2007 x86_64 x86_64 x86_64 GNU/Linux
> | [root at oss01 ~]#
> `----
>
> Mount of this specific OST
> =========================>
> ,----
> | [root at oss01 ~]# mount -t lustre /dev/mpath/mpath0 /mnt/lustre/ost1
> | Read from remote host oss01: Connection reset by peer
> | Connection to oss01 closed.
> | [ddn at admin01 ~]$
> `----
>
> /var/log/messages during the operation
> =====================================>
> --8<---------------cut here---------------start------------->8---
> Dec 13 08:36:04 oss01 sshd(pam_unix)[13469]: session opened for user
> root by root(uid=0)Dec 13 08:36:20 oss01 kernel: kjournald starting.
> Commit interval 5 seconds
> Dec 13 08:36:20 oss01 kernel: LDISKFS FS on dm-1, internal journal
> Dec 13 08:36:20 oss01 kernel: LDISKFS-fs: recovery complete.Dec 13
> 08:36:20 oss01 kernel: LDISKFS-fs: mounted filesystem with ordered
> data mode.
> Dec 13 08:36:20 oss01 kernel: kjournald starting.  Commit interval 5
> seconds
> Dec 13 08:36:20 oss01 kernel: LDISKFS FS on dm-1, internal journalDec
> 13 08:36:20 oss01 kernel: LDISKFS-fs: mounted filesystem with ordered
> data mode.
> Dec 13 08:36:20 oss01 kernel: LDISKFS-fs: file extents enabled
> Dec 13 08:36:20 oss01 kernel: LDISKFS-fs: mballoc enabled
> Dec 13 08:36:20 oss01 kernel: Lustre: OST lustre-OST0002 now serving
> dev (lustre-OST0002/0258906d-8eca-ba98-4e3d-19adfa472914) with
> recovery enabled
> Dec 13 08:36:20 oss01 kernel: Lustre: Server lustre-OST0002 on  
> device /
> dev/mpath/mpath1 has started
> Dec 13 08:36:21 oss01 kernel: LustreError: 137-5: UUID ''lustre-
> OST0030_UUID'' is not available  for connect (no target)
> Dec 13 08:36:21 oss01 kernel: LustreError: Skipped 4 previous similar
> messages
> Dec 13 08:36:21 oss01 kernel: LustreError: 13664:0:(ldlm_lib.c:
> 1437:target_send_reply_msg()) @@@ processing error (-19)
> req at 00000102244efe00 x146203/t0 o8-><?>@<?>:-1 lens 240/0
ref 0 fl
> Interpret:/0/0 rc -19/0
> Dec 13 08:36:21 oss01 kernel: LustreError: 13664:0:(ldlm_lib.c:
> 1437:target_send_reply_msg()) Skipped 4 previous similar messages
> Dec 13 08:36:41 oss01 kernel: LustreError: 137-5: UUID ''lustre-
> OST0030_UUID'' is not available  for connect (no target)
> Dec 13 08:36:41 oss01 kernel: LustreError: 13665:0:(ldlm_lib.c:
> 1437:target_send_reply_msg()) @@@ processing error (-19)
> req at 000001021d1d68
> 00 x146233/t0 o8-><?>@<?>:-1 lens 240/0 ref 0 fl
Interpret:/0/0 rc
> -19/0
> Dec 13 08:37:01 oss01 kernel: LustreError: 137-5: UUID ''lustre-
> OST0030_UUID'' is not available  for connect (no target)
> Dec 13 08:37:01 oss01 kernel: LustreError: 13666:0:(ldlm_lib.c:
> 1437:target_send_reply_msg()) @@@ processing error (-19)
> req at 0000010006b95c
> 00 x146264/t0 o8-><?>@<?>:-1 lens 240/0 ref 0 fl
Interpret:/0/0 rc
> -19/0
> Dec 13 08:37:21 oss01 kernel: LustreError: 137-5: UUID ''lustre-
> OST0030_UUID'' is not available  for connect (no target)
> Dec 13 08:37:21 oss01 kernel: LustreError: 13667:0:(ldlm_lib.c:
> 1437:target_send_reply_msg()) @@@ processing error (-19)
> req at 00000100cfe9ba
> 00 x146300/t0 o8-><?>@<?>:-1 lens 240/0 ref 0 fl
Interpret:/0/0 rc
> -19/0
> Dec 13 08:37:41 oss01 kernel: LustreError: 137-5: UUID ''lustre-
> OST0030_UUID'' is not available  for connect (no target)
> Dec 13 08:37:41 oss01 kernel: LustreError: Skipped 5 previous similar
> messages
> Dec 13 08:37:41 oss01 kernel: LustreError: 13668:0:(ldlm_lib.c:
> 1437:target_send_reply_msg()) @@@ processing error (-19)
> req at 0000010037e88e
> 00 x146373/t0 o8-><?>@<?>:-1 lens 240/0 ref 0 fl
Interpret:/0/0 rc
> -19/0
> Dec 13 08:37:41 oss01 kernel: LustreError: 13668:0:(ldlm_lib.c:
> 1437:target_send_reply_msg()) Skipped 5 previous similar messages
> Dec 13 08:37:47 oss01 kernel: Lustre: Failing over lustre-OST0002
> Dec 13 08:37:47 oss01 kernel: Lustre: *** setting obd lustre-OST0002
> device ''unknown-block(253,1)'' read-only ***
> Dec 13 08:37:47 oss01 kernel: Turning device dm-1 (0xfd00001) read-
> only
> Dec 13 08:37:47 oss01 kernel: Lustre: lustre-OST0002: shutting down
> for failover; client state will be preserved.
> Dec 13 08:37:47 oss01 kernel: Lustre: OST lustre-OST0002 has stopped.
> Dec 13 08:37:47 oss01 kernel: LDISKFS-fs: mballoc: 1 blocks 1 reqs (0
> success)
> Dec 13 08:37:47 oss01 kernel: LDISKFS-fs: mballoc: 1 extents scanned,
> 1 goal hits, 0 2^N hits, 0 breaks, 0 lost
> Dec 13 08:37:47 oss01 kernel: LDISKFS-fs: mballoc: 1 generated and it
> took 12560
> Dec 13 08:37:47 oss01 kernel: LDISKFS-fs: mballoc: 256 preallocated, 0
> discarded
> Dec 13 08:37:47 oss01 kernel: Removing read-only on dm-1 (0xfd00001)
> Dec 13 08:37:47 oss01 kernel: Lustre: server umount lustre-OST0002
> complete
> Dec 13 08:37:57 oss01 sshd(pam_unix)[13946]: session opened for user
> root by root(uid=0)
> Dec 13 08:38:18 oss01 kernel: kjournald starting.  Commit interval 5
> seconds
> Dec 13 08:38:18 oss01 kernel: LDISKFS FS on dm-0, internal journal
> Dec 13 08:38:18 oss01 kernel: LDISKFS-fs: mounted filesystem with
> ordered data mode.
> Dec 13 08:38:18 oss01 kernel: kjournald starting.  Commit interval 5
> seconds
> Dec 13 08:38:18 oss01 kernel: LDISKFS FS on dm-0, internal journal
> Dec 13 08:38:18 oss01 kernel: LDISKFS-fs: mounted filesystem with
> ordered data mode.
> Dec 13 08:38:18 oss01 kernel: LDISKFS-fs: file extents enabled
> Dec 13 08:38:18 oss01 kernel: LDISKFS-fs: mballoc enabled
> Dec 13 08:43:52 oss01 syslogd 1.4.1: restart.
> Dec 13 08:43:52 oss01 syslog: syslogd startup succeeded
> Dec 13 08:43:52 oss01 kernel: klogd 1.4.1, log source = /proc/kmsg
> started.
> --8<---------------cut here---------------end--------------->8---
>
> We have to do a power cycle to connect again
> ===========================================>
> ,----
> | # ipmitool -I lan -H 192.168.99.101 -U $login -P $password power
> cycle
> `----
>
>
> The OST fsck seems correct
> =========================>
> ,----
> | [root at oss01 log]# fsck.ext2 /dev/mpath/mpath0
> | e2fsck 1.40.2.cfs1 (12-Jul-2007)
> | lustre-OST0030: recovering journal
> | lustre-OST0030: clean, 227/244195328 files, 15614685/976760320
> blocks
> | [root at oss01 log]#
> `----
>
> tunefs.lustre reads correctly mpath0 information
> ===============================================>
> ,----
> | [root at oss01 log]# tunefs.lustre /dev/mpath/mpath0
> | checking for existing Lustre data: found CONFIGS/mountdata
> | Reading CONFIGS/mountdata
> |
> |    Read previous values:
> | Target:     lustre-OST0030
> | Index:      48
> | Lustre FS:  lustre
> | Mount type: ldiskfs
> | Flags:      0x142
> |               (OST update writeconf )
> | Persistent mount opts: errors=remount-ro,extents,mballoc
> | Parameters: mgsnode=10.143.0.5 at tcp mgsnode=10.143.0.6 at tcp
> failover.node=10.143.0.2 at tcp sys.timeout=80 mgsnode=10.143.0.5 at tcp
> mgsnode=10.143.0.6 at tcp failover.node=10.143.0.2 at tcp sys.timeout=80
> |
> |
> |    Permanent disk data:
> | Target:     lustre-OST0030
> | Index:      48
> | Lustre FS:  lustre
> | Mount type: ldiskfs
> | Flags:      0x142
> |               (OST update writeconf )
> | Persistent mount opts: errors=remount-ro,extents,mballoc
> | Parameters: mgsnode=10.143.0.5 at tcp mgsnode=10.143.0.6 at tcp
> failover.node=10.143.0.2 at tcp sys.timeout=80 mgsnode=10.143.0.5 at tcp
> mgsnode=10.143.0.6 at tcp failover.node=10.143.0.2 at tcp sys.timeout=80
> |
> | Writing CONFIGS/mountdata
> | [root at oss01 log]#
> `----
>
>
> DDN lun is ready and working correctly
> =====================================>
> ,----[ OSS view ]
> | [root at oss01 log]# multipath -l | grep mpath0
> | mpath0 (360001ff00fd4922302000800001d1c17)
> | [root at oss01 log]#
> `----
>
> ,----[ S2A9550 view ]
> | [ddn at admin01 ~]$ s2a -h 10.141.0.92 -e "lun list" | grep -i
> 0fd492230200
> |   2                     1    Ready          3815470 0FD492230200
> | [ddn at admin01 ~]$
> `----
>
> Stack trace (We got it from OSS02 via the serial line during a
> mounting try)
> ====================================================================== 
> =====>
> --8<---------------cut here---------------start------------->8---
> LDISKFS-fs: file extents enabled
> LDISKFS-fs: mballoc enabled
> LustreError: 134-6: Trying to start OBD lustre-OST0030_UUID using the
> wrong disk lustre-OST0000_UUID. Were the /dev/ assignments rearranged
> ?
> LustreError: 10203:0:(filter.c:1022:filter_prep()) cannot read
> last_rcvd: rc = -22
> LustreEr<ro4>re:i p10:2 f0f3f:0ff:f(fobfad_0c3aon2f12ig1.
> c:325:class_setup()) set--up- -l-u--s-t-re--- OS[Tcu00t 30h efreai ]le
> d- -(----22--)-
>  [please bite here ] -L-u-s--tr--eE--
> r                                                -
> or: 10203:0:(obd_config.Kcer:n1e06l2 B:cUlG asast
> _csponifnilogc_lkl:o1g19_h
> ndler()) Err -22 on cfign cvaolmimdan
> odp:                                  a
> and: 0000 [1] SLMuPs tre:    cmd=cf003 0:lu<s4tr>
> OST0030  1:dev  2:type CP 3U: 3f
> ustreError: 15b-f: MGC1M0o.d1ul4e3s. 0l.i5 at nktcepd : inT:he
> configuration from l ogo bd''lfuisltterer-OST0030'' failed
(-22).
> ( U)Make sure th
> is client a ndf stfihlet _MlGSdi askrfes running compatible
> ver(siUo)ns of
> Lustre.
>  oLussttreError: 15c-8: MGC10.<144>(3.U0).5 at tcp: The configurati
omng
> cfrom log ''lustre-OST003(0U'') failed (-22). This may l
dbies tkhfes r
> esult of communicatio(nU) errors between this nod el usantdr ethe MGS,
> a bad configur(aU)tion, or other errors.  Seloev the syslog for more
>
> inf(oU)rmation.
>  LlqusutotraeError:
10203:0:(obd_mo<u4nt>.(cU):1082:server_start_targe
> tmds(c)) failed to start serv(eUr) lustre-OST0030: -22
>  ksocklnd(LU)ustreError: 10203:0:(obd<4_>m oupnttl.rpcc:
> 1573:server_fill_super((U))) Unable to start target so:bd -c2l2a
> (ULu)streError: 10203:0:(obd<_4c>o nlfneitg.c:392:class_cleanup())
> ( U)Device 2 not setup                                ss
>  lvfs(U) libcfs(U) md5(U) ipv6(U) parport_pc(U) lp(U) parport(U)
> autofs4(U) i2c_dev(U) i2c_core(U) sunrpc(U) ds(U) yenta_socket(U)
> pcmcia_c
> ore(U) dm_mirror(U) dm_round_robin(U) dm_multipath(U) joydev(U)
> button(U) battery(U) ac(U) uhci_hcd(U) ehci_hcd(U) hw_random(U)
> myri10ge(U)
>  bnx2(U) ext3(U) jbd(U) dm_mod(U) qla2400(U) ata_piix(U)
> megaraid_sas(U) qla2xxx(U) scsi_transport_fc(U) sd_mod(U)
> multipath(U)
> Pid: 10286, comm: ptlrpcd Tainted: GF     2.6.9-55.0.9.EL_lustre.
> 1.6.3smp
> RIP: 0010:[<ffffffff80321465>]
<ffffffff80321465>{__lock_text_start
> +32}
> RSP: 0018:0000010218cd9bc8  EFLAGS: 00010216
> RAX: 0000000000000016 RBX: 000001021654e4bc RCX: 0000000000020000
> RDX: 000000000000baa7 RSI: 0000000000000246 RDI: ffffffff80396fc0
> RBP: 000001021654e4a0 R08: 00000000fffffffe R09: 000001021654e4bc
> R10: 0000000000000000 R11: 0000000000000000 R12: 00000102196e6058
> R13: 00000102196e6000 R14: 0000010218cd9eb8 R15: 0000010218cd9e58
> FS:  0000002a9557ab00(0000) GS:ffffffff804a6880(0000) knlGS:
> 0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> CR2: 0000002a95557000 CR3: 0000000228514000 CR4: 00000000000006e0
> Process ptlrpcd (pid: 10286, threadinfo 0000010218cd8000, task
> 00000102170b4030)
> Stack: 000001021654e4bc ffffffffa03a2121 000001021a99304e
> ffffffffa03b32a0
>        000001021654e0b0 ffffffffa04d6510 0000008000000000
> 0000000000000000
>        0000000000000000 00000102203920c0
> Call Trace:<ffffffffa03a2121>{:lquota:filter_quota_clearinfo+49}
>        <ffffffffa04d6510>{:obdfilter:filter_destroy_export+560}
>        <ffffffff80131923>{recalc_task_prio+337}
> <ffffffffa02586fd>{:obdclass:class_export_destroy+381}
>        <ffffffffa025c336>{:obdclass:obd_zombie_impexp_cull+150}
>        <ffffffffa0318345>{:ptlrpc:ptlrpcd_check+229}
> <ffffffffa031883a>{:ptlrpc:ptlrpcd+874}
>        <ffffffff80133566>{default_wake_function+0}
> <ffffffffa02eb450>{:ptlrpc:ptlrpc_expired_set+0}
>        <ffffffffa02eb450>{:ptlrpc:ptlrpc_expired_set+0}
> <ffffffff80133566>{default_wake_function+0}
>        <ffffffff80110de3>{child_rip+8}
> <ffffffffa03184d0>{:ptlrpc:ptlrpcd+0}
>        <ffffffff80110ddb>{child_rip+0}
>
> Code: 0f 0b 04 c2 33 80 ff ff ff ff 77 00 f0 ff 0b 0f 88 8b 03 00
> RIP <ffffffff80321465>{__lock_text_start+32} RSP
<0000010218cd9bc8>
>  <0>Kernel pani<4c> -L DnIoStKF sSy-nfcs:i nmg:b aOlolopsc
> 1 blocks 1 reqs (0 succ ess)                              :
> --8<---------------cut here---------------end--------------->8---
>
> If you need more information or debug, feel free to request us. The
> problem occurs only with this OST.
>
> Thanks, Ludo
>
> --
> Ludovic Francois                 +33 (0)6 14 77 26 93
> System Engineer                  DataDirect Networks
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at clusterfs.com
> https://mail.clusterfs.com/mailman/listinfo/lustre-discuss
Mr Wojciech Turek
Assistant System Manager
University of Cambridge
High Performance Computing service
email: wjt27 at cam.ac.uk
tel. +441223763517



-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20071213/09c2b346/attachment-0002.html

Franck Martinaux

2007-Dec-13 12:17 UTC

head link

[Lustre-discuss] Kernel panic on mounting an OST

Hi Wojciech,

Here is more infos :


[root at oss01 ~]# multipath -l mpath0
mpath0 (360001ff00fd4922302000800001d1c17)
[size=3726 GB][features="0"][hwhandler="0"]
\_ round-robin 0 [active]
  \_ 2:0:0:2  sdaa 65:160 [active]
\_ round-robin 0 [enabled]
  \_ 1:0:0:2  sdc  8:32   [active]


[root at oss01 ~]# tunefs.lustre --print /dev/sdaa
checking for existing Lustre data: found CONFIGS/mountdata
Reading CONFIGS/mountdata

    Read previous values:
Target:     lustre-OST0030
Index:      48
Lustre FS:  lustre
Mount type: ldiskfs
Flags:      0x2
               (OST )
Persistent mount opts: errors=remount-ro,extents,mballoc
Parameters: mgsnode=10.143.0.5 at tcp mgsnode=10.143.0.6 at tcp  
failover.node=10.143.0.2 at tcp sys.timeout=80 mgsnode=10.143.0.5 at tcp  
mgsnode=10.143.0.6 at tcp failover.node=10.143.0.2 at tcp sys.timeout=80


    Permanent disk data:
Target:     lustre-OST0030
Index:      48
Lustre FS:  lustre
Mount type: ldiskfs
Flags:      0x2
               (OST )
Persistent mount opts: errors=remount-ro,extents,mballoc
Parameters: mgsnode=10.143.0.5 at tcp mgsnode=10.143.0.6 at tcp  
failover.node=10.143.0.2 at tcp sys.timeout=80 mgsnode=10.143.0.5 at tcp  
mgsnode=10.143.0.6 at tcp failover.node=10.143.0.2 at tcp sys.timeout=80

exiting before disk write.


[root at oss01 ~]# tunefs.lustre --print /dev/sdc
checking for existing Lustre data: found CONFIGS/mountdata
Reading CONFIGS/mountdata

    Read previous values:
Target:     lustre-OST0030
Index:      48
Lustre FS:  lustre
Mount type: ldiskfs
Flags:      0x2
               (OST )
Persistent mount opts: errors=remount-ro,extents,mballoc
Parameters: mgsnode=10.143.0.5 at tcp mgsnode=10.143.0.6 at tcp  
failover.node=10.143.0.2 at tcp sys.timeout=80 mgsnode=10.143.0.5 at tcp  
mgsnode=10.143.0.6 at tcp failover.node=10.143.0.2 at tcp sys.timeout=80


    Permanent disk data:
Target:     lustre-OST0030
Index:      48
Lustre FS:  lustre
Mount type: ldiskfs
Flags:      0x2
               (OST )
Persistent mount opts: errors=remount-ro,extents,mballoc
Parameters: mgsnode=10.143.0.5 at tcp mgsnode=10.143.0.6 at tcp  
failover.node=10.143.0.2 at tcp sys.timeout=80 mgsnode=10.143.0.5 at tcp  
mgsnode=10.143.0.6 at tcp failover.node=10.143.0.2 at tcp sys.timeout=80

exiting before disk write.


Any ideas ?

Regards

Franck


Le 13 d?c. 07 ? 12:53, Wojciech Turek a ?crit :
> Hi,
>
> I just would like add that you could do very simple test to see if  
> mpath is working correctly. On your server oss1 run tunefs.lustre -- 
> print /dev/<all_mpath_devices> then write down target name for each  
> mpath device. Reboot the server and do the same and compare if the  
> mpath -> target map is the same as it was before reboot.
>
> Cheers
>
> Wojciech
> On 13 Dec 2007, at 10:55, Ludovic Francois wrote:
>
>> On 12 d?c, 17:51, Oleg Drokin <Oleg.Dro... at Sun.COM> wrote:
>>> Hello!
>>>
>>> On Dec 12, 2007, at 11:39 AM, Franck Martinaux wrote:
>>>
>>>> After a power outage, I get some difficulties to mount a OST.
>>>> I am running a lustre 1.6.3 and I get a panic on the OSS when I
>>>> try to
>>>> mount a OST.
>>>
>>> It would greatly help us if you show us panic message and possibly
>>> stacktrace.
>>
>>
>> Hi,
>>
>> Please find below all information we got this morning
>>
>> Environment
>> ==========>>
>> ,----
>> | [root at oss01 ~]# uname -a
>> | Linux oss01.data.cluster 2.6.9-55.0.9.EL_lustre.1.6.3smp #1 SMP Sun
>> Oct 7 20:08:31 EDT 2007 x86_64 x86_64 x86_64 GNU/Linux
>> | [root at oss01 ~]#
>> `----
>>
>> Mount of this specific OST
>> =========================>>
>> ,----
>> | [root at oss01 ~]# mount -t lustre /dev/mpath/mpath0 /mnt/lustre/ost1
>> | Read from remote host oss01: Connection reset by peer
>> | Connection to oss01 closed.
>> | [ddn at admin01 ~]$
>> `----
>>
>> /var/log/messages during the operation
>> =====================================>>
>> --8<---------------cut here---------------start------------->8---
>> Dec 13 08:36:04 oss01 sshd(pam_unix)[13469]: session opened for user
>> root by root(uid=0)Dec 13 08:36:20 oss01 kernel: kjournald starting.
>> Commit interval 5 seconds
>> Dec 13 08:36:20 oss01 kernel: LDISKFS FS on dm-1, internal journal
>> Dec 13 08:36:20 oss01 kernel: LDISKFS-fs: recovery complete.Dec 13
>> 08:36:20 oss01 kernel: LDISKFS-fs: mounted filesystem with ordered
>> data mode.
>> Dec 13 08:36:20 oss01 kernel: kjournald starting.  Commit interval 5
>> seconds
>> Dec 13 08:36:20 oss01 kernel: LDISKFS FS on dm-1, internal journalDec
>> 13 08:36:20 oss01 kernel: LDISKFS-fs: mounted filesystem with ordered
>> data mode.
>> Dec 13 08:36:20 oss01 kernel: LDISKFS-fs: file extents enabled
>> Dec 13 08:36:20 oss01 kernel: LDISKFS-fs: mballoc enabled
>> Dec 13 08:36:20 oss01 kernel: Lustre: OST lustre-OST0002 now serving
>> dev (lustre-OST0002/0258906d-8eca-ba98-4e3d-19adfa472914) with
>> recovery enabled
>> Dec 13 08:36:20 oss01 kernel: Lustre: Server lustre-OST0002 on  
>> device /
>> dev/mpath/mpath1 has started
>> Dec 13 08:36:21 oss01 kernel: LustreError: 137-5: UUID
''lustre-
>> OST0030_UUID'' is not available  for connect (no target)
>> Dec 13 08:36:21 oss01 kernel: LustreError: Skipped 4 previous similar
>> messages
>> Dec 13 08:36:21 oss01 kernel: LustreError: 13664:0:(ldlm_lib.c:
>> 1437:target_send_reply_msg()) @@@ processing error (-19)
>> req at 00000102244efe00 x146203/t0 o8-><?>@<?>:-1 lens
240/0 ref 0 fl
>> Interpret:/0/0 rc -19/0
>> Dec 13 08:36:21 oss01 kernel: LustreError: 13664:0:(ldlm_lib.c:
>> 1437:target_send_reply_msg()) Skipped 4 previous similar messages
>> Dec 13 08:36:41 oss01 kernel: LustreError: 137-5: UUID
''lustre-
>> OST0030_UUID'' is not available  for connect (no target)
>> Dec 13 08:36:41 oss01 kernel: LustreError: 13665:0:(ldlm_lib.c:
>> 1437:target_send_reply_msg()) @@@ processing error (-19)
>> req at 000001021d1d68
>> 00 x146233/t0 o8-><?>@<?>:-1 lens 240/0 ref 0 fl
Interpret:/0/0 rc
>> -19/0
>> Dec 13 08:37:01 oss01 kernel: LustreError: 137-5: UUID
''lustre-
>> OST0030_UUID'' is not available  for connect (no target)
>> Dec 13 08:37:01 oss01 kernel: LustreError: 13666:0:(ldlm_lib.c:
>> 1437:target_send_reply_msg()) @@@ processing error (-19)
>> req at 0000010006b95c
>> 00 x146264/t0 o8-><?>@<?>:-1 lens 240/0 ref 0 fl
Interpret:/0/0 rc
>> -19/0
>> Dec 13 08:37:21 oss01 kernel: LustreError: 137-5: UUID
''lustre-
>> OST0030_UUID'' is not available  for connect (no target)
>> Dec 13 08:37:21 oss01 kernel: LustreError: 13667:0:(ldlm_lib.c:
>> 1437:target_send_reply_msg()) @@@ processing error (-19)
>> req at 00000100cfe9ba
>> 00 x146300/t0 o8-><?>@<?>:-1 lens 240/0 ref 0 fl
Interpret:/0/0 rc
>> -19/0
>> Dec 13 08:37:41 oss01 kernel: LustreError: 137-5: UUID
''lustre-
>> OST0030_UUID'' is not available  for connect (no target)
>> Dec 13 08:37:41 oss01 kernel: LustreError: Skipped 5 previous similar
>> messages
>> Dec 13 08:37:41 oss01 kernel: LustreError: 13668:0:(ldlm_lib.c:
>> 1437:target_send_reply_msg()) @@@ processing error (-19)
>> req at 0000010037e88e
>> 00 x146373/t0 o8-><?>@<?>:-1 lens 240/0 ref 0 fl
Interpret:/0/0 rc
>> -19/0
>> Dec 13 08:37:41 oss01 kernel: LustreError: 13668:0:(ldlm_lib.c:
>> 1437:target_send_reply_msg()) Skipped 5 previous similar messages
>> Dec 13 08:37:47 oss01 kernel: Lustre: Failing over lustre-OST0002
>> Dec 13 08:37:47 oss01 kernel: Lustre: *** setting obd lustre-OST0002
>> device ''unknown-block(253,1)'' read-only ***
>> Dec 13 08:37:47 oss01 kernel: Turning device dm-1 (0xfd00001) read-
>> only
>> Dec 13 08:37:47 oss01 kernel: Lustre: lustre-OST0002: shutting down
>> for failover; client state will be preserved.
>> Dec 13 08:37:47 oss01 kernel: Lustre: OST lustre-OST0002 has stopped.
>> Dec 13 08:37:47 oss01 kernel: LDISKFS-fs: mballoc: 1 blocks 1 reqs (0
>> success)
>> Dec 13 08:37:47 oss01 kernel: LDISKFS-fs: mballoc: 1 extents scanned,
>> 1 goal hits, 0 2^N hits, 0 breaks, 0 lost
>> Dec 13 08:37:47 oss01 kernel: LDISKFS-fs: mballoc: 1 generated and it
>> took 12560
>> Dec 13 08:37:47 oss01 kernel: LDISKFS-fs: mballoc: 256  
>> preallocated, 0
>> discarded
>> Dec 13 08:37:47 oss01 kernel: Removing read-only on dm-1 (0xfd00001)
>> Dec 13 08:37:47 oss01 kernel: Lustre: server umount lustre-OST0002
>> complete
>> Dec 13 08:37:57 oss01 sshd(pam_unix)[13946]: session opened for user
>> root by root(uid=0)
>> Dec 13 08:38:18 oss01 kernel: kjournald starting.  Commit interval 5
>> seconds
>> Dec 13 08:38:18 oss01 kernel: LDISKFS FS on dm-0, internal journal
>> Dec 13 08:38:18 oss01 kernel: LDISKFS-fs: mounted filesystem with
>> ordered data mode.
>> Dec 13 08:38:18 oss01 kernel: kjournald starting.  Commit interval 5
>> seconds
>> Dec 13 08:38:18 oss01 kernel: LDISKFS FS on dm-0, internal journal
>> Dec 13 08:38:18 oss01 kernel: LDISKFS-fs: mounted filesystem with
>> ordered data mode.
>> Dec 13 08:38:18 oss01 kernel: LDISKFS-fs: file extents enabled
>> Dec 13 08:38:18 oss01 kernel: LDISKFS-fs: mballoc enabled
>> Dec 13 08:43:52 oss01 syslogd 1.4.1: restart.
>> Dec 13 08:43:52 oss01 syslog: syslogd startup succeeded
>> Dec 13 08:43:52 oss01 kernel: klogd 1.4.1, log source = /proc/kmsg
>> started.
>> --8<---------------cut here---------------end--------------->8---
>>
>> We have to do a power cycle to connect again
>> ===========================================>>
>> ,----
>> | # ipmitool -I lan -H 192.168.99.101 -U $login -P $password power
>> cycle
>> `----
>>
>>
>> The OST fsck seems correct
>> =========================>>
>> ,----
>> | [root at oss01 log]# fsck.ext2 /dev/mpath/mpath0
>> | e2fsck 1.40.2.cfs1 (12-Jul-2007)
>> | lustre-OST0030: recovering journal
>> | lustre-OST0030: clean, 227/244195328 files, 15614685/976760320
>> blocks
>> | [root at oss01 log]#
>> `----
>>
>> tunefs.lustre reads correctly mpath0 information
>> ===============================================>>
>> ,----
>> | [root at oss01 log]# tunefs.lustre /dev/mpath/mpath0
>> | checking for existing Lustre data: found CONFIGS/mountdata
>> | Reading CONFIGS/mountdata
>> |
>> |    Read previous values:
>> | Target:     lustre-OST0030
>> | Index:      48
>> | Lustre FS:  lustre
>> | Mount type: ldiskfs
>> | Flags:      0x142
>> |               (OST update writeconf )
>> | Persistent mount opts: errors=remount-ro,extents,mballoc
>> | Parameters: mgsnode=10.143.0.5 at tcp mgsnode=10.143.0.6 at tcp
>> failover.node=10.143.0.2 at tcp sys.timeout=80 mgsnode=10.143.0.5 at
tcp
>> mgsnode=10.143.0.6 at tcp failover.node=10.143.0.2 at tcp
sys.timeout=80
>> |
>> |
>> |    Permanent disk data:
>> | Target:     lustre-OST0030
>> | Index:      48
>> | Lustre FS:  lustre
>> | Mount type: ldiskfs
>> | Flags:      0x142
>> |               (OST update writeconf )
>> | Persistent mount opts: errors=remount-ro,extents,mballoc
>> | Parameters: mgsnode=10.143.0.5 at tcp mgsnode=10.143.0.6 at tcp
>> failover.node=10.143.0.2 at tcp sys.timeout=80 mgsnode=10.143.0.5 at
tcp
>> mgsnode=10.143.0.6 at tcp failover.node=10.143.0.2 at tcp
sys.timeout=80
>> |
>> | Writing CONFIGS/mountdata
>> | [root at oss01 log]#
>> `----
>>
>>
>> DDN lun is ready and working correctly
>> =====================================>>
>> ,----[ OSS view ]
>> | [root at oss01 log]# multipath -l | grep mpath0
>> | mpath0 (360001ff00fd4922302000800001d1c17)
>> | [root at oss01 log]#
>> `----
>>
>> ,----[ S2A9550 view ]
>> | [ddn at admin01 ~]$ s2a -h 10.141.0.92 -e "lun list" | grep
-i
>> 0fd492230200
>> |   2                     1    Ready          3815470 0FD492230200
>> | [ddn at admin01 ~]$
>> `----
>>
>> Stack trace (We got it from OSS02 via the serial line during a
>> mounting try)
>> = 
>> = 
>> = 
>> = 
>> = 
>> = 
>> = 
>>
====================================================================>>
>> --8<---------------cut here---------------start------------->8---
>> LDISKFS-fs: file extents enabled
>> LDISKFS-fs: mballoc enabled
>> LustreError: 134-6: Trying to start OBD lustre-OST0030_UUID using the
>> wrong disk lustre-OST0000_UUID. Were the /dev/ assignments rearranged
>> ?
>> LustreError: 10203:0:(filter.c:1022:filter_prep()) cannot read
>> last_rcvd: rc = -22
>> LustreEr<ro4>re:i p10:2 f0f3f:0ff:f(fobfad_0c3aon2f12ig1.
>> c:325:class_setup()) set--up- -l-u--s-t-re--- OS[Tcu00t 30h  
>> efreai ]le
>> d- -(----22--)-
>>  [please bite here ] -L-u-s--tr--eE--
>> r                                                -
>> or: 10203:0:(obd_config.Kcer:n1e06l2 B:cUlG asast
>> _csponifnilogc_lkl:o1g19_h
>> ndler()) Err -22 on cfign cvaolmimdan
>> odp:                                  a
>> and: 0000 [1] SLMuPs tre:    cmd=cf003 0:lu<s4tr>
>> OST0030  1:dev  2:type CP 3U: 3f
>> ustreError: 15b-f: MGC1M0o.d1ul4e3s. 0l.i5 at nktcepd : inT:he
>> configuration from l ogo bd''lfuisltterer-OST0030''
failed (-22).
>> ( U)Make sure th
>> is client a ndf stfihlet _MlGSdi askrfes running compatible
>> ver(siUo)ns of
>> Lustre.
>>  oLussttreError: 15c-8: MGC10.<144>(3.U0).5 at tcp: The
configurati omng
>> cfrom log ''lustre-OST003(0U'') failed (-22). This may
l dbies tkhfes r
>> esult of communicatio(nU) errors between this nod el usantdr ethe  
>> MGS,
>> a bad configur(aU)tion, or other errors.  Seloev the syslog for more
>>
>> inf(oU)rmation.
>>  LlqusutotraeError: 10203:0:(obd_mo<u4nt>.(cU): 
>> 1082:server_start_targe
>> tmds(c)) failed to start serv(eUr) lustre-OST0030: -22
>>  ksocklnd(LU)ustreError: 10203:0:(obd<4_>m oupnttl.rpcc:
>> 1573:server_fill_super((U))) Unable to start target so:bd -c2l2a
>> (ULu)streError: 10203:0:(obd<_4c>o
nlfneitg.c:392:class_cleanup())
>> ( U)Device 2 not setup                                ss
>>  lvfs(U) libcfs(U) md5(U) ipv6(U) parport_pc(U) lp(U) parport(U)
>> autofs4(U) i2c_dev(U) i2c_core(U) sunrpc(U) ds(U) yenta_socket(U)
>> pcmcia_c
>> ore(U) dm_mirror(U) dm_round_robin(U) dm_multipath(U) joydev(U)
>> button(U) battery(U) ac(U) uhci_hcd(U) ehci_hcd(U) hw_random(U)
>> myri10ge(U)
>>  bnx2(U) ext3(U) jbd(U) dm_mod(U) qla2400(U) ata_piix(U)
>> megaraid_sas(U) qla2xxx(U) scsi_transport_fc(U) sd_mod(U)
>> multipath(U)
>> Pid: 10286, comm: ptlrpcd Tainted: GF     2.6.9-55.0.9.EL_lustre.
>> 1.6.3smp
>> RIP: 0010:[<ffffffff80321465>]
<ffffffff80321465>{__lock_text_start
>> +32}
>> RSP: 0018:0000010218cd9bc8  EFLAGS: 00010216
>> RAX: 0000000000000016 RBX: 000001021654e4bc RCX: 0000000000020000
>> RDX: 000000000000baa7 RSI: 0000000000000246 RDI: ffffffff80396fc0
>> RBP: 000001021654e4a0 R08: 00000000fffffffe R09: 000001021654e4bc
>> R10: 0000000000000000 R11: 0000000000000000 R12: 00000102196e6058
>> R13: 00000102196e6000 R14: 0000010218cd9eb8 R15: 0000010218cd9e58
>> FS:  0000002a9557ab00(0000) GS:ffffffff804a6880(0000) knlGS:
>> 0000000000000000
>> CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
>> CR2: 0000002a95557000 CR3: 0000000228514000 CR4: 00000000000006e0
>> Process ptlrpcd (pid: 10286, threadinfo 0000010218cd8000, task
>> 00000102170b4030)
>> Stack: 000001021654e4bc ffffffffa03a2121 000001021a99304e
>> ffffffffa03b32a0
>>        000001021654e0b0 ffffffffa04d6510 0000008000000000
>> 0000000000000000
>>        0000000000000000 00000102203920c0
>> Call Trace:<ffffffffa03a2121>{:lquota:filter_quota_clearinfo+49}
>>        <ffffffffa04d6510>{:obdfilter:filter_destroy_export+560}
>>        <ffffffff80131923>{recalc_task_prio+337}
>> <ffffffffa02586fd>{:obdclass:class_export_destroy+381}
>>        <ffffffffa025c336>{:obdclass:obd_zombie_impexp_cull+150}
>>        <ffffffffa0318345>{:ptlrpc:ptlrpcd_check+229}
>> <ffffffffa031883a>{:ptlrpc:ptlrpcd+874}
>>        <ffffffff80133566>{default_wake_function+0}
>> <ffffffffa02eb450>{:ptlrpc:ptlrpc_expired_set+0}
>>        <ffffffffa02eb450>{:ptlrpc:ptlrpc_expired_set+0}
>> <ffffffff80133566>{default_wake_function+0}
>>        <ffffffff80110de3>{child_rip+8}
>> <ffffffffa03184d0>{:ptlrpc:ptlrpcd+0}
>>        <ffffffff80110ddb>{child_rip+0}
>>
>> Code: 0f 0b 04 c2 33 80 ff ff ff ff 77 00 f0 ff 0b 0f 88 8b 03 00
>> RIP <ffffffff80321465>{__lock_text_start+32} RSP
<0000010218cd9bc8>
>>  <0>Kernel pani<4c> -L DnIoStKF sSy-nfcs:i nmg:b aOlolopsc
>> 1 blocks 1 reqs (0 succ ess)                              :
>> --8<---------------cut here---------------end--------------->8---
>>
>> If you need more information or debug, feel free to request us. The
>> problem occurs only with this OST.
>>
>> Thanks, Ludo
>>
>> --
>> Ludovic Francois                 +33 (0)6 14 77 26 93
>> System Engineer                  DataDirect Networks
>>
>> _______________________________________________
>> Lustre-discuss mailing list
>> Lustre-discuss at clusterfs.com
>> https://mail.clusterfs.com/mailman/listinfo/lustre-discuss
>
> Mr Wojciech Turek
> Assistant System Manager
> University of Cambridge
> High Performance Computing service
> email: wjt27 at cam.ac.uk
> tel. +441223763517
>
>
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at clusterfs.com
> https://mail.clusterfs.com/mailman/listinfo/lustre-discuss

Ludovic Francois

2007-Dec-13 13:59 UTC

head link

[Lustre-discuss] Kernel panic on mounting an OST

On Dec 13, 2007 12:53 PM, Wojciech Turek <wjt27 at cam.ac.uk>
wrote:>  Hi,
Hello Wojciech
> I just would like add that you could do very simple test to see if mpath is
> working correctly.
By the way we are reworking on it to integrate nicely mulitpath with
the DDN disk array. And generate the multipath.conf file according at
LUNs wwid and s2a labels.
> On your server oss1 run tunefs.lustre --print
> /dev/<all_mpath_devices> then write down target name for each mpath
device.
> Reboot the server and do the same and compare if the mpath -> target map
is
> the same as it was before reboot.
I just did it, and it''s exactly the same, between 2 different reboots.
But I accord you it''s possible the device name moved in the time
according our actual configuration.

How do you explain the UUID can move in the time, the uuid of mpath
device and uuid of sd device corresponding should be always the same.
Am I wrong?

I turned off the multipathing, and I got exactly the same issue.

,----
| LustreError: 134-6: Trying to start OBD lustre-OST0030_UUID using the
| wrong disk lustre-OST0000_UUID. Were the /dev/ assignments rearranged
| ?
`----

I think the multipath is not the problem right now, but the fact there
is some weakness, maybe the multipathing helped someone to do a wrong
command.

Do you think it''s possible  someone overwrote the "label"
with a tunefs command?

I already saw it with some other file system.

Ludo,

-- 
Ludovic Francois               +33 (0)6 20 67 05 42

Ludovic Francois

2007-Dec-13 14:12 UTC

head link

[Lustre-discuss] Kernel panic on mounting an OST

On Dec 13, 2:59 pm, "Ludovic Francois" <lfranc... at gmail.com>
wrote:
> Do you think it''s possible  someone overwrote the
"label" with a tunefs command?
or the system
> I already saw it with some other file system.
>
Just two commands they go in this direction, the lustre-OST0030
shouldn''t be exist

[root at oss01 ~]# for i in $(seq 0 23); do tunefs.lustre --print /dev/
mpath/mpath$i | grep Target ;done | sort | uniq
Target:     lustre-OST0001
Target:     lustre-OST0002
Target:     lustre-OST0003
Target:     lustre-OST0004
Target:     lustre-OST0005
Target:     lustre-OST0006
Target:     lustre-OST0007
Target:     lustre-OST0008
Target:     lustre-OST0009
Target:     lustre-OST000a
Target:     lustre-OST000b
Target:     lustre-OST000c
Target:     lustre-OST000d
Target:     lustre-OST000e
Target:     lustre-OST000f
Target:     lustre-OST0010
Target:     lustre-OST0011
Target:     lustre-OST0012
Target:     lustre-OST0013
Target:     lustre-OST0014
Target:     lustre-OST0015
Target:     lustre-OST0016
Target:     lustre-OST0017
Target:     lustre-OST0030
[root at oss01 ~]#

[root at oss03 ~]# for i in $(seq 0 23); do tunefs.lustre --print /dev/
mpath/mpath$i | grep Target; done | sort | uniq
Target:     lustre-OST0018
Target:     lustre-OST0019
Target:     lustre-OST001a
Target:     lustre-OST001b
Target:     lustre-OST001c
Target:     lustre-OST001d
Target:     lustre-OST001e
Target:     lustre-OST001f
Target:     lustre-OST0020
Target:     lustre-OST0021
Target:     lustre-OST0022
Target:     lustre-OST0023
Target:     lustre-OST0024
Target:     lustre-OST0025
Target:     lustre-OST0026
Target:     lustre-OST0027
Target:     lustre-OST0028
Target:     lustre-OST0029
Target:     lustre-OST002a
Target:     lustre-OST002b
Target:     lustre-OST002c
Target:     lustre-OST002d
Target:     lustre-OST002e
Target:     lustre-OST002f
[root at oss03 ~]#

Ludovic Francois

2007-Dec-13 15:18 UTC

head link

[Lustre-discuss] Kernel panic on mounting an OST

On Dec 13, 3:12 pm, Ludovic Francois <lfranc... at gmail.com>
wrote:> On Dec 13, 2:59 pm, "Ludovic Francois" <lfranc... at
gmail.com> wrote:
>
> > Do you think it''s possible  someone overwrote the
"label" with a tunefs command?
>
> or the system
>
> > I already saw it with some other file system.
We recreated Target and Index with the tunefs.lustre command:

--8<---------------cut here---------------start------------->8---
[root at oss01 ~]# tunefs.lustre --writeconf --index 0 /dev/mpath/mpath0
checking for existing Lustre data: found CONFIGS/mountdata
Reading CONFIGS/mountdata

Read previous values:
Target: lustre-OST0030
Index: 48
Lustre FS: lustre
Mount type: ldiskfs
Flags: 0x2
(OST )
Persistent mount opts: errors=remount-ro,extents,mballoc
Parameters: mgsnode=10.143.0.5 at tcp mgsnode=10.143.0.6 at tcp
failover.node=10.143.0.2 at tcp sys.timeout=80 mgsnode=10.143.0.5 at tcp
mgsnode=10.143.0.6 at tcp failover.node=10.143.0.2 at tcp sys.timeout=80


Permanent disk data:
Target: lustre-OST0000
Index: 0
Lustre FS: lustre
Mount type: ldiskfs
Flags: 0x102
(OST writeconf )
Persistent mount opts: errors=remount-ro,extents,mballoc
Parameters: mgsnode=10.143.0.5 at tcp mgsnode=10.143.0.6 at tcp
failover.node=10.143.0.2 at tcp sys.timeout=80 mgsnode=10.143.0.5 at tcp
mgsnode=10.143.0.6 at tcp failover.node=10.143.0.2 at tcp sys.timeout=80

Writing CONFIGS/mountdata
[root at oss01 ~]#
--8<---------------cut here---------------end--------------->8---

But now  we have some problems  to remount the file  system, could you
confirm us this command just rewrite the index?

Best Regards, Ludo

--
Ludovic Francois                 +33 (0)6 14 77 26 93
System Engineer                  DataDirect Networks

Wojciech Turek

2007-Dec-13 16:22 UTC

head link

[Lustre-discuss] Kernel panic on mounting an OST

Hi,

Sorry for big delay in responses but I was away for christmas lunch.
Changing index may help but I am not certain of that. Definitly it is  
weird  that you don''t have OST0000 but you have OST0030 this may  
cause problems later with quotas when you try to turn them off. I  
think lustre will expect OST0000 to exist and if it don''t find it,  
lustre will complain and quota will not work.
However if you do change indexes I think you need to do that in  
certain way I suggest do it as follows
# Umount all OST''s and MDT''s and run for each target:
tunefs.lustre --reformat --index=<index> --writeconf /dev/ 
<block_device_name>
# This need to be done on all OSS''s and on MDS
# For each target mount it as ldiskfs file system. This need to be  
done on all OSS''s and on MDS
#for example:
mount -t ldiskfs /dev/dm-0 /mnt/mdt
# delete file /mnt/mdt/last_rcvd
# mount filesystem

after that you can do writeconf for each target, then start MGS/MDT  
target, and then start one by one OST targets starting with mpath0 as  
first one

Also have a look at our /etc/multipath.conf
As you can see it is very static but we can be sure that each dm- 
<number> device is always pointing to the same LUN

defaults {
         udev_dir                /dev
         polling_interval        10
         selector                "round-robin 0"
         path_grouping_policy    failover
         getuid_callout          "/sbin/scsi_id -g -u -s /block/%n"
         prio_callout            /bin/true
         path_checker            tur
         rr_min_io               100
         rr_weight               priorities
         failback                immediate
         no_path_retry           fail
         user_friendly_name      yes
         prio_callout            "/sbin/mpath_prio_my %n"
}
devnode_blacklist {
         devnode "^(ram|raw|loop|fd|md|dm-|sr|scd|st|sda)[0-9]*"
         devnode "^hd[a-z]"
         devnode "^cciss!c[0-9]d[0-9]*"
}

multipaths {
         multipath {
                 wwid                                     
360001ff007e6173300000800001d1c17
                 alias                                           dm-0
                 path_grouping_policy            failover
                 path_checker                            tur
                 path_selector                           "round-robin
0"
                 failback                                         
immediate
                 rr_weight                               priorities
                 no_path_retry                           5
         }
         multipath {
                 wwid                                     
360001ff007e6173401000800001d1c17
                 alias                                           dm-1
                 path_grouping_policy            failover
                 path_checker                            tur
                 path_selector                           "round-robin
0"
                 failback                                         
immediate
                 rr_weight                               priorities
                 no_path_retry                           5
         }
         multipath {
                 wwid                                     
360001ff007e6173502000800001d1c17
                 alias                                           dm-2
                 path_grouping_policy            failover
                 path_checker                            tur
                 path_selector                           "round-robin
0"
                 failback                                         
immediate
                 rr_weight                               priorities
                 no_path_retry                           5
         }
         multipath {
                 wwid                                     
360001ff007e6173906000800001d1c17
                 alias                                           dm-3
                 path_grouping_policy            failover
                 path_checker                            tur
                 path_selector                           "round-robin
0"
                 failback                                         
immediate
                 rr_weight                               priorities
                 no_path_retry                           5
         }
         multipath {
                 wwid                                     
360001ff007e6173a07000800001d1c17
                 alias                                           dm-4
                 path_grouping_policy            failover
                 path_checker                            tur
                 path_selector                           "round-robin
0"
                 failback                                         
immediate
                 rr_weight                               priorities
                 no_path_retry                           5
         }
         multipath {
                 wwid                                     
360001ff007e6173b08000800001d1c17
                 alias                                           dm-5
                 path_grouping_policy            failover
                 path_checker                            tur
                 path_selector                           "round-robin
0"
                 failback                                         
immediate
                 rr_weight                               priorities
                 no_path_retry                           5
         }
         multipath {
                 wwid                                     
360001ff007e6173603000800001d1c17
                 alias                                           dm-6
                 path_grouping_policy            failover
                 path_checker                            tur
                 path_selector                           "round-robin
0"
                 failback                                         
immediate
                 rr_weight                               priorities
                 no_path_retry                           5
         }
         multipath {
                 wwid                                     
360001ff007e6173704000800001d1c17
                 alias                                           dm-7
                 path_grouping_policy            failover
                 path_checker                            tur
                 path_selector                           "round-robin
0"
                 failback                                         
immediate
                 rr_weight                               priorities
                 no_path_retry                           5
         }
         multipath {
                 wwid                                     
360001ff007e6173805000800001d1c17
                 alias                                           dm-8
                 path_grouping_policy            failover
                 path_checker                            tur
                 path_selector                           "round-robin
0"
                 failback                                         
immediate
                 rr_weight                               priorities
                 no_path_retry                           5
         }
         multipath {
                 wwid                                     
360001ff007e6173c09000800001d1c17
                 alias                                           dm-9
                 path_grouping_policy            failover
                 path_checker                            tur
                 path_selector                           "round-robin
0"
                 failback                                         
immediate
                 rr_weight                               priorities
                 no_path_retry                           5
         }
         multipath {
                 wwid                                     
360001ff007e6173d0a000800001d1c17
                 alias                                           dm-10
                 path_grouping_policy            failover
                 path_checker                            tur
                 path_selector                           "round-robin
0"
                 failback                                         
immediate
                 rr_weight                               priorities
                 no_path_retry                           5
         }
         multipath {
                 wwid                                     
360001ff007e6173e0b000800001d1c17
                 alias                                           dm-11
                 path_grouping_policy            failover
                 path_checker                            tur
                 path_selector                           "round-robin
0"
                 failback                                         
immediate
                 rr_weight                               priorities
                 no_path_retry                           5
         }
}


I hope this helps

Wojciech

On 13 Dec 2007, at 15:18, Ludovic Francois wrote:
> On Dec 13, 3:12 pm, Ludovic Francois <lfranc... at gmail.com> wrote:
>> On Dec 13, 2:59 pm, "Ludovic Francois" <lfranc... at
gmail.com> wrote:
>>
>>> Do you think it''s possible  someone overwrote the
"label" with a
>>> tunefs command?
>>
>> or the system
>>
>>> I already saw it with some other file system.
>
> We recreated Target and Index with the tunefs.lustre command:
>
> --8<---------------cut here---------------start------------->8---
> [root at oss01 ~]# tunefs.lustre --writeconf --index 0 /dev/mpath/mpath0
> checking for existing Lustre data: found CONFIGS/mountdata
> Reading CONFIGS/mountdata
>
> Read previous values:
> Target: lustre-OST0030
> Index: 48
> Lustre FS: lustre
> Mount type: ldiskfs
> Flags: 0x2
> (OST )
> Persistent mount opts: errors=remount-ro,extents,mballoc
> Parameters: mgsnode=10.143.0.5 at tcp mgsnode=10.143.0.6 at tcp
> failover.node=10.143.0.2 at tcp sys.timeout=80 mgsnode=10.143.0.5 at tcp
> mgsnode=10.143.0.6 at tcp failover.node=10.143.0.2 at tcp sys.timeout=80
>
>
> Permanent disk data:
> Target: lustre-OST0000
> Index: 0
> Lustre FS: lustre
> Mount type: ldiskfs
> Flags: 0x102
> (OST writeconf )
> Persistent mount opts: errors=remount-ro,extents,mballoc
> Parameters: mgsnode=10.143.0.5 at tcp mgsnode=10.143.0.6 at tcp
> failover.node=10.143.0.2 at tcp sys.timeout=80 mgsnode=10.143.0.5 at tcp
> mgsnode=10.143.0.6 at tcp failover.node=10.143.0.2 at tcp sys.timeout=80
>
> Writing CONFIGS/mountdata
> [root at oss01 ~]#
> --8<---------------cut here---------------end--------------->8---
>
> But now  we have some problems  to remount the file  system, could you
> confirm us this command just rewrite the index?
>
> Best Regards, Ludo
>
> --
> Ludovic Francois                 +33 (0)6 14 77 26 93
> System Engineer                  DataDirect Networks
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at clusterfs.com
> https://mail.clusterfs.com/mailman/listinfo/lustre-discuss
Mr Wojciech Turek
Assistant System Manager
University of Cambridge
High Performance Computing service
email: wjt27 at cam.ac.uk
tel. +441223763517



-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20071213/ebd8b10b/attachment-0002.html

Ludovic Francois

2007-Dec-14 10:20 UTC

head link

[Lustre-discuss] Kernel panic on mounting an OST

On Dec 13, 2007 5:22 PM, Wojciech Turek <wjt27 at cam.ac.uk>
wrote:>  Hi,
Hello Wojciech,
> However if you do change indexes I think you need to do that in certain way
> I suggest do it as follows
> # Umount all OST''s and MDT''s and run for each target:
> tunefs.lustre --reformat --index=<index> --writeconf
> /dev/<block_device_name>
We did it, but it was enough, finally there was 2 different issues in
the configuration, we finished to solve this night. And now the system
is running fine.

You can check the end of the incident progress managed by Anand and
Cliff in the bugzilla:

    https://bugzilla.lustre.org/show_bug.cgi?id=14470

Thanks for your help, Ludo

-- 
Ludovic Francois               +33 (0)6 20 67 05 42

Lustre discuss - Dec 2007 - Kernel panic on mounting an OST

[Lustre-discuss] Kernel panic on mounting an OST

[Lustre-discuss] Kernel panic on mounting an OST

[Lustre-discuss] Kernel panic on mounting an OST

[Lustre-discuss] Kernel panic on mounting an OST

[Lustre-discuss] Kernel panic on mounting an OST

[Lustre-discuss] Kernel panic on mounting an OST

[Lustre-discuss] Kernel panic on mounting an OST

[Lustre-discuss] Kernel panic on mounting an OST

[Lustre-discuss] Kernel panic on mounting an OST

[Lustre-discuss] Kernel panic on mounting an OST

[Lustre-discuss] Kernel panic on mounting an OST