thr3ads.net - Lustre discuss - [Lustre-discuss] IO-Node issue [Jul 2011]

If this information is useful, please help other people find it:
Share via:

Young, Damiri

2011-Jul-18 15:28 UTC

[Lustre-discuss] IO-Node issue

Hi All,
On Friday 1 of the 4 I/O nodes in our lustre network began logging the following
slow start_page_write errors for some OSTs:
--------------------------------------------------- messages
-----------------------------------------------------------------------------------
Jul 15 12:55:32 IO-10 heartbeat: [21053]: info: Checking status of STONITH
device [IPMI STONITH device ]
Jul 15 12:55:32 IO-10 heartbeat: [16746]: info: Managed STONITH-stat process
21053 exited with return code 0.
Jul 15 13:02:25 IO-10 kernel: Lustre:
18616:0:(filter_io.c:652:filter_preprw_write()) es1-OST000b: slow
start_page_write 44s
Jul 15 13:02:25 IO-10 kernel: Lustre:
18688:0:(filter_io_26.c:779:filter_commitrw_write()) es1-OST000b: slow direct_io
44s
Jul 15 13:02:57 IO-10 kernel: Lustre:
18531:0:(filter_io.c:652:filter_preprw_write()) es1-OST000b: slow
start_page_write 38s
Jul 15 13:03:29 IO-10 kernel: Lustre:
18566:0:(filter_io_26.c:779:filter_commitrw_write()) es1-OST000b: slow direct_io
32s
Jul 15 13:03:29 IO-10 kernel:  host2: SRP abort called
Jul 15 13:03:29 IO-10 last message repeated 4 times
Jul 15 13:03:29 IO-10 kernel:  host2: SRP reset_device called
Jul 15 13:03:29 IO-10 kernel:  host2: ib_srp: SRP reset_host called state 0
qp_err 0
Jul 15 13:03:39 IO-10 kernel: sd 2:0:0:12: SCSI error: return code = 0x06000000
Jul 15 13:03:39 IO-10 kernel: end_request: I/O error, dev sdag, sector
15012385240
Jul 15 13:03:39 IO-10 kernel: device-mapper: multipath: Failing path 66:0.
Jul 15 13:03:39 IO-10 kernel: sd 2:0:0:12: SCSI error: return code = 0x06000000
Jul 15 13:03:39 IO-10 kernel: end_request: I/O error, dev sdag, sector
15012386176
Jul 15 13:03:39 IO-10 kernel: sd 2:0:0:12: SCSI error: return code = 0x06000000
Jul 15 13:03:39 IO-10 kernel: end_request: I/O error, dev sdag, sector
15012386696
Jul 15 13:03:39 IO-10 kernel: sd 2:0:0:12: SCSI error: return code = 0x06000000
Jul 15 13:03:39 IO-10 kernel: end_request: I/O error, dev sdag, sector
15012385256
Jul 15 13:03:39 IO-10 kernel: sd 2:0:0:12: SCSI error: return code = 0x06000000
Jul 15 13:03:39 IO-10 kernel: end_request: I/O error, dev sdag, sector
15012385568
Jul 15 13:03:39 IO-10 multipathd: 66:0: mark as failed
Jul 15 13:03:39 IO-10 multipathd: lun_12: remaining active paths: 1
Jul 15 13:04:35 IO-10 kernel: LustreError:
18538:0:(filter_io_26.c:779:filter_commitrw_write()) es1-OST000b: slow direct_io
50s
Jul 15 13:04:35 IO-10 kernel: Lustre:
18447:0:(filter_io_26.c:779:filter_commitrw_write()) es1-OST000b: slow direct_io
31s
Jul 15 13:04:35 IO-10 kernel: Lustre:
18525:0:(filter_io.c:652:filter_preprw_write()) es1-OST000b: slow
start_page_write 46s
Jul 15 13:04:35 IO-10 kernel: LustreError:
18598:0:(lustre_fsfilt.h:318:fsfilt_commit_wait()) es1-OST000b: slow journal
start 56s
Jul 15 13:04:35 IO-10 kernel: LustreError:
18393:0:(filter_io_26.c:792:filter_commitrw_write()) es1-OST000b: slow commitrw
commit 56s
Jul 15 13:04:35 IO-10 kernel: LustreError:
18393:0:(filter_io_26.c:792:filter_commitrw_write()) Skipped 1 previous similar
message
Jul 15 13:04:35 IO-10 kernel: Lustre:
18739:0:(lustre_fsfilt.h:318:fsfilt_commit_wait()) es1-OST000b: slow journal
start 46s
Jul 15 13:04:35 IO-10 kernel: Lustre:
18739:0:(filter_io_26.c:792:filter_commitrw_write()) es1-OST000b: slow commitrw
commit 46s
Jul 15 13:04:35 IO-10 kernel: Lustre:
18422:0:(lustre_fsfilt.h:318:fsfilt_commit_wait()) es1-OST000b: slow journal
start 31s
Jul 15 13:04:35 IO-10 kernel: Lustre:
18422:0:(lustre_fsfilt.h:318:fsfilt_commit_wait()) Skipped 5 previous similar
messages
Jul 15 13:04:35 IO-10 kernel: Lustre:
18422:0:(filter_io_26.c:792:filter_commitrw_write()) es1-OST000b: slow commitrw
commit 31s
Jul 15 13:04:35 IO-10 kernel: Lustre:
18422:0:(filter_io_26.c:792:filter_commitrw_write()) Skipped 5 previous similar
messages
Jul 15 13:04:35 IO-10 kernel: LustreError:
18644:0:(lustre_fsfilt.h:318:fsfilt_commit_wait()) es1-OST000b: slow journal
start 56s
Jul 15 13:04:35 IO-10 kernel: LustreError:
18644:0:(lustre_fsfilt.h:318:fsfilt_commit_wait()) Skipped 5 previous similar
messages
Jul 15 13:04:35 IO-10 kernel: LustreError:
18644:0:(filter_io_26.c:792:filter_commitrw_write()) es1-OST000b: slow commitrw
commit 56s
Jul 15 13:04:35 IO-10 kernel: LustreError:
18644:0:(filter_io_26.c:792:filter_commitrw_write()) Skipped 4 previous similar
messages
Jul 15 13:04:35 IO-10 kernel: LustreError:
18598:0:(lustre_fsfilt.h:318:fsfilt_commit_wait()) Skipped 4 previous similar
messages
Jul 15 13:04:35 IO-10 kernel: LustreError:
18766:0:(filter_io_26.c:779:filter_commitrw_write()) es1-OST000b: slow direct_io
55s
Jul 15 13:04:35 IO-10 kernel: LustreError:
18766:0:(filter_io_26.c:779:filter_commitrw_write()) Skipped 1 previous similar
message
Jul 15 13:05:07 IO-10 kernel: Lustre:
18650:0:(filter_io_26.c:779:filter_commitrw_write()) es1-OST000b: slow direct_io
31s
Jul 15 13:05:07 IO-10 kernel: Lustre:
18650:0:(filter_io_26.c:779:filter_commitrw_write()) Skipped 2 previous similar
messages
Jul 15 13:05:08 IO-10 kernel:  host1: SRP abort called
Jul 15 13:05:08 IO-10 kernel:  host1: SRP reset_device called
Jul 15 13:05:08 IO-10 kernel:  host1: ib_srp: SRP reset_host called state 0
qp_err 0
Jul 15 13:05:18 IO-10 kernel: sd 1:0:0:12: SCSI error: return code = 0x00000002
Jul 15 13:05:18 IO-10 kernel: end_request: I/O error, dev sdm, sector 861112
Jul 15 13:05:18 IO-10 kernel: device-mapper: multipath: Failing path 8:192.
Jul 15 13:05:18 IO-10 kernel: Buffer I/O error on device dm-11, logical block
107639
Jul 15 13:05:18 IO-10 kernel: lost page write due to I/O error on dm-11
Jul 15 13:05:18 IO-10 kernel: LDISKFS-fs error (device dm-11): kmmpd: Error
writing to MMP block
Jul 15 13:05:18 IO-10 kernel: Remounting filesystem read-only
Jul 15 13:05:18 IO-10 kernel: LustreError:
18499:0:(fsfilt-ldiskfs.c:438:fsfilt_ldiskfs_brw_start()) can''t get
handle for 99 credits: rc = -3
0
Jul 15 13:05:18 IO-10 kernel: Lustre:
18657:0:(filter_io_26.c:779:filter_commitrw_write()) es1-OST000b: slow direct_io
36s
Jul 15 13:05:18 IO-10 multipathd: 8:192: mark as failed
Jul 15 13:05:18 IO-10 kernel: Lustre:
18657:0:(filter_io_26.c:779:filter_commitrw_write()) Skipped 1 previous similar
message
Jul 15 13:05:18 IO-10 kernel: Buffer I/O error on device dm-11, logical block
79917
Jul 15 13:05:18 IO-10 multipathd: lun_12: remaining active paths: 0
Jul 15 13:05:18 IO-10 kernel: lost page write due to I/O error on dm-11
Jul 15 13:05:18 IO-10 kernel: Aborting journal on device dm-11.
Jul 15 13:05:18 IO-10 kernel: LustreError:
19103:0:(obd.h:1188:obd_transno_commit_cb()) es1-OST000b: transno 608801934
commit error: 2

------------------------------------------- end messages
-------------------------------------------------------------------------


I then tried to fsck -n one of the mentioned devices for inconsistencies at
which point I noticed on a kernel  panic. Rebooted the machine which appears to
not be able to establish a connection or LNET:
-----------------------------------------------------messages
---------------------------------------------------------------------
Jul 18 09:24:44 IO-10 OpenSM[28708]: /var/log/opensm.log log file opened
Jul 18 09:24:44 IO-10 OpenSM[28708]: OpenSM 3.1.11
Jul 18 09:24:44 IO-10 OpenSM[28708]: Entering STANDBY state
Jul 18 09:28:23 IO-10 run_srp_daemon[28757]: failed srp_daemon: [HCA=mlx4_0]
[port=2] [exit status=110]. Will try to restart srp_daemon perio
dically. No more warnings will be issued in the next 7200 seconds if the same
problem repeats
Jul 18 09:28:23 IO-10 run_srp_daemon[28763]: failed srp_daemon: [HCA=mlx4_0]
[port=1] [exit status=110]. Will try to restart srp_daemon perio
dically. No more warnings will be issued in the next 7200 seconds if the same
problem repeats
Jul 18 09:28:24 IO-10 run_srp_daemon[28769]: starting srp_daemon: [HCA=mlx4_0]
[port=2]
Jul 18 09:28:25 IO-10 run_srp_daemon[28782]: starting srp_daemon: [HCA=mlx4_0]
[port=1]
Jul 18 09:33:10 IO-10 kernel: megasas: 00.00.03.15-RH1 Wed Nov. 21 10:29:45 PST
2007
Jul 18 09:33:10 IO-10 kernel: Lustre: OBD class driver, http://www.lustre.org/
Jul 18 09:33:10 IO-10 kernel:         Lustre Version: 1.6.6
Jul 18 09:33:10 IO-10 kernel:         Build Version:
1.6.6-1.6.6-ddn3.1-20090527173746
Jul 18 09:33:10 IO-10 kernel: Lustre:
30493:0:(o2iblnd_modparams.c:324:kiblnd_tunables_init()) Concurrent sends 7 is
lower than message queue
 size: 8, performance may drop slightly.
Jul 18 09:33:10 IO-10 kernel: Lustre: Added LNI 10.1.0.229 at o2ib [8/64]
Jul 18 09:33:11 IO-10 kernel: Lustre: Lustre Client File System;
http://www.lustre.org/
Jul 18 09:33:34 IO-10 kernel: kjournald starting.  Commit interval 5 seconds
Jul 18 09:33:34 IO-10 kernel: LDISKFS-fs warning: checktime reached, running
e2fsck is recommended
Jul 18 09:33:34 IO-10 kernel: LDISKFS FS on dm-10, internal journal
Jul 18 09:33:34 IO-10 kernel: LDISKFS-fs: recovery complete.
Jul 18 09:33:34 IO-10 kernel: LDISKFS-fs: mounted filesystem with ordered data
mode.
Jul 18 09:33:34 IO-10 multipathd: dm-10: umount map (uevent)
Jul 18 09:33:45 IO-10 kernel: kjournald starting.  Commit interval 5 seconds
Jul 18 09:33:45 IO-10 kernel: LDISKFS-fs warning: checktime reached, running
e2fsck is recommended
Jul 18 09:33:45 IO-10 kernel: LDISKFS FS on dm-10, internal journal
Jul 18 09:33:45 IO-10 kernel: LDISKFS-fs: mounted filesystem with ordered data
mode.
Jul 18 09:33:45 IO-10 kernel: LDISKFS-fs: file extents enabled
Jul 18 09:33:45 IO-10 kernel: LDISKFS-fs: mballoc enabled
Jul 18 09:33:45 IO-10 kernel: LustreError: 137-5: UUID
''es1-OST000a_UUID'' is not available  for connect (no target)
Jul 18 09:33:45 IO-10 kernel: LustreError:
30789:0:(ldlm_lib.c:1619:target_send_reply_msg()) @@@ processing error (-19) 
req at ffff8101a6992c00
 x5558026/t0 o8-><?>@<?>:0/0 lens 304/0 e 0 to 0 dl 1310999725
ref 1 fl Interpret:/0/0 rc -19/0
Jul 18 09:33:45 IO-10 kernel: LustreError:
30789:0:(ldlm_lib.c:1619:target_send_reply_msg()) Skipped 2 previous similar
messages
Jul 18 09:33:45 IO-10 kernel: fsfilt_ldiskfs: no version for
"ldiskfs_free_blocks" found: kernel tainted.
Jul 18 09:33:45 IO-10 kernel: Lustre: Filtering OBD driver;
http://www.lustre.org/
Jul 18 09:33:45 IO-10 kernel: Lustre:
31834:0:(filter.c:868:filter_init_server_data()) RECOVERY: service es1-OST000a,
249 recoverable clients
, last_rcvd 469616249
Jul 18 09:33:45 IO-10 kernel: Lustre: OST es1-OST000a now serving dev
(es1-OST000a/15fae56a-7dae-ba24-4423-347c0a118367), but will be in reco
very for at least 5:00, or until 249 clients reconnect. During this time new
clients will not be allowed to connect. Recovery progress can be
 monitored by watching /proc/fs/lustre/obdfilter/es1-OST000a/recovery_status.
Jul 18 09:33:45 IO-10 kernel: Lustre: es1-OST000a.ost: set parameter
quota_type=ug
Jul 18 09:33:45 IO-10 kernel: Lustre: Server es1-OST000a on device
/dev/mpath/lun_11 has started
Jul 18 09:33:45 IO-10 kernel: Lustre:
30790:0:(ldlm_lib.c:1226:check_and_start_recovery_timer()) es1-OST000a: starting
req at ffff8101a6992c00
 x5558026/t0 o8-><?>@<?>:0/0 lens 304/0 e 0 to 0 dl 1310999725
ref 1 fl Interpret:/0/0 rc -19/0
Jul 18 09:33:45 IO-10 kernel: LustreError:
30789:0:(ldlm_lib.c:1619:target_send_reply_msg()) Skipped 2 previous similar
messages
Jul 18 09:33:45 IO-10 kernel: fsfilt_ldiskfs: no version for
"ldiskfs_free_blocks" found: kernel tainted.
Jul 18 09:33:45 IO-10 kernel: Lustre: Filtering OBD driver;
http://www.lustre.org/
Jul 18 09:33:45 IO-10 kernel: Lustre:
31834:0:(filter.c:868:filter_init_server_data()) RECOVERY: service es1-OST000a,
249 recoverable clients
, last_rcvd 469616249
Jul 18 09:33:45 IO-10 kernel: Lustre: OST es1-OST000a now serving dev
(es1-OST000a/15fae56a-7dae-ba24-4423-347c0a118367), but will be in reco
very for at least 5:00, or until 249 clients reconnect. During this time new
clients will not be allowed to connect. Recovery progress can be
 monitored by watching /proc/fs/lustre/obdfilter/es1-OST000a/recovery_status.
Jul 18 09:33:45 IO-10 kernel: Lustre: es1-OST000a.ost: set parameter
quota_type=ug
Jul 18 09:33:45 IO-10 kernel: Lustre: Server es1-OST000a on device
/dev/mpath/lun_11 has started
Jul 18 09:33:45 IO-10 kernel: Lustre:
30790:0:(ldlm_lib.c:1226:check_and_start_recovery_timer()) es1-OST000a: starting
recovery timer
Jul 18 09:33:45 IO-10 kernel: LustreError: 137-5: UUID
''es1-OST000b_UUID'' is not available  for connect (no target)
Jul 18 09:33:45 IO-10 kernel: LustreError: Skipped 4 previous similar messages
Jul 18 09:33:45 IO-10 kernel: LustreError:
30792:0:(ldlm_lib.c:1619:target_send_reply_msg()) @@@ processing error (-19) 
req at ffff8103185c9000
 x215076358/t0 o8-><?>@<?>:0/0 lens 304/0 e 0 to 0 dl 1310999725
ref 1 fl Interpret:/0/0 rc -19/0
Jul 18 09:33:45 IO-10 kernel: Lustre:
30794:0:(ldlm_lib.c:1567:target_queue_last_replay_reply()) es1-OST000a: 248
recoverable clients remain
Jul 18 09:33:46 IO-10 kernel: Lustre:
30806:0:(ldlm_lib.c:1567:target_queue_last_replay_reply()) es1-OST000a: 247
recoverable clients remain
Jul 18 09:33:46 IO-10 kernel: LustreError: 137-5: UUID
''es1-OST000e_UUID'' is not available  for connect (no target)
Jul 18 09:33:46 IO-10 kernel: LustreError:
30815:0:(ldlm_lib.c:1619:target_send_reply_msg()) @@@ processing error (-19) 
req at ffff8102f093d000
 x220133374/t0 o8-><?>@<?>:0/0 lens 304/0 e 0 to 0 dl 1310999726
ref 1 fl Interpret:/0/0 rc -19/0
Jul 18 09:33:46 IO-10 kernel: LustreError:
30815:0:(ldlm_lib.c:1619:target_send_reply_msg()) Skipped 11 previous similar
messages
Jul 18 09:33:46 IO-10 kernel: LustreError: Skipped 14 previous similar messages
Jul 18 09:33:46 IO-10 kernel: Lustre:
30889:0:(ldlm_lib.c:1567:target_queue_last_replay_reply()) es1-OST000a: 245
recoverable clients remain
Jul 18 09:33:46 IO-10 kernel: Lustre:
30889:0:(ldlm_lib.c:1567:target_queue_last_replay_reply()) Skipped 1 previous
similar message
Jul 18 09:33:47 IO-10 kernel: LustreError: 137-5: UUID
''es1-OST000b_UUID'' is not available  for connect (no target)
Jul 18 09:33:47 IO-10 kernel: LustreError:
31004:0:(ldlm_lib.c:1619:target_send_reply_msg()) @@@ processing error (-19) 
req at ffff8102fea35450
 x33450807/t0 o8-><?>@<?>:0/0 lens 304/0 e 0 to 0 dl 1310999727
ref 1 fl Interpret:/0/0 rc -19/0
Jul 18 09:33:47 IO-10 kernel: LustreError:
31004:0:(ldlm_lib.c:1619:target_send_reply_msg()) Skipped 47 previous similar
messages
Jul 18 09:33:47 IO-10 kernel: LustreError: Skipped 47 previous similar messages
----------------------------------------------------------------------end
messages
------------------------------------------------------------------------------------------

I''m not seeing that Lustre is trying to recover at all. Any thoughts on
where to go from here, Lustre (mounted at /scratch on clients) is starting to
have problems cluster-wide.



Best Regards,
--
DaMiri Young
System Engineer
High Performance Computing Team
CITC Academic Computing and User Services | UNT
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20110718/72099f3e/attachment-0001.html

Wojciech Turek

2011-Jul-18 15:49 UTC

head link

[Lustre-discuss] IO-Node issue

Hi Damiri,
>From the logs you have provided it looks like you have a problem with yourback end storage. First of all we can see that your SRP connection to
backend storage reports abort and reset (I guess your backend storage
hardware is connected via Infiniband if you are using SRP). Then Lustre
reports slow messages and eventually kernel reports SCSI errors. Device
mapper reports that both paths to the device are failed and Lustre remounts
filesystem read-only due to I/O error. All these means that your I/O node
lost contact with the OST due to some errors either on IB network connecting
your host to the storage hardware or on the storage hardware itself. From
the first part of the log we can see that the device being in trouble is OST
es1-OST000b (dm-11). In the second part of your log I can not see that
device being mounted. From your log I can see that only OST  es1-OST000a
(dm-10) is mounted and enters recovery (id that a complete lustre start up
log?). You need to check if your I/O node can actually still see and access
OST es1-OST000b (dm-11), if yes then run fsck on that device and then try to
mount it. If your host can not see that device you will have to investigate
the IB network, storage devices and also it is worth checking multipath
configuration (incorrect multipath config may not create all necessary
devices).

Good luck!

Wojciech



On 18 July 2011 16:28, Young, Damiri <Damiri.Young at unt.edu> wrote:
>  Hi All,
> On Friday 1 of the 4 I/O nodes in our lustre network began logging the
> following slow start_page_write errors for some OSTs:
> --------------------------------------------------- messages
>
-----------------------------------------------------------------------------------
> Jul 15 12:55:32 IO-10 heartbeat: [21053]: info: Checking status of STONITH
> device [IPMI STONITH device ]
> Jul 15 12:55:32 IO-10 heartbeat: [16746]: info: Managed STONITH-stat
> process 21053 exited with return code 0.
> Jul 15 13:02:25 IO-10 kernel: Lustre:
> 18616:0:(filter_io.c:652:filter_preprw_write()) es1-OST000b: slow
> start_page_write 44s
> Jul 15 13:02:25 IO-10 kernel: Lustre:
> 18688:0:(filter_io_26.c:779:filter_commitrw_write()) es1-OST000b: slow
> direct_io 44s
> Jul 15 13:02:57 IO-10 kernel: Lustre:
> 18531:0:(filter_io.c:652:filter_preprw_write()) es1-OST000b: slow
> start_page_write 38s
> Jul 15 13:03:29 IO-10 kernel: Lustre:
> 18566:0:(filter_io_26.c:779:filter_commitrw_write()) es1-OST000b: slow
> direct_io 32s
> Jul 15 13:03:29 IO-10 kernel:  host2: SRP abort called
> Jul 15 13:03:29 IO-10 last message repeated 4 times
> Jul 15 13:03:29 IO-10 kernel:  host2: SRP reset_device called
> Jul 15 13:03:29 IO-10 kernel:  host2: ib_srp: SRP reset_host called state 0
> qp_err 0
> Jul 15 13:03:39 IO-10 kernel: sd 2:0:0:12: SCSI error: return code >
0x06000000
> Jul 15 13:03:39 IO-10 kernel: end_request: I/O error, dev sdag, sector
> 15012385240
> Jul 15 13:03:39 IO-10 kernel: device-mapper: multipath: Failing path 66:0.
> Jul 15 13:03:39 IO-10 kernel: sd 2:0:0:12: SCSI error: return code >
0x06000000
> Jul 15 13:03:39 IO-10 kernel: end_request: I/O error, dev sdag, sector
> 15012386176
> Jul 15 13:03:39 IO-10 kernel: sd 2:0:0:12: SCSI error: return code >
0x06000000
> Jul 15 13:03:39 IO-10 kernel: end_request: I/O error, dev sdag, sector
> 15012386696
> Jul 15 13:03:39 IO-10 kernel: sd 2:0:0:12: SCSI error: return code >
0x06000000
> Jul 15 13:03:39 IO-10 kernel: end_request: I/O error, dev sdag, sector
> 15012385256
> Jul 15 13:03:39 IO-10 kernel: sd 2:0:0:12: SCSI error: return code >
0x06000000
> Jul 15 13:03:39 IO-10 kernel: end_request: I/O error, dev sdag, sector
> 15012385568
> Jul 15 13:03:39 IO-10 multipathd: 66:0: mark as failed
> Jul 15 13:03:39 IO-10 multipathd: lun_12: remaining active paths: 1
> Jul 15 13:04:35 IO-10 kernel: LustreError:
> 18538:0:(filter_io_26.c:779:filter_commitrw_write()) es1-OST000b: slow
> direct_io 50s
> Jul 15 13:04:35 IO-10 kernel: Lustre:
> 18447:0:(filter_io_26.c:779:filter_commitrw_write()) es1-OST000b: slow
> direct_io 31s
> Jul 15 13:04:35 IO-10 kernel: Lustre:
> 18525:0:(filter_io.c:652:filter_preprw_write()) es1-OST000b: slow
> start_page_write 46s
> Jul 15 13:04:35 IO-10 kernel: LustreError:
> 18598:0:(lustre_fsfilt.h:318:fsfilt_commit_wait()) es1-OST000b: slow
journal
> start 56s
> Jul 15 13:04:35 IO-10 kernel: LustreError:
> 18393:0:(filter_io_26.c:792:filter_commitrw_write()) es1-OST000b: slow
> commitrw commit 56s
> Jul 15 13:04:35 IO-10 kernel: LustreError:
> 18393:0:(filter_io_26.c:792:filter_commitrw_write()) Skipped 1 previous
> similar message
> Jul 15 13:04:35 IO-10 kernel: Lustre:
> 18739:0:(lustre_fsfilt.h:318:fsfilt_commit_wait()) es1-OST000b: slow
journal
> start 46s
> Jul 15 13:04:35 IO-10 kernel: Lustre:
> 18739:0:(filter_io_26.c:792:filter_commitrw_write()) es1-OST000b: slow
> commitrw commit 46s
> Jul 15 13:04:35 IO-10 kernel: Lustre:
> 18422:0:(lustre_fsfilt.h:318:fsfilt_commit_wait()) es1-OST000b: slow
journal
> start 31s
> Jul 15 13:04:35 IO-10 kernel: Lustre:
> 18422:0:(lustre_fsfilt.h:318:fsfilt_commit_wait()) Skipped 5 previous
> similar messages
> Jul 15 13:04:35 IO-10 kernel: Lustre:
> 18422:0:(filter_io_26.c:792:filter_commitrw_write()) es1-OST000b: slow
> commitrw commit 31s
> Jul 15 13:04:35 IO-10 kernel: Lustre:
> 18422:0:(filter_io_26.c:792:filter_commitrw_write()) Skipped 5 previous
> similar messages
> Jul 15 13:04:35 IO-10 kernel: LustreError:
> 18644:0:(lustre_fsfilt.h:318:fsfilt_commit_wait()) es1-OST000b: slow
journal
> start 56s
> Jul 15 13:04:35 IO-10 kernel: LustreError:
> 18644:0:(lustre_fsfilt.h:318:fsfilt_commit_wait()) Skipped 5 previous
> similar messages
> Jul 15 13:04:35 IO-10 kernel: LustreError:
> 18644:0:(filter_io_26.c:792:filter_commitrw_write()) es1-OST000b: slow
> commitrw commit 56s
> Jul 15 13:04:35 IO-10 kernel: LustreError:
> 18644:0:(filter_io_26.c:792:filter_commitrw_write()) Skipped 4 previous
> similar messages
> Jul 15 13:04:35 IO-10 kernel: LustreError:
> 18598:0:(lustre_fsfilt.h:318:fsfilt_commit_wait()) Skipped 4 previous
> similar messages
> Jul 15 13:04:35 IO-10 kernel: LustreError:
> 18766:0:(filter_io_26.c:779:filter_commitrw_write()) es1-OST000b: slow
> direct_io 55s
> Jul 15 13:04:35 IO-10 kernel: LustreError:
> 18766:0:(filter_io_26.c:779:filter_commitrw_write()) Skipped 1 previous
> similar message
> Jul 15 13:05:07 IO-10 kernel: Lustre:
> 18650:0:(filter_io_26.c:779:filter_commitrw_write()) es1-OST000b: slow
> direct_io 31s
> Jul 15 13:05:07 IO-10 kernel: Lustre:
> 18650:0:(filter_io_26.c:779:filter_commitrw_write()) Skipped 2 previous
> similar messages
> Jul 15 13:05:08 IO-10 kernel:  host1: SRP abort called
> Jul 15 13:05:08 IO-10 kernel:  host1: SRP reset_device called
> Jul 15 13:05:08 IO-10 kernel:  host1: ib_srp: SRP reset_host called state 0
> qp_err 0
> Jul 15 13:05:18 IO-10 kernel: sd 1:0:0:12: SCSI error: return code >
0x00000002
> Jul 15 13:05:18 IO-10 kernel: end_request: I/O error, dev sdm, sector
> 861112
> Jul 15 13:05:18 IO-10 kernel: device-mapper: multipath: Failing path 8:192.
> Jul 15 13:05:18 IO-10 kernel: Buffer I/O error on device dm-11, logical
> block 107639
> Jul 15 13:05:18 IO-10 kernel: lost page write due to I/O error on dm-11
> Jul 15 13:05:18 IO-10 kernel: LDISKFS-fs error (device dm-11): kmmpd: Error
> writing to MMP block
> Jul 15 13:05:18 IO-10 kernel: Remounting filesystem read-only
> Jul 15 13:05:18 IO-10 kernel: LustreError:
> 18499:0:(fsfilt-ldiskfs.c:438:fsfilt_ldiskfs_brw_start()) can''t
get handle
> for 99 credits: rc = -3
> 0
> Jul 15 13:05:18 IO-10 kernel: Lustre:
> 18657:0:(filter_io_26.c:779:filter_commitrw_write()) es1-OST000b: slow
> direct_io 36s
> Jul 15 13:05:18 IO-10 multipathd: 8:192: mark as failed
> Jul 15 13:05:18 IO-10 kernel: Lustre:
> 18657:0:(filter_io_26.c:779:filter_commitrw_write()) Skipped 1 previous
> similar message
> Jul 15 13:05:18 IO-10 kernel: Buffer I/O error on device dm-11, logical
> block 79917
> Jul 15 13:05:18 IO-10 multipathd: lun_12: remaining active paths: 0
> Jul 15 13:05:18 IO-10 kernel: lost page write due to I/O error on dm-11
> Jul 15 13:05:18 IO-10 kernel: Aborting journal on device dm-11.
> Jul 15 13:05:18 IO-10 kernel: LustreError:
> 19103:0:(obd.h:1188:obd_transno_commit_cb()) es1-OST000b: transno 608801934
> commit error: 2
>
> ------------------------------------------- end messages
> -------------------------------------------------------------------------
>
>
> I then tried to fsck -n one of the mentioned devices for inconsistencies at
> which point I noticed on a kernel  panic. Rebooted the machine which
appears
> to not be able to establish a connection or LNET:
> -----------------------------------------------------messages
> ---------------------------------------------------------------------
> Jul 18 09:24:44 IO-10 OpenSM[28708]: /var/log/opensm.log log file opened
> Jul 18 09:24:44 IO-10 OpenSM[28708]: OpenSM 3.1.11
> Jul 18 09:24:44 IO-10 OpenSM[28708]: Entering STANDBY state
> Jul 18 09:28:23 IO-10 run_srp_daemon[28757]: failed srp_daemon:
> [HCA=mlx4_0] [port=2] [exit status=110]. Will try to restart srp_daemon
> perio
> dically. No more warnings will be issued in the next 7200 seconds if the
> same problem repeats
> Jul 18 09:28:23 IO-10 run_srp_daemon[28763]: failed srp_daemon:
> [HCA=mlx4_0] [port=1] [exit status=110]. Will try to restart srp_daemon
> perio
> dically. No more warnings will be issued in the next 7200 seconds if the
> same problem repeats
> Jul 18 09:28:24 IO-10 run_srp_daemon[28769]: starting srp_daemon:
> [HCA=mlx4_0] [port=2]
> Jul 18 09:28:25 IO-10 run_srp_daemon[28782]: starting srp_daemon:
> [HCA=mlx4_0] [port=1]
> Jul 18 09:33:10 IO-10 kernel: megasas: 00.00.03.15-RH1 Wed Nov. 21 10:29:45
> PST 2007
> Jul 18 09:33:10 IO-10 kernel: Lustre: OBD class driver,
> http://www.lustre.org/
> Jul 18 09:33:10 IO-10 kernel:         Lustre Version: 1.6.6
> Jul 18 09:33:10 IO-10 kernel:         Build Version:
> 1.6.6-1.6.6-ddn3.1-20090527173746
> Jul 18 09:33:10 IO-10 kernel: Lustre:
> 30493:0:(o2iblnd_modparams.c:324:kiblnd_tunables_init()) Concurrent sends 7
> is lower than message queue
>  size: 8, performance may drop slightly.
> Jul 18 09:33:10 IO-10 kernel: Lustre: Added LNI 10.1.0.229 at o2ib [8/64]
> Jul 18 09:33:11 IO-10 kernel: Lustre: Lustre Client File System;
> http://www.lustre.org/
> Jul 18 09:33:34 IO-10 kernel: kjournald starting.  Commit interval 5
> seconds
> Jul 18 09:33:34 IO-10 kernel: LDISKFS-fs warning: checktime reached,
> running e2fsck is recommended
> Jul 18 09:33:34 IO-10 kernel: LDISKFS FS on dm-10, internal journal
> Jul 18 09:33:34 IO-10 kernel: LDISKFS-fs: recovery complete.
> Jul 18 09:33:34 IO-10 kernel: LDISKFS-fs: mounted filesystem with ordered
> data mode.
> Jul 18 09:33:34 IO-10 multipathd: dm-10: umount map (uevent)
> Jul 18 09:33:45 IO-10 kernel: kjournald starting.  Commit interval 5
> seconds
> Jul 18 09:33:45 IO-10 kernel: LDISKFS-fs warning: checktime reached,
> running e2fsck is recommended
> Jul 18 09:33:45 IO-10 kernel: LDISKFS FS on dm-10, internal journal
> Jul 18 09:33:45 IO-10 kernel: LDISKFS-fs: mounted filesystem with ordered
> data mode.
> Jul 18 09:33:45 IO-10 kernel: LDISKFS-fs: file extents enabled
> Jul 18 09:33:45 IO-10 kernel: LDISKFS-fs: mballoc enabled
> Jul 18 09:33:45 IO-10 kernel: LustreError: 137-5: UUID
''es1-OST000a_UUID''
> is not available  for connect (no target)
> Jul 18 09:33:45 IO-10 kernel: LustreError:
> 30789:0:(ldlm_lib.c:1619:target_send_reply_msg()) @@@ processing error
> (-19)  req at ffff8101a6992c00
>  x5558026/t0 o8-><?>@<?>:0/0 lens 304/0 e 0 to 0 dl
1310999725 ref 1 fl
> Interpret:/0/0 rc -19/0
> Jul 18 09:33:45 IO-10 kernel: LustreError:
> 30789:0:(ldlm_lib.c:1619:target_send_reply_msg()) Skipped 2 previous
similar
> messages
> Jul 18 09:33:45 IO-10 kernel: fsfilt_ldiskfs: no version for
> "ldiskfs_free_blocks" found: kernel tainted.
> Jul 18 09:33:45 IO-10 kernel: Lustre: Filtering OBD driver;
> http://www.lustre.org/
> Jul 18 09:33:45 IO-10 kernel: Lustre:
> 31834:0:(filter.c:868:filter_init_server_data()) RECOVERY: service
> es1-OST000a, 249 recoverable clients
> , last_rcvd 469616249
> Jul 18 09:33:45 IO-10 kernel: Lustre: OST es1-OST000a now serving dev
> (es1-OST000a/15fae56a-7dae-ba24-4423-347c0a118367), but will be in reco
> very for at least 5:00, or until 249 clients reconnect. During this time
> new clients will not be allowed to connect. Recovery progress can be
>  monitored by watching
> /proc/fs/lustre/obdfilter/es1-OST000a/recovery_status.
> Jul 18 09:33:45 IO-10 kernel: Lustre: es1-OST000a.ost: set parameter
> quota_type=ug
> Jul 18 09:33:45 IO-10 kernel: Lustre: Server es1-OST000a on device
> /dev/mpath/lun_11 has started
> Jul 18 09:33:45 IO-10 kernel: Lustre:
> 30790:0:(ldlm_lib.c:1226:check_and_start_recovery_timer()) es1-OST000a:
> starting req at ffff8101a6992c00
>  x5558026/t0 o8-><?>@<?>:0/0 lens 304/0 e 0 to 0 dl
1310999725 ref 1 fl
> Interpret:/0/0 rc -19/0
> Jul 18 09:33:45 IO-10 kernel: LustreError:
> 30789:0:(ldlm_lib.c:1619:target_send_reply_msg()) Skipped 2 previous
similar
> messages
> Jul 18 09:33:45 IO-10 kernel: fsfilt_ldiskfs: no version for
> "ldiskfs_free_blocks" found: kernel tainted.
> Jul 18 09:33:45 IO-10 kernel: Lustre: Filtering OBD driver;
> http://www.lustre.org/
> Jul 18 09:33:45 IO-10 kernel: Lustre:
> 31834:0:(filter.c:868:filter_init_server_data()) RECOVERY: service
> es1-OST000a, 249 recoverable clients
> , last_rcvd 469616249
> Jul 18 09:33:45 IO-10 kernel: Lustre: OST es1-OST000a now serving dev
> (es1-OST000a/15fae56a-7dae-ba24-4423-347c0a118367), but will be in reco
> very for at least 5:00, or until 249 clients reconnect. During this time
> new clients will not be allowed to connect. Recovery progress can be
>  monitored by watching
> /proc/fs/lustre/obdfilter/es1-OST000a/recovery_status.
> Jul 18 09:33:45 IO-10 kernel: Lustre: es1-OST000a.ost: set parameter
> quota_type=ug
> Jul 18 09:33:45 IO-10 kernel: Lustre: Server es1-OST000a on device
> /dev/mpath/lun_11 has started
> Jul 18 09:33:45 IO-10 kernel: Lustre:
> 30790:0:(ldlm_lib.c:1226:check_and_start_recovery_timer()) es1-OST000a:
> starting recovery timer
> Jul 18 09:33:45 IO-10 kernel: LustreError: 137-5: UUID
''es1-OST000b_UUID''
> is not available  for connect (no target)
> Jul 18 09:33:45 IO-10 kernel: LustreError: Skipped 4 previous similar
> messages
> Jul 18 09:33:45 IO-10 kernel: LustreError:
> 30792:0:(ldlm_lib.c:1619:target_send_reply_msg()) @@@ processing error
> (-19)  req at ffff8103185c9000
>  x215076358/t0 o8-><?>@<?>:0/0 lens 304/0 e 0 to 0 dl
1310999725 ref 1 fl
> Interpret:/0/0 rc -19/0
> Jul 18 09:33:45 IO-10 kernel: Lustre:
> 30794:0:(ldlm_lib.c:1567:target_queue_last_replay_reply()) es1-OST000a: 248
> recoverable clients remain
> Jul 18 09:33:46 IO-10 kernel: Lustre:
> 30806:0:(ldlm_lib.c:1567:target_queue_last_replay_reply()) es1-OST000a: 247
> recoverable clients remain
> Jul 18 09:33:46 IO-10 kernel: LustreError: 137-5: UUID
''es1-OST000e_UUID''
> is not available  for connect (no target)
> Jul 18 09:33:46 IO-10 kernel: LustreError:
> 30815:0:(ldlm_lib.c:1619:target_send_reply_msg()) @@@ processing error
> (-19)  req at ffff8102f093d000
>  x220133374/t0 o8-><?>@<?>:0/0 lens 304/0 e 0 to 0 dl
1310999726 ref 1 fl
> Interpret:/0/0 rc -19/0
> Jul 18 09:33:46 IO-10 kernel: LustreError:
> 30815:0:(ldlm_lib.c:1619:target_send_reply_msg()) Skipped 11 previous
> similar messages
> Jul 18 09:33:46 IO-10 kernel: LustreError: Skipped 14 previous similar
> messages
> Jul 18 09:33:46 IO-10 kernel: Lustre:
> 30889:0:(ldlm_lib.c:1567:target_queue_last_replay_reply()) es1-OST000a: 245
> recoverable clients remain
> Jul 18 09:33:46 IO-10 kernel: Lustre:
> 30889:0:(ldlm_lib.c:1567:target_queue_last_replay_reply()) Skipped 1
> previous similar message
> Jul 18 09:33:47 IO-10 kernel: LustreError: 137-5: UUID
''es1-OST000b_UUID''
> is not available  for connect (no target)
> Jul 18 09:33:47 IO-10 kernel: LustreError:
> 31004:0:(ldlm_lib.c:1619:target_send_reply_msg()) @@@ processing error
> (-19)  req at ffff8102fea35450
>  x33450807/t0 o8-><?>@<?>:0/0 lens 304/0 e 0 to 0 dl
1310999727 ref 1 fl
> Interpret:/0/0 rc -19/0
> Jul 18 09:33:47 IO-10 kernel: LustreError:
> 31004:0:(ldlm_lib.c:1619:target_send_reply_msg()) Skipped 47 previous
> similar messages
> Jul 18 09:33:47 IO-10 kernel: LustreError: Skipped 47 previous similar
> messages
> ----------------------------------------------------------------------end
> messages
>
------------------------------------------------------------------------------------------
>
> I''m not seeing that Lustre is trying to recover at all. Any
thoughts on
> where to go from here, Lustre (mounted at /scratch on clients) is starting
> to have problems cluster-wide.
>
>
>
>  Best Regards,
> --
> DaMiri Young
> System Engineer
> High Performance Computing Team
> CITC Academic Computing and User Services | UNT
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20110718/5ebb74d5/attachment-0001.html

DaMiri Young

2011-Jul-18 22:02 UTC

head link

[Lustre-discuss] IO-Node issue

So you were right about the I/O node losing contact with the OST. In 
short, after enabling lustre debugging, restarting opensmd and openibd 
services on the troubled node the OSTs were remounted and lustre entered 
recovery:
--------------------------- messages 
----------------------------------------
Jul 18 10:02:56 IO-10 kernel: ib_ipath 0000:05:00.0: We got a lid: 0x75
Jul 18 10:02:56 IO-10 kernel: ib_srp: ASYNC event= 11 on device= ipath0
Jul 18 10:02:56 IO-10 kernel: ib_srp: ASYNC event= 13 on device= ipath0
Jul 18 10:02:56 IO-10 kernel: ib_srp: ASYNC event= 17 on device= ipath0
Jul 18 10:02:56 IO-10 kernel: ib_srp: ASYNC event= 9 on device= ipath0
Jul 18 10:02:59 IO-10 kernel: ADDRCONF(NETDEV_CHANGE): ib0: link becomes 
ready
Jul 18 10:03:01 IO-10 avahi-daemon[24939]: New relevant interface 
ib0.IPv6 for mDNS.
Jul 18 10:03:01 IO-10 avahi-daemon[24939]: Joining mDNS multicast group 
on interface ib0.IPv6 with address fe80::211:7500:ff:7bf6.
Jul 18 10:03:01 IO-10 avahi-daemon[24939]: Registering new address 
record for fe80::211:7500:ff:7bf6 on ib0.
Jul 18 10:03:15 IO-10 ntpd[23084]: synchronized to 10.0.0.1, stratum 3
Jul 18 11:41:40 IO-10 kernel: megasas: 00.00.03.15-RH1 Wed Nov. 21 
10:29:45 PST 2007
Jul 18 11:41:41 IO-10 kernel: Lustre: OBD class driver, 
http://www.lustre.org/
Jul 18 11:41:41 IO-10 kernel:         Lustre Version: 1.6.6
Jul 18 11:41:41 IO-10 kernel:         Build Version: 
1.6.6-1.6.6-ddn3.1-20090527173746
Jul 18 11:41:41 IO-10 kernel: Lustre: 
28686:0:(o2iblnd_modparams.c:324:kiblnd_tunables_init()) Concurrent 
sends 7 is lower than message queue size: 8, performance may drop slightly.
Jul 18 11:41:41 IO-10 kernel: Lustre: Added LNI 10.1.0.229 at o2ib [8/64]
Jul 18 11:41:41 IO-10 kernel: Lustre: Lustre Client File System; 
http://www.lustre.org/
Jul 18 11:42:07 IO-10 kernel: kjournald starting.  Commit interval 5 seconds
Jul 18 11:42:07 IO-10 kernel: LDISKFS-fs warning: checktime reached, 
running e2fsck is recommended
Jul 18 11:42:07 IO-10 kernel: LDISKFS FS on dm-11, internal journal
Jul 18 11:42:07 IO-10 kernel: LDISKFS-fs: recovery complete.
Jul 18 11:42:07 IO-10 kernel: LDISKFS-fs: mounted filesystem with 
ordered data mode.
Jul 18 11:42:07 IO-10 multipathd: dm-11: umount map (uevent)
Jul 18 11:42:18 IO-10 kernel: kjournald starting.  Commit interval 5 seconds
Jul 18 11:42:18 IO-10 kernel: LDISKFS-fs warning: checktime reached, 
running e2fsck is recommended
Jul 18 11:42:18 IO-10 kernel: LDISKFS FS on dm-11, internal journal
Jul 18 11:42:18 IO-10 kernel: LDISKFS-fs: mounted filesystem with 
ordered data mode.
Jul 18 11:42:18 IO-10 kernel: LDISKFS-fs: file extents enabled
Jul 18 11:42:18 IO-10 kernel: LDISKFS-fs: mballoc enabled
Jul 18 11:42:18 IO-10 kernel: fsfilt_ldiskfs: no version for 
"ldiskfs_free_blocks" found: kernel tainted.
Jul 18 11:42:18 IO-10 kernel: Lustre: Filtering OBD driver; 
http://www.lustre.org/
Jul 18 11:42:18 IO-10 kernel: Lustre: 
29999:0:(filter.c:868:filter_init_server_data()) RECOVERY: service 
es1-OST000a, 249 recoverable clients, last_rcvd 469628325
Jul 18 11:42:18 IO-10 kernel: Lustre: OST es1-OST000a now serving dev 
(es1-OST000a/15fae56a-7dae-ba24-4423-347c0a118367), but will be in 
recovery for at least 5:00, or until 249 clients reconnect. During this 
time new clients will not be allowed to connect. Recovery progress can 
be monitored by watching 
/proc/fs/lustre/obdfilter/es1-OST000a/recovery_status.
Jul 18 11:42:18 IO-10 kernel: Lustre: es1-OST000a.ost: set parameter 
quota_type=ug
Jul 18 11:42:18 IO-10 kernel: Lustre: Server es1-OST000a on device 
/dev/mpath/lun_11 has started
Jul 18 11:42:19 IO-10 kernel: Lustre: 
28952:0:(ldlm_lib.c:1226:check_and_start_recovery_timer()) es1-OST000a: 
starting recovery timer
Jul 18 11:42:19 IO-10 kernel: LustreError: 137-5: UUID 
''es1-OST000c_UUID'' is not available  for connect (no target)
Jul 18 11:42:19 IO-10 kernel: LustreError: 
28957:0:(ldlm_lib.c:1619:target_send_reply_msg()) @@@ processing error 
(-19)  req at ffff810311f9f400 x36077513/t0 o8-><?>@<?>:0/0 lens
304/0 e 0
to 0 dl 1311007439 ref 1 fl Interpret:/0/0 rc -19/0
Jul 18 11:42:19 IO-10 kernel: LustreError: Skipped 3 previous similar 
messages
Jul 18 11:42:19 IO-10 kernel: LustreError: 137-5: UUID 
''es1-OST000b_UUID'' is not available  for connect (no target)
Jul 18 11:42:19 IO-10 kernel: LustreError: 
28985:0:(ldlm_lib.c:1619:target_send_reply_msg()) @@@ processing error 
(-19)  req at ffff8102f81ce800 x8649866/t0 o8-><?>@<?>:0/0 lens
304/0 e 0
to 0 dl 1311007439 ref 1 fl Interpret:/0/0 rc -19/0
Jul 18 11:42:19 IO-10 kernel: LustreError: 
28985:0:(ldlm_lib.c:1619:target_send_reply_msg()) Skipped 3 previous 
similar messages
Jul 18 11:42:19 IO-10 kernel: LustreError: Skipped 3 previous similar 
messages
Jul 18 11:42:19 IO-10 kernel: LustreError: 137-5: UUID 
''es1-OST000b_UUID'' is not available  for connect (no target)
Jul 18 11:42:19 IO-10 kernel: Lustre: 
29068:0:(ldlm_lib.c:1567:target_queue_last_replay_reply()) es1-OST000a: 
248 recoverable clients remain
Jul 18 11:42:19 IO-10 kernel: LustreError: 
29010:0:(ldlm_lib.c:1619:target_send_reply_msg()) @@@ processing error 
(-19)  req at ffff8102f81f2c00 x368697/t0 o8-><?>@<?>:0/0 lens
304/0 e 0 to
0 dl 1311007439 ref 1 fl Interpret:/0/0 rc -19/0
Jul 18 11:42:19 IO-10 kernel: LustreError: 
29010:0:(ldlm_lib.c:1619:target_send_reply_msg()) Skipped 19 previous 
similar messages
Jul 18 11:42:19 IO-10 kernel: LustreError: Skipped 19 previous similar 
messages
Jul 18 11:42:19 IO-10 kernel: Lustre: 
29012:0:(ldlm_lib.c:1567:target_queue_last_replay_reply()) es1-OST000a: 
247 recoverable clients remain
Jul 18 11:42:20 IO-10 kernel: Lustre: 
29106:0:(ldlm_lib.c:1567:target_queue_last_replay_reply()) es1-OST000a: 
240 recoverable clients remain
Jul 18 11:42:20 IO-10 kernel: Lustre: 
29106:0:(ldlm_lib.c:1567:target_queue_last_replay_reply()) Skipped 6 
previous similar messages
Jul 18 11:42:20 IO-10 kernel: LustreError: 137-5: UUID 
''es1-OST000b_UUID'' is not available  for connect (no target)
Jul 18 11:42:20 IO-10 kernel: LustreError: 
29149:0:(ldlm_lib.c:1619:target_send_reply_msg()) @@@ processing error 
(-19)  req at ffff81030eff2850 x68565826/t0 o8-><?>@<?>:0/0 lens
304/0 e 0
to 0 dl 1311007440 ref 1 fl Interpret:/0/0 rc -19/0
Jul 18 11:42:20 IO-10 kernel: LustreError: 
29149:0:(ldlm_lib.c:1619:target_send_reply_msg()) Skipped 31 previous 
similar messages
Jul 18 11:42:20 IO-10 kernel: LustreError: Skipped 31 previous similar 
messages
Jul 18 11:42:21 IO-10 kernel: Lustre: 
29196:0:(ldlm_lib.c:1567:target_queue_last_replay_reply()) es1-OST000a: 
232 recoverable clients remain
Jul 18 11:42:21 IO-10 kernel: Lustre: 
29196:0:(ldlm_lib.c:1567:target_queue_last_replay_reply()) Skipped 7 
previous similar messages
Jul 18 11:42:22 IO-10 kernel: LustreError: 137-5: UUID 
''es1-OST000b_UUID'' is not available  for connect (no target)
Jul 18 11:42:22 IO-10 kernel: LustreError: 
29275:0:(ldlm_lib.c:1619:target_send_reply_msg()) @@@ processing error 
(-19)  req at ffff810302713c50 x519337/t0 o8-><?>@<?>:0/0 lens
304/0 e 0 to
0 dl 1311007442 ref 1 fl Interpret:/0/0 rc -19/0
Jul 18 11:42:22 IO-10 kernel: LustreError: 
29275:0:(ldlm_lib.c:1619:target_send_reply_msg()) Skipped 47 previous 
similar messages
Jul 18 11:42:22 IO-10 kernel: LustreError: Skipped 47 previous similar 
messages
Jul 18 11:42:23 IO-10 kernel: Lustre: 
29320:0:(ldlm_lib.c:1567:target_queue_last_replay_reply()) es1-OST000a: 
221 recoverable clients remain
Jul 18 11:42:23 IO-10 kernel: Lustre: 
29320:0:(ldlm_lib.c:1567:target_queue_last_replay_reply()) Skipped 10 
previous similar messages
Jul 18 11:42:27 IO-10 kernel: LustreError: 137-5: UUID 
''es1-OST000c_UUID'' is not available  for connect (no target)
Jul 18 11:42:27 IO-10 kernel: LustreError: 
29030:0:(ldlm_lib.c:1619:target_send_reply_msg()) @@@ processing error 
(-19)  req at ffff8102f87bac00 x435304948/t0 o8-><?>@<?>:0/0 lens
304/0 e 0
to 0 dl 1311007447 ref 1 fl Interpret:/0/0 rc -19/0
Jul 18 11:42:27 IO-10 kernel: LustreError: 
29030:0:(ldlm_lib.c:1619:target_send_reply_msg()) Skipped 91 previous 
similar messages
Jul 18 11:42:27 IO-10 kernel: LustreError: Skipped 91 previous similar 
messages
Jul 18 11:42:27 IO-10 kernel: Lustre: 
29182:0:(ldlm_lib.c:1567:target_queue_last_replay_reply()) es1-OST000a: 
196 recoverable clients remain
Jul 18 11:42:27 IO-10 kernel: Lustre: 
29182:0:(ldlm_lib.c:1567:target_queue_last_replay_reply()) Skipped 24 
previous similar messages
Jul 18 11:42:46 IO-10 kernel: kjournald starting.  Commit interval 5 seconds
Jul 18 11:42:46 IO-10 kernel: LDISKFS-fs warning: checktime reached, 
running e2fsck is recommended
Jul 18 11:42:46 IO-10 kernel: LDISKFS FS on dm-10, internal journal
Jul 18 11:42:46 IO-10 kernel: LDISKFS-fs: recovery complete.
Jul 18 11:42:46 IO-10 kernel: LDISKFS-fs: mounted filesystem with 
ordered data mode.
Jul 18 11:42:46 IO-10 multipathd: dm-10: umount map (uevent)
Jul 18 11:42:58 IO-10 kernel: kjournald starting.  Commit interval 5 seconds
Jul 18 11:42:58 IO-10 kernel: LDISKFS-fs warning: checktime reached, 
running e2fsck is recommended
Jul 18 11:42:58 IO-10 kernel: LDISKFS FS on dm-10, internal journal
Jul 18 11:42:58 IO-10 kernel: LDISKFS-fs: mounted filesystem with 
ordered data mode.
Jul 18 11:42:58 IO-10 kernel: LDISKFS-fs: file extents enabled
Jul 18 11:42:58 IO-10 kernel: LDISKFS-fs: mballoc enabled
Jul 18 11:42:58 IO-10 kernel: Lustre: 
30227:0:(filter.c:868:filter_init_server_data()) RECOVERY: service 
es1-OST000b, 249 recoverable clients, last_rcvd 608808684
Jul 18 11:42:58 IO-10 kernel: Lustre: OST es1-OST000b now serving dev 
(es1-OST000b/1f38b48f-9a67-b3a6-4374-b25762e71391), but will be in 
recovery for at least 5:00, or until 249 clients reconnect. During this 
time new clients will not be allowed to connect. Recovery progress can 
be monitored by watching 
/proc/fs/lustre/obdfilter/es1-OST000b/recovery_status.
Jul 18 11:42:58 IO-10 kernel: Lustre: es1-OST000b.ost: set parameter 
quota_type=ug
Jul 18 11:42:58 IO-10 kernel: Lustre: Server es1-OST000b on device 
/dev/mpath/lun_12 has started
Jul 18 11:43:09 IO-10 kernel: Lustre: 
28975:0:(ldlm_lib.c:1226:check_and_start_recovery_timer()) es1-OST000b: 
starting recovery timer
Jul 18 11:43:09 IO-10 kernel: LustreError: 137-5: UUID 
''es1-OST000c_UUID'' is not available  for connect (no target)
Jul 18 11:43:09 IO-10 kernel: LustreError: Skipped 111 previous similar 
messages
Jul 18 11:43:09 IO-10 kernel: LustreError: 
29079:0:(ldlm_lib.c:1619:target_send_reply_msg()) @@@ processing error 
(-19)  req at ffff8102eb3cb000 x36077574/t0 o8-><?>@<?>:0/0 lens
304/0 e 0
to 0 dl 1311007489 ref 1 fl Interpret:/0/0 rc -19/0
Jul 18 11:43:09 IO-10 kernel: LustreError: 
29079:0:(ldlm_lib.c:1619:target_send_reply_msg()) Skipped 114 previous 
similar messages
Jul 18 11:43:09 IO-10 kernel: Lustre: 
28999:0:(ldlm_lib.c:1567:target_queue_last_replay_reply()) es1-OST000b: 
248 recoverable clients remain
Jul 18 11:43:09 IO-10 kernel: Lustre: 
28999:0:(ldlm_lib.c:1567:target_queue_last_replay_reply()) Skipped 25 
previous similar messages
Jul 18 11:43:21 IO-10 kernel: kjournald starting.  Commit interval 5 seconds
Jul 18 11:43:21 IO-10 kernel: LDISKFS-fs warning: maximal mount count 
reached, running e2fsck is recommended
Jul 18 11:43:21 IO-10 kernel: LDISKFS FS on dm-12, internal journal
Jul 18 11:43:21 IO-10 kernel: LDISKFS-fs: recovery complete.
Jul 18 11:43:21 IO-10 kernel: LDISKFS-fs: mounted filesystem with 
ordered data mode.
Jul 18 11:43:21 IO-10 multipathd: dm-12: umount map (uevent)
Jul 18 11:43:32 IO-10 kernel: kjournald starting.  Commit interval 5 seconds
Jul 18 11:43:32 IO-10 kernel: LDISKFS-fs warning: maximal mount count 
reached, running e2fsck is recommended
Jul 18 11:43:32 IO-10 kernel: LDISKFS FS on dm-12, internal journal
Jul 18 11:43:32 IO-10 kernel: LDISKFS-fs: mounted filesystem with 
ordered data mode.
Jul 18 11:43:32 IO-10 kernel: LDISKFS-fs: file extents enabled
Jul 18 11:43:32 IO-10 kernel: LDISKFS-fs: mballoc enabled
Jul 18 11:43:32 IO-10 kernel: Lustre: 
30436:0:(filter.c:868:filter_init_server_data()) RECOVERY: service 
es1-OST000c, 249 recoverable clients, last_rcvd 370809064
Jul 18 11:43:32 IO-10 kernel: Lustre: OST es1-OST000c now serving dev 
(es1-OST000c/f8c1bf77-11b3-88be-4438-016f059a91b5), but will be in 
recovery for at least 5:00, or until 249 clients reconnect. During this 
time new clients will not be allowed to connect. Recovery progress can 
be monitored by watching 
/proc/fs/lustre/obdfilter/es1-OST000c/recovery_status.
Jul 18 11:43:32 IO-10 kernel: Lustre: es1-OST000c.ost: set parameter 
quota_type=ug
Jul 18 11:43:32 IO-10 kernel: Lustre: Server es1-OST000c on device 
/dev/mpath/lun_13 has started
Jul 18 11:43:46 IO-10 kernel: Lustre: 
29050:0:(ldlm_lib.c:1226:check_and_start_recovery_timer()) es1-OST000c: 
starting recovery timer
Jul 18 11:43:46 IO-10 kernel: LustreError: 137-5: UUID 
''es1-OST000d_UUID'' is not available  for connect (no target)
Jul 18 11:43:46 IO-10 kernel: LustreError: Skipped 229 previous similar 
messages
Jul 18 11:43:46 IO-10 kernel: LustreError: 
29123:0:(ldlm_lib.c:1619:target_send_reply_msg()) @@@ processing error 
(-19)  req at ffff8102f6e36000 x36721236/t0 o8-><?>@<?>:0/0 lens
304/0 e 0
to 0 dl 1311007526 ref 1 fl Interpret:/0/0 rc -19/0
Jul 18 11:43:46 IO-10 kernel: LustreError: 
29123:0:(ldlm_lib.c:1619:target_send_reply_msg()) Skipped 229 previous 
similar messages
Jul 18 11:43:46 IO-10 kernel: Lustre: 
28982:0:(ldlm_lib.c:1567:target_queue_last_replay_reply()) es1-OST000b: 
171 recoverable clients remain
Jul 18 11:43:46 IO-10 kernel: Lustre: 
28982:0:(ldlm_lib.c:1567:target_queue_last_replay_reply()) Skipped 76 
previous similar messages
Jul 18 11:43:54 IO-10 kernel: kjournald starting.  Commit interval 5 seconds
Jul 18 11:43:54 IO-10 kernel: LDISKFS-fs warning: maximal mount count 
reached, running e2fsck is recommended
Jul 18 11:43:54 IO-10 kernel: LDISKFS FS on dm-13, internal journal
Jul 18 11:43:54 IO-10 kernel: LDISKFS-fs: recovery complete.
Jul 18 11:43:54 IO-10 kernel: LDISKFS-fs: mounted filesystem with 
ordered data mode.
Jul 18 11:43:55 IO-10 multipathd: dm-13: umount map (uevent)
Jul 18 11:44:06 IO-10 kernel: kjournald starting.  Commit interval 5 seconds
Jul 18 11:44:06 IO-10 kernel: LDISKFS-fs warning: maximal mount count 
reached, running e2fsck is recommended
Jul 18 11:44:06 IO-10 kernel: LDISKFS FS on dm-13, internal journal
Jul 18 11:44:06 IO-10 kernel: LDISKFS-fs: mounted filesystem with 
ordered data mode.
Jul 18 11:44:06 IO-10 kernel: LDISKFS-fs: file extents enabled
Jul 18 11:44:06 IO-10 kernel: LDISKFS-fs: mballoc enabled
Jul 18 11:44:06 IO-10 kernel: Lustre: 
30686:0:(filter.c:868:filter_init_server_data()) RECOVERY: service 
es1-OST000d, 249 recoverable clients, last_rcvd 694562245
Jul 18 11:44:06 IO-10 kernel: Lustre: OST es1-OST000d now serving dev 
(es1-OST000d/cf608dbd-accd-89b7-471a-f4487e9f8ba3), but will be in 
recovery for at least 5:00, or until 249 clients reconnect. During this 
time new clients will not be allowed to connect. Recovery progress can 
be monitored by watching 
/proc/fs/lustre/obdfilter/es1-OST000d/recovery_status.
Jul 18 11:44:06 IO-10 kernel: Lustre: es1-OST000d.ost: set parameter 
quota_type=ug
Jul 18 11:44:06 IO-10 kernel: Lustre: Server es1-OST000d on device 
/dev/mpath/lun_14 has started
Jul 18 11:44:06 IO-10 kernel: Lustre: 
29293:0:(ldlm_lib.c:1226:check_and_start_recovery_timer()) es1-OST000d: 
starting recovery timer
Jul 18 11:44:18 IO-10 kernel: LustreError: 137-5: UUID 
''es1-OST000e_UUID'' is not available  for connect (no target)
Jul 18 11:44:18 IO-10 kernel: LustreError: Skipped 199 previous similar 
messages
Jul 18 11:44:18 IO-10 kernel: Lustre: 
29068:0:(ldlm_lib.c:1567:target_queue_last_replay_reply()) es1-OST000d: 
175 recoverable clients remain
Jul 18 11:44:18 IO-10 kernel: LustreError: 
29135:0:(ldlm_lib.c:1619:target_send_reply_msg()) @@@ processing error 
(-19)  req at ffff8102f4c1cc00 x56000488/t0 o8-><?>@<?>:0/0 lens
304/0 e 0
to 0 dl 1311007558 ref 1 fl Interpret:/0/0 rc -19/0
Jul 18 11:44:18 IO-10 kernel: LustreError: 
29135:0:(ldlm_lib.c:1619:target_send_reply_msg()) Skipped 199 previous 
similar messages
Jul 18 11:44:18 IO-10 kernel: Lustre: 
29068:0:(ldlm_lib.c:1567:target_queue_last_replay_reply()) Skipped 331 
previous similar messages
Jul 18 11:44:28 IO-10 kernel: kjournald starting.  Commit interval 5 seconds
Jul 18 11:44:28 IO-10 kernel: LDISKFS-fs warning: maximal mount count 
reached, running e2fsck is recommended
Jul 18 11:44:28 IO-10 kernel: LDISKFS FS on dm-14, internal journal
Jul 18 11:44:28 IO-10 kernel: LDISKFS-fs: recovery complete.
Jul 18 11:44:28 IO-10 kernel: LDISKFS-fs: mounted filesystem with 
ordered data mode.
Jul 18 11:44:28 IO-10 multipathd: dm-14: umount map (uevent)
Jul 18 11:44:39 IO-10 kernel: kjournald starting.  Commit interval 5 seconds
Jul 18 11:44:39 IO-10 kernel: LDISKFS-fs warning: maximal mount count 
reached, running e2fsck is recommended
Jul 18 11:44:39 IO-10 kernel: LDISKFS FS on dm-14, internal journal
Jul 18 11:44:39 IO-10 kernel: LDISKFS-fs: mounted filesystem with 
ordered data mode.
Jul 18 11:44:39 IO-10 kernel: LDISKFS-fs: file extents enabled
Jul 18 11:44:39 IO-10 kernel: LDISKFS-fs: mballoc enabled
Jul 18 11:44:39 IO-10 kernel: Lustre: 
30893:0:(filter.c:868:filter_init_server_data()) RECOVERY: service 
es1-OST000e, 249 recoverable clients, last_rcvd 613643608
Jul 18 11:44:39 IO-10 kernel: Lustre: OST es1-OST000e now serving dev 
(es1-OST000e/478c7dc4-4936-bfe2-45ac-2fb7a2e69f62), but will be in 
recovery for at least 5:00, or until 249 clients reconnect. During this 
time new clients will not be allowed to connect. Recovery progress can 
be monitored by watching 
/proc/fs/lustre/obdfilter/es1-OST000e/recovery_status.
Jul 18 11:44:39 IO-10 kernel: Lustre: es1-OST000e.ost: set parameter 
quota_type=ug
Jul 18 11:44:39 IO-10 kernel: Lustre: Server es1-OST000e on device 
/dev/mpath/lun_15 has started
Jul 18 11:44:40 IO-10 kernel: Lustre: 
29214:0:(ldlm_lib.c:1226:check_and_start_recovery_timer()) es1-OST000e: 
starting recovery timer
Jul 18 11:44:49 IO-10 kernel: Lustre: 
29236:0:(service.c:939:ptlrpc_server_handle_req_in()) @@@ Slow req_in 
handling 6s  req at ffff8102f4419c00 x738214853/t0
o101-><?>@<?>:0/0 lens
232/0 e 0 to 0 dl 0 ref 1 fl New:/0/0 rc 0/0
Jul 18 11:44:49 IO-10 kernel: Lustre: 
28992:0:(service.c:939:ptlrpc_server_handle_req_in()) @@@ Slow req_in 
handling 6s  req at ffff8102f4419400 x738214855/t0
o101-><?>@<?>:0/0 lens
232/0 e 0 to 0 dl 0 ref 1 fl New:/0/0 rc 0/0
---------------------- end messages -----------------------------

It mentioned completing the recovery so I didn''t bother with running 
another fsck, should I? The problem now seems to be that STONITH on the 
troubled node''s failover can''t reset the node. It tries and
fails
incessantly:
------------------------ messages -------------------------------
Jul 18 16:45:17 IO-11 heartbeat: [25037]: info: Resetting node 
io-10.internal.acs.unt.prv with [IPMI STONITH device ]
Jul 18 16:45:18 IO-11 heartbeat: [25037]: info: glib: external_run_cmd: 
Calling ''/usr/lib64/stonith/plugins/external/ipmi reset 
io-10.internal.acs.unt.prv'' returned 256
Jul 18 16:45:18 IO-11 heartbeat: [25037]: ERROR: glib: 
external_reset_req: ''ipmi reset'' for host
io-10.internal.acs.unt.prv
failed with rc 256
Jul 18 16:45:18 IO-11 heartbeat: [25037]: ERROR: Host 
io-10.internal.acs.unt.prv not reset!
Jul 18 16:45:18 IO-11 heartbeat: [15803]: WARN: Managed STONITH 
io-10.internal.acs.unt.prv process 25037 exited with return code 1.
Jul 18 16:45:18 IO-11 heartbeat: [15803]: ERROR: STONITH of 
io-10.internal.acs.unt.prv failed.  Retrying...
---------------------- end messages ---------------------------------

I''ve checked the logic in usr/lib64/stonith/plugins/external/ipmi which
doesn''t seem to be using the correct address for the BMC controller. 
It''s possible that the HA facilites could prevent mounting of the final
OSTs isn''t it?


Wojciech Turek wrote:> Hi Damiri,
> 
>  From the logs you have provided it looks like you have a problem with 
> your back end storage. First of all we can see that your SRP connection 
> to backend storage reports abort and reset (I guess your backend storage 
> hardware is connected via Infiniband if you are using SRP). Then Lustre 
> reports slow messages and eventually kernel reports SCSI errors. Device 
> mapper reports that both paths to the device are failed and Lustre 
> remounts filesystem read-only due to I/O error. All these means that 
> your I/O node lost contact with the OST due to some errors either on IB 
> network connecting your host to the storage hardware or on the storage 
> hardware itself. From the first part of the log we can see that the 
> device being in trouble is OST es1-OST000b (dm-11). In the second part 
> of your log I can not see that device being mounted. From your log I can 
> see that only OST  es1-OST000a (dm-10) is mounted and enters recovery

-- 
DaMiri Young
HPC System Engineer
High Performance Computing Team | ACUS/CITC | UNT

Wojciech Turek

2011-Jul-18 23:06 UTC

head link

[Lustre-discuss] IO-Node issue

Ok , Good so at least you have now the OST mounting. I would have run fsck
even though it said that LDISKFS was recovered correctly.

As for you Heartbeat problems, If you use the old V1 heartbeat configuration
with haresources file then I don''t think that STONITH has anything to
do
with your filesystem resources not starting.>From your logs it looks like STONITH device is not configured properly sofirst you need to test yourSTONITH config as follows:

# stonith -t external/ipmi -n
HOSTNAME  IP_ADDR  IPMI_USER  IPMI_PASSWD_FILE


# stonith -t external/ipmi -p "oss02 10.145.245.2 root
/etc/ha.d/ipmitool.passwd" -lS
stonith: external/ipmi device OK.

As you can see in my config stonith command returns OK so you need to look
at your config and tweak it so it also return OK.

regards

Wojciech

On 18 July 2011 23:02, DaMiri Young <damiri at unt.edu> wrote:
> So you were right about the I/O node losing contact with the OST. In short,
> after enabling lustre debugging, restarting opensmd and openibd services on
> the troubled node the OSTs were remounted and lustre entered recovery:
> --------------------------- messages ------------------------------**
> ----------
> Jul 18 10:02:56 IO-10 kernel: ib_ipath 0000:05:00.0: We got a lid: 0x75
> Jul 18 10:02:56 IO-10 kernel: ib_srp: ASYNC event= 11 on device= ipath0
> Jul 18 10:02:56 IO-10 kernel: ib_srp: ASYNC event= 13 on device= ipath0
> Jul 18 10:02:56 IO-10 kernel: ib_srp: ASYNC event= 17 on device= ipath0
> Jul 18 10:02:56 IO-10 kernel: ib_srp: ASYNC event= 9 on device= ipath0
> Jul 18 10:02:59 IO-10 kernel: ADDRCONF(NETDEV_CHANGE): ib0: link becomes
> ready
> Jul 18 10:03:01 IO-10 avahi-daemon[24939]: New relevant interface ib0.IPv6
> for mDNS.
> Jul 18 10:03:01 IO-10 avahi-daemon[24939]: Joining mDNS multicast group on
> interface ib0.IPv6 with address fe80::211:7500:ff:7bf6.
> Jul 18 10:03:01 IO-10 avahi-daemon[24939]: Registering new address record
> for fe80::211:7500:ff:7bf6 on ib0.
> Jul 18 10:03:15 IO-10 ntpd[23084]: synchronized to 10.0.0.1, stratum 3
> Jul 18 11:41:40 IO-10 kernel: megasas: 00.00.03.15-RH1 Wed Nov. 21 10:29:45
> PST 2007
> Jul 18 11:41:41 IO-10 kernel: Lustre: OBD class driver,
> http://www.lustre.org/
> Jul 18 11:41:41 IO-10 kernel:         Lustre Version: 1.6.6
> Jul 18 11:41:41 IO-10 kernel:         Build Version: 1.6.6-1.6.6-ddn3.1-**
> 20090527173746
> Jul 18 11:41:41 IO-10 kernel: Lustre:
28686:0:(o2iblnd_modparams.c:**324:kiblnd_tunables_init())
> Concurrent sends 7 is lower than message queue size: 8, performance may
drop
> slightly.
> Jul 18 11:41:41 IO-10 kernel: Lustre: Added LNI 10.1.0.229 at o2ib [8/64]
> Jul 18 11:41:41 IO-10 kernel: Lustre: Lustre Client File System;
> http://www.lustre.org/
> Jul 18 11:42:07 IO-10 kernel: kjournald starting.  Commit interval 5
> seconds
> Jul 18 11:42:07 IO-10 kernel: LDISKFS-fs warning: checktime reached,
> running e2fsck is recommended
> Jul 18 11:42:07 IO-10 kernel: LDISKFS FS on dm-11, internal journal
> Jul 18 11:42:07 IO-10 kernel: LDISKFS-fs: recovery complete.
> Jul 18 11:42:07 IO-10 kernel: LDISKFS-fs: mounted filesystem with ordered
> data mode.
> Jul 18 11:42:07 IO-10 multipathd: dm-11: umount map (uevent)
> Jul 18 11:42:18 IO-10 kernel: kjournald starting.  Commit interval 5
> seconds
> Jul 18 11:42:18 IO-10 kernel: LDISKFS-fs warning: checktime reached,
> running e2fsck is recommended
> Jul 18 11:42:18 IO-10 kernel: LDISKFS FS on dm-11, internal journal
> Jul 18 11:42:18 IO-10 kernel: LDISKFS-fs: mounted filesystem with ordered
> data mode.
> Jul 18 11:42:18 IO-10 kernel: LDISKFS-fs: file extents enabled
> Jul 18 11:42:18 IO-10 kernel: LDISKFS-fs: mballoc enabled
> Jul 18 11:42:18 IO-10 kernel: fsfilt_ldiskfs: no version for
> "ldiskfs_free_blocks" found: kernel tainted.
> Jul 18 11:42:18 IO-10 kernel: Lustre: Filtering OBD driver;
> http://www.lustre.org/
> Jul 18 11:42:18 IO-10 kernel: Lustre:
29999:0:(filter.c:868:filter_**init_server_data())
> RECOVERY: service es1-OST000a, 249 recoverable clients, last_rcvd 469628325
> Jul 18 11:42:18 IO-10 kernel: Lustre: OST es1-OST000a now serving dev
> (es1-OST000a/15fae56a-7dae-**ba24-4423-347c0a118367), but will be in
> recovery for at least 5:00, or until 249 clients reconnect. During this
time
> new clients will not be allowed to connect. Recovery progress can be
> monitored by watching /proc/fs/lustre/obdfilter/es1-**
> OST000a/recovery_status.
> Jul 18 11:42:18 IO-10 kernel: Lustre: es1-OST000a.ost: set parameter
> quota_type=ug
> Jul 18 11:42:18 IO-10 kernel: Lustre: Server es1-OST000a on device
> /dev/mpath/lun_11 has started
> Jul 18 11:42:19 IO-10 kernel: Lustre: 28952:0:(ldlm_lib.c:1226:**
> check_and_start_recovery_**timer()) es1-OST000a: starting recovery timer
> Jul 18 11:42:19 IO-10 kernel: LustreError: 137-5: UUID
''es1-OST000c_UUID''
> is not available  for connect (no target)
> Jul 18 11:42:19 IO-10 kernel: LustreError:
28957:0:(ldlm_lib.c:1619:**target_send_reply_msg())
> @@@ processing error (-19)  req at ffff810311f9f400 x36077513/t0
> o8-><?>@<?>:0/0 lens 304/0 e 0 to 0 dl 1311007439 ref 1 fl
Interpret:/0/0 rc
> -19/0
> Jul 18 11:42:19 IO-10 kernel: LustreError: Skipped 3 previous similar
> messages
> Jul 18 11:42:19 IO-10 kernel: LustreError: 137-5: UUID
''es1-OST000b_UUID''
> is not available  for connect (no target)
> Jul 18 11:42:19 IO-10 kernel: LustreError:
28985:0:(ldlm_lib.c:1619:**target_send_reply_msg())
> @@@ processing error (-19)  req at ffff8102f81ce800 x8649866/t0
> o8-><?>@<?>:0/0 lens 304/0 e 0 to 0 dl 1311007439 ref 1 fl
Interpret:/0/0 rc
> -19/0
> Jul 18 11:42:19 IO-10 kernel: LustreError:
28985:0:(ldlm_lib.c:1619:**target_send_reply_msg())
> Skipped 3 previous similar messages
> Jul 18 11:42:19 IO-10 kernel: LustreError: Skipped 3 previous similar
> messages
> Jul 18 11:42:19 IO-10 kernel: LustreError: 137-5: UUID
''es1-OST000b_UUID''
> is not available  for connect (no target)
> Jul 18 11:42:19 IO-10 kernel: Lustre: 29068:0:(ldlm_lib.c:1567:**
> target_queue_last_replay_**reply()) es1-OST000a: 248 recoverable clients
> remain
> Jul 18 11:42:19 IO-10 kernel: LustreError:
29010:0:(ldlm_lib.c:1619:**target_send_reply_msg())
> @@@ processing error (-19)  req at ffff8102f81f2c00 x368697/t0
> o8-><?>@<?>:0/0 lens 304/0 e 0 to 0 dl 1311007439 ref 1 fl
Interpret:/0/0 rc
> -19/0
> Jul 18 11:42:19 IO-10 kernel: LustreError:
29010:0:(ldlm_lib.c:1619:**target_send_reply_msg())
> Skipped 19 previous similar messages
> Jul 18 11:42:19 IO-10 kernel: LustreError: Skipped 19 previous similar
> messages
> Jul 18 11:42:19 IO-10 kernel: Lustre: 29012:0:(ldlm_lib.c:1567:**
> target_queue_last_replay_**reply()) es1-OST000a: 247 recoverable clients
> remain
> Jul 18 11:42:20 IO-10 kernel: Lustre: 29106:0:(ldlm_lib.c:1567:**
> target_queue_last_replay_**reply()) es1-OST000a: 240 recoverable clients
> remain
> Jul 18 11:42:20 IO-10 kernel: Lustre: 29106:0:(ldlm_lib.c:1567:**
> target_queue_last_replay_**reply()) Skipped 6 previous similar messages
> Jul 18 11:42:20 IO-10 kernel: LustreError: 137-5: UUID
''es1-OST000b_UUID''
> is not available  for connect (no target)
> Jul 18 11:42:20 IO-10 kernel: LustreError:
29149:0:(ldlm_lib.c:1619:**target_send_reply_msg())
> @@@ processing error (-19)  req at ffff81030eff2850 x68565826/t0
> o8-><?>@<?>:0/0 lens 304/0 e 0 to 0 dl 1311007440 ref 1 fl
Interpret:/0/0 rc
> -19/0
> Jul 18 11:42:20 IO-10 kernel: LustreError:
29149:0:(ldlm_lib.c:1619:**target_send_reply_msg())
> Skipped 31 previous similar messages
> Jul 18 11:42:20 IO-10 kernel: LustreError: Skipped 31 previous similar
> messages
> Jul 18 11:42:21 IO-10 kernel: Lustre: 29196:0:(ldlm_lib.c:1567:**
> target_queue_last_replay_**reply()) es1-OST000a: 232 recoverable clients
> remain
> Jul 18 11:42:21 IO-10 kernel: Lustre: 29196:0:(ldlm_lib.c:1567:**
> target_queue_last_replay_**reply()) Skipped 7 previous similar messages
> Jul 18 11:42:22 IO-10 kernel: LustreError: 137-5: UUID
''es1-OST000b_UUID''
> is not available  for connect (no target)
> Jul 18 11:42:22 IO-10 kernel: LustreError:
29275:0:(ldlm_lib.c:1619:**target_send_reply_msg())
> @@@ processing error (-19)  req at ffff810302713c50 x519337/t0
> o8-><?>@<?>:0/0 lens 304/0 e 0 to 0 dl 1311007442 ref 1 fl
Interpret:/0/0 rc
> -19/0
> Jul 18 11:42:22 IO-10 kernel: LustreError:
29275:0:(ldlm_lib.c:1619:**target_send_reply_msg())
> Skipped 47 previous similar messages
> Jul 18 11:42:22 IO-10 kernel: LustreError: Skipped 47 previous similar
> messages
> Jul 18 11:42:23 IO-10 kernel: Lustre: 29320:0:(ldlm_lib.c:1567:**
> target_queue_last_replay_**reply()) es1-OST000a: 221 recoverable clients
> remain
> Jul 18 11:42:23 IO-10 kernel: Lustre: 29320:0:(ldlm_lib.c:1567:**
> target_queue_last_replay_**reply()) Skipped 10 previous similar messages
> Jul 18 11:42:27 IO-10 kernel: LustreError: 137-5: UUID
''es1-OST000c_UUID''
> is not available  for connect (no target)
> Jul 18 11:42:27 IO-10 kernel: LustreError:
29030:0:(ldlm_lib.c:1619:**target_send_reply_msg())
> @@@ processing error (-19)  req at ffff8102f87bac00 x435304948/t0
> o8-><?>@<?>:0/0 lens 304/0 e 0 to 0 dl 1311007447 ref 1 fl
Interpret:/0/0 rc
> -19/0
> Jul 18 11:42:27 IO-10 kernel: LustreError:
29030:0:(ldlm_lib.c:1619:**target_send_reply_msg())
> Skipped 91 previous similar messages
> Jul 18 11:42:27 IO-10 kernel: LustreError: Skipped 91 previous similar
> messages
> Jul 18 11:42:27 IO-10 kernel: Lustre: 29182:0:(ldlm_lib.c:1567:**
> target_queue_last_replay_**reply()) es1-OST000a: 196 recoverable clients
> remain
> Jul 18 11:42:27 IO-10 kernel: Lustre: 29182:0:(ldlm_lib.c:1567:**
> target_queue_last_replay_**reply()) Skipped 24 previous similar messages
> Jul 18 11:42:46 IO-10 kernel: kjournald starting.  Commit interval 5
> seconds
> Jul 18 11:42:46 IO-10 kernel: LDISKFS-fs warning: checktime reached,
> running e2fsck is recommended
> Jul 18 11:42:46 IO-10 kernel: LDISKFS FS on dm-10, internal journal
> Jul 18 11:42:46 IO-10 kernel: LDISKFS-fs: recovery complete.
> Jul 18 11:42:46 IO-10 kernel: LDISKFS-fs: mounted filesystem with ordered
> data mode.
> Jul 18 11:42:46 IO-10 multipathd: dm-10: umount map (uevent)
> Jul 18 11:42:58 IO-10 kernel: kjournald starting.  Commit interval 5
> seconds
> Jul 18 11:42:58 IO-10 kernel: LDISKFS-fs warning: checktime reached,
> running e2fsck is recommended
> Jul 18 11:42:58 IO-10 kernel: LDISKFS FS on dm-10, internal journal
> Jul 18 11:42:58 IO-10 kernel: LDISKFS-fs: mounted filesystem with ordered
> data mode.
> Jul 18 11:42:58 IO-10 kernel: LDISKFS-fs: file extents enabled
> Jul 18 11:42:58 IO-10 kernel: LDISKFS-fs: mballoc enabled
> Jul 18 11:42:58 IO-10 kernel: Lustre:
30227:0:(filter.c:868:filter_**init_server_data())
> RECOVERY: service es1-OST000b, 249 recoverable clients, last_rcvd 608808684
> Jul 18 11:42:58 IO-10 kernel: Lustre: OST es1-OST000b now serving dev
> (es1-OST000b/1f38b48f-9a67-**b3a6-4374-b25762e71391), but will be in
> recovery for at least 5:00, or until 249 clients reconnect. During this
time
> new clients will not be allowed to connect. Recovery progress can be
> monitored by watching /proc/fs/lustre/obdfilter/es1-**
> OST000b/recovery_status.
> Jul 18 11:42:58 IO-10 kernel: Lustre: es1-OST000b.ost: set parameter
> quota_type=ug
> Jul 18 11:42:58 IO-10 kernel: Lustre: Server es1-OST000b on device
> /dev/mpath/lun_12 has started
> Jul 18 11:43:09 IO-10 kernel: Lustre: 28975:0:(ldlm_lib.c:1226:**
> check_and_start_recovery_**timer()) es1-OST000b: starting recovery timer
> Jul 18 11:43:09 IO-10 kernel: LustreError: 137-5: UUID
''es1-OST000c_UUID''
> is not available  for connect (no target)
> Jul 18 11:43:09 IO-10 kernel: LustreError: Skipped 111 previous similar
> messages
> Jul 18 11:43:09 IO-10 kernel: LustreError:
29079:0:(ldlm_lib.c:1619:**target_send_reply_msg())
> @@@ processing error (-19)  req at ffff8102eb3cb000 x36077574/t0
> o8-><?>@<?>:0/0 lens 304/0 e 0 to 0 dl 1311007489 ref 1 fl
Interpret:/0/0 rc
> -19/0
> Jul 18 11:43:09 IO-10 kernel: LustreError:
29079:0:(ldlm_lib.c:1619:**target_send_reply_msg())
> Skipped 114 previous similar messages
> Jul 18 11:43:09 IO-10 kernel: Lustre: 28999:0:(ldlm_lib.c:1567:**
> target_queue_last_replay_**reply()) es1-OST000b: 248 recoverable clients
> remain
> Jul 18 11:43:09 IO-10 kernel: Lustre: 28999:0:(ldlm_lib.c:1567:**
> target_queue_last_replay_**reply()) Skipped 25 previous similar messages
> Jul 18 11:43:21 IO-10 kernel: kjournald starting.  Commit interval 5
> seconds
> Jul 18 11:43:21 IO-10 kernel: LDISKFS-fs warning: maximal mount count
> reached, running e2fsck is recommended
> Jul 18 11:43:21 IO-10 kernel: LDISKFS FS on dm-12, internal journal
> Jul 18 11:43:21 IO-10 kernel: LDISKFS-fs: recovery complete.
> Jul 18 11:43:21 IO-10 kernel: LDISKFS-fs: mounted filesystem with ordered
> data mode.
> Jul 18 11:43:21 IO-10 multipathd: dm-12: umount map (uevent)
> Jul 18 11:43:32 IO-10 kernel: kjournald starting.  Commit interval 5
> seconds
> Jul 18 11:43:32 IO-10 kernel: LDISKFS-fs warning: maximal mount count
> reached, running e2fsck is recommended
> Jul 18 11:43:32 IO-10 kernel: LDISKFS FS on dm-12, internal journal
> Jul 18 11:43:32 IO-10 kernel: LDISKFS-fs: mounted filesystem with ordered
> data mode.
> Jul 18 11:43:32 IO-10 kernel: LDISKFS-fs: file extents enabled
> Jul 18 11:43:32 IO-10 kernel: LDISKFS-fs: mballoc enabled
> Jul 18 11:43:32 IO-10 kernel: Lustre:
30436:0:(filter.c:868:filter_**init_server_data())
> RECOVERY: service es1-OST000c, 249 recoverable clients, last_rcvd 370809064
> Jul 18 11:43:32 IO-10 kernel: Lustre: OST es1-OST000c now serving dev
> (es1-OST000c/f8c1bf77-11b3-**88be-4438-016f059a91b5), but will be in
> recovery for at least 5:00, or until 249 clients reconnect. During this
time
> new clients will not be allowed to connect. Recovery progress can be
> monitored by watching /proc/fs/lustre/obdfilter/es1-**
> OST000c/recovery_status.
> Jul 18 11:43:32 IO-10 kernel: Lustre: es1-OST000c.ost: set parameter
> quota_type=ug
> Jul 18 11:43:32 IO-10 kernel: Lustre: Server es1-OST000c on device
> /dev/mpath/lun_13 has started
> Jul 18 11:43:46 IO-10 kernel: Lustre: 29050:0:(ldlm_lib.c:1226:**
> check_and_start_recovery_**timer()) es1-OST000c: starting recovery timer
> Jul 18 11:43:46 IO-10 kernel: LustreError: 137-5: UUID
''es1-OST000d_UUID''
> is not available  for connect (no target)
> Jul 18 11:43:46 IO-10 kernel: LustreError: Skipped 229 previous similar
> messages
> Jul 18 11:43:46 IO-10 kernel: LustreError:
29123:0:(ldlm_lib.c:1619:**target_send_reply_msg())
> @@@ processing error (-19)  req at ffff8102f6e36000 x36721236/t0
> o8-><?>@<?>:0/0 lens 304/0 e 0 to 0 dl 1311007526 ref 1 fl
Interpret:/0/0 rc
> -19/0
> Jul 18 11:43:46 IO-10 kernel: LustreError:
29123:0:(ldlm_lib.c:1619:**target_send_reply_msg())
> Skipped 229 previous similar messages
> Jul 18 11:43:46 IO-10 kernel: Lustre: 28982:0:(ldlm_lib.c:1567:**
> target_queue_last_replay_**reply()) es1-OST000b: 171 recoverable clients
> remain
> Jul 18 11:43:46 IO-10 kernel: Lustre: 28982:0:(ldlm_lib.c:1567:**
> target_queue_last_replay_**reply()) Skipped 76 previous similar messages
> Jul 18 11:43:54 IO-10 kernel: kjournald starting.  Commit interval 5
> seconds
> Jul 18 11:43:54 IO-10 kernel: LDISKFS-fs warning: maximal mount count
> reached, running e2fsck is recommended
> Jul 18 11:43:54 IO-10 kernel: LDISKFS FS on dm-13, internal journal
> Jul 18 11:43:54 IO-10 kernel: LDISKFS-fs: recovery complete.
> Jul 18 11:43:54 IO-10 kernel: LDISKFS-fs: mounted filesystem with ordered
> data mode.
> Jul 18 11:43:55 IO-10 multipathd: dm-13: umount map (uevent)
> Jul 18 11:44:06 IO-10 kernel: kjournald starting.  Commit interval 5
> seconds
> Jul 18 11:44:06 IO-10 kernel: LDISKFS-fs warning: maximal mount count
> reached, running e2fsck is recommended
> Jul 18 11:44:06 IO-10 kernel: LDISKFS FS on dm-13, internal journal
> Jul 18 11:44:06 IO-10 kernel: LDISKFS-fs: mounted filesystem with ordered
> data mode.
> Jul 18 11:44:06 IO-10 kernel: LDISKFS-fs: file extents enabled
> Jul 18 11:44:06 IO-10 kernel: LDISKFS-fs: mballoc enabled
> Jul 18 11:44:06 IO-10 kernel: Lustre:
30686:0:(filter.c:868:filter_**init_server_data())
> RECOVERY: service es1-OST000d, 249 recoverable clients, last_rcvd 694562245
> Jul 18 11:44:06 IO-10 kernel: Lustre: OST es1-OST000d now serving dev
> (es1-OST000d/cf608dbd-accd-**89b7-471a-f4487e9f8ba3), but will be in
> recovery for at least 5:00, or until 249 clients reconnect. During this
time
> new clients will not be allowed to connect. Recovery progress can be
> monitored by watching /proc/fs/lustre/obdfilter/es1-**
> OST000d/recovery_status.
> Jul 18 11:44:06 IO-10 kernel: Lustre: es1-OST000d.ost: set parameter
> quota_type=ug
> Jul 18 11:44:06 IO-10 kernel: Lustre: Server es1-OST000d on device
> /dev/mpath/lun_14 has started
> Jul 18 11:44:06 IO-10 kernel: Lustre: 29293:0:(ldlm_lib.c:1226:**
> check_and_start_recovery_**timer()) es1-OST000d: starting recovery timer
> Jul 18 11:44:18 IO-10 kernel: LustreError: 137-5: UUID
''es1-OST000e_UUID''
> is not available  for connect (no target)
> Jul 18 11:44:18 IO-10 kernel: LustreError: Skipped 199 previous similar
> messages
> Jul 18 11:44:18 IO-10 kernel: Lustre: 29068:0:(ldlm_lib.c:1567:**
> target_queue_last_replay_**reply()) es1-OST000d: 175 recoverable clients
> remain
> Jul 18 11:44:18 IO-10 kernel: LustreError:
29135:0:(ldlm_lib.c:1619:**target_send_reply_msg())
> @@@ processing error (-19)  req at ffff8102f4c1cc00 x56000488/t0
> o8-><?>@<?>:0/0 lens 304/0 e 0 to 0 dl 1311007558 ref 1 fl
Interpret:/0/0 rc
> -19/0
> Jul 18 11:44:18 IO-10 kernel: LustreError:
29135:0:(ldlm_lib.c:1619:**target_send_reply_msg())
> Skipped 199 previous similar messages
> Jul 18 11:44:18 IO-10 kernel: Lustre: 29068:0:(ldlm_lib.c:1567:**
> target_queue_last_replay_**reply()) Skipped 331 previous similar messages
> Jul 18 11:44:28 IO-10 kernel: kjournald starting.  Commit interval 5
> seconds
> Jul 18 11:44:28 IO-10 kernel: LDISKFS-fs warning: maximal mount count
> reached, running e2fsck is recommended
> Jul 18 11:44:28 IO-10 kernel: LDISKFS FS on dm-14, internal journal
> Jul 18 11:44:28 IO-10 kernel: LDISKFS-fs: recovery complete.
> Jul 18 11:44:28 IO-10 kernel: LDISKFS-fs: mounted filesystem with ordered
> data mode.
> Jul 18 11:44:28 IO-10 multipathd: dm-14: umount map (uevent)
> Jul 18 11:44:39 IO-10 kernel: kjournald starting.  Commit interval 5
> seconds
> Jul 18 11:44:39 IO-10 kernel: LDISKFS-fs warning: maximal mount count
> reached, running e2fsck is recommended
> Jul 18 11:44:39 IO-10 kernel: LDISKFS FS on dm-14, internal journal
> Jul 18 11:44:39 IO-10 kernel: LDISKFS-fs: mounted filesystem with ordered
> data mode.
> Jul 18 11:44:39 IO-10 kernel: LDISKFS-fs: file extents enabled
> Jul 18 11:44:39 IO-10 kernel: LDISKFS-fs: mballoc enabled
> Jul 18 11:44:39 IO-10 kernel: Lustre:
30893:0:(filter.c:868:filter_**init_server_data())
> RECOVERY: service es1-OST000e, 249 recoverable clients, last_rcvd 613643608
> Jul 18 11:44:39 IO-10 kernel: Lustre: OST es1-OST000e now serving dev
> (es1-OST000e/478c7dc4-4936-**bfe2-45ac-2fb7a2e69f62), but will be in
> recovery for at least 5:00, or until 249 clients reconnect. During this
time
> new clients will not be allowed to connect. Recovery progress can be
> monitored by watching /proc/fs/lustre/obdfilter/es1-**
> OST000e/recovery_status.
> Jul 18 11:44:39 IO-10 kernel: Lustre: es1-OST000e.ost: set parameter
> quota_type=ug
> Jul 18 11:44:39 IO-10 kernel: Lustre: Server es1-OST000e on device
> /dev/mpath/lun_15 has started
> Jul 18 11:44:40 IO-10 kernel: Lustre: 29214:0:(ldlm_lib.c:1226:**
> check_and_start_recovery_**timer()) es1-OST000e: starting recovery timer
> Jul 18 11:44:49 IO-10 kernel: Lustre:
29236:0:(service.c:939:ptlrpc_**server_handle_req_in())
> @@@ Slow req_in handling 6s  req at ffff8102f4419c00 x738214853/t0
> o101-><?>@<?>:0/0 lens 232/0 e 0 to 0 dl 0 ref 1 fl New:/0/0
rc 0/0
> Jul 18 11:44:49 IO-10 kernel: Lustre:
28992:0:(service.c:939:ptlrpc_**server_handle_req_in())
> @@@ Slow req_in handling 6s  req at ffff8102f4419400 x738214855/t0
> o101-><?>@<?>:0/0 lens 232/0 e 0 to 0 dl 0 ref 1 fl New:/0/0
rc 0/0
> ---------------------- end messages -----------------------------
>
> It mentioned completing the recovery so I didn''t bother with
running
> another fsck, should I? The problem now seems to be that STONITH on the
> troubled node''s failover can''t reset the node. It tries
and fails
> incessantly:
> ------------------------ messages ------------------------------**-
> Jul 18 16:45:17 IO-11 heartbeat: [25037]: info: Resetting node
> io-10.internal.acs.unt.prv with [IPMI STONITH device ]
> Jul 18 16:45:18 IO-11 heartbeat: [25037]: info: glib: external_run_cmd:
> Calling ''/usr/lib64/stonith/plugins/**external/ipmi reset
> io-10.internal.acs.unt.prv'' returned 256
> Jul 18 16:45:18 IO-11 heartbeat: [25037]: ERROR: glib: external_reset_req:
> ''ipmi reset'' for host io-10.internal.acs.unt.prv failed
with rc 256
> Jul 18 16:45:18 IO-11 heartbeat: [25037]: ERROR: Host
> io-10.internal.acs.unt.prv not reset!
> Jul 18 16:45:18 IO-11 heartbeat: [15803]: WARN: Managed STONITH
> io-10.internal.acs.unt.prv process 25037 exited with return code 1.
> Jul 18 16:45:18 IO-11 heartbeat: [15803]: ERROR: STONITH of
> io-10.internal.acs.unt.prv failed.  Retrying...
> ---------------------- end messages ------------------------------**---
>
> I''ve checked the logic in
usr/lib64/stonith/plugins/**external/ipmi which
> doesn''t seem to be using the correct address for the BMC
controller. It''s
> possible that the HA facilites could prevent mounting of the final OSTs
> isn''t it?
>
>
>
> Wojciech Turek wrote:
>
>> Hi Damiri,
>>
>>  From the logs you have provided it looks like you have a problem with
>> your back end storage. First of all we can see that your SRP connection
to
>> backend storage reports abort and reset (I guess your backend storage
>> hardware is connected via Infiniband if you are using SRP). Then Lustre
>> reports slow messages and eventually kernel reports SCSI errors. Device
>> mapper reports that both paths to the device are failed and Lustre
remounts
>> filesystem read-only due to I/O error. All these means that your I/O
node
>> lost contact with the OST due to some errors either on IB network
connecting
>> your host to the storage hardware or on the storage hardware itself.
From
>> the first part of the log we can see that the device being in trouble
is OST
>> es1-OST000b (dm-11). In the second part of your log I can not see that
>> device being mounted. From your log I can see that only OST 
es1-OST000a
>> (dm-10) is mounted and enters recovery
>>
>
>
> --
> DaMiri Young
> HPC System Engineer
> High Performance Computing Team | ACUS/CITC | UNT
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20110719/e57ba3c2/attachment-0001.html

Wojciech Turek

2011-Jul-20 10:39 UTC

head link

[Lustre-discuss] IO-Node issue

Hi Damiri,

If heartbeat is not able to start(mount) one of the OSTs I would recommend
to stop heartbeat on both servers and then mount troubled OST manually. Then
you should see why OST is not mounted. In order to check the consistency of
the filesystem, in your case I would first run fsck with -n switch to see
extent of the damage, this also prevents from damaging your filesystem even
more if you have a faulty controller or links corrupting data. In normal
situation I use following command: fsck -f -v /dev/<ost_dev> -C0
Make sure that you log output from the fsck which will be essential for the
further troubleshooting.

Best regards,

Wojciech

On 19 July 2011 16:58, Young, Damiri <Damiri.Young at unt.edu> wrote:
>  Many thanks for the useful info Turek. I mentioned HA (heartbeat v2)
> issues because after the troubled I/O got it''s paths back to the
OST''s it
> failed all 4 of the 5 OSTs over to it''s sibling server where
they''re now
> mounted. To me it seems the OSTs (we''re using lustre v1.6 btw)
won''t be
> released until the failed over node is reset by it''s sibling.
>
> The OSSs seem to have trouble connecting to the 1 OST I mentioned:
> -------------------------------- messages
> -------------------------------------
> Jul 19 10:29:02 IO-10 kernel: LustreError:
> 29429:0:(ldlm_lib.c:1619:target_send_reply_msg()) @@@ processing error
> (-19)  req at ffff810315ab7000 x62666211/t0 o8-><?>@<?>:0/0
lens 304/0 e 0 to
> 0 dl 1311089442 ref 1 fl Interpret:/0/0 rc -19/0
> Jul 19 10:29:02 IO-10 kernel: LustreError:
> 29429:0:(ldlm_lib.c:1619:target_send_reply_msg()) Skipped 2987 previous
> similar messages
> Jul 19 10:39:02 IO-10 kernel: LustreError: 137-5: UUID
''es1-OST000a_UUID''
> is not available  for connect (stopping)
> Jul 19 10:39:02 IO-10 kernel: LustreError: Skipped 2988 previous similar
> messages
>
>
---------------------------------------------------------------------------------------
>
>
> --------------------- lfs check osts output
> --------------------------------
> lfs check osts
> es1-OST0000-osc-ffff810c5dc61000 active.
> es1-OST0001-osc-ffff810c5dc61000 active.
> es1-OST0002-osc-ffff810c5dc61000 active.
> es1-OST0003-osc-ffff810c5dc61000 active.
> es1-OST0004-osc-ffff810c5dc61000 active.
> es1-OST0005-osc-ffff810c5dc61000 active.
> es1-OST0006-osc-ffff810c5dc61000 active.
> es1-OST0007-osc-ffff810c5dc61000 active.
> es1-OST0008-osc-ffff810c5dc61000 active.
> es1-OST0009-osc-ffff810c5dc61000 active.
> error: check ''es1-OST000a-osc-ffff810c5dc61000'': Resource
temporarily
> unavailable (11)
> es1-OST000b-osc-ffff810c5dc61000 active.
> es1-OST000c-osc-ffff810c5dc61000 active.
> es1-OST000d-osc-ffff810c5dc61000 active.
> es1-OST000e-osc-ffff810c5dc61000 active.
> es1-OST000f-osc-ffff810c5dc61000 active.
> es1-OST0010-osc-ffff810c5dc61000 active.
> es1-OST0011-osc-ffff810c5dc61000 active.
> es1-OST0012-osc-ffff810c5dc61000 active.
> es1-OST0013-osc-ffff810c5dc61000 active.
>
>
---------------------------------------------------------------------------------------------------------------
>
> ----------------- cat /proc/fs/lustre/devices output --------------------
> cat /proc/fs/lustre/devices
>   1 UP ost OSS OSS_uuid 3
>   2 ST obdfilter es1-OST000a es1-OST000a_UUID 3
>
>
--------------------------------------------------------------------------------------
>
>  do you know the command(s) to fsck correctly?
>
>  Best Regards,
> --
> DaMiri Young
>
> System Engineer
> High Performance Computing Team
> CITC Academic Computing and User Services | UNT
>   ------------------------------
> *From:* turek.wojciech at gmail.com [turek.wojciech at gmail.com] on behalf
of
> Wojciech Turek [wjt27 at cam.ac.uk]
> *Sent:* Monday, July 18, 2011 6:06 PM
> *To:* DaMiri Young
> *Cc:* Young, Damiri; lustre-discuss at lists.lustre.org
> *Subject:* Re: [Lustre-discuss] IO-Node issue
>
>  Ok , Good so at least you have now the OST mounting. I would have run
> fsck even though it said that LDISKFS was recovered correctly.
>
> As for you Heartbeat problems, If you use the old V1 heartbeat
> configuration with haresources file then I don''t think that
STONITH has
> anything to do with your filesystem resources not starting.
> From your logs it looks like STONITH device is not configured properly so
> first you need to test yourSTONITH config as follows:
>
> # stonith -t external/ipmi -n
> HOSTNAME  IP_ADDR  IPMI_USER  IPMI_PASSWD_FILE
>
>
> # stonith -t external/ipmi -p "oss02 10.145.245.2 root
> /etc/ha.d/ipmitool.passwd" -lS
> stonith: external/ipmi device OK.
>
> As you can see in my config stonith command returns OK so you need to look
> at your config and tweak it so it also return OK.
>
> regards
>
> Wojciech
>
> On 18 July 2011 23:02, DaMiri Young <damiri at unt.edu> wrote:
>
>> So you were right about the I/O node losing contact with the OST. In
>> short, after enabling lustre debugging, restarting opensmd and openibd
>> services on the troubled node the OSTs were remounted and lustre
entered
>> recovery:
>> --------------------------- messages ------------------------------**
>> ----------
>> Jul 18 10:02:56 IO-10 kernel: ib_ipath 0000:05:00.0: We got a lid: 0x75
>> Jul 18 10:02:56 IO-10 kernel: ib_srp: ASYNC event= 11 on device= ipath0
>> Jul 18 10:02:56 IO-10 kernel: ib_srp: ASYNC event= 13 on device= ipath0
>> Jul 18 10:02:56 IO-10 kernel: ib_srp: ASYNC event= 17 on device= ipath0
>> Jul 18 10:02:56 IO-10 kernel: ib_srp: ASYNC event= 9 on device= ipath0
>> Jul 18 10:02:59 IO-10 kernel: ADDRCONF(NETDEV_CHANGE): ib0: link
becomes
>> ready
>> Jul 18 10:03:01 IO-10 avahi-daemon[24939]: New relevant interface
ib0.IPv6
>> for mDNS.
>> Jul 18 10:03:01 IO-10 avahi-daemon[24939]: Joining mDNS multicast group
on
>> interface ib0.IPv6 with address fe80::211:7500:ff:7bf6.
>> Jul 18 10:03:01 IO-10 avahi-daemon[24939]: Registering new address
record
>> for fe80::211:7500:ff:7bf6 on ib0.
>> Jul 18 10:03:15 IO-10 ntpd[23084]: synchronized to 10.0.0.1, stratum 3
>> Jul 18 11:41:40 IO-10 kernel: megasas: 00.00.03.15-RH1 Wed Nov. 21
>> 10:29:45 PST 2007
>> Jul 18 11:41:41 IO-10 kernel: Lustre: OBD class driver,
>> http://www.lustre.org/
>> Jul 18 11:41:41 IO-10 kernel:         Lustre Version: 1.6.6
>> Jul 18 11:41:41 IO-10 kernel:         Build Version:
1.6.6-1.6.6-ddn3.1-*
>> *20090527173746
>> Jul 18 11:41:41 IO-10 kernel: Lustre:
28686:0:(o2iblnd_modparams.c:**324:kiblnd_tunables_init())
>> Concurrent sends 7 is lower than message queue size: 8, performance may
drop
>> slightly.
>> Jul 18 11:41:41 IO-10 kernel: Lustre: Added LNI 10.1.0.229 at o2ib
[8/64]
>> Jul 18 11:41:41 IO-10 kernel: Lustre: Lustre Client File System;
>> http://www.lustre.org/
>> Jul 18 11:42:07 IO-10 kernel: kjournald starting.  Commit interval 5
>> seconds
>> Jul 18 11:42:07 IO-10 kernel: LDISKFS-fs warning: checktime reached,
>> running e2fsck is recommended
>> Jul 18 11:42:07 IO-10 kernel: LDISKFS FS on dm-11, internal journal
>> Jul 18 11:42:07 IO-10 kernel: LDISKFS-fs: recovery complete.
>> Jul 18 11:42:07 IO-10 kernel: LDISKFS-fs: mounted filesystem with
ordered
>> data mode.
>> Jul 18 11:42:07 IO-10 multipathd: dm-11: umount map (uevent)
>> Jul 18 11:42:18 IO-10 kernel: kjournald starting.  Commit interval 5
>> seconds
>> Jul 18 11:42:18 IO-10 kernel: LDISKFS-fs warning: checktime reached,
>> running e2fsck is recommended
>> Jul 18 11:42:18 IO-10 kernel: LDISKFS FS on dm-11, internal journal
>> Jul 18 11:42:18 IO-10 kernel: LDISKFS-fs: mounted filesystem with
ordered
>> data mode.
>> Jul 18 11:42:18 IO-10 kernel: LDISKFS-fs: file extents enabled
>> Jul 18 11:42:18 IO-10 kernel: LDISKFS-fs: mballoc enabled
>> Jul 18 11:42:18 IO-10 kernel: fsfilt_ldiskfs: no version for
>> "ldiskfs_free_blocks" found: kernel tainted.
>> Jul 18 11:42:18 IO-10 kernel: Lustre: Filtering OBD driver;
>> http://www.lustre.org/
>> Jul 18 11:42:18 IO-10 kernel: Lustre:
29999:0:(filter.c:868:filter_**init_server_data())
>> RECOVERY: service es1-OST000a, 249 recoverable clients, last_rcvd
469628325
>> Jul 18 11:42:18 IO-10 kernel: Lustre: OST es1-OST000a now serving dev
>> (es1-OST000a/15fae56a-7dae-**ba24-4423-347c0a118367), but will be in
>> recovery for at least 5:00, or until 249 clients reconnect. During this
time
>> new clients will not be allowed to connect. Recovery progress can be
>> monitored by watching /proc/fs/lustre/obdfilter/es1-**
>> OST000a/recovery_status.
>> Jul 18 11:42:18 IO-10 kernel: Lustre: es1-OST000a.ost: set parameter
>> quota_type=ug
>> Jul 18 11:42:18 IO-10 kernel: Lustre: Server es1-OST000a on device
>> /dev/mpath/lun_11 has started
>> Jul 18 11:42:19 IO-10 kernel: Lustre: 28952:0:(ldlm_lib.c:1226:**
>> check_and_start_recovery_**timer()) es1-OST000a: starting recovery
timer
>> Jul 18 11:42:19 IO-10 kernel: LustreError: 137-5: UUID
''es1-OST000c_UUID''
>> is not available  for connect (no target)
>> Jul 18 11:42:19 IO-10 kernel: LustreError:
28957:0:(ldlm_lib.c:1619:**target_send_reply_msg())
>> @@@ processing error (-19)  req at ffff810311f9f400 x36077513/t0
>> o8-><?>@<?>:0/0 lens 304/0 e 0 to 0 dl 1311007439 ref 1
fl Interpret:/0/0 rc
>> -19/0
>> Jul 18 11:42:19 IO-10 kernel: LustreError: Skipped 3 previous similar
>> messages
>> Jul 18 11:42:19 IO-10 kernel: LustreError: 137-5: UUID
''es1-OST000b_UUID''
>> is not available  for connect (no target)
>> Jul 18 11:42:19 IO-10 kernel: LustreError:
28985:0:(ldlm_lib.c:1619:**target_send_reply_msg())
>> @@@ processing error (-19)  req at ffff8102f81ce800 x8649866/t0
>> o8-><?>@<?>:0/0 lens 304/0 e 0 to 0 dl 1311007439 ref 1
fl Interpret:/0/0 rc
>> -19/0
>> Jul 18 11:42:19 IO-10 kernel: LustreError:
28985:0:(ldlm_lib.c:1619:**target_send_reply_msg())
>> Skipped 3 previous similar messages
>> Jul 18 11:42:19 IO-10 kernel: LustreError: Skipped 3 previous similar
>> messages
>> Jul 18 11:42:19 IO-10 kernel: LustreError: 137-5: UUID
''es1-OST000b_UUID''
>> is not available  for connect (no target)
>> Jul 18 11:42:19 IO-10 kernel: Lustre: 29068:0:(ldlm_lib.c:1567:**
>> target_queue_last_replay_**reply()) es1-OST000a: 248 recoverable
clients
>> remain
>> Jul 18 11:42:19 IO-10 kernel: LustreError:
29010:0:(ldlm_lib.c:1619:**target_send_reply_msg())
>> @@@ processing error (-19)  req at ffff8102f81f2c00 x368697/t0
>> o8-><?>@<?>:0/0 lens 304/0 e 0 to 0 dl 1311007439 ref 1
fl Interpret:/0/0 rc
>> -19/0
>> Jul 18 11:42:19 IO-10 kernel: LustreError:
29010:0:(ldlm_lib.c:1619:**target_send_reply_msg())
>> Skipped 19 previous similar messages
>> Jul 18 11:42:19 IO-10 kernel: LustreError: Skipped 19 previous similar
>> messages
>> Jul 18 11:42:19 IO-10 kernel: Lustre: 29012:0:(ldlm_lib.c:1567:**
>> target_queue_last_replay_**reply()) es1-OST000a: 247 recoverable
clients
>> remain
>> Jul 18 11:42:20 IO-10 kernel: Lustre: 29106:0:(ldlm_lib.c:1567:**
>> target_queue_last_replay_**reply()) es1-OST000a: 240 recoverable
clients
>> remain
>> Jul 18 11:42:20 IO-10 kernel: Lustre: 29106:0:(ldlm_lib.c:1567:**
>> target_queue_last_replay_**reply()) Skipped 6 previous similar messages
>> Jul 18 11:42:20 IO-10 kernel: LustreError: 137-5: UUID
''es1-OST000b_UUID''
>> is not available  for connect (no target)
>> Jul 18 11:42:20 IO-10 kernel: LustreError:
29149:0:(ldlm_lib.c:1619:**target_send_reply_msg())
>> @@@ processing error (-19)  req at ffff81030eff2850 x68565826/t0
>> o8-><?>@<?>:0/0 lens 304/0 e 0 to 0 dl 1311007440 ref 1
fl Interpret:/0/0 rc
>> -19/0
>> Jul 18 11:42:20 IO-10 kernel: LustreError:
29149:0:(ldlm_lib.c:1619:**target_send_reply_msg())
>> Skipped 31 previous similar messages
>> Jul 18 11:42:20 IO-10 kernel: LustreError: Skipped 31 previous similar
>> messages
>> Jul 18 11:42:21 IO-10 kernel: Lustre: 29196:0:(ldlm_lib.c:1567:**
>> target_queue_last_replay_**reply()) es1-OST000a: 232 recoverable
clients
>> remain
>> Jul 18 11:42:21 IO-10 kernel: Lustre: 29196:0:(ldlm_lib.c:1567:**
>> target_queue_last_replay_**reply()) Skipped 7 previous similar messages
>> Jul 18 11:42:22 IO-10 kernel: LustreError: 137-5: UUID
''es1-OST000b_UUID''
>> is not available  for connect (no target)
>> Jul 18 11:42:22 IO-10 kernel: LustreError:
29275:0:(ldlm_lib.c:1619:**target_send_reply_msg())
>> @@@ processing error (-19)  req at ffff810302713c50 x519337/t0
>> o8-><?>@<?>:0/0 lens 304/0 e 0 to 0 dl 1311007442 ref 1
fl Interpret:/0/0 rc
>> -19/0
>> Jul 18 11:42:22 IO-10 kernel: LustreError:
29275:0:(ldlm_lib.c:1619:**target_send_reply_msg())
>> Skipped 47 previous similar messages
>> Jul 18 11:42:22 IO-10 kernel: LustreError: Skipped 47 previous similar
>> messages
>> Jul 18 11:42:23 IO-10 kernel: Lustre: 29320:0:(ldlm_lib.c:1567:**
>> target_queue_last_replay_**reply()) es1-OST000a: 221 recoverable
clients
>> remain
>> Jul 18 11:42:23 IO-10 kernel: Lustre: 29320:0:(ldlm_lib.c:1567:**
>> target_queue_last_replay_**reply()) Skipped 10 previous similar
messages
>> Jul 18 11:42:27 IO-10 kernel: LustreError: 137-5: UUID
''es1-OST000c_UUID''
>> is not available  for connect (no target)
>> Jul 18 11:42:27 IO-10 kernel: LustreError:
29030:0:(ldlm_lib.c:1619:**target_send_reply_msg())
>> @@@ processing error (-19)  req at ffff8102f87bac00 x435304948/t0
>> o8-><?>@<?>:0/0 lens 304/0 e 0 to 0 dl 1311007447 ref 1
fl Interpret:/0/0 rc
>> -19/0
>> Jul 18 11:42:27 IO-10 kernel: LustreError:
29030:0:(ldlm_lib.c:1619:**target_send_reply_msg())
>> Skipped 91 previous similar messages
>> Jul 18 11:42:27 IO-10 kernel: LustreError: Skipped 91 previous similar
>> messages
>> Jul 18 11:42:27 IO-10 kernel: Lustre: 29182:0:(ldlm_lib.c:1567:**
>> target_queue_last_replay_**reply()) es1-OST000a: 196 recoverable
clients
>> remain
>> Jul 18 11:42:27 IO-10 kernel: Lustre: 29182:0:(ldlm_lib.c:1567:**
>> target_queue_last_replay_**reply()) Skipped 24 previous similar
messages
>> Jul 18 11:42:46 IO-10 kernel: kjournald starting.  Commit interval 5
>> seconds
>> Jul 18 11:42:46 IO-10 kernel: LDISKFS-fs warning: checktime reached,
>> running e2fsck is recommended
>> Jul 18 11:42:46 IO-10 kernel: LDISKFS FS on dm-10, internal journal
>> Jul 18 11:42:46 IO-10 kernel: LDISKFS-fs: recovery complete.
>> Jul 18 11:42:46 IO-10 kernel: LDISKFS-fs: mounted filesystem with
ordered
>> data mode.
>> Jul 18 11:42:46 IO-10 multipathd: dm-10: umount map (uevent)
>> Jul 18 11:42:58 IO-10 kernel: kjournald starting.  Commit interval 5
>> seconds
>> Jul 18 11:42:58 IO-10 kernel: LDISKFS-fs warning: checktime reached,
>> running e2fsck is recommended
>> Jul 18 11:42:58 IO-10 kernel: LDISKFS FS on dm-10, internal journal
>> Jul 18 11:42:58 IO-10 kernel: LDISKFS-fs: mounted filesystem with
ordered
>> data mode.
>> Jul 18 11:42:58 IO-10 kernel: LDISKFS-fs: file extents enabled
>> Jul 18 11:42:58 IO-10 kernel: LDISKFS-fs: mballoc enabled
>> Jul 18 11:42:58 IO-10 kernel: Lustre:
30227:0:(filter.c:868:filter_**init_server_data())
>> RECOVERY: service es1-OST000b, 249 recoverable clients, last_rcvd
608808684
>> Jul 18 11:42:58 IO-10 kernel: Lustre: OST es1-OST000b now serving dev
>> (es1-OST000b/1f38b48f-9a67-**b3a6-4374-b25762e71391), but will be in
>> recovery for at least 5:00, or until 249 clients reconnect. During this
time
>> new clients will not be allowed to connect. Recovery progress can be
>> monitored by watching /proc/fs/lustre/obdfilter/es1-**
>> OST000b/recovery_status.
>> Jul 18 11:42:58 IO-10 kernel: Lustre: es1-OST000b.ost: set parameter
>> quota_type=ug
>> Jul 18 11:42:58 IO-10 kernel: Lustre: Server es1-OST000b on device
>> /dev/mpath/lun_12 has started
>> Jul 18 11:43:09 IO-10 kernel: Lustre: 28975:0:(ldlm_lib.c:1226:**
>> check_and_start_recovery_**timer()) es1-OST000b: starting recovery
timer
>> Jul 18 11:43:09 IO-10 kernel: LustreError: 137-5: UUID
''es1-OST000c_UUID''
>> is not available  for connect (no target)
>> Jul 18 11:43:09 IO-10 kernel: LustreError: Skipped 111 previous similar
>> messages
>> Jul 18 11:43:09 IO-10 kernel: LustreError:
29079:0:(ldlm_lib.c:1619:**target_send_reply_msg())
>> @@@ processing error (-19)  req at ffff8102eb3cb000 x36077574/t0
>> o8-><?>@<?>:0/0 lens 304/0 e 0 to 0 dl 1311007489 ref 1
fl Interpret:/0/0 rc
>> -19/0
>> Jul 18 11:43:09 IO-10 kernel: LustreError:
29079:0:(ldlm_lib.c:1619:**target_send_reply_msg())
>> Skipped 114 previous similar messages
>> Jul 18 11:43:09 IO-10 kernel: Lustre: 28999:0:(ldlm_lib.c:1567:**
>> target_queue_last_replay_**reply()) es1-OST000b: 248 recoverable
clients
>> remain
>> Jul 18 11:43:09 IO-10 kernel: Lustre: 28999:0:(ldlm_lib.c:1567:**
>> target_queue_last_replay_**reply()) Skipped 25 previous similar
messages
>> Jul 18 11:43:21 IO-10 kernel: kjournald starting.  Commit interval 5
>> seconds
>> Jul 18 11:43:21 IO-10 kernel: LDISKFS-fs warning: maximal mount count
>> reached, running e2fsck is recommended
>> Jul 18 11:43:21 IO-10 kernel: LDISKFS FS on dm-12, internal journal
>> Jul 18 11:43:21 IO-10 kernel: LDISKFS-fs: recovery complete.
>> Jul 18 11:43:21 IO-10 kernel: LDISKFS-fs: mounted filesystem with
ordered
>> data mode.
>> Jul 18 11:43:21 IO-10 multipathd: dm-12: umount map (uevent)
>> Jul 18 11:43:32 IO-10 kernel: kjournald starting.  Commit interval 5
>> seconds
>> Jul 18 11:43:32 IO-10 kernel: LDISKFS-fs warning: maximal mount count
>> reached, running e2fsck is recommended
>> Jul 18 11:43:32 IO-10 kernel: LDISKFS FS on dm-12, internal journal
>> Jul 18 11:43:32 IO-10 kernel: LDISKFS-fs: mounted filesystem with
ordered
>> data mode.
>> Jul 18 11:43:32 IO-10 kernel: LDISKFS-fs: file extents enabled
>> Jul 18 11:43:32 IO-10 kernel: LDISKFS-fs: mballoc enabled
>> Jul 18 11:43:32 IO-10 kernel: Lustre:
30436:0:(filter.c:868:filter_**init_server_data())
>> RECOVERY: service es1-OST000c, 249 recoverable clients, last_rcvd
370809064
>> Jul 18 11:43:32 IO-10 kernel: Lustre: OST es1-OST000c now serving dev
>> (es1-OST000c/f8c1bf77-11b3-**88be-4438-016f059a91b5), but will be in
>> recovery for at least 5:00, or until 249 clients reconnect. During this
time
>> new clients will not be allowed to connect. Recovery progress can be
>> monitored by watching /proc/fs/lustre/obdfilter/es1-**
>> OST000c/recovery_status.
>> Jul 18 11:43:32 IO-10 kernel: Lustre: es1-OST000c.ost: set parameter
>> quota_type=ug
>> Jul 18 11:43:32 IO-10 kernel: Lustre: Server es1-OST000c on device
>> /dev/mpath/lun_13 has started
>> Jul 18 11:43:46 IO-10 kernel: Lustre: 29050:0:(ldlm_lib.c:1226:**
>> check_and_start_recovery_**timer()) es1-OST000c: starting recovery
timer
>> Jul 18 11:43:46 IO-10 kernel: LustreError: 137-5: UUID
''es1-OST000d_UUID''
>> is not available  for connect (no target)
>> Jul 18 11:43:46 IO-10 kernel: LustreError: Skipped 229 previous similar
>> messages
>> Jul 18 11:43:46 IO-10 kernel: LustreError:
29123:0:(ldlm_lib.c:1619:**target_send_reply_msg())
>> @@@ processing error (-19)  req at ffff8102f6e36000 x36721236/t0
>> o8-><?>@<?>:0/0 lens 304/0 e 0 to 0 dl 1311007526 ref 1
fl Interpret:/0/0 rc
>> -19/0
>> Jul 18 11:43:46 IO-10 kernel: LustreError:
29123:0:(ldlm_lib.c:1619:**target_send_reply_msg())
>> Skipped 229 previous similar messages
>> Jul 18 11:43:46 IO-10 kernel: Lustre: 28982:0:(ldlm_lib.c:1567:**
>> target_queue_last_replay_**reply()) es1-OST000b: 171 recoverable
clients
>> remain
>> Jul 18 11:43:46 IO-10 kernel: Lustre: 28982:0:(ldlm_lib.c:1567:**
>> target_queue_last_replay_**reply()) Skipped 76 previous similar
messages
>> Jul 18 11:43:54 IO-10 kernel: kjournald starting.  Commit interval 5
>> seconds
>> Jul 18 11:43:54 IO-10 kernel: LDISKFS-fs warning: maximal mount count
>> reached, running e2fsck is recommended
>> Jul 18 11:43:54 IO-10 kernel: LDISKFS FS on dm-13, internal journal
>> Jul 18 11:43:54 IO-10 kernel: LDISKFS-fs: recovery complete.
>> Jul 18 11:43:54 IO-10 kernel: LDISKFS-fs: mounted filesystem with
ordered
>> data mode.
>> Jul 18 11:43:55 IO-10 multipathd: dm-13: umount map (uevent)
>> Jul 18 11:44:06 IO-10 kernel: kjournald starting.  Commit interval 5
>> seconds
>> Jul 18 11:44:06 IO-10 kernel: LDISKFS-fs warning: maximal mount count
>> reached, running e2fsck is recommended
>> Jul 18 11:44:06 IO-10 kernel: LDISKFS FS on dm-13, internal journal
>> Jul 18 11:44:06 IO-10 kernel: LDISKFS-fs: mounted filesystem with
ordered
>> data mode.
>> Jul 18 11:44:06 IO-10 kernel: LDISKFS-fs: file extents enabled
>> Jul 18 11:44:06 IO-10 kernel: LDISKFS-fs: mballoc enabled
>> Jul 18 11:44:06 IO-10 kernel: Lustre:
30686:0:(filter.c:868:filter_**init_server_data())
>> RECOVERY: service es1-OST000d, 249 recoverable clients, last_rcvd
694562245
>> Jul 18 11:44:06 IO-10 kernel: Lustre: OST es1-OST000d now serving dev
>> (es1-OST000d/cf608dbd-accd-**89b7-471a-f4487e9f8ba3), but will be in
>> recovery for at least 5:00, or until 249 clients reconnect. During this
time
>> new clients will not be allowed to connect. Recovery progress can be
>> monitored by watching /proc/fs/lustre/obdfilter/es1-**
>> OST000d/recovery_status.
>> Jul 18 11:44:06 IO-10 kernel: Lustre: es1-OST000d.ost: set parameter
>> quota_type=ug
>> Jul 18 11:44:06 IO-10 kernel: Lustre: Server es1-OST000d on device
>> /dev/mpath/lun_14 has started
>> Jul 18 11:44:06 IO-10 kernel: Lustre: 29293:0:(ldlm_lib.c:1226:**
>> check_and_start_recovery_**timer()) es1-OST000d: starting recovery
timer
>> Jul 18 11:44:18 IO-10 kernel: LustreError: 137-5: UUID
''es1-OST000e_UUID''
>> is not available  for connect (no target)
>> Jul 18 11:44:18 IO-10 kernel: LustreError: Skipped 199 previous similar
>> messages
>> Jul 18 11:44:18 IO-10 kernel: Lustre: 29068:0:(ldlm_lib.c:1567:**
>> target_queue_last_replay_**reply()) es1-OST000d: 175 recoverable
clients
>> remain
>> Jul 18 11:44:18 IO-10 kernel: LustreError:
29135:0:(ldlm_lib.c:1619:**target_send_reply_msg())
>> @@@ processing error (-19)  req at ffff8102f4c1cc00 x56000488/t0
>> o8-><?>@<?>:0/0 lens 304/0 e 0 to 0 dl 1311007558 ref 1
fl Interpret:/0/0 rc
>> -19/0
>> Jul 18 11:44:18 IO-10 kernel: LustreError:
29135:0:(ldlm_lib.c:1619:**target_send_reply_msg())
>> Skipped 199 previous similar messages
>> Jul 18 11:44:18 IO-10 kernel: Lustre: 29068:0:(ldlm_lib.c:1567:**
>> target_queue_last_replay_**reply()) Skipped 331 previous similar
messages
>> Jul 18 11:44:28 IO-10 kernel: kjournald starting.  Commit interval 5
>> seconds
>> Jul 18 11:44:28 IO-10 kernel: LDISKFS-fs warning: maximal mount count
>> reached, running e2fsck is recommended
>> Jul 18 11:44:28 IO-10 kernel: LDISKFS FS on dm-14, internal journal
>> Jul 18 11:44:28 IO-10 kernel: LDISKFS-fs: recovery complete.
>> Jul 18 11:44:28 IO-10 kernel: LDISKFS-fs: mounted filesystem with
ordered
>> data mode.
>> Jul 18 11:44:28 IO-10 multipathd: dm-14: umount map (uevent)
>> Jul 18 11:44:39 IO-10 kernel: kjournald starting.  Commit interval 5
>> seconds
>> Jul 18 11:44:39 IO-10 kernel: LDISKFS-fs warning: maximal mount count
>> reached, running e2fsck is recommended
>> Jul 18 11:44:39 IO-10 kernel: LDISKFS FS on dm-14, internal journal
>> Jul 18 11:44:39 IO-10 kernel: LDISKFS-fs: mounted filesystem with
ordered
>> data mode.
>> Jul 18 11:44:39 IO-10 kernel: LDISKFS-fs: file extents enabled
>> Jul 18 11:44:39 IO-10 kernel: LDISKFS-fs: mballoc enabled
>> Jul 18 11:44:39 IO-10 kernel: Lustre:
30893:0:(filter.c:868:filter_**init_server_data())
>> RECOVERY: service es1-OST000e, 249 recoverable clients, last_rcvd
613643608
>> Jul 18 11:44:39 IO-10 kernel: Lustre: OST es1-OST000e now serving dev
>> (es1-OST000e/478c7dc4-4936-**bfe2-45ac-2fb7a2e69f62), but will be in
>> recovery for at least 5:00, or until 249 clients reconnect. During this
time
>> new clients will not be allowed to connect. Recovery progress can be
>> monitored by watching /proc/fs/lustre/obdfilter/es1-**
>> OST000e/recovery_status.
>> Jul 18 11:44:39 IO-10 kernel: Lustre: es1-OST000e.ost: set parameter
>> quota_type=ug
>> Jul 18 11:44:39 IO-10 kernel: Lustre: Server es1-OST000e on device
>> /dev/mpath/lun_15 has started
>> Jul 18 11:44:40 IO-10 kernel: Lustre: 29214:0:(ldlm_lib.c:1226:**
>> check_and_start_recovery_**timer()) es1-OST000e: starting recovery
timer
>> Jul 18 11:44:49 IO-10 kernel: Lustre:
29236:0:(service.c:939:ptlrpc_**server_handle_req_in())
>> @@@ Slow req_in handling 6s  req at ffff8102f4419c00 x738214853/t0
>> o101-><?>@<?>:0/0 lens 232/0 e 0 to 0 dl 0 ref 1 fl
New:/0/0 rc 0/0
>> Jul 18 11:44:49 IO-10 kernel: Lustre:
28992:0:(service.c:939:ptlrpc_**server_handle_req_in())
>> @@@ Slow req_in handling 6s  req at ffff8102f4419400 x738214855/t0
>> o101-><?>@<?>:0/0 lens 232/0 e 0 to 0 dl 0 ref 1 fl
New:/0/0 rc 0/0
>> ---------------------- end messages -----------------------------
>>
>> It mentioned completing the recovery so I didn''t bother with
running
>> another fsck, should I? The problem now seems to be that STONITH on the
>> troubled node''s failover can''t reset the node. It
tries and fails
>> incessantly:
>> ------------------------ messages ------------------------------**-
>> Jul 18 16:45:17 IO-11 heartbeat: [25037]: info: Resetting node
>> io-10.internal.acs.unt.prv with [IPMI STONITH device ]
>> Jul 18 16:45:18 IO-11 heartbeat: [25037]: info: glib: external_run_cmd:
>> Calling ''/usr/lib64/stonith/plugins/**external/ipmi reset
>> io-10.internal.acs.unt.prv'' returned 256
>> Jul 18 16:45:18 IO-11 heartbeat: [25037]: ERROR: glib:
external_reset_req:
>> ''ipmi reset'' for host io-10.internal.acs.unt.prv
failed with rc 256
>> Jul 18 16:45:18 IO-11 heartbeat: [25037]: ERROR: Host
>> io-10.internal.acs.unt.prv not reset!
>> Jul 18 16:45:18 IO-11 heartbeat: [15803]: WARN: Managed STONITH
>> io-10.internal.acs.unt.prv process 25037 exited with return code 1.
>> Jul 18 16:45:18 IO-11 heartbeat: [15803]: ERROR: STONITH of
>> io-10.internal.acs.unt.prv failed.  Retrying...
>> ---------------------- end messages ------------------------------**---
>>
>> I''ve checked the logic in
usr/lib64/stonith/plugins/**external/ipmi which
>> doesn''t seem to be using the correct address for the BMC
controller. It''s
>> possible that the HA facilites could prevent mounting of the final OSTs
>> isn''t it?
>>
>>
>>
>> Wojciech Turek wrote:
>>
>>> Hi Damiri,
>>>
>>>  From the logs you have provided it looks like you have a problem
with
>>> your back end storage. First of all we can see that your SRP
connection to
>>> backend storage reports abort and reset (I guess your backend
storage
>>> hardware is connected via Infiniband if you are using SRP). Then
Lustre
>>> reports slow messages and eventually kernel reports SCSI errors.
Device
>>> mapper reports that both paths to the device are failed and Lustre
remounts
>>> filesystem read-only due to I/O error. All these means that your
I/O node
>>> lost contact with the OST due to some errors either on IB network
connecting
>>> your host to the storage hardware or on the storage hardware
itself. From
>>> the first part of the log we can see that the device being in
trouble is OST
>>> es1-OST000b (dm-11). In the second part of your log I can not see
that
>>> device being mounted. From your log I can see that only OST 
es1-OST000a
>>> (dm-10) is mounted and enters recovery
>>>
>>
>>
>>  --
>> DaMiri Young
>> HPC System Engineer
>> High Performance Computing Team | ACUS/CITC | UNT
>>
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20110720/b042bdaa/attachment-0001.html

DaMiri Young

2011-Jul-20 15:37 UTC

head link

[Lustre-discuss] IO-Node issue

Hi Wojciech,
Stopping heartbeat sounds like a logical next step. Before I do that 
though I tried a fsck dry run using e2fsprogs v1.14.10 and got:
---------------------------------------------------------------
# e2fsck -n -v /dev/dm-11
e2fsck 1.41.10.sun2 (24-Feb-2010)
device /dev/dm-11 mounted by lustre per 
/proc/fs/lustre/obdfilter/es1-OST000a/mntdev
Warning!  /dev/dm-11 is mounted.
e2fsck: MMP: device currently active while trying to open /dev/dm-11

The superblock could not be read or does not describe a correct ext2
filesystem.  If the device is valid and it really contains an ext2
filesystem (and not swap or ufs or something else), then the superblock
is corrupt, and you might try running e2fsck with an alternate superblock:
     e2fsck -b 32768 <device>
------------------------------------------------------------------

Do you suppose stopping heartbeat will allow the OST to be unmounted all 
the way be lustre? I tried unmounting manually and got:
------------------------------------------------------------------
# umount /dev/dm-11
umount: /dev/dm-11: not mounted
------------------------------------------------------------------


Wojciech Turek wrote:> Hi Damiri,
> 
> If heartbeat is not able to start(mount) one of the OSTs I would 
> recommend to stop heartbeat on both servers and then mount troubled OST 
> manually. Then you should see why OST is not mounted. In order to check 
> the consistency of the filesystem, in your case I would first run fsck 
> with -n switch to see extent of the damage, this also prevents from 
> damaging your filesystem even more if you have a faulty controller or 
> links corrupting data. In normal situation I use following command: fsck 
> -f -v /dev/<ost_dev> -C0
> Make sure that you log output from the fsck which will be essential for 
> the further troubleshooting.
> 
> Best regards,
> 
> Wojciech
> 
> On 19 July 2011 16:58, Young, Damiri <Damiri.Young at unt.edu 
> <mailto:Damiri.Young at unt.edu>> wrote:
> 
>     Many thanks for the useful info Turek. I mentioned HA (heartbeat v2)
>     issues because after the troubled I/O got it''s paths back to
the
>     OST''s it failed all 4 of the 5 OSTs over to it''s
sibling server
>     where they''re now mounted. To me it seems the OSTs
(we''re using
>     lustre v1.6 btw) won''t be released until the failed over node
is
>     reset by it''s sibling.
> 
>     The OSSs seem to have trouble connecting to the 1 OST I mentioned:
>     -------------------------------- messages
>     -------------------------------------
>     Jul 19 10:29:02 IO-10 kernel: LustreError: 29429: 

-- 
DaMiri Young
HPC System Engineer
High Performance Computing Team | ACUS/CITC | UNT

Wojciech Turek

2011-Jul-20 20:44 UTC

head link

[Lustre-discuss] IO-Node issue

Hi Damiri

I use more recent e2fsprogs:
e2fsprogs-1.41.12.2.ora1-0redhat.x86_64

I think you can get even more recent version from Whamcloude.

I remember that your version of e2fsck does not allow access to mounted FS
in read only mode which is annoying and not necessary. With more recent
version fsck -n should run even on mounted FS.

Anyway for proper fsck''ing you need to umount device that you want to
check
and make sure that it is not mounted on the other node. Stopping heartbeat
should automatically umount your OSTs if your FileSystem resources are
properly configured.

Best regards,

Wojciech

On 20 July 2011 16:37, DaMiri Young <damiri at unt.edu> wrote:
> Hi Wojciech,
> Stopping heartbeat sounds like a logical next step. Before I do that though
> I tried a fsck dry run using e2fsprogs v1.14.10 and got:
> ------------------------------**------------------------------**---
> # e2fsck -n -v /dev/dm-11
> e2fsck 1.41.10.sun2 (24-Feb-2010)
> device /dev/dm-11 mounted by lustre per /proc/fs/lustre/obdfilter/es1-**
> OST000a/mntdev
> Warning!  /dev/dm-11 is mounted.
> e2fsck: MMP: device currently active while trying to open /dev/dm-11
>
> The superblock could not be read or does not describe a correct ext2
> filesystem.  If the device is valid and it really contains an ext2
> filesystem (and not swap or ufs or something else), then the superblock
> is corrupt, and you might try running e2fsck with an alternate superblock:
>    e2fsck -b 32768 <device>
> ------------------------------**------------------------------**------
>
> Do you suppose stopping heartbeat will allow the OST to be unmounted all
> the way be lustre? I tried unmounting manually and got:
> ------------------------------**------------------------------**------
> # umount /dev/dm-11
> umount: /dev/dm-11: not mounted
> ------------------------------**------------------------------**------
>
>
> Wojciech Turek wrote:
>
>> Hi Damiri,
>>
>> If heartbeat is not able to start(mount) one of the OSTs I would
recommend
>> to stop heartbeat on both servers and then mount troubled OST manually.
Then
>> you should see why OST is not mounted. In order to check the
consistency of
>> the filesystem, in your case I would first run fsck with -n switch to
see
>> extent of the damage, this also prevents from damaging your filesystem
even
>> more if you have a faulty controller or links corrupting data. In
normal
>> situation I use following command: fsck -f -v /dev/<ost_dev> -C0
>> Make sure that you log output from the fsck which will be essential for
>> the further troubleshooting.
>>
>> Best regards,
>>
>> Wojciech
>>
>> On 19 July 2011 16:58, Young, Damiri <Damiri.Young at unt.edu
<mailto:
>> Damiri.Young at unt.edu>> wrote:
>>
>>    Many thanks for the useful info Turek. I mentioned HA (heartbeat v2)
>>    issues because after the troubled I/O got it''s paths back
to the
>>    OST''s it failed all 4 of the 5 OSTs over to it''s
sibling server
>>    where they''re now mounted. To me it seems the OSTs
(we''re using
>>    lustre v1.6 btw) won''t be released until the failed over
node is
>>    reset by it''s sibling.
>>
>>    The OSSs seem to have trouble connecting to the 1 OST I mentioned:
>>    ------------------------------**-- messages
>>    ------------------------------**-------
>>    Jul 19 10:29:02 IO-10 kernel: LustreError: 29429:
>>
>
>
> --
> DaMiri Young
> HPC System Engineer
> High Performance Computing Team | ACUS/CITC | UNT
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20110720/b1f85ca5/attachment.html

Lustre discuss - Jul 2011 - IO-Node issue

[Lustre-discuss] IO-Node issue

[Lustre-discuss] IO-Node issue

[Lustre-discuss] IO-Node issue

[Lustre-discuss] IO-Node issue

[Lustre-discuss] IO-Node issue

[Lustre-discuss] IO-Node issue

[Lustre-discuss] IO-Node issue