Jonathan Buch
2010-Jun-17 15:43 UTC
[Lustre-discuss] Lustre problem recovering from a hardware error
Hello everyone. I hope someone on this list can advise me on what to do. A few days back one of our SAN systems started to produce I/O read errors. This affected one of our Lustre partitions. Our Setup: Lustre server 1.8.3 with a RHEL5 kernel. Clients are still 1.8.2. client~# lfs df -h UUID bytes Used Available Use% Mounted on eu01-MDT0000_UUID 24.4G 3.3G 19.8G 13% /mnt/home.eu01[MDT:0] eu01-OST0000_UUID 3.5T 3.1T 257.3G 87% /mnt/home.eu01[OST:0] eu01-OST0001_UUID 3.7T 3.1T 397.6G 84% /mnt/home.eu01[OST:1] eu01-OST0002_UUID 3.7T 3.3T 164.8G 90% /mnt/home.eu01[OST:2] eu01-OST0003_UUID 3.7T 1.6T 1.9T 43% /mnt/home.eu01[OST:3] eu01-OST0004_UUID 889.2G 474.6G 369.4G 53% /mnt/home.eu01[OST:4] eu01-OST0005_UUID 6.3T 1.5T 4.4T 24% /mnt/home.eu01[OST:5] eu01-OST0006_UUID 6.3T 1.6T 4.4T 25% /mnt/home.eu01[OST:6] filesystem summary: 27.9T 14.7T 11.9T 52% /mnt/home.eu01 (the eu01-OST0003_UUID is the "broken" one) server~# df -h /dev/sdd1 3,8T 1,7T 2,0T 46% /mnt/lustre-eu-ost3 What I did next: * the SAN system did not show any broken harddrives, but did output some information in its controller log: "Log Number","Concern Level","Date","Time","Device","Message", "4052","2","06/15/10","10:46:31","Configuration WWN: 20000050CC204369 Controller: 0","An unrecoverable drive error has occurred as a result of a command being issued. This may be due to a drive error in a non-fault tolerant array, such as RAID 0, or when the array is already in a degraded mode. The controller will pass the status from the drive back to the host system to allow the host recovery mechanisms to be used. Details: Host Loop = 0, Host Loop ID = 2, Mapped Logical Drive Requested = 2, Op Code = 0x88, Sense Data = 03/11/01." That Array is covered by a RAID5, so I would''ve preferred if it had just disabled the bad harddrive. I also can''t remove the harddrive and let it rebuild the array because I can''t deduce the exact harddrive from the logical block which had the I/O error. * next I tried a filesystem check on the eu01-OST0003_UUID: e2fsck 1.41.10.sun2 (24-Feb-2010) I used -c to also check for bad blocks. The output contained quite a few errors in the filesystem, some 2000 files were put into lost+found. I reran the check just to be sure. * Now I remounted the partition (atually the server was so locked that I had to force a reboot) and I now have more problems than before. Some kernel threads (ll_ost_40 and llog_process_th) are blocking CPUs and fills the syslog with kernel error messages about them getting stuck. I attached relevant portions of my /var/log/messages file. I also attached output of `lctl dk` in case that helps. As a sidenote: I can''t unmount the OST, the unmount hangs indefinitly and I also can''t reboot the system cleanly, `reboot` will also hang. Right now I can access part of the filesystem, but accessing certain files/directories will lock up the client. We do have backups (around 2 weeks old) on tape, but I would prefer not to have to replay them, as that would take around 5 days. I''ll provide more information, if that is needed. Please advise on how I can resolve the situation. Thank you very much. Jonathan Buch -- B.Sc. Jonathan Buch Karlsruhe University of Applied Sciences Institute of Materials and Processes (IMP) CMSE - Systemadministration Moltkestrasse 30 D-76133 Karlsruhe Germany jonathan.buch(at)hs-karlsruhe.de Phone: +49 721 925 1415 Fax: +49 721 925 2348 -------------- next part -------------- A non-text attachment was scrubbed... Name: lctldk_20100617_2.log.gz Type: application/x-gunzip Size: 120677 bytes Desc: not available Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100617/05b44497/attachment-0001.bin -------------- next part -------------- jo at cmse-svr01:~$ sudo tail -f /var/log/messages | grep -v DHCP Jun 17 16:20:00 cmse-svr01 kernel: Lustre: OBD class driver, http://www.lustre.org/ Jun 17 16:20:00 cmse-svr01 kernel: Lustre: Lustre Version: 1.8.3 Jun 17 16:20:00 cmse-svr01 kernel: Lustre: Build Version: 1.8.3-20100503161738-PRISTINE-2.6.18-164.11.1.el5lustre.1.8.2-0rc4 Jun 17 16:20:01 cmse-svr01 kernel: Lustre: Added LNI 192.168.21.174 at tcp1 [8/256/0/180] Jun 17 16:20:01 cmse-svr01 kernel: Lustre: Added LNI 10.101.0.2 at tcp [8/256/0/180] Jun 17 16:20:01 cmse-svr01 kernel: Lustre: Accept secure, port 988 Jun 17 16:20:01 cmse-svr01 kernel: Lustre: Lustre Client File System; http://www.lustre.org/ Jun 17 16:20:01 cmse-svr01 kernel: init dynlocks cache Jun 17 16:20:01 cmse-svr01 kernel: ldiskfs created from ext3-2.6-rhel5 Jun 17 16:20:01 cmse-svr01 kernel: kjournald starting. Commit interval 5 seconds Jun 17 16:20:01 cmse-svr01 kernel: LDISKFS FS on sdc1, internal journal Jun 17 16:20:01 cmse-svr01 kernel: LDISKFS-fs: mounted filesystem with ordered data mode. Jun 17 16:20:01 cmse-svr01 kernel: kjournald starting. Commit interval 5 seconds Jun 17 16:20:01 cmse-svr01 kernel: LDISKFS FS on sdc1, internal journal Jun 17 16:20:01 cmse-svr01 kernel: LDISKFS-fs: mounted filesystem with ordered data mode. Jun 17 16:20:01 cmse-svr01 kernel: Lustre: MGS MGS started Jun 17 16:20:01 cmse-svr01 kernel: Lustre: MGC192.168.21.174 at tcp1: Reactivating import Jun 17 16:20:01 cmse-svr01 kernel: Lustre: Enabling user_xattr Jun 17 16:20:01 cmse-svr01 kernel: Lustre: eu01-MDT0000: denying duplicate export for 324cd728-79b3-56ea-3a4a-b86fd985e0c7, -114 Jun 17 16:20:01 cmse-svr01 kernel: Lustre: 24142:0:(mds_fs.c:673:mds_init_server_data()) RECOVERY: service eu01-MDT0000, 17 recoverable clients, 0 delayed clients, last_transno 21474848741 Jun 17 16:20:01 cmse-svr01 kernel: Lustre: eu01-MDT0000: Now serving eu01-MDT0000 on /dev/sdc1 with recovery enabled Jun 17 16:20:01 cmse-svr01 kernel: Lustre: eu01-MDT0000: Will be in recovery for at least 5:00, or until 17 clients reconnect Jun 17 16:20:01 cmse-svr01 kernel: Lustre: 24142:0:(mds_lov.c:1167:mds_notify()) MDS eu01-MDT0000: add target eu01-OST0000_UUID Jun 17 16:20:02 cmse-svr01 kernel: Lustre: 24142:0:(mds_lov.c:1167:mds_notify()) MDS eu01-MDT0000: add target eu01-OST0001_UUID Jun 17 16:20:02 cmse-svr01 kernel: Lustre: 23941:0:(client.c:1463:ptlrpc_expire_one_request()) @@@ Request x1338805480062988 sent from eu01-OST0005-osc to NID 192.168.21.173 at tcp1 0s ago has failed due to network error (5s prior to deadline). Jun 17 16:20:02 cmse-svr01 kernel: req at ffff810249acac00 x1338805480062988/t0 o8->eu01-OST0005_UUID at 10.101.0.1@tcp:28/4 lens 368/584 e 0 to 1 dl 1276784407 ref 1 fl Rpc:N/0/0 rc 0/0 Jun 17 16:20:02 cmse-svr01 kernel: Lustre: 23941:0:(client.c:1463:ptlrpc_expire_one_request()) @@@ Request x1338805480062990 sent from eu01-OST0006-osc to NID 192.168.21.173 at tcp1 0s ago has failed due to network error (5s prior to deadline). Jun 17 16:20:02 cmse-svr01 kernel: req at ffff8101a65c9000 x1338805480062990/t0 o8->eu01-OST0006_UUID at 10.101.0.1@tcp:28/4 lens 368/584 e 0 to 1 dl 1276784407 ref 1 fl Rpc:N/0/0 rc 0/0 Jun 17 16:20:07 cmse-svr01 kernel: Lustre: 23941:0:(client.c:1463:ptlrpc_expire_one_request()) @@@ Request x1338805480062983 sent from eu01-OST0000-osc to NID 0 at lo 5s ago has timed out (5s prior to deadline). Jun 17 16:20:07 cmse-svr01 kernel: req at ffff81029c757400 x1338805480062983/t0 o8->eu01-OST0000_UUID at 192.168.21.174@tcp1:28/4 lens 368/584 e 0 to 1 dl 1276784407 ref 1 fl Rpc:N/0/0 rc 0/0 Jun 17 16:20:08 cmse-svr01 kernel: Lustre: eu01-MDT0000: temporarily refusing client connection from 192.168.21.126 at tcp1 Jun 17 16:20:10 cmse-svr01 kernel: Lustre: eu01-MDT0000: temporarily refusing client connection from 192.168.21.133 at tcp1 Jun 17 16:20:15 cmse-svr01 kernel: Lustre: Failing over eu01-MDT0000 Jun 17 16:20:15 cmse-svr01 kernel: Lustre: *** setting obd eu01-MDT0000 device ''sdc1'' read-only *** Jun 17 16:20:15 cmse-svr01 kernel: Turning device sdc (0x800021) read-only Jun 17 16:20:15 cmse-svr01 kernel: Lustre: Failing over eu01-OST0006-osc Jun 17 16:20:15 cmse-svr01 kernel: Lustre: 23942:0:(import.c:517:import_select_connection()) eu01-OST0000-osc: tried all connections, increasing latency to 1s Jun 17 16:20:15 cmse-svr01 kernel: Lustre: MGS has stopped. Jun 17 16:20:15 cmse-svr01 kernel: Lustre: eu01-MDT0000: shutting down for failover; client state will be preserved. Jun 17 16:20:15 cmse-svr01 kernel: Lustre: MDT eu01-MDT0000 has stopped. Jun 17 16:20:18 cmse-svr01 kernel: Removing read-only on unknown block (0x800021) Jun 17 16:20:18 cmse-svr01 kernel: Lustre: server umount eu01-MDT0000 complete Jun 17 16:20:46 cmse-svr01 kernel: kjournald starting. Commit interval 5 seconds Jun 17 16:20:46 cmse-svr01 kernel: LDISKFS FS on sdc1, internal journal Jun 17 16:20:46 cmse-svr01 kernel: LDISKFS-fs: recovery complete. Jun 17 16:20:46 cmse-svr01 kernel: LDISKFS-fs: mounted filesystem with ordered data mode. Jun 17 16:20:46 cmse-svr01 kernel: kjournald starting. Commit interval 5 seconds Jun 17 16:20:46 cmse-svr01 kernel: LDISKFS FS on sdc1, internal journal Jun 17 16:20:46 cmse-svr01 kernel: LDISKFS-fs: mounted filesystem with ordered data mode. Jun 17 16:20:46 cmse-svr01 kernel: Lustre: MGS MGS started Jun 17 16:20:46 cmse-svr01 kernel: Lustre: MGC192.168.21.174 at tcp1: Reactivating import Jun 17 16:20:47 cmse-svr01 kernel: Lustre: Enabling user_xattr Jun 17 16:20:47 cmse-svr01 kernel: Lustre: eu01-MDT0000: Now serving eu01-MDT0000 on /dev/sdc1 with recovery enabled Jun 17 16:20:47 cmse-svr01 kernel: Lustre: 24240:0:(mds_lov.c:1167:mds_notify()) MDS eu01-MDT0000: add target eu01-OST0000_UUID Jun 17 16:20:47 cmse-svr01 kernel: Lustre: 24240:0:(mds_lov.c:1167:mds_notify()) Skipped 5 previous similar messages Jun 17 16:20:47 cmse-svr01 kernel: Lustre: 23941:0:(client.c:1463:ptlrpc_expire_one_request()) @@@ Request x1338805480063008 sent from eu01-OST0004-osc to NID 192.168.21.173 at tcp1 0s ago has failed due to network error (5s prior to deadline). Jun 17 16:20:47 cmse-svr01 kernel: req at ffff81015d47c000 x1338805480063008/t0 o8->eu01-OST0004_UUID at 10.101.0.1@tcp:28/4 lens 368/584 e 0 to 1 dl 1276784452 ref 1 fl Rpc:N/0/0 rc 0/0 Jun 17 16:20:47 cmse-svr01 kernel: Lustre: 23941:0:(client.c:1463:ptlrpc_expire_one_request()) Skipped 4 previous similar messages Jun 17 16:20:47 cmse-svr01 kernel: Lustre: eu01-MDT0000: Aborting recovery. Jun 17 16:20:51 cmse-svr01 kernel: Lustre: eu01-MDT0000: temporarily refusing client connection from 192.168.21.103 at tcp1 Jun 17 16:20:52 cmse-svr01 kernel: Lustre: 23941:0:(client.c:1463:ptlrpc_expire_one_request()) @@@ Request x1338805480063004 sent from eu01-OST0000-osc to NID 0 at lo 5s ago has timed out (5s prior to deadline). Jun 17 16:20:52 cmse-svr01 kernel: req at ffff810323104000 x1338805480063004/t0 o8->eu01-OST0000_UUID at 192.168.21.174@tcp1:28/4 lens 368/584 e 0 to 1 dl 1276784452 ref 1 fl Rpc:N/0/0 rc 0/0 Jun 17 16:20:52 cmse-svr01 kernel: Lustre: 23941:0:(client.c:1463:ptlrpc_expire_one_request()) Skipped 2 previous similar messages Jun 17 16:20:53 cmse-svr01 kernel: Lustre: eu01-MDT0000: temporarily refusing client connection from 192.168.21.118 at tcp1 Jun 17 16:20:53 cmse-svr01 kernel: Lustre: Skipped 1 previous similar message Jun 17 16:20:59 cmse-svr01 kernel: Lustre: eu01-MDT0000: temporarily refusing client connection from 192.168.21.107 at tcp1 Jun 17 16:20:59 cmse-svr01 kernel: Lustre: Skipped 1 previous similar message Jun 17 16:21:03 cmse-svr01 kernel: Lustre: eu01-MDT0000: temporarily refusing client connection from 192.168.21.105 at tcp1 Jun 17 16:21:03 cmse-svr01 kernel: Lustre: Skipped 5 previous similar messages Jun 17 16:21:12 cmse-svr01 kernel: Lustre: eu01-MDT0000: temporarily refusing client connection from 192.168.21.126 at tcp1 Jun 17 16:21:12 cmse-svr01 kernel: Lustre: Skipped 1 previous similar message Jun 17 16:21:17 cmse-svr01 kernel: Lustre: 23942:0:(import.c:517:import_select_connection()) eu01-OST0000-osc: tried all connections, increasing latency to 1s Jun 17 16:21:17 cmse-svr01 kernel: Lustre: 23942:0:(import.c:517:import_select_connection()) Skipped 3 previous similar messages Jun 17 16:21:17 cmse-svr01 kernel: Lustre: 23941:0:(client.c:1463:ptlrpc_expire_one_request()) @@@ Request x1338805480063018 sent from eu01-OST0004-osc to NID 192.168.21.173 at tcp1 0s ago has failed due to network error (6s prior to deadline). Jun 17 16:21:17 cmse-svr01 kernel: req at ffff8102fd548400 x1338805480063018/t0 o8->eu01-OST0004_UUID at 10.101.0.1@tcp:28/4 lens 368/584 e 0 to 1 dl 1276784483 ref 1 fl Rpc:N/0/0 rc 0/0 Jun 17 16:21:17 cmse-svr01 kernel: Lustre: 23941:0:(client.c:1463:ptlrpc_expire_one_request()) Skipped 3 previous similar messages Jun 17 16:21:29 cmse-svr01 kernel: Lustre: eu01-MDT0000: temporarily refusing client connection from 192.168.21.103 at tcp1 Jun 17 16:21:29 cmse-svr01 kernel: Lustre: Skipped 9 previous similar messages Jun 17 16:21:32 cmse-svr01 kernel: kjournald starting. Commit interval 5 seconds Jun 17 16:21:32 cmse-svr01 kernel: LDISKFS FS on sdc4, internal journal Jun 17 16:21:32 cmse-svr01 kernel: LDISKFS-fs: mounted filesystem with ordered data mode. Jun 17 16:21:32 cmse-svr01 kernel: kjournald starting. Commit interval 5 seconds Jun 17 16:21:32 cmse-svr01 kernel: LDISKFS FS on sdc4, internal journal Jun 17 16:21:32 cmse-svr01 kernel: LDISKFS-fs: mounted filesystem with ordered data mode. Jun 17 16:21:32 cmse-svr01 kernel: LDISKFS-fs: file extents enabled Jun 17 16:21:32 cmse-svr01 kernel: LDISKFS-fs: mballoc enabled Jun 17 16:21:42 cmse-svr01 kernel: Lustre: 23942:0:(import.c:517:import_select_connection()) eu01-OST0000-osc: tried all connections, increasing latency to 2s Jun 17 16:21:42 cmse-svr01 kernel: Lustre: 23942:0:(import.c:517:import_select_connection()) Skipped 6 previous similar messages Jun 17 16:21:42 cmse-svr01 kernel: Lustre: 23941:0:(client.c:1463:ptlrpc_expire_one_request()) @@@ Request x1338805480063031 sent from eu01-OST0004-osc to NID 192.168.21.173 at tcp1 0s ago has failed due to network error (7s prior to deadline). Jun 17 16:21:42 cmse-svr01 kernel: req at ffff81030dcf1000 x1338805480063031/t0 o8->eu01-OST0004_UUID at 10.101.0.1@tcp:28/4 lens 368/584 e 0 to 1 dl 1276784509 ref 1 fl Rpc:N/0/0 rc 0/0 Jun 17 16:21:42 cmse-svr01 kernel: Lustre: 23941:0:(client.c:1463:ptlrpc_expire_one_request()) Skipped 6 previous similar messages Jun 17 16:21:46 cmse-svr01 kernel: Lustre: Filtering OBD driver; http://www.lustre.org/ Jun 17 16:21:46 cmse-svr01 kernel: Lustre: eu01-OST0000: Now serving eu01-OST0000 on /dev/sdc4 with recovery enabled Jun 17 16:22:01 cmse-svr01 kernel: Lustre: eu01-MDT0000: temporarily refusing client connection from 192.168.21.119 at tcp1 Jun 17 16:22:01 cmse-svr01 kernel: Lustre: Skipped 47 previous similar messages Jun 17 16:22:07 cmse-svr01 kernel: Lustre: 23942:0:(import.c:517:import_select_connection()) eu01-OST0000-osc: tried all connections, increasing latency to 3s Jun 17 16:22:07 cmse-svr01 kernel: Lustre: 23942:0:(import.c:517:import_select_connection()) Skipped 6 previous similar messages Jun 17 16:22:07 cmse-svr01 kernel: Lustre: 23941:0:(client.c:1463:ptlrpc_expire_one_request()) @@@ Request x1338805480063039 sent from eu01-OST0004-osc to NID 192.168.21.173 at tcp1 0s ago has failed due to network error (8s prior to deadline). Jun 17 16:22:07 cmse-svr01 kernel: req at ffff81029e491400 x1338805480063039/t0 o8->eu01-OST0004_UUID at 10.101.0.1@tcp:28/4 lens 368/584 e 0 to 1 dl 1276784535 ref 1 fl Rpc:N/0/0 rc 0/0 Jun 17 16:22:07 cmse-svr01 kernel: Lustre: 23941:0:(client.c:1463:ptlrpc_expire_one_request()) Skipped 2 previous similar messages Jun 17 16:22:07 cmse-svr01 kernel: Lustre: 23941:0:(quota_master.c:1716:mds_quota_recovery()) Only 0/7 OSTs are active, abort quota recovery Jun 17 16:22:07 cmse-svr01 kernel: Lustre: eu01-OST0000: received MDS connection from 0 at lo Jun 17 16:22:07 cmse-svr01 kernel: Lustre: MDS eu01-MDT0000: eu01-OST0000_UUID now active, resetting orphans Jun 17 16:22:30 cmse-svr01 kernel: kjournald starting. Commit interval 5 seconds Jun 17 16:22:30 cmse-svr01 kernel: LDISKFS FS on sdb1, internal journal Jun 17 16:22:30 cmse-svr01 kernel: LDISKFS-fs: mounted filesystem with ordered data mode. Jun 17 16:22:30 cmse-svr01 kernel: kjournald starting. Commit interval 5 seconds Jun 17 16:22:30 cmse-svr01 kernel: LDISKFS FS on sdb1, internal journal Jun 17 16:22:30 cmse-svr01 kernel: LDISKFS-fs: mounted filesystem with ordered data mode. Jun 17 16:22:30 cmse-svr01 kernel: LDISKFS-fs: file extents enabled Jun 17 16:22:30 cmse-svr01 kernel: LDISKFS-fs: mballoc enabled Jun 17 16:22:30 cmse-svr01 kernel: Lustre: eu01-OST0001: Now serving eu01-OST0001 on /dev/sdb1 with recovery enabled Jun 17 16:22:32 cmse-svr01 kernel: Lustre: 23942:0:(import.c:517:import_select_connection()) eu01-OST0001-osc: tried all connections, increasing latency to 4s Jun 17 16:22:32 cmse-svr01 kernel: Lustre: 23942:0:(import.c:517:import_select_connection()) Skipped 6 previous similar messages Jun 17 16:22:32 cmse-svr01 kernel: Lustre: 23941:0:(quota_master.c:1716:mds_quota_recovery()) Only 0/7 OSTs are active, abort quota recovery Jun 17 16:22:32 cmse-svr01 kernel: Lustre: eu01-OST0001: received MDS connection from 0 at lo Jun 17 16:22:32 cmse-svr01 kernel: Lustre: MDS eu01-MDT0000: eu01-OST0001_UUID now active, resetting orphans Jun 17 16:22:57 cmse-svr01 kernel: Lustre: 23942:0:(import.c:517:import_select_connection()) eu01-OST0002-osc: tried all connections, increasing latency to 5s Jun 17 16:22:57 cmse-svr01 kernel: Lustre: 23942:0:(import.c:517:import_select_connection()) Skipped 5 previous similar messages Jun 17 16:22:57 cmse-svr01 kernel: Lustre: 23941:0:(client.c:1463:ptlrpc_expire_one_request()) @@@ Request x1338805480063100 sent from eu01-OST0004-osc to NID 192.168.21.173 at tcp1 0s ago has failed due to network error (10s prior to deadline). Jun 17 16:22:57 cmse-svr01 kernel: req at ffff8102a0cc1800 x1338805480063100/t0 o8->eu01-OST0004_UUID at 10.101.0.1@tcp:28/4 lens 368/584 e 0 to 1 dl 1276784587 ref 1 fl Rpc:N/0/0 rc 0/0 Jun 17 16:22:57 cmse-svr01 kernel: Lustre: 23941:0:(client.c:1463:ptlrpc_expire_one_request()) Skipped 5 previous similar messages Jun 17 16:23:10 cmse-svr01 kernel: kjournald starting. Commit interval 5 seconds Jun 17 16:23:10 cmse-svr01 kernel: LDISKFS FS on sdb2, internal journal Jun 17 16:23:10 cmse-svr01 kernel: LDISKFS-fs: mounted filesystem with ordered data mode. Jun 17 16:23:11 cmse-svr01 kernel: kjournald starting. Commit interval 5 seconds Jun 17 16:23:11 cmse-svr01 kernel: LDISKFS FS on sdb2, internal journal Jun 17 16:23:11 cmse-svr01 kernel: LDISKFS-fs: mounted filesystem with ordered data mode. Jun 17 16:23:11 cmse-svr01 kernel: LDISKFS-fs: file extents enabled Jun 17 16:23:11 cmse-svr01 kernel: LDISKFS-fs: mballoc enabled Jun 17 16:23:22 cmse-svr01 kernel: Lustre: 23942:0:(import.c:517:import_select_connection()) eu01-OST0002-osc: tried all connections, increasing latency to 6s Jun 17 16:23:22 cmse-svr01 kernel: Lustre: 23942:0:(import.c:517:import_select_connection()) Skipped 4 previous similar messages Jun 17 16:23:27 cmse-svr01 kernel: Lustre: eu01-OST0002: Now serving eu01-OST0002 on /dev/sdb2 with recovery enabled Jun 17 16:23:47 cmse-svr01 kernel: Lustre: 23942:0:(import.c:517:import_select_connection()) eu01-OST0002-osc: tried all connections, increasing latency to 7s Jun 17 16:23:47 cmse-svr01 kernel: Lustre: 23942:0:(import.c:517:import_select_connection()) Skipped 4 previous similar messages Jun 17 16:23:47 cmse-svr01 kernel: Lustre: 23941:0:(quota_master.c:1716:mds_quota_recovery()) Only 0/7 OSTs are active, abort quota recovery Jun 17 16:23:47 cmse-svr01 kernel: Lustre: eu01-OST0002: received MDS connection from 0 at lo Jun 17 16:23:47 cmse-svr01 kernel: Lustre: MDS eu01-MDT0000: eu01-OST0002_UUID now active, resetting orphans Jun 17 16:24:12 cmse-svr01 kernel: Lustre: 23941:0:(client.c:1463:ptlrpc_expire_one_request()) @@@ Request x1338805480063143 sent from eu01-OST0004-osc to NID 192.168.21.173 at tcp1 0s ago has failed due to network error (13s prior to deadline). Jun 17 16:24:12 cmse-svr01 kernel: req at ffff810153414000 x1338805480063143/t0 o8->eu01-OST0004_UUID at 10.101.0.1@tcp:28/4 lens 368/584 e 0 to 1 dl 1276784665 ref 1 fl Rpc:N/0/0 rc 0/0 Jun 17 16:24:12 cmse-svr01 kernel: Lustre: 23941:0:(client.c:1463:ptlrpc_expire_one_request()) Skipped 8 previous similar messages Jun 17 16:24:37 cmse-svr01 kernel: Lustre: 23942:0:(import.c:517:import_select_connection()) eu01-OST0003-osc: tried all connections, increasing latency to 9s Jun 17 16:24:37 cmse-svr01 kernel: Lustre: 23942:0:(import.c:517:import_select_connection()) Skipped 8 previous similar messages Jun 17 16:25:52 cmse-svr01 kernel: Lustre: 23942:0:(import.c:517:import_select_connection()) eu01-OST0003-osc: tried all connections, increasing latency to 12s Jun 17 16:25:52 cmse-svr01 kernel: Lustre: 23942:0:(import.c:517:import_select_connection()) Skipped 11 previous similar messages Jun 17 16:26:42 cmse-svr01 kernel: Lustre: 23941:0:(client.c:1463:ptlrpc_expire_one_request()) @@@ Request x1338805480063191 sent from eu01-OST0004-osc to NID 192.168.21.173 at tcp1 0s ago has failed due to network error (19s prior to deadline). Jun 17 16:26:42 cmse-svr01 kernel: req at ffff8102c094e400 x1338805480063191/t0 o8->eu01-OST0004_UUID at 10.101.0.1@tcp:28/4 lens 368/584 e 0 to 1 dl 1276784821 ref 1 fl Rpc:N/0/0 rc 0/0 Jun 17 16:26:42 cmse-svr01 kernel: Lustre: 23941:0:(client.c:1463:ptlrpc_expire_one_request()) Skipped 17 previous similar messages Jun 17 16:26:44 cmse-svr01 kernel: kjournald starting. Commit interval 5 seconds Jun 17 16:26:44 cmse-svr01 kernel: LDISKFS FS on sdd1, internal journal Jun 17 16:26:44 cmse-svr01 kernel: LDISKFS-fs: mounted filesystem with ordered data mode. Jun 17 16:26:44 cmse-svr01 kernel: kjournald starting. Commit interval 5 seconds Jun 17 16:26:44 cmse-svr01 kernel: LDISKFS FS on sdd1, internal journal Jun 17 16:26:44 cmse-svr01 kernel: LDISKFS-fs: mounted filesystem with ordered data mode. Jun 17 16:26:44 cmse-svr01 kernel: LDISKFS-fs: file extents enabled Jun 17 16:26:44 cmse-svr01 kernel: LDISKFS-fs: mballoc enabled Jun 17 16:26:45 cmse-svr01 kernel: Lustre: eu01-OST0003: Now serving eu01-OST0003 on /dev/sdd1 with recovery enabled Jun 17 16:27:07 cmse-svr01 kernel: Lustre: 23941:0:(quota_master.c:1716:mds_quota_recovery()) Only 0/7 OSTs are active, abort quota recovery Jun 17 16:27:07 cmse-svr01 kernel: Lustre: eu01-OST0003: received MDS connection from 0 at lo Jun 17 16:27:07 cmse-svr01 kernel: Lustre: MDS eu01-MDT0000: eu01-OST0003_UUID now active, resetting orphans Jun 17 16:27:17 cmse-svr01 kernel: CPU 0: Jun 17 16:27:17 cmse-svr01 kernel: Modules linked in: obdfilter(U) ost(U) mds(U) fsfilt_ldiskfs(U) mgs(U) mgc(U) ldiskfs(U) crc16(U) lustre(U) lov(U) mdc(U) lquota(U) osc(U) ksocklnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) libcfs(U) ppdev(U) parport_pc(U) lp(U) parport(U) nfsd(U) exportfs(U) auth_rpcgss(U) nfs(U) lockd(U) fscache(U) nfs_acl(U) sunrpc(U) ipv6(U) xfrm_nalgo(U) crypto_api(U) fuse(U) loop(U) ixgbe(U) i2c_i801(U) pl2303(U) serio_raw(U) i2c_core(U) pcspkr(U) shpchp(U) usbserial(U) joydev(U) ext3(U) jbd(U) dm_mirror(U) dm_log(U) dm_snapshot(U) dm_mod(U) sg(U) sd_mod(U) st(U) ch(U) ide_cd(U) floppy(U) cdrom(U) 3w_9xxx(U) uhci_hcd(U) mptsas(U) mptscsih(U) mptbase(U) scsi_transport_sas(U) qla2xxx(U) scsi_transport_fc(U) ehci_hcd(U) scsi_mod(U) igb(U) 8021q(U) Jun 17 16:27:17 cmse-svr01 kernel: Pid: 24646, comm: llog_process_th Tainted: G 2.6.18-164.11.1.el5lustre.1.8.2-0rc4 #3 Jun 17 16:27:17 cmse-svr01 kernel: RIP: 0010:[<ffffffff88950645>] [<ffffffff88950645>] :ldiskfs:ldiskfs_find_entry+0x245/0x5b0 Jun 17 16:27:17 cmse-svr01 kernel: RSP: 0018:ffff810628abd970 EFLAGS: 00000202 Jun 17 16:27:17 cmse-svr01 kernel: RAX: 0000000000000000 RBX: 0000000000000007 RCX: 000000004c1a30bb Jun 17 16:27:17 cmse-svr01 kernel: RDX: ffff81028f388000 RSI: ffff810628abd8e0 RDI: ffff81034a868118 Jun 17 16:27:17 cmse-svr01 kernel: RBP: ffff81063edcb100 R08: ffff81063bbdbff8 R09: ffff81063bbdb000 Jun 17 16:27:17 cmse-svr01 kernel: R10: ffff810628abd950 R11: 00000000000000e8 R12: 0000000000000000 Jun 17 16:27:17 cmse-svr01 kernel: R13: 0000000000000002 R14: ffff81063b511f10 R15: ffffffff80063adb Jun 17 16:27:17 cmse-svr01 kernel: FS: 00002ac84db0f6e0(0000) GS:ffffffff803c2000(0000) knlGS:0000000000000000 Jun 17 16:27:17 cmse-svr01 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b Jun 17 16:27:17 cmse-svr01 kernel: CR2: 0000000002829808 CR3: 000000063e705000 CR4: 00000000000006e0 Jun 17 16:27:17 cmse-svr01 kernel: Jun 17 16:27:17 cmse-svr01 kernel: Call Trace: Jun 17 16:27:17 cmse-svr01 kernel: [<ffffffff80128a60>] avc_has_perm+0x46/0x58 Jun 17 16:27:17 cmse-svr01 kernel: [<ffffffff88952403>] :ldiskfs:ldiskfs_lookup+0x53/0x281 Jun 17 16:27:17 cmse-svr01 kernel: [<ffffffff80036eb5>] __lookup_hash+0x10b/0x12f Jun 17 16:27:17 cmse-svr01 kernel: [<ffffffff800e73ba>] lookup_one_len+0x54/0x62 Jun 17 16:27:17 cmse-svr01 kernel: [<ffffffff88a9a35d>] :obdfilter:filter_fid2dentry+0x42d/0x740 Jun 17 16:27:17 cmse-svr01 kernel: [<ffffffff8873d6a4>] :ptlrpc:at_measured+0x114/0x320 Jun 17 16:27:17 cmse-svr01 kernel: [<ffffffff8873d6a4>] :ptlrpc:at_measured+0x114/0x320 Jun 17 16:27:17 cmse-svr01 kernel: [<ffffffff8000d3a5>] dput+0x2c/0x113 Jun 17 16:27:17 cmse-svr01 kernel: [<ffffffff88aa1fb4>] :obdfilter:filter_destroy+0x154/0x1fb0 Jun 17 16:27:17 cmse-svr01 kernel: [<ffffffff88723175>] :ptlrpc:lustre_msg_set_transno+0x45/0x120 Jun 17 16:27:17 cmse-svr01 kernel: [<ffffffff887221f5>] :ptlrpc:lustre_msg_get_transno+0x35/0xf0 Jun 17 16:27:17 cmse-svr01 kernel: [<ffffffff8870f21a>] :ptlrpc:after_reply+0x97a/0xd00 Jun 17 16:27:17 cmse-svr01 kernel: [<ffffffff8002e3d3>] __wake_up+0x38/0x4f Jun 17 16:27:17 cmse-svr01 kernel: [<ffffffff887153c4>] :ptlrpc:ptlrpc_queue_wait+0x1654/0x16f0 Jun 17 16:27:17 cmse-svr01 kernel: [<ffffffff88722cb5>] :ptlrpc:lustre_msg_set_opc+0x45/0x120 Jun 17 16:27:17 cmse-svr01 kernel: [<ffffffff8870be35>] :ptlrpc:ptlrpc_at_set_req_timeout+0x85/0xd0 Jun 17 16:27:17 cmse-svr01 kernel: [<ffffffff8005c362>] cache_alloc_refill+0x106/0x186 Jun 17 16:27:17 cmse-svr01 kernel: [<ffffffff88ab1363>] :obdfilter:filter_recov_log_mds_ost_cb+0x5b3/0xf10 Jun 17 16:27:17 cmse-svr01 kernel: [<ffffffff88735f66>] :ptlrpc:llog_client_next_block+0x5a6/0x650 Jun 17 16:27:17 cmse-svr01 kernel: [<ffffffff8005c362>] cache_alloc_refill+0x106/0x186 Jun 17 16:27:17 cmse-svr01 kernel: [<ffffffff88643d22>] :obdclass:llog_process_thread+0x882/0xc30 Jun 17 16:27:17 cmse-svr01 kernel: [<ffffffff8005dfb1>] child_rip+0xa/0x11 Jun 17 16:27:17 cmse-svr01 kernel: [<ffffffff886434a0>] :obdclass:llog_process_thread+0x0/0xc30 Jun 17 16:27:17 cmse-svr01 kernel: [<ffffffff8005dfa7>] child_rip+0x0/0x11 Jun 17 16:27:17 cmse-svr01 kernel: Jun 17 16:27:19 cmse-svr01 kernel: CPU 1: Jun 17 16:27:19 cmse-svr01 kernel: Modules linked in: obdfilter(U) ost(U) mds(U) fsfilt_ldiskfs(U) mgs(U) mgc(U) ldiskfs(U) crc16(U) lustre(U) lov(U) mdc(U) lquota(U) osc(U) ksocklnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) libcfs(U) ppdev(U) parport_pc(U) lp(U) parport(U) nfsd(U) exportfs(U) auth_rpcgss(U) nfs(U) lockd(U) fscache(U) nfs_acl(U) sunrpc(U) ipv6(U) xfrm_nalgo(U) crypto_api(U) fuse(U) loop(U) ixgbe(U) i2c_i801(U) pl2303(U) serio_raw(U) i2c_core(U) pcspkr(U) shpchp(U) usbserial(U) joydev(U) ext3(U) jbd(U) dm_mirror(U) dm_log(U) dm_snapshot(U) dm_mod(U) sg(U) sd_mod(U) st(U) ch(U) ide_cd(U) floppy(U) cdrom(U) 3w_9xxx(U) uhci_hcd(U) mptsas(U) mptscsih(U) mptbase(U) scsi_transport_sas(U) qla2xxx(U) scsi_transport_fc(U) ehci_hcd(U) scsi_mod(U) igb(U) 8021q(U) Jun 17 16:27:19 cmse-svr01 kernel: Pid: 24314, comm: ll_ost_40 Tainted: G 2.6.18-164.11.1.el5lustre.1.8.2-0rc4 #3 Jun 17 16:27:19 cmse-svr01 kernel: RIP: 0010:[<ffffffff889505d8>] [<ffffffff889505d8>] :ldiskfs:ldiskfs_find_entry+0x1d8/0x5b0 Jun 17 16:27:19 cmse-svr01 kernel: RSP: 0018:ffff8102bc35d690 EFLAGS: 00000202 Jun 17 16:27:19 cmse-svr01 kernel: RAX: 0000000000000000 RBX: 0000000000000007 RCX: 000000004c1a30bd Jun 17 16:27:19 cmse-svr01 kernel: RDX: ffff81028f388000 RSI: ffff8102bc35d600 RDI: ffff81010b644610 Jun 17 16:27:19 cmse-svr01 kernel: RBP: ffff8102a3c0f080 R08: ffff8103148dfff8 R09: ffff8103148df000 Jun 17 16:27:19 cmse-svr01 kernel: R10: ffff8102bc35d670 R11: ffff81010b6e9ee8 R12: 0000000000000000 Jun 17 16:27:19 cmse-svr01 kernel: R13: 0000000000000002 R14: ffff8102e58855b0 R15: ffffffff80063adb Jun 17 16:27:19 cmse-svr01 kernel: FS: 00002ac84db0f6e0(0000) GS:ffff81010b699440(0000) knlGS:0000000000000000 Jun 17 16:27:19 cmse-svr01 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b Jun 17 16:27:19 cmse-svr01 kernel: CR2: 00002b82d0a84ad8 CR3: 000000063e705000 CR4: 00000000000006e0 Jun 17 16:27:19 cmse-svr01 kernel: Jun 17 16:27:19 cmse-svr01 kernel: Call Trace: Jun 17 16:27:19 cmse-svr01 kernel: [<ffffffff80128a60>] avc_has_perm+0x46/0x58 Jun 17 16:27:19 cmse-svr01 kernel: [<ffffffff88952403>] :ldiskfs:ldiskfs_lookup+0x53/0x281 Jun 17 16:27:19 cmse-svr01 kernel: [<ffffffff80036eb5>] __lookup_hash+0x10b/0x12f Jun 17 16:27:19 cmse-svr01 kernel: [<ffffffff800e73ba>] lookup_one_len+0x54/0x62 Jun 17 16:27:19 cmse-svr01 kernel: [<ffffffff88a9a35d>] :obdfilter:filter_fid2dentry+0x42d/0x740 Jun 17 16:27:19 cmse-svr01 kernel: [<ffffffff8027e2f0>] __down_trylock+0x44/0x4e Jun 17 16:27:19 cmse-svr01 kernel: [<ffffffff88ab454b>] :obdfilter:filter_lvbo_init+0x3bb/0x68b Jun 17 16:27:19 cmse-svr01 kernel: [<ffffffff88607ca7>] :lnet:lnet_prep_send+0x67/0xb0 Jun 17 16:27:19 cmse-svr01 kernel: [<ffffffff886e25be>] :ptlrpc:ldlm_resource_get+0x90e/0xa60 Jun 17 16:27:19 cmse-svr01 kernel: [<ffffffff88a77530>] :ost:ost_blocking_ast+0x0/0x610 Jun 17 16:27:19 cmse-svr01 kernel: [<ffffffff88701150>] :ptlrpc:ldlm_server_completion_ast+0x0/0x5d0 Jun 17 16:27:19 cmse-svr01 kernel: [<ffffffff886d8efa>] :ptlrpc:ldlm_lock_create+0xba/0xa00 Jun 17 16:27:19 cmse-svr01 kernel: [<ffffffff8871de21>] :ptlrpc:lustre_swab_buf+0x81/0x170 Jun 17 16:27:19 cmse-svr01 kernel: [<ffffffff886fbb90>] :ptlrpc:ldlm_server_glimpse_ast+0x0/0x3b0 Jun 17 16:27:19 cmse-svr01 kernel: [<ffffffff886fbb90>] :ptlrpc:ldlm_server_glimpse_ast+0x0/0x3b0 Jun 17 16:27:19 cmse-svr01 kernel: [<ffffffff88701150>] :ptlrpc:ldlm_server_completion_ast+0x0/0x5d0 Jun 17 16:27:19 cmse-svr01 kernel: [<ffffffff88a77530>] :ost:ost_blocking_ast+0x0/0x610 Jun 17 16:27:19 cmse-svr01 kernel: [<ffffffff886fe27f>] :ptlrpc:ldlm_handle_enqueue+0x66f/0x1210 Jun 17 16:27:19 cmse-svr01 kernel: [<ffffffff8871ca38>] :ptlrpc:lustre_msg_check_version_v2+0x8/0x20 Jun 17 16:27:19 cmse-svr01 kernel: [<ffffffff88a7edb7>] :ost:ost_handle+0x4e17/0x53e0 Jun 17 16:27:19 cmse-svr01 kernel: [<ffffffff88729c8d>] :ptlrpc:ptlrpc_server_handle_request+0xaad/0x1150 Jun 17 16:27:19 cmse-svr01 kernel: [<ffffffff8008c23d>] __activate_task+0x56/0x6d Jun 17 16:27:19 cmse-svr01 kernel: [<ffffffff80047205>] try_to_wake_up+0x473/0x485 Jun 17 16:27:19 cmse-svr01 kernel: [<ffffffff8003dbd8>] lock_timer_base+0x1b/0x3c Jun 17 16:27:19 cmse-svr01 kernel: [<ffffffff8008ac3a>] __wake_up_common+0x3e/0x68 Jun 17 16:27:19 cmse-svr01 kernel: [<ffffffff8872d708>] :ptlrpc:ptlrpc_main+0x1258/0x1420 Jun 17 16:27:19 cmse-svr01 kernel: [<ffffffff8008c837>] default_wake_function+0x0/0xe Jun 17 16:27:19 cmse-svr01 kernel: [<ffffffff8005dfb1>] child_rip+0xa/0x11 Jun 17 16:27:19 cmse-svr01 kernel: [<ffffffff8872c4b0>] :ptlrpc:ptlrpc_main+0x0/0x1420 Jun 17 16:27:19 cmse-svr01 kernel: [<ffffffff8005dfa7>] child_rip+0x0/0x11 Jun 17 16:27:19 cmse-svr01 kernel: Jun 17 16:27:29 cmse-svr01 kernel: CPU 1: Jun 17 16:27:29 cmse-svr01 kernel: Modules linked in: obdfilter(U) ost(U) mds(U) fsfilt_ldiskfs(U) mgs(U) mgc(U) ldiskfs(U) crc16(U) lustre(U) lov(U) mdc(U) lquota(U) osc(U) ksocklnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) libcfs(U) ppdev(U) parport_pc(U) lp(U) parport(U) nfsd(U) exportfs(U) auth_rpcgss(U) nfs(U) lockd(U) fscache(U) nfs_acl(U) sunrpc(U) ipv6(U) xfrm_nalgo(U) crypto_api(U) fuse(U) loop(U) ixgbe(U) i2c_i801(U) pl2303(U) serio_raw(U) i2c_core(U) pcspkr(U) shpchp(U) usbserial(U) joydev(U) ext3(U) jbd(U) dm_mirror(U) dm_log(U) dm_snapshot(U) dm_mod(U) sg(U) sd_mod(U) st(U) ch(U) ide_cd(U) floppy(U) cdrom(U) 3w_9xxx(U) uhci_hcd(U) mptsas(U) mptscsih(U) mptbase(U) scsi_transport_sas(U) qla2xxx(U) scsi_transport_fc(U) ehci_hcd(U) scsi_mod(U) igb(U) 8021q(U) Jun 17 16:27:29 cmse-svr01 kernel: Pid: 24314, comm: ll_ost_40 Tainted: G 2.6.18-164.11.1.el5lustre.1.8.2-0rc4 #3 Jun 17 16:27:29 cmse-svr01 kernel: RIP: 0010:[<ffffffff8895064f>] [<ffffffff8895064f>] :ldiskfs:ldiskfs_find_entry+0x24f/0x5b0 Jun 17 16:27:29 cmse-svr01 kernel: RSP: 0018:ffff8102bc35d690 EFLAGS: 00000202 Jun 17 16:27:29 cmse-svr01 kernel: RAX: 0000000000000000 RBX: 0000000000000007 RCX: 000000004c1a30bd Jun 17 16:27:29 cmse-svr01 kernel: RDX: ffff81028f388000 RSI: ffff8102bc35d600 RDI: ffff81010b644610 Jun 17 16:27:29 cmse-svr01 kernel: RBP: ffff8102a3c0f080 R08: ffff8103148dfff8 R09: ffff8103148df000 Jun 17 16:27:29 cmse-svr01 kernel: R10: ffff8102bc35d670 R11: ffff81010b6e9ee8 R12: 0000000000000000 Jun 17 16:27:29 cmse-svr01 kernel: R13: 0000000000000002 R14: ffff8102e58855b0 R15: ffffffff80063adb Jun 17 16:27:29 cmse-svr01 kernel: FS: 00002ac84db0f6e0(0000) GS:ffff81010b699440(0000) knlGS:0000000000000000 Jun 17 16:27:29 cmse-svr01 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b Jun 17 16:27:29 cmse-svr01 kernel: CR2: 00002b82d0a84ad8 CR3: 000000063e705000 CR4: 00000000000006e0 Jun 17 16:27:29 cmse-svr01 kernel: Jun 17 16:27:29 cmse-svr01 kernel: Call Trace: Jun 17 16:27:29 cmse-svr01 kernel: [<ffffffff80128a60>] avc_has_perm+0x46/0x58 Jun 17 16:27:29 cmse-svr01 kernel: [<ffffffff88952403>] :ldiskfs:ldiskfs_lookup+0x53/0x281 Jun 17 16:27:29 cmse-svr01 kernel: [<ffffffff80036eb5>] __lookup_hash+0x10b/0x12f Jun 17 16:27:29 cmse-svr01 kernel: [<ffffffff800e73ba>] lookup_one_len+0x54/0x62 Jun 17 16:27:29 cmse-svr01 kernel: [<ffffffff88a9a35d>] :obdfilter:filter_fid2dentry+0x42d/0x740 Jun 17 16:27:29 cmse-svr01 kernel: [<ffffffff8027e2f0>] __down_trylock+0x44/0x4e Jun 17 16:27:29 cmse-svr01 kernel: [<ffffffff88ab454b>] :obdfilter:filter_lvbo_init+0x3bb/0x68b Jun 17 16:27:29 cmse-svr01 kernel: [<ffffffff88607ca7>] :lnet:lnet_prep_send+0x67/0xb0 Jun 17 16:27:29 cmse-svr01 kernel: [<ffffffff886e25be>] :ptlrpc:ldlm_resource_get+0x90e/0xa60 Jun 17 16:27:29 cmse-svr01 kernel: [<ffffffff88a77530>] :ost:ost_blocking_ast+0x0/0x610 Jun 17 16:27:29 cmse-svr01 kernel: [<ffffffff88701150>] :ptlrpc:ldlm_server_completion_ast+0x0/0x5d0 Jun 17 16:27:29 cmse-svr01 kernel: [<ffffffff886d8efa>] :ptlrpc:ldlm_lock_create+0xba/0xa00 Jun 17 16:27:29 cmse-svr01 kernel: [<ffffffff8871de21>] :ptlrpc:lustre_swab_buf+0x81/0x170 Jun 17 16:27:29 cmse-svr01 kernel: [<ffffffff886fbb90>] :ptlrpc:ldlm_server_glimpse_ast+0x0/0x3b0 Jun 17 16:27:29 cmse-svr01 kernel: [<ffffffff886fbb90>] :ptlrpc:ldlm_server_glimpse_ast+0x0/0x3b0 Jun 17 16:27:29 cmse-svr01 kernel: [<ffffffff88701150>] :ptlrpc:ldlm_server_completion_ast+0x0/0x5d0 Jun 17 16:27:29 cmse-svr01 kernel: [<ffffffff88a77530>] :ost:ost_blocking_ast+0x0/0x610 Jun 17 16:27:29 cmse-svr01 kernel: [<ffffffff886fe27f>] :ptlrpc:ldlm_handle_enqueue+0x66f/0x1210 Jun 17 16:27:29 cmse-svr01 kernel: [<ffffffff8871ca38>] :ptlrpc:lustre_msg_check_version_v2+0x8/0x20 Jun 17 16:27:29 cmse-svr01 kernel: [<ffffffff88a7edb7>] :ost:ost_handle+0x4e17/0x53e0 Jun 17 16:27:29 cmse-svr01 kernel: [<ffffffff88729c8d>] :ptlrpc:ptlrpc_server_handle_request+0xaad/0x1150 Jun 17 16:27:29 cmse-svr01 kernel: [<ffffffff8008c23d>] __activate_task+0x56/0x6d Jun 17 16:27:29 cmse-svr01 kernel: [<ffffffff80047205>] try_to_wake_up+0x473/0x485 Jun 17 16:27:29 cmse-svr01 kernel: [<ffffffff8003dbd8>] lock_timer_base+0x1b/0x3c Jun 17 16:27:29 cmse-svr01 kernel: [<ffffffff8008ac3a>] __wake_up_common+0x3e/0x68 Jun 17 16:27:29 cmse-svr01 kernel: [<ffffffff8872d708>] :ptlrpc:ptlrpc_main+0x1258/0x1420 Jun 17 16:27:29 cmse-svr01 kernel: [<ffffffff8008c837>] default_wake_function+0x0/0xe Jun 17 16:27:29 cmse-svr01 kernel: [<ffffffff8005dfb1>] child_rip+0xa/0x11 Jun 17 16:27:29 cmse-svr01 kernel: [<ffffffff8872c4b0>] :ptlrpc:ptlrpc_main+0x0/0x1420 Jun 17 16:27:29 cmse-svr01 kernel: [<ffffffff8005dfa7>] child_rip+0x0/0x11 Jun 17 16:27:29 cmse-svr01 kernel: (many more of the hung-process stacktraces, about one each two seconds) Jun 17 16:43:59 cmse-svr01 kernel: Lustre: 24299:0:(ldlm_lib.c:575:target_handle_reconnect()) eu01-OST0003: eu01-mdtlov_UUID reconnecting Jun 17 16:43:59 cmse-svr01 kernel: Lustre: 24299:0:(ldlm_lib.c:575:target_handle_reconnect()) Skipped 4 previous similar messages Jun 17 16:43:59 cmse-svr01 kernel: Lustre: 24299:0:(ldlm_lib.c:875:target_handle_connect()) eu01-OST0003: refuse reconnection from eu01-mdtlov_UUID at 0@lo to 0xffff81034a921800; still busy with 1 active RPCs Jun 17 16:43:59 cmse-svr01 kernel: Lustre: 24299:0:(ldlm_lib.c:875:target_handle_connect()) Skipped 4 previous similar messages Jun 17 17:02:41 cmse-svr01 kernel: Lustre: 24437:0:(niobuf.c:202:ptlrpc_abort_bulk()) Unexpectedly long timeout: desc ffff810324ad9480