Dear All, I am having lustre error on my HPC as given below.Please any one can help me to resolve this problem. Thanks in Advance. Sep 30 08:40:23 service0 kernel: [343138.837222] Lustre: 8300:0:(client.c:1476:ptlrpc_expire_one_request()) Skipped 1 previous similar message Sep 30 08:40:23 service0 kernel: [343138.837233] Lustre: lustre-OST0008-osc-ffff880b272cf800: Connection to service lustre-OST0008 via nid 10.148.0.106 at o2ib was lost; in progress operations using this service will wait for recovery to complete. Sep 30 08:40:24 service0 kernel: [343139.837260] Lustre: 8300:0:(client.c:1476:ptlrpc_expire_one_request()) @@@ Request x1380984193067288 sent from lustre-OST0006-osc-ffff880b272cf800 to NID 10.148.0.106 at o2ib 7s ago has timed out (7s prior to deadline). Sep 30 08:40:24 service0 kernel: [343139.837263] req at ffff880a5f800c00x1380984193067288/t0 o3->lustre-OST0006_UUID at 10.148.0.106@o2ib:6/4 lens 448/592 e 0 to 1 dl 1317352224 ref 2 fl Rpc:/0/0 rc 0/0 Sep 30 08:40:24 service0 kernel: [343139.837269] Lustre: 8300:0:(client.c:1476:ptlrpc_expire_one_request()) Skipped 38 previous similar messages Sep 30 08:40:24 service0 kernel: [343140.129284] LustreError: 9983:0:(ldlm_request.c:1025:ldlm_cli_cancel_req()) Got rc -11 from cancel RPC: canceling anyway Sep 30 08:40:24 service0 kernel: [343140.129290] LustreError: 9983:0:(ldlm_request.c:1025:ldlm_cli_cancel_req()) Skipped 1 previous similar message Sep 30 08:40:24 service0 kernel: [343140.129295] LustreError: 9983:0:(ldlm_request.c:1587:ldlm_cli_cancel_list()) ldlm_cli_cancel_list: -11 Sep 30 08:40:24 service0 kernel: [343140.129299] LustreError: 9983:0:(ldlm_request.c:1587:ldlm_cli_cancel_list()) Skipped 1 previous similar message Sep 30 08:40:25 service0 kernel: [343140.837308] Lustre: 8300:0:(client.c:1476:ptlrpc_expire_one_request()) @@@ Request x1380984193067299 sent from lustre-OST0010-osc-ffff880b272cf800 to NID 10.148.0.106 at o2ib 7s ago has timed out (7s prior to deadline). Sep 30 08:40:25 service0 kernel: [343140.837311] req at ffff880a557c4400x1380984193067299/t0 o3->lustre-OST0010_UUID at 10.148.0.106@o2ib:6/4 lens 448/592 e 0 to 1 dl 1317352225 ref 2 fl Rpc:/0/0 rc 0/0 Sep 30 08:40:25 service0 kernel: [343140.837316] Lustre: 8300:0:(client.c:1476:ptlrpc_expire_one_request()) Skipped 4 previous similar messages Sep 30 08:40:26 service0 kernel: [343141.245365] LustreError: 30978:0:(ldlm_request.c:1025:ldlm_cli_cancel_req()) Got rc -11 from cancel RPC: canceling anyway Sep 30 08:40:26 service0 kernel: [343141.245371] LustreError: 22729:0:(ldlm_request.c:1587:ldlm_cli_cancel_list()) ldlm_cli_cancel_list: -11 Sep 30 08:40:26 service0 kernel: [343141.245378] LustreError: 30978:0:(ldlm_request.c:1025:ldlm_cli_cancel_req()) Skipped 1 previous similar message Sep 30 08:40:33 service0 kernel: [343148.245683] Lustre: 22725:0:(client.c:1476:ptlrpc_expire_one_request()) @@@ Request x1380984193067302 sent from lustre-OST0004-osc-ffff880b272cf800 to NID 10.148.0.106 at o2ib 14s ago has timed out (14s prior to deadline). Sep 30 08:40:33 service0 kernel: [343148.245686] req at ffff8805c879e800x1380984193067302/t0 o103->lustre-OST0004_UUID at 10.148.0.106@o2ib:17/18 lens 296/384 e 0 to 1 dl 1317352233 ref 1 fl Rpc:N/0/0 rc 0/0 Sep 30 08:40:33 service0 kernel: [343148.245692] Lustre: 22725:0:(client.c:1476:ptlrpc_expire_one_request()) Skipped 2 previous similar messages Sep 30 08:40:33 service0 kernel: [343148.245708] LustreError: 22725:0:(ldlm_request.c:1025:ldlm_cli_cancel_req()) Got rc -11 from cancel RPC: canceling anyway Sep 30 08:40:33 service0 kernel: [343148.245714] LustreError: 22725:0:(ldlm_request.c:1587:ldlm_cli_cancel_list()) ldlm_cli_cancel_list: -11 Sep 30 08:40:33 service0 kernel: [343148.245717] LustreError: 22725:0:(ldlm_request.c:1587:ldlm_cli_cancel_list()) Skipped 1 previous similar message Sep 30 08:40:36 service0 kernel: [343151.548005] LustreError: 11-0: an error occurred while communicating with 10.148.0.106 at o2ib. The ost_connect operation failed with -16 Sep 30 08:40:36 service0 kernel: [343151.548008] LustreError: Skipped 1 previous similar message Sep 30 08:40:36 service0 kernel: [343151.548024] LustreError: 167-0: This client was evicted by lustre-OST000b; in progress operations using this service will fail. Sep 30 08:40:36 service0 kernel: [343151.548250] LustreError: 30452:0:(llite_mmap.c:210:ll_tree_unlock()) couldn''t unlock -5 Sep 30 08:40:36 service0 kernel: [343151.550210] LustreError: 8300:0:(client.c:858:ptlrpc_import_delay_req()) @@@ IMP_INVALID req at ffff88049528c400 x1380984193067406/t0 o3->lustre-OST000b_UUID at 10.148.0.106@o2ib:6/4 lens 448/592 e 0 to 1 dl 0 ref 2 fl Rpc:/0/0 rc 0/0 Sep 30 08:40:36 service0 kernel: [343151.594742] Lustre: lustre-OST0000-osc-ffff880b272cf800: Connection restored to service lustre-OST0000 using nid 10.148.0.106 at o2ib. Sep 30 08:40:36 service0 kernel: [343151.837203] Lustre: lustre-OST0006-osc-ffff880b272cf800: Connection restored to service lustre-OST0006 using nid 10.148.0.106 at o2ib. Sep 30 08:40:37 service0 kernel: [343152.842631] Lustre: lustre-OST0003-osc-ffff880b272cf800: Connection restored to service lustre-OST0003 using nid 10.148.0.106 at o2ib. Sep 30 08:40:37 service0 kernel: [343152.842636] Lustre: Skipped 3 previous similar messages Thanks and Regards Ashok -- *Ashok Nulguda * *TATA ELXSI LTD* *Mb : +91 9689945767 * *Email :ashokn at tataelxsi.co.in <tshrikant at tataelxsi.co.in>* -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20110930/e714e4cb/attachment-0001.html
Hello Ashok is the cluster hanging or otherwise behaving badly? The logs below show that the client lost connection to 10.148.0.106 for 10seconds or so. It should have recovered ok. If you want further help from the list you need to add more detail about the cluster i.e. A general description of the number of OSS/OST, clients, version of lustre etc, and a description of what is actually going wrong... ie hanging, offline etc The first thing is to check the infrastructure.. ie. in this case you should check your IB network for errors On 30-September-2011 2:39 PM, Ashok nulguda wrote:> Dear All, > > I am having lustre error on my HPC as given below.Please any one can > help me to resolve this problem. > Thanks in Advance. > Sep 30 08:40:23 service0 kernel: [343138.837222] Lustre: > 8300:0:(client.c:1476:ptlrpc_expire_one_request()) Skipped 1 previous > similar message > Sep 30 08:40:23 service0 kernel: [343138.837233] Lustre: > lustre-OST0008-osc-ffff880b272cf800: Connection to service > lustre-OST0008 via nid 10.148.0.106 at o2ib was lost; in progress > operations using this service will wait for recovery to complete. > Sep 30 08:40:24 service0 kernel: [343139.837260] Lustre: > 8300:0:(client.c:1476:ptlrpc_expire_one_request()) @@@ Request > x1380984193067288 sent from lustre-OST0006-osc-ffff880b272cf800 to NID > 10.148.0.106 at o2ib 7s ago has timed out (7s prior to deadline). > Sep 30 08:40:24 service0 kernel: [343139.837263] > req at ffff880a5f800c00 x1380984193067288/t0 > o3->lustre-OST0006_UUID at 10.148.0.106@o2ib:6/4 lens 448/592 e 0 to 1 dl > 1317352224 ref 2 fl Rpc:/0/0 rc 0/0 > Sep 30 08:40:24 service0 kernel: [343139.837269] Lustre: > 8300:0:(client.c:1476:ptlrpc_expire_one_request()) Skipped 38 previous > similar messages > Sep 30 08:40:24 service0 kernel: [343140.129284] LustreError: > 9983:0:(ldlm_request.c:1025:ldlm_cli_cancel_req()) Got rc -11 from > cancel RPC: canceling anyway > Sep 30 08:40:24 service0 kernel: [343140.129290] LustreError: > 9983:0:(ldlm_request.c:1025:ldlm_cli_cancel_req()) Skipped 1 previous > similar message > Sep 30 08:40:24 service0 kernel: [343140.129295] LustreError: > 9983:0:(ldlm_request.c:1587:ldlm_cli_cancel_list()) > ldlm_cli_cancel_list: -11 > Sep 30 08:40:24 service0 kernel: [343140.129299] LustreError: > 9983:0:(ldlm_request.c:1587:ldlm_cli_cancel_list()) Skipped 1 previous > similar message > Sep 30 08:40:25 service0 kernel: [343140.837308] Lustre: > 8300:0:(client.c:1476:ptlrpc_expire_one_request()) @@@ Request > x1380984193067299 sent from lustre-OST0010-osc-ffff880b272cf800 to NID > 10.148.0.106 at o2ib 7s ago has timed out (7s prior to deadline). > Sep 30 08:40:25 service0 kernel: [343140.837311] > req at ffff880a557c4400 x1380984193067299/t0 > o3->lustre-OST0010_UUID at 10.148.0.106@o2ib:6/4 lens 448/592 e 0 to 1 dl > 1317352225 ref 2 fl Rpc:/0/0 rc 0/0 > Sep 30 08:40:25 service0 kernel: [343140.837316] Lustre: > 8300:0:(client.c:1476:ptlrpc_expire_one_request()) Skipped 4 previous > similar messages > Sep 30 08:40:26 service0 kernel: [343141.245365] LustreError: > 30978:0:(ldlm_request.c:1025:ldlm_cli_cancel_req()) Got rc -11 from > cancel RPC: canceling anyway > Sep 30 08:40:26 service0 kernel: [343141.245371] LustreError: > 22729:0:(ldlm_request.c:1587:ldlm_cli_cancel_list()) > ldlm_cli_cancel_list: -11 > Sep 30 08:40:26 service0 kernel: [343141.245378] LustreError: > 30978:0:(ldlm_request.c:1025:ldlm_cli_cancel_req()) Skipped 1 previous > similar message > Sep 30 08:40:33 service0 kernel: [343148.245683] Lustre: > 22725:0:(client.c:1476:ptlrpc_expire_one_request()) @@@ Request > x1380984193067302 sent from lustre-OST0004-osc-ffff880b272cf800 to NID > 10.148.0.106 at o2ib 14s ago has timed out (14s prior to deadline). > Sep 30 08:40:33 service0 kernel: [343148.245686] > req at ffff8805c879e800 x1380984193067302/t0 > o103->lustre-OST0004_UUID at 10.148.0.106@o2ib:17/18 lens 296/384 e 0 to > 1 dl 1317352233 ref 1 fl Rpc:N/0/0 rc 0/0 > Sep 30 08:40:33 service0 kernel: [343148.245692] Lustre: > 22725:0:(client.c:1476:ptlrpc_expire_one_request()) Skipped 2 previous > similar messages > Sep 30 08:40:33 service0 kernel: [343148.245708] LustreError: > 22725:0:(ldlm_request.c:1025:ldlm_cli_cancel_req()) Got rc -11 from > cancel RPC: canceling anyway > Sep 30 08:40:33 service0 kernel: [343148.245714] LustreError: > 22725:0:(ldlm_request.c:1587:ldlm_cli_cancel_list()) > ldlm_cli_cancel_list: -11 > Sep 30 08:40:33 service0 kernel: [343148.245717] LustreError: > 22725:0:(ldlm_request.c:1587:ldlm_cli_cancel_list()) Skipped 1 > previous similar message > Sep 30 08:40:36 service0 kernel: [343151.548005] LustreError: 11-0: an > error occurred while communicating with 10.148.0.106 at o2ib. The > ost_connect operation failed with -16 > Sep 30 08:40:36 service0 kernel: [343151.548008] LustreError: Skipped > 1 previous similar message > Sep 30 08:40:36 service0 kernel: [343151.548024] LustreError: 167-0: > This client was evicted by lustre-OST000b; in progress operations > using this service will fail. > Sep 30 08:40:36 service0 kernel: [343151.548250] LustreError: > 30452:0:(llite_mmap.c:210:ll_tree_unlock()) couldn''t unlock -5 > Sep 30 08:40:36 service0 kernel: [343151.550210] LustreError: > 8300:0:(client.c:858:ptlrpc_import_delay_req()) @@@ IMP_INVALID > req at ffff88049528c400 x1380984193067406/t0 > o3->lustre-OST000b_UUID at 10.148.0.106@o2ib:6/4 lens 448/592 e 0 to 1 dl > 0 ref 2 fl Rpc:/0/0 rc 0/0 > Sep 30 08:40:36 service0 kernel: [343151.594742] Lustre: > lustre-OST0000-osc-ffff880b272cf800: Connection restored to service > lustre-OST0000 using nid 10.148.0.106 at o2ib. > Sep 30 08:40:36 service0 kernel: [343151.837203] Lustre: > lustre-OST0006-osc-ffff880b272cf800: Connection restored to service > lustre-OST0006 using nid 10.148.0.106 at o2ib. > Sep 30 08:40:37 service0 kernel: [343152.842631] Lustre: > lustre-OST0003-osc-ffff880b272cf800: Connection restored to service > lustre-OST0003 using nid 10.148.0.106 at o2ib. > Sep 30 08:40:37 service0 kernel: [343152.842636] Lustre: Skipped 3 > previous similar messages > > > Thanks and Regards > Ashok > > -- > *Ashok Nulguda > * > *TATA ELXSI LTD* > *Mb : +91 9689945767 > * > *Email :ashokn at tataelxsi.co.in <mailto:tshrikant at tataelxsi.co.in>* > > > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss-- Brian O''Connor ------------------------------------------------- SGI Consulting Email: briano at sgi.com, Mobile +61 417 746 452 Phone: +61 3 9963 1900, Fax: +61 3 9963 1902 357 Camberwell Road, Camberwell, Victoria, 3124 AUSTRALIA http://www.sgi.com/support/services ------------------------------------------------- -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20110930/c142be61/attachment.html
Dear Sir, Thanks for your help. My system is ICE 8400 cluster with 30 TB of lustre of 64 node. oss1:~ # df -h Filesystem Size Used Avail Use% Mounted on /dev/sda3 100G 5.8G 95G 6% / tmpfs 12G 1.1M 12G 1% /dev tmpfs 12G 88K 12G 1% /dev/shm /dev/sda1 1020M 181M 840M 18% /boot /dev/sda4 170G 6.6M 170G 1% /data1 /dev/mapper/3600a0b8000755ee0000010964dc231bc_part1 2.1T 74G 1.9T 4% /OST1 /dev/mapper/3600a0b8000755ed1000010614dc23425_part1 1.7T 67G 1.5T 5% /OST4 /dev/mapper/3600a0b8000755ee0000010a04dc23323_part1 2.1T 67G 1.9T 4% /OST5 /dev/mapper/3600a0b8000755f1f000011224dc239d7_part1 1.7T 67G 1.5T 5% /OST8 /dev/mapper/3600a0b8000755dbe000010de4dc23997_part1 2.1T 66G 1.9T 4% /OST9 /dev/mapper/3600a0b8000755f1f000011284dc23b5a_part1 1.7T 66G 1.5T 5% /OST12 /dev/mapper/3600a0b8000755eb3000011304dc23db1_part1 2.1T 66G 1.9T 4% /OST13 /dev/mapper/3600a0b8000755f22000011104dc23ec7_part1 1.7T 66G 1.5T 5% /OST16 oss1:~ # rpm -qa | grep -i lustre kernel-default-2.6.27.39-0.3_lustre.1.8.4 kernel-ib-1.5.1-2.6.27.39_0.3_lustre.1.8.4_default lustre-modules-1.8.4-2.6.27_39_0.3_lustre.1.8.4_default kernel-default-base-2.6.27.39-0.3_lustre.1.8.4 lustre-1.8.4-2.6.27_39_0.3_lustre.1.8.4_default lustre-ldiskfs-3.1.3-2.6.27_39_0.3_lustre.1.8.4_default oss2:~ # Filesystem Size Used Avail Use% Mounted on /dev/sdcw3 100G 8.3G 92G 9% / tmpfs 12G 1.1M 12G 1% /dev tmpfs 12G 88K 12G 1% /dev/shm /dev/sdcw1 1020M 144M 876M 15% /boot /dev/sdcw4 170G 13M 170G 1% /data1 /dev/mapper/3600a0b8000755ed10000105e4dc23397_part1 1.7T 69G 1.5T 5% /OST2 /dev/mapper/3600a0b8000755ee00000109b4dc232a0_part1 2.1T 68G 1.9T 4% /OST3 /dev/mapper/3600a0b8000755ed1000010644dc2349f_part1 1.7T 67G 1.5T 5% /OST6 /dev/mapper/3600a0b8000755dbe000010d94dc23873_part1 2.1T 67G 1.9T 4% /OST7 /dev/mapper/3600a0b8000755f1f000011254dc23add_part1 1.7T 66G 1.5T 5% /OST10 /dev/mapper/3600a0b8000755dbe000010e34dc23a09_part1 2.1T 66G 1.9T 4% /OST11 /dev/mapper/3600a0b8000755f220000110d4dc23e36_part1 1.7T 66G 1.5T 5% /OST14 /dev/mapper/3600a0b8000755eb3000011354dc23e39_part1 2.1T 66G 1.9T 4% /OST15 /dev/mapper/3600a0b8000755eb30000113a4dc23ec4_part1 1.4T 66G 1.3T 6% /OST17 [1]+ Done df -h oss2:~ # rpm -qa | grep -i lustre lustre-modules-1.8.4-2.6.27_39_0.3_lustre.1.8.4_default kernel-default-base-2.6.27.39-0.3_lustre.1.8.4 kernel-default-2.6.27.39-0.3_lustre.1.8.4 kernel-ib-1.5.1-2.6.27.39_0.3_lustre.1.8.4_default lustre-ldiskfs-3.1.3-2.6.27_39_0.3_lustre.1.8.4_default lustre-1.8.4-2.6.27_39_0.3_lustre.1.8.4_default mdc1:~ # Filesystem Size Used Avail Use% Mounted on /dev/sde2 100G 5.2G 95G 6% / tmpfs 12G 184K 12G 1% /dev tmpfs 12G 88K 12G 1% /dev/shm /dev/sde1 1020M 181M 840M 18% /boot /dev/sde4 167G 196M 159G 1% /data1 /dev/mapper/3600a0b8000755f22000011134dc23f7e_part1 489G 2.3G 458G 1% /MDC [1]+ Done df -h mdc1:~ # mdc1:~ # rpm -qa | grep -i lustre lustre-1.8.4-2.6.27_39_0.3_lustre.1.8.4_default kernel-default-2.6.27.39-0.3_lustre.1.8.4 lustre-ldiskfs-3.1.3-2.6.27_39_0.3_lustre.1.8.4_default kernel-ib-1.5.1-2.6.27.39_0.3_lustre.1.8.4_default lustre-modules-1.8.4-2.6.27_39_0.3_lustre.1.8.4_default kernel-default-base-2.6.27.39-0.3_lustre.1.8.4 mdc1:~ # mdc2:~ # Filesystem Size Used Avail Use% Mounted on /dev/sde3 100G 5.0G 95G 5% / tmpfs 18G 184K 18G 1% /dev tmpfs 7.8G 88K 7.8G 1% /dev/shm /dev/sde1 1020M 144M 876M 15% /boot /dev/sde4 170G 6.6M 170G 1% /data1 [1]+ Done df -h mdc2:~ # rpm -qqa | grep -i lustre lustre-modules-1.8.4-2.6.27_39_0.3_lustre.1.8.4_default kernel-default-base-2.6.27.39-0.3_lustre.1.8.4 kernel-default-2.6.27.39-0.3_lustre.1.8.4 lustre-ldiskfs-3.1.3-2.6.27_39_0.3_lustre.1.8.4_default kernel-ib-1.5.1-2.6.27.39_0.3_lustre.1.8.4_default lustre-1.8.4-2.6.27_39_0.3_lustre.1.8.4_default mdc2:~ # service0:~ # ibstat CA ''mlx4_0'' CA type: MT26428 Number of ports: 2 Firmware version: 2.7.0 Hardware version: a0 Node GUID: 0x0002c903000a6028 System image GUID: 0x0002c903000a602b Port 1: State: Active Physical state: LinkUp Rate: 40 Base lid: 9 LMC: 0 SM lid: 1 Capability mask: 0x02510868 Port GUID: 0x0002c903000a6029 Port 2: State: Active Physical state: LinkUp Rate: 40 Base lid: 10 LMC: 0 SM lid: 1 Capability mask: 0x02510868 Port GUID: 0x0002c903000a602a service0:~ # service0:~ # ibstatus Infiniband device ''mlx4_0'' port 1 status: default gid: fec0:0000:0000:0000:0002:c903:000a:6029 base lid: 0x9 sm lid: 0x1 state: 4: ACTIVE phys state: 5: LinkUp rate: 40 Gb/sec (4X QDR) Infiniband device ''mlx4_0'' port 2 status: default gid: fec0:0000:0000:0000:0002:c903:000a:602a base lid: 0xa sm lid: 0x1 state: 4: ACTIVE phys state: 5: LinkUp rate: 40 Gb/sec (4X QDR) service0:~ # service0:~ # ibdiagnet Loading IBDIAGNET from: /usr/lib64/ibdiagnet1.2 -W- Topology file is not specified. Reports regarding cluster links will use direct routes. Loading IBDM from: /usr/lib64/ibdm1.2 -W- A few ports of local device are up. Since port-num was not specified (-p option), port 1 of device 1 will be used as the local port. -I- Discovering ... 88 nodes (9 Switches & 79 CA-s) discovered. -I--------------------------------------------------- -I- Bad Guids/LIDs Info -I--------------------------------------------------- -I- No bad Guids were found -I--------------------------------------------------- -I- Links With Logical State = INIT -I--------------------------------------------------- -I- No bad Links (with logical state = INIT) were found -I--------------------------------------------------- -I- PM Counters Info -I--------------------------------------------------- -I- No illegal PM counters values were found -I--------------------------------------------------- -I- Fabric Partitions Report (see ibdiagnet.pkey for a full hosts list) -I--------------------------------------------------- -I- PKey:0x7fff Hosts:81 full:81 partial:0 -I--------------------------------------------------- -I- IPoIB Subnets Check -I--------------------------------------------------- -I- Subnet: IPv4 PKey:0x7fff QKey:0x00000b1b MTU:2048Byte rate:10Gbps SL:0x00 -W- Suboptimal rate for group. Lowest member rate:20Gbps > group-rate:10Gbps -I--------------------------------------------------- -I- Bad Links Info -I- No bad link were found -I--------------------------------------------------- ---------------------------------------------------------------- -I- Stages Status Report: STAGE Errors Warnings Bad GUIDs/LIDs Check 0 0 Link State Active Check 0 0 Performance Counters Report 0 0 Partitions Check 0 0 IPoIB Subnets Check 0 1 Please see /tmp/ibdiagnet.log for complete log ---------------------------------------------------------------- -I- Done. Run time was 9 seconds. service0:~ # service0:~ # ibcheckerrors #warn: counter VL15Dropped = 18584 (threshold 100) lid 1 port 1 Error check on lid 1 (r1lead HCA-1) port 1: FAILED #warn: counter SymbolErrors = 42829 (threshold 10) lid 9 port 1 #warn: counter RcvErrors = 9279 (threshold 10) lid 9 port 1 Error check on lid 9 (service0 HCA-1) port 1: FAILED ## Summary: 88 nodes checked, 0 bad nodes found ## 292 ports checked, 2 ports have errors beyond threshold service0:~ # service0:~ # ibchecknet # Checking Ca: nodeguid 0x0002c903000abfc2 # Checking Ca: nodeguid 0x0002c903000ac00e # Checking Ca: nodeguid 0x0002c903000a69dc # Checking Ca: nodeguid 0x0002c9030009cd46 # Checking Ca: nodeguid 0x003048fffff4d878 # Checking Ca: nodeguid 0x003048fffff4d880 # Checking Ca: nodeguid 0x003048fffff4d87c # Checking Ca: nodeguid 0x003048fffff4d884 # Checking Ca: nodeguid 0x003048fffff4d888 # Checking Ca: nodeguid 0x003048fffff4d88c # Checking Ca: nodeguid 0x003048fffff4d890 # Checking Ca: nodeguid 0x003048fffff4d894 # Checking Ca: nodeguid 0x0002c9020029fa50 #warn: counter VL15Dropped = 18617 (threshold 100) lid 1 port 1 Error check on lid 1 (r1lead HCA-1) port 1: FAILED # Checking Ca: nodeguid 0x0002c90300054eac # Checking Ca: nodeguid 0x0002c9030009cebe # Checking Ca: nodeguid 0x003048fffff4c9f8 # Checking Ca: nodeguid 0x003048fffff4db08 # Checking Ca: nodeguid 0x003048fffff4db40 # Checking Ca: nodeguid 0x003048fffff4db44 # Checking Ca: nodeguid 0x003048fffff4db48 # Checking Ca: nodeguid 0x003048fffff4db4c # Checking Ca: nodeguid 0x003048fffff4db0c # Checking Ca: nodeguid 0x003048fffff4dca0 # Checking Ca: nodeguid 0x0002c903000abfe2 # Checking Ca: nodeguid 0x0002c903000abfe6 # Checking Ca: nodeguid 0x0002c9030009dd28 # Checking Ca: nodeguid 0x003048fffff4db54 # Checking Ca: nodeguid 0x003048fffff4db58 # Checking Ca: nodeguid 0x003048fffff4c9f4 # Checking Ca: nodeguid 0x003048fffff4db50 # Checking Ca: nodeguid 0x003048fffff4db3c # Checking Ca: nodeguid 0x003048fffff4db38 # Checking Ca: nodeguid 0x003048fffff4db14 # Checking Ca: nodeguid 0x003048fffff4db10 # Checking Ca: nodeguid 0x003048fffff4d8a8 # Checking Ca: nodeguid 0x003048fffff4d8ac # Checking Ca: nodeguid 0x003048fffff4d8b4 # Checking Ca: nodeguid 0x003048fffff4d8b0 # Checking Ca: nodeguid 0x003048fffff4db70 # Checking Ca: nodeguid 0x003048fffff4db68 # Checking Ca: nodeguid 0x003048fffff4db64 # Checking Ca: nodeguid 0x003048fffff4db78 # Checking Ca: nodeguid 0x0002c903000a69f0 # Checking Ca: nodeguid 0x0002c9030006004a # Checking Ca: nodeguid 0x0002c9030009dd2c # Checking Ca: nodeguid 0x003048fffff4d8b8 # Checking Ca: nodeguid 0x003048fffff4d8bc # Checking Ca: nodeguid 0x003048fffff4d8a4 # Checking Ca: nodeguid 0x003048fffff4d8a0 # Checking Ca: nodeguid 0x003048fffff4db7c # Checking Ca: nodeguid 0x003048fffff4db80 # Checking Ca: nodeguid 0x003048fffff4db6c # Checking Ca: nodeguid 0x003048fffff4db74 # Checking Ca: nodeguid 0x003048fffff4dcb8 # Checking Ca: nodeguid 0x003048fffff4dcd0 # Checking Ca: nodeguid 0x003048fffff4dc5c # Checking Ca: nodeguid 0x003048fffff4dc60 # Checking Ca: nodeguid 0x003048fffff4dc54 # Checking Ca: nodeguid 0x003048fffff4dc50 # Checking Ca: nodeguid 0x003048fffff4dc4c # Checking Ca: nodeguid 0x003048fffff4dcd4 # Checking Ca: nodeguid 0x0002c903000a6164 # Checking Ca: nodeguid 0x003048fffff4dcf0 # Checking Ca: nodeguid 0x003048fffff4db5c # Checking Ca: nodeguid 0x003048fffff4dc90 # Checking Ca: nodeguid 0x003048fffff4dc8c # Checking Ca: nodeguid 0x003048fffff4dc58 # Checking Ca: nodeguid 0x003048fffff4dc94 # Checking Ca: nodeguid 0x003048fffff4dc9c # Checking Ca: nodeguid 0x003048fffff4db60 # Checking Ca: nodeguid 0x003048fffff4d89c # Checking Ca: nodeguid 0x003048fffff4d898 # Checking Ca: nodeguid 0x003048fffff4dad8 # Checking Ca: nodeguid 0x003048fffff4dadc # Checking Ca: nodeguid 0x003048fffff4db30 # Checking Ca: nodeguid 0x003048fffff4db34 # Checking Ca: nodeguid 0x003048fffff4d874 # Checking Ca: nodeguid 0x003048fffff4d870 # Checking Ca: nodeguid 0x0002c903000a6028 #warn: counter SymbolErrors = 44150 (threshold 10) lid 9 port 1 #warn: counter RcvErrors = 9283 (threshold 10) lid 9 port 1 Error check on lid 9 (service0 HCA-1) port 1: FAILED ## Summary: 88 nodes checked, 0 bad nodes found ## 292 ports checked, 0 bad ports found ## 2 ports have errors beyond threshold service0:~ # ibcheckstate ## Summary: 88 nodes checked, 0 bad nodes found ## 292 ports checked, 0 ports with bad state found service0:~ # ibcheckwidth ## Summary: 88 nodes checked, 0 bad nodes found ## 292 ports checked, 0 ports with 1x width in error found service0:~ # Thanks and Regards Ashok On 30 September 2011 12:39, Brian O''Connor <briano at sgi.com> wrote:> Hello Ashok > > is the cluster hanging or otherwise behaving badly? The logs below show > that the client > lost connection to 10.148.0.106 for 10seconds or so. It should have > recovered ok. > > If you want further help from the list you need to add more detail about > the cluster i.e. > A general description of the number of OSS/OST, clients, version of lustre > etc, and a description > of what is actually going wrong... ie hanging, offline etc > > The first thing is to check the infrastructure.. ie. in this case you > should check your IB network for errors > > > > > On 30-September-2011 2:39 PM, Ashok nulguda wrote: > > Dear All, > > I am having lustre error on my HPC as given below.Please any one can help > me to resolve this problem. > Thanks in Advance. > Sep 30 08:40:23 service0 kernel: [343138.837222] Lustre: > 8300:0:(client.c:1476:ptlrpc_expire_one_request()) Skipped 1 previous > similar message > Sep 30 08:40:23 service0 kernel: [343138.837233] Lustre: > lustre-OST0008-osc-ffff880b272cf800: Connection to service lustre-OST0008 > via nid 10.148.0.106 at o2ib was lost; in progress operations using this > service will wait for recovery to complete. > Sep 30 08:40:24 service0 kernel: [343139.837260] Lustre: > 8300:0:(client.c:1476:ptlrpc_expire_one_request()) @@@ Request > x1380984193067288 sent from lustre-OST0006-osc-ffff880b272cf800 to NID > 10.148.0.106 at o2ib 7s ago has timed out (7s prior to deadline). > Sep 30 08:40:24 service0 kernel: [343139.837263] req at ffff880a5f800c00x1380984193067288/t0 o3-> > lustre-OST0006_UUID at 10.148.0.106@o2ib:6/4 lens 448/592 e 0 to 1 dl > 1317352224 ref 2 fl Rpc:/0/0 rc 0/0 > Sep 30 08:40:24 service0 kernel: [343139.837269] Lustre: > 8300:0:(client.c:1476:ptlrpc_expire_one_request()) Skipped 38 previous > similar messages > Sep 30 08:40:24 service0 kernel: [343140.129284] LustreError: > 9983:0:(ldlm_request.c:1025:ldlm_cli_cancel_req()) Got rc -11 from cancel > RPC: canceling anyway > Sep 30 08:40:24 service0 kernel: [343140.129290] LustreError: > 9983:0:(ldlm_request.c:1025:ldlm_cli_cancel_req()) Skipped 1 previous > similar message > Sep 30 08:40:24 service0 kernel: [343140.129295] LustreError: > 9983:0:(ldlm_request.c:1587:ldlm_cli_cancel_list()) ldlm_cli_cancel_list: > -11 > Sep 30 08:40:24 service0 kernel: [343140.129299] LustreError: > 9983:0:(ldlm_request.c:1587:ldlm_cli_cancel_list()) Skipped 1 previous > similar message > Sep 30 08:40:25 service0 kernel: [343140.837308] Lustre: > 8300:0:(client.c:1476:ptlrpc_expire_one_request()) @@@ Request > x1380984193067299 sent from lustre-OST0010-osc-ffff880b272cf800 to NID > 10.148.0.106 at o2ib 7s ago has timed out (7s prior to deadline). > Sep 30 08:40:25 service0 kernel: [343140.837311] req at ffff880a557c4400x1380984193067299/t0 o3-> > lustre-OST0010_UUID at 10.148.0.106@o2ib:6/4 lens 448/592 e 0 to 1 dl > 1317352225 ref 2 fl Rpc:/0/0 rc 0/0 > Sep 30 08:40:25 service0 kernel: [343140.837316] Lustre: > 8300:0:(client.c:1476:ptlrpc_expire_one_request()) Skipped 4 previous > similar messages > Sep 30 08:40:26 service0 kernel: [343141.245365] LustreError: > 30978:0:(ldlm_request.c:1025:ldlm_cli_cancel_req()) Got rc -11 from cancel > RPC: canceling anyway > Sep 30 08:40:26 service0 kernel: [343141.245371] LustreError: > 22729:0:(ldlm_request.c:1587:ldlm_cli_cancel_list()) ldlm_cli_cancel_list: > -11 > Sep 30 08:40:26 service0 kernel: [343141.245378] LustreError: > 30978:0:(ldlm_request.c:1025:ldlm_cli_cancel_req()) Skipped 1 previous > similar message > Sep 30 08:40:33 service0 kernel: [343148.245683] Lustre: > 22725:0:(client.c:1476:ptlrpc_expire_one_request()) @@@ Request > x1380984193067302 sent from lustre-OST0004-osc-ffff880b272cf800 to NID > 10.148.0.106 at o2ib 14s ago has timed out (14s prior to deadline). > Sep 30 08:40:33 service0 kernel: [343148.245686] req at ffff8805c879e800x1380984193067302/t0 o103-> > lustre-OST0004_UUID at 10.148.0.106@o2ib:17/18 lens 296/384 e 0 to 1 dl > 1317352233 ref 1 fl Rpc:N/0/0 rc 0/0 > Sep 30 08:40:33 service0 kernel: [343148.245692] Lustre: > 22725:0:(client.c:1476:ptlrpc_expire_one_request()) Skipped 2 previous > similar messages > Sep 30 08:40:33 service0 kernel: [343148.245708] LustreError: > 22725:0:(ldlm_request.c:1025:ldlm_cli_cancel_req()) Got rc -11 from cancel > RPC: canceling anyway > Sep 30 08:40:33 service0 kernel: [343148.245714] LustreError: > 22725:0:(ldlm_request.c:1587:ldlm_cli_cancel_list()) ldlm_cli_cancel_list: > -11 > Sep 30 08:40:33 service0 kernel: [343148.245717] LustreError: > 22725:0:(ldlm_request.c:1587:ldlm_cli_cancel_list()) Skipped 1 previous > similar message > Sep 30 08:40:36 service0 kernel: [343151.548005] LustreError: 11-0: an > error occurred while communicating with 10.148.0.106 at o2ib. The ost_connect > operation failed with -16 > Sep 30 08:40:36 service0 kernel: [343151.548008] LustreError: Skipped 1 > previous similar message > Sep 30 08:40:36 service0 kernel: [343151.548024] LustreError: 167-0: This > client was evicted by lustre-OST000b; in progress operations using this > service will fail. > Sep 30 08:40:36 service0 kernel: [343151.548250] LustreError: > 30452:0:(llite_mmap.c:210:ll_tree_unlock()) couldn''t unlock -5 > Sep 30 08:40:36 service0 kernel: [343151.550210] LustreError: > 8300:0:(client.c:858:ptlrpc_import_delay_req()) @@@ IMP_INVALID > req at ffff88049528c400 x1380984193067406/t0 o3-> > lustre-OST000b_UUID at 10.148.0.106@o2ib:6/4 lens 448/592 e 0 to 1 dl 0 ref 2 > fl Rpc:/0/0 rc 0/0 > Sep 30 08:40:36 service0 kernel: [343151.594742] Lustre: > lustre-OST0000-osc-ffff880b272cf800: Connection restored to service > lustre-OST0000 using nid 10.148.0.106 at o2ib. > Sep 30 08:40:36 service0 kernel: [343151.837203] Lustre: > lustre-OST0006-osc-ffff880b272cf800: Connection restored to service > lustre-OST0006 using nid 10.148.0.106 at o2ib. > Sep 30 08:40:37 service0 kernel: [343152.842631] Lustre: > lustre-OST0003-osc-ffff880b272cf800: Connection restored to service > lustre-OST0003 using nid 10.148.0.106 at o2ib. > Sep 30 08:40:37 service0 kernel: [343152.842636] Lustre: Skipped 3 previous > similar messages > > > Thanks and Regards > Ashok > > -- > *Ashok Nulguda > * > *TATA ELXSI LTD* > *Mb : +91 9689945767 > * > *Email :ashokn at tataelxsi.co.in <tshrikant at tataelxsi.co.in>* > > > > _______________________________________________ > Lustre-discuss mailing listLustre-discuss at lists.lustre.orghttp://lists.lustre.org/mailman/listinfo/lustre-discuss > > > > -- > Brian O''Connor > ------------------------------------------------- > SGI Consulting > Email: briano at sgi.com, Mobile +61 417 746 452 > Phone: +61 3 9963 1900, Fax: +61 3 9963 1902 > 357 Camberwell Road, Camberwell, Victoria, 3124 > AUSTRALIA http://www.sgi.com/support/services > ------------------------------------------------- > > > >-- *Ashok Nulguda * *TATA ELXSI LTD* *Mb : +91 9689945767 * *Email :ashokn at tataelxsi.co.in <tshrikant at tataelxsi.co.in>* -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20110930/b62ae3b4/attachment-0001.html
Hi Ashok If you have a valid support contract log a call with you local SGI office, you have a couple of bad IB ports, maybe a cable or other such thing. Include the information you provided below and ask them help out. On 30-September-2011 6:37 PM, Ashok nulguda wrote:> Dear Sir, > > > Thanks for your help. > > My system is ICE 8400 cluster with 30 TB of lustre of 64 node. > oss1:~ # df -h > Filesystem Size Used Avail Use% Mounted on > /dev/sda3 100G 5.8G 95G 6% / > tmpfs 12G 1.1M 12G 1% /dev > tmpfs 12G 88K 12G 1% /dev/shm > /dev/sda1 1020M 181M 840M 18% /boot > /dev/sda4 170G 6.6M 170G 1% /data1 > /dev/mapper/3600a0b8000755ee0000010964dc231bc_part1 > 2.1T 74G 1.9T 4% /OST1 > /dev/mapper/3600a0b8000755ed1000010614dc23425_part1 > 1.7T 67G 1.5T 5% /OST4 > /dev/mapper/3600a0b8000755ee0000010a04dc23323_part1 > 2.1T 67G 1.9T 4% /OST5 > /dev/mapper/3600a0b8000755f1f000011224dc239d7_part1 > 1.7T 67G 1.5T 5% /OST8 > /dev/mapper/3600a0b8000755dbe000010de4dc23997_part1 > 2.1T 66G 1.9T 4% /OST9 > /dev/mapper/3600a0b8000755f1f000011284dc23b5a_part1 > 1.7T 66G 1.5T 5% /OST12 > /dev/mapper/3600a0b8000755eb3000011304dc23db1_part1 > 2.1T 66G 1.9T 4% /OST13 > /dev/mapper/3600a0b8000755f22000011104dc23ec7_part1 > 1.7T 66G 1.5T 5% /OST16 > > > oss1:~ # rpm -qa | grep -i lustre > kernel-default-2.6.27.39-0.3_lustre.1.8.4 > kernel-ib-1.5.1-2.6.27.39_0.3_lustre.1.8.4_default > lustre-modules-1.8.4-2.6.27_39_0.3_lustre.1.8.4_default > kernel-default-base-2.6.27.39-0.3_lustre.1.8.4 > lustre-1.8.4-2.6.27_39_0.3_lustre.1.8.4_default > lustre-ldiskfs-3.1.3-2.6.27_39_0.3_lustre.1.8.4_default > > > oss2:~ # Filesystem Size Used Avail Use% Mounted on > /dev/sdcw3 100G 8.3G 92G 9% / > tmpfs 12G 1.1M 12G 1% /dev > tmpfs 12G 88K 12G 1% /dev/shm > /dev/sdcw1 1020M 144M 876M 15% /boot > /dev/sdcw4 170G 13M 170G 1% /data1 > /dev/mapper/3600a0b8000755ed10000105e4dc23397_part1 > 1.7T 69G 1.5T 5% /OST2 > /dev/mapper/3600a0b8000755ee00000109b4dc232a0_part1 > 2.1T 68G 1.9T 4% /OST3 > /dev/mapper/3600a0b8000755ed1000010644dc2349f_part1 > 1.7T 67G 1.5T 5% /OST6 > /dev/mapper/3600a0b8000755dbe000010d94dc23873_part1 > 2.1T 67G 1.9T 4% /OST7 > /dev/mapper/3600a0b8000755f1f000011254dc23add_part1 > 1.7T 66G 1.5T 5% /OST10 > /dev/mapper/3600a0b8000755dbe000010e34dc23a09_part1 > 2.1T 66G 1.9T 4% /OST11 > /dev/mapper/3600a0b8000755f220000110d4dc23e36_part1 > 1.7T 66G 1.5T 5% /OST14 > /dev/mapper/3600a0b8000755eb3000011354dc23e39_part1 > 2.1T 66G 1.9T 4% /OST15 > /dev/mapper/3600a0b8000755eb30000113a4dc23ec4_part1 > 1.4T 66G 1.3T 6% /OST17 > > [1]+ Done df -h > > oss2:~ # rpm -qa | grep -i lustre > lustre-modules-1.8.4-2.6.27_39_0.3_lustre.1.8.4_default > kernel-default-base-2.6.27.39-0.3_lustre.1.8.4 > kernel-default-2.6.27.39-0.3_lustre.1.8.4 > kernel-ib-1.5.1-2.6.27.39_0.3_lustre.1.8.4_default > lustre-ldiskfs-3.1.3-2.6.27_39_0.3_lustre.1.8.4_default > lustre-1.8.4-2.6.27_39_0.3_lustre.1.8.4_default > > mdc1:~ # Filesystem Size Used Avail Use% Mounted on > /dev/sde2 100G 5.2G 95G 6% / > tmpfs 12G 184K 12G 1% /dev > tmpfs 12G 88K 12G 1% /dev/shm > /dev/sde1 1020M 181M 840M 18% /boot > /dev/sde4 167G 196M 159G 1% /data1 > /dev/mapper/3600a0b8000755f22000011134dc23f7e_part1 > 489G 2.3G 458G 1% /MDC > > [1]+ Done df -h > mdc1:~ # > > > mdc1:~ # rpm -qa | grep -i lustre > lustre-1.8.4-2.6.27_39_0.3_lustre.1.8.4_default > kernel-default-2.6.27.39-0.3_lustre.1.8.4 > lustre-ldiskfs-3.1.3-2.6.27_39_0.3_lustre.1.8.4_default > kernel-ib-1.5.1-2.6.27.39_0.3_lustre.1.8.4_default > lustre-modules-1.8.4-2.6.27_39_0.3_lustre.1.8.4_default > kernel-default-base-2.6.27.39-0.3_lustre.1.8.4 > mdc1:~ # > > mdc2:~ # Filesystem Size Used Avail Use% Mounted on > /dev/sde3 100G 5.0G 95G 5% / > tmpfs 18G 184K 18G 1% /dev > tmpfs 7.8G 88K 7.8G 1% /dev/shm > /dev/sde1 1020M 144M 876M 15% /boot > /dev/sde4 170G 6.6M 170G 1% /data1 > > [1]+ Done df -h > mdc2:~ # rpm -qqa | grep -i lustre > lustre-modules-1.8.4-2.6.27_39_0.3_lustre.1.8.4_default > kernel-default-base-2.6.27.39-0.3_lustre.1.8.4 > kernel-default-2.6.27.39-0.3_lustre.1.8.4 > lustre-ldiskfs-3.1.3-2.6.27_39_0.3_lustre.1.8.4_default > kernel-ib-1.5.1-2.6.27.39_0.3_lustre.1.8.4_default > lustre-1.8.4-2.6.27_39_0.3_lustre.1.8.4_default > mdc2:~ # > > > service0:~ # ibstat > CA ''mlx4_0'' > CA type: MT26428 > Number of ports: 2 > Firmware version: 2.7.0 > Hardware version: a0 > Node GUID: 0x0002c903000a6028 > System image GUID: 0x0002c903000a602b > Port 1: > State: Active > Physical state: LinkUp > Rate: 40 > Base lid: 9 > LMC: 0 > SM lid: 1 > Capability mask: 0x02510868 > Port GUID: 0x0002c903000a6029 > Port 2: > State: Active > Physical state: LinkUp > Rate: 40 > Base lid: 10 > LMC: 0 > SM lid: 1 > Capability mask: 0x02510868 > Port GUID: 0x0002c903000a602a > service0:~ # > > > > service0:~ # ibstatus > Infiniband device ''mlx4_0'' port 1 status: > default gid: fec0:0000:0000:0000:0002:c903:000a:6029 > base lid: 0x9 > sm lid: 0x1 > state: 4: ACTIVE > phys state: 5: LinkUp > rate: 40 Gb/sec (4X QDR) > > Infiniband device ''mlx4_0'' port 2 status: > default gid: fec0:0000:0000:0000:0002:c903:000a:602a > base lid: 0xa > sm lid: 0x1 > state: 4: ACTIVE > phys state: 5: LinkUp > rate: 40 Gb/sec (4X QDR) > > service0:~ # > > > > service0:~ # ibdiagnet > Loading IBDIAGNET from: /usr/lib64/ibdiagnet1.2 > -W- Topology file is not specified. > Reports regarding cluster links will use direct routes. > Loading IBDM from: /usr/lib64/ibdm1.2 > -W- A few ports of local device are up. > Since port-num was not specified (-p option), port 1 of device 1 > will be > used as the local port. > -I- Discovering ... 88 nodes (9 Switches & 79 CA-s) discovered. > > > -I--------------------------------------------------- > -I- Bad Guids/LIDs Info > -I--------------------------------------------------- > -I- No bad Guids were found > > -I--------------------------------------------------- > -I- Links With Logical State = INIT > -I--------------------------------------------------- > -I- No bad Links (with logical state = INIT) were found > > -I--------------------------------------------------- > -I- PM Counters Info > -I--------------------------------------------------- > -I- No illegal PM counters values were found > > -I--------------------------------------------------- > -I- Fabric Partitions Report (see ibdiagnet.pkey for a full hosts list) > -I--------------------------------------------------- > -I- PKey:0x7fff Hosts:81 full:81 partial:0 > > -I--------------------------------------------------- > -I- IPoIB Subnets Check > -I--------------------------------------------------- > -I- Subnet: IPv4 PKey:0x7fff QKey:0x00000b1b MTU:2048Byte rate:10Gbps > SL:0x00 > -W- Suboptimal rate for group. Lowest member rate:20Gbps > > group-rate:10Gbps > > -I--------------------------------------------------- > -I- Bad Links Info > -I- No bad link were found > -I--------------------------------------------------- > ---------------------------------------------------------------- > -I- Stages Status Report: > STAGE Errors Warnings > Bad GUIDs/LIDs Check 0 0 > Link State Active Check 0 0 > Performance Counters Report 0 0 > Partitions Check 0 0 > IPoIB Subnets Check 0 1 > > Please see /tmp/ibdiagnet.log for complete log > ---------------------------------------------------------------- > > -I- Done. Run time was 9 seconds. > service0:~ # > > > service0:~ # ibcheckerrors > #warn: counter VL15Dropped = 18584 (threshold 100) lid 1 port 1 > Error check on lid 1 (r1lead HCA-1) port 1: FAILED > #warn: counter SymbolErrors = 42829 (threshold 10) lid 9 port 1 > #warn: counter RcvErrors = 9279 (threshold 10) lid 9 port 1 > Error check on lid 9 (service0 HCA-1) port 1: FAILED > > ## Summary: 88 nodes checked, 0 bad nodes found > ## 292 ports checked, 2 ports have errors beyond threshold > service0:~ # > > > service0:~ # ibchecknet > > # Checking Ca: nodeguid 0x0002c903000abfc2 > > # Checking Ca: nodeguid 0x0002c903000ac00e > > # Checking Ca: nodeguid 0x0002c903000a69dc > > # Checking Ca: nodeguid 0x0002c9030009cd46 > > # Checking Ca: nodeguid 0x003048fffff4d878 > > # Checking Ca: nodeguid 0x003048fffff4d880 > > # Checking Ca: nodeguid 0x003048fffff4d87c > > # Checking Ca: nodeguid 0x003048fffff4d884 > > # Checking Ca: nodeguid 0x003048fffff4d888 > > # Checking Ca: nodeguid 0x003048fffff4d88c > > # Checking Ca: nodeguid 0x003048fffff4d890 > > # Checking Ca: nodeguid 0x003048fffff4d894 > > # Checking Ca: nodeguid 0x0002c9020029fa50 > #warn: counter VL15Dropped = 18617 (threshold 100) lid 1 port 1 > Error check on lid 1 (r1lead HCA-1) port 1: FAILED > > # Checking Ca: nodeguid 0x0002c90300054eac > > # Checking Ca: nodeguid 0x0002c9030009cebe > > # Checking Ca: nodeguid 0x003048fffff4c9f8 > > # Checking Ca: nodeguid 0x003048fffff4db08 > > # Checking Ca: nodeguid 0x003048fffff4db40 > > # Checking Ca: nodeguid 0x003048fffff4db44 > > # Checking Ca: nodeguid 0x003048fffff4db48 > > # Checking Ca: nodeguid 0x003048fffff4db4c > > # Checking Ca: nodeguid 0x003048fffff4db0c > > # Checking Ca: nodeguid 0x003048fffff4dca0 > > # Checking Ca: nodeguid 0x0002c903000abfe2 > > # Checking Ca: nodeguid 0x0002c903000abfe6 > > # Checking Ca: nodeguid 0x0002c9030009dd28 > > # Checking Ca: nodeguid 0x003048fffff4db54 > > # Checking Ca: nodeguid 0x003048fffff4db58 > > # Checking Ca: nodeguid 0x003048fffff4c9f4 > > # Checking Ca: nodeguid 0x003048fffff4db50 > > # Checking Ca: nodeguid 0x003048fffff4db3c > > # Checking Ca: nodeguid 0x003048fffff4db38 > > # Checking Ca: nodeguid 0x003048fffff4db14 > > # Checking Ca: nodeguid 0x003048fffff4db10 > > # Checking Ca: nodeguid 0x003048fffff4d8a8 > > # Checking Ca: nodeguid 0x003048fffff4d8ac > > # Checking Ca: nodeguid 0x003048fffff4d8b4 > > # Checking Ca: nodeguid 0x003048fffff4d8b0 > > # Checking Ca: nodeguid 0x003048fffff4db70 > > # Checking Ca: nodeguid 0x003048fffff4db68 > > # Checking Ca: nodeguid 0x003048fffff4db64 > > # Checking Ca: nodeguid 0x003048fffff4db78 > > # Checking Ca: nodeguid 0x0002c903000a69f0 > > # Checking Ca: nodeguid 0x0002c9030006004a > > # Checking Ca: nodeguid 0x0002c9030009dd2c > > # Checking Ca: nodeguid 0x003048fffff4d8b8 > > # Checking Ca: nodeguid 0x003048fffff4d8bc > > # Checking Ca: nodeguid 0x003048fffff4d8a4 > > # Checking Ca: nodeguid 0x003048fffff4d8a0 > > # Checking Ca: nodeguid 0x003048fffff4db7c > > # Checking Ca: nodeguid 0x003048fffff4db80 > > # Checking Ca: nodeguid 0x003048fffff4db6c > > # Checking Ca: nodeguid 0x003048fffff4db74 > > # Checking Ca: nodeguid 0x003048fffff4dcb8 > > # Checking Ca: nodeguid 0x003048fffff4dcd0 > > # Checking Ca: nodeguid 0x003048fffff4dc5c > > # Checking Ca: nodeguid 0x003048fffff4dc60 > > # Checking Ca: nodeguid 0x003048fffff4dc54 > > # Checking Ca: nodeguid 0x003048fffff4dc50 > > # Checking Ca: nodeguid 0x003048fffff4dc4c > > # Checking Ca: nodeguid 0x003048fffff4dcd4 > > # Checking Ca: nodeguid 0x0002c903000a6164 > > # Checking Ca: nodeguid 0x003048fffff4dcf0 > > # Checking Ca: nodeguid 0x003048fffff4db5c > > # Checking Ca: nodeguid 0x003048fffff4dc90 > > # Checking Ca: nodeguid 0x003048fffff4dc8c > > # Checking Ca: nodeguid 0x003048fffff4dc58 > > # Checking Ca: nodeguid 0x003048fffff4dc94 > > # Checking Ca: nodeguid 0x003048fffff4dc9c > > # Checking Ca: nodeguid 0x003048fffff4db60 > > # Checking Ca: nodeguid 0x003048fffff4d89c > > # Checking Ca: nodeguid 0x003048fffff4d898 > > # Checking Ca: nodeguid 0x003048fffff4dad8 > > # Checking Ca: nodeguid 0x003048fffff4dadc > > # Checking Ca: nodeguid 0x003048fffff4db30 > > # Checking Ca: nodeguid 0x003048fffff4db34 > > # Checking Ca: nodeguid 0x003048fffff4d874 > > # Checking Ca: nodeguid 0x003048fffff4d870 > > # Checking Ca: nodeguid 0x0002c903000a6028 > #warn: counter SymbolErrors = 44150 (threshold 10) lid 9 port 1 > #warn: counter RcvErrors = 9283 (threshold 10) lid 9 port 1 > Error check on lid 9 (service0 HCA-1) port 1: FAILED > > ## Summary: 88 nodes checked, 0 bad nodes found > ## 292 ports checked, 0 bad ports found > ## 2 ports have errors beyond threshold > > > > service0:~ # ibcheckstate > > ## Summary: 88 nodes checked, 0 bad nodes found > ## 292 ports checked, 0 ports with bad state found > service0:~ # ibcheckwidth > > ## Summary: 88 nodes checked, 0 bad nodes found > ## 292 ports checked, 0 ports with 1x width in error found > service0:~ # > > > Thanks and Regards > Ashok > > > > On 30 September 2011 12:39, Brian O''Connor <briano at sgi.com > <mailto:briano at sgi.com>> wrote: > > Hello Ashok > > is the cluster hanging or otherwise behaving badly? The logs below > show that the client > lost connection to 10.148.0.106 for 10seconds or so. It should > have recovered ok. > > If you want further help from the list you need to add more detail > about the cluster i.e. > A general description of the number of OSS/OST, clients, version > of lustre etc, and a description > of what is actually going wrong... ie hanging, offline etc > > The first thing is to check the infrastructure.. ie. in this case > you should check your IB network for errors > > > > > On 30-September-2011 2:39 PM, Ashok nulguda wrote: >> Dear All, >> >> I am having lustre error on my HPC as given below.Please any one >> can help me to resolve this problem. >> Thanks in Advance. >> Sep 30 08:40:23 service0 kernel: [343138.837222] Lustre: >> 8300:0:(client.c:1476:ptlrpc_expire_one_request()) Skipped 1 >> previous similar message >> Sep 30 08:40:23 service0 kernel: [343138.837233] Lustre: >> lustre-OST0008-osc-ffff880b272cf800: Connection to service >> lustre-OST0008 via nid 10.148.0.106 at o2ib was lost; in progress >> operations using this service will wait for recovery to complete. >> Sep 30 08:40:24 service0 kernel: [343139.837260] Lustre: >> 8300:0:(client.c:1476:ptlrpc_expire_one_request()) @@@ Request >> x1380984193067288 sent from lustre-OST0006-osc-ffff880b272cf800 >> to NID 10.148.0.106 at o2ib 7s ago has timed out (7s prior to deadline). >> Sep 30 08:40:24 service0 kernel: [343139.837263] >> req at ffff880a5f800c00 x1380984193067288/t0 >> o3->lustre-OST0006_UUID at 10.148.0.106@o2ib:6/4 >> <mailto:lustre-OST0006_UUID at 10.148.0.106@o2ib:6/4> lens 448/592 e >> 0 to 1 dl 1317352224 ref 2 fl Rpc:/0/0 rc 0/0 >> Sep 30 08:40:24 service0 kernel: [343139.837269] Lustre: >> 8300:0:(client.c:1476:ptlrpc_expire_one_request()) Skipped 38 >> previous similar messages >> Sep 30 08:40:24 service0 kernel: [343140.129284] LustreError: >> 9983:0:(ldlm_request.c:1025:ldlm_cli_cancel_req()) Got rc -11 >> from cancel RPC: canceling anyway >> Sep 30 08:40:24 service0 kernel: [343140.129290] LustreError: >> 9983:0:(ldlm_request.c:1025:ldlm_cli_cancel_req()) Skipped 1 >> previous similar message >> Sep 30 08:40:24 service0 kernel: [343140.129295] LustreError: >> 9983:0:(ldlm_request.c:1587:ldlm_cli_cancel_list()) >> ldlm_cli_cancel_list: -11 >> Sep 30 08:40:24 service0 kernel: [343140.129299] LustreError: >> 9983:0:(ldlm_request.c:1587:ldlm_cli_cancel_list()) Skipped 1 >> previous similar message >> Sep 30 08:40:25 service0 kernel: [343140.837308] Lustre: >> 8300:0:(client.c:1476:ptlrpc_expire_one_request()) @@@ Request >> x1380984193067299 sent from lustre-OST0010-osc-ffff880b272cf800 >> to NID 10.148.0.106 at o2ib 7s ago has timed out (7s prior to deadline). >> Sep 30 08:40:25 service0 kernel: [343140.837311] >> req at ffff880a557c4400 x1380984193067299/t0 >> o3->lustre-OST0010_UUID at 10.148.0.106@o2ib:6/4 >> <mailto:lustre-OST0010_UUID at 10.148.0.106@o2ib:6/4> lens 448/592 e >> 0 to 1 dl 1317352225 ref 2 fl Rpc:/0/0 rc 0/0 >> Sep 30 08:40:25 service0 kernel: [343140.837316] Lustre: >> 8300:0:(client.c:1476:ptlrpc_expire_one_request()) Skipped 4 >> previous similar messages >> Sep 30 08:40:26 service0 kernel: [343141.245365] LustreError: >> 30978:0:(ldlm_request.c:1025:ldlm_cli_cancel_req()) Got rc -11 >> from cancel RPC: canceling anyway >> Sep 30 08:40:26 service0 kernel: [343141.245371] LustreError: >> 22729:0:(ldlm_request.c:1587:ldlm_cli_cancel_list()) >> ldlm_cli_cancel_list: -11 >> Sep 30 08:40:26 service0 kernel: [343141.245378] LustreError: >> 30978:0:(ldlm_request.c:1025:ldlm_cli_cancel_req()) Skipped 1 >> previous similar message >> Sep 30 08:40:33 service0 kernel: [343148.245683] Lustre: >> 22725:0:(client.c:1476:ptlrpc_expire_one_request()) @@@ Request >> x1380984193067302 sent from lustre-OST0004-osc-ffff880b272cf800 >> to NID 10.148.0.106 at o2ib 14s ago has timed out (14s prior to >> deadline). >> Sep 30 08:40:33 service0 kernel: [343148.245686] >> req at ffff8805c879e800 x1380984193067302/t0 >> o103->lustre-OST0004_UUID at 10.148.0.106@o2ib:17/18 >> <mailto:lustre-OST0004_UUID at 10.148.0.106@o2ib:17/18> lens 296/384 >> e 0 to 1 dl 1317352233 ref 1 fl Rpc:N/0/0 rc 0/0 >> Sep 30 08:40:33 service0 kernel: [343148.245692] Lustre: >> 22725:0:(client.c:1476:ptlrpc_expire_one_request()) Skipped 2 >> previous similar messages >> Sep 30 08:40:33 service0 kernel: [343148.245708] LustreError: >> 22725:0:(ldlm_request.c:1025:ldlm_cli_cancel_req()) Got rc -11 >> from cancel RPC: canceling anyway >> Sep 30 08:40:33 service0 kernel: [343148.245714] LustreError: >> 22725:0:(ldlm_request.c:1587:ldlm_cli_cancel_list()) >> ldlm_cli_cancel_list: -11 >> Sep 30 08:40:33 service0 kernel: [343148.245717] LustreError: >> 22725:0:(ldlm_request.c:1587:ldlm_cli_cancel_list()) Skipped 1 >> previous similar message >> Sep 30 08:40:36 service0 kernel: [343151.548005] LustreError: >> 11-0: an error occurred while communicating with >> 10.148.0.106 at o2ib. The ost_connect operation failed with -16 >> Sep 30 08:40:36 service0 kernel: [343151.548008] LustreError: >> Skipped 1 previous similar message >> Sep 30 08:40:36 service0 kernel: [343151.548024] LustreError: >> 167-0: This client was evicted by lustre-OST000b; in progress >> operations using this service will fail. >> Sep 30 08:40:36 service0 kernel: [343151.548250] LustreError: >> 30452:0:(llite_mmap.c:210:ll_tree_unlock()) couldn''t unlock -5 >> Sep 30 08:40:36 service0 kernel: [343151.550210] LustreError: >> 8300:0:(client.c:858:ptlrpc_import_delay_req()) @@@ IMP_INVALID >> req at ffff88049528c400 x1380984193067406/t0 >> o3->lustre-OST000b_UUID at 10.148.0.106@o2ib:6/4 >> <mailto:lustre-OST000b_UUID at 10.148.0.106@o2ib:6/4> lens 448/592 e >> 0 to 1 dl 0 ref 2 fl Rpc:/0/0 rc 0/0 >> Sep 30 08:40:36 service0 kernel: [343151.594742] Lustre: >> lustre-OST0000-osc-ffff880b272cf800: Connection restored to >> service lustre-OST0000 using nid 10.148.0.106 at o2ib. >> Sep 30 08:40:36 service0 kernel: [343151.837203] Lustre: >> lustre-OST0006-osc-ffff880b272cf800: Connection restored to >> service lustre-OST0006 using nid 10.148.0.106 at o2ib. >> Sep 30 08:40:37 service0 kernel: [343152.842631] Lustre: >> lustre-OST0003-osc-ffff880b272cf800: Connection restored to >> service lustre-OST0003 using nid 10.148.0.106 at o2ib. >> Sep 30 08:40:37 service0 kernel: [343152.842636] Lustre: Skipped >> 3 previous similar messages >> >> >> Thanks and Regards >> Ashok >> >> -- >> *Ashok Nulguda >> * >> *TATA ELXSI LTD* >> *Mb : +91 9689945767 >> * >> *Email :ashokn at tataelxsi.co.in <mailto:tshrikant at tataelxsi.co.in>* >> >> >> >> _______________________________________________ >> Lustre-discuss mailing list >> Lustre-discuss at lists.lustre.org <mailto:Lustre-discuss at lists.lustre.org> >> http://lists.lustre.org/mailman/listinfo/lustre-discuss > > > -- > Brian O''Connor > ------------------------------------------------- > SGI Consulting > Email:briano at sgi.com <mailto:briano at sgi.com>, Mobile +61 417 746 452 > Phone: +61 3 9963 1900, Fax: +61 3 9963 1902 > 357 Camberwell Road, Camberwell, Victoria, 3124 > AUSTRALIAhttp://www.sgi.com/support/services > ------------------------------------------------- > > > > > > > -- > *Ashok Nulguda > * > *TATA ELXSI LTD* > *Mb : +91 9689945767 > * > *Email :ashokn at tataelxsi.co.in <mailto:tshrikant at tataelxsi.co.in>* >-- Brian O''Connor ------------------------------------------------- SGI Consulting Email: briano at sgi.com, Mobile +61 417 746 452 Phone: +61 3 9963 1900, Fax: +61 3 9963 1902 357 Camberwell Road, Camberwell, Victoria, 3124 AUSTRALIA http://www.sgi.com/support/services ------------------------------------------------- -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20110930/458d4206/attachment-0001.html
Hi, Looks like connection timeout, likely temporary as it appears to have reconnected and recovered without any problems. What other issue are you experiencing? -cf On 09/29/2011 10:39 PM, Ashok nulguda wrote:> Dear All, > > I am having lustre error on my HPC as given below.Please any one can > help me to resolve this problem. > Thanks in Advance. > Sep 30 08:40:23 service0 kernel: [343138.837222] Lustre: > 8300:0:(client.c:1476:ptlrpc_expire_one_request()) Skipped 1 previous > similar message > Sep 30 08:40:23 service0 kernel: [343138.837233] Lustre: > lustre-OST0008-osc-ffff880b272cf800: Connection to service > lustre-OST0008 via nid 10.148.0.106 at o2ib was lost; in progress > operations using this service will wait for recovery to complete. > Sep 30 08:40:24 service0 kernel: [343139.837260] Lustre: > 8300:0:(client.c:1476:ptlrpc_expire_one_request()) @@@ Request > x1380984193067288 sent from lustre-OST0006-osc-ffff880b272cf800 to NID > 10.148.0.106 at o2ib 7s ago has timed out (7s prior to deadline). > Sep 30 08:40:24 service0 kernel: [343139.837263] > req at ffff880a5f800c00 x1380984193067288/t0 > o3->lustre-OST0006_UUID at 10.148.0.106@o2ib:6/4 lens 448/592 e 0 to 1 dl > 1317352224 ref 2 fl Rpc:/0/0 rc 0/0 > Sep 30 08:40:24 service0 kernel: [343139.837269] Lustre: > 8300:0:(client.c:1476:ptlrpc_expire_one_request()) Skipped 38 previous > similar messages > Sep 30 08:40:24 service0 kernel: [343140.129284] LustreError: > 9983:0:(ldlm_request.c:1025:ldlm_cli_cancel_req()) Got rc -11 from > cancel RPC: canceling anyway > Sep 30 08:40:24 service0 kernel: [343140.129290] LustreError: > 9983:0:(ldlm_request.c:1025:ldlm_cli_cancel_req()) Skipped 1 previous > similar message > Sep 30 08:40:24 service0 kernel: [343140.129295] LustreError: > 9983:0:(ldlm_request.c:1587:ldlm_cli_cancel_list()) > ldlm_cli_cancel_list: -11 > Sep 30 08:40:24 service0 kernel: [343140.129299] LustreError: > 9983:0:(ldlm_request.c:1587:ldlm_cli_cancel_list()) Skipped 1 previous > similar message > Sep 30 08:40:25 service0 kernel: [343140.837308] Lustre: > 8300:0:(client.c:1476:ptlrpc_expire_one_request()) @@@ Request > x1380984193067299 sent from lustre-OST0010-osc-ffff880b272cf800 to NID > 10.148.0.106 at o2ib 7s ago has timed out (7s prior to deadline). > Sep 30 08:40:25 service0 kernel: [343140.837311] > req at ffff880a557c4400 x1380984193067299/t0 > o3->lustre-OST0010_UUID at 10.148.0.106@o2ib:6/4 lens 448/592 e 0 to 1 dl > 1317352225 ref 2 fl Rpc:/0/0 rc 0/0 > Sep 30 08:40:25 service0 kernel: [343140.837316] Lustre: > 8300:0:(client.c:1476:ptlrpc_expire_one_request()) Skipped 4 previous > similar messages > Sep 30 08:40:26 service0 kernel: [343141.245365] LustreError: > 30978:0:(ldlm_request.c:1025:ldlm_cli_cancel_req()) Got rc -11 from > cancel RPC: canceling anyway > Sep 30 08:40:26 service0 kernel: [343141.245371] LustreError: > 22729:0:(ldlm_request.c:1587:ldlm_cli_cancel_list()) > ldlm_cli_cancel_list: -11 > Sep 30 08:40:26 service0 kernel: [343141.245378] LustreError: > 30978:0:(ldlm_request.c:1025:ldlm_cli_cancel_req()) Skipped 1 previous > similar message > Sep 30 08:40:33 service0 kernel: [343148.245683] Lustre: > 22725:0:(client.c:1476:ptlrpc_expire_one_request()) @@@ Request > x1380984193067302 sent from lustre-OST0004-osc-ffff880b272cf800 to NID > 10.148.0.106 at o2ib 14s ago has timed out (14s prior to deadline). > Sep 30 08:40:33 service0 kernel: [343148.245686] > req at ffff8805c879e800 x1380984193067302/t0 > o103->lustre-OST0004_UUID at 10.148.0.106@o2ib:17/18 lens 296/384 e 0 to > 1 dl 1317352233 ref 1 fl Rpc:N/0/0 rc 0/0 > Sep 30 08:40:33 service0 kernel: [343148.245692] Lustre: > 22725:0:(client.c:1476:ptlrpc_expire_one_request()) Skipped 2 previous > similar messages > Sep 30 08:40:33 service0 kernel: [343148.245708] LustreError: > 22725:0:(ldlm_request.c:1025:ldlm_cli_cancel_req()) Got rc -11 from > cancel RPC: canceling anyway > Sep 30 08:40:33 service0 kernel: [343148.245714] LustreError: > 22725:0:(ldlm_request.c:1587:ldlm_cli_cancel_list()) > ldlm_cli_cancel_list: -11 > Sep 30 08:40:33 service0 kernel: [343148.245717] LustreError: > 22725:0:(ldlm_request.c:1587:ldlm_cli_cancel_list()) Skipped 1 > previous similar message > Sep 30 08:40:36 service0 kernel: [343151.548005] LustreError: 11-0: an > error occurred while communicating with 10.148.0.106 at o2ib. The > ost_connect operation failed with -16 > Sep 30 08:40:36 service0 kernel: [343151.548008] LustreError: Skipped > 1 previous similar message > Sep 30 08:40:36 service0 kernel: [343151.548024] LustreError: 167-0: > This client was evicted by lustre-OST000b; in progress operations > using this service will fail. > Sep 30 08:40:36 service0 kernel: [343151.548250] LustreError: > 30452:0:(llite_mmap.c:210:ll_tree_unlock()) couldn''t unlock -5 > Sep 30 08:40:36 service0 kernel: [343151.550210] LustreError: > 8300:0:(client.c:858:ptlrpc_import_delay_req()) @@@ IMP_INVALID > req at ffff88049528c400 x1380984193067406/t0 > o3->lustre-OST000b_UUID at 10.148.0.106@o2ib:6/4 lens 448/592 e 0 to 1 dl > 0 ref 2 fl Rpc:/0/0 rc 0/0 > Sep 30 08:40:36 service0 kernel: [343151.594742] Lustre: > lustre-OST0000-osc-ffff880b272cf800: Connection restored to service > lustre-OST0000 using nid 10.148.0.106 at o2ib. > Sep 30 08:40:36 service0 kernel: [343151.837203] Lustre: > lustre-OST0006-osc-ffff880b272cf800: Connection restored to service > lustre-OST0006 using nid 10.148.0.106 at o2ib. > Sep 30 08:40:37 service0 kernel: [343152.842631] Lustre: > lustre-OST0003-osc-ffff880b272cf800: Connection restored to service > lustre-OST0003 using nid 10.148.0.106 at o2ib. > Sep 30 08:40:37 service0 kernel: [343152.842636] Lustre: Skipped 3 > previous similar messages > > > Thanks and Regards > Ashok > > -- > *Ashok Nulguda > * > *TATA ELXSI LTD* > *Mb : +91 9689945767 > * > *Email :ashokn at tataelxsi.co.in <mailto:tshrikant at tataelxsi.co.in>* > > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss______________________________________________________________________ This email may contain privileged or confidential information, which should only be used for the purpose for which it was sent by Xyratex. No further rights or licenses are granted to use such information. If you are not the intended recipient of this message, please notify the sender by return and delete it. You may not use, copy, disclose or rely on the information contained in it. Internet email is susceptible to data corruption, interception and unauthorised amendment for which Xyratex does not accept liability. While we have taken reasonable precautions to ensure that this email is free of viruses, Xyratex does not accept liability for the presence of any computer viruses in this email, nor for any losses caused as a result of viruses. Xyratex Technology Limited (03134912), Registered in England & Wales, Registered Office, Langstone Road, Havant, Hampshire, PO9 1SA. The Xyratex group of companies also includes, Xyratex Ltd, registered in Bermuda, Xyratex International Inc, registered in California, Xyratex (Malaysia) Sdn Bhd registered in Malaysia, Xyratex Technology (Wuxi) Co Ltd registered in The People''s Republic of China and Xyratex Japan Limited registered in Japan. ______________________________________________________________________
Is adaptative timeout enable ? ( on MGS/MDS lctl get_param at_max ) Quentin Bouyer System Engineer | Sgi France +33 6 80 36 49 64 qbouyer at sgi.com<mailto:qbouyer at sgi.com> [cid:image001.jpg at 01CC82BB.317331F0] ________________________________ From: lustre-discuss-bounces at lists.lustre.org [mailto:lustre-discuss-bounces at lists.lustre.org] On Behalf Of Ashok nulguda Sent: vendredi 30 septembre 2011 06:39 To: Lustre-discuss at lists.lustre.org Subject: [Lustre-discuss] help Dear All, I am having lustre error on my HPC as given below.Please any one can help me to resolve this problem. Thanks in Advance. Sep 30 08:40:23 service0 kernel: [343138.837222] Lustre: 8300:0:(client.c:1476:ptlrpc_expire_one_request()) Skipped 1 previous similar message Sep 30 08:40:23 service0 kernel: [343138.837233] Lustre: lustre-OST0008-osc-ffff880b272cf800: Connection to service lustre-OST0008 via nid 10.148.0.106 at o2ib was lost; in progress operations using this service will wait for recovery to complete. Sep 30 08:40:24 service0 kernel: [343139.837260] Lustre: 8300:0:(client.c:1476:ptlrpc_expire_one_request()) @@@ Request x1380984193067288 sent from lustre-OST0006-osc-ffff880b272cf800 to NID 10.148.0.106 at o2ib 7s ago has timed out (7s prior to deadline). Sep 30 08:40:24 service0 kernel: [343139.837263] req at ffff880a5f800c00 x1380984193067288/t0 o3->lustre-OST0006_UUID at 10.148.0.106@o2ib:6/4 lens 448/592 e 0 to 1 dl 1317352224 ref 2 fl Rpc:/0/0 rc 0/0 Sep 30 08:40:24 service0 kernel: [343139.837269] Lustre: 8300:0:(client.c:1476:ptlrpc_expire_one_request()) Skipped 38 previous similar messages Sep 30 08:40:24 service0 kernel: [343140.129284] LustreError: 9983:0:(ldlm_request.c:1025:ldlm_cli_cancel_req()) Got rc -11 from cancel RPC: canceling anyway Sep 30 08:40:24 service0 kernel: [343140.129290] LustreError: 9983:0:(ldlm_request.c:1025:ldlm_cli_cancel_req()) Skipped 1 previous similar message Sep 30 08:40:24 service0 kernel: [343140.129295] LustreError: 9983:0:(ldlm_request.c:1587:ldlm_cli_cancel_list()) ldlm_cli_cancel_list: -11 Sep 30 08:40:24 service0 kernel: [343140.129299] LustreError: 9983:0:(ldlm_request.c:1587:ldlm_cli_cancel_list()) Skipped 1 previous similar message Sep 30 08:40:25 service0 kernel: [343140.837308] Lustre: 8300:0:(client.c:1476:ptlrpc_expire_one_request()) @@@ Request x1380984193067299 sent from lustre-OST0010-osc-ffff880b272cf800 to NID 10.148.0.106 at o2ib 7s ago has timed out (7s prior to deadline). Sep 30 08:40:25 service0 kernel: [343140.837311] req at ffff880a557c4400 x1380984193067299/t0 o3->lustre-OST0010_UUID at 10.148.0.106@o2ib:6/4 lens 448/592 e 0 to 1 dl 1317352225 ref 2 fl Rpc:/0/0 rc 0/0 Sep 30 08:40:25 service0 kernel: [343140.837316] Lustre: 8300:0:(client.c:1476:ptlrpc_expire_one_request()) Skipped 4 previous similar messages Sep 30 08:40:26 service0 kernel: [343141.245365] LustreError: 30978:0:(ldlm_request.c:1025:ldlm_cli_cancel_req()) Got rc -11 from cancel RPC: canceling anyway Sep 30 08:40:26 service0 kernel: [343141.245371] LustreError: 22729:0:(ldlm_request.c:1587:ldlm_cli_cancel_list()) ldlm_cli_cancel_list: -11 Sep 30 08:40:26 service0 kernel: [343141.245378] LustreError: 30978:0:(ldlm_request.c:1025:ldlm_cli_cancel_req()) Skipped 1 previous similar message Sep 30 08:40:33 service0 kernel: [343148.245683] Lustre: 22725:0:(client.c:1476:ptlrpc_expire_one_request()) @@@ Request x1380984193067302 sent from lustre-OST0004-osc-ffff880b272cf800 to NID 10.148.0.106 at o2ib 14s ago has timed out (14s prior to deadline). Sep 30 08:40:33 service0 kernel: [343148.245686] req at ffff8805c879e800 x1380984193067302/t0 o103->lustre-OST0004_UUID at 10.148.0.106@o2ib:17/18 lens 296/384 e 0 to 1 dl 1317352233 ref 1 fl Rpc:N/0/0 rc 0/0 Sep 30 08:40:33 service0 kernel: [343148.245692] Lustre: 22725:0:(client.c:1476:ptlrpc_expire_one_request()) Skipped 2 previous similar messages Sep 30 08:40:33 service0 kernel: [343148.245708] LustreError: 22725:0:(ldlm_request.c:1025:ldlm_cli_cancel_req()) Got rc -11 from cancel RPC: canceling anyway Sep 30 08:40:33 service0 kernel: [343148.245714] LustreError: 22725:0:(ldlm_request.c:1587:ldlm_cli_cancel_list()) ldlm_cli_cancel_list: -11 Sep 30 08:40:33 service0 kernel: [343148.245717] LustreError: 22725:0:(ldlm_request.c:1587:ldlm_cli_cancel_list()) Skipped 1 previous similar message Sep 30 08:40:36 service0 kernel: [343151.548005] LustreError: 11-0: an error occurred while communicating with 10.148.0.106 at o2ib. The ost_connect operation failed with -16 Sep 30 08:40:36 service0 kernel: [343151.548008] LustreError: Skipped 1 previous similar message Sep 30 08:40:36 service0 kernel: [343151.548024] LustreError: 167-0: This client was evicted by lustre-OST000b; in progress operations using this service will fail. Sep 30 08:40:36 service0 kernel: [343151.548250] LustreError: 30452:0:(llite_mmap.c:210:ll_tree_unlock()) couldn''t unlock -5 Sep 30 08:40:36 service0 kernel: [343151.550210] LustreError: 8300:0:(client.c:858:ptlrpc_import_delay_req()) @@@ IMP_INVALID req at ffff88049528c400 x1380984193067406/t0 o3->lustre-OST000b_UUID at 10.148.0.106@o2ib:6/4 lens 448/592 e 0 to 1 dl 0 ref 2 fl Rpc:/0/0 rc 0/0 Sep 30 08:40:36 service0 kernel: [343151.594742] Lustre: lustre-OST0000-osc-ffff880b272cf800: Connection restored to service lustre-OST0000 using nid 10.148.0.106 at o2ib. Sep 30 08:40:36 service0 kernel: [343151.837203] Lustre: lustre-OST0006-osc-ffff880b272cf800: Connection restored to service lustre-OST0006 using nid 10.148.0.106 at o2ib. Sep 30 08:40:37 service0 kernel: [343152.842631] Lustre: lustre-OST0003-osc-ffff880b272cf800: Connection restored to service lustre-OST0003 using nid 10.148.0.106 at o2ib. Sep 30 08:40:37 service0 kernel: [343152.842636] Lustre: Skipped 3 previous similar messages Thanks and Regards Ashok -- Ashok Nulguda TATA ELXSI LTD Mb : +91 9689945767 Email :ashokn at tataelxsi.co.in<mailto:tshrikant at tataelxsi.co.in> -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20111004/46934636/attachment.html -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.jpg Type: image/jpeg Size: 1856 bytes Desc: image001.jpg Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20111004/46934636/attachment.jpg