thr3ads.net - Lustre discuss - [Lustre-discuss] help [Sep 2011]

If this information is useful, please help other people find it:
Share via:

Ashok nulguda

2011-Sep-30 04:39 UTC

[Lustre-discuss] help

Dear All,

I am having lustre error on my HPC as given below.Please any one can help me
to resolve this problem.
Thanks in Advance.
Sep 30 08:40:23 service0 kernel: [343138.837222] Lustre:
8300:0:(client.c:1476:ptlrpc_expire_one_request()) Skipped 1 previous
similar message
Sep 30 08:40:23 service0 kernel: [343138.837233] Lustre:
lustre-OST0008-osc-ffff880b272cf800: Connection to service lustre-OST0008
via nid 10.148.0.106 at o2ib was lost; in progress operations using this
service will wait for recovery to complete.
Sep 30 08:40:24 service0 kernel: [343139.837260] Lustre:
8300:0:(client.c:1476:ptlrpc_expire_one_request()) @@@ Request
x1380984193067288 sent from lustre-OST0006-osc-ffff880b272cf800 to NID
10.148.0.106 at o2ib 7s ago has timed out (7s prior to deadline).
Sep 30 08:40:24 service0 kernel: [343139.837263]
req at ffff880a5f800c00x1380984193067288/t0
o3->lustre-OST0006_UUID at 10.148.0.106@o2ib:6/4
lens 448/592 e 0 to 1 dl 1317352224 ref 2 fl Rpc:/0/0 rc 0/0
Sep 30 08:40:24 service0 kernel: [343139.837269] Lustre:
8300:0:(client.c:1476:ptlrpc_expire_one_request()) Skipped 38 previous
similar messages
Sep 30 08:40:24 service0 kernel: [343140.129284] LustreError:
9983:0:(ldlm_request.c:1025:ldlm_cli_cancel_req()) Got rc -11 from cancel
RPC: canceling anyway
Sep 30 08:40:24 service0 kernel: [343140.129290] LustreError:
9983:0:(ldlm_request.c:1025:ldlm_cli_cancel_req()) Skipped 1 previous
similar message
Sep 30 08:40:24 service0 kernel: [343140.129295] LustreError:
9983:0:(ldlm_request.c:1587:ldlm_cli_cancel_list()) ldlm_cli_cancel_list:
-11
Sep 30 08:40:24 service0 kernel: [343140.129299] LustreError:
9983:0:(ldlm_request.c:1587:ldlm_cli_cancel_list()) Skipped 1 previous
similar message
Sep 30 08:40:25 service0 kernel: [343140.837308] Lustre:
8300:0:(client.c:1476:ptlrpc_expire_one_request()) @@@ Request
x1380984193067299 sent from lustre-OST0010-osc-ffff880b272cf800 to NID
10.148.0.106 at o2ib 7s ago has timed out (7s prior to deadline).
Sep 30 08:40:25 service0 kernel: [343140.837311]
req at ffff880a557c4400x1380984193067299/t0
o3->lustre-OST0010_UUID at 10.148.0.106@o2ib:6/4
lens 448/592 e 0 to 1 dl 1317352225 ref 2 fl Rpc:/0/0 rc 0/0
Sep 30 08:40:25 service0 kernel: [343140.837316] Lustre:
8300:0:(client.c:1476:ptlrpc_expire_one_request()) Skipped 4 previous
similar messages
Sep 30 08:40:26 service0 kernel: [343141.245365] LustreError:
30978:0:(ldlm_request.c:1025:ldlm_cli_cancel_req()) Got rc -11 from cancel
RPC: canceling anyway
Sep 30 08:40:26 service0 kernel: [343141.245371] LustreError:
22729:0:(ldlm_request.c:1587:ldlm_cli_cancel_list()) ldlm_cli_cancel_list:
-11
Sep 30 08:40:26 service0 kernel: [343141.245378] LustreError:
30978:0:(ldlm_request.c:1025:ldlm_cli_cancel_req()) Skipped 1 previous
similar message
Sep 30 08:40:33 service0 kernel: [343148.245683] Lustre:
22725:0:(client.c:1476:ptlrpc_expire_one_request()) @@@ Request
x1380984193067302 sent from lustre-OST0004-osc-ffff880b272cf800 to NID
10.148.0.106 at o2ib 14s ago has timed out (14s prior to deadline).
Sep 30 08:40:33 service0 kernel: [343148.245686]
req at ffff8805c879e800x1380984193067302/t0
o103->lustre-OST0004_UUID at 10.148.0.106@o2ib:17/18
lens 296/384 e 0 to 1 dl 1317352233 ref 1 fl Rpc:N/0/0 rc 0/0
Sep 30 08:40:33 service0 kernel: [343148.245692] Lustre:
22725:0:(client.c:1476:ptlrpc_expire_one_request()) Skipped 2 previous
similar messages
Sep 30 08:40:33 service0 kernel: [343148.245708] LustreError:
22725:0:(ldlm_request.c:1025:ldlm_cli_cancel_req()) Got rc -11 from cancel
RPC: canceling anyway
Sep 30 08:40:33 service0 kernel: [343148.245714] LustreError:
22725:0:(ldlm_request.c:1587:ldlm_cli_cancel_list()) ldlm_cli_cancel_list:
-11
Sep 30 08:40:33 service0 kernel: [343148.245717] LustreError:
22725:0:(ldlm_request.c:1587:ldlm_cli_cancel_list()) Skipped 1 previous
similar message
Sep 30 08:40:36 service0 kernel: [343151.548005] LustreError: 11-0: an error
occurred while communicating with 10.148.0.106 at o2ib. The ost_connect
operation failed with -16
Sep 30 08:40:36 service0 kernel: [343151.548008] LustreError: Skipped 1
previous similar message
Sep 30 08:40:36 service0 kernel: [343151.548024] LustreError: 167-0: This
client was evicted by lustre-OST000b; in progress operations using this
service will fail.
Sep 30 08:40:36 service0 kernel: [343151.548250] LustreError:
30452:0:(llite_mmap.c:210:ll_tree_unlock()) couldn''t unlock -5
Sep 30 08:40:36 service0 kernel: [343151.550210] LustreError:
8300:0:(client.c:858:ptlrpc_import_delay_req()) @@@ IMP_INVALID
req at ffff88049528c400 x1380984193067406/t0
o3->lustre-OST000b_UUID at 10.148.0.106@o2ib:6/4 lens 448/592 e 0 to 1 dl 0
ref
2 fl Rpc:/0/0 rc 0/0
Sep 30 08:40:36 service0 kernel: [343151.594742] Lustre:
lustre-OST0000-osc-ffff880b272cf800: Connection restored to service
lustre-OST0000 using nid 10.148.0.106 at o2ib.
Sep 30 08:40:36 service0 kernel: [343151.837203] Lustre:
lustre-OST0006-osc-ffff880b272cf800: Connection restored to service
lustre-OST0006 using nid 10.148.0.106 at o2ib.
Sep 30 08:40:37 service0 kernel: [343152.842631] Lustre:
lustre-OST0003-osc-ffff880b272cf800: Connection restored to service
lustre-OST0003 using nid 10.148.0.106 at o2ib.
Sep 30 08:40:37 service0 kernel: [343152.842636] Lustre: Skipped 3 previous
similar messages


Thanks and Regards
Ashok

-- 
*Ashok Nulguda
*
*TATA ELXSI LTD*
 *Mb : +91 9689945767
*
*Email :ashokn at tataelxsi.co.in <tshrikant at tataelxsi.co.in>*
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20110930/e714e4cb/attachment-0001.html

Brian O''Connor

2011-Sep-30 07:09 UTC

head link

[Lustre-discuss] help

Hello Ashok

is the cluster hanging or otherwise behaving badly? The logs below show 
that the client
lost connection to 10.148.0.106 for 10seconds or so. It should have 
recovered ok.

If you want further help from the list you need to add more detail about 
the cluster i.e.
A general description of the number of OSS/OST, clients, version of 
lustre etc, and a description
of what is actually going wrong... ie hanging, offline etc

The first thing is to check the infrastructure.. ie. in this case you 
should check your IB network for errors



On 30-September-2011 2:39 PM, Ashok nulguda wrote:> Dear All,
>
> I am having lustre error on my HPC as given below.Please any one can 
> help me to resolve this problem.
> Thanks in Advance.
> Sep 30 08:40:23 service0 kernel: [343138.837222] Lustre: 
> 8300:0:(client.c:1476:ptlrpc_expire_one_request()) Skipped 1 previous 
> similar message
> Sep 30 08:40:23 service0 kernel: [343138.837233] Lustre: 
> lustre-OST0008-osc-ffff880b272cf800: Connection to service 
> lustre-OST0008 via nid 10.148.0.106 at o2ib was lost; in progress 
> operations using this service will wait for recovery to complete.
> Sep 30 08:40:24 service0 kernel: [343139.837260] Lustre: 
> 8300:0:(client.c:1476:ptlrpc_expire_one_request()) @@@ Request 
> x1380984193067288 sent from lustre-OST0006-osc-ffff880b272cf800 to NID 
> 10.148.0.106 at o2ib 7s ago has timed out (7s prior to deadline).
> Sep 30 08:40:24 service0 kernel: [343139.837263]   
> req at ffff880a5f800c00 x1380984193067288/t0 
> o3->lustre-OST0006_UUID at 10.148.0.106@o2ib:6/4 lens 448/592 e 0 to 1
dl
> 1317352224 ref 2 fl Rpc:/0/0 rc 0/0
> Sep 30 08:40:24 service0 kernel: [343139.837269] Lustre: 
> 8300:0:(client.c:1476:ptlrpc_expire_one_request()) Skipped 38 previous 
> similar messages
> Sep 30 08:40:24 service0 kernel: [343140.129284] LustreError: 
> 9983:0:(ldlm_request.c:1025:ldlm_cli_cancel_req()) Got rc -11 from 
> cancel RPC: canceling anyway
> Sep 30 08:40:24 service0 kernel: [343140.129290] LustreError: 
> 9983:0:(ldlm_request.c:1025:ldlm_cli_cancel_req()) Skipped 1 previous 
> similar message
> Sep 30 08:40:24 service0 kernel: [343140.129295] LustreError: 
> 9983:0:(ldlm_request.c:1587:ldlm_cli_cancel_list()) 
> ldlm_cli_cancel_list: -11
> Sep 30 08:40:24 service0 kernel: [343140.129299] LustreError: 
> 9983:0:(ldlm_request.c:1587:ldlm_cli_cancel_list()) Skipped 1 previous 
> similar message
> Sep 30 08:40:25 service0 kernel: [343140.837308] Lustre: 
> 8300:0:(client.c:1476:ptlrpc_expire_one_request()) @@@ Request 
> x1380984193067299 sent from lustre-OST0010-osc-ffff880b272cf800 to NID 
> 10.148.0.106 at o2ib 7s ago has timed out (7s prior to deadline).
> Sep 30 08:40:25 service0 kernel: [343140.837311]   
> req at ffff880a557c4400 x1380984193067299/t0 
> o3->lustre-OST0010_UUID at 10.148.0.106@o2ib:6/4 lens 448/592 e 0 to 1
dl
> 1317352225 ref 2 fl Rpc:/0/0 rc 0/0
> Sep 30 08:40:25 service0 kernel: [343140.837316] Lustre: 
> 8300:0:(client.c:1476:ptlrpc_expire_one_request()) Skipped 4 previous 
> similar messages
> Sep 30 08:40:26 service0 kernel: [343141.245365] LustreError: 
> 30978:0:(ldlm_request.c:1025:ldlm_cli_cancel_req()) Got rc -11 from 
> cancel RPC: canceling anyway
> Sep 30 08:40:26 service0 kernel: [343141.245371] LustreError: 
> 22729:0:(ldlm_request.c:1587:ldlm_cli_cancel_list()) 
> ldlm_cli_cancel_list: -11
> Sep 30 08:40:26 service0 kernel: [343141.245378] LustreError: 
> 30978:0:(ldlm_request.c:1025:ldlm_cli_cancel_req()) Skipped 1 previous 
> similar message
> Sep 30 08:40:33 service0 kernel: [343148.245683] Lustre: 
> 22725:0:(client.c:1476:ptlrpc_expire_one_request()) @@@ Request 
> x1380984193067302 sent from lustre-OST0004-osc-ffff880b272cf800 to NID 
> 10.148.0.106 at o2ib 14s ago has timed out (14s prior to deadline).
> Sep 30 08:40:33 service0 kernel: [343148.245686]   
> req at ffff8805c879e800 x1380984193067302/t0 
> o103->lustre-OST0004_UUID at 10.148.0.106@o2ib:17/18 lens 296/384 e 0 to
> 1 dl 1317352233 ref 1 fl Rpc:N/0/0 rc 0/0
> Sep 30 08:40:33 service0 kernel: [343148.245692] Lustre: 
> 22725:0:(client.c:1476:ptlrpc_expire_one_request()) Skipped 2 previous 
> similar messages
> Sep 30 08:40:33 service0 kernel: [343148.245708] LustreError: 
> 22725:0:(ldlm_request.c:1025:ldlm_cli_cancel_req()) Got rc -11 from 
> cancel RPC: canceling anyway
> Sep 30 08:40:33 service0 kernel: [343148.245714] LustreError: 
> 22725:0:(ldlm_request.c:1587:ldlm_cli_cancel_list()) 
> ldlm_cli_cancel_list: -11
> Sep 30 08:40:33 service0 kernel: [343148.245717] LustreError: 
> 22725:0:(ldlm_request.c:1587:ldlm_cli_cancel_list()) Skipped 1 
> previous similar message
> Sep 30 08:40:36 service0 kernel: [343151.548005] LustreError: 11-0: an 
> error occurred while communicating with 10.148.0.106 at o2ib. The 
> ost_connect operation failed with -16
> Sep 30 08:40:36 service0 kernel: [343151.548008] LustreError: Skipped 
> 1 previous similar message
> Sep 30 08:40:36 service0 kernel: [343151.548024] LustreError: 167-0: 
> This client was evicted by lustre-OST000b; in progress operations 
> using this service will fail.
> Sep 30 08:40:36 service0 kernel: [343151.548250] LustreError: 
> 30452:0:(llite_mmap.c:210:ll_tree_unlock()) couldn''t unlock -5
> Sep 30 08:40:36 service0 kernel: [343151.550210] LustreError: 
> 8300:0:(client.c:858:ptlrpc_import_delay_req()) @@@ IMP_INVALID  
> req at ffff88049528c400 x1380984193067406/t0 
> o3->lustre-OST000b_UUID at 10.148.0.106@o2ib:6/4 lens 448/592 e 0 to 1
dl
> 0 ref 2 fl Rpc:/0/0 rc 0/0
> Sep 30 08:40:36 service0 kernel: [343151.594742] Lustre: 
> lustre-OST0000-osc-ffff880b272cf800: Connection restored to service 
> lustre-OST0000 using nid 10.148.0.106 at o2ib.
> Sep 30 08:40:36 service0 kernel: [343151.837203] Lustre: 
> lustre-OST0006-osc-ffff880b272cf800: Connection restored to service 
> lustre-OST0006 using nid 10.148.0.106 at o2ib.
> Sep 30 08:40:37 service0 kernel: [343152.842631] Lustre: 
> lustre-OST0003-osc-ffff880b272cf800: Connection restored to service 
> lustre-OST0003 using nid 10.148.0.106 at o2ib.
> Sep 30 08:40:37 service0 kernel: [343152.842636] Lustre: Skipped 3 
> previous similar messages
>
>
> Thanks and Regards
> Ashok
>
> -- 
> *Ashok Nulguda
> *
> *TATA ELXSI LTD*
> *Mb : +91 9689945767
> *
> *Email :ashokn at tataelxsi.co.in <mailto:tshrikant at
tataelxsi.co.in>*
>
>
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss

-- 
Brian O''Connor
-------------------------------------------------
SGI Consulting
Email: briano at sgi.com, Mobile +61 417 746 452
Phone: +61 3 9963 1900, Fax: +61 3 9963 1902
357 Camberwell Road, Camberwell, Victoria, 3124
AUSTRALIA http://www.sgi.com/support/services
-------------------------------------------------



-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20110930/c142be61/attachment.html

Ashok nulguda

2011-Sep-30 08:37 UTC

head link

[Lustre-discuss] help

Dear Sir,


Thanks for your help.

My system is ICE 8400 cluster with 30 TB of lustre of 64 node.
oss1:~ # df -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/sda3             100G  5.8G   95G   6% /
tmpfs                  12G  1.1M   12G   1% /dev
tmpfs                  12G   88K   12G   1% /dev/shm
/dev/sda1            1020M  181M  840M  18% /boot
/dev/sda4             170G  6.6M  170G   1% /data1
/dev/mapper/3600a0b8000755ee0000010964dc231bc_part1
                      2.1T   74G  1.9T   4% /OST1
/dev/mapper/3600a0b8000755ed1000010614dc23425_part1
                      1.7T   67G  1.5T   5% /OST4
/dev/mapper/3600a0b8000755ee0000010a04dc23323_part1
                      2.1T   67G  1.9T   4% /OST5
/dev/mapper/3600a0b8000755f1f000011224dc239d7_part1
                      1.7T   67G  1.5T   5% /OST8
/dev/mapper/3600a0b8000755dbe000010de4dc23997_part1
                      2.1T   66G  1.9T   4% /OST9
/dev/mapper/3600a0b8000755f1f000011284dc23b5a_part1
                      1.7T   66G  1.5T   5% /OST12
/dev/mapper/3600a0b8000755eb3000011304dc23db1_part1
                      2.1T   66G  1.9T   4% /OST13
/dev/mapper/3600a0b8000755f22000011104dc23ec7_part1
                      1.7T   66G  1.5T   5% /OST16


oss1:~ # rpm -qa | grep -i lustre
kernel-default-2.6.27.39-0.3_lustre.1.8.4
kernel-ib-1.5.1-2.6.27.39_0.3_lustre.1.8.4_default
lustre-modules-1.8.4-2.6.27_39_0.3_lustre.1.8.4_default
kernel-default-base-2.6.27.39-0.3_lustre.1.8.4
lustre-1.8.4-2.6.27_39_0.3_lustre.1.8.4_default
lustre-ldiskfs-3.1.3-2.6.27_39_0.3_lustre.1.8.4_default


oss2:~ # Filesystem            Size  Used Avail Use% Mounted on
/dev/sdcw3            100G  8.3G   92G   9% /
tmpfs                  12G  1.1M   12G   1% /dev
tmpfs                  12G   88K   12G   1% /dev/shm
/dev/sdcw1           1020M  144M  876M  15% /boot
/dev/sdcw4            170G   13M  170G   1% /data1
/dev/mapper/3600a0b8000755ed10000105e4dc23397_part1
                      1.7T   69G  1.5T   5% /OST2
/dev/mapper/3600a0b8000755ee00000109b4dc232a0_part1
                      2.1T   68G  1.9T   4% /OST3
/dev/mapper/3600a0b8000755ed1000010644dc2349f_part1
                      1.7T   67G  1.5T   5% /OST6
/dev/mapper/3600a0b8000755dbe000010d94dc23873_part1
                      2.1T   67G  1.9T   4% /OST7
/dev/mapper/3600a0b8000755f1f000011254dc23add_part1
                      1.7T   66G  1.5T   5% /OST10
/dev/mapper/3600a0b8000755dbe000010e34dc23a09_part1
                      2.1T   66G  1.9T   4% /OST11
/dev/mapper/3600a0b8000755f220000110d4dc23e36_part1
                      1.7T   66G  1.5T   5% /OST14
/dev/mapper/3600a0b8000755eb3000011354dc23e39_part1
                      2.1T   66G  1.9T   4% /OST15
/dev/mapper/3600a0b8000755eb30000113a4dc23ec4_part1
                      1.4T   66G  1.3T   6% /OST17

[1]+  Done                    df -h

oss2:~ # rpm -qa | grep -i lustre
lustre-modules-1.8.4-2.6.27_39_0.3_lustre.1.8.4_default
kernel-default-base-2.6.27.39-0.3_lustre.1.8.4
kernel-default-2.6.27.39-0.3_lustre.1.8.4
kernel-ib-1.5.1-2.6.27.39_0.3_lustre.1.8.4_default
lustre-ldiskfs-3.1.3-2.6.27_39_0.3_lustre.1.8.4_default
lustre-1.8.4-2.6.27_39_0.3_lustre.1.8.4_default

mdc1:~ # Filesystem            Size  Used Avail Use% Mounted on
/dev/sde2             100G  5.2G   95G   6% /
tmpfs                  12G  184K   12G   1% /dev
tmpfs                  12G   88K   12G   1% /dev/shm
/dev/sde1            1020M  181M  840M  18% /boot
/dev/sde4             167G  196M  159G   1% /data1
/dev/mapper/3600a0b8000755f22000011134dc23f7e_part1
                      489G  2.3G  458G   1% /MDC

[1]+  Done                    df -h
mdc1:~ #


mdc1:~ # rpm -qa | grep -i lustre
lustre-1.8.4-2.6.27_39_0.3_lustre.1.8.4_default
kernel-default-2.6.27.39-0.3_lustre.1.8.4
lustre-ldiskfs-3.1.3-2.6.27_39_0.3_lustre.1.8.4_default
kernel-ib-1.5.1-2.6.27.39_0.3_lustre.1.8.4_default
lustre-modules-1.8.4-2.6.27_39_0.3_lustre.1.8.4_default
kernel-default-base-2.6.27.39-0.3_lustre.1.8.4
mdc1:~ #

mdc2:~ # Filesystem            Size  Used Avail Use% Mounted on
/dev/sde3             100G  5.0G   95G   5% /
tmpfs                  18G  184K   18G   1% /dev
tmpfs                 7.8G   88K  7.8G   1% /dev/shm
/dev/sde1            1020M  144M  876M  15% /boot
/dev/sde4             170G  6.6M  170G   1% /data1

[1]+  Done                    df -h
mdc2:~ # rpm -qqa | grep -i lustre
lustre-modules-1.8.4-2.6.27_39_0.3_lustre.1.8.4_default
kernel-default-base-2.6.27.39-0.3_lustre.1.8.4
kernel-default-2.6.27.39-0.3_lustre.1.8.4
lustre-ldiskfs-3.1.3-2.6.27_39_0.3_lustre.1.8.4_default
kernel-ib-1.5.1-2.6.27.39_0.3_lustre.1.8.4_default
lustre-1.8.4-2.6.27_39_0.3_lustre.1.8.4_default
mdc2:~ #


service0:~ # ibstat
CA ''mlx4_0''
    CA type: MT26428
    Number of ports: 2
    Firmware version: 2.7.0
    Hardware version: a0
    Node GUID: 0x0002c903000a6028
    System image GUID: 0x0002c903000a602b
    Port 1:
        State: Active
        Physical state: LinkUp
        Rate: 40
        Base lid: 9
        LMC: 0
        SM lid: 1
        Capability mask: 0x02510868
        Port GUID: 0x0002c903000a6029
    Port 2:
        State: Active
        Physical state: LinkUp
        Rate: 40
        Base lid: 10
        LMC: 0
        SM lid: 1
        Capability mask: 0x02510868
        Port GUID: 0x0002c903000a602a
service0:~ #



service0:~ # ibstatus
Infiniband device ''mlx4_0'' port 1 status:
    default gid:     fec0:0000:0000:0000:0002:c903:000a:6029
    base lid:     0x9
    sm lid:         0x1
    state:         4: ACTIVE
    phys state:     5: LinkUp
    rate:         40 Gb/sec (4X QDR)

Infiniband device ''mlx4_0'' port 2 status:
    default gid:     fec0:0000:0000:0000:0002:c903:000a:602a
    base lid:     0xa
    sm lid:         0x1
    state:         4: ACTIVE
    phys state:     5: LinkUp
    rate:         40 Gb/sec (4X QDR)

service0:~ #



service0:~ # ibdiagnet
Loading IBDIAGNET from: /usr/lib64/ibdiagnet1.2
-W- Topology file is not specified.
    Reports regarding cluster links will use direct routes.
Loading IBDM from: /usr/lib64/ibdm1.2
-W- A few ports of local device are up.
    Since port-num was not specified (-p option), port 1 of device 1 will be
    used as the local port.
-I- Discovering ... 88 nodes (9 Switches & 79 CA-s) discovered.


-I---------------------------------------------------
-I- Bad Guids/LIDs Info
-I---------------------------------------------------
-I- No bad Guids were found

-I---------------------------------------------------
-I- Links With Logical State = INIT
-I---------------------------------------------------
-I- No bad Links (with logical state = INIT) were found

-I---------------------------------------------------
-I- PM Counters Info
-I---------------------------------------------------
-I- No illegal PM counters values were found

-I---------------------------------------------------
-I- Fabric Partitions Report (see ibdiagnet.pkey for a full hosts list)
-I---------------------------------------------------
-I-    PKey:0x7fff Hosts:81 full:81 partial:0

-I---------------------------------------------------
-I- IPoIB Subnets Check
-I---------------------------------------------------
-I- Subnet: IPv4 PKey:0x7fff QKey:0x00000b1b MTU:2048Byte rate:10Gbps
SL:0x00
-W- Suboptimal rate for group. Lowest member rate:20Gbps > group-rate:10Gbps

-I---------------------------------------------------
-I- Bad Links Info
-I- No bad link were found
-I---------------------------------------------------
----------------------------------------------------------------
-I- Stages Status Report:
    STAGE                                    Errors Warnings
    Bad GUIDs/LIDs Check                     0      0
    Link State Active Check                  0      0
    Performance Counters Report              0      0
    Partitions Check                         0      0
    IPoIB Subnets Check                      0      1

Please see /tmp/ibdiagnet.log for complete log
----------------------------------------------------------------

-I- Done. Run time was 9 seconds.
service0:~ #


service0:~ # ibcheckerrors
#warn: counter VL15Dropped = 18584     (threshold 100) lid 1 port 1
Error check on lid 1 (r1lead HCA-1) port 1:  FAILED
#warn: counter SymbolErrors = 42829     (threshold 10) lid 9 port 1
#warn: counter RcvErrors = 9279     (threshold 10) lid 9 port 1
Error check on lid 9 (service0 HCA-1) port 1:  FAILED

## Summary: 88 nodes checked, 0 bad nodes found
##          292 ports checked, 2 ports have errors beyond threshold
service0:~ #


service0:~ # ibchecknet

# Checking Ca: nodeguid 0x0002c903000abfc2

# Checking Ca: nodeguid 0x0002c903000ac00e

# Checking Ca: nodeguid 0x0002c903000a69dc

# Checking Ca: nodeguid 0x0002c9030009cd46

# Checking Ca: nodeguid 0x003048fffff4d878

# Checking Ca: nodeguid 0x003048fffff4d880

# Checking Ca: nodeguid 0x003048fffff4d87c

# Checking Ca: nodeguid 0x003048fffff4d884

# Checking Ca: nodeguid 0x003048fffff4d888

# Checking Ca: nodeguid 0x003048fffff4d88c

# Checking Ca: nodeguid 0x003048fffff4d890

# Checking Ca: nodeguid 0x003048fffff4d894

# Checking Ca: nodeguid 0x0002c9020029fa50
#warn: counter VL15Dropped = 18617     (threshold 100) lid 1 port 1
Error check on lid 1 (r1lead HCA-1) port 1:  FAILED

# Checking Ca: nodeguid 0x0002c90300054eac

# Checking Ca: nodeguid 0x0002c9030009cebe

# Checking Ca: nodeguid 0x003048fffff4c9f8

# Checking Ca: nodeguid 0x003048fffff4db08

# Checking Ca: nodeguid 0x003048fffff4db40

# Checking Ca: nodeguid 0x003048fffff4db44

# Checking Ca: nodeguid 0x003048fffff4db48

# Checking Ca: nodeguid 0x003048fffff4db4c

# Checking Ca: nodeguid 0x003048fffff4db0c

# Checking Ca: nodeguid 0x003048fffff4dca0

# Checking Ca: nodeguid 0x0002c903000abfe2

# Checking Ca: nodeguid 0x0002c903000abfe6

# Checking Ca: nodeguid 0x0002c9030009dd28

# Checking Ca: nodeguid 0x003048fffff4db54

# Checking Ca: nodeguid 0x003048fffff4db58

# Checking Ca: nodeguid 0x003048fffff4c9f4

# Checking Ca: nodeguid 0x003048fffff4db50

# Checking Ca: nodeguid 0x003048fffff4db3c

# Checking Ca: nodeguid 0x003048fffff4db38

# Checking Ca: nodeguid 0x003048fffff4db14

# Checking Ca: nodeguid 0x003048fffff4db10

# Checking Ca: nodeguid 0x003048fffff4d8a8

# Checking Ca: nodeguid 0x003048fffff4d8ac

# Checking Ca: nodeguid 0x003048fffff4d8b4

# Checking Ca: nodeguid 0x003048fffff4d8b0

# Checking Ca: nodeguid 0x003048fffff4db70

# Checking Ca: nodeguid 0x003048fffff4db68

# Checking Ca: nodeguid 0x003048fffff4db64

# Checking Ca: nodeguid 0x003048fffff4db78

# Checking Ca: nodeguid 0x0002c903000a69f0

# Checking Ca: nodeguid 0x0002c9030006004a

# Checking Ca: nodeguid 0x0002c9030009dd2c

# Checking Ca: nodeguid 0x003048fffff4d8b8

# Checking Ca: nodeguid 0x003048fffff4d8bc

# Checking Ca: nodeguid 0x003048fffff4d8a4

# Checking Ca: nodeguid 0x003048fffff4d8a0

# Checking Ca: nodeguid 0x003048fffff4db7c

# Checking Ca: nodeguid 0x003048fffff4db80

# Checking Ca: nodeguid 0x003048fffff4db6c

# Checking Ca: nodeguid 0x003048fffff4db74

# Checking Ca: nodeguid 0x003048fffff4dcb8

# Checking Ca: nodeguid 0x003048fffff4dcd0

# Checking Ca: nodeguid 0x003048fffff4dc5c

# Checking Ca: nodeguid 0x003048fffff4dc60

# Checking Ca: nodeguid 0x003048fffff4dc54

# Checking Ca: nodeguid 0x003048fffff4dc50

# Checking Ca: nodeguid 0x003048fffff4dc4c

# Checking Ca: nodeguid 0x003048fffff4dcd4

# Checking Ca: nodeguid 0x0002c903000a6164

# Checking Ca: nodeguid 0x003048fffff4dcf0

# Checking Ca: nodeguid 0x003048fffff4db5c

# Checking Ca: nodeguid 0x003048fffff4dc90

# Checking Ca: nodeguid 0x003048fffff4dc8c

# Checking Ca: nodeguid 0x003048fffff4dc58

# Checking Ca: nodeguid 0x003048fffff4dc94

# Checking Ca: nodeguid 0x003048fffff4dc9c

# Checking Ca: nodeguid 0x003048fffff4db60

# Checking Ca: nodeguid 0x003048fffff4d89c

# Checking Ca: nodeguid 0x003048fffff4d898

# Checking Ca: nodeguid 0x003048fffff4dad8

# Checking Ca: nodeguid 0x003048fffff4dadc

# Checking Ca: nodeguid 0x003048fffff4db30

# Checking Ca: nodeguid 0x003048fffff4db34

# Checking Ca: nodeguid 0x003048fffff4d874

# Checking Ca: nodeguid 0x003048fffff4d870

# Checking Ca: nodeguid 0x0002c903000a6028
#warn: counter SymbolErrors = 44150     (threshold 10) lid 9 port 1
#warn: counter RcvErrors = 9283     (threshold 10) lid 9 port 1
Error check on lid 9 (service0 HCA-1) port 1:  FAILED

## Summary: 88 nodes checked, 0 bad nodes found
##          292 ports checked, 0 bad ports found
##          2 ports have errors beyond threshold



service0:~ # ibcheckstate

## Summary: 88 nodes checked, 0 bad nodes found
##          292 ports checked, 0 ports with bad state found
service0:~ # ibcheckwidth

## Summary: 88 nodes checked, 0 bad nodes found
##          292 ports checked, 0 ports with 1x width in error found
service0:~ #


Thanks and Regards
Ashok



On 30 September 2011 12:39, Brian O''Connor <briano at sgi.com>
wrote:
>  Hello Ashok
>
> is the cluster hanging or otherwise behaving badly? The logs below show
> that the client
> lost connection to 10.148.0.106 for 10seconds or so. It should have
> recovered ok.
>
> If you want further help from the list you need to add more detail about
> the cluster i.e.
> A general description of the number of OSS/OST, clients, version of lustre
> etc, and a description
> of what is actually going wrong... ie hanging, offline etc
>
> The first thing is to check the infrastructure.. ie. in this case you
> should check your IB network for errors
>
>
>
>
> On 30-September-2011 2:39 PM, Ashok nulguda wrote:
>
> Dear All,
>
> I am having lustre error on my HPC as given below.Please any one can help
> me to resolve this problem.
> Thanks in Advance.
> Sep 30 08:40:23 service0 kernel: [343138.837222] Lustre:
> 8300:0:(client.c:1476:ptlrpc_expire_one_request()) Skipped 1 previous
> similar message
> Sep 30 08:40:23 service0 kernel: [343138.837233] Lustre:
> lustre-OST0008-osc-ffff880b272cf800: Connection to service lustre-OST0008
> via nid 10.148.0.106 at o2ib was lost; in progress operations using this
> service will wait for recovery to complete.
> Sep 30 08:40:24 service0 kernel: [343139.837260] Lustre:
> 8300:0:(client.c:1476:ptlrpc_expire_one_request()) @@@ Request
> x1380984193067288 sent from lustre-OST0006-osc-ffff880b272cf800 to NID
> 10.148.0.106 at o2ib 7s ago has timed out (7s prior to deadline).
> Sep 30 08:40:24 service0 kernel: [343139.837263]   req at
ffff880a5f800c00x1380984193067288/t0 o3->
> lustre-OST0006_UUID at 10.148.0.106@o2ib:6/4 lens 448/592 e 0 to 1 dl
> 1317352224 ref 2 fl Rpc:/0/0 rc 0/0
> Sep 30 08:40:24 service0 kernel: [343139.837269] Lustre:
> 8300:0:(client.c:1476:ptlrpc_expire_one_request()) Skipped 38 previous
> similar messages
> Sep 30 08:40:24 service0 kernel: [343140.129284] LustreError:
> 9983:0:(ldlm_request.c:1025:ldlm_cli_cancel_req()) Got rc -11 from cancel
> RPC: canceling anyway
> Sep 30 08:40:24 service0 kernel: [343140.129290] LustreError:
> 9983:0:(ldlm_request.c:1025:ldlm_cli_cancel_req()) Skipped 1 previous
> similar message
> Sep 30 08:40:24 service0 kernel: [343140.129295] LustreError:
> 9983:0:(ldlm_request.c:1587:ldlm_cli_cancel_list()) ldlm_cli_cancel_list:
> -11
> Sep 30 08:40:24 service0 kernel: [343140.129299] LustreError:
> 9983:0:(ldlm_request.c:1587:ldlm_cli_cancel_list()) Skipped 1 previous
> similar message
> Sep 30 08:40:25 service0 kernel: [343140.837308] Lustre:
> 8300:0:(client.c:1476:ptlrpc_expire_one_request()) @@@ Request
> x1380984193067299 sent from lustre-OST0010-osc-ffff880b272cf800 to NID
> 10.148.0.106 at o2ib 7s ago has timed out (7s prior to deadline).
> Sep 30 08:40:25 service0 kernel: [343140.837311]   req at
ffff880a557c4400x1380984193067299/t0 o3->
> lustre-OST0010_UUID at 10.148.0.106@o2ib:6/4 lens 448/592 e 0 to 1 dl
> 1317352225 ref 2 fl Rpc:/0/0 rc 0/0
> Sep 30 08:40:25 service0 kernel: [343140.837316] Lustre:
> 8300:0:(client.c:1476:ptlrpc_expire_one_request()) Skipped 4 previous
> similar messages
> Sep 30 08:40:26 service0 kernel: [343141.245365] LustreError:
> 30978:0:(ldlm_request.c:1025:ldlm_cli_cancel_req()) Got rc -11 from cancel
> RPC: canceling anyway
> Sep 30 08:40:26 service0 kernel: [343141.245371] LustreError:
> 22729:0:(ldlm_request.c:1587:ldlm_cli_cancel_list()) ldlm_cli_cancel_list:
> -11
> Sep 30 08:40:26 service0 kernel: [343141.245378] LustreError:
> 30978:0:(ldlm_request.c:1025:ldlm_cli_cancel_req()) Skipped 1 previous
> similar message
> Sep 30 08:40:33 service0 kernel: [343148.245683] Lustre:
> 22725:0:(client.c:1476:ptlrpc_expire_one_request()) @@@ Request
> x1380984193067302 sent from lustre-OST0004-osc-ffff880b272cf800 to NID
> 10.148.0.106 at o2ib 14s ago has timed out (14s prior to deadline).
> Sep 30 08:40:33 service0 kernel: [343148.245686]   req at
ffff8805c879e800x1380984193067302/t0 o103->
> lustre-OST0004_UUID at 10.148.0.106@o2ib:17/18 lens 296/384 e 0 to 1 dl
> 1317352233 ref 1 fl Rpc:N/0/0 rc 0/0
> Sep 30 08:40:33 service0 kernel: [343148.245692] Lustre:
> 22725:0:(client.c:1476:ptlrpc_expire_one_request()) Skipped 2 previous
> similar messages
> Sep 30 08:40:33 service0 kernel: [343148.245708] LustreError:
> 22725:0:(ldlm_request.c:1025:ldlm_cli_cancel_req()) Got rc -11 from cancel
> RPC: canceling anyway
> Sep 30 08:40:33 service0 kernel: [343148.245714] LustreError:
> 22725:0:(ldlm_request.c:1587:ldlm_cli_cancel_list()) ldlm_cli_cancel_list:
> -11
> Sep 30 08:40:33 service0 kernel: [343148.245717] LustreError:
> 22725:0:(ldlm_request.c:1587:ldlm_cli_cancel_list()) Skipped 1 previous
> similar message
> Sep 30 08:40:36 service0 kernel: [343151.548005] LustreError: 11-0: an
> error occurred while communicating with 10.148.0.106 at o2ib. The
ost_connect
> operation failed with -16
> Sep 30 08:40:36 service0 kernel: [343151.548008] LustreError: Skipped 1
> previous similar message
> Sep 30 08:40:36 service0 kernel: [343151.548024] LustreError: 167-0: This
> client was evicted by lustre-OST000b; in progress operations using this
> service will fail.
> Sep 30 08:40:36 service0 kernel: [343151.548250] LustreError:
> 30452:0:(llite_mmap.c:210:ll_tree_unlock()) couldn''t unlock -5
> Sep 30 08:40:36 service0 kernel: [343151.550210] LustreError:
> 8300:0:(client.c:858:ptlrpc_import_delay_req()) @@@ IMP_INVALID
> req at ffff88049528c400 x1380984193067406/t0 o3->
> lustre-OST000b_UUID at 10.148.0.106@o2ib:6/4 lens 448/592 e 0 to 1 dl 0 ref
2
> fl Rpc:/0/0 rc 0/0
> Sep 30 08:40:36 service0 kernel: [343151.594742] Lustre:
> lustre-OST0000-osc-ffff880b272cf800: Connection restored to service
> lustre-OST0000 using nid 10.148.0.106 at o2ib.
> Sep 30 08:40:36 service0 kernel: [343151.837203] Lustre:
> lustre-OST0006-osc-ffff880b272cf800: Connection restored to service
> lustre-OST0006 using nid 10.148.0.106 at o2ib.
> Sep 30 08:40:37 service0 kernel: [343152.842631] Lustre:
> lustre-OST0003-osc-ffff880b272cf800: Connection restored to service
> lustre-OST0003 using nid 10.148.0.106 at o2ib.
> Sep 30 08:40:37 service0 kernel: [343152.842636] Lustre: Skipped 3 previous
> similar messages
>
>
> Thanks and Regards
> Ashok
>
> --
> *Ashok Nulguda
> *
> *TATA ELXSI LTD*
>  *Mb : +91 9689945767
> *
> *Email :ashokn at tataelxsi.co.in <tshrikant at tataelxsi.co.in>*
>
>
>
> _______________________________________________
> Lustre-discuss mailing listLustre-discuss at
lists.lustre.orghttp://lists.lustre.org/mailman/listinfo/lustre-discuss
>
>
>
> --
> Brian O''Connor
> -------------------------------------------------
> SGI Consulting
> Email: briano at sgi.com, Mobile +61 417 746 452
> Phone: +61 3 9963 1900, Fax: +61 3 9963 1902
> 357 Camberwell Road, Camberwell, Victoria, 3124
> AUSTRALIA http://www.sgi.com/support/services
> -------------------------------------------------
>
>
>
>

-- 
*Ashok Nulguda
*
*TATA ELXSI LTD*
 *Mb : +91 9689945767
*
*Email :ashokn at tataelxsi.co.in <tshrikant at tataelxsi.co.in>*
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20110930/b62ae3b4/attachment-0001.html

Brian O''Connor

2011-Sep-30 08:47 UTC

head link

[Lustre-discuss] help

Hi Ashok

If you have a valid support contract log a call with you local SGI 
office, you have a couple of bad IB ports, maybe a cable or other such 
thing. Include the information you provided below
and ask them help out.


On 30-September-2011 6:37 PM, Ashok nulguda wrote:> Dear Sir,
>
>
> Thanks for your help.
>
> My system is ICE 8400 cluster with 30 TB of lustre of 64 node.
> oss1:~ # df -h
> Filesystem            Size  Used Avail Use% Mounted on
> /dev/sda3             100G  5.8G   95G   6% /
> tmpfs                  12G  1.1M   12G   1% /dev
> tmpfs                  12G   88K   12G   1% /dev/shm
> /dev/sda1            1020M  181M  840M  18% /boot
> /dev/sda4             170G  6.6M  170G   1% /data1
> /dev/mapper/3600a0b8000755ee0000010964dc231bc_part1
>                       2.1T   74G  1.9T   4% /OST1
> /dev/mapper/3600a0b8000755ed1000010614dc23425_part1
>                       1.7T   67G  1.5T   5% /OST4
> /dev/mapper/3600a0b8000755ee0000010a04dc23323_part1
>                       2.1T   67G  1.9T   4% /OST5
> /dev/mapper/3600a0b8000755f1f000011224dc239d7_part1
>                       1.7T   67G  1.5T   5% /OST8
> /dev/mapper/3600a0b8000755dbe000010de4dc23997_part1
>                       2.1T   66G  1.9T   4% /OST9
> /dev/mapper/3600a0b8000755f1f000011284dc23b5a_part1
>                       1.7T   66G  1.5T   5% /OST12
> /dev/mapper/3600a0b8000755eb3000011304dc23db1_part1
>                       2.1T   66G  1.9T   4% /OST13
> /dev/mapper/3600a0b8000755f22000011104dc23ec7_part1
>                       1.7T   66G  1.5T   5% /OST16
>
>
> oss1:~ # rpm -qa | grep -i lustre
> kernel-default-2.6.27.39-0.3_lustre.1.8.4
> kernel-ib-1.5.1-2.6.27.39_0.3_lustre.1.8.4_default
> lustre-modules-1.8.4-2.6.27_39_0.3_lustre.1.8.4_default
> kernel-default-base-2.6.27.39-0.3_lustre.1.8.4
> lustre-1.8.4-2.6.27_39_0.3_lustre.1.8.4_default
> lustre-ldiskfs-3.1.3-2.6.27_39_0.3_lustre.1.8.4_default
>
>
> oss2:~ # Filesystem            Size  Used Avail Use% Mounted on
> /dev/sdcw3            100G  8.3G   92G   9% /
> tmpfs                  12G  1.1M   12G   1% /dev
> tmpfs                  12G   88K   12G   1% /dev/shm
> /dev/sdcw1           1020M  144M  876M  15% /boot
> /dev/sdcw4            170G   13M  170G   1% /data1
> /dev/mapper/3600a0b8000755ed10000105e4dc23397_part1
>                       1.7T   69G  1.5T   5% /OST2
> /dev/mapper/3600a0b8000755ee00000109b4dc232a0_part1
>                       2.1T   68G  1.9T   4% /OST3
> /dev/mapper/3600a0b8000755ed1000010644dc2349f_part1
>                       1.7T   67G  1.5T   5% /OST6
> /dev/mapper/3600a0b8000755dbe000010d94dc23873_part1
>                       2.1T   67G  1.9T   4% /OST7
> /dev/mapper/3600a0b8000755f1f000011254dc23add_part1
>                       1.7T   66G  1.5T   5% /OST10
> /dev/mapper/3600a0b8000755dbe000010e34dc23a09_part1
>                       2.1T   66G  1.9T   4% /OST11
> /dev/mapper/3600a0b8000755f220000110d4dc23e36_part1
>                       1.7T   66G  1.5T   5% /OST14
> /dev/mapper/3600a0b8000755eb3000011354dc23e39_part1
>                       2.1T   66G  1.9T   4% /OST15
> /dev/mapper/3600a0b8000755eb30000113a4dc23ec4_part1
>                       1.4T   66G  1.3T   6% /OST17
>
> [1]+  Done                    df -h
>
> oss2:~ # rpm -qa | grep -i lustre
> lustre-modules-1.8.4-2.6.27_39_0.3_lustre.1.8.4_default
> kernel-default-base-2.6.27.39-0.3_lustre.1.8.4
> kernel-default-2.6.27.39-0.3_lustre.1.8.4
> kernel-ib-1.5.1-2.6.27.39_0.3_lustre.1.8.4_default
> lustre-ldiskfs-3.1.3-2.6.27_39_0.3_lustre.1.8.4_default
> lustre-1.8.4-2.6.27_39_0.3_lustre.1.8.4_default
>
> mdc1:~ # Filesystem            Size  Used Avail Use% Mounted on
> /dev/sde2             100G  5.2G   95G   6% /
> tmpfs                  12G  184K   12G   1% /dev
> tmpfs                  12G   88K   12G   1% /dev/shm
> /dev/sde1            1020M  181M  840M  18% /boot
> /dev/sde4             167G  196M  159G   1% /data1
> /dev/mapper/3600a0b8000755f22000011134dc23f7e_part1
>                       489G  2.3G  458G   1% /MDC
>
> [1]+  Done                    df -h
> mdc1:~ #
>
>
> mdc1:~ # rpm -qa | grep -i lustre
> lustre-1.8.4-2.6.27_39_0.3_lustre.1.8.4_default
> kernel-default-2.6.27.39-0.3_lustre.1.8.4
> lustre-ldiskfs-3.1.3-2.6.27_39_0.3_lustre.1.8.4_default
> kernel-ib-1.5.1-2.6.27.39_0.3_lustre.1.8.4_default
> lustre-modules-1.8.4-2.6.27_39_0.3_lustre.1.8.4_default
> kernel-default-base-2.6.27.39-0.3_lustre.1.8.4
> mdc1:~ #
>
> mdc2:~ # Filesystem            Size  Used Avail Use% Mounted on
> /dev/sde3             100G  5.0G   95G   5% /
> tmpfs                  18G  184K   18G   1% /dev
> tmpfs                 7.8G   88K  7.8G   1% /dev/shm
> /dev/sde1            1020M  144M  876M  15% /boot
> /dev/sde4             170G  6.6M  170G   1% /data1
>
> [1]+  Done                    df -h
> mdc2:~ # rpm -qqa | grep -i lustre
> lustre-modules-1.8.4-2.6.27_39_0.3_lustre.1.8.4_default
> kernel-default-base-2.6.27.39-0.3_lustre.1.8.4
> kernel-default-2.6.27.39-0.3_lustre.1.8.4
> lustre-ldiskfs-3.1.3-2.6.27_39_0.3_lustre.1.8.4_default
> kernel-ib-1.5.1-2.6.27.39_0.3_lustre.1.8.4_default
> lustre-1.8.4-2.6.27_39_0.3_lustre.1.8.4_default
> mdc2:~ #
>
>
> service0:~ # ibstat
> CA ''mlx4_0''
>     CA type: MT26428
>     Number of ports: 2
>     Firmware version: 2.7.0
>     Hardware version: a0
>     Node GUID: 0x0002c903000a6028
>     System image GUID: 0x0002c903000a602b
>     Port 1:
>         State: Active
>         Physical state: LinkUp
>         Rate: 40
>         Base lid: 9
>         LMC: 0
>         SM lid: 1
>         Capability mask: 0x02510868
>         Port GUID: 0x0002c903000a6029
>     Port 2:
>         State: Active
>         Physical state: LinkUp
>         Rate: 40
>         Base lid: 10
>         LMC: 0
>         SM lid: 1
>         Capability mask: 0x02510868
>         Port GUID: 0x0002c903000a602a
> service0:~ #
>
>
>
> service0:~ # ibstatus
> Infiniband device ''mlx4_0'' port 1 status:
>     default gid:     fec0:0000:0000:0000:0002:c903:000a:6029
>     base lid:     0x9
>     sm lid:         0x1
>     state:         4: ACTIVE
>     phys state:     5: LinkUp
>     rate:         40 Gb/sec (4X QDR)
>
> Infiniband device ''mlx4_0'' port 2 status:
>     default gid:     fec0:0000:0000:0000:0002:c903:000a:602a
>     base lid:     0xa
>     sm lid:         0x1
>     state:         4: ACTIVE
>     phys state:     5: LinkUp
>     rate:         40 Gb/sec (4X QDR)
>
> service0:~ #
>
>
>
> service0:~ # ibdiagnet
> Loading IBDIAGNET from: /usr/lib64/ibdiagnet1.2
> -W- Topology file is not specified.
>     Reports regarding cluster links will use direct routes.
> Loading IBDM from: /usr/lib64/ibdm1.2
> -W- A few ports of local device are up.
>     Since port-num was not specified (-p option), port 1 of device 1 
> will be
>     used as the local port.
> -I- Discovering ... 88 nodes (9 Switches & 79 CA-s) discovered.
>
>
> -I---------------------------------------------------
> -I- Bad Guids/LIDs Info
> -I---------------------------------------------------
> -I- No bad Guids were found
>
> -I---------------------------------------------------
> -I- Links With Logical State = INIT
> -I---------------------------------------------------
> -I- No bad Links (with logical state = INIT) were found
>
> -I---------------------------------------------------
> -I- PM Counters Info
> -I---------------------------------------------------
> -I- No illegal PM counters values were found
>
> -I---------------------------------------------------
> -I- Fabric Partitions Report (see ibdiagnet.pkey for a full hosts list)
> -I---------------------------------------------------
> -I-    PKey:0x7fff Hosts:81 full:81 partial:0
>
> -I---------------------------------------------------
> -I- IPoIB Subnets Check
> -I---------------------------------------------------
> -I- Subnet: IPv4 PKey:0x7fff QKey:0x00000b1b MTU:2048Byte rate:10Gbps 
> SL:0x00
> -W- Suboptimal rate for group. Lowest member rate:20Gbps > 
> group-rate:10Gbps
>
> -I---------------------------------------------------
> -I- Bad Links Info
> -I- No bad link were found
> -I---------------------------------------------------
> ----------------------------------------------------------------
> -I- Stages Status Report:
>     STAGE                                    Errors Warnings
>     Bad GUIDs/LIDs Check                     0      0
>     Link State Active Check                  0      0
>     Performance Counters Report              0      0
>     Partitions Check                         0      0
>     IPoIB Subnets Check                      0      1
>
> Please see /tmp/ibdiagnet.log for complete log
> ----------------------------------------------------------------
>
> -I- Done. Run time was 9 seconds.
> service0:~ #
>
>
> service0:~ # ibcheckerrors
> #warn: counter VL15Dropped = 18584     (threshold 100) lid 1 port 1
> Error check on lid 1 (r1lead HCA-1) port 1:  FAILED
> #warn: counter SymbolErrors = 42829     (threshold 10) lid 9 port 1
> #warn: counter RcvErrors = 9279     (threshold 10) lid 9 port 1
> Error check on lid 9 (service0 HCA-1) port 1:  FAILED
>
> ## Summary: 88 nodes checked, 0 bad nodes found
> ##          292 ports checked, 2 ports have errors beyond threshold
> service0:~ #
>
>
> service0:~ # ibchecknet
>
> # Checking Ca: nodeguid 0x0002c903000abfc2
>
> # Checking Ca: nodeguid 0x0002c903000ac00e
>
> # Checking Ca: nodeguid 0x0002c903000a69dc
>
> # Checking Ca: nodeguid 0x0002c9030009cd46
>
> # Checking Ca: nodeguid 0x003048fffff4d878
>
> # Checking Ca: nodeguid 0x003048fffff4d880
>
> # Checking Ca: nodeguid 0x003048fffff4d87c
>
> # Checking Ca: nodeguid 0x003048fffff4d884
>
> # Checking Ca: nodeguid 0x003048fffff4d888
>
> # Checking Ca: nodeguid 0x003048fffff4d88c
>
> # Checking Ca: nodeguid 0x003048fffff4d890
>
> # Checking Ca: nodeguid 0x003048fffff4d894
>
> # Checking Ca: nodeguid 0x0002c9020029fa50
> #warn: counter VL15Dropped = 18617     (threshold 100) lid 1 port 1
> Error check on lid 1 (r1lead HCA-1) port 1:  FAILED
>
> # Checking Ca: nodeguid 0x0002c90300054eac
>
> # Checking Ca: nodeguid 0x0002c9030009cebe
>
> # Checking Ca: nodeguid 0x003048fffff4c9f8
>
> # Checking Ca: nodeguid 0x003048fffff4db08
>
> # Checking Ca: nodeguid 0x003048fffff4db40
>
> # Checking Ca: nodeguid 0x003048fffff4db44
>
> # Checking Ca: nodeguid 0x003048fffff4db48
>
> # Checking Ca: nodeguid 0x003048fffff4db4c
>
> # Checking Ca: nodeguid 0x003048fffff4db0c
>
> # Checking Ca: nodeguid 0x003048fffff4dca0
>
> # Checking Ca: nodeguid 0x0002c903000abfe2
>
> # Checking Ca: nodeguid 0x0002c903000abfe6
>
> # Checking Ca: nodeguid 0x0002c9030009dd28
>
> # Checking Ca: nodeguid 0x003048fffff4db54
>
> # Checking Ca: nodeguid 0x003048fffff4db58
>
> # Checking Ca: nodeguid 0x003048fffff4c9f4
>
> # Checking Ca: nodeguid 0x003048fffff4db50
>
> # Checking Ca: nodeguid 0x003048fffff4db3c
>
> # Checking Ca: nodeguid 0x003048fffff4db38
>
> # Checking Ca: nodeguid 0x003048fffff4db14
>
> # Checking Ca: nodeguid 0x003048fffff4db10
>
> # Checking Ca: nodeguid 0x003048fffff4d8a8
>
> # Checking Ca: nodeguid 0x003048fffff4d8ac
>
> # Checking Ca: nodeguid 0x003048fffff4d8b4
>
> # Checking Ca: nodeguid 0x003048fffff4d8b0
>
> # Checking Ca: nodeguid 0x003048fffff4db70
>
> # Checking Ca: nodeguid 0x003048fffff4db68
>
> # Checking Ca: nodeguid 0x003048fffff4db64
>
> # Checking Ca: nodeguid 0x003048fffff4db78
>
> # Checking Ca: nodeguid 0x0002c903000a69f0
>
> # Checking Ca: nodeguid 0x0002c9030006004a
>
> # Checking Ca: nodeguid 0x0002c9030009dd2c
>
> # Checking Ca: nodeguid 0x003048fffff4d8b8
>
> # Checking Ca: nodeguid 0x003048fffff4d8bc
>
> # Checking Ca: nodeguid 0x003048fffff4d8a4
>
> # Checking Ca: nodeguid 0x003048fffff4d8a0
>
> # Checking Ca: nodeguid 0x003048fffff4db7c
>
> # Checking Ca: nodeguid 0x003048fffff4db80
>
> # Checking Ca: nodeguid 0x003048fffff4db6c
>
> # Checking Ca: nodeguid 0x003048fffff4db74
>
> # Checking Ca: nodeguid 0x003048fffff4dcb8
>
> # Checking Ca: nodeguid 0x003048fffff4dcd0
>
> # Checking Ca: nodeguid 0x003048fffff4dc5c
>
> # Checking Ca: nodeguid 0x003048fffff4dc60
>
> # Checking Ca: nodeguid 0x003048fffff4dc54
>
> # Checking Ca: nodeguid 0x003048fffff4dc50
>
> # Checking Ca: nodeguid 0x003048fffff4dc4c
>
> # Checking Ca: nodeguid 0x003048fffff4dcd4
>
> # Checking Ca: nodeguid 0x0002c903000a6164
>
> # Checking Ca: nodeguid 0x003048fffff4dcf0
>
> # Checking Ca: nodeguid 0x003048fffff4db5c
>
> # Checking Ca: nodeguid 0x003048fffff4dc90
>
> # Checking Ca: nodeguid 0x003048fffff4dc8c
>
> # Checking Ca: nodeguid 0x003048fffff4dc58
>
> # Checking Ca: nodeguid 0x003048fffff4dc94
>
> # Checking Ca: nodeguid 0x003048fffff4dc9c
>
> # Checking Ca: nodeguid 0x003048fffff4db60
>
> # Checking Ca: nodeguid 0x003048fffff4d89c
>
> # Checking Ca: nodeguid 0x003048fffff4d898
>
> # Checking Ca: nodeguid 0x003048fffff4dad8
>
> # Checking Ca: nodeguid 0x003048fffff4dadc
>
> # Checking Ca: nodeguid 0x003048fffff4db30
>
> # Checking Ca: nodeguid 0x003048fffff4db34
>
> # Checking Ca: nodeguid 0x003048fffff4d874
>
> # Checking Ca: nodeguid 0x003048fffff4d870
>
> # Checking Ca: nodeguid 0x0002c903000a6028
> #warn: counter SymbolErrors = 44150     (threshold 10) lid 9 port 1
> #warn: counter RcvErrors = 9283     (threshold 10) lid 9 port 1
> Error check on lid 9 (service0 HCA-1) port 1:  FAILED
>
> ## Summary: 88 nodes checked, 0 bad nodes found
> ##          292 ports checked, 0 bad ports found
> ##          2 ports have errors beyond threshold
>
>
>
> service0:~ # ibcheckstate
>
> ## Summary: 88 nodes checked, 0 bad nodes found
> ##          292 ports checked, 0 ports with bad state found
> service0:~ # ibcheckwidth
>
> ## Summary: 88 nodes checked, 0 bad nodes found
> ##          292 ports checked, 0 ports with 1x width in error found
> service0:~ #
>
>
> Thanks and Regards
> Ashok
>
>
>
> On 30 September 2011 12:39, Brian O''Connor <briano at sgi.com 
> <mailto:briano at sgi.com>> wrote:
>
>     Hello Ashok
>
>     is the cluster hanging or otherwise behaving badly? The logs below
>     show that the client
>     lost connection to 10.148.0.106 for 10seconds or so. It should
>     have recovered ok.
>
>     If you want further help from the list you need to add more detail
>     about the cluster i.e.
>     A general description of the number of OSS/OST, clients, version
>     of lustre etc, and a description
>     of what is actually going wrong... ie hanging, offline etc
>
>     The first thing is to check the infrastructure.. ie. in this case
>     you should check your IB network for errors
>
>
>
>
>     On 30-September-2011 2:39 PM, Ashok nulguda wrote:
>>     Dear All,
>>
>>     I am having lustre error on my HPC as given below.Please any one
>>     can help me to resolve this problem.
>>     Thanks in Advance.
>>     Sep 30 08:40:23 service0 kernel: [343138.837222] Lustre:
>>     8300:0:(client.c:1476:ptlrpc_expire_one_request()) Skipped 1
>>     previous similar message
>>     Sep 30 08:40:23 service0 kernel: [343138.837233] Lustre:
>>     lustre-OST0008-osc-ffff880b272cf800: Connection to service
>>     lustre-OST0008 via nid 10.148.0.106 at o2ib was lost; in progress
>>     operations using this service will wait for recovery to complete.
>>     Sep 30 08:40:24 service0 kernel: [343139.837260] Lustre:
>>     8300:0:(client.c:1476:ptlrpc_expire_one_request()) @@@ Request
>>     x1380984193067288 sent from lustre-OST0006-osc-ffff880b272cf800
>>     to NID 10.148.0.106 at o2ib 7s ago has timed out (7s prior to
deadline).
>>     Sep 30 08:40:24 service0 kernel: [343139.837263]  
>>     req at ffff880a5f800c00 x1380984193067288/t0
>>     o3->lustre-OST0006_UUID at 10.148.0.106@o2ib:6/4
>>     <mailto:lustre-OST0006_UUID at 10.148.0.106@o2ib:6/4> lens
448/592 e
>>     0 to 1 dl 1317352224 ref 2 fl Rpc:/0/0 rc 0/0
>>     Sep 30 08:40:24 service0 kernel: [343139.837269] Lustre:
>>     8300:0:(client.c:1476:ptlrpc_expire_one_request()) Skipped 38
>>     previous similar messages
>>     Sep 30 08:40:24 service0 kernel: [343140.129284] LustreError:
>>     9983:0:(ldlm_request.c:1025:ldlm_cli_cancel_req()) Got rc -11
>>     from cancel RPC: canceling anyway
>>     Sep 30 08:40:24 service0 kernel: [343140.129290] LustreError:
>>     9983:0:(ldlm_request.c:1025:ldlm_cli_cancel_req()) Skipped 1
>>     previous similar message
>>     Sep 30 08:40:24 service0 kernel: [343140.129295] LustreError:
>>     9983:0:(ldlm_request.c:1587:ldlm_cli_cancel_list())
>>     ldlm_cli_cancel_list: -11
>>     Sep 30 08:40:24 service0 kernel: [343140.129299] LustreError:
>>     9983:0:(ldlm_request.c:1587:ldlm_cli_cancel_list()) Skipped 1
>>     previous similar message
>>     Sep 30 08:40:25 service0 kernel: [343140.837308] Lustre:
>>     8300:0:(client.c:1476:ptlrpc_expire_one_request()) @@@ Request
>>     x1380984193067299 sent from lustre-OST0010-osc-ffff880b272cf800
>>     to NID 10.148.0.106 at o2ib 7s ago has timed out (7s prior to
deadline).
>>     Sep 30 08:40:25 service0 kernel: [343140.837311]  
>>     req at ffff880a557c4400 x1380984193067299/t0
>>     o3->lustre-OST0010_UUID at 10.148.0.106@o2ib:6/4
>>     <mailto:lustre-OST0010_UUID at 10.148.0.106@o2ib:6/4> lens
448/592 e
>>     0 to 1 dl 1317352225 ref 2 fl Rpc:/0/0 rc 0/0
>>     Sep 30 08:40:25 service0 kernel: [343140.837316] Lustre:
>>     8300:0:(client.c:1476:ptlrpc_expire_one_request()) Skipped 4
>>     previous similar messages
>>     Sep 30 08:40:26 service0 kernel: [343141.245365] LustreError:
>>     30978:0:(ldlm_request.c:1025:ldlm_cli_cancel_req()) Got rc -11
>>     from cancel RPC: canceling anyway
>>     Sep 30 08:40:26 service0 kernel: [343141.245371] LustreError:
>>     22729:0:(ldlm_request.c:1587:ldlm_cli_cancel_list())
>>     ldlm_cli_cancel_list: -11
>>     Sep 30 08:40:26 service0 kernel: [343141.245378] LustreError:
>>     30978:0:(ldlm_request.c:1025:ldlm_cli_cancel_req()) Skipped 1
>>     previous similar message
>>     Sep 30 08:40:33 service0 kernel: [343148.245683] Lustre:
>>     22725:0:(client.c:1476:ptlrpc_expire_one_request()) @@@ Request
>>     x1380984193067302 sent from lustre-OST0004-osc-ffff880b272cf800
>>     to NID 10.148.0.106 at o2ib 14s ago has timed out (14s prior to
>>     deadline).
>>     Sep 30 08:40:33 service0 kernel: [343148.245686]  
>>     req at ffff8805c879e800 x1380984193067302/t0
>>     o103->lustre-OST0004_UUID at 10.148.0.106@o2ib:17/18
>>     <mailto:lustre-OST0004_UUID at 10.148.0.106@o2ib:17/18> lens
296/384
>>     e 0 to 1 dl 1317352233 ref 1 fl Rpc:N/0/0 rc 0/0
>>     Sep 30 08:40:33 service0 kernel: [343148.245692] Lustre:
>>     22725:0:(client.c:1476:ptlrpc_expire_one_request()) Skipped 2
>>     previous similar messages
>>     Sep 30 08:40:33 service0 kernel: [343148.245708] LustreError:
>>     22725:0:(ldlm_request.c:1025:ldlm_cli_cancel_req()) Got rc -11
>>     from cancel RPC: canceling anyway
>>     Sep 30 08:40:33 service0 kernel: [343148.245714] LustreError:
>>     22725:0:(ldlm_request.c:1587:ldlm_cli_cancel_list())
>>     ldlm_cli_cancel_list: -11
>>     Sep 30 08:40:33 service0 kernel: [343148.245717] LustreError:
>>     22725:0:(ldlm_request.c:1587:ldlm_cli_cancel_list()) Skipped 1
>>     previous similar message
>>     Sep 30 08:40:36 service0 kernel: [343151.548005] LustreError:
>>     11-0: an error occurred while communicating with
>>     10.148.0.106 at o2ib. The ost_connect operation failed with -16
>>     Sep 30 08:40:36 service0 kernel: [343151.548008] LustreError:
>>     Skipped 1 previous similar message
>>     Sep 30 08:40:36 service0 kernel: [343151.548024] LustreError:
>>     167-0: This client was evicted by lustre-OST000b; in progress
>>     operations using this service will fail.
>>     Sep 30 08:40:36 service0 kernel: [343151.548250] LustreError:
>>     30452:0:(llite_mmap.c:210:ll_tree_unlock()) couldn''t
unlock -5
>>     Sep 30 08:40:36 service0 kernel: [343151.550210] LustreError:
>>     8300:0:(client.c:858:ptlrpc_import_delay_req()) @@@ IMP_INVALID 
>>     req at ffff88049528c400 x1380984193067406/t0
>>     o3->lustre-OST000b_UUID at 10.148.0.106@o2ib:6/4
>>     <mailto:lustre-OST000b_UUID at 10.148.0.106@o2ib:6/4> lens
448/592 e
>>     0 to 1 dl 0 ref 2 fl Rpc:/0/0 rc 0/0
>>     Sep 30 08:40:36 service0 kernel: [343151.594742] Lustre:
>>     lustre-OST0000-osc-ffff880b272cf800: Connection restored to
>>     service lustre-OST0000 using nid 10.148.0.106 at o2ib.
>>     Sep 30 08:40:36 service0 kernel: [343151.837203] Lustre:
>>     lustre-OST0006-osc-ffff880b272cf800: Connection restored to
>>     service lustre-OST0006 using nid 10.148.0.106 at o2ib.
>>     Sep 30 08:40:37 service0 kernel: [343152.842631] Lustre:
>>     lustre-OST0003-osc-ffff880b272cf800: Connection restored to
>>     service lustre-OST0003 using nid 10.148.0.106 at o2ib.
>>     Sep 30 08:40:37 service0 kernel: [343152.842636] Lustre: Skipped
>>     3 previous similar messages
>>
>>
>>     Thanks and Regards
>>     Ashok
>>
>>     -- 
>>     *Ashok Nulguda
>>     *
>>     *TATA ELXSI LTD*
>>     *Mb : +91 9689945767
>>     *
>>     *Email :ashokn at tataelxsi.co.in <mailto:tshrikant at
tataelxsi.co.in>*
>>
>>
>>
>>     _______________________________________________
>>     Lustre-discuss mailing list
>>     Lustre-discuss at lists.lustre.org  <mailto:Lustre-discuss at
lists.lustre.org>
>>     http://lists.lustre.org/mailman/listinfo/lustre-discuss
>
>
>     -- 
>     Brian O''Connor
>     -------------------------------------------------
>     SGI Consulting
>     Email:briano at sgi.com  <mailto:briano at sgi.com>, Mobile +61
417 746 452
>     Phone: +61 3 9963 1900, Fax: +61 3 9963 1902
>     357 Camberwell Road, Camberwell, Victoria, 3124
>     AUSTRALIAhttp://www.sgi.com/support/services
>     -------------------------------------------------
>
>
>
>
>
>
> -- 
> *Ashok Nulguda
> *
> *TATA ELXSI LTD*
> *Mb : +91 9689945767
> *
> *Email :ashokn at tataelxsi.co.in <mailto:tshrikant at
tataelxsi.co.in>*
>

-- 
Brian O''Connor
-------------------------------------------------
SGI Consulting
Email: briano at sgi.com, Mobile +61 417 746 452
Phone: +61 3 9963 1900, Fax: +61 3 9963 1902
357 Camberwell Road, Camberwell, Victoria, 3124
AUSTRALIA http://www.sgi.com/support/services
-------------------------------------------------



-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20110930/458d4206/attachment-0001.html

Colin Faber

2011-Sep-30 14:46 UTC

head link

[Lustre-discuss] help

Hi,

Looks like connection timeout, likely temporary as it appears to have 
reconnected and recovered without any problems.

What other issue are you experiencing?

-cf


On 09/29/2011 10:39 PM, Ashok nulguda wrote:> Dear All,
>
> I am having lustre error on my HPC as given below.Please any one can 
> help me to resolve this problem.
> Thanks in Advance.
> Sep 30 08:40:23 service0 kernel: [343138.837222] Lustre: 
> 8300:0:(client.c:1476:ptlrpc_expire_one_request()) Skipped 1 previous 
> similar message
> Sep 30 08:40:23 service0 kernel: [343138.837233] Lustre: 
> lustre-OST0008-osc-ffff880b272cf800: Connection to service 
> lustre-OST0008 via nid 10.148.0.106 at o2ib was lost; in progress 
> operations using this service will wait for recovery to complete.
> Sep 30 08:40:24 service0 kernel: [343139.837260] Lustre: 
> 8300:0:(client.c:1476:ptlrpc_expire_one_request()) @@@ Request 
> x1380984193067288 sent from lustre-OST0006-osc-ffff880b272cf800 to NID 
> 10.148.0.106 at o2ib 7s ago has timed out (7s prior to deadline).
> Sep 30 08:40:24 service0 kernel: [343139.837263]   
> req at ffff880a5f800c00 x1380984193067288/t0 
> o3->lustre-OST0006_UUID at 10.148.0.106@o2ib:6/4 lens 448/592 e 0 to 1
dl
> 1317352224 ref 2 fl Rpc:/0/0 rc 0/0
> Sep 30 08:40:24 service0 kernel: [343139.837269] Lustre: 
> 8300:0:(client.c:1476:ptlrpc_expire_one_request()) Skipped 38 previous 
> similar messages
> Sep 30 08:40:24 service0 kernel: [343140.129284] LustreError: 
> 9983:0:(ldlm_request.c:1025:ldlm_cli_cancel_req()) Got rc -11 from 
> cancel RPC: canceling anyway
> Sep 30 08:40:24 service0 kernel: [343140.129290] LustreError: 
> 9983:0:(ldlm_request.c:1025:ldlm_cli_cancel_req()) Skipped 1 previous 
> similar message
> Sep 30 08:40:24 service0 kernel: [343140.129295] LustreError: 
> 9983:0:(ldlm_request.c:1587:ldlm_cli_cancel_list()) 
> ldlm_cli_cancel_list: -11
> Sep 30 08:40:24 service0 kernel: [343140.129299] LustreError: 
> 9983:0:(ldlm_request.c:1587:ldlm_cli_cancel_list()) Skipped 1 previous 
> similar message
> Sep 30 08:40:25 service0 kernel: [343140.837308] Lustre: 
> 8300:0:(client.c:1476:ptlrpc_expire_one_request()) @@@ Request 
> x1380984193067299 sent from lustre-OST0010-osc-ffff880b272cf800 to NID 
> 10.148.0.106 at o2ib 7s ago has timed out (7s prior to deadline).
> Sep 30 08:40:25 service0 kernel: [343140.837311]   
> req at ffff880a557c4400 x1380984193067299/t0 
> o3->lustre-OST0010_UUID at 10.148.0.106@o2ib:6/4 lens 448/592 e 0 to 1
dl
> 1317352225 ref 2 fl Rpc:/0/0 rc 0/0
> Sep 30 08:40:25 service0 kernel: [343140.837316] Lustre: 
> 8300:0:(client.c:1476:ptlrpc_expire_one_request()) Skipped 4 previous 
> similar messages
> Sep 30 08:40:26 service0 kernel: [343141.245365] LustreError: 
> 30978:0:(ldlm_request.c:1025:ldlm_cli_cancel_req()) Got rc -11 from 
> cancel RPC: canceling anyway
> Sep 30 08:40:26 service0 kernel: [343141.245371] LustreError: 
> 22729:0:(ldlm_request.c:1587:ldlm_cli_cancel_list()) 
> ldlm_cli_cancel_list: -11
> Sep 30 08:40:26 service0 kernel: [343141.245378] LustreError: 
> 30978:0:(ldlm_request.c:1025:ldlm_cli_cancel_req()) Skipped 1 previous 
> similar message
> Sep 30 08:40:33 service0 kernel: [343148.245683] Lustre: 
> 22725:0:(client.c:1476:ptlrpc_expire_one_request()) @@@ Request 
> x1380984193067302 sent from lustre-OST0004-osc-ffff880b272cf800 to NID 
> 10.148.0.106 at o2ib 14s ago has timed out (14s prior to deadline).
> Sep 30 08:40:33 service0 kernel: [343148.245686]   
> req at ffff8805c879e800 x1380984193067302/t0 
> o103->lustre-OST0004_UUID at 10.148.0.106@o2ib:17/18 lens 296/384 e 0 to
> 1 dl 1317352233 ref 1 fl Rpc:N/0/0 rc 0/0
> Sep 30 08:40:33 service0 kernel: [343148.245692] Lustre: 
> 22725:0:(client.c:1476:ptlrpc_expire_one_request()) Skipped 2 previous 
> similar messages
> Sep 30 08:40:33 service0 kernel: [343148.245708] LustreError: 
> 22725:0:(ldlm_request.c:1025:ldlm_cli_cancel_req()) Got rc -11 from 
> cancel RPC: canceling anyway
> Sep 30 08:40:33 service0 kernel: [343148.245714] LustreError: 
> 22725:0:(ldlm_request.c:1587:ldlm_cli_cancel_list()) 
> ldlm_cli_cancel_list: -11
> Sep 30 08:40:33 service0 kernel: [343148.245717] LustreError: 
> 22725:0:(ldlm_request.c:1587:ldlm_cli_cancel_list()) Skipped 1 
> previous similar message
> Sep 30 08:40:36 service0 kernel: [343151.548005] LustreError: 11-0: an 
> error occurred while communicating with 10.148.0.106 at o2ib. The 
> ost_connect operation failed with -16
> Sep 30 08:40:36 service0 kernel: [343151.548008] LustreError: Skipped 
> 1 previous similar message
> Sep 30 08:40:36 service0 kernel: [343151.548024] LustreError: 167-0: 
> This client was evicted by lustre-OST000b; in progress operations 
> using this service will fail.
> Sep 30 08:40:36 service0 kernel: [343151.548250] LustreError: 
> 30452:0:(llite_mmap.c:210:ll_tree_unlock()) couldn''t unlock -5
> Sep 30 08:40:36 service0 kernel: [343151.550210] LustreError: 
> 8300:0:(client.c:858:ptlrpc_import_delay_req()) @@@ IMP_INVALID  
> req at ffff88049528c400 x1380984193067406/t0 
> o3->lustre-OST000b_UUID at 10.148.0.106@o2ib:6/4 lens 448/592 e 0 to 1
dl
> 0 ref 2 fl Rpc:/0/0 rc 0/0
> Sep 30 08:40:36 service0 kernel: [343151.594742] Lustre: 
> lustre-OST0000-osc-ffff880b272cf800: Connection restored to service 
> lustre-OST0000 using nid 10.148.0.106 at o2ib.
> Sep 30 08:40:36 service0 kernel: [343151.837203] Lustre: 
> lustre-OST0006-osc-ffff880b272cf800: Connection restored to service 
> lustre-OST0006 using nid 10.148.0.106 at o2ib.
> Sep 30 08:40:37 service0 kernel: [343152.842631] Lustre: 
> lustre-OST0003-osc-ffff880b272cf800: Connection restored to service 
> lustre-OST0003 using nid 10.148.0.106 at o2ib.
> Sep 30 08:40:37 service0 kernel: [343152.842636] Lustre: Skipped 3 
> previous similar messages
>
>
> Thanks and Regards
> Ashok
>
> -- 
> *Ashok Nulguda
> *
> *TATA ELXSI LTD*
> *Mb : +91 9689945767
> *
> *Email :ashokn at tataelxsi.co.in <mailto:tshrikant at
tataelxsi.co.in>*
>
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss______________________________________________________________________
This email may contain privileged or confidential information, which should only
be used for the purpose for which it was sent by Xyratex. No further rights or
licenses are granted to use such information. If you are not the intended
recipient of this message, please notify the sender by return and delete it. You
may not use, copy, disclose or rely on the information contained in it.
 
Internet email is susceptible to data corruption, interception and unauthorised
amendment for which Xyratex does not accept liability. While we have taken
reasonable precautions to ensure that this email is free of viruses, Xyratex
does not accept liability for the presence of any computer viruses in this
email, nor for any losses caused as a result of viruses.
 
Xyratex Technology Limited (03134912), Registered in England & Wales,
Registered Office, Langstone Road, Havant, Hampshire, PO9 1SA.
 
The Xyratex group of companies also includes, Xyratex Ltd, registered in
Bermuda, Xyratex International Inc, registered in California, Xyratex (Malaysia)
Sdn Bhd registered in Malaysia, Xyratex Technology (Wuxi) Co Ltd registered in
The People''s Republic of China and Xyratex Japan Limited registered in
Japan.
______________________________________________________________________

Quentin Bouyer

2011-Oct-04 15:30 UTC

head link

[Lustre-discuss] help

Is adaptative timeout enable ?
( on MGS/MDS lctl get_param at_max )

Quentin Bouyer
System Engineer | Sgi France
+33 6 80 36 49 64
qbouyer at sgi.com<mailto:qbouyer at sgi.com>
[cid:image001.jpg at 01CC82BB.317331F0]

________________________________
From: lustre-discuss-bounces at lists.lustre.org [mailto:lustre-discuss-bounces
at lists.lustre.org] On Behalf Of Ashok nulguda
Sent: vendredi 30 septembre 2011 06:39
To: Lustre-discuss at lists.lustre.org
Subject: [Lustre-discuss] help

Dear All,

I am having lustre error on my HPC as given below.Please any one can help me to
resolve this problem.
Thanks in Advance.
Sep 30 08:40:23 service0 kernel: [343138.837222] Lustre:
8300:0:(client.c:1476:ptlrpc_expire_one_request()) Skipped 1 previous similar
message
Sep 30 08:40:23 service0 kernel: [343138.837233] Lustre:
lustre-OST0008-osc-ffff880b272cf800: Connection to service lustre-OST0008 via
nid 10.148.0.106 at o2ib was lost; in progress operations using this service
will wait for recovery to complete.
Sep 30 08:40:24 service0 kernel: [343139.837260] Lustre:
8300:0:(client.c:1476:ptlrpc_expire_one_request()) @@@ Request x1380984193067288
sent from lustre-OST0006-osc-ffff880b272cf800 to NID 10.148.0.106 at o2ib 7s ago
has timed out (7s prior to deadline).
Sep 30 08:40:24 service0 kernel: [343139.837263]   req at ffff880a5f800c00
x1380984193067288/t0 o3->lustre-OST0006_UUID at 10.148.0.106@o2ib:6/4 lens
448/592 e 0 to 1 dl 1317352224 ref 2 fl Rpc:/0/0 rc 0/0
Sep 30 08:40:24 service0 kernel: [343139.837269] Lustre:
8300:0:(client.c:1476:ptlrpc_expire_one_request()) Skipped 38 previous similar
messages
Sep 30 08:40:24 service0 kernel: [343140.129284] LustreError:
9983:0:(ldlm_request.c:1025:ldlm_cli_cancel_req()) Got rc -11 from cancel RPC:
canceling anyway
Sep 30 08:40:24 service0 kernel: [343140.129290] LustreError:
9983:0:(ldlm_request.c:1025:ldlm_cli_cancel_req()) Skipped 1 previous similar
message
Sep 30 08:40:24 service0 kernel: [343140.129295] LustreError:
9983:0:(ldlm_request.c:1587:ldlm_cli_cancel_list()) ldlm_cli_cancel_list: -11
Sep 30 08:40:24 service0 kernel: [343140.129299] LustreError:
9983:0:(ldlm_request.c:1587:ldlm_cli_cancel_list()) Skipped 1 previous similar
message
Sep 30 08:40:25 service0 kernel: [343140.837308] Lustre:
8300:0:(client.c:1476:ptlrpc_expire_one_request()) @@@ Request x1380984193067299
sent from lustre-OST0010-osc-ffff880b272cf800 to NID 10.148.0.106 at o2ib 7s ago
has timed out (7s prior to deadline).
Sep 30 08:40:25 service0 kernel: [343140.837311]   req at ffff880a557c4400
x1380984193067299/t0 o3->lustre-OST0010_UUID at 10.148.0.106@o2ib:6/4 lens
448/592 e 0 to 1 dl 1317352225 ref 2 fl Rpc:/0/0 rc 0/0
Sep 30 08:40:25 service0 kernel: [343140.837316] Lustre:
8300:0:(client.c:1476:ptlrpc_expire_one_request()) Skipped 4 previous similar
messages
Sep 30 08:40:26 service0 kernel: [343141.245365] LustreError:
30978:0:(ldlm_request.c:1025:ldlm_cli_cancel_req()) Got rc -11 from cancel RPC:
canceling anyway
Sep 30 08:40:26 service0 kernel: [343141.245371] LustreError:
22729:0:(ldlm_request.c:1587:ldlm_cli_cancel_list()) ldlm_cli_cancel_list: -11
Sep 30 08:40:26 service0 kernel: [343141.245378] LustreError:
30978:0:(ldlm_request.c:1025:ldlm_cli_cancel_req()) Skipped 1 previous similar
message
Sep 30 08:40:33 service0 kernel: [343148.245683] Lustre:
22725:0:(client.c:1476:ptlrpc_expire_one_request()) @@@ Request
x1380984193067302 sent from lustre-OST0004-osc-ffff880b272cf800 to NID
10.148.0.106 at o2ib 14s ago has timed out (14s prior to deadline).
Sep 30 08:40:33 service0 kernel: [343148.245686]   req at ffff8805c879e800
x1380984193067302/t0 o103->lustre-OST0004_UUID at 10.148.0.106@o2ib:17/18
lens 296/384 e 0 to 1 dl 1317352233 ref 1 fl Rpc:N/0/0 rc 0/0
Sep 30 08:40:33 service0 kernel: [343148.245692] Lustre:
22725:0:(client.c:1476:ptlrpc_expire_one_request()) Skipped 2 previous similar
messages
Sep 30 08:40:33 service0 kernel: [343148.245708] LustreError:
22725:0:(ldlm_request.c:1025:ldlm_cli_cancel_req()) Got rc -11 from cancel RPC:
canceling anyway
Sep 30 08:40:33 service0 kernel: [343148.245714] LustreError:
22725:0:(ldlm_request.c:1587:ldlm_cli_cancel_list()) ldlm_cli_cancel_list: -11
Sep 30 08:40:33 service0 kernel: [343148.245717] LustreError:
22725:0:(ldlm_request.c:1587:ldlm_cli_cancel_list()) Skipped 1 previous similar
message
Sep 30 08:40:36 service0 kernel: [343151.548005] LustreError: 11-0: an error
occurred while communicating with 10.148.0.106 at o2ib. The ost_connect
operation failed with -16
Sep 30 08:40:36 service0 kernel: [343151.548008] LustreError: Skipped 1 previous
similar message
Sep 30 08:40:36 service0 kernel: [343151.548024] LustreError: 167-0: This client
was evicted by lustre-OST000b; in progress operations using this service will
fail.
Sep 30 08:40:36 service0 kernel: [343151.548250] LustreError:
30452:0:(llite_mmap.c:210:ll_tree_unlock()) couldn''t unlock -5
Sep 30 08:40:36 service0 kernel: [343151.550210] LustreError:
8300:0:(client.c:858:ptlrpc_import_delay_req()) @@@ IMP_INVALID  req at
ffff88049528c400 x1380984193067406/t0 o3->lustre-OST000b_UUID at
10.148.0.106@o2ib:6/4 lens 448/592 e 0 to 1 dl 0 ref 2 fl Rpc:/0/0 rc 0/0
Sep 30 08:40:36 service0 kernel: [343151.594742] Lustre:
lustre-OST0000-osc-ffff880b272cf800: Connection restored to service
lustre-OST0000 using nid 10.148.0.106 at o2ib.
Sep 30 08:40:36 service0 kernel: [343151.837203] Lustre:
lustre-OST0006-osc-ffff880b272cf800: Connection restored to service
lustre-OST0006 using nid 10.148.0.106 at o2ib.
Sep 30 08:40:37 service0 kernel: [343152.842631] Lustre:
lustre-OST0003-osc-ffff880b272cf800: Connection restored to service
lustre-OST0003 using nid 10.148.0.106 at o2ib.
Sep 30 08:40:37 service0 kernel: [343152.842636] Lustre: Skipped 3 previous
similar messages


Thanks and Regards
Ashok

--
Ashok Nulguda
TATA ELXSI LTD
Mb : +91 9689945767
Email :ashokn at tataelxsi.co.in<mailto:tshrikant at tataelxsi.co.in>

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20111004/46934636/attachment.html
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image001.jpg
Type: image/jpeg
Size: 1856 bytes
Desc: image001.jpg
Url :
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20111004/46934636/attachment.jpg

Lustre discuss - Sep 2011 - help

[Lustre-discuss] help

[Lustre-discuss] help

[Lustre-discuss] help

[Lustre-discuss] help

[Lustre-discuss] help

[Lustre-discuss] help