thr3ads.net - Lustre discuss - [Lustre-discuss] Recovery Problem [May 2010]

If this information is useful, please help other people find it:
Share via:

Stefano Elmopi

2010-May-19 12:34 UTC

[Lustre-discuss] Recovery Problem

Hi,

I have a small problem but it certainly is the fault of the little  
knowledge I have by the argument.
I have a Lustre file system with a node MGS/MDS, two nodes OSS and one  
Client.
I launch a copy of a large file on Lustre and while the copy goes on,
I restart the node OSS that is handling the writing on the File System.
The copy process is put in the state -stalled- and when the node OSS  
is back on,
I expected the copy process to resume normally, but instead crashes.
This is a log on the node MGS:

May 19 13:43:43 mdt01prdpom kernel: Lustre: 3827:0:(client.c: 
1463:ptlrpc_expire_one_request()) @@@ Request x1336168048230433 sent  
from lustre01-OST0000-osc to NID 172.16.100.121 at tcp 17s ago has timed  
out (17s prior to deadline).
May 19 13:43:43 mdt01prdpom kernel:   req at ffff81012e11e400  
x1336168048230433/t0 o400->lustre01-OST0000_UUID at 172.16.100.121@tcp: 
28/4 lens 192/384 e 0 to 1 dl 1274269423 ref 1 fl Rpc:N/0/0 rc 0/0
May 19 13:43:43 mdt01prdpom kernel: Lustre: lustre01-OST0000-osc:  
Connection to service lustre01-OST0000 via nid 172.16.100.121 at tcp was  
lost; in progress operations using this service will wait for recovery  
to complete.
May 19 13:44:09 mdt01prdpom kernel: Lustre: 3828:0:(client.c: 
1463:ptlrpc_expire_one_request()) @@@ Request x1336168048230435 sent  
from lustre01-OST0000-osc to NID 172.16.100.121 at tcp 26s ago has timed  
out (26s prior to deadline).
May 19 13:44:09 mdt01prdpom kernel:   req at ffff81012e5f2000  
x1336168048230435/t0 o8->lustre01-OST0000_UUID at 172.16.100.121@tcp:28/4  
lens 368/584 e 0 to 1 dl 1274269449 ref 1 fl Rpc:N/0/0 rc 0/0
May 19 13:44:37 mdt01prdpom kernel: Lustre: 3829:0:(import.c: 
517:import_select_connection()) lustre01-OST0000-osc: tried all  
connections, increasing latency to 2s
May 19 13:44:37 mdt01prdpom kernel: LustreError: 3828:0:(lib-move.c: 
2441:LNetPut()) Error sending PUT to 12345-172.16.100.121 at tcp: -113
May 19 13:44:37 mdt01prdpom kernel: LustreError: 3828:0:(events.c: 
66:request_out_callback()) @@@ type 4, status -113   
req at ffff81012d3e5800 x1336168048230437/t0 o8->lustre01-OST0000_UUID at
172.16.100.121
@tcp:28/4 lens 368/584 e 0 to 1 dl 1274269504 ref 2 fl Rpc:N/0/0 rc 0/0
May 19 13:44:37 mdt01prdpom kernel: Lustre: 3828:0:(client.c: 
1463:ptlrpc_expire_one_request()) @@@ Request x1336168048230437 sent  
from lustre01-OST0000-osc to NID 172.16.100.121 at tcp 0s ago has failed  
due to network error (27s prior to deadline).
May 19 13:44:37 mdt01prdpom kernel:   req at ffff81012d3e5800  
x1336168048230437/t0 o8->lustre01-OST0000_UUID at 172.16.100.121@tcp:28/4  
lens 368/584 e 0 to 1 dl 1274269504 ref 1 fl Rpc:N/0/0 rc 0/0
May 19 13:45:33 mdt01prdpom kernel: Lustre: 3829:0:(import.c: 
517:import_select_connection()) lustre01-OST0000-osc: tried all  
connections, increasing latency to 3s
May 19 13:45:33 mdt01prdpom kernel: LustreError: 3828:0:(lib-move.c: 
2441:LNetPut()) Error sending PUT to 12345-172.16.100.121 at tcp: -113
May 19 13:45:33 mdt01prdpom kernel: LustreError: 3828:0:(events.c: 
66:request_out_callback()) @@@ type 4, status -113   
req at ffff81012e11e400 x1336168048230441/t0 o8->lustre01-OST0000_UUID at
172.16.100.121
@tcp:28/4 lens 368/584 e 0 to 1 dl 1274269561 ref 2 fl Rpc:N/0/0 rc 0/0
May 19 13:45:33 mdt01prdpom kernel: Lustre: 3828:0:(client.c: 
1463:ptlrpc_expire_one_request()) @@@ Request x1336168048230441 sent  
from lustre01-OST0000-osc to NID 172.16.100.121 at tcp 0s ago has failed  
due to network error (28s prior to deadline).
May 19 13:45:33 mdt01prdpom kernel:   req at ffff81012e11e400  
x1336168048230441/t0 o8->lustre01-OST0000_UUID at 172.16.100.121@tcp:28/4  
lens 368/584 e 0 to 1 dl 1274269561 ref 1 fl Rpc:N/0/0 rc 0/0
May 19 13:46:31 mdt01prdpom kernel: Lustre: 3829:0:(import.c: 
517:import_select_connection()) lustre01-OST0000-osc: tried all  
connections, increasing latency to 4s
May 19 13:46:31 mdt01prdpom kernel: LustreError: 167-0: This client  
was evicted by lustre01-OST0000; in progress operations using this  
service will fail.
May 19 13:46:31 mdt01prdpom kernel: Lustre: 4099:0:(quota_master.c: 
1716:mds_quota_recovery()) Only 0/2 OSTs are active, abort quota  
recovery
May 19 13:46:31 mdt01prdpom kernel: Lustre: lustre01-OST0000-osc:  
Connection restored to service lustre01-OST0000 using nid  
172.16.100.121 at tcp.
May 19 13:46:31 mdt01prdpom kernel: Lustre: MDS lustre01-MDT0000:  
lustre01-OST0000_UUID now active, resetting orphans

is a timeout problem ??
How can I change the timeout ?

Thanks !!!



Ing. Stefano Elmopi
Gruppo Darco - Resp. ICT Sistemi
Via Ostiense 131/L Corpo B, 00154 Roma

cell. 3466147165
tel.  0657060500
email:stefano.elmopi at sociale.it

"Ai sensi e per effetti della legge sulla tutela  della  riservatezza  
personale
(D.lgs n. 196/2003),  questa @mail e'' destinata  unicamente alle  
persone sopra
indicate e le informazioni in essa contenute sono da considerarsi  
strettamente
riservate. E'' proibito leggere, copiare, usare o diffondere il  
contenuto della
presente @mail  senza  autorizzazione. Se avete ricevuto  questo  
messaggio per
errore, siete pregati di rispedire la stessa al mittente. Grazie"

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100519/17c64d14/attachment.html

Andreas Dilger

2010-May-19 15:07 UTC

head link

[Lustre-discuss] Recovery Problem

More important is to include the crash message from the client and the  
version of Lustre you are using.

Cheers, Andreas

On 2010-05-19, at 6:34, Stefano Elmopi <stefano.elmopi at sociale.it>  
wrote:
>
>
> Hi,
>
> I have a small problem but it certainly is the fault of the little  
> knowledge I have by the argument.
> I have a Lustre file system with a node MGS/MDS, two nodes OSS and  
> one Client.
> I launch a copy of a large file on Lustre and while the copy goes on,
> I restart the node OSS that is handling the writing on the File  
> System.
> The copy process is put in the state -stalled- and when the node OSS  
> is back on,
> I expected the copy process to resume normally, but instead crashes.
> This is a log on the node MGS:
>
> May 19 13:43:43 mdt01prdpom kernel: Lustre: 3827:0:(client.c: 
> 1463:ptlrpc_expire_one_request()) @@@ Request x1336168048230433 sent  
> from lustre01-OST0000-osc to NID 172.16.100.121 at tcp 17s ago has  
> timed out (17s prior to deadline).
> May 19 13:43:43 mdt01prdpom kernel:   req at ffff81012e11e400  
> x1336168048230433/t0 o400->lustre01-OST0000_UUID at 172.16.100.121@tcp: 
> 28/4 lens 192/384 e 0 to 1 dl 1274269423 ref 1 fl Rpc:N/0/0 rc 0/0
> May 19 13:43:43 mdt01prdpom kernel: Lustre: lustre01-OST0000-osc:  
> Connection to service lustre01-OST0000 via nid 172.16.100.121 at tcp  
> was lost; in progress operations using this service will wait for  
> recovery to complete.
> May 19 13:44:09 mdt01prdpom kernel: Lustre: 3828:0:(client.c: 
> 1463:ptlrpc_expire_one_request()) @@@ Request x1336168048230435 sent  
> from lustre01-OST0000-osc to NID 172.16.100.121 at tcp 26s ago has  
> timed out (26s prior to deadline).
> May 19 13:44:09 mdt01prdpom kernel:   req at ffff81012e5f2000  
> x1336168048230435/t0 o8->lustre01-OST0000_UUID at 172.16.100.121@tcp: 
> 28/4 lens 368/584 e 0 to 1 dl 1274269449 ref 1 fl Rpc:N/0/0 rc 0/0
> May 19 13:44:37 mdt01prdpom kernel: Lustre: 3829:0:(import.c: 
> 517:import_select_connection()) lustre01-OST0000-osc: tried all  
> connections, increasing latency to 2s
> May 19 13:44:37 mdt01prdpom kernel: LustreError: 3828:0:(lib-move.c: 
> 2441:LNetPut()) Error sending PUT to 12345-172.16.100.121 at tcp: -113
> May 19 13:44:37 mdt01prdpom kernel: LustreError: 3828:0:(events.c: 
> 66:request_out_callback()) @@@ type 4, status -113   
> req at ffff81012d3e5800 x1336168048230437/t0 o8->lustre01-OST0000_UUID
at 172.16.100.121
> @tcp:28/4 lens 368/584 e 0 to 1 dl 1274269504 ref 2 fl Rpc:N/0/0 rc  
> 0/0
> May 19 13:44:37 mdt01prdpom kernel: Lustre: 3828:0:(client.c: 
> 1463:ptlrpc_expire_one_request()) @@@ Request x1336168048230437 sent  
> from lustre01-OST0000-osc to NID 172.16.100.121 at tcp 0s ago has  
> failed due to network error (27s prior to deadline).
> May 19 13:44:37 mdt01prdpom kernel:   req at ffff81012d3e5800  
> x1336168048230437/t0 o8->lustre01-OST0000_UUID at 172.16.100.121@tcp: 
> 28/4 lens 368/584 e 0 to 1 dl 1274269504 ref 1 fl Rpc:N/0/0 rc 0/0
> May 19 13:45:33 mdt01prdpom kernel: Lustre: 3829:0:(import.c: 
> 517:import_select_connection()) lustre01-OST0000-osc: tried all  
> connections, increasing latency to 3s
> May 19 13:45:33 mdt01prdpom kernel: LustreError: 3828:0:(lib-move.c: 
> 2441:LNetPut()) Error sending PUT to 12345-172.16.100.121 at tcp: -113
> May 19 13:45:33 mdt01prdpom kernel: LustreError: 3828:0:(events.c: 
> 66:request_out_callback()) @@@ type 4, status -113   
> req at ffff81012e11e400 x1336168048230441/t0 o8->lustre01-OST0000_UUID
at 172.16.100.121
> @tcp:28/4 lens 368/584 e 0 to 1 dl 1274269561 ref 2 fl Rpc:N/0/0 rc  
> 0/0
> May 19 13:45:33 mdt01prdpom kernel: Lustre: 3828:0:(client.c: 
> 1463:ptlrpc_expire_one_request()) @@@ Request x1336168048230441 sent  
> from lustre01-OST0000-osc to NID 172.16.100.121 at tcp 0s ago has  
> failed due to network error (28s prior to deadline).
> May 19 13:45:33 mdt01prdpom kernel:   req at ffff81012e11e400  
> x1336168048230441/t0 o8->lustre01-OST0000_UUID at 172.16.100.121@tcp: 
> 28/4 lens 368/584 e 0 to 1 dl 1274269561 ref 1 fl Rpc:N/0/0 rc 0/0
> May 19 13:46:31 mdt01prdpom kernel: Lustre: 3829:0:(import.c: 
> 517:import_select_connection()) lustre01-OST0000-osc: tried all  
> connections, increasing latency to 4s
> May 19 13:46:31 mdt01prdpom kernel: LustreError: 167-0: This client  
> was evicted by lustre01-OST0000; in progress operations using this  
> service will fail.
> May 19 13:46:31 mdt01prdpom kernel: Lustre: 4099:0:(quota_master.c: 
> 1716:mds_quota_recovery()) Only 0/2 OSTs are active, abort quota  
> recovery
> May 19 13:46:31 mdt01prdpom kernel: Lustre: lustre01-OST0000-osc:  
> Connection restored to service lustre01-OST0000 using nid 172.16.100.121 
> @tcp.
> May 19 13:46:31 mdt01prdpom kernel: Lustre: MDS lustre01-MDT0000:  
> lustre01-OST0000_UUID now active, resetting orphans
>
> is a timeout problem ??
> How can I change the timeout ?
>
> Thanks !!!
>
>
>
> Ing. Stefano Elmopi
> Gruppo Darco - Resp. ICT Sistemi
> Via Ostiense 131/L Corpo B, 00154 Roma
>
> cell. 3466147165
> tel.  0657060500
> email:stefano.elmopi at sociale.it
>
> "Ai sensi e per effetti della legge sulla tutela  della   
> riservatezza personale
> (D.lgs n. 196/2003),  questa @mail e'' destinata  unicamente alle  
> persone sopra
> indicate e le informazioni in essa contenute sono da considerarsi  
> strettamente
> riservate. E'' proibito leggere, copiare, usare o diffondere il  
> contenuto della
> presente @mail  senza  autorizzazione. Se avete ricevuto  questo  
> messaggio per
> errore, siete pregati di rispedire la stessa al mittente. Grazie"
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100519/b6554b18/attachment-0001.html

Gabriele Paciucci

2010-May-19 17:25 UTC

head link

[Lustre-discuss] Recovery Problem

Hi Stefano,
your client was evicted by the OST server, now you could explain if you 
have any failover configuration. To understand better the failover and 
the timeout issue you could read the manual at paragraph 4.3.8.

Ciao.

On 05/19/2010 02:34 PM, Stefano Elmopi wrote:>
>
> Hi,
>
> I have a small problem but it certainly is the fault of the little 
> knowledge I have by the argument.
> I have a Lustre file system with a node MGS/MDS, two nodes OSS and one 
> Client.
> I launch a copy of a large file on Lustre and while the copy goes on,
> I restart the node OSS that is handling the writing on the File System.
> The copy process is put in the state -stalled- and when the node OSS 
> is back on,
> I expected the copy process to resume normally, but instead crashes.
> This is a log on the node MGS:
>
> May 19 13:43:43 mdt01prdpom kernel: Lustre: 
> 3827:0:(client.c:1463:ptlrpc_expire_one_request()) @@@ Request 
> x1336168048230433 sent from lustre01-OST0000-osc to NID 
> 172.16.100.121 at tcp 17s ago has timed out (17s prior to deadline).
> May 19 13:43:43 mdt01prdpom kernel:   req at ffff81012e11e400 
> x1336168048230433/t0 o400->lustre01-OST0000_UUID at 172.16.100.121 
> <mailto:lustre01-OST0000_UUID at 172.16.100.121>@tcp:28/4 lens
192/384 e
> 0 to 1 dl 1274269423 ref 1 fl Rpc:N/0/0 rc 0/0
> May 19 13:43:43 mdt01prdpom kernel: Lustre: lustre01-OST0000-osc: 
> Connection to service lustre01-OST0000 via nid 172.16.100.121 at tcp was 
> lost; in progress operations using this service will wait for recovery 
> to complete.
> May 19 13:44:09 mdt01prdpom kernel: Lustre: 
> 3828:0:(client.c:1463:ptlrpc_expire_one_request()) @@@ Request 
> x1336168048230435 sent from lustre01-OST0000-osc to NID 
> 172.16.100.121 at tcp 26s ago has timed out (26s prior to deadline).
> May 19 13:44:09 mdt01prdpom kernel:   req at ffff81012e5f2000 
> x1336168048230435/t0 o8->lustre01-OST0000_UUID at 172.16.100.121 
> <mailto:lustre01-OST0000_UUID at 172.16.100.121>@tcp:28/4 lens
368/584 e
> 0 to 1 dl 1274269449 ref 1 fl Rpc:N/0/0 rc 0/0
> May 19 13:44:37 mdt01prdpom kernel: Lustre: 
> 3829:0:(import.c:517:import_select_connection()) lustre01-OST0000-osc: 
> tried all connections, increasing latency to 2s
> May 19 13:44:37 mdt01prdpom kernel: LustreError: 
> 3828:0:(lib-move.c:2441:LNetPut()) Error sending PUT to 
> 12345-172.16.100.121 at tcp: -113
> May 19 13:44:37 mdt01prdpom kernel: LustreError: 
> 3828:0:(events.c:66:request_out_callback()) @@@ type 4, status -113 
>  req at ffff81012d3e5800 x1336168048230437/t0 
> o8->lustre01-OST0000_UUID at 172.16.100.121 
> <mailto:lustre01-OST0000_UUID at 172.16.100.121>@tcp:28/4 lens
368/584 e
> 0 to 1 dl 1274269504 ref 2 fl Rpc:N/0/0 rc 0/0
> May 19 13:44:37 mdt01prdpom kernel: Lustre: 
> 3828:0:(client.c:1463:ptlrpc_expire_one_request()) @@@ Request 
> x1336168048230437 sent from lustre01-OST0000-osc to NID 
> 172.16.100.121 at tcp 0s ago has failed due to network error (27s prior 
> to deadline).
> May 19 13:44:37 mdt01prdpom kernel:   req at ffff81012d3e5800 
> x1336168048230437/t0 o8->lustre01-OST0000_UUID at 172.16.100.121 
> <mailto:lustre01-OST0000_UUID at 172.16.100.121>@tcp:28/4 lens
368/584 e
> 0 to 1 dl 1274269504 ref 1 fl Rpc:N/0/0 rc 0/0
> May 19 13:45:33 mdt01prdpom kernel: Lustre: 
> 3829:0:(import.c:517:import_select_connection()) lustre01-OST0000-osc: 
> tried all connections, increasing latency to 3s
> May 19 13:45:33 mdt01prdpom kernel: LustreError: 
> 3828:0:(lib-move.c:2441:LNetPut()) Error sending PUT to 
> 12345-172.16.100.121 at tcp: -113
> May 19 13:45:33 mdt01prdpom kernel: LustreError: 
> 3828:0:(events.c:66:request_out_callback()) @@@ type 4, status -113 
>  req at ffff81012e11e400 x1336168048230441/t0 
> o8->lustre01-OST0000_UUID at 172.16.100.121 
> <mailto:lustre01-OST0000_UUID at 172.16.100.121>@tcp:28/4 lens
368/584 e
> 0 to 1 dl 1274269561 ref 2 fl Rpc:N/0/0 rc 0/0
> May 19 13:45:33 mdt01prdpom kernel: Lustre: 
> 3828:0:(client.c:1463:ptlrpc_expire_one_request()) @@@ Request 
> x1336168048230441 sent from lustre01-OST0000-osc to NID 
> 172.16.100.121 at tcp 0s ago has failed due to network error (28s prior 
> to deadline).
> May 19 13:45:33 mdt01prdpom kernel:   req at ffff81012e11e400 
> x1336168048230441/t0 o8->lustre01-OST0000_UUID at 172.16.100.121 
> <mailto:lustre01-OST0000_UUID at 172.16.100.121>@tcp:28/4 lens
368/584 e
> 0 to 1 dl 1274269561 ref 1 fl Rpc:N/0/0 rc 0/0
> May 19 13:46:31 mdt01prdpom kernel: Lustre: 
> 3829:0:(import.c:517:import_select_connection()) lustre01-OST0000-osc: 
> tried all connections, increasing latency to 4s
> May 19 13:46:31 mdt01prdpom kernel: LustreError: 167-0: This client 
> was evicted by lustre01-OST0000; in progress operations using this 
> service will fail.
> May 19 13:46:31 mdt01prdpom kernel: Lustre: 
> 4099:0:(quota_master.c:1716:mds_quota_recovery()) Only 0/2 OSTs are 
> active, abort quota recovery
> May 19 13:46:31 mdt01prdpom kernel: Lustre: lustre01-OST0000-osc: 
> Connection restored to service lustre01-OST0000 using nid 
> 172.16.100.121 at tcp.
> May 19 13:46:31 mdt01prdpom kernel: Lustre: MDS lustre01-MDT0000: 
> lustre01-OST0000_UUID now active, resetting orphans
>
> is a timeout problem ??
> How can I change the timeout ?
>
> Thanks !!!
>
>
>
> Ing. Stefano Elmopi
> Gruppo Darco - Resp. ICT Sistemi
> Via Ostiense 131/L Corpo B, 00154 Roma
>
> cell. 3466147165
> tel.  0657060500
> email:stefano.elmopi at sociale.it <mailto:stefano.elmopi at
sociale.it>
>
> "Ai sensi e per effetti della legge sulla tutela  della  riservatezza 
> personale
> (D.lgs n. 196/2003),  questa @mail e'' destinata  unicamente alle 
> persone sopra
> indicate e le informazioni in essa contenute sono da considerarsi 
> strettamente
> riservate. E'' proibito leggere, copiare, usare o diffondere il 
> contenuto della
> presente @mail  senza  autorizzazione. Se avete ricevuto  questo 
> messaggio per
> errore, siete pregati di rispedire la stessa al mittente. Grazie"
>
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>    

-- 
_Gabriele Paciucci_ http://www.linkedin.com/in/paciucci

Pursuant to legislative Decree n. 196/03 you are hereby informed that this email
contains confidential information intended only for use of addressee. If you are
not the addressee and have received this email by mistake, please send this
email to the sender. You may not copy or disseminate this message to anyone.
Thank You.

Stefano Elmopi

2010-May-20 10:29 UTC

head link

[Lustre-discuss] Recovery Problem

Hi Andreas

My version of Lustre 1.8.3
Sorry for my bad English but I used the wrong word, "crash" is not the
right word.
I try to explain better, I start copying a large file on the file system
and while the copy process continues, I reboot the server OSS,
and the copy process enters state "- stalled -".
I expected that once the server back online, the copy process to  
resume normal
and complete copy of the file, instead the copy process fault.
Therefore the copy process that goes wrong, Lustre continues to  
perform good.
The failure of the copy process is a timeout issue ?
How can I change the timeout ?

Thanks !!!


Cheers, Stefano


Ing. Stefano Elmopi
Gruppo Darco - Resp. ICT Sistemi
Via Ostiense 131/L Corpo B, 00154 Roma

cell. 3466147165
tel.  0657060500
email:stefano.elmopi at sociale.it

"Ai sensi e per effetti della legge sulla tutela  della  riservatezza  
personale
(D.lgs n. 196/2003),  questa @mail e'' destinata  unicamente alle  
persone sopra
indicate e le informazioni in essa contenute sono da considerarsi  
strettamente
riservate. E'' proibito leggere, copiare, usare o diffondere il  
contenuto della
presente @mail  senza  autorizzazione. Se avete ricevuto  questo  
messaggio per
errore, siete pregati di rispedire la stessa al mittente. Grazie"

Il giorno 19/mag/10, alle ore 17:07, Andreas Dilger ha scritto:
> More important is to include the crash message from the client and  
> the version of Lustre you are using.
>
> Cheers, Andreas
>
> On 2010-05-19, at 6:34, Stefano Elmopi <stefano.elmopi at sociale.it>
> wrote:
>
>>
>>
>> Hi,
>>
>> I have a small problem but it certainly is the fault of the little  
>> knowledge I have by the argument.
>> I have a Lustre file system with a node MGS/MDS, two nodes OSS and  
>> one Client.
>> I launch a copy of a large file on Lustre and while the copy goes on,
>> I restart the node OSS that is handling the writing on the File  
>> System.
>> The copy process is put in the state -stalled- and when the node  
>> OSS is back on,
>> I expected the copy process to resume normally, but instead crashes.
>> This is a log on the node MGS:
>>
>> May 19 13:43:43 mdt01prdpom kernel: Lustre: 3827:0:(client.c: 
>> 1463:ptlrpc_expire_one_request()) @@@ Request x1336168048230433  
>> sent from lustre01-OST0000-osc to NID 172.16.100.121 at tcp 17s ago  
>> has timed out (17s prior to deadline).
>> May 19 13:43:43 mdt01prdpom kernel:   req at ffff81012e11e400  
>> x1336168048230433/t0 o400->lustre01-OST0000_UUID at
172.16.100.121@tcp:
>> 28/4 lens 192/384 e 0 to 1 dl 1274269423 ref 1 fl Rpc:N/0/0 rc 0/0
>> May 19 13:43:43 mdt01prdpom kernel: Lustre: lustre01-OST0000-osc:  
>> Connection to service lustre01-OST0000 via nid 172.16.100.121 at tcp  
>> was lost; in progress operations using this service will wait for  
>> recovery to complete.
>> May 19 13:44:09 mdt01prdpom kernel: Lustre: 3828:0:(client.c: 
>> 1463:ptlrpc_expire_one_request()) @@@ Request x1336168048230435  
>> sent from lustre01-OST0000-osc to NID 172.16.100.121 at tcp 26s ago  
>> has timed out (26s prior to deadline).
>> May 19 13:44:09 mdt01prdpom kernel:   req at ffff81012e5f2000  
>> x1336168048230435/t0 o8->lustre01-OST0000_UUID at
172.16.100.121@tcp:
>> 28/4 lens 368/584 e 0 to 1 dl 1274269449 ref 1 fl Rpc:N/0/0 rc 0/0
>> May 19 13:44:37 mdt01prdpom kernel: Lustre: 3829:0:(import.c: 
>> 517:import_select_connection()) lustre01-OST0000-osc: tried all  
>> connections, increasing latency to 2s
>> May 19 13:44:37 mdt01prdpom kernel: LustreError: 3828:0:(lib-move.c: 
>> 2441:LNetPut()) Error sending PUT to 12345-172.16.100.121 at tcp: -113
>> May 19 13:44:37 mdt01prdpom kernel: LustreError: 3828:0:(events.c: 
>> 66:request_out_callback()) @@@ type 4, status -113   
>> req at ffff81012d3e5800 x1336168048230437/t0
o8->lustre01-OST0000_UUID at 172.16.100.121
>> @tcp:28/4 lens 368/584 e 0 to 1 dl 1274269504 ref 2 fl Rpc:N/0/0 rc  
>> 0/0
>> May 19 13:44:37 mdt01prdpom kernel: Lustre: 3828:0:(client.c: 
>> 1463:ptlrpc_expire_one_request()) @@@ Request x1336168048230437  
>> sent from lustre01-OST0000-osc to NID 172.16.100.121 at tcp 0s ago has
>> failed due to network error (27s prior to deadline).
>> May 19 13:44:37 mdt01prdpom kernel:   req at ffff81012d3e5800  
>> x1336168048230437/t0 o8->lustre01-OST0000_UUID at
172.16.100.121@tcp:
>> 28/4 lens 368/584 e 0 to 1 dl 1274269504 ref 1 fl Rpc:N/0/0 rc 0/0
>> May 19 13:45:33 mdt01prdpom kernel: Lustre: 3829:0:(import.c: 
>> 517:import_select_connection()) lustre01-OST0000-osc: tried all  
>> connections, increasing latency to 3s
>> May 19 13:45:33 mdt01prdpom kernel: LustreError: 3828:0:(lib-move.c: 
>> 2441:LNetPut()) Error sending PUT to 12345-172.16.100.121 at tcp: -113
>> May 19 13:45:33 mdt01prdpom kernel: LustreError: 3828:0:(events.c: 
>> 66:request_out_callback()) @@@ type 4, status -113   
>> req at ffff81012e11e400 x1336168048230441/t0
o8->lustre01-OST0000_UUID at 172.16.100.121
>> @tcp:28/4 lens 368/584 e 0 to 1 dl 1274269561 ref 2 fl Rpc:N/0/0 rc  
>> 0/0
>> May 19 13:45:33 mdt01prdpom kernel: Lustre: 3828:0:(client.c: 
>> 1463:ptlrpc_expire_one_request()) @@@ Request x1336168048230441  
>> sent from lustre01-OST0000-osc to NID 172.16.100.121 at tcp 0s ago has
>> failed due to network error (28s prior to deadline).
>> May 19 13:45:33 mdt01prdpom kernel:   req at ffff81012e11e400  
>> x1336168048230441/t0 o8->lustre01-OST0000_UUID at
172.16.100.121@tcp:
>> 28/4 lens 368/584 e 0 to 1 dl 1274269561 ref 1 fl Rpc:N/0/0 rc 0/0
>> May 19 13:46:31 mdt01prdpom kernel: Lustre: 3829:0:(import.c: 
>> 517:import_select_connection()) lustre01-OST0000-osc: tried all  
>> connections, increasing latency to 4s
>> May 19 13:46:31 mdt01prdpom kernel: LustreError: 167-0: This client  
>> was evicted by lustre01-OST0000; in progress operations using this  
>> service will fail.
>> May 19 13:46:31 mdt01prdpom kernel: Lustre: 4099:0:(quota_master.c: 
>> 1716:mds_quota_recovery()) Only 0/2 OSTs are active, abort quota  
>> recovery
>> May 19 13:46:31 mdt01prdpom kernel: Lustre: lustre01-OST0000-osc:  
>> Connection restored to service lustre01-OST0000 using nid  
>> 172.16.100.121 at tcp.
>> May 19 13:46:31 mdt01prdpom kernel: Lustre: MDS lustre01-MDT0000:  
>> lustre01-OST0000_UUID now active, resetting orphans
>>
>> is a timeout problem ??
>> How can I change the timeout ?
>>
>> Thanks !!!
>>
>>
>>
>> Ing. Stefano Elmopi
>> Gruppo Darco - Resp. ICT Sistemi
>> Via Ostiense 131/L Corpo B, 00154 Roma
>>
>> cell. 3466147165
>> tel.  0657060500
>> email:stefano.elmopi at sociale.it
>>
>> "Ai sensi e per effetti della legge sulla tutela  della   
>> riservatezza personale
>> (D.lgs n. 196/2003),  questa @mail e'' destinata  unicamente
alle
>> persone sopra
>> indicate e le informazioni in essa contenute sono da considerarsi  
>> strettamente
>> riservate. E'' proibito leggere, copiare, usare o diffondere il
>> contenuto della
>> presente @mail  senza  autorizzazione. Se avete ricevuto  questo  
>> messaggio per
>> errore, siete pregati di rispedire la stessa al mittente. Grazie"
>>
>> _______________________________________________
>> Lustre-discuss mailing list
>> Lustre-discuss at lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100520/f7edffb1/attachment.html

Johann Lombardi

2010-May-20 11:28 UTC

head link

[Lustre-discuss] Recovery Problem

On Thu, May 20, 2010 at 12:29:41PM +0200, Stefano Elmopi
wrote:> Hi Andreas
> My version of Lustre 1.8.3
> Sorry for my bad English but I used the wrong word, "crash" is
not the
> right word.
> I try to explain better, I start copying a large file on the file system
> and while the copy process continues, I reboot the server OSS,
> and the copy process enters state "- stalled -".
> I expected that once the server back online, the copy process to resume
> normal
> and complete copy of the file, instead the copy process fault.
> Therefore the copy process that goes wrong, Lustre continues to perform
> good.
May 19 13:46:31 mdt01prdpom kernel: LustreError: 167-0: This client was
evicted by lustre01-OST0000; in progress operations using this service
will fail.

The cp process failed because the client got evicted by the OSS.
We need to look at the OSS logs to figure out the root cause of
the eviction.

Johann

Stefano Elmopi

2010-May-21 11:49 UTC

head link

[Lustre-discuss] Recovery Problem

Hi,

I realized that the time server differed much across machines,
there were at least a few hours of difference.
I''m doing the tests and have not been paying attention to time  
synchronization
but now I have aligned the time of all servers and I''ve configured  
ntpd service
and the problem no longer occurs.
I can imagine that the cause of the problem was just the time  
misalignment.
Thank and sorry for the trouble !

Cheers, Stefano

Ing. Stefano Elmopi
Gruppo Darco - Resp. ICT Sistemi
Via Ostiense 131/L Corpo B, 00154 Roma

cell. 3466147165
tel.  0657060500
email:stefano.elmopi at sociale.it

"Ai sensi e per effetti della legge sulla tutela  della  riservatezza  
personale
(D.lgs n. 196/2003),  questa @mail e'' destinata  unicamente alle  
persone sopra
indicate e le informazioni in essa contenute sono da considerarsi  
strettamente
riservate. E'' proibito leggere, copiare, usare o diffondere il  
contenuto della
presente @mail  senza  autorizzazione. Se avete ricevuto  questo  
messaggio per
errore, siete pregati di rispedire la stessa al mittente. Grazie"

Il giorno 20/mag/10, alle ore 13:28, Johann Lombardi ha scritto:
> On Thu, May 20, 2010 at 12:29:41PM +0200, Stefano Elmopi wrote:
>> Hi Andreas
>> My version of Lustre 1.8.3
>> Sorry for my bad English but I used the wrong word, "crash"
is not
>> the
>> right word.
>> I try to explain better, I start copying a large file on the file  
>> system
>> and while the copy process continues, I reboot the server OSS,
>> and the copy process enters state "- stalled -".
>> I expected that once the server back online, the copy process to  
>> resume
>> normal
>> and complete copy of the file, instead the copy process fault.
>> Therefore the copy process that goes wrong, Lustre continues to  
>> perform
>> good.
>
> May 19 13:46:31 mdt01prdpom kernel: LustreError: 167-0: This client  
> was
> evicted by lustre01-OST0000; in progress operations using this service
> will fail.
>
> The cp process failed because the client got evicted by the OSS.
> We need to look at the OSS logs to figure out the root cause of
> the eviction.
>
> Johann
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100521/4dc2d715/attachment.html

Johann Lombardi

2010-May-21 11:58 UTC

head link

[Lustre-discuss] Recovery Problem

On Fri, May 21, 2010 at 01:49:41PM +0200, Stefano Elmopi
wrote:> I realized that the time server differed much across machines,
> there were at least a few hours of difference.
> I''m doing the tests and have not been paying attention to time
> synchronization but now I have aligned the time of all servers
> and I''ve configured ntpd service and the problem no longer occurs.
> I can imagine that the cause of the problem was just the time
> misalignment.
I doubt this eviction issue is related to time synchronization across
nodes. If it is really the case, then this is a bug in lustre.

Johann

Andreas Dilger

2010-May-21 14:56 UTC

head link

[Lustre-discuss] Recovery Problem

On 2010-05-21, at 5:49, Stefano Elmopi <stefano.elmopi at sociale.it>  
wrote:>
> I realized that the time server differed much across machines,
> there were at least a few hours of difference.
> I''m doing the tests and have not been paying attention to time  
> synchronization
> but now I have aligned the time of all servers and I''ve configured
> ntpd service
> and the problem no longer occurs.
> I can imagine that the cause of the problem was just the time  
> misalignment.
The client and server clock should have nothing to do with the  
functioning of lustre, so it surprising that this would be the cause.

> Il giorno 20/mag/10, alle ore 13:28, Johann Lombardi ha scritto:
>
>> On Thu, May 20, 2010 at 12:29:41PM +0200, Stefano Elmopi wrote:
>>> Hi Andreas
>>> My version of Lustre 1.8.3
>>> Sorry for my bad English but I used the wrong word,
"crash" is not
>>> the
>>> right word.
>>> I try to explain better, I start copying a large file on the file  
>>> system
>>> and while the copy process continues, I reboot the server OSS,
>>> and the copy process enters state "- stalled -".
>>> I expected that once the server back online, the copy process to  
>>> resume
>>> normal
>>> and complete copy of the file, instead the copy process fault.
>>> Therefore the copy process that goes wrong, Lustre continues to  
>>> perform
>>> good.
>>
>> May 19 13:46:31 mdt01prdpom kernel: LustreError: 167-0: This client  
>> was
>> evicted by lustre01-OST0000; in progress operations using this  
>> service
>> will fail.
>>
>> The cp process failed because the client got evicted by the OSS.
>> We need to look at the OSS logs to figure out the root cause of
>> the eviction.
>>
>> Johann
>

Stefano Elmopi

2010-May-24 12:09 UTC

head link

[Lustre-discuss] Recovery Problem

Hi,

so it is a bit strange, at least for me !
The problem occurs even when the clocks are all synchronized
but the thing that I misled is that under equal conditions, the  
problem does not always.
The test is always the same, launching a process of copying and then  
reboot the server OSS concerned:
These are the log when the recovery process is successful:

MGS/MDS
May 24 10:00:21 mdt01prdpom kernel: LustreError: 3558:0:(lib-move.c: 
2441:LNetPut()) Error sending PUT to 12345-172.16.100.121 at tcp: -113
May 24 10:00:21 mdt01prdpom kernel: LustreError: 3558:0:(lib-move.c: 
2441:LNetPut()) Skipped 5 previous similar messages
May 24 10:00:21 mdt01prdpom kernel: LustreError: 3558:0:(events.c: 
66:request_out_callback()) @@@ type 4, status -113   
req at ffff81011b6dd400 x1336357074007206/t0 o400->lustre01-OST0000_UUID at
172.16.100.121
@tcp:28/4 lens 192/384 e 0 to 1 dl 1274688078 ref 2 fl Rpc:N/0/0 rc 0/0
May 24 10:00:21 mdt01prdpom kernel: LustreError: 3558:0:(events.c: 
66:request_out_callback()) Skipped 5 previous similar messages
May 24 10:00:21 mdt01prdpom kernel: Lustre: lustre01-OST0000-osc:  
Connection to service lustre01-OST0000 via nid 172.16.100.121 at tcp was  
lost; in progress operations using this service will wait for recovery  
to complete.
May 24 10:01:43 mdt01prdpom kernel: Lustre: 3690:0:(ldlm_lib.c: 
575:target_handle_reconnect()) MGS: 52138191-d920-7519-563b- 
ab022b922751 reconnecting
May 24 10:02:17 mdt01prdpom kernel: Lustre: 3559:0:(quota_master.c: 
1716:mds_quota_recovery()) Only 1/3 OSTs are active, abort quota  
recovery
May 24 10:02:17 mdt01prdpom kernel: Lustre: lustre01-OST0000-osc:  
Connection restored to service lustre01-OST0000 using nid  
172.16.100.121 at tcp.
May 24 10:02:17 mdt01prdpom kernel: Lustre: MDS lustre01-MDT0000:  
lustre01-OST0000_UUID now active, resetting orphans

OSS
May 24 10:01:14 oss1prdpom kernel: LDISKFS-fs: file extents enabled
May 24 10:01:14 oss1prdpom kernel: LDISKFS-fs: mballoc enabled
May 24 10:01:19 oss1prdpom kernel: Lustre: 3640:0:(client.c: 
1463:ptlrpc_expire_one_request()) @@@ Request x1336607320834049 sent  
from MGC172.16.100.111 at tcp to NID 172.16.100.111 at tcp 5s ago has timed  
out (5s prior to deadline).
May 24 10:01:19 oss1prdpom kernel:   req at ffff81012ec78c00  
x1336607320834049/t0 o250->MGS at MGC172.16.100.111@tcp_0:26/25 lens  
368/584 e 0 to 1 dl 1274688079 ref 1 fl Rpc:N/0/0 rc 0/0
May 24 10:01:44 oss1prdpom kernel: Lustre: 3640:0:(client.c: 
1463:ptlrpc_expire_one_request()) @@@ Request x1336607320834051 sent  
from MGC172.16.100.111 at tcp to NID 0 at lo 5s ago has timed out (5s prior  
to deadline).
May 24 10:01:44 oss1prdpom kernel:   req at ffff81013ff1cc00  
x1336607320834051/t0 o250->MGS at MGC172.16.100.111@tcp_1:26/25 lens  
368/584 e 0 to 1 dl 1274688104 ref 1 fl Rpc:N/0/0 rc 0/0
May 24 10:01:44 oss1prdpom kernel: LustreError: 3557:0:(client.c: 
858:ptlrpc_import_delay_req()) @@@ IMP_INVALID  req at ffff81012d15e400  
x1336607320834052/t0 o101->MGS at MGC172.16.100.111@tcp_1:26/25 lens  
296/544 e 0 to 1 dl 0 ref 1 fl Rpc:/0/0 rc 0/0
May 24 10:01:44 oss1prdpom kernel: Lustre: Filtering OBD driver;
http://www.lustre.org/
May 24 10:01:44 oss1prdpom kernel: Lustre: 4036:0:(filter.c: 
990:filter_init_server_data()) RECOVERY: service lustre01-OST0000, 2  
recoverable clients, 0 delayed clients, last_rcvd 103079215728
May 24 10:01:44 oss1prdpom kernel: Lustre: lustre01-OST0000: Now  
serving lustre01-OST0000 on /dev/mpath/mpath1 with recovery enabled
May 24 10:01:44 oss1prdpom kernel: Lustre: lustre01-OST0000: Will be  
in recovery for at least 5:00, or until 2 clients reconnect
May 24 10:01:44 oss1prdpom kernel: Bluetooth: Core ver 2.10
May 24 10:01:44 oss1prdpom kernel: NET: Registered protocol family 31
May 24 10:01:44 oss1prdpom kernel: Bluetooth: HCI device and  
connection manager initialized
May 24 10:01:44 oss1prdpom kernel: Bluetooth: HCI socket layer  
initialized
May 24 10:01:44 oss1prdpom kernel: Bluetooth: L2CAP ver 2.8
May 24 10:01:44 oss1prdpom kernel: Bluetooth: L2CAP socket layer  
initialized
May 24 10:01:44 oss1prdpom kernel: Bluetooth: HIDP (Human Interface  
Emulation) ver 1.1
May 24 10:01:44 oss1prdpom hidd[4106]: Bluetooth HID daemon
May 24 10:01:45 oss1prdpom kernel: Lustre: MGC172.16.100.111 at tcp:  
Reactivating import
May 24 10:01:45 oss1prdpom rhnsd: Red Hat Network Services Daemon  
running with check_in interval set to 240 seconds.
May 24 10:01:45 oss1prdpom rhnsd: Red Hat Network Services Daemon  
running with check_in interval set to 240 seconds.
May 24 10:01:45 oss1prdpom rhnsd[4179]: Red Hat Network Services  
Daemon starting up.
May 24 10:02:12 oss1prdpom kernel: Lustre: 3761:0:(ldlm_lib.c: 
1788:target_queue_last_replay_reply()) lustre01-OST0000: 1 recoverable  
clients remain
May 24 10:02:18 oss1prdpom kernel: Lustre: lustre01-OST0000: Recovery  
period over after 0:06, of 2 clients 2 recovered and 0 were evicted.
May 24 10:02:18 oss1prdpom kernel: Lustre: lustre01-OST0000: sending  
delayed replies to recovered clients
May 24 10:02:18 oss1prdpom kernel: Lustre: lustre01-OST0000: received  
MDS connection from 172.16.100.111 at tcp

CLIENT
May 24 09:59:13 mdt02prdpom kernel: LustreError: 11-0: an error  
occurred while communicating with 172.16.100.121 at tcp. The ost_write  
operation failed with -107
May 24 09:59:13 mdt02prdpom kernel: Lustre: lustre01-OST0000-osc- 
ffff8101337bbc00: Connection to service lustre01-OST0000 via nid  
172.16.100.121 at tcp was lost; in progress operations using this service  
will wait for recovery to complete.
May 24 10:00:13 mdt02prdpom kernel: Lustre: There was an unexpected  
network error while writing to 172.16.100.121: -110.
May 24 10:02:17 mdt02prdpom kernel: Lustre: lustre01-OST0000-osc- 
ffff8101337bbc00: Connection restored to service lustre01-OST0000  
using nid 172.16.100.121 at tcp.

These are the log when the recovery process goes wrong:

MGS/MDS
May 24 09:52:01 mdt01prdpom kernel: Lustre: lustre01-OST0000-osc:  
Connection to service lustre01-OST0000 via nid 172.16.100.121 at tcp was  
lost; in progress operations using this service will wait for recovery  
to complete.
May 24 09:52:51 mdt01prdpom kernel: Lustre: 3559:0:(client.c: 
1463:ptlrpc_expire_one_request()) @@@ Request x1336357074007136 sent  
from lustre01-OST0000-osc to NID 172.16.100.121 at tcp 0s ago has failed  
due to network error (25s prior to deadline).
May 24 09:52:51 mdt01prdpom kernel:   req at ffff81012d814c00  
x1336357074007136/t0 o8->lustre01-OST0000_UUID at 172.16.100.121@tcp:28/4  
lens 368/584 e 0 to 1 dl 1274687596 ref 1 fl Rpc:N/0/0 rc 0/0
May 24 09:52:51 mdt01prdpom kernel: Lustre: 3559:0:(client.c: 
1463:ptlrpc_expire_one_request()) Skipped 9 previous similar messages
May 24 09:53:43 mdt01prdpom kernel: Lustre: 3560:0:(import.c: 
517:import_select_connection()) lustre01-OST0000-osc: tried all  
connections, increasing latency to 21s
May 24 09:53:43 mdt01prdpom kernel: Lustre: 3560:0:(import.c: 
517:import_select_connection()) Skipped 4 previous similar messages
May 24 09:53:43 mdt01prdpom kernel: LustreError: 167-0: This client  
was evicted by lustre01-OST0000; in progress operations using this  
service will fail.
May 24 09:53:43 mdt01prdpom kernel: Lustre: 5071:0:(quota_master.c: 
1716:mds_quota_recovery()) Only 1/3 OSTs are active, abort quota  
recovery
May 24 09:53:43 mdt01prdpom kernel: Lustre: lustre01-OST0000-osc:  
Connection restored to service lustre01-OST0000 using nid  
172.16.100.121 at tcp.
May 24 09:53:43 mdt01prdpom kernel: Lustre: MDS lustre01-MDT0000:  
lustre01-OST0000_UUID now active, resetting orphans
May 24 09:54:35 mdt01prdpom kernel: Lustre: MGS: haven''t heard from  
client b6c79384-6c45-6d9c-ab9b-f12969d74da0 (at 172.16.100.121 at tcp) in  
235 seconds. I think it''s dead, and I am evicting it.

OSS
May 24 09:53:02 oss1prdpom kernel: ADDRCONF(NETDEV_UP): eth0: link is  
not ready
May 24 09:53:02 oss1prdpom kernel: e1000e: eth0 NIC Link is Up 1000  
Mbps Full Duplex, Flow Control: RX/TX
May 24 09:53:02 oss1prdpom kernel: ADDRCONF(NETDEV_CHANGE): eth0: link  
becomes ready
May 24 09:53:02 oss1prdpom kernel: ADDRCONF(NETDEV_UP): eth1: link is  
not ready
May 24 09:53:02 oss1prdpom kernel: e1000e: eth1 NIC Link is Up 1000  
Mbps Full Duplex, Flow Control: RX/TX
May 24 09:53:02 oss1prdpom kernel: ADDRCONF(NETDEV_CHANGE): eth1: link  
becomes ready
May 24 09:53:02 oss1prdpom kernel: Lustre: OBD class driver,
http://www.lustre.org/
May 24 09:53:02 oss1prdpom kernel: Lustre:     Lustre Version: 1.8.3
May 24 09:53:02 oss1prdpom kernel: Lustre:     Build Version:  
1.8.3-20100409182943-PRISTINE-2.6.18-164.11.1.el5_lustre.1.8.3
May 24 09:53:02 oss1prdpom kernel: Lustre:     Lustre Version: 1.8.3
May 24 09:53:02 oss1prdpom kernel: Lustre:     Build Version:  
1.8.3-20100409182943-PRISTINE-2.6.18-164.11.1.el5_lustre.1.8.3
May 24 09:53:03 oss1prdpom kernel: Lustre: Added LNI  
172.16.100.121 at tcp [8/256/0/180]
May 24 09:53:03 oss1prdpom kernel: Lustre: Accept secure, port 988
May 24 09:53:03 oss1prdpom kernel: Lustre: Lustre Client File System;
http://www.lustre.org/
May 24 09:53:03 oss1prdpom kernel: init dynlocks cache
May 24 09:53:03 oss1prdpom kernel: ldiskfs created from ext3-2.6-rhel5
May 24 09:53:03 oss1prdpom kernel: kjournald starting.  Commit  
interval 5 seconds
May 24 09:53:03 oss1prdpom kernel: LDISKFS-fs warning: maximal mount  
count reached, running e2fsck is recommended
May 24 09:53:03 oss1prdpom kernel: LDISKFS FS on dm-0, internal journal
May 24 09:53:03 oss1prdpom kernel: LDISKFS-fs: mounted filesystem with  
ordered data mode.
May 24 09:53:03 oss1prdpom multipathd: dm-0: umount map (uevent)
May 24 09:53:03 oss1prdpom kernel: kjournald starting.  Commit  
interval 5 seconds
May 24 09:53:03 oss1prdpom kernel: LDISKFS-fs warning: maximal mount  
count reached, running e2fsck is recommended
May 24 09:53:03 oss1prdpom kernel: LDISKFS FS on dm-0, internal journal
May 24 09:53:03 oss1prdpom kernel: LDISKFS-fs: mounted filesystem with  
ordered data mode.
May 24 09:53:03 oss1prdpom kernel: LDISKFS-fs: file extents enabled
May 24 09:53:03 oss1prdpom kernel: LDISKFS-fs: mballoc enabled
May 24 09:53:03 oss1prdpom kernel: Lustre: MGC172.16.100.111 at tcp:  
Reactivating import
May 24 09:53:03 oss1prdpom kernel: Lustre: Filtering OBD driver;
http://www.lustre.org/
May 24 09:53:03 oss1prdpom kernel: Lustre: lustre01-OST0000: Now  
serving lustre01-OST0000 on /dev/mpath/mpath1 with recovery enabled
May 24 09:53:04 oss1prdpom kernel: Bluetooth: Core ver 2.10
May 24 09:53:04 oss1prdpom kernel: NET: Registered protocol family 31
May 24 09:53:04 oss1prdpom kernel: Bluetooth: HCI device and  
connection manager initialized
May 24 09:53:04 oss1prdpom kernel: Bluetooth: HCI socket layer  
initialized
May 24 09:53:04 oss1prdpom kernel: Bluetooth: L2CAP ver 2.8
May 24 09:53:04 oss1prdpom kernel: Bluetooth: L2CAP socket layer  
initialized
May 24 09:53:04 oss1prdpom kernel: Bluetooth: HIDP (Human Interface  
Emulation) ver 1.1
May 24 09:53:04 oss1prdpom hidd[4108]: Bluetooth HID daemon
May 24 09:53:04 oss1prdpom rhnsd: Red Hat Network Services Daemon  
running with check_in interval set to 240 seconds.
May 24 09:53:04 oss1prdpom rhnsd: Red Hat Network Services Daemon  
running with check_in interval set to 240 seconds.
May 24 09:53:04 oss1prdpom rhnsd[4181]: Red Hat Network Services  
Daemon starting up.
May 24 09:53:44 oss1prdpom kernel: Lustre: lustre01-OST0000: received  
MDS connection from 172.16.100.111 at tcp

CLIENT
May 24 09:51:28 mdt02prdpom kernel: Lustre: lustre01-OST0000-osc- 
ffff8101337bbc00: Connection to service lustre01-OST0000 via nid  
172.16.100.121 at tcp was lost; in progress operations using this service  
will wait for recovery to complete.
May 24 09:51:38 mdt02prdpom kernel: Lustre: There was an unexpected  
network error while writing to 172.16.100.121: -110.
May 24 09:52:22 mdt02prdpom kernel: Lustre: 3796:0:(client.c: 
1463:ptlrpc_expire_one_request()) @@@ Request x1336351691692671 sent  
from lustre01-OST0000-osc-ffff8101337bbc00 to NID 172.16.100.121 at tcp  
0s ago has failed due to network error (27s prior to deadline).
May 24 09:52:22 mdt02prdpom kernel:   req at ffff810109b92c00  
x1336351691692671/t0 o8->lustre01-OST0000_UUID at 172.16.100.121@tcp:28/4  
lens 368/584 e 0 to 1 dl 1274687569 ref 1 fl Rpc:N/0/0 rc 0/0
May 24 09:52:22 mdt02prdpom kernel: Lustre: 3796:0:(client.c: 
1463:ptlrpc_expire_one_request()) Skipped 23 previous similar messages
May 24 09:53:18 mdt02prdpom kernel: Lustre: 3797:0:(import.c: 
517:import_select_connection()) lustre01-OST0000-osc-ffff8101337bbc00:  
tried all connections, increasing latency to 23s
May 24 09:53:18 mdt02prdpom kernel: Lustre: 3797:0:(import.c: 
517:import_select_connection()) Skipped 5 previous similar messages
May 24 09:53:18 mdt02prdpom kernel: LustreError: 3796:0:(lib-move.c: 
2441:LNetPut()) Error sending PUT to 12345-172.16.100.121 at tcp: -113
May 24 09:53:18 mdt02prdpom kernel: LustreError: 3796:0:(lib-move.c: 
2441:LNetPut()) Skipped 4 previous similar messages
May 24 09:53:18 mdt02prdpom kernel: LustreError: 3796:0:(events.c: 
66:request_out_callback()) @@@ type 4, status -113   
req at ffff810023fcfc00 x1336351691692679/t0 o8->lustre01-OST0000_UUID at
172.16.100.121
@tcp:28/4 lens 368/584 e 0 to 1 dl 1274687626 ref 2 fl Rpc:N/0/0 rc 0/0
May 24 09:53:18 mdt02prdpom kernel: LustreError: 3796:0:(events.c: 
66:request_out_callback()) Skipped 4 previous similar messages
May 24 09:54:16 mdt02prdpom kernel: LustreError: 167-0: This client  
was evicted by lustre01-OST0000; in progress operations using this  
service will fail.
May 24 09:54:16 mdt02prdpom kernel: LustreError: 3795:0:(client.c: 
858:ptlrpc_import_delay_req()) @@@ IMP_INVALID  req at ffff81012fa9f000  
x1336351691692691/t0 o4->lustre01-OST0000_UUID at 172.16.100.121@tcp:6/4  
lens 448/608 e 0 to 1 dl 0 ref 2 fl Rpc:/0/0 rc 0/0
May 24 09:54:16 mdt02prdpom kernel: LustreError: 3795:0:(client.c: 
858:ptlrpc_import_delay_req()) Skipped 18 previous similar messages
May 24 09:54:16 mdt02prdpom kernel: Lustre: lustre01-OST0000-osc- 
ffff8101337bbc00: Connection restored to service lustre01-OST0000  
using nid 172.16.100.121 at tcp.

Thanks !!

Ing. Stefano Elmopi
Gruppo Darco - Resp. ICT Sistemi
Via Ostiense 131/L Corpo B, 00154 Roma

cell. 3466147165
tel.  0657060500
email:stefano.elmopi at sociale.it

"Ai sensi e per effetti della legge sulla tutela  della  riservatezza  
personale
(D.lgs n. 196/2003),  questa @mail e'' destinata  unicamente alle  
persone sopra
indicate e le informazioni in essa contenute sono da considerarsi  
strettamente
riservate. E'' proibito leggere, copiare, usare o diffondere il  
contenuto della
presente @mail  senza  autorizzazione. Se avete ricevuto  questo  
messaggio per
errore, siete pregati di rispedire la stessa al mittente. Grazie"

Il giorno 21/mag/10, alle ore 16:56, Andreas Dilger ha scritto:
>
> On 2010-05-21, at 5:49, Stefano Elmopi <stefano.elmopi at sociale.it>
> wrote:
>>
>> I realized that the time server differed much across machines,
>> there were at least a few hours of difference.
>> I''m doing the tests and have not been paying attention to time
>> synchronization
>> but now I have aligned the time of all servers and I''ve
configured
>> ntpd service
>> and the problem no longer occurs.
>> I can imagine that the cause of the problem was just the time  
>> misalignment.
>
> The client and server clock should have nothing to do with the  
> functioning of lustre, so it surprising that this would be the cause.
>
>
>> Il giorno 20/mag/10, alle ore 13:28, Johann Lombardi ha scritto:
>>
>>> On Thu, May 20, 2010 at 12:29:41PM +0200, Stefano Elmopi wrote:
>>>> Hi Andreas
>>>> My version of Lustre 1.8.3
>>>> Sorry for my bad English but I used the wrong word,
"crash" is
>>>> not the
>>>> right word.
>>>> I try to explain better, I start copying a large file on the
file
>>>> system
>>>> and while the copy process continues, I reboot the server OSS,
>>>> and the copy process enters state "- stalled -".
>>>> I expected that once the server back online, the copy process
to
>>>> resume
>>>> normal
>>>> and complete copy of the file, instead the copy process fault.
>>>> Therefore the copy process that goes wrong, Lustre continues to
>>>> perform
>>>> good.
>>>
>>> May 19 13:46:31 mdt01prdpom kernel: LustreError: 167-0: This  
>>> client was
>>> evicted by lustre01-OST0000; in progress operations using this  
>>> service
>>> will fail.
>>>
>>> The cp process failed because the client got evicted by the OSS.
>>> We need to look at the OSS logs to figure out the root cause of
>>> the eviction.
>>>
>>> Johann
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100524/1eb81f20/attachment-0001.html

Lustre discuss - May 2010 - Recovery Problem

[Lustre-discuss] Recovery Problem

[Lustre-discuss] Recovery Problem

[Lustre-discuss] Recovery Problem

[Lustre-discuss] Recovery Problem

[Lustre-discuss] Recovery Problem

[Lustre-discuss] Recovery Problem

[Lustre-discuss] Recovery Problem

[Lustre-discuss] Recovery Problem

[Lustre-discuss] Recovery Problem