thr3ads.net - Lustre discuss - [Lustre-discuss] Clients frozen during pressure test [Aug 2009]

If this information is useful, please help other people find it:
Share via:

Lu Wang

2009-Aug-03 14:10 UTC

[Lustre-discuss] Clients frozen during pressure test

Dear list , 
		I am doing pressure test for a new 10-OSS Lustre file system using 70 client
node. (each server has 10Gb Ethernet connection, each client has 1Gb Ethernet
connection, there are 3 OST on 3 RAID6 volulme for one OSS)
		Each time, after about 4 hours, clients began to be frozen one after another.
command "lfs check osts" shows that the frozen clients cannot access
some OSTs.
		error: check ''testfs-OST0007-osc-c9b82800'': Resource
temporarily unavailable (11)
		error: check ''testfs-OST0008-osc-c9b82800'': Resource
temporarily unavailable (11)
		error: check ''testfs-OST0009-osc-c9b82800'': Resource
temporarily unavailable (11)

and  command "lctl ping server" , shows "Input/Out put
error"
				
  	 	   However, the servers are not so busy( util% <10)  when clients are
frozen. My question is:
			1.Why  clients cannot reconnect when servers are not so busy? 
			2. I am setting timeout=1000, do I need add timeout to a number larger?
			3.Is there any other  variable needed to be tuned under heavy pressure? 
each server has 10Gb Ethernet connection, each client has 1Gb Ethernet
connection.
            




Best Regards
Lu Wang
--------------------------------------------------------------	  
Computing Center
IHEP						
Beijing 100049,China		Email: Lu.Wang at ihep.ac.cn							
--------------------------------------------------------------

Alexey Lyashkov

2009-Aug-04 17:38 UTC

head link

[Lustre-discuss] Clients frozen during pressure test

On Mon, 2009-08-03 at 22:10 +0800, Lu Wang wrote:> Dear list , 
> 		I am doing pressure test for a new 10-OSS Lustre file system using 70
client node. (each server has 10Gb Ethernet connection, each client has 1Gb
Ethernet connection, there are 3 OST on 3 RAID6 volulme for one OSS)
> 		Each time, after about 4 hours, clients began to be frozen one after
another. command "lfs check osts" shows that the frozen clients cannot
access some OSTs.
> 		error: check ''testfs-OST0007-osc-c9b82800'': Resource
temporarily unavailable (11)
> 		error: check ''testfs-OST0008-osc-c9b82800'': Resource
temporarily unavailable (11)
> 		error: check ''testfs-OST0009-osc-c9b82800'': Resource
temporarily unavailable (11)
> 
> and  command "lctl ping server" , shows "Input/Out put
error"looks something wrong with network link, or nodes OST[7-9] is down -
panic, deadlock, or other.



-- 
Alexey Lyashkov <Alexey.Lyashkov at Sun.COM>
Sun Microsystems

Andreas Dilger

2009-Aug-04 17:43 UTC

head link

[Lustre-discuss] Clients frozen during pressure test

On Aug 03, 2009  22:10 +0800, Lu Wang wrote:> 		I am doing pressure test for a new 10-OSS Lustre file system using 70
client node. (each server has 10Gb Ethernet connection, each client has 1Gb
Ethernet connection, there are 3 OST on 3 RAID6 volulme for one OSS)
> 		Each time, after about 4 hours, clients began to be frozen one after
another. command "lfs check osts" shows that the frozen clients cannot
access some OSTs.
> 		error: check ''testfs-OST0007-osc-c9b82800'': Resource
temporarily unavailable (11)
> 		error: check ''testfs-OST0008-osc-c9b82800'': Resource
temporarily unavailable (11)
> 		error: check ''testfs-OST0009-osc-c9b82800'': Resource
temporarily unavailable (11)
> 
> and  command "lctl ping server" , shows "Input/Out put
error"
> 				
>   	 	   However, the servers are not so busy( util% <10)  when clients
are frozen. My question is:
> 			1.Why  clients cannot reconnect when servers are not so busy? 
> 			2. I am setting timeout=1000, do I need add timeout to a number larger?
> 			3.Is there any other  variable needed to be tuned under heavy pressure? 
> each server has 10Gb Ethernet connection, each client has 1Gb Ethernet
connection.
Using a timeout of 1000s is not good at all.  That means clients will
wait AT LEAST 1000s (16 minutes!) before even detecting that an RPC
failed.  If you are running Lustre 1.8 you don''t even need to set the
timeouts - they are scaled automatically based on the load at the server.

You should check for error messages on the server that might indicate
why it is having a problem.  If all of the clients are reporting the
same OSTs, possibly on the same OSS, then that is a good sign there is
something wrong with that OSS.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

Lu Wang

2009-Aug-04 23:30 UTC

head link

[Lustre-discuss] Fw: Re: Clients frozen during pressure test

Hi, 
		After modification of TCP statck on OSS, it seems that failed clients can
reconnect again. The pressure test has persisted  for 12 hours.
         "You should check for error messages on the server that might
indicate
why it is having a problem.  If all of the clients are reporting the
same OSTs, possibly on the same OSS, then that is a good sign there is
something wrong with that OSS." --Different clients reported unconnection
to different OSTs, so the problem might caused by a certain OST''s
failure.
--------------				 
Lu Wang
2009-08-05

-------------------------------------------------------------
????Andreas Dilger
?????2009-08-05 01:41:03
????Lu Wang
???lustre-discuss
???Re: [Lustre-discuss] Clients frozen during pressure test

On Aug 03, 2009  22:10 +0800, Lu Wang wrote:> 		I am doing pressure test for a new 10-OSS Lustre file system using 70
client node. (each server has 10Gb Ethernet connection, each client has 1Gb
Ethernet connection, there are 3 OST on 3 RAID6 volulme for one OSS)
> 		Each time, after about 4 hours, clients began to be frozen one after
another. command "lfs check osts" shows that the frozen clients cannot
access some OSTs.
> 		error: check ''testfs-OST0007-osc-c9b82800'': Resource
temporarily unavailable (11)
> 		error: check ''testfs-OST0008-osc-c9b82800'': Resource
temporarily unavailable (11)
> 		error: check ''testfs-OST0009-osc-c9b82800'': Resource
temporarily unavailable (11)
> 
> and  command "lctl ping server" , shows "Input/Out put
error"
> 				
>   	 	   However, the servers are not so busy( util% <10)  when clients
are frozen. My question is:
> 			1.Why  clients cannot reconnect when servers are not so busy? 
> 			2. I am setting timeout=1000, do I need add timeout to a number larger?
> 			3.Is there any other  variable needed to be tuned under heavy pressure? 
> each server has 10Gb Ethernet connection, each client has 1Gb Ethernet
connection.
Using a timeout of 1000s is not good at all.  That means clients will
wait AT LEAST 1000s (16 minutes!) before even detecting that an RPC
failed.  If you are running Lustre 1.8 you don''t even need to set the
timeouts - they are scaled automatically based on the load at the server.

You should check for error messages on the server that might indicate
why it is having a problem.  If all of the clients are reporting the
same OSTs, possibly on the same OSS, then that is a good sign there is
something wrong with that OSS.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

Lustre discuss - Aug 2009 - Clients frozen during pressure test

[Lustre-discuss] Clients frozen during pressure test

[Lustre-discuss] Clients frozen during pressure test

[Lustre-discuss] Clients frozen during pressure test

[Lustre-discuss] Fw: Re: Clients frozen during pressure test