thr3ads.net - Gluster users - [Gluster-users] Instability with large setup and infiniband [Jan 2011]

If this information is useful, please help other people find it:
Share via:

Fabricio Cannini

2011-Jan-06 22:55 UTC

[Gluster-users] Instability with large setup and infiniband

Hi all

I''ve setup glusterfs , version 3.0.5 ( debian squeeze amd64 stock
packages )
, like this, being each node both a server and client:

Client config
http://pastebin.com/6d4BjQwd

Server config
http://pastebin.com/4ZmX9ir1


Configuration of each node:
2x Intel xeon 5420 2.5GHz , 16GB ddr2 ECC , 1 sata2 hd of 750GB.
Of which ~600GB is a partition ( /glstfs ) dedicated to gluster. Each node 
also have 1 Mellanox MT25204 [InfiniHost III Lx] Inifiniband DDR hca used by 
gluster through the 'verbs' interface.

This cluster of 22 nodes is used for scientific computing, and glusterfs is 
used to create a scratch area for I/O intensive apps.

And this is one of the problems: *one* I/O intensive job can bring the whole 
volume to its knees, with "Transport endpoint not connected" errors
and so,
till complete uselessness; Especially if the job is running in a parallel way 
( through MPI ) in more than one node.

The other problem is that gluster have been somewhat unstable, even without 
I/O intensive jobs. Out of the blue a simple 'ls -la /scratch' is
answered
with a "Transport endpoint not connected" error. But when this
happens,
restarting all servers brings things back to a working state.

If anybody here using glusterfs with infiniband have been through this ( or 
something like that ) and could share your experiences , please please please 
do

TIA,
Fabricio.

Gluster users - Jan 2011 - Instability with large setup and infiniband

[Gluster-users] Instability with large setup and infiniband