Fabricio Cannini
2011-Jan-06 22:55 UTC
[Gluster-users] Instability with large setup and infiniband
Hi all I''ve setup glusterfs , version 3.0.5 ( debian squeeze amd64 stock packages ) , like this, being each node both a server and client: Client config http://pastebin.com/6d4BjQwd Server config http://pastebin.com/4ZmX9ir1 Configuration of each node: 2x Intel xeon 5420 2.5GHz , 16GB ddr2 ECC , 1 sata2 hd of 750GB. Of which ~600GB is a partition ( /glstfs ) dedicated to gluster. Each node also have 1 Mellanox MT25204 [InfiniHost III Lx] Inifiniband DDR hca used by gluster through the 'verbs' interface. This cluster of 22 nodes is used for scientific computing, and glusterfs is used to create a scratch area for I/O intensive apps. And this is one of the problems: *one* I/O intensive job can bring the whole volume to its knees, with "Transport endpoint not connected" errors and so, till complete uselessness; Especially if the job is running in a parallel way ( through MPI ) in more than one node. The other problem is that gluster have been somewhat unstable, even without I/O intensive jobs. Out of the blue a simple 'ls -la /scratch' is answered with a "Transport endpoint not connected" error. But when this happens, restarting all servers brings things back to a working state. If anybody here using glusterfs with infiniband have been through this ( or something like that ) and could share your experiences , please please please do TIA, Fabricio.