thr3ads.net - Lustre discuss - [Lustre-discuss] Lustre with channel bonding [Jan 2007]

If this information is useful, please help other people find it:
Share via:

Jeremy Mann

2007-Jan-18 15:05 UTC

[Lustre-discuss] Lustre with channel bonding

I''ve been trying to check why a Lustre fileystem on one cluster
performs
so poorly. I have 2 clusters, 1 20 node dual Opteron with channel bonded
gigE ethernet and 1 10 node i386 cluster with 100mbit ethernet. Both run
the same version of Lustre, 1.5.95 and were set up the same way in which
the frontend is the MDT/MGS and each compute node is an OST, then the
/lustre filesystem is mounted on each node and frontend in /lustre.

On the Opteron cluster, I can start a copy of the NCBI databases into
/lustre and it takes forever, if it makes it. Loads go up to 8 and 9 on
the frontend and eventually I have to CTRL-C the copy. I have also tried
rsync with the same results. About 2 weeks ago I tried to copy about 8
gigs of data into /lustre and the lustre filesystem completely crashed.
The copy timed out about 4 times, then when I go back to check on it, I
can''t see anything in /lustre. Checking the logs tells me that a few
nodes
"died" but I can ssh to them and they are fine.

If I do the same things as above on the 10 node i386 cluster I get what I
expect. Excellent speed results and almost no load. I am quite happy with
it.

Why does Lustre perform so badly on the Opterons and not the i386 cluster?
I have a hunch that the channel bonding is the culprit.

Has channel bonding ever been a problem before? If so, are there any
bonding tweaks that I can perform on the Opteron cluster? Is anybody else
using channel bonding with a Lustre filesystem?

Thanks for any help or tips you can provide.


-- 
Jeremy Mann
jeremy@biochem.uthscsa.edu

University of Texas Health Science Center
Bioinformatics Core Facility
http://www.bioinformatics.uthscsa.edu
Phone: (210) 567-2672

Lustre discuss - Jan 2007 - Lustre with channel bonding

[Lustre-discuss] Lustre with channel bonding