I''ve been trying to check why a Lustre fileystem on one cluster performs so poorly. I have 2 clusters, 1 20 node dual Opteron with channel bonded gigE ethernet and 1 10 node i386 cluster with 100mbit ethernet. Both run the same version of Lustre, 1.5.95 and were set up the same way in which the frontend is the MDT/MGS and each compute node is an OST, then the /lustre filesystem is mounted on each node and frontend in /lustre. On the Opteron cluster, I can start a copy of the NCBI databases into /lustre and it takes forever, if it makes it. Loads go up to 8 and 9 on the frontend and eventually I have to CTRL-C the copy. I have also tried rsync with the same results. About 2 weeks ago I tried to copy about 8 gigs of data into /lustre and the lustre filesystem completely crashed. The copy timed out about 4 times, then when I go back to check on it, I can''t see anything in /lustre. Checking the logs tells me that a few nodes "died" but I can ssh to them and they are fine. If I do the same things as above on the 10 node i386 cluster I get what I expect. Excellent speed results and almost no load. I am quite happy with it. Why does Lustre perform so badly on the Opterons and not the i386 cluster? I have a hunch that the channel bonding is the culprit. Has channel bonding ever been a problem before? If so, are there any bonding tweaks that I can perform on the Opteron cluster? Is anybody else using channel bonding with a Lustre filesystem? Thanks for any help or tips you can provide. -- Jeremy Mann jeremy@biochem.uthscsa.edu University of Texas Health Science Center Bioinformatics Core Facility http://www.bioinformatics.uthscsa.edu Phone: (210) 567-2672