Hi, As some of you know we have been attacking some scalability issues. This email serves the purpose of giving an overview of the activities since we started this work early Nov. A summary of the work in progress is visible in Bugzilla: https://bugzilla.clusterfs.com/showdependencytree.cgi?id=11228 You can drill down in this outline tree and see patches, design documents etc. 1. Client Count Scalability This work was preceded by studying the scalability of a few common operations: many clients mounting, opening files, and doing IO. We will review more use cases, but in the right now the following issues are being addressed: - parallel lock callbacks (instead of sequential for each client) (11301) - scanning lock lists with skiplists and interval trees instead of linear lists (10902, 11300) - a hash for connection retrieval (11013) - no synchronous IO upon connect (10906) - remembering the quota master to avoid searching for it (11228) 2. File IO scalability - do not walk page lists to find pages covered by a lock (10718, 20 (the oldest open bug!)) 3. Server based load simulator We believe that testing our servers with artificial loads will be very helpful. We constructed a load generator, and hope to make good use of it. - this simulator will be available with the next 1.6 beta (11334) - it required us to scale to address lustre device scalability (11307) - an alternate load generator based on liblustre is forthcoming (11302) These issues are all being worked on and are in a variety of stages, depending on their sizes. We expect that these will alleviate some problems we have seen on really large systems (Sandia, ORNL). I expect we will find new scalability issues, and that those, like these will be relatively easily addressed. - Peter -